How to Build a Browser-Based Voice Assistant With the AssemblyAI Voice Agent API

Wait 5 sec.

Real-time voice apps have a reputation for being painful to build. You’d normally need a speech-to-text service, an LLM, a text-to-speech engine, a WebSocket server to coordinate them, and some way to handle turn-taking so people aren’t talking over each other.AssemblyAI’s Voice Agent API handles all of that behind a single WebSocket endpoint. You stream audio in, you get spoken responses back. In this tutorial, we’ll build a browser-based voice assistant from scratch—a tiny Express server for authentication, and an HTML page that captures your mic, talks to the agent, and plays its responses. The whole thing is roughly 120 lines of code across two files.What we’re buildingA browser page where you click a button, talk to an AI voice assistant, and hear it respond in real time. The assistant can also call tools—we’ll wire up a simple weather lookup to demonstrate. No frameworks, no build step, no React. Just vanilla HTML and JavaScript.PrerequisitesYou need Node.js (v18+) and an AssemblyAI API key. If you don’t have one yet, sign up for free—the API key is on your dashboard.Step 1: The token serverBrowsers can’t set custom headers on WebSocket connections, so you can’t pass your API key directly. Instead, your server mints a short-lived token and the browser uses that to authenticate. This keeps your API key off the client entirely.Create a file called server.js:const express = require("express"); const app = express();app.use(express.static("public")); app.get("/token", async (req, res) => {  const response = await fetch(    "https://agents.assemblyai.com/v1/token?expires_in_seconds=300",    { headers: { Authorization: `Bearer ${process.env.ASSEMBLYAI_API_KEY}` } }  );  if (!response.ok) return res.status(500).send("Token generation failed");  const { token } = await response.json();  res.json({ token });}); app.listen(3000, () => console.log("Running on http://localhost:3000"));That’s the entire backend. One endpoint, 15 lines. Each token is single-use and expires after 5 minutes, so even if someone intercepts one, the blast radius is minimal.Step 2: Capture mic audio in the browserCreate a public/index.html file. We’ll build it up section by section, starting with the audio capture. The Voice Agent API expects PCM16 mono audio at 24kHz, base64-encoded.Voice Assistant  Voice Assistant  Start Conversation  Stop       let ws, audioCtx, micStream, processor;   document.getElementById("start").onclick = async () => {    document.getElementById("start").disabled = true;    document.getElementById("stop").disabled = false;     // 1. Get a temporary token from our server    const { token } = await fetch("/token").then(r => r.json());     // 2. Open WebSocket to the Voice Agent API    const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");    wsUrl.searchParams.set("token", token);    ws = new WebSocket(wsUrl);     // 3. Configure the agent on connect    ws.onopen = () => {      ws.send(JSON.stringify({        type: "session.update",        session: {          system_prompt: "You are a helpful voice assistant. " +            "Keep responses under 2 sentences. " +            "Use get_weather for weather questions.",          greeting: "Hi! Ask me anything, or try asking about the weather.",          output: { voice: "ivy" },          tools: [{            type: "function",            name: "get_weather",            description: "Get current weather for a city",            parameters: {              type: "object",              properties: {                location: { type: "string", description: "City name" }              },              required: ["location"]            }          }]        }      }));    };A few things to note: the system prompt tells the agent to keep it short (critical for voice UX—nobody wants to listen to a paragraph), and we’ve registered a get_weather tool right in the session config.Step 3: Handle events and stream audio both waysNow we need to handle the incoming events from the API and stream our mic audio out. Add this right after the ws.onopen handler:    // 4. Handle incoming events    const pendingTools = [];     ws.onmessage = async (event) => {      const msg = JSON.parse(event.data);       switch (msg.type) {        case "session.ready":          startMic();  // Begin streaming audio once ready          break;         case "reply.audio":          playAudio(msg.data);          break;         case "transcript.user":          log("You: " + msg.text);          break;         case "transcript.agent":          log("Agent: " + msg.text);          break;         case "tool.call":          // Simulate a weather lookup          const result = msg.name === "get_weather"            ? { temp: "72°F", conditions: "Sunny" }            : { error: "Unknown tool" };          pendingTools.push({            call_id: msg.call_id, result          });          break;         case "reply.done":          if (msg.status === "interrupted") {            pendingTools.length = 0;          } else if (pendingTools.length > 0) {            for (const tool of pendingTools) {              ws.send(JSON.stringify({                type: "tool.result",                call_id: tool.call_id,                result: JSON.stringify(tool.result)              }));            }            pendingTools.length = 0;          }          break;      }    };The key pattern with tool calling: accumulate results during tool.call events, but don’t send them back until reply.done fires. The agent speaks a transition phrase while waiting, and sending results too early causes timing issues.Step 4: Mic input and audio playbackFinally, wire up the Web Audio API for both capturing mic input (resampled to 24kHz PCM16) and playing the agent’s audio responses. Note the closing }; on the second-to-last line—it closes the outer start.onclick handler. // 5. Mic capture — resample to 24kHz PCM16    async function startMic() {      audioCtx = new AudioContext({ sampleRate: 24000 });      micStream = await navigator.mediaDevices.getUserMedia({        audio: { sampleRate: 24000, channelCount: 1 }      });      const source = audioCtx.createMediaStreamSource(micStream);      processor = audioCtx.createScriptProcessor(4096, 1, 1);       processor.onaudioprocess = (e) => {        if (ws.readyState !== WebSocket.OPEN) return;        const float32 = e.inputBuffer.getChannelData(0);        const pcm16 = new Int16Array(float32.length);        for (let i = 0; i < float32.length; i++) {          pcm16[i] = Math.max(-32768,            Math.min(32767, Math.floor(float32[i] * 32768)));        }        const b64 = btoa(String.fromCharCode(          ...new Uint8Array(pcm16.buffer)));        ws.send(JSON.stringify({          type: "input.audio", audio: b64        }));      };       source.connect(processor);      processor.connect(audioCtx.destination);    }\\ // 6. Play agent audio    function playAudio(base64Data) {      const bytes = atob(base64Data);      const pcm16 = new Int16Array(bytes.length / 2);      for (let i = 0; i < pcm16.length; i++) {        pcm16[i] = bytes.charCodeAt(i * 2)          | (bytes.charCodeAt(i * 2 + 1) {      if (ws) ws.close();      if (micStream) micStream.getTracks().forEach(t => t.stop());      if (processor) processor.disconnect();      document.getElementById("start").disabled = false;      document.getElementById("stop").disabled = true;    };  };  Step 5: Run itInstall Express and start the server:npm install expressASSEMBLYAI_API_KEY=your_key_here node server.jsOpen http://localhost:3000, click "Start Conversation," and talk. You’ll hear the agent greet you and respond to your questions. Try asking "What’s the weather in Tokyo?" to see tool calling in action.Where to go from hereThis is a working voice assistant in two files and about 120 lines of meaningful code. No separate STT, LLM, or TTS services to manage. No orchestration layer. Just one WebSocket doing everything.Some next steps worth exploring: swap ivy for a multilingual voice like lucia (Spanish/English) or ren (Japanese/English). Add more tools—maybe one that queries your database or creates a support ticket. Adjust the vad_threshold for noisier environments. Or use session.resume to reconnect dropped sessions without losing context (sessions persist for 30 seconds after disconnection).The full API reference and more examples are in the Voice Agent API docs. If you build something cool with this, I’d love to see it.\