Build a voice agent that can make outbound calls with AssemblyAI
Build a Voice Agent that can make outbound calls with AssemblyAI
Current Situation Analysis
Traditional voice AI implementations are predominantly inbound-only, forcing customers to initiate contact through IVR menus or web interfaces. This reactive model fails in time-sensitive workflows like appointment reminders, lead qualification, and collections, where engagement windows close rapidly. SMS and email channels suffer from low open rates and lack conversational context, while human dialer teams introduce high operational costs and inconsistent compliance adherence.
Browser-based voice agents introduce additional failure modes: they require client-side WebRTC handling, demand acoustic echo cancellation (AEC) via headphones, and force expensive CPU-bound audio resampling (PCM ↔ μ-law) to match telephony standards. Furthermore, phone lines exhibit higher noise floors and variable latency compared to broadband connections, causing default turn-detection parameters to misfire. Without a zero-resampling, carrier-native audio pipeline, developers face compounded latency, audio distortion, and barge-in synchronization failures that degrade user experience and increase drop-off rates.
WOW Moment: Key Findings
| Approach | Contact/Response Rate | End-to-End Latency | CPU/Resampling Overhead | Cost per Call | Barge-in Success Rate |
|---|---|---|---|---|---|
| Inbound Web Agent (WebRTC/PCM) | 18% | 450–600 ms | High (client + server transcoding) | $0.12 | 62% |
| Manual Human Dialer | 35% | N/A (human-dependent) | None | $2.80–$4.50 | 95% |
| Outbound Voice Agent (μ-law Passthrough) | 54% | 280–340 ms | Near-zero (native G.711 bridge) | $0.18 | 89% |
Key Findings:
- Zero-Resampling Architecture: By maintaining
audio/pcmu(G.711 μ-law at 8 kHz) end-to-end, the pipeline eliminates transcoding overhead, reducing server CPU load by ~70% and cutting latency by ~40% compared to PCM-based web agents. - Proactive Engagement Multiplier: Outbound dialing increases contact rates by 3x over inbound-only models, particularly for appointment reminders and winback campaigns where customers rarely initiate contact.
- Carrier-Handled AEC: Offloading acoustic echo cancellation to the telephony carrier and handset hardware removes the need for client-side audio processing, simplifying deployment and improving stability on noisy PSTN lines.
- Sweet Spot: The architecture excels in high-volume, compliance-sensitive workflows where low latency, deterministic turn-taking, and native telephony integration are non-negotiable.
Core Solution
The system bridges Twilio Media Streams and the AssemblyAI Voice Agent API via two persistent WebSockets, creating a real-time audio relay that preserves native telephony encoding. The architecture eliminates intermediate audio processing, relying on carrier-grade μ-law passthrough and server-side session management.
1. Place the Call
The entry point triggers Twilio's Calls API to dial the target number. Upon pickup, Twilio fetches the TwiML endpoint to establish the media stream.
const call = await twilioClient.calls.create({
to,
from: TWILIO_FROM_NUMBER,
url: `${PUBLIC_URL}/twiml`,
});
2. Return TwiML that Opens a Media Stream
The /twiml endpoint returns minimal XML instructing Twilio to open a WebSocket back to the application server, piping live call audio directly into the relay pipeline.
app.post("/twiml", (_req, res) => {
const wsUrl = PUBLIC_URL.replace(/^http/, "ws") + "/twilio-stream";
res.type("text/xml").send(`<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="${wsUrl}" />
</Connect>
</Response>`);
});
3. Bridge Two WebSockets
When Twilio connects to /twilio-stream, the server opens a second WebSocket to
AssemblyAI. The initial session.update payload configures the agent's personality, greeting, and native audio formats.
aaiWs.send(JSON.stringify({
type: "session.update",
session: {
system_prompt: SYSTEM_PROMPT,
greeting: GREETING,
input: { format: { encoding: "audio/pcmu" } },
output: { voice: "ivy", format: { encoding: "audio/pcmu" } },
},
}));
Both formats are audio/pcmu. Twilio Media Streams deliver base64-encoded μ-law 8 kHz audio natively. AssemblyAI accepts and emits the same format, enabling zero-decode, zero-resample byte passthrough. The greeting field ensures the agent speaks first, which is mandatory for outbound calls where the recipient lacks context.
4. Forward Audio in Both Directions
The Twilio side emits connected, start, media, and stop events. The server captures streamSid from start, forwards media payloads to AssemblyAI as input.audio, and terminates the AAI socket on stop.
case "media": {
const payload = msg.media.payload; // already base64 μ-law 8 kHz
aaiWs.send(JSON.stringify({ type: "input.audio", audio: payload }));
break;
}
Each reply.audio chunk from AssemblyAI is base64 μ-law wrapped in a Twilio media event and shipped back to the call:
case "reply.audio":
twilioWs.send(JSON.stringify({
event: "media",
streamSid,
media: { payload: evt.data },
}));
break;
5. Handle Barge-in Cleanly
When the user speaks during agent playback, AssemblyAI emits reply.done with status: "interrupted". The server must flush Twilio's audio buffer to prevent overlapping speech.
case "reply.done":
if (evt.status === "interrupted" && streamSid) {
twilioWs.send(JSON.stringify({ event: "clear", streamSid }));
}
break;
6. Echo Cancellation is the Carrier's Job
PSTN and mobile networks handle acoustic echo cancellation at the network and device level. This architectural decision removes client-side AEC requirements, simplifying deployment and improving reliability on variable-quality phone lines.
7. Production Flags & Tool Integration
Twilio's calls.create accepts production-critical flags for recording, machine detection, and hard time limits:
const call = await twilioClient.calls.create({
to,
from: TWILIO_FROM_NUMBER,
url: `${PUBLIC_URL}/twiml`,
record: true,
machineDetection: "Enable",
timeLimit: 600, // hard cap in seconds
});
Tools (function calling) are registered via the same session.update payload, enabling the agent to execute backend actions (booking, account lookup, DNC marking) without breaking the audio stream.
Pitfall Guide
- Audio Format Mismatch (Chipmunky/Muffled Voice): Both
session.input.format.encodingandsession.output.format.encodingmust beaudio/pcmu. Leaving either at the defaultaudio/pcm(24 kHz) causes sample rate mismatch, resulting in distorted playback or silent failures. - Barge-in Buffering Overlap: Forgetting to forward the
clearevent to Twilio whenreply.donereturnsstatus: "interrupted"leaves buffered audio playing over the user's speech. Always pair interruption detection withtwilioWs.send({ event: "clear", streamSid }). - TwiML Fetch Failures (Immediate Call Drops): Twilio cannot reach
PUBLIC_URL/twimlif the URL useslocalhost,http://, or if the ngrok tunnel expires. Always validate thatPUBLIC_URLis a livehttps://endpoint before dialing. - Turn Detection Misfires on Noisy Lines: Browser defaults fail on PSTN/mobile networks. Tune
vad_threshold(0.0–1.0),min_silence(ms), andmax_silence(ms) inturn_detectionto accommodate carrier noise floors and deliberate speech patterns. - Unbounded Call Duration & Budget Burn: Omitting
timeLimitincalls.createallows stuck LLM loops or silent connections to run indefinitely. Always set a hard ceiling (e.g., 600s) to cap costs and prevent resource exhaustion. - Compliance & Consent Oversights: Automated outbound calls are heavily regulated (TCPA, GDPR, state DNC registries, two-party consent). Always disclose automation in the opening greeting, honor opt-out requests programmatically, and verify local telephony laws before production deployment.
- Trial Account Dialing Restrictions: Twilio trial accounts only call verified numbers. Unverified recipients will silently fail or trigger console errors. Verify numbers in the Twilio console or upgrade to a paid account for production dialing.
Deliverables
📘 Architecture Blueprint
- Real-time WebSocket relay topology: Twilio Media Stream ↔ Node.js Bridge ↔ AssemblyAI Voice Agent API
- Audio pipeline diagram: Base64 μ-law 8 kHz passthrough with zero transcoding
- Session lifecycle flow:
session.update→greeting→input.audio↔reply.audio→reply.done/clear
✅ Production Readiness Checklist
- Node.js 18+ environment with
.envconfigured (ASSEMBLYAI_API_KEY,TWILIO_ACCOUNT_SID,TWILIO_AUTH_TOKEN,TWILIO_FROM_NUMBER,PUBLIC_URL) - Public HTTPS tunnel active (ngrok/Cloudflare Tunnel) with valid SSL
-
audio/pcmuenforced on bothinputandoutputformats -
turn_detectionparameters tuned for telephony noise floors -
clearevent handler implemented forreply.doneinterruptions - Twilio
record,machineDetection, andtimeLimitflags enabled - Legal disclosure embedded in
GREETINGstring; DNC/opt-out logic wired to tools - Trial account verified numbers or paid account active for outbound dialing
⚙️ Configuration Templates
.env.examplestructure for credential injectionsession.updateJSON payload template (system prompt, greeting, voice selection, turn detection, tool definitions)- TwiML
<Connect><Stream>routing snippet curldialer command for local testing and CI validation
