nt and improving stability on noisy PSTN lines.
- Sweet Spot: The architecture excels in high-volume, compliance-sensitive workflows where low latency, deterministic turn-taking, and native telephony integration are non-negotiable.
Core Solution
The system bridges Twilio Media Streams and the AssemblyAI Voice Agent API via two persistent WebSockets, creating a real-time audio relay that preserves native telephony encoding. The architecture eliminates intermediate audio processing, relying on carrier-grade μ-law passthrough and server-side session management.
1. Place the Call
The entry point triggers Twilio's Calls API to dial the target number. Upon pickup, Twilio fetches the TwiML endpoint to establish the media stream.
const call = await twilioClient.calls.create({
to,
from: TWILIO_FROM_NUMBER,
url: `${PUBLIC_URL}/twiml`,
});
The /twiml endpoint returns minimal XML instructing Twilio to open a WebSocket back to the application server, piping live call audio directly into the relay pipeline.
app.post("/twiml", (_req, res) => {
const wsUrl = PUBLIC_URL.replace(/^http/, "ws") + "/twilio-stream";
res.type("text/xml").send(`<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="${wsUrl}" />
</Connect>
</Response>`);
});
3. Bridge Two WebSockets
When Twilio connects to /twilio-stream, the server opens a second WebSocket to AssemblyAI. The initial session.update payload configures the agent's personality, greeting, and native audio formats.
aaiWs.send(JSON.stringify({
type: "session.update",
session: {
system_prompt: SYSTEM_PROMPT,
greeting: GREETING,
input: { format: { encoding: "audio/pcmu" } },
output: { voice: "ivy", format: { encoding: "audio/pcmu" } },
},
}));
Both formats are audio/pcmu. Twilio Media Streams deliver base64-encoded μ-law 8 kHz audio natively. AssemblyAI accepts and emits the same format, enabling zero-decode, zero-resample byte passthrough. The greeting field ensures the agent speaks first, which is mandatory for outbound calls where the recipient lacks context.
4. Forward Audio in Both Directions
The Twilio side emits connected, start, media, and stop events. The server captures streamSid from start, forwards media payloads to AssemblyAI as input.audio, and terminates the AAI socket on stop.
case "media": {
const payload = msg.media.payload; // already base64 μ-law 8 kHz
aaiWs.send(JSON.stringify({ type: "input.audio", audio: payload }));
break;
}
Each reply.audio chunk from AssemblyAI is base64 μ-law wrapped in a Twilio media event and shipped back to the call:
case "reply.audio":
twilioWs.send(JSON.stringify({
event: "media",
streamSid,
media: { payload: evt.data },
}));
break;
5. Handle Barge-in Cleanly
When the user speaks during agent playback, AssemblyAI emits reply.done with status: "interrupted". The server must flush Twilio's audio buffer to prevent overlapping speech.
case "reply.done":
if (evt.status === "interrupted" && streamSid) {
twilioWs.send(JSON.stringify({ event: "clear", streamSid }));
}
break;
6. Echo Cancellation is the Carrier's Job
PSTN and mobile networks handle acoustic echo cancellation at the network and device level. This architectural decision removes client-side AEC requirements, simplifying deployment and improving reliability on variable-quality phone lines.
Twilio's calls.create accepts production-critical flags for recording, machine detection, and hard time limits:
const call = await twilioClient.calls.create({
to,
from: TWILIO_FROM_NUMBER,
url: `${PUBLIC_URL}/twiml`,
record: true,
machineDetection: "Enable",
timeLimit: 600, // hard cap in seconds
});
Tools (function calling) are registered via the same session.update payload, enabling the agent to execute backend actions (booking, account lookup, DNC marking) without breaking the audio stream.
Pitfall Guide
- Audio Format Mismatch (Chipmunky/Muffled Voice): Both
session.input.format.encoding and session.output.format.encoding must be audio/pcmu. Leaving either at the default audio/pcm (24 kHz) causes sample rate mismatch, resulting in distorted playback or silent failures.
- Barge-in Buffering Overlap: Forgetting to forward the
clear event to Twilio when reply.done returns status: "interrupted" leaves buffered audio playing over the user's speech. Always pair interruption detection with twilioWs.send({ event: "clear", streamSid }).
- TwiML Fetch Failures (Immediate Call Drops): Twilio cannot reach
PUBLIC_URL/twiml if the URL uses localhost, http://, or if the ngrok tunnel expires. Always validate that PUBLIC_URL is a live https:// endpoint before dialing.
- Turn Detection Misfires on Noisy Lines: Browser defaults fail on PSTN/mobile networks. Tune
vad_threshold (0.0–1.0), min_silence (ms), and max_silence (ms) in turn_detection to accommodate carrier noise floors and deliberate speech patterns.
- Unbounded Call Duration & Budget Burn: Omitting
timeLimit in calls.create allows stuck LLM loops or silent connections to run indefinitely. Always set a hard ceiling (e.g., 600s) to cap costs and prevent resource exhaustion.
- Compliance & Consent Oversights: Automated outbound calls are heavily regulated (TCPA, GDPR, state DNC registries, two-party consent). Always disclose automation in the opening greeting, honor opt-out requests programmatically, and verify local telephony laws before production deployment.
- Trial Account Dialing Restrictions: Twilio trial accounts only call verified numbers. Unverified recipients will silently fail or trigger console errors. Verify numbers in the Twilio console or upgrade to a paid account for production dialing.
Deliverables
📘 Architecture Blueprint
- Real-time WebSocket relay topology: Twilio Media Stream ↔ Node.js Bridge ↔ AssemblyAI Voice Agent API
- Audio pipeline diagram: Base64 μ-law 8 kHz passthrough with zero transcoding
- Session lifecycle flow:
session.update → greeting → input.audio ↔ reply.audio → reply.done/clear
✅ Production Readiness Checklist
⚙️ Configuration Templates
.env.example structure for credential injection
session.update JSON payload template (system prompt, greeting, voice selection, turn detection, tool definitions)
- TwiML
<Connect><Stream> routing snippet
curl dialer command for local testing and CI validation