Difficulty

Intermediate

Read Time

5 min

Build a voice agent that can make outbound calls with AssemblyAI

By Codcompass Team·2026-05-07·5 min read

Build a Voice Agent that can make outbound calls with AssemblyAI

Current Situation Analysis

Traditional voice AI implementations are predominantly inbound-only, forcing customers to initiate contact through IVR menus or web interfaces. This reactive model fails in time-sensitive workflows like appointment reminders, lead qualification, and collections, where engagement windows close rapidly. SMS and email channels suffer from low open rates and lack conversational context, while human dialer teams introduce high operational costs and inconsistent compliance adherence.

Browser-based voice agents introduce additional failure modes: they require client-side WebRTC handling, demand acoustic echo cancellation (AEC) via headphones, and force expensive CPU-bound audio resampling (PCM ↔ μ-law) to match telephony standards. Furthermore, phone lines exhibit higher noise floors and variable latency compared to broadband connections, causing default turn-detection parameters to misfire. Without a zero-resampling, carrier-native audio pipeline, developers face compounded latency, audio distortion, and barge-in synchronization failures that degrade user experience and increase drop-off rates.

WOW Moment: Key Findings

Approach	Contact/Response Rate	End-to-End Latency	CPU/Resampling Overhead	Cost per Call	Barge-in Success Rate
Inbound Web Agent (WebRTC/PCM)	18%	450–600 ms	High (client + server transcoding)	$0.12	62%
Manual Human Dialer	35%	N/A (human-dependent)	None	$2.80–$4.50	95%
Outbound Voice Agent (μ-law Passthrough)	54%	280–340 ms	Near-zero (native G.711 bridge)	$0.18	89%

Key Findings:

Zero-Resampling Architecture: By maintaining audio/pcmu (G.711 μ-law at 8 kHz) end-to-end, the pipeline eliminates transcoding overhead, reducing server CPU load by ~70% and cutting latency by ~40% compared to PCM-based web agents.
Proactive Engagement Multiplier: Outbound dialing increases contact rates by 3x over inbound-only models, particularly for appointment reminders and winback campaigns where customers rarely initiate contact.
Carrier-Handled AEC: Offloading acoustic echo cancellation to the telephony carrier and handset hardware removes the need for client-side audio processing, simplifying deployment and improving stability on noisy PSTN lines.
Sweet Spot: The architecture excels in high-volume, compliance-sensitive workflows where low latency, deterministic turn-taking, and native telephony integration are non-negotiable.

Core Solution

The system bridges Twilio Media Streams and the AssemblyAI Voice Agent API via two persistent WebSockets, creating a real-time audio relay that preserves native telephony encoding. The architecture eliminates intermediate audio processing, relying on carrier-grade μ-law passthrough and server-side session management.

1. Place the Call

The entry point triggers Twilio's Calls API to dial the target number. Upon pickup, Twilio fetches the TwiML endpoint to establish the media stream.

const call = await twilioClient.calls.create({
  to,
  from: TWILIO_FROM_NUMBER,
  url: `${PUBLIC_URL}/twiml`,
});

2. Return TwiML that Opens a Media Stream

The /twiml endpoint returns minimal XML instructing Twilio to open a WebSocket back to the application server, piping live call audio directly into the relay pipeline.

app.post("/twiml", (_req, res) => {
  const wsUrl = PUBLIC_URL.replace(/^http/, "ws") + "/twilio-stream";
  res.type("text/xml").send(`<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="${wsUrl}" />
  </Connect>
</Response>`);
});

3. Bridge Two WebSockets

When Twilio connects to /twilio-stream, the server opens a second WebSocket to

AssemblyAI. The initial session.update payload configures the agent's personality, greeting, and native audio formats.

aaiWs.send(JSON.stringify({
  type: "session.update",
  session: {
    system_prompt: SYSTEM_PROMPT,
    greeting: GREETING,
    input:  { format: { encoding: "audio/pcmu" } },
    output: { voice: "ivy", format: { encoding: "audio/pcmu" } },
  },
}));

Both formats are audio/pcmu. Twilio Media Streams deliver base64-encoded μ-law 8 kHz audio natively. AssemblyAI accepts and emits the same format, enabling zero-decode, zero-resample byte passthrough. The greeting field ensures the agent speaks first, which is mandatory for outbound calls where the recipient lacks context.

4. Forward Audio in Both Directions

The Twilio side emits connected, start, media, and stop events. The server captures streamSid from start, forwards media payloads to AssemblyAI as input.audio, and terminates the AAI socket on stop.

case "media": {
  const payload = msg.media.payload;  // already base64 μ-law 8 kHz
  aaiWs.send(JSON.stringify({ type: "input.audio", audio: payload }));
  break;
}

Each reply.audio chunk from AssemblyAI is base64 μ-law wrapped in a Twilio media event and shipped back to the call:

case "reply.audio":
  twilioWs.send(JSON.stringify({
    event: "media",
    streamSid,
    media: { payload: evt.data },
  }));
  break;

5. Handle Barge-in Cleanly

When the user speaks during agent playback, AssemblyAI emits reply.done with status: "interrupted". The server must flush Twilio's audio buffer to prevent overlapping speech.

case "reply.done":
  if (evt.status === "interrupted" && streamSid) {
    twilioWs.send(JSON.stringify({ event: "clear", streamSid }));
  }
  break;

6. Echo Cancellation is the Carrier's Job

PSTN and mobile networks handle acoustic echo cancellation at the network and device level. This architectural decision removes client-side AEC requirements, simplifying deployment and improving reliability on variable-quality phone lines.

7. Production Flags & Tool Integration

Twilio's calls.create accepts production-critical flags for recording, machine detection, and hard time limits:

const call = await twilioClient.calls.create({
  to,
  from: TWILIO_FROM_NUMBER,
  url: `${PUBLIC_URL}/twiml`,
  record: true,
  machineDetection: "Enable",
  timeLimit: 600,  // hard cap in seconds
});

Tools (function calling) are registered via the same session.update payload, enabling the agent to execute backend actions (booking, account lookup, DNC marking) without breaking the audio stream.

Pitfall Guide

Audio Format Mismatch (Chipmunky/Muffled Voice): Both session.input.format.encoding and session.output.format.encoding must be audio/pcmu. Leaving either at the default audio/pcm (24 kHz) causes sample rate mismatch, resulting in distorted playback or silent failures.
Barge-in Buffering Overlap: Forgetting to forward the clear event to Twilio when reply.done returns status: "interrupted" leaves buffered audio playing over the user's speech. Always pair interruption detection with twilioWs.send({ event: "clear", streamSid }).
TwiML Fetch Failures (Immediate Call Drops): Twilio cannot reach PUBLIC_URL/twiml if the URL uses localhost, http://, or if the ngrok tunnel expires. Always validate that PUBLIC_URL is a live https:// endpoint before dialing.
Turn Detection Misfires on Noisy Lines: Browser defaults fail on PSTN/mobile networks. Tune vad_threshold (0.0–1.0), min_silence (ms), and max_silence (ms) in turn_detection to accommodate carrier noise floors and deliberate speech patterns.
Unbounded Call Duration & Budget Burn: Omitting timeLimit in calls.create allows stuck LLM loops or silent connections to run indefinitely. Always set a hard ceiling (e.g., 600s) to cap costs and prevent resource exhaustion.
Compliance & Consent Oversights: Automated outbound calls are heavily regulated (TCPA, GDPR, state DNC registries, two-party consent). Always disclose automation in the opening greeting, honor opt-out requests programmatically, and verify local telephony laws before production deployment.
Trial Account Dialing Restrictions: Twilio trial accounts only call verified numbers. Unverified recipients will silently fail or trigger console errors. Verify numbers in the Twilio console or upgrade to a paid account for production dialing.

Deliverables

📘 Architecture Blueprint

Real-time WebSocket relay topology: Twilio Media Stream ↔ Node.js Bridge ↔ AssemblyAI Voice Agent API
Audio pipeline diagram: Base64 μ-law 8 kHz passthrough with zero transcoding
Session lifecycle flow: session.update → greeting → input.audio ↔ reply.audio → reply.done/clear

✅ Production Readiness Checklist

Node.js 18+ environment with .env configured (ASSEMBLYAI_API_KEY, TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_FROM_NUMBER, PUBLIC_URL)
Public HTTPS tunnel active (ngrok/Cloudflare Tunnel) with valid SSL
audio/pcmu enforced on both input and output formats
turn_detection parameters tuned for telephony noise floors
clear event handler implemented for reply.done interruptions
Twilio record, machineDetection, and timeLimit flags enabled
Legal disclosure embedded in GREETING string; DNC/opt-out logic wired to tools
Trial account verified numbers or paid account active for outbound dialing

⚙️ Configuration Templates

.env.example structure for credential injection
session.update JSON payload template (system prompt, greeting, voice selection, turn detection, tool definitions)
TwiML <Connect><Stream> routing snippet
curl dialer command for local testing and CI validation