Building a Real-Time AI Voice Agent for Asterisk
Current Situation Analysis
Every missed phone call represents lost revenue, particularly for time-sensitive home services (plumbers, electricians, locksmiths). Human agents are costly, require shift coverage, and inevitably miss off-hours calls. Traditional IVR systems force callers through rigid menus, degrading user experience and increasing abandonment rates.
The core failure mode of legacy voice AI stacks is latency. Callers expect conversational turn-taking; a 1β2 second pause after every utterance breaks immersion and triggers hang-ups. Traditional cloud providers fail in real-time telephony contexts for three reasons:
- Synchronous Processing Pipelines: Most STT/LLM/TTS chains wait for complete sentences before processing, adding 300β500ms of accumulation delay.
- Inference Bottlenecks: General-purpose LLMs (e.g., GPT-4) exhibit 500msβ2s Time-To-First-Token (TTFT), making real-time dialogue impossible.
- Audio Format Mismatch: High-fidelity TTS engines output 24kHz/44.1kHz audio, requiring CPU-intensive resampling to 8kHz for telephony, adding 50β100ms of processing overhead and breaking streaming continuity.
Without a concurrent, token-streaming architecture that aligns STT endpointing, speculative LLM decoding, and native telephony PCM output, sub-250ms mouth-to-ear latency remains unachievable.
WOW Moment: Key Findings
| Approach | Mouth-to-Ear Latency | TTFT (LLM) | TTFB (TTS) | Audio Format Handling | Concurrency Model |
|---|---|---|---|---|---|
| Traditional Cloud Stack (Big 3 STT + GPT-4 + ElevenLabs) | 1,200β2,000ms | 500msβ2,000ms | 300β400ms | Requires 24kHzβ8kHz resampling | Synchronous/Blocking |
| Optimized Stack (Deepgram Nova-3 + Groq Specdec + Cartesia Sonic-3) | 200β250ms | 30β50ms | 50β80ms | Native 8kHz PCM (pcm_s16le) | Concurrent Token-Streaming |
Key Findings:
- Token-Streaming Breakthrough: Piping LLM tokens directly into TTS as they arrive reduces perceived latency by ~85%. The caller hears the response while the AI is still generating it.
- Speculative Decoding Advantage: Groq's
specdecvariant delivers 1,665 tokens/second (6x throughput over standard variants) with identical output quality, making 70B parameter models viable for real-time voice. - Native Telephony Audio: Cartesia's native 8kHz PCM output eliminates resampling overhead and enables true WebSocket continuation streaming, critical for maintaining AudioSocket frame pacing.
- Latency Budget Alignment: Deepgram STT (~150-200ms) + Groq TTFT (~30-50ms) + Cartesia TTFB (~50-80ms) + AudioSocket transmission (~20ms) = ~200-250ms total, consistently beating human agent response times.
Core Solution
Architecture & Data Flow
TELEPHONE NETWORK
|
SIP Trunk
|
+-----------v-----------+
| ASTERISK PBX |
| |
| 1. Answer() |
| 2. AGI(setup.agi) |
| - Generate UUID |
| - Write metadata |
| 3. AudioSocket( |
| 127.0.0.1:9099) |
+-----------+------------+
|
TCP (AudioSocket Protocol)
8kHz 16-bit PCM, 20ms frames
|
+-----------v------------+
| PYTHON VOICE AGENT |
| (asyncio TCP server) |
| |
| +-------------------+ | +------------------+
Caller | | Audio Reader | | | Deepgram Nova-3 |
speaks | | - Read PCM frames |--------> Streaming STT |
| | - Barge-in VAD | | | (WebSocket) |
| +-------------------+ | +--------+---------+
| | |
| | Transcript
| | |
| +-------------------+ | +--------v---------+
| | Conversation | | | Groq Llama 3.3 |
| | Manager |<------| 70B specdec |
| | - State machine | | | Streaming LLM |
| | - Message history | | | (HTTP SSE) |
| | - Tool calls | | +--------+---------+
| +-------------------+ | |
| | Token stream
| | |
| +-------------------+ | +--------v---------+
Caller | | Audio Writer | | | Cartesia Sonic-3 |
hears <-----| - Queue playback |<------| Streaming TTS |
| | - 20ms pacing | | | (WebSocket) |
| +-------------------+ | +------------------+
| |
+-------------------------+
Latency Budget (target: <250ms mouth-to-ear):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Deepgram STT final transcript: ~150-200ms β
β Groq LLM first token (TTFT): ~30-50ms β
β Cartesia TTS first audio (TTFB): ~50-80ms β
β AudioSocket frame transmission: ~20ms (1 frame) β
β β
β TOTAL: ~200-250ms β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
**Data Flow Summary:**
1. Caller audio arrives at Asterisk as RTP, converted to raw PCM via AudioSocket.
2. Python agent streams PCM frames to Deepgram Nova-3 WebSocket.
3. Deepgram returns fragments; `speech_final` triggers complete utterance processing.
4. Conversation history + utterance sent to Groq Llama 3.3 70B with streaming enabled.
5. **Concurrent Token-Streaming**: Each LLM token immediately routes to Cartesia Sonic-3 continuation API.
6. Cartesia returns PCM chunks, paced at 20ms intervals, written back through AudioSocket to Asterisk.
7. Asterisk converts PCM to RTP and delivers to caller.
### Component Selection Rationale
- **STT (Deepgram Nova-3)**: Purpose-built for real-time streaming. `endpointing` controls silence detection, `speech_final` marks utterance completion, and keyword boosting (`keywords=postcode:2`) improves domain accuracy. British English model (`language=en-GB`) handles regional accents reliably.
- **LLM (Groq Llama 3.3 70B specdec)**: Speculative decoding uses a draft model to predict tokens, verified in parallel by the 70B model. Delivers 1,665 tok/s with 30-50ms TTFT, outperforming standard variants by 6x while maintaining large-model reasoning for objection handling and workflow progression.
- **TTS (Cartesia Sonic-3)**: Native 8kHz PCM (`pcm_s16le`) eliminates resampling. Continuation API enables true token-streaming over a single WebSocket context, maintaining audio continuity without context-switching overhead.
### Implementation Prerequisites & Verification
```bash
pip install websockets aiohttp cartesia
asterisk -rx "module show like audiosocket"
Expected output:
Module Description Use Count Status
res_audiosocket.so AudioSocket support 0 Running
app_audiosocket.so AudioSocket application 0 Running
2 modules loaded
If modules are missing:
asterisk -rx "module load res_audiosocket.so"
asterisk -rx "module load app_audiosocket.so"
Persist in /etc/asterisk/modules.conf:
load = res_audiosocket.so
load = app_audiosocket.so
Pipeline & State Management
- Barge-In Handling: VAD monitors incoming PCM for energy spikes during TTS playback. On detection, the audio writer queue is flushed, Cartesia context is terminated, and Groq receives an interruption signal to reset generation.
- Conversation State Machine: 8-step workflow (greet β understand β quote β collect details β book β confirm β close). State transitions are triggered by STT confidence thresholds and tool call completions.
- Tool Calling Integration: Groq outputs structured JSON for external booking APIs. The agent parses tool calls, executes
aiohttprequests, and injects results back into the conversation context without breaking the streaming pipeline. - DID-to-Company Context API: Inbound DID routing queries a lightweight context API to inject company-specific greetings, pricing tiers, and service areas into the system prompt dynamically.
Latency Optimization Deep Dive
- Frame Pacing: AudioSocket requires strict 20ms PCM frame transmission. Bursting causes Asterisk buffer underruns. The audio writer uses a token bucket algorithm to maintain exact pacing.
- Asyncio Event Loop: All I/O (WebSocket STT/TTS, HTTP LLM, TCP AudioSocket) runs non-blocking. CPU-bound tasks are offloaded to thread pools to prevent loop starvation.
- Endpointing Tuning: Deepgram
endpointingset to 400ms for conversational pacing. Aggressive values (<200ms) cause false cuts; lenient values (>600ms) add dead air.
Pitfall Guide
- Ignoring Barge-In/VAD Synchronization: Failing to flush the TTS queue and terminate the Cartesia context on interruption causes the agent to talk over the caller. Implement energy-based VAD with immediate context teardown.
- Synchronous Token Processing: Waiting for full LLM sentences before invoking TTS destroys the latency budget. Must stream tokens concurrently via Cartesia's continuation API.
- Audio Resampling Overhead: Using 24kHz/44.1kHz TTS requires CPU-intensive resampling to 8kHz for telephony, adding 50β100ms latency and breaking streaming continuity. Always use native 8kHz PCM outputs.
- Misconfigured STT Endpointing: Too aggressive (
endpointing < 200ms) cuts off speakers mid-sentence; too lenient (> 600ms) adds silence latency. Tune based on conversational domain and monitorspeech_finaltriggers. - Blocking the Asyncio Event Loop: Synchronous HTTP calls or heavy CPU tasks in the main loop will drop AudioSocket frames, causing Asterisk to hang up. Use
aiohttp,asyncio.gather, and offload CPU work to executors. - Neglecting Network Jitter & Frame Pacing: AudioSocket requires strict 20ms frame pacing. Bursting frames causes buffer underruns/overruns. Implement a pacing queue with token bucket rate limiting.
- State Machine Deadlocks: Failing to handle partial transcripts or tool call failures leaves the conversation stuck. Implement timeout fallbacks, confidence thresholds, and explicit state reset handlers.
Deliverables
- π System Blueprint: Complete architecture diagram, data flow specification, latency budget breakdown, and component interaction matrix.
- β Production Checklist: Pre-deployment verification steps including Asterisk module validation, API credential testing, barge-in stress testing, latency benchmarking (<250ms target), and fallback routing configuration.
- βοΈ Configuration Templates:
modules.conf&extensions.confsnippets for AudioSocket routing and AGI bootstrappingsystemdservice unit for production daemonization with automatic restart and resource limitsrequirements.txtand environment variable schema for API keys, endpoint URLs, and latency tuning parameters- Python asyncio TCP server skeleton with AudioSocket frame reader/writer, Deepgram/Groq/Cartesia WebSocket clients, and state machine integration points
