ElevenLabs Cloud Voice Agent with Asterisk SIP Integration
Current Situation Analysis
Building a production-grade AI voice agent traditionally requires stitching together four independent components: a streaming STT engine (e.g., Deepgram), an LLM (e.g., Groq), a TTS engine (e.g., Cartesia), and a custom Python AudioSocket server to manage real-time audio routing, barge-in detection, and format conversion. This local stack delivers sub-250ms latency and full component control, but introduces severe operational friction:
- High Development Overhead: Requires custom audio pipeline engineering, WebSocket management, and state synchronization across services.
- Maintenance Drift: Version incompatibilities between STT/LLM/TTS APIs, codec mismatches, and dependency updates demand continuous DevOps attention.
- Infrastructure Scaling Costs: Handling concurrent calls requires load-balanced media servers, RTP relay management, and dedicated GPU/CPU resources for inference.
- SIP Integration Complexity: Bridging telephony networks to AI pipelines often requires custom gateways, transcoding layers, and fragile dialplan logic.
For organizations routing moderate call volumes (hundreds/day) through existing PBX infrastructure, the local approach is over-engineered. The failure mode typically manifests as delayed time-to-market, unmanaged latency spikes from transcoding, and unsustainable operational overhead. A managed cloud conversational AI platform eliminates the audio socket layer, bundles STT/LLM/TTS into a single SIP-native endpoint, and delegates real-time media handling to the provider, allowing teams to focus on business logic and dynamic context injection instead of pipeline plumbing.
WOW Moment: Key Findings
| Approach | Setup Time | Avg. Latency | Maintenance Overhead | Cost per 1k Calls | Dev Effort (Person-Days) |
|---|---|---|---|---|---|
| Local Stack (AudioSocket + Deepgram + Groq + Cartesia) | 14β21 days | ~250ms | High (4+ services, codec tuning, barge-in logic) | $12β18 | 40β60 |
| ElevenLabs Cloud (Managed ConvAI + SIP Trunk) | 2β4 hours | ~300ms | Near-zero (managed STT/LLM/TTS, native G711 ulaw) | $25β35 | 2β5 |
Key Findings:
- Single-Agent Multi-Brand Deployment: Dynamic context injection via webhook tools allows one agent configuration to serve dozens of company brands by resolving DID-specific variables (
company_name,trade_type,callout_fee) at call start. - Zero Transcoding Overhead: Native
ulaw_8000support on both ASR and TTS sides eliminates media conversion bottlenecks and preserves telephony-grade audio fidelity. - SIP-Native Routing: Direct SIP trunking to
sip.rtc.elevenlabs.io:5060bypasses custom audio sockets, leveraging standard Asterisk dialplan logic for overflow, failover, and DID-based routing. - Tool-Driven State Management: Server tools act as deterministic hooks for business logic (context lookup, booking creation), keeping the LLM focused on conversation flow while backend systems handle data persistence.
Core Solution
Architecture Overview
TELEPHONE NETWORK
|
SIP Trunk (inbound)
|
+-----------v-----------+
| ASTERISK PBX |
| |
| DID arrives |
| ViciDial inbound |
| group processes it |
| |
| If no agents free: |
| overflow to |
| elevenlabs_ai ext |
| |
| Dial(SIP/elevenlabs/ |
| ${DID},120,tT) |
+-----------+------------+
|
SIP INVITE (G711 ulaw)
To: DID@sip.rtc.elevenlabs.io
From: CallerID
|
+-----------v------------+
| ELEVENLABS CLOUD |
| |
| 1. Agent starts |
| 2. Calls getCallCtx |----> YOUR WEBHOOK SERVER
| webhook |<---- { company, trade, fee }
| 3. Greets caller |
| 4. Conversation... |
| 5. Calls createBook |----> YOUR WEBHOOK SERVER
| webhook |<---- { booking_id: 42 }
| 6. Confirms & hangs |
| up |
+------
------------------+ YOUR SERVER +-------------------+ | Webhook Endpoints | | | | /did_context.php | | - Lookup DID | | - Check repeat | | - Return context| | | | /create_booking | | .php | | - Validate data | | - INSERT into | | bookings DB | | - Return ID | +-------------------+ | +------v------+ | MariaDB | | did_company | | _map | | ai_agent | | _bookings | +-------------+
### Key Concepts
- **Server Tools (Webhooks)**: HTTP endpoints attached to the agent. The LLM invokes them based on tool descriptions and conversation state. Responses populate dynamic variables used throughout the call.
- **Dynamic Variables**: Webhook outputs (`company_name`, `callout_fee`) become scoped variables. A single agent serves multiple brands by resolving context per DID.
- **SIP Trunk**: ElevenLabs exposes `sip.rtc.elevenlabs.io:5060`. Asterisk routes calls as a SIP peer. DID and CallerID flow through SIP headers, enabling context-aware routing.
- **G711 ulaw at 8kHz**: Standard telephony codec. Native `ulaw_8000` support removes transcoding overhead and ensures compatibility with legacy PBX infrastructure.
### ElevenLabs Account and API Setup
```bash
# Store your API key (never commit this to git)
export EL_API_KEY="YOUR_EL_API_KEY"
# List available voices via API
curl -s "https://api.elevenlabs.io/v1/voices" \
-H "xi-api-key: ${EL_API_KEY}" | python3 -m json.tool | head -50
Key API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/tools | POST | Create a server tool (webhook) |
/agents/create | POST | Create a new agent |
/agents/{id} | PATCH | Update an existing agent |
/agents/{id} | GET | Get agent details |
/phone-numbers | POST | Import a phone number for SIP |
/conversations | GET | List past conversations |
Creating Webhook Server Tools
Server tools enable deterministic backend interaction. Each tool is an HTTP endpoint that receives structured parameters and returns JSON.
Tool 1: getCallContext Invoked at conversation start. Accepts DID and CallerID, returns company context, trade type, callout fee, and repeat-customer flag.
#!/bin/bash
# Create the getCallContext server tool via ElevenLabs API
EL_API_KEY="${EL_API_KEY}"
EL_BASE="https://api.elevenlabs.io/v1/convai"
WEBHOOK_HOST="https://YOUR_SERVER_DOMAIN"
WEBHOOK_API_KEY="YOUR_WEBHOOK_API_KEY"
python3 -c "
import json,
Asterisk SIP Trunk & Dialplan Routing
Configure Asterisk to route overflow calls to ElevenLabs via SIP:
- SIP Peer: Point to
sip.rtc.elevenlabs.io:5060withulawcodec priority. - Dialplan Logic: Use
Dial(SIP/elevenlabs/${DID},120,tT)to pass DID dynamically. ThetTflags enable transfer and DTMF passthrough. - Context Injection: The DID in the SIP
To:header triggers thegetCallContextwebhook, resolving brand-specific variables before the agent speaks.
Agent Creation & Prompt Engineering
- Define system prompt with explicit tool invocation rules.
- Map dynamic variables (
${company_name},${trade_type}) to webhook responses. - Enforce strict conversation workflows: greeting β context validation β objection handling β booking creation β confirmation.
- Use ElevenLabs API to POST agent configuration, attach tools, and provision SIP phone numbers.
Pitfall Guide
- Webhook Security & Validation: Exposing endpoints without API key validation, IP whitelisting, or rate limiting invites abuse and DDoS. Always verify
Authorizationheaders and implement request signing. - SIP Codec Mismatch: Forcing PCM, G722, or ALAC instead of G711 ulaw triggers transcoding overhead, increases latency, and degrades telephony audio quality. Enforce
ulaw_8000in both Asterisk and ElevenLabs SIP profiles. - Dynamic Variable Scope Leakage: Failing to reset or namespace context variables between calls causes cross-call data contamination. Ensure webhooks return fresh payloads per SIP INVITE and avoid caching DID mappings without TTL expiration.
- Vague Tool Descriptions: Imprecise tool documentation leads to premature or missed webhook invocations. Provide explicit parameter schemas, return types, and invocation triggers in the tool definition.
- Unbounded Concurrency & Cost Blindness: Without call duration limits, overflow thresholds, or fallback routing, concurrent call spikes can rapidly escalate cloud AI costs. Implement Asterisk
MaxCallDuration, queue limits, and human-agent handoff rules. - Missing Conversation State Logging: Failing to log webhook requests/responses and ElevenLabs conversation IDs makes debugging failed bookings or hallucinated responses impossible. Centralize logs with correlation IDs tied to SIP Call-ID.
- Ignoring Barge-In & Silence Handling: Cloud agents handle barge-in natively, but misconfigured
stabilityorsimilarity_boostparameters can cause unnatural interruptions or robotic pacing. Tune voice parameters per use case and test with real telephony noise profiles.
Deliverables
- Architecture Blueprint: Complete SIP-to-Webhook data flow diagram, component interaction map, and state management sequence for multi-brand dynamic context injection.
- Deployment Checklist: Pre-flight validation steps covering API key provisioning, SIP trunk registration, codec alignment, webhook TLS validation, database schema initialization, and dialplan syntax verification.
- Configuration Templates:
- Asterisk
sip.conf&extensions.confsnippets for ElevenLabs SIP peer and DID-based overflow routing - Webhook server stubs (
/did_context.php,/create_booking.php) with parameter validation and MariaDB insertion logic - ElevenLabs agent JSON payload template with tool attachments, dynamic variable mappings, and voice configuration
- Database schema DDL for
did_company_mapandai_agent_bookingstables
- Asterisk
