How I built a voice AI agent that answers phone calls and remembers callers
Stateful Voice Agents: Architecting Persistent Memory and Low-Latency Telephony Integration
Current Situation Analysis
Small businesses lose significant revenue due to missed calls and inconsistent customer interactions. While voice AI agents have emerged to answer phones 24/7, most implementations suffer from a critical flaw: session amnesia. Standard voice agents treat every inbound call as a first-time interaction, forcing customers to repeat information and preventing the agent from adapting to historical context, such as previous complaints or specific preferences.
This limitation is often overlooked because developers prioritize low latency and STT accuracy over state management. However, the absence of persistent memory degrades user trust and limits the agent's utility to simple routing or FAQ tasks. Advanced implementations require a memory layer that survives call termination and injects context dynamically without introducing latency that breaks the natural flow of conversation.
Production data from deployed voice agents demonstrates that stateful architectures can maintain cost efficiency while drastically improving interaction quality. For example, a fully stateful agent handling multilingual calls (Spanish, English, Catalan) with post-call analysis can operate at approximately €0.28 per 5-minute call. This cost point proves that adding memory and analysis does not necessarily explode operational expenses, provided the architecture leverages efficient models and streaming protocols.
WOW Moment: Key Findings
The following comparison highlights the operational impact of implementing a stateful memory layer versus a traditional stateless voice agent.
| Approach | Context Retention | Sentiment Adaptation | Cost (5-min call) | Implementation Complexity |
|---|---|---|---|---|
| Stateless Voice Agent | None | Static | €0.25 | Low |
| Stateful Agent (SQLite + Injection) | Full History | Dynamic (e.g., patience for frustrated callers) | €0.28 | Medium |
| Stateful Agent (Vector DB + RAG) | Semantic Search | Dynamic + Nuanced | €0.35 | High |
Why this matters: The stateful approach using a lightweight relational store like SQLite offers the highest ROI for most business use cases. It adds only €0.03 per call compared to stateless agents but enables critical features like recognizing returning callers, adjusting tone based on past sentiment, and generating actionable post-call summaries. This transforms the agent from a passive answerer into an active participant that builds rapport over time.
Core Solution
Building a stateful voice agent requires orchestrating telephony, speech processing, LLM inference, and persistent storage while maintaining sub-second latency. The architecture below uses Vapi for telephony and STT, Claude Haiku for reasoning, Deepgram Nova 3 for multilingual speech recognition, Azure TTS for voice synthesis, and SQLite for memory.
1. Telephony Orchestration and Custom LLM Integration
Vapi manages the PSTN connection and converts speech to text. To enable custom memory injection, configure Vapi to use a Custom LLM provider. This routes the conversation logic to your own endpoint. Vapi sends a POST request containing the call metadata and message history, expecting an OpenAI-compatible Server-Sent Events (SSE) stream in return.
Architecture Decision: Use a Custom LLM endpoint rather than Vapi's built-in models to decouple memory logic from the telephony provider. This allows you to swap LLMs or update memory schemas without reconfiguring Vapi.
from flask import Flask, request, Response
import json
import sse_client
app = Flask(__name__)
class VoiceOrchestrator:
def __init__(self, memory_repo, llm_client):
self.memory = memory_repo
self.llm = llm_client
def build_context(self, caller_id: str, messages: list) -> dict:
"""Injects persistent memory into the conversation context."""
profile = self.memory.get_caller_profile(caller_id)
system_prompt = self._generate_base_prompt()
if profile:
if profile.get("name"):
system_prompt += f"\n\nThe caller is known to us. Their name is {profile['name']}."
if profile.get("last_sentiment") == "frustrated":
system_prompt += "\n\nIMPORTANT: This caller expressed frustration in the previous session. Prioritize empathy and patience."
return {"system": system_prompt, "messages": messages}
def stream_response(self, context: dict):
"""Yields SSE tokens from the LLM."""
response = self.llm.chat.completions.create(
model="claude-3-haiku-20240307",
messages=[{"role": "s", "content": context["system"]}] + context["messages"],
stream=True,
tools=self._get_tools()
)
for chunk in response:
if chunk.choices[0].delta.content:
yield f"data: {json.dumps({'choices': [{'delta': {'content': chunk.choices[0].delta.content}}]})}\n\n"
elif chunk.choices[0].delta.tool_calls:
# Handle tool calls without breaking stream
yield self._handle_tool_stream(chunk)
@app.route("/v1/chat/completions", methods=["POST"])
def handle_chat():
payload = request.json
caller_id = payload.get("call", {}).get("customer", {}).get("number")
messages = payload.get("messages", [])
orchestrator = VoiceOrchestrator(memory_repo=sqlite_repo, llm_client=claude_client)
context = orchestrator.build_context(caller_id, messages)
return Response(
orchestrator.stream_response(context),
mimetype="text/event-stream"
)
2. Persistent Memory Layer
Memory is stored in SQLite, keyed by the caller's phone number. This schema supports fast lookups and simple updates. On every call, the system queries the database and appends relevant history to the system prompt.
Rationale: SQLite is chosen for its zero-configuration overhead and sufficient performance for read-heavy memory injection. For high-throughput scenarios, this can be swapped for Redis or PostgreSQL without changing the application logic.
import sqlite3
class MemoryRepository:
def __init__(self, db_path: str):
self.conn = sqlite3.connect(db_path)
self._init_schema()
def _init_schema(self):
self.conn.execute("""
CREATE TABLE IF NOT EXISTS callers (
phone_number TEXT PRIMARY KEY,
name TEXT,
last_sentiment TEXT,
last_interaction_summary TEXT,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
self.conn.commit()
def get_caller_profile(self, phone: str) -> dict | None:
cursor = self.conn.execute(
"SELECT * FROM callers WHERE phone_number = ?", (phone,)
)
row = cursor.fetchone()
if row:
return {
"name": row[1],
"last_sentiment": row[2],
"summary": row[3]
}
return None
def update_profile(self, phone: str, name: str = None, sentiment: str = None, summary: str = None):
self.conn.execute("""
INSERT INTO callers (phone_number, name, last_sentiment, last_interaction_summary)
VALUES (?, ?, ?, ?)
ON CONFLICT(phone_number) DO UPDATE SET
name = COALESCE(excluded.name, name),
last_sentiment = COALESCE(excluded.last_sentiment, last_sentiment),
last_interaction_summary = COALESCE(excluded.last_interaction_summary, last_interaction_summary),
updated_at = CURRENT_TIMESTAMP
""", (phone, name, sentiment, summary))
self.conn.commit()
3. Post-Call Analysis and Tool Use Handling
When a call ends, Vapi triggers a webhook. The agent sends the full transcript to Claude Haiku to extract a summary, urgency score, required actions, and sentiment. This data updates the SQLite profile for future calls.
A common technical hurdle is tool use silence. When the LLM invokes a tool (e.g., saving a name) without generating conversational text, the response stream may be empty, causing Vapi to hang or disconnect. The solution is to detect tool usage and trigger a secondary generation pass to produce a verbal acknowledgment.
@app.route("/webhooks/call-ended", methods=["POST"])
def handle_call_ended():
payload = request.json
transcript = payload.get("transcript", [])
caller_id = payload.get("call", {}).get("customer", {}).get("number")
# Extract insights via LLM
analysis_prompt = f"""
Analyze this call transcript. Return JSON with:
- summary: 2-line summary
- urgency: 1-5 score
- action: "call_back", "send_quote", or "none"
- sentiment: "satisfied", "neutral", "frustrated"
Transcript: {json.dumps(transcript)}
"""
result = claude_client.messages.create(
model="claude-3-haiku-20240307",
messages=[{"role": "user", "content": analysis_prompt}],
response_format={"type": "json_object"}
)
insights = json.loads(result.content[0].text)
# Update memory
sqlite_repo.update_profile(
phone=caller_id,
sentiment=insights.get("sentiment"),
summary=insights.get("summary")
)
return {"status": "processed"}
def _handle_tool_stream(chunk):
"""
Mitigates empty response when tool is called.
Triggers a follow-up generation to ensure verbal output.
"""
tool_calls = chunk.choices[0].delta.tool_calls
if tool_calls:
# Execute tool logic here...
# Force a verbal response generation
verbal_response = claude_client.chat.completions.create(
model="claude-3-haiku-20240307",
messages=[{"role": "assistant", "content": "Tool executed successfully."}],
prompt="Respond naturally to the user confirming the action."
)
return f"data: {json.dumps({'choices': [{'delta': {'content': verbal_response.choices[0].message.content}}]})}\n\n"
return ""
Pitfall Guide
Streaming Protocol Mismatch
- Explanation: Vapi requires SSE streaming responses. Returning a standard JSON response causes the call to terminate immediately.
- Fix: Ensure your endpoint returns
text/event-streamand yields chunks incrementally. Use generators in Flask/FastAPI to stream tokens as they arrive from the LLM.
Tool Invocation Silence
- Explanation: If the LLM calls a tool and produces no text content, the voice agent receives an empty stream, leading to dead air or disconnection.
- Fix: Implement a fallback mechanism. Detect tool calls in the stream and trigger a secondary, lightweight generation pass to produce a verbal confirmation before resuming the main conversation.
Environment Variable Shadowing
- Explanation:
load_dotenv()in Python does not override existing environment variables by default. If variables are set in the OS or CI/CD pipeline, local.envchanges are ignored. - Fix: Always use
load_dotenv(override=True)during development to ensure configuration files take precedence, or manage secrets via a dedicated vault in production.
- Explanation:
Token Budget Blowout
- Explanation: Injecting full conversation history or large memory blobs can exceed context windows or increase latency and cost.
- Fix: Summarize historical interactions before injection. Limit memory to key attributes (name, sentiment, last action) rather than raw transcripts. Use token counting utilities to cap injection size.
Latency Spikes from Synchronous DB
- Explanation: Blocking database queries during the request lifecycle can introduce latency, breaking the real-time feel of the voice agent.
- Fix: Use asynchronous database drivers (e.g.,
aiosqliteor async PostgreSQL) or cache caller profiles in memory. Pre-fetch memory data during the initial call setup webhook rather than the chat completion request.
Language Detection Drift
- Explanation: Multilingual callers may switch languages mid-call, causing STT errors if the model is locked to a single language.
- Fix: Use STT models with automatic language detection, such as Deepgram Nova 3. Configure the model to accept multiple language hints and dynamically adjust the system prompt based on detected language changes.
Sentiment Injection Bias
- Explanation: Injecting sentiment labels like "frustrated" can cause the LLM to overcompensate, becoming overly apologetic even when unnecessary.
- Fix: Frame sentiment injection as context rather than instruction. Instead of "Be patient," use "The caller previously mentioned issues with billing; verify if this is resolved." This guides the agent without forcing a tone.
Production Bundle
Action Checklist
- Configure Vapi Custom LLM: Point the Vapi assistant to your
/v1/chat/completionsendpoint and enable streaming. - Implement SSE Generator: Ensure your Flask/FastAPI route yields tokens in SSE format and handles tool calls gracefully.
- Set Up Memory Schema: Initialize SQLite with
callerstable and implement profile retrieval/injection logic. - Add Post-Call Webhook: Create
/webhooks/call-endedto process transcripts, extract insights, and update memory. - Test Tool-Use Fallback: Verify that tool invocations trigger a verbal response and do not result in empty streams.
- Optimize STT Language: Configure Deepgram Nova 3 with language hints for your target regions and test code-switching.
- Monitor Latency: Instrument endpoints to track time-to-first-token (TTFN) and ensure it remains under 500ms.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Low Volume / MVP | SQLite + Claude Haiku | Simple setup, low cost, sufficient for <100 calls/day. | Low (~€0.28/call) |
| High Volume / Enterprise | PostgreSQL + Redis Cache + Claude Sonnet | Scalability, concurrent access, and higher reasoning quality for complex queries. | Medium (~€0.45/call) |
| Multilingual Support | Deepgram Nova 3 + Azure TTS | Nova 3 handles language switching; Azure provides high-quality multilingual voices. | Low (STT/TTS included in stack) |
| Strict Latency Requirements | Async DB + Edge Deployment | Reduces round-trip time for memory lookups and LLM inference. | Medium (Infrastructure cost) |
Configuration Template
# config.yaml
telephony:
provider: vapi
assistant_id: "asst_vapi_12345"
custom_llm_url: "https://your-domain.com/v1/chat/completions"
models:
llm:
provider: anthropic
model: claude-3-haiku-20240307
max_tokens: 256
stt:
provider: deepgram
model: nova-3
language: auto
tts:
provider: azure
voice: es-ES-ElviraNeural
rate: 1.0
memory:
provider: sqlite
db_path: ./data/voice_agents.db
max_history_tokens: 500
webhooks:
call_ended: "https://your-domain.com/webhooks/call-ended"
Quick Start Guide
- Initialize Project: Create a Python virtual environment and install dependencies:
pip install flask anthropic deepgram-sdk python-dotenv. - Configure Environment: Create a
.envfile with your API keys for Vapi, Anthropic, Deepgram, and Azure. Useload_dotenv(override=True)in your app. - Deploy Endpoint: Run the Flask application locally or deploy to a cloud provider. Expose the endpoint via a tunnel (e.g., ngrok) for testing.
- Connect Vapi: In the Vapi dashboard, create a new assistant, select "Custom LLM," and enter your endpoint URL. Configure the STT and TTS providers to match your stack.
- Test Call: Place a test call to the Vapi number. Verify that the agent responds, memory is injected on subsequent calls, and post-call analysis updates the database.
