How I built a voice AI agent that answers phone calls and remembers callers

Stateful Voice Agents: Architecting Persistent Memory and Low-Latency Telephony Integration

Current Situation Analysis

Small businesses lose significant revenue due to missed calls and inconsistent customer interactions. While voice AI agents have emerged to answer phones 24/7, most implementations suffer from a critical flaw: session amnesia. Standard voice agents treat every inbound call as a first-time interaction, forcing customers to repeat information and preventing the agent from adapting to historical context, such as previous complaints or specific preferences.

This limitation is often overlooked because developers prioritize low latency and STT accuracy over state management. However, the absence of persistent memory degrades user trust and limits the agent's utility to simple routing or FAQ tasks. Advanced implementations require a memory layer that survives call termination and injects context dynamically without introducing latency that breaks the natural flow of conversation.

Production data from deployed voice agents demonstrates that stateful architectures can maintain cost efficiency while drastically improving interaction quality. For example, a fully stateful agent handling multilingual calls (Spanish, English, Catalan) with post-call analysis can operate at approximately €0.28 per 5-minute call. This cost point proves that adding memory and analysis does not necessarily explode operational expenses, provided the architecture leverages efficient models and streaming protocols.

WOW Moment: Key Findings

The following comparison highlights the operational impact of implementing a stateful memory layer versus a traditional stateless voice agent.

Approach	Context Retention	Sentiment Adaptation	Cost (5-min call)	Implementation Complexity
Stateless Voice Agent	None	Static	€0.25	Low
Stateful Agent (SQLite + Injection)	Full History	Dynamic (e.g., patience for frustrated callers)	€0.28	Medium
Stateful Agent (Vector DB + RAG)	Semantic Search	Dynamic + Nuanced	€0.35	High

Why this matters: The stateful approach using a lightweight relational store like SQLite offers the highest ROI for most business use cases. It adds only €0.03 per call compared to stateless agents but enables critical features like recognizing returning callers, adjusting tone based on past sentiment, and generating actionable post-call summaries. This transforms the agent from a passive answerer into an active participant that builds rapport over time.

Core Solution

Building a stateful voice agent requires orchestrating telephony, speech processing, LLM inference, and persistent storage while maintaining sub-second latency. The architecture below uses Vapi for telephony and STT, Claude Haiku for reasoning, Deepgram Nova 3 for multilingual speech recognition, Azure TTS for voice synthesis, and SQLite for memory.

1. Telephony Orchestration and Custom LLM Integration

Vapi manages the PSTN connection and converts speech to text. To enable custom memory injection, configure Vapi to use a Custom LLM provider. This routes the conversation logic to your own endpoint. Vapi sends a POST request containing the call metadata and message history, expecting an OpenAI-compatible Server-Sent Events (SSE) stream in return.

Architecture Decision: Use a Custom LLM endpoint rather than Vapi's built-in models to decouple memory logic from the telephony provider. This allows you to swap LLMs or update memory schemas without reconfiguring Vapi.

from flask import Flask, request, Response
import json
import sse_client

app = Flask(__name__)

class VoiceOrchestrator:
    def __init__(self, memory_repo, llm_client):
        self.memory = memory_repo
        self.llm = llm_client

    def build_context(self, caller_id: str, messages: list) -> dict:
        """Injects persistent memory into the conversation context."""
        profile = self.memory.get_caller_profile(caller_id)
        system_prompt = self._generate_base_prompt()
        
        if profile:
            if profile.get("name"):
                system_prompt += f"\n\nThe caller is known to us. Their name is {profile['name']}."
            if profile.get("last_sentiment") == "frustrated":
                system_prompt += "\n\nIMPORTANT: This caller expressed frustration in the previous session. Prioritize empathy and patience."
        
        return {"system": system_prompt, "messages": messages}

    def stream_response(self, context: dict):
        """Yields SSE tokens from the LLM."""
        response = self.llm.chat.completions.create(
            model="claude-3-haiku-20240307",
            messages=[{"role": "s", "content": context["system"]}] + context["messages"],
            stream=True,
            tools=self._get_tools()
        )
        
        for chunk in response:
            if chunk.choices[0].delta.content:
                yield f"data: {json.dumps({'choices': [{'delta': {'content': chunk.choices[0].delta.content}}]})}\n\n"
            elif chunk.choices[0].delta.tool_calls:
                # Handle tool calls without breaking stream
                yield self._handle_tool_stream(chunk)

@app.route("/v1/chat/completions", methods=["POST"])
def handle_chat():
    payload = request.json
    caller_id = payload.get("call", {}).get("customer", {}).get("number")
    messages = payload.get("messages", [])
    
    orchestrator = VoiceOrchestrator(memory_repo=sqlite_repo, llm_client=claude_client)
    context = orchestrator.build_context(caller_id, messages)
    
    return Response(
        orchestrator.stream_response(context),
        mimetype="text/event-stream"
    )

2. Persistent Memory Layer

Memory is stored in SQLite, keyed by the caller's phone number. This schema supports fast lookups and simple updates. On every call, the system queries the database and appends relevant history to the system prompt.

Rationale: SQLite is chosen for its zero-configuration overhead and sufficient performance for read-heavy memory injection. For high-throughput scenarios, this can be swapped for Redis or PostgreSQL without changing the application logic.

import sqlite3

class MemoryRepository:
    def __init__(self, db_path: str):
        self.conn = sqlite3.connect(db_path)
        self._init_schema()

    def _init_schema(self):
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS callers (
                phone_number TEXT PRIMARY KEY,
                name TEXT,
                last_sentiment TEXT,
                last_interaction_summary TEXT,
                updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        self.conn.commit()

    def get_caller_profile(self, phone: str) -> dict | None:
        cursor = self.conn.execute(
            "SELECT * FROM callers WHERE phone_number = ?", (phone,)
        )
        row = cursor.fetchone()
        if row:
            return {
                "name": row[1],
                "last_sentiment": row[2],
                "summary": row[3]
            }
        return None

    def update_profile(self, phone: str, name: str = None, sentiment: str = None, summary: str = None):
        self.conn.execute("""
            INSERT INTO callers (phone_number, name, last_sentiment, last_interaction_summary)
            VALUES (?, ?, ?, ?)
            ON CONFLICT(phone_number) DO UPDATE SET
                name = COALESCE(excluded.name, name),
                last_sentiment = COALESCE(excluded.last_sentiment, last_sentiment),
                last_interaction_summary = COALESCE(excluded.last_interaction_summary, last_interaction_summary),
                updated_at = CURRENT_TIMESTAMP
        """, (phone, name, sentiment, summary))
        self.conn.commit()

3. Post-Call Analysis and Tool Use Handling

When a call ends, Vapi triggers a webhook. The agent sends the full transcript to Claude Haiku to extract a summary, urgency score, required actions, and sentiment. This data updates the SQLite profile for future calls.

A common technical hurdle is tool use silence. When the LLM invokes a tool (e.g., saving a name) without generating conversational text, the response stream may be empty, causing Vapi to hang or disconnect. The solution is to detect tool usage and trigger a secondary generation pass to produce a verbal acknowledgment.

@app.route("/webhooks/call-ended", methods=["POST"])
def handle_call_ended():
    payload = request.json
    transcript = payload.get("transcript", [])
    caller_id = payload.get("call", {}).get("customer", {}).get("number")
    
    # Extract insights via LLM
    analysis_prompt = f"""
    Analyze this call transcript. Return JSON with:
    - summary: 2-line summary
    - urgency: 1-5 score
    - action: "call_back", "send_quote", or "none"
    - sentiment: "satisfied", "neutral", "frustrated"
    
    Transcript: {json.dumps(transcript)}
    """
    
    result = claude_client.messages.create(
        model="claude-3-haiku-20240307",
        messages=[{"role": "user", "content": analysis_prompt}],
        response_format={"type": "json_object"}
    )
    
    insights = json.loads(result.content[0].text)
    
    # Update memory
    sqlite_repo.update_profile(
        phone=caller_id,
        sentiment=insights.get("sentiment"),
        summary=insights.get("summary")
    )
    
    return {"status": "processed"}

def _handle_tool_stream(chunk):
    """
    Mitigates empty response when tool is called.
    Triggers a follow-up generation to ensure verbal output.
    """
    tool_calls = chunk.choices[0].delta.tool_calls
    if tool_calls:
        # Execute tool logic here...
        # Force a verbal response generation
        verbal_response = claude_client.chat.completions.create(
            model="claude-3-haiku-20240307",
            messages=[{"role": "assistant", "content": "Tool executed successfully."}],
            prompt="Respond naturally to the user confirming the action."
        )
        return f"data: {json.dumps({'choices': [{'delta': {'content': verbal_response.choices[0].message.content}}]})}\n\n"
    return ""

Pitfall Guide

Streaming Protocol Mismatch
- Explanation: Vapi requires SSE streaming responses. Returning a standard JSON response causes the call to terminate immediately.
- Fix: Ensure your endpoint returns text/event-stream and yields chunks incrementally. Use generators in Flask/FastAPI to stream tokens as they arrive from the LLM.
Tool Invocation Silence
- Explanation: If the LLM calls a tool and produces no text content, the voice agent receives an empty stream, leading to dead air or disconnection.
- Fix: Implement a fallback mechanism. Detect tool calls in the stream and trigger a secondary, lightweight generation pass to produce a verbal confirmation before resuming the main conversation.
Environment Variable Shadowing
- Explanation: load_dotenv() in Python does not override existing environment variables by default. If variables are set in the OS or CI/CD pipeline, local .env changes are ignored.
- Fix: Always use load_dotenv(override=True) during development to ensure configuration files take precedence, or manage secrets via a dedicated vault in production.
Token Budget Blowout
- Explanation: Injecting full conversation history or large memory blobs can exceed context windows or increase latency and cost.
- Fix: Summarize historical interactions before injection. Limit memory to key attributes (name, sentiment, last action) rather than raw transcripts. Use token counting utilities to cap injection size.
Latency Spikes from Synchronous DB
- Explanation: Blocking database queries during the request lifecycle can introduce latency, breaking the real-time feel of the voice agent.
- Fix: Use asynchronous database drivers (e.g., aiosqlite or async PostgreSQL) or cache caller profiles in memory. Pre-fetch memory data during the initial call setup webhook rather than the chat completion request.
Language Detection Drift
- Explanation: Multilingual callers may switch languages mid-call, causing STT errors if the model is locked to a single language.
- Fix: Use STT models with automatic language detection, such as Deepgram Nova 3. Configure the model to accept multiple language hints and dynamically adjust the system prompt based on detected language changes.
Sentiment Injection Bias
- Explanation: Injecting sentiment labels like "frustrated" can cause the LLM to overcompensate, becoming overly apologetic even when unnecessary.
- Fix: Frame sentiment injection as context rather than instruction. Instead of "Be patient," use "The caller previously mentioned issues with billing; verify if this is resolved." This guides the agent without forcing a tone.

Production Bundle

Action Checklist

Configure Vapi Custom LLM: Point the Vapi assistant to your /v1/chat/completions endpoint and enable streaming.
Implement SSE Generator: Ensure your Flask/FastAPI route yields tokens in SSE format and handles tool calls gracefully.
Set Up Memory Schema: Initialize SQLite with callers table and implement profile retrieval/injection logic.
Add Post-Call Webhook: Create /webhooks/call-ended to process transcripts, extract insights, and update memory.
Test Tool-Use Fallback: Verify that tool invocations trigger a verbal response and do not result in empty streams.
Optimize STT Language: Configure Deepgram Nova 3 with language hints for your target regions and test code-switching.
Monitor Latency: Instrument endpoints to track time-to-first-token (TTFN) and ensure it remains under 500ms.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Low Volume / MVP	SQLite + Claude Haiku	Simple setup, low cost, sufficient for <100 calls/day.	Low (~€0.28/call)
High Volume / Enterprise	PostgreSQL + Redis Cache + Claude Sonnet	Scalability, concurrent access, and higher reasoning quality for complex queries.	Medium (~€0.45/call)
Multilingual Support	Deepgram Nova 3 + Azure TTS	Nova 3 handles language switching; Azure provides high-quality multilingual voices.	Low (STT/TTS included in stack)
Strict Latency Requirements	Async DB + Edge Deployment	Reduces round-trip time for memory lookups and LLM inference.	Medium (Infrastructure cost)

Configuration Template

# config.yaml
telephony:
  provider: vapi
  assistant_id: "asst_vapi_12345"
  custom_llm_url: "https://your-domain.com/v1/chat/completions"

models:
  llm:
    provider: anthropic
    model: claude-3-haiku-20240307
    max_tokens: 256
  stt:
    provider: deepgram
    model: nova-3
    language: auto
  tts:
    provider: azure
    voice: es-ES-ElviraNeural
    rate: 1.0

memory:
  provider: sqlite
  db_path: ./data/voice_agents.db
  max_history_tokens: 500

webhooks:
  call_ended: "https://your-domain.com/webhooks/call-ended"

Quick Start Guide

Initialize Project: Create a Python virtual environment and install dependencies: pip install flask anthropic deepgram-sdk python-dotenv.
Configure Environment: Create a .env file with your API keys for Vapi, Anthropic, Deepgram, and Azure. Use load_dotenv(override=True) in your app.
Deploy Endpoint: Run the Flask application locally or deploy to a cloud provider. Expose the endpoint via a tunnel (e.g., ngrok) for testing.
Connect Vapi: In the Vapi dashboard, create a new assistant, select "Custom LLM," and enter your endpoint URL. Configure the STT and TTS providers to match your stack.
Test Call: Place a test call to the Vapi number. Verify that the agent responds, memory is injected on subsequent calls, and post-call analysis updates the database.