Enabled** | < 25% | Low | High |
Key Insight: The < 300ms zone is the gold standard, achievable for simple exchanges using optimized stacks (e.g., Cartesia Sonic TTS + Deepgram STT + Edge LLM). For complex reasoning where sub-300ms is unattainable, the 300-500ms range combined with aggressive filler strategies represents the practical production sweet spot. Fillers reduce overlap incidents by over 60%, effectively buying time without breaking the user's mental model.
Core Solution
Achieving sub-cliff latency requires a holistic architecture that manages the critical path, predicts bottlenecks, and handles user interruptions gracefully. The solution spans pipeline design, model optimization, and adaptive UX.
1. Pipeline Architecture & Bottleneck Management
The voice pipeline must be designed to minimize the critical path while handling interruptions without state corruption.
[User Audio] -> [Streaming STT] -> [LLM Inference] -> [Streaming TTS] -> [Audio Output]
100-300ms 200-800ms 40-150ms
(Chunk-based) (70% of latency) (Low-latency models)
- STT: Use streaming models (Deepgram, AssemblyAI) to process audio chunks immediately. Do not wait for silence detection before sending data to the LLM if the pipeline supports incremental processing.
- LLM Inference: This is the dominant bottleneck. Implement speculative decoding, prompt caching, and model quantization. Route simple queries to smaller, faster models to preserve the 500ms budget.
- TTS: Deploy low-latency models like Cartesia Sonic (40ms TTFA) or Kokoro (82M params, NPU-native). Ensure TTS can stream audio tokens as soon as they are generated.
2. Implementation Strategy
The following TypeScript implementation demonstrates an adaptive pipeline that monitors p95 latency, injects fillers based on predictive thresholds, and handles overlap interruptions safely.
// Core interfaces for the voice pipeline
interface VoiceTurnConfig {
p95TargetMs: number;
fillerTriggerMs: number;
maxOverlapRetries: number;
}
interface LatencyMetrics {
p95: number;
p99: number;
currentTurnMs: number;
}
// Strategy for handling user interruptions
interface OverlapHandler {
handleInterruption(turnId: string): Promise<void>;
}
// Filler injection strategy
interface FillerStrategy {
shouldInject(metrics: LatencyMetrics): boolean;
getFillerAudio(): Buffer;
}
class AdaptiveVoiceOrchestrator {
private metrics: LatencyMetrics;
private config: VoiceTurnConfig;
private overlapHandler: OverlapHandler;
private fillerStrategy: FillerStrategy;
constructor(config: VoiceTurnConfig, overlapHandler: OverlapHandler, fillerStrategy: FillerStrategy) {
this.config = config;
this.overlapHandler = overlapHandler;
this.fillerStrategy = fillerStrategy;
this.metrics = { p95: 0, p99: 0, currentTurnMs: 0 };
}
/**
* Processes a voice turn with adaptive filler injection and overlap handling.
*/
async processVoiceTurn(audioStream: AsyncIterable<Buffer>): Promise<void> {
const turnStart = performance.now();
const turnId = crypto.randomUUID();
try {
// 1. Stream STT and LLM inference
// In production, this would involve chunked processing
const llmResponse = await this.runInference(audioStream, turnId);
// 2. Check latency budget and inject filler if needed
const elapsed = performance.now() - turnStart;
if (this.fillerStrategy.shouldInject({ ...this.metrics, currentTurnMs: elapsed })) {
await this.injectFiller(turnId);
}
// 3. Stream TTS output
await this.streamAudioOutput(llmResponse, turnId);
// 4. Update metrics
this.updateMetrics(performance.now() - turnStart);
} catch (error) {
if (this.isOverlapError(error)) {
await this.overlapHandler.handleInterruption(turnId);
} else {
throw error;
}
}
}
private async injectFiller(turnId: string): Promise<void> {
// Low-latency filler audio injection
const fillerAudio = this.fillerStrategy.getFillerAudio();
// Send filler to output stream immediately
await this.writeToOutput(fillerAudio, turnId);
}
private isOverlapError(error: unknown): boolean {
// Detect interruption signals from STT pipeline
return error instanceof Error && error.message.includes('INTERRUPTION_DETECTED');
}
private updateMetrics(durationMs: number): void {
// Update rolling p95/p99 calculations
// Implementation depends on metrics library (e.g., Prometheus, Datadog)
}
}
Architecture Decisions:
- Predictive Filler Injection: The system checks elapsed time against the
fillerTriggerMs. If the LLM is taking longer than expected, a filler is injected before the user crosses the 500ms cliff. This prevents the overlap trap proactively.
- Safe Overlap Handling: The
OverlapHandler ensures that when a user interrupts, the pipeline aborts the current TTS stream, clears the LLM context if necessary, and resets the STT buffer. This prevents the "overlap death spiral" where pipeline state becomes corrupted.
- p95-Centric Metrics: The orchestrator tracks p95 and p99 latency, not averages. Alerts and auto-scaling decisions should be based on these tail metrics to ensure cliff thresholds are respected.
3. Optimization Techniques
- Speculative Decoding: Use a draft model to propose tokens verified by the larger model, reducing effective inference time by 10-15%.
- Prompt Caching: Cache embeddings for system prompts and common context to skip redundant processing.
- Edge Deployment: Deploy inference closer to the user to reduce network RTT. Edge computing has reduced average voice agent latency from 2,500ms to ~600ms in production environments.
- Audio Tokenization: Leverage models that accept audio tokens directly, bypassing the STT step for specific use cases, reducing pipeline stages.
Pitfall Guide
- The Linear Latency Illusion: Treating latency as a continuous metric. Crossing a cliff (e.g., 500ms) changes the interaction type entirely. Optimization must prioritize staying below cliffs rather than incremental improvements within a failed zone. A 490ms response is fundamentally different from a 510ms response.
- Downstream Optimization Bias: Spending weeks optimizing TTS latency while the LLM inference stage dominates the critical path. Always profile the full pipeline; if LLM takes 400ms, shaving 50ms off TTS will not save you from the 500ms cliff.
- Interruption Cascade: Failing to handle user interruptions gracefully. When a user talks over the agent, the pipeline must abort the current stream and reset state. Without this, latency compounds, leading to rapid abandonment and pipeline corruption.
- The Average Trap: Reporting average latency hides the p95 tail where cliff crossings occur. Users form their worst impressions based on the slowest interactions. Always monitor and optimize p95/p99 metrics.
- Delayed Auditory Feedback Blindspot: Assuming latency is the only issue. Delayed auditory feedback causes physiological speech disruption. High-quality Acoustic Echo Cancellation (AEC) is mandatory for any system that may exceed 800ms or operate in noisy environments.
- First-Turn Anchor Effect: The first response sets the user's mental model. A slow first turn causes high drop-off even if subsequent turns are fast. Front-load optimization on the initial exchange; cache common openings.
- The Waiter Demographic Fallacy: Assuming users will wait. In testing, 8 out of 12 users overlapped at 500ms. You cannot design for the demographic that waits 3 seconds after a light turns green. Design for the majority who interrupt.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Simple FAQ / Transactional | Edge LLM + Prompt Caching | Keeps latency < 300ms; avoids cloud RTT. | Low; edge compute is cheaper. |
| Complex Reasoning / Analysis | Cloud LLM + Aggressive Fillers | Cloud models are slower; fillers prevent overlap. | Medium; filler audio adds minimal cost. |
| Noisy Environment | High-Quality AEC + Robust STT | Echo and noise exacerbate latency perception. | Medium; AEC requires DSP resources. |
| High Concurrency | Speculative Decoding + Quantization | Reduces inference time per request. | Low; improves throughput efficiency. |
| Budget-Constrained | Kokoro TTS + Deepgram STT | Open-weight models reduce API costs. | Low; self-hosted models lower OpEx. |
Configuration Template
# voice-pipeline-config.yaml
latency_budget:
p95_target_ms: 450
p99_target_ms: 800
first_turn_target_ms: 300
fillers:
enabled: true
trigger_threshold_ms: 400
audio_assets:
- "mmhmm.wav"
- "let_me_check.wav"
- "one_second.wav"
routing:
simple_queries:
model: "edge-llm-7b-quantized"
max_tokens: 100
complex_queries:
model: "cloud-llm-70b"
max_tokens: 500
overlap_handling:
strategy: "abort_and_reset"
max_retries: 2
silence_detection_ms: 500
metrics:
provider: "prometheus"
labels: ["turn_type", "model_name", "region"]
Quick Start Guide
- Define Thresholds: Set your p95 target to 450ms and configure filler injection at 400ms. This provides a 50ms buffer before the 500ms overlap cliff.
- Integrate Filler Service: Add a low-latency filler audio service to your pipeline. Ensure fillers can be injected within 50ms of the trigger.
- Deploy Edge Inference: Route simple queries to an edge-deployed model. Measure the reduction in RTT and verify p95 improvements.
- Instrument p95 Monitoring: Configure your metrics pipeline to track p95 and p99 latency. Set up dashboards to visualize cliff crossings in real-time.
- Test Interruptions: Run load tests with simulated user interruptions. Verify that the pipeline aborts cleanly and recovers without state corruption.