sts.
- Claude Haiku generates executive-ready summaries in <2s, while Sonnet/Opus handles complex multi-turn admin queries without bloating per-call latency.
Core Solution
The service is built on a FastAPI microservice architecture that orchestrates CLI audio tools, neural inference, and LLM synthesis into a single HTTP-exposed pipeline.
Architecture & Data Flow
+-------------------+
| Your Dashboard |
| (Grafana, Web) |
+--------+----------+
|
HTTP API
|
+-------------v--------------+
| FastAPI Service |
| (port 8084) |
| |
| +-------+ +-----------+ |
| | /analyze| | /ai-analyze| |
| +---+---+ +-----+-----+ |
| | | |
| +---v------------v------+ |
| | Analysis Pipeline | |
| | | |
| | 1. Fetch recording | |
| | (HTTP/SCP) | |
| | 2. SoX stats | |
| | (RMS, peak, dur) | |
| | 3. FFmpeg silence | |
| | detect | |
| | 4. Silero VAD | |
| | (speech segments) | |
| | 5. NISQA MOS | |
| | (neural scoring) | |
| | 6. One-way detection | |
| +---+--------------------+ |
| | |
| +---v---+ +-----------+ |
| |SQLite | | Claude AI | |
| |Cache | | (Haiku/ | |
| | | | Sonnet) | |
| +-------+ +-----------+ |
+-------|----------|-----------+
| |
+-------------+ +----+----------+
| | |
+--------v-------+ +-------v-----+ +------v------+
| Recording | | Anthropic | | ViciDial |
| Server (HTTP) | | Messages | | Databases |
| /RECORDINGS/ | | API | | (MySQL) |
+----------------+ +-------------+ +-------------+
Core Audio Processing: SoX and FFmpeg
Before neural inference, fundamental acoustic metrics are extracted using battle-tested CLI tools. SoX provides amplitude statistics in a single pass:
import subprocess
import numpy as np
def sox_stats(wav_path: str) -> dict:
"""Run SoX stat on a WAV file, return RMS/peak/duration."""
try:
result = subprocess.run(
['sox', wav_path, '-n', 'stat'],
capture_output=True, text=True, timeout=30
)
# SoX writes stats to stderr (not stdout)
text = result.stderr
stats = {}
for line in text.splitlines():
if 'RMS amplitude' in line:
val = line.split(':')[-1].strip()
try:
amp = float(val)
stats['rms_db'] = round(20 * np.log10(amp + 1e-10), 1)
except ValueError:
stats['rms_db'] = -99.0
elif 'Maximum amplitude' in line:
val = line.split(':')[-1].strip()
try:
amp = float(val)
stats['peak_db'] = round(20 * np.log10(amp + 1e-10), 1)
except ValueError:
stats['peak_db'] = -99.0
elif 'Length (seconds)' in line:
val = line.split(':')[-1].strip()
try:
stats['duration'] = round(float(val), 2)
except ValueError:
pass
return stats
except Exception as e:
return {'error': str(e)}
FFmpeg silencedetect filters identify dead air and gap patterns, while Silero VAD timestamps speech segments for directional analysis. NISQA loads pre-trained nisqa.tar weights to predict MOS and sub-dimensions (noisiness, discontinuity, coloration, loudness) without requiring reference audio.
Analysis Pipeline & API Design
The pipeline executes sequentially but is optimized for parallel I/O where possible:
- Recording Retrieval: Attempts separate
in/out legs first; falls back to mixed if unavailable.
- Metric Extraction: SoX (RMS/peak/duration) β FFmpeg (silence gaps) β Silero VAD (speech % per leg).
- Neural Scoring: NISQA inference on normalized 16kHz mono WAV.
- One-Way Detection: Compares caller vs. agent speech percentages; flags asymmetry >70/30 split.
- Caching & Response: Results serialized to SQLite with hash-based cache keys.
/ai-analyze appends Claude Haiku natural-language assessment. /investigate and /ask endpoints leverage Sonnet/Opus for SIP trace correlation and multi-turn admin queries.
Deployment & Optimization
- Systemd Service: Runs as a persistent daemon with automatic restart, log rotation, and resource limits.
- SQLite Caching: In-memory + disk hybrid; TTL invalidation prevents stale model outputs.
- Hardware Profile: CPU-only deployment sufficient. Models load ~500MB RAM; single analysis completes in 3β8 seconds on 4-core instances.
Pitfall Guide
- SoX Output Redirection: SoX writes statistical output to
stderr, not stdout. Capturing stdout returns empty strings, causing silent metric failures. Always parse result.stderr.
- PyTorch Build Bloat: Installing the default
torch package pulls CUDA dependencies (~1.5GB+). For CPU-only inference, use --index-url https://download.pytorch.org/whl/cpu to reduce disk footprint and startup time.
- Mixed vs. Separate Leg Recordings: Analyzing mixed (merged) audio masks directional failures. Always prioritize separate
in/out leg files; mixed recordings should only be used as a fallback with explicit warnings.
- VAD Threshold Misconfiguration: Silero VAD defaults may misclassify quiet telephony speech or background noise. Tune the probability threshold (typically
0.5β0.6) based on your PBX's SNR profile and codec characteristics.
- Cache Invalidation & Stale Data: Remote recording servers may overwrite or truncate files. Implement hash-based cache keys (
md5(file_path + timestamp)) and TTL expiration to prevent serving outdated analysis.
- LLM Cost & Rate Limiting: Using Sonnet/Opus for every
/analyze request will exhaust API quotas and inflate costs. Route per-call analysis to Haiku; reserve larger models for /investigate and /ask endpoints.
- Temporary File Accumulation: The
tmp/ directory fills rapidly under load. Implement context-managed cleanup (tempfile.TemporaryDirectory) or a background cron job to purge files older than 15 minutes.
Deliverables
- Architecture Blueprint: Complete system topology, data flow diagrams, endpoint specifications, model weight placement strategy, and caching layer design. Includes systemd unit template, Nginx reverse-proxy configuration, and Grafana dashboard JSON for real-time MOS/silence tracking.
- Production Checklist: Step-by-step deployment validation covering OS prerequisites, venv isolation, API key security (
chmod 600), SoX/FFmpeg path verification, NISQA weight validation, SQLite schema initialization, Claude API quota monitoring, and automated cleanup hooks. Includes rollback procedures and health-check endpoints for load balancer integration.