LLM Observability Audit: 32% Error Rate, 720K-Token Bug, and One $1.11 Call
LLM Observability Audit: 32% Error Rate, 720K-Token Bug, and One $1.11 Call
Current Situation Analysis
Production LLM deployments often mask structural failures behind aggregated dashboard metrics. In a self-hosted Langfuse environment processing 21 hours of traffic (516 traces, 24 routed models, $2.86 total spend), traditional monitoring failed to surface critical failure modes. Dashboard-level observability aggregates latency, success rates, and token counts, which obscures three fundamental issues:
- Silent Error Accumulation: A 32.1% error rate was buried in production logs, with gateway-level rejections (
ctx_overflow) masquerading as normal latency spikes. - Cost & Context Drift: Retrieval layers and routing configurations were passing unbounded context windows, resulting in a 97:1 input/output token ratio and extreme cost concentration ($1.11 in a single call).
- Routing Configuration Drift: Invalid or deprecated model slugs (
openrouter/free,google/gemma-4-26b-a4b-it:free) were being emitted by the routing layer, causing 100% failure rates that dashboards typically filter out as "noise" or "fallbacks".
Traditional methods don't work because they rely on pre-computed aggregates and UI-rendered panels. Structural bugs like integer casting errors in max_tokens, missing startup validation against provider catalogs, and greedy context injection require raw trace/observation extraction and programmatic analysis.
WOW Moment: Key Findings
Extracting the full dataset via Langfuse's REST API and running a flat audit revealed immediate structural wins. The comparison between traditional dashboard monitoring and raw API auditing demonstrates the visibility gap:
| Approach | Error Detection Latency | Cost Anomaly Visibility | Context Overflow Detection | Model Performance Resolution |
|---|---|---|---|---|
| Traditional Dashboard Monitoring | ~24-48 hours (aggregation cycles) | Low (aggregated spend only) | Missed (filtered as gateway timeout) | Fleet-level only |
| Raw API Audit (Langfuse REST) | <1 hour (programmatic grep) | High (trace-level $0.01 precision) | Instant (statusMessage parsing) | Per-model correctness scoring |
Key Findings & Sweet Spot:
- 32.1% error rate driven primarily by
ctx_overflow(91/106 errors) due to a720,000token output request against a 262K context limit. - 52% of total spend ($1.486) concentrated in two
claude-opus-4.6calls with 221K and 75K input tokens, indicating missing retrieval truncation. - 97:1 I/O token ratio confirms aggressive output constraints (tool-calls, classification) paired with unbounded input injection.
- Model leaderboard variance (0.589 to 0.940 correctness) highlights that free-tier routing without validation degrades quality predictability.
- Sweet Spot: Combining paginated REST extraction with pandas-based flat analysis catches configuration drift, cost anomalies, and context overflow before they compound in production.
Core Solution
The audit pipeline relies on three Langfuse REST endpoints (/traces, /observations, /scores) with programmatic pagination, followed by targeted fixes for routing, context management, and validation.
1. Data Extraction Pipeline
import os, httpx
from dotenv import load_dotenv
load_dotenv()
BASE = os.environ["LANGFUSE_BASE_URL"].rstrip("/")
AUTH = (os.environ["LANGFUSE_PUBLIC_KEY"], os.environ["LANGFUSE_SECRET_KEY"])
def paginate(client, path, params=None):
params = dict(params or {})
params.setdefault("limit", 100)
page = 1
while True:
params["page"] = page
r = client.get(f"{BASE}{path}", params=params)
r.raise_for_status()
j = r.json()
yield from j.get("data", [])
if page >= j.get("meta", {}).get("totalPages", 1):
break
page += 1
with httpx.Client(auth=AUTH, timeout=60) as c:
traces = list(paginate(c, "/api/public/traces"))
obs = list(paginate(c, "/api/public/observations"))
scores = list(paginate(c, "/api/public/scores"))
2. Context & Token Budgeting Fix
Hardcode an upper sanity bound to prevent integer casting errors or unbounded requests from bypassing gateway limits.
def cap_max_tokens(model_ctx: int, input_tok: int, requested: int, margin: int = 256) -> int:
return min(requested, max(0, model_ctx - input_tok - margin), 8192)
3. Routing Validation Guardrail
Validate model slugs against the provider catalog at startup to catch drift immediately.
async def validate_models(used_slugs: set[str]) -> None:
r = await httpx.AsyncClient().get("https://openrouter.ai/api/v1/models")
valid = {m["id"] for m in r.json()["data"]}
if invalid := used_slugs - valid:
raise RuntimeError(f"Unknown OpenRouter slugs: {invalid}")
4. Retrieval Context Trimming
Implement greedy token-budget trimming to prevent cost concentration and context overflow.
def trim_context(chunks: list[Chunk], budget_tok: int, encoder) -> list[Chunk]:
"""Greedy by score, stop when budget is exhausted."""
chunks = sorted(chunks, key=lambda c: c.score, reverse=True)
out, used = [], 0
for c in chunks:
n = len(encoder.encode(c.text))
if used + n > budget_tok:
break
out.append(c); used += n
return out
Pitfall Guide
- Unbounded
max_tokensConfiguration: Passing raw integer values without sanity caps leads to gateway rejections (ctx_overflow) and wasted latency. Always enforce a hard upper bound (e.g.,8192) relative to the model's context window and input size. - Unvalidated Model Slugs in Routing: Routing layers often emit placeholder or deprecated slugs (
openrouter/free, invalid version tags). Implement startup-time validation against the provider's/modelsendpoint to fail fast in CI/CD rather than in production. - Missing Context Truncation in RAG: Retrieval systems that inject full corpora or untruncated chat history cause extreme cost concentration and I/O ratio drift. Pair greedy score-based trimming with a hard input ceiling well below the model's max context.
- Ignoring Input/Output Token Ratio Drift: A sustained ratio >20:1 indicates input overshooting or aggressive output constraints. Monitor this metric per request; drift signals retrieval inefficiency or prompt bloat that dashboards typically mask.
- Relying on Dashboard Aggregates for Error Analysis: Pre-computed panels filter out gateway rejections, fallbacks, and low-latency failures. Raw trace/observation extraction is required to classify errors by
statusMessage,level, and routing metadata. - Assuming Free-Tier Model Stability: Free-tier slugs frequently rotate, deprecate, or change performance characteristics. Maintain a validated slug registry and pin versions in configuration rather than relying on dynamic routing aliases.
Deliverables
- Langfuse Audit Blueprint: Step-by-step architecture for paginated REST extraction, pandas-based flat analysis, and trace-to-observation joining. Includes endpoint mapping, pagination handling, and metric derivation formulas.
- LLM Routing & Cost Checklist: Pre-deployment validation steps covering model slug verification,
max_tokenssanity caps, context window margin calculations, and startup-time catalog sync. - Configuration Templates: Ready-to-use Python snippets for token budgeting (
cap_max_tokens), greedy context trimming (trim_context), and async model validation (validate_models). Includes environment variable schemas for Langfuse auth and base URL routing.
