Your LLM-as-a-Judge Sees 86% Hallucinations. 42% Are Your Pipeline.
Your LLM-as-a-Judge Sees 86% Hallucinations. 42% Are Your Pipeline.
Current Situation Analysis
Automated LLM-as-a-judge evaluators are increasingly deployed as the primary signal for generation quality, but they introduce a critical blind spot: structural inability to distinguish between model failure and pipeline failure. In a self-hosted Langfuse instance running a custom Hallucination rubric, 86% of scored generations were flagged as hallucinating. At face value, this suggests a fleet of fundamentally broken models. However, tracing each flagged score back to the underlying observation reveals that 42% of these flags are actually infrastructure failures (e.g., gateway rejections, invalid model slugs, or excessive max_tokens parameters) that the judge cannot detect.
Traditional evaluation methods fail here because they rely on scalar aggregation across all observations without filtering on execution state. LLM-as-a-judge systems score the final artifact in front of them, not the execution path that produced it. When API calls fail and the SDK logs the request envelope as the "output", the judge interprets the prompt/request configuration as a verbatim model response, confidently flagging it as a hallucination. This contaminates aggregate metrics with infrastructure noise, leading teams to optimize model selection or prompt engineering when the actual issue is pipeline routing and error handling.
WOW Moment: Key Findings
Cross-tabulating the judge's scores with the observation's level field cleanly separates infrastructure noise from genuine model behavior. Filtering out level=ERROR states before aggregation shifts the headline metric from 86.0% to 68.9%, revealing that nearly half the "hallucinations" were pipeline artifacts. The remaining 58% cluster into distinct failure modes that require targeted architectural fixes rather than blanket model replacement.
| Approach | Metric 1 | Metric 2 | Metric 3 |
|---|---|---|---|
| Naive Scalar Aggregation | 86.0% Hallucination Rate | 42% Infrastructure Noise | 0 Actionable Failure Patterns |
| State-Filtered Analysis | 68.9% Hallucination Rate | 0% Infrastructure Noise | 4 Distinct Model Failure Modes |
Core Solution
The fix requires a two-layer approach: first, isolate infrastructure failures at the data pipeline level before scoring; second, route and validate model outputs based on the clustered failure patterns.
1. Pre-Aggregation Infrastructure Filter Apply a strict execution-state filter before computing any evaluator metrics. This removes gateway rejections and null completions from the quality signal.
# wrong: includes failed calls
df["hallucination_rate"] = df["score"].mean()
# right: only score successful generations
genuine = df[df["level"] != "ERROR"]
hallucination_rate = genuine["score"].mean()
2. Failure Mode Routing & Validation Once filtered, the 36 genuine hallucinations resolve into specific patterns requiring distinct technical interventions:
Pattern A β Prompt Echo (Most Frequent)
Small instruction-tuned models (3Bβ30B) frequently output the input verbatim when tasked with highly structured generation. This is not classical hallucination but an instruction-following gap.
Fix: Bind smaller models to simpler tasks (classification, regex-validated extraction). Route structured-summary tasks to 70B+ tiers. Implement a pydantic schema validator on the output with a single-shot retry on parse failure.
Pattern B β Fabricated Tool APIs Agents confabulate plausible REST endpoints and body parameters when tool schemas are implicit or missing. Fix: This is a tool-binding problem. Provide explicit tool schemas via function-calling APIs, or wrap unknown surfaces with a tool that returns its own OpenAPI spec on demand. Models stop fabricating when given concrete schemas to bind to.
Pattern C β Tool-Output Misinterpretation
Agents execute malformed commands, receive permissive success: true responses from runners, and proceed blindly.
Fix: Tool runners must never return success: true on non-zero exit codes. Inject the exit code, stderr, and exact command executed into the tool result payload so the agent can perform self-correction.
Pitfall Guide
- Aggregating Scores Without Execution State Filtering: Calculating mean scores across all observations inflates failure metrics with API/pipeline errors. Always filter
level != "ERROR"before computing aggregate evaluator metrics. - Misclassifying Prompt Echo as Classical Hallucination: Treating verbatim prompt repetition as factual invention obscures the real issue: instruction-tuning gaps in smaller models on structured tasks. Route these to appropriate model tiers or add schema validation.
- Expecting Judges to Detect Infrastructure Failures: LLM-as-a-judge evaluators only score the final artifact. They cannot distinguish between a null completion due to a gateway rejection and a model refusal. Relying on them for pipeline health monitoring contaminates quality scores.
- Permissive Tool Runner Design: Returning
success: truefor malformed commands or non-zero exit codes creates false positives that agents blindly trust. Tool runners must enforce strict exit-code validation and propagate stderr/stdout explicitly. - Missing Explicit Tool Schema Binding: Agents confabulate API structures when tool definitions are implicit. Always bind concrete schemas via function-calling APIs or dynamic spec retrieval to prevent fabricated endpoints.
- Over-relying on Scalar Metrics for Model Routing: A single mean score masks distinct failure modes. Cluster judge comments and scores by model size and task type to identify routing mismatches (e.g., 3Bβ30B models failing at structured JSON generation).
Deliverables
- LLM Observability Pipeline Filter Blueprint: Architecture diagram and data flow specification for injecting
level-state filtering into Langfuse/SDK pipelines before evaluator ingestion. - LLM-as-a-Judge Validation Checklist: 12-point audit checklist covering execution state isolation, artifact vs. path scoring, schema validation hooks, and tool runner strictness protocols.
- Configuration Templates: Ready-to-deploy snippets including the
pandas/polarsexecution filter script,pydanticoutput validator with retry logic, and strict tool runner exit-code enforcement configuration.
