← Back to Blog
AI/ML2026-05-05Β·33 min read

Your LLM-as-a-Judge Sees 86% Hallucinations. 42% Are Your Pipeline.

By Julio Molina Soler

Your LLM-as-a-Judge Sees 86% Hallucinations. 42% Are Your Pipeline.

Current Situation Analysis

Automated LLM-as-a-judge evaluators are increasingly deployed as the primary signal for generation quality, but they introduce a critical blind spot: structural inability to distinguish between model failure and pipeline failure. In a self-hosted Langfuse instance running a custom Hallucination rubric, 86% of scored generations were flagged as hallucinating. At face value, this suggests a fleet of fundamentally broken models. However, tracing each flagged score back to the underlying observation reveals that 42% of these flags are actually infrastructure failures (e.g., gateway rejections, invalid model slugs, or excessive max_tokens parameters) that the judge cannot detect.

Traditional evaluation methods fail here because they rely on scalar aggregation across all observations without filtering on execution state. LLM-as-a-judge systems score the final artifact in front of them, not the execution path that produced it. When API calls fail and the SDK logs the request envelope as the "output", the judge interprets the prompt/request configuration as a verbatim model response, confidently flagging it as a hallucination. This contaminates aggregate metrics with infrastructure noise, leading teams to optimize model selection or prompt engineering when the actual issue is pipeline routing and error handling.

WOW Moment: Key Findings

Cross-tabulating the judge's scores with the observation's level field cleanly separates infrastructure noise from genuine model behavior. Filtering out level=ERROR states before aggregation shifts the headline metric from 86.0% to 68.9%, revealing that nearly half the "hallucinations" were pipeline artifacts. The remaining 58% cluster into distinct failure modes that require targeted architectural fixes rather than blanket model replacement.

Approach Metric 1 Metric 2 Metric 3
Naive Scalar Aggregation 86.0% Hallucination Rate 42% Infrastructure Noise 0 Actionable Failure Patterns
State-Filtered Analysis 68.9% Hallucination Rate 0% Infrastructure Noise 4 Distinct Model Failure Modes

Core Solution

The fix requires a two-layer approach: first, isolate infrastructure failures at the data pipeline level before scoring; second, route and validate model outputs based on the clustered failure patterns.

1. Pre-Aggregation Infrastructure Filter Apply a strict execution-state filter before computing any evaluator metrics. This removes gateway rejections and null completions from the quality signal.

# wrong: includes failed calls
df["hallucination_rate"] = df["score"].mean()

# right: only score successful generations
genuine = df[df["level"] != "ERROR"]
hallucination_rate = genuine["score"].mean()

2. Failure Mode Routing & Validation Once filtered, the 36 genuine hallucinations resolve into specific patterns requiring distinct technical interventions:

Pattern A β€” Prompt Echo (Most Frequent) Small instruction-tuned models (3B–30B) frequently output the input verbatim when tasked with highly structured generation. This is not classical hallucination but an instruction-following gap. Fix: Bind smaller models to simpler tasks (classification, regex-validated extraction). Route structured-summary tasks to 70B+ tiers. Implement a pydantic schema validator on the output with a single-shot retry on parse failure.

Pattern B β€” Fabricated Tool APIs Agents confabulate plausible REST endpoints and body parameters when tool schemas are implicit or missing. Fix: This is a tool-binding problem. Provide explicit tool schemas via function-calling APIs, or wrap unknown surfaces with a tool that returns its own OpenAPI spec on demand. Models stop fabricating when given concrete schemas to bind to.

Pattern C β€” Tool-Output Misinterpretation Agents execute malformed commands, receive permissive success: true responses from runners, and proceed blindly. Fix: Tool runners must never return success: true on non-zero exit codes. Inject the exit code, stderr, and exact command executed into the tool result payload so the agent can perform self-correction.

Pitfall Guide

  1. Aggregating Scores Without Execution State Filtering: Calculating mean scores across all observations inflates failure metrics with API/pipeline errors. Always filter level != "ERROR" before computing aggregate evaluator metrics.
  2. Misclassifying Prompt Echo as Classical Hallucination: Treating verbatim prompt repetition as factual invention obscures the real issue: instruction-tuning gaps in smaller models on structured tasks. Route these to appropriate model tiers or add schema validation.
  3. Expecting Judges to Detect Infrastructure Failures: LLM-as-a-judge evaluators only score the final artifact. They cannot distinguish between a null completion due to a gateway rejection and a model refusal. Relying on them for pipeline health monitoring contaminates quality scores.
  4. Permissive Tool Runner Design: Returning success: true for malformed commands or non-zero exit codes creates false positives that agents blindly trust. Tool runners must enforce strict exit-code validation and propagate stderr/stdout explicitly.
  5. Missing Explicit Tool Schema Binding: Agents confabulate API structures when tool definitions are implicit. Always bind concrete schemas via function-calling APIs or dynamic spec retrieval to prevent fabricated endpoints.
  6. Over-relying on Scalar Metrics for Model Routing: A single mean score masks distinct failure modes. Cluster judge comments and scores by model size and task type to identify routing mismatches (e.g., 3B–30B models failing at structured JSON generation).

Deliverables

  • LLM Observability Pipeline Filter Blueprint: Architecture diagram and data flow specification for injecting level-state filtering into Langfuse/SDK pipelines before evaluator ingestion.
  • LLM-as-a-Judge Validation Checklist: 12-point audit checklist covering execution state isolation, artifact vs. path scoring, schema validation hooks, and tool runner strictness protocols.
  • Configuration Templates: Ready-to-deploy snippets including the pandas/polars execution filter script, pydantic output validator with retry logic, and strict tool runner exit-code enforcement configuration.