Five ways to test an LLM's answer and what each one misses
Current Situation Analysis
Traditional software testing operates on a deterministic premise: identical inputs must produce identical outputs. LLM evaluation shatters this assumption. A single prompt routed through a generative model can yield structurally different, yet semantically equivalent, responses across consecutive runs. Engineering teams attempting to apply classical assertion frameworks to probabilistic systems consistently encounter false negatives, silent hallucinations, and CI pipelines that either block valid releases or ship broken logic.
The core misunderstanding stems from treating evaluation metrics as universal correctness detectors. Metrics like BLEU and ROUGE were engineered for machine translation and text summarization, where reference texts and model outputs share predictable length and syntactic structure. When applied to open-ended QA, their mathematical denominators collapse. A one-token reference compared against a paragraph-length response yields a near-zero overlap score, regardless of factual accuracy. Similarly, semantic similarity relies on vector cosine angles to measure textual proximity, not factual verification. Two answers can occupy the same semantic neighborhood while one contains a fabricated API flag or a reversed causal relationship.
Data from controlled evaluation harnesses reveals a consistent pattern: four out of five standard scorers reject factually correct answers when the output format diverges from the reference. More critically, LLM-as-judge systems exhibit self-grading bias when the judge shares architecture with the target model. In benchmark runs, a hallucinated command-line flag received a deterministic score of exactly 0.700 across five identical executions, locking precisely on the pass threshold. The system did not fluctuate; it stabilized on a false positive. Averaging multiple runs, a common mitigation for non-determinism, fails to resolve this because the bias is architectural, not stochastic.
The industry has not yet standardized on a multi-scorer orchestration pattern. Teams either over-rely on a single metric, manually tune thresholds until they admit incorrect outputs, or deploy judge models that mirror the target architecture. The result is evaluation suites that measure format compliance and lexical overlap rather than factual correctness, leaving production systems vulnerable to deterministic hallucination passes.
WOW Moment: Key Findings
The following table isolates how each evaluation approach behaves when confronted with format variance, factual errors, and non-deterministic scoring. The data reflects controlled benchmark runs using llama3.2 as both target and judge, with a fixed pass threshold of 0.700.
| Approach | Format Sensitivity | Hallucination Catch Rate | Determinism | Best Use Case |
|---|---|---|---|---|
| Exact Match | Critical | High | Absolute | Structured data, IDs, short codes |
| BLEU | Critical | Low | High | Translation, parallel corpora |
| ROUGE | Critical | Low | High | Summarization, extractive tasks |
| Semantic Similarity | Moderate | Low | Moderate | Intent matching, paraphrase detection |
| LLM-as-Judge | Low | Variable | Low (threshold locking) | Open-ended QA, reasoning validation |
The critical insight is that no single scorer captures correctness in isolation. BLEU and ROUGE fail when reference and output lengths diverge because their denominators normalize overlap against total token count. Semantic similarity measures vector proximity, which cannot distinguish between a correct explanation and a plausible but factually inverted one. LLM-as-judge systems introduce architectural bias: when the judge and target share weights, hallucinated tokens do not trigger internal contradiction, causing scores to stabilize on the pass threshold rather than fluctuate.
This finding matters because it shifts evaluation strategy from metric selection to orchestrator design. Teams must accept that every scorer has documented blind spots. The evaluation harness becomes a panel of imperfect validators, where disagreements surface the actual test surface. Production systems that route questions to the appropriate scorer, enforce external judges, and pluralize references consistently outperform single-metric pipelines in both precision and recall.
Core Solution
Building a resilient LLM evaluation pipeline requires decoupling scoring logic from execution, enforcing architectural separation between target and judge models, and normalizing references to handle generative variance. The following TypeScript implementation demonstrates a modular orchestrator that routes evaluations based on question type, applies shape-aware reference matching, and enforces external judging.
Architecture Decisions
- Router Pattern: Questions are classified by expected output shape (structured, prose, reasoning). The router selects the appropriate scorer, preventing format collapse.
- External Judge Enforcement: The judge model must differ from the target model to eliminate self-grading bias. We enforce this via configuration validation.
- Reference Pluralization: Each test case accepts multiple reference answers. The scorer computes the maximum score across all references, mitigating shape mismatch.
- Threshold Calibration: Pass/fail boundaries are not hardcoded. They are derived from a calibration dataset using ROC curve optimization, then locked for CI.
Implementation
import { createHash } from 'crypto';
// Types
type QuestionType = 'structured' | 'prose' | 'reasoning';
type ScorerResult = { score: number; passed: boolean; metadata: Record<string, unknown> };
interface TestCase {
id: string;
prompt: string;
references: string[];
expectedType: QuestionType;
threshold: number;
}
interface EvaluationConfig {
targetModel: string;
judgeModel: string;
semanticModel: string;
maxRetries: number;
}
// Cosine Similarity Calculator
function cosineSimilarity(vecA: number[], vecB: number[]): number {
const dotProduct = vecA.reduce((sum, val, i) => sum + val * vecB[i], 0);
const magA = Math.sqrt(vecA.reduce((sum, val) => sum + val ** 2, 0));
const magB = Math.sqrt(vecB.reduce((sum, val) => sum + val ** 2, 0));
return magA && magB ? dotProduct / (magA * magB) : 0;
}
// Scorer Interfaces
interface Scorer {
evaluate(target: string, references: string[]): ScorerResult;
}
class ExactMatchScorer implements Scorer {
evaluate(target: string, references: string[]): ScorerResult {
const normalizedTarget = target.trim().toLowerCase();
const passed = references.some(ref => ref.trim().toLowerCase() === normalizedTarget);
return { score: passed ? 1.0 : 0.0, passed, metadata: { type: 'exact' } };
}
}
class SemanticScorer implements Scorer {
constructor(private embeddingFn: (text: string) => Promise<number[]>) {}
async evaluate(target: string, references: string[]): Promise<ScorerResult> {
const targetVec = await this.embeddingFn(target);
const scores = await Promise.all(references.map(async ref => {
const refVec = await this.embeddingFn(ref);
return cosineSimilarity(targetVec, refVec);
}));
const maxScore = Math.max(...scores);
return { score: maxScore, passed: maxScore >= 0.85, metadata: { type: 'semantic', threshold: 0.85 } };
}
}
class ExternalJudgeScorer implements Scorer {
constructor(private judgeClient: any, private rubric: string) {}
async evaluate(target: string, references: string[]): Promise<ScorerResult> {
const prompt = `${this.rubric}\n\nReference:\n${references.join('\n')}\n\nTarget Response:\n${target}\n\nScore correctness and relevance (0-1).`;
const response = await this.judgeClient.chat.completions.create({
model: 'judge-model-v2', // Must differ from target
messages: [{ role: 'user', content: prompt }],
temperature: 0.0, // Force deterministic scoring
response_format: { type: 'json_object' }
});
const parsed = JSON.parse(response.choices[0].message.content);
const score = Math.min(1.0, Math.max(0.0, parsed.score / 10));
return { score, passed: score >= 0.75, metadata: { type: 'judge', reasoning: parsed.explanation } };
}
}
// Orchestrator
class EvaluationOrchestrator {
private scorers: Record<QuestionType, Scorer>;
constructor(config: EvaluationConfig) {
if (config.targetModel === config.judgeModel) {
throw new Error('Architectural violation: Judge model must differ from target model to prevent self-grading bias.');
}
this.scorers = {
structured: new ExactMatchScorer(),
prose: new SemanticScorer(async (t) => { /* embedding call */ return []; }),
reasoning: new ExternalJudgeScorer({ chat: { completions: { create: async () => ({}) } } }, '')
};
}
async run(testCase: TestCase, targetOutput: string): Promise<ScorerResult> {
const scorer = this.scorers[testCase.expectedType];
if ('evaluate' in scorer && scorer.evaluate.constructor.name === 'AsyncFunction') {
return await (scorer as any).evaluate(targetOutput, testCase.references);
}
return scorer.evaluate(targetOutput, testCase.references);
}
}
Rationale
- Zero Temperature for Judges: LLM-as-judge scoring must be deterministic. Setting
temperature: 0.0and using structured JSON output eliminates stochastic variance during evaluation, exposing architectural bias rather than masking it with noise. - Reference Pluralization: The
referencesarray allows teams to supply multiple valid phrasings. The scorer takes the maximum score, preventing false negatives when the model rephrases correctly. - Threshold Calibration: Hardcoded thresholds (like 0.700) create threshold-locking artifacts. Production systems should derive thresholds from a labeled calibration set using precision-recall optimization, then freeze them for CI.
- Model Separation: The constructor explicitly rejects identical target/judge models. Self-grading bias is not a bug; it's a mathematical certainty when weights are shared.
Pitfall Guide
1. Self-Grading Bias
Explanation: When the judge model shares architecture with the target model, hallucinated tokens do not trigger internal contradiction. Scores stabilize on the pass threshold rather than fluctuating, creating deterministic false positives. Fix: Always route evaluation to a distinct, higher-capability model. Validate model separation in CI configuration. Use temperature 0.0 to expose bias rather than mask it.
2. Reference Shape Collapse
Explanation: BLEU and ROUGE divide shared tokens by total output length. A one-word reference against a paragraph response yields near-zero scores, regardless of factual accuracy. Fix: Match reference length to expected output shape. For short answers, use exact match or structured parsing. For prose, supply full-sentence references or switch to semantic/judge scorers.
3. Semantic Proximity vs Factual Correctness
Explanation: Cosine similarity measures vector angle, not truth. A response that inverts causality or swaps entities can occupy the same semantic neighborhood as the correct answer. Fix: Never use semantic similarity as a standalone correctness validator. Combine it with rule-based fact extraction or an external judge that verifies specific claims.
4. Threshold Tuning Illusion
Explanation: Lowering cosine or judge thresholds to admit correctly formatted but differently phrased answers inevitably admits incorrect ones. In benchmark data, right and wrong scores often differ by less than 0.005. Fix: Calibrate thresholds on a held-out dataset. Use ROC analysis to find the operating point that maximizes F1 score. Lock thresholds post-calibration; do not adjust them per test case.
5. Reasoning Text Overweighting
Explanation: LLM judges frequently generate explanations that contradict their numerical scores. The prose reflects token-level generation patterns, while the score reflects weighted rubric alignment. Fix: Treat judge explanations as debugging artifacts, not evidence. Parse and store the numerical score. Log the reasoning separately for post-hoc analysis.
6. Ignoring Non-Determinism in CI
Explanation: Running a single evaluation pass and treating the result as absolute ignores generative variance. Flaky tests are often dismissed rather than investigated.
Fix: Implement strict failure tracking (xfail patterns). Document known scorer limitations. When a test unexpectedly passes, break the build to force investigation. Use multiple runs only when measuring variance, not correctness.
7. Single-Reference Bottleneck
Explanation: Hardcoding one reference answer assumes the model must reproduce exact phrasing. This penalizes valid paraphrasing and inflates false negative rates. Fix: Maintain a reference pool per test case. Use dynamic matching that selects the highest-scoring reference. Rotate references periodically to prevent overfitting.
Production Bundle
Action Checklist
- Separate target and judge models: Verify architectural distinction in configuration validation.
- Pluralize references: Supply 3-5 valid phrasings per test case to handle generative variance.
- Calibrate thresholds: Run evaluation on a labeled dataset and derive pass/fail boundaries using ROC optimization.
- Enforce deterministic judging: Set temperature to 0.0 and require structured JSON output for all judge calls.
- Document scorer blind spots: Maintain a living spec of known failure modes (format collapse, semantic proximity limits).
- Implement strict failure tracking: Use
xfailpatterns with reason fields to catch unexpected passes and force investigation. - Scale evaluation datasets: Move from 10-item prototypes to 200+ item production sets with multiple human raters.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Structured data (IDs, codes, enums) | Exact Match + Schema Validation | Deterministic, zero false positives, minimal compute | Low |
| Open-ended QA with factual claims | External Judge (different model) + Fact Extraction | Catches hallucinations, handles format variance | Medium-High |
| Paraphrase/Intent matching | Semantic Similarity + Rule-Based Filters | Fast, format-agnostic, but requires correctness guardrails | Low |
| Translation/Summarization | BLEU/ROUGE with matched-length references | Mathematically sound for parallel corpora | Low |
| High-stakes reasoning (medical, legal) | Multi-Judge Panel + Human-in-the-Loop Audit | Eliminates single-model bias, provides audit trail | High |
Configuration Template
evaluation:
target_model: "llama3.2-8b"
judge_model: "qwen2.5-72b-instruct"
semantic_model: "nomic-embed-text"
thresholds:
calibrated: true
calibration_dataset: "evals/v2/ground_truth.json"
operating_point: "max_f1"
scorers:
structured:
type: "exact_match"
normalize: true
prose:
type: "semantic_similarity"
threshold: 0.82
fallback: "judge"
reasoning:
type: "external_judge"
rubric: "evals/rubrics/factual_correctness_v3.md"
temperature: 0.0
response_format: "json"
ci:
strict_xfail: true
max_retries: 3
variance_alert: true
Quick Start Guide
- Initialize the harness: Create a TypeScript project with
tsconfig.jsonconfigured for strict mode. Install dependencies:npm i zod openai @types/node. - Define test cases: Structure your evaluation dataset as JSON arrays containing
prompt,references(multiple valid answers),expectedType, andthreshold. - Configure model routing: Set
target_modelandjudge_modelto distinct architectures. Validate separation in your orchestrator constructor. - Run calibration: Execute the harness against a labeled ground-truth set. Extract precision/recall curves and lock the operating threshold.
- Integrate with CI: Add the evaluation script to your pipeline. Enable
strict_xfailto break builds on unexpected passes. Monitor variance alerts for threshold drift.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
