Five ways to test an LLM's answer and what each one misses

Current Situation Analysis

Traditional software testing operates on a deterministic premise: identical inputs must produce identical outputs. LLM evaluation shatters this assumption. A single prompt routed through a generative model can yield structurally different, yet semantically equivalent, responses across consecutive runs. Engineering teams attempting to apply classical assertion frameworks to probabilistic systems consistently encounter false negatives, silent hallucinations, and CI pipelines that either block valid releases or ship broken logic.

The core misunderstanding stems from treating evaluation metrics as universal correctness detectors. Metrics like BLEU and ROUGE were engineered for machine translation and text summarization, where reference texts and model outputs share predictable length and syntactic structure. When applied to open-ended QA, their mathematical denominators collapse. A one-token reference compared against a paragraph-length response yields a near-zero overlap score, regardless of factual accuracy. Similarly, semantic similarity relies on vector cosine angles to measure textual proximity, not factual verification. Two answers can occupy the same semantic neighborhood while one contains a fabricated API flag or a reversed causal relationship.

Data from controlled evaluation harnesses reveals a consistent pattern: four out of five standard scorers reject factually correct answers when the output format diverges from the reference. More critically, LLM-as-judge systems exhibit self-grading bias when the judge shares architecture with the target model. In benchmark runs, a hallucinated command-line flag received a deterministic score of exactly 0.700 across five identical executions, locking precisely on the pass threshold. The system did not fluctuate; it stabilized on a false positive. Averaging multiple runs, a common mitigation for non-determinism, fails to resolve this because the bias is architectural, not stochastic.

The industry has not yet standardized on a multi-scorer orchestration pattern. Teams either over-rely on a single metric, manually tune thresholds until they admit incorrect outputs, or deploy judge models that mirror the target architecture. The result is evaluation suites that measure format compliance and lexical overlap rather than factual correctness, leaving production systems vulnerable to deterministic hallucination passes.

WOW Moment: Key Findings

The following table isolates how each evaluation approach behaves when confronted with format variance, factual errors, and non-deterministic scoring. The data reflects controlled benchmark runs using llama3.2 as both target and judge, with a fixed pass threshold of 0.700.

Approach	Format Sensitivity	Hallucination Catch Rate	Determinism	Best Use Case
Exact Match	Critical	High	Absolute	Structured data, IDs, short codes
BLEU	Critical	Low	High	Translation, parallel corpora
ROUGE	Critical	Low	High	Summarization, extractive tasks
Semantic Similarity	Moderate	Low	Moderate	Intent matching, paraphrase detection
LLM-as-Judge	Low	Variable	Low (threshold locking)	Open-ended QA, reasoning validation

The critical insight is that no single scorer captures correctness in isolation. BLEU and ROUGE fail when reference and output lengths diverge because their denominators normalize overlap against total token count. Semantic similarity measures vector proximity, which cannot distinguish between a correct explanation and a plausible but factually inverted one. LLM-as-judge systems introduce architectural bias: when the judge and target share weights, hallucinated tokens do not trigger internal contradiction, causing scores to stabilize on the pass threshold rather than fluctuate.

This finding matters because it shifts evaluation strategy from metric selection to orchestrator design. Teams must accept that every scorer has documented blind spots. The evaluation harness becomes a panel of imperfect validators, where disagreements surface the actual test surface. Production systems that route questions to the appropriate scorer, enforce external judges, and pluralize references consistently outperform single-metric pipelines in both precision and recall.

Core Solution

Building a resilient LLM evaluation pipeline requires decoupling scoring logic from execution, enforcing architectural separation between target and judge models, and normalizing references to handle generative variance. The following TypeScript implementation demonstrates a modular orchestrator that routes evaluations based on question type, applies shape-aware reference matching, and enforces external judging.

Architecture Decisions

Router Pattern: Questions are classified by expected output shape (structured, prose, reasoning). The router selects the appropriate scorer, preventing format collapse.
External Judge Enforcement: The judge model must differ from the target model to eliminate self-grading bias. We enforce this via configuration validation.
Reference Pluralization: Each test case accepts multiple reference answers. The scorer computes the maximum score across all references, mitigating shape mismatch.
Threshold Calibration: Pass/fail boundaries are not hardcoded. They are derived from a calibration dataset using ROC curve optimization, then locked for CI.

Implementation

import { createHash } from 'crypto';

// Types
type QuestionType = 'structured' | 'prose' | 'reasoning';
type ScorerResult = { score: number; passed: boolean; metadata: Record<string, unknown> };

interface TestCase {
  id: string;
  prompt: string;
  references: string[];
  expectedType: QuestionType;
  threshold: number;
}

interface EvaluationConfig {
  targetModel: string;
  judgeModel: string;
  semanticModel: string;
  maxRetries: number;
}

// Cosine Similarity Calculator
function cosineSimilarity(vecA: number[], vecB: number[]): number {
  const dotProduct = vecA.reduce((sum, val, i) => sum + val * vecB[i], 0);
  const magA = Math.sqrt(vecA.reduce((sum, val) => sum + val ** 2, 0));
  const magB = Math.sqrt(vecB.reduce((sum, val) => sum + val ** 2, 0));
  return magA && magB ? dotProduct / (magA * magB) : 0;
}

// Scorer Interfaces
interface Scorer {
  evaluate(target: string, references: string[]): ScorerResult;
}

class ExactMatchScorer implements Scorer {
  evaluate(target: string, references: string[]): ScorerResult {
    const normalizedTarget = target.trim().toLowerCase();
    const passed = references.some(ref => ref.trim().toLowerCase() === normalizedTarget);
    return { score: passed ? 1.0 : 0.0, passed, metadata: { type: 'exact' } };
  }
}

class SemanticScorer implements Scorer {
  constructor(private embeddingFn: (text: string) => Promise<number[]>) {}

  async evaluate(target: string, references: string[]): Promise<ScorerResult> {
    const targetVec = await this.embeddingFn(target);
    const scores = await Promise.all(references.map(async ref => {
      const refVec = await this.embeddingFn(ref);
      return cosineSimilarity(targetVec, refVec);
    }));
    const maxScore = Math.max(...scores);
    return { score: maxScore, passed: maxScore >= 0.85, metadata: { type: 'semantic', threshold: 0.85 } };
  }
}

class ExternalJudgeScorer implements Scorer {
  constructor(private judgeClient: any, private rubric: string) {}

  async evaluate(target: string, references: string[]): Promise<ScorerResult> {
    const prompt = `${this.rubric}\n\nReference:\n${references.join('\n')}\n\nTarget Response:\n${target}\n\nScore correctness and relevance (0-1).`;
    const response = await this.judgeClient.chat.completions.create({
      model: 'judge-model-v2', // Must differ from target
      messages: [{ role: 'user', content: prompt }],
      temperature: 0.0, // Force deterministic scoring
      response_format: { type: 'json_object' }
    });
    const parsed = JSON.parse(response.choices[0].message.content);
    const score = Math.min(1.0, Math.max(0.0, parsed.score / 10));
    return { score, passed: score >= 0.75, metadata: { type: 'judge', reasoning: parsed.explanation } };
  }
}

// Orchestrator
class EvaluationOrchestrator {
  private scorers: Record<QuestionType, Scorer>;

  constructor(config: EvaluationConfig) {
    if (config.targetModel === config.judgeModel) {
      throw new Error('Architectural violation: Judge model must differ from target model to prevent self-grading bias.');
    }
    this.scorers = {
      structured: new ExactMatchScorer(),
      prose: new SemanticScorer(async (t) => { /* embedding call */ return []; }),
      reasoning: new ExternalJudgeScorer({ chat: { completions: { create: async () => ({}) } } }, '')
    };
  }

  async run(testCase: TestCase, targetOutput: string): Promise<ScorerResult> {
    const scorer = this.scorers[testCase.expectedType];
    if ('evaluate' in scorer && scorer.evaluate.constructor.name === 'AsyncFunction') {
      return await (scorer as any).evaluate(targetOutput, testCase.references);
    }
    return scorer.evaluate(targetOutput, testCase.references);
  }
}

Rationale

Zero Temperature for Judges: LLM-as-judge scoring must be deterministic. Setting temperature: 0.0 and using structured JSON output eliminates stochastic variance during evaluation, exposing architectural bias rather than masking it with noise.
Reference Pluralization: The references array allows teams to supply multiple valid phrasings. The scorer takes the maximum score, preventing false negatives when the model rephrases correctly.
Threshold Calibration: Hardcoded thresholds (like 0.700) create threshold-locking artifacts. Production systems should derive thresholds from a labeled calibration set using precision-recall optimization, then freeze them for CI.
Model Separation: The constructor explicitly rejects identical target/judge models. Self-grading bias is not a bug; it's a mathematical certainty when weights are shared.

Pitfall Guide

1. Self-Grading Bias

Explanation: When the judge model shares architecture with the target model, hallucinated tokens do not trigger internal contradiction. Scores stabilize on the pass threshold rather than fluctuating, creating deterministic false positives. Fix: Always route evaluation to a distinct, higher-capability model. Validate model separation in CI configuration. Use temperature 0.0 to expose bias rather than mask it.

2. Reference Shape Collapse

Explanation: BLEU and ROUGE divide shared tokens by total output length. A one-word reference against a paragraph response yields near-zero scores, regardless of factual accuracy. Fix: Match reference length to expected output shape. For short answers, use exact match or structured parsing. For prose, supply full-sentence references or switch to semantic/judge scorers.

3. Semantic Proximity vs Factual Correctness

Explanation: Cosine similarity measures vector angle, not truth. A response that inverts causality or swaps entities can occupy the same semantic neighborhood as the correct answer. Fix: Never use semantic similarity as a standalone correctness validator. Combine it with rule-based fact extraction or an external judge that verifies specific claims.

4. Threshold Tuning Illusion

Explanation: Lowering cosine or judge thresholds to admit correctly formatted but differently phrased answers inevitably admits incorrect ones. In benchmark data, right and wrong scores often differ by less than 0.005. Fix: Calibrate thresholds on a held-out dataset. Use ROC analysis to find the operating point that maximizes F1 score. Lock thresholds post-calibration; do not adjust them per test case.

5. Reasoning Text Overweighting

Explanation: LLM judges frequently generate explanations that contradict their numerical scores. The prose reflects token-level generation patterns, while the score reflects weighted rubric alignment. Fix: Treat judge explanations as debugging artifacts, not evidence. Parse and store the numerical score. Log the reasoning separately for post-hoc analysis.

6. Ignoring Non-Determinism in CI

Explanation: Running a single evaluation pass and treating the result as absolute ignores generative variance. Flaky tests are often dismissed rather than investigated. Fix: Implement strict failure tracking (xfail patterns). Document known scorer limitations. When a test unexpectedly passes, break the build to force investigation. Use multiple runs only when measuring variance, not correctness.

7. Single-Reference Bottleneck

Explanation: Hardcoding one reference answer assumes the model must reproduce exact phrasing. This penalizes valid paraphrasing and inflates false negative rates. Fix: Maintain a reference pool per test case. Use dynamic matching that selects the highest-scoring reference. Rotate references periodically to prevent overfitting.

Production Bundle

Action Checklist

Separate target and judge models: Verify architectural distinction in configuration validation.
Pluralize references: Supply 3-5 valid phrasings per test case to handle generative variance.
Calibrate thresholds: Run evaluation on a labeled dataset and derive pass/fail boundaries using ROC optimization.
Enforce deterministic judging: Set temperature to 0.0 and require structured JSON output for all judge calls.
Document scorer blind spots: Maintain a living spec of known failure modes (format collapse, semantic proximity limits).
Implement strict failure tracking: Use xfail patterns with reason fields to catch unexpected passes and force investigation.
Scale evaluation datasets: Move from 10-item prototypes to 200+ item production sets with multiple human raters.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Structured data (IDs, codes, enums)	Exact Match + Schema Validation	Deterministic, zero false positives, minimal compute	Low
Open-ended QA with factual claims	External Judge (different model) + Fact Extraction	Catches hallucinations, handles format variance	Medium-High
Paraphrase/Intent matching	Semantic Similarity + Rule-Based Filters	Fast, format-agnostic, but requires correctness guardrails	Low
Translation/Summarization	BLEU/ROUGE with matched-length references	Mathematically sound for parallel corpora	Low
High-stakes reasoning (medical, legal)	Multi-Judge Panel + Human-in-the-Loop Audit	Eliminates single-model bias, provides audit trail	High

Configuration Template

evaluation:
  target_model: "llama3.2-8b"
  judge_model: "qwen2.5-72b-instruct"
  semantic_model: "nomic-embed-text"
  
  thresholds:
    calibrated: true
    calibration_dataset: "evals/v2/ground_truth.json"
    operating_point: "max_f1"
    
  scorers:
    structured:
      type: "exact_match"
      normalize: true
    prose:
      type: "semantic_similarity"
      threshold: 0.82
      fallback: "judge"
    reasoning:
      type: "external_judge"
      rubric: "evals/rubrics/factual_correctness_v3.md"
      temperature: 0.0
      response_format: "json"
      
  ci:
    strict_xfail: true
    max_retries: 3
    variance_alert: true

Quick Start Guide

Initialize the harness: Create a TypeScript project with tsconfig.json configured for strict mode. Install dependencies: npm i zod openai @types/node.
Define test cases: Structure your evaluation dataset as JSON arrays containing prompt, references (multiple valid answers), expectedType, and threshold.
Configure model routing: Set target_model and judge_model to distinct architectures. Validate separation in your orchestrator constructor.
Run calibration: Execute the harness against a labeled ground-truth set. Extract precision/recall curves and lock the operating threshold.
Integrate with CI: Add the evaluation script to your pipeline. Enable strict_xfail to break builds on unexpected passes. Monitor variance alerts for threshold drift.

Mid-Year Sale — Unlock Full Article