discipline. The goal is to eliminate variance sources that confound judge scoring while preserving statistical validity. The implementation below demonstrates a production-ready TypeScript framework that enforces the measurement standard.
Step 1: Fixed Budget & Pool Configuration
Unbounded context windows and variable answer lengths introduce noise that judges cannot reliably disentangle from retrieval quality. The pipeline enforces strict caps at the data ingestion layer.
interface EvaluationBudget {
maxCandidatePool: number;
maxEvidenceChunks: number;
maxAnswerTokens: number;
fixedGeneratorModel: string;
judgeModel: string;
temperature: number;
}
class BudgetEnforcer {
private config: EvaluationBudget;
constructor(config: EvaluationBudget) {
this.config = config;
}
validateAndTrim(candidatePool: string[], evidenceChunks: string[], generatedAnswer: string) {
if (candidatePool.length > this.config.maxCandidatePool) {
throw new Error(`Candidate pool exceeds fixed limit of ${this.config.maxCandidatePool}`);
}
if (evidenceChunks.length > this.config.maxEvidenceChunks) {
evidenceChunks.length = this.config.maxEvidenceChunks;
}
if (this.countTokens(generatedAnswer) > this.config.maxAnswerTokens) {
throw new Error(`Answer exceeds token cap. Truncation violates evaluation protocol.`);
}
return { candidatePool, evidenceChunks, generatedAnswer };
}
private countTokens(text: string): number {
// Production: use tiktoken or model-specific tokenizer
return text.split(/\s+/).length;
}
}
Why this choice: Fixed pools prevent retrieval systems from gaming the evaluation by expanding context windows. Token caps eliminate verbosity bias, which LLM judges consistently reward. Hard limits force architectural trade-offs to surface during development rather than evaluation.
Step 2: Cluster-Aware Statistical Engine
Standard tests assume independence. Cluster-aware inference models the dependency structure using permutation testing with cluster-level sign flips.
interface ClusteredSample {
id: string;
clusterId: string;
scoreA: number;
scoreB: number;
}
class ClusterAwareValidator {
async computeSignificance(
samples: ClusteredSample[],
permutations: number = 10000
): Promise<{ pValue: number; significant: boolean; alpha: number }> {
const observedDiff = this.meanDifference(samples);
const clusterIds = [...new Set(samples.map(s => s.clusterId))];
let extremeCount = 0;
for (let i = 0; i < permutations; i++) {
const permutedSamples = samples.map(sample => {
const flipCluster = Math.random() < 0.5;
return flipCluster
? { ...sample, scoreA: sample.scoreB, scoreB: sample.scoreA }
: sample;
});
const permDiff = this.meanDifference(permutedSamples);
if (Math.abs(permDiff) >= Math.abs(observedDiff)) {
extremeCount++;
}
}
const pValue = extremeCount / permutations;
const alpha = 0.05 / clusterIds.length; // Bonferroni adjustment
return {
pValue,
significant: pValue < alpha,
alpha
};
}
private meanDifference(samples: ClusteredSample[]): number {
const diffs = samples.map(s => s.scoreA - s.scoreB);
return diffs.reduce((a, b) => a + b, 0) / diffs.length;
}
}
Why this choice: Cluster sign-flip permutation tests preserve the dependency structure while generating a valid null distribution. Bonferroni correction accounts for multiple hypothesis testing across clusters. This approach is computationally heavier than parametric tests but eliminates Type I inflation that plagues naive evaluations.
Step 3: Dual-Judge Replication Protocol
Single-model judges introduce systematic bias. Replication across independent models isolates judge-specific artifacts from genuine performance differences.
interface JudgeResponse {
model: string;
score: number;
reasoning: string;
confidence: number;
}
class DualJudgeArbiter {
async evaluate(
prompt: string,
candidateA: string,
candidateB: string,
judgeModels: string[]
): Promise<JudgeResponse[]> {
const results: JudgeResponse[] = [];
for (const model of judgeModels) {
const response = await this.queryJudge(model, prompt, candidateA, candidateB);
results.push(response);
}
const consensus = this.computeConsensus(results);
if (consensus.disagreement > 0.3) {
console.warn(`Judge disagreement threshold exceeded: ${consensus.disagreement}`);
}
return results;
}
private async queryJudge(
model: string,
prompt: string,
a: string,
b: string
): Promise<JudgeResponse> {
// Production: route to inference API with deterministic seed
return {
model,
score: Math.random() * 10, // Placeholder for actual API call
reasoning: "Evaluation rationale",
confidence: 0.85
};
}
private computeConsensus(responses: JudgeResponse[]) {
const scores = responses.map(r => r.score);
const mean = scores.reduce((a, b) => a + b, 0) / scores.length;
const variance = scores.reduce((acc, val) => acc + Math.pow(val - mean, 2), 0) / scores.length;
return { mean, variance, disagreement: Math.sqrt(variance) / mean };
}
}
Why this choice: Cross-model replication catches prompt-sensitive biases and model-specific verbosity preferences. Disagreement thresholds trigger manual review or rubric refinement. This layer transforms evaluation from a black-box scoring exercise into a reproducible measurement system.
Step 4: Pre-Registered Hypothesis Validation
Post-hoc metric selection is a primary source of evaluation bias. The pipeline enforces hypothesis registration before execution.
interface RegisteredHypothesis {
id: string;
primaryMetric: string;
threshold: number;
correctionMethod: 'bonferroni' | 'holm' | 'fdr';
clusterAware: boolean;
registeredAt: string;
}
class HypothesisRegistry {
private registry: Map<string, RegisteredHypothesis> = new Map();
register(hypothesis: RegisteredHypothesis): void {
if (this.registry.has(hypothesis.id)) {
throw new Error(`Hypothesis ${hypothesis.id} already registered`);
}
this.registry.set(hypothesis.id, hypothesis);
}
validateExecution(metricValue: number, hypothesisId: string): boolean {
const hypothesis = this.registry.get(hypothesisId);
if (!hypothesis) throw new Error(`Unregistered hypothesis execution`);
return metricValue >= hypothesis.threshold;
}
}
Why this choice: Pre-registration eliminates p-hacking and metric switching. It forces teams to define success criteria before observing results, aligning evaluation with scientific rigor rather than retrospective justification.
Pitfall Guide
1. Ignoring Intra-Cluster Correlation
Explanation: Standard t-tests and binomial tests assume sample independence. RAG benchmarks contain overlapping documents, shared retrieval paths, and domain clusters. Treating these as independent inflates significance by 30-60%.
Fix: Implement cluster-aware permutation tests or mixed-effects models. Always report intra-cluster correlation coefficients (ICC) alongside p-values.
2. Unbounded Answer Length
Explanation: LLM judges consistently reward verbose outputs, even when verbosity adds no factual value. Systems that generate longer answers artificially inflate scores without improving retrieval or reasoning quality.
Fix: Enforce strict token caps. Penalize verbosity in judge prompts. Use length-normalized scoring when comparing systems with different generation strategies.
3. Single-Judge Dependency
Explanation: Each LLM judge has distinct biases: some prefer structured formatting, others favor lexical overlap, and many exhibit temperature-sensitive scoring variance. Relying on one model locks evaluation to its idiosyncrasies.
Fix: Deploy dual or triple judge replication. Require consensus thresholds. Rotate judge models across evaluation cycles to detect systematic drift.
4. Prompt Drift Across Runs
Explanation: Minor changes to judge instructions, system prompts, or temperature settings alter scoring behavior. Teams often tweak prompts iteratively, invalidating longitudinal comparisons.
Fix: Hash and version all prompts. Store prompt configurations alongside evaluation results. Use deterministic seeds for reproducibility. Never modify prompts post-hoc.
5. Lexical Overlap Confounding
Explanation: Judges frequently reward surface-level keyword matching over genuine multi-hop reasoning. Semantic retrieval systems that paraphrase effectively may score lower than lexical systems that preserve exact terminology.
Fix: Explicitly instruct judges to penalize lexical matching without reasoning. Use semantic similarity filters to isolate overlap bias. Include reasoning rubrics that require hop-by-hop justification.
6. Post-Hoc Hypothesis Tuning
Explanation: Adjusting success thresholds, switching metrics, or excluding outliers after seeing results creates false confidence. This practice is pervasive in rapid prototyping cycles.
Fix: Pre-register all hypotheses, metrics, and correction methods. Use immutable evaluation manifests. Treat post-hoc adjustments as separate experimental phases.
7. Ignoring Retrieval Budget Constraints
Explanation: Unlimited context windows mask retrieval inefficiencies. Systems that retrieve 50 chunks may score well simply because the answer exists somewhere in the noise, not because the retrieval strategy is effective.
Fix: Cap evidence chunks. Fix top-k candidate pools. Evaluate retrieval precision at strict budget boundaries. Report recall@k alongside generation scores.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Early prototype validation | Naive binomial test + single judge | Speed prioritized over rigor; acceptable for internal iteration | Low |
| Production model selection | Cluster-aware permutation + dual judge | Eliminates false positives; ensures statistical validity | Medium (+15% compute) |
| Cross-domain benchmarking | Hybrid retrieval + fixed budget + Bonferroni correction | Controls for terminology density and cluster variance | High (requires larger pool) |
| Regulatory/compliance evaluation | Pre-registered hypotheses + triple judge + sign-flip validation | Audit trail required; zero tolerance for statistical artifacts | Very High |
| Real-time A/B testing | Cluster-robust standard errors + streaming judge aggregation | Balances latency with dependency awareness | Medium |
Configuration Template
// evaluation-manifest.config.ts
export const EvaluationManifest = {
budget: {
maxCandidatePool: 100,
maxEvidenceChunks: 8,
maxAnswerTokens: 256,
fixedGeneratorModel: "llama-3.1-70b-instruct",
judgeModels: ["gpt-4o-mini", "claude-3.5-haiku"],
temperature: 0.0
},
statistics: {
method: "cluster_aware_permutation",
permutations: 10000,
correction: "bonferroni",
alpha: 0.05,
clusterAware: true
},
hypotheses: [
{
id: "H1-CSML-Hybrid",
primaryMetric: "cluster_adjusted_precision",
threshold: 0.72,
registeredAt: "2024-11-15T08:00:00Z"
},
{
id: "H2-Materials-BM25",
primaryMetric: "cluster_adjusted_precision",
threshold: 0.68,
registeredAt: "2024-11-15T08:00:00Z"
}
],
validation: {
requirePreRegistration: true,
enforceTokenCaps: true,
judgeDisagreementThreshold: 0.3,
promptVersioning: true
}
};
Quick Start Guide
- Initialize the evaluation manifest: Copy the configuration template and adjust budget limits, judge models, and hypothesis thresholds to match your domain requirements.
- Register hypotheses: Use the
HypothesisRegistry class to log all primary metrics and success criteria before running any evaluations. This step is mandatory for statistical validity.
- Execute the pipeline: Instantiate
BudgetEnforcer, ClusterAwareValidator, and DualJudgeArbiter with your manifest. Run evaluations against your fixed candidate pool and evidence budget.
- Validate results: Compare observed p-values against Bonferroni-adjusted alpha thresholds. Flag any evaluations exceeding the judge disagreement threshold for manual review.
- Archive and version: Store prompt hashes, judge responses, and statistical outputs in an immutable evaluation log. Use this archive for longitudinal tracking and audit compliance.
Cluster-aware evaluation is not an academic exercise. It is a production requirement for multi-hop RAG systems where retrieval quality, reasoning composition, and statistical validity must be disentangled. Teams that adopt fixed-budget protocols, pre-registered hypotheses, and cluster-aware inference will eliminate false positives, reduce compute waste, and make architecture decisions grounded in reproducible evidence. The measurement standard shifts the focus from optimizing for judge heuristics to building genuinely robust retrieval and composition pipelines.