ppa variance without relying on asymptotic assumptions that break down with skewed class distributions. The algorithm resamples paired judge-human labels with replacement, recalculates kappa for each iteration, and extracts percentile-based bounds.
interface LabeledPair {
judgeScore: number;
humanScore: number;
}
interface KappaResult {
pointEstimate: number;
confidenceInterval: [number, number];
sampleSize: number;
}
export class KappaCalibrator {
private readonly rng: () => number;
constructor(seed: number = 42) {
// Seeded PRNG for reproducible bootstrap runs
let state = seed;
this.rng = () => {
state = (state * 1664525 + 1013904223) & 0xffffffff;
return state / 0xffffffff;
};
}
private computeCohenKappa(pairs: LabeledPair[]): number {
const n = pairs.length;
const categories = new Set<number>();
pairs.forEach(p => { categories.add(p.judgeScore); categories.add(p.humanScore); });
const catArray = Array.from(categories).sort();
const confusionMatrix = catArray.map(() => catArray.map(() => 0));
pairs.forEach(p => {
const i = catArray.indexOf(p.judgeScore);
const j = catArray.indexOf(p.humanScore);
confusionMatrix[i][j]++;
});
const observedAgreement = confusionMatrix.reduce((sum, row, i) => sum + row[i], 0) / n;
const rowSums = confusionMatrix.map(row => row.reduce((a, b) => a + b, 0));
const colSums = confusionMatrix[0].map((_, j) => confusionMatrix.reduce((sum, row) => sum + row[j], 0));
const expectedAgreement = rowSums.reduce((sum, r, i) => sum + (r * colSums[i]) / (n * n), 0);
return (observedAgreement - expectedAgreement) / (1 - expectedAgreement);
}
public estimateWithBootstrap(
pairs: LabeledPair[],
resamples: number = 2000,
confidenceLevel: number = 0.95
): KappaResult {
const pointEstimate = this.computeCohenKappa(pairs);
const bootstrapKappas: number[] = [];
for (let r = 0; r < resamples; r++) {
const resampled: LabeledPair[] = [];
for (let i = 0; i < pairs.length; i++) {
const idx = Math.floor(this.rng() * pairs.length);
resampled.push(pairs[idx]);
}
bootstrapKappas.push(this.computeCohenKappa(resampled));
}
bootstrapKappas.sort((a, b) => a - b);
const alpha = 1 - confidenceLevel;
const lowIdx = Math.floor((alpha / 2) * resamples);
const highIdx = Math.floor((1 - alpha / 2) * resamples);
return {
pointEstimate,
confidenceInterval: [bootstrapKappas[lowIdx], bootstrapKappas[highIdx]],
sampleSize: pairs.length
};
}
}
Architecture Rationale: The bootstrap approach avoids closed-form variance formulas that require prevalence and bias adjustments. By resampling directly from observed pairs, we preserve the empirical distribution of disagreements. The seeded PRNG ensures deterministic runs for CI/CD pipelines. The computeCohenKappa method explicitly builds the confusion matrix, making it trivial to extend to weighted kappa or multi-criterion scoring later.
Step 2: Paired Judge Comparison via McNemar's Test
When evaluating two candidate judges on the same human-labeled set, you need a test that accounts for paired nominal data. McNemar's exact test evaluates marginal homogeneity, determining whether one judge agrees with humans significantly more often than the other.
interface JudgeComparison {
judgeA: number[];
judgeB: number[];
human: number[];
}
export class JudgeComparator {
public runMcNemarExact(comparison: JudgeComparison): { pValue: number; discordant: { aOnly: number; bOnly: number } } {
const aMatches = comparison.judgeA.map((a, i) => a === comparison.human[i]);
const bMatches = comparison.judgeB.map((b, i) => b === comparison.human[i]);
let aOnly = 0;
let bOnly = 0;
for (let i = 0; i < aMatches.length; i++) {
if (aMatches[i] && !bMatches[i]) aOnly++;
if (!aMatches[i] && bMatches[i]) bOnly++;
}
// Exact binomial test for discordant pairs
const totalDiscordant = aOnly + bOnly;
if (totalDiscordant === 0) return { pValue: 1.0, discordant: { aOnly, bOnly } };
// Two-tailed exact p-value calculation
const minDiscordant = Math.min(aOnly, bOnly);
let pValue = 0;
for (let k = 0; k <= minDiscordant; k++) {
const logComb = this.logBinomialCoefficient(totalDiscordant, k);
pValue += Math.exp(logComb - totalDiscordant * Math.log(2));
}
pValue *= 2; // Two-tailed
return { pValue: Math.min(1.0, pValue), discordant: { aOnly, bOnly } };
}
private logBinomialCoefficient(n: number, k: number): number {
if (k < 0 || k > n) return -Infinity;
if (k === 0 || k === n) return 0;
if (k > n / 2) k = n - k;
let res = 0;
for (let i = 1; i <= k; i++) {
res += Math.log(n - i + 1) - Math.log(i);
}
return res;
}
}
Architecture Rationale: McNemar's test isolates the discordant pairs (where judges disagree with each other but agree with humans, or vice versa). The exact binomial calculation avoids normal approximation errors when sample sizes are small or discordant counts are skewed. Returning the discordant counts alongside the p-value gives engineers actionable insight into which judge fails on specific edge cases.
Step 3: Dynamic Sample Size Calculator
Closed-form Fleiss variance formulas depend heavily on class prevalence and rater bias, making them impractical for rapid engineering iteration. A lookup-based approach calibrated via Monte Carlo simulation provides deterministic recommendations without runtime overhead.
export class SampleSizeAdvisor {
public static recommendN(targetKappa: number, targetCIWidth: number = 0.10): number {
const baseFactor = 40 / (targetCIWidth ** 2);
if (targetKappa >= 0.85) return Math.max(50, Math.round(baseFactor * 0.5));
if (targetKappa >= 0.65) return Math.max(150, Math.round(baseFactor * 1.5));
if (targetKappa >= 0.45) return Math.max(250, Math.round(baseFactor * 2.5));
return Math.max(450, Math.round(baseFactor * 4.5));
}
}
Architecture Rationale: The multipliers (0.5, 1.5, 2.5, 4.5) encode the non-linear variance scaling observed in Monte Carlo trials. The Math.max guards ensure minimum viable sample sizes even when target CI widths are relaxed. This function integrates cleanly into CI/CD gates, allowing teams to fail builds when calibration sets fall below statistical thresholds.
Pitfall Guide
1. Optimizing for Point Estimates Over Interval Width
Explanation: Teams celebrate when kappa crosses 0.60 without checking the CI bounds. A point estimate of 0.62 with CI [0.40, 0.84] is statistically indistinguishable from a moderate judge.
Fix: Always report and gate on CI width. Treat point estimates as secondary signals. Implement automated alerts when CI width exceeds 0.10.
2. Ignoring Class Prevalence Imbalance
Explanation: Kappa variance inflates when one rating category dominates. A dataset where 90% of examples are "pass" will produce artificially tight or misleading kappa values.
Fix: Apply stratified sampling during label collection. Ensure each rating category represents at least 15-20% of the calibration set. Use prevalence-adjusted bootstrap weights if rebalancing is impossible.
3. Using Independent Tests for Paired Judge Comparisons
Explanation: Running separate kappa calculations for Judge A and Judge B, then comparing the numbers, ignores the paired nature of the data. Both judges evaluate the exact same prompts, creating dependency.
Fix: Always use McNemar's test or paired bootstrap difference intervals when comparing judges on identical inputs. Independent tests inflate Type I error rates.
4. Equating High Agreement With High Validity
Explanation: Kappa measures agreement, not correctness. If human raters consistently misapply guidelines, a judge can achieve 0.80 kappa while systematically reproducing human errors.
Fix: Audit human label quality independently. Implement double-blind human reviews on a subset. Track validity metrics (e.g., guideline adherence scores) alongside agreement metrics.
5. Static Calibration Sets in Dynamic Environments
Explanation: Prompt updates, model version changes, and input distribution shifts degrade judge reliability over time. A calibration set valid in Q1 may be obsolete by Q3.
Fix: Implement rolling calibration windows. Retain the most recent 200-400 labeled examples and recompute kappa weekly. Archive older sets for trend analysis but do not use them for active gating.
6. Misinterpreting Bootstrap Convergence
Explanation: Running 200 resamples instead of 2000 produces unstable CI bounds. The percentile extraction becomes sensitive to random seed variance.
Fix: Use a minimum of 2000 resamples for production pipelines. Verify convergence by running three independent bootstrap runs and checking that CI bounds vary by less than 0.02.
7. Single-Criterion Aggregation
Explanation: Collapsing multi-dimensional scores (e.g., factual accuracy, tone, formatting) into a single kappa value masks criterion-specific failures. A judge may excel at accuracy but fail completely on tone.
Fix: Compute per-criterion kappa with separate CIs. Aggregate only after verifying all criteria meet minimum thresholds. Use weighted kappa if criteria have different business impact.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Initial judge calibration with unknown performance | N=200, bootstrap CI, target width 0.10 | Provides baseline reliability without over-labeling | Moderate upfront labeling cost |
| Judge kappa observed at 0.35-0.45 | Expand to N=400-450, re-evaluate | Low agreement requires larger samples for statistical power | High labeling cost; consider prompt engineering first |
| Comparing two candidate judges | McNemar exact test on paired outputs | Detects marginal homogeneity without independence assumptions | Low cost; reuses existing human labels |
| Production drift monitoring | Rolling 200-example window, weekly recompute | Catches 0.10-point drops reliably within sampling noise | Automated pipeline cost; minimal manual overhead |
| Multi-criterion scoring task | Per-criterion kappa + separate CIs | Prevents high-performing criteria from masking failures | Moderate cost; requires structured human rubrics |
Configuration Template
# evaluation-pipeline.config.yaml
calibration:
target_kappa: 0.60
max_ci_width: 0.10
min_sample_size: 200
resamples: 2000
confidence_level: 0.95
stratification:
enabled: true
min_category_ratio: 0.15
monitoring:
drift_detection:
enabled: true
window_size: 200
recompute_frequency: "weekly"
alert_threshold_kappa_drop: 0.10
judge_comparison:
test_type: "mcnemar_exact"
significance_level: 0.05
require_paired_inputs: true
output:
format: "json"
include_confidence_intervals: true
include_discordant_pairs: true
seed: 42
Quick Start Guide
- Collect paired labels: Run your candidate judge on 200 diverse prompts. Have human raters score the same 200 prompts using identical rubrics. Store as
{judgeScore, humanScore} pairs.
- Run bootstrap calibration: Instantiate
KappaCalibrator with your pairs. Call estimateWithBootstrap(pairs, 2000, 0.95). Verify CI width β€ 0.10.
- Gate or expand: If CI width exceeds 0.10, use
SampleSizeAdvisor.recommendN(observedKappa, 0.10) to calculate additional labels needed. Label the delta and recompute.
- Deploy monitoring: Integrate the calibrator into your CI/CD pipeline. Configure weekly rolling windows to recompute kappa on fresh slices. Set alerts for CI width breaches or kappa drops β₯ 0.10.
- Validate multi-criterion tasks: Split scoring rubrics into dimensions. Run separate bootstrap estimators per criterion. Only promote the judge when all dimensions meet target thresholds.