cal confidence score from vote distribution.
interface RoutingInput {
query: string;
availableRoutes: string[];
}
interface RoutingOutput {
selectedRoute: string;
rawConfidence: number;
calibratedConfidence: number;
action: 'execute' | 'abstain' | 'escalate';
}
class ConfidenceExtractor {
private readonly apiClient: LLMClient;
private readonly sampleCount: number;
constructor(client: LLMClient, samples: number = 5) {
this.apiClient = client;
this.sampleCount = samples;
}
async extract(input: RoutingInput): Promise<{ vote: string; fraction: number }> {
const prompts = Array.from({ length: this.sampleCount }, () =>
this.buildRoutingPrompt(input)
);
const responses = await Promise.all(
prompts.map(p => this.apiClient.generate(p, { temperature: 0.7 }))
);
const votes = responses.map(r => r.trim().toLowerCase());
const voteCounts = votes.reduce<Record<string, number>>((acc, v) => {
acc[v] = (acc[v] || 0) + 1;
return acc;
}, {});
const dominantRoute = Object.entries(voteCounts).sort((a, b) => b[1] - a[1])[0][0];
const fraction = voteCounts[dominantRoute] / this.sampleCount;
return { vote: dominantRoute, fraction };
}
private buildRoutingPrompt(input: RoutingInput): string {
return `Route the following query to exactly one of these options: ${input.availableRoutes.join(', ')}.
Query: "${input.query}"
Return only the route name.`;
}
}
Architecture Rationale: Separating extraction from calibration allows swapping sampling strategies (logprobs, ensembling) without rewriting downstream logic. Temperature is set to 0.7 to encourage diversity without degrading coherence.
Step 2: Apply Platt Scaling
Raw vote fractions are discrete and biased. Platt scaling maps them to calibrated probabilities using a logistic function learned on a held-out dataset.
class ProbabilityCalibrator {
private readonly weight: number;
private readonly bias: number;
constructor(weight: number, bias: number) {
this.weight = weight;
this.bias = bias;
}
static fromCalibrationSet(rawScores: number[], trueLabels: boolean[]): ProbabilityCalibrator {
// In production, use scipy.optimize or a JS equivalent to minimize log-loss
// Here we simulate fitted parameters from a 1000-sample calibration set
return new ProbabilityCalibrator(2.14, -1.87);
}
calibrate(rawFraction: number): number {
const z = this.weight * rawFraction + this.bias;
return 1 / (1 + Math.exp(-z));
}
}
Architecture Rationale: Platt scaling requires only two parameters, making it stable with 200β500 calibration examples. It outperforms histogram binning in smoothness and avoids the overfitting risks of isotonic regression on smaller datasets.
Step 3: Enforce Thresholds & Abstention Zones
Calibrated probabilities enable cost-aware routing. Symmetric errors use a 0.5 cutoff; asymmetric costs shift the threshold. An abstention zone handles irreducible ambiguity.
class DecisionRouter {
private readonly extractor: ConfidenceExtractor;
private readonly calibrator: ProbabilityCalibrator;
private readonly thresholds: { execute: number; abstain: number };
constructor(
extractor: ConfidenceExtractor,
calibrator: ProbabilityCalibrator,
thresholds: { execute: number; abstain: number }
) {
this.extractor = extractor;
this.calibrator = calibrator;
this.thresholds = thresholds;
}
async route(input: RoutingInput): Promise<RoutingOutput> {
const { vote, fraction } = await this.extractor.extract(input);
const calibrated = this.calibrator.calibrate(fraction);
let action: RoutingOutput['action'];
if (calibrated >= this.thresholds.execute) {
action = 'execute';
} else if (calibrated >= this.thresholds.abstain) {
action = 'abstain';
} else {
action = 'escalate';
}
return {
selectedRoute: vote,
rawConfidence: fraction,
calibratedConfidence: calibrated,
action
};
}
}
Architecture Rationale: The three-tier action space (execute, abstain, escalate) decouples model uncertainty from business logic. Abstention triggers clarifying prompts or parallel execution; escalation routes to human review or deterministic fallbacks.
Step 4: Validate Calibration Quality
Production systems must continuously verify calibration using Expected Calibration Error (ECE) and Brier scores.
function computeECE(predictions: number[], labels: boolean[], bins: number = 10): number {
const binCounts = new Array(bins).fill(0);
const binAccuracy = new Array(bins).fill(0);
const binConfidence = new Array(bins).fill(0);
predictions.forEach((p, i) => {
const binIndex = Math.min(Math.floor(p * bins), bins - 1);
binCounts[binIndex]++;
binAccuracy[binIndex] += labels[i] ? 1 : 0;
binConfidence[binIndex] += p;
});
let ece = 0;
const total = predictions.length;
for (let i = 0; i < bins; i++) {
if (binCounts[i] === 0) continue;
const avgConf = binConfidence[i] / binCounts[i];
const avgAcc = binAccuracy[i] / binCounts[i];
ece += (binCounts[i] / total) * Math.abs(avgAcc - avgConf);
}
return ece;
}
Architecture Rationale: ECE measures the weighted average gap between predicted confidence and empirical accuracy per bin. Values below 0.05 indicate production-ready calibration. Pair this with reliability diagrams to visualize drift over time.
Pitfall Guide
1. The Self-Report Trap
Explanation: Prompting an LLM to output a confidence score yields systematically inflated values. Models lack true meta-cognition and default to high certainty to satisfy instruction-following biases.
Fix: Never trust self-reported probabilities. Extract signals from token distributions, self-consistency voting, or cross-model disagreement, then apply Platt scaling.
2. Static Thresholds in Shifting Traffic
Explanation: Thresholds tuned on Q1 traffic degrade when user intent distribution changes. A fixed 0.6 cutoff may become too aggressive or too conservative as query patterns evolve.
Fix: Implement rolling-window calibration. Recalculate Platt parameters monthly using the most recent 500 labeled examples. Alert when ECE exceeds 0.08.
3. Ignoring Cost Asymmetry
Explanation: Routing a knowledge query to web search wastes tokens and latency. Routing a search-dependent query to local knowledge produces factual errors. These mistakes carry different business costs.
Fix: Weight threshold tuning by cost ratios. If false negatives cost 5Γ more than false positives, shift the execute threshold downward and widen the abstention zone.
4. The Gold Set Mirage
Explanation: Synthetic benchmarks or manually crafted examples rarely match production traffic distributions. Models optimized on clean data fail on noisy, real-world inputs.
Fix: Stratify your evaluation set using production log sampling. Match the class distribution, query length, and ambiguity rate of live traffic.
5. Calibration Drift on Model Updates
Explanation: Vendors frequently push silent model updates. A calibration curve fitted on model-v2.1 becomes invalid on model-v2.2, causing ECE to spike without warning.
Fix: Version-lock calibration datasets. Run automated calibration regression tests against a shadow deployment before promoting new model versions.
6. Overfitting Calibration Curves
Explanation: Isotonic regression and histogram binning require large datasets. Applying them to 200 examples creates jagged, non-generalizable mappings that fail in production.
Fix: Start with Platt scaling. Only upgrade to non-parametric methods when you have 1,000+ calibration examples and Platt's logistic assumption proves insufficient.
7. Treating Ambiguity as Failure
Explanation: Some inputs are inherently unclear. Forcing a decision on ambiguous queries increases error rates and degrades user trust.
Fix: Design abstention zones as first-class features. Trigger clarifying questions, parallel tool execution, or human-in-the-loop review when calibrated confidence falls in the uncertainty band.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-latency internal routing | Platt-scaled self-consistency (N=3) | Balances speed and calibration quality | Moderate (+15% compute) |
| High-stakes customer triage | Cross-model ensemble + isotonic regression | Maximizes calibration accuracy for critical decisions | High (+40% compute) |
| Ambiguous query volume >20% | Wide abstention zone (0.35β0.65) | Prevents forced errors on unclear inputs | Low (shifts cost to clarification) |
| Vendor model updates frequent | Rolling-window Platt + shadow deployment | Detects calibration drift before production impact | Low (automated monitoring) |
Configuration Template
export const routingConfig = {
sampling: {
strategy: 'self_consistency',
count: 5,
temperature: 0.7,
maxConcurrency: 10
},
calibration: {
method: 'platt_scaling',
weight: 2.14,
bias: -1.87,
updateFrequency: 'monthly',
minCalibrationSamples: 300
},
thresholds: {
execute: 0.68,
abstain: 0.42,
costWeighting: { falsePositive: 1, falseNegative: 4 }
},
monitoring: {
eceWarning: 0.08,
eceCritical: 0.12,
reliabilityDiagramExport: true,
driftDetectionWindow: '14d'
},
fallback: {
abstentionAction: 'clarify_prompt',
escalationAction: 'human_review_queue',
timeoutMs: 2000
}
};
Quick Start Guide
- Extract live traffic samples: Pull 1,000 recent user queries from your application logs. Stratify by intent category to match production distribution.
- Label a gold set: Route 300 samples through your target LLM using self-consistency (N=5). Have two annotators label the correct route. Calculate inter-annotator agreement to establish your Bayes error floor.
- Fit calibration parameters: Split the labeled set into 70% calibration and 30% validation. Run Platt scaling optimization to derive weight and bias parameters. Verify ECE < 0.05 on the validation split.
- Deploy with abstention: Initialize the
DecisionRouter with calibrated parameters and cost-weighted thresholds. Route production traffic through the abstention zone for the first 48 hours, logging all escalations for threshold refinement.
- Monitor drift: Configure ECE tracking and reliability diagram exports. Set automated alerts for calibration degradation. Schedule monthly Platt parameter retraining using the most recent labeled cohort.