g from probability-based uncertainty to divergence-based uncertainty. The framework relies on three components: multi-pass sampling, numerically stable LogSumExp computation, and Bregman information aggregation.
Step 1: Multi-Pass Logit Collection
Bregman information is an expectation over a distribution of predictions. You must generate multiple logit vectors per input. Common strategies include:
- Weight ensembling: Train multiple models with different random seeds or data subsets.
- Monte Carlo Dropout: Enable dropout at inference time and run multiple forward passes.
- Test-time augmentation: Apply stochastic transformations (crop, flip, noise) and aggregate logits.
Step 2: Numerically Stable LogSumExp
The variance term for log-loss uses the LogSumExp (LSE) function as the convex generator $\phi$. Direct computation of $\log(\sum e^{z_i})$ suffers from floating-point overflow. The standard stabilization subtracts the maximum logit before exponentiation:
function stableLogSumExp(logits: number[]): number {
const maxLogit = Math.max(...logits);
const sumExp = logits.reduce((acc, z) => acc + Math.exp(z - maxLogit), 0);
return maxLogit + Math.log(sumExp);
}
Bregman information is defined as $B_\phi[X] = \mathbb{E}[\phi(X)] - \phi(\mathbb{E}[X])$. For classification, $\phi$ is the LSE function. The implementation aggregates multiple logit samples, computes the expected LSE, subtracts the LSE of the expected logits, and returns the variance metric.
interface LogitSample {
id: string;
values: number[];
}
function computeBregmanInformation(samples: LogitSample[]): number {
if (samples.length === 0) return 0;
const numClasses = samples[0].values.length;
const meanLogits = new Array(numClasses).fill(0);
const lseValues: number[] = [];
// Compute mean logits and individual LSE values
for (const sample of samples) {
lseValues.push(stableLogSumExp(sample.values));
for (let c = 0; c < numClasses; c++) {
meanLogits[c] += sample.values[c];
}
}
// Average the logits
const avgLogits = meanLogits.map(v => v / samples.length);
// Bregman Information = E[LSE(z)] - LSE(E[z])
const expectedLSE = lseValues.reduce((a, b) => a + b, 0) / samples.length;
const lseOfMean = stableLogSumExp(avgLogits);
return expectedLSE - lseOfMean;
}
Step 4: Uncertainty Thresholding & Routing
The Bregman information value represents predictive variance. Higher values indicate greater disagreement across samples, signaling potential OOD inputs or high epistemic uncertainty. Production systems should map this metric to a rejection threshold:
function routePrediction(
bregmanVariance: number,
threshold: number,
primaryHandler: () => void,
fallbackHandler: () => void
): void {
if (bregmanVariance > threshold) {
fallbackHandler(); // Route to human review, secondary model, or safe default
} else {
primaryHandler(); // Proceed with standard pipeline
}
}
Architecture Decisions & Rationale
- Logit-space over probability-space: Softmax saturates at extreme values, compressing variance signals. Operating on raw logits preserves gradient flow and variance magnitude, making the metric sensitive to genuine model disagreement.
- LSE as the convex generator: LSE is the convex conjugate of the negative entropy for categorical distributions. This mathematical property ensures the Bregman divergence aligns exactly with log-loss, the standard proper scoring rule for classification.
- Frequentist variance interpretation: This framework measures spread across training initializations or sampling distributions, not a Bayesian posterior. It is computationally cheaper than full variational inference while providing equivalent variance reduction guarantees for ensembling.
Pitfall Guide
1. Applying the Decomposition to 0-1 Loss or Accuracy
Explanation: The bias-variance decomposition requires strictly proper scoring rules. Accuracy and 0-1 loss are not proper scoring rules; they do not incentivize truthful probability reporting. Attempting to compute Bregman information for accuracy yields mathematically invalid results.
Fix: Always evaluate uncertainty using log-loss, Brier score, or CRPS. If business metrics require accuracy, map uncertainty thresholds to accuracy targets offline, but compute variance using proper scores.
2. Computing Variance After Softmax Normalization
Explanation: Applying softmax before variance calculation destroys the geometric properties required by Bregman divergences. The compression of the probability simplex artificially caps variance, making the metric insensitive to high-uncertainty regions.
Fix: Collect raw logits from the model's final linear layer. Never apply softmax, temperature scaling, or normalization before computing Bregman information.
3. Ignoring Numerical Stability in LogSumExp
Explanation: Naive implementation of $\log(\sum e^{z_i})$ overflows when logits exceed ~700 in double precision. This causes NaN propagation and silent pipeline failures.
Fix: Always implement the max-subtraction trick shown in the core solution. Validate inputs for NaN or Infinity before aggregation.
4. Confusing Frequentist Variance with Bayesian Posteriors
Explanation: Bregman information measures spread across a frequentist sampling distribution (e.g., different random seeds, dropout masks). It does not provide a full posterior distribution, credible intervals, or parameter uncertainty.
Fix: Use this framework for predictive uncertainty and OOD detection. If full Bayesian inference or parameter-level uncertainty is required, integrate with variational inference or MCMC-based neural networks.
5. Single-Pass Estimation Attempts
Explanation: Bregman information is an expectation. Computing it from a single forward pass returns zero variance, falsely indicating perfect confidence.
Fix: Enforce multi-pass sampling at the infrastructure level. Cache logit tensors across passes to avoid redundant computation. Use batched inference to amortize overhead.
6. Misinterpreting Markov-Based Confidence Regions
Explanation: Confidence regions derived from Markov's inequality on Bregman divergence are conservative bounds, not exact credible intervals. They guarantee coverage probability but are often wider than Gaussian approximations.
Fix: Treat these regions as safety envelopes for routing and rejection, not as precise probability estimates. Validate coverage empirically on held-out shift data.
7. Static Thresholding Without Drift Monitoring
Explanation: Bregman variance thresholds calibrated on clean validation data degrade under distribution shift. A fixed threshold will either reject too much data or fail to catch OOD inputs as the data distribution evolves.
Fix: Implement adaptive thresholding using exponential moving averages of variance distributions. Trigger re-calibration when the variance distribution's percentile shifts beyond a tolerance band.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-latency inference (<50ms) | MC Dropout (3-5 passes) | Minimal architectural changes; variance scales with pass count | Low compute overhead; slight latency increase |
| High-reliability safety systems | Weight Ensemble (5-10 models) | Maximizes variance reduction; theoretically grounded via law of total Bregman info | High memory/compute cost; linear scaling with ensemble size |
| Limited GPU memory | Test-Time Augmentation | Reuses single model; variance comes from input stochasticity | Negligible memory cost; moderate CPU overhead for transforms |
| Full posterior required | Bayesian Neural Networks | Provides parameter uncertainty and credible intervals | High training complexity; slower inference; not covered by Bregman framework |
| Real-time OOD filtering | Logit-Space Bregman Info | Superior discard rate vs softmax; distribution-free confidence bounds | Moderate compute; requires multi-pass sampling infrastructure |
Configuration Template
// uncertainty.config.ts
export interface UncertaintyPipelineConfig {
samplingStrategy: 'mc_dropout' | 'ensemble' | 'tta';
numPasses: number;
varianceThreshold: number;
fallbackRoute: 'human_review' | 'secondary_model' | 'safe_default';
monitoring: {
enabled: boolean;
percentileTrack: number[]; // e.g., [0.5, 0.9, 0.95]
driftTolerance: number; // % shift before re-calibration
};
}
export const defaultConfig: UncertaintyPipelineConfig = {
samplingStrategy: 'mc_dropout',
numPasses: 5,
varianceThreshold: 0.42, // Calibrated on validation OOD set
fallbackRoute: 'secondary_model',
monitoring: {
enabled: true,
percentileTrack: [0.5, 0.9, 0.95],
driftTolerance: 0.15
}
};
// uncertainty.engine.ts
import { stableLogSumExp, computeBregmanInformation } from './math-utils';
export class UncertaintyEngine {
private config: UncertaintyPipelineConfig;
private varianceHistory: number[] = [];
constructor(config: UncertaintyPipelineConfig) {
this.config = config;
}
async evaluate(input: Tensor, model: InferenceModel): Promise<UncertaintyResult> {
const logits: number[][] = [];
for (let i = 0; i < this.config.numPasses; i++) {
const passLogits = await model.forward(input, {
applyDropout: this.config.samplingStrategy === 'mc_dropout',
augment: this.config.samplingStrategy === 'tta'
});
logits.push(passLogits.toArray());
}
const samples = logits.map((values, idx) => ({ id: `pass_${idx}`, values }));
const bregmanVar = computeBregmanInformation(samples);
this.varianceHistory.push(bregmanVar);
if (this.varianceHistory.length > 1000) this.varianceHistory.shift();
const isOod = bregmanVar > this.config.varianceThreshold;
return {
variance: bregmanVar,
isOod,
route: isOod ? this.config.fallbackRoute : 'primary',
timestamp: Date.now()
};
}
}
Quick Start Guide
- Extract raw logits: Modify your inference wrapper to bypass the final softmax layer. Return the pre-activation tensor directly to the uncertainty pipeline.
- Implement multi-pass sampling: Wrap your model inference in a loop that executes
numPasses forward passes. Enable dropout at test time or apply stochastic augmentations depending on your latency budget.
- Compute Bregman information: Aggregate the logit samples, compute the stable LogSumExp for each pass, average the logits, compute LogSumExp of the average, and subtract. The result is your variance metric.
- Set empirical thresholds: Run the pipeline on a validation set containing both in-distribution and known OOD/corrupted samples. Plot the variance distribution and select a threshold that separates the two populations while meeting your acceptable reject rate.
- Integrate routing logic: Connect the variance metric to your production router. High-variance inputs trigger fallback handlers; low-variance inputs proceed through the standard pipeline. Enable monitoring to track variance drift and trigger automatic re-calibration when distribution shifts occur.