ven the scaffolding it will actually use, not how well it performs on a vendor's Kubernetes cluster.
Implementation Strategy
- Profile Production Constraints: Extract hard limits from your deployment environment. This includes memory ceilings, CPU quotas, timeout thresholds, retry budgets, and tool availability.
- Construct the "Iron Cage": Build an evaluation harness that enforces these constraints strictly. This harness must replicate the production sandbox, including middleware for PII redaction, human-in-the-loop escalation, and context management.
- Execute Model Swap: Run candidate models through the fixed harness. The only variable should be the model identifier. All tools, prompts, retry policies, and memory stores must remain constant.
- Measure Trajectory Efficiency: Move beyond binary pass/fail metrics. Analyze the agent's trajectory: tool call precision, context consumption, retry usage, and resource efficiency. A model that succeeds with fewer tool calls and lower memory usage is superior for production, even if its raw success rate is marginally lower.
Technical Implementation: Constrained Bake-off Runner
The following TypeScript example demonstrates a harness-first evaluation runner. Unlike public benchmarks, this runner enforces production constraints and measures trajectory efficiency.
import { AgentTrajectory, EvalResult, ResourceMetrics } from './types';
interface ProductionConstraints {
memoryLimitMb: number;
cpuTimeoutMs: number;
maxRetries: number;
allowedTools: string[];
}
interface ModelCandidate {
id: string;
provider: string;
}
class ConstrainedBakeoffRunner {
private constraints: ProductionConstraints;
private harness: AgentHarness;
constructor(constraints: ProductionConstraints, harness: AgentHarness) {
this.constraints = constraints;
this.harness = harness;
}
async runBakeoff(candidates: ModelCandidate[], dataset: EvalDataset): Promise<EvalReport> {
const results: EvalResult[] = [];
for (const candidate of candidates) {
// Reset harness state for each model to ensure isolation
const modelHarness = this.harness.clone();
modelHarness.setModel(candidate);
modelHarness.applyConstraints(this.constraints);
const trajectoryLog = await this.executeTrajectories(modelHarness, dataset);
results.push(this.analyzeTrajectory(candidate, trajectoryLog));
}
return this.generateReport(results);
}
private async executeTrajectories(
harness: AgentHarness,
dataset: EvalDataset
): Promise<AgentTrajectory[]> {
const trajectories: AgentTrajectory[] = [];
for (const task of dataset) {
const metrics: ResourceMetrics = {
peakMemoryMb: 0,
cpuTimeMs: 0,
retryCount: 0,
toolCalls: 0,
};
try {
const result = await harness.invoke(task.input, {
onResourceUpdate: (m) => this.updateMetrics(metrics, m),
onToolCall: () => metrics.toolCalls++,
maxRetries: this.constraints.maxRetries,
});
trajectories.push({
taskId: task.id,
success: result.status === 'success',
metrics,
trajectory: result.steps,
});
} catch (error) {
// Distinguish between model failure and infra failure
trajectories.push({
taskId: task.id,
success: false,
metrics,
errorType: this.classifyError(error),
trajectory: [],
});
}
}
return trajectories;
}
private classifyError(error: unknown): 'model' | 'infra' | 'timeout' {
if (error instanceof MemoryLimitExceededError) return 'infra';
if (error instanceof TimeoutError) return 'timeout';
return 'model';
}
private analyzeTrajectory(candidate: ModelCandidate, trajectories: AgentTrajectory[]): EvalResult {
const successRate = trajectories.filter(t => t.success).length / trajectories.length;
const avgMemory = trajectories.reduce((sum, t) => sum + t.metrics.peakMemoryMb, 0) / trajectories.length;
const avgToolCalls = trajectories.reduce((sum, t) => sum + t.metrics.toolCalls, 0) / trajectories.length;
const infraFailureRate = trajectories.filter(t => t.errorType === 'infra').length / trajectories.length;
return {
modelId: candidate.id,
successRate,
efficiencyScore: this.calculateEfficiency(successRate, avgMemory, avgToolCalls),
infraFailureRate,
avgMemory,
avgToolCalls,
};
}
private calculateEfficiency(success: number, memory: number, tools: number): number {
// Efficiency penalizes high resource usage and excessive tool calls
// Higher score is better
return success / (1 + (memory / 1024) + (tools * 0.1));
}
}
Architecture Decisions:
- Constraint Enforcement: The
ProductionConstraints interface is applied directly to the harness. This ensures the evaluation environment matches production reality.
- Error Classification: The
classifyError method separates model failures from infrastructure failures. This is critical for accurate scoring; a model should not be penalized for an OOM kill caused by the harness.
- Efficiency Scoring: The
calculateEfficiency function rewards models that achieve success with lower memory and fewer tool calls. This aligns the evaluation metric with production costs and reliability.
- Trajectory Analysis: By capturing the full trajectory, you can analyze how the model approaches problems. This reveals behavioral patterns that binary metrics miss.
Pitfall Guide
1. The OOM Mirage
Explanation: Models that consume excessive memory may appear more capable because they can load larger contexts or run more complex subprocesses. In production, these agents trigger Out-Of-Memory kills.
Fix: Enforce strict memory limits during evaluation. Penalize models that approach the memory ceiling, even if they succeed.
2. Retry Amnesia
Explanation: Aggressive retry policies can mask model errors. An agent that fails three times but succeeds on the fourth attempt may look successful, but it is inefficient and costly. Retries can also hide critical errors until they compound.
Fix: Track retry counts as a primary metric. Implement a "retry budget" in the harness and fail the evaluation if the budget is exhausted. Log distinct attempts to analyze error patterns.
3. Context Shaping Drift
Explanation: The harness defines the agent's role through system prompts and context structure. As noted by Anna Bernad, effective harnesses "make the context describe a different room." If the context implies a reviewer role, the agent may soft-approve work regardless of model capability. This is a harness bug, not a model bug.
Fix: Audit system prompts for behavioral cues. Ensure the context accurately reflects the desired agent behavior. Test multiple prompt variations to isolate model performance from prompt influence.
4. The "Kitchen Sink" Install
Explanation: Agents with unlimited resources may solve tasks by installing numerous packages, including large dependencies. This brute-force approach works in the benchmark but fails in production due to network limits, storage constraints, or security policies.
Fix: Monitor package installation counts and sizes. Penalize agents that install unnecessary dependencies. Enforce network egress limits during evaluation.
5. Vendor Sandbox Bias
Explanation: Vendor benchmarks often run on optimized sandboxes with specific configurations. Reproducing these scores requires identical infrastructure. Using these scores for procurement assumes your environment matches the vendor's, which is rarely true.
Fix: Request detailed infrastructure specifications from vendors. If unavailable, treat the score as a marketing metric, not a technical benchmark. Run your own constrained bake-off.
6. The Noise Floor Trap
Explanation: Anthropic's data shows that between 1x and 3x resource multipliers, success scores are within noise (p=0.40). Over-provisioning resources in this range yields no capability gain but increases cost.
Fix: Identify the resource threshold where performance plateaus. Allocate resources just above this threshold to avoid waste. Do not assume more resources always equal better performance.
7. Trajectory Blindness
Explanation: Focusing solely on final output success ignores the path taken. Two agents may achieve the same result, but one may use a robust, efficient trajectory while the other relies on fragile, resource-heavy steps.
Fix: Implement trajectory tracing. Compare actual tool-call paths to reference trajectories. Analyze tool call precision and context consumption.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Cost-Sensitive Deployment | Efficiency-First Bake-off | Prioritizes models that succeed with minimal resources. Reduces infrastructure costs and improves scalability. | Lowers compute costs; may accept slightly lower raw accuracy. |
| Mission-Critical Systems | Uncapped Validation + Constrained Deploy | Validates model capability under ideal conditions, then deploys with constraints. Ensures safety and reliability. | Higher validation costs; deployment costs remain controlled. |
| Rapid Prototyping | Public Leaderboard Screening | Quick assessment of model capabilities. Useful for initial shortlisting before detailed bake-off. | Low initial cost; risk of procurement mismatch. |
| Strict Compliance Environment | Constrained Bake-off with Security Audit | Ensures model operates within security and compliance boundaries. Validates tool usage and data handling. | Moderate cost; essential for regulatory adherence. |
Configuration Template
Use this JSON schema to define production constraints for your evaluation harness.
{
"evaluation_config": {
"constraints": {
"memory_limit_mb": 512,
"cpu_timeout_ms": 30000,
"max_retries": 3,
"allowed_tools": ["file_read", "file_write", "shell_exec", "web_search"],
"network_egress_limit_mb": 100
},
"metrics": {
"track_resource_usage": true,
"track_tool_calls": true,
"track_retry_count": true,
"error_classification": true
},
"scoring": {
"success_weight": 0.6,
"efficiency_weight": 0.4,
"penalty_oom": 1.0,
"penalty_timeout": 0.8
}
}
}
Quick Start Guide
- Export Production Metrics: Gather data on memory usage, CPU consumption, and error rates from your current production environment.
- Create Constraint Config: Use the Configuration Template to define your production constraints. Adjust values based on your metrics.
- Run Local Bake-off: Execute the Constrained Bake-off Runner with your candidate models. Ensure the harness enforces all constraints.
- Select Model: Choose the model with the highest efficiency score and lowest infrastructure failure rate. Validate the selection against production requirements.
- Deploy and Monitor: Deploy the selected model with the validated harness. Monitor production metrics to ensure performance matches evaluation results.
The benchmark is not the product. The harness is the product. By owning the harness, measuring trajectories, and selecting models based on production fit, you eliminate the infrastructure illusion and build agents that ship reliably.