Agent Benchmark Scores Are Measuring the Harness, Not the Model | Focused Labs

By Codcompass Team·2026-05-18·9 min read

The Infrastructure Illusion: Why Agentic Benchmarks Measure Your Sandbox, Not Your Model

Current Situation Analysis

Enterprise procurement teams are making model selection decisions based on a fundamental category error. Public leaderboards for agentic coding models are widely treated as IQ tests for the underlying LLM. In reality, they are stress tests for the evaluation harness. The score you see is a composite of model capability, resource allocation, sandbox stability, and retry logic. When the harness changes, the score changes, often independent of the model.

This discrepancy is not theoretical; it is quantifiable and significant. Recent infrastructure analysis from Anthropic on Terminal-Bench 2.0 revealed that varying only the resource budget for a single model configuration resulted in a 6.0% performance gap between the most and least resourced setups (p < 0.01). This variance is larger than the performance spread between most frontier models on public leaderboards.

The industry overlooks this because static evaluation metrics do not translate to agentic runtimes. Static evals score text output. Agentic evals score a model operating within a runtime environment that actively participates in the problem-solving process. The runtime decides if a container survives a transient memory spike, if a pip install completes, or if a test subprocess returns before timing out. Two agents running the same model on the same task set but with different resource budgets are effectively taking different exams.

Data from Anthropic's resource sweep highlights the mechanics of this illusion:

Strict Enforcement: 5.8% of tasks failed due to pod errors unrelated to model capacity.
Uncapped Resources: Pod error rate dropped to 0.5%.
The Noise Floor: Success scores between 1x and 3x resource multipliers were statistically indistinguishable (p=0.40). The model failed tasks due to capability limits, not resources.
The Brute-Force Threshold: Beyond 3x resources, success scores climbed faster than infrastructure errors declined. Extra headroom allowed agents to attempt strategies that rely on generous allocations, such as installing multiple large packages simultaneously, running memory-intensive test suites, or spawning long-lived subprocesses.

The benchmark has shifted from measuring model capability to measuring the budget available to brute-force a solution. Without standardized infrastructure reporting, leaderboard numbers are incomparable artifacts of specific vendor environments.

WOW Moment: Key Findings

The critical insight is that the "best" model on a leaderboard is often the model that best exploits unlimited resources, while the "best" model for production is the one that maximizes efficiency under constraints. These are rarely the same model.

Evaluation Mode	Resource Allocation	Primary Driver of Score	Production Predictive Value
Public Leaderboard	Uncapped / Vendor-Defined	Resource Brute-Force + Model Capability	Low. Scores inflate via memory/CPU abundance. Fails to predict OOM kills or timeout failures.
Constrained Bake-off	Production-Replica	Model Efficiency + Harness Robustness	High. Measures trajectory efficiency, error recovery, and success within actual operational limits.

Why this matters: Relying on unconstrained benchmarks leads to procurement of models that appear superior in isolation but degrade rapidly in production. An agent that wins a benchmark by installing 50 dependencies to solve a simple task will fail silently in a production environment with strict memory caps or network egress limits. Conversely, a leaner model that loses the leaderboard may be the only candidate capable of shipping reliably within your infrastructure constraints. The engineering work has moved from model selection to harness design.

Core Solution

To eliminate the infrastructure illusion, organizations must adopt a Harness-First Architecture. The evaluation process must invert: define the production constraints and harness topology first, then benchmark models within that fixed environment. The goal is to measure how well a model performs gi

ven the scaffolding it will actually use, not how well it performs on a vendor's Kubernetes cluster.

Implementation Strategy

Profile Production Constraints: Extract hard limits from your deployment environment. This includes memory ceilings, CPU quotas, timeout thresholds, retry budgets, and tool availability.
Construct the "Iron Cage": Build an evaluation harness that enforces these constraints strictly. This harness must replicate the production sandbox, including middleware for PII redaction, human-in-the-loop escalation, and context management.
Execute Model Swap: Run candidate models through the fixed harness. The only variable should be the model identifier. All tools, prompts, retry policies, and memory stores must remain constant.
Measure Trajectory Efficiency: Move beyond binary pass/fail metrics. Analyze the agent's trajectory: tool call precision, context consumption, retry usage, and resource efficiency. A model that succeeds with fewer tool calls and lower memory usage is superior for production, even if its raw success rate is marginally lower.

Technical Implementation: Constrained Bake-off Runner

The following TypeScript example demonstrates a harness-first evaluation runner. Unlike public benchmarks, this runner enforces production constraints and measures trajectory efficiency.

import { AgentTrajectory, EvalResult, ResourceMetrics } from './types';

interface ProductionConstraints {
  memoryLimitMb: number;
  cpuTimeoutMs: number;
  maxRetries: number;
  allowedTools: string[];
}

interface ModelCandidate {
  id: string;
  provider: string;
}

class ConstrainedBakeoffRunner {
  private constraints: ProductionConstraints;
  private harness: AgentHarness;

  constructor(constraints: ProductionConstraints, harness: AgentHarness) {
    this.constraints = constraints;
    this.harness = harness;
  }

  async runBakeoff(candidates: ModelCandidate[], dataset: EvalDataset): Promise<EvalReport> {
    const results: EvalResult[] = [];

    for (const candidate of candidates) {
      // Reset harness state for each model to ensure isolation
      const modelHarness = this.harness.clone();
      modelHarness.setModel(candidate);
      modelHarness.applyConstraints(this.constraints);

      const trajectoryLog = await this.executeTrajectories(modelHarness, dataset);
      results.push(this.analyzeTrajectory(candidate, trajectoryLog));
    }

    return this.generateReport(results);
  }

  private async executeTrajectories(
    harness: AgentHarness,
    dataset: EvalDataset
  ): Promise<AgentTrajectory[]> {
    const trajectories: AgentTrajectory[] = [];

    for (const task of dataset) {
      const metrics: ResourceMetrics = {
        peakMemoryMb: 0,
        cpuTimeMs: 0,
        retryCount: 0,
        toolCalls: 0,
      };

      try {
        const result = await harness.invoke(task.input, {
          onResourceUpdate: (m) => this.updateMetrics(metrics, m),
          onToolCall: () => metrics.toolCalls++,
          maxRetries: this.constraints.maxRetries,
        });

        trajectories.push({
          taskId: task.id,
          success: result.status === 'success',
          metrics,
          trajectory: result.steps,
        });
      } catch (error) {
        // Distinguish between model failure and infra failure
        trajectories.push({
          taskId: task.id,
          success: false,
          metrics,
          errorType: this.classifyError(error),
          trajectory: [],
        });
      }
    }

    return trajectories;
  }

  private classifyError(error: unknown): 'model' | 'infra' | 'timeout' {
    if (error instanceof MemoryLimitExceededError) return 'infra';
    if (error instanceof TimeoutError) return 'timeout';
    return 'model';
  }

  private analyzeTrajectory(candidate: ModelCandidate, trajectories: AgentTrajectory[]): EvalResult {
    const successRate = trajectories.filter(t => t.success).length / trajectories.length;
    const avgMemory = trajectories.reduce((sum, t) => sum + t.metrics.peakMemoryMb, 0) / trajectories.length;
    const avgToolCalls = trajectories.reduce((sum, t) => sum + t.metrics.toolCalls, 0) / trajectories.length;
    const infraFailureRate = trajectories.filter(t => t.errorType === 'infra').length / trajectories.length;

    return {
      modelId: candidate.id,
      successRate,
      efficiencyScore: this.calculateEfficiency(successRate, avgMemory, avgToolCalls),
      infraFailureRate,
      avgMemory,
      avgToolCalls,
    };
  }

  private calculateEfficiency(success: number, memory: number, tools: number): number {
    // Efficiency penalizes high resource usage and excessive tool calls
    // Higher score is better
    return success / (1 + (memory / 1024) + (tools * 0.1));
  }
}

Architecture Decisions:

Constraint Enforcement: The ProductionConstraints interface is applied directly to the harness. This ensures the evaluation environment matches production reality.
Error Classification: The classifyError method separates model failures from infrastructure failures. This is critical for accurate scoring; a model should not be penalized for an OOM kill caused by the harness.
Efficiency Scoring: The calculateEfficiency function rewards models that achieve success with lower memory and fewer tool calls. This aligns the evaluation metric with production costs and reliability.
Trajectory Analysis: By capturing the full trajectory, you can analyze how the model approaches problems. This reveals behavioral patterns that binary metrics miss.

Pitfall Guide

1. The OOM Mirage

Explanation: Models that consume excessive memory may appear more capable because they can load larger contexts or run more complex subprocesses. In production, these agents trigger Out-Of-Memory kills. Fix: Enforce strict memory limits during evaluation. Penalize models that approach the memory ceiling, even if they succeed.

2. Retry Amnesia

Explanation: Aggressive retry policies can mask model errors. An agent that fails three times but succeeds on the fourth attempt may look successful, but it is inefficient and costly. Retries can also hide critical errors until they compound. Fix: Track retry counts as a primary metric. Implement a "retry budget" in the harness and fail the evaluation if the budget is exhausted. Log distinct attempts to analyze error patterns.

3. Context Shaping Drift

Explanation: The harness defines the agent's role through system prompts and context structure. As noted by Anna Bernad, effective harnesses "make the context describe a different room." If the context implies a reviewer role, the agent may soft-approve work regardless of model capability. This is a harness bug, not a model bug. Fix: Audit system prompts for behavioral cues. Ensure the context accurately reflects the desired agent behavior. Test multiple prompt variations to isolate model performance from prompt influence.

4. The "Kitchen Sink" Install

Explanation: Agents with unlimited resources may solve tasks by installing numerous packages, including large dependencies. This brute-force approach works in the benchmark but fails in production due to network limits, storage constraints, or security policies. Fix: Monitor package installation counts and sizes. Penalize agents that install unnecessary dependencies. Enforce network egress limits during evaluation.

5. Vendor Sandbox Bias

Explanation: Vendor benchmarks often run on optimized sandboxes with specific configurations. Reproducing these scores requires identical infrastructure. Using these scores for procurement assumes your environment matches the vendor's, which is rarely true. Fix: Request detailed infrastructure specifications from vendors. If unavailable, treat the score as a marketing metric, not a technical benchmark. Run your own constrained bake-off.

6. The Noise Floor Trap

Explanation: Anthropic's data shows that between 1x and 3x resource multipliers, success scores are within noise (p=0.40). Over-provisioning resources in this range yields no capability gain but increases cost. Fix: Identify the resource threshold where performance plateaus. Allocate resources just above this threshold to avoid waste. Do not assume more resources always equal better performance.

7. Trajectory Blindness

Explanation: Focusing solely on final output success ignores the path taken. Two agents may achieve the same result, but one may use a robust, efficient trajectory while the other relies on fragile, resource-heavy steps. Fix: Implement trajectory tracing. Compare actual tool-call paths to reference trajectories. Analyze tool call precision and context consumption.

Production Bundle

Action Checklist

Define Production Constraints: Document memory limits, CPU quotas, timeouts, and retry budgets for your deployment environment.
Build Constrained Harness: Create an evaluation harness that enforces these constraints strictly. Replicate production middleware and tooling.
Separate Infra Errors: Implement error classification to distinguish model failures from infrastructure failures.
Measure Efficiency: Track resource usage, tool call counts, and retry rates. Calculate efficiency scores alongside success rates.
Audit Context Prompts: Review system prompts for behavioral cues that may influence agent performance independently of the model.
Run Model Swap: Execute a bake-off with candidate models, keeping the harness constant. Only vary the model identifier.
Analyze Trajectories: Review agent trajectories for efficiency and robustness. Identify patterns that indicate fragility.
Validate Production Fit: Compare evaluation results against production metrics. Ensure the selected model performs well within actual constraints.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Cost-Sensitive Deployment	Efficiency-First Bake-off	Prioritizes models that succeed with minimal resources. Reduces infrastructure costs and improves scalability.	Lowers compute costs; may accept slightly lower raw accuracy.
Mission-Critical Systems	Uncapped Validation + Constrained Deploy	Validates model capability under ideal conditions, then deploys with constraints. Ensures safety and reliability.	Higher validation costs; deployment costs remain controlled.
Rapid Prototyping	Public Leaderboard Screening	Quick assessment of model capabilities. Useful for initial shortlisting before detailed bake-off.	Low initial cost; risk of procurement mismatch.
Strict Compliance Environment	Constrained Bake-off with Security Audit	Ensures model operates within security and compliance boundaries. Validates tool usage and data handling.	Moderate cost; essential for regulatory adherence.

Configuration Template

Use this JSON schema to define production constraints for your evaluation harness.

{
  "evaluation_config": {
    "constraints": {
      "memory_limit_mb": 512,
      "cpu_timeout_ms": 30000,
      "max_retries": 3,
      "allowed_tools": ["file_read", "file_write", "shell_exec", "web_search"],
      "network_egress_limit_mb": 100
    },
    "metrics": {
      "track_resource_usage": true,
      "track_tool_calls": true,
      "track_retry_count": true,
      "error_classification": true
    },
    "scoring": {
      "success_weight": 0.6,
      "efficiency_weight": 0.4,
      "penalty_oom": 1.0,
      "penalty_timeout": 0.8
    }
  }
}

Quick Start Guide

Export Production Metrics: Gather data on memory usage, CPU consumption, and error rates from your current production environment.
Create Constraint Config: Use the Configuration Template to define your production constraints. Adjust values based on your metrics.
Run Local Bake-off: Execute the Constrained Bake-off Runner with your candidate models. Ensure the harness enforces all constraints.
Select Model: Choose the model with the highest efficiency score and lowest infrastructure failure rate. Validate the selection against production requirements.
Deploy and Monitor: Deploy the selected model with the validated harness. Monitor production metrics to ensure performance matches evaluation results.

The benchmark is not the product. The harness is the product. By owning the harness, measuring trajectories, and selecting models based on production fit, you eliminate the infrastructure illusion and build agents that ship reliably.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back