The Agent Harness Is the Real Product. The Model Is Just the Engine.

By Codcompass Team·2026-05-17·9 min read

The Agent Harness Architecture: Engineering Context, Tools, and Evaluation Loops

Current Situation Analysis

The industry has spent the last two years fixated on model leaderboards. Engineering teams optimize for SWE-bench scores, chase the latest Sonnet or GPT release, and assume that upgrading the model automatically upgrades their AI coding workflow. This model-centric view ignores the reality of production systems: the model is a stochastic engine, but the harness—the deterministic code wrapping that engine—determines whether the system succeeds or fails.

This misconception persists because harness engineering is invisible in demos. It lives in context window management, tool schema mapping, output compression, and evaluation pipelines. When a model "hallucinates" or "forgets," it is rarely a raw intelligence failure; it is usually a harness failure where the context assembler dropped critical state or the tool executor returned unstructured noise.

The evidence for this shift is now empirical. Internal benchmarks from the VS Code team, specifically VSC-Bench, reveal a counter-intuitive finding regarding reasoning effort. When scaling reasoning tokens from high to xhigh, the system burns significantly more tokens but resolves fewer tasks. The data indicates a "useful effort sweet spot." Beyond this threshold, additional computation does not improve outcomes; it degrades them. This proves that raw model capability is bounded by harness design. Without a robust harness to constrain context, manage tools, and evaluate results, increasing model power yields diminishing or negative returns.

WOW Moment: Key Findings

The critical insight from recent benchmarking is that harness tuning outperforms model escalation. The following comparison illustrates the trade-offs observed in containerized agent runs.

Approach	Token Efficiency	Task Resolution	System Stability
Model Escalation (`xhigh` reasoning)	Low (High burn rate)	Decreases (Regression)	Unpredictable
Harness Tuning (Context compression + Per-model tools)	High (Optimized budget)	Increases (Peak performance)	Deterministic
Generic Harness (One-size-fits-all config)	Medium	Plateaued	Fragile across models

Why this matters: The xhigh regression demonstrates that "more thinking" is not a universal good. It consumes context budget and can lead to over-optimization or loop degradation. Teams that invest in harness engineering—specifically context assembly, tool adaptation, and closed-loop evaluation—achieve higher resolution rates at lower costs than teams chasing model upgrades. This enables reliable, production-grade AI agents that function consistently across different model families.

Core Solution

Building a production-ready agent harness requires treating the system as a control loop with three distinct responsibilities: Context Assembly, Tool Exposure, and Execution Control. The harness must adapt dynamically to the model family, as different models exhibit distinct behaviors regarding tool calling, history management, and reasoning depth.

Architecture Decisions

Per-Model Tool Mapping: Models do not share identical tool capabilities. For example, Claude-based models may prefer replace_string_in_file for edits, while GPT-based models perform better with apply_patch. Gemini models may require explicit reminders to use tool calls rather than narrating actions. The harness must include an adapter layer that maps internal operations to model-specific tool schemas.
Progressive Context Loading: Context windows are finite. The harness should implement progressive disclosure for skills and extensions. Metadata loads first; full bodies load only when relevant. This preserves budget for the active task.
Output Compression: Tool outputs can be massive (e.g., npm install logs). The harness must compress or truncate outputs before they enter the context window to prevent "context poisoning."
*Closed-Loop Evaluation:

Quality cannot be assumed. The harness must integrate with an evaluation pipeline that runs containerized tasks on every change, measuring resolution rates and token costs.

Implementation: TypeScript Harness

The following code demonstrates a modular harness architecture. It separates concerns, implements per-model tool mapping, and includes context management.

1. Harness Configuration and Model Adapter

// config/harness.types.ts
export interface HarnessConfig {
  modelFamily: 'claude' | 'gpt' | 'gemini';
  contextBudget: number;
  reasoningLevel: 'medium' | 'high'; // Avoid 'xhigh' based on benchmarks
  compressionThreshold: number;
}

export interface ToolDefinition {
  name: string;
  schema: Record<string, any>;
  description: string;
}

// adapters/tool-mapper.ts
export class ToolMapper {
  private static readonly MAPPINGS: Record<string, Record<string, string>> = {
    claude: { fileEdit: 'replace_string_in_file', fileRead: 'read_file' },
    gpt: { fileEdit: 'apply_patch', fileRead: 'read_file' },
    gemini: { fileEdit: 'apply_patch', fileRead: 'read_file', requiresToolCallReminder: true }
  };

  static resolveToolName(family: string, operation: string): string {
    const mapping = this.MAPPINGS[family];
    if (!mapping) throw new Error(`Unsupported model family: ${family}`);
    return mapping[operation] || operation;
  }

  static needsToolCallReminder(family: string): boolean {
    return this.MAPPINGS[family]?.requiresToolCallReminder || false;
  }
}

2. Context Assembler with Compression

// context/context-assembler.ts
export class ContextAssembler {
  private budget: number;
  private compressionThreshold: number;

  constructor(budget: number, compressionThreshold: number) {
    this.budget = budget;
    this.compressionThreshold = compressionThreshold;
  }

  assemble(userMessage: string, toolHistory: any[], skillMetadata: any[]): string {
    let context = userMessage;
    
    // Add skill metadata (progressive disclosure)
    context += this.formatSkills(skillMetadata);
    
    // Add tool history with compression
    context += this.formatToolHistory(toolHistory);
    
    // Enforce budget
    if (context.length > this.budget) {
      context = this.truncateContext(context);
    }
    
    return context;
  }

  private formatToolHistory(history: any[]): string {
    return history.map(entry => {
      const output = entry.output || '';
      if (output.length > this.compressionThreshold) {
        return `[Tool: ${entry.tool}] Output compressed (${output.length} chars omitted)`;
      }
      return `[Tool: ${entry.tool}] ${output}`;
    }).join('\n');
  }

  private truncateContext(context: string): string {
    // Implement LRU or semantic truncation logic
    return context.slice(-this.budget);
  }
}

3. Skill Loader with Progressive Disclosure

// skills/skill-loader.ts
export interface SkillManifest {
  id: string;
  metadata: string; // ~100 tokens
  body: string;     // ~5000 tokens
  allowedTools: string[];
}

export class SkillLoader {
  async loadSkill(id: string, relevanceScore: number): Promise<SkillManifest> {
    const manifest = await this.fetchManifest(id);
    
    // Only load full body if relevant
    if (relevanceScore > 0.7) {
      return {
        ...manifest,
        body: await this.fetchBody(id)
      };
    }
    
    return { ...manifest, body: '' };
  }

  private async fetchManifest(id: string): Promise<SkillManifest> {
    // Fetch from agentskills.io or local registry
    return {
      id,
      metadata: `Skill: ${id} - Description...`,
      body: '',
      allowedTools: ['git_log', 'write_markdown']
    };
  }
}

4. Agent Loop Controller

// loop/agent-loop.ts
export class AgentLoop {
  constructor(
    private config: HarnessConfig,
    private assembler: ContextAssembler,
    private toolMapper: ToolMapper,
    private modelClient: any
  ) {}

  async run(task: string): Promise<string> {
    let history: any[] = [];
    let round = 0;
    const maxRounds = 10;

    while (round < maxRounds) {
      const context = this.assembler.assemble(task, history, []);
      const tools = this.getToolsForModel();
      
      // Inject tool call reminder for Gemini
      const systemPrompt = ToolMapper.needsToolCallReminder(this.config.modelFamily)
        ? "You must use tool calls. Do not narrate actions."
        : "";

      const response = await this.modelClient.complete({
        context,
        tools,
        systemPrompt,
        reasoningLevel: this.config.reasoningLevel
      });

      if (response.toolCalls) {
        const results = await this.executeTools(response.toolCalls);
        history.push({ toolCalls: response.toolCalls, results });
      } else if (response.finalAnswer) {
        return response.finalAnswer;
      } else {
        // Handle orphaned calls or hallucination
        history.push({ error: "Model failed to produce valid tool call or answer." });
      }
      
      round++;
    }
    
    throw new Error("Agent loop exceeded maximum rounds.");
  }

  private getToolsForModel(): any[] {
    // Map internal tools to model-specific schemas
    return [
      { name: this.toolMapper.resolveToolName(this.config.modelFamily, 'fileEdit'), ... },
      { name: this.toolMapper.resolveToolName(this.config.modelFamily, 'fileRead'), ... }
    ];
  }
}

Rationale

Separation of Concerns: The ToolMapper isolates model-specific logic. Adding a new model family requires only updating the mapping, not rewriting the loop.
Budget Enforcement: The ContextAssembler actively manages the context window, preventing overflow and ensuring critical information is retained.
Reasoning Cap: The configuration explicitly avoids xhigh reasoning levels, aligning with benchmark data that shows degradation at that tier.
Skill Gating: The SkillLoader implements progressive disclosure, loading full skill bodies only when relevant, preserving context for the task.

Pitfall Guide

1. The "Universal Prompt" Fallacy

Explanation: Using a single system prompt for all model families. Models have different training data and instruction-following behaviors. A prompt optimized for Claude may confuse GPT or Gemini. Fix: Implement per-model system prompts. The harness should select the prompt template based on modelFamily. Tune prompts against pre-release checkpoints to ensure compatibility.

2. Context Poisoning via Tool Output

Explanation: Returning raw, unstructured tool output (e.g., full build logs) to the model. This consumes context budget and can distract the model with irrelevant details. Fix: Implement output compression in the ToolExecutor. Truncate long outputs, summarize diffs, and omit progress bars. Use settings like chat.tools.compressOutput.enabled to automatically trim terminal output.

3. Ignoring Tool Schema Divergence

Explanation: Assuming all models support the same tool names and arguments. For example, replace_string_in_file may not work well with GPT, which prefers apply_patch. Fix: Use a ToolMapper to translate internal operations to model-specific tool schemas. Validate tool calls against the model's expected format before execution.

4. Reasoning Escalation Trap

Explanation: Increasing reasoning effort (xhigh) to solve difficult tasks. Benchmarks show this increases token cost and decreases resolution rates due to over-optimization or loop degradation. Fix: Cap reasoning effort at high. Invest in harness tuning (context assembly, tool exposure) instead of reasoning escalation. Measure the ROI of reasoning levels using a closed-loop eval.

5. Orphaned Tool Calls

Explanation: Some models (e.g., Gemini) may describe actions instead of calling tools, or fail on dangling tool calls in history. This breaks the agent loop. Fix: Implement history sanitization. Detect orphaned calls and inject reminders. For Gemini, use explicit tool-call enforcement hooks and validate that tool calls are present in the response.

6. Eval Blindness

Explanation: Relying on manual demos or unit tests for JSON schemas. This misses regressions in the agent loop, context assembly, or tool execution. Fix: Implement a closed-loop evaluation pipeline. Run containerized tasks on every PR, measuring resolution rates and token costs. Use benchmarks like VSC-Bench to catch harness regressions before merge.

7. Hardcoding Skill Context

Explanation: Loading all skill bodies into the context window at startup. This wastes budget and slows down the agent. Fix: Implement progressive disclosure for skills. Load metadata first, then load full bodies only when the model indicates relevance. Use SKILL.md files with capped token limits.

Production Bundle

Action Checklist

Define Context Budget: Set explicit limits for context window usage and implement compression thresholds.
Map Tools Per Model: Create a ToolMapper to handle model-specific tool schemas and arguments.
Implement Output Compression: Add logic to truncate or summarize tool outputs before they enter the context.
Configure Reasoning Levels: Set reasoning effort to high and avoid xhigh based on benchmark data.
Build Skill Manifests: Create SKILL.md files with metadata and progressive disclosure logic.
Set Up Eval Pipeline: Implement a closed-loop evaluation system that runs containerized tasks on every change.
Sanitize History: Add logic to detect and handle orphaned tool calls, especially for Gemini models.
Tune System Prompts: Develop per-model system prompts and test them against pre-release checkpoints.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Model Switching	Retune harness per model	Different models require different tools, prompts, and reminders. Switching without retuning degrades quality.	Medium (Engineering effort)
High Latency	Use `high` reasoning, not `xhigh`	`xhigh` increases cost and reduces resolution. `high` offers the best balance of performance and cost.	Low (Reduced token burn)
Large Tool Output	Compress output	Raw output causes context poisoning. Compression preserves budget and improves model focus.	Low (Negligible compute)
Skill Management	Progressive disclosure	Loading all skills wastes context. Progressive disclosure loads only relevant skills, optimizing budget.	Low (Negligible compute)
Evaluation	Closed-loop PR eval	Manual demos miss regressions. Automated evals catch harness issues before merge.	Medium (Infrastructure)

Configuration Template

# harness.config.yaml
model:
  family: "claude"
  reasoning_level: "high"
  system_prompt_template: "prompts/claude_v1.txt"

context:
  budget: 8000
  compression_threshold: 2000
  skill_disclosure: "progressive"

tools:
  mapper: "tool-mapper.ts"
  registry:
    - name: "fileEdit"
      model_names:
        claude: "replace_string_in_file"
        gpt: "apply_patch"
        gemini: "apply_patch"
    - name: "fileRead"
      model_names:
        claude: "read_file"
        gpt: "read_file"
        gemini: "read_file"

skills:
  directory: "./skills"
  metadata_tokens: 100
  body_tokens: 5000
  allowed_tools_field: true

evaluation:
  pipeline: "azure-devops"
  containerized: true
  metrics:
    - "resolution_rate"
    - "token_cost"
    - "loop_stability"

Quick Start Guide

Initialize Harness Config: Create a harness.config.yaml file defining your model family, context budget, and tool mappings.
Implement Tool Mapper: Write a ToolMapper class to translate internal tool operations to model-specific schemas.
Add Context Compression: Integrate output compression logic into your ToolExecutor to trim large outputs.
Run Baseline Eval: Execute a containerized evaluation run to measure baseline resolution rates and token costs.
Iterate on Harness: Adjust context budget, tool mappings, and system prompts based on eval results. Avoid increasing reasoning effort beyond high.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back