- Quality cannot be assumed. The harness must integrate with an evaluation pipeline that runs containerized tasks on every change, measuring resolution rates and token costs.
Implementation: TypeScript Harness
The following code demonstrates a modular harness architecture. It separates concerns, implements per-model tool mapping, and includes context management.
1. Harness Configuration and Model Adapter
// config/harness.types.ts
export interface HarnessConfig {
modelFamily: 'claude' | 'gpt' | 'gemini';
contextBudget: number;
reasoningLevel: 'medium' | 'high'; // Avoid 'xhigh' based on benchmarks
compressionThreshold: number;
}
export interface ToolDefinition {
name: string;
schema: Record<string, any>;
description: string;
}
// adapters/tool-mapper.ts
export class ToolMapper {
private static readonly MAPPINGS: Record<string, Record<string, string>> = {
claude: { fileEdit: 'replace_string_in_file', fileRead: 'read_file' },
gpt: { fileEdit: 'apply_patch', fileRead: 'read_file' },
gemini: { fileEdit: 'apply_patch', fileRead: 'read_file', requiresToolCallReminder: true }
};
static resolveToolName(family: string, operation: string): string {
const mapping = this.MAPPINGS[family];
if (!mapping) throw new Error(`Unsupported model family: ${family}`);
return mapping[operation] || operation;
}
static needsToolCallReminder(family: string): boolean {
return this.MAPPINGS[family]?.requiresToolCallReminder || false;
}
}
2. Context Assembler with Compression
// context/context-assembler.ts
export class ContextAssembler {
private budget: number;
private compressionThreshold: number;
constructor(budget: number, compressionThreshold: number) {
this.budget = budget;
this.compressionThreshold = compressionThreshold;
}
assemble(userMessage: string, toolHistory: any[], skillMetadata: any[]): string {
let context = userMessage;
// Add skill metadata (progressive disclosure)
context += this.formatSkills(skillMetadata);
// Add tool history with compression
context += this.formatToolHistory(toolHistory);
// Enforce budget
if (context.length > this.budget) {
context = this.truncateContext(context);
}
return context;
}
private formatToolHistory(history: any[]): string {
return history.map(entry => {
const output = entry.output || '';
if (output.length > this.compressionThreshold) {
return `[Tool: ${entry.tool}] Output compressed (${output.length} chars omitted)`;
}
return `[Tool: ${entry.tool}] ${output}`;
}).join('\n');
}
private truncateContext(context: string): string {
// Implement LRU or semantic truncation logic
return context.slice(-this.budget);
}
}
3. Skill Loader with Progressive Disclosure
// skills/skill-loader.ts
export interface SkillManifest {
id: string;
metadata: string; // ~100 tokens
body: string; // ~5000 tokens
allowedTools: string[];
}
export class SkillLoader {
async loadSkill(id: string, relevanceScore: number): Promise<SkillManifest> {
const manifest = await this.fetchManifest(id);
// Only load full body if relevant
if (relevanceScore > 0.7) {
return {
...manifest,
body: await this.fetchBody(id)
};
}
return { ...manifest, body: '' };
}
private async fetchManifest(id: string): Promise<SkillManifest> {
// Fetch from agentskills.io or local registry
return {
id,
metadata: `Skill: ${id} - Description...`,
body: '',
allowedTools: ['git_log', 'write_markdown']
};
}
}
4. Agent Loop Controller
// loop/agent-loop.ts
export class AgentLoop {
constructor(
private config: HarnessConfig,
private assembler: ContextAssembler,
private toolMapper: ToolMapper,
private modelClient: any
) {}
async run(task: string): Promise<string> {
let history: any[] = [];
let round = 0;
const maxRounds = 10;
while (round < maxRounds) {
const context = this.assembler.assemble(task, history, []);
const tools = this.getToolsForModel();
// Inject tool call reminder for Gemini
const systemPrompt = ToolMapper.needsToolCallReminder(this.config.modelFamily)
? "You must use tool calls. Do not narrate actions."
: "";
const response = await this.modelClient.complete({
context,
tools,
systemPrompt,
reasoningLevel: this.config.reasoningLevel
});
if (response.toolCalls) {
const results = await this.executeTools(response.toolCalls);
history.push({ toolCalls: response.toolCalls, results });
} else if (response.finalAnswer) {
return response.finalAnswer;
} else {
// Handle orphaned calls or hallucination
history.push({ error: "Model failed to produce valid tool call or answer." });
}
round++;
}
throw new Error("Agent loop exceeded maximum rounds.");
}
private getToolsForModel(): any[] {
// Map internal tools to model-specific schemas
return [
{ name: this.toolMapper.resolveToolName(this.config.modelFamily, 'fileEdit'), ... },
{ name: this.toolMapper.resolveToolName(this.config.modelFamily, 'fileRead'), ... }
];
}
}
Rationale
- Separation of Concerns: The
ToolMapper isolates model-specific logic. Adding a new model family requires only updating the mapping, not rewriting the loop.
- Budget Enforcement: The
ContextAssembler actively manages the context window, preventing overflow and ensuring critical information is retained.
- Reasoning Cap: The configuration explicitly avoids
xhigh reasoning levels, aligning with benchmark data that shows degradation at that tier.
- Skill Gating: The
SkillLoader implements progressive disclosure, loading full skill bodies only when relevant, preserving context for the task.
Pitfall Guide
1. The "Universal Prompt" Fallacy
Explanation: Using a single system prompt for all model families. Models have different training data and instruction-following behaviors. A prompt optimized for Claude may confuse GPT or Gemini.
Fix: Implement per-model system prompts. The harness should select the prompt template based on modelFamily. Tune prompts against pre-release checkpoints to ensure compatibility.
2. Context Poisoning via Tool Output
Explanation: Returning raw, unstructured tool output (e.g., full build logs) to the model. This consumes context budget and can distract the model with irrelevant details.
Fix: Implement output compression in the ToolExecutor. Truncate long outputs, summarize diffs, and omit progress bars. Use settings like chat.tools.compressOutput.enabled to automatically trim terminal output.
Explanation: Assuming all models support the same tool names and arguments. For example, replace_string_in_file may not work well with GPT, which prefers apply_patch.
Fix: Use a ToolMapper to translate internal operations to model-specific tool schemas. Validate tool calls against the model's expected format before execution.
4. Reasoning Escalation Trap
Explanation: Increasing reasoning effort (xhigh) to solve difficult tasks. Benchmarks show this increases token cost and decreases resolution rates due to over-optimization or loop degradation.
Fix: Cap reasoning effort at high. Invest in harness tuning (context assembly, tool exposure) instead of reasoning escalation. Measure the ROI of reasoning levels using a closed-loop eval.
Explanation: Some models (e.g., Gemini) may describe actions instead of calling tools, or fail on dangling tool calls in history. This breaks the agent loop.
Fix: Implement history sanitization. Detect orphaned calls and inject reminders. For Gemini, use explicit tool-call enforcement hooks and validate that tool calls are present in the response.
6. Eval Blindness
Explanation: Relying on manual demos or unit tests for JSON schemas. This misses regressions in the agent loop, context assembly, or tool execution.
Fix: Implement a closed-loop evaluation pipeline. Run containerized tasks on every PR, measuring resolution rates and token costs. Use benchmarks like VSC-Bench to catch harness regressions before merge.
7. Hardcoding Skill Context
Explanation: Loading all skill bodies into the context window at startup. This wastes budget and slows down the agent.
Fix: Implement progressive disclosure for skills. Load metadata first, then load full bodies only when the model indicates relevance. Use SKILL.md files with capped token limits.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Model Switching | Retune harness per model | Different models require different tools, prompts, and reminders. Switching without retuning degrades quality. | Medium (Engineering effort) |
| High Latency | Use high reasoning, not xhigh | xhigh increases cost and reduces resolution. high offers the best balance of performance and cost. | Low (Reduced token burn) |
| Large Tool Output | Compress output | Raw output causes context poisoning. Compression preserves budget and improves model focus. | Low (Negligible compute) |
| Skill Management | Progressive disclosure | Loading all skills wastes context. Progressive disclosure loads only relevant skills, optimizing budget. | Low (Negligible compute) |
| Evaluation | Closed-loop PR eval | Manual demos miss regressions. Automated evals catch harness issues before merge. | Medium (Infrastructure) |
Configuration Template
# harness.config.yaml
model:
family: "claude"
reasoning_level: "high"
system_prompt_template: "prompts/claude_v1.txt"
context:
budget: 8000
compression_threshold: 2000
skill_disclosure: "progressive"
tools:
mapper: "tool-mapper.ts"
registry:
- name: "fileEdit"
model_names:
claude: "replace_string_in_file"
gpt: "apply_patch"
gemini: "apply_patch"
- name: "fileRead"
model_names:
claude: "read_file"
gpt: "read_file"
gemini: "read_file"
skills:
directory: "./skills"
metadata_tokens: 100
body_tokens: 5000
allowed_tools_field: true
evaluation:
pipeline: "azure-devops"
containerized: true
metrics:
- "resolution_rate"
- "token_cost"
- "loop_stability"
Quick Start Guide
- Initialize Harness Config: Create a
harness.config.yaml file defining your model family, context budget, and tool mappings.
- Implement Tool Mapper: Write a
ToolMapper class to translate internal tool operations to model-specific schemas.
- Add Context Compression: Integrate output compression logic into your
ToolExecutor to trim large outputs.
- Run Baseline Eval: Execute a containerized evaluation run to measure baseline resolution rates and token costs.
- Iterate on Harness: Adjust context budget, tool mappings, and system prompts based on eval results. Avoid increasing reasoning effort beyond
high.