← Back to Blog
AI/ML2026-05-11Β·61 min read

Your MCP server eats 55,000 tokens before your agent says a word -- I measured the real cost

By Ken Imoto

Context Tax in AI Tool Routing: Quantifying and Mitigating MCP Schema Overhead

Current Situation Analysis

The industry has rapidly adopted the Model Context Protocol (MCP) to standardize how AI agents interact with external services. The promise is straightforward: plug in a server, gain access to its capabilities, and let the model orchestrate workflows. What most engineering teams overlook is the hidden payload cost of this abstraction. Every time an MCP client establishes a session, it serializes the complete tool registry into the prompt context. Names, descriptions, parameter schemas, enum constraints, and type definitions are injected on every single inference turn, regardless of whether the agent actually invokes those tools.

This overhead is frequently misunderstood because MCP clients abstract the transport layer. Developers assume lazy loading or on-demand schema resolution, but the current protocol specification pushes the entire definition set upfront. The result is a compounding context tax that scales linearly with server count and tool density.

The financial and operational impact is measurable. A minimal database adapter exposing a single query tool consumes roughly 35 tokens. A standard SaaS integration with seven endpoints pushes approximately 704 tokens. A full-featured platform suite like GitHub can inject ~55,000 tokens before the first user message is processed. At enterprise scale, running 1,000 daily requests with heavy schema payloads burns roughly $170 per day, translating to $5,100 monthly, solely for context injection. Beyond direct costs, model reasoning quality exhibits a sharp degradation threshold. Empirical testing consistently shows output quality declining once 50+ tool definitions occupy the prompt. The model begins prioritizing tool metadata over user intent, generating tangent-chasing behavior and confidently recommending irrelevant endpoints for unrelated problems.

WOW Moment: Key Findings

The disparity between raw tool availability and usable context capacity is not linear; it is exponential relative to working memory allocation. The following comparison isolates the context overhead against inference cost and remaining reasoning capacity:

Configuration Tool Count Context Overhead (tokens) Monthly Inference Cost (est.) Reasoning Capacity Remaining
Minimal DB Adapter 1 ~35 ~$0.0005 ~99.9%
Standard SaaS Integration 7 ~704 ~$0.009 ~99.6%
Full Platform Suite 93 ~55,000 ~$0.74 ~72.5%
Optimized & Filtered 10 ~650 ~$0.008 ~99.7%

This finding matters because context windows are finite compute resources. Every token consumed by schema definitions is a token subtracted from chain-of-thought reasoning, user history, and output generation. The data demonstrates that capability does not scale with tool count; it scales with context efficiency. By treating tool definitions as a budgeted resource rather than an unlimited catalog, engineering teams can reclaim up to 96% of wasted context while maintaining identical functional coverage. This enables predictable cost modeling, stable model performance, and deterministic agent behavior.

Core Solution

Mitigating MCP context tax requires a three-layer architecture: schema pruning, description compression, and dynamic session routing. Each layer addresses a different vector of payload bloat.

Step 1: Implement Schema Pruning via Allowlists

Instead of accepting the full registry from an MCP server, enforce an explicit allowlist at the client layer. This prevents unused endpoints from entering the prompt.

interface MCPToolFilter {
  serverId: string;
  permittedEndpoints: string[];
}

class ContextRegistry {
  private filters: Map<string, MCPToolFilter> = new Map();

  registerFilter(config: MCPToolFilter): void {
    this.filters.set(config.serverId, config);
  }

  pruneSchema(rawTools: any[]): any[] {
    return rawTools.filter(tool => {
      const filter = this.filters.get(tool.serverId);
      if (!filter) return true;
      return filter.permittedEndpoints.includes(tool.name);
    });
  }
}

Architecture Rationale: Filtering at the client layer ensures the LLM provider only receives the exact subset required for the current workflow. This reduces payload size without modifying the upstream server. The ContextRegistry acts as a deterministic gatekeeper, preventing accidental schema leakage during multi-server sessions.

Step 2: Compress Descriptions Programmatically

API documentation is written for human readers. LLMs parse structured keywords and type constraints more efficiently than prose. Strip conversational filler, retain action verbs, parameter names, and critical constraints.

class SchemaCompressor {
  static optimizeDescription(raw: string): string {
    const stopWords = new Set(['the', 'a', 'an', 'is', 'are', 'uses', 'allows', 'you can']);
    const cleaned = raw
      .replace(/\s+/g, ' ')
      .trim()
      .split(' ')
      .filter(word => !stopWords.has(word.toLowerCase()))
      .join(' ');
    
    return cleaned.length > 60 ? cleaned.slice(0, 57) + '...' : cleaned;
  }

  static applyToRegistry(tools: any[]): any[] {
    return tools.map(tool => ({
      ...tool,
      description: this.optimizeDescription(tool.description)
    }));
  }
}

Architecture Rationale: Description compression targets the highest-density token waste. A single paragraph description can consume 60-80 tokens. Compressing it to a keyword-dense phrase retains semantic clarity for the model while cutting payload by 70-75%. The compressor runs post-filtering, ensuring only active tools are optimized.

Step 3: Route Sessions Dynamically

Static server attachment forces all schemas into every turn. Dynamic routing attaches servers only when the workflow requires them, then detaches them when the task shifts.

class DynamicSessionRouter {
  private activeServers: Set<string> = new Set();

  attach(serverId: string): void {
    this.activeServers.add(serverId);
  }

  detach(serverId: string): void {
    this.activeServers.delete(serverId);
  }

  getActivePayload(): string[] {
    return Array.from(this.activeServers);
  }
}

Architecture Rationale: Dynamic routing eliminates cross-task contamination. When an agent switches from financial reconciliation to code generation, detaching the accounting server zeroes out its context tax. This requires workflow-aware orchestration but delivers the highest context efficiency.

Pitfall Guide

1. The Lazy Loading Fallacy

Explanation: Assuming MCP clients fetch tool schemas only when invoked. The protocol serializes the full registry on session initialization and re-injects it on every turn. Fix: Treat every connected server as a permanent context consumer. Apply allowlists immediately upon connection.

2. Documentation-to-Prompt Leakage

Explanation: Copying raw API documentation into tool descriptions. Human-readable prose contains redundant phrasing, examples, and edge-case warnings that consume tokens without improving model accuracy. Fix: Implement a compression layer that extracts action verbs, parameter names, and type constraints. Discard conversational filler.

3. Enum Explosion

Explanation: Exposing full enum arrays for parameters like status, category, or region. Large enum lists can add hundreds of tokens per tool. Fix: Use dynamic enum resolution or pass only relevant subsets. Where possible, replace enums with free-text parameters validated post-call.

4. Static Session Binding

Explanation: Keeping all MCP servers attached throughout the entire conversation lifecycle. This forces unrelated schemas to occupy context during every inference step. Fix: Implement task-scoped attachment. Detach servers when the workflow transitions to a different domain.

5. Ignoring the 50-Tool Threshold

Explanation: Assuming more tools equal more capability. Empirical data shows model reasoning quality degrades noticeably after ~50 definitions, causing tangent-chasing and incorrect tool selection. Fix: Cap active tool counts per session. Use routing or filtering to stay below the threshold.

6. Premature Protocol Dependency

Explanation: Waiting for infrastructure-level solutions like MCP Tool Search (introduced Jan 2026) to handle schema deferral. While effective, adoption is uneven and client support varies. Fix: Implement client-side filtering and compression now. Treat protocol-level deferral as a fallback, not a primary strategy.

7. Skipping Baseline Measurement

Explanation: Deploying servers without measuring token consumption before and after attachment. Teams cannot optimize what they do not quantify. Fix: Log per-turn token usage for each session. Compare baseline consumption against post-attachment metrics to isolate expensive registries.

Production Bundle

Action Checklist

  • Audit connected MCP servers: Run tools/list against each endpoint and record total tool counts.
  • Enforce allowlists: Configure permittedEndpoints for every server to exclude unused workflows.
  • Compress descriptions: Deploy a schema optimizer that strips conversational filler and caps length.
  • Implement dynamic routing: Attach servers only during relevant workflow phases; detach immediately after.
  • Monitor context windows: Track per-turn token usage and alert when schema overhead exceeds 10% of available context.
  • Validate enum payloads: Replace large enum arrays with dynamic resolution or free-text validation where possible.
  • Establish a 50-tool cap: Limit active tool definitions per session to preserve reasoning capacity.

Decision Matrix

Scenario Recommended Approach Why Cost Impact
Single-purpose agent (e.g., code review only) Static allowlist + compression Predictable workload; minimal routing overhead Low; stable token baseline
Multi-domain workflow (finance β†’ dev β†’ support) Dynamic session routing Prevents cross-contamination; zeroes unused context tax High savings; context scales with active task
Enterprise SaaS integration (200+ tools) Aggressive filtering + enum pruning Raw registry consumes ~70% of context window Critical; avoids $5k+/month overhead
Prototyping / exploration Full registry + monitoring Maximizes discovery; measurement informs future pruning Temporary; acceptable for short-lived sessions

Configuration Template

{
  "mcpContextPolicy": {
    "maxActiveTools": 50,
    "contextThresholdPercent": 10,
    "servers": {
      "accounting-platform": {
        "allowedTools": [
          "create_transaction",
          "list_ledger_entries",
          "get_trial_balance",
          "reconcile_accounts"
        ],
        "compressDescriptions": true,
        "autoDetachOnTaskSwitch": true
      },
      "code-repository": {
        "allowedTools": [
          "search_commits",
          "list_pull_requests",
          "get_file_content"
        ],
        "compressDescriptions": true,
        "autoDetachOnTaskSwitch": true
      }
    }
  }
}

Quick Start Guide

  1. Inventory your registries: Execute tools/list against every connected MCP server. Export the raw JSON and count total tools per server.
  2. Apply an allowlist: Create a configuration file mapping each server to only the endpoints required for your current workflow. Load this into your client's filter layer.
  3. Compress and deploy: Run the schema compressor on the filtered registry. Verify description length and token count. Attach the optimized registry to your session.
  4. Measure and iterate: Log per-turn token consumption before and after optimization. If schema overhead exceeds 10% of your context window, tighten the allowlist or enable dynamic detachment.