← Back to Blog
AI/ML2026-05-13·69 min read

I asked Cursor to rename a function. It sent 8,400 tokens. I checked.

By GDS K S

Owning the AI Routing Layer: How Context Overhead Drives LLM Costs and How to Reclaim Control

Current Situation Analysis

The modern AI coding stack suffers from a structural misalignment: wrapper applications prioritize user experience and safety over token efficiency, while developers bear the marginal cost of every inflated API call. When you interact with an AI assistant through a GUI, the application injects system instructions, tool definitions, recent file buffers, and workspace metadata into every request. This routing layer operates as a black box. You see a chat panel; the API sees a payload that can easily exceed 8,000 input tokens for a trivial task.

This problem is routinely overlooked because development teams optimize for feature velocity and interface polish. Token economics are abstracted behind flat subscription fees or buried in monthly API invoices. The incentive structure favors the wrapper: shipping conservative, high-context payloads reduces hallucination rates and support tickets, while the variable cost flows directly to the end user's direct API spend. The result is a silent cost multiplier that compounds with every session.

Data from production environments consistently reveals the scale of this overhead. A straightforward three-line function rename triggered approximately 8,400 input tokens when routed through a commercial wrapper. The same operation executed via a direct API call consumed roughly 1,900 tokens. The differential—over 6,500 tokens—consisted of system prompts, agent tool schemas, and recently accessed buffer states that had zero relevance to the task. Over a three-month period, direct API expenditures grew at a 50% month-over-month rate despite stable engineering hours. The growth was not driven by increased workload, but by unoptimized context injection and uncontrolled model selection.

WOW Moment: Key Findings

The turning point occurs when you instrument your AI workflow and compare routing strategies side-by-side. The data reveals that context overhead is not a fixed cost; it is a configurable variable. By decoupling intent classification from model dispatch, you can drastically reduce input token volume while maintaining output quality.

Routing Approach Avg Input Tokens Cost per 1k Calls Context Overhead Control Level
Wrapper Chat Panel ~8,400 High ~6,500 extra tokens Low (vendor-managed)
Direct API Call ~1,900 Medium Minimal High (manual)
Custom Router ~2,100 Low (41% reduction) Optimized per intent Full (user-owned)

This finding matters because it shifts AI spending from a variable, unpredictable expense to a fixed, optimized line item. When you own the routing layer, you can route trivial queries to low-cost models, reserve high-capability models for complex reasoning, and enforce strict context budgets. The 41% cost reduction observed in production was achieved without sacrificing capability; it was achieved by eliminating unnecessary token transmission and aligning model selection with actual task complexity.

Core Solution

Building a custom routing layer requires three architectural components: an intent classifier, a model dispatcher, and a cost telemetry system. The goal is not to replicate a wrapper's UI, but to intercept prompts, classify their purpose, trim irrelevant context, and route them to the most cost-effective model tier.

Step 1: Intent Classification via Scoring Engine

Instead of brittle regular expressions, implement a weighted scoring system that evaluates prompt characteristics against known intent categories. This approach handles edge cases gracefully and allows incremental refinement.

interface IntentRule {
  category: 'trivial' | 'code' | 'planning' | 'embedding';
  keywords: string[];
  weight: number;
}

const INTENT_RULES: IntentRule[] = [
  { category: 'trivial', keywords: ['rename', 'format', 'comment', 'lint', 'spell'], weight: 1.0 },
  { category: 'code', keywords: ['implement', 'refactor', 'debug', 'function', 'class', 'api'], weight: 1.5 },
  { category: 'planning', keywords: ['architecture', 'design', 'strategy', 'roadmap', 'breakdown'], weight: 2.0 },
  { category: 'embedding', keywords: ['vector', 'search', 'similarity', 'classify', 'cluster'], weight: 0.8 }
];

export function classifyIntent(prompt: string): IntentRule['category'] {
  const scores: Record<string, number> = { trivial: 0, code: 0, planning: 0, embedding: 0 };
  const normalized = prompt.toLowerCase();

  for (const rule of INTENT_RULES) {
    for (const keyword of rule.keywords) {
      if (normalized.includes(keyword)) {
        scores[rule.category] += rule.weight;
      }
    }
  }

  return Object.entries(scores).reduce((a, b) => (a[1] > b[1] ? a : b))[0] as IntentRule['category'];
}

Step 2: Context Pruning & Token Budgeting

Before dispatching to an LLM, enforce a hard token limit. Strip irrelevant workspace metadata, truncate conversation history, and compress system instructions based on the classified intent.

import { Tiktoken, get_encoding } from 'tiktoken';

const encoder = get_encoding('cl100k_base');

export function pruneContext(prompt: string, history: string[], maxTokens: number): string {
  const systemPrefix = 'You are a technical assistant. ';
  const baseTokens = encoder.encode(systemPrefix).length;
  let remaining = maxTokens - baseTokens;
  
  const truncatedHistory = history.reduceRight<string[]>((acc, msg) => {
    const msgTokens = encoder.encode(msg).length;
    if (msgTokens <= remaining) {
      remaining -= msgTokens;
      acc.unshift(msg);
    }
    return acc;
  }, []);

  return `${systemPrefix}\n${truncatedHistory.join('\n')}\nUser: ${prompt}`;
}

Step 3: Model Dispatch & Cost Attribution

Map intents to model tiers, attach cost tracking, and implement fallback routing. Externalize pricing to avoid hardcoding.

interface ModelConfig {
  id: string;
  provider: 'anthropic' | 'openai';
  inputPricePer1k: number;
  outputPricePer1k: number;
  maxContext: number;
}

const MODEL_REGISTRY: Record<string, ModelConfig> = {
  haiku: { id: 'claude-3-haiku', provider: 'anthropic', inputPricePer1k: 0.00025, outputPricePer1k: 0.00125, maxContext: 200000 },
  sonnet: { id: 'claude-3-sonnet', provider: 'anthropic', inputPricePer1k: 0.003, outputPricePer1k: 0.015, maxContext: 200000 },
  opus: { id: 'claude-3-opus', provider: 'anthropic', inputPricePer1k: 0.015, outputPricePer1k: 0.075, maxContext: 200000 },
  embedding: { id: 'text-embedding-3-small', provider: 'openai', inputPricePer1k: 0.00002, outputPricePer1k: 0, maxContext: 8191 }
};

export function selectModel(intent: string): ModelConfig {
  const mapping: Record<string, string> = {
    trivial: 'haiku',
    code: 'sonnet',
    planning: 'opus',
    embedding: 'embedding'
  };
  const key = mapping[intent] || 'sonnet';
  return MODEL_REGISTRY[key];
}

export function calculateCost(inputTokens: number, outputTokens: number, model: ModelConfig): number {
  const inputCost = (inputTokens / 1000) * model.inputPricePer1k;
  const outputCost = (outputTokens / 1000) * model.outputPricePer1k;
  return inputCost + outputCost;
}

Architecture Rationale

  • Decoupled Classification: Separating intent detection from model selection allows you to swap classifiers (regex, ML, LLM-as-judge) without touching dispatch logic.
  • Token Budgeting: Enforcing limits before API calls prevents context window overflow and caps unpredictable costs.
  • Externalized Pricing: Storing model rates in configuration enables rapid updates when vendors adjust pricing, without code deployments.
  • Fallback Routing: Defaulting to a mid-tier model ensures continuity when classification confidence is low or APIs throttle.

Pitfall Guide

  1. Over-Indexing on Pattern Matching

    • Explanation: Relying solely on keyword matching causes misclassification when prompts use synonyms or domain-specific terminology.
    • Fix: Implement a confidence threshold. If the top score is within 10% of the second-highest, route to a fallback model or request clarification.
  2. Ignoring Output Token Costs

    • Explanation: Input token optimization is useless if the model generates excessive output. Unbounded responses can double your bill.
    • Fix: Set max_tokens parameters based on intent. Use streaming with early termination for long-form generation. Log output tokens alongside input.
  3. Context Window Overflow from Aggressive Pruning

    • Explanation: Stripping too much history breaks conversational continuity, causing the model to repeat work or lose state.
    • Fix: Implement sliding window history with semantic importance scoring. Preserve the last 3-5 turns regardless of token count.
  4. Hardcoding Model Prices

    • Explanation: Vendor pricing changes frequently. Hardcoded values cause inaccurate cost attribution and budgeting errors.
    • Fix: Load pricing from a versioned JSON/YAML config. Update via CI/CD pipeline or environment variables. Add a lastUpdated timestamp for audit trails.
  5. No Fallback or Retry Logic

    • Explanation: API rate limits, model deprecations, or network failures will break your pipeline if routing is rigid.
    • Fix: Implement exponential backoff with automatic fallback to the next tier. Log failure reasons and trigger alerts on sustained degradation.
  6. Skipping Latency Monitoring

    • Explanation: Cost optimization often sacrifices speed. Routing latency-sensitive tasks to slower models degrades developer experience.
    • Fix: Track p50/p95 response times per model. Route interactive prompts to faster tiers, batch tasks to higher-capability models.
  7. Assuming All Prompts Need Full Context

    • Explanation: Sending entire file contents for simple queries wastes tokens and increases hallucination risk.
    • Fix: Implement context-aware trimming. For code tasks, send only relevant function signatures and call sites. Use AST parsing to extract dependencies.

Production Bundle

Action Checklist

  • Audit current AI token usage: Export API logs and calculate average input/output tokens per session.
  • Instrument cost telemetry: Add logging for every API call including model, tokens, intent, and calculated cost.
  • Define intent categories: Map your team's most common prompts to 4-6 routing buckets.
  • Build the classifier: Implement a scoring engine with fallback routing and confidence thresholds.
  • Enforce context budgets: Set hard token limits per intent and implement history truncation.
  • Externalize pricing: Move model rates to configuration files with version control.
  • Monitor latency: Track response times and adjust routing rules based on p95 thresholds.
  • Review monthly: Compare routing efficiency against baseline and iterate on classification rules.

Decision Matrix

Scenario Recommended Approach Why Cost Impact
Solo developer, chat-heavy workflow Keep wrapper + add cost logging UI value outweighs routing complexity Low optimization potential
Team with mixed CLI/chat usage Custom router + shared config Centralized routing reduces duplicate overhead 30-45% reduction
Agent-heavy automation pipelines Direct API + intent classifier Bots generate predictable prompts; routing is highly effective 50%+ reduction
Real-time collaborative editing Wrapper + output token caps Latency and state sync require vendor infrastructure Moderate (focus on output limits)
Budget-constrained startup Custom router + Haiku fallback Maximizes usage volume within fixed budget Highest ROI per dollar

Configuration Template

# ai-router.config.yml
version: 2.1
pricing:
  updated_at: "2024-05-01T00:00:00Z"
  models:
    haiku:
      input_per_1k: 0.00025
      output_per_1k: 0.00125
    sonnet:
      input_per_1k: 0.003
      output_per_1k: 0.015
    opus:
      input_per_1k: 0.015
      output_per_1k: 0.075
    embedding:
      input_per_1k: 0.00002
      output_per_1k: 0.0

routing:
  intents:
    trivial: { model: haiku, max_input: 2000, max_output: 500 }
    code: { model: sonnet, max_input: 8000, max_output: 4000 }
    planning: { model: opus, max_input: 12000, max_output: 6000 }
    embedding: { model: embedding, max_input: 8000, max_output: 0 }
  fallback: sonnet
  confidence_threshold: 0.15

telemetry:
  log_path: ./logs/ai-router.jsonl
  alert_on_cost_per_hour: 5.00
  retention_days: 90

Quick Start Guide

  1. Initialize the router: Clone the scoring engine and model dispatcher modules. Replace the MODEL_REGISTRY with your provider's current endpoints and pricing.
  2. Configure intent rules: Edit the keyword lists to match your team's terminology. Add domain-specific terms (e.g., terraform, graphql, reactive) to improve classification accuracy.
  3. Instrument logging: Add the cost calculator to your API wrapper. Write each call to a structured log file with timestamp, intent, model, tokens, and cost.
  4. Test with sample prompts: Run 20-30 representative prompts through the router. Verify intent classification matches expectations and token counts stay within configured budgets.
  5. Deploy and monitor: Route production traffic through the router. Review telemetry after 48 hours. Adjust confidence thresholds and context limits based on observed latency and cost distribution.