uts API spend from a line item to a rounding error without sacrificing capability.
Core Solution
Building a cost-aware routing system requires decoupling task complexity from model selection, abstracting provider-specific cache logic, and calculating real-time costs before dispatch. The following implementation demonstrates a production-ready TypeScript architecture.
Step 1: Define Workload Characteristics
Models should be selected based on input/output ratios, repetition frequency, and context length. We abstract this into a RequestProfile interface.
interface RequestProfile {
inputTokens: number;
outputTokens: number;
hasRepeatedPrefix: boolean;
contextLength: number;
complexityScore: number; // 0-1 scale
}
Step 2: Implement a Cost Calculator
The calculator normalizes pricing across providers, applies cache discounts, and accounts for write premiums and context surcharges.
class CostCalculator {
private readonly CACHE_DISCOUNT_RATE = 0.90;
private readonly DEEPSEEK_CACHE_DISCOUNT = 0.98;
private readonly ANTHROPIC_WRITE_PREMIUM = 1.25;
private readonly GOOGLE_LONG_CONTEXT_THRESHOLD = 200_000;
private readonly GOOGLE_LONG_CONTEXT_MULTIPLIER = 2.0;
calculate(profile: RequestProfile, model: ModelSpec, isFirstCall: boolean): number {
let inputCost = model.inputRate * (profile.inputTokens / 1_000_000);
let outputCost = model.outputRate * (profile.outputTokens / 1_000_000);
// Apply context window surcharge
if (profile.contextLength > this.GOOGLE_LONG_CONTEXT_THRESHOLD && model.provider === 'google') {
inputCost *= this.GOOGLE_LONG_CONTEXT_MULTIPLIER;
}
// Apply cache logic
if (profile.hasRepeatedPrefix) {
if (model.provider === 'deepseek') {
inputCost *= (1 - this.DEEPSEEK_CACHE_DISCOUNT);
} else if (model.provider === 'anthropic') {
inputCost = isFirstCall
? inputCost * this.ANTHROPIC_WRITE_PREMIUM
: inputCost * (1 - this.CACHE_DISCOUNT_RATE);
} else {
inputCost *= (1 - this.CACHE_DISCOUNT_RATE);
}
}
return inputCost + outputCost;
}
}
Step 3: Build the Routing Layer
The router evaluates complexity scores, cache eligibility, and cost thresholds to select the optimal model.
interface ModelSpec {
id: string;
provider: 'openai' | 'anthropic' | 'google' | 'deepseek';
inputRate: number;
outputRate: number;
maxContext: number;
tier: 'budget' | 'mid' | 'frontier';
}
class LLMRouter {
private calculator = new CostCalculator();
private cacheHitMap = new Map<string, boolean>();
route(request: RequestProfile, availableModels: ModelSpec[]): ModelSpec {
const eligible = availableModels.filter(m => m.maxContext >= request.contextLength);
// Force frontier tier for high-complexity tasks
if (request.complexityScore > 0.85) {
return eligible.find(m => m.tier === 'frontier') ?? eligible[0];
}
// Cost-optimized selection for mid/budget tasks
const scored = eligible.map(model => {
const cacheKey = `${model.id}:${request.inputTokens}`;
const isFirst = !this.cacheHitMap.has(cacheKey);
const cost = this.calculator.calculate(request, model, isFirst);
this.cacheHitMap.set(cacheKey, true);
return { model, cost };
});
scored.sort((a, b) => a.cost - b.cost);
return scored[0].model;
}
}
Architecture Decisions & Rationale
- Pre-dispatch cost calculation: Computing costs before routing prevents accidental dispatch to expensive models when budget alternatives meet complexity thresholds.
- Provider-specific cache abstraction: Caching mechanics differ fundamentally. Anthropic's write premium requires tracking first-call state, while DeepSeek's 98% discount justifies aggressive prefix reuse. Abstracting this prevents provider lock-in.
- Complexity scoring over hard rules: Static routing (e.g., "always use Flash-Lite for classification") fails when edge cases require reasoning. A 0β1 complexity score enables dynamic fallback without manual intervention.
- Context window validation: Filtering models by
maxContext before cost calculation prevents silent truncation or surcharge application.
Pitfall Guide
Explanation: Teams optimize for input pricing but overlook that generation-heavy workloads amplify output costs. OpenAI's 6xβ8x ratios make long responses expensive, while DeepSeek's 2x ratio flips the economics.
Fix: Calculate total cost using (inputTokens * inputRate) + (outputTokens * outputRate). Weight routing decisions by expected response length, not just prompt size.
2. Misjudging Cache Write Premiums
Explanation: Anthropic charges 25% more on cache writes. Assuming immediate savings leads to negative ROI on low-frequency prefixes.
Fix: Implement a hit counter. Only enable caching for prefixes that repeat β₯3 times within the TTL window. Track first-call costs separately in monitoring dashboards.
3. Overlooking Context Window Surcharges
Explanation: Google applies a 2x multiplier for prompts exceeding 200K tokens. Models appear cheap until long-context workloads trigger hidden pricing.
Fix: Validate contextLength against provider thresholds before routing. Implement chunking or summarization pipelines to keep prompts under surcharge boundaries.
Explanation: Providers enforce different expiration windows (typically 5β10 minutes). Assuming a fixed TTL causes cache misses and unexpected full-price charges.
Fix: Abstract TTL configuration per provider. Use short-lived in-memory caches for high-frequency routing, and persist cache keys with provider-specific expiration metadata.
5. Hardcoding Tier Assignments
Explanation: Static routing (e.g., "budget for extraction, frontier for code") fails when task complexity varies within categories. A simple extraction with ambiguous schema requires reasoning.
Fix: Replace static rules with a complexity scoring function that evaluates schema ambiguity, instruction count, and required reasoning depth. Route dynamically based on score thresholds.
6. Neglecting Batch Processing Discounts
Explanation: Providers offer 30β50% discounts for async batch jobs. Real-time routing ignores these savings, inflating costs for non-urgent workloads.
Fix: Implement a workload classifier that flags latency-tolerant tasks. Route them to batch endpoints with automatic retry and result polling.
7. Treating Caching as Binary
Explanation: Prefix caching only matches exact token sequences. Semantic variations, dynamic variables, or reordered instructions break cache hits.
Fix: Normalize prompts before routing. Strip timestamps, standardize JSON key order, and extract dynamic variables into a separate context layer. Use semantic caching proxies for near-duplicate matching.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-throughput classification (<500 tokens) | Budget tier + aggressive prefix caching | Low complexity, high repetition, minimal output | 90β95% reduction vs. frontier |
| Complex reasoning / code generation | Frontier tier + semantic caching | Requires multi-step logic, output quality critical | 3β5x higher than budget, but error reduction offsets cost |
| Long-context analysis (>200K tokens) | Mid-tier + prompt chunking | Avoids Google surcharge, maintains reasoning depth | Prevents 2x multiplier, keeps costs predictable |
| Generation-heavy workflows (long responses) | DeepSeek V4 Pro/Flash | 2x output ratio drastically lowers generation cost | 40β60% cheaper than 6xβ8x ratio models |
| Latency-tolerant batch jobs | Async batch routing | Provider discounts apply, no real-time SLA required | 30β50% savings vs. synchronous routing |
Configuration Template
llm_router:
providers:
openai:
cache_discount: 0.90
write_premium: false
context_surge_threshold: null
anthropic:
cache_discount: 0.90
write_premium: 1.25
cache_break_even: 3
context_surge_threshold: null
google:
cache_discount: 0.90
write_premium: false
context_surge_threshold: 200000
context_surge_multiplier: 2.0
deepseek:
cache_discount: 0.98
write_premium: false
context_surge_threshold: null
routing_rules:
complexity_thresholds:
budget: 0.0 - 0.4
mid: 0.4 - 0.7
frontier: 0.7 - 1.0
batch_eligible:
max_latency_ms: 5000
allowed_tasks: ["extraction", "summarization", "classification"]
cache_policies:
ttl_seconds: 300
min_hit_count: 2
normalize_json_keys: true
strip_dynamic_variables: true
Quick Start Guide
- Install dependencies: Add
@anthropic-ai/sdk, openai, @google/generative-ai, and deepseek client libraries to your project.
- Initialize the router: Import the
LLMRouter and CostCalculator classes. Load provider API keys and configure the YAML template.
- Define request profiles: Wrap incoming requests in
RequestProfile objects, calculating complexity scores and context lengths before routing.
- Deploy with monitoring: Route traffic through the router, log cache hit rates, first-call costs, and model selection decisions. Set alerts for cache break-even violations and context surcharge triggers.