at route sensitive or latency-bound requests locally while offloading heavy reasoning, high-concurrency, or frontier-model demands to cloud endpoints. This approach captures sovereignty where legally required and cost efficiency where economically viable.
Core Solution
Building a cost-aware routing architecture requires decoupling inference selection from application logic. The goal is a middleware layer that evaluates payload characteristics, estimates token consumption, checks compliance flags, and routes to the optimal endpoint. Below is a production-grade TypeScript implementation that demonstrates this pattern.
Step 1: Define Endpoint Profiles and Cost Baselines
First, establish a type-safe contract for inference providers. Each profile contains pricing, latency expectations, concurrency limits, and capability tags.
interface InferenceProfile {
id: string;
provider: 'local' | 'cloud';
model: string;
costPerInputToken: number;
costPerOutputToken: number;
estimatedTTFT: number; // milliseconds
maxConcurrency: number;
supportsFrontier: boolean;
requiresCompliance: boolean;
}
const ENDPOINT_REGISTRY: Record<string, InferenceProfile> = {
local_mac_studio: {
id: 'local_mac_studio',
provider: 'local',
model: 'llama-3.3-70b-q4',
costPerInputToken: 0.000033,
costPerOutputToken: 0.000033,
estimatedTTFT: 85,
maxConcurrency: 1,
supportsFrontier: false,
requiresCompliance: true,
},
cloud_openrouter_qwen: {
id: 'cloud_openrouter_qwen',
provider: 'cloud',
model: 'qwen-2.5-72b-instruct',
costPerInputToken: 0.0000004,
costPerOutputToken: 0.0000004,
estimatedTTFT: 320,
maxConcurrency: 500,
supportsFrontier: false,
requiresCompliance: false,
},
cloud_openrouter_deepseek: {
id: 'cloud_openrouter_deepseek',
provider: 'cloud',
model: 'deepseek-v3.1',
costPerInputToken: 0.00000027,
costPerOutputToken: 0.0000011,
estimatedTTFT: 280,
maxConcurrency: 1000,
supportsFrontier: false,
requiresCompliance: false,
},
};
Step 2: Implement Token Estimation and Routing Logic
The router calculates expected cost based on prompt length, output constraints, and compliance requirements. It prioritizes local endpoints only when data residency is mandatory or latency thresholds are strict.
interface RoutingRequest {
prompt: string;
maxOutputTokens: number;
requiresDataSovereignty: boolean;
latencyBudgetMs: number;
complexity: 'routine' | 'complex';
}
interface RoutingDecision {
selectedEndpoint: string;
estimatedCost: number;
rationale: string;
}
class TokenEconomyRouter {
private estimateTokens(text: string): number {
// Approximation: ~1.3 tokens per English word
return Math.ceil(text.split(/\s+/).length * 1.3);
}
public resolveRoute(request: RoutingRequest): RoutingDecision {
const inputTokens = this.estimateTokens(request.prompt);
const outputTokens = request.maxOutputTokens;
// Compliance override: route locally if data cannot leave premises
if (request.requiresDataSovereignty) {
const localProfile = ENDPOINT_REGISTRY.local_mac_studio;
const cost = (inputTokens * localProfile.costPerInputToken) +
(outputTokens * localProfile.costPerOutputToken);
return {
selectedEndpoint: localProfile.id,
estimatedCost: cost,
rationale: 'Compliance mandate forces local routing despite higher per-token cost.',
};
}
// Latency-bound interactive loops favor local
if (request.latencyBudgetMs < 150 && request.complexity === 'routine') {
const localProfile = ENDPOINT_REGISTRY.local_mac_studio;
const cost = (inputTokens * localProfile.costPerInputToken) +
(outputTokens * localProfile.costPerOutputToken);
return {
selectedEndpoint: localProfile.id,
estimatedCost: cost,
rationale: 'Strict latency budget and routine complexity justify local execution.',
};
}
// Default to cloud for cost efficiency and concurrency
const cloudProfile = ENDPOINT_REGISTRY.cloud_openrouter_qwen;
const cost = (inputTokens * cloudProfile.costPerInputToken) +
(outputTokens * cloudProfile.costPerOutputToken);
return {
selectedEndpoint: cloudProfile.id,
estimatedCost: cost,
rationale: 'Cloud routing selected for cost efficiency and higher concurrency tolerance.',
};
}
}
Step 3: Architecture Decisions and Rationale
Why decouple routing from application code? Monolithic inference calls create tight coupling between business logic and infrastructure constraints. A dedicated router enables centralized policy enforcement, cost tracking, and fallback handling without scattering conditional logic across services.
Why use token estimation instead of exact counting? Exact tokenization requires loading a tokenizer model, adding latency and memory overhead. The word-to-token approximation (1.3x multiplier) provides sufficient accuracy for routing decisions while keeping the middleware lightweight. Production systems can swap this for a fast tokenizer (e.g., tiktoken or @huggingface/tokenizers) when precision is critical.
Why prioritize compliance and latency over cost in the decision tree? Economic optimization is secondary to legal and UX constraints. Data residency requirements are non-negotiable. Latency budgets below 150ms compound quickly in agentic loops or interactive coding assistants, making cloud round-trip overhead unacceptable. Cost routing applies only when neither constraint is active.
Why maintain a registry pattern? Hardcoding endpoints creates maintenance debt. A registry allows dynamic configuration updates, A/B testing of providers, and seamless addition of new models without redeploying application logic.
Pitfall Guide
1. Sunk Cost Procurement Fallacy
Explanation: Teams purchase high-memory Apple Silicon specifically for LLM inference, assuming long-term savings. Hardware depreciation dominates TCO, and break-even requires near-continuous utilization.
Fix: Treat local hardware as a general-purpose development machine. Evaluate inference workloads against existing capacity. Only provision dedicated inference nodes when compliance or latency mandates it.
2. Aggressive Quantization Degradation
Explanation: Dropping to 2-bit or 3-bit quantization to fit larger models into limited memory severely impacts reasoning quality, instruction following, and output consistency.
Fix: Cap quantization at 4-bit (Q4_K_M) for 70B-class models. Validate output quality against a benchmark suite before production deployment. Monitor perplexity drift over time.
3. Ignoring Concurrency Bottlenecks
Explanation: Apple Silicon handles one inference stream at full speed. Concurrent requests queue, increasing tail latency and degrading user experience.
Fix: Implement request queuing with timeout thresholds. Route concurrent workloads to cloud endpoints. Use local inference only for single-user or low-concurrency scenarios.
4. Thermal Throttling Under Sustained Load
Explanation: Continuous inference pushes M-series chips into thermal limits, reducing clock speeds and token throughput by 20-30% after 15-20 minutes.
Fix: Monitor thermal sensors via powermetrics or system APIs. Implement duty cycling or pause inference during thermal spikes. Ensure adequate ventilation and ambient cooling in deployment environments.
5. Misaligned Model Capability Expectations
Explanation: Local 70B models approximate GPT-4o-mini performance on standard benchmarks but lag significantly on complex reasoning, multi-step planning, and nuanced instruction following.
Fix: Map task complexity to model tier. Route routine autocomplete, summarization, and classification locally. Offload architectural design, debugging, and creative generation to frontier cloud models.
6. Overlooking Network Fallback Latency
Explanation: Hybrid routing assumes seamless failover. When local endpoints crash or exceed memory limits, fallback to cloud introduces 200-500ms latency spikes that break interactive UX.
Fix: Implement circuit breakers with warm cloud connections. Pre-authenticate cloud APIs and maintain persistent HTTP/2 sessions. Cache fallback routing decisions to avoid cold-start delays.
7. Neglecting Token Accounting Overhead
Explanation: Teams track cloud billing but ignore local hardware depreciation, power costs, and maintenance labor. This creates false economies and misinformed budgeting.
Fix: Implement unified token accounting across all endpoints. Tag requests with cost centers. Generate monthly TCO reports that include amortization, electricity, and cloud spend for accurate financial planning.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Healthcare/Financial data with strict residency laws | Local inference only | Compliance overrides cost; data cannot leave premises | High upfront hardware cost, zero cloud spend |
| High-throughput SaaS with thousands of concurrent users | Cloud routing via OpenRouter | Local concurrency caps at 1 stream; cloud scales elastically | Low per-token cost, predictable monthly billing |
| Developer autocomplete / IDE plugins | Hybrid routing | Routine tasks route locally for <100ms TTFT; complex queries route to cloud | Minimal cloud spend, optimized UX |
| Offline field operations / conference environments | Local inference only | Network dependency eliminated; deterministic performance | Hardware amortization dominates, no internet required |
| Complex reasoning / multi-step planning | Cloud frontier models | Local 70B lacks capability depth; cloud delivers higher accuracy | Higher per-token cost, reduced rework and error rates |
Configuration Template
# llm-routing-config.yaml
endpoints:
local:
host: "http://localhost:8080"
model: "llama-3.3-70b-q4"
max_concurrency: 1
thermal_threshold_c: 85
cost_per_mtok: 33.00
cloud:
provider: "openrouter"
api_key_env: "OPENROUTER_API_KEY"
default_model: "qwen-2.5-72b-instruct"
fallback_models:
- "deepseek-v3.1"
- "llama-3.3-70b"
cost_per_mtok: 0.65
routing_policy:
compliance_override: true
latency_budget_ms: 150
complexity_threshold: "complex"
fallback_timeout_ms: 2000
circuit_breaker:
failure_threshold: 3
recovery_timeout_s: 30
accounting:
enabled: true
export_format: "csv"
aggregation_interval: "daily"
Quick Start Guide
- Install dependencies:
npm install @huggingface/tokenizers axios yaml
- Configure endpoints: Copy the YAML template, set
OPENROUTER_API_KEY, and adjust thermal thresholds to match your hardware.
- Initialize the router: Import
TokenEconomyRouter, load the configuration, and instantiate the routing engine in your application's middleware layer.
- Test routing logic: Send sample requests with varying complexity, latency budgets, and compliance flags. Verify endpoint selection matches the decision matrix.
- Deploy with monitoring: Enable token accounting, attach thermal alerts, and route production traffic. Review cost dashboards weekly to validate TCO assumptions.