implementation demonstrates a TypeScript-based hybrid router that dynamically directs traffic based on task complexity, token budget, and real-time performance metrics.
Architecture Decisions and Rationale
- Payload Classification Layer: Instead of hardcoding routes, the system analyzes incoming requests for complexity indicators (token count, presence of multi-step instructions, required output schema). This prevents overloading local nodes with tasks they cannot handle efficiently.
- Latency-Driven Fallback: Local inference throughput degrades under concurrency. The router monitors p95 response times. If a local request exceeds the SLO threshold, it automatically retries via the cloud provider. This preserves user experience while maximizing local utilization.
- Token Budget Enforcement: Strict input/output limits prevent context window exhaustion and reduce unnecessary compute. Requests exceeding local capacity are routed upstream.
- Metrics-Driven Routing: The system tracks success rates, latency, and cost per route. Over time, routing rules can be adjusted based on empirical performance rather than static configuration.
Implementation
import { createClient as createOpenAIClient } from '@openai/api';
import { InferenceRouter, RouteDecision, PayloadProfile } from './router-core';
// Configuration interfaces
interface InferenceConfig {
localEndpoint: string;
cloudApiKey: string;
slos: {
maxLatencyMs: number;
maxInputTokens: number;
maxOutputTokens: number;
};
routing: {
localThreshold: number; // 0-1 complexity score
fallbackEnabled: boolean;
};
}
// Payload analyzer determines task complexity
class PayloadAnalyzer {
static analyze(prompt: string, schema?: object): PayloadProfile {
const tokenEstimate = Math.ceil(prompt.length / 4);
const hasComplexInstructions = /reason|plan|analyze|compare|long-context/i.test(prompt);
const requiresStructuredOutput = !!schema;
let complexityScore = 0.2; // Base score
if (hasComplexInstructions) complexityScore += 0.4;
if (tokenEstimate > 4000) complexityScore += 0.3;
if (requiresStructuredOutput) complexityScore += 0.1;
return {
tokenEstimate,
complexityScore: Math.min(complexityScore, 1.0),
requiresStructuredOutput,
};
}
}
// Hybrid inference router
class CostAwareRouter extends InferenceRouter {
private localClient: any;
private cloudClient: ReturnType<typeof createOpenAIClient>;
private config: InferenceConfig;
constructor(config: InferenceConfig) {
super();
this.config = config;
this.localClient = createOpenAIClient({ baseURL: config.localEndpoint, apiKey: 'local-bypass' });
this.cloudClient = createOpenAIClient({ apiKey: config.cloudApiKey });
}
async routeRequest(prompt: string, schema?: object): Promise<RouteDecision> {
const profile = PayloadAnalyzer.analyze(prompt, schema);
// Route to local if complexity is low and within token budget
const shouldUseLocal =
profile.complexityScore <= this.config.routing.localThreshold &&
profile.tokenEstimate <= this.config.slos.maxInputTokens;
if (shouldUseLocal) {
try {
const startTime = Date.now();
const response = await this.localClient.chat.completions.create({
model: 'gemma4:4b',
messages: [{ role: 'user', content: prompt }],
max_tokens: this.config.slos.maxOutputTokens,
temperature: 0.2,
});
const latency = Date.now() - startTime;
if (latency > this.config.slos.maxLatencyMs) {
if (this.config.routing.fallbackEnabled) {
return this.executeCloudFallback(prompt, schema, latency);
}
throw new Error('Local SLO breach');
}
return { route: 'local', latency, cost: 0, payload: response.choices[0].message.content };
} catch (error) {
if (this.config.routing.fallbackEnabled) {
return this.executeCloudFallback(prompt, schema);
}
throw error;
}
}
return this.executeCloudRequest(prompt, schema);
}
private async executeCloudRequest(prompt: string, schema?: object): Promise<RouteDecision> {
const startTime = Date.now();
const response = await this.cloudClient.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }],
max_tokens: this.config.slos.maxOutputTokens,
temperature: 0.2,
});
const latency = Date.now() - startTime;
const estimatedCost = this.calculateCloudCost(prompt, response.choices[0].message.content);
return { route: 'cloud', latency, cost: estimatedCost, payload: response.choices[0].message.content };
}
private async executeCloudFallback(prompt: string, schema?: object, localLatency?: number): Promise<RouteDecision> {
console.warn(`Falling back to cloud. Local latency: ${localLatency}ms`);
return this.executeCloudRequest(prompt, schema);
}
private calculateCloudCost(input: string, output: string): number {
const inputTokens = Math.ceil(input.length / 4);
const outputTokens = Math.ceil(output.length / 4);
const inputCost = (inputTokens / 1_000_000) * 0.15;
const outputCost = (outputTokens / 1_000_000) * 0.60;
return inputCost + outputCost;
}
}
export { CostAwareRouter, InferenceConfig };
The router prioritizes predictability over raw capability. By enforcing token budgets and complexity thresholds, it prevents local nodes from becoming latency bottlenecks. The fallback mechanism ensures that SLO violations trigger automatic cloud routing, preserving user experience while capturing cost savings on routine tasks. This architecture scales horizontally: you can add multiple local nodes behind a load balancer, or swap model weights without modifying the routing logic.
Pitfall Guide
1. Concurrency Blindness
Explanation: Local inference engines queue requests when multiple users hit the same endpoint. A single Mac mini handling 80 tokens/sec will degrade rapidly under concurrent load, causing p95 latency to spike.
Fix: Implement request batching, async job queues, or deploy multiple inference nodes behind a round-robin load balancer. Monitor queue depth and trigger cloud fallback when backlog exceeds threshold.
2. Quality Parity Fallacy
Explanation: Expecting a 4B-parameter model to match frontier reasoning capabilities leads to degraded outputs, increased retry rates, and higher effective costs due to failed requests.
Fix: Define task-specific acceptance criteria. Use local models for extraction, classification, routing, and short-context generation. Reserve cloud APIs for multi-step planning, long-document analysis, and complex code reasoning.
3. Maintenance Debt Accumulation
Explanation: Cloud providers handle model updates, security patches, and API versioning. Local deployments require manual quantization, template alignment, and dependency management. Over time, context template drift breaks output parsing.
Fix: Pin model versions in production. Automate validation pipelines that test output format consistency after any model swap. Maintain a rollback strategy and document template requirements for each model variant.
4. Spiky Traffic Assumption
Explanation: Local hardware has fixed throughput. Viral traffic or bursty usage patterns will overwhelm single-node deployments, causing request drops or severe latency degradation.
Fix: Use cloud APIs for burst handling. Implement auto-scaling GPU clusters if local deployment is mandatory for high-volume periods. Alternatively, queue burst traffic and process asynchronously with clear user feedback.
5. Premature Infrastructure Lock-in
Explanation: Deploying local inference before product-market fit diverts engineering resources from core product development. The cloud bill is cheaper than the opportunity cost of infrastructure maintenance.
Fix: Delay local deployment until unit economics justify the investment. Use cloud APIs during validation phases. Transition to local routing only when monthly token spend consistently exceeds hardware amortization thresholds.
6. Token Counting Errors
Explanation: Miscounting system prompts, tool definitions, or streaming overhead leads to context window exhaustion and silent truncation. This causes malformed outputs and routing failures.
Fix: Implement strict token budgeting at the application layer. Use tokenizer libraries to count exact tokens before routing. Reserve 10-15% of the context window for system instructions and tool schemas.
7. Latency SLO Neglect
Explanation: Focusing exclusively on cost reduction while ignoring response time guarantees degrades user experience. Local inference may be cheaper but slower under load.
Fix: Set hard latency thresholds (e.g., p95 < 2000ms). Implement automatic fallback to cloud providers when local response times breach SLOs. Track latency distributions, not just averages.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Pre-PMF validation | Cloud-only | Engineering velocity outweighs marginal cost savings; avoids infra overhead | Higher variable cost, lower opportunity cost |
| High-volume extraction/classification | Local-by-default | Predictable workload, low complexity, marginal cost approaches zero | Fixed hardware cost, ~90% reduction in inference spend |
| Spiky/viral traffic patterns | Cloud-first with local caching | Local nodes cannot handle burst concurrency; cloud provides elastic scaling | Higher cloud spend during peaks, stable baseline |
| Privacy/air-gapped requirements | Local-only | Compliance mandates data residency; cloud APIs violate security policies | High upfront capital, zero ongoing token cost |
| Complex reasoning/long-context | Cloud-only | Frontier models outperform local on multi-step planning and 50k+ token windows | Premium pricing, but necessary for quality |
Configuration Template
# inference-router.config.yaml
routing:
local:
endpoint: "http://localhost:11434/v1"
model: "gemma4:4b"
max_input_tokens: 4000
max_output_tokens: 1024
complexity_threshold: 0.4
cloud:
provider: "openai"
model: "gpt-4o-mini"
api_key_env: "CLOUD_API_KEY"
fallback:
enabled: true
latency_threshold_ms: 2000
max_retries: 1
slos:
p95_latency_ms: 1500
quality_acceptance_rate: 0.92
metrics:
export_interval_sec: 30
log_level: "info"
Quick Start Guide
- Install local inference runtime: Deploy Ollama or vLLM on your target hardware. Pull the Gemma 4 4B or Qwen 3 7B model weights.
- Configure environment variables: Set
LOCAL_INFERENCE_ENDPOINT, CLOUD_API_KEY, and routing thresholds in your application config.
- Initialize the router: Instantiate the
CostAwareRouter with your configuration. Replace direct API calls with router.routeRequest(prompt, schema).
- Run validation suite: Execute your last 100 production requests through the router. Verify p95 latency, output quality, and fallback behavior.
- Enable gradual rollout: Route 10% of traffic locally. Monitor metrics for 48 hours. Increase to 50%, then 100% as confidence stabilizes.