Ollama and LM Studio expose an API compatible with api.openai.com/v1. By standardizing on this interface, we avoid vendor lock-in and allow seamless swapping between local and cloud backends without rewriting business logic.
2. Quantization-Aware Model Selection: Local models ship in multiple quantization formats (Q4_K_M, Q5_K_S, Q8_0). The router should expose a quantization parameter that balances memory footprint against output fidelity. Q4_K_M is optimal for 8B models on 16GB systems; Q8_0 is preferred for 70B models on 64GB+ systems where memory pressure is manageable.
3. Capability-Based Routing: Instead of hardcoding fallback logic, the router evaluates three dimensions: context length, tool complexity, and privacy sensitivity. This prevents local models from being forced into tasks they cannot reliably execute.
4. Warm-State Management: Local models incur a cold-start penalty while loading weights into memory. The router maintains a persistent connection to the local backend, keeping the KV cache warm for interactive sessions.
Implementation
import { z } from 'zod';
// Domain models
interface InferenceRequest {
prompt: string;
systemContext?: string;
maxTokens: number;
tools?: string[];
requiresPrivacy: boolean;
}
interface RoutingDecision {
backend: 'local' | 'cloud';
model: string;
quantization: 'Q4_K_M' | 'Q5_K_S' | 'Q8_0';
reason: string;
}
interface InferenceBackend {
generate(request: InferenceRequest): Promise<string>;
isHealthy(): Promise<boolean>;
}
// Router configuration
interface RouterConfig {
localEndpoint: string;
cloudEndpoint: string;
cloudApiKey: string;
maxLocalContext: number;
maxToolChainLength: number;
}
class CapabilityRouter {
private config: RouterConfig;
private localBackend: InferenceBackend;
private cloudBackend: InferenceBackend;
constructor(config: RouterConfig) {
this.config = config;
this.localBackend = this.buildLocalBackend(config.localEndpoint);
this.cloudBackend = this.buildCloudBackend(config.cloudEndpoint, config.cloudApiKey);
}
async route(request: InferenceRequest): Promise<RoutingDecision> {
// Rule 1: Privacy requirement forces local execution
if (request.requiresPrivacy) {
return {
backend: 'local',
model: 'qwen2.5:7b',
quantization: 'Q4_K_M',
reason: 'Privacy constraint enforced'
};
}
// Rule 2: Context window exceeds local practical limit
if (request.maxTokens > this.config.maxLocalContext) {
return {
backend: 'cloud',
model: 'claude-3-5-sonnet',
quantization: 'Q8_0',
reason: 'Context exceeds local threshold'
};
}
// Rule 3: Complex tool chains require frontier reasoning
if (request.tools && request.tools.length > this.config.maxToolChainLength) {
return {
backend: 'cloud',
model: 'gpt-4o',
quantization: 'Q8_0',
reason: 'Tool chain complexity exceeds local reliability'
};
}
// Default: Route to local for routine tasks
return {
backend: 'local',
model: 'llama3.1:8b',
quantization: 'Q4_K_M',
reason: 'Routine task optimized for local execution'
};
}
async execute(request: InferenceRequest): Promise<string> {
const decision = await this.route(request);
try {
const backend = decision.backend === 'local' ? this.localBackend : this.cloudBackend;
return await backend.generate(request);
} catch (error) {
// Fallback to cloud if local backend fails
console.warn('Local backend failed, falling back to cloud');
return await this.cloudBackend.generate(request);
}
}
private buildLocalBackend(endpoint: string): InferenceBackend {
return {
generate: async (req: InferenceRequest) => {
const response = await fetch(`${endpoint}/v1/chat/completions`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama3.1:8b',
messages: [
{ role: 'system', content: req.systemContext || 'You are a helpful assistant.' },
{ role: 'user', content: req.prompt }
],
max_tokens: req.maxTokens,
temperature: 0.2
})
});
const data = await response.json();
return data.choices[0].message.content;
},
isHealthy: async () => {
try {
const res = await fetch(`${endpoint}/health`);
return res.ok;
} catch {
return false;
}
}
};
}
private buildCloudBackend(endpoint: string, apiKey: string): InferenceBackend {
return {
generate: async (req: InferenceRequest) => {
const response = await fetch(`${endpoint}/v1/chat/completions`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${apiKey}`
},
body: JSON.stringify({
model: 'gpt-4o',
messages: [
{ role: 'system', content: req.systemContext || 'You are a helpful assistant.' },
{ role: 'user', content: req.prompt }
],
max_tokens: req.maxTokens,
temperature: 0.2
})
});
const data = await response.json();
return data.choices[0].message.content;
},
isHealthy: async () => true // Cloud backends assumed healthy
};
}
}
Why This Architecture Works
The router decouples business logic from inference infrastructure. By evaluating requiresPrivacy, maxTokens, and tools.length before execution, it prevents local models from being forced into tasks where they degrade. The fallback mechanism ensures resilience without sacrificing the local-first default. Quantization is explicitly tied to model size and hardware constraints, preventing out-of-memory crashes during production loads. Finally, the OpenAI-compatible interface means existing SDKs, prompt libraries, and evaluation frameworks require zero modification.
Pitfall Guide
1. Quantization Blindness
Explanation: Treating all quantization formats as equivalent leads to unpredictable output quality. Q4_K_M reduces memory usage by ~50% compared to Q8_0 but introduces subtle reasoning degradation in complex chains.
Fix: Match quantization to task complexity. Use Q4_K_M for classification, summarization, and code completion. Reserve Q8_0 for 70B models handling multi-step reasoning or strict JSON output.
2. Context Window Overcommitment
Explanation: Local models advertise 8K–128K context windows, but quality degrades significantly beyond 16K–32K tokens due to KV cache fragmentation and attention dilution.
Fix: Implement a hard ceiling in your router. Chunk large documents before sending them to local backends, or route directly to cloud providers that maintain quality at 200K+ tokens.
3. The 1:1 Replacement Fallacy
Explanation: Assuming local models can replicate every cloud capability leads to broken agentic workflows. Multi-step tool use with strict schema validation still favors frontier models.
Fix: Define explicit capability boundaries. Route tasks requiring 5+ chained tool calls, complex JSON parsing, or cross-file codebase reasoning to cloud backends.
4. Cold Start Latency Neglect
Explanation: Local models incur a 2–5 second penalty during initial weight loading. Interactive tooling that spawns new processes per request will experience unacceptable delays.
Fix: Maintain a persistent daemon (Ollama/LM Studio) that keeps models resident in memory. Implement connection pooling and avoid spawning fresh inference processes for each developer action.
5. Unvalidated Structured Outputs
Explanation: Local models frequently deviate from strict JSON schemas, especially under memory pressure. Parsing failures cascade into broken automation pipelines.
Fix: Always wrap local outputs in a validation layer using Zod or JSON Schema. Implement retry logic with temperature reduction (temperature: 0.1) when validation fails.
6. Memory Fragmentation in Long Sessions
Explanation: Extended interactive sessions gradually fragment the KV cache, causing latency spikes and eventual out-of-memory errors on consumer hardware.
Fix: Implement periodic cache flushing or session timeouts. Monitor VRAM/RAM usage and restart the inference daemon when utilization exceeds 85%.
7. Hardcoded Fallback Logic
Explanation: Tying fallback behavior to specific error codes or string matching creates brittle systems that break when backend responses change.
Fix: Use capability-based routing with explicit health checks. Route to cloud only when local backend fails health checks or when task complexity exceeds predefined thresholds.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Inline code completion | Local (8B, Q4_K_M) | Low latency, high frequency, routine pattern matching | Near-zero marginal cost |
| Log summarization | Local (8B, Q4_K_M) | Context fits within 16K threshold, privacy-sensitive | Eliminates per-request API fees |
| Multi-step agentic workflow | Cloud (GPT-4o/Claude) | Requires strict JSON, 10+ tool chains, frontier reasoning | Higher per-call cost, but prevents pipeline failures |
| Long-context document analysis | Cloud (200K+ context) | Local attention degrades beyond 32K tokens | Predictable cloud spend for edge cases |
| Client codebase refactoring | Local (70B, Q8_0) | Privacy requirement, moderate complexity, 64GB+ RAM available | One-time hardware cost, zero recurring fees |
Configuration Template
# inference-router.config.yaml
router:
local_endpoint: "http://localhost:11434"
cloud_endpoint: "https://api.openai.com/v1"
cloud_api_key: "${OPENAI_API_KEY}"
thresholds:
max_local_context_tokens: 24000
max_tool_chain_length: 8
privacy_enforced: true
model_routing:
local:
default: "llama3.1:8b"
high_precision: "qwen2.5:14b"
quantization: "Q4_K_M"
cloud:
default: "gpt-4o"
fallback: "claude-3-5-sonnet"
health_check:
interval_seconds: 30
timeout_ms: 2000
retry_count: 2
validation:
strict_json: true
max_retries_on_failure: 3
temperature_reduction: 0.1
Quick Start Guide
- Install the inference daemon: Run
brew install ollama (macOS) or follow the one-line Linux installer. Start the service with ollama serve.
- Pull baseline models: Execute
ollama pull llama3.1:8b and ollama pull qwen2.5:7b to populate the local model registry.
- Deploy the router: Copy the TypeScript implementation into your project, configure
inference-router.config.yaml, and initialize CapabilityRouter with your endpoints.
- Validate routing: Run a test suite that exercises routine tasks (commit generation, log parsing) and edge cases (long context, tool chains). Verify that 60–80% of requests route locally.
- Monitor and iterate: Track time-to-first-token, fallback frequency, and memory utilization. Adjust quantization tiers and context thresholds based on production telemetry.
The local-first inference stack is no longer experimental. It is the operational baseline for cost-efficient, low-latency, and privacy-compliant AI tooling. Route by capability, validate outputs, and let the cloud handle what the hardware cannot.