routing layer that decouples task semantics from hardware constraints. The architecture consists of three components:
- Hardware Profiler: Measures available VRAM, PCIe bandwidth, and baseline throughput
- Task Router: Maps incoming requests to model endpoints based on complexity, context length, and latency SLA
- Metrics Collector: Tracks tokens/sec, cache hits, and error rates to enable dynamic fallback
Step-by-Step Implementation
1. Hardware Baseline
Run a lightweight probe to determine safe quantization levels and batch limits. Do not assume manufacturer specs match real-world memory fragmentation.
2. Model Endpoint Setup
Deploy models via vLLM or Ollama with explicit quantization flags. AWQ and GPTQ-4bit show the best accuracy-throughput trade-off for NVIDIA architectures. For Apple Silicon, use MLX with 4-bit group quantization.
3. Router Implementation (TypeScript)
The router intercepts requests, classifies task complexity, and forwards to the appropriate endpoint. It includes circuit breaking and fallback routing.
import { fetch } from 'undici';
interface ModelEndpoint {
id: string;
url: string;
maxContext: number;
minVramGB: number;
latencyTargetMs: number;
}
interface RequestPayload {
prompt: string;
taskType: 'code' | 'reasoning' | 'extraction' | 'chat';
maxTokens: number;
contextLength: number;
}
const ENDPOINTS: ModelEndpoint[] = [
{ id: 'qwen-coder-32b', url: 'http://localhost:8000/v1/completions', maxContext: 256000, minVramGB: 18, latencyTargetMs: 300 },
{ id: 'llama-scout-17b', url: 'http://localhost:8001/v1/completions', maxContext: 128000, minVramGB: 11, latencyTargetMs: 200 },
{ id: 'phi-medium-14b', url: 'http://localhost:8002/v1/completions', maxContext: 64000, minVramGB: 8, latencyTargetMs: 100 },
];
async function classifyTask(payload: RequestPayload): Promise<string> {
if (payload.taskType === 'code' && payload.maxTokens > 2048) return 'qwen-coder-32b';
if (payload.taskType === 'reasoning' && payload.contextLength > 32000) return 'llama-scout-17b';
if (payload.latencyTargetMs <= 150) return 'phi-medium-14b';
return 'llama-scout-17b'; // default fallback
}
export async function routeInference(payload: RequestPayload): Promise<unknown> {
const target = await classifyTask(payload);
const endpoint = ENDPOINTS.find(e => e.id === target);
if (!endpoint) throw new Error('No matching endpoint');
const response = await fetch(endpoint.url, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: target,
prompt: payload.prompt,
max_tokens: payload.maxTokens,
temperature: 0.2,
top_p: 0.9,
}),
});
if (!response.ok) {
// Fallback to lower-tier model on hardware saturation
const fallback = ENDPOINTS.find(e => e.id === 'phi-medium-14b');
if (fallback && fallback.id !== target) {
return routeInference({ ...payload, taskType: 'chat' });
}
throw new Error(`Inference failed: ${response.status}`);
}
return response.json();
}
4. Architecture Rationale
- Decoupled routing prevents application code from hardcoding model dependencies
- Dynamic fallback handles VRAM spikes and batch queue saturation without dropping requests
- Explicit context thresholds prevent KV cache thrashing on models with smaller effective windows
- TypeScript runtime enables seamless integration with Next.js, Express, or Deno deployments while maintaining strict typing for payload validation
Pitfall Guide
-
Treating leaderboard scores as production guarantees
Public benchmarks test zero-shot accuracy on curated datasets. They do not measure latency under concurrent load, KV cache fragmentation, or quantization-induced hallucination. Always validate on your actual prompt distribution.
-
Ignoring KV cache memory scaling
Context window size ≠ usable context. KV cache grows quadratically with sequence length in dense models and linearly in MoE. A 128K window model may OOM at 32K tokens if batch size exceeds 8. Profile cache usage before production.
-
Applying uniform quantization across architectures
AWQ, GPTQ, and GGUF quantization artifacts vary by weight distribution. Qwen-3-Coder-32B retains 94% of FP16 accuracy at 4-bit AWQ, while Llama-4-Scout-17B drops to 89% under identical settings. Validate perplexity on your task domain before committing to quantization.
-
Assuming tokenizer consistency
Different tokenizers fragment code, JSON, and non-ASCII text differently. A 512-token prompt in Llama's tokenizer may become 680 tokens in Qwen's, directly impacting latency and context window utilization. Normalize token counts using the target model's tokenizer before routing.
-
Benchmarking with single requests
Local inference stacks behave differently under concurrency. PagedAttention efficiency, CUDA stream scheduling, and PCIe bandwidth saturation only appear at batch sizes ≥16. Single-request benchmarks produce misleading throughput numbers.
-
Neglecting hardware-specific optimizations
NVIDIA GPUs benefit from FlashAttention-2 and CUDA graphs. Apple Silicon requires MLX with contiguous memory allocation. x86 CPU fallbacks need AVX-512 and OpenBLAS tuning. Shipping a generic inference binary guarantees suboptimal performance.
-
Skipping circuit breakers and health checks
VRAM fragmentation and CUDA context resets cause silent failures. Implement /health endpoints, monitor cudaMemGetInfo, and trigger graceful degradation before the inference server crashes.
Best Practices from Production:
- Run a 24-hour soak test with realistic prompt distributions before deployment
- Pin quantization versions and validate accuracy drift after model updates
- Use request queuing with backpressure instead of dropping requests during saturation
- Maintain a fallback model with identical tokenizer to avoid re-encoding overhead
- Log actual context usage, not just requested context, to detect cache thrashing
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Sub-100ms interactive chat | Phi-4-Medium-14B (4-bit) | Highest throughput, lowest VRAM, stable latency under load | Low hardware cost, high user retention |
| Code generation & refactoring | Qwen-3-Coder-32B (AWQ-4bit) | Superior syntax awareness, AST preservation, multi-file context | Requires 18GB+ VRAM, moderate infra cost |
| Multi-step reasoning & research | Llama-4-Scout-17B (GPTQ-4bit) | Balanced accuracy/latency, strong chain-of-thought retention | Mid-tier GPU sufficient, predictable scaling |
| Edge/mobile deployment | Llama-4-Edge-8B (GGUF-Q4_K_M) | Optimized for NPU/Neural Engine, low power draw | Minimal infra, higher per-device cost |
| High-throughput batch processing | MoE-70B routed to 3x RTX 5090 | Parallel expert routing maximizes throughput, amortizes cost | High upfront GPU cost, lowest per-token cost at scale |
Configuration Template
# router.config.yaml
inference:
backend: vllm
version: "0.7.2"
quantization: awq
batch_size: 32
max_num_seqs: 64
gpu_memory_utilization: 0.92
models:
- id: qwen-coder-32b
path: /models/qwen3-coder-32b-awq
max_model_len: 256000
tensor_parallel_size: 1
dtype: float16
enforce_eager: false
- id: llama-scout-17b
path: /models/llama4-scout-17b-gptq
max_model_len: 128000
tensor_parallel_size: 1
dtype: float16
enforce_eager: true
- id: phi-medium-14b
path: /models/phi4-medium-14b-awq
max_model_len: 64000
tensor_parallel_size: 1
dtype: float16
enforce_eager: false
routing:
default: llama-scout-17b
fallback_order: [phi-medium-14b, llama-scout-17b]
health_check_interval_ms: 5000
timeout_ms: 15000
circuit_breaker:
failure_threshold: 5
recovery_timeout_ms: 30000
monitoring:
metrics_endpoint: /metrics
log_level: info
track_context_usage: true
alert_on_vram_spike: true
Quick Start Guide
- Install dependencies:
npm install undici @types/node tsx
- Pull and quantize models: Use
llama.cpp or autoawq to convert weights to 4-bit. Example: autoawq quantize --model_path qwen3-coder-32b --quant_path ./weights/qwen3-coder-32b-awq --w_bits 4 --group_size 128
- Launch inference backends: Run vLLM per model:
vllm serve ./weights/qwen3-coder-32b-awq --port 8000 --quantization awq --max-model-len 256000
- Start router:
tsx router.ts (or compile and run). Send a test payload to verify routing and latency.
- Validate under load: Use
autocannon or wrk to simulate 50 concurrent requests. Monitor VRAM usage and tokens/sec. Adjust batch_size and gpu_memory_utilization if fragmentation exceeds 15%.
Deploying open-weight models in 2026 requires treating inference as a resource-aware routing problem. Leaderboard scores are noise; hardware constraints, quantization fidelity, and task semantics are signal. Build the router, measure the reality, and scale what actually works.