equests before application timeouts trigger. The gateway shifts local LLMs from experimental endpoints to production-grade services with predictable SLAs.
Core Solution
A production-ready local LLM API gateway requires four architectural layers: interface normalization, request routing with health awareness, deterministic caching, and streaming backpressure management. The implementation below uses Fastify (TypeScript) for low-overhead HTTP handling, undici for high-performance proxying, and an LRU cache for prompt deduplication.
Architecture Decisions and Rationale
- Fastify over Express/NGINX: Fastify provides schema-based request validation, async-first routing, and plugin isolation. NGINX lacks dynamic model registry awareness and custom auth logic. Express introduces unnecessary middleware overhead for high-throughput proxying.
- OpenAI-Compatible Interface: Normalizing to
/v1/chat/completions and /v1/embeddings ensures SDK compatibility across LangChain, Vercel AI SDK, and custom clients without rewriting application code.
- Stream-First Proxying: LLM responses are token-by-token SSE streams. The gateway must pipe streams directly to clients without buffering entire payloads, preserving memory and reducing time-to-first-token (TTFT).
- Health-Aware Routing: Local backends fail unpredictably (OOM, driver crashes, VRAM fragmentation). The gateway maintains a lightweight health registry and routes around degraded nodes.
Step-by-Step Implementation
1. Initialize Gateway with Schema Validation
import Fastify from 'fastify';
import { Type } from '@sinclair/typebox';
import { Value } from '@sinclair/typebox/value';
const app = Fastify({ logger: true });
const ChatCompletionSchema = Type.Object({
model: Type.String(),
messages: Type.Array(Type.Object({ role: Type.String(), content: Type.String() })),
temperature: Type.Optional(Type.Number({ minimum: 0, maximum: 2 })),
stream: Type.Optional(Type.Boolean())
});
app.post('/v1/chat/completions', {
schema: { body: ChatCompletionSchema },
handler: handleChatCompletion
});
2. Model Registry and Health Tracking
interface BackendNode {
id: string;
url: string;
models: string[];
healthy: boolean;
lastCheck: number;
}
const registry: BackendNode[] = [
{ id: 'ollama-1', url: 'http://127.0.0.1:11434', models: ['llama3', 'mistral'], healthy: true, lastCheck: Date.now() },
{ id: 'vllm-1', url: 'http://127.0.0.1:8000/v1', models: ['qwen2.5-7b'], healthy: true, lastCheck: Date.now() }
];
async function checkHealth(node: BackendNode): Promise<boolean> {
try {
const res = await fetch(`${node.url}/health`, { signal: AbortSignal.timeout(1000) });
return res.ok;
} catch { return false; }
}
setInterval(async () => {
for (const node of registry) {
node.healthy = await checkHealth(node);
node.lastCheck = Date.now();
}
}, 5000);
3. Request Routing and Caching Middleware
import { LRUCache } from 'lru-cache';
const promptCache = new LRUCache<string, any>({
max: 500,
ttl: 1000 * 60 * 5,
allowStale: false
});
function generateCacheKey(body: any): string {
return JSON.stringify({
model: body.model,
messages: body.messages,
temperature: body.temperature ?? 0.7
});
}
async function resolveBackend(modelName: string): Promise<BackendNode | null> {
return registry.find(n => n.healthy && n.models.includes(modelName)) ?? null;
}
4. Streaming Proxy with Backpressure
import { request } from 'undici';
async function handleChatCompletion(req: any, reply: any) {
const cacheKey = generateCacheKey(req.body);
if (!req.body.stream && promptCache.has(cacheKey)) {
return reply.send(promptCache.get(cacheKey));
}
const backend = await resolveBackend(req.body.model);
if (!backend) {
return reply.code(503).send({ error: 'No healthy backend for model' });
}
const proxyUrl = `${backend.url}/chat/completions`;
const upstream = await request(proxyUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(req.body),
throwOnError: false
});
if (req.body.stream) {
reply.header('Content-Type', 'text/event-stream');
reply.header('Cache-Control', 'no-cache');
// Pipe stream directly to client with backpressure
upstream.body.pipe(reply.raw);
return;
}
const raw = await upstream.body.text();
const parsed = JSON.parse(raw);
if (parsed.choices?.[0]?.message?.content) {
promptCache.set(cacheKey, parsed);
}
reply.send(parsed);
}
app.listen({ port: 4000, host: '0.0.0.0' }, (err) => {
if (err) throw err;
console.log('Local LLM Gateway running on port 4000');
});
Architecture Rationale: The gateway intentionally avoids buffering full responses. Streaming pipes preserve memory during long generations. The LRU cache stores only non-streamed completions to prevent stale token leakage. Health checks run asynchronously to avoid blocking request paths. Backend selection is deterministic but can be extended to weighted routing or VRAM-aware scheduling.
Pitfall Guide
1. Blocking the Event Loop with Synchronous Model Calls
Local SDKs sometimes expose sync wrappers or heavy JSON parsing. In a high-concurrency gateway, synchronous operations stall the Node.js event loop, causing TTFT to spike and connections to timeout. Always use async/await with stream piping. Validate that any utility function (token counting, prompt formatting) is non-blocking.
2. Caching Streaming or Non-Deterministic Outputs
Caching SSE streams or responses with temperature > 0.8 creates stale or contradictory outputs. Only cache deterministic completions (stream: false, temperature: 0). Implement cache key normalization that includes temperature, top_p, and seed. Never cache partial chunks.
3. Ignoring VRAM Fragmentation During Model Swaps
Local backends like Ollama unload models from VRAM when switching. If the gateway routes to a backend that just unloaded the target model, latency spikes. Mitigate by tracking model load state in the registry, implementing a warm-up queue, or using backends that support concurrent model loading (vLLM with --num-scheduler-steps).
4. Missing Backpressure Handling in SSE Streams
Piping upstream streams directly to clients without backpressure causes memory leaks when clients disconnect slowly. Use pipeline or stream.pipeline from Node's stream/promises module, or rely on undici's built-in backpressure-aware piping. Always attach abort signals to upstream requests.
5. Hardcoding Endpoints Instead of Dynamic Registry
Static configuration breaks when backends scale, restart, or change ports. Implement a dynamic registry with health probes, TTL-based staleness, and hot-reload capability. Support environment variables or a lightweight config file that the gateway watches for changes.
6. No Circuit Breaker for Unresponsive Backends
Local inference processes crash silently. Without a circuit breaker, the gateway continues routing to dead nodes, accumulating timeouts. Implement a simple state machine: CLOSED β OPEN β HALF_OPEN. Open after N consecutive failures, reset after a cooldown period, and allow one probe request before reopening.
7. Over-Provisioning or Under-Provisioning Request Queues
Local backends have strict concurrency limits. Flooding a single Ollama instance with 50 concurrent requests causes OOM or kernel panics. Implement a request queue with per-model concurrency limits. Drop or defer requests when the queue saturates, returning 429 Too Many Requests with retry-after headers.
Best Practices from Production:
- Log structured metrics:
model, backend_id, cache_hit, ttft, tokens_generated, latency_ms.
- Implement token budgeting at the gateway level to prevent runaway generation costs.
- Use graceful degradation: route to smaller models when VRAM is constrained.
- Separate control plane (health, routing) from data plane (proxy, cache) for independent scaling.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single-model development | Direct Ollama/vLLM endpoint | Zero overhead, fast iteration | Negligible |
| Multi-model production routing | Local API gateway with dynamic registry | Prevents VRAM thrashing, enables fallbacks | +12% CPU, -40% compute waste |
| Low VRAM / edge deployment | Gateway + model-aware queue + streaming only | Avoids OOM, preserves TTFT | Lower hardware requirements |
| High-throughput API service | Gateway + Redis cache + circuit breaker | Handles traffic spikes, reduces redundant generation | +15% infra, -60% token cost |
Configuration Template
# gateway.config.yaml
server:
port: 4000
host: "0.0.0.0"
logLevel: "info"
backends:
- id: "ollama-dev"
url: "http://127.0.0.1:11434"
models: ["llama3:8b", "mistral:7b"]
concurrency: 8
healthInterval: 5000
timeout: 3000
- id: "vllm-prod"
url: "http://127.0.0.1:8000/v1"
models: ["qwen2.5:14b"]
concurrency: 12
healthInterval: 5000
timeout: 5000
cache:
enabled: true
maxItems: 1000
ttlMs: 300000
streamCacheDisabled: true
circuitBreaker:
failureThreshold: 3
resetTimeoutMs: 10000
halfOpenRequests: 1
metrics:
enabled: true
endpoint: "/metrics"
labels: ["model", "backend_id", "cache_hit", "status"]
Quick Start Guide
- Initialize project:
npm init -y && npm i fastify @sinclair/typebox undici lru-cache
- Create
gateway.ts with the implementation code above and gateway.config.yaml
- Start local backends: Run
ollama serve and/or vllm serve <model> on default ports
- Launch gateway:
npx tsx gateway.ts (or compile with tsc and run node dist/gateway.js)
- Verify:
curl http://localhost:4000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"llama3","messages":[{"role":"user","content":"Hello"}],"stream":false}'
The gateway normalizes requests, routes to healthy backends, caches deterministic outputs, and returns OpenAI-compatible responses. Extend with Redis for distributed caching, add JWT auth for multi-tenant isolation, or integrate with Prometheus for real-time inference observability.