fully. The architecture below follows a four-layer pattern: ingress routing β semantic caching β model orchestration β observability & billing.
Step 1: Design the AI Gateway
The gateway sits between the client and all model providers. It handles authentication, rate limiting, request validation, and routing logic. Never expose provider APIs directly to frontend clients.
// src/gateway/router.ts
import { Router } from 'express';
import { cacheMiddleware } from './cache';
import { costLimiter } from './billing';
import { fallbackOrchestrator } from './orchestrator';
const router = Router();
router.post('/v1/chat',
costLimiter,
cacheMiddleware,
async (req, res) => {
const { prompt, model, userId } = req.body;
// Route to primary model with fallback chain
const response = await fallbackOrchestrator.execute({
primary: model || 'openai/gpt-4o',
fallbacks: ['anthropic/claude-3-5-sonnet', 'meta/llama-3.1-70b'],
payload: { prompt, userId, stream: true }
});
res.setHeader('Content-Type', 'text/event-stream');
res.write(response.stream);
res.end();
}
);
export default router;
Step 2: Implement Semantic Caching & Fallback Routing
Semantic caching reduces redundant inference calls. Use vector similarity thresholds to cache prompts that are functionally identical, not just string-matched. Fallback routing ensures continuity when providers throttle or fail.
// src/cache/semantic.ts
import { createClient } from 'redis';
import { cosineSimilarity } from './math';
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
export async function cacheMiddleware(req: any, res: any, next: any) {
const embedding = await generateEmbedding(req.body.prompt);
const cached = await redis.search('idx:prompts', `*=>[KNN 1 @vector ${embedding} $filter]`);
if (cached && cosineSimilarity(embedding, cached.vector) > 0.92) {
res.json({ source: 'cache', response: cached.text });
return;
}
next();
}
Step 3: Build Token Accounting & Cost Control
Token economics dictate unit economics. Track input/output tokens, apply provider-specific pricing, and enforce hard limits before requests hit inference endpoints.
// src/billing/limiter.ts
export const costLimiter = async (req: any, res: any, next: any) => {
const userQuota = await getUserQuota(req.userId);
const estimatedTokens = estimateTokens(req.body.prompt);
const cost = estimatedTokens * getProviderRate(req.body.model);
if (userQuota.remaining < cost) {
res.status(429).json({ error: 'quota_exceeded', retryAfter: 3600 });
return;
}
req.estimatedCost = cost;
next();
};
Step 4: Architecture Decisions & Rationale
- Edge vs Cloud Routing: Deploy the gateway on edge runtimes (Vercel Edge, Cloudflare Workers) for sub-50ms routing latency. Keep heavy embedding generation and vector search in regional cloud instances.
- Streaming vs Batch: Default to streaming for conversational UX. Use batch processing for offline RAG indexing and evaluation pipelines.
- Vector DB Selection: Use pgvector for <1M vectors (simpler ops, ACID compliance). Switch to Qdrant or Milvus when scaling beyond 5M vectors or requiring HNSW tuning.
- Observability: Instrument with OpenTelemetry. Track
llm.request.duration, llm.token.count, cache.hit_ratio, and fallback.trigger_rate. Alert on p95 > 600ms or cost variance > 20% hour-over-hour.
Pitfall Guide
-
Treating prompts as immutable configuration
Prompts drift as models update. Hardcoded prompts break when provider APIs change or when context windows shift. Store prompts in versioned configuration, validate against eval sets, and implement A/B routing for prompt variants.
-
Ignoring token economics in pricing
Pricing based on feature tiers without mapping to token consumption guarantees margin erosion. Implement per-token cost tracking, expose usage dashboards, and cap high-cost operations (e.g., multi-agent loops) with explicit user consent.
-
Skipping model evaluation pipelines
Launching without automated evals means regression goes undetected. Build eval suites measuring factual accuracy, tone consistency, latency, and cost. Run evals on every model switch or prompt update. Never deploy without a passing eval gate.
-
Over-relying on single-provider APIs
Provider outages, rate limit changes, or pricing shifts can halt production. Abstract provider interfaces behind a unified gateway. Implement circuit breakers with exponential backoff and automatic fallback to secondary models.
-
Neglecting semantic cache invalidation
String-matching caches miss paraphrased queries. Semantic caches without TTL or version tagging serve stale context. Combine vector similarity with prompt version hashes and time-based expiration. Invalidate cache when model weights or system prompts change.
-
Deploying without graceful degradation
AI systems fail silently. When a model times out or returns malformed JSON, the UI hangs or crashes. Implement timeout thresholds, structured error responses, and user-facing degradation states (e.g., "Processing slower than usual, try again").
-
Missing real-time cost alerting
Token spend scales non-linearly during traffic spikes or agentic loops. Set up real-time cost dashboards with Slack/PagerDuty alerts at 50%, 80%, and 100% of budget thresholds. Implement automatic request throttling when anomalies are detected.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume conversational UI | Edge gateway + semantic cache + streaming | Minimizes p95 latency, reduces redundant inference | -45% infra cost |
| Batch RAG indexing | Cloud worker queue + pgvector + async embeddings | Handles large payloads without blocking user threads | +12% compute, -60% API calls |
| Multi-tenant SaaS with strict compliance | On-prem VPC routing + local model fallback + audit logging | Meets data residency, prevents provider data leakage | +30% infra, zero vendor risk |
| Agentic workflow with tool use | Structured output parsing + retry budget + cost caps | Prevents infinite loops, bounds token spend | +8% latency, -70% runaway cost |
Configuration Template
// config/ai-gateway.json
{
"gateway": {
"edge": true,
"timeout_ms": 3000,
"stream_enabled": true
},
"cache": {
"provider": "redis",
"semantic_threshold": 0.92,
"ttl_hours": 24,
"version_key": "prompt_v2"
},
"routing": {
"primary": "openai/gpt-4o",
"fallbacks": ["anthropic/claude-3-5-sonnet", "meta/llama-3.1-70b"],
"circuit_breaker": {
"threshold": 5,
"reset_timeout_ms": 30000
}
},
"billing": {
"track_tokens": true,
"hard_limit_per_request_usd": 0.50,
"alert_thresholds": [0.5, 0.8, 1.0]
},
"observability": {
"otel_enabled": true,
"metrics": ["llm.request.duration", "llm.token.count", "cache.hit_ratio", "fallback.trigger_rate"],
"log_level": "info"
}
}
Quick Start Guide
- Initialize the gateway:
npx create-ai-gateway --template production-ready && cd ai-gateway
- Configure environment: Copy
.env.example to .env, set REDIS_URL, provider API keys, and OTEL_EXPORTER_OTLP_ENDPOINT
- Seed cache & evals:
npm run setup:cache && npm run eval:baseline
- Launch locally:
npm run dev β gateway runs on localhost:3000 with streaming, caching, and cost limits active
- Verify production readiness:
npm run test:load -- --sim 10x --inject-failure β confirms fallback routing and circuit breaker behavior under stress
Launch resilience is engineered, not accidental. Treat inference as a distributed system, instrument every token, and validate every routing decision. The models will handle the intelligence; your pipeline must handle the reality.