reads this policy at startup and maintains an in-memory routing table.
interface RoutingPolicy {
policyId: string;
defaultModel: string;
providers: Array<{
vendor: 'openai' | 'anthropic' | 'mistral';
allowedModels: string[];
weight: number;
apiKeyRef: string;
timeoutMs: number;
}>;
fallbackBehavior: 'auto' | 'explicit';
}
const prodRoutingPolicy: RoutingPolicy = {
policyId: 'gateway-primary-v1',
defaultModel: 'gpt-4o',
providers: [
{
vendor: 'openai',
allowedModels: ['gpt-4o', 'gpt-4o-mini'],
weight: 0.7,
apiKeyRef: 'env:OPENAI_PROD_KEY',
timeoutMs: 4000
},
{
vendor: 'anthropic',
allowedModels: ['claude-3-sonnet-20240229', 'claude-3-haiku-20240307'],
weight: 0.3,
apiKeyRef: 'env:ANTHROPIC_PROD_KEY',
timeoutMs: 5000
}
],
fallbackBehavior: 'auto'
};
Step 2: Implement the Gateway Router
The router evaluates the incoming request against the policy. It checks model compatibility, sorts providers by weight, and constructs a fallback chain. If the primary provider times out or returns a 4xx/5xx error, the router attempts the next provider in the chain.
class LLMRouter {
private policy: RoutingPolicy;
private modelCatalog: Map<string, Set<string>>;
constructor(policy: RoutingPolicy) {
this.policy = policy;
this.modelCatalog = this.buildCatalog();
}
private buildCatalog(): Map<string, Set<string>> {
const catalog = new Map<string, Set<string>>();
for (const p of this.policy.providers) {
for (const model of p.allowedModels) {
if (!catalog.has(model)) catalog.set(model, new Set());
catalog.get(model)!.add(p.vendor);
}
}
return catalog;
}
async routeRequest(payload: { model: string; messages: any[] }) {
const supportedVendors = this.modelCatalog.get(payload.model);
if (!supportedVendors) {
throw new Error(`Model ${payload.model} not registered in catalog`);
}
const chain = this.policy.providers
.filter(p => supportedVendors.has(p.vendor))
.sort((a, b) => b.weight - a.weight);
let lastError: Error | null = null;
for (const provider of chain) {
try {
const response = await this.dispatchToProvider(provider, payload);
return this.attachMetadata(response, provider.vendor, false);
} catch (err) {
lastError = err as Error;
console.warn(`Fallback triggered: ${provider.vendor} failed. Attempting next.`);
}
}
throw lastError ?? new Error('All providers in chain exhausted');
}
private async dispatchToProvider(provider: RoutingPolicy['providers'][0], payload: any) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), provider.timeoutMs);
const res = await fetch(`https://api.${provider.vendor}.com/v1/chat/completions`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env[provider.apiKeyRef]}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({ model: payload.model, messages: payload.messages }),
signal: controller.signal
});
clearTimeout(timeout);
if (!res.ok) throw new Error(`HTTP ${res.status} from ${provider.vendor}`);
return res.json();
}
private attachMetadata(response: any, vendor: string, isFallback: boolean) {
return {
...response,
_routing: { vendor, isFallback, timestamp: Date.now() }
};
}
}
Step 3: Integrate with Application Code
The application no longer manages provider selection. It sends requests to the router, which handles validation, fallbacks, and metadata injection.
const router = new LLMRouter(prodRoutingPolicy);
async function generateCompletion(userPrompt: string) {
const result = await router.routeRequest({
model: 'gpt-4o',
messages: [{ role: 'user', content: userPrompt }]
});
console.log('Response:', result.choices[0].message.content);
console.log('Routed via:', result._routing.vendor);
return result;
}
Architecture Decisions & Rationale
- Declarative Policy over Imperative Logic: Routing rules live in configuration, not controllers. This enables hot-reloading of weights and provider lists without application restarts.
- Weight-Ordered Fallback Chain: Sorting by descending weight ensures the most cost-effective or highest-capability provider is tried first. Fallbacks degrade predictably rather than randomly.
- Model Catalog Validation: Prevents routing requests to providers that don't support the requested model. This eliminates silent failures and output quality degradation.
- Explicit Fallback Override: When applications pass a custom
fallbacks array, the gateway skips automatic chaining. This preserves compliance workflows and specialized routing requirements.
- Timeout Per Hop: Each provider in the chain gets an independent timeout. This prevents a slow primary provider from blocking the entire fallback sequence.
Pitfall Guide
1. Ignoring Streaming State During Failover
Explanation: When a primary provider fails mid-stream, naive routers drop the connection and restart from the beginning. This wastes tokens and breaks user experience.
Fix: Implement chunk buffering and state checkpointing. If a fallback triggers, replay buffered tokens or switch to non-streaming mode for the remaining payload.
2. Weight Misconfiguration Leading to Cost Spikes
Explanation: Assigning high weights to premium models without monitoring actual usage causes unexpected billing spikes during fallback events.
Fix: Audit weights against provider pricing tiers. Implement cost-aware routing that dynamically adjusts weights based on real-time token consumption and budget thresholds.
3. Bypassing Model Catalog Validation
Explanation: Routing requests without verifying model support causes 404 errors or degraded output when fallbacks hit incompatible endpoints.
Fix: Enforce strict allowlists in the routing policy. Sync the model catalog with provider API documentation on deployment and validate incoming requests against it.
4. Overriding Fallbacks Unintentionally
Explanation: Applications that pass explicit fallbacks arrays disable automatic chaining. Teams often forget this behavior and wonder why failover isn't triggering.
Fix: Document fallback override behavior in API contracts. Use environment flags to toggle between automatic and explicit routing during development.
5. Neglecting Fallback Latency Budgets
Explanation: Each hop in the fallback chain adds network and processing latency. Without per-hop timeouts, total request time exceeds SLA thresholds.
Fix: Define a total latency budget (e.g., 6s) and distribute it across providers. Implement circuit breakers that skip providers with historically high latency during peak load.
6. Single-Region Key Restrictions
Explanation: Using production API keys across all regions causes routing failures when regional endpoints are throttled or unavailable.
Fix: Map API keys to geographic zones in the routing policy. Route requests to the nearest healthy endpoint using latency-aware provider selection.
7. Missing Observability Hooks
Explanation: Without tracking which provider fulfilled each request, teams cannot diagnose outages, allocate costs, or optimize weights.
Fix: Inject routing metadata into every response. Export structured logs containing primary_provider, fallback_used, hop_count, and total_latency_ms to your observability stack.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Cost-Optimized Batch Processing | High weight to budget models, strict fallback to mid-tier | Minimizes token spend while maintaining acceptable quality | Low (predictable per-token pricing) |
| Low-Latency Interactive Chat | Weighted routing to regional endpoints, tight timeout budgets | Reduces round-trip time and prevents timeout cascades | Medium (regional routing may use premium keys) |
| Compliance-Strict Workloads | Explicit fallback chains with model validation, key isolation | Ensures data residency and audit trail requirements are met | High (dedicated keys and restricted models) |
| High-Availability Production | Auto-fallback with 3+ providers, dynamic weight rebalancing | Guarantees uptime during provider degradation events | Variable (fallback usage increases spend during outages) |
Configuration Template
gateway:
policy_id: prod-routing-v2
default_model: gpt-4o
fallback_mode: auto
providers:
- vendor: openai
allowed_models:
- gpt-4o
- gpt-4o-mini
weight: 0.65
api_key_ref: OPENAI_PROD_KEY
timeout_ms: 3500
region: us-east-1
- vendor: anthropic
allowed_models:
- claude-3-sonnet-20240229
- claude-3-haiku-20240307
weight: 0.25
api_key_ref: ANTHROPIC_PROD_KEY
timeout_ms: 4500
region: us-west-2
- vendor: mistral
allowed_models:
- mistral-large-latest
weight: 0.10
api_key_ref: MISTRAL_PROD_KEY
timeout_ms: 5000
region: eu-central-1
observability:
export_fallback_metrics: true
correlation_id_header: X-Request-ID
log_level: info
Quick Start Guide
- Initialize the Gateway: Deploy the routing proxy and load the YAML configuration. The gateway will build the model catalog and initialize the weight-sorted provider chain.
- Configure API Keys: Inject provider credentials via environment variables or a secrets manager. Ensure key references match the
api_key_ref fields in the policy.
- Route Traffic: Point your application's LLM client to the gateway endpoint. Replace direct provider calls with the
LLMRouter wrapper or HTTP proxy.
- Validate Fallbacks: Simulate provider degradation by temporarily revoking a key or injecting latency. Verify that requests automatically traverse the fallback chain and return routing metadata.
- Hook Observability: Connect gateway logs to your monitoring stack. Track
fallback_used, hop_count, and total_latency_ms to optimize weights and detect early provider degradation.