es a unified response interface, shielding consumers from provider-specific quirks.
4. Async-First Design: Inference is I/O bound. The implementation uses non-blocking I/O to maximize throughput and support concurrent request handling.
Implementation
// Core Interfaces
interface InferenceRequest {
prompt: string;
metadata?: Record<string, string>;
constraints?: {
maxTokens?: number;
maxCost?: number;
requiredCapabilities?: string[];
};
}
interface InferenceResponse {
content: string;
modelId: string;
provider: string;
usage: {
inputTokens: number;
outputTokens: number;
estimatedCost: number;
};
latencyMs: number;
}
interface ModelProvider {
id: string;
capabilities: string[];
invoke(request: InferenceRequest): Promise<InferenceResponse>;
getCostPerToken(): { input: number; output: number };
}
// Circuit Breaker State
enum CircuitState { CLOSED, OPEN, HALF_OPEN }
class CircuitBreaker {
private state: CircuitState = CircuitState.CLOSED;
private failureCount: number = 0;
private threshold: number;
private resetTimeout: number;
private lastFailureTime: number = 0;
constructor(threshold: number = 5, resetTimeoutMs: number = 30000) {
this.threshold = threshold;
this.resetTimeout = resetTimeoutMs;
}
async execute(fn: () => Promise<InferenceResponse>): Promise<InferenceResponse> {
if (this.state === CircuitState.OPEN) {
if (Date.now() - this.lastFailureTime > this.resetTimeout) {
this.state = CircuitState.HALF_OPEN;
} else {
throw new Error(`Circuit breaker open for provider`);
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failureCount = 0;
this.state = CircuitState.CLOSED;
}
private onFailure() {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.threshold) {
this.state = CircuitState.OPEN;
}
}
}
// Gateway Router
class InferenceRouter {
private providers: Map<string, ModelProvider>;
private circuitBreakers: Map<string, CircuitBreaker>;
private fallbackChain: string[];
constructor() {
this.providers = new Map();
this.circuitBreakers = new Map();
this.fallbackChain = [];
}
registerProvider(provider: ModelProvider, fallbackPriority: number) {
this.providers.set(provider.id, provider);
this.circuitBreakers.set(provider.id, new CircuitBreaker());
this.fallbackChain[fallbackPriority] = provider.id;
}
async route(request: InferenceRequest): Promise<InferenceResponse> {
const candidate = this.selectProvider(request);
// Attempt primary provider with circuit breaker
try {
const breaker = this.circuitBreakers.get(candidate)!;
return await breaker.execute(() => this.invokeProvider(candidate, request));
} catch (error) {
console.warn(`Primary provider ${candidate} failed, initiating fallback.`);
return this.executeFallback(request, candidate);
}
}
private selectProvider(request: InferenceRequest): string {
// Strategy: Match capabilities and constraints
const required = request.constraints?.requiredCapabilities || [];
const scored = this.fallbackChain
.filter(id => {
const p = this.providers.get(id)!;
return required.every(cap => p.capabilities.includes(cap));
})
.map(id => {
const p = this.providers.get(id)!;
// Simple cost heuristic; can be extended for dynamic pricing
const cost = p.getCostPerToken().input + p.getCostPerToken().output;
return { id, cost };
})
.sort((a, b) => a.cost - b.cost);
return scored.length > 0 ? scored[0].id : this.fallbackChain[0];
}
private async executeFallback(request: InferenceRequest, excluded: string): Promise<InferenceResponse> {
for (const providerId of this.fallbackChain) {
if (providerId === excluded) continue;
const breaker = this.circuitBreakers.get(providerId)!;
try {
return await breaker.execute(() => this.invokeProvider(providerId, request));
} catch (error) {
console.warn(`Fallback provider ${providerId} failed.`);
}
}
throw new Error("All providers exhausted or circuit breakers open.");
}
private async invokeProvider(id: string, request: InferenceRequest): Promise<InferenceResponse> {
const provider = this.providers.get(id)!;
const start = Date.now();
// Validate context window constraints
if (request.constraints?.maxTokens) {
// Implementation would check provider's context limit
}
const response = await provider.invoke(request);
const latency = Date.now() - start;
return {
...response,
latencyMs: latency,
provider: id
};
}
}
Rationale
- Provider Abstraction: The
ModelProvider interface allows swapping implementations without touching the router. This supports everything from cloud APIs to local vLLM instances.
- Cost-Aware Selection: The
selectProvider method includes a cost heuristic. In production, this integrates with a pricing registry to route low-priority tasks to cheaper models automatically.
- Fallback Orchestration: The
executeFallback method iterates through a prioritized chain. This ensures that if the optimal model is down, the system degrades gracefully to the next best option rather than failing entirely.
- Circuit Breaker Per Provider: Each provider has an independent circuit breaker. An outage in one provider does not affect the availability of others.
Pitfall Guide
1. Context Window Mismatch
Explanation: Routing a prompt that exceeds a model's context limit results in immediate failure or silent truncation. Developers often assume all models support similar context lengths.
Fix: The gateway must validate prompt length against the target model's context window before invocation. Implement automatic truncation strategies or reject requests with clear error messages.
2. Schema Drift in Responses
Explanation: Different models may return JSON with varying field names or structures, especially when using function calling or structured output modes. Consumers expecting a fixed schema will break.
Fix: Enforce output normalization in the gateway. Use JSON schema validation to transform provider-specific responses into a unified format before returning to the application.
3. Rate Limit Blindness
Explanation: Aggregating traffic from multiple services can inadvertently exceed provider rate limits, even if individual services stay within bounds.
Fix: Implement a global token bucket algorithm in the gateway that tracks requests per second and tokens per minute across all consumers. Throttle requests at the gateway level rather than relying on provider error responses.
4. Secret Sprawl
Explanation: Storing API keys in the gateway configuration or environment variables without rotation policies creates security risks.
Fix: Integrate with a secrets manager (e.g., HashiCorp Vault, AWS Secrets Manager). The gateway should fetch credentials dynamically and support automatic rotation. Never log or expose keys in telemetry.
5. Synchronous Fallback Blocking
Explanation: Implementing fallbacks as sequential synchronous calls can increase tail latency significantly. If the primary provider times out, the user waits for the timeout plus the fallback latency.
Fix: Use speculative execution for critical paths. Send the request to the primary and a fast fallback simultaneously, returning the first successful response. Cancel the pending request once a response is received.
6. Ignoring Hallucination Risks
Explanation: Routing all queries to the cheapest model without considering task complexity can lead to high hallucination rates in sensitive domains.
Fix: Implement capability-based routing. Route high-stakes queries (e.g., medical, legal, financial) to models with proven reasoning benchmarks, regardless of cost. Use guardrails to detect and flag potential hallucinations.
7. Lack of Cost Attribution
Explanation: Without per-request cost tracking, organizations cannot attribute AI spend to specific features, teams, or customers.
Fix: The gateway must calculate estimated cost based on token usage and current pricing. Emit this metric to your observability stack with tags for service, endpoint, and user segment.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| MVP / Side Project | Direct SDK Integration | Low overhead, faster development cycle. | Minimal infrastructure cost. |
| Enterprise Multi-Team | Centralized Gateway | Shared governance, consistent security, unified observability. | Higher infra cost, offset by optimized routing. |
| Cost-Sensitive Workloads | Gateway with Smart Routing | Automatically routes low-priority tasks to cheaper models. | Significant reduction in inference spend. |
| High Compliance Requirements | Gateway with PII Scrubbing | Centralized audit logging and data masking. | Compliance cost reduction; avoids regulatory fines. |
| Latency-Critical Apps | Gateway with Speculative Execution | Reduces tail latency by parallelizing fallbacks. | Slight increase in token usage for speculative calls. |
Configuration Template
gateway:
version: "1.0"
providers:
- id: openai-gpt4o
type: openai
model: gpt-4o
capabilities: [code, reasoning, vision]
pricing:
input_per_1m: 5.00
output_per_1m: 15.00
fallback_priority: 1
- id: anthropic-sonnet
type: anthropic
model: claude-3-5-sonnet-20240620
capabilities: [reasoning, long_context, coding]
pricing:
input_per_1m: 3.00
output_per_1m: 15.00
fallback_priority: 2
- id: gemini-flash
type: google
model: gemini-1.5-flash
capabilities: [summarization, translation, low_cost]
pricing:
input_per_1m: 0.35
output_per_1m: 1.05
fallback_priority: 3
routing:
strategy: capability_based
cost_optimization: true
max_cost_per_request: 0.50
security:
pii_masking: true
prompt_injection_detection: true
rate_limit:
requests_per_minute: 1000
tokens_per_minute: 500000
observability:
metrics_endpoint: /metrics
tracing:
enabled: true
provider: opentelemetry
Quick Start Guide
- Initialize the Router: Instantiate
InferenceRouter and register providers with their capabilities and fallback priorities.
- Configure Secrets: Set up your secrets manager integration to provide API keys dynamically. Ensure keys are scoped with minimal permissions.
- Deploy the Gateway: Containerize the gateway service and deploy it behind your API gateway or ingress controller. Expose the inference endpoint.
- Update Application Code: Replace direct SDK calls with HTTP requests to the gateway endpoint. Pass request metadata to enable smart routing.
- Verify Telemetry: Check your observability dashboard to confirm metrics are flowing. Validate that fallback chains trigger correctly during simulated outages.