en adaptability.
Core Solution
Building a production-ready AI integration layer requires standardizing on the OpenAI-compatible protocol, implementing config-driven routing, and layering tiered fallback logic. The following implementation demonstrates how to achieve this in TypeScript while maintaining strict separation between application logic and inference routing.
Step 1: Define the Gateway Contract
The foundation is a strict interface that mirrors the OpenAI chat completion payload. This ensures compatibility with any provider that adheres to the standard.
interface ChatMessage {
role: 'system' | 'user' | 'assistant';
content: string;
}
interface GatewayRequest {
model: string;
messages: ChatMessage[];
temperature?: number;
max_tokens?: number;
stream?: boolean;
}
interface GatewayResponse {
id: string;
model: string;
choices: Array<{
index: number;
message: ChatMessage;
finish_reason: string;
}>;
usage: {
prompt_tokens: number;
completion_tokens: number;
total_tokens: number;
};
}
Step 2: Implement Configuration-Driven Routing
Instead of hardcoding provider logic, route decisions are driven by a tiered configuration object. This allows runtime adjustments without code changes.
type ModelTier = 'primary' | 'fallback' | 'premium';
interface TierConfig {
model: string;
maxCostPerMillion: number;
priority: number;
}
interface RoutingConfig {
apiKey: string;
baseUrl: string;
tiers: Record<ModelTier, TierConfig>;
fallbackThreshold: number; // percentage of requests allowed to escalate
}
const defaultRoutingConfig: RoutingConfig = {
apiKey: process.env.AI_GATEWAY_KEY || '',
baseUrl: 'https://api.gateway-provider.com/v1',
tiers: {
primary: { model: 'flash-v4', maxCostPerMillion: 0.25, priority: 1 },
fallback: { model: 'qwen-32b-instruct', maxCostPerMillion: 0.28, priority: 2 },
premium: { model: 'reasoning-r1', maxCostPerMillion: 2.50, priority: 3 }
},
fallbackThreshold: 0.15
};
Step 3: Build the Inference Router
The router handles request execution, error detection, and tier escalation. It uses standard fetch to maintain framework neutrality and avoid SDK lock-in.
class LLMRouter {
private config: RoutingConfig;
private requestLog: Array<{ tier: ModelTier; success: boolean; latency: number }> = [];
constructor(config: RoutingConfig) {
this.config = config;
}
async complete(payload: GatewayRequest): Promise<GatewayResponse> {
const orderedTiers: ModelTier[] = ['primary', 'fallback', 'premium'];
for (const tier of orderedTiers) {
const model = this.config.tiers[tier].model;
const startTime = performance.now();
try {
const response = await fetch(`${this.config.baseUrl}/chat/completions`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${this.config.apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({ ...payload, model })
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const data = await response.json();
const latency = performance.now() - startTime;
this.requestLog.push({ tier, success: true, latency });
return data as GatewayResponse;
} catch (error) {
const latency = performance.now() - startTime;
this.requestLog.push({ tier, success: false, latency });
if (tier === 'premium') {
throw new Error('All routing tiers exhausted. Request failed.');
}
}
}
throw new Error('Routing configuration error.');
}
getFallbackRate(): number {
const total = this.requestLog.length;
if (total === 0) return 0;
const fallbacks = this.requestLog.filter(r => r.tier !== 'primary' && r.success).length;
return fallbacks / total;
}
}
Architecture Decisions & Rationale
- Standardized Protocol: Using the OpenAI-compatible contract eliminates vendor-specific parsing logic. Every provider that supports this format returns identical JSON structures, enabling seamless model swaps.
- Configuration Over Code Branches: Routing tiers are defined in a single config object. Changing from a cost-optimized startup setup to an enterprise SLA-backed deployment requires only an API key rotation and tier parameter adjustment.
- Explicit Fallback Chain: The router attempts primary, then fallback, then premium. This prevents silent degradation while capping cost escalation. The
fallbackThreshold metric allows monitoring of routing health.
- Framework Neutrality: Relying on native
fetch instead of provider SDKs removes dependency bloat and ensures compatibility across Node.js, Deno, Bun, and edge runtimes.
Pitfall Guide
1. Hardcoding Provider-Specific SDKs
Explanation: Importing vendor SDKs ties your codebase to a single provider's update cycle, authentication flow, and error handling patterns.
Fix: Abstract behind a unified interface. Use standard HTTP clients or a lightweight wrapper that normalizes payloads before transmission.
2. Ignoring Tokenization Variance
Explanation: Different models tokenize text differently. A 500-token prompt in one model may consume 650 tokens in another, breaking cost estimates and context window limits.
Fix: Implement token counting at the application layer using model-specific estimators or provider-provided tokenization endpoints. Log actual usage, not estimated usage.
3. Over-Engineering Fallback Chains
Explanation: Building complex retry logic with exponential backoff, circuit breakers, and custom health checks for every provider creates maintenance debt.
Fix: Rely on the gateway's built-in routing and health monitoring. Implement application-level retries only for transient network failures, not model degradation.
Explanation: Providers return X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After headers. Ignoring these leads to 429 errors and wasted requests.
Fix: Parse rate-limit headers on every response. Implement a lightweight token bucket or queue that respects reset timestamps before dispatching new requests.
5. Mixing Auth Tiers in Single Deployment
Explanation: Using a standard API key in production while expecting enterprise SLAs results in shared capacity, unpredictable latency, and no priority support.
Fix: Separate environments by key scope. Use dedicated enterprise keys for production, standard keys for staging, and scoped keys for internal tooling. Rotate keys programmatically.
6. Optimizing for Raw Token Cost Instead of Task Completion
Explanation: Cheap models may require multiple retries, longer prompts, or post-processing to achieve the same output quality as premium models.
Fix: Track cost-per-successful-task, not cost-per-token. Measure first-pass accuracy, retry rates, and downstream processing overhead. Adjust routing based on total workflow cost.
7. Skipping Structured Observability
Explanation: LLM calls lack traditional metrics. Without structured logging, you cannot diagnose latency spikes, model degradation, or cost anomalies.
Fix: Emit structured events for every request: model, tier, token count, latency, finish reason, and fallback status. Integrate with OpenTelemetry or your existing observability stack.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| MVP / Early Startup | Unified Gateway Standard Tier | Low overhead, high model variety, pay-as-you-go pricing | $0.01β$0.25/M tokens; ~97.5% savings vs premium models |
| Scaling SaaS (10kβ100k users) | Config-Driven Routing with Fallback | Balances cost and reliability; prevents single-provider dependency | Predictable monthly spend; fallback caps cost spikes |
| Enterprise / Regulated Workload | Unified Gateway Pro Channel | 99.9% SLA, dedicated capacity, priority support, custom rate limits | Higher base cost; eliminates downtime risk and compliance gaps |
| Multi-Model Experimentation | Gateway Standard + Token Budget Limits | Rapid model switching without code changes; controlled spend | Low marginal cost; prevents runaway experimentation expenses |
Configuration Template
Copy this environment-driven configuration to initialize a production-ready routing layer. Adjust tiers based on your workload profile.
# Gateway Connection
AI_GATEWAY_KEY=sk-prod-xxxxxxxxxxxxxxxx
AI_GATEWAY_BASE_URL=https://api.gateway-provider.com/v1
# Routing Tiers
TIER_PRIMARY_MODEL=flash-v4
TIER_PRIMARY_MAX_COST=0.25
TIER_FALLBACK_MODEL=qwen-32b-instruct
TIER_FALLBACK_MAX_COST=0.28
TIER_PREMIUM_MODEL=reasoning-r1
TIER_PREMIUM_MAX_COST=2.50
# Operational Limits
FALLBACK_THRESHOLD=0.15
MAX_REQUEST_TIMEOUT_MS=3000
ENABLE_STREAMING=false
LOG_LEVEL=info
Quick Start Guide
- Initialize the router: Import the
LLMRouter class and pass your environment configuration. Ensure AI_GATEWAY_KEY and AI_GATEWAY_BASE_URL are set.
- Define your payload: Construct a standard
GatewayRequest object with messages, temperature, and max_tokens. No provider-specific fields required.
- Execute the request: Call
router.complete(payload). The router automatically attempts primary, fallback, and premium tiers based on your configuration.
- Monitor routing health: Check
router.getFallbackRate() and structured logs to verify that primary tier handles >85% of traffic. Adjust tier models or thresholds if fallback rate exceeds 15%.
- Scale configuration: Rotate to an enterprise API key for production. The routing logic remains identical; only capacity, SLA, and support tier change.