rmalization, and resilient rate-limit handling. The following implementation demonstrates a TypeScript-based adapter pattern that abstracts vendor differences while enforcing operational guardrails.
Step 1: Context Budget Management
Never trust advertised context limits. Implement a budgeting layer that counts tokens, reserves space for system instructions and expected output, and truncates or prioritizes chunks before they reach the API.
import { encoding_for_model } from "tiktoken";
export class ContextBudgeter {
private readonly maxSafeTokens: number;
private readonly outputReserve: number;
constructor(maxSafeTokens = 80_000, outputReserve = 2_000) {
this.maxSafeTokens = maxSafeTokens;
this.outputReserve = outputReserve;
}
public allocate(
systemPrompt: string,
userChunks: string[],
model: string = "gpt-4.1"
): string[] {
const encoder = encoding_for_model(model);
const systemTokens = encoder.encode(systemPrompt).length;
const availableBudget = this.maxSafeTokens - systemTokens - this.outputReserve;
let consumed = 0;
const allocated: string[] = [];
for (const chunk of userChunks) {
const chunkSize = encoder.encode(chunk).length;
if (consumed + chunkSize > availableBudget) break;
allocated.push(chunk);
consumed += chunkSize;
}
return allocated;
}
}
Rationale: Hard limits prevent API errors, but quality degradation happens earlier. By reserving output space and enforcing a conservative threshold, you maintain instruction adherence without hitting provider cutoffs.
Vendor APIs serialize tool calls differently. OpenAI returns arguments as JSON strings; Anthropic returns parsed objects. An adapter layer must normalize this before downstream processing.
export interface ToolCall {
id: string;
name: string;
arguments: Record<string, unknown>;
}
export interface ModelResponse {
content: string;
toolCalls: ToolCall[];
}
export abstract class LLMAdapter {
abstract generate(prompt: string, tools?: unknown[]): Promise<ModelResponse>;
}
export class OpenAIAdapter extends LLMAdapter {
async generate(prompt: string, tools?: unknown[]): Promise<ModelResponse> {
// Simulated OpenAI client call
const raw = await this.callOpenAI(prompt, tools);
const normalizedTools: ToolCall[] = raw.choices[0].message.tool_calls?.map(
(tc: any) => ({
id: tc.id,
name: tc.function.name,
arguments: JSON.parse(tc.function.arguments),
})
) ?? [];
return { content: raw.choices[0].message.content, toolCalls: normalizedTools };
}
}
export class AnthropicAdapter extends LLMAdapter {
async generate(prompt: string, tools?: unknown[]): Promise<ModelResponse> {
// Simulated Anthropic client call
const raw = await this.callAnthropic(prompt, tools);
const normalizedTools: ToolCall[] = raw.content
.filter((block: any) => block.type === "tool_use")
.map((block: any) => ({
id: block.id,
name: block.name,
arguments: block.input, // Already parsed
}));
return { content: raw.content[0].text, toolCalls: normalizedTools };
}
}
Rationale: The adapter pattern isolates vendor-specific serialization logic. Downstream services interact with a uniform ToolCall interface, eliminating json.loads() branching and preventing silent parsing failures when switching providers.
Step 3: Resilient Rate-Limit Handling
Both providers enforce simultaneous RPM and TPM limits. Anthropic adds daily token budgets on lower tiers. A unified retry mechanism must parse response headers and implement jittered exponential backoff.
export class RateLimitHandler {
private readonly maxAttempts: number;
private readonly baseDelayMs: number;
constructor(maxAttempts = 5, baseDelayMs = 1000) {
this.maxAttempts = maxAttempts;
this.baseDelayMs = baseDelayMs;
}
public async execute<T>(fn: () => Promise<T>): Promise<T> {
for (let attempt = 1; attempt <= this.maxAttempts; attempt++) {
try {
return await fn();
} catch (error: any) {
const isRateLimit =
error.status === 429 ||
error.message?.toLowerCase().includes("rate_limit") ||
error.message?.toLowerCase().includes("daily_token");
if (!isRateLimit || attempt === this.maxAttempts) throw error;
const jitter = Math.random() * 500;
const delay = this.baseDelayMs * Math.pow(2, attempt - 1) + jitter;
console.warn(`Rate limited. Backing off ${delay}ms (attempt ${attempt})`);
await new Promise((res) => setTimeout(res, delay));
}
}
throw new Error("Max retry attempts exceeded");
}
}
Rationale: Linear retries amplify congestion. Exponential backoff with random jitter distributes retry pressure across time, preventing thundering herd scenarios. Explicit handling of daily token errors ensures batch pipelines fail gracefully rather than burning through allowances.
Pitfall Guide
1. The "Advertised Context" Trap
Explanation: Providers list 128Kβ200K token windows, but model attention mechanisms degrade before reaching those ceilings. GPT-4.1 typically shows instruction-following drift around 80K tokens.
Fix: Implement a conservative MAX_SAFE_CONTEXT constant (75Kβ85K) and enforce chunk budgeting. Monitor output quality metrics, not just token counts.
2. Caching Blindness
Explanation: Teams calculate costs using headline rates, ignoring that repeated system prompts or RAG prefixes trigger cache hits. Claude's 90% cache discount and OpenAI's 75% discount dramatically alter TCO.
Fix: Hash prompt prefixes to track cache hit rates. Model effective cost using (uncached_input * (1 - cache_rate)) + cached_input * cache_rate before vendor selection.
Explanation: Assuming both providers return identical tool call structures leads to runtime JSON.parse errors or undefined property access when switching adapters.
Fix: Never pass raw provider responses to business logic. Always route through a normalization layer that outputs a consistent ToolCall interface.
4. Daily Token Budget Surprise
Explanation: Anthropic's lower tiers enforce daily token caps. High-throughput batch jobs or unthrottled dev environments can exhaust allowances by mid-morning, causing silent failures.
Fix: Implement application-level token accounting. Log cumulative daily usage and trigger circuit breakers or queue draining when approaching 80% of the daily limit.
5. Silent Degradation Ignoration
Explanation: APIs return valid JSON even when output quality has collapsed. Teams assume successful HTTP 200 responses mean correct outputs.
Fix: Deploy lightweight validation layers: schema enforcement, confidence scoring, or secondary model verification for critical paths. Track hallucination rates alongside latency.
6. Unsanitized LLM Output Rendering
Explanation: LLMs faithfully reproduce malicious payloads embedded in user inputs. Rendering raw output in UI components exposes applications to XSS and template injection.
Fix: Treat all LLM responses as untrusted user-generated content. Apply context-aware sanitization (HTML escaping, markdown stripping, or strict DOMPurify pipelines) before rendering.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-reuse system prompts / RAG pipelines | Anthropic Claude Sonnet 4-5 | 90% cache discount drastically lowers effective input cost | -60% to -80% on repeated prefixes |
| Output-heavy generation / summarization | OpenAI GPT-4.1 | Lower output pricing ($8 vs $15 per 1M tokens) | -40% to -50% on generation volume |
| Tool-heavy agentic workflows | OpenAI GPT-4.1 | Broader third-party SDK support and mature function calling ecosystem | Neutral (saves dev time, not tokens) |
| Long-document retrieval (>90K tokens) | Anthropic Claude Sonnet 4-5 | More consistent attention distribution across extended contexts | Neutral (improves accuracy, reduces retry costs) |
| Batch processing / offline jobs | OpenAI GPT-4.1 | No daily token caps on standard tiers; predictable RPM/TPM scaling | Lower risk of mid-run budget exhaustion |
Configuration Template
export const LLM_ROUTING_CONFIG = {
providers: {
openai: {
model: "gpt-4.1",
apiKeyEnv: "OPENAI_API_KEY",
rateLimits: { rpm: 500, tpm: 1_000_000 },
pricing: { input: 2.0, output: 8.0, cached: 0.5 },
},
anthropic: {
model: "claude-sonnet-4-5",
apiKeyEnv: "ANTHROPIC_API_KEY",
rateLimits: { rpm: 400, tpm: 800_000, dailyTokens: 50_000_000 },
pricing: { input: 3.0, output: 15.0, cached: 0.3 },
},
},
context: {
maxSafeTokens: 80_000,
outputReserve: 2_000,
cachePrefixLength: 1024, // Tokens to hash for cache tracking
},
resilience: {
maxRetries: 5,
baseBackoffMs: 1000,
circuitBreakerThreshold: 0.8, // 80% of daily limit
fallbackProvider: "openai",
},
};
Quick Start Guide
- Install dependencies:
npm install openai @anthropic-ai/sdk tiktoken
- Initialize the budgeter and adapters: Instantiate
ContextBudgeter with your safe threshold, then create OpenAIAdapter and AnthropicAdapter instances using your environment keys.
- Wire the rate limit handler: Wrap all API calls in
RateLimitHandler.execute() to automatically manage backoff and daily cap enforcement.
- Deploy telemetry: Log cache hit rates, token consumption per provider, and 429 frequency. Use these metrics to adjust routing weights monthly.
- Validate outputs: Add a lightweight schema validator or confidence scorer to catch silent degradation before it reaches end users.