ider URLs, use a structured configuration that maps Claude Code's internal tiers to external inference endpoints. This allows runtime swaps without redeployment.
interface ProviderRoute {
endpoint: string;
apiKeyEnv: string;
model: string;
maxTokens: number;
supportsStreaming: boolean;
}
interface RoutingConfig {
primary: ProviderRoute;
secondary: ProviderRoute;
background: ProviderRoute;
fallback: ProviderRoute;
}
export const defaultRouting: RoutingConfig = {
primary: {
endpoint: 'https://api.openrouter.ai/v1/chat/completions',
apiKeyEnv: 'OPENROUTER_KEY',
model: 'openrouter/qwen/qwen3-235b-a22b:free',
maxTokens: 8192,
supportsStreaming: true,
},
secondary: {
endpoint: 'https://api.deepseek.com/v1/chat/completions',
apiKeyEnv: 'DEEPSEEK_KEY',
model: 'deepseek-chat',
maxTokens: 4096,
supportsStreaming: true,
},
background: {
endpoint: 'http://localhost:11434/v1/chat/completions',
apiKeyEnv: 'OLLAMA_KEY',
model: 'llama3.1:8b',
maxTokens: 2048,
supportsStreaming: true,
},
fallback: {
endpoint: 'https://build.nvidia.com/v1/chat/completions',
apiKeyEnv: 'NVIDIA_NIM_KEY',
model: 'nvidia/nemotron-mini-4b-instruct',
maxTokens: 4096,
supportsStreaming: true,
},
};
Step 2: Implement Schema Translation Middleware
Anthropic's /v1/messages format differs from OpenAI-compatible providers. The gateway must normalize tool definitions, message roles, and streaming chunks.
import { Request, Response, NextFunction } from 'express';
export function translateAnthropicToOpenAI(req: Request, res: Response, next: NextFunction) {
const { model, messages, tools, max_tokens, stream } = req.body;
req.translatedPayload = {
model: model.replace('claude-', ''),
messages: messages.map((msg: any) => ({
role: msg.role === 'assistant' ? 'assistant' : 'user',
content: msg.content?.[0]?.text || msg.content || '',
})),
tools: tools?.map((tool: any) => ({
type: 'function',
function: {
name: tool.name,
description: tool.description,
parameters: tool.input_schema,
},
})),
max_tokens: max_tokens || 4096,
stream: stream || false,
};
next();
}
Step 3: Build the Tier Router & Forwarder
The router selects the backend based on the task tier, injects the appropriate API key, and forwards the request while preserving Server-Sent Events (SSE) for streaming.
import axios from 'axios';
import { defaultRouting } from './config';
export async function routeAndForward(req: Request, res: Response) {
const tier = req.headers['x-task-tier'] || 'primary';
const route = defaultRouting[tier as keyof typeof defaultRouting] || defaultRouting.primary;
const apiKey = process.env[route.apiKeyEnv];
if (!apiKey) {
return res.status(500).json({ error: `Missing API key: ${route.apiKeyEnv}` });
}
try {
const response = await axios.post(route.endpoint, req.translatedPayload, {
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json',
'Accept': route.supportsStreaming ? 'text/event-stream' : 'application/json',
},
responseType: route.supportsStreaming ? 'stream' : 'json',
timeout: 30000,
});
if (route.supportsStreaming) {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
response.data.pipe(res);
} else {
res.json(response.data);
}
} catch (err) {
console.error(`Routing failed for tier: ${tier}`, err);
res.status(502).json({ error: 'Upstream provider unreachable' });
}
}
Architecture Decisions & Rationale
- Tier-Based Routing Over Global Fallback: AI coding assistants partition workloads by complexity. Primary agents require high reasoning capacity; background agents handle deterministic transformations. Routing by tier prevents expensive models from processing trivial tasks while preserving accuracy where it matters.
- Schema Translation at the Edge: Converting Anthropic tool schemas to OpenAI-compatible function definitions at the gateway layer keeps the IDE extension untouched. This preserves compatibility with VS Code, JetBrains, and CLI clients without vendor-specific patches.
- Streaming Preservation: AI coding workflows rely on incremental token delivery for responsive UX. The proxy must forward
text/event-stream chunks without buffering. Axios streaming with direct pipe-to-response ensures latency remains within acceptable bounds.
- Environment-Driven Key Injection: Hardcoding credentials creates security and rotation bottlenecks. Mapping keys to environment variables enables secret management via vaults, CI/CD pipelines, or local
.env files without code changes.
Pitfall Guide
Explanation: Anthropic and OpenAI-compatible providers use different JSON structures for function definitions. Direct forwarding causes the IDE to reject tool responses or misparse arguments.
Fix: Implement strict schema normalization that maps input_schema to parameters, preserves additionalProperties: false, and validates required fields before forwarding.
2. Silent Rate Limit Throttling
Explanation: Free-tier aggregators often drop connections or return 429 without clear error bodies, causing the IDE to hang or retry indefinitely.
Fix: Add circuit breaker logic with exponential backoff. Cache rate limit headers (X-RateLimit-Remaining) and switch to fallback providers when thresholds drop below 10%.
3. Local VRAM Miscalculation
Explanation: Running 70B-parameter models locally requires 40GB+ VRAM. Under-provisioned hardware causes OOM kills or severe swapping, degrading latency to unusable levels.
Fix: Use quantized weights (Q4_K_M or Q5_K_S) and monitor memory pressure. Route background tasks to 8B models and reserve larger local models only for secondary tier operations.
4. Streaming Buffer Latency
Explanation: Some HTTP clients or reverse proxies buffer SSE chunks, breaking the incremental delivery model that AI coding assistants expect.
Fix: Disable Nagle's algorithm, set Transfer-Encoding: chunked, and avoid middleware that intercepts response bodies. Test with curl -N to verify real-time chunk delivery.
5. Context Window Overflow
Explanation: Free and local models typically support 8Kβ32K context windows, while Claude Code assumes 200K+. Long sessions trigger silent truncation or response failures.
Fix: Implement a sliding window middleware that summarizes older messages, strips resolved tool outputs, and enforces token budgets before forwarding to upstream providers.
6. Auth Token Hardcoding in Shared Environments
Explanation: Using static tokens in team IDEs or CI pipelines exposes credentials and prevents per-user billing or quota tracking.
Fix: Generate short-lived tokens via a central auth service. Inject them at runtime using ANTHROPIC_AUTH_TOKEN overrides, and rotate keys every 24 hours.
7. Over-Routing Complex Agentic Tasks
Explanation: Sending architecture decisions or multi-file refactors to weak models causes tool-call loops, hallucinated file paths, and broken state.
Fix: Define task-complexity heuristics. Route tasks with >3 tool calls or >500 token prompts to the primary tier. Use a lightweight classifier or rule-based prefix matcher to enforce routing boundaries.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Rapid prototyping & learning | Free cloud aggregator proxy | Zero marginal cost, sufficient for syntax/formatting tasks | $0 API, minimal infra |
| Enterprise agentic workflows | Direct Anthropic API + gateway fallback | High tool-call accuracy required for complex orchestration | High API, predictable scaling |
| Air-gapped or compliance-restricted | Local inference proxy (Ollama/llama.cpp) | Data never leaves network, full control over weights | Hardware depreciation, electricity |
| Budget-constrained teams | Tiered routing (primary=free cloud, background=local) | Balances capability and cost, isolates expensive operations | Near-zero API, moderate infra |
Configuration Template
# gateway-config.yaml
server:
port: 8082
auth_token: "${GATEWAY_AUTH_TOKEN}"
discovery_enabled: true
routing:
primary:
provider: openrouter
model: "openrouter/qwen/qwen3-235b-a22b:free"
max_tokens: 8192
fallback: nvidia_nim
secondary:
provider: deepseek
model: "deepseek-chat"
max_tokens: 4096
fallback: openrouter
background:
provider: ollama
model: "llama3.1:8b"
max_tokens: 2048
fallback: local_lmstudio
providers:
openrouter:
endpoint: "https://api.openrouter.ai/v1/chat/completions"
key_env: "OPENROUTER_KEY"
deepseek:
endpoint: "https://api.deepseek.com/v1/chat/completions"
key_env: "DEEPSEEK_KEY"
nvidia_nim:
endpoint: "https://build.nvidia.com/v1/chat/completions"
key_env: "NVIDIA_NIM_KEY"
ollama:
endpoint: "http://localhost:11434/v1/chat/completions"
key_env: "OLLAMA_KEY"
local_lmstudio:
endpoint: "http://localhost:1234/v1/chat/completions"
key_env: "LMSTUDIO_KEY"
limits:
rate_limit_window: 60s
max_concurrent_streams: 50
context_truncation_strategy: "summarize_and_slide"
Quick Start Guide
- Initialize the gateway: Clone the proxy repository, install dependencies via
uv or npm, and export provider keys (export OPENROUTER_KEY=..., export DEEPSEEK_KEY=...).
- Start the service: Run the gateway executable. It will bind to
localhost:8082 and expose an admin UI for key validation and tier routing visualization.
- Configure your IDE: Add
ANTHROPIC_BASE_URL=http://localhost:8082, ANTHROPIC_AUTH_TOKEN=your_secure_token, and CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1 to your VS Code settings.json or JetBrains ACP config.
- Verify routing: Open the IDE's model picker. The gateway's
/v1/models endpoint will expose all configured backends. Run a simple file edit and check the gateway logs to confirm tier assignment and upstream forwarding.