e TypeScript implementation demonstrating how the three layers chain together during a single request lifecycle.
Step 1: Model Routing Layer
The model routing layer abstracts provider APIs, enforces virtual key scoping, and manages fallback chains. It replaces direct SDK calls with a unified endpoint that handles rate limiting, caching, and cost attribution.
interface ModelRoutingConfig {
primaryProvider: string;
fallbackChain: string[];
cacheTTLSeconds: number;
virtualKeyScope: string;
}
class ModelRouter {
private cache: Map<string, any> = new Map();
constructor(private config: ModelRoutingConfig) {}
async dispatch(messages: any[], model: string): Promise<any> {
const cacheKey = `${model}:${JSON.stringify(messages)}`;
const cached = this.cache.get(cacheKey);
if (cached) return cached;
const providers = [this.config.primaryProvider, ...this.config.fallbackChain];
for (const provider of providers) {
try {
const response = await this.callProvider(provider, model, messages);
this.cache.set(cacheKey, response);
return response;
} catch (error) {
console.warn(`Provider ${provider} failed, attempting fallback.`);
}
}
throw new Error('All model providers exhausted');
}
private async callProvider(provider: string, model: string, messages: any[]): Promise<any> {
const endpoint = `https://api.${provider}.com/v1/chat/completions`;
const res = await fetch(endpoint, {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.VIRTUAL_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({ model, messages, stream: false })
});
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
}
}
Architecture Rationale: Virtual keys replace raw provider credentials, enabling per-team quota enforcement without exposing master API keys. The fallback chain is evaluated sequentially to prevent race conditions and ensure predictable cost attribution. Semantic caching is intentionally simplified here; production implementations should use vector similarity thresholds to avoid stale responses.
When an LLM requests tool execution, the call must pass through a policy engine before reaching the actual tool server. This layer enforces permissions, validates payloads, and logs every execution attempt.
interface ToolPolicyRule {
toolName: string;
allowedRoles: string[];
maxExecutionsPerMinute: number;
requiresConfirmation: boolean;
}
class ToolPolicyEngine {
private executionCounts: Map<string, number> = new Map();
constructor(private rules: ToolPolicyRule[]) {}
async evaluate(toolCall: { tool: string; args: any; callerRole: string }): Promise<boolean> {
const rule = this.rules.find(r => r.toolName === toolCall.tool);
if (!rule) return false;
if (!rule.allowedRoles.includes(toolCall.callerRole)) {
throw new Error(`Role ${toolCall.callerRole} denied access to ${toolCall.tool}`);
}
const currentCount = this.executionCounts.get(toolCall.tool) || 0;
if (currentCount >= rule.maxExecutionsPerMinute) {
throw new Error(`Rate limit exceeded for ${toolCall.tool}`);
}
if (rule.requiresConfirmation) {
console.log(`[POLICY] Confirmation required for ${toolCall.tool}. Awaiting approval.`);
return false;
}
this.executionCounts.set(toolCall.tool, currentCount + 1);
return true;
}
}
Architecture Rationale: Policy evaluation happens before network egress to tool servers. This prevents runaway agents from exhausting external APIs or triggering destructive operations. The execution counter resets on a sliding window in production; the simplified map here demonstrates the enforcement boundary. Requiring confirmation for high-impact tools creates a human-in-the-loop checkpoint without blocking the entire pipeline.
Step 3: Inter-Agent Routing Layer
Multi-agent systems require a dedicated routing plane to manage identity, conversation state, and handoff logic. This layer replaces ad-hoc HTTP calls between agents with a structured message bus that enforces traceability.
interface AgentMessage {
traceId: string;
sourceAgent: string;
targetAgent: string;
payload: any;
priority: 'low' | 'normal' | 'critical';
}
class AgentMessageBus {
private queues: Map<string, AgentMessage[]> = new Map();
async route(message: AgentMessage): Promise<void> {
if (!this.queues.has(message.targetAgent)) {
this.queues.set(message.targetAgent, []);
}
this.queues.get(message.targetAgent)!.push(message);
console.log(`[ROUTER] Message ${message.traceId} queued for ${message.targetAgent}`);
}
async consume(targetAgent: string): Promise<AgentMessage | null> {
const queue = this.queues.get(targetAgent);
if (!queue || queue.length === 0) return null;
const sorted = queue.sort((a, b) => {
const priorityMap = { critical: 0, normal: 1, low: 2 };
return priorityMap[a.priority] - priorityMap[b.priority];
});
const next = sorted.shift()!;
queue.length = 0;
queue.push(...sorted);
return next;
}
}
Architecture Rationale: Agent-to-agent communication is inherently stateful and non-deterministic. A message bus with priority queuing prevents critical handoffs from being starved by background tasks. The traceId field ensures cross-layer observability, allowing you to reconstruct the full request path from initial prompt to final tool execution or agent handoff.
Pitfall Guide
Explanation: Teams adopt an all-in-one "AI platform" before experiencing the specific pain points each layer solves. This creates vendor lock-in for problems that don't exist yet and obscures visibility into actual bottlenecks.
Fix: Deploy each layer independently. Use open protocols (MCP, A2A) and standard HTTP interfaces. Replace components individually when scaling demands change.
Explanation: LLMs select tools based on probabilistic reasoning. Assuming a tool will only be called once per request leads to missing idempotency safeguards and duplicate executions.
Fix: Implement idempotency keys on all tool endpoints. The policy layer should validate request signatures and reject duplicate executions within a defined time window.
3. Ignoring Cross-Gateway Trace Correlation
Explanation: Each gateway generates its own logs, making it impossible to reconstruct a single user request across model routing, tool execution, and agent handoffs.
Fix: Propagate a single traceId through all layers. Inject it into HTTP headers, policy logs, and message payloads. Use a centralized tracing backend (OpenTelemetry, Jaeger) to correlate spans.
4. Over-Provisioning MCP Permissions
Explanation: Granting broad tool access to avoid friction results in permission escalation when agents encounter edge cases. LLMs will exploit overly permissive policies.
Fix: Apply least-privilege scoping. Define tool access per agent role, not per user. Require explicit confirmation for destructive operations and log every policy evaluation.
5. Skipping Semantic Cache Invalidation
Explanation: Exact-match caching fails when prompts vary slightly but semantically match previous queries. Conversely, aggressive semantic caching returns stale data when context changes.
Fix: Use vector similarity thresholds (e.g., cosine similarity > 0.92) combined with context-aware TTLs. Invalidate cache entries when underlying data sources change or when user intent shifts.
6. Misaligning Virtual Key Scopes
Explanation: Virtual keys that lack environment or feature boundaries allow cross-team token leakage and make cost attribution impossible.
Fix: Scope virtual keys to team:environment:feature. Enforce quotas at the gateway level, not the application level. Rotate keys automatically on team reorganization.
7. Neglecting Fallback Strategy Validation
Explanation: Configuring provider fallbacks without testing them results in silent failures. When the primary provider degrades, the fallback chain may lack compatible model capabilities or pricing tiers.
Fix: Implement synthetic traffic testing for fallback routes. Verify model capability parity, latency expectations, and cost deltas before promoting fallback configurations to production.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single team, one provider, <10k tokens/day | Direct SDK integration | Overhead outweighs benefits; routing layer adds latency without measurable ROI | Baseline |
| Multiple teams, mixed providers, billing opacity | Model routing layer (AI Gateway) | Centralizes cost tracking, enforces quotas, enables provider fallbacks | +5β10% infrastructure, -30% token waste |
| Agents executing production tools (DB, APIs, SaaS) | Tool policy layer (MCP Gateway) | Prevents permission escalation, enforces execution quotas, creates audit trails | +15% latency per tool call, -90% security incidents |
| Multi-agent workflows or vendor A2A integration | Agent routing layer (Agent Gateway) | Manages identity, prioritizes handoffs, enables cross-agent traceability | +20% orchestration overhead, enables complex workflows |
| Enterprise compliance (SOC2, HIPAA) | All three layers with centralized tracing | Meets audit requirements, enforces least privilege, provides full request reconstruction | +25% operational complexity, enables compliance certification |
Configuration Template
# gateway-stack.config.yaml
model_routing:
virtual_key_scope: "team:production:feature"
primary_provider: "anthropic"
fallback_chain:
- "openai"
- "google-vertex"
cache:
strategy: "semantic"
similarity_threshold: 0.92
ttl_seconds: 3600
rate_limits:
tokens_per_minute: 50000
requests_per_second: 120
tool_policy:
enforcement_mode: "strict"
rules:
- tool: "database.execute_query"
allowed_roles: ["data_analyst", "admin"]
max_executions_per_minute: 30
requires_confirmation: true
- tool: "slack.send_message"
allowed_roles: ["support_agent", "admin"]
max_executions_per_minute: 60
requires_confirmation: false
audit:
log_all_executions: true
retention_days: 90
agent_routing:
message_bus:
queue_strategy: "priority"
max_queue_depth: 1000
identity:
binding: "agent:role:tenant"
trace_propagation: "header"
fallback:
on_timeout_seconds: 30
retry_attempts: 2
Quick Start Guide
- Replace direct SDK calls with a unified model routing endpoint. Configure virtual keys scoped to your team and environment, and define a fallback chain matching your provider contracts.
- Wrap all tool server calls with a policy evaluation step. Define role-based access rules, set execution quotas, and enable confirmation prompts for high-impact operations.
- Inject a
traceId into every outbound request. Propagate it through model routing headers, policy logs, and agent message payloads to enable end-to-end observability.
- Deploy the configuration template to your staging environment. Run synthetic traffic to validate fallback routes, policy enforcement, and queue prioritization before promoting to production.
- Monitor cost attribution and policy violations using the virtual key scopes and audit logs. Adjust quotas and fallback chains based on observed traffic patterns and provider performance.