complexity heuristics.
Architecture Decisions
- Classifier-First Routing: A lightweight classification layer evaluates incoming requests before model invocation. This prevents unnecessary context window expansion and keeps routing latency under 50ms.
- Execution Abstraction: The router delegates to an execution engine that handles retries, token budgeting, and tool-loop management. This isolates model-specific behavior from business logic.
- Cost-Aware Fallbacks: If a Flash-routed task exceeds token thresholds or fails tool-call validation, the system escalates to Pro rather than failing silently. This preserves reliability while maintaining cost discipline.
- Observability Hooks: Every routing decision, token count, and benchmark alignment metric is logged. This enables continuous calibration of the classification rules.
Implementation (TypeScript)
import { z } from 'zod';
// Domain types
type ModelTier = 'flash' | 'pro';
type TaskCategory = 'agentic' | 'reasoning' | 'mixed';
interface RoutingConfig {
maxToolCalls: number;
tokenBudget: { input: number; output: number };
fallbackTier: ModelTier;
}
interface TaskMetadata {
id: string;
category: TaskCategory;
requiresToolUse: boolean;
contextLength: number;
}
interface ExecutionResult {
taskId: string;
modelUsed: ModelTier;
tokensConsumed: { input: number; output: number };
success: boolean;
latencyMs: number;
}
// Classifier logic
class TaskClassifier {
static categorize(metadata: TaskMetadata): TaskCategory {
if (metadata.requiresToolUse && metadata.contextLength < 32000) {
return 'agentic';
}
if (!metadata.requiresToolUse && metadata.contextLength > 64000) {
return 'reasoning';
}
return 'mixed';
}
}
// Router core
class ModelRouter {
private config: RoutingConfig;
private executionEngine: ExecutionEngine;
constructor(config: RoutingConfig, engine: ExecutionEngine) {
this.config = config;
this.executionEngine = engine;
}
async route(metadata: TaskMetadata): Promise<ExecutionResult> {
const category = TaskClassifier.categorize(metadata);
const targetTier = this.selectTier(category);
const result = await this.executionEngine.run({
tier: targetTier,
metadata,
budget: this.config.tokenBudget,
maxRetries: category === 'agentic' ? 3 : 1
});
// Fallback escalation on failure or budget overflow
if (!result.success || result.tokensConsumed.output > this.config.tokenBudget.output * 0.8) {
console.warn(`[Router] Escalating ${metadata.id} to ${this.config.fallbackTier}`);
return this.executionEngine.run({
tier: this.config.fallbackTier,
metadata,
budget: this.config.tokenBudget,
maxRetries: 1
});
}
return result;
}
private selectTier(category: TaskCategory): ModelTier {
switch (category) {
case 'agentic': return 'flash';
case 'reasoning': return 'pro';
case 'mixed': return 'pro'; // Conservative default for ambiguous workloads
}
}
}
// Execution engine stub (replace with actual SDK calls)
class ExecutionEngine {
async run(params: {
tier: ModelTier;
metadata: TaskMetadata;
budget: { input: number; output: number };
maxRetries: number;
}): Promise<ExecutionResult> {
const start = performance.now();
// Simulate model invocation with tool-loop handling
const success = Math.random() > 0.15;
const inputTokens = Math.floor(params.metadata.contextLength * 0.9);
const outputTokens = success ? 1200 : 400;
return {
taskId: params.metadata.id,
modelUsed: params.tier,
tokensConsumed: { input: inputTokens, output: outputTokens },
success,
latencyMs: Math.floor(performance.now() - start)
};
}
}
// Usage example
const router = new ModelRouter(
{
maxToolCalls: 12,
tokenBudget: { input: 100000, output: 8000 },
fallbackTier: 'pro'
},
new ExecutionEngine()
);
router.route({
id: 'task-8842',
category: 'agentic',
requiresToolUse: true,
contextLength: 18500
}).then(console.log);
Why This Architecture Works
The classifier isolates routing logic from execution, making it trivial to swap models or adjust thresholds without touching business code. The fallback mechanism prevents catastrophic failures when Flash's reasoning limits are exposed, while the token budget guardrails stop agentic loops from bleeding output costs. By treating agentic and reasoning as distinct execution topologies rather than complexity levels, the router aligns model strengths with workload patterns, which is where the benchmark divergence actually occurs.
Pitfall Guide
1. Hierarchical Default Routing
Explanation: Routing all complex tasks to Pro because it sits higher in the model lineup. This ignores the agentic optimization baked into Flash and inflates costs by 40% without improving tool-call accuracy.
Fix: Implement category-based routing. Use tool-use detection and context length thresholds to direct iterative workflows to Flash, reserving Pro for novel reasoning tasks.
Explanation: Assuming that because Flash scores higher on MCP Atlas, tool failures won't occur. In production, network latency, malformed arguments, and server-side rate limits still break loops.
Fix: Build exponential backoff with argument validation. Cache successful tool schemas and log failure modes to refine the classifier's requiresToolUse heuristic over time.
3. Over-Classification Latency
Explanation: Running heavy NLP classifiers or embedding similarity searches to determine task category before every request. This adds 200-500ms of routing overhead, negating Flash's speed advantage.
Fix: Use lightweight rule-based classification or metadata tags from the calling service. Reserve ML-based routing for batch processing or offline calibration, not real-time inference.
4. Benchmark-to-Production Parity Assumption
Explanation: Treating Terminal-Bench 2.1 or MCP Atlas scores as direct predictors of production performance. Benchmarks are curated, time-boxed, and lack real-world noise like partial context or degraded APIs.
Fix: Run shadow deployments. Route 10-20% of traffic to Flash while logging success rates, token consumption, and user feedback. Adjust routing thresholds based on observed drift, not lab scores.
5. Output Token Cost Blindness
Explanation: Focusing only on input token pricing while ignoring that agentic loops generate many small output tokens. Flash's $15/M output pricing is 40% cheaper, but unbounded loops still accumulate cost.
Fix: Enforce output token caps per iteration. Implement early stopping when tool outputs repeat or converge. Track cost-per-task rather than cost-per-token to align with business metrics.
6. Hardcoded Model Identifiers
Explanation: Embedding gemini-3.5-flash or gemini-3.1-pro directly in routing logic. Model versioning changes, and hardcoding breaks deployments when Google releases patches or deprecates endpoints.
Fix: Use alias mappings in configuration files. Maintain a modelRegistry that maps logical roles (agenticWorker, reasoningEngine) to current model identifiers. Update aliases without redeploying routing code.
7. Neglecting Context Window Alignment
Explanation: Routing long-context tasks to Flash without verifying context window compatibility. Flash may truncate or compress context aggressively, degrading tool-call accuracy.
Fix: Validate context length against model specifications before routing. Implement context summarization or chunking strategies for tasks exceeding the target model's optimal window.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Terminal automation, code debugging, MCP tool loops | Route to Gemini 3.5 Flash | Higher tool-calling accuracy, faster iteration, 40% lower token pricing | Reduces per-task cost by ~35-45% |
| Financial research, multi-step agent workflows | Route to Gemini 3.5 Flash | Superior long-horizon coherence (57.9% vs 43.0% on Finance Agent v2) | Lowers output token spend despite loop volume |
| Novel analytical reasoning, expert-level Q&A | Route to Gemini 3.1 Pro | Retains edge on one-shot reasoning (44.4% vs 40.2% on HLE) | Accepts 40% premium for accuracy-critical tasks |
| Ambiguous or mixed workloads | Default to Pro with Flash fallback | Conservative routing prevents reasoning degradation; fallback catches agentic patterns | Balances reliability with cost recovery on misclassified tasks |
| High-throughput, low-latency APIs | Route to Gemini 3.5 Flash | ~4x faster generation reduces p95 latency and improves throughput | Cuts infrastructure scaling costs by reducing request duration |
Configuration Template
# routing-config.yaml
model_registry:
agentic_worker: "gemini-3.5-flash"
reasoning_engine: "gemini-3.1-pro"
routing_rules:
agentic:
target_tier: agentic_worker
max_tool_calls: 12
token_budget:
input: 80000
output: 6000
fallback: reasoning_engine
retry_policy:
max_attempts: 3
backoff_ms: [500, 1500, 3000]
reasoning:
target_tier: reasoning_engine
token_budget:
input: 120000
output: 10000
fallback: null
retry_policy:
max_attempts: 1
backoff_ms: []
observability:
metrics_prefix: "llm.router"
log_routing_decisions: true
shadow_traffic_pct: 15
alert_on_budget_overflow: true
Quick Start Guide
- Tag your endpoints: Add
requiresToolUse and contextLength metadata to incoming requests. This takes minutes if your API already tracks payload size and tool dependencies.
- Deploy the classifier: Use the rule-based categorization logic from the Core Solution. No ML training required; it relies on deterministic thresholds.
- Configure the router: Load the YAML template, adjust token budgets to match your SLAs, and point
agentic_worker and reasoning_engine to your current API keys.
- Enable shadow routing: Set
shadow_traffic_pct to 15. Monitor llm.router.fallback_rate and llm.router.cost_per_task for 48 hours before promoting Flash to production traffic.
- Enforce output caps: Add early-stopping checks in your execution engine. When tool outputs repeat or converge, terminate the loop and return results. This prevents cost drift without sacrificing accuracy.