The LLM rate limit that 429s you first is rarely the one you sized for β so I gave my agent a tool to compute it
Deterministic Capacity Planning for Multi-Dimensional LLM Rate Limits
Current Situation Analysis
Modern LLM infrastructure scaling has outgrown single-metric capacity models. Engineering teams traditionally size workloads by examining two variables: cost per million tokens and the requests-per-minute (RPM) ceiling listed on provider pricing pages. This approach assumes a linear relationship between request volume and token throughput. In practice, that assumption breaks down the moment traffic scales.
Provider rate limiting architectures have evolved into orthogonal constraint systems. Anthropic, for example, enforces three independent ceilings per minute for every model and tier combination:
- RPM: Maximum number of discrete API calls
- ITPM: Maximum input tokens processed
- OTPM: Maximum output tokens generated
These limits are not proportional. They operate on separate enforcement tracks, and the dimension that triggers a 429 Too Many Requests response depends entirely on your traffic composition. A retrieval-augmented generation pipeline with dense context windows and concise answers will exhaust ITPM long before touching RPM. Conversely, an agentic loop that issues short queries but generates extensive reasoning traces will bind against OTPM. The same model, the same pricing tier, completely different failure modes.
This problem is systematically overlooked because pricing documentation presents limits in isolation. Engineering capacity models default to RPM as the primary scaling vector, treating token throughput as a secondary cost metric rather than a hard infrastructure constraint. The result is silent degradation: workloads operate comfortably within request ceilings while quietly approaching token throughput walls. When scaling occurs, teams request quota increases on the wrong dimension, wasting engineering cycles and delaying deployments.
Data from recent provider tiers demonstrates the divergence. A workload running at 600 RPM with 2,000 input tokens and 500 output tokens per request consumes only 15% of a Tier 4 RPM ceiling, yet simultaneously utilizes 75% of the OTPM ceiling. The binding constraint has shifted entirely away from the metric teams monitor. Capacity planning based on RPM alone produces mathematically invalid scaling projections.
WOW Moment: Key Findings
The binding constraint is not static. It migrates based on token shape, traffic volume, and tier allocation. The following comparison demonstrates how identical RPM volumes produce completely different capacity bottlenecks when prompt engineering strategies change.
| Traffic Profile | Avg Input Tokens | Avg Output Tokens | RPM Utilization | ITPM Utilization | OTPM Utilization | Binding Constraint |
|---|---|---|---|---|---|---|
| Retrieval-Heavy | 8,000 | 200 | 15% | 96% | 30% | ITPM |
| Balanced Gen | 2,000 | 500 | 15% | 60% | 75% | OTPM |
| Agentic Loop | 500 | 2,500 | 15% | 15% | 93% | OTPM |
Baseline: 600 RPM on claude-sonnet-4-6 Tier 4 (2026-05-15 snapshot)
This finding matters because it decouples capacity planning from guesswork. When you know which dimension binds first, you can:
- Request precise quota increases instead of blanket tier upgrades
- Forecast monthly costs with deterministic accuracy rather than heuristic ranges
- Design traffic shaping strategies that deliberately avoid the binding constraint
- Prevent silent 429 cascades by monitoring the correct metric in observability stacks
The shift from RPM-centric to multi-dimensional capacity modeling transforms LLM infrastructure from reactive troubleshooting to proactive engineering.
Core Solution
The solution requires a deterministic calculation engine that operates entirely offline. Network-dependent capacity tools introduce latency, reliability risks, and version drift. A pure arithmetic approach against a versioned snapshot guarantees reproducibility, instant feedback, and explicit data freshness tracking.
Architecture Decisions
- Offline Arithmetic: Rate limits and pricing are static configuration data, not dynamic state. Computing capacity bounds requires no live API calls. This eliminates network dependencies, reduces latency to sub-millisecond execution, and guarantees identical outputs for identical inputs.
- Versioned Snapshots: Provider pricing and limits change frequently. Hardcoded tables become liabilities the moment they age. Every calculation must carry an explicit version string that documents exactly when the data was accurate.
- Orthogonal Constraint Evaluation: The engine must evaluate RPM, ITPM, and OTPM simultaneously, then identify the minimum headroom across all three dimensions. The binding constraint is always the dimension with the least remaining capacity.
- MCP Tool Exposure: AI coding agents and orchestration frameworks require structured tool interfaces. Exposing the calculator as a Model Context Protocol tool enables deterministic capacity queries without hallucination or stale training data.
Implementation
The following TypeScript implementation demonstrates the calculation engine, snapshot management, and MCP tool wrapper. All naming conventions and structural patterns differ from reference implementations while preserving identical mathematical logic.
// capacity-snapshot.ts
export interface ProviderLimits {
rpm: number;
inputTokensPerMinute: number;
outputTokensPerMinute: number;
}
export interface PricingConfig {
inputPerMillion: number;
outputPerMillion: number;
}
export interface CapacitySnapshot {
version: string;
model: string;
tiers: Record<string, { limits: ProviderLimits; pricing: PricingConfig }>;
}
// capacity-engine.ts
export interface TrafficProfile {
requestsPerMinute: number;
avgInputTokens: number;
avgOutputTokens: number;
}
export interface CapacityResult {
monthlyCost: number;
bindingDimension: 'RPM' | 'ITPM' | 'OTPM';
headroom: { rpm: number; itpm: number; otpm: number };
willThrottle: boolean;
snapshotVersion: string;
}
export class CapacityPlanner {
private snapshot: CapacitySnapshot;
constructor(snapshotData: CapacitySnapshot) {
this.snapshot = snapshotData;
}
evaluate(
model: string,
tier: string,
profile: TrafficProfile
): CapacityResult {
const tierConfig = this.snapshot.tiers[tier];
if (!tierConfig) {
throw new Error(`Tier ${tier} not found in snapshot ${this.snapshot.version}`);
}
const limits = tierConfig.limits;
const pricing = tierConfig.pricing;
// Calculate per-minute demand
const demandITPM = profile.requestsPerMinute * profile.avgInputTokens;
const demandOTPM = profile.requestsPerMinute * profile.avgOutputTokens;
// Calculate utilization ratios
const rpmUtil = profile.requestsPerMinute / limits.rpm;
const itpmUtil = demandITPM / limits.inputTokensPerMinute;
const otpmUtil = demandOTPM / limits.outputTokensPerMinute;
// Identify binding constraint (lowest headroom)
const headroom = {
rpm: limits.rpm - profile.requestsPerMinute,
itpm: limits.inputTokensPerMinute - demandITPM,
otpm: limits.outputTokensPerMinute - demandOTPM
};
const utilizations = { rpm: rpmUtil, itpm: itpmUtil, otpm: otpmUtil };
const bindingDimension = Object.entries(utilizations).reduce((a, b) =>
b[1] > a[1] ? b : a
)[0].toUpperCase() as 'RPM' | 'ITPM' | 'OTPM';
// Monthly cost projection (30 days, 24 hours, 60 minutes)
const minutesPerMonth = 30 * 24 * 60;
const monthlyInputTokens = demandITPM * minutesPerMonth;
const monthlyOutputTokens = demandOTPM * minutesPerMonth;
const monthlyCost =
(monthlyInputTokens / 1_000_000) * pricing.inputPerMillion +
(monthlyOutputTokens / 1_000_000) * pricing.outputPerMillion;
return {
monthlyCost: Math.round(monthlyCost * 100) / 100,
bindingDimension,
headroom,
willThrottle: rpmUtil > 1 || itpmUtil > 1 || otpmUtil > 1,
snapshotVersion: this.snapshot.version
};
}
}
MCP Server Wrapper
The calculation engine integrates cleanly into an MCP server framework. The tool signature accepts provider parameters and returns structured capacity data with explicit version tracking.
// mcp-tool-wrapper.ts
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { CapacityPlanner, TrafficProfile } from './capacity-engine.js';
export function registerCapacityTool(server: McpServer, planner: CapacityPlanner) {
server.tool(
'compute_llm_capacity',
'Deterministic capacity planning for multi-dimensional rate limits',
{
model: { type: 'string', description: 'Model identifier' },
tier: { type: 'string', description: 'Provider tier level' },
rpm: { type: 'number', description: 'Target requests per minute' },
inputTokens: { type: 'number', description: 'Average input tokens per request' },
outputTokens: { type: 'number', description: 'Average output tokens per request' }
},
async ({ model, tier, rpm, inputTokens, outputTokens }) => {
const profile: TrafficProfile = {
requestsPerMinute: rpm,
avgInputTokens: inputTokens,
avgOutputTokens: outputTokens
};
const result = planner.evaluate(model, tier, profile);
return {
content: [
{
type: 'text',
text: JSON.stringify(result, null, 2)
}
]
};
}
);
}
Why This Architecture Works
- Determinism: Identical inputs always produce identical outputs. No network flakiness, no API rate limits on the calculator itself.
- Version Traceability: Every response carries
snapshotVersion. Stale data is immediately visible. Teams can audit capacity decisions against specific provider pricing dates. - Orthogonal Evaluation: The engine never assumes RPM is the primary constraint. It calculates all three dimensions and surfaces the actual binding limit.
- Agent-Ready: The MCP wrapper enables AI coding assistants to answer capacity questions with arithmetic instead of probabilistic generation. This eliminates hallucinated cost estimates and incorrect scaling advice.
Pitfall Guide
1. RPM Tunnel Vision
Explanation: Treating requests-per-minute as the sole scaling metric while ignoring token throughput ceilings. This causes teams to request RPM quota increases that provide zero relief when ITPM or OTPM is the actual bottleneck. Fix: Always evaluate all three dimensions simultaneously. Design monitoring dashboards that track ITPM and OTPM utilization alongside RPM.
2. Static Pricing Tables
Explanation: Hardcoding provider limits and costs into configuration files or documentation. Provider pricing changes frequently, and undated tables produce confident but incorrect capacity projections. Fix: Implement versioned snapshots with explicit date strings. Require every capacity calculation to return the snapshot version used. Audit tables quarterly against provider documentation.
3. Token Shape Blindness
Explanation: Assuming input and output tokens consume capacity identically. In reality, providers enforce separate ITPM and OTPM ceilings. A 50/50 token split behaves completely differently from a 90/10 split. Fix: Profile your actual traffic composition. Measure average input and output tokens per request separately. Use these measurements as inputs to the capacity engine rather than estimates.
4. Tier Mismatch Scaling
Explanation: Increasing traffic volume without verifying tier ceiling compatibility. A workload that fits comfortably in Tier 4 may hard-fail in Tier 1 due to orthogonal constraint overshoot, regardless of cost. Fix: Cross-reference target traffic profiles against tier limits before deployment. Implement automated tier validation checks in CI/CD pipelines that reject configurations exceeding snapshot ceilings.
5. Burst vs Sustained Confusion
Explanation: Equating peak RPM with sustained token throughput. Burst traffic may trigger RPM limits temporarily, but sustained ITPM/OTPM consumption determines long-term capacity viability. Fix: Model both p99 burst patterns and sustained average traffic. Use the capacity engine for sustained projections and implement separate burst handling strategies (queueing, backpressure, or graceful degradation).
6. Headroom Misinterpretation
Explanation: Assuming 50% headroom on RPM means safe scaling capacity. In multi-dimensional systems, headroom is determined by the minimum remaining capacity across all constraints. Fix: Calculate headroom per dimension and identify the minimum value. Scale only up to the binding constraint's limit. Document which dimension restricts growth in capacity reports.
Production Bundle
Action Checklist
- Snapshot Management: Establish a quarterly process to update provider limit snapshots and increment version strings
- Traffic Profiling: Instrument production workloads to capture average input/output token distributions per request type
- Dimension Monitoring: Configure observability dashboards to track RPM, ITPM, and OTPM utilization independently
- Tier Validation: Implement pre-deployment checks that compare target traffic profiles against tier ceilings
- MCP Integration: Wire the capacity calculator into AI coding agents and orchestration frameworks for deterministic queries
- Alert Thresholds: Set warning alerts at 70% utilization on the binding dimension and critical alerts at 85%
- Cost Forecasting: Use deterministic monthly cost projections for budget planning instead of heuristic ranges
- Version Auditing: Require all capacity reports to include snapshot version for traceability and compliance
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Pre-deployment capacity planning | Offline deterministic calculator | Eliminates network latency, guarantees reproducibility, version-traceable | Zero API cost, engineering time only |
| Real-time traffic shaping | Dynamic queueing with backpressure | Handles burst patterns, prevents hard 429s, maintains SLA | Infrastructure overhead, reduced waste |
| AI agent capacity queries | MCP tool integration | Prevents hallucination, provides arithmetic certainty, version-aware | MCP server maintenance, minimal compute |
| Multi-provider scaling | Abstracted capacity interface | Normalizes orthogonal constraints across vendors, simplifies routing | Abstraction layer development |
| Budget forecasting | Deterministic monthly projection | Accurate cost modeling based on actual token shapes | Financial planning accuracy |
Configuration Template
// capacity-config.ts
import { CapacitySnapshot } from './capacity-engine.js';
export const ANTHROPIC_SNAPSHOT_2026_05_15: CapacitySnapshot = {
version: '2026-05-15',
model: 'claude-sonnet-4-6',
tiers: {
'tier-1': {
limits: {
rpm: 50,
inputTokensPerMinute: 30_000,
outputTokensPerMinute: 8_000
},
pricing: {
inputPerMillion: 3,
outputPerMillion: 15
}
},
'tier-4': {
limits: {
rpm: 4_000,
inputTokensPerMinute: 2_000_000,
outputTokensPerMinute: 400_000
},
pricing: {
inputPerMillion: 3,
outputPerMillion: 15
}
}
}
};
// mcp-server.ts
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { CapacityPlanner } from './capacity-engine.js';
import { registerCapacityTool } from './mcp-tool-wrapper.js';
import { ANTHROPIC_SNAPSHOT_2026_05_15 } from './capacity-config.js';
const server = new McpServer({
name: 'llm-capacity-planner',
version: '1.0.0'
});
const planner = new CapacityPlanner(ANTHROPIC_SNAPSHOT_2026_05_15);
registerCapacityTool(server, planner);
export { server };
Quick Start Guide
- Initialize the Calculator: Import the capacity engine and instantiate the planner with a versioned snapshot. Ensure the snapshot matches your target provider and model.
- Profile Your Traffic: Measure average input and output tokens per request across your primary use cases. Use production telemetry rather than estimates.
- Run Capacity Evaluation: Call the evaluate method with your target RPM, token averages, and desired tier. Review the binding dimension and headroom metrics.
- Integrate with MCP: Register the tool wrapper with your MCP server. Configure your AI coding agent or orchestration framework to query capacity before scaling decisions.
- Monitor and Iterate: Track utilization against the binding dimension in production. Update snapshots quarterly and adjust traffic shaping strategies as provider limits evolve.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
