Back to KB
Difficulty
Intermediate
Read Time
6 min

Protecting against token theft

By Codcompass Team··6 min read

Inference Arbitrage: Defending High-Cost AI Endpoints Against Resale Attacks

Current Situation Analysis

The economics of modern web infrastructure have created a dangerous asymmetry. Standard HTTP requests are virtually free; at scale, providers like Vercel charge approximately $2 per million requests. In contrast, a single prompt to a frontier model can cost $2 or more. This creates a million-fold cost differential that attackers exploit through inference arbitrage.

Inference theft occurs when attackers unauthorizedly consume paid AI inference and resell the capacity at a discount. Because the attacker's marginal cost for inference is zero, they can undercut legitimate pricing while maintaining high margins. This is not merely rate-limit abuse; it is the creation of a black market for stolen compute.

Many engineering teams overlook this threat because traditional web defenses are designed for low-cost attacks. IP rate limiting and authentication walls assume that the cost of bypassing defenses scales with the value of the resource. In inference theft, this assumption breaks. Attackers deploy residential proxy farms with thousands of IPs and automate account creation, rendering per-IP limits ineffective. Furthermore, defenses that verify identity only at session start allow attackers to amortize a single bypass across thousands of inference calls, destroying the defender's cost advantage.

Real-world incidents confirm the severity. On April 12, 2026, a Vercel documentation AI endpoint experienced a traffic spike to 1,300 requests per minute on the Anthropic Claude Haiku 4.5 model. The attack utilized residential proxies to obscure origins, bypassing standard rate limits. The volume represented a potential inference cost run rate exceeding $10,000 per day. Without per-request verification, such attacks can drain budgets rapidly before detection.

WOW Moment: Key Findings

The critical insight in defending AI endpoints is the amortization effect. Attackers profit by decoupling the cost of bypassing security from the number of inference calls. Per-request verification re-couples these costs, making the attack economically unviable.

Defense StrategyBypass CostCalls per BypassAttacker MarginResale Viability
Session-Based AuthHigh (One-time)Thousands>95%High
IP Rate LimitsLow (Proxy rotation)Hundreds per IP~80%Medium
Per-Request Deep AnalysisHigh (Per-call)1NegativeNone

Why this matters: When verification runs per request, the attacker must pay the bypass cost for every single inference call. Since inference is the most expensive resource per call, forcing the attacker to solve a challenge for each request destroys their ma

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back