How I cut my multi-turn LLM API costs by 90% (O(N ) O(N))
How I cut my multi-turn LLM API costs by 90% (O(N²) → O(N))
Current Situation Analysis
Building multi-turn AI agents introduces a fundamental scaling bottleneck: API costs do not grow linearly, they grow quadratically. In a standard agent loop, every turn replays the full conversation history as input. Consequently, the token cost on turn N is proportional to N, making the total cost across N turns Θ(N²).
Traditional context management strategies fail to address this because they treat the conversation history as a monolithic payload. Without aggressive compression or caching strategies, developers quickly hit hard API quotas (e.g., consuming 97% of a weekly Anthropic budget in a single heavy coding session). Furthermore, vendor-locked implementations prevent dynamic routing of tasks to cost-optimized models, forcing expensive reasoning models to handle trivial execution steps. The combination of unbounded context replay and static provider routing creates an unsustainable cost curve that collapses under production workloads.
WOW Moment: Key Findings
The experimental validation demonstrates that collapsing the quadratic history term into a linear one, while leveraging provider-level prefix caching, fundamentally rewrites the economics of multi-turn orchestration. Benchmarks conducted via the Anthropic SDK using raw response.usage metrics confirm a 90.3% cost reduction against naive Claude Opus at turn 10, with real-world consumption dropping by ~16x.
| Approach | Cost per 10-turn Session | Token Complexity | Context Payload Size | Cache Hit Rate |
|---|---|---|---|---|
| Standalone (No Cache) | $4.66 | O(N²) | Full raw transcript | ~0% |
| Standalone (+ Prefix Cache) | $0.65 | O(N²) | Full raw transcript | ~85% |
| Burnless Maestro | $0.45 (-90.3%) | O(N) | ~80-char capsules | ~95% |
Key Findings:
- Quadratic → Linear Collapse: Replacing raw transcripts with compressed ~80-character capsules reduces the history term from
Θ(N²)toO(N). - Prefix Cache Leverage: Maintaining a byte-identical system prompt prefix allows Anthropic's prompt caching (
ttl: 1h) to bill at cache-read rates (~10x cheaper than fresh input). - Provider Agnosticism: The cost curve improvement is mathematically portable to any provider exposing prompt caching and per-token input billing.
Core Solution
Burnless operates as a vendor-agnostic orchestration layer that decouples reasoning quality from execution cost. The architecture relies on two synchronized mechanisms:
- Shared Prefix Cache: The persistent system prompt (often 20k+ tokens) is cached using provider-specific prompt caching. Crucially, switching models within the same provider mid-session does not invalidate the cache as long as the prefix remains byte-identical.
- Capsule History: Instead of retaining raw transcripts, the "Maestro" (Brain) model only maintains ~80-character compressed capsules of prior turns. This tiny linear payload replaces the quadratic history accumulation.
The orchestration layer routes tasks across quality/cost tiers rather than fixed vendors. You designate a conductor (Maestro) and workers, mapping gold/silver/bronze bands to any CLI in config.yaml. This allows mixing providers freely and running encoder/decoder stages on local models for zero marginal cost.
agents:
gold: { command: "claude --model claude-sonnet-4-6 -p" } # The Brain
silver: { command: "codex exec --sandbox workspace-write" } # Execution
bronze: { command: "ollama run qwen2.5-coder" } # Local, zero marginal cost
pip install burnless
burnless setup
Pitfall Guide
- Prefix Byte Drift Invalidates Cache: Modifying the system prompt structure, adding dynamic timestamps, or changing whitespace between turns breaks byte-identical matching. Always hash or normalize the prefix before injection to guarantee cache hits.
- Over-Compression of Capsules: Truncating history too aggressively strips critical state (e.g., variable bindings, error traces). Implement a minimum token threshold and retain structured metadata alongside the 80-char summary to prevent Maestro context loss.
- Tier-to-Provider Misalignment: Mapping
goldto a low-capability model orbronzeto an expensive reasoning model defeats the orchestration strategy. Validate tier assignments against actual throughput/quality benchmarks before production deployment. - Ignoring Cache TTL Boundaries: Setting
ttl: 1hon a long-running agent loop causes cache eviction mid-session. Align cache TTL with expected session duration or implement proactive cache refresh hooks before expiration. - Local Model Latency Bottlenecks: Routing encoding/decoding to local Ollama models introduces hardware-dependent latency. If GPU VRAM or CPU throughput is insufficient, the Maestro will timeout waiting for capsule generation. Implement async fallbacks or queue-based execution.
- Dashboard vs. SDK Metric Mismatch: Relying on provider UI dashboards for cost tracking often masks cache-read discounts or includes hidden overhead. Always parse raw
response.usagefrom the SDK for accurate O(N) vs O(N²) validation.
Deliverables
- 📘 Burnless Architecture Blueprint: Complete system design document detailing the Maestro/Worker orchestration flow, capsule compression algorithm, and prefix cache synchronization mechanics.
- ✅ Multi-Turn Cost Reduction Checklist: Step-by-step validation guide covering cache TTL alignment, prefix normalization, tier mapping verification, and SDK usage tracking.
- ⚙️ Configuration Templates: Production-ready
config.yamlschemas for gold/silver/bronze tier routing, local model fallback configurations, and provider-agnostic cache setup parameters.
