← Back to Blog
AI/ML2026-05-05·30 min read

How I cut my multi-turn LLM API costs by 90% (O(N ) O(N))

By Rudekwydra

How I cut my multi-turn LLM API costs by 90% (O(N²) → O(N))

Current Situation Analysis

Building multi-turn AI agents introduces a fundamental scaling bottleneck: API costs do not grow linearly, they grow quadratically. In a standard agent loop, every turn replays the full conversation history as input. Consequently, the token cost on turn N is proportional to N, making the total cost across N turns Θ(N²).

Traditional context management strategies fail to address this because they treat the conversation history as a monolithic payload. Without aggressive compression or caching strategies, developers quickly hit hard API quotas (e.g., consuming 97% of a weekly Anthropic budget in a single heavy coding session). Furthermore, vendor-locked implementations prevent dynamic routing of tasks to cost-optimized models, forcing expensive reasoning models to handle trivial execution steps. The combination of unbounded context replay and static provider routing creates an unsustainable cost curve that collapses under production workloads.

WOW Moment: Key Findings

The experimental validation demonstrates that collapsing the quadratic history term into a linear one, while leveraging provider-level prefix caching, fundamentally rewrites the economics of multi-turn orchestration. Benchmarks conducted via the Anthropic SDK using raw response.usage metrics confirm a 90.3% cost reduction against naive Claude Opus at turn 10, with real-world consumption dropping by ~16x.

Approach Cost per 10-turn Session Token Complexity Context Payload Size Cache Hit Rate
Standalone (No Cache) $4.66 O(N²) Full raw transcript ~0%
Standalone (+ Prefix Cache) $0.65 O(N²) Full raw transcript ~85%
Burnless Maestro $0.45 (-90.3%) O(N) ~80-char capsules ~95%

Key Findings:

  • Quadratic → Linear Collapse: Replacing raw transcripts with compressed ~80-character capsules reduces the history term from Θ(N²) to O(N).
  • Prefix Cache Leverage: Maintaining a byte-identical system prompt prefix allows Anthropic's prompt caching (ttl: 1h) to bill at cache-read rates (~10x cheaper than fresh input).
  • Provider Agnosticism: The cost curve improvement is mathematically portable to any provider exposing prompt caching and per-token input billing.

Core Solution

Burnless operates as a vendor-agnostic orchestration layer that decouples reasoning quality from execution cost. The architecture relies on two synchronized mechanisms:

  1. Shared Prefix Cache: The persistent system prompt (often 20k+ tokens) is cached using provider-specific prompt caching. Crucially, switching models within the same provider mid-session does not invalidate the cache as long as the prefix remains byte-identical.
  2. Capsule History: Instead of retaining raw transcripts, the "Maestro" (Brain) model only maintains ~80-character compressed capsules of prior turns. This tiny linear payload replaces the quadratic history accumulation.

The orchestration layer routes tasks across quality/cost tiers rather than fixed vendors. You designate a conductor (Maestro) and workers, mapping gold/silver/bronze bands to any CLI in config.yaml. This allows mixing providers freely and running encoder/decoder stages on local models for zero marginal cost.

agents:
  gold:    { command: "claude --model claude-sonnet-4-6 -p" } # The Brain
  silver:  { command: "codex exec --sandbox workspace-write" } # Execution
  bronze:  { command: "ollama run qwen2.5-coder" } # Local, zero marginal cost
pip install burnless
burnless setup

Pitfall Guide

  1. Prefix Byte Drift Invalidates Cache: Modifying the system prompt structure, adding dynamic timestamps, or changing whitespace between turns breaks byte-identical matching. Always hash or normalize the prefix before injection to guarantee cache hits.
  2. Over-Compression of Capsules: Truncating history too aggressively strips critical state (e.g., variable bindings, error traces). Implement a minimum token threshold and retain structured metadata alongside the 80-char summary to prevent Maestro context loss.
  3. Tier-to-Provider Misalignment: Mapping gold to a low-capability model or bronze to an expensive reasoning model defeats the orchestration strategy. Validate tier assignments against actual throughput/quality benchmarks before production deployment.
  4. Ignoring Cache TTL Boundaries: Setting ttl: 1h on a long-running agent loop causes cache eviction mid-session. Align cache TTL with expected session duration or implement proactive cache refresh hooks before expiration.
  5. Local Model Latency Bottlenecks: Routing encoding/decoding to local Ollama models introduces hardware-dependent latency. If GPU VRAM or CPU throughput is insufficient, the Maestro will timeout waiting for capsule generation. Implement async fallbacks or queue-based execution.
  6. Dashboard vs. SDK Metric Mismatch: Relying on provider UI dashboards for cost tracking often masks cache-read discounts or includes hidden overhead. Always parse raw response.usage from the SDK for accurate O(N) vs O(N²) validation.

Deliverables

  • 📘 Burnless Architecture Blueprint: Complete system design document detailing the Maestro/Worker orchestration flow, capsule compression algorithm, and prefix cache synchronization mechanics.
  • ✅ Multi-Turn Cost Reduction Checklist: Step-by-step validation guide covering cache TTL alignment, prefix normalization, tier mapping verification, and SDK usage tracking.
  • ⚙️ Configuration Templates: Production-ready config.yaml schemas for gold/silver/bronze tier routing, local model fallback configurations, and provider-agnostic cache setup parameters.