Cutting LLM Inference Costs by 64% and Latency by 48% with Speculative-First Routing and KV-Cache Overcommit
Current Situation Analysis We migrated our LLM serving layer from a naive round-robin load balancer to a specialized infrastructure in Q3 2024. The results were not incremental; they were structural. We reduced cost per million output tokens from $3.80 to $1.36, cut p99 latency from 1.4s to 0.
