Back to KB
Difficulty
Intermediate
Read Time
7 min

tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

By Codcompass Team··7 min read

Distributed KV Cache Tiering: Achieving Sub-GPU Latency via Offloaded Quantized Restoration

Current Situation Analysis

Modern large language model (LLM) inference engines manage the Key-Value (KV) cache exclusively within GPU VRAM. As context windows expand, VRAM becomes the primary bottleneck. When the cache reaches capacity, engines evict blocks to accommodate new requests. This eviction strategy creates a severe performance penalty for repeated prefixes: if a previously evicted prompt reappears, the engine must recompute the entire prefill phase from scratch.

Prefill computation scales quadratically with sequence length, O(n²). For long-context documents, this results in massive compute waste. On a 30,561-token sequence, a cold prefill can consume over 10 seconds of GPU time. This problem is frequently misunderstood; engineering teams often assume that a GPU cache hit represents the performance ceiling. However, even a GPU cache hit requires partial attention computation to validate and integrate the cached blocks. The true bottleneck is not just memory retrieval, but the attention kernel overhead itself.

Data from production workloads on hybrid models like Qwen3.6-35B-A3B demonstrates that relying solely on VRAM caching leaves significant latency on the table. The industry lacks a standardized mechanism to offload KV blocks to cheaper memory tiers while maintaining restoration speeds that compete with on-device access.

WOW Moment: Key Findings

The most critical insight in distributed KV caching is that offloaded restoration can outperform native GPU cache hits. By bypassing attention computation entirely and injecting blocks directly into the paged buffer, distributed tiering achieves lower latency than keeping blocks in VRAM.

The following benchmark data illustrates this phenomenon using the Apple FY2025 10-K filing (30,561 tokens) on a Qwen3.6-35B-A3B model:

StrategyLatency (30k tokens)Compute PathScaling Behavior
Cold Prefill10.75sFull Attention O(n²)Quadratic
GPU Cache Hit1.19sPartial AttentionLinear/Constant
Distributed Restore0.52sDirect InjectionLinear + Network

Why this matters: Distributed restoration is approximately 2.3× faster than a GPU cache hit. This occurs because the restoration path skips the attention kernel; blocks are decoded from the vault and written directly into the engine's paged KV buffer. The performance gap widens with context length. Since prefill is O(n²) and restoration is O(n) plus network transfer, the advantage grows significantly for 128k+ contexts, where distributed restore is projected to be ~35× faster than cold prefill.

Additionally, this architecture supports hybrid inference engines like EXO. In tests with 8,000-token prompts, restoration reduced latency from 30.83s (cold) to 4.11s, a 7.3× improvement, validating the approach across different inference runtimes.

Core Solution

The solution implements a three-tier architecture that decouples KV storage from compute hardware. It leverages a plugin API to intercept eviction and restore events without modifying the inference engine source code.

Architecture Tiers

  1. Hot Tier (GPU VRAM): Active KV blocks managed by the inference engine'

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back