Back to KB
Difficulty
Intermediate
Read Time
7 min

tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

By Codcompass Team··7 min read

Distributed KV Cache Tiering: Achieving Sub-GPU Latency via Offloaded Quantized Restoration

Current Situation Analysis

Modern large language model (LLM) inference engines manage the Key-Value (KV) cache exclusively within GPU VRAM. As context windows expand, VRAM becomes the primary bottleneck. When the cache reaches capacity, engines evict blocks to accommodate new requests. This eviction strategy creates a severe performance penalty for repeated prefixes: if a previously evicted prompt reappears, the engine must recompute the entire prefill phase from scratch.

Prefill computation scales quadratically with sequence length, O(n²). For long-context documents, this results in massive compute waste. On a 30,561-token sequence, a cold prefill can consume over 10 seconds of GPU time. This problem is frequently misunderstood; engineering teams often assume that a GPU cache hit represents the performance ceiling. However, even a GPU cache hit requires partial attention computation to validate and integrate the cached blocks. The true bottleneck is not just memory retrieval, but the attention kernel overhead itself.

Data from production workloads on hybrid models like Qwen3.6-35B-A3B demonstrates that relying solely on VRAM caching leaves significant latency on the table. The industry lacks a standardized mechanism to offload KV blocks to cheaper memory tiers while maintaining restoration speeds that compete with on-device access.

WOW Moment: Key Findings

The most critical insight in distributed KV caching is that offloaded restoration can outperform native GPU cache hits. By bypassing attention computation entirely and injecting blocks directly into the paged buffer, distributed tiering achieves lower latency than keeping blocks in VRAM.

The following benchmark data illustrates this phenomenon using the Apple FY2025 10-K filing (30,561 tokens) on a Qwen3.6-35B-A3B model:

StrategyLatency (30k tokens)Compute PathScaling Behavior
Cold Prefill10.75sFull Attention O(n²)Quadratic
GPU Cache Hit1.19sPartial AttentionLinear/Constant
Distributed Restore0.52sDirect InjectionLinear + Network

Why this matters: Distributed restoration is approximately 2.3× faster than a GPU cache hit. This occurs because the restoration path skips the attention kernel; blocks are decoded from the vault and written directly into the engine's paged KV buffer. The performance gap widens with context length. Since prefill is O(n²) and restoration is O(n) plus network transfer, the advantage grows significantly for 128k+ contexts, where distributed restore is projected to be ~35× faster than cold prefill.

Additionally, this architecture supports hybrid inference engines like EXO. In tests with 8,000-token prompts, restoration reduced latency from 30.83s (cold) to 4.11s, a 7.3× improvement, validating the approach across different inference runtimes.

Core Solution

The solution implements a three-tier architecture that decouples KV storage from compute hardware. It leverages a plugin API to intercept eviction and restore events without modifying the inference engine source code.

Architecture Tiers

  1. Hot Tier (GPU VRAM): Active KV blocks managed by the inference engine's prefix cache.
  2. Cold Tier A (LAN RAM - Attention): Offloaded attention layers stored on a LAN machine. Access via gRPC with ~0.5ms RTT.
  3. Cold Tier B (LAN RAM - SSM/Linear): Dedicated storage for state-space model or linear-attention layers in hybrid architectures.

Implementation Flow

Eviction Path: When the GPU cache manager evicts a block, the plugin intercepts the event. The block is quantized using a head-aligned INT8 scheme and transmitted to the vault via a fire-and-forget gRPC Store call. The GPU block is freed immediately, minimizing latency impact on the serving thread.

Restore Path: Upon a cache miss for a known prefix, the engine triggers a BatchPromote RPC. This request fetches all required layers in a single round-trip. The vault performs parallel decoding (releasing the GIL in Rust implementations) and streams the tensors back. The engine injects these tensors directly into the paged KV buffer, bypassing attention recomputation.

Quantization Strategy: Head-Aligned INT8

To maximize compression without degrading model quality, the system uses a per-group INT8 quantizer. Key design decisions include:

  • Group Alignment: Quantization groups are aligned to attention head boundaries. The group size equals the head dimension (e.g., 256 for Qwen3.6-35B-A3B). This isolation prevents outlier values in one head from corrupting the quantization of neighboring heads.
  • Compression Ratio: Achieves approximately 3.9× compression.
  • Quality Metric: Maintains Signal-to-Noise Ratio (SNR) ≥ 52 dB, ensuring negligible impact on generation quality for most workloads.

Code Implementation Example

The following TypeScript interfaces define the configuration and connector structure for a distributed KV tiering plugin. This abstraction demonstrates how to integrate with an inference engine's plugin system.

// kv-tier-engine/types.ts

export interface DistributedKVConfig {
  cluster: {
    role: 'inference' 

| 'vault_attention' | 'vault_ssm'; vaultEndpoint: string; ssmVaultEndpoint?: string; }; quantization: { enabled: boolean; algorithm: 'INT8_HEAD_ALIGNED'; headDimension: number; snrThresholdDb: number; }; network: { maxRetries: number; timeoutMs: number; }; }

export interface KVConnectorPlugin { onEviction(blockId: string, tensor: Tensor): Promise<void>; onRestore(prefixHash: string): Promise<Tensor[]>; getMetadata(): ConnectorMetadata; }

// kv-tier-engine/connectors/vllm-adapter.ts

import { KVConnectorPlugin, DistributedKVConfig } from '../types'; import { VaultClient } from '../client';

export class DistributedKVConnector implements KVConnectorPlugin { private vaultClient: VaultClient; private config: DistributedKVConfig;

constructor(configPath: string) { this.config = loadConfig(configPath); this.vaultClient = new VaultClient(this.config.cluster.vaultEndpoint); }

async onEviction(blockId: string, tensor: Tensor): Promise<void> { if (!this.config.quantization.enabled) { await this.vaultClient.storeRaw(blockId, tensor); return; }

const quantized = tensor.quantize({
  algorithm: this.config.quantization.algorithm,
  groupSize: this.config.quantization.headDimension,
});

// Fire-and-forget to minimize eviction latency
this.vaultClient.storeQuantized(blockId, quantized)
  .catch(err => console.error(`Vault store failed: ${err.message}`));

}

async onRestore(prefixHash: string): Promise<Tensor[]> { const batchRequest = { prefixHash, layers: 'all', format: 'quantized' };

const response = await this.vaultClient.batchPromote(batchRequest);

return response.blocks.map(block => 
  block.decode({
    groupSize: this.config.quantization.headDimension,
    snrCheck: this.config.quantization.snrThresholdDb
  })
);

} }


### Architecture Rationale

*   **Separate SSM Vault:** Hybrid models like Qwen3.6-35B-A3B contain both full-attention and linear-attention layers. Routing these to separate vaults allows independent scaling and optimization, as SSM states have different access patterns and compression characteristics.
*   **BatchPromote RPC:** Fetching all layers in a single RPC call reduces network round-trips. This is critical when the vault is on a LAN machine; minimizing latency overhead ensures the restore path remains faster than GPU attention.
*   **Plugin API Integration:** Using the engine's `KVConnectorBase_V1` API allows deployment without rebuilding the inference server. This reduces operational friction and enables incremental adoption.

## Pitfall Guide

### 1. Network Latency Blindness
**Explanation:** Distributed restoration relies on sub-millisecond LAN access. WiFi or WAN connections introduce latency that negates the compute savings.
**Fix:** Deploy vault nodes on the same subnet with 5GbE or 10GbE connectivity. Verify RTT is consistently below 5ms. Use network monitoring to detect jitter.

### 2. Quantization Drift in Critical Tasks
**Explanation:** INT8 quantization introduces minor numerical noise. While SNR ≥ 52 dB is safe for general generation, tasks requiring bit-for-bit reproducibility may fail.
**Fix:** Provide a configuration flag to disable quantization for sensitive workloads. Monitor SNR metrics and alert if compression ratios drop below thresholds, indicating potential quality issues.

### 3. Hybrid Model Routing Errors
**Explanation:** Failing to distinguish between attention and SSM layers can cause blocks to be stored in the wrong vault, leading to restore failures or corruption.
**Fix:** Implement auto-detection of layer types based on model architecture metadata. Ensure the configuration explicitly maps layer indices to the correct vault endpoint.

### 4. Over-Provisioning Cold Storage
**Explanation:** Allocating excessive RAM to the vault can starve other processes on the LAN machine. KV blocks are large, and unbounded growth can cause OOM conditions.
**Fix:** Set `max_bytes` limits in the vault configuration. Implement LRU eviction policies on the vault side to manage capacity. Monitor storage utilization and tune limits based on actual reuse patterns.

### 5. Ignoring Tensor Parallelism Constraints
**Explanation:** Some distributed KV implementations do not support tensor-parallel inference configurations. Attempting to use both can cause synchronization errors.
**Fix:** Verify compatibility with the inference engine's parallelism mode. If tensor parallelism is required, use data parallelism or single-GPU setups until the plugin supports sharded KV distribution.

### 6. Short Prompt Inefficiency
**Explanation:** For very short prompts, the network overhead of vault access may exceed the cost of a local prefill.
**Fix:** Implement a threshold check. If the prefix length is below a certain block count, skip vault lookup and proceed with local computation. This optimization prevents unnecessary RPC calls.

### 7. Single-Shot Workload Waste
**Explanation:** Deploying distributed caching for workloads with no prefix reuse adds infrastructure cost without benefit.
**Fix:** Profile request patterns before deployment. If prefix reuse rate is low, disable the plugin to avoid network overhead and storage costs.

## Production Bundle

### Action Checklist

- [ ] **Install Package:** Deploy the distributed KV engine package across all inference and vault nodes.
- [ ] **Define Topology:** Create configuration files specifying roles, endpoints, and quantization parameters.
- [ ] **Start Vault Services:** Launch vault daemons on LAN machines with allocated memory limits.
- [ ] **Verify Connectivity:** Run health checks to confirm RPC latency and storage accessibility.
- [ ] **Configure Engine:** Update inference server launch arguments to include the KV connector plugin.
- [ ] **Enable Prefix Caching:** Ensure the inference engine's native prefix caching is active to trigger plugin events.
- [ ] **Monitor Metrics:** Track restore latency, compression ratios, and vault storage utilization.

### Decision Matrix

| Scenario | Recommended Approach | Why | Cost Impact |
|----------|---------------------|-----|-------------|
| High Reuse RAG | Distributed Tiering | Massive compute savings on repeated documents | Low (LAN infra) |
| Single Shot API | Native GPU Cache | No network overhead; simpler architecture | Zero |
| Hybrid Model | Tiered with SSM Vault | Optimizes storage for mixed layer types | Medium |
| Tight VRAM Budget | Distributed Tiering | Extends context window beyond GPU limits | Low |
| Exact Reproducibility | Disable Quantization | Ensures bit-for-bit output consistency | Low |
| High Latency Network | Native GPU Cache | Network overhead negates benefits | Zero |

### Configuration Template

```toml
# kv-tier-config.toml

[cluster_topology]
role = "inference"
vault_endpoint = "192.168.1.50:50051"
ssm_vault_endpoint = "192.168.1.51:50051"

[quantization_profile]
enabled = true
algorithm = "INT8_HEAD_ALIGNED"
head_dimension = 256
snr_threshold_db = 52

[vault_limits]
max_storage_bytes = 24_000_000_000
eviction_policy = "LRU"

[network_settings]
rpc_timeout_ms = 500
max_retries = 2

Quick Start Guide

  1. Install the engine:
    pip install kv-tier-engine
    
  2. Create configuration: Save the template above as kv-tier-config.toml and adjust endpoints.
  3. Start vault on LAN machine:
    kv-tier vault start --config kv-tier-config.toml
    
  4. Launch inference server:
    vllm serve Qwen/Qwen3-30B-A3B \
      --kv-transfer-config '{
        "connector": "DistributedKVConnector",
        "module_path": "kv_tier.plugins.vllm",
        "role": "kv_both",
        "extra_config": {"config_path": "/etc/kv-tier-config.toml"}
      }' \
      --enable-prefix-caching \
      --block-size 16
    
  5. Validate: Send a repeated long-context request and observe latency reduction in engine metrics.