s prefix cache.
2. Cold Tier A (LAN RAM - Attention): Offloaded attention layers stored on a LAN machine. Access via gRPC with ~0.5ms RTT.
3. Cold Tier B (LAN RAM - SSM/Linear): Dedicated storage for state-space model or linear-attention layers in hybrid architectures.
Implementation Flow
Eviction Path:
When the GPU cache manager evicts a block, the plugin intercepts the event. The block is quantized using a head-aligned INT8 scheme and transmitted to the vault via a fire-and-forget gRPC Store call. The GPU block is freed immediately, minimizing latency impact on the serving thread.
Restore Path:
Upon a cache miss for a known prefix, the engine triggers a BatchPromote RPC. This request fetches all required layers in a single round-trip. The vault performs parallel decoding (releasing the GIL in Rust implementations) and streams the tensors back. The engine injects these tensors directly into the paged KV buffer, bypassing attention recomputation.
Quantization Strategy: Head-Aligned INT8
To maximize compression without degrading model quality, the system uses a per-group INT8 quantizer. Key design decisions include:
- Group Alignment: Quantization groups are aligned to attention head boundaries. The group size equals the head dimension (e.g., 256 for Qwen3.6-35B-A3B). This isolation prevents outlier values in one head from corrupting the quantization of neighboring heads.
- Compression Ratio: Achieves approximately 3.9× compression.
- Quality Metric: Maintains Signal-to-Noise Ratio (SNR) ≥ 52 dB, ensuring negligible impact on generation quality for most workloads.
Code Implementation Example
The following TypeScript interfaces define the configuration and connector structure for a distributed KV tiering plugin. This abstraction demonstrates how to integrate with an inference engine's plugin system.
// kv-tier-engine/types.ts
export interface DistributedKVConfig {
cluster: {
role: 'inference' | 'vault_attention' | 'vault_ssm';
vaultEndpoint: string;
ssmVaultEndpoint?: string;
};
quantization: {
enabled: boolean;
algorithm: 'INT8_HEAD_ALIGNED';
headDimension: number;
snrThresholdDb: number;
};
network: {
maxRetries: number;
timeoutMs: number;
};
}
export interface KVConnectorPlugin {
onEviction(blockId: string, tensor: Tensor): Promise<void>;
onRestore(prefixHash: string): Promise<Tensor[]>;
getMetadata(): ConnectorMetadata;
}
// kv-tier-engine/connectors/vllm-adapter.ts
import { KVConnectorPlugin, DistributedKVConfig } from '../types';
import { VaultClient } from '../client';
export class DistributedKVConnector implements KVConnectorPlugin {
private vaultClient: VaultClient;
private config: DistributedKVConfig;
constructor(configPath: string) {
this.config = loadConfig(configPath);
this.vaultClient = new VaultClient(this.config.cluster.vaultEndpoint);
}
async onEviction(blockId: string, tensor: Tensor): Promise<void> {
if (!this.config.quantization.enabled) {
await this.vaultClient.storeRaw(blockId, tensor);
return;
}
const quantized = tensor.quantize({
algorithm: this.config.quantization.algorithm,
groupSize: this.config.quantization.headDimension,
});
// Fire-and-forget to minimize eviction latency
this.vaultClient.storeQuantized(blockId, quantized)
.catch(err => console.error(`Vault store failed: ${err.message}`));
}
async onRestore(prefixHash: string): Promise<Tensor[]> {
const batchRequest = {
prefixHash,
layers: 'all',
format: 'quantized'
};
const response = await this.vaultClient.batchPromote(batchRequest);
return response.blocks.map(block =>
block.decode({
groupSize: this.config.quantization.headDimension,
snrCheck: this.config.quantization.snrThresholdDb
})
);
}
}
Architecture Rationale
- Separate SSM Vault: Hybrid models like Qwen3.6-35B-A3B contain both full-attention and linear-attention layers. Routing these to separate vaults allows independent scaling and optimization, as SSM states have different access patterns and compression characteristics.
- BatchPromote RPC: Fetching all layers in a single RPC call reduces network round-trips. This is critical when the vault is on a LAN machine; minimizing latency overhead ensures the restore path remains faster than GPU attention.
- Plugin API Integration: Using the engine's
KVConnectorBase_V1 API allows deployment without rebuilding the inference server. This reduces operational friction and enables incremental adoption.
Pitfall Guide
1. Network Latency Blindness
Explanation: Distributed restoration relies on sub-millisecond LAN access. WiFi or WAN connections introduce latency that negates the compute savings.
Fix: Deploy vault nodes on the same subnet with 5GbE or 10GbE connectivity. Verify RTT is consistently below 5ms. Use network monitoring to detect jitter.
2. Quantization Drift in Critical Tasks
Explanation: INT8 quantization introduces minor numerical noise. While SNR ≥ 52 dB is safe for general generation, tasks requiring bit-for-bit reproducibility may fail.
Fix: Provide a configuration flag to disable quantization for sensitive workloads. Monitor SNR metrics and alert if compression ratios drop below thresholds, indicating potential quality issues.
3. Hybrid Model Routing Errors
Explanation: Failing to distinguish between attention and SSM layers can cause blocks to be stored in the wrong vault, leading to restore failures or corruption.
Fix: Implement auto-detection of layer types based on model architecture metadata. Ensure the configuration explicitly maps layer indices to the correct vault endpoint.
4. Over-Provisioning Cold Storage
Explanation: Allocating excessive RAM to the vault can starve other processes on the LAN machine. KV blocks are large, and unbounded growth can cause OOM conditions.
Fix: Set max_bytes limits in the vault configuration. Implement LRU eviction policies on the vault side to manage capacity. Monitor storage utilization and tune limits based on actual reuse patterns.
5. Ignoring Tensor Parallelism Constraints
Explanation: Some distributed KV implementations do not support tensor-parallel inference configurations. Attempting to use both can cause synchronization errors.
Fix: Verify compatibility with the inference engine's parallelism mode. If tensor parallelism is required, use data parallelism or single-GPU setups until the plugin supports sharded KV distribution.
6. Short Prompt Inefficiency
Explanation: For very short prompts, the network overhead of vault access may exceed the cost of a local prefill.
Fix: Implement a threshold check. If the prefix length is below a certain block count, skip vault lookup and proceed with local computation. This optimization prevents unnecessary RPC calls.
7. Single-Shot Workload Waste
Explanation: Deploying distributed caching for workloads with no prefix reuse adds infrastructure cost without benefit.
Fix: Profile request patterns before deployment. If prefix reuse rate is low, disable the plugin to avoid network overhead and storage costs.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Reuse RAG | Distributed Tiering | Massive compute savings on repeated documents | Low (LAN infra) |
| Single Shot API | Native GPU Cache | No network overhead; simpler architecture | Zero |
| Hybrid Model | Tiered with SSM Vault | Optimizes storage for mixed layer types | Medium |
| Tight VRAM Budget | Distributed Tiering | Extends context window beyond GPU limits | Low |
| Exact Reproducibility | Disable Quantization | Ensures bit-for-bit output consistency | Low |
| High Latency Network | Native GPU Cache | Network overhead negates benefits | Zero |
Configuration Template
# kv-tier-config.toml
[cluster_topology]
role = "inference"
vault_endpoint = "192.168.1.50:50051"
ssm_vault_endpoint = "192.168.1.51:50051"
[quantization_profile]
enabled = true
algorithm = "INT8_HEAD_ALIGNED"
head_dimension = 256
snr_threshold_db = 52
[vault_limits]
max_storage_bytes = 24_000_000_000
eviction_policy = "LRU"
[network_settings]
rpc_timeout_ms = 500
max_retries = 2
Quick Start Guide
- Install the engine:
pip install kv-tier-engine
- Create configuration:
Save the template above as
kv-tier-config.toml and adjust endpoints.
- Start vault on LAN machine:
kv-tier vault start --config kv-tier-config.toml
- Launch inference server:
vllm serve Qwen/Qwen3-30B-A3B \
--kv-transfer-config '{
"connector": "DistributedKVConnector",
"module_path": "kv_tier.plugins.vllm",
"role": "kv_both",
"extra_config": {"config_path": "/etc/kv-tier-config.toml"}
}' \
--enable-prefix-caching \
--block-size 16
- Validate:
Send a repeated long-context request and observe latency reduction in engine metrics.