KV FP8 with Gemma4 26B

By Codcompass Team·2026-05-14·7 min read

TPU-Scale Inference: Unlocking High-Concurrency LLM Serving with KV Cache Quantization

Current Situation Analysis

Deploying large language models on specialized accelerators like Google's TPU v6e series introduces a distinct bottleneck that traditional GPU-centric optimization guides rarely address: KV cache memory exhaustion. While most engineering teams focus on weight quantization (INT8/FP8) to reduce VRAM/HBM pressure, the attention state cache grows linearly with both batch size and context length. At scale, the KV cache quickly dwarfs the model weights themselves.

This problem is frequently overlooked because benchmarking is typically performed at low concurrency (1-16 users) or short contexts (<4k tokens). Under those conditions, memory bandwidth and compute dominate. However, production workloads rarely behave this way. When serving interactive APIs or batch processing pipelines, concurrent request counts spike, and context windows expand. On a TPU v6e cluster with 128GB of HBM per node, a standard bfloat16 KV cache will trigger out-of-memory (OOM) failures around 256-512 concurrent users at 16k context lengths. The industry default response is horizontal scaling: adding more TPU chips. This approach is capital-inefficient and introduces network latency overhead that degrades tail latency.

The misunderstanding stems from treating the KV cache as a secondary concern. In reality, during the decode phase, inference becomes strictly memory-bandwidth bound. Moving 2-byte bfloat16 values across the HBM bus saturates the interconnect long before the Tensor Cores reach peak utilization. Without addressing the cache footprint, hardware saturation remains theoretically impossible to achieve safely. Recent production benchmarks on the google/gemma-4-26B-A4B-it model demonstrate that shifting optimization focus from weights to the KV cache unlocks previously unreachable concurrency tiers while maintaining sub-second latency floors for interactive workloads.

WOW Moment: Key Findings

The deployment of KV FP8 quantization on a TPU v6e cluster fundamentally alters the capacity-to-latency curve. By halving the memory footprint of the attention states, the system bypasses the HBM ceiling that traditionally caps concurrent request handling. The following comparison illustrates the operational shift between standard bfloat16 caching and FP8-optimized caching under identical hardware constraints.

Approach	Max Stable Concurrency	Peak Prefill Throughput	KV Cache Memory Footprint	TTFT at Peak Load
BF16 KV Cache	~256 users (16k ctx)	~185,000 tok/s	~33.4 GB (16.7M tokens)	~14.8s
FP8 KV Cache	1024 users (16k ctx)	475,552 tok/s	~16.7 GB (16.7M tokens)	~19.2s

Why this matters: The FP8 configuration doesn't just prevent crashes; it doubles the hardware's effective capacity. The 475,552 tokens per second prefill rate represents near-linear scaling across the TPU v6e's memory hierarchy. More importantly, the TTFT increase from ~14.8s to ~19.2s at peak load is a predictable queueing delay, not a hardware failure. This transforms an unstable, OOM-prone deployment into a deterministic, high-density inference engine capable of handling both real-time chat interfaces and massive document ingestion pipelines on the same cluster.

Core Solution

Achieving this performance profile requires a coordinated configuration across the inference runtime, cache quantization layer, and speculative decoding engine. The architecture prioritizes memory bandwidth efficiency during decode and compute saturation during prefill.

Step 1: Runtime Initialization & Cache Quantization

The vLLM engine must be configured to allocate KV cache blocks using 8-bit floating point precision. This requires explicit quantization parameters that map the attention state tensors to FP8 without degrading the model's perplexity. The gemma-4-26B-A4B-it architecture tolerates FP8 KV caching with negligible accuracy loss due to its mixture-of-experts routing and normalized attention heads.

# inference_engine.py
from typing import Optional
import vllm
from vllm import AsyncLLMEngine, EngineArgs

class TPUInferenceOrchestrator:
    def __init__(
        self,
        model_id: str,
        max_context_window: int,
        kv_precision: str = "fp8",
        speculative_method: Optional[str] = None
    ):
        self.engine_args = EngineArgs(
            model=model_id,
            max_model_len=max_context_window,
            dtype="bfloat16",
            kv_cache_dtype=kv_precision,
            speculative_decoding_method=speculative_method,
            num_speculative_tokens=4,
            tensor_parallel_size=4,  # Matches TPU v6e-4 cluster
            enable_chunked_prefill=True,
            max_num_batched_tokens=32768
        )
        self._engine: Optional[AsyncLLMEngine] = None

    async def initialize(self) -> AsyncLLMEngine:
        if not self._engine:
            self._engine = AsyncLLMEngine.from_engine_args(self.engine_args)
        return self._engine

Step 2: Speculative Decoding Integration

N-gram speculative decoding complements FP8 KV caching by reducing the number of decode steps required per token. The TPU v6e architecture benefits from this approach because the overhead of verifying speculative tokens is

minimal compared to the bandwidth savings from fewer HBM round-trips. The engine maintains a small draft buffer that proposes tokens, which are then verified in a single forward pass.

Step 3: Client-Side Streaming & Health Monitoring

High-concurrency deployments require robust client handling. TTFT will naturally increase as the queue depth grows. Implementing server-sent events (SSE) or chunked streaming ensures the client receives incremental tokens, preventing timeout errors and improving perceived responsiveness.

// client/llm_stream_handler.ts
import { EventSource } from 'eventsource';

interface InferenceRequest {
  prompt: string;
  maxTokens: number;
  temperature: number;
}

export class TPUStreamClient {
  private endpoint: string;
  private abortController: AbortController;

  constructor(endpoint: string) {
    this.endpoint = endpoint;
    this.abortController = new AbortController();
  }

  async streamCompletion(request: InferenceRequest): Promise<ReadableStream> {
    const response = await fetch(`${this.endpoint}/v1/completions`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: 'google/gemma-4-26B-A4B-it',
        prompt: request.prompt,
        max_tokens: request.maxTokens,
        temperature: request.temperature,
        stream: true
      }),
      signal: this.abortController.signal
    });

    if (!response.ok) {
      throw new Error(`Inference failed: ${response.status}`);
    }

    return response.body as ReadableStream;
  }

  cancel() {
    this.abortController.abort();
  }
}

Architecture Rationale

bfloat16 for weights: TPUs natively accelerate bfloat16 matrix multiplications. Keeping weights in this precision avoids quantization artifacts in the expert routing layers while maintaining full hardware utilization.
FP8 for KV cache: Attention states are read repeatedly during decode but only written once per token. FP8 reduces HBM traffic by 50%, directly increasing decode throughput without impacting weight precision.
Chunked Prefill: Breaks long prompts into manageable segments, allowing the scheduler to interleave decode steps from other requests. This prevents head-of-line blocking when processing 16k+ context windows.
N-gram Speculation: Zero-training overhead, deterministic, and aligns well with TPU memory hierarchy. Verification passes are batched efficiently, reducing total decode iterations by ~15-20%.

Pitfall Guide

1. The Exact Boundary Trap

Explanation: Setting max_model_len exactly to your target context length (e.g., 32768) causes HTTP 400 failures when the engine attempts to generate the first token. The engine calculates prompt_length + generation_tokens, which exceeds the hard limit. Fix: Always allocate a generation buffer. If your target context is 32k, set max_model_len to 33280 or higher. For production stability, cap prompts at 24k-28k to leave headroom for output.

2. Cold Start Latency Misinterpretation

Explanation: Initial requests after deployment show elevated TTFT (0.4s-1.1s vs historical 0.1s). This is not a performance regression; it's JAX compilation caching and KV cache warm-up. Fix: Implement a synthetic warm-up routine that sends 3-5 dummy requests across varying context lengths before routing production traffic. This pre-compiles the XLA graphs and populates the initial cache blocks.

3. FP8 Precision Anxiety

Explanation: Teams often apply FP8 to both weights and KV cache, fearing accuracy loss. However, weight quantization introduces routing errors in MoE models like Gemma 4. Fix: Quantize only the KV cache. Keep weights in bfloat16 or float16. Modern attention mechanisms are highly resilient to FP8 KV states, but weight precision directly impacts expert selection accuracy.

4. Streaming Neglect at Scale

Explanation: At 128+ concurrent users, TTFT can exceed 10 seconds. Non-streaming clients will timeout or appear frozen, triggering retry storms that amplify queue depth. Fix: Enforce stream: true at the API gateway level. Implement client-side timeout overrides for streaming connections and display progressive token rendering to maintain UX stability.

5. Concurrency Over-Saturation

Explanation: Pushing beyond 1024 concurrent users yields diminishing throughput returns. Prefill rates plateau near 480k tok/s, but TTFT doubles as requests queue. Fix: Implement admission control at the load balancer. Queue excess requests externally (e.g., Redis-backed task queue) and drain them at a rate that maintains TTFT under your SLA threshold.

6. TPU Compilation Cache Eviction

Explanation: Dynamic batch sizes or varying context lengths force repeated XLA recompilation, causing latency spikes and HBM fragmentation. Fix: Use fixed bucket sizes for context lengths (e.g., 1k, 4k, 8k, 16k) and enable enable_prefix_caching. This maximizes cache hits and minimizes compilation overhead.

Production Bundle

Action Checklist

Validate KV cache quantization: Confirm kv_cache_dtype="fp8" is active and monitor HBM utilization during load tests.
Configure generation buffer: Set max_model_len to at least target_context + 512 to prevent boundary errors.
Implement synthetic warm-up: Run 5-10 background requests post-deployment to populate JAX and KV caches.
Enforce streaming: Configure API gateway to reject non-streaming requests for long-context endpoints.
Set concurrency admission limits: Cap direct TPU routing at 1024 users; queue overflow externally.
Monitor TTFT percentiles: Track p50, p95, and p99 latency separately; p99 spikes indicate queue saturation.
Verify speculative decoding metrics: Ensure verification acceptance rate stays above 60%; drop below 40% indicates draft mismatch.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Interactive Chat / UI	1-8 concurrency, FP8 KV, n-gram speculation	Minimizes TTFT to <1.5s; TPU cores remain underutilized but latency floor is optimal	Higher cost per token; lower hardware utilization
Standard API / RAG	32-64 concurrency, FP8 KV, chunked prefill	Balances 213k tok/s throughput with ~2.1s TTFT; maximizes ROI on TPU v6e	Optimal cost-to-performance ratio; recommended baseline
Batch Processing / Indexing	128-1024 concurrency, FP8 KV, max batch tokens	Saturates HBM bandwidth; achieves 379k-475k tok/s; TTFT acceptable for async jobs	Lowest cost per token; highest hardware utilization
Long-Context Research	16-32 concurrency, FP8 KV, extended context buckets	Prevents OOM while maintaining decode stability for 24k-28k prompts	Moderate cost; requires careful memory budgeting

Configuration Template

# vllm_tpu_config.yaml
model: google/gemma-4-26B-A4B-it
dtype: bfloat16
kv_cache_dtype: fp8
max_model_len: 33280
max_num_batched_tokens: 32768
tensor_parallel_size: 4
enable_chunked_prefill: true
enable_prefix_caching: true
speculative_decoding_method: ngram
num_speculative_tokens: 4
gpu_memory_utilization: 0.92
swap_space: 4
enforce_eager: false
compile_cache_max_size: 1024

Quick Start Guide

Provision TPU v6e Cluster: Deploy a 4-chip TPU v6e node with 128GB HBM. Ensure the TPU runtime and vLLM nightly build are installed.
Apply Configuration: Copy the YAML template above and launch the engine using vllm serve or the Python orchestrator. Verify health via GET /health endpoint.
Run Warm-Up Sequence: Execute 5 synthetic requests with context lengths of 1k, 4k, 8k, 16k, and 24k. Monitor compilation logs until XLA cache hits stabilize.
Validate Streaming & Limits: Send a test request with stream: true and a 16k prompt. Confirm TTFT remains under 3s and tokens arrive incrementally. Check server logs for FP8 cache allocation confirmation.
Scale Concurrency: Gradually increase concurrent clients to 64. Monitor prefill throughput and HBM utilization. If TTFT exceeds SLA, adjust admission control or reduce batch size.