Production-Grade MoE Inference on Trillium TPUs: Stabilizing Gemma 4 26B with vLLM

Current Situation Analysis

Deploying large mixture-of-experts (MoE) architectures on specialized accelerator hardware consistently exposes a hidden fault line in modern inference pipelines: the false dichotomy between raw throughput and runtime stability. Engineering teams routinely optimize for peak token generation rates, only to discover that production concurrency patterns trigger catastrophic latency spikes, out-of-memory (OOM) failures, or extended cold-start windows. This problem is frequently misunderstood because operators treat inference servers as stateless HTTP proxies, ignoring the underlying compiler runtime behavior and hardware memory hierarchy.

The Gemma 4 26B MoE model (google/gemma-4-26B-A4B-it) running on Google's TPU v6e-4 (Trillium architecture) exemplifies this challenge. Early deployment attempts using standard vLLM configurations demonstrated that pushing a 26B-parameter model to its theoretical throughput limits inevitably collides with JAX/XLA compilation overhead and HBM fragmentation. Benchmarking across 144 distinct concurrency points (ranging from 1 to 2048 simultaneous requests) revealed that default memory allocation strategies fail at a 94% pass rate. The most severe symptom occurs at the 2,048-token context boundary: under 256 concurrent users, the system stalls for 131.99 seconds while the runtime attempts to reallocate the HBM heap. This is not a model limitation; it is a runtime configuration mismatch between request bucketing, compiler kernel generation, and available memory headroom.

Furthermore, the operational cost of stateless deployments remains severely underestimated. Without persistent compilation caching, every container restart or crash recovery forces the JAX compiler to rebuild execution graphs from scratch, resulting in a 24-minute warm-up period. In production environments where rolling updates, autoscaling events, or hardware preemptions are routine, this downtime directly impacts availability SLAs and increases infrastructure waste.

WOW Moment: Key Findings

The breakthrough emerges when runtime parameters are aligned with Trillium hardware characteristics rather than generic GPU assumptions. By synchronizing request bucketing with optimal XLA graph shapes, reserving precise HBM headroom for speculative decoding kernels, and persisting compilation artifacts to shared memory, the system achieves a stable production state that decouples model size from performance penalties.

Deployment Phase	Model Configuration	Peak Throughput	2K Context Latency	Cold Start Time	Benchmark Pass Rate
Baseline (v1)	4B Standalone	463,345 tok/s	~0.950s	~20 min	100% (Light load)
Unstable Peak (v2)	26B Full MoE	483,930 tok/s	131.99s (Spike)	~24 min	94% (OOM Risk)
Turbo-Stable (v3)	26B Full MoE	467,825 tok/s	1.157s (Stable)	<10 sec	100% (Solid)

This data reveals three critical insights:

Latency Determinism: The 114x reduction in worst-case latency at the 2K boundary transforms an interactive API into a production-ready service. Sub-second response times are maintained across the full concurrency spectrum.
Throughput Preservation: The stable configuration retains 96.6% of the unstable peak throughput while eliminating OOM failures. Large model deployment no longer requires sacrificing speed for reliability.
Operational Agility: Persistent compilation caching reduces restart overhead by 99.3%, enabling rapid rollouts, autoscaling, and maintenance windows without service degradation.

Core Solution

Achieving deterministic performance on Trillium TPUs requires a layered approach that addresses memory topology, compiler caching, request scheduling, and speculative execution. The following implementation strategy replaces ad-hoc flag tuning with a structured deployment architecture.

Step 1: Runtime Environment & Memory Topology

Trillium TPUs expose a distinct HBM architecture that behaves differently from NVIDIA CUDA memory models. Allocating 95% of available memory to the KV cache leaves insufficient headroom for JAX to compile speculative decoding kernels, triggering resource exhaustion during peak load. Reducing utilization to 90% reserves approximately 6GB of HBM specifically for compiler overhead and runtime fragmentation.

# docker-compose.yml
version: "3.9"
services:
  gemma4-inference:
    image: vllm/vllm-tpu:nightly
    container_name: trillium-gemma4-node
    network_mode: host
    privileged: true
    environment:
      - HF_HOME=/shared/cache/huggingface
      - HF_TOKEN_FILE=/run/secrets/hf_token
      - XLA_CACHE_DIR=/shared/cache/xla_graphs
      - TPU_BUCKET_ALIGNMENT=512
    volumes:
      - shared_memory:/shared/cache
      - ./secrets:/run/secrets:ro
    deploy:
      resources:
        limits:
          memory: 10g
    secrets:
      - hf_token
    command: ["bash", "/opt/entrypoint/launch_inference.sh"]

volumes:
  shared_memory:
    driver: local
    driver_opts:
      type: tmpfs
      device: tmpfs
      o: size=10g,mode=1777

secrets:
  hf_token:
    file: ./secrets/hf_token.txt

Step 2: XLA Compilation Caching Strategy

JAX recompiles execution graphs whenever tensor shapes or batch configurations change. In a stateless container, this occurs on every startup. Mounting a tmpfs volume to /shared/cache/xla_graphs and pointing the XLA cache directory to it allows the compiler to serialize and reuse optimized graphs across restarts. This transforms a 24-minute warm-up into a sub-10-second initialization sequence.

#!/bin/bash
# /opt/entrypoint/launch_inference.sh
set -euo pipefail

# Verify cache directory exists and is writable
CACHE_DIR="${XLA_CACHE_DIR:-/shared/cache/xla_graphs}"
mkdir -p "${CACHE_DIR}"
chmod 1777 "${CACHE_DIR}"

# Export runtime flags for vLLM
export VLLM_XLA_CACHE_PATH="${CACHE_DIR}"
export VLLM_TPU_BUCKET_PADDING_GAP="${TPU_BUCKET_ALIGNMENT:-512}"

# Load Hugging Face token securely
if [ -f "${HF_TOKEN_FILE}" ]; then
    export HF_TOKEN=$(cat "${HF_TOKEN_FILE}")
else
    echo "ERROR: HF_TOKEN_FILE not found" >&2
    exit 1
fi

# Execute vLLM with production parameters
exec vllm serve google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 4 \
    --dtype bfloat16 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --block-size 32 \
    --max-model-len 16384 \
    --max-num-seqs 256 \
    --max-num-batched-tokens 4096 \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 3}' \
    --enable-prefix-caching \
    --safetensors-load-strategy prefetch \
    --limit-mm-per-prompt '{"image":4,"audio":1}' \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --reasoning-parser gemma4 \
    --trust-remote-code

Step 3: KV Cache & Block Sizing

The 26B MoE architecture generates significantly larger attention maps than dense models. Using a block size of 32 reduces page-table overhead and aligns with Trillium's memory allocation granularity. Combined with FP8 quantization for the KV cache, this configuration minimizes memory fragmentation while preserving generation quality.

Step 4: Speculative Decoding & Request Batching

N-gram speculative decoding with 3 lookahead tokens reduces effective latency without requiring a separate draft model. The VLLM_TPU_BUCKET_PADDING_GAP=512 environment variable forces the scheduler to group incoming requests into larger, hardware-aligned buckets. This eliminates the micro-stalls that previously caused the 132-second latency spike at the 2K context boundary.

Pitfall Guide

1. Ignoring HBM Headroom for Compiler Kernels

Explanation: Setting --gpu-memory-utilization to 0.95 or higher leaves zero margin for JAX to allocate temporary buffers during speculative kernel compilation. Under load, this triggers Resource Exhausted errors and container crashes. Fix: Cap utilization at 0.90. The 10% reserved space is not wasted; it is actively consumed by the compiler runtime to prevent OOM failures during graph finalization.

2. Misaligning TPU Bucket Padding

Explanation: Default bucket padding (typically 64 or 128) creates excessive XLA graph variants. When requests fall between buckets, the scheduler stalls while compiling new graphs on-the-fly, causing latency spikes. Fix: Set VLLM_TPU_BUCKET_PADDING_GAP=512. This forces request grouping into larger, predictable shapes that match Trillium's optimal execution windows.

3. Skipping Persistent XLA Caching

Explanation: Running inference without VLLM_XLA_CACHE_PATH forces full recompilation on every restart. In autoscaling environments, this creates cascading delays as new nodes take 20+ minutes to become ready. Fix: Mount a tmpfs volume and configure VLLM_XLA_CACHE_PATH to point to it. Cache hits reduce initialization to under 10 seconds.

4. Over-Provisioning KV Cache Block Sizes

Explanation: Using block sizes larger than 64 on MoE models increases internal fragmentation. Smaller blocks (16) cause excessive page-table overhead. The sweet spot for 26B parameters on Trillium is 32. Fix: Explicitly set --block-size 32. This balances memory efficiency with page-table lookup speed.

5. Mismatching Speculative Decoding Parameters

Explanation: Configuring speculative decoding without aligning num_speculative_tokens to the model's attention pattern causes verification overhead to exceed generation speed, degrading throughput. Fix: Use ngram method with num_speculative_tokens: 3 for Gemma 4. This provides latency reduction without triggering excessive verification rejections.

6. Neglecting Multimodal Token Limits

Explanation: Gemma 4 supports image and audio inputs. Without explicit limits, a single request can consume disproportionate KV cache space, starving concurrent text requests. Fix: Apply --limit-mm-per-prompt '{"image":4,"audio":1}' to enforce predictable memory consumption per request.

7. Assuming Stateless Restarts

Explanation: Treating inference containers as ephemeral without caching compilation artifacts or model weights leads to repeated disk I/O and compiler overhead. Fix: Persist both Hugging Face cache (HF_HOME) and XLA graphs to shared memory volumes. Pre-warm nodes during deployment pipelines rather than at request time.

Production Bundle

Action Checklist

Verify TPU v6e-4 topology matches --tensor-parallel-size 4 configuration
Allocate tmpfs volume with minimum 10GB capacity for XLA and HF caches
Set VLLM_TPU_BUCKET_PADDING_GAP=512 to align request scheduling with Trillium hardware
Configure --gpu-memory-utilization 0.90 to reserve HBM for compiler kernels
Enable FP8 KV cache via --kv-cache-dtype fp8 to reduce memory fragmentation
Apply --block-size 32 for optimal page-table performance on 26B MoE
Validate speculative decoding with ngram method and 3 lookahead tokens
Implement health checks that monitor XLA cache hit rates and HBM utilization

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Interactive Chat API	Turbo-Stable v3 config	Sub-second TTFT and deterministic latency required for UX	Baseline infrastructure cost
Batch Processing	Increase `--max-num-seqs` to 512, disable speculative decoding	Maximizes throughput for async workloads	+15% compute, -20% latency sensitivity
Multi-Modal Workloads	Enforce `--limit-mm-per-prompt` strictly	Prevents KV cache starvation from heavy media inputs	Requires careful quota management
Autoscaling Clusters	Persist XLA cache to shared storage	Eliminates 24-min warm-up penalty during scale-out	+$0.05/node/hr for tmpfs, saves 90% downtime cost
Cost-Constrained Deployments	Reduce `--max-model-len` to 8192	Cuts KV cache footprint by ~40% with minimal quality loss	-30% memory cost, acceptable for short-context use cases

Configuration Template

# vllm-production.env
# Core Runtime
VLLM_XLA_CACHE_PATH=/shared/cache/xla_graphs
VLLM_TPU_BUCKET_PADDING_GAP=512
HF_HOME=/shared/cache/huggingface

# Model Parameters
MODEL_ID=google/gemma-4-26B-A4B-it
TENSOR_PARALLEL=4
DTYPE=bfloat16
KV_CACHE_DTYPE=fp8
MEMORY_UTILIZATION=0.90
BLOCK_SIZE=32
MAX_MODEL_LEN=16384

# Scheduling & Batching
MAX_NUM_SEQS=256
MAX_NUM_BATCHED_TOKENS=4096
SPECULATIVE_CONFIG='{"method": "ngram", "num_speculative_tokens": 3}'

# Features
ENABLE_PREFIX_CACHING=true
SAFETENSORS_LOAD_STRATEGY=prefetch
MULTIMODAL_LIMITS='{"image":4,"audio":1}'
ENABLE_AUTO_TOOL_CHOICE=true
TOOL_CALL_PARSER=gemma4
REASONING_PARSER=gemma4
TRUST_REMOTE_CODE=true

Quick Start Guide

Prepare Shared Memory Volume: Create a 10GB tmpfs mount point on your TPU node to host XLA compilation artifacts and Hugging Face weights. This eliminates disk I/O bottlenecks during initialization.
Deploy Container Stack: Use the provided docker-compose.yml and vllm-production.env to launch the inference service. The entrypoint script automatically configures environment variables and executes vllm serve with production flags.
Validate Runtime State: Monitor the first 30 seconds of startup. XLA cache population should complete in under 10 seconds. Verify HBM utilization stabilizes at ~90% and no Resource Exhausted warnings appear in logs.
Run Concurrency Baseline: Execute a load test scaling from 1 to 256 concurrent requests with 2,048-token contexts. Confirm TTFT remains below 0.350s and latency does not exceed 1.200s at the 2K boundary.
Enable Production Routing: Register the service endpoint with your API gateway. Configure health checks to verify XLA cache availability and HBM headroom before routing production traffic.