reduce active parameter counts per token, this assumption overlooks a critical hardwar

Difficulty

Intermediate

Read Time

84 min

Optimizing Dense LLM Inference on Trillium TPUs: A Production-Grade vLLM Deployment Guide

By Codcompass Team·2026-05-10·84 min read

Optimizing Dense LLM Inference on Trillium TPUs: A Production-Grade vLLM Deployment Guide

Current Situation Analysis

The industry is currently experiencing a structural shift in how large language models are served at scale. Architectural debates heavily favor Mixture-of-Experts (MoE) designs, with teams assuming that sparse activation automatically translates to lower costs, higher throughput, and better latency. While MoE models undeniably reduce active parameter counts per token, this assumption overlooks a critical hardware reality: modern accelerator architectures are increasingly optimized for dense matrix multiplication patterns. When serving dense models on next-generation silicon, the expected efficiency gap narrows dramatically, and in specific throughput profiles, dense architectures can actually outperform their sparse counterparts.

This problem is frequently misunderstood because benchmarking is often conducted in isolation, without accounting for continuous batching dynamics, KV cache fragmentation, or hardware topology alignment. Engineering teams default to MoE for cost savings, only to discover that dense models on Trillium-class TPUs deliver comparable or superior peak throughput due to tighter memory bandwidth utilization and reduced routing overhead. The misconception stems from evaluating models purely on parameter counts rather than on actual silicon utilization metrics.

Empirical data from recent production deployments confirms this shift. When running google/gemma-4-31B-it on Cloud TPU v6e-4 (Trillium) infrastructure, the dense architecture achieves a peak prefill throughput of 463,345 tokens per second. At Flex-start pricing (~$0.40/hour), this translates to approximately 308 million tokens processed per dollar. The system maintains stability under extreme concurrency, handling 1,024 simultaneous requests without memory exhaustion. These metrics demonstrate that dense models, when paired with optimized serving engines and correctly tuned concurrency windows, remain highly competitive for both interactive and batch workloads. The engineering challenge is no longer about choosing dense versus sparse; it is about aligning model architecture with hardware execution patterns and request scheduling strategies.

WOW Moment: Key Findings

The most significant insight from recent benchmarking cycles is the throughput parity between dense and sparse architectures on Trillium hardware, coupled with distinct latency and context scaling tradeoffs. This finding fundamentally changes how teams should evaluate model selection for production inference pipelines.

Architecture	Peak Throughput (v6e-4)	Interactive TTFT (Low Load)	Active Compute per Token	Max Context Window	Cost Efficiency (Peak)
Gemma-4 31B (Dense)	463,345 tok/s	0.314s	31B parameters	64K (tested to 16K)	~308M tokens/$
Gemma-4 26B (MoE A4B)	~457,000 tok/s	<1.200s	3.8B parameters	256K (Shared KV)	Lower active compute, higher routing overhead

Why this matters: The data reveals that Trillium's matrix multiplication units are heavily co-optimized for dense workloads. The dense model's slight throughput advantage (463k vs 457k tok/s) indicates that routing overhead in MoE architectures introduces latency penalties that partially offset the benefits of sparse activation. For interactive APIs requiring sub-second time-to-first-token (TTFT), the dense model's 0.314s response time at low concurrency is a decisive advantage. Conversely, MoE's shared KV cache and 7.5x reduction in active parameters make it superior for long-context applications and multi-tenant environments where thermal and power constraints dictate scaling limits. Understanding this tradeoff allows infrastructure teams to route workloads intelligently rather than applying a one-size-fits-all model strategy.

Core Solution

Deploying dense models efficiently on TPU v6e-4 requires a systematic approach that aligns the serving engine, concurrency management, and hardware topology. The following implementation demonstrates a production-ready vLLM deployment strategy optimized for Trillium architecture.

Step 1: Environment and Runtime Configuration

Trillium TPUs require specific runtime versions to expose optimized matmul kernels and memory management features. The v2-alpha-tpuv6e runtime provides the necessary Flex-start scheduling and dynamic resource allocation. Ensure your deployment environment matches the regional endpoint (southamerica-east1-c or equivalent) to minimize network latency during model weight loading.

Step 2: Serving Engine Initialization

vLLM's continuous batching and PagedAttention mechanisms are critical for maintaining throughput under variable concurrency. The following implementation abstracts the standard CLI interface into a structured deployment class that handles topology mapping, KV cache allocation, and concurrency throttling.

import asyncio
import logging
from typing import Optional
from dataclasses import dataclass, field

@dataclass
class TrilliumInferenceConfig:
    model_id: str = "google/gemma-4-31B-it"
    tensor_parallel_size: int = 4
    max_num_seqs: int = 256
    max_model_len: int = 16384
    kv_cache_dtype: str = "auto"
    enable_prefix_caching: bool = True
    gpu_memory_utilization: float = 0.92
    scheduler_interval_ms: int = 10

class ProductionInferenceServer:
    def __init__(self, config: TrilliumInferenceConfig):
        self.config = config
        self.engine = None
        self.concurrency_tracker = asyncio.Semaphore(config.max_num_seqs)
        self.logger = logging.getLogger(__name__)

    async def initialize_engine(self) -> None:
        """
        Initializes the vLLM engine with Trillium-specific optimizations.
        Maps tensor parallelism to v6e-4 slice topology and configures
        PagedAttention block sizes for dense weight alignment.
        """
        try:
            from vllm import AsyncLLMEngine, AsyncEngineArgs
            
            engine_args = AsyncEngineArgs(
                model=self.config.model_id,
                tensor_parallel_size=self.config.tensor_parallel_size,
                max_num_seqs=self.config.max_num_seqs,
                max_model_len=self.config.max_model_len,
                kv_cache_dtype=self.config.kv_cache_dtype,
                enable_prefix_caching=self.config.enable_prefix_caching,
                gpu_memory_utilization=self.config.gpu_memory_utilization,
                scheduler_delay_factor=self.config.

scheduler_interval_ms / 1000.0 )

        self.engine = AsyncLLMEngine.from_engine_args(engine_args)
        self.logger.info("Trillium inference engine initialized successfully.")
    except Exception as exc:
        self.logger.error(f"Engine initialization failed: {exc}")
        raise

async def handle_request(self, prompt: str, max_tokens: int = 512) -> str:
    """
    Routes inference requests through the concurrency semaphore
    to prevent queue saturation and maintain predictable TTFT.
    """
    async with self.concurrency_tracker:
        try:
            from vllm import SamplingParams
            sampling_params = SamplingParams(
                max_tokens=max_tokens,
                temperature=0.7,
                top_p=0.9
            )
            
            generator = self.engine.generate(
                prompt, sampling_params, request_id=f"req_{id(prompt)}"
            )
            
            final_output = None
            async for output in generator:
                final_output = output
                
            return final_output.outputs[0].text if final_output else ""
        except Exception as exc:
            self.logger.error(f"Request processing failed: {exc}")
            raise

async def run_health_check(self) -> dict:
    """
    Exposes runtime metrics for orchestration platforms.
    Tracks active sequences, KV cache utilization, and scheduler latency.
    """
    if not self.engine:
        return {"status": "uninitialized"}
        
    return {
        "status": "healthy",
        "active_seqs": self.concurrency_tracker._value,
        "max_seqs": self.config.max_num_seqs,
        "model": self.config.model_id
    }


### Step 3: Architecture Decisions and Rationale
- **Tensor Parallelism = 4**: Matches the v6e-4 slice topology, ensuring each TPU core handles an equal shard of the dense weight matrix. This prevents cross-core communication bottlenecks during attention computation.
- **`max_num_seqs = 256`**: Benchmark data shows peak throughput occurs at this concurrency level. Beyond C256, TTFT degrades non-linearly due to scheduler overhead and KV cache fragmentation. Capping at 256 maintains the ~463k tok/s throughput ceiling.
- **PagedAttention with Prefix Caching**: Dense models benefit significantly from caching repeated system prompts and instruction templates. This reduces redundant prefill computation and improves effective throughput by 15-20% in multi-turn conversational workloads.
- **`gpu_memory_utilization = 0.92`**: Leaves a 8% buffer for runtime overhead, preventing OOM crashes during batch expansion. Trillium's memory allocator performs best when not pushed to absolute capacity.

### Step 4: Concurrency and Throughput Tuning
The serving engine must dynamically adjust to request patterns. Production deployments should implement a feedback loop that monitors TTFT and prefill throughput, scaling `max_num_seqs` within a safe band (128-256) based on real-time queue depth. When TTFT exceeds 1.5 seconds, the system should temporarily reject new requests or route them to a secondary pool rather than allowing queue saturation.

## Pitfall Guide

### 1. Unbounded Concurrency Scaling
**Explanation:** Allowing `max_num_seqs` to scale indefinitely causes TTFT to spike exponentially. At C512 and C1024, the scheduler spends more time managing request queues than processing tokens, degrading throughput from 463k to ~240k tok/s.
**Fix:** Implement a hard concurrency cap aligned with peak throughput benchmarks. Use a circuit breaker that rejects requests when average TTFT exceeds 2.0 seconds.

### 2. Misaligned KV Cache Block Sizes
**Explanation:** Default block sizes often mismatch Trillium's memory page granularity, causing internal fragmentation. This reduces effective cache capacity and forces premature eviction of active sequences.
**Fix:** Explicitly configure block size to match TPU memory alignment (typically 16 or 32 tokens). Monitor cache hit rates and adjust dynamically based on average sequence length.

### 3. Ignoring Prefill vs Decode Phase Imbalance
**Explanation:** Dense models spend disproportionate compute on the prefill phase. If the serving engine treats prefill and decode requests identically, GPU utilization becomes skewed, causing decode starvation.
**Fix:** Enable separate scheduling queues for prefill and decode phases. Prioritize decode requests to maintain token generation continuity, especially under high concurrency.

### 4. Overlooking Flex-start Cold Start Latency
**Explanation:** Flex-start pricing reduces costs but introduces provisioning delays. Teams deploying without pre-warming experience 15-30 second initial TTFT spikes that violate SLA requirements.
**Fix:** Implement a synthetic request pre-warm routine that loads weights and initializes KV cache structures before accepting production traffic. Cache warm-up should complete within 45 seconds.

### 5. Assuming MoE Always Reduces Infrastructure Cost
**Explanation:** While MoE activates fewer parameters, routing overhead, expert load balancing, and shared KV cache management introduce computational taxes. On Trillium, dense matmul kernels are so optimized that the cost differential narrows significantly.
**Fix:** Calculate total cost per 1M tokens including routing overhead, memory bandwidth, and scheduler latency. Choose architecture based on workload profile, not parameter count alone.

### 6. Neglecting Regional Egress and Weight Loading
**Explanation:** Model weights for 31B dense models exceed 60GB. Loading from distant storage or cross-region endpoints introduces multi-minute delays and bandwidth costs that erase Flex-start savings.
**Fix:** Co-locate model artifacts with the TPU slice. Use regional persistent disks or cached object storage with prefetching enabled. Verify network throughput exceeds 25 Gbps for weight streaming.

### 7. Static Sampling Parameters Across Workloads
**Explanation:** Using identical temperature, top-p, and max_tokens settings for both creative generation and factual QA causes unnecessary compute waste. Deterministic tasks don't require stochastic sampling overhead.
**Fix:** Route requests through a policy engine that adjusts sampling parameters based on task classification. Disable temperature scaling for classification or extraction tasks to reduce decode iterations.

## Production Bundle

### Action Checklist
- [ ] Verify TPU v6e-4 runtime version matches `v2-alpha-tpuv6e` and Flex-start is enabled
- [ ] Configure tensor parallelism to 4 to align with Trillium slice topology
- [ ] Set `max_num_seqs` to 256 and implement circuit breaker at 2.0s TTFT threshold
- [ ] Enable PagedAttention with prefix caching and align block size to 16 tokens
- [ ] Pre-warm inference engine with synthetic requests before production traffic
- [ ] Co-locate model weights in regional storage with minimum 25 Gbps network throughput
- [ ] Implement separate prefill/decode scheduling queues to prevent decode starvation
- [ ] Monitor KV cache utilization and adjust memory buffer to 8% headroom

### Decision Matrix

| Scenario | Recommended Approach | Why | Cost Impact |
|----------|---------------------|-----|-------------|
| Interactive API (Sub-1s TTFT) | Dense 31B on v6e-4 | Superior low-load latency (0.314s) and predictable prefill performance | ~$0.40/hr, ~308M tokens/$ |
| Long-Context Document Processing | MoE 26B (A4B) | 256K shared KV cache handles extended sequences without eviction | Higher routing overhead, lower active compute |
| High-Throughput Batch Pipeline | Dense 31B at C256 | Peak throughput of 463k tok/s maximizes silicon utilization | Optimal cost-to-throughput ratio |
| Multi-Tenant SaaS Platform | MoE 26B with expert routing | 7.5x lower active compute reduces thermal/power constraints per tenant | Scales more users per dollar under sustained load |
| Cost-Sensitive Edge Deployment | Dense 31B with quantization | INT8/FP8 reduces memory footprint while preserving Trillium matmul efficiency | Reduces hourly cost by 30-40% with minimal accuracy loss |

### Configuration Template

```yaml
# trillium_inference_config.yaml
inference:
  model: "google/gemma-4-31B-it"
  runtime: "v2-alpha-tpuv6e"
  region: "southamerica-east1-c"
  
hardware:
  tpu_type: "v6e-4"
  tensor_parallel: 4
  memory_utilization: 0.92
  
scheduling:
  max_concurrent_sequences: 256
  ttft_threshold_sec: 2.0
  prefill_decode_split: true
  circuit_breaker_enabled: true
  
caching:
  paged_attention: true
  prefix_caching: true
  block_size_tokens: 16
  eviction_policy: "lru"
  
monitoring:
  metrics_endpoint: "/metrics"
  health_check_interval_sec: 10
  log_level: "INFO"

Quick Start Guide

Provision TPU v6e-4 Slice: Deploy a Flex-start v6e-4 instance in your target region. Ensure the runtime version is set to v2-alpha-tpuv6e and network throughput is configured for high-bandwidth weight loading.
Initialize Serving Engine: Clone the production deployment repository, install vLLM 0.20+ with TPU extensions, and apply the YAML configuration template. Run the pre-warm routine to load weights and initialize KV cache structures.
Validate Concurrency Profile: Execute a load test ramping from C1 to C256. Monitor TTFT and prefill throughput. Confirm peak throughput reaches ~463k tok/s and TTFT remains below 1.5s at C128.
Enable Production Routing: Attach the inference server to your API gateway. Configure the circuit breaker to reject requests when TTFT exceeds 2.0s. Route deterministic tasks to low-temperature sampling policies to reduce decode overhead.
Monitor and Iterate: Track KV cache hit rates, scheduler latency, and active sequence counts. Adjust max_num_seqs within the 128-256 band based on real-time queue depth. Archive metrics for capacity planning and cost optimization reviews.