Compiling for Latency, Scaling for Throughput: A Production Guide to TensorRT-LLM and Triton

Current Situation Analysis

The industry standard for LLM serving has converged around runtime-flexible frameworks that prioritize ease of deployment over hardware-specific optimization. Teams routinely benchmark serving stacks using mismatched workloads, disabled memory management features, and uncontrolled token generation. This creates a false equivalence between frameworks that compile ahead-of-time and those that execute dynamically. The result is infrastructure misalignment: latency-sensitive applications get routed to throughput-optimized runtimes, while high-concurrency batch pipelines are forced into rigid, low-capacity engines.

This problem persists because serving is treated as a software configuration task rather than a hardware topology problem. Engineers rarely account for the physical constraints of multi-GPU interconnects during engine design. On NVIDIA Hopper architectures, tensor parallelism introduces an all-reduce communication step that consumes approximately 77% of available NVLink bandwidth during layer sharding. When teams blindly scale tensor parallelism (TP) without distinguishing between prefill and decode phases, they inadvertently throttle decode latency. Prefill operations are bandwidth-bound and benefit from aggressive sharding, while decode operations are latency-bound and degrade when cross-GPU synchronization overhead exceeds compute gains.

Furthermore, benchmarking methodologies frequently ignore the fundamental requirement of matched workloads. If two serving stacks generate different token counts per request, throughput and latency metrics become mathematically incomparable. Production serving also demands continuous batching and paged KV-cache allocation. Disabling these features reduces a serving stack to a static batch processor, invalidating any performance claims for real-world traffic patterns. The missing link is a compilation-first strategy that locks in hardware topology, precision, and batching policies before deployment, paired with a hardened control plane that exposes observability, health checks, and dynamic routing.

WOW Moment: Key Findings

The performance boundary between ahead-of-time compiled engines and runtime-flexible servers is not absolute; it is concurrency-dependent. When workloads are strictly controlled for token count, precision, and hardware topology, a clear crossover emerges. TensorRT-LLM with CUDA graph capture dominates the latency-sensitive regime, while runtime schedulers dominate the throughput-saturated regime. This crossover dictates infrastructure strategy: you do not choose a framework based on preference, you choose based on traffic pattern.

Serving Architecture	Low Concurrency Latency (TTFT)	High Concurrency Throughput	Compilation Overhead
TensorRT-LLM + CUDA Graphs	~15-20% lower	Saturates earlier	High (AOT)
vLLM (JIT Runtime)	Baseline	~10-15% higher	Low (Runtime)

This finding matters because it replaces framework loyalty with traffic-aware routing. Latency-bound applications (interactive chat, real-time agents, low-latency APIs) benefit from the deterministic execution paths and eliminated per-iteration launch overhead that CUDA graphs provide. Throughput-bound pipelines (batch inference, offline processing, high-concurrency public APIs) benefit from dynamic schedulers that maximize GPU occupancy without compilation rigidity. The compilation step is not a bottleneck; it is a performance multiplier that only activates when the deployment surface matches the traffic profile.

Core Solution

Moving from a Hugging Face checkpoint to a production endpoint requires four distinct phases. Each phase locks in a hardware or software constraint that directly impacts latency, throughput, or accuracy.

Phase 1: Precision Selection and Calibration

Precision dictates memory footprint and compute density. FP16 serves as the accuracy baseline. FP8 leverages the Hopper Transformer Engine to compress weights and KV-cache by approximately 50%, directly increasing concurrent sequence capacity. The trade-off is model-dependent accuracy degradation. Never deploy FP8 without running a calibration suite against your target task distribution. Use a representative dataset to measure perplexity or task-specific metrics before committing to the compiled engine.

Phase 2: Ahead-of-Time Engine Compilation

TensorRT-LLM requires a fixed tensor-parallel degree, maximum batch size, and precision profile at compile time. This rigidity is intentional: the compiler fuses kernels, optimizes memory layouts, and captures CUDA graphs to eliminate Python runtime overhead.

import tensorrt_llm
from tensorrt_llm.builder import BuildConfig, Builder
from tensorrt_llm.models import LLMConfig
from tensorrt_llm.quantization import QuantMode

def compile_inference_engine(
    checkpoint_path: str,
    output_dir: str,
    tensor_parallel: int = 4,
    max_batch_size: int = 64,
    max_seq_len: int = 4096,
    use_fp8: bool = False
) -> None:
    quant_mode = QuantMode.FP8 if use_fp8 else QuantMode.NONE
    
    build_cfg = BuildConfig(
        tensor_parallel=tensor_parallel,
        max_batch_size=max_batch_size,
        max_seq_len=max_seq_len,
        quant_mode=quant_mode,
        enable_cuda_graph=True,
        use_paged_kv_cache=True,
        enable_context_fmha=True
    )
    
    model_cfg = LLMConfig.from_hf_checkpoint(
        checkpoint_path=checkpoint_path,
        dtype="float16"
    )
    
    builder = Builder()
    engine = builder.build_engine(
        model_config=model_cfg,
        build_config=build_cfg,
        output_dir=output_dir
    )
    
    builder.save_engine(engine, output_dir)
    print(f"Engine compiled to {output_dir} | TP={tensor_parallel} | CUDA Graphs=ON")

Architecture Rationale:

enable_cuda_graph=True captures the execution graph once, removing per-iteration kernel launch overhead. This is the primary driver of low-concurrency latency improvements.
use_paged_kv_cache=True allocates KV memory in fixed-size blocks, preventing worst-case sequence length reservations and enabling higher concurrency.
tensor_parallel must match the physical NVLink topology. Over-sharding decode-heavy workloads introduces synchronization latency that outweighs compute parallelism.

Phase 3: Triton Model Repository Configuration

Triton Inference Server replaces development scripts with a production-grade control plane. It provides health endpoints, metric exposition, dynamic batching configuration, and ensemble routing. The compiled engine is wrapped in a tensorrt_llm backend configuration that exposes an OpenAI-compatible HTTP/gRPC interface.

Phase 4: Controlled Load Testing

Benchmarking requires strict workload parity. Every request must decode an identical number of tokens to ensure throughput and latency metrics reflect identical computational work.

import asyncio
import httpx
import time

async def run_controlled_benchmark(
    endpoint_url: str,
    prompt: str,
    target_tokens: int = 256,
    concurrency: int = 10
) -> dict:
    async with httpx.AsyncClient(timeout=120.0) as client:
        tasks = []
        for _ in range(concurrency):
            payload = {
                "model": "default",
                "prompt": prompt,
                "max_tokens": target_tokens,
                "min_tokens": target_tokens,
                "ignore_eos": True,
                "temperature": 0.0
            }
            tasks.append(client.post(f"{endpoint_url}/v1/completions", json=payload))
        
        start = time.perf_counter()
        responses = await asyncio.gather(*tasks)
        elapsed = time.perf_counter() - start
        
        total_tokens = 0
        for resp in responses:
            data = resp.json()
            total_tokens += len(data.get("choices", [{}])[0].get("text", "").split())
            
        return {
            "elapsed_seconds": round(elapsed, 3),
            "total_tokens_generated": total_tokens,
            "throughput_tokens_per_sec": round(total_tokens / elapsed, 2),
            "concurrency": concurrency
        }

Architecture Rationale:

ignore_eos=True and min_tokens=max_tokens force deterministic token generation, eliminating variance from model stopping conditions.
concurrency is swept incrementally to map the latency-to-throughput crossover curve.
Metrics are collected under matched concurrency to ensure fair comparison across serving stacks.

Pitfall Guide

1. The CUDA Graph Blind Spot

Explanation: CUDA graphs are not enabled by default in all build configurations. A compiled engine without graph capture runs with standard Python kernel launches, negating the primary latency advantage. Fix: Explicitly set enable_cuda_graph=True during compilation and verify graph capture logs during the first inference request. Monitor triton metrics for cuda_graph_memory_usage to confirm activation.

2. Mismatched Decode Lengths in Benchmarks

Explanation: Frameworks that truncate outputs early appear faster because they perform less compute. Comparing token/s across stacks with different effective sequence lengths produces mathematically invalid results. Fix: Enforce min_tokens == max_tokens and ignore_eos=True in all load tests. Report exact token counts alongside throughput metrics.

3. Over-Sharding for Decode-Heavy Workloads

Explanation: Tensor parallelism introduces an all-reduce step that consumes NVLink bandwidth. On decode-heavy traffic (one token per iteration), cross-GPU synchronization latency exceeds compute gains, increasing inter-token latency. Fix: Profile prefill vs decode ratios. Use TP=4 or TP=8 for prefill-bound workloads. Drop to TP=1 or TP=2 for decode-bound workloads to minimize synchronization overhead.

4. FP8 Deployment Without Accuracy Validation

Explanation: FP8 compression reduces memory but introduces quantization noise. Deploying without task-specific calibration risks silent accuracy degradation in production. Fix: Run a calibration suite against your target distribution before compilation. Compare perplexity or task metrics against FP16 baseline. Only proceed if delta falls within acceptable thresholds.

5. Treating Development Servers as Production Endpoints

Explanation: trtllm-serve and similar CLI tools lack health checks, metric exposition, dynamic batching controls, and ensemble routing. They are benchmarking utilities, not production surfaces. Fix: Wrap compiled engines in a Triton tensorrt_llm backend. Configure config.pbtxt for dynamic batching, rate limiting, and health endpoints. Route traffic through Triton in production.

6. Ignoring Paged KV and Continuous Batching

Explanation: Static KV allocation reserves memory for maximum sequence length per request, drastically reducing concurrent capacity. Without continuous batching, new requests wait for full batch completion. Fix: Enable use_paged_kv_cache=True and enable_context_fmha=True during compilation. Configure Triton dynamic_batching to allow in-flight request injection.

7. Static Batching Policies in Bursty Traffic

Explanation: Fixed batch sizes cause GPU underutilization during traffic dips and queue saturation during spikes. Production traffic is inherently bursty. Fix: Use Triton's dynamic_batching with max_queue_delay_microseconds to balance latency and throughput. Tune preferred_batch_size based on observed request patterns.

Production Bundle

Action Checklist

Validate NVLink topology and confirm all-reduce bandwidth capacity before selecting tensor parallelism degree
Run FP8 calibration against target task distribution; document accuracy delta before compilation
Enable CUDA graph capture explicitly in build configuration; verify activation via runtime logs
Configure paged KV-cache and continuous batching; disable static allocation policies
Wrap compiled engine in Triton tensorrt_llm backend; expose health, metrics, and OpenAI-compatible endpoints
Enforce matched token generation in all benchmarks using ignore_eos=True and min_tokens=max_tokens
Sweep concurrency levels to map latency-to-throughput crossover; document traffic pattern thresholds
Implement request routing based on concurrency regime: low/mid for latency, high for throughput

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Interactive chat / low-latency API	TensorRT-LLM + CUDA Graphs (TP=2/4)	Eliminates per-iteration launch overhead; deterministic execution paths	Higher upfront compilation cost; lower GPU count needed for latency SLA
Batch inference / offline processing	vLLM (JIT Runtime)	Dynamic scheduler maximizes GPU occupancy; no compilation rigidity	Lower operational overhead; higher GPU count for equivalent throughput
Long-context document processing	TensorRT-LLM + Paged KV (FP8)	50% KV-cache reduction enables longer sequences; paged allocation prevents memory fragmentation	Requires FP8 calibration; moderate compilation time
Multi-model ensemble serving	Triton Inference Server	Unified control plane; health/metric exposition; dynamic routing across backends	Increased configuration complexity; standardized deployment surface

Configuration Template

# triton_model_repo/llm_engine/config.pbtxt
name: "llm_engine"
backend: "tensorrt_llm"
max_batch_size: 64

instance_group [
  {
    kind: KIND_GPU
    count: 4
    gpus: [0, 1, 2, 3]
  }
]

dynamic_batching {
  preferred_batch_size: [16, 32, 64]
  max_queue_delay_microseconds: 50000
}

parameters: {
  key: "tensor_parallel"
  value: { string_value: "4" }
}
parameters: {
  key: "max_sequence_length"
  value: { string_value: "4096" }
}
parameters: {
  key: "enable_cuda_graph"
  value: { string_value: "true" }
}
parameters: {
  key: "use_paged_kv_cache"
  value: { string_value: "true" }
}

Quick Start Guide

Compile the engine: Run the build script with your target precision, tensor parallelism, and CUDA graph flags. Verify graph capture logs on first request.
Deploy to Triton: Place the compiled engine and config.pbtxt in the model repository directory. Start Triton with tritonserver --model-repository=./triton_model_repo.
Validate endpoints: Query /v1/models and /v1/health to confirm readiness. Send a test request with ignore_eos=True and fixed token bounds.
Profile concurrency: Sweep concurrency from 1 to 64. Record TTFT, inter-token latency, and throughput. Identify the crossover point where latency degrades and throughput plateaus.
Route traffic: Direct latency-sensitive requests to the compiled engine during low/mid concurrency. Route high-concurrency batch traffic to a runtime scheduler or scale horizontally.

Notes on Serving LLMs with TensorRT-LLM and Triton