Notes on Serving LLMs with TensorRT-LLM and Triton
Compiling for Latency, Scaling for Throughput: A Production Guide to TensorRT-LLM and Triton
Current Situation Analysis
The industry standard for LLM serving has converged around runtime-flexible frameworks that prioritize ease of deployment over hardware-specific optimization. Teams routinely benchmark serving stacks using mismatched workloads, disabled memory management features, and uncontrolled token generation. This creates a false equivalence between frameworks that compile ahead-of-time and those that execute dynamically. The result is infrastructure misalignment: latency-sensitive applications get routed to throughput-optimized runtimes, while high-concurrency batch pipelines are forced into rigid, low-capacity engines.
This problem persists because serving is treated as a software configuration task rather than a hardware topology problem. Engineers rarely account for the physical constraints of multi-GPU interconnects during engine design. On NVIDIA Hopper architectures, tensor parallelism introduces an all-reduce communication step that consumes approximately 77% of available NVLink bandwidth during layer sharding. When teams blindly scale tensor parallelism (TP) without distinguishing between prefill and decode phases, they inadvertently throttle decode latency. Prefill operations are bandwidth-bound and benefit from aggressive sharding, while decode operations are latency-bound and degrade when cross-GPU synchronization overhead exceeds compute gains.
Furthermore, benchmarking methodologies frequently ignore the fundamental requirement of matched workloads. If two serving stacks generate different token counts per request, throughput and latency metrics become mathematically incomparable. Production serving also demands continuous batching and paged KV-cache allocation. Disabling these features reduces a serving stack to a static batch processor, invalidating any performance claims for real-world traffic patterns. The missing link is a compilation-first strategy that locks in hardware topology, precision, and batching policies before deployment, paired with a hardened control plane that exposes observability, health checks, and dynamic routing.
WOW Moment: Key Findings
The performance boundary between ahead-of-time compiled engines and runtime-flexible servers is not absolute; it is concurrency-dependent. When workloads are strictly controlled for token count, precision, and hardware topology, a clear crossover emerges. TensorRT-LLM with CUDA graph capture dominates the latency-sensitive regime, while runtime schedulers dominate the throughput-saturated regime. This crossover dictates infrastructure strategy: you do not choose a framework based on preference, you choose based on traffic pattern.
| Serving Architecture | Low Concurrency Latency (TTFT) | High Concurrency Throughput | Compilation Overhead |
|---|---|---|---|
| TensorRT-LLM + CUDA Graphs | ~15-20% lower | Saturates earlier | High (AOT) |
| vLLM (JIT Runtime) | Baseline | ~10-15% higher | Low (Runtime) |
This finding matters because it replaces framework loyalty with traffic-aware routing. Latency-bound applications (interactive chat, real-time agents, low-latency APIs) benefit from the deterministic execution paths and eliminated per-iteration launch overhead that CUDA graphs provide. Throughput-bound pipelines (batch inference, offline processing, high-concurrency public APIs) benefit from dynamic schedulers that maximize GPU occupancy without compilation rigidity. The compilation step is not a bottleneck; it is a performance multiplier that only activates when the deployment surface matches the traffic profile.
Core Solution
Moving from a Hugging Face checkpoint to a production endpoint requires four distinct phases. Each phase locks in a hardware or software constraint that directly impacts latency, throughput, or accuracy.
Phase 1: Precision Selection and Calibration
Precision dictates memory footprint and compute density. FP16 serves as the accuracy baseline. FP8 leverages the Hopper Transformer Engine to compress weights and KV-cache by approximately 50%, directly increasing concurrent sequence capacity. The trade-off is model-dependent accuracy degradation. Never deploy FP8 without running a calibration suite against your target task distribution. Use a representative dataset to measure perplexity or task-specific metrics before committing to the compiled engine.
Phase 2: Ahead-of-Time Engine Compilation
TensorRT-LLM requires a fixed tensor-parallel degree, maximum batch size, and precision profile at compile time. This rigidity is intentional: the compiler fuses kernels, optimizes memory layouts, and captures CUDA graphs to eliminate Python runtime overhead.
import tensorrt_llm
from tensorrt_llm.builder import BuildConfig, Builder
from tensorrt_llm.models import LLMConfig
from tensorrt_llm.quantization import QuantMode
def compile_inference_engine(
checkpoint_path: str,
output_dir: str,
tensor_parallel: int = 4,
max_batch_size: int = 64,
max_seq_len: int = 4096,
use_fp8: bool = False
) -> None:
quant_mode = QuantMode.FP8 if use_fp8 else QuantMode.NONE
build_cfg = BuildConfig(
tensor_parallel=tensor_parallel,
max_batch_size=max_batch_size,
max_seq_len=max_seq_len,
quant_mode=quant_mode,
enable_cuda_graph=True,
use_paged_kv_cache=True,
enable_context_fmha=True
)
model_cfg = LLMConfig.from_hf_checkpoint(
checkpoint_path=checkpoint_path,
dtype="float16"
)
builder = Builder()
engine = builder.build_engine(
model_config=model_cfg,
build_config=build_cfg,
output_dir=output_dir
)
builder.save_engine(engine, output_dir)
print(f"Engine compiled to {output_dir} | TP={tensor_parallel} | CUDA Graphs=ON")
Architecture Rationale:
enable_cuda_graph=Truecaptures the execution graph once, removing per-iteration kernel launch overhead. This is the primary driver of low-concurrency latency improvements.use_paged_kv_cache=Trueallocates KV memory in fixed-size blocks, preventing worst-case sequence length reservations and enabling higher concurrency.tensor_parallelmust match the physical NVLink topology. Over-sharding decode-heavy workloads introduces synchronization latency that outweighs compute parallelism.
Phase 3: Triton Model Repository Configuration
Triton Inference Server replaces development scripts with a production-grade control plane. It provides health endpoints, metric exposition, dynamic batching configuration, and ensemble routing. The compiled engine is wrapped in a tensorrt_llm backend configuration that exposes an OpenAI-compatible HTTP/gRPC interface.
Phase 4: Controlled Load Testing
Benchmarking requires strict workload parity. Every request must decode an identical number of tokens to ensure throughput and latency metrics reflect identical computational work.
import asyncio
import httpx
import time
async def run_controlled_benchmark(
endpoint_url: str,
prompt: str,
target_tokens: int = 256,
concurrency: int = 10
) -> dict:
async with httpx.AsyncClient(timeout=120.0) as client:
tasks = []
for _ in range(concurrency):
payload = {
"model": "default",
"prompt": prompt,
"max_tokens": target_tokens,
"min_tokens": target_tokens,
"ignore_eos": True,
"temperature": 0.0
}
tasks.append(client.post(f"{endpoint_url}/v1/completions", json=payload))
start = time.perf_counter()
responses = await asyncio.gather(*tasks)
elapsed = time.perf_counter() - start
total_tokens = 0
for resp in responses:
data = resp.json()
total_tokens += len(data.get("choices", [{}])[0].get("text", "").split())
return {
"elapsed_seconds": round(elapsed, 3),
"total_tokens_generated": total_tokens,
"throughput_tokens_per_sec": round(total_tokens / elapsed, 2),
"concurrency": concurrency
}
Architecture Rationale:
ignore_eos=Trueandmin_tokens=max_tokensforce deterministic token generation, eliminating variance from model stopping conditions.concurrencyis swept incrementally to map the latency-to-throughput crossover curve.- Metrics are collected under matched concurrency to ensure fair comparison across serving stacks.
Pitfall Guide
1. The CUDA Graph Blind Spot
Explanation: CUDA graphs are not enabled by default in all build configurations. A compiled engine without graph capture runs with standard Python kernel launches, negating the primary latency advantage.
Fix: Explicitly set enable_cuda_graph=True during compilation and verify graph capture logs during the first inference request. Monitor triton metrics for cuda_graph_memory_usage to confirm activation.
2. Mismatched Decode Lengths in Benchmarks
Explanation: Frameworks that truncate outputs early appear faster because they perform less compute. Comparing token/s across stacks with different effective sequence lengths produces mathematically invalid results.
Fix: Enforce min_tokens == max_tokens and ignore_eos=True in all load tests. Report exact token counts alongside throughput metrics.
3. Over-Sharding for Decode-Heavy Workloads
Explanation: Tensor parallelism introduces an all-reduce step that consumes NVLink bandwidth. On decode-heavy traffic (one token per iteration), cross-GPU synchronization latency exceeds compute gains, increasing inter-token latency. Fix: Profile prefill vs decode ratios. Use TP=4 or TP=8 for prefill-bound workloads. Drop to TP=1 or TP=2 for decode-bound workloads to minimize synchronization overhead.
4. FP8 Deployment Without Accuracy Validation
Explanation: FP8 compression reduces memory but introduces quantization noise. Deploying without task-specific calibration risks silent accuracy degradation in production. Fix: Run a calibration suite against your target distribution before compilation. Compare perplexity or task metrics against FP16 baseline. Only proceed if delta falls within acceptable thresholds.
5. Treating Development Servers as Production Endpoints
Explanation: trtllm-serve and similar CLI tools lack health checks, metric exposition, dynamic batching controls, and ensemble routing. They are benchmarking utilities, not production surfaces.
Fix: Wrap compiled engines in a Triton tensorrt_llm backend. Configure config.pbtxt for dynamic batching, rate limiting, and health endpoints. Route traffic through Triton in production.
6. Ignoring Paged KV and Continuous Batching
Explanation: Static KV allocation reserves memory for maximum sequence length per request, drastically reducing concurrent capacity. Without continuous batching, new requests wait for full batch completion.
Fix: Enable use_paged_kv_cache=True and enable_context_fmha=True during compilation. Configure Triton dynamic_batching to allow in-flight request injection.
7. Static Batching Policies in Bursty Traffic
Explanation: Fixed batch sizes cause GPU underutilization during traffic dips and queue saturation during spikes. Production traffic is inherently bursty.
Fix: Use Triton's dynamic_batching with max_queue_delay_microseconds to balance latency and throughput. Tune preferred_batch_size based on observed request patterns.
Production Bundle
Action Checklist
- Validate NVLink topology and confirm all-reduce bandwidth capacity before selecting tensor parallelism degree
- Run FP8 calibration against target task distribution; document accuracy delta before compilation
- Enable CUDA graph capture explicitly in build configuration; verify activation via runtime logs
- Configure paged KV-cache and continuous batching; disable static allocation policies
- Wrap compiled engine in Triton
tensorrt_llmbackend; expose health, metrics, and OpenAI-compatible endpoints - Enforce matched token generation in all benchmarks using
ignore_eos=Trueandmin_tokens=max_tokens - Sweep concurrency levels to map latency-to-throughput crossover; document traffic pattern thresholds
- Implement request routing based on concurrency regime: low/mid for latency, high for throughput
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Interactive chat / low-latency API | TensorRT-LLM + CUDA Graphs (TP=2/4) | Eliminates per-iteration launch overhead; deterministic execution paths | Higher upfront compilation cost; lower GPU count needed for latency SLA |
| Batch inference / offline processing | vLLM (JIT Runtime) | Dynamic scheduler maximizes GPU occupancy; no compilation rigidity | Lower operational overhead; higher GPU count for equivalent throughput |
| Long-context document processing | TensorRT-LLM + Paged KV (FP8) | 50% KV-cache reduction enables longer sequences; paged allocation prevents memory fragmentation | Requires FP8 calibration; moderate compilation time |
| Multi-model ensemble serving | Triton Inference Server | Unified control plane; health/metric exposition; dynamic routing across backends | Increased configuration complexity; standardized deployment surface |
Configuration Template
# triton_model_repo/llm_engine/config.pbtxt
name: "llm_engine"
backend: "tensorrt_llm"
max_batch_size: 64
instance_group [
{
kind: KIND_GPU
count: 4
gpus: [0, 1, 2, 3]
}
]
dynamic_batching {
preferred_batch_size: [16, 32, 64]
max_queue_delay_microseconds: 50000
}
parameters: {
key: "tensor_parallel"
value: { string_value: "4" }
}
parameters: {
key: "max_sequence_length"
value: { string_value: "4096" }
}
parameters: {
key: "enable_cuda_graph"
value: { string_value: "true" }
}
parameters: {
key: "use_paged_kv_cache"
value: { string_value: "true" }
}
Quick Start Guide
- Compile the engine: Run the build script with your target precision, tensor parallelism, and CUDA graph flags. Verify graph capture logs on first request.
- Deploy to Triton: Place the compiled engine and
config.pbtxtin the model repository directory. Start Triton withtritonserver --model-repository=./triton_model_repo. - Validate endpoints: Query
/v1/modelsand/v1/healthto confirm readiness. Send a test request withignore_eos=Trueand fixed token bounds. - Profile concurrency: Sweep concurrency from 1 to 64. Record TTFT, inter-token latency, and throughput. Identify the crossover point where latency degrades and throughput plateaus.
- Route traffic: Direct latency-sensitive requests to the compiled engine during low/mid concurrency. Route high-concurrency batch traffic to a runtime scheduler or scale horizontally.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
