or Token-Level Observability
Standard Prometheus exporters track HTTP status codes and response times. You need a custom metrics adapter that exposes inference-specific telemetry. The serving engine (vLLM, TGI, or TensorRT-LLM) must emit:
tokens_per_second (aggregate throughput)
ttft_seconds (prefill completion time)
tpot_seconds (decode step duration)
kv_cache_usage_ratio (VRAM allocated to cached key/value tensors)
queue_depth (pending requests awaiting GPU scheduling)
Step 2: Decouple Prefill and Decode Workloads
Prefill is compute-bound and scales with input length. Decode is memory-bandwidth-bound and scales with output length. Running them on the same GPU pool creates scheduling contention. Modern engines like vLLM mitigate this with continuous batching and PagedAttention, but production clusters benefit from explicit separation:
- Prefill Group: High compute, lower memory pressure. Optimized for short TTFT.
- Decode Group: High memory bandwidth, steady-state generation. Optimized for high TPOT and KV cache retention.
Step 3: Implement Queue-Aware Routing
Round-robin distribution fragments KV cache locality and starves underutilized workers while overloading others. A custom ingress router should evaluate:
- Estimated prompt length (from
Content-Length or tokenized preview)
- Current queue depth per replica
- KV cache hit probability (for prefix caching)
- Tenant priority or SLO tier
HPA on CPU is ineffective. Use KEDA (Kubernetes Event-driven Autoscaling) with custom metrics to scale based on queue depth and token throughput. Maintain a warm pool of preloaded models to absorb traffic spikes without cold-start penalties.
Code Example: Token-Aware Metrics Adapter (TypeScript)
This adapter transforms raw inference telemetry into KEDA-compatible custom metrics. It replaces naive RPS tracking with token economics.
import { FastifyInstance } from 'fastify';
import { MetricsRegistry, Gauge, Counter } from 'prom-client';
interface InferenceTelemetry {
modelId: string;
inputTokens: number;
outputTokens: number;
ttftMs: number;
tpotMs: number;
kvCacheOccupancy: number;
pendingQueue: number;
}
export class TokenMetricsAdapter {
private registry: MetricsRegistry;
private ttftGauge: Gauge;
private tpotGauge: Gauge;
private kvCacheGauge: Gauge;
private queueDepthGauge: Gauge;
private tokenThroughputCounter: Counter;
constructor() {
this.registry = new MetricsRegistry();
this.ttftGauge = new Gauge({
name: 'llm_ttft_seconds',
help: 'Time to first token in seconds',
registers: [this.registry],
labelNames: ['model_id']
});
this.tpotGauge = new Gauge({
name: 'llm_tpot_seconds',
help: 'Time per output token in seconds',
registers: [this.registry],
labelNames: ['model_id']
});
this.kvCacheGauge = new Gauge({
name: 'llm_kv_cache_usage_ratio',
help: 'Fraction of GPU VRAM allocated to KV cache',
registers: [this.registry],
labelNames: ['model_id']
});
this.queueDepthGauge = new Gauge({
name: 'llm_queue_depth',
help: 'Number of requests waiting for GPU scheduling',
registers: [this.registry],
labelNames: ['model_id']
});
this.tokenThroughputCounter = new Counter({
name: 'llm_tokens_processed_total',
help: 'Total tokens processed (input + output)',
registers: [this.registry],
labelNames: ['model_id', 'token_type']
});
}
ingest(telemetry: InferenceTelemetry): void {
const { modelId, ttftMs, tpotMs, kvCacheOccupancy, pendingQueue, inputTokens, outputTokens } = telemetry;
this.ttftGauge.set({ model_id: modelId }, ttftMs / 1000);
this.tpotGauge.set({ model_id: modelId }, tpotMs / 1000);
this.kvCacheGauge.set({ model_id: modelId }, kvCacheOccupancy);
this.queueDepthGauge.set({ model_id: modelId }, pendingQueue);
this.tokenThroughputCounter.inc({ model_id: modelId, token_type: 'input' }, inputTokens);
this.tokenThroughputCounter.inc({ model_id: modelId, token_type: 'output' }, outputTokens);
}
async getMetrics(): Promise<string> {
return this.registry.metrics();
}
}
Architecture Rationale
- Why KEDA over HPA: HPA only supports CPU/memory/RPS natively. KEDA supports custom metrics, SLO burn rates, and cron-based predictive scaling, which aligns with token economics.
- Why separate prefill/decode: Continuous batching helps, but explicit separation prevents long-context prompts from blocking decode streams. It also enables different GPU types per phase (e.g., A100 for prefill, L40S for decode).
- Why queue-depth scaling: Queue depth directly correlates with user-perceived latency. Scaling on it prevents the "utilization looks fine but users are timing out" paradox.
Pitfall Guide
1. Scaling on CPU Utilization
Explanation: LLM inference offloads compute to GPUs. CPU often handles lightweight HTTP framing, tokenization, and routing. CPU can sit at 15% while GPUs are saturated and queues are backing up.
Fix: Replace CPU HPA with KEDA scalers targeting llm_queue_depth or llm_tokens_per_second. Use GPU memory bandwidth as a secondary trigger.
2. Ignoring KV Cache Eviction Limits
Explanation: KV cache grows linearly with context length and batch size. When VRAM fills, the serving engine either rejects requests or triggers expensive cache eviction, causing TPOT spikes.
Fix: Configure maximum context windows per model. Implement PagedAttention or vLLM's chunked prefill. Set autoscaling thresholds at 75% KV cache occupancy, not 90%.
3. Treating TTFT and TPOT as a Single Latency Metric
Explanation: Aggregating latency into p95 masks prefill vs decode behavior. A model can have excellent TPOT but terrible TTFT due to queueing, or vice versa.
Fix: Track TTFT and TPOT as separate SLOs. Route long prompts to prefill-optimized workers. Use speculative decoding to improve TPOT without increasing VRAM pressure.
4. Naive Round-Robin Load Balancing
Explanation: Distributing requests evenly ignores prompt length variance and KV cache locality. Long prompts monopolize workers, while short prompts wait behind them. Prefix caching becomes ineffective.
Fix: Deploy a cache-aware router that hashes prompt prefixes to specific replicas. Factor queue depth and estimated compute cost into routing decisions.
5. Cold-Start Autoscaling Without Warm Pools
Explanation: Loading a 70B model in BF16 takes 2β6 minutes depending on storage I/O and CUDA initialization. Scaling from zero guarantees SLO violations during traffic spikes.
Fix: Maintain a minimum of 2β3 warm replicas. Use local NVMe caching for model weights. Implement predictive scaling based on historical token throughput patterns.
6. Misconfiguring Tensor Parallelism Boundaries
Explanation: Tensor parallelism splits model weights across GPUs. If pod scheduling doesn't respect node locality, inter-GPU communication traverses the network, destroying throughput.
Fix: Use topologySpreadConstraints or node selectors to co-locate tensor-parallel groups on the same node. Validate with nvidia-smi nvlink bandwidth checks.
7. Overprovisioning Context Windows
Explanation: Allowing 128K context for every request burns VRAM on unused KV cache. Most production prompts rarely exceed 4Kβ8K tokens.
Fix: Set dynamic context limits per tenant or model. Implement prompt truncation or summarization fallbacks. Charge cost per token to discourage unnecessary context expansion.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Short prompts, high concurrency | vLLM with continuous batching + single GPU pool | Maximizes throughput, minimizes TTFT via efficient KV cache reuse | Lowers cost per token by 30β50% vs naive batching |
| Long context, low concurrency | Dedicated prefill/decode separation + high VRAM GPUs | Prevents decode starvation, preserves KV cache locality | Increases GPU spend but reduces timeout rates and retry costs |
| Multi-tenant SLOs | Queue-aware routing + KEDA SLO burn rate scaling | Guarantees latency tiers without overprovisioning | Optimizes GPU utilization, reduces idle spend by 20β40% |
| Budget-constrained inference | INT8/FP8 quantization + speculative decoding | Cuts VRAM requirements, accelerates decode steps | Reduces GPU hour costs by 40β60% with <2% accuracy loss |
Configuration Template
KEDA scaler targeting queue depth and token throughput. Replace placeholder metrics with your serving engine's Prometheus endpoints.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-inference-scaler
namespace: inference
spec:
scaleTargetRef:
name: vllm-decoder-pool
pollingInterval: 10
cooldownPeriod: 120
minReplicaCount: 2
maxReplicaCount: 20
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: sum(llm_queue_depth{model_id="llama-3-70b"})
threshold: "15"
metricName: llm_queue_depth
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: sum(rate(llm_tokens_processed_total{token_type="output"}[1m]))
threshold: "5000"
metricName: llm_output_tokens_per_sec
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 3
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
Quick Start Guide
- Deploy a token-aware serving engine: Use vLLM or TensorRT-LLM with
--enable-prefix-caching and --max-num-batched-tokens tuned to your VRAM.
- Expose custom metrics: Configure the engine's Prometheus exporter to emit
ttft, tpot, kv_cache_usage, and queue_depth.
- Install KEDA and apply the ScaledObject: Point the scaler to your Prometheus instance and set thresholds based on your SLOs.
- Route with queue awareness: Deploy a lightweight ingress proxy that reads
/metrics from backend pods and routes to the replica with the lowest queue depth and highest KV cache hit rate.
- Validate with load testing: Use a token-aware load generator (e.g.,
llm-perf or custom scripts) to simulate mixed prompt lengths. Verify TTFT stays under 2s and TPOT under 50ms at target concurrency.
Scaling LLMs successfully requires abandoning request-centric mental models. When you measure tokens, track prefill/decode separately, and route based on cache locality and queue depth, Kubernetes becomes a predictable platform for generative workloads instead of a source of silent degradation.