Back to KB
Difficulty
Intermediate
Read Time
8 min

Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

By Codcompass TeamΒ·Β·8 min read

Token-Centric Orchestration: Rethinking Kubernetes Scaling for Large Language Models

Current Situation Analysis

Platform teams have spent the last decade perfecting request-driven autoscaling. You deploy a stateless service, attach a Horizontal Pod Autoscaler (HPA) to CPU or memory thresholds, and let Kubernetes handle the rest. This model assumes uniform workloads, predictable memory footprints, and near-instant capacity provisioning. When you apply the same playbook to large language model (LLM) inference, the abstraction fractures immediately.

The core misunderstanding stems from treating generative AI workloads as traditional HTTP services. In standard web architectures, a request is a discrete, relatively uniform unit of work. In LLM serving, a request is merely a container for a highly variable computational payload. One prompt might consume 64 input tokens and generate 128 output tokens. Another might push 32,000 input tokens through the context window and stream 4,000 outputs. Both register as 200 OK in your access logs, but their GPU compute profiles, memory pressure, and queue impact differ by orders of magnitude.

This mismatch creates three systemic blind spots:

  1. Metric Misalignment: RPS (requests per second) and CPU utilization become noise. The actual throughput unit is tokens, and the actual bottleneck is GPU memory bandwidth and KV cache capacity.
  2. Topology Ambiguity: A single model instance rarely maps to one pod. Tensor parallelism, pipeline parallelism, and multi-node Ray clusters mean that "scaling replicas" requires coordinated group scheduling, not independent pod creation.
  3. Latency Fragmentation: User experience splits into two distinct phases. Time to First Token (TTFT) dictates perceived responsiveness, while Time Per Output Token (TPOT) controls streaming fluidity. Traditional p95 latency masks both.

Production environments that ignore these shifts routinely experience silent degradation: GPUs report 80% utilization while queue depth spikes, TTFT crosses 8-second thresholds, and cloud bills balloon from idle tensor-parallel groups waiting for decode saturation. The infrastructure hasn't failed; the scaling paradigm has.

WOW Moment: Key Findings

The shift from request-centric to token-centric orchestration isn't theoretical. It fundamentally changes how you measure capacity, trigger scaling, and route traffic. The following comparison isolates the operational divergence between traditional API scaling and LLM inference scaling.

DimensionTraditional Web ScalingLLM Inference Scaling
Unit of WorkHTTP RequestInput/Output Token
Primary BottleneckCPU Cores / Heap MemoryGPU VRAM / Memory Bandwidth / KV Cache
Latency DefinitionSingle p95/p99 durationTTFT (prefill) + TPOT (decode)
Scaling TriggerCPU/Memory % or RPSQueue Depth, Tokens/sec, SLO Burn Rate
Pod Startup ImpactSeconds (image pull + health check)Minutes (weight load + CUDA init + engine warm)
Load Balancing LogicRound-robin / Least ConnectionsCache-aware / Prompt-length / Queue-depth

Why this matters: When you align your control plane with token economics, you stop scaling on phantom capacity. You can predict GPU memory exhaustion before OOM kills occur, route long-context prompts to dedicated prefill workers, and trigger autoscaling based on actual inference saturation rather than CPU idle time. This enables deterministic SLOs for generative workloads and transforms GPU spend from a black box into a measurable cost-per-token metric.

Core Solution

Building a production-ready LLM serving layer requires replacing request-driven assumptions with token-aware orchestration. The implementation spans observability, scheduling, routing, and autoscaling.

Step 1: Instrument for Token-Level Observability

Standard Prometheus exporters track HTTP status codes and response times. You need a custom metrics adapter that exposes inference-specific telemetry. The serving engine (vLLM, TGI, or TensorRT-LLM) must emit:

  • tokens_per_second (aggregate throughput)
  • ttft_seconds (prefill completion time)
  • tpot_seconds (decode step duration)
  • kv_cache_usage_ratio (VRAM allocated to cached key/value tensors)
  • queue_depth (pending requests awaiting GPU scheduling)

Step 2: Decouple Prefill and Decode Workloads

Prefill is compute-bound and scales with input length. Decode is memory-bandwidth-bound and scales with output length. Running them on the same GPU pool creates scheduling contention. Modern engines like vLLM mitigate this with continuous batching and PagedAttention, but production clusters benefit from explicit separation:

  • Prefill Group: High compute, lower memory pressure. Optimized for short TTFT.
  • Decode Group: High memory bandwidth, steady-state generation. Optimized for high TPOT and KV cache retention.

Step 3: Implement Queue-Aware Routing

Round-robin distribution fragments KV cache locality and starves underutilized workers while overloading others. A custom ingress router should evaluate:

  • Estimated prompt length (from Content-Length or tokenized preview)
  • Current queue depth per replica
  • KV cache hit probability (for prefix caching)
  • Tenant priority or SLO tier

Step 4: Configure Predictive Autoscaling

HPA on CPU is ineffective. Use KEDA (Kubernetes Event-driven Autoscaling) with custom metrics to scale based on queue depth and token throughput. Maintain a warm pool of preloaded models to absorb traffic spikes without cold-start penalties.

Code Example: Token-Aware Metrics Adapter (TypeScript)

This adapter transforms raw inference telemetry into KEDA-compatible custom metrics. It replaces naive RPS tracking with token economics.

import { FastifyInstance } from 'fastify';
import { MetricsRegistry, Gauge, Counter } from 'prom-client';

interface InferenceTelemetry {
  modelId: string;
  inputTokens: number;
  outputTokens: number;
  ttftMs: number;
  tpotMs: number;
  kvCacheOccupancy: number;
  pendingQueue: number;
}

export class TokenMetricsAdapter {
  private registry: MetricsRegistry;
  private ttftG

auge: Gauge; private tpotGauge: Gauge; private kvCacheGauge: Gauge; private queueDepthGauge: Gauge; private tokenThroughputCounter: Counter;

constructor() { this.registry = new MetricsRegistry();

this.ttftGauge = new Gauge({
  name: 'llm_ttft_seconds',
  help: 'Time to first token in seconds',
  registers: [this.registry],
  labelNames: ['model_id']
});

this.tpotGauge = new Gauge({
  name: 'llm_tpot_seconds',
  help: 'Time per output token in seconds',
  registers: [this.registry],
  labelNames: ['model_id']
});

this.kvCacheGauge = new Gauge({
  name: 'llm_kv_cache_usage_ratio',
  help: 'Fraction of GPU VRAM allocated to KV cache',
  registers: [this.registry],
  labelNames: ['model_id']
});

this.queueDepthGauge = new Gauge({
  name: 'llm_queue_depth',
  help: 'Number of requests waiting for GPU scheduling',
  registers: [this.registry],
  labelNames: ['model_id']
});

this.tokenThroughputCounter = new Counter({
  name: 'llm_tokens_processed_total',
  help: 'Total tokens processed (input + output)',
  registers: [this.registry],
  labelNames: ['model_id', 'token_type']
});

}

ingest(telemetry: InferenceTelemetry): void { const { modelId, ttftMs, tpotMs, kvCacheOccupancy, pendingQueue, inputTokens, outputTokens } = telemetry;

this.ttftGauge.set({ model_id: modelId }, ttftMs / 1000);
this.tpotGauge.set({ model_id: modelId }, tpotMs / 1000);
this.kvCacheGauge.set({ model_id: modelId }, kvCacheOccupancy);
this.queueDepthGauge.set({ model_id: modelId }, pendingQueue);

this.tokenThroughputCounter.inc({ model_id: modelId, token_type: 'input' }, inputTokens);
this.tokenThroughputCounter.inc({ model_id: modelId, token_type: 'output' }, outputTokens);

}

async getMetrics(): Promise<string> { return this.registry.metrics(); } }


### Architecture Rationale
- **Why KEDA over HPA**: HPA only supports CPU/memory/RPS natively. KEDA supports custom metrics, SLO burn rates, and cron-based predictive scaling, which aligns with token economics.
- **Why separate prefill/decode**: Continuous batching helps, but explicit separation prevents long-context prompts from blocking decode streams. It also enables different GPU types per phase (e.g., A100 for prefill, L40S for decode).
- **Why queue-depth scaling**: Queue depth directly correlates with user-perceived latency. Scaling on it prevents the "utilization looks fine but users are timing out" paradox.

## Pitfall Guide

### 1. Scaling on CPU Utilization
**Explanation**: LLM inference offloads compute to GPUs. CPU often handles lightweight HTTP framing, tokenization, and routing. CPU can sit at 15% while GPUs are saturated and queues are backing up.
**Fix**: Replace CPU HPA with KEDA scalers targeting `llm_queue_depth` or `llm_tokens_per_second`. Use GPU memory bandwidth as a secondary trigger.

### 2. Ignoring KV Cache Eviction Limits
**Explanation**: KV cache grows linearly with context length and batch size. When VRAM fills, the serving engine either rejects requests or triggers expensive cache eviction, causing TPOT spikes.
**Fix**: Configure maximum context windows per model. Implement PagedAttention or vLLM's chunked prefill. Set autoscaling thresholds at 75% KV cache occupancy, not 90%.

### 3. Treating TTFT and TPOT as a Single Latency Metric
**Explanation**: Aggregating latency into p95 masks prefill vs decode behavior. A model can have excellent TPOT but terrible TTFT due to queueing, or vice versa.
**Fix**: Track TTFT and TPOT as separate SLOs. Route long prompts to prefill-optimized workers. Use speculative decoding to improve TPOT without increasing VRAM pressure.

### 4. Naive Round-Robin Load Balancing
**Explanation**: Distributing requests evenly ignores prompt length variance and KV cache locality. Long prompts monopolize workers, while short prompts wait behind them. Prefix caching becomes ineffective.
**Fix**: Deploy a cache-aware router that hashes prompt prefixes to specific replicas. Factor queue depth and estimated compute cost into routing decisions.

### 5. Cold-Start Autoscaling Without Warm Pools
**Explanation**: Loading a 70B model in BF16 takes 2–6 minutes depending on storage I/O and CUDA initialization. Scaling from zero guarantees SLO violations during traffic spikes.
**Fix**: Maintain a minimum of 2–3 warm replicas. Use local NVMe caching for model weights. Implement predictive scaling based on historical token throughput patterns.

### 6. Misconfiguring Tensor Parallelism Boundaries
**Explanation**: Tensor parallelism splits model weights across GPUs. If pod scheduling doesn't respect node locality, inter-GPU communication traverses the network, destroying throughput.
**Fix**: Use `topologySpreadConstraints` or node selectors to co-locate tensor-parallel groups on the same node. Validate with `nvidia-smi nvlink` bandwidth checks.

### 7. Overprovisioning Context Windows
**Explanation**: Allowing 128K context for every request burns VRAM on unused KV cache. Most production prompts rarely exceed 4K–8K tokens.
**Fix**: Set dynamic context limits per tenant or model. Implement prompt truncation or summarization fallbacks. Charge cost per token to discourage unnecessary context expansion.

## Production Bundle

### Action Checklist
- [ ] Replace RPS/CPU autoscaling with KEDA custom metrics for queue depth and tokens/sec
- [ ] Instrument serving engines to emit TTFT, TPOT, and KV cache occupancy
- [ ] Separate prefill and decode workloads or enable continuous batching with PagedAttention
- [ ] Deploy a cache-aware router that considers prompt length and queue depth
- [ ] Maintain warm model pools with local NVMe weight caching
- [ ] Set KV cache autoscaling thresholds at 70–75% VRAM utilization
- [ ] Implement cost attribution per input/output token for tenant billing
- [ ] Validate tensor parallelism node locality with topology constraints

### Decision Matrix

| Scenario | Recommended Approach | Why | Cost Impact |
|----------|---------------------|-----|-------------|
| Short prompts, high concurrency | vLLM with continuous batching + single GPU pool | Maximizes throughput, minimizes TTFT via efficient KV cache reuse | Lowers cost per token by 30–50% vs naive batching |
| Long context, low concurrency | Dedicated prefill/decode separation + high VRAM GPUs | Prevents decode starvation, preserves KV cache locality | Increases GPU spend but reduces timeout rates and retry costs |
| Multi-tenant SLOs | Queue-aware routing + KEDA SLO burn rate scaling | Guarantees latency tiers without overprovisioning | Optimizes GPU utilization, reduces idle spend by 20–40% |
| Budget-constrained inference | INT8/FP8 quantization + speculative decoding | Cuts VRAM requirements, accelerates decode steps | Reduces GPU hour costs by 40–60% with <2% accuracy loss |

### Configuration Template
KEDA scaler targeting queue depth and token throughput. Replace placeholder metrics with your serving engine's Prometheus endpoints.

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-inference-scaler
  namespace: inference
spec:
  scaleTargetRef:
    name: vllm-decoder-pool
  pollingInterval: 10
  cooldownPeriod: 120
  minReplicaCount: 2
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        query: sum(llm_queue_depth{model_id="llama-3-70b"})
        threshold: "15"
        metricName: llm_queue_depth
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        query: sum(rate(llm_tokens_processed_total{token_type="output"}[1m]))
        threshold: "5000"
        metricName: llm_output_tokens_per_sec
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 30
          policies:
            - type: Pods
              value: 3
              periodSeconds: 60
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
            - type: Pods
              value: 1
              periodSeconds: 120

Quick Start Guide

  1. Deploy a token-aware serving engine: Use vLLM or TensorRT-LLM with --enable-prefix-caching and --max-num-batched-tokens tuned to your VRAM.
  2. Expose custom metrics: Configure the engine's Prometheus exporter to emit ttft, tpot, kv_cache_usage, and queue_depth.
  3. Install KEDA and apply the ScaledObject: Point the scaler to your Prometheus instance and set thresholds based on your SLOs.
  4. Route with queue awareness: Deploy a lightweight ingress proxy that reads /metrics from backend pods and routes to the replica with the lowest queue depth and highest KV cache hit rate.
  5. Validate with load testing: Use a token-aware load generator (e.g., llm-perf or custom scripts) to simulate mixed prompt lengths. Verify TTFT stays under 2s and TPOT under 50ms at target concurrency.

Scaling LLMs successfully requires abandoning request-centric mental models. When you measure tokens, track prefill/decode separately, and route based on cache locality and queue depth, Kubernetes becomes a predictable platform for generative workloads instead of a source of silent degradation.