reduced attack surface. For AI/ML workloads where rapid node provisioning and security

Difficulty

Intermediate

Read Time

84 min

Architecting Predictable LLM Inference on EKS: A Karpenter-Driven Capacity Strategy

By Codcompass Team·2026-05-16·84 min read

Architecting Predictable LLM Inference on EKS: A Karpenter-Driven Capacity Strategy

Current Situation Analysis

Translating executive requirements into production-ready machine learning infrastructure remains one of the most persistent bottlenecks in modern AI deployments. Engineering teams frequently receive directives like "support 10,000 concurrent users with sub-5-second responses" and immediately jump to GPU procurement or model quantization. This approach bypasses the critical translation layer between business expectations and compute topology.

The core issue stems from treating LLM inference as a static compute problem rather than a dynamic concurrency curve. Traditional Kubernetes autoscaling assumes linear CPU/memory scaling, which collapses under the weight of GPU memory fragmentation, KV cache limits, and the distinct computational phases of transformer models. Teams overlook that inference latency is not a fixed property of the hardware; it is a function of batch size, sequence length, and concurrent request density. Without a structured workload model, infrastructure decisions become reactive, leading to either expensive underutilization or SLO violations during traffic spikes.

Empirical validation consistently demonstrates this gap. When organizations measure end-to-end latency against request concurrency on identical GPU hardware, the cost-to-performance ratio shifts dramatically. A single 8-GPU node handling one request at a time delivers fast responses but operates at single-digit utilization. The same node saturated with 128 concurrent executions maximizes hardware efficiency but pushes latency past acceptable thresholds. The optimal architecture rarely lives at either extreme; it emerges from deliberate concurrency balancing, precise NodePool topology, and dual-layer autoscaling. This is where Karpenter's event-driven provisioning and EKS's managed control plane converge to solve the scaling equation.

WOW Moment: Key Findings

The relationship between concurrency, latency, and infrastructure cost is non-linear. The following data illustrates how identical hardware yields vastly different economic and performance outcomes based solely on concurrency management and node distribution.

Scenario	Instance Count	Concurrent Executions	E2E Latency	RPS	Cost-Efficiency
Underutilized	1x Instance (8 GPUs)	1x	2.5s	0.4	Fast response, but very high cost per request
Fully Saturated	1x Instance (8 GPUs)	128x	10s	12.8	Highly utilized hardware, but potentially misses latency SLOs
Optimized	2x Instances (16 GPUs)	64x (per node)	5s	25.6	Great value, balanced performance and cost-efficiency

This finding matters because it shifts infrastructure planning from hardware procurement to concurrency engineering. By distributing load across multiple nodes and capping per-node concurrency, you maintain predictable Time To First Token (TTFT) while preserving GPU utilization. It enables teams to decouple baseline capacity from burst demand, apply weighted provisioning strategies, and align autoscaling metrics with actual inference behavior rather than generic CPU thresholds.

Core Solution

Building a production-grade GenAI data plane on EKS requires a four-phase implementation strategy. Each phase addresses a specific layer of the scaling stack, from workload quantification to node lifecycle management.

Step 1: Quantify the Workload Model

Before provisioning infrastructure, convert business requirements into measurable compute parameters. Define average prompt length (tokens IN), expected response length (tokens OUT), target requests per second (RPS), and maximum acceptable end-to-end latency. These metrics directly inform KV cache sizing, batch limits, and GPU memory allocation. LLM inference operates in two distinct phases: the prefill phase processes input tokens sequentially, while the decode phase generates output tokens autoregressively. TTFT is dominated by prefill latency, whereas sustained throughput depends on decode efficiency. Your workload model must

account for both.

Step 2: Establish the Node OS & Driver Baseline

Select a minimal, security-hardened operating system with pre-integrated GPU and networking drivers. Amazon Linux 2023 (AL2023) provides a general-purpose foundation with EFA drivers, NVIDIA kernel modules, the NVIDIA container runtime, and Multi-Instance GPU (MIG) support. Bottlerocket offers a hardened, immutable alternative with identical driver support but significantly faster boot times and a reduced attack surface. For AI/ML workloads where rapid node provisioning and security compliance are priorities, Bottlerocket is the preferred baseline.

Step 3: Design Karpenter NodePool Topology

Karpenter replaces traditional Cluster Autoscaler logic with event-driven, pod-aware provisioning. Instead of static node groups, you define NodePools that map directly to workload characteristics. A single NodePool suffices for uniform workloads, but production GenAI architectures benefit from isolation and prioritization. Separate accelerated inference nodes from CPU-bound routing or preprocessing services. Use weighted NodePools to enforce instance family preferences (e.g., G5 before P4) or cost-tiered provisioning. For predictable baseline traffic, anchor capacity with On-Demand Capacity Reservations (ODCR) or ML Capacity Blocks. Reserve bursting NodePools with instance diversification (G4dn, G5, G6, P4) to absorb traffic spikes without provisioning delays.

Step 4: Implement Dual-Layer Autoscaling

Application scaling and data plane scaling must operate independently but coordinate through shared metrics. Use HPA, VPA, or KEDA to scale inference pods based on queue depth, GPU utilization, or custom TTFT metrics. Karpenter watches pending pods and provisions exact GPU instances within seconds, eliminating the 5-10 minute lag typical of Cluster Autoscaler. This separation ensures that pod scaling absorbs micro-bursts while node scaling handles sustained load shifts.

Architecture Decisions & Rationale

Why Karpenter over Cluster Autoscaler? Karpenter evaluates pod requirements against available instance types in real-time, provisioning the exact GPU architecture needed. It supports consolidation, drift detection, and weighted prioritization, which are critical for cost-controlled AI workloads.
Why Bottlerocket? Immutable OS design reduces patching overhead and boot latency. Faster node initialization directly improves autoscaling responsiveness during traffic surges.
Why separate prefill/decode workloads? Prefill is compute-bound and benefits from high-memory GPUs (P-series). Decode is memory-bound and scales efficiently on G-series instances. Isolating them prevents resource contention and enables targeted autoscaling.

TypeScript Infrastructure Definition

The following TypeScript configuration demonstrates how to programmatically define Karpenter NodePools and HPA targets using a modern infrastructure-as-code pattern. This approach ensures version-controlled, reproducible capacity definitions.

import { KarpenterNodePool, KarpenterProvisioner } from '@codcompass/eks-infra';
import { HpaTarget, MetricType } from '@codcompass/autoscaling';

// Define baseline capacity with capacity reservations
const baselinePool = new KarpenterNodePool('llm-inference-baseline', {
  nodeClass: 'Bottlerocket',
  instanceFamilies: ['g5.12xlarge', 'p4d.24xlarge'],
  weight: 100,
  capacityType: 'on-demand',
  limits: { cpu: 96, memory: '384Gi', 'nvidia.com/gpu': 8 },
  consolidation: { enabled: true, policy: 'when-empty' },
  capacityReservation: {
    id: 'ocr-llm-prod-01',
    strategy: 'prioritize'
  }
});

// Define burst pool with instance diversification
const burstPool = new KarpenterNodePool('llm-inference-burst', {
  nodeClass: 'Bottlerocket',
  instanceFamilies: ['g4dn.12xlarge', 'g5.12xlarge', 'g6.12xlarge'],
  weight: 50,
  capacityType: 'spot',
  limits: { cpu: 96, memory: '384Gi', 'nvidia.com/gpu': 8 },
  consolidation: { enabled: false },
  taints: [{ key: 'workload', value: 'burst', effect: 'NoSchedule' }]
});

// Configure HPA targeting GPU utilization and queue depth
const inferenceHpa = new HpaTarget('vllm-inference-deployment', {
  minReplicas: 4,
  maxReplicas: 64,
  metrics: [
    { type: MetricType.PODS, name: 'gpu_utilization', targetAverage: 75 },
    { type: MetricType.PODS, name: 'request_queue_depth', targetAverage: 10 }
  ],
  behavior: {
    scaleUp: { stabilizationWindowSeconds: 30, policies: [{ type: 'Pods', value: 4, periodSeconds: 60 }] },
    scaleDown: { stabilizationWindowSeconds: 300, policies: [{ type: 'Percent', value: 10, periodSeconds: 60 }] }
  }
});

export const capacityTopology = {
  nodePools: [baselinePool, burstPool],
  autoscaler: inferenceHpa
};

This configuration enforces weighted provisioning, isolates burst traffic via taints, disables consolidation for spot instances to prevent premature eviction, and aligns HPA scaling behavior with inference latency characteristics.

Pitfall Guide

1. Treating Inference as a Monolithic Workload

Explanation: Applying uniform autoscaling rules to both prefill and decode phases ignores their distinct resource profiles. Prefill demands high compute throughput, while decode is constrained by KV cache memory. Fix: Separate workloads into distinct deployments. Scale prefill pods based on CPU/GPU compute metrics and decode pods based on memory utilization and queue depth. Use Karpenter NodePools with instance families optimized for each phase.

2. Over-Provisioning for Peak Latency

Explanation: Sizing infrastructure to guarantee sub-2-second latency at all concurrency levels results in severe underutilization during normal traffic. The cost per request becomes unsustainable. Fix: Define acceptable latency bands rather than absolute minimums. Use weighted NodePools to prioritize cost-effective instances for baseline load and reserve high-performance GPUs for SLO-critical traffic. Implement graceful degradation or request queuing during extreme spikes.

3. Misaligning NodePool Weights with Actual Cost Models

Explanation: Assigning static weights without accounting for spot pricing volatility, instance availability, or regional GPU shortages leads to provisioning failures or unexpected cost spikes. Fix: Dynamically adjust weights based on real-time pricing APIs and capacity health. Use Karpenter's weight field in conjunction with consolidation policies to automatically shift load to cheaper instances when latency budgets allow.

4. Ignoring KV Cache & Batch Size Limits in Autoscaling Metrics

Explanation: HPA targets based solely on CPU or generic GPU utilization fail to capture memory pressure from KV cache expansion. Pods may scale out prematurely or crash due to OOM errors. Fix: Expose custom metrics for KV cache hit rate, active sequence count, and GPU memory fragmentation. Configure KEDA or HPA to scale based on these inference-specific indicators rather than generic resource thresholds.

5. Skipping Capacity Reservations for Baseline Load

Explanation: Relying entirely on on-demand or spot provisioning for predictable traffic introduces latency during node initialization and risks capacity shortages during regional demand surges. Fix: Anchor baseline workloads with ODCR or ML Capacity Blocks. Use Karpenter's capacityReservation configuration to guarantee immediate availability for steady-state inference, reserving dynamic provisioning for burst traffic.

6. Confusing Pod Scaling with Node Scaling

Explanation: Allowing HPA to scale pods beyond node capacity triggers pending pod queues, which Karpenter eventually resolves but with unpredictable latency. This creates a feedback loop where pod scaling masks node scaling delays. Fix: Implement pod disruption budgets and queue depth limits. Ensure HPA max replicas align with expected NodePool capacity. Use Karpenter's consolidation and drift settings to maintain node health, and monitor pending pod metrics as a leading indicator for data plane scaling.

Production Bundle

Action Checklist

Quantify workload metrics: Define tokens IN/OUT, target RPS, and acceptable latency bands before provisioning.
Select Bottlerocket for node OS to minimize boot latency and attack surface while retaining NVIDIA/EFA driver support.
Implement weighted Karpenter NodePools to separate baseline, burst, and cost-tiered capacity.
Anchor predictable traffic with ODCR or ML Capacity Blocks to eliminate provisioning delays.
Configure HPA/KEDA using inference-specific metrics (KV cache, queue depth, GPU memory) instead of generic CPU thresholds.
Isolate prefill and decode workloads into separate deployments with targeted autoscaling policies.
Disable consolidation on burst/spot NodePools to prevent premature eviction during traffic spikes.
Validate architecture with reproducible load tests using Locust before production rollout.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Stable, predictable traffic	ODCR-backed NodePool + Bottlerocket	Guarantees capacity, eliminates provisioning latency, reduces spot volatility	Lower operational risk, moderate baseline cost
Spiky, unpredictable traffic	Weighted burst NodePool with instance diversification (G4/G5/G6)	Maximizes availability during spikes, leverages spot pricing for cost efficiency	Higher variance, optimized per-request cost
Latency-critical SLOs (<3s)	Dedicated P-series NodePool + strict concurrency caps	Prevents KV cache contention, ensures decode phase responsiveness	Premium hardware cost, justified by SLO compliance
Cost-optimized batch inference	Trainium/Inferentia instances + aggressive consolidation	AWS ML chips deliver superior $/token for non-interactive workloads	Significant cost reduction, requires model compilation

Configuration Template

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: llm-inference-baseline
spec:
  template:
    metadata:
      labels:
        workload: llm-inference
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: bottlerocket-gpu
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g5.12xlarge", "p4d.24xlarge"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
      taints:
        - key: workload
          value: inference
          effect: NoSchedule
  limits:
    cpu: 192
    memory: 768Gi
    nvidia.com/gpu: 16
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
  weight: 100
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: llm-inference-burst
spec:
  template:
    metadata:
      labels:
        workload: llm-inference-burst
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: bottlerocket-gpu
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g4dn.12xlarge", "g5.12xlarge", "g6.12xlarge"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
      taints:
        - key: workload
          value: burst
          effect: NoSchedule
  limits:
    cpu: 192
    memory: 768Gi
    nvidia.com/gpu: 16
  disruption:
    consolidationPolicy: Never
  weight: 50

Quick Start Guide

Provision EKS Control Plane: Deploy a managed EKS cluster with IAM roles for service accounts (IRSA) configured for Karpenter, EC2, and SSM access.
Install Karpenter & EC2NodeClass: Apply the Karpenter Helm chart, then create an EC2NodeClass referencing Bottlerocket AMI, security groups, and subnet IDs with GPU-enabled instance profiles.
Apply NodePool Definitions: Deploy the baseline and burst NodePool manifests. Verify that pending pods trigger immediate node provisioning by checking kubectl get nodepools and kubectl logs -n karpenter.
Deploy Inference Workload: Launch your LLM serving stack (e.g., vLLM, TGI) with GPU resource requests and HPA/KEDA configurations targeting custom inference metrics.
Validate Scaling Behavior: Run a Locust load test simulating your target RPS and token distribution. Monitor TTFT, GPU memory utilization, and Karpenter provisioning events to confirm concurrency balancing aligns with your SLOs.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back