account for both.
Step 2: Establish the Node OS & Driver Baseline
Select a minimal, security-hardened operating system with pre-integrated GPU and networking drivers. Amazon Linux 2023 (AL2023) provides a general-purpose foundation with EFA drivers, NVIDIA kernel modules, the NVIDIA container runtime, and Multi-Instance GPU (MIG) support. Bottlerocket offers a hardened, immutable alternative with identical driver support but significantly faster boot times and a reduced attack surface. For AI/ML workloads where rapid node provisioning and security compliance are priorities, Bottlerocket is the preferred baseline.
Step 3: Design Karpenter NodePool Topology
Karpenter replaces traditional Cluster Autoscaler logic with event-driven, pod-aware provisioning. Instead of static node groups, you define NodePools that map directly to workload characteristics. A single NodePool suffices for uniform workloads, but production GenAI architectures benefit from isolation and prioritization. Separate accelerated inference nodes from CPU-bound routing or preprocessing services. Use weighted NodePools to enforce instance family preferences (e.g., G5 before P4) or cost-tiered provisioning. For predictable baseline traffic, anchor capacity with On-Demand Capacity Reservations (ODCR) or ML Capacity Blocks. Reserve bursting NodePools with instance diversification (G4dn, G5, G6, P4) to absorb traffic spikes without provisioning delays.
Step 4: Implement Dual-Layer Autoscaling
Application scaling and data plane scaling must operate independently but coordinate through shared metrics. Use HPA, VPA, or KEDA to scale inference pods based on queue depth, GPU utilization, or custom TTFT metrics. Karpenter watches pending pods and provisions exact GPU instances within seconds, eliminating the 5-10 minute lag typical of Cluster Autoscaler. This separation ensures that pod scaling absorbs micro-bursts while node scaling handles sustained load shifts.
Architecture Decisions & Rationale
- Why Karpenter over Cluster Autoscaler? Karpenter evaluates pod requirements against available instance types in real-time, provisioning the exact GPU architecture needed. It supports consolidation, drift detection, and weighted prioritization, which are critical for cost-controlled AI workloads.
- Why Bottlerocket? Immutable OS design reduces patching overhead and boot latency. Faster node initialization directly improves autoscaling responsiveness during traffic surges.
- Why separate prefill/decode workloads? Prefill is compute-bound and benefits from high-memory GPUs (P-series). Decode is memory-bound and scales efficiently on G-series instances. Isolating them prevents resource contention and enables targeted autoscaling.
TypeScript Infrastructure Definition
The following TypeScript configuration demonstrates how to programmatically define Karpenter NodePools and HPA targets using a modern infrastructure-as-code pattern. This approach ensures version-controlled, reproducible capacity definitions.
import { KarpenterNodePool, KarpenterProvisioner } from '@codcompass/eks-infra';
import { HpaTarget, MetricType } from '@codcompass/autoscaling';
// Define baseline capacity with capacity reservations
const baselinePool = new KarpenterNodePool('llm-inference-baseline', {
nodeClass: 'Bottlerocket',
instanceFamilies: ['g5.12xlarge', 'p4d.24xlarge'],
weight: 100,
capacityType: 'on-demand',
limits: { cpu: 96, memory: '384Gi', 'nvidia.com/gpu': 8 },
consolidation: { enabled: true, policy: 'when-empty' },
capacityReservation: {
id: 'ocr-llm-prod-01',
strategy: 'prioritize'
}
});
// Define burst pool with instance diversification
const burstPool = new KarpenterNodePool('llm-inference-burst', {
nodeClass: 'Bottlerocket',
instanceFamilies: ['g4dn.12xlarge', 'g5.12xlarge', 'g6.12xlarge'],
weight: 50,
capacityType: 'spot',
limits: { cpu: 96, memory: '384Gi', 'nvidia.com/gpu': 8 },
consolidation: { enabled: false },
taints: [{ key: 'workload', value: 'burst', effect: 'NoSchedule' }]
});
// Configure HPA targeting GPU utilization and queue depth
const inferenceHpa = new HpaTarget('vllm-inference-deployment', {
minReplicas: 4,
maxReplicas: 64,
metrics: [
{ type: MetricType.PODS, name: 'gpu_utilization', targetAverage: 75 },
{ type: MetricType.PODS, name: 'request_queue_depth', targetAverage: 10 }
],
behavior: {
scaleUp: { stabilizationWindowSeconds: 30, policies: [{ type: 'Pods', value: 4, periodSeconds: 60 }] },
scaleDown: { stabilizationWindowSeconds: 300, policies: [{ type: 'Percent', value: 10, periodSeconds: 60 }] }
}
});
export const capacityTopology = {
nodePools: [baselinePool, burstPool],
autoscaler: inferenceHpa
};
This configuration enforces weighted provisioning, isolates burst traffic via taints, disables consolidation for spot instances to prevent premature eviction, and aligns HPA scaling behavior with inference latency characteristics.
Pitfall Guide
1. Treating Inference as a Monolithic Workload
Explanation: Applying uniform autoscaling rules to both prefill and decode phases ignores their distinct resource profiles. Prefill demands high compute throughput, while decode is constrained by KV cache memory.
Fix: Separate workloads into distinct deployments. Scale prefill pods based on CPU/GPU compute metrics and decode pods based on memory utilization and queue depth. Use Karpenter NodePools with instance families optimized for each phase.
2. Over-Provisioning for Peak Latency
Explanation: Sizing infrastructure to guarantee sub-2-second latency at all concurrency levels results in severe underutilization during normal traffic. The cost per request becomes unsustainable.
Fix: Define acceptable latency bands rather than absolute minimums. Use weighted NodePools to prioritize cost-effective instances for baseline load and reserve high-performance GPUs for SLO-critical traffic. Implement graceful degradation or request queuing during extreme spikes.
3. Misaligning NodePool Weights with Actual Cost Models
Explanation: Assigning static weights without accounting for spot pricing volatility, instance availability, or regional GPU shortages leads to provisioning failures or unexpected cost spikes.
Fix: Dynamically adjust weights based on real-time pricing APIs and capacity health. Use Karpenter's weight field in conjunction with consolidation policies to automatically shift load to cheaper instances when latency budgets allow.
4. Ignoring KV Cache & Batch Size Limits in Autoscaling Metrics
Explanation: HPA targets based solely on CPU or generic GPU utilization fail to capture memory pressure from KV cache expansion. Pods may scale out prematurely or crash due to OOM errors.
Fix: Expose custom metrics for KV cache hit rate, active sequence count, and GPU memory fragmentation. Configure KEDA or HPA to scale based on these inference-specific indicators rather than generic resource thresholds.
5. Skipping Capacity Reservations for Baseline Load
Explanation: Relying entirely on on-demand or spot provisioning for predictable traffic introduces latency during node initialization and risks capacity shortages during regional demand surges.
Fix: Anchor baseline workloads with ODCR or ML Capacity Blocks. Use Karpenter's capacityReservation configuration to guarantee immediate availability for steady-state inference, reserving dynamic provisioning for burst traffic.
6. Confusing Pod Scaling with Node Scaling
Explanation: Allowing HPA to scale pods beyond node capacity triggers pending pod queues, which Karpenter eventually resolves but with unpredictable latency. This creates a feedback loop where pod scaling masks node scaling delays.
Fix: Implement pod disruption budgets and queue depth limits. Ensure HPA max replicas align with expected NodePool capacity. Use Karpenter's consolidation and drift settings to maintain node health, and monitor pending pod metrics as a leading indicator for data plane scaling.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Stable, predictable traffic | ODCR-backed NodePool + Bottlerocket | Guarantees capacity, eliminates provisioning latency, reduces spot volatility | Lower operational risk, moderate baseline cost |
| Spiky, unpredictable traffic | Weighted burst NodePool with instance diversification (G4/G5/G6) | Maximizes availability during spikes, leverages spot pricing for cost efficiency | Higher variance, optimized per-request cost |
| Latency-critical SLOs (<3s) | Dedicated P-series NodePool + strict concurrency caps | Prevents KV cache contention, ensures decode phase responsiveness | Premium hardware cost, justified by SLO compliance |
| Cost-optimized batch inference | Trainium/Inferentia instances + aggressive consolidation | AWS ML chips deliver superior $/token for non-interactive workloads | Significant cost reduction, requires model compilation |
Configuration Template
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: llm-inference-baseline
spec:
template:
metadata:
labels:
workload: llm-inference
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: bottlerocket-gpu
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g5.12xlarge", "p4d.24xlarge"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
taints:
- key: workload
value: inference
effect: NoSchedule
limits:
cpu: 192
memory: 768Gi
nvidia.com/gpu: 16
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30m
weight: 100
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: llm-inference-burst
spec:
template:
metadata:
labels:
workload: llm-inference-burst
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: bottlerocket-gpu
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g4dn.12xlarge", "g5.12xlarge", "g6.12xlarge"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
taints:
- key: workload
value: burst
effect: NoSchedule
limits:
cpu: 192
memory: 768Gi
nvidia.com/gpu: 16
disruption:
consolidationPolicy: Never
weight: 50
Quick Start Guide
- Provision EKS Control Plane: Deploy a managed EKS cluster with IAM roles for service accounts (IRSA) configured for Karpenter, EC2, and SSM access.
- Install Karpenter & EC2NodeClass: Apply the Karpenter Helm chart, then create an
EC2NodeClass referencing Bottlerocket AMI, security groups, and subnet IDs with GPU-enabled instance profiles.
- Apply NodePool Definitions: Deploy the baseline and burst NodePool manifests. Verify that pending pods trigger immediate node provisioning by checking
kubectl get nodepools and kubectl logs -n karpenter.
- Deploy Inference Workload: Launch your LLM serving stack (e.g., vLLM, TGI) with GPU resource requests and HPA/KEDA configurations targeting custom inference metrics.
- Validate Scaling Behavior: Run a Locust load test simulating your target RPS and token distribution. Monitor TTFT, GPU memory utilization, and Karpenter provisioning events to confirm concurrency balancing aligns with your SLOs.