Difficulty

Intermediate

Read Time

8 min

Kubernetes Autoscaling: HPA vs. VPA Architecture and Implementation

By Codcompass Team·2026-05-19·8 min read

Kubernetes Autoscaling: HPA vs. VPA Architecture and Implementation

Current Situation Analysis

Static resource allocation in Kubernetes clusters is a primary driver of cloud infrastructure waste and application instability. Engineering teams typically provision CPU and memory requests based on peak load estimates or guesswork, resulting in two distinct failure modes. Over-provisioning leads to resource hoarding, where pods reserve capacity they never utilize, inflating cluster costs by 30-40% on average. Under-provisioning causes CPU throttling and Out-Of-Memory (OOM) kills during traffic spikes, directly impacting latency and availability.

The industry recognizes Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) as the solution, yet implementation remains fraught with architectural misunderstandings. A significant portion of production clusters disable autoscaling due to fear of "flapping" or resource contention. The core misunderstanding lies in treating HPA and VPA as interchangeable tools rather than complementary mechanisms with distinct control loops and side effects.

Data from infrastructure audits indicates that clusters running HPA without tuned behavior policies experience unnecessary pod churn, increasing API server load and scheduler overhead. Conversely, clusters deploying VPA in Auto mode without a warm-up period frequently trigger OOM kills during the initial recommendation phase, as the VPA increases limits before the application has warmed its caches. Furthermore, attempting to run HPA and VPA on the same workload without proper configuration results in conflicting control loops, where VPA resizes pods while HPA scales replicas, causing eviction storms and service disruption.

WOW Moment: Key Findings

The critical insight for production autoscaling is the distinction between scaling actions and resize actions, and their respective impact on cluster economics and stability. HPA manages horizontal scale (replica count) to handle throughput, while VPA manages vertical scale (resource requests/limits) to optimize density.

The following comparison quantifies the operational differences. Note that combining HPA and VPA is possible but requires strict mode configuration to avoid conflicts.

Approach	Primary Action	Latency to Effect	Cluster Impact	Cost Efficiency	Best Use Case
HPA	Add/Remove Pods	30s - 5m	High (Scheduler load, IP exhaustion risk)	Low (Over-provisioned per pod)	Traffic spikes, bursty workloads
VPA	Update Requests/Limits	5m - 30m	Medium (Pod eviction, restarts)	High (Right-sized per pod)	Steady load with variable size, batch jobs
HPA + VPA	Scale & Resize	5m+	Very High (Complex interactions)	Very High	Production workloads requiring both elasticity and efficiency
Static	None	N/A	None	Low (Fixed waste)	Legacy apps, strict compliance constraints

Why this matters: Choosing HPA alone leaves you paying for wasted memory/CPU on every pod. Choosing VPA alone leaves you vulnerable to traffic spikes that a single resized pod cannot handle. The optimal production pattern is often a hybrid: VPA ensures pods are right-sized to minimize node count, while HPA scales those right-sized pods to meet demand. However, VPA must be set to updateMode: "Initial" or "Off" when HPA is active to prevent the VPA from evicting pods that HPA is trying to scale.

Core Solution

Architecture Overview

Kubernetes autoscaling relies on the Metrics API and specific controllers.

Metrics Server: The foundational component that

collects resource usage from Kubelets. Without this, neither HPA nor VPA can function. 2. HPA Controller: Watches HorizontalPodAutoscaler objects, queries metrics, and calculates desired replica counts. It interacts with the ReplicaSet controller. 3. VPA Components: * Recommender: Analyzes usage history and calculates resource recommendations. * Updater: Identifies pods that should be updated based on recommendations and evicts them. * Admission Controller: Intercepts pod creation to apply recommended resources (used in Initial mode).

Step-by-Step Implementation

1. Deploy Metrics Server

Ensure the Metrics Server is running with --kubelet-insecure-tls (for local dev) or proper certificate configuration for production.

# metrics-server-deployment.yaml snippet
apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-server
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: metrics-server
        image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
        args:
          - --cert-dir=/tmp
          - --secure-port=10250
          - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
          - --kubelet-use-node-status-port
          - --metric-resolution=15s

2. Horizontal Pod Autoscaler Configuration

HPA v2 supports multiple metrics and behavior policies to control scaling velocity.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 100
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

Rationale: The behavior field is critical. scaleDown stabilization prevents premature pod removal during transient lulls. Policies limit the rate of change, protecting downstream dependencies.

3. Vertical Pod Autoscaler Configuration

VPA must be configured carefully based on the presence of HPA.

Scenario A: VPA Only (No HPA)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-deployment
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 2
          memory: 2Gi

Scenario B: HPA and VPA Combined

When HPA manages replica count, VPA should only set resources at pod creation to avoid evicting pods that HPA relies on.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa-combined
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-deployment
  updatePolicy:
    updateMode: "Initial" # Critical: VPA only sets resources on new pods
  resourcePolicy:
    containerPolicies:
      - containerName: "app"
        mode: "Auto"

Integration with Cluster Autoscaler

Autoscaling pods is futile if the cluster cannot provision nodes. Ensure the Cluster Autoscaler is configured with appropriate --scale-down-utilization-threshold and --scale-down-delay-after-add to complement HPA/VPA actions.

Pitfall Guide

1. The HPA/VPA Eviction Loop

Mistake: Running HPA and VPA with VPA updateMode: "Auto". Impact: VPA detects a pod needs more memory, evicts it. HPA sees a replica drop and creates a new one. VPA immediately evicts the new one. This creates a churn loop, exhausting API resources and causing downtime. Fix: Set VPA to updateMode: "Initial" or "Off" when HPA is present.

2. VPA "Auto" Mode OOM Kills

Mistake: Switching VPA to Auto mode on a production workload immediately. Impact: VPA recommends higher limits based on usage, but the application may have memory leaks or cache buildup that VPA misinterprets as required memory. The increased limit allows the leak to grow until the node OOMs. Fix: Run VPA in Recommend mode for at least one full business cycle to analyze recommendations. Apply minAllowed and maxAllowed constraints.

3. Missing Metrics Server or TLS Errors

Mistake: Deploying HPA/VPA without verifying Metrics Server health. Impact: HPA status shows Invalid or Unknown. VPA fails to generate recommendations. Fix: Check kubectl get --raw "/apis/metrics.k8s.io/v1beta1" to verify API availability. Ensure Kubelet certificates are trusted.

4. HPA Scale-Down Flapping

Mistake: Not configuring stabilizationWindowSeconds for scale-down. Impact: HPA scales down pods, then traffic returns slightly, causing immediate scale-up. This wastes resources and increases latency. Fix: Set stabilizationWindowSeconds to a value longer than your typical traffic micro-bursts (e.g., 300s).

5. Requests vs. Limits Mismatch

Mistake: Setting CPU requests high but limits low, or vice versa, without understanding QoS classes. Impact: Pods with Burstable QoS are evicted first during node pressure. VPA may recommend requests that push pods into Guaranteed QoS, changing eviction priority unexpectedly. Fix: Align requests with VPA recommendations. Use maxAllowed to prevent VPA from creating Guaranteed pods if you prefer Burstable for cost reasons.

6. Custom Metrics Cardinality

Mistake: Using high-cardinality labels in Custom Metrics for HPA. Impact: Metrics adapter overload; HPA cannot aggregate metrics efficiently; slow reconciliation. Fix: Aggregate metrics at the exporter level or use low-cardinality labels (e.g., namespace, deployment) rather than pod_id or user_id.

7. Ignoring Cluster Capacity

Mistake: Setting maxReplicas on HPA without calculating node capacity. Impact: HPA scales to maxReplicas, but the cluster cannot schedule pods. Pods remain Pending. Cluster Autoscaler may not trigger if the pending pods don't fit in a single node type. Fix: Calculate maxReplicas based on (NodeCapacity / PodRequests) * NodeCount. Ensure Cluster Autoscaler supports the required node sizes.

Production Bundle

Action Checklist

Verify Metrics Pipeline: Confirm Metrics Server is running and kubectl top pods returns data.
Baseline Resources: Ensure all deployments have requests defined. Autoscaling cannot function without baselines.
Deploy VPA in Recommend Mode: Apply VPA with updateMode: "Off" and collect data for 7-14 days.
Analyze Recommendations: Review VPA events and recommendations. Check for OOM kills in recommendations.
Configure HPA Behavior: Define scaleUp and scaleDown policies with stabilization windows.
Set VPA Constraints: Define minAllowed and maxAllowed in VPA to prevent extreme resizing.
Enable Safe Mode: If using HPA, set VPA updateMode: "Initial". If VPA only, set updateMode: "Auto".
Monitor Events: Set up alerts for FailedScale, FailedUpdate, and frequent pod evictions.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Steady traffic, varying memory usage	VPA (`Auto`)	Optimizes memory requests, reduces over-provisioning waste.	High reduction in waste.
Bursty traffic, predictable size	HPA	Scales replicas to handle throughput spikes.	Increases cost during spikes, saves during lulls.
Bursty traffic + varying size	HPA + VPA (`Initial`)	HPA handles spikes; VPA ensures pods are right-sized.	Optimal balance of cost and performance.
Event-driven batch jobs	KEDA or CronHPA	HPA/VPA react to metrics; KEDA reacts to event sources (queues).	High efficiency; scales to zero.
Strict latency requirements	Static + HPA (Aggressive)	VPA evictions cause restart latency. HPA adds cold-start latency.	Higher cost for reserved capacity.

Configuration Template

Copy-paste template for a production workload using HPA and VPA safely.

# hpa-vpa-production.yaml
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: production-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: production-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Percent
          value: 20
          periodSeconds: 120
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: production-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: production-app
  updatePolicy:
    updateMode: "Initial" # Safe mode when HPA is present
  resourcePolicy:
    containerPolicies:
      - containerName: "app-container"
        mode: "Auto"
        minAllowed:
          cpu: 250m
          memory: 256Mi
        maxAllowed:
          cpu: 2
          memory: 4Gi

Quick Start Guide

Install Metrics Server:
```
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
```
Patch for local clusters: kubectl patch deployment metrics-server -n kube-system --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'
Apply VPA in Recommend Mode: Create vpa-recommend.yaml with updateMode: "Off" and apply:
```
kubectl apply -f vpa-recommend.yaml
```
Wait 24 hours to gather data.
Review and Apply HPA: Check kubectl describe hpa (if exists) or kubectl top pods to determine thresholds. Create hpa.yaml with appropriate metrics and apply:
```
kubectl apply -f hpa.yaml
```
Activate VPA: Update VPA to updateMode: "Initial" (if HPA is active) or "Auto" (if HPA is not active). Apply changes:
```
kubectl apply -f vpa-active.yaml
```

Verify: Generate load and monitor:

kubectl get hpa -w
kubectl get vpa -w
kubectl get events --field-selector reason=FailedScale,FailedUpdate

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back

Sources

• ai-generated