tsandlimits`.
- Requests: Define the baseline resource consumption. The scheduler uses these values to place pods on nodes with sufficient capacity. Accurate requests enable efficient bin-packing and prevent overcommitment.
- Limits: Cap resource usage during bursts. Limits protect the node from resource exhaustion and prevent a single pod from starving others.
Rationale: Without requests, the scheduler treats all pods equally, leading to suboptimal placement. Without limits, a runaway process can consume all node resources, causing OOM kills or CPU throttling for co-located pods.
2. Intelligent Autoscaling with Custom Metrics
CPU utilization is often insufficient for modern microservices. Many applications are I/O-bound or latency-sensitive, meaning CPU usage may remain low while performance degrades. Custom metrics provide a more accurate signal for scaling decisions.
Implementation Strategy:
- Expose Metrics: Instrument the application to export relevant metrics via Prometheus. Common signals include request latency, error rates, and queue depth.
- Configure HPA: Use the
autoscaling/v2 API to target custom metrics. Define behavior policies to control scale-up and scale-down velocities.
TypeScript Metric Exporter Example:
This snippet demonstrates how to expose a custom queue depth metric using prom-client.
import { Registry, Counter, Gauge } from 'prom-client';
import express from 'express';
const register = new Registry();
const queueDepthGauge = new Gauge({
name: 'app_queue_depth',
help: 'Current number of items in the processing queue',
registers: [register],
});
// Simulate queue processing
setInterval(() => {
const depth = Math.floor(Math.random() * 50);
queueDepthGauge.set(depth);
}, 5000);
const app = express();
app.get('/metrics', async (_req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(3000, () => console.log('Metrics server running'));
HPA Configuration with Behavior:
The HPA targets the custom metric and includes stabilization windows to prevent thrashing.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: worker-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: app_queue_depth
target:
type: AverageValue
averageValue: "10"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120
Rationale: The behavior field is critical. Scale-up policies allow rapid response to traffic spikes, while scale-down policies use longer stabilization windows to avoid premature scaling during transient lulls. This prevents the "sawtooth" pattern common in naive HPA configurations.
3. Availability Contracts with PDBs and Probes
Pod Disruption Budgets (PDBs) protect service availability during voluntary disruptions. A PDB defines the minimum number of pods that must remain available during operations like node drains or rolling updates.
PDB Configuration:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: worker-service-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: worker-service
Readiness Probes:
PDBs rely on accurate pod status. Readiness probes ensure that traffic is only routed to pods that are fully initialized and capable of handling requests. Without readiness probes, the service may send traffic to pods that are still starting up, causing failures.
Rationale: PDBs prevent the control plane from removing too many pods simultaneously, ensuring the service can handle the load shift. Readiness probes complement PDBs by guaranteeing that only healthy pods receive traffic, reducing error rates during deployments.
4. Stateless Architecture and External State
For optimal scaling, services should be stateless. Stateful components like caches and sessions should be externalized to managed services like Redis or databases. This simplifies scaling by eliminating the need for persistent volumes and complex state synchronization.
Rationale: Stateless pods can be created and destroyed rapidly without data loss. StatefulSets are appropriate for databases but introduce complexity for application caches. Externalizing state reduces bootstrap times and enables aggressive scaling.
Pitfall Guide
1. The "No-Request" Trap
- Explanation: Omitting resource requests forces the scheduler to rely on heuristics, leading to poor node packing and unpredictable performance.
- Fix: Always define
requests based on baseline load measurements. Use monitoring data to set accurate values.
2. CPU-Blindness in I/O-Bound Services
- Explanation: Scaling on CPU for I/O-bound services results in delayed responses to traffic spikes, as CPU usage lags behind actual demand.
- Fix: Use custom metrics like queue depth or request latency for autoscaling decisions.
3. HPA Oscillation
- Explanation: Rapid scale-up and scale-down cycles waste resources and destabilize the cluster.
- Fix: Configure
behavior policies with appropriate stabilization windows. Scale up quickly but scale down slowly.
4. PDB Misconfiguration
- Explanation: Setting
minAvailable too high can block rolling updates or node drains, causing operational bottlenecks.
- Fix: Calculate
minAvailable based on maxUnavailable. Ensure the PDB allows for at least one pod to be disrupted during updates.
5. Stateful Caching Layers
- Explanation: Using StatefulSets for caching introduces unnecessary complexity and slows down scaling.
- Fix: Externalize caches to managed services like Redis. Keep application pods stateless.
6. Readiness Probe Neglect
- Explanation: Missing readiness probes cause traffic to be routed to unready pods, increasing error rates during startups and updates.
- Fix: Implement HTTP or TCP readiness probes that verify the application is fully initialized.
7. Ignoring Scale-Down Costs
- Explanation: Aggressive scale-down can lead to frequent pod creation/destruction cycles, increasing API server load and latency.
- Fix: Use longer stabilization windows for scale-down and monitor the cost of scaling events.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Bursty Web Traffic | HPA on CPU/Requests | Simple and effective for CPU-bound workloads | Moderate |
| Queue Processing | HPA on Queue Depth | Accurate scaling based on actual workload | Low (Efficient) |
| Database Services | StatefulSet + PV | Ensures data persistence and ordering | High |
| Cache Layers | Stateless + External Store | Fast scaling, no data loss, simplified ops | Low |
| Latency-Sensitive APIs | HPA on P99 Latency | Directly optimizes for user experience | Medium |
Configuration Template
This template combines a Deployment with readiness probes, an HPA with custom metrics and behavior, and a PDB.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 3
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
containers:
- name: api
image: api-service:latest
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 3
maxReplicas: 15
metrics:
- type: Pods
pods:
metric:
name: http_request_duration_seconds
target:
type: AverageValue
averageValue: "0.5"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api-service
Quick Start Guide
- Install Metrics Server: Ensure the Kubernetes Metrics Server is deployed to enable resource-based autoscaling.
- Deploy Prometheus Adapter: Install the Prometheus Adapter to expose custom metrics to the HPA.
- Apply Resource Quotas: Set namespace-level resource quotas to enforce resource boundaries.
- Deploy HPA and PDB: Apply the HPA and PDB configurations to your services.
- Validate with Load Testing: Use tools like k6 or Locust to simulate traffic and verify scaling behavior.