ecisions & Rationale
- Vertical Pod Autoscaler (VPA) in Recommendation Mode: VPA analyzes historical CPU and memory consumption to generate right-sizing suggestions. Running it in
recommendation-only mode prevents disruptive pod restarts during the measurement phase.
- Custom Admission Webhook: A TypeScript-based validating webhook intercepts Pod creation requests. It queries a metrics backend for P95 historical usage and rejects or patches requests that exceed a defined safety margin (e.g., 150% of observed peak).
- Karpenter Node Provisioning: Unlike the legacy Cluster Autoscaler, Karpenter provisions nodes based on exact pod requirements, supports consolidation, and reduces fragmentation by launching optimally sized instances rather than fixed node pool templates.
- Namespace-Level Cost Tagging: Every workload is required to carry
cost-center, team, and environment labels. These propagate to cloud provider billing APIs, enabling showback and accountability.
Implementation: Admission Webhook (TypeScript)
The following webhook validates incoming Pod specs against historical usage thresholds. It uses the Kubernetes client library to parse admission requests, queries a mock metrics service, and returns a patch or rejection.
import { K8sAdmissionReview, AdmissionResponse } from 'k8s-admission-controller';
import { MetricsClient } from './metrics-client';
const metricsClient = new MetricsClient(process.env.METRICS_API_URL);
const SAFETY_MARGIN = 1.5; // 150% of P95 usage
export async function handlePodAdmission(req: K8sAdmissionReview): Promise<AdmissionResponse> {
const pod = req.request.object;
const namespace = pod.metadata.namespace;
const workloadName = pod.metadata.labels?.['app.kubernetes.io/name'] || 'unknown';
const patches: any[] = [];
let rejected = false;
let rejectionReason = '';
for (const container of pod.spec.containers) {
const cpuReq = parseResource(container.resources?.requests?.cpu);
const memReq = parseResource(container.resources?.requests?.memory);
const historical = await metricsClient.getP95Usage(namespace, workloadName, container.name);
if (cpuReq && historical.cpu) {
const maxAllowed = historical.cpu * SAFETY_MARGIN;
if (cpuReq > maxAllowed) {
patches.push({
op: 'replace',
path: `/spec/containers/${pod.spec.containers.indexOf(container)}/resources/requests/cpu`,
value: formatResource(maxAllowed, 'cpu')
});
}
}
if (memReq && historical.memory) {
const maxAllowed = historical.memory * SAFETY_MARGIN;
if (memReq > maxAllowed) {
patches.push({
op: 'replace',
path: `/spec/containers/${pod.spec.containers.indexOf(container)}/resources/requests/memory`,
value: formatResource(maxAllowed, 'memory')
});
}
}
}
if (patches.length > 0) {
return {
allowed: true,
patch: Buffer.from(JSON.stringify(patches)).toString('base64'),
patchType: 'JSONPatch'
};
}
return { allowed: true };
}
function parseResource(val?: string): number | null {
if (!val) return null;
if (val.endsWith('m')) return parseFloat(val) / 1000;
if (val.endsWith('Mi')) return parseFloat(val);
if (val.endsWith('Gi')) return parseFloat(val) * 1024;
return parseFloat(val);
}
function formatResource(val: number, type: 'cpu' | 'memory'): string {
return type === 'cpu' ? `${Math.round(val * 1000)}m` : `${Math.round(val)}Mi`;
}
Why this approach: The webhook prevents request inflation at the source. By capping requests at 150% of P95 historical usage, it eliminates defensive over-provisioning while preserving headroom for legitimate traffic spikes. The JSONPatch response ensures zero downtime during enforcement.
Node Pool Segmentation Strategy
Fragmentation thrives when heterogeneous workloads share the same node pool. The solution is to isolate workloads by shape and priority:
- General Pool: Standard CPU/memory workloads, best-effort scheduling
- High-Perf Pool: Latency-sensitive services, guaranteed QoS, dedicated instances
- Batch/GPU Pool: Spot/preemptible instances, time-sliced GPUs, offline jobs
Karpenter handles this natively via nodeClass and requirements fields, launching only the exact instance type required for the pending pod. This eliminates the stranded capacity problem inherent in fixed-size node pools.
Pitfall Guide
1. Peak-Based Request Inflation
Explanation: Teams set resource requests to the absolute maximum observed load, often during a single traffic spike or deployment rollout. The scheduler reserves this capacity permanently, even during idle periods.
Fix: Base requests on P95 or P99 historical usage over a 14-day window. Use VPA recommendations to establish baselines, then apply a 10β20% buffer for variance.
2. The Fragmentation Blind Spot
Explanation: Assuming 60% cluster utilization means 40% free capacity. In reality, fragmented memory/CPU blocks, daemonset reservations, and pod affinity rules render much of that space unschedulable.
Fix: Track schedulable_capacity vs allocatable_capacity. Use Karpenter consolidation or Descheduler to evict and reschedule pods into tighter bin-packs. Monitor fragmentation index via node-level available metrics.
3. Autoscaler Signal Mismatch
Explanation: HPA targets CPU utilization at 70% of the request, not actual usage. If requests are inflated, the autoscaler triggers scaling long before the node is genuinely stressed.
Fix: Align HPA targets with actual consumption metrics. Use custom metrics (requests per second, queue depth) instead of raw CPU/memory when possible. Validate that HPA thresholds reflect business load, not scheduling artifacts.
4. GPU Whole-Device Locking
Explanation: Reserving an entire GPU for inference workloads that only consume 10β20% of VRAM or compute cycles. GPU instances carry premium pricing, making idle time exceptionally costly.
Fix: Enable NVIDIA time-slicing or MPS (Multi-Process Service) for shared workloads. Use vGPU drivers or cloud provider GPU partitioning. Separate latency-sensitive inference from batch training into dedicated pools with different scheduling policies.
5. Environment Sprawl Without Lifecycle Policies
Explanation: Staging, preview, and sandbox namespaces accumulate over time. Workloads are deployed for testing but never terminated, consuming node capacity indefinitely.
Fix: Implement namespace TTL controllers. Use GitOps-driven environment provisioning with automatic teardown on branch merge or PR closure. Tag ephemeral namespaces with lifecycle: ephemeral and run a nightly cleanup job.
6. Ignoring System Overhead Tax
Explanation: Assuming allocatable capacity equals usable capacity. DaemonSets (CNI, observability, node agents), kubelet reserves, and system reserved memory consume 10β20% of every node before user workloads are scheduled.
Fix: Explicitly configure kubelet-reserved and system-reserved in node specs. Monitor actual daemonset footprint. Size node pools accounting for overhead, not just workload requests.
7. Cost Ownership Vacuum
Explanation: Resource requests are set by developers but billed to a central infrastructure team. No feedback loop exists to correct over-provisioning, leading to chronic waste.
Fix: Enforce mandatory cost labels (cost-center, team, environment). Implement showback dashboards that map namespace spend to engineering teams. Tie request approvals to cost center budgets in CI/CD pipelines.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single API / Low Traffic | Managed container service or serverless | No orchestration overhead, pay-per-use billing | β 40β60% vs K8s cluster |
| Multi-tenant SaaS (10+ services) | Kubernetes with VPA + Karpenter | Standardization, isolation, autoscaling benefits outweigh overhead | β Neutral to β 15% after right-sizing |
| GPU Batch / Training | Dedicated GPU pool with time-slicing + spot instances | Maximizes VRAM utilization, leverages preemptible pricing | β 30β50% vs always-on GPU nodes |
| Early MVP / Prototype | Single VM or PaaS | Avoids platform complexity before product-market fit | β 70% vs managed K8s |
| High-Frequency Trading / Low Latency | Guaranteed QoS pool + dedicated instances | Eliminates noisy neighbor, ensures deterministic scheduling | β 20β30% (justified by SLA) |
Configuration Template
# vertical-pod-autoscaler-recommendation.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: workload-vpa-recommendation
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
updatePolicy:
updateMode: "Off" # Recommendation only
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
---
# karpenter-nodepool-general.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: general-workloads
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
nodeClassRef:
name: default
limits:
cpu: "100"
memory: "400Gi"
disruption:
consolidationPolicy: WhenEmpty
expireAfter: 720h
Quick Start Guide
- Install Measurement Stack: Deploy Prometheus, Grafana, and VPA in
recommendation-only mode. Allow 7β14 days for historical usage data to accumulate.
- Enable Admission Control: Build and deploy the TypeScript webhook. Configure the Kubernetes API server to route Pod creation requests to the webhook endpoint. Test with a sample deployment to verify patch behavior.
- Provision Dynamic Nodes: Replace existing node pools with Karpenter
NodePool specs. Tag workloads with cost-center and team labels. Verify that pending pods trigger instance launches matching their exact resource shape.
- Enforce Lifecycle Policies: Deploy a namespace TTL controller or configure GitOps pipeline hooks to auto-delete preview/staging namespaces after merge. Validate showback dashboards reflect accurate spend attribution.
- Iterate & Tighten: Review VPA recommendations weekly. Adjust safety margins, enable
Auto mode for stable workloads, and activate Karpenter consolidation. Monitor fragmentation index and autoscaler reaction times to confirm cost trajectory.