Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks
Current Situation Analysis
Machine learning training workloads running on Kubernetes frequently encounter unexplained performance degradation. Engineering teams typically observe CPU throttling, prolonged epoch completion times, and inconsistent batch processing rates. The immediate assumption is almost always insufficient cluster capacity, misconfigured resource requests, or scheduler contention. In reality, the bottleneck often originates outside the container runtime, buried in platform defaults and background agent behavior.
This problem is systematically overlooked because modern observability stacks are heavily workload-centric. Metrics servers, Prometheus scrapers, and cloud monitoring agents track container CPU usage, memory limits, and pod scheduling events. They rarely expose cgroup accounting health, kernel memory reclaim overhead, or node-level process anomalies. When a background service leaks memory cgroup entries, the kernel’s memory management subsystem triggers direct reclaim and kswapd activity. This memory pressure forces the CPU scheduler to throttle active workloads to maintain system stability. The result is a false CPU starvation scenario: the hardware has cycles available, but the kernel refuses to allocate them due to corrupted resource accounting.
The engineering team at Pinterest encountered this exact pattern on PinCompute, their Kubernetes-based orchestration platform. ML training jobs experienced severe throughput degradation despite healthy node metrics and adequate CPU requests. Deep investigation revealed an idle Amazon ECS agent running on the worker nodes. Though the agent was no longer required for workload scheduling, it remained active and continuously leaked memory cgroup entries. The kernel’s memory reclaim mechanisms consumed CPU cycles and triggered throttling policies, indirectly starving the ML training containers. Disabling the orphaned agent immediately stabilized performance. This case demonstrates a critical production reality: platform defaults and legacy agents can silently corrupt resource accounting, transforming a memory leak into a compute bottleneck.
WOW Moment: Key Findings
Shifting diagnostic focus from workload metrics to node-level agent auditing dramatically reduces mean time to resolution (MTTR) and eliminates guesswork. The table below compares three common diagnostic approaches used when ML training jobs experience unexplained CPU throttling.
| Diagnostic Approach | Detection Latency | Root Cause Visibility | Resolution Complexity |
|---|---|---|---|
| Standard K8s Metrics | High (hours) | Low | High |
| Cgroup-Level Tracing | Medium (minutes) | Medium | Medium |
| Agent Audit & Isolation | Low (seconds) | High | Low |
Standard Kubernetes metrics only surface the symptom: CPU throttling. Teams respond by scaling nodes or adjusting resource limits, which compounds the leak and increases cost. Cgroup-level tracing identifies memory pressure but requires kernel-level expertise and manual correlation. Agent audit and isolation pinpoints the exact orphaned process, reveals the accounting corruption, and enables a direct fix. This finding matters because it redefines how teams approach resource bottlenecks. Instead of treating CPU starvation as a compute shortage, engineers can treat it as an accounting integrity issue. The fix shifts from infrastructure scaling to configuration hygiene, reducing operational overhead and preventing recurring degradation.
Core Solution
Resolving cgroup-induced CPU starvation requires a systematic approach: verify accounting health, correlate memory pressure with scheduler behavior, audit background agents, and apply targeted isolation. The following implementation demonstrates a production-ready diagnostic and remediation workflow.
Step 1: Verify Cgroup Accounting Health
Cgroup memory leaks manifest as orphaned entries in /sys/fs/cgroup/memory (v1) or /sys/fs/cgroup/memory (v2). A TypeScript-based diagnostic utility can parse these paths, detect abnormal growth patterns, and flag processes with stale cgroup references.
import { readFileSync, readdirSync, statSync } from 'fs';
import { join } from 'path';
interface CgroupMetric {
processId: string;
memoryUsageBytes: number;
leakProbability: number;
}
const CGROUP_V1_PATH = '/sys/fs/cgroup/memory';
const THRESHOLD_BYTES = 500 * 1024 * 1024; // 500MB baseline
function scanCgroupEntries(): CgroupMetric[] {
const entries = readdirSync(CGROUP_V1_PATH);
const metrics: CgroupMetric[] = [];
for (const entry of entries) {
const statPath = join(CGROUP_V1_PATH, entry, 'memory.usage_in_bytes');
try {
const usage = parseInt(readFileSync(statPath, 'utf-8').trim(), 10);
const pid = entry.match(/(\d+)/)?.[1] || 'unknown';
const leakProbability = usage > THRESHOLD_BYTES ? 0.85 : 0.15;
metrics.push({ processId: pid, memoryUsageBytes: usage, leakProbability });
} catch {
// Skip inaccessible or kernel-managed cgroups
}
}
return metrics.filter(m => m.leakProbability > 0.7);
}
export { scanCgroupEntries, CgroupMetric };
This utility isolates high-probability leak candidates by comparing current memory usage against a configurable baseline. It avoids parsing transient kernel cgroups and focuses on user-space process references. The leakProbability heuristic flags entries that exceed normal allocation patterns, which typically indicate abandoned cgroup hierarchies.
Step 2: Correlate Memory Pressure with CPU Throttling
Memory cgroup leaks trigger kernel direct reclaim, which consumes CPU cycles and forces the scheduler to throttle container workloads. A PromQL query can correlate node_memory_KswapdWriteback and container_cpu_cfs_throttled_seconds_total to validate the relationship.
# Detect correlation between kernel memory reclaim and container CPU throttling
(
rate(node_memory_KswapdWriteback_bytes_total[5m]) > 0
)
and
(
rate(container_cpu_cfs_throttled_seconds_total{namespace="ml-training"}[5m]) > 0.5
)
This query identifies time windows where kernel memo
ry reclaim activity aligns with container CPU throttling. If both metrics spike simultaneously, the bottleneck is accounting corruption, not compute shortage. Teams should avoid scaling nodes during this window, as additional capacity will inherit the same cgroup leak.
Step 3: Audit Background Agents
Orphaned platform agents are the primary source of cgroup leaks. A node-level audit script should enumerate running processes, cross-reference them with active workloads, and flag unused services.
#!/usr/bin/env bash
# audit_node_agents.sh
# Identifies platform agents running without active workload dependencies
AGENT_LIST=("ecs-agent" "datadog-agent" "fluentd" "node-exporter" "kube-proxy")
ACTIVE_PIDS=$(pgrep -f "k8s_" || true)
for agent in "${AGENT_LIST[@]}"; do
AGENT_PID=$(pgrep -f "$agent" || true)
if [[ -n "$AGENT_PID" && -z "$ACTIVE_PIDS" ]]; then
echo "[ALERT] Orphaned agent detected: $agent (PID: $AGENT_PID)"
echo " -> No active Kubernetes workloads found. Safe to disable."
fi
done
This script checks for known platform agents and verifies whether active Kubernetes containers are running. If an agent is active but no workloads exist, it is a candidate for immediate disablement. The logic prevents accidental termination of essential monitoring or networking components.
Step 4: Disable and Validate
Once an orphaned agent is identified, disable it via systemd override or Kubernetes node configuration. Validate that cgroup entries are reclaimed and CPU throttling subsides.
# systemd override for ecs-agent
[Service]
ExecStart=
ExecStart=/bin/true
Restart=no
After applying the override, restart the agent service and monitor /sys/fs/cgroup/memory for entry cleanup. CPU throttling should normalize within 2-3 minutes as the kernel reclaims orphaned cgroup references.
Architecture Decisions and Rationale
- Node-level auditing over pod-level monitoring: Cgroup leaks originate outside the container runtime. Pod metrics cannot detect kernel accounting corruption. Node-level inspection is mandatory.
- TypeScript diagnostic utility over shell scripts: TypeScript provides type safety, structured error handling, and easier integration with existing observability pipelines. Shell scripts are retained for quick node audits due to lower overhead.
- Disable over patch: Legacy agents often lack active maintenance. Patching memory leaks requires kernel-level changes or agent recompilation. Disabling unused agents is faster, safer, and eliminates the root cause.
- PromQL correlation over single-metric alerts: CPU throttling alone is ambiguous. Correlating with kernel memory reclaim metrics confirms the accounting corruption hypothesis before triggering remediation.
Pitfall Guide
1. Mistaking Memory Pressure for CPU Throttling
Explanation: Teams observe high CPU throttling and assume compute shortage. The actual cause is kernel memory reclaim consuming cycles and triggering scheduler limits.
Fix: Always correlate container_cpu_cfs_throttled_seconds_total with node_memory_KswapdWriteback or node_vmstat_pgmajfault. If both spike together, investigate cgroup health before scaling.
2. Ignoring Cgroup Version Differences
Explanation: Cgroup v1 and v2 handle memory accounting differently. v1 uses separate memory.usage_in_bytes and memory.limit_in_bytes, while v2 consolidates them under memory.current and memory.max. Scripts written for v1 fail on v2 nodes.
Fix: Detect cgroup version at runtime using stat -fc %T /sys/fs/cgroup. Branch logic accordingly or migrate clusters to v2 for unified accounting.
3. Assuming All Sidecars Are Essential
Explanation: Platform teams deploy monitoring, logging, and security agents as sidecars or node-level services. Over time, workloads change, but agents remain. Unused agents leak resources silently. Fix: Implement quarterly agent audits. Cross-reference active workloads with running agents. Disable or remove services with zero dependency graphs.
4. Overlooking Kernel Reclaim Overhead
Explanation: Direct memory reclaim runs in process context and consumes CPU cycles. Under heavy cgroup leaks, reclaim overhead can exceed 15% of available CPU, causing apparent throttling.
Fix: Monitor node_vmstat_pgscan_direct and node_vmstat_pgsteal_direct. If direct reclaim dominates, reduce cgroup fragmentation by consolidating workloads or disabling leak sources.
5. Relying Solely on Metrics-Server
Explanation: Kubernetes metrics-server aggregates pod-level CPU and memory usage. It does not expose kernel reclaim activity, cgroup accounting health, or node-level process anomalies. Fix: Supplement metrics-server with node exporters, cgroup scrapers, and kernel tracing tools. Build dashboards that correlate container metrics with kernel behavior.
6. Disabling Agents Without Graceful Drain
Explanation: Terminating an agent abruptly can leave dangling cgroup entries or interrupt active log/metric pipelines. The kernel may retain orphaned references, prolonging the leak.
Fix: Use systemctl stop followed by cgroupfs-mount --remount or cgdelete to force cleanup. Verify /sys/fs/cgroup/memory before marking the node as healthy.
7. Skipping Post-Remediation Validation
Explanation: Teams disable the agent and assume the problem is resolved. Without validation, residual cgroup entries or secondary leaks can cause recurring throttling. Fix: Run a 10-minute validation window. Monitor CPU throttling, memory reclaim metrics, and cgroup entry counts. Confirm normalization before closing the incident.
Production Bundle
Action Checklist
- Verify cgroup version on all worker nodes and align diagnostic scripts accordingly
- Deploy a node-level cgroup scanner to detect orphaned memory entries
- Correlate CPU throttling with kernel memory reclaim metrics before scaling
- Audit all platform agents and cross-reference with active workload dependencies
- Disable unused agents via systemd overrides or Kubernetes node configurations
- Force cgroup cleanup using
cgdeleteor kernel remount commands - Run a 10-minute validation window to confirm throttling normalization
- Document agent lifecycle policies to prevent future orphaned deployments
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| ML training jobs show CPU throttling with healthy node metrics | Agent audit and cgroup scan | Throttling is likely accounting corruption, not compute shortage | Low (no node scaling required) |
| Cgroup v1 nodes exhibit persistent memory leaks | Migrate to cgroup v2 or apply kernel patches | v1 lacks unified memory accounting, increasing leak probability | Medium (requires cluster upgrade) |
| Multiple platform agents running with zero workload dependencies | Disable non-essential agents via systemd | Reduces kernel reclaim overhead and cgroup fragmentation | Low (immediate CPU recovery) |
| High direct reclaim overhead (>15% CPU) | Consolidate workloads and disable leak sources | Direct reclaim consumes CPU cycles, causing false throttling | Low (optimizes existing capacity) |
| Metrics-server shows normal usage but jobs stall | Deploy kernel-level tracing (bpftrace) | Metrics-server lacks kernel accounting visibility | Medium (requires tracing setup) |
Configuration Template
# kubelet-cgroup-config.yaml
# Aligns kubelet with cgroup v2 and disables legacy memory accounting
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
cgroupVersion: "v2"
memorySwap: {}
featureGates:
CgroupV2: true
# systemd-override-ecs-agent.service
[Service]
ExecStart=
ExecStart=/bin/true
Restart=no
LimitNOFILE=1024
LimitNPROC=512
# prometheus-cgroup-alerts.yaml
groups:
- name: cgroup-leak-detection
rules:
- alert: HighCgroupMemoryLeak
expr: rate(node_memory_KswapdWriteback_bytes_total[5m]) > 10485760
for: 2m
labels:
severity: warning
annotations:
summary: "Kernel memory reclaim exceeding threshold"
description: "Cgroup leak likely causing CPU throttling. Audit node agents."
Quick Start Guide
- Detect cgroup version: Run
stat -fc %T /sys/fs/cgroupon each worker node. If output istmpfs, you are on v2. Ifcgroup, you are on v1. Adjust diagnostic scripts accordingly. - Deploy cgroup scanner: Copy the TypeScript diagnostic utility to a monitoring pod or node agent. Execute
scanCgroupEntries()and filter results whereleakProbability > 0.7. - Correlate metrics: Query Prometheus for
rate(node_memory_KswapdWriteback_bytes_total[5m])andrate(container_cpu_cfs_throttled_seconds_total[5m]). If both exceed thresholds simultaneously, proceed to agent audit. - Audit and disable: Run
audit_node_agents.shon affected nodes. Identify orphaned services. Apply systemd overrides to disable them. Force cgroup cleanup withcgdelete -r /sys/fs/cgroup/memory/<orphaned_path>. - Validate: Monitor CPU throttling and memory reclaim metrics for 10 minutes. Confirm normalization. Document the agent lifecycle policy to prevent recurrence.
