Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks

By Codcompass Team·2026-05-15·8 min read

Current Situation Analysis

Machine learning training workloads running on Kubernetes frequently encounter unexplained performance degradation. Engineering teams typically observe CPU throttling, prolonged epoch completion times, and inconsistent batch processing rates. The immediate assumption is almost always insufficient cluster capacity, misconfigured resource requests, or scheduler contention. In reality, the bottleneck often originates outside the container runtime, buried in platform defaults and background agent behavior.

This problem is systematically overlooked because modern observability stacks are heavily workload-centric. Metrics servers, Prometheus scrapers, and cloud monitoring agents track container CPU usage, memory limits, and pod scheduling events. They rarely expose cgroup accounting health, kernel memory reclaim overhead, or node-level process anomalies. When a background service leaks memory cgroup entries, the kernel’s memory management subsystem triggers direct reclaim and kswapd activity. This memory pressure forces the CPU scheduler to throttle active workloads to maintain system stability. The result is a false CPU starvation scenario: the hardware has cycles available, but the kernel refuses to allocate them due to corrupted resource accounting.

The engineering team at Pinterest encountered this exact pattern on PinCompute, their Kubernetes-based orchestration platform. ML training jobs experienced severe throughput degradation despite healthy node metrics and adequate CPU requests. Deep investigation revealed an idle Amazon ECS agent running on the worker nodes. Though the agent was no longer required for workload scheduling, it remained active and continuously leaked memory cgroup entries. The kernel’s memory reclaim mechanisms consumed CPU cycles and triggered throttling policies, indirectly starving the ML training containers. Disabling the orphaned agent immediately stabilized performance. This case demonstrates a critical production reality: platform defaults and legacy agents can silently corrupt resource accounting, transforming a memory leak into a compute bottleneck.

WOW Moment: Key Findings

Shifting diagnostic focus from workload metrics to node-level agent auditing dramatically reduces mean time to resolution (MTTR) and eliminates guesswork. The table below compares three common diagnostic approaches used when ML training jobs experience unexplained CPU throttling.

Diagnostic Approach	Detection Latency	Root Cause Visibility	Resolution Complexity
Standard K8s Metrics	High (hours)	Low	High
Cgroup-Level Tracing	Medium (minutes)	Medium	Medium
Agent Audit & Isolation	Low (seconds)	High	Low

Standard Kubernetes metrics only surface the symptom: CPU throttling. Teams respond by scaling nodes or adjusting resource limits, which compounds the leak and increases cost. Cgroup-level tracing identifies memory pressure but requires kernel-level expertise and manual correlation. Agent audit and isolation pinpoints the exact orphaned process, reveals the accounting corruption, and enables a direct fix. This finding matters because it redefines how teams approach resource bottlenecks. Instead of treating CPU starvation as a compute shortage, engineers can treat it as an accounting integrity issue. The fix shifts from infrastructure scaling to configuration hygiene, reducing operational overhead and preventing recurring degradation.

Core Solution

Resolving cgroup-induced CPU starvation requires a systematic approach: verify accounting health, correlate memory pressure with scheduler behavior, audit background agents, and apply targeted isolation. The following implementation demonstrates a production-ready diagnostic and remediation workflow.

Step 1: Verify Cgroup Accounting Health

Cgroup memory leaks manifest as orphaned entries in /sys/fs/cgroup/memory (v1) or /sys/fs/cgroup/memory (v2). A TypeScript-based diagnostic utility can parse these paths, detect abnormal growth patterns, and flag processes with stale cgroup references.

import { readFileSync, readdirSync, statSync } from 'fs';
import { join } from 'path';

interface CgroupMetric {
  processId: string;
  memoryUsageBytes: number;
  leakProbability: number;
}

const CGROUP_V1_PATH = '/sys/fs/cgroup/memory';
const THRESHOLD_BYTES = 500 * 1024 * 1024; // 500MB baseline

function scanCgroupEntries(): CgroupMetric[] {
  const entries = readdirSync(CGROUP_V1_PATH);
  const metrics: CgroupMetric[] = [];

  for (const entry of entries) {
    const statPath = join(CGROUP_V1_PATH, entry, 'memory.usage_in_bytes');
    try {
      const usage = parseInt(readFileSync(statPath, 'utf-8').trim(), 10);
      const pid = entry.match(/(\d+)/)?.[1] || 'unknown';
      
      const leakProbability = usage > THRESHOLD_BYTES ? 0.85 : 0.15;
      metrics.push({ processId: pid, memoryUsageBytes: usage, leakProbability });
    } catch {
      // Skip inaccessible or kernel-managed cgroups
    }
  }

  return metrics.filter(m => m.leakProbability > 0.7);
}

export { scanCgroupEntries, CgroupMetric };

This utility isolates high-probability leak candidates by comparing current memory usage against a configurable baseline. It avoids parsing transient kernel cgroups and focuses on user-space process references. The leakProbability heuristic flags entries that exceed normal allocation patterns, which typically indicate abandoned cgroup hierarchies.

Step 2: Correlate Memory Pressure with CPU Throttling

Memory cgroup leaks trigger kernel direct reclaim, which consumes CPU cycles and forces the scheduler to throttle container workloads. A PromQL query can correlate node_memory_KswapdWriteback and container_cpu_cfs_throttled_seconds_total to validate the relationship.

# Detect correlation between kernel memory reclaim and container CPU throttling
(
  rate(node_memory_KswapdWriteback_bytes_total[5m]) > 0
)
and
(
  rate(container_cpu_cfs_throttled_seconds_total{namespace="ml-training"}[5m]) > 0.5
)

This query identifies time windows where kernel memo

ry reclaim activity aligns with container CPU throttling. If both metrics spike simultaneously, the bottleneck is accounting corruption, not compute shortage. Teams should avoid scaling nodes during this window, as additional capacity will inherit the same cgroup leak.

Step 3: Audit Background Agents

Orphaned platform agents are the primary source of cgroup leaks. A node-level audit script should enumerate running processes, cross-reference them with active workloads, and flag unused services.

#!/usr/bin/env bash
# audit_node_agents.sh
# Identifies platform agents running without active workload dependencies

AGENT_LIST=("ecs-agent" "datadog-agent" "fluentd" "node-exporter" "kube-proxy")
ACTIVE_PIDS=$(pgrep -f "k8s_" || true)

for agent in "${AGENT_LIST[@]}"; do
  AGENT_PID=$(pgrep -f "$agent" || true)
  if [[ -n "$AGENT_PID" && -z "$ACTIVE_PIDS" ]]; then
    echo "[ALERT] Orphaned agent detected: $agent (PID: $AGENT_PID)"
    echo "  -> No active Kubernetes workloads found. Safe to disable."
  fi
done

This script checks for known platform agents and verifies whether active Kubernetes containers are running. If an agent is active but no workloads exist, it is a candidate for immediate disablement. The logic prevents accidental termination of essential monitoring or networking components.

Step 4: Disable and Validate

Once an orphaned agent is identified, disable it via systemd override or Kubernetes node configuration. Validate that cgroup entries are reclaimed and CPU throttling subsides.

# systemd override for ecs-agent
[Service]
ExecStart=
ExecStart=/bin/true
Restart=no

After applying the override, restart the agent service and monitor /sys/fs/cgroup/memory for entry cleanup. CPU throttling should normalize within 2-3 minutes as the kernel reclaims orphaned cgroup references.

Architecture Decisions and Rationale

Node-level auditing over pod-level monitoring: Cgroup leaks originate outside the container runtime. Pod metrics cannot detect kernel accounting corruption. Node-level inspection is mandatory.
TypeScript diagnostic utility over shell scripts: TypeScript provides type safety, structured error handling, and easier integration with existing observability pipelines. Shell scripts are retained for quick node audits due to lower overhead.
Disable over patch: Legacy agents often lack active maintenance. Patching memory leaks requires kernel-level changes or agent recompilation. Disabling unused agents is faster, safer, and eliminates the root cause.
PromQL correlation over single-metric alerts: CPU throttling alone is ambiguous. Correlating with kernel memory reclaim metrics confirms the accounting corruption hypothesis before triggering remediation.

Pitfall Guide

1. Mistaking Memory Pressure for CPU Throttling

Explanation: Teams observe high CPU throttling and assume compute shortage. The actual cause is kernel memory reclaim consuming cycles and triggering scheduler limits. Fix: Always correlate container_cpu_cfs_throttled_seconds_total with node_memory_KswapdWriteback or node_vmstat_pgmajfault. If both spike together, investigate cgroup health before scaling.

2. Ignoring Cgroup Version Differences

Explanation: Cgroup v1 and v2 handle memory accounting differently. v1 uses separate memory.usage_in_bytes and memory.limit_in_bytes, while v2 consolidates them under memory.current and memory.max. Scripts written for v1 fail on v2 nodes. Fix: Detect cgroup version at runtime using stat -fc %T /sys/fs/cgroup. Branch logic accordingly or migrate clusters to v2 for unified accounting.

3. Assuming All Sidecars Are Essential

Explanation: Platform teams deploy monitoring, logging, and security agents as sidecars or node-level services. Over time, workloads change, but agents remain. Unused agents leak resources silently. Fix: Implement quarterly agent audits. Cross-reference active workloads with running agents. Disable or remove services with zero dependency graphs.

4. Overlooking Kernel Reclaim Overhead

Explanation: Direct memory reclaim runs in process context and consumes CPU cycles. Under heavy cgroup leaks, reclaim overhead can exceed 15% of available CPU, causing apparent throttling. Fix: Monitor node_vmstat_pgscan_direct and node_vmstat_pgsteal_direct. If direct reclaim dominates, reduce cgroup fragmentation by consolidating workloads or disabling leak sources.

5. Relying Solely on Metrics-Server

Explanation: Kubernetes metrics-server aggregates pod-level CPU and memory usage. It does not expose kernel reclaim activity, cgroup accounting health, or node-level process anomalies. Fix: Supplement metrics-server with node exporters, cgroup scrapers, and kernel tracing tools. Build dashboards that correlate container metrics with kernel behavior.

6. Disabling Agents Without Graceful Drain

Explanation: Terminating an agent abruptly can leave dangling cgroup entries or interrupt active log/metric pipelines. The kernel may retain orphaned references, prolonging the leak. Fix: Use systemctl stop followed by cgroupfs-mount --remount or cgdelete to force cleanup. Verify /sys/fs/cgroup/memory before marking the node as healthy.

7. Skipping Post-Remediation Validation

Explanation: Teams disable the agent and assume the problem is resolved. Without validation, residual cgroup entries or secondary leaks can cause recurring throttling. Fix: Run a 10-minute validation window. Monitor CPU throttling, memory reclaim metrics, and cgroup entry counts. Confirm normalization before closing the incident.

Production Bundle

Action Checklist

Verify cgroup version on all worker nodes and align diagnostic scripts accordingly
Deploy a node-level cgroup scanner to detect orphaned memory entries
Correlate CPU throttling with kernel memory reclaim metrics before scaling
Audit all platform agents and cross-reference with active workload dependencies
Disable unused agents via systemd overrides or Kubernetes node configurations
Force cgroup cleanup using cgdelete or kernel remount commands
Run a 10-minute validation window to confirm throttling normalization
Document agent lifecycle policies to prevent future orphaned deployments

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
ML training jobs show CPU throttling with healthy node metrics	Agent audit and cgroup scan	Throttling is likely accounting corruption, not compute shortage	Low (no node scaling required)
Cgroup v1 nodes exhibit persistent memory leaks	Migrate to cgroup v2 or apply kernel patches	v1 lacks unified memory accounting, increasing leak probability	Medium (requires cluster upgrade)
Multiple platform agents running with zero workload dependencies	Disable non-essential agents via systemd	Reduces kernel reclaim overhead and cgroup fragmentation	Low (immediate CPU recovery)
High direct reclaim overhead (>15% CPU)	Consolidate workloads and disable leak sources	Direct reclaim consumes CPU cycles, causing false throttling	Low (optimizes existing capacity)
Metrics-server shows normal usage but jobs stall	Deploy kernel-level tracing (bpftrace)	Metrics-server lacks kernel accounting visibility	Medium (requires tracing setup)

Configuration Template

# kubelet-cgroup-config.yaml
# Aligns kubelet with cgroup v2 and disables legacy memory accounting
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
cgroupVersion: "v2"
memorySwap: {}
featureGates:
  CgroupV2: true

# systemd-override-ecs-agent.service
[Service]
ExecStart=
ExecStart=/bin/true
Restart=no
LimitNOFILE=1024
LimitNPROC=512

# prometheus-cgroup-alerts.yaml
groups:
  - name: cgroup-leak-detection
    rules:
      - alert: HighCgroupMemoryLeak
        expr: rate(node_memory_KswapdWriteback_bytes_total[5m]) > 10485760
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Kernel memory reclaim exceeding threshold"
          description: "Cgroup leak likely causing CPU throttling. Audit node agents."

Quick Start Guide

Detect cgroup version: Run stat -fc %T /sys/fs/cgroup on each worker node. If output is tmpfs, you are on v2. If cgroup, you are on v1. Adjust diagnostic scripts accordingly.
Deploy cgroup scanner: Copy the TypeScript diagnostic utility to a monitoring pod or node agent. Execute scanCgroupEntries() and filter results where leakProbability > 0.7.
Correlate metrics: Query Prometheus for rate(node_memory_KswapdWriteback_bytes_total[5m]) and rate(container_cpu_cfs_throttled_seconds_total[5m]). If both exceed thresholds simultaneously, proceed to agent audit.
Audit and disable: Run audit_node_agents.sh on affected nodes. Identify orphaned services. Apply systemd overrides to disable them. Force cgroup cleanup with cgdelete -r /sys/fs/cgroup/memory/<orphaned_path>.
Validate: Monitor CPU throttling and memory reclaim metrics for 10 minutes. Confirm normalization. Document the agent lifecycle policy to prevent recurrence.