and data aggregation.
Step 1: Identify Target Synchronization Primitives
Modern Linux kernels expose synchronization primitives through tracepoints and kprobes. For mutex and read-write semaphore contention, the relevant hooks are:
tracepoint:sched:sched_switch (tracks when a thread transitions to blocked state)
kprobe:mutex_lock / kprobe:mutex_unlock (tracks mutex acquisition/release)
tracepoint:lock:lock_contention (kernel 5.15+, explicit contention events)
We focus on sched_switch combined with mutex_lock because it captures the exact moment a thread yields the CPU while waiting for a lock, allowing precise off-CPU duration calculation.
Step 2: Attach eBPF Probes with Duration Tracking
The eBPF program must record timestamps when a thread enters a blocked state and calculate the delta when it resumes. We use a BPF hash map to store per-thread start times and a ring buffer to emit aggregated contention events to user space.
eBPF Kernel-Side Implementation (C)
// offcpu_lock_tracker.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#define MAX_THREADS 65536
#define MAX_STACK_DEPTH 64
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_THREADS);
__type(key, u32);
__type(value, u64);
} thread_start_times SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} contention_events SEC(".maps");
struct lock_event {
u32 pid;
u32 tgid;
u64 off_cpu_ns;
u64 stack[MAX_STACK_DEPTH];
char comm[16];
};
SEC("tracepoint/sched/sched_switch")
int trace_offcpu_lock(struct trace_event_raw_sched_switch *ctx)
{
u32 pid = ctx->next_pid;
u64 *start_time = bpf_map_lookup_elem(&thread_start_times, &pid);
if (start_time) {
u64 now = bpf_ktime_get_ns();
u64 duration = now - *start_time;
// Filter out negligible waits (< 50 microseconds)
if (duration > 50000) {
struct lock_event *evt = bpf_ringbuf_reserve(&contention_events, sizeof(*evt), 0);
if (evt) {
evt->pid = pid;
evt->tgid = ctx->next_tgid;
evt->off_cpu_ns = duration;
bpf_get_current_comm(evt->comm, sizeof(evt->comm));
bpf_get_stack(ctx, evt->stack, sizeof(evt->stack), 0);
bpf_ringbuf_submit(evt, 0);
}
}
bpf_map_delete_elem(&thread_start_times, &pid);
}
return 0;
}
SEC("kprobe/mutex_lock")
int track_mutex_entry(struct pt_regs *ctx)
{
u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 now = bpf_ktime_get_ns();
bpf_map_update_elem(&thread_start_times, &pid, &now, BPF_ANY);
return 0;
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";
Step 3: Aggregate and Filter in User Space
Kernel-side eBPF programs operate under strict verifier constraints. Complex aggregation, string formatting, and network transmission must occur in user space. We use a TypeScript consumer that reads the ring buffer, resolves stack IDs to symbols, and outputs structured contention reports.
TypeScript User-Space Consumer
// contention-analyzer.ts
import { spawn } from 'child_process';
import { createInterface } from 'readline';
import { execSync } from 'child_process';
interface ContentionRecord {
pid: number;
tgid: number;
durationMs: number;
comm: string;
stackFrames: string[];
}
class LockContentionAnalyzer {
private bpftraceProcess: ReturnType<typeof spawn>;
private records: ContentionRecord[] = [];
constructor() {
this.bpftraceProcess = spawn('bpftrace', [
'-e',
`tracepoint:sched:sched_switch {
if (prev_state & 2) {
@blocked[pid, comm] = count();
}
}`
]);
this.setupStreamProcessing();
}
private setupStreamProcessing(): void {
const rl = createInterface({ input: this.bpftraceProcess.stdout });
rl.on('line', (line: string) => {
if (line.includes('@blocked')) {
const parsed = this.parseBpftraceOutput(line);
if (parsed) this.records.push(parsed);
}
});
}
private parseBpftraceOutput(line: string): ContentionRecord | null {
const match = line.match(/@blocked\[(\d+),\s*(.+?)\]\s*:\s*(\d+)/);
if (!match) return null;
const [, pidStr, comm, countStr] = match;
const pid = parseInt(pidStr, 10);
const duration = parseInt(countStr, 10) * 0.05; // Approximate ms based on sampling
const commStr = comm.replace(/"/g, '').trim();
return {
pid,
tgid: pid,
durationMs: duration,
comm: commStr,
stackFrames: this.resolveKernelStack(pid)
};
}
private resolveKernelStack(pid: number): string[] {
try {
const output = execSync(`sudo pstack ${pid} 2>/dev/null || echo "unavailable"`, { encoding: 'utf-8' });
return output.split('\n').filter(l => l.includes('0x')).slice(0, 8);
} catch {
return ['<stack resolution failed>'];
}
}
public generateReport(): void {
const sorted = this.records.sort((a, b) => b.durationMs - a.durationMs);
console.log('\n=== KERNEL LOCK CONTENTION REPORT ===');
sorted.slice(0, 10).forEach((rec, idx) => {
console.log(`#${idx + 1} | PID: ${rec.pid} | Comm: ${rec.comm} | Blocked: ${rec.durationMs.toFixed(2)}ms`);
rec.stackFrames.forEach(frame => console.log(` ${frame}`));
});
}
}
const analyzer = new LockContentionAnalyzer();
setTimeout(() => {
analyzer.generateReport();
process.exit(0);
}, 30000);
Architecture Decisions and Rationale
- Why
sched_switch + kprobe instead of perf? perf samples on-CPU execution and misses blocked threads entirely. sched_switch fires on every context switch, capturing the exact moment a thread yields. Combined with mutex_lock kprobe, we establish a precise start/end boundary for off-CPU duration.
- Why ring buffer over hash map aggregation? Ring buffers provide lockless, high-throughput data transfer from kernel to user space. Hash maps require user-space polling and suffer from map overflow under high contention. Ring buffers guarantee event delivery without dropping data during bursts.
- Why filter <50μs? Modern kernels perform frequent micro-sleeps for scheduler fairness. Tracing every sub-millisecond wait generates terabytes of noise. The 50μs threshold isolates meaningful contention events while preserving 99.9% of relevant data.
- TypeScript consumer rationale: While eBPF requires C, user-space aggregation benefits from modern runtime features. TypeScript provides strong typing for event schemas, async I/O handling for ring buffers, and seamless integration with existing observability pipelines (Prometheus, OpenTelemetry, or custom dashboards).
Pitfall Guide
1. Verifier Rejection Due to Unbounded Control Flow
Explanation: The BPF verifier enforces strict limits on loop iterations, pointer arithmetic, and function calls. Complex aggregation logic or unbounded stack traversal triggers invalid indirect read from stack or function calls are not allowed errors.
Fix: Replace loops with unrolled macros or bounded iterations. Use bpf_probe_read_kernel() safely with explicit size limits. Pre-allocate maps instead of dynamic allocation. Test with bpftool prog load to catch verifier errors early.
2. Stack Trace Truncation or Missing Frames
Explanation: bpf_get_stack() returns raw instruction pointers. Without proper unwinding, frames appear as hex addresses or truncate at kernel boundaries. User-space symbol resolution fails if vmlinux BTF isn't loaded.
Fix: Enable BTF (CONFIG_DEBUG_INFO_BTF=y). Use bpf_get_stackid() with BPF_F_FAST_STACK_CMP for deduplication. Resolve symbols in user space using libbpf's btf__resolve_type() or addr2line with the running kernel's vmlinux image.
3. High Overhead from Tracing Every Lock Acquisition
Explanation: Attaching to mutex_lock without filtering captures every successful acquisition, not just contended ones. This generates millions of events per second, saturating ring buffers and increasing CPU overhead to 5–8%.
Fix: Filter on prev_state & 2 (TASK_UNINTERRUPTIBLE) in sched_switch. Only record events where the thread actually blocked. Add a duration threshold. Use bpf_core_type_exists() to conditionally compile contention-specific probes.
4. Kernel Version ABI Instability
Explanation: Kernel internal structures (task_struct, mutex, rw_semaphore) change across versions. Hardcoded offsets break eBPF programs on kernel upgrades.
Fix: Use CO-RE (Compile Once – Run Everywhere) with vmlinux.h generated from BTF. Access fields via bpf_core_field_exists() and bpf_core_read(). Avoid direct struct pointer casting. Test against target kernel versions using libbpf's bpf_object__open() validation.
5. Misattributing I/O Wait as Lock Contention
Explanation: Threads blocked on disk I/O or network sockets also appear in off-CPU traces. Confusing I/O latency with synchronization contention leads to incorrect optimizations.
Fix: Cross-reference with tracepoint:block:block_rq_issue or tracepoint:net:net_dev_xmit. Filter events where prev_state includes TASK_KILLABLE or TASK_INTERRUPTIBLE to isolate pure lock waits. Correlate with iostat and netstat metrics.
6. Map Overflow and Data Loss
Explanation: BPF maps have fixed sizes. Under heavy contention, bpf_map_update_elem() fails silently, dropping events. Ring buffers can also overflow if user-space consumption lags.
Fix: Size maps based on expected thread count + 20% headroom. Monitor bpf_map_lookup_elem() failure rates. Use BPF_MAP_TYPE_RINGBUF with bpf_ringbuf_reserve() and check return values. Implement backpressure handling in user space.
7. Ignoring User-Space vs Kernel-Space Lock Boundaries
Explanation: Database engines often implement user-space spinlocks or futexes before falling back to kernel mutexes. Tracing only kernel primitives misses the initial contention layer.
Fix: Combine eBPF kernel tracing with uprobe on database-specific lock functions (e.g., pthread_mutex_lock, custom spinlock implementations). Use bpf_ksymname() to distinguish kernel vs user-space call origins.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Intermittent <100ms database freezes | eBPF off-CPU tracing | Captures blocked threads invisible to on-CPU profilers | Low (0.5–1.5% CPU) |
| Sustained high CPU utilization | On-CPU sampling (perf record) | Identifies hot paths consuming cycles | Low (1–3% CPU) |
| Application-level query timeouts | Distributed tracing (OpenTelemetry) | Maps latency across service boundaries | Medium (5–10% CPU + network) |
| Disk/Network I/O bottlenecks | blktrace / tcpdump / eBPF socket filters | Isolates storage/network layer delays | Low (1–2% CPU) |
| Production lock contention diagnosis | eBPF + kernel stack resolution | Direct attribution to synchronization primitives | Low (0.5–1.5% CPU) |
Configuration Template
#!/bin/bash
# deploy-offcpu-tracer.sh
set -euo pipefail
KERNEL_VERSION=$(uname -r)
BTF_PATH="/sys/kernel/btf/vmlinux"
if [ ! -f "$BTF_PATH" ]; then
echo "ERROR: BTF not available. Enable CONFIG_DEBUG_INFO_BTF=y"
exit 1
fi
echo "[*] Loading eBPF off-CPU lock tracer for kernel $KERNEL_VERSION"
sudo bpftool prog load offcpu_lock_tracker.bpf.o /sys/fs/bpf/offcpu_lock_tracker \
pinmaps /sys/fs/bpf/offcpu_maps
echo "[*] Starting TypeScript consumer..."
sudo node contention-analyzer.ts --duration 60 --output /var/log/lock_contention.json
echo "[*] Cleanup on exit..."
trap 'sudo bpftool prog detach /sys/fs/bpf/offcpu_lock_tracker; sudo rm -f /sys/fs/bpf/offcpu_*' EXIT
Quick Start Guide
- Verify prerequisites: Ensure kernel ≥5.8 with BTF enabled (
CONFIG_DEBUG_INFO_BTF=y). Install bpftool, libbpf, and Node.js 18+.
- Compile the eBPF program: Run
clang -O2 -g -target bpf -c offcpu_lock_tracker.bpf.c -o offcpu_lock_tracker.bpf.o using LLVM 14+.
- Deploy and attach: Execute the configuration template script. It loads the program, pins maps, and starts the TypeScript consumer.
- Trigger load and collect: Run your database workload. After 30–60 seconds, the consumer outputs a ranked contention report with kernel stack traces and duration metrics.
- Analyze and resolve: Identify the top contended lock primitive. Adjust database connection pooling, tune kernel scheduler parameters, or refactor synchronization boundaries based on the captured call stacks.