ou can measure the exact duration of collective operations per rank. Combined with cgroup attribution, you can link kernel events back to specific tenant processes, enabling precise isolation of noisy neighbors and skew causes without the overhead of process-level profilers.
Core Solution
Implementing kernel-side observability requires a strategy that captures signals at the right abstraction layers, attributes them correctly, and aggregates them using robust statistical methods.
Architecture Decisions
- Non-Invasive Instrumentation: Use eBPF uprobes on
libnccl symbols rather than the NCCL profiler. The NCCL profiler requires linking against a shared object at process start and emitting to a configured target, which adds friction and overhead. eBPF uprobes attach dynamically, require no workload modification, and stream data directly to user space.
- Cgroup Attribution: Multi-tenancy demands that every kernel event be tagged with the cgroup ID of the originating process. This requires extracting the
cgroup_id from the kernel task struct during probe execution and attaching it to the telemetry payload.
- Robust Aggregation: Use Median Absolute Deviation (MAD) for skew detection rather than standard deviation. MAD is resistant to outliers, making it suitable for latency distributions where occasional spikes should not skew the threshold calculation.
- OTLP Emission: Standardize on OpenTelemetry Protocol (OTLP) for telemetry transport. This ensures compatibility with existing observability stacks (Prometheus, Grafana, Datadog) and enables cross-node correlation.
Implementation: Skew Detection Algorithm
The following TypeScript example demonstrates a skew detection utility that processes per-rank NCCL latency samples. It calculates the MAD-based threshold to identify slow ranks in real-time.
interface RankLatencySample {
rankId: number;
clusterId: string;
timestamp: number;
durationNs: number;
cgroupId: string;
tenantId?: string;
}
interface SkewAlert {
clusterId: string;
slowRankId: number;
medianLatencyNs: number;
madValue: number;
thresholdNs: number;
slowRankLatencyNs: number;
cgroupId: string;
tenantId?: string;
}
/**
* Detects slow ranks using Median Absolute Deviation (MAD).
* MAD is robust against outliers, making it ideal for latency distributions.
*/
export function detectRankSkew(
samples: RankLatencySample[],
madMultiplier: number = 3.0
): SkewAlert[] {
const alerts: SkewAlert[] = [];
const clusters = new Map<string, RankLatencySample[]>();
// Group samples by cluster
for (const sample of samples) {
const clusterSamples = clusters.get(sample.clusterId) || [];
clusterSamples.push(sample);
clusters.set(sample.clusterId, clusterSamples);
}
for (const [clusterId, clusterSamples] of clusters.entries()) {
// Group by rank within cluster
const ranks = new Map<number, RankLatencySample[]>();
for (const sample of clusterSamples) {
const rankSamples = ranks.get(sample.rankId) || [];
rankSamples.push(sample);
ranks.set(sample.rankId, rankSamples);
}
// Calculate global median and MAD for the cluster
const allDurations = clusterSamples.map(s => s.durationNs).sort((a, b) => a - b);
const median = getMedian(allDurations);
const deviations = allDurations.map(d => Math.abs(d - median));
const mad = getMedian(deviations);
const threshold = median + (madMultiplier * mad);
// Check each rank against threshold
for (const [rankId, rankSamples] of ranks.entries()) {
const rankMedian = getMedian(rankSamples.map(s => s.durationNs));
if (rankMedian > threshold) {
const slowSample = rankSamples[rankSamples.length - 1];
alerts.push({
clusterId,
slowRankId: rankId,
medianLatencyNs: median,
madValue: mad,
thresholdNs: threshold,
slowRankLatencyNs: rankMedian,
cgroupId: slowSample.cgroupId,
tenantId: slowSample.tenantId,
});
}
}
}
return alerts;
}
function getMedian(sortedValues: number[]): number {
const mid = Math.floor(sortedValues.length / 2);
return sortedValues.length % 2 !== 0
? sortedValues[mid]
: (sortedValues[mid - 1] + sortedValues[mid]) / 2;
}
Rationale
- MAD vs. Standard Deviation: Latency distributions in inference clusters are often non-Gaussian due to stragglers and system noise. MAD provides a more stable baseline for thresholding.
- Cgroup Tagging: Including
cgroupId in the alert payload allows operators to immediately correlate a slow rank with a specific tenant or workload, accelerating remediation.
- Real-Time Processing: This algorithm can run within an OpenTelemetry Collector extension or a sidecar agent, enabling immediate alerting without batch processing delays.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Ignoring p99 Distributions | Relying on p90 masks tail latency issues that impact user experience. p99/p99.9 are critical for reliability SLAs. | Configure dashboards and alerts to track p99 and p99.9 latency distributions. Investigate tail causes like speculative decoding accept ratios. |
| NUMA Misalignment | Processes bound to the wrong NUMA node can suffer increased memory latency, causing cross-rank skew. | Use numactl or Kubernetes topology hints to pin inference processes to the correct NUMA node. Verify with hwloc. |
| eBPF Overhead | Excessive probing or inefficient map usage can introduce CPU overhead, impacting inference performance. | Use sampling strategies for high-frequency events. Optimize eBPF maps with BPF_MAP_TYPE_PERCPU. Monitor eBPF CPU usage. |
| NCCL Profiler Dependency | Relying on NCCL's built-in profiler requires process restarts and configuration changes, hindering agility. | Prefer eBPF uprobes for dynamic, non-invasive instrumentation. Reserve NCCL profiler for deep-dive debugging sessions. |
| Averaging Multi-Tenant Metrics | Aggregating metrics across tenants hides noisy neighbor effects and attribution issues. | Implement cgroup-aware telemetry. Tag all metrics with cgroup_id and tenant_id for granular analysis. |
| Speculative Decoding Masking Skew | High speculative decoding acceptance can mask underlying latency issues, giving false confidence. | Correlate speculative decoding accept ratios with latency metrics. Monitor both to ensure gains are genuine. |
| TCP Retransmits Hidden by NCCL | Network issues causing TCP retransmits may be obscured by NCCL's internal retry mechanisms. | Use eBPF kprobes on tcp_retransmit_skb to capture retransmit events. Attribute retransmits to specific ranks and nodes. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single-Tenant, Small Cluster | DCGM + NCCL Profiler | Simpler setup; sufficient for isolated workloads. | Low |
| Multi-Tenant, Large Cluster | eBPF + OTLP + MAD | Required for tenant attribution and skew detection at scale. | Medium |
| Debugging Tail Latency | eBPF + Kernel Probes | Deep visibility into kernel-launch overhead and TCP retransmits. | High Effort |
| Compliance/Audit | eBPF + Cgroup Tagging | Precise resource attribution per tenant for billing/compliance. | Medium |
Configuration Template
Below is a YAML configuration template for an eBPF observability agent targeting NCCL collectives and cgroup attribution.
agent:
name: "inference-observer"
version: "1.0.0"
probes:
- name: "nccl_allreduce"
type: "uprobe"
target: "libnccl.so"
symbol: "ncclAllReduce"
events: ["entry", "exit"]
attributes:
- "rank_id"
- "cgroup_id"
- "timestamp_ns"
- name: "tcp_retransmit"
type: "kprobe"
target: "tcp_retransmit_skb"
events: ["entry"]
attributes:
- "rank_id"
- "cgroup_id"
- "retransmit_count"
- name: "cuda_launch"
type: "kfunc"
target: "cudaLaunchKernel"
events: ["entry", "exit"]
attributes:
- "kernel_name"
- "launch_overhead_ns"
aggregation:
skew_detection:
method: "MAD"
multiplier: 3.0
window_seconds: 60
output:
protocol: "OTLP"
endpoint: "otel-collector:4317"
batch_size: 100
flush_interval_ms: 500
Quick Start Guide
- Install Agent: Deploy the eBPF observability agent binary to your inference nodes. Ensure kernel headers are available for eBPF compilation.
- Load Probes: Start the agent with the configuration template above. Verify that uprobes on
libnccl are active and events are being captured.
- Stream Telemetry: Confirm OTLP connectivity to your collector. Check that events include
cgroup_id and rank_id attributes.
- Query Skew: Use the MAD-based detection algorithm to query for slow ranks. Validate alerts by checking the
cgroup_id against known tenants.
- Correlate Metrics: Overlay eBPF skew alerts with DCGM utilization metrics to identify hardware vs. software causes of latency.