What Inference-Platform Benchmark Posts Leave Out

By Codcompass Team·2026-05-13·7 min read

Beyond p90: Kernel-Side Observability for Multi-GPU Inference Clusters

Current Situation Analysis

The industry standard for benchmarking large language model inference has converged on a narrow set of metrics: p90 Time-to-First-Token (TTFT) and aggregate throughput. Recent platform writeups for models like Kimi K2.5 (deployed on clusters of 8+ H100 GPUs) and Llama 4 Scout (running on dual H200 configurations) highlight these headline numbers to demonstrate performance gains. While these metrics are useful for high-level capacity planning, they obscure the operational realities of serving production inference workloads at scale.

The gap exists because platform benchmarks are designed for external consumption, whereas site reliability engineering (SRE) requires deep internal visibility. Host-level monitoring tools like NVIDIA DCGM provide essential hardware counters—GPU utilization, memory consumption, power draw, and thermal status—but they stop at the device boundary. They cannot see inside the collective communication libraries, attribute resource consumption to specific tenants in a multi-tenant environment, or isolate kernel-launch overhead from actual compute time.

This blindness creates three critical blind spots:

Tail Latency Obscurity: p90 metrics mask the p99 and p99.9 distributions where user experience degrades. Tail latency is often driven by speculative decoding accept ratios dropping, PCIe contention, or kernel-launch spikes, none of which are visible in aggregate throughput graphs.
Cross-Rank Skew: In tensor-parallel deployments, every forward pass concludes with an AllReduce barrier. The wall-clock time is dictated by the slowest rank. A single rank suffering from NUMA misalignment, thermal throttling, or a noisy neighbor can degrade the entire cluster's serving rate by a proportional amount, yet DCGM will only show average utilization.
Multi-Tenant Attribution: Production clusters host multiple tenants. When latency spikes, operators must determine if the cause is a specific GPU, a colocated tenant consuming host CPU, or network saturation. Host-level Prometheus metrics average across tenants, destroying the resolution needed for root-cause analysis.

WOW Moment: Key Findings

The transition from host-level polling to kernel-side eBPF instrumentation unlocks a class of signals that are otherwise inaccessible without modifying the inference workload or restarting processes. The following comparison illustrates the observability delta provided by eBPF uprobes and kprobes targeting libnccl and driver interfaces.

Signal Category	DCGM / Host-Level Metrics	eBPF Kernel-Side Instrumentation
GPU Utilization & Memory	✅ Available	✅ Available (Redundant)
Per-Rank NCCL Collective Latency	❌ Blind	✅ `uprobes` on `ncclAllReduce`, `ncclBroadcast`
Kernel-Launch Overhead vs. Runtime	❌ Blind	✅ `kfuncs` on `cudaLaunchKernel` + GPU events
PCIe Transfer Cost by Cgroup	❌ Blind	✅ `kprobes` on driver IOCTLs + `cgroup_id`
Inter-Node TCP Retransmits by Rank	❌ Blind	✅ `kprobes` on `tcp_retransmit_skb` + rank env
Per-Tenant Resource Attribution	❌ Averaged	✅ `cgroup`-aware event tagging

Why this matters: eBPF allows operators to capture per-rank latency histograms and detect stragglers in real-time. By attaching uprobes to libnccl symbols, you can measure the exact duration of collective operations per rank. Combined with cgroup attribution, you can link kernel events back to specific tenant processes, enabling precise isolation of noisy neighbors and skew causes without the overhead of process-level profilers.

Core Solution

Implementing kernel-side observability requires a strategy that captures signals at the right abstraction layers, attributes them correctly, and aggregates them using robust statistical methods.

Architecture Decisions

Non-Invasive Instrumentation: Use eBPF uprobes on libnccl symbols rather than the NCCL profiler. The NCCL profiler requires linking against a shared object at process start and emitting to a configured target, which adds friction and overhead. eBPF uprobes attach dynamically, require no workload modification, and stream data directly to user space.
Cgroup Attribution: Multi-tenancy demands that every kernel event be tagged with the cgroup ID of the originating process. This requires extracting the cgroup_id from the kernel task struct during probe execution and attaching it to the telemetry payload.
Robust Aggregation: Use Median Absolute Deviation (MAD) for skew detection rather than standard deviation. MAD is resistant to outliers, making it suitable for latency distributions where occasional spikes should not skew the threshold calculation.
OTLP Emission: Standardize on OpenTelemetry Protocol (OTLP) for telemetry transport. This ensures compatibility with existing observability stacks (Prometheus, Grafana, Datadog) and enables cross-node correlation.

Implementation: Skew Detection Algorithm

The following TypeScript example demonstrates a skew detection utility that processes per-rank NCCL latency samples. It calculates the MAD-based threshold to identify slow ranks in real-time.

interface RankLatencySample {
  rankId: number;
  clusterId: string;
  timestamp: number;
  durationNs: number;
  cgroupId: string;
  tenantId?: string;
}

interface SkewAlert {
  cl

usterId: string; slowRankId: number; medianLatencyNs: number; madValue: number; thresholdNs: number; slowRankLatencyNs: number; cgroupId: string; tenantId?: string; }

/**

Detects slow ranks using Median Absolute Deviation (MAD).
MAD is robust against outliers, making it ideal for latency distributions. */ export function detectRankSkew( samples: RankLatencySample[], madMultiplier: number = 3.0 ): SkewAlert[] { const alerts: SkewAlert[] = []; const clusters = new Map<string, RankLatencySample[]>();

// Group samples by cluster for (const sample of samples) { const clusterSamples = clusters.get(sample.clusterId) || []; clusterSamples.push(sample); clusters.set(sample.clusterId, clusterSamples); }

for (const [clusterId, clusterSamples] of clusters.entries()) { // Group by rank within cluster const ranks = new Map<number, RankLatencySample[]>(); for (const sample of clusterSamples) { const rankSamples = ranks.get(sample.rankId) || []; rankSamples.push(sample); ranks.set(sample.rankId, rankSamples); }

// Calculate global median and MAD for the cluster
const allDurations = clusterSamples.map(s => s.durationNs).sort((a, b) => a - b);
const median = getMedian(allDurations);
const deviations = allDurations.map(d => Math.abs(d - median));
const mad = getMedian(deviations);
const threshold = median + (madMultiplier * mad);

// Check each rank against threshold
for (const [rankId, rankSamples] of ranks.entries()) {
  const rankMedian = getMedian(rankSamples.map(s => s.durationNs));
  if (rankMedian > threshold) {
    const slowSample = rankSamples[rankSamples.length - 1];
    alerts.push({
      clusterId,
      slowRankId: rankId,
      medianLatencyNs: median,
      madValue: mad,
      thresholdNs: threshold,
      slowRankLatencyNs: rankMedian,
      cgroupId: slowSample.cgroupId,
      tenantId: slowSample.tenantId,
    });
  }
}

}

return alerts; }

function getMedian(sortedValues: number[]): number { const mid = Math.floor(sortedValues.length / 2); return sortedValues.length % 2 !== 0 ? sortedValues[mid] : (sortedValues[mid - 1] + sortedValues[mid]) / 2; }


#### Rationale

*   **MAD vs. Standard Deviation:** Latency distributions in inference clusters are often non-Gaussian due to stragglers and system noise. MAD provides a more stable baseline for thresholding.
*   **Cgroup Tagging:** Including `cgroupId` in the alert payload allows operators to immediately correlate a slow rank with a specific tenant or workload, accelerating remediation.
*   **Real-Time Processing:** This algorithm can run within an OpenTelemetry Collector extension or a sidecar agent, enabling immediate alerting without batch processing delays.

### Pitfall Guide

| Pitfall | Explanation | Fix |
| :--- | :--- | :--- |
| **Ignoring p99 Distributions** | Relying on p90 masks tail latency issues that impact user experience. p99/p99.9 are critical for reliability SLAs. | Configure dashboards and alerts to track p99 and p99.9 latency distributions. Investigate tail causes like speculative decoding accept ratios. |
| **NUMA Misalignment** | Processes bound to the wrong NUMA node can suffer increased memory latency, causing cross-rank skew. | Use `numactl` or Kubernetes topology hints to pin inference processes to the correct NUMA node. Verify with `hwloc`. |
| **eBPF Overhead** | Excessive probing or inefficient map usage can introduce CPU overhead, impacting inference performance. | Use sampling strategies for high-frequency events. Optimize eBPF maps with `BPF_MAP_TYPE_PERCPU`. Monitor eBPF CPU usage. |
| **NCCL Profiler Dependency** | Relying on NCCL's built-in profiler requires process restarts and configuration changes, hindering agility. | Prefer eBPF uprobes for dynamic, non-invasive instrumentation. Reserve NCCL profiler for deep-dive debugging sessions. |
| **Averaging Multi-Tenant Metrics** | Aggregating metrics across tenants hides noisy neighbor effects and attribution issues. | Implement cgroup-aware telemetry. Tag all metrics with `cgroup_id` and `tenant_id` for granular analysis. |
| **Speculative Decoding Masking Skew** | High speculative decoding acceptance can mask underlying latency issues, giving false confidence. | Correlate speculative decoding accept ratios with latency metrics. Monitor both to ensure gains are genuine. |
| **TCP Retransmits Hidden by NCCL** | Network issues causing TCP retransmits may be obscured by NCCL's internal retry mechanisms. | Use eBPF kprobes on `tcp_retransmit_skb` to capture retransmit events. Attribute retransmits to specific ranks and nodes. |

### Production Bundle

#### Action Checklist

- [ ] **Deploy eBPF Agent:** Install an eBPF-based observability agent configured with uprobes on `libnccl` symbols (`ncclAllReduce`, `ncclBroadcast`, `ncclAllGather`).
- [ ] **Enable Cgroup Attribution:** Configure the agent to extract `cgroup_id` from the kernel task struct and attach it to all telemetry events for tenant isolation.
- [ ] **Configure MAD Thresholds:** Set up skew detection using Median Absolute Deviation (MAD) with a multiplier of 3.0 to identify slow ranks robustly.
- [ ] **Integrate OTLP Pipeline:** Ensure telemetry is emitted via OTLP to your central observability platform. Verify correlation with DCGM metrics.
- [ ] **Validate with Straggler Injection:** Test the observability stack by injecting artificial latency on a single rank. Confirm that alerts trigger and attribution is correct.
- [ ] **Monitor Kernel-Launch Overhead:** Add kprobes on `cudaLaunchKernel` to measure launch overhead vs. runtime. Alert on overhead spikes.
- [ ] **Review Tail Latency:** Set up dashboards for p99 and p99.9 latency. Investigate correlations with speculative decoding accept ratios and PCIe contention.

#### Decision Matrix

| Scenario | Recommended Approach | Why | Cost Impact |
| :--- | :--- | :--- | :--- |
| **Single-Tenant, Small Cluster** | DCGM + NCCL Profiler | Simpler setup; sufficient for isolated workloads. | Low |
| **Multi-Tenant, Large Cluster** | eBPF + OTLP + MAD | Required for tenant attribution and skew detection at scale. | Medium |
| **Debugging Tail Latency** | eBPF + Kernel Probes | Deep visibility into kernel-launch overhead and TCP retransmits. | High Effort |
| **Compliance/Audit** | eBPF + Cgroup Tagging | Precise resource attribution per tenant for billing/compliance. | Medium |

#### Configuration Template

Below is a YAML configuration template for an eBPF observability agent targeting NCCL collectives and cgroup attribution.

```yaml
agent:
  name: "inference-observer"
  version: "1.0.0"

probes:
  - name: "nccl_allreduce"
    type: "uprobe"
    target: "libnccl.so"
    symbol: "ncclAllReduce"
    events: ["entry", "exit"]
    attributes:
      - "rank_id"
      - "cgroup_id"
      - "timestamp_ns"

  - name: "tcp_retransmit"
    type: "kprobe"
    target: "tcp_retransmit_skb"
    events: ["entry"]
    attributes:
      - "rank_id"
      - "cgroup_id"
      - "retransmit_count"

  - name: "cuda_launch"
    type: "kfunc"
    target: "cudaLaunchKernel"
    events: ["entry", "exit"]
    attributes:
      - "kernel_name"
      - "launch_overhead_ns"

aggregation:
  skew_detection:
    method: "MAD"
    multiplier: 3.0
    window_seconds: 60

output:
  protocol: "OTLP"
  endpoint: "otel-collector:4317"
  batch_size: 100
  flush_interval_ms: 500

Quick Start Guide

Install Agent: Deploy the eBPF observability agent binary to your inference nodes. Ensure kernel headers are available for eBPF compilation.
Load Probes: Start the agent with the configuration template above. Verify that uprobes on libnccl are active and events are being captured.
Stream Telemetry: Confirm OTLP connectivity to your collector. Check that events include cgroup_id and rank_id attributes.
Query Skew: Use the MAD-based detection algorithm to query for slow ranks. Validate alerts by checking the cgroup_id against known tenants.
Correlate Metrics: Overlay eBPF skew alerts with DCGM utilization metrics to identify hardware vs. software causes of latency.