with precise off-CPU events, futex contention, or kernel scheduler preemptions.
Core Solution
Building a dual-layer diagnostic system requires three distinct components: kernel event capture, event-shaped data exposure, and agent orchestration. The architecture deliberately separates the "what" (application telemetry) from the "why" (kernel mechanics) to maintain clean boundaries, enforce security, and preserve performance.
Step 1: Kernel Event Capture Strategy
Deploy eBPF programs that target specific execution boundaries. Avoid blanket syscall tracing. Instead, attach uprobes to GPU runtime libraries and tracepoints to the Linux scheduler and block layer. This captures:
- CUDA kernel launch latency and synchronization waits
- Thread off-CPU time and scheduler preemption events
- Cross-process futex contention and I/O blocking
Step 2: Event Streaming & Storage
Stream captured events into a columnar or time-series store optimized for SQL-like aggregation. The schema must remain event-shaped, not metric-shaped. Pre-bucketed percentiles obscure tail latency and causal chains. Store raw event timestamps, PIDs, TIDs, stack traces, and correlation IDs.
Step 3: MCP Server Implementation
Expose the event store through a read-only MCP server. Tools should return causal chains, not just counts. The following TypeScript implementation demonstrates a production-ready structure:
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { z } from "zod";
export class KernelDiagnosticMCP {
private server: McpServer;
private eventStore: EventStoreAdapter;
constructor(store: EventStoreAdapter) {
this.eventStore = store;
this.server = new McpServer({
name: "kernel-diagnostic-bridge",
version: "1.0.0",
capabilities: { tools: {} }
});
this.registerTools();
}
private registerTools(): void {
this.server.tool(
"query_off_cpu_events",
"Retrieves scheduler preemption and off-CPU events for a specific process window",
{
pid: z.number().int().positive(),
start_ts: z.string().datetime(),
end_ts: z.string().datetime(),
min_duration_ms: z.number().min(0).default(100)
},
async ({ pid, start_ts, end_ts, min_duration_ms }) => {
const events = await this.eventStore.fetchOffCpuEvents({
pid,
window: { start: start_ts, end: end_ts },
threshold: min_duration_ms
});
return {
content: [{
type: "text",
text: JSON.stringify({
total_events: events.length,
cumulative_off_cpu_ms: events.reduce((sum, e) => sum + e.duration_ms, 0),
top_contenders: this.extractTopContenders(events)
})
}]
};
}
);
this.server.tool(
"resolve_causal_chain",
"Maps a latency spike to kernel-level blocking events and stack traces",
{
target_pid: z.number().int().positive(),
latency_window_s: z.number().min(1).max(300),
include_stacks: z.boolean().default(true)
},
async ({ target_pid, latency_window_s, include_stacks }) => {
const chain = await this.eventStore.buildCausalChain({
pid: target_pid,
windowSeconds: latency_window_s,
attachStacks: include_stacks
});
return {
content: [{
type: "text",
text: JSON.stringify({
severity: chain.severity,
blocking_mechanism: chain.mechanism,
contention_source: chain.source_process,
stack_trace: include_stacks ? chain.stack : null,
recommended_action: this.deriveRemediation(chain)
})
}]
};
}
);
}
private extractTopContenders(events: OffCpuEvent[]): string[] {
const contentionMap = new Map<string, number>();
events.forEach(e => {
const key = `${e.blocking_pid}:${e.blocking_comm}`;
contentionMap.set(key, (contentionMap.get(key) || 0) + e.duration_ms);
});
return Array.from(contentionMap.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, 3)
.map(([key]) => key);
}
private deriveRemediation(chain: CausalChain): string {
if (chain.mechanism.includes("futex_wait")) {
return "Isolate competing threads via cgroup cpuset or pin critical paths to dedicated cores.";
}
if (chain.mechanism.includes("sched_switch")) {
return "Reduce host-level CPU contention; consider workload co-location adjustments.";
}
return "Review kernel scheduler tuning and verify hardware interrupt affinity.";
}
public getServer(): McpServer {
return this.server;
}
}
Step 4: Agent Orchestration Flow
The agent receives a symptom (e.g., TTFT spike) and routes calls sequentially:
- Query application MCP for metrics/logs to establish the time window.
- Query kernel MCP for off-CPU events and causal chains within that window.
- Synthesize findings into a root-cause statement with actionable remediation.
Architecture Rationale
- Read-Only Enforcement: Kernel-layer tools must never expose write capabilities. Agents investigate; human operators or separate automation pipelines execute fixes.
- Event-Shaped Schema: Pre-aggregated metrics lose causal context. Storing raw events allows the agent to reconstruct execution paths dynamically.
- Per-Host Boundary: eBPF data is inherently host-local. The MCP server should operate per-node, with cluster-wide visibility achieved through fan-out aggregation, not centralized indexing.
- Fixed Overhead Targeting: Probes attach only to high-signal libraries and tracepoints. This maintains sub-2% CPU overhead regardless of process count, making it safe for production inference clusters.
Pitfall Guide
1. Metric-Bucket Schema Contamination
Explanation: Developers often force eBPF event data into pre-bucketed time series formats (e.g., 1-minute aggregates). This destroys tail latency visibility and breaks causal chain reconstruction.
Fix: Store raw events with millisecond precision. Use SQL aggregation tools within the MCP server to generate summaries on-demand, preserving the underlying event stream.
2. Over-Probing the Kernel
Explanation: Attaching to every syscall or tracepoint creates massive data volume, triggering storage bottlenecks and pushing overhead past safe thresholds.
Fix: Target specific uprobes (libcudart.so, libcuda.so) and high-signal tracepoints (sched_switch, block:rq_issue). Use PID/TID filters to scope probes to relevant workloads.
3. Cluster-Wide Indexing Assumption
Explanation: Teams attempt to centralize eBPF data into a single distributed index. This introduces network serialization costs, breaks per-host isolation, and complicates security boundaries.
Fix: Design the MCP server as a per-host daemon. Implement a lightweight fan-out orchestrator that aggregates per-node responses only when the agent requests cluster-wide context.
4. Ignoring CPU Pinning & Cgroup Isolation
Explanation: Monitoring threads, eBPF collectors, and agent runtimes compete with production workloads for CPU cycles. This creates artificial contention that the agent misdiagnoses as application issues.
Fix: Pin critical monitoring processes to dedicated cores using taskset or cgroup cpuset. Reserve isolated CPU sets for inference engines and diagnostic tooling.
Explanation: Granting agents remediation capabilities (e.g., kill_process, modify_cgroup) through the kernel MCP layer violates the principle of least privilege and introduces blast radius risks.
Fix: Enforce strict read-only contracts on kernel diagnostic tools. Separate investigation MCP servers from execution/automation pipelines. Require human approval or policy-gated automation for state changes.
6. Latency Blindness in Aggregation
Explanation: Relying on p50 or p90 metrics masks the actual bottleneck. GPU inference stalls and scheduler preemptions manifest in the p99/p999 tail.
Fix: Always expose p99 and p999 latency distributions alongside causal chain data. Configure the MCP server to return tail-event samples by default, not just aggregate counts.
7. Cross-Layer Schema Mismatch
Explanation: Application MCP servers return metric-shaped data (timestamps, values, labels). Kernel MCP servers return event-shaped data (PIDs, stacks, durations). Agents struggle to correlate them without explicit mapping.
Fix: Implement a correlation layer that maps application request IDs to kernel thread IDs and process boundaries. Include trace_id or span_id in eBPF uprobes where framework support exists.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Microservice latency spike | Standard App-Layer MCP + Distributed Tracing | Application boundaries are clear; SDK instrumentation is sufficient | Low (existing observability stack) |
| GPU inference TTFT stall | Kernel-Enhanced eBPF MCP + Application MCP | Bottleneck lives in scheduler preemption or CUDA sync waits; app telemetry is blind | Medium (eBPF deployment + per-host storage) |
| Network retransmit storm | eBPF tcp:tcp_retransmit_skb tracepoints + MCP | Kernel sees packet drops before application logs; causal chain requires socket-level context | Medium (targeted probe deployment) |
| Multi-tenant CPU contention | eBPF sched_switch + cgroup cpuset isolation | Cross-process contention requires kernel scheduler visibility; app metrics only show aggregate CPU | High (requires host-level policy enforcement) |
| Compliance/audit investigation | Standard App-Layer MCP with immutable logs | Audit trails require application-level context; kernel data is noisy and irrelevant | Low (leverages existing SIEM/audit pipelines) |
Configuration Template
# kernel-mcp-config.yaml
server:
name: "kernel-diagnostic-bridge"
version: "1.0.0"
transport: "stdio"
read_only: true
probes:
uprobes:
- library: "libcudart.so"
symbol: "cudaLaunchKernel"
attach: "entry"
- library: "libcudart.so"
symbol: "cudaDeviceSynchronize"
attach: "entry"
tracepoints:
- name: "sched:sched_switch"
filter: "prev_pid != 0 && next_pid != 0"
- name: "block:block_rq_issue"
filter: "rwbs ~ '.*W.*'"
storage:
backend: "columnar"
retention_hours: 72
precision_ms: 1
max_events_per_second: 50000
tools:
- name: "query_off_cpu_events"
description: "Retrieves scheduler preemption events"
schema: "event"
- name: "resolve_causal_chain"
description: "Maps latency to kernel blocking mechanics"
schema: "event"
- name: "analyze_gpu_launch_latency"
description: "Returns CUDA kernel launch distribution"
schema: "event"
security:
write_access: false
pid_filtering: true
rate_limit_per_second: 100
Quick Start Guide
- Install eBPF Collector: Deploy the probe binary to the target host. Verify kernel headers are available and
bpf syscall is unblocked.
- Start Event Store: Initialize the columnar storage backend with the provided configuration. Confirm raw event ingestion is active.
- Launch MCP Server: Run the TypeScript MCP bridge in
stdio mode. Test tool discovery using mcp-cli list-tools.
- Connect Agent: Configure your LLM agent to register both the application MCP and the kernel MCP. Execute a diagnostic query against a known latency window.
- Validate Overhead: Monitor host CPU usage during probe execution. Confirm overhead remains under 2% and event throughput matches expected workload volume.