Back to KB
Difficulty
Intermediate
Read Time
8 min

eBPF-Based Observability for Kubernetes Sidecars You Actually Understand

By Codcompass TeamΒ·Β·8 min read

Kernel-Native Telemetry: Decoupling Metrics from Sidecars in Kubernetes

Current Situation Analysis

Modern Kubernetes observability has converged on a problematic default: inject a proxy sidecar into every workload, or pay per-host licensing fees for commercial APMs. Both models share a fundamental scaling flaw. They multiply infrastructure overhead by pod count. As microservices architectures fragment into dozens of small deployments, the cumulative memory footprint of Envoy or Linkerd sidecars, combined with per-agent APM licensing, creates a steep operational tax. Teams routinely accept 50–100 MB of resident memory per pod and 1–3% added latency as "the cost of visibility." Meanwhile, commercial monitoring platforms charge $3,000–$5,000 monthly for mid-sized clusters, locking engineering budgets to agent-based telemetry.

This problem persists because observability tooling has historically been built at the application or network proxy layer. Developers assume that extracting HTTP status codes, request durations, or TCP health signals requires instrumenting the runtime, injecting middleware, or terminating traffic at a sidecar. The Linux kernel's eBPF (extended Berkeley Packet Filter) capability is frequently dismissed as too low-level, too complex, or too unstable for production telemetry. In reality, eBPF has matured into a stable, verifiable execution environment that runs safely in kernel space. It can intercept syscalls, tracepoints, and kprobes without modifying application binaries, restarting containers, or routing traffic through user-space proxies.

The economic and operational friction is quantifiable. Sidecar architectures scale linearly with pod density. A 200-pod cluster running 80 MB sidecars consumes roughly 16 GB of RAM solely for observability. Commercial APMs compound this with licensing tiers that ignore actual resource utilization. eBPF flips the scaling model: telemetry runs per-node via a DaemonSet, consuming a flat memory budget regardless of pod count. The kernel handles packet processing, syscall tracing, and histogram aggregation natively. Userspace agents only read aggregated results. The result is L4/L7 visibility with negligible overhead, zero application code changes, and licensing costs that drop to infrastructure you already provision.

WOW Moment: Key Findings

The architectural shift from per-pod proxies to per-node kernel probes fundamentally changes how observability scales. The following comparison illustrates the operational and economic divergence:

ApproachScaling FactorMemory Footprint (200 pods)Licensing Model
Service Mesh Sidecar (Envoy)Per-pod10–20 GBOpen source (compute cost only)
Lightweight Sidecar (Linkerd)Per-pod4–6 GBOpen source (compute cost only)
Commercial APM AgentPer-host/agent50–150 MB per node$3,000–$5,000/mo (mid-cluster)
eBPF DaemonSetPer-node~40 MB per nodeOpen source (compute cost only)

This data reveals a critical insight: eBPF decouples telemetry density from workload density. At startup or mid-market scale, the difference between multiplying memory by pod count versus node count often covers the salary of a platform engineer. More importantly, it enables metrics collection at the syscall and network stack boundary, capturing TCP retransmits, connection handshakes, and HTTP request boundaries without requiring language-specific SDKs, proxy injection, or application restarts. The finding matters because it shifts observability from an application-layer dependency to an infrastructure-layer capability, allowing teams to standardize telemetry across polyglot stacks while reclaiming compute budget and eliminating licensing lock-in.

Core Solution

Building a production-grade eBPF telemetry pipeline requires three coordinated layers: kernel-space probes, a portable compilation strategy, and a userspace aggregation agent. The architecture prioritizes stability, verifiability, and seamless integration with existing Prometheus/Grafana stacks.

Step 1: Kernel Prerequisites and BTF Validation

eBPF programs require the BPF Type Format (BTF) to achieve binary portability. BTF embeds kernel struct definitions into the kernel image, allowing the eBPF verifier to resolve struct field offsets at load time. Without BTF, probes must be compiled against exact kernel headers for each node, which breaks in auto-scaling or auto-updating managed clusters.

Verify BTF availability before deployment:

# Check if BTF is embedded in the running kernel
cat /sys/kernel/btf/vmlinux | file -
# Expected: /dev/stdin: BTF blob

Managed Kubernetes distributions (GKE, EKS with Amazon Linux 2023, AKS) ship with BTF-enabled kernels (5.8+). If running self-managed clusters, ensure CONFIG_DEBUG_INFO_BTF=y is set during kernel compilation.

Step 2: Portable Probe Design with BPF CO-RE

BPF CO-RE (Compile Once, Run Everywhere) eliminates kernel-version coupling. Instead of hardcoding struct offsets, CO-RE uses BPF_CORE_READ macros to defer relocation to load time. The probe compiles once in CI, ships as a container image, and loads safely across heterogeneous node pools.

Below is a production-ready C probe that tracks TCP retransmits and extracts destination port/address for pod correlation:

// tcp_retransmit_tracker.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct retransmit_event {
    __u32 saddr;
    __u32 daddr;
    __u16 sport;
    __u16 dport;
    __u64 timestamp_ns;
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} retransmit_ring SEC(".maps");

SEC("tracepoint/tcp/tcp_retransmit_skb")
int handle_tcp_retransmit(struct trace_event_raw_tcp_event_sk_skb *ctx)
{
    struct sock *sk = (struct sock *)ctx->skaddr;
    if (!sk) return 0;

    struct retransmit_event *evt = bpf_ringbuf_reserve(&retransmit_ring, sizeof(*evt), 0);
    if (!evt) return 0;

    evt->saddr = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
    evt->daddr = BPF_CORE_READ(sk, __sk_common.skc_daddr);
    evt->sport = bpf_ntohs(BPF_CORE_REA

D(sk, __sk_common.skc_sport)); evt->dport = bpf_ntohs(BPF_CORE_READ(sk, __sk_common.skc_dport)); evt->timestamp_ns = bpf_ktime_get_ns();

bpf_ringbuf_submit(evt, 0);
return 0;

}

char LICENSE[] SEC("license") = "Dual BSD/GPL";


**Architecture Rationale:**
- `BPF_MAP_TYPE_RINGBUF` replaces legacy perf buffers. It provides lockless, batched delivery to userspace, reducing context-switch overhead.
- Tracepoint attachment (`tcp_retransmit_skb`) is stable across kernel versions. Unlike kprobes, tracepoints guarantee ABI stability.
- Network byte order conversion (`bpf_ntohs`) happens in-kernel to avoid userspace parsing delays.

### Step 3: Userspace Aggregation and Prometheus Exposition
Kernel probes emit raw events. A userspace DaemonSet agent reads the ring buffer, correlates network addresses to Kubernetes pod metadata via the API server, and aggregates metrics into Prometheus histograms.

```go
// collector.go (userspace agent)
package main

import (
    "context"
    "fmt"
    "net"
    "time"

    "github.com/cilium/ebpf/ringbuf"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/rest"
)

var (
    tcpRetransmits = promauto.NewCounterVec(
        prometheus.CounterOpts{Name: "node_tcp_retransmits_total"},
        []string{"pod", "namespace", "service", "dest_port"},
    )
)

func main() {
    ctx := context.Background()
    config, _ := rest.InClusterConfig()
    clientset, _ := kubernetes.NewForConfig(config)

    rd, _ := ringbuf.NewReader(ringBufMap)
    defer rd.Close()

    for {
        record, _ := rd.Read()
        var evt RetransmitEvent
        // Unmarshal binary event into struct
        // ... (binary decoding omitted for brevity)

        destIP := net.IP(evt.Daddr[:]).String()
        podMeta := resolvePodByIP(ctx, clientset, destIP)

        tcpRetransmits.WithLabelValues(
            podMeta.Name, podMeta.Namespace, podMeta.Labels["app"],
            fmt.Sprintf("%d", evt.Dport),
        ).Inc()
    }
}

Architecture Rationale:

  • The agent runs as a DaemonSet, ensuring one instance per node. It watches the Kubernetes API for pod IP assignments, maintaining a local cache to resolve daddr to pod/namespace/service labels.
  • Prometheus counters/histograms are exposed via /metrics. No external push gateways required.
  • Binary event decoding uses encoding/binary with little-endian alignment matching the kernel struct layout.

Step 4: HTTP Latency Extraction via Syscall Boundaries

For L7 metrics, attach uprobes or tracepoints to accept, read, and write syscalls. Parse the initial bytes in-kernel to extract HTTP method and status code. Aggregate durations into Prometheus histograms:

http_request_duration_seconds_bucket{pod="checkout-svc-8f2a",method="POST",status="201",le="0.05"} 8420
http_request_duration_seconds_bucket{pod="checkout-svc-8f2a",method="POST",status="201",le="0.1"} 9105
http_request_duration_seconds_bucket{pod="checkout-svc-8f2a",method="POST",status="201",le="+Inf"} 9142

This approach works across Go, Rust, Python, and Node.js because it intercepts the OS boundary, not the runtime. No SDK installation, no environment variable injection, no container rebuilds.

Pitfall Guide

PitfallExplanationFix
Assuming eBPF replaces distributed tracingeBPF captures network and syscall boundaries, not application-level trace context (e.g., traceparent headers). It cannot reconstruct cross-service spans.Pair eBPF metrics with OpenTelemetry SDKs for trace propagation. Use eBPF for aggregate latency/error rates, OTel for request-scoped traces.
TLS/mTLS blindness at socket layereBPF probes attached to tcp_sendmsg or tcp_recvmsg see encrypted payloads. HTTP method/status codes are unreadable.Attach uprobes to TLS library entry points (SSL_read/SSL_write in OpenSSL, rustls equivalents). Maintain version-specific offset maps or use CO-RE-compatible TLS libraries.
Verifier rejection due to unbounded complexityThe eBPF verifier rejects programs with unbounded loops, excessive stack usage, or unvalidated pointer arithmetic.Replace loops with bpf_loop() helper, limit stack to 512 bytes, use BPF_CORE_READ for all struct access, and validate pointers before dereferencing.
Ring buffer overflow under high throughputIf userspace reads slower than kernel emits events, the ring buffer drops samples. Metrics become inaccurate during traffic spikes.Size ring buffer to 256KB–1MB per node, implement backpressure-aware reading, and monitor bpf_ringbuf_discard events. Consider per-CPU maps for extreme throughput.
Pod IP cache stalenessKubernetes rapidly assigns/reclaims IPs. A static lookup table causes metric misattribution or label drift.Watch v1/pods with resourceVersion, maintain a TTL-based cache (30s), and fall back to node-level labels when pod resolution fails.
Metric cardinality explosionAttaching labels like request_id or full_url to Prometheus metrics creates unbounded series, crashing the TSDB.Limit labels to pod, namespace, service, method, status. Use exemplars for trace correlation instead of high-cardinality labels.
Ignoring kernel version floorBTF and CO-RE require kernel 5.8+. Deploying to legacy nodes causes load failures or fallback to non-portable probes.Validate node kernel versions in admission webhooks or DaemonSet tolerations. Ship fallback probes or exclude pre-5.8 nodes from eBPF telemetry.

Production Bundle

Action Checklist

  • Verify BTF availability on all target nodes (/sys/kernel/btf/vmlinux)
  • Compile eBPF probes using CO-RE toolchain (bpftool gen, libbpf, or cilium/ebpf)
  • Package probes and userspace agent into a single DaemonSet container image
  • Implement pod IP-to-metadata cache with Kubernetes watch API and 30s TTL
  • Configure Prometheus scrape targets for DaemonSet /metrics endpoint
  • Set up Grafana dashboards for TCP retransmit rates and HTTP latency histograms
  • Validate metric accuracy against baseline sidecar/APM data during shadow deployment
  • Document fallback procedures for pre-5.8 nodes or TLS library version mismatches

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
Polyglot microservices, strict budgeteBPF DaemonSet + PrometheusZero SDK overhead, scales by node, eliminates licensing~$0 licensing, minimal compute
Strict compliance requiring full request tracingOpenTelemetry SDKs + eBPF metricsOTel handles trace context; eBPF handles network healthModerate SDK overhead, no licensing
Legacy cluster (kernel < 5.8)Sidecar proxy or host-level agentBTF unavailable, CO-RE fails, kernel compatibility riskHigher memory/CPU, potential licensing
High-throughput gRPC/mTLS serviceseBPF + TLS library uprobesSocket layer blind to ciphertext; uprobes decode at library boundaryRequires version mapping, moderate dev effort
Multi-tenant cluster with strict isolationeBPF with cgroup-based filteringPrevents cross-tenant metric leakage, enforces namespace boundariesRequires cgroupv2, moderate config complexity

Configuration Template

# ebpf-telemetry-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kernel-telemetry-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: ebpf-collector
  template:
    metadata:
      labels:
        app: ebpf-collector
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: collector
        image: registry.internal/ebpf-telemetry:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: bpf-maps
          mountPath: /sys/fs/bpf
        - name: proc
          mountPath: /host/proc
          readOnly: true
      volumes:
      - name: bpf-maps
        hostPath:
          path: /sys/fs/bpf
      - name: proc
        hostPath:
          path: /proc
---
# prometheus-scrape-config.yaml
scrape_configs:
  - job_name: 'ebpf-node-telemetry'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        target_label: __address__
        replacement: '${1}:9090'
    metrics_path: /metrics

Quick Start Guide

  1. Validate kernel compatibility: Run cat /sys/kernel/btf/vmlinux | file - on a representative node. Proceed only if output confirms BTF blob.
  2. Compile probes: Use bpftool gen skeleton or cilium/ebpf toolchain to generate CO-RE binaries. Package into a container image with the userspace Go agent.
  3. Deploy DaemonSet: Apply the YAML manifest. Verify pods reach Running state and mount /sys/fs/bpf successfully.
  4. Expose metrics: Confirm the agent listens on port 9090 and serves Prometheus exposition format at /metrics.
  5. Ingest and visualize: Add the scrape config to Prometheus. Import pre-built Grafana dashboards for TCP retransmit rates and HTTP latency histograms. Validate pod label resolution against Kubernetes API.