ubleshooting requires a layered instrumentation strategy that correlates network telemetry with application traces and real-user monitoring.
Step 1: Instrumentation Hierarchy
You must capture latency at four distinct layers:
- Client/RUM: DNS time, TCP connect, TLS handshake, TTFB, DOM load.
- Network/Edge: RTT, Jitter, Packet Loss, Retransmissions, BGP convergence.
- Application/Service: Queue wait time, Processing time, Downstream calls, Serialization cost.
- Kernel/OS: Context switches, CPU steal time, TCP state transitions, Interrupts.
Step 2: Distributed Tracing with Latency Spans
Implement distributed tracing that explicitly captures network hops. Use OpenTelemetry to create spans for external calls. Ensure net.peer.ip and net.peer.port are attached.
import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';
export async function fetchWithLatencyTracking(url: string, options?: RequestInit) {
const tracer = trace.getTracer('network-latency-tracker');
const span = tracer.startSpan(`HTTP ${options?.method || 'GET'} ${url}`, {
kind: SpanKind.CLIENT,
attributes: {
'http.url': url,
'http.method': options?.method || 'GET',
},
});
const startTime = performance.now();
try {
// Instrument fetch to capture timing breakdowns
const response = await fetch(url, options);
const endTime = performance.now();
span.setAttribute('http.status_code', response.status);
span.setAttribute('network.latency_ms', endTime - startTime);
// Capture DNS and TCP timing if available via Resource Timing API
if (window.performance && window.performance.getEntriesByName) {
const entries = window.performance.getEntriesByName(url);
if (entries.length > 0) {
const timing = entries[0] as PerformanceResourceTiming;
span.setAttribute('network.dns_ms', timing.domainLookupEnd - timing.domainLookupStart);
span.setAttribute('network.tcp_ms', timing.connectEnd - timing.connectStart);
span.setAttribute('network.tls_ms', timing.connectEnd - timing.secureConnectionStart);
}
}
return response;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
throw error;
} finally {
span.end();
}
}
Step 3: Kernel-Level Visibility with eBPF
For deep troubleshooting, user-space tools are insufficient. eBPF allows safe inspection of kernel network stacks without restarting services. Use bpftrace to monitor TCP retransmissions and latency spikes in real-time.
# Monitor TCP retransmissions by process
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
printf("%-6d %-16s %s:%d -> %s:%d\n", pid, comm,
inet_ntop(AF_INET, args->saddr), args->sport,
inet_ntop(AF_INET, args->daddr), args->dport);
}'
Step 4: Protocol Optimization Analysis
Evaluate protocol overhead. HTTP/1.1 suffers from head-of-line blocking. HTTP/2 introduces multiplexing but can suffer from stream contention. QUIC (HTTP/3) reduces connection establishment latency by combining handshake and key exchange.
- Decision: If P99 latency is driven by connection churn, migrate to HTTP/2 or HTTP/3.
- TLS: Ensure TLS 1.3 is enabled. TLS 1.2 requires two round trips for handshake; TLS 1.3 requires one. For global deployments, this saves ~RTT per new connection.
Step 5: Resilient Client Patterns
Latency troubleshooting must address how the application reacts to latency. Implement timeouts and circuit breakers to prevent latency amplification.
import { CircuitBreaker, TimeoutError } from 'opossum';
const circuitBreakerOptions = {
timeout: 2000, // Fail after 2 seconds
errorThresholdPercentage: 50,
resetTimeout: 30000,
};
const breaker = new CircuitBreaker(fetchWithLatencyTracking, circuitBreakerOptions);
breaker.on('timeout', (delay) => {
console.warn(`Circuit breaker timeout after ${delay}ms`);
// Trigger alerting or fallback logic
});
export async function resilientFetch(url: string) {
try {
return await breaker.fire(url);
} catch (error) {
if (error instanceof TimeoutError) {
// Handle timeout specifically
return { status: 408, body: 'Gateway Timeout' };
}
throw error;
}
}
Pitfall Guide
1. Optimizing P50 Instead of P99
Mistake: Teams celebrate low average latency while ignoring tail latency.
Impact: P50 masks the experience of users on poor connections or those hitting slow replicas. A system can have P50=10ms and P99=2000ms. Users in the tail churn.
Best Practice: Define SLOs based on P99 and P999. Alert on latency budget burn rate, not absolute averages.
2. Ignoring DNS Resolution Time
Mistake: Assuming DNS is instant.
Impact: DNS lookups can take 50-200ms, especially with misconfigured resolvers or DNSSEC validation overhead. If connection pooling is disabled, every request incurs DNS cost.
Best Practice: Implement local DNS caching. Monitor dns.lookup latency in traces. Use getent or dig to verify resolver performance. Consider using DNS-over-HTTPS (DoH) or DoT if corporate DNS is the bottleneck.
3. TCP Head-of-Line Blocking and Retransmissions
Mistake: Not monitoring TCP retransmissions.
Impact: A single lost packet can block all subsequent data in a TCP stream until retransmitted, causing massive latency spikes for HTTP/1.1 and HTTP/2.
Best Practice: Monitor tcp.retransmit metrics. If retransmissions are high, investigate network path quality or MTU mismatches. Consider QUIC to mitigate HOL blocking.
4. Misconfigured Keep-Alive and Connection Pools
Mistake: Using default connection pool settings or disabling keep-alive.
Impact: Frequent TCP handshakes and TLS negotiations add RTT overhead. If pools are too small, requests queue waiting for connections, increasing latency artificially.
Best Practice: Tune keepalive_timeout and pool size based on concurrency. Ensure idle connections are validated before reuse to avoid "zombie" connection errors.
5. Garbage Collection Pauses Masquerading as Network Latency
Mistake: Attributing request delays to the network when the process is paused.
Impact: In languages with GC (Java, Go, Node.js), stop-the-world pauses can freeze the network stack processing. The client sees a timeout, but the server was just collecting garbage.
Best Practice: Correlate latency spikes with GC metrics. Tune GC for low latency (e.g., G1GC or ZGC in Java). Use non-blocking I/O patterns.
6. Assuming "The Network" is the Problem Without Evidence
Mistake: Blaming the network immediately.
Impact: Wastes time troubleshooting BGP or ISP issues when the problem is application logic or database locking.
Best Practice: Use the "Three-Segment" rule. Verify Client, Network, and Server independently. If Client RTT is high, check user network. If Server RTT is high, check load balancer and backend. If RTT is normal but latency is high, check application processing.
7. Ignoring Bufferbloat
Mistake: Over-provisioning buffers in routers/switches.
Impact: Excessive buffering hides packet loss but increases queuing delay and jitter, causing latency to skyrocket under load.
Best Practice: Enable AQM (Active Queue Management) like CoDel or FQ-CoDel on network devices. Monitor jitter, not just latency.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High P99 with normal P50 | Implement Exponential Backoff + Jitter; Optimize DB queries | Tail latency often caused by retries or resource contention | Low (Engineering time) |
| Global users experiencing latency | Deploy Edge Compute / CDN; Use Anycast DNS | Reduces physical distance; caches content closer to user | Medium (CDN/Edge costs) |
| Sporadic timeouts in microservices | Enable Circuit Breakers; Tune Timeouts; Check Connection Pools | Prevents latency amplification and cascading failures | Low |
| High TLS handshake overhead | Migrate to TLS 1.3; Enable Session Resumption | Reduces round trips for connection setup | Low |
| TCP retransmissions causing spikes | Investigate MTU/Path MTU Discovery; Enable QUIC if possible | Eliminates HOL blocking; fixes packet loss issues | Low/Medium |
| GC pauses correlating with latency | Tune GC parameters; Switch to low-latency GC | Removes application-level stalls | Low |
Configuration Template
Prometheus Alert Rule for Latency SLO Violation:
groups:
- name: latency_slo_alerts
rules:
- alert: HighLatencySLOViolation
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{job="api-gateway"}[5m])
) > 0.2
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "P99 latency exceeds 200ms for {{ $labels.job }}"
description: "Current P99 latency is {{ $value }}s. Check traces and downstream dependencies."
TypeScript Resilient HTTP Client Configuration:
export const HTTP_CLIENT_CONFIG = {
timeout: 3000,
retries: 2,
retryDelay: (attempt: number) => Math.pow(2, attempt) * 100 + Math.random() * 100,
headers: {
'Connection': 'keep-alive',
'Keep-Alive': 'timeout=5, max=100',
},
// Enable HTTP/2 if supported
http2: true,
// TLS 1.3 minimum
tls: {
minVersion: 'TLSv1.3',
}
};
Quick Start Guide
- Instrument Tracing: Deploy OpenTelemetry agents to all services. Ensure
http and net instrumentation libraries are enabled. Verify spans are flowing to your backend.
- Add RUM: Integrate the Real User Monitoring SDK into your frontend. Configure it to capture
PerformanceResourceTiming and correlate with trace IDs.
- Deploy eBPF Probe: Install
bpftrace or bcc tools on critical hosts. Run a script to monitor TCP retransmissions and latency distribution for 15 minutes to establish a baseline.
- Configure Alerts: Set up Prometheus alerts for P99 latency based on your SLO. Include burn rate alerts for rapid detection.
- Validate Dashboard: Create a cross-observability dashboard linking RUM P99, Service Span Latency, Network RTT, and TCP Retransmissions. Test by inducing latency (e.g.,
tc qdisc add) to verify detection.