ctured approach centered on OpenTelemetry (OTel) for vendor neutrality, SLOs for alerting logic, and correlation for root cause analysis.
Step 1: Define Service Level Objectives (SLOs)
Before instrumenting code, define SLOs based on user-centric metrics. Common Service Level Indicators (SLIs) include:
- Availability: Success rate of API requests (e.g., 99.9%).
- Latency: Percentage of requests served within threshold (e.g., p95 < 200ms).
- Throughput: Requests per second handled without degradation.
Step 2: Implement OpenTelemetry Instrumentation
Use OpenTelemetry to generate telemetry data. OTel provides a unified API for traces, metrics, and logs, ensuring data can be exported to any backend.
TypeScript Implementation Example:
This example demonstrates configuring the OTel Node SDK with auto-instrumentation, custom business metrics, and trace context propagation.
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-proto';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
import { trace, metrics } from '@opentelemetry/api';
// 1. Initialize SDK with auto-instrumentation for Express, HTTP, etc.
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'payment-service',
[SEMRESATTRS_SERVICE_VERSION]: '1.2.0',
environment: process.env.NODE_ENV || 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://otel-collector:4318/v1/metrics',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// 2. Custom Business Metric: Payment Success Rate
const meter = metrics.getMeter('payment-service');
const paymentCounter = meter.createCounter('payment.transactions', {
description: 'Total number of payment transactions',
});
// 3. Custom Trace: Enriching spans with business context
const tracer = trace.getTracer('payment-service');
export async function processPayment(transactionId: string, amount: number) {
const span = tracer.startSpan('process-payment');
span.setAttribute('payment.transaction_id', transactionId);
span.setAttribute('payment.amount', amount);
try {
// Simulate payment logic
const result = await executePaymentGateway(transactionId, amount);
// Emit metric with labels for aggregation
paymentCounter.add(1, { status: 'success', currency: 'USD' });
span.setStatus({ code: 1 }); // OK
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: 2, message: error.message });
paymentCounter.add(1, { status: 'error', currency: 'USD', error_type: error.code });
throw error;
} finally {
span.end();
}
}
Full trace collection is cost-prohibitive at scale. Implement Tail-Based Sampling via the OTel Collector. This allows you to sample based on trace attributes (e.g., keep 100% of error traces, sample 1% of success traces) rather than random head-based sampling, which often discards critical failure paths.
OTel Collector Configuration Snippet:
processors:
tail_sampling:
policies:
- name: error-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: latency-policy
type: latency
latency: { threshold_ms: 500 }
- name: probabilistic-policy
type: probabilistic
probabilistic: { sampling_percentage: 1 }
Step 4: Alerting on Error Budget Burn Rates
Replace static thresholds with burn rate alerts. A burn rate alert triggers when the error budget is being consumed too quickly to last until the end of the measurement window.
- Fast Burn (Page): Error budget consumed in < 1 hour. Immediate page required.
- Slow Burn (Ticket): Error budget consumed in < 1 day. Create ticket, investigate during business hours.
Pitfall Guide
1. High Cardinality Explosion
Mistake: Adding unbounded labels to metrics (e.g., user_id, transaction_id, request_path with dynamic IDs).
Impact: Database performance degradation, query timeouts, and massive storage costs.
Fix: Limit label cardinality. Use low-cardinality labels like service_name, endpoint_group, http_method, and status_code. Bucket dynamic values or use traces for high-cardinality data.
2. Monitoring Symptoms, Not Causes
Mistake: Alerting on "CPU High" or "Memory High" without correlating to request latency or error rates.
Impact: Waking engineers for autoscaling events or batch jobs that do not impact users.
Fix: Alert on SLO violations. If CPU is high but latency and error rates are healthy, do not page.
3. Ignoring Tail Latency
Mistake: Relying on average latency metrics.
Impact: Averages mask the experience of the slowest 1% of users, who are often power users or paying customers.
Fix: Monitor p95 and p99 latencies. Ensure instrumentation captures histogram data for accurate percentile calculation.
4. Lack of Log-Trace Correlation
Mistake: Logs and traces are stored separately without a shared correlation ID.
Impact: During an incident, engineers must manually search logs after finding a trace, increasing MTTR.
Fix: Inject the trace_id and span_id into every log entry. Ensure the logging library automatically enriches logs with active span context.
5. Static Thresholds in Dynamic Environments
Mistake: Using fixed thresholds (e.g., "Alert if error rate > 1%") in systems with variable traffic.
Impact: False positives during low-traffic periods or missed alerts during traffic spikes.
Fix: Use rate-based alerts and anomaly detection for baseline metrics. For SLOs, use burn rates which normalize over traffic volume.
6. PII Leakage in Telemetry
Mistake: Logging user emails, passwords, or payment details in traces or logs.
Impact: GDPR/CCPA violations, security breaches, and compliance fines.
Fix: Implement scrubbing middleware in the OTel exporter or use attribute processors to redact sensitive fields before export.
7. Alert Silence Without Resolution
Mistake: Silencing alerts in the monitoring tool without fixing the underlying issue.
Impact: Technical debt accumulation and blind spots during real incidents.
Fix: Enforce a process where alerts must be acknowledged with a ticket or remediation action. Use runbooks linked to alerts.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-throughput API Gateway | Prometheus + OTel Metrics + Grafana | Low overhead, high query flexibility for rates/histograms. | Low |
| Complex Distributed Transaction | Distributed Tracing + Tail Sampling | Essential for root cause across service boundaries. | Medium/High |
| Batch Processing / ETL | Custom Metrics + Dead Man Switches | Latency is less critical; completion and data integrity matter. | Low |
| Multi-tenant SaaS | SLOs per Tenant Tier | High-value tenants require stricter SLAs and alerting. | Medium |
| Legacy Monolith Migration | Strangler Fig + Dual Instrumentation | Maintain visibility during transition; compare old vs new metrics. | Medium |
Configuration Template
Prometheus SLO Burn Rate Alerting Rules
Copy this configuration to implement standard 4-burn-rate alerting for a 99.9% availability SLO over a 28-day window.
groups:
- name: payment-service-slo
rules:
# 1. Recording Rules: Calculate error budget burn rates
- record: slo:error_budget_burn_rate:1h
expr: |
(
sum(rate(http_requests_total{job="payment-service", status=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="payment-service"}[1h]))
) / (1 - 0.999)
- record: slo:error_budget_burn_rate:6h
expr: |
(
sum(rate(http_requests_total{job="payment-service", status=~"5.."}[6h]))
/
sum(rate(http_requests_total{job="payment-service"}[6h]))
) / (1 - 0.999)
- record: slo:error_budget_burn_rate:1d
expr: |
(
sum(rate(http_requests_total{job="payment-service", status=~"5.."}[1d]))
/
sum(rate(http_requests_total{job="payment-service"}[1d]))
) / (1 - 0.999)
- record: slo:error_budget_burn_rate:3d
expr: |
(
sum(rate(http_requests_total{job="payment-service", status=~"5.."}[3d]))
/
sum(rate(http_requests_total{job="payment-service"}[3d]))
) / (1 - 0.999)
# 2. Alerting Rules: Define thresholds based on burn rate multiples
# Fast Burn: 14.4x burn rate over 1h (consumes 2% of budget in 1h)
- alert: SLOFastBurn
expr: slo:error_budget_burn_rate:1h > 14.4
for: 2m
labels:
severity: critical
page: "true"
annotations:
summary: "Payment Service SLO burn rate critical"
description: "Error budget is being consumed at 14.4x rate. 2% of budget lost in 1 hour."
# Slow Burn: 3x burn rate over 6h (consumes 2% of budget in 6h)
- alert: SLOSlowBurn
expr: slo:error_budget_burn_rate:6h > 3
for: 15m
labels:
severity: warning
ticket: "true"
annotations:
summary: "Payment Service SLO burn rate elevated"
description: "Error budget is being consumed at 3x rate. Investigate during business hours."
Quick Start Guide
- Install OTel Dependencies:
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-proto @opentelemetry/exporter-metrics-otlp-proto
- Initialize SDK in Entry Point:
Add the initialization code from the Core Solution to your application's startup file. Ensure
NODE_ENV=production is set for correct resource attributes.
- Deploy OTel Collector:
Run the OpenTelemetry Collector as a sidecar or daemonset. Configure the
tail_sampling processor and otlp receivers/exporters pointing to your backend (e.g., Grafana Cloud, Datadog, or self-hosted Prometheus/Tempo).
- Verify Data Flow:
Generate traffic to your service. Query your backend for
http_requests_total metrics and verify traces appear in the trace explorer with correct service names and spans. Confirm trace_id appears in log entries.