SLO Alerting with OpenTelemetry and Prometheus
Beyond Thresholds: Implementing Error Budget Burn-Rate Alerting with OpenTelemetry and Prometheus
Current Situation Analysis
Modern distributed systems generate telemetry at a scale that traditional monitoring paradigms cannot sustainably handle. Engineering teams routinely configure static thresholds—CPU utilization above 80%, p95 latency exceeding 500ms, or error rates crossing 5%—only to discover that these rules trigger constantly during routine deployments, traffic spikes, or minor backend degradation. The result is alert fatigue: engineers mute channels, ignore pages, and eventually miss genuine outages.
This problem persists because threshold-based alerting measures infrastructure health rather than user-facing reliability. A server can be running at 90% CPU while serving requests flawlessly, or it can be idling at 10% CPU while a database connection pool exhaustion silently drops 30% of user transactions. When alerting is decoupled from actual service reliability, teams waste cognitive bandwidth on symptoms instead of business impact.
The misunderstanding runs deeper: many organizations document Service Level Objectives (SLOs) in wikis or compliance reports but never operationalize them into alerting pipelines. SLOs are treated as retrospective reporting metrics rather than proactive control mechanisms. Industry telemetry from incident management platforms consistently shows that 60–75% of production alerts are either false positives or low-severity noise, directly correlating with teams that rely on static thresholds instead of error budget consumption models.
OpenTelemetry solves the data collection fragmentation problem by providing a vendor-neutral standard for metrics, traces, and logs. Prometheus solves the query and alerting problem with a powerful time-series database and rule engine. When combined, they enable a shift from reactive threshold monitoring to proportional, budget-aware alerting that aligns engineering toil with actual user experience degradation.
WOW Moment: Key Findings
The fundamental shift from static thresholds to burn-rate alerting changes how engineering teams allocate attention. Instead of waking up for every metric spike, alerts fire only when the system is consuming its reliability budget faster than sustainable.
| Approach | Alert Volume (Weekly) | False Positive Rate | Alignment with User Impact | Operational Toil |
|---|---|---|---|---|
| Static Thresholds | 45–120 | 65–80% | Low (infrastructure-focused) | High (constant triage) |
| SLO Burn-Rate | 8–15 | 10–20% | High (user-experience focused) | Low (proportional response) |
This finding matters because it transforms alerting from a noise generator into a reliability governor. Burn-rate alerting ensures that pages only trigger when the error budget is being depleted at a pace that threatens the monthly SLO target. It enables proportional alerting: fast burn rates trigger immediate pages, slow burn rates trigger next-day tickets, and normal consumption triggers nothing. This directly reduces on-call burnout while improving mean time to resolution (MTTR) for genuine reliability events.
Core Solution
Implementing burn-rate alerting requires aligning telemetry collection, metric computation, and alert routing into a cohesive pipeline. The architecture follows four logical phases: contract definition, signal collection, budget computation, and proportional alerting.
Step 1: Define the Reliability Contract
Before writing any rules, establish the SLO target and measurement window. A standard contract includes:
- Service: The boundary of what you're measuring (e.g.,
checkout-api) - SLI (Service Level Indicator): The metric representing user experience (e.g., successful HTTP responses)
- SLO Target: The acceptable reliability threshold (e.g., 99.5% success rate)
- Measurement Window: The rolling period for budget calculation (typically 30 days)
The error budget is simply 1 - SLO_target. For a 99.5% target, the budget is 0.5%. This budget represents the maximum allowable failure rate over the measurement window.
Step 2: Instrument with OpenTelemetry
OpenTelemetry standardizes how applications emit metrics. Instead of vendor-specific SDKs, you configure a single collector pipeline that receives metrics from your services and exports them to Prometheus. The key is emitting a counter metric that tracks request outcomes with consistent labels.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_max_size: 1000
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "platform"
resource_to_telemetry_conversion:
enabled: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Applications increment a counter like platform_api_request_total with labels status (success/failure) and endpoint. OpenTelemetry handles metric aggregation, cardinality management, and OTLP protocol compliance, ensuring Prometheus receives clean, standardized time-series data.
Step 3: Compute SLI and Burn Rate via Recording Rules
Prometheus recording rules precompute expensive PromQL expressions at fixed intervals. This reduces query latency and ensures alerting rules evaluate against consistent, cached values.
# prometheus_slo_rules.yml
groups:
- name: platform.slo.checkout
interval: 30s
rules:
# SLI: 5-minute success rate for checkout endpoints
- record: sli:checkout_success_rate:5m
expr: |
sum(rate(platform_api_request_total{job="checkout-api", status="success"}[5m]))
/
clamp_min(sum(rate(platform_api_request_total{job="checkout-api"}[5m])), 1)
# Error budget consumption ratio (0 = untouched, 1 = exhausted)
- record: slo:checkout_budget_consumed:ratio
expr: |
(1 - sli:checkout_success_rate:5m)
/
(1 - 0.995)
# 1-hour burn rate (how fast budget is depleting relative to sustainable pace)
- record: slo:checkout_burn_rate:1h
expr: |
slo:checkout_budget_consumed:ratio
/
(3600 / (30 * 24 * 3600))
Architecture Rationale:
- `clamp_min(...,
1)` in the denominator prevents division-by-zero during low-traffic periods or deployment windows.
- The burn rate calculation divides actual consumption by the expected consumption over the window. A 1-hour window expects
1/(30*24)of the budget. If actual consumption exceeds that, the burn rate exceeds 1.0. - Recording rules run every 30 seconds, ensuring alerting rules evaluate against fresh, pre-aggregated data without querying raw counters repeatedly.
Step 4: Configure Multi-Window Burn-Rate Alerts
Google's SRE methodology recommends multi-window, multi-burn-rate alerting to distinguish between fast failures and slow degradation. A single window creates either too much noise or too much delay.
# prometheus_alert_rules.yml
groups:
- name: platform.alerts.checkout
rules:
# Fast burn: 14x rate → budget exhausted in ~2 hours
- alert: CheckoutErrorBudgetFastBurn
expr: |
slo:checkout_burn_rate:1h > 14
and
slo:checkout_burn_rate:5m > 14
for: 2m
labels:
severity: page
team: platform-reliability
annotations:
summary: "Checkout API error budget burning at 14×"
description: "At current rate, 99.5% SLO will be breached in ~2 hours. Immediate investigation required."
# Slow burn: 3x rate → budget exhausted in ~10 days
- alert: CheckoutErrorBudgetSlowBurn
expr: |
slo:checkout_burn_rate:6h > 3
and
slo:checkout_burn_rate:30m > 3
for: 15m
labels:
severity: ticket
team: platform-reliability
annotations:
summary: "Checkout API error budget burning at 3×"
description: "SLO at risk over next 10 days. Schedule investigation during business hours."
Why Multi-Window?
- The fast burn rule (
1h+5m) catches sudden failures (e.g., database failover, deployment regression) without triggering on transient spikes. - The slow burn rule (
6h+30m) catches gradual degradation (e.g., memory leaks, cache eviction storms) that wouldn't trigger a page but will exhaust the budget if unaddressed. - The
forduration acts as a debounce mechanism, ensuring the burn rate is sustained before alerting.
Pitfall Guide
1. Single-Window Burn Rate Triggers
Explanation: Using only one time window (e.g., 1-hour) causes either false positives during brief traffic anomalies or delayed detection of slow degradation. Fix: Implement the standard two-tier pattern: fast burn (short window + short evaluation) for pages, slow burn (long window + longer evaluation) for tickets. Always require both windows to exceed the threshold simultaneously.
2. Ignoring Request Volume Thresholds
Explanation: Burn rate calculations become mathematically unstable when request volume drops near zero. A single failed request out of two can trigger a 100% error rate and massive burn rate.
Fix: Add a volume guard to alert expressions: sum(rate(platform_api_request_total[5m])) > 10. This ensures alerts only fire when sufficient traffic exists to make the SLI statistically meaningful.
3. Misaligned Measurement Windows
Explanation: Calculating burn rate over a 1-hour window while the SLO measurement period is 30 days creates mathematical inconsistency. The expected consumption rate must match the SLO window.
Fix: Always normalize burn rate against the full SLO window. If your SLO uses a 30-day rolling window, divide actual consumption by (alert_window_seconds / (30 * 24 * 3600)) to get the true multiplier.
4. Alerting on SLI Instead of Burn Rate
Explanation: Teams often alert when the success rate drops below 99.5%, which is functionally identical to a static threshold. This ignores how much budget remains and how fast it's being consumed. Fix: Alert exclusively on burn rate multipliers. The SLI is for dashboards and reporting; the burn rate is for alerting. This decouples threshold noise from budget consumption.
5. Missing Budget Tracking Dashboards
Explanation: Without visibility into remaining budget, teams cannot make informed decisions about feature releases, deployment freezes, or incident prioritization. Fix: Build a dedicated SLO dashboard showing: current burn rate, remaining budget percentage, days until exhaustion, and historical consumption trends. Integrate this into deployment gates and incident command workflows.
6. Over-Engineering OTel Pipelines Early
Explanation: Teams attempt to collect traces, logs, and metrics simultaneously, causing high cardinality, storage bloat, and collector instability before establishing baseline reliability. Fix: Start with metrics-only pipelines for SLO alerting. Add traces only when alerting identifies a specific latency or error path requiring root-cause analysis. Add logs only for compliance or audit requirements.
7. Static Error Budget Reset Logic
Explanation: Treating the error budget as a monthly reset without accounting for carryover or deployment windows leads to gaming the system or unnecessary feature freezes. Fix: Implement budget carryover policies (e.g., unused budget rolls over up to 10%), and define explicit deployment freeze triggers when burn rate exceeds 3× for more than 24 hours. Document these policies in the team's reliability runbook.
Production Bundle
Action Checklist
- Define SLO target, SLI metric, and measurement window for each critical service
- Deploy OpenTelemetry collector with OTLP receiver and Prometheus exporter
- Instrument applications to emit outcome counters with consistent status labels
- Create Prometheus recording rules for SLI calculation with volume guards
- Implement multi-window burn rate rules (fast + slow) with appropriate
fordurations - Route alerts to PagerDuty/Opsgenie with severity-based escalation policies
- Build SLO tracking dashboard showing budget consumption and burn rate trends
- Document budget freeze policies and integrate with CI/CD deployment gates
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Customer-facing API with strict latency requirements | Multi-window burn rate (14x/1h + 3x/6h) | Catches both sudden failures and gradual degradation without noise | Low (Prometheus recording rules are CPU-efficient) |
| Internal batch processing with SLA of 99% | Single-window burn rate (5x/30m) | Simpler rules suffice for non-user-facing workloads; reduces rule maintenance | Minimal (fewer recording rules) |
| High-cardinality service with dynamic routing | Volume-guarded burn rate + label filtering | Prevents alert storms from low-traffic edge cases; focuses on core paths | Moderate (requires careful label design in OTel) |
| Startup with limited on-call capacity | Slow-burn only (3x/12h) + daily digest | Reduces page frequency while maintaining budget visibility; aligns with smaller team size | Low (fewer alerts, less tooling overhead) |
Configuration Template
# prometheus_slo_pipeline.yml
groups:
- name: reliability.slo.core
interval: 30s
rules:
# Guarded SLI calculation
- record: sli:core_api_success:5m
expr: |
sum(rate(core_api_requests_total{status="ok"}[5m]))
/
clamp_min(sum(rate(core_api_requests_total[5m])), 5)
# Budget consumption (0.995 target → 0.005 budget)
- record: slo:core_budget_used:ratio
expr: |
(1 - sli:core_api_success:5m) / 0.005
# Normalized burn rate (1h window vs 30d SLO)
- record: slo:core_burn:1h
expr: |
slo:core_budget_used:ratio / (3600 / 2592000)
- name: reliability.alerts.core
rules:
- alert: CoreAPI_BudgetFastBurn
expr: |
slo:core_burn:1h > 14
and
sum(rate(core_api_requests_total[5m])) > 20
for: 2m
labels:
severity: page
team: reliability
annotations:
summary: "Core API budget burning at 14×"
description: "SLO breach imminent. Investigate error sources immediately."
- alert: CoreAPI_BudgetSlowBurn
expr: |
slo:core_burn:6h > 3
and
sum(rate(core_api_requests_total[30m])) > 50
for: 15m
labels:
severity: ticket
team: reliability
annotations:
summary: "Core API budget burning at 3×"
description: "Schedule investigation. Budget exhaustion projected in ~10 days."
Quick Start Guide
- Deploy the Collector: Run the OpenTelemetry collector with the provided YAML configuration. Verify metrics are exposed at
http://localhost:8889/metricsand contain your custom counters. - Load Recording Rules: Place the
prometheus_slo_pipeline.ymlfile in Prometheus's rule directory and reload the configuration. Confirm recording rules appear in the Prometheus UI under theRulestab. - Validate Burn Rate Math: Simulate traffic using a load generator. Check that
sli:core_api_success:5mreflects your success ratio andslo:core_burn:1hcalculates correctly against the 30-day window. - Activate Alerts: Enable the alert rules and configure your notification manager (Alertmanager) to route
pageseverity to on-call rotation andticketseverity to your issue tracker. Verify debounce behavior by triggering brief error spikes. - Monitor Budget: Add the recording rule metrics to a Grafana dashboard. Track remaining budget percentage, current burn rate, and historical consumption to inform deployment and incident decisions.
