Difficulty

Intermediate

Read Time

9 min

SLO Alerting with OpenTelemetry and Prometheus

By Codcompass Team·2026-05-14·9 min read

Beyond Thresholds: Implementing Error Budget Burn-Rate Alerting with OpenTelemetry and Prometheus

Current Situation Analysis

Modern distributed systems generate telemetry at a scale that traditional monitoring paradigms cannot sustainably handle. Engineering teams routinely configure static thresholds—CPU utilization above 80%, p95 latency exceeding 500ms, or error rates crossing 5%—only to discover that these rules trigger constantly during routine deployments, traffic spikes, or minor backend degradation. The result is alert fatigue: engineers mute channels, ignore pages, and eventually miss genuine outages.

This problem persists because threshold-based alerting measures infrastructure health rather than user-facing reliability. A server can be running at 90% CPU while serving requests flawlessly, or it can be idling at 10% CPU while a database connection pool exhaustion silently drops 30% of user transactions. When alerting is decoupled from actual service reliability, teams waste cognitive bandwidth on symptoms instead of business impact.

The misunderstanding runs deeper: many organizations document Service Level Objectives (SLOs) in wikis or compliance reports but never operationalize them into alerting pipelines. SLOs are treated as retrospective reporting metrics rather than proactive control mechanisms. Industry telemetry from incident management platforms consistently shows that 60–75% of production alerts are either false positives or low-severity noise, directly correlating with teams that rely on static thresholds instead of error budget consumption models.

OpenTelemetry solves the data collection fragmentation problem by providing a vendor-neutral standard for metrics, traces, and logs. Prometheus solves the query and alerting problem with a powerful time-series database and rule engine. When combined, they enable a shift from reactive threshold monitoring to proportional, budget-aware alerting that aligns engineering toil with actual user experience degradation.

WOW Moment: Key Findings

The fundamental shift from static thresholds to burn-rate alerting changes how engineering teams allocate attention. Instead of waking up for every metric spike, alerts fire only when the system is consuming its reliability budget faster than sustainable.

Approach	Alert Volume (Weekly)	False Positive Rate	Alignment with User Impact	Operational Toil
Static Thresholds	45–120	65–80%	Low (infrastructure-focused)	High (constant triage)
SLO Burn-Rate	8–15	10–20%	High (user-experience focused)	Low (proportional response)

This finding matters because it transforms alerting from a noise generator into a reliability governor. Burn-rate alerting ensures that pages only trigger when the error budget is being depleted at a pace that threatens the monthly SLO target. It enables proportional alerting: fast burn rates trigger immediate pages, slow burn rates trigger next-day tickets, and normal consumption triggers nothing. This directly reduces on-call burnout while improving mean time to resolution (MTTR) for genuine reliability events.

Core Solution

Implementing burn-rate alerting requires aligning telemetry collection, metric computation, and alert routing into a cohesive pipeline. The architecture follows four logical phases: contract definition, signal collection, budget computation, and proportional alerting.

Step 1: Define the Reliability Contract

Before writing any rules, establish the SLO target and measurement window. A standard contract includes:

Service: The boundary of what you're measuring (e.g., checkout-api)
SLI (Service Level Indicator): The metric representing user experience (e.g., successful HTTP responses)
SLO Target: The acceptable reliability threshold (e.g., 99.5% success rate)
Measurement Window: The rolling period for budget calculation (typically 30 days)

The error budget is simply 1 - SLO_target. For a 99.5% target, the budget is 0.5%. This budget represents the maximum allowable failure rate over the measurement window.

Step 2: Instrument with OpenTelemetry

OpenTelemetry standardizes how applications emit metrics. Instead of vendor-specific SDKs, you configure a single collector pipeline that receives metrics from your services and exports them to Prometheus. The key is emitting a counter metric that tracks request outcomes with consistent labels.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_max_size: 1000

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "platform"
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Applications increment a counter like platform_api_request_total with labels status (success/failure) and endpoint. OpenTelemetry handles metric aggregation, cardinality management, and OTLP protocol compliance, ensuring Prometheus receives clean, standardized time-series data.

Step 3: Compute SLI and Burn Rate via Recording Rules

Prometheus recording rules precompute expensive PromQL expressions at fixed intervals. This reduces query latency and ensures alerting rules evaluate against consistent, cached values.

# prometheus_slo_rules.yml
groups:
  - name: platform.slo.checkout
    interval: 30s
    rules:
      # SLI: 5-minute success rate for checkout endpoints
      - record: sli:checkout_success_rate:5m
        expr: |
          sum(rate(platform_api_request_total{job="checkout-api", status="success"}[5m]))
          /
          clamp_min(sum(rate(platform_api_request_total{job="checkout-api"}[5m])), 1)

      # Error budget consumption ratio (0 = untouched, 1 = exhausted)
      - record: slo:checkout_budget_consumed:ratio
        expr: |
          (1 - sli:checkout_success_rate:5m)
          /
          (1 - 0.995)

      # 1-hour burn rate (how fast budget is depleting relative to sustainable pace)
      - record: slo:checkout_burn_rate:1h
        expr: |
          slo:checkout_budget_consumed:ratio
          /
          (3600 / (30 * 24 * 3600))

Architecture Rationale:

`clamp_min(...,

1)` in the denominator prevents division-by-zero during low-traffic periods or deployment windows.

The burn rate calculation divides actual consumption by the expected consumption over the window. A 1-hour window expects 1/(30*24) of the budget. If actual consumption exceeds that, the burn rate exceeds 1.0.
Recording rules run every 30 seconds, ensuring alerting rules evaluate against fresh, pre-aggregated data without querying raw counters repeatedly.

Step 4: Configure Multi-Window Burn-Rate Alerts

Google's SRE methodology recommends multi-window, multi-burn-rate alerting to distinguish between fast failures and slow degradation. A single window creates either too much noise or too much delay.

# prometheus_alert_rules.yml
groups:
  - name: platform.alerts.checkout
    rules:
      # Fast burn: 14x rate → budget exhausted in ~2 hours
      - alert: CheckoutErrorBudgetFastBurn
        expr: |
          slo:checkout_burn_rate:1h > 14
          and
          slo:checkout_burn_rate:5m > 14
        for: 2m
        labels:
          severity: page
          team: platform-reliability
        annotations:
          summary: "Checkout API error budget burning at 14×"
          description: "At current rate, 99.5% SLO will be breached in ~2 hours. Immediate investigation required."

      # Slow burn: 3x rate → budget exhausted in ~10 days
      - alert: CheckoutErrorBudgetSlowBurn
        expr: |
          slo:checkout_burn_rate:6h > 3
          and
          slo:checkout_burn_rate:30m > 3
        for: 15m
        labels:
          severity: ticket
          team: platform-reliability
        annotations:
          summary: "Checkout API error budget burning at 3×"
          description: "SLO at risk over next 10 days. Schedule investigation during business hours."

Why Multi-Window?

The fast burn rule (1h + 5m) catches sudden failures (e.g., database failover, deployment regression) without triggering on transient spikes.
The slow burn rule (6h + 30m) catches gradual degradation (e.g., memory leaks, cache eviction storms) that wouldn't trigger a page but will exhaust the budget if unaddressed.
The for duration acts as a debounce mechanism, ensuring the burn rate is sustained before alerting.

Pitfall Guide

1. Single-Window Burn Rate Triggers

Explanation: Using only one time window (e.g., 1-hour) causes either false positives during brief traffic anomalies or delayed detection of slow degradation. Fix: Implement the standard two-tier pattern: fast burn (short window + short evaluation) for pages, slow burn (long window + longer evaluation) for tickets. Always require both windows to exceed the threshold simultaneously.

2. Ignoring Request Volume Thresholds

Explanation: Burn rate calculations become mathematically unstable when request volume drops near zero. A single failed request out of two can trigger a 100% error rate and massive burn rate. Fix: Add a volume guard to alert expressions: sum(rate(platform_api_request_total[5m])) > 10. This ensures alerts only fire when sufficient traffic exists to make the SLI statistically meaningful.

3. Misaligned Measurement Windows

Explanation: Calculating burn rate over a 1-hour window while the SLO measurement period is 30 days creates mathematical inconsistency. The expected consumption rate must match the SLO window. Fix: Always normalize burn rate against the full SLO window. If your SLO uses a 30-day rolling window, divide actual consumption by (alert_window_seconds / (30 * 24 * 3600)) to get the true multiplier.

4. Alerting on SLI Instead of Burn Rate

Explanation: Teams often alert when the success rate drops below 99.5%, which is functionally identical to a static threshold. This ignores how much budget remains and how fast it's being consumed. Fix: Alert exclusively on burn rate multipliers. The SLI is for dashboards and reporting; the burn rate is for alerting. This decouples threshold noise from budget consumption.

5. Missing Budget Tracking Dashboards

Explanation: Without visibility into remaining budget, teams cannot make informed decisions about feature releases, deployment freezes, or incident prioritization. Fix: Build a dedicated SLO dashboard showing: current burn rate, remaining budget percentage, days until exhaustion, and historical consumption trends. Integrate this into deployment gates and incident command workflows.

6. Over-Engineering OTel Pipelines Early

Explanation: Teams attempt to collect traces, logs, and metrics simultaneously, causing high cardinality, storage bloat, and collector instability before establishing baseline reliability. Fix: Start with metrics-only pipelines for SLO alerting. Add traces only when alerting identifies a specific latency or error path requiring root-cause analysis. Add logs only for compliance or audit requirements.

7. Static Error Budget Reset Logic

Explanation: Treating the error budget as a monthly reset without accounting for carryover or deployment windows leads to gaming the system or unnecessary feature freezes. Fix: Implement budget carryover policies (e.g., unused budget rolls over up to 10%), and define explicit deployment freeze triggers when burn rate exceeds 3× for more than 24 hours. Document these policies in the team's reliability runbook.

Production Bundle

Action Checklist

Define SLO target, SLI metric, and measurement window for each critical service
Deploy OpenTelemetry collector with OTLP receiver and Prometheus exporter
Instrument applications to emit outcome counters with consistent status labels
Create Prometheus recording rules for SLI calculation with volume guards
Implement multi-window burn rate rules (fast + slow) with appropriate for durations
Route alerts to PagerDuty/Opsgenie with severity-based escalation policies
Build SLO tracking dashboard showing budget consumption and burn rate trends
Document budget freeze policies and integrate with CI/CD deployment gates

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Customer-facing API with strict latency requirements	Multi-window burn rate (14x/1h + 3x/6h)	Catches both sudden failures and gradual degradation without noise	Low (Prometheus recording rules are CPU-efficient)
Internal batch processing with SLA of 99%	Single-window burn rate (5x/30m)	Simpler rules suffice for non-user-facing workloads; reduces rule maintenance	Minimal (fewer recording rules)
High-cardinality service with dynamic routing	Volume-guarded burn rate + label filtering	Prevents alert storms from low-traffic edge cases; focuses on core paths	Moderate (requires careful label design in OTel)
Startup with limited on-call capacity	Slow-burn only (3x/12h) + daily digest	Reduces page frequency while maintaining budget visibility; aligns with smaller team size	Low (fewer alerts, less tooling overhead)

Configuration Template

# prometheus_slo_pipeline.yml
groups:
  - name: reliability.slo.core
    interval: 30s
    rules:
      # Guarded SLI calculation
      - record: sli:core_api_success:5m
        expr: |
          sum(rate(core_api_requests_total{status="ok"}[5m]))
          /
          clamp_min(sum(rate(core_api_requests_total[5m])), 5)

      # Budget consumption (0.995 target → 0.005 budget)
      - record: slo:core_budget_used:ratio
        expr: |
          (1 - sli:core_api_success:5m) / 0.005

      # Normalized burn rate (1h window vs 30d SLO)
      - record: slo:core_burn:1h
        expr: |
          slo:core_budget_used:ratio / (3600 / 2592000)

  - name: reliability.alerts.core
    rules:
      - alert: CoreAPI_BudgetFastBurn
        expr: |
          slo:core_burn:1h > 14
          and
          sum(rate(core_api_requests_total[5m])) > 20
        for: 2m
        labels:
          severity: page
          team: reliability
        annotations:
          summary: "Core API budget burning at 14×"
          description: "SLO breach imminent. Investigate error sources immediately."

      - alert: CoreAPI_BudgetSlowBurn
        expr: |
          slo:core_burn:6h > 3
          and
          sum(rate(core_api_requests_total[30m])) > 50
        for: 15m
        labels:
          severity: ticket
          team: reliability
        annotations:
          summary: "Core API budget burning at 3×"
          description: "Schedule investigation. Budget exhaustion projected in ~10 days."

Quick Start Guide

Deploy the Collector: Run the OpenTelemetry collector with the provided YAML configuration. Verify metrics are exposed at http://localhost:8889/metrics and contain your custom counters.
Load Recording Rules: Place the prometheus_slo_pipeline.yml file in Prometheus's rule directory and reload the configuration. Confirm recording rules appear in the Prometheus UI under the Rules tab.
Validate Burn Rate Math: Simulate traffic using a load generator. Check that sli:core_api_success:5m reflects your success ratio and slo:core_burn:1h calculates correctly against the 30-day window.
Activate Alerts: Enable the alert rules and configure your notification manager (Alertmanager) to route page severity to on-call rotation and ticket severity to your issue tracker. Verify debounce behavior by triggering brief error spikes.
Monitor Budget: Add the recording rule metrics to a Grafana dashboard. Track remaining budget percentage, current burn rate, and historical consumption to inform deployment and incident decisions.