oards provide real-time visibility.
Implementation: Python Budget Calculator + Prometheus Integration
Below is a minimal but production-ready implementation that calculates error budget consumption, supports rolling windows, and exposes metrics for Prometheus.
1. SLO Configuration (slo_config.yaml)
services:
checkout-api:
sli_type: success_rate
target: 0.999
window_days: 30
alerts:
fast_burn:
window_minutes: 5
threshold: 14.4
slow_burn:
window_minutes: 60
threshold: 6
critical_burn:
window_minutes: 1440
threshold: 3
2. Budget Engine (budget_engine.py)
import yaml
import time
import prometheus_client
from prometheus_client import Gauge, Counter
from datetime import datetime, timedelta
# Prometheus metrics
budget_remaining = Gauge('error_budget_remaining_pct', 'Percentage of error budget remaining')
budget_consumed = Counter('error_budget_consumed_total', 'Total error budget consumed')
burn_rate_fast = Gauge('burn_rate_fast', 'Current fast burn rate')
burn_rate_slow = Gauge('burn_rate_slow', 'Current slow burn rate')
class ErrorBudgetEngine:
def __init__(self, config_path):
with open(config_path) as f:
self.config = yaml.safe_load(f)
self.sli_counts = {}
self.window_seconds = {}
def record_sli(self, service, success, total):
if service not in self.sli_counts:
self.sli_counts[service] = []
self.sli_counts[service].append({
'success': success,
'total': total,
'timestamp': time.time()
})
# Keep only last 30 days of data
cutoff = time.time() - (30 * 24 * 3600)
self.sli_counts[service] = [
r for r in self.sli_counts[service] if r['timestamp'] > cutoff
]
def calculate_budget(self, service):
cfg = self.config['services'][service]
target = cfg['target']
window_days = cfg['window_days']
records = self.sli_counts.get(service, [])
if not records:
return 100.0, 0.0, 0.0
total_success = sum(r['success'] for r in records)
total_requests = sum(r['total'] for r in records)
actual_rate = total_success / total_requests if total_requests > 0 else 1.0
# Budget consumed = (1 - actual_rate) / (1 - target)
error_budget_total = 1 - target
error_budget_consumed = (1 - actual_rate) / error_budget_total if error_budget_total > 0 else 0
remaining_pct = max(0, (1 - error_budget_consumed) * 100)
# Burn rate calculation (simplified)
window_sec = window_days * 24 * 3600
elapsed_sec = time.time() - records[0]['timestamp']
burn_rate = (error_budget_consumed / (elapsed_sec / window_sec)) if elapsed_sec > 0 else 0
return remaining_pct, burn_rate, error_budget_consumed
def evaluate_alerts(self, service, burn_rate):
cfg = self.config['services'][service]['alerts']
if burn_rate > cfg['fast_burn']['threshold']:
return 'FAST_BURN'
elif burn_rate > cfg['slow_burn']['threshold']:
return 'SLOW_BURN'
elif burn_rate > cfg['critical_burn']['threshold']:
return 'CRITICAL_BURN'
return 'HEALTHY'
def update_metrics(self, service):
remaining, burn, consumed = self.calculate_budget(service)
budget_remaining.labels(service=service).set(remaining)
budget_consumed.labels(service=service).inc(consumed)
burn_rate_fast.labels(service=service).set(burn)
burn_rate_slow.labels(service=service).set(burn * 0.5) # Placeholder for slow window
alert_state = self.evaluate_alerts(service, burn)
return alert_state
# Example usage
if __name__ == "__main__":
engine = ErrorBudgetEngine('slo_config.yaml')
# Simulate metric ingestion
engine.record_sli('checkout-api', success=99850, total=100000)
engine.record_sli('checkout-api', success=99900, total=100000)
state = engine.update_metrics('checkout-api')
print(f"Alert State: {state}")
# Start Prometheus HTTP server
prometheus_client.start_http_server(8000)
while True:
time.sleep(60)
engine.update_metrics('checkout-api')
3. Deployment Notes
- Run the engine as a sidecar or dedicated service alongside your application.
- Expose
/metrics endpoint for Prometheus scraping.
- Integrate with Grafana using the
error_budget_remaining_pct and burn_rate_* metrics.
- Replace simulated
record_sli calls with actual OpenTelemetry/Prometheus metric exports.
- For production, replace in-memory storage with Redis or PostgreSQL to persist SLI data across restarts.
Policy Enforcement Layer
Error budgets become actionable when coupled with automation:
- Release Gating: CI/CD pipelines query the budget API. If
remaining_pct < 20, deployments are blocked or require VP approval.
- Feature Flags: New features deploy behind flags. If burn rate exceeds thresholds, flags automatically roll back.
- Reliability Sprints: When budget drops below 10%, the next sprint shifts to stability work (chaos engineering, debt reduction, performance optimization).
Pitfall Guide (6 Critical Mistakes)
1. Defining SLOs Without User-Centric Context
Problem: Teams set SLOs based on infrastructure metrics (CPU, memory, pod restarts) rather than user-facing outcomes (checkout success, search latency, payment completion).
Mitigation: Map every SLO to a user journey. Use product analytics and customer support data to identify which failures actually impact retention or revenue. Validate SLO targets with stakeholders before implementation.
2. Treating Error Budgets as Compliance Checkboxes
Problem: Budgets are calculated once, stored in spreadsheets, and reviewed quarterly. Teams ignore daily consumption patterns, missing early degradation signals.
Mitigation: Automate budget calculation and expose it in real-time dashboards. Integrate budget status into daily standups and release reviews. Treat exhaustion as an operational event, not a reporting requirement.
3. Ignoring Multi-Window Burn Rate Dynamics
Problem: Single-threshold alerts trigger on both transient spikes and sustained degradation, causing alert fatigue or missed incidents.
Mitigation: Implement multi-window multi-burn-rate alerting (e.g., 5m/14.4x for fast burn, 1h/6x for slow burn, 24h/3x for critical). Fast burn indicates immediate user impact; slow burn indicates creeping degradation requiring planned intervention.
4. Lack of Automated Policy Enforcement
Problem: Teams manually check budgets and rely on tribal knowledge to decide whether to release. Inconsistency leads to either reckless deployments or unnecessary delays.
Mitigation: Build policy-as-code. Use tools like OPA, custom CI/CD plugins, or service mesh policies to automatically gate deployments, trigger rollback, or require reliability sign-off based on budget state.
5. Cultural Blame When Budgets Are Exhausted
Problem: Exhaustion is treated as a failure of individuals or teams, leading to risk aversion, hidden incidents, and psychological unsafety.
Mitigation: Frame budget consumption as expected behavior. Reliability is a spectrum, not a binary. Conduct blameless postmortems focused on system design, not human error. Celebrate teams that proactively invest in reliability when budgets drop.
6. Static Windows and Unchanging Targets
Problem: 30-day rolling windows don't adapt to seasonal traffic, product launches, or architectural migrations. SLOs become misaligned with reality.
Mitigation: Use dynamic windows (e.g., 7-day for fast-moving services, 90-day for stable core systems). Review SLOs quarterly with product and customer success teams. Adjust targets based on actual user tolerance and business impact, not arbitrary percentages.
Production Bundle
β
Checklist
π Decision Matrix
| Scenario | Recommended Action | Rationale |
|---|
| Budget > 80% remaining | Allow standard releases | High reliability headroom |
| Budget 50β80% remaining | Proceed with monitoring | Normal consumption, no action needed |
| Budget 20β50% remaining | Require reliability sign-off | Moderate risk, validate changes |
| Budget < 20% remaining | Block non-critical releases | Preserve remaining budget |
| Fast burn rate triggered | Immediate rollback/feature flag disable | Prevent user impact escalation |
| Slow burn rate triggered | Schedule stability sprint | Address creeping degradation |
| Budget exhausted for 7+ days | Freeze releases, launch reliability initiative | Systemic issue, requires investment |
| SLO miss rate < 1% annually | Re-evaluate SLO target | Target may be too lenient or service over-engineered |
βοΈ Config Template (error-budget-policy.yaml)
policy:
version: "2.0"
evaluation_interval: 60s
services:
- name: payment-gateway
sli:
metric: http_request_duration_seconds_bucket
le: 0.5
label: status_code
success_values: ["200", "201"]
slo:
target: 0.9995
window_days: 30
burn_rates:
fast:
window_minutes: 5
multiplier: 14.4
slow:
window_minutes: 60
multiplier: 6
critical:
window_minutes: 1440
multiplier: 3
enforcement:
release_gate:
threshold_pct: 20
action: block
approval_override: true
feature_flag:
auto_rollback: true
burn_threshold: 10.0
notification:
channels:
- slack:#reliability
- pagerduty:escalation-tier-2
when: ["fast_burn", "budget_exhausted"]
π Quick Start (30-Minute Deployment)
- Instrument Your Service: Add OpenTelemetry SDK to your application. Export HTTP success rates and latency to Prometheus.
pip install opentelemetry-sdk opentelemetry-exporter-prometheus
- Deploy Budget Engine: Clone the provided Python code, install dependencies (
pip install pyyaml prometheus_client), and run:
python budget_engine.py
- Configure Prometheus: Add a scrape job for
localhost:8000/metrics in prometheus.yml.
- Import Dashboard: Use the provided Grafana JSON dashboard (export
error_budget_remaining_pct and burn_rate_* metrics).
- Set CI/CD Gate: Add a pipeline step that queries
http://localhost:8000/health or Prometheus API. Block if error_budget_remaining_pct < 20.
- Test with Chaos: Use
toxiproxy or kubectl delete pod to simulate failures. Verify budget consumption and alerting trigger correctly.
- Document & Socialize: Share the dashboard with engineering and product teams. Establish a weekly 15-minute reliability sync to review budget trends.
Closing Notes
Error budget management is not a monitoring feature; it is a release governance mechanism and a cultural operating system. When implemented correctly, it transforms reliability from a cost center into a strategic asset. Teams stop guessing about risk, product managers gain visibility into trade-offs, and engineers ship faster with confidence.
The code, configurations, and workflows provided here are designed for incremental adoption. Start with one user-critical journey, automate the budget calculation, enforce a simple release gate, and iterate. Reliability engineering is a practice, not a project. Measure, adjust, and let data drive the balance between speed and stability.