t servers report. Common categories:
- Availability:
success_count / total_count over a rolling window
- Latency:
p99_request_duration_seconds or percentage_within_threshold
- Throughput:
requests_per_second relative to capacity ceiling
- Freshness:
data_age_seconds for cache or sync systems
Step 2: Set SLOs with Rolling Windows
SLOs are targets defined over fixed periods (typically 30 days). Example:
- Availability SLO: 99.9% over 30-day rolling window
- Latency SLO: p99 < 300ms for 99.5% of requests
Step 3: Implement Measurement Pipeline
Use OpenTelemetry for instrumentation, Prometheus for aggregation, and a custom SLO calculator for burn rate math. Avoid static thresholds.
Burn rate = (actual_error_rate / allowed_error_rate). Multi-window prevents false positives by requiring consistent degradation across time scales.
Step 5: Integrate Error Budget Policy
Expose budget consumption to CI/CD. Block deployments when budget < 10% remaining. Allow deployment when budget > 50%.
TypeScript Implementation: SLO Calculator & Burn Rate Tracker
import { Counter, Gauge, register } from 'prom-client';
export interface SLOConfig {
name: string;
windowHours: number;
targetAvailability: number; // e.g., 0.999
burnRates: {
fast: { windowHours: number; multiplier: number };
slow: { windowHours: number; multiplier: number };
};
}
export class SLOTracker {
private totalRequests: Counter;
private failedRequests: Counter;
private errorBudgetRemaining: Gauge;
private burnRateFast: Gauge;
private burnRateSlow: Gauge;
private config: SLOConfig;
constructor(config: SLOConfig) {
this.config = config;
const prefix = `slo_${config.name.replace(/\s+/g, '_').toLowerCase()}`;
this.totalRequests = new Counter({
name: `${prefix}_total_requests`,
help: 'Total requests for SLO calculation'
});
this.failedRequests = new Counter({
name: `${prefix}_failed_requests`,
help: 'Failed requests for SLO calculation'
});
this.errorBudgetRemaining = new Gauge({
name: `${prefix}_error_budget_remaining`,
help: 'Remaining error budget as percentage (0-100)'
});
this.burnRateFast = new Gauge({
name: `${prefix}_burn_rate_fast`,
help: 'Fast burn rate multiplier'
});
this.burnRateSlow = new Gauge({
name: `${prefix}_burn_rate_slow`,
help: 'Slow burn rate multiplier'
});
register.registerMetric(this.totalRequests);
register.registerMetric(this.failedRequests);
register.registerMetric(this.errorBudgetRemaining);
register.registerMetric(this.burnRateFast);
register.registerMetric(this.burnRateSlow);
}
recordRequest(success: boolean): void {
this.totalRequests.inc();
if (!success) this.failedRequests.inc();
}
async calculateAndExpose(): Promise<void> {
const total = await this.totalRequests.get();
const failed = await this.failedRequests.get();
const totalValue = total.values[0]?.value || 0;
const failedValue = failed.values[0]?.value || 0;
if (totalValue === 0) return;
const actualErrorRate = failedValue / totalValue;
const allowedErrorRate = 1 - this.config.targetAvailability;
const currentBurnRate = actualErrorRate / allowedErrorRate;
// Simulate multi-window burn rate calculation
// In production, use Prometheus subqueries or recording rules
const fastBurn = currentBurnRate * (this.config.burnRates.fast.multiplier / currentBurnRate || 1);
const slowBurn = currentBurnRate * (this.config.burnRates.slow.multiplier / currentBurnRate || 1);
// Error budget consumption over window
const windowHours = this.config.windowHours;
const expectedFailures = totalValue * allowedErrorRate;
const budgetConsumed = Math.min(failedValue / expectedFailures, 1) * 100;
const budgetRemaining = Math.max(100 - budgetConsumed, 0);
this.errorBudgetRemaining.set(budgetRemaining);
this.burnRateFast.set(fastBurn);
this.burnRateSlow.set(slowBurn);
}
}
// Usage
const apiSLO = new SLOTracker({
name: 'Payment API',
windowHours: 720, // 30 days
targetAvailability: 0.999,
burnRates: {
fast: { windowHours: 1, multiplier: 14.4 },
slow: { windowHours: 6, multiplier: 3 }
}
});
// Instrument request handler
export async function handlePaymentRequest(req: any, res: any) {
const success = await processPayment(req);
apiSLO.recordRequest(success);
// Expose metrics endpoint separately via /metrics
res.json({ success });
}
Architecture Decisions & Rationale
- Push vs Pull: Use pull-based scraping (Prometheus) for stable aggregation. Push gateways introduce timestamp skew that breaks rolling window math.
- Rolling Windows: 30-day windows align with billing cycles and SLA review periods. Shorter windows (1h, 6h) are used exclusively for burn rate alerting, not SLO targets.
- Multi-Window Burn Rates: Single-window alerting causes noise. Fast burn (14.4x over 1h) catches catastrophic failures. Slow burn (3x over 6h) catches gradual degradation. Both require the other to confirm signal validity.
- Error Budget Exposure: Budget metrics must be queryable by CI/CD pipelines. REST or Prometheus query API enables automated deployment gating.
- SLA Integration: SLAs should never drive engineering targets. They are business contracts. Map SLO breaches to SLA clauses via policy engines, not direct metric thresholds.
Pitfall Guide
1. Measuring Infrastructure SLIs Instead of User-Facing Ones
Tracking CPU utilization or pod restarts as primary SLIs creates blind spots. A service can run at 100% CPU while returning 500s to users. Always anchor SLIs to request success, latency percentiles, or data freshness.
2. Static Thresholds Over Multi-Window Burn Rates
Firing alerts when error rate exceeds 0.1% guarantees alert fatigue. Real degradation is defined by velocity, not absolute value. Multi-window burn rate math separates transient spikes from sustained budget consumption.
3. Ignoring Rolling Window Semantics
SLOs are not calendar-month targets. They are rolling windows. A 30-day SLO resets daily, not monthly. Misaligned windows cause budget miscalculation and false deployment gates.
4. Treating SLAs as Engineering Targets
SLAs dictate financial consequences. Engineering targets must be SLOs with error budgets. Directly optimizing for SLA thresholds removes the safety buffer required for incident response and leaves zero margin for recovery.
5. Single-Metric SLOs Ignoring Tail Latency
Availability alone masks performance degradation. A service returning 200 OK with 8-second p99 latency is functionally broken. Composite SLIs (success rate + latency threshold) prevent latency-induced churn.
6. No Error Budget Policy in CI/CD
Tracking SLOs without enforcing budget policies creates theoretical reliability. Deployment velocity must be coupled to budget consumption. Healthy budgets enable rapid iteration; depleted budgets trigger freeze or rollback.
7. Manual SLO Reporting
Spreadsheets and quarterly reviews cannot support real-time reliability engineering. SLOs require automated calculation, continuous exposure, and programmatic policy enforcement. Manual tracking guarantees drift.
Production Best Practices:
- Define SLIs before SLOs. Measurement dictates targets.
- Start with 99.9% availability and tighten incrementally based on user impact data.
- Implement burn rate alerting before deployment gating.
- Review SLOs quarterly with product, engineering, and business stakeholders.
- Document error budget consumption policies explicitly. Treat them as runbooks.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup MVP (0-10k users) | Single SLO: 99.5% availability, no burn rate alerting | Velocity prioritization; overhead must not exceed engineering capacity | Low monitoring cost, moderate churn risk |
| Regulated FinTech | Composite SLOs: 99.99% availability + p99 < 200ms, strict deployment gating | Compliance requirements mandate measurable reliability and audit trails | High tooling cost, low regulatory penalty risk |
| Microservices Platform | Per-service SLOs with shared error budget pool | Prevents cascade failures; isolates budget consumption to offending service | Medium platform cost, high stability ROI |
| Legacy Monolith | Gradual SLI extraction: Start with request success rate, migrate to latency/threshold composite | Avoids measurement shock; enables incremental SLO adoption without rewrite | Low upfront cost, medium migration overhead |
Configuration Template
# prometheus-slo-rules.yaml
groups:
- name: slo_payment_api
interval: 30s
rules:
# SLI: Success rate over 5m window
- record: slo:payment_api:success_rate_5m
expr: |
sum(rate(http_requests_total{job="payment-api", status!~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payment-api"}[5m]))
# SLI: p99 latency
- record: slo:payment_api:p99_latency_5m
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="payment-api"}[5m])) by (le))
# Error Budget Consumption (30d rolling)
- record: slo:payment_api:error_budget_remaining_pct
expr: |
100 * (1 - (
sum(increase(http_requests_total{job="payment-api", status=~"5.."}[30d]))
/
sum(increase(http_requests_total{job="payment-api"}[30d]))
) / 0.001)
# Burn Rate: Fast (14.4x over 1h)
- record: slo:payment_api:burn_rate_fast
expr: |
(sum(rate(http_requests_total{job="payment-api", status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="payment-api"}[1h])))
/ 0.001 * 14.4
# Burn Rate: Slow (3x over 6h)
- record: slo:payment_api:burn_rate_slow
expr: |
(sum(rate(http_requests_total{job="payment-api", status=~"5.."}[6h]))
/ sum(rate(http_requests_total{job="payment-api"}[6h])))
/ 0.001 * 3
# slo-policy-config.yaml
slo_policies:
payment_api:
target_availability: 0.999
window_days: 30
alert_thresholds:
fast_burn: 14.4
slow_burn: 3.0
deployment_gating:
budget_remaining_min: 10
budget_remaining_max: 50
action_on_low: "block_release"
action_on_medium: "require_approval"
action_on_high: "allow_deployment"
sla_mapping:
budget_exhausted: "trigger_credits"
repeated_breach: "escalate_to_executive_review"
Quick Start Guide
- Instrument requests: Add OpenTelemetry HTTP middleware to your primary service. Ensure every request emits
http_requests_total with status and duration labels.
- Deploy recording rules: Load
prometheus-slo-rules.yaml into Prometheus. Verify slo:payment_api:success_rate_5m and slo:payment_api:error_budget_remaining_pct are resolving.
- Configure burn rate alerts: Add Alertmanager rules triggering on
slo:payment_api:burn_rate_fast > 1 and slo:payment_api:burn_rate_slow > 1. Route fast burn to PagerDuty, slow burn to Slack.
- Integrate CI/CD gate: Add a pre-deploy step querying
/api/v1/query?query=slo:payment_api:error_budget_remaining_pct. Block if < 10. Allow if > 50. Route to manual approval otherwise.
- Validate: Simulate 5xx spike. Confirm fast burn fires within 15 minutes. Confirm error budget decreases proportionally. Verify deployment gate blocks release when budget < 10%.