Step 1: Derive Metrics from SLOs
Metrics must map directly to SLIs. If an SLO defines availability as successful_requests / total_requests, the underlying metrics must support this calculation. This typically requires a counter for total requests and a counter for failed requests. Metrics that do not contribute to SLO calculations or essential operational visibility (e.g., critical infrastructure health) should be deprecated.
Step 2: Enforce Cardinality Constraints
High cardinality occurs when label combinations create unique time series that exceed storage or query limits. SRE metric design mandates a strict allowlist for label keys and values.
- Allowed Keys:
service, version, region, endpoint, status_code.
- Forbidden Keys:
user_id, request_id, session_id, ip_address.
- Value Constraints: Enumerate values where possible. Avoid free-text values that can drift.
Step 3: Implement Burn Rate-Ready Metrics
Metrics must be emitted in a way that supports rapid aggregation. Counters are preferred for rates. Histograms must be configured with buckets that align with latency SLOs. For example, if the latency SLO is p99 < 500ms, the histogram buckets should include a boundary at 500ms to enable efficient calculation of the success ratio.
Step 4: TypeScript Implementation with OpenTelemetry
OpenTelemetry is the standard for cross-observability instrumentation. The following TypeScript pattern demonstrates a metric factory that enforces SRE constraints. This wrapper validates labels against an allowlist and attaches metadata required for burn rate calculations.
import { Meter, Counter, Histogram, Attributes } from '@opentelemetry/api';
// SRE Configuration Interface
interface SREMetricConfig {
allowedLabels: Set<string>;
maxCardinalityPerMetric: number;
sliDimensions: string[]; // Labels critical for SLO slicing
}
// SRE Metric Factory
export class SREMetricFactory {
private readonly config: SREMetricConfig;
private readonly cardinalityTracker: Map<string, number> = new Map();
constructor(private meter: Meter, config: SREMetricConfig) {
this.config = config;
}
createCounter(name: string, description: string, sliName: string): Counter {
this.validateSREConstraints(name);
const counter = this.meter.createCounter(name, { description });
// Attach metadata for burn rate tooling
counter.add(0, { _sre_sli: sliName, _sre_metric_type: 'counter' });
return counter;
}
createHistogram(name: string, description: string, sliName: string, buckets: number[]): Histogram {
this.validateSREConstraints(name);
const histogram = this.meter.createHistogram(name, {
description,
advice: { explicitBucketBoundaries: buckets }
});
// Validate buckets against SLO thresholds
this.validateBuckets(buckets, sliName);
histogram.add(0, { _sre_sli: sliName, _sre_metric_type: 'histogram' });
return histogram;
}
recordWithValidation(metric: Counter | Histogram, attributes: Attributes): void {
const labelKeys = Object.keys(attributes);
// Cardinality Check
for (const key of labelKeys) {
if (!this.config.allowedLabels.has(key)) {
console.warn(`[SRE] Blocked label '${key}' for metric due to cardinality risk.`);
delete attributes[key];
}
}
// Validate SLI dimensions presence
for (const dim of this.config.sliDimensions) {
if (!(dim in attributes)) {
console.warn(`[SRE] Missing required SLI dimension '${dim}' in attributes.`);
}
}
return metric.add(1, attributes);
}
private validateSREConstraints(name: string): void {
// Naming convention enforcement
if (!/^[a-z][a-z0-9_]*$/.test(name)) {
throw new Error(`[SRE] Metric name '${name}' violates naming conventions.`);
}
}
private validateBuckets(buckets: number[], sliName: string): void {
// Example: Ensure SLO threshold is covered
// In production, this would parse SLO config to verify bucket coverage
if (sliName.includes('latency') && !buckets.some(b => b <= 500)) {
console.warn(`[SRE] Buckets for '${sliName}' may not support p99 < 500ms calculation.`);
}
}
}
Architecture Decisions
- OpenTelemetry SDK: Use the SDK for instrumentation to ensure vendor neutrality. Metric design decisions should be embedded in the SDK wrapper, not the collector configuration, to catch issues at compile/runtime.
- Label Normalization: Normalize label values (e.g., lowercasing, truncating strings) within the factory to prevent cardinality drift from inconsistent data.
- Metric Descriptions: Enforce descriptions that link the metric to the specific SLO. This aids in auditing metric relevance over time.
Pitfall Guide
-
Cardinality Explosion via User Context:
- Mistake: Including
user_id or email in metric labels.
- Impact: Creates millions of unique time series, crashing the TSDB or causing query timeouts.
- Mitigation: Use
user_id in traces or logs, never in metrics. If user segmentation is needed, use coarse buckets (e.g., tier: premium vs tier: free).
-
Static Thresholds on Volatile Metrics:
- Mistake: Alerting when error rate > 5% regardless of traffic volume.
- Impact: Alerts fire during traffic dips or fail to fire during traffic spikes where absolute errors are high but rate is low.
- Mitigation: Use burn rate alerts based on error budget consumption. This accounts for traffic variance and aligns alerts with SLO risk.
-
Ignoring the "Unknown" State:
- Mistake: Assuming metrics are always available.
- Impact: If the metric pipeline fails, the system appears healthy because no error metrics are emitted.
- Mitigation: Implement heartbeat metrics or pipeline health checks. Alert on metric absence for critical SLIs.
-
Histogram Bucket Misalignment:
- Mistake: Using default histogram buckets that do not cover SLO thresholds.
- Impact: Cannot calculate SLI compliance accurately. Queries return interpolated or incorrect values.
- Mitigation: Define buckets based on SLO targets. If SLO is p99 < 200ms, ensure a bucket at 200ms exists.
-
Metric Drift and Schema Rot:
- Mistake: Changing label names or values in code without updating dashboards and alerts.
- Impact: Dashboards break silently; alerts stop firing.
- Mitigation: Treat metric schemas as API contracts. Use CI/CD checks to validate metric definitions against the SRE allowlist. Version metrics if breaking changes are unavoidable.
-
Over-Instrumentation of Non-Critical Paths:
- Mistake: Emitting metrics for every function call or internal library event.
- Impact: High cardinality, storage waste, and noise.
- Mitigation: Instrument at service boundaries and critical failure points. Internal metrics should only exist if they directly aid in debugging SLO violations.
-
Lack of Metric Documentation:
- Mistake: Metrics exist without context on what they measure or how to use them.
- Impact: Engineers create duplicate metrics or misinterpret existing ones.
- Mitigation: Enforce documentation standards. Every metric must have a description linking it to an SLO and usage guidelines.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Volume Transactional Service | Counters with Burn Rates | Provides precise error budget tracking and efficient aggregation. | Low storage; high compute efficiency. |
| Latency-Sensitive API | Histograms with SLO-Aligned Buckets | Enables accurate percentile calculations and SLI compliance checks. | Moderate storage; requires bucket tuning. |
| System with Dynamic User Base | Coarse-Grained Dimensions Only | Prevents cardinality explosion while retaining segmentation value. | Significant storage reduction. |
| Cost-Constrained Environment | Aggregation at Collector | Reduce cardinality and volume before ingestion. | Lower ingestion costs; potential loss of granularity. |
| Rapidly Evolving Microservices | OpenTelemetry with Schema Validation | Ensures consistency across services and prevents drift. | Moderate dev overhead; long-term savings. |
Configuration Template
Prometheus Recording Rules for Burn Rate Calculation. This template assumes metrics http_requests_total (counter) and http_request_duration_seconds (histogram).
groups:
- name: sre_burn_rates
interval: 30s
rules:
# 1. Short-term burn rate (5m window, 1h burn)
# Triggers if 1% of error budget is consumed in 1 hour
- record: job:http_request_error_budget_burn:5m
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
)
/ (1 - 0.99) # Assuming 99% SLO
* (3600 / 300) # 1h / 5m window
# 2. Long-term burn rate (1h window, 6h burn)
# Triggers if 5% of error budget is consumed in 6 hours
- record: job:http_request_error_budget_burn:1h
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
)
/ (1 - 0.99)
* (21600 / 3600) # 6h / 1h window
# 3. Latency SLI Calculation (p99 < 500ms)
- record: job:http_request_latency_sli:1h
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h]))
/ sum(rate(http_request_duration_seconds_count[1h]))
Quick Start Guide
- Define SLOs: Document at least one availability and one latency SLO for the service. Specify targets (e.g., 99.9% availability, p99 < 500ms).
- Instrument with SRE Factory: Replace direct calls to the metric SDK with the
SREMetricFactory wrapper. Ensure labels adhere to the allowlist.
- Deploy Recording Rules: Add the burn rate recording rules to your Prometheus/Thanos/VictoriaMetrics configuration. Adjust the SLO target and windows as needed.
- Configure Alerts: Create alerting rules that fire on the burn rate metrics. Example: Alert when
job:http_request_error_budget_burn:5m > 14.4 (1% burn in 5m implies rapid exhaustion).
- Validate: Simulate traffic and errors. Verify that metrics appear in the TSDB, burn rates calculate correctly, and alerts fire within expected windows. Check cardinality to ensure no explosion occurred.