rchy
Classify all metrics into tiers based on SLO impact.
- Tier 1 (Critical): Directly impacts user experience or revenue. Must appear on the primary view.
- Tier 2 (Component): Internal service health. Accessible via drill-down.
- Tier 3 (Diagnostic): Debugging metrics. Hidden by default.
2. Enforce Cross-Observability Links
Metrics must not exist in isolation. Every widget representing an error or latency spike must include a correlation link to:
- Traces: Sampled traces with high latency or error status codes.
- Logs: Log lines matching the timestamp and service instance of the anomaly.
- Infrastructure: Node metrics if the service is containerized.
3. Implement Context-Aware Layouts
Use a layout engine that adapts to the incident state.
- Normal State: Shows SLO burn rates and throughput.
- Alert State: Automatically surfaces related components and recent changes (deployments, config updates).
4. Optimize Query Performance
Dashboards must be faster than the incident evolution.
- Use pre-aggregation for time-series data.
- Implement query timeouts at the dashboard layer to prevent cascading load on the observability backend.
- Use progressive rendering to display cached data while fresh queries resolve.
TypeScript Implementation: Dashboard Schema and Validation
The following TypeScript code defines a type-safe dashboard configuration schema. It includes a validation builder that enforces ops best practices, such as widget limits, required drill-downs, and query constraints. This acts as a guardrail for dashboard development.
// types.ts
export type SignalTier = 'critical' | 'component' | 'diagnostic';
export type CrossObsLink = 'trace' | 'log' | 'infra';
export interface WidgetConfig {
id: string;
title: string;
query: string;
tier: SignalTier;
// Every critical widget must have cross-observability links
drillDowns: CrossObsLink[];
// Query must complete within this time or fail fast
maxQueryTimeMs: number;
// Thresholds based on SLO burn rate, not static values
sloburnRateThresholds: {
warning: number; // e.g., 2x burn rate
critical: number; // e.g., 14.4x burn rate (1 hour budget)
};
}
export interface DashboardConfig {
id: string;
name: string;
// Max widgets to prevent cognitive overload
maxWidgetsPerView: number;
widgets: WidgetConfig[];
// Auto-refresh interval; must balance freshness vs load
refreshIntervalSeconds: number;
}
// builder.ts
export class OpsDashboardBuilder {
private config: DashboardConfig;
constructor(name: string) {
this.config = {
id: crypto.randomUUID(),
name,
maxWidgetsPerView: 8, // Cognitive load constraint
widgets: [],
refreshIntervalSeconds: 30,
};
}
addWidget(widget: WidgetConfig): this {
// Validation 1: Cognitive Load Check
const criticalCount = this.config.widgets.filter(w => w.tier === 'critical').length;
if (widget.tier === 'critical' && criticalCount >= 6) {
throw new Error(
`Dashboard "${this.config.name}": Critical widget limit reached. Max 6 allowed to prevent cognitive overload.`
);
}
// Validation 2: Total Widget Limit
if (this.config.widgets.length >= this.config.maxWidgetsPerView) {
throw new Error(
`Dashboard "${this.config.name}": Total widget limit reached. Use drill-downs for additional metrics.`
);
}
// Validation 3: Cross-Obs Enforcement
if (widget.tier === 'critical' && widget.drillDowns.length === 0) {
throw new Error(
`Widget "${widget.title}": Critical widgets must define drillDowns for trace/log correlation.`
);
}
// Validation 4: Query Safety
if (widget.maxQueryTimeMs < 1000) {
console.warn(
`Widget "${widget.title}": Query timeout < 1s may cause false negatives during load spikes. Recommend >= 2000ms.`
);
}
this.config.widgets.push(widget);
return this;
}
setRefreshInterval(seconds: number): this {
if (seconds < 5) {
throw new Error('Refresh interval cannot be less than 5s to prevent query storms.');
}
this.config.refreshIntervalSeconds = seconds;
return this;
}
build(): Readonly<DashboardConfig> {
// Final validation
if (this.config.widgets.length === 0) {
throw new Error('Dashboard must contain at least one widget.');
}
// Check for duplicate queries
const queries = this.config.widgets.map(w => w.query);
if (new Set(queries).size !== queries.length) {
console.warn('Dashboard contains duplicate queries. Consider consolidating widgets.');
}
return Object.freeze(this.config);
}
}
// usage.ts
const dashboard = new OpsDashboardBuilder('Payment Service Ops')
.addWidget({
id: 'req-rate',
title: 'Request Rate',
query: 'sum(rate(http_requests_total[5m]))',
tier: 'critical',
drillDowns: ['trace', 'log'],
maxQueryTimeMs: 2000,
sloburnRateThresholds: { warning: 2, critical: 14.4 },
})
.addWidget({
id: 'error-burn',
title: 'Error Budget Burn Rate',
query: 'burn_rate(error_slo, 1h)',
tier: 'critical',
drillDowns: ['trace'],
maxQueryTimeMs: 1500,
sloburnRateThresholds: { warning: 2, critical: 14.4 },
})
.setRefreshInterval(15)
.build();
console.log(dashboard);
Architecture Decisions
- Schema-Driven Dashboards: Dashboards are defined via code/schema, enabling version control, peer review, and automated validation. This prevents "dashboard drift" and ensures consistency.
- Query Abstraction: The dashboard layer should not query raw data directly. It should query a Metric Query API that handles caching, aggregation, and rate limiting. This isolates the dashboard from backend load.
- Correlation Engine: Implement a backend service that indexes metrics against trace and log IDs. When a widget is clicked, the dashboard passes the time range and service context to this engine, which returns pre-filtered traces/logs. This removes the need for manual query construction by the operator.
- Edge Caching: For global teams, cache dashboard metadata and aggregated results at the edge. Only drill-down queries hit the central observability store.
Pitfall Guide
1. The "All Green" Fallacy
Mistake: Dashboards show green status while users report errors.
Cause: Relying solely on server-side metrics that don't capture client-side failures or network issues.
Fix: Integrate client-side RUM (Real User Monitoring) metrics and SLO burn rates. Green dashboard should mean "SLO is healthy," not "Server is up."
2. Query Storms During Incidents
Mistake: Dashboard refresh queries overload the TSDB during traffic spikes, causing data gaps.
Cause: Unoptimized queries or aggressive refresh intervals.
Fix: Use pre-aggregated data for real-time views. Implement query queuing and prioritize dashboard queries with lower priority than alerting pipelines. Enforce maxQueryTimeMs in the schema.
3. Static Thresholds on Dynamic Workloads
Mistake: Alerts trigger on normal traffic variations.
Cause: Hard-coded thresholds (e.g., error rate > 1%) that don't account for traffic volume.
Fix: Use dynamic baselines or SLO-based burn rates. Thresholds should scale with traffic or focus on error budget consumption.
4. Missing Cross-Observability Links
Mistake: Metric shows high latency, but operator must manually search traces.
Cause: Siloed tooling and lack of correlation metadata.
Fix: Ensure every metric carries context tags (service, version, region). Implement the drill-down links enforced in the TypeScript builder. Clicking a metric should open a trace view filtered to that anomaly.
Mistake: Adding every available metric to the dashboard.
Cause: Fear of missing data.
Fix: Adhere to the widget cap. If a metric isn't actionable or tied to an SLO, it belongs in a diagnostic view, not the ops dashboard. Use the builder's validation to reject non-critical widgets on primary views.
6. Dashboard Latency Mismatch
Mistake: Dashboard takes 5 seconds to load, but the incident evolves in seconds.
Cause: Heavy client-side rendering or slow backend queries.
Fix: Implement skeleton loading states. Cache previous state data. Use progressive rendering. The dashboard must render within 500ms for the UI, with data updating asynchronously.
7. Lack of Ownership and Lifecycle
Mistake: Dashboards rot as services change.
Cause: No owner assigned; dashboards are treated as static assets.
Fix: Treat dashboards as code. Assign ownership to service teams. Include dashboard validation in CI/CD pipelines. Prune widgets quarterly based on usage analytics.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Volume Microservices | Aggregated SLO Dashboards with Drill-Down | Prevents query explosion; focuses on service boundaries. | Lowers storage costs via aggregation; reduces TSDB load. |
| Batch Processing Jobs | Event-Driven Dashboards | Real-time streaming is unnecessary; job completion events are sufficient. | Reduces compute costs; eliminates unnecessary polling. |
| Multi-Region Deployment | Regional Shards with Global Rollup | Global view for status; regional drill-down for latency isolation. | Increases storage slightly for rollups; improves incident isolation speed. |
| Regulatory Compliance | Immutable Audit Dashboards | Requires tamper-proof logs and metrics for compliance reporting. | Higher storage costs for retention; adds validation overhead. |
| Development/Testing | On-Demand Sampling | Full fidelity data is too expensive; sampling is sufficient for debugging. | Significantly reduces ingest and storage costs. |
Configuration Template
Use this JSON template as a base for dashboard configurations. It includes fields for ownership, validation constraints, and cross-observability mapping.
{
"dashboardId": "svc-payment-ops-v1",
"name": "Payment Service Operations",
"owner": "team-payments",
"reviewCadenceDays": 30,
"constraints": {
"maxCriticalWidgets": 6,
"maxTotalWidgets": 8,
"minRefreshIntervalSec": 15,
"requireDrillDowns": true
},
"widgets": [
{
"id": "w-req-rate",
"title": "Request Rate",
"tier": "critical",
"query": "sum(rate(http_requests_total{service=\"payment\"}[5m]))",
"drillDowns": {
"trace": {
"index": "traces-payment",
"filterField": "service.name",
"filterValue": "payment"
},
"log": {
"index": "logs-payment",
"filterField": "service",
"filterValue": "payment"
}
},
"thresholds": {
"type": "slo-burn-rate",
"sloId": "slo-payment-availability",
"windows": ["1h", "6h"],
"warning": 2,
"critical": 14.4
},
"visualization": {
"type": "timeseries",
"aggregation": "rate",
"cacheTtlSec": 10
}
}
]
}
Quick Start Guide
- Define Three SLOs: Identify the three most critical user-facing metrics for your service (e.g., Availability, Latency, Throughput).
- Create Schema Config: Copy the configuration template and define three widgets corresponding to your SLOs. Ensure
drillDowns are populated with trace/log indices.
- Run Validation: Pass the config through the
OpsDashboardBuilder or equivalent validation logic. Fix any errors regarding widget limits or missing links.
- Deploy Dashboard: Render the dashboard using your UI framework, consuming the validated config. Implement the drill-down handlers to fetch correlated data.
- Test with Chaos: Simulate a latency spike. Verify the dashboard highlights the anomaly, the burn rate threshold triggers, and clicking the widget reveals relevant traces within 2 seconds.