services. Mount configuration files and data volumes explicitly.
# docker-compose.yml
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.48.0
ports: ["9090:9090"]
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=15d"
- "--storage.tsdb.retention.size=50GB"
- "--web.enable-lifecycle"
grafana:
image: grafana/grafana:10.2.0
ports: ["3000:3000"]
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: changeme
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- grafana_data:/var/lib/grafana
alertmanager:
image: prom/alertmanager:v0.26.0
ports: ["9093:9093"]
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
volumes:
prometheus_data:
grafana_data:
Step 2: Prometheus Scrape Configuration
Prometheus must know what to scrape, how often, and how to handle failures. Use scrape_interval and scrape_timeout deliberately. Align timeout to 90% of interval. Enable service discovery for dynamic environments; avoid static target lists in containerized deployments.
# prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "app-services"
metrics_path: /metrics
scheme: http
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "(.+):(.+)"
replacement: "${1}"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
rule_files:
- "recording_rules.yml"
- "alerting_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
Step 3: TypeScript Instrumentation
Use prom-client for Node.js/TypeScript. Expose metrics on a dedicated endpoint. Avoid dynamic label values. Use counters for cumulative events, gauges for current state, and histograms for latency distributions.
// src/metrics.ts
import * as promClient from 'prom-client';
const register = new promClient.Registry();
// Enable default metrics (event loop lag, memory, file descriptors)
promClient.collectDefaultMetrics({ register, prefix: 'app_' });
// HTTP request counter with bounded labels
const httpRequestCounter = new promClient.Counter({
name: 'app_http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register]
});
// Request duration histogram
const httpRequestDuration = new promClient.Histogram({
name: 'app_http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route'],
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
registers: [register]
});
// Gauge for active connections
const activeConnections = new promClient.Gauge({
name: 'app_active_connections',
help: 'Number of active WebSocket connections',
registers: [register]
});
export { register, httpRequestCounter, httpRequestDuration, activeConnections };
// src/middleware/metrics.ts
import { Request, Response, NextFunction } from 'express';
import { httpRequestCounter, httpRequestDuration } from '../metrics';
export const metricsMiddleware = (req: Request, res: Response, next: NextFunction) => {
const start = process.hrtime.bigint();
res.on('finish', () => {
const duration = Number(process.hrtime.bigint() - start) / 1e9;
const route = req.route?.path || req.path;
httpRequestCounter.inc({ method: req.method, route, status_code: res.statusCode });
httpRequestDuration.observe({ method: req.method, route }, duration);
});
next();
};
Step 4: Grafana Provisioning
Provision data sources and dashboards declaratively. Avoid manual UI imports in production. Use JSON dashboard exports with variable templating for environment/portability.
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: 15s
queryTimeout: 30s
httpMethod: POST
Architecture Decisions and Rationale
- Pull Model: Prometheus scrapes targets. This prevents agents from overwhelming the server during network partitions and enables centralized retention control.
- Local TSDB + Retention Limits: Default 15-day retention balances cost and debugging window. Size limits prevent disk exhaustion. Remote write is reserved for long-term archival, not real-time querying.
- Recording Rules: Precompute expensive aggregations (
rate(), histogram_quantile()) at the scrape interval. Reduces dashboard query latency by 60β80%.
- Alertmanager Integration: Decouples alerting from scraping. Enables grouping, inhibition, and multi-channel routing without modifying Prometheus rules.
- Label Cardinality Enforcement: Maximum 3β4 high-cardinality dimensions per metric. Dynamic values (IDs, timestamps) belong in logs or traces, not metrics.
Pitfall Guide
1. Unbounded Label Cardinality
Adding {user_id}, {request_id}, or {endpoint_with_params} creates millions of unique series. Prometheus compresses series by label set. Fragmented sets disable chunk compression, increase memory usage, and degrade query performance.
Best Practice: Restrict labels to low-cardinality dimensions (service, region, status). Route high-cardinality data to structured logs or distributed tracing.
2. Scrape Timeout Misalignment
Setting scrape_timeout equal to or greater than scrape_interval causes overlapping scrapes. Prometheus queues requests, increasing memory pressure and missing data points.
Best Practice: Configure timeout to 90% of interval. If scrape_interval: 15s, set scrape_timeout: 10s. Tune per-job based on endpoint response times.
3. Dashboard Query Anti-Patterns
Using rate() on counters without proper windowing, querying raw histograms, or joining unindexed metrics causes full TSDB scans. Dashboards load slowly and timeout under load.
Best Practice: Use rate(metric[5m]) for counters. Precompute percentiles with histogram_quantile(0.95, rate(metric_bucket[5m])). Leverage Grafana query caching and template variables for range selection.
4. Static Target Lists in Dynamic Environments
Hardcoding IPs or hostnames in static_configs breaks in auto-scaling or containerized deployments. Targets drift, scrapes fail, and gaps appear in metrics.
Best Practice: Use kubernetes_sd_configs, consul_sd_configs, or ec2_sd_configs. Rely on annotations or tags for scrape eligibility. Validate with prometheus_sd_files during local testing.
5. Ignoring Metric Type Semantics
Using gauges for cumulative events or counters for fluctuating values breaks rate() calculations. Prometheus assumes monotonic increase for counters. Resetting a counter without proper handling causes negative spikes.
Best Practice: Counters for totals (requests, errors). Gauges for current state (memory, queue depth). Histograms/Summaries for distributions (latency, payload size). Never mix semantics.
6. Alerting Without Severity Tiers
Firing identical alerts for disk usage, latency spikes, and error rates creates noise. Engineers mute channels, missing critical incidents.
Best Practice: Implement severity: critical|warning|info. Use Alertmanager grouping, inhibition rules, and route-specific receivers. Require acknowledgment for critical alerts.
7. No Retention or Compaction Policy
Running Prometheus without --storage.tsdb.retention.size or --storage.tsdb.retention.time leads to unbounded disk growth. Default compaction may lag under high write throughput.
Best Practice: Set explicit retention limits. Monitor prometheus_tsdb_head_chunks and prometheus_tsdb_compactions_failed. Tune --storage.tsdb.min-block-duration and --storage.tsdb.max-block-duration for write-heavy workloads.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small team (<10 services), single region | Local TSDB, 15d retention, manual dashboard import | Simplicity reduces operational overhead; low cardinality expected | Low storage, minimal engineering time |
| Cloud-native auto-scaling (K8s/ECS) | Service discovery, recording rules, remote write to Thanos/VictoriaMetrics | Dynamic targets require SD; remote write enables cross-cluster querying and longer retention | Moderate storage, higher query reliability |
| High-throughput SaaS (>100k RPS) | Sharded Prometheus, federation, histogram precomputation, strict label governance | Single instance cannot handle write throughput; federation distributes load | High infrastructure cost, optimized query latency |
| Compliance/audit requirements | Dual-write to local TSDB + immutable object storage, WORM retention, dashboard audit trails | Metrics must be tamper-proof and retained for regulatory periods | Premium storage, added pipeline complexity |
Configuration Template
# prometheus/prometheus.yml (Production Baseline)
global:
scrape_interval: 10s
scrape_timeout: 8s
evaluation_interval: 10s
storage:
tsdb:
retention.time: 30d
retention.size: 100GB
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "application"
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (\d+)
replacement: $1
target_label: __address__
rule_files:
- "recording_rules.yml"
- "alerting_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
# grafana/provisioning/dashboards/providers.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
recurse: true
Quick Start Guide
- Initialize project structure: Create directories
prometheus/, grafana/provisioning/datasources/, grafana/provisioning/dashboards/, and alertmanager/. Place the configuration templates above into their respective paths.
- Start the stack: Run
docker compose up -d. Verify Prometheus targets at http://localhost:9090/targets and Grafana at http://localhost:3000.
- Instrument a service: Add
prom-client to your TypeScript application. Expose /metrics on port 9090 or 8080. Annotate the service/pod with prometheus.io/scrape: "true" and prometheus.io/port: "<port>".
- Validate data flow: In Grafana, create a new query using the Prometheus datasource. Test
rate(app_http_requests_total[5m]) and histogram_quantile(0.95, rate(app_http_request_duration_seconds_bucket[5m])). Confirm dashboard renders without timeout.
- Lock configuration: Commit all YAML files to version control. Disable manual Grafana edits in production. Enforce label governance via CI linting (
promtool check metrics).