Step-by-Step Implementation
-
Instrument the driver layer, not the host
Host metrics are insufficient for application-level database behavior. Instrument the database driver to capture request rate, error rate, duration, and saturation (RED method) alongside connection pool metrics.
-
Establish latency distributions with histograms
Averages mask tail latency. Configure histogram buckets that match your SLO targets. Track p95 and p99 latency to detect degradation before p50 shifts.
-
Track connection pool saturation as a leading indicator
Connection exhaustion precedes query timeouts. Monitor active connections, idle connections, pending requests, and queue depth. Alert on saturation trends, not absolute counts.
-
Correlate traces, metrics, and logs
Propagate trace context into database spans. Tag metrics with service, environment, and tenant identifiers. Ensure logs carry trace IDs for cross-referencing during incident response.
TypeScript Implementation
The following example demonstrates pattern-based instrumentation using OpenTelemetry and the pg driver. It captures RED metrics, connection pool saturation, and propagates trace context.
import { trace, metrics, context, SpanKind } from '@opentelemetry/api';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { Pool, PoolClient } from 'pg';
// 1. Initialize OpenTelemetry instrumentation
const exporter = new PrometheusExporter({ port: 9464, endpoint: '/metrics' });
registerInstrumentations({
instrumentations: [
new PgInstrumentation({
enhanceDatabaseStatement: true,
ignoreStatements: ['SELECT 1'],
responseHook: (span, result) => {
span.setAttribute('db.rows_affected', result?.rowCount ?? 0);
},
}),
],
});
// 2. Configure latency histogram aligned with SLOs
const meter = metrics.getMeter('database-monitoring');
const queryLatencyHistogram = meter.createHistogram('db.query.latency', {
description: 'Database query latency in milliseconds',
unit: 'ms',
advice: {
explicitBucketBoundaries: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500],
},
});
// 3. Connection pool saturation tracking
const pool = new Pool({
host: process.env.DB_HOST,
database: process.env.DB_NAME,
max: 20,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});
const poolGauge = meter.createUpDownCounter('db.pool.connections', {
description: 'Active vs idle vs pending connections',
});
pool.on('acquire', () => poolGauge.add(1, { state: 'active' }));
pool.on('release', () => poolGauge.add(-1, { state: 'active' }));
pool.on('error', (err) => {
console.error('Pool error:', err);
poolGauge.add(1, { state: 'error' });
});
// 4. Query execution with context propagation and latency tracking
export async function executeQuery<T>(query: string, params?: any[]): Promise<T> {
const tracer = trace.getTracer('db-operations');
const startTime = performance.now();
return tracer.startActiveSpan('db.query', { kind: SpanKind.CLIENT }, async (span) => {
try {
const client: PoolClient = await pool.connect();
try {
const result = await client.query(query, params);
const duration = performance.now() - startTime;
// Record latency distribution
queryLatencyHistogram.record(duration, {
operation: query.trim().split(' ')[0].toLowerCase(),
status: 'success',
});
return result.rows as T;
} finally {
client.release();
}
} catch (error: any) {
const duration = performance.now() - startTime;
queryLatencyHistogram.record(duration, {
operation: query.trim().split(' ')[0].toLowerCase(),
status: 'error',
error_type: error.code || 'unknown',
});
span.recordException(error);
span.setStatus({ code: 2, message: error.message });
throw error;
} finally {
span.end();
}
});
}
Architecture Decisions and Rationale
- OpenTelemetry over vendor SDKs: Ensures metric and trace format consistency across PostgreSQL, Redis, and MongoDB. Eliminates lock-in and simplifies migration.
- Push vs Pull: Metrics are pushed via OTLP or Prometheus remote write to decouple collection from query patterns. Traces use OTLP push for low-latency correlation.
- Cardinality control: Tag metrics with
service, environment, and operation. Avoid tagging with user IDs or raw query strings to prevent metric explosion.
- SLO-aligned alerting: Alert on rate-of-change and saturation thresholds rather than absolute values. Use multi-window alerting to distinguish transient spikes from sustained degradation.
Pitfall Guide
-
Alerting on absolute CPU/RAM instead of saturation
High CPU does not equal database degradation. A saturated connection pool or lock wait queue will impact latency long before CPU spikes. Monitor request rate vs capacity, not resource utilization.
-
Tracking average latency instead of distributions
p50 latency is useless for SLO compliance. A single unoptimized query can spike p99 to 2000ms while p50 remains at 15ms. Always instrument histograms and alert on tail latency.
-
Ignoring connection pool queue depth
Connection exhaustion manifests as timeout errors, not slow queries. If pending requests exceed 20% of pool size, the database is backpressuring the application. Track acquire wait time and queue depth.
-
Treating replication lag as a static threshold
Replication lag drift is a trend problem, not a binary state. A 2-second lag may be acceptable for analytics but catastrophic for financial transactions. Monitor lag velocity and correlate with write throughput.
-
Over-instrumenting with debug-level query logging
Logging every query string in production introduces I/O overhead, exposes sensitive data, and bloats storage. Sample high-cardinality traces, mask parameters, and rely on metrics for continuous monitoring.
-
Decoupling database spans from application traces
Database latency without request context is unactionable. Propagate trace context into db.query spans, tag with http.route or queue.name, and ensure logs carry trace.id for cross-referencing.
-
Setting static alert thresholds
Workloads are dynamic. Static thresholds generate noise during traffic surges and miss degradation during quiet periods. Implement dynamic baselines using moving averages or SLO burn rate calculations.
Best Practices from Production
- Define database SLOs per workload tier (e.g., p95 < 50ms for user-facing, p95 < 200ms for background jobs).
- Use multi-window alerting (e.g., 5m short window + 1h long window) to filter transient spikes.
- Correlate connection pool saturation with query execution plans to identify missing indexes or lock contention.
- Implement circuit breakers at the application layer when pool saturation exceeds 80%.
- Regularly audit metric cardinality; prune unused tags to control Prometheus/Grafana storage costs.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small team, single DB instance | Pattern-based OpenTelemetry + Prometheus | Low overhead, covers 90% of degradation patterns without full APM licensing | $0-50/mo infrastructure |
| High-throughput e-commerce | Full-stack tracing + query sampling + SLO alerting | Tail latency and connection saturation directly impact conversion rates | $200-800/mo observability stack |
| Multi-region replication | Lag velocity tracking + cross-region trace correlation | Consistency drift causes stale reads; trend detection prevents cascading failures | $150-400/mo network/metric storage |
| Legacy monolith with ORM | Driver-level instrumentation + connection pool monitoring | ORM abstraction hides query patterns; pool saturation is the leading failure indicator | $50-150/mo agent + storage |
Configuration Template
OpenTelemetry Collector configuration for database metric ingestion and trace forwarding:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'app-database-metrics'
static_configs:
- targets: ['app-service:9464']
metrics_path: '/metrics'
scrape_interval: 15s
processors:
batch:
timeout: 5s
send_batch_max_size: 1000
filter/verbose:
spans:
exclude:
match_type: regexp
names:
- 'db.query.*SELECT 1.*'
attributes/cardinality:
actions:
- key: db.statement
action: delete
- key: http.user_agent
action: delete
exporters:
prometheus:
endpoint: '0.0.0.0:8889'
namespace: 'database_observability'
otlp/jaeger:
endpoint: 'jaeger-collector:4317'
tls:
insecure: true
logging:
loglevel: warn
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [batch]
exporters: [prometheus]
traces:
receivers: [otlp]
processors: [batch, filter/verbose, attributes/cardinality]
exporters: [otlp/jaeger]
Quick Start Guide
- Install OpenTelemetry SDK and PG instrumentation
npm install @opentelemetry/api @opentelemetry/instrumentation-pg @opentelemetry/exporter-prometheus
- Deploy OpenTelemetry Collector and Prometheus
Use Docker Compose to run the collector, Prometheus, and Grafana. Mount the configuration template above and expose ports 4317, 9090, and 3000.
- Instrument your application
Import the OTel initialization code in your app entrypoint. Replace raw
pg queries with the executeQuery wrapper or use @opentelemetry/instrumentation-pg auto-instrumentation.
- Validate with synthetic load
Run a lightweight load test (e.g.,
autocannon or k6) against a database endpoint. Verify metrics appear in Prometheus, traces propagate through Jaeger, and latency histograms populate correct buckets. Adjust alert thresholds based on observed p95/p99 values.