layer must never block the application thread.
Implementation: TypeScript Telemetry Batcher
This example demonstrates a backpressure-aware batcher with circuit breaker logic and automatic context enrichment.
import { EventEmitter } from 'events';
interface TelemetryPayload {
traceId: string;
spanId: string;
timestamp: number;
data: Record<string, unknown>;
}
interface BatcherConfig {
batchSize: number;
flushIntervalMs: number;
maxRetries: number;
endpoint: string;
}
export class TelemetryBatcher extends EventEmitter {
private buffer: TelemetryPayload[] = [];
private isFlushing = false;
private circuitBreakerFailures = 0;
private readonly config: BatcherConfig;
constructor(config: BatcherConfig) {
super();
this.config = config;
setInterval(() => this.flush(), this.config.flushIntervalMs);
}
add(payload: TelemetryPayload): void {
if (this.circuitBreakerFailures >= 3) {
this.emit('dropped', payload);
return;
}
this.buffer.push(payload);
if (this.buffer.length >= this.config.batchSize) {
this.flush();
}
}
private async flush(): Promise<void> {
if (this.isFlushing || this.buffer.length === 0) return;
this.isFlushing = true;
const batch = this.buffer.splice(0, this.config.batchSize);
try {
await this.sendToEndpoint(batch);
this.circuitBreakerFailures = 0;
} catch (error) {
this.circuitBreakerFailures++;
this.emit('error', error);
// Re-queue or drop based on policy; here we drop to prevent memory leak
this.emit('dropped', batch);
} finally {
this.isFlushing = false;
}
}
private async sendToEndpoint(batch: TelemetryPayload[]): Promise<void> {
// Implementation: HTTP POST with TLS, retry backoff, and idempotency keys
const response = await fetch(this.config.endpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(batch),
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
}
}
Architecture Rationale:
- Batching: Reduces network overhead and improves compression ratios at the storage layer.
- Circuit Breaker: Prevents cascading failures when the ingestion endpoint is degraded.
- Async Flush: Ensures application latency is unaffected by telemetry transmission.
3. Polyglot Storage Backends
Select storage engines optimized for specific data shapes.
- Metrics: Deploy a TSDB like VictoriaMetrics or Prometheus-compatible remote storage. Configure downsampling rules to aggregate raw samples into 1m, 5m, and 1h resolutions for long-term retention.
- Logs: Use a search-optimized store like OpenSearch or ClickHouse. Enforce strict mappings for
trace_id, span_id, and level to enable fast filtering.
- Traces: Utilize a dedicated trace store like Tempo or Jaeger. Index traces by
trace_id, service name, duration, and error status.
4. The Correlation Bridge
Implement a lightweight service that ingests metadata from all backends and builds a correlation index. This index maps trace_id to log references and metric windows.
Implementation: Correlation Indexer
interface CorrelationRecord {
traceId: string;
service: string;
logRef: string; // Pointer to log document ID
metricWindow: { // Time window for metric aggregation
start: number;
end: number;
};
errorStatus: boolean;
}
export class CorrelationIndexer {
private index: Map<string, CorrelationRecord[]> = new Map();
upsert(record: CorrelationRecord): void {
const existing = this.index.get(record.traceId) || [];
existing.push(record);
this.index.set(record.traceId, existing);
}
resolve(traceId: string): CorrelationRecord[] {
return this.index.get(traceId) || [];
}
// Periodic cleanup to manage memory footprint
prune(olderThan: number): void {
for (const [traceId, records] of this.index.entries()) {
if (records.every(r => r.metricWindow.end < olderThan)) {
this.index.delete(traceId);
}
}
}
}
Architecture Rationale:
- Pre-computation: Joins are performed at ingestion time, not query time. This shifts cost to the write path where it is amortized, ensuring sub-50ms query latency.
- Lightweight: The index stores only references and metadata, not full payloads, keeping memory usage low.
- Lifecycle: Automatic pruning prevents unbounded growth of the correlation index.
5. Unified Query Layer
Expose a GraphQL or gRPC API that accepts cross-signal queries. The query layer translates high-level requests into backend-specific queries and merges results using the correlation index.
- Query Pattern:
GET /correlation?traceId=abc123 returns the trace graph, associated error logs, and latency metrics for the duration of the trace.
- Materialized Views: For high-frequency dashboard queries, pre-aggregate results into materialized views to avoid repeated computation.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Cardinality Explosion | Using high-cardinality fields (e.g., user_id, request_id) as metric tags causes index blowouts and storage costs to scale with traffic. | Restrict metric tags to low-cardinality dimensions (service, region, env). Use trace_id for debugging correlations instead of metric tags. |
| Synchronous Ingestion | Sending telemetry synchronously within request handlers adds latency and risks blocking application threads during network hiccups. | Use async batchers with backpressure. Ensure telemetry failures do not impact user-facing requests. |
| Context Loss | Logs emitted without trace_id or span_id cannot be correlated with traces, breaking the debugging workflow. | Enforce context propagation in sidecars or SDKs. Validate that all log entries contain mandatory correlation fields via schema validation. |
| Alert Fatigue | Threshold-based alerts on raw metrics generate noise during transient spikes, leading to ignored alerts. | Implement SLO-based burn rate alerts. Use multi-condition logic (e.g., error rate AND latency) and suppress alerts during known incidents. |
| Retention Mismatch | Applying uniform retention policies wastes storage. Traces are rarely needed after 90 days, while compliance logs may require 365 days. | Define tiered retention policies per data type. Move cold data to cheaper object storage with reduced query performance. |
| Global Search Dependency | Relying on full-text searches across all logs for debugging is slow and expensive at scale. | Create materialized views for common query patterns. Use structured fields for filtering instead of text search. |
| Missing Governance | Uncontrolled dashboard and alert proliferation leads to configuration drift and security risks. | Treat observability configs as code. Version control dashboards, enforce RBAC, and validate changes via CI/CD pipelines. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Volume Metrics | Remote write to TSDB with downsampling | TSDBs optimize for time-series compression and high write throughput. | Low (Efficient compression) |
| Compliance Logs | Immutable object store with indexing | Ensures auditability and long-term retention at minimal cost. | Medium (Storage cost, low compute) |
| Debugging Traces | Probabilistic sampling + Error-based retention | Reduces trace volume while preserving high-value error traces. | Low (Significant volume reduction) |
| Cross-Signal Queries | Pre-computed correlation index | Eliminates expensive joins at query time; ensures low latency. | Medium (Ingestion compute cost) |
| Ad-Hoc Analysis | Materialized views for hot paths | Accelerates frequent queries; falls back to raw scan for rare cases. | Low (Storage overhead vs compute savings) |
Configuration Template
# observability-pipeline.yaml
pipeline:
ingestion:
batch_size: 500
flush_interval_ms: 2000
max_retries: 3
tls_enabled: true
redaction_rules:
- field: "user_email"
action: "hash"
- field: "credit_card"
action: "mask"
storage:
metrics:
backend: "victoria-metrics"
retention:
raw: "7d"
downsample_1m: "90d"
downsample_1h: "365d"
logs:
backend: "opensearch"
retention: "180d"
mapping_strict: true
traces:
backend: "tempo"
sampling:
rate: 0.1
error_override: true
retention: "90d"
correlation:
index_ttl: "24h"
prune_interval: "1h"
fields:
- "trace_id"
- "span_id"
- "service"
- "environment"
governance:
rbac:
roles:
- name: "sre"
access: ["metrics", "traces", "logs:non-pii"]
- name: "dev"
access: ["metrics", "traces", "logs:own-service"]
cost_alerts:
threshold_ingest_mbps: 500
notification_channel: "#observability-costs"
Quick Start Guide
- Deploy Ingestion Agents: Install the sidecar agent on target services. Configure the OTLP endpoint to point to your ingestion gateway. Verify that telemetry is being batched and transmitted without blocking application threads.
- Initialize Storage Backends: Provision the TSDB, log store, and trace store. Apply retention policies and indexing configurations as defined in the template. Ensure TLS is enforced for all data in transit.
- Launch Correlation Service: Deploy the correlation indexer. Configure it to consume metadata streams from all backends. Verify that
trace_id lookups return linked log references and metric windows.
- Validate Cross-Signal Query: Execute a test query using a known
trace_id. Confirm that the unified query layer returns the trace graph, associated logs, and latency metrics within the expected latency threshold (<50ms).
- Enable Governance: Apply RBAC policies and set up cost monitoring alerts. Run a synthetic load test to validate ingestion resilience and alert triggering. Review dashboard configurations for cardinality compliance.