tion at query time. The architecture centers on OpenTelemetry as the collection standard, an OTel Collector as the routing and transformation layer, and purpose-built storage backends optimized for each signal type.
Step 1: Instrumentation Strategy
Use OpenTelemetry SDKs for explicit instrumentation of business logic, combined with auto-instrumentation for framework and library telemetry. Explicit spans must carry domain attributes (tenant ID, request type, feature flag state) to enable high-cardinality filtering without cross-referencing external systems.
import { trace } from '@opentelemetry/api';
import { SemanticAttributes } from '@opentelemetry/semantic-conventions';
export async function processPayment(transaction: PaymentRequest) {
const tracer = trace.getTracer('payment-service');
return tracer.startActiveSpan('payment.process', async (span) => {
try {
span.setAttributes({
[SemanticAttributes.PAYMENT_METHOD]: transaction.method,
'payment.tenant_id': transaction.tenantId,
'payment.is_retry': transaction.retryCount > 0,
[SemanticAttributes.HTTP_STATUS_CODE]: 200
});
const result = await executeGateway(transaction);
span.setStatus({ code: 1 }); // OK
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: 2, message: err.message });
throw err;
} finally {
span.end();
}
});
}
Step 2: Collector Pipeline Architecture
Deploy the OTel Collector as a sidecar or daemonset. Configure separate pipelines for metrics, traces, and logs to apply signal-specific processing. Use tail-based sampling for traces, metric aggregation for high-cardinality dimensions, and log enrichment for structured parsing.
Architecture decisions:
- Collector over direct export: Decouples instrumentation from backend volatility. Enables protocol translation, batching, retry logic, and sampling before data leaves the cluster.
- Tail-based sampling over head-based: Preserves complete traces containing errors or latency outliers. Head-based sampling randomly drops traces, often breaking correlation chains.
- Metric aggregation at collection: Prevents high-cardinality explosion by applying dimension budgets and rollups before persistence. Raw histograms are aggregated into percentiles; counters are downsampled by interval.
- Separate storage, unified query: Metrics to Prometheus/VictoriaMetrics, traces to Tempo/Jaeger, logs to Loki/Opensearch. Correlation happens at query time via trace ID injection into logs and metric label matching.
Step 3: Adaptive Sampling & Cardinality Control
Configure the OTel Collector to sample based on trace attributes, not random probability. Combine with metric dimension limits and log retention policies.
processors:
tail_sampling:
policies:
- name: error_preservation
type: status_code
status_code: { status_codes: [ERROR] }
- name: latency_threshold
type: latency
latency: { threshold_ms: 800 }
- name: business_critical
type: string_attribute
string_attribute: { key: "payment.is_retry", values: ["true"] }
metricstransform:
transforms:
- include: "http_server_duration"
match_type: strict
action: update
label_set:
- "service.name"
- "http.method"
- "http.status_code"
# Drop high-cardinality labels like "user.id" or "session.token"
Step 4: Cross-Signal Correlation
Inject trace IDs into log output at the application level. Configure the collector to parse logs and attach trace_id and span_id fields. Query engines then join signals using these identifiers without requiring manual correlation.
import { context, trace } from '@opentelemetry/api';
function attachTraceToLogger() {
const span = trace.getSpan(context.active());
if (!span) return {};
const spanContext = span.spanContext();
return {
trace_id: spanContext.traceId,
span_id: spanContext.spanId,
trace_flags: spanContext.traceFlags.toString(16).padStart(2, '0')
};
}
Pitfall Guide
-
Treating auto-instrumentation as complete coverage
Auto-instrumentation captures HTTP calls, DB queries, and message broker interactions, but it cannot infer business intent. Without explicit spans for domain operations (e.g., checkout.calculate_tax, onboarding.verify_identity), traces become infrastructure noise. Best practice: Reserve auto-instrumentation for libraries and frameworks. Explicitly instrument business workflows and attach domain attributes.
-
Ignoring high-cardinality dimension budgets
Ingesting user IDs, session tokens, or ephemeral request IDs into metrics creates unbounded series. Prometheus and VictoriaMetrics will OOM or degrade query performance. Best practice: Apply cardinality budgets at the collector. Drop or hash ephemeral dimensions. Use logs for high-cardinality data, metrics for aggregated patterns.
-
Fixed-rate sampling without context awareness
Dropping 90% of traces randomly guarantees that 90% of failures are also dropped. Tail-based sampling preserves error paths, latency outliers, and business-critical transactions. Best practice: Configure multiple sampling policies. Use status_code for errors, latency for SLO breaches, and string_attribute for feature flags or tenant tiers.
-
Mixing correlation boundaries across services
Passing trace context via HTTP headers works for synchronous calls. Async boundaries (message queues, event buses, batch jobs) break correlation if trace IDs are not explicitly propagated in message metadata. Best practice: Serialize traceparent into message headers or payload. Use W3C Trace Context standard. Verify propagation in async consumers.
-
Alerting on symptoms instead of SLOs
CPU > 80% or error rate > 5% alerts trigger on infrastructure symptoms, not user impact. Cloud-native systems self-heal; transient spikes are normal. Best practice: Define error budgets and SLOs. Alert on burn rate, not absolute thresholds. Use multi-window, multi-burn-rate alerting to distinguish transient noise from sustained degradation.
-
Storing raw logs without structured parsing
Unstructured logs require full-text search, which is slow and expensive. JSON logs with consistent keys enable index-free filtering. Best practice: Enforce structured logging at the SDK level. Map log levels to OTel severity. Attach trace_id and span_id to every log line. Drop debug logs in production unless explicitly enabled per tenant.
-
Neglecting infrastructure-level visibility
Application telemetry cannot diagnose node networking issues, container OOM kills, or storage latency spikes. eBPF and node exporters fill the gap. Best practice: Deploy eBPF-based collectors (e.g., Cilium, Pixie, or OpenTelemetry eBPF receiver) for network, filesystem, and scheduler metrics. Correlate infrastructure anomalies with application traces using node/pod labels.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-traffic API with user-level metrics | Aggregate at collector, store in TSDB, query via PromQL | Prevents series explosion; maintains SLO visibility | -60% vs raw ingestion |
| Event-driven microservices | Tail-based sampling + async trace propagation | Preserves failure paths across queues/batches | +15% storage, -40% MTTR |
| Multi-tenant SaaS with feature flags | Explicit instrumentation with tenant/flag attributes | Enables per-tenant SLO tracking and canary analysis | +20% compute, +35% diagnostic accuracy |
| Batch processing / data pipelines | eBPF + job-level spans, not per-record tracing | Reduces noise; focuses on job completion and resource contention | -50% trace volume, stable MTTR |
| Serverless / ephemeral functions | Auto-instrumentation + cold-start metrics, no high-cardinality logs | Functions lack persistent state; cold-start dominates latency | -30% retention cost, faster scale diagnosis |
Configuration Template
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
http:
prometheus:
config:
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
filelog:
include: ['/var/log/pods/*/*.log']
operators:
- type: json_parser
timestamp:
parse_from: attributes.timestamp
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
- type: trace_context_parser
trace_id:
parse_from: attributes.trace_id
span_id:
parse_from: attributes.span_id
processors:
tail_sampling:
policies:
- name: error_trace
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow_request
type: latency
latency: { threshold_ms: 1200 }
- name: critical_tenant
type: string_attribute
string_attribute: { key: "tenant.tier", values: ["enterprise"] }
memory_limiter:
check_interval: 1s
limit_mib: 1500
spike_limit_mib: 512
batch:
timeout: 5s
send_batch_max_size: 10000
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "otel"
otlp:
endpoint: "tempo.monitoring:4317"
tls:
insecure: true
loki:
endpoint: "http://loki.monitoring:3100/loki/api/v1/push"
default_labels_enabled:
exporter: true
job: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [filelog]
processors: [memory_limiter, batch]
exporters: [loki]
Quick Start Guide
- Deploy the OTel Collector: Run
otelcol-contrib as a sidecar or daemonset using the configuration template above. Map pod log volumes and expose OTLP ports (4317/4318).
- Instrument the application: Add
@opentelemetry/sdk-node and auto-instrumentation packages. Initialize the SDK with OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317.
- Configure storage backends: Deploy Prometheus (metrics), Tempo (traces), and Loki (logs) in a monitoring namespace. Point collector exporters to their respective endpoints.
- Validate correlation: Trigger a request that fails or exceeds latency thresholds. Query Tempo for the trace ID, then filter Loki logs and Prometheus metrics using the same
trace_id. Confirm cross-signal alignment.
- Enforce cardinality limits: Review metric series cardinality via
/metrics endpoint. Adjust metricstransform or collector processors to drop ephemeral labels. Set log retention to 7-14 days for debug, 30-90 days for info/warn.