y standard for this stack.
Step 1: Initialize the Tracer Provider
Install the core SDK and auto-instrumentations. Configure the tracer provider with W3C Trace Context propagation and an OTLP exporter.
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.4.2',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
headers: { 'api-key': process.env.OTEL_API_KEY || '' },
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
ignoreIncomingRequestHook: (req) => req.url?.includes('/health'),
},
}),
],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown().catch(console.error));
Step 2: Bridge Async Context Boundaries
Auto-instrumentation cannot infer context across message queues or worker threads. You must manually propagate the W3C trace context.
import { context, trace, SpanStatusCode } from '@opentelemetry/api';
import { W3CTraceContextPropagator } from '@opentelemetry/core';
const propagator = new W3CTraceContextPropagator();
// Inject context into outgoing message headers
export function injectTraceContext(headers: Record<string, string>) {
propagator.inject(context.active(), headers, {
set: (h, k, v) => (h[k] = v),
});
return headers;
}
// Extract and restore context on consumer side
export function extractTraceContext(headers: Record<string, string>) {
const ctx = propagator.extract(context.active(), headers, {
get: (h, k) => h[k],
keys: (h) => Object.keys(h),
});
return ctx;
}
Step 3: Implement Business-Level Spans
Auto-instrumentation captures infrastructure calls. Business logic requires explicit spans to map user journeys.
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('order-business-logic');
export async function processOrder(orderId: string, payload: any) {
return tracer.startActiveSpan('process.order', async (span) => {
try {
span.setAttribute('order.id', orderId);
span.setAttribute('order.total', payload.total);
await validateInventory(orderId, payload);
await chargePayment(orderId, payload);
await dispatchToFulfillment(orderId);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.recordException(error as Error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
throw error;
} finally {
span.end();
}
});
}
Step 4: Architecture Decisions & Rationale
- OTLP over HTTP/Protobuf: Use OTLP as the standard protocol. It supports streaming, compression, and native backend compatibility.
- Head-Based Sampling at Edge: Apply probabilistic sampling (e.g., 10%) at the SDK level to reduce network egress. Always override to 100% for error status codes.
- Collector as Central Hub: Deploy the OpenTelemetry Collector to handle batching, filtering, attribute enrichment, and tail-based sampling. This decouples instrumentation from backend storage.
- Resource Attribute Standardization: Enforce
service.name, service.version, and deployment.environment at initialization. Missing attributes break service maps and alert routing.
- Log-Trace Correlation: Inject
trace_id and span_id into structured logs. Use the same resource attributes across logs, metrics, and traces to enable cross-signal querying.
Pitfall Guide
-
Unbounded Span Generation Without Sampling
Generating spans for every request without sampling strategy causes exponential storage growth and CPU overhead. Production systems must implement probabilistic head-based sampling with error-based override rules.
-
Broken Context Propagation Across Async Boundaries
Message brokers (Kafka, RabbitMQ, SQS) and worker threads do not inherit async context automatically. Failing to inject/extract W3C headers severs the trace chain, creating orphaned spans that cannot be reconstructed.
-
Ignoring Resource Attributes
Traces without service.name or deployment.environment are unqueryable at scale. Backends rely on these attributes for service maps, retention policies, and alert routing. Set them at SDK initialization, not per-span.
-
Mixing Trace IDs with Log Correlation Incorrectly
Logging frameworks do not automatically attach trace context. You must configure log appenders to extract trace_id from the active span and inject it into JSON log structures. Mismatched attribute names break correlation queries.
-
Using Fragmented or Deprecated Instrumentation Libraries
Mixing legacy tracing SDKs (Zipkin, Jaeger client) with OTel creates protocol translation overhead and context loss. Standardize on OTel auto-instrumentations and verify version compatibility with your runtime.
-
Storing Raw Spans Without Aggregation or Indexing Strategy
Raw span storage is expensive and slow for debugging. Use the collector to drop high-cardinality attributes, aggregate identical spans, and route error traces to hot storage while routing success traces to cold storage.
-
Treating Tracing as a Replacement for Metrics and Logs
Tracing answers "what happened in this request." Metrics answer "what is the system doing overall." Logs answer "what did the code output." All three signals must share resource attributes and trace IDs to form a complete observability stack.
Production Best Practices:
- Implement tail-based sampling in the collector for error traces and latency outliers.
- Use semantic conventions for attribute naming to ensure cross-service compatibility.
- Set span limits (
maxAttributesPerSpan, maxAttributeValueLength) to prevent payload bloat.
- Validate trace continuity with synthetic transactions before deploying to production.
- Monitor collector health metrics (
otelcol_exporter_send_failed_spans, otelcol_processor_batch_batch_send_size) to detect pipeline bottlenecks.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small monolith or single service | SDK-only + direct OTLP export | Minimal context propagation needs; collector adds unnecessary latency | Low |
| Multi-service mesh with HTTP/RPC | SDK + Collector pipeline | Enables cross-service routing, attribute enrichment, and centralized sampling | Medium |
| Event-driven architecture (Kafka/RabbitMQ) | SDK + manual context bridging + Collector | Async boundaries require explicit W3C header propagation; collector handles tail-based sampling | Medium-High |
| High-throughput serverless | SDK with aggressive head sampling + Collector | Cold starts and ephemeral instances demand low overhead; collector aggregates sparse spans | Low-Medium |
Configuration Template
otel-collector-config.yaml
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_max_size: 1000
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
filter/attributes:
spans:
include:
match_type: regexp
attributes:
- key: http.method
value: "(GET|POST|PUT|DELETE)"
tail_sampling:
policies:
- name: error_traces
type: status_code
status_code: { status_codes: [ERROR] }
- name: latency_traces
type: latency
latency: { threshold_ms: 500 }
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [resource, filter/attributes, batch, tail_sampling]
exporters: [otlp/tempo, logging]
Quick Start Guide
- Install dependencies:
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http
- Add the SDK initialization code to your application entry point with your service name and OTLP endpoint.
- Run the collector locally:
docker run -p 4318:4318 -v ./otel-collector-config.yaml:/etc/otel-collector-config.yaml otel/opentelemetry-collector-contrib:latest --config /etc/otel-collector-config.yaml
- Generate test traffic and verify span arrival in your backend UI (Tempo, Jaeger, or commercial vendor).
- Confirm context propagation by checking that downstream service spans share the same
trace_id as the originating request.