chitecture must ensure headers are injected into outgoing HTTP/gRPC requests and extracted in incoming handlers.
3. Sampling Strategy: Random sampling is insufficient for debugging rare errors. Implement tail-based sampling or error-based sampling to retain traces containing failures while dropping healthy traffic to control cost.
Implementation Steps (TypeScript)
The following implementation uses @opentelemetry/api and @opentelemetry/sdk-trace-node.
1. Initialize Tracer Provider:
Configure the SDK to export spans to a collector. Use BatchSpanProcessor to aggregate spans and reduce network overhead.
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
const provider = new NodeTracerProvider({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'payment-service',
[SEMRESATTRS_SERVICE_VERSION]: '1.2.0',
}),
});
// Use BatchSpanProcessor for production efficiency
const exporter = new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
2. Instrument Execution Boundaries:
Wrap logical units of work in spans. Use tracer.startActiveSpan to automatically manage context scope via AsyncLocalStorage, ensuring child spans inherit the parent context in asynchronous flows.
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');
async function processPayment(transactionId: string, amount: number) {
return tracer.startActiveSpan(
'processPayment',
{ attributes: { 'payment.transaction_id': transactionId } },
async (span) => {
try {
// Validate input
await validateTransaction(transactionId);
// Child span for downstream call
const result = await tracer.startActiveSpan(
'callGateway',
async (gatewaySpan) => {
try {
const res = await fetchPaymentGateway(transactionId, amount);
gatewaySpan.setAttribute('payment.gateway.status', res.status);
return res;
} catch (err) {
gatewaySpan.recordException(err as Error);
gatewaySpan.setStatus({ code: 2 }); // ERROR
throw err;
} finally {
gatewaySpan.end();
}
}
);
span.setAttribute('payment.amount', amount);
return result;
} catch (error) {
// Record exception and set span status
span.recordException(error as Error);
span.setStatus({ code: 2, message: (error as Error).message });
throw error;
} finally {
span.end();
}
}
);
}
3. Context Propagation in HTTP Clients:
Ensure the OTel HTTP plugin or manual injection propagates the traceparent header.
import { propagation, SpanKind } from '@opentelemetry/api';
async function fetchPaymentGateway(id: string, amount: number) {
const headers: Record<string, string> = {};
// Inject context into headers
propagation.inject(trace.setSpan(context.active(), currentSpan), headers);
const response = await fetch('https://gateway.example.com/pay', {
method: 'POST',
headers: { ...headers, 'Content-Type': 'application/json' },
body: JSON.stringify({ id, amount }),
});
return response.json();
}
4. Correlation with Logs:
For trace-based debugging to be effective, logs must be correlated with the active span. Inject trace_id and span_id into log entries.
import { context, trace } from '@opentelemetry/api';
function debugLog(message: string) {
const span = trace.getSpan(context.active());
const ctx = span ? span.spanContext() : null;
console.log(JSON.stringify({
level: 'debug',
message,
trace_id: ctx?.traceId,
span_id: ctx?.spanId,
timestamp: new Date().toISOString(),
}));
}
Pitfall Guide
Production trace-based debugging fails when implementation details are ignored. The following pitfalls are common in high-scale environments.
-
Cardinality Explosion:
Adding high-cardinality attributes (e.g., user emails, request bodies, UUIDs) to spans causes index bloat in the backend, drastically increasing storage costs and query latency.
- Best Practice: Restrict attributes to low-cardinality keys. Use
trace_id to fetch full request details from logs or databases only when debugging a specific trace. Define an allowlist of permitted attributes.
-
Async Context Loss:
In Node.js, failing to use AsyncLocalStorage or OTel's startActiveSpan results in child spans detaching from the parent, creating fragmented traces.
- Best Practice: Always use
tracer.startActiveSpan for async operations. Verify context propagation in custom async wrappers or worker threads.
-
Over-Instrumentation:
Creating spans for every function call generates excessive data volume and noise, obscuring the actual execution path.
- Best Practice: Instrument only boundaries that represent distinct logical operations or network hops. Internal helper functions should not create spans unless they contain complex logic requiring isolation.
-
Sampling Bias:
Random sampling discards traces indiscriminately. If a bug occurs in 0.1% of requests, random sampling at 10% may miss the error entirely.
- Best Practice: Implement tail-based sampling or error-based sampling. Configure the collector to retain traces where any span has an error status, regardless of the global sampling rate.
-
Secret Leakage:
Accidentally capturing sensitive data (passwords, tokens, PII) in span attributes or logs.
- Best Practice: Implement attribute sanitization filters in the SDK or Collector. Use regex patterns to redact sensitive keys. Conduct security reviews of instrumentation code.
-
Ignoring Downstream Latency:
Focusing only on local execution time without capturing downstream dependency latency.
- Best Practice: Ensure every outbound call creates a child span with
SpanKind.CLIENT. Record attributes like http.status_code, db.statement, and rpc.service to identify bottleneck dependencies.
-
Treating Traces as Logs:
Using trace attributes to store verbose, unstructured messages.
- Best Practice: Traces describe flow and timing. Logs describe events. Use trace events for discrete moments within a span, but rely on correlated logs for detailed state dumps. Keep span attributes structured and queryable.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Throughput API (>50k RPS) | Low-rate sampling + Tail-based retention | Random sampling reduces volume; tail sampling preserves errors. | Low storage cost; moderate compute for collector. |
| Complex Microservice Mesh | Full instrumentation + eBPF overlay | SDK captures business logic; eBPF captures network/infra issues without code changes. | Moderate cost; high debugging efficiency. |
| Batch Processing / ETL | 100% Sampling for job spans | Batch jobs have low volume; full traces provide complete auditability. | Low cost; high data retention value. |
| Compliance-Heavy (GDPR/HIPAA) | Strict attribute allowlist + Redaction | Prevents PII leakage; ensures audit trails without storing sensitive data. | High engineering overhead; low compliance risk. |
| Serverless Functions | Auto-instrumentation + Cold start optimization | Serverless requires minimal init time; auto-instrumentation reduces boilerplate. | Low cost; potential latency spike if SDK is heavy. |
Configuration Template
OTel Collector configuration for tail-based sampling and attribute redaction.
receivers:
otlp:
protocols:
http:
endpoint: "0.0.0.0:4318"
processors:
# Redact sensitive attributes
attributes/redact:
actions:
- key: "user.email"
action: "delete"
- key: "http.request.body"
action: "delete"
- key: "db.statement"
action: "hash"
# Tail-based sampling to retain errors
tail_sampling:
policies:
- name: error-policy
type: status_code
status_code: { status_codes: [ "ERROR" ] }
- name: latency-policy
type: latency
latency: { threshold_ms: 500 }
- name: default-policy
type: probabilistic
probabilistic: { sampling_percentage: 10 }
batch:
timeout: 1s
send_batch_max_size: 1024
exporters:
otlp/jaeger:
endpoint: "jaeger:14250"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes/redact, tail_sampling, batch]
exporters: [otlp/jaeger]
Quick Start Guide
-
Install Dependencies:
npm install @opentelemetry/api @opentelemetry/sdk-trace-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/auto-instrumentations-node
-
Initialize Tracing:
Create tracer.ts with the provider setup and register it before importing application code.
import './tracer'; // Ensure this runs first
import express from 'express';
// ... app logic
-
Add Spans to Critical Paths:
Wrap request handlers and external calls with tracer.startActiveSpan.
app.get('/users/:id', async (req, res) => {
return tracer.startActiveSpan('GET /users', async (span) => {
try {
const user = await db.findUser(req.params.id);
span.setAttribute('user.id', req.params.id);
res.json(user);
} catch (err) {
span.recordException(err);
res.status(500).send('Error');
} finally {
span.end();
}
});
});
-
Run Collector:
Deploy the OTel Collector using the provided configuration template.
docker run -p 4318:4318 -v otel-config.yaml:/etc/otel/config.yaml otel/opentelemetry-collector-contrib --config /etc/otel/config.yaml
-
Visualize and Debug:
Query traces in your backend (e.g., Jaeger UI, Grafana Tempo). Filter by error=true or specific trace_id to reconstruct execution flow and identify bottlenecks.