error traces.
* Tail Sampling: Decides after the trace completes. Requires the Collector to buffer traces. Essential for capturing error traces and high-latency outliers without noise.
3. Context Propagation: Use W3C Trace Context headers (traceparent, tracestate). This ensures interoperability across polyglot services.
Implementation Steps (TypeScript / Node.js)
This example uses @opentelemetry/sdk-node for auto-instrumentation and manual span creation.
1. Initialization and Auto-Instrumentation
Auto-instrumentation hooks into popular libraries (HTTP, Express, pg, redis) to generate spans automatically.
// tracer.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'order-service',
[SEMRESATTRS_SERVICE_VERSION]: '1.0.0',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
}),
spanProcessor: new BatchSpanProcessor(),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
ignoreIncomingRequestHook: (req) => {
// Ignore health checks to reduce noise
return req.url?.includes('/health') || false;
},
},
}),
],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown().catch(() => console.error('Error shutting down tracer'));
});
export default sdk;
2. Custom Spans and Context Propagation
Auto-instrumentation covers infrastructure calls. Business logic requires manual spans to provide semantic meaning.
// order-handler.ts
import { trace } from '@opentelemetry/api';
import { SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
export async function processOrder(orderId: string) {
// Auto-instrumentation creates the root span for the HTTP request.
// We create a child span for business logic.
return tracer.startActiveSpan('process-order-business-logic', async (span) => {
try {
span.setAttribute('order.id', orderId);
// Simulate validation
await validateOrder(orderId);
// Simulate downstream call (context propagates automatically via AsyncLocalStorage)
await callPaymentGateway(orderId);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
// Record exception and set status
span.recordException(error as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
span.end();
}
});
}
async function validateOrder(id: string) {
// No manual span needed if this is synchronous CPU work,
// but useful if it involves complex logic you want to track.
return true;
}
async function callPaymentGateway(id: string) {
// HTTP client instrumentation will automatically create a child span
// and propagate the trace context to the payment service.
return fetch(`https://payment-gateway/api/charge/${id}`);
}
3. Context Propagation Mechanics
Node.js uses AsyncLocalStorage (ALS) to maintain context across asynchronous boundaries. OTel SDKs leverage ALS to ensure that when you start a span, it is automatically available to any downstream asynchronous operation, including callbacks, promises, and event emitters.
Critical Rationale: If you use worker threads or custom thread pools, ALS may not propagate automatically. In these cases, you must manually bind the context or use OTel's context wrappers to ensure child spans link correctly to the parent.
Semantic Conventions
Adhere to OpenTelemetry Semantic Conventions for attribute naming. This ensures traces are queryable across services without custom schema mapping.
- Use
http.method, http.status_code, db.statement instead of custom attributes.
- Avoid high-cardinality attributes like
user.email or transaction.uuid in span attributes unless necessary for debugging; use baggage for request-scoped data if propagation is required, but be mindful of header size limits.
Pitfall Guide
1. Cardinality Explosion
Mistake: Adding unique identifiers (UUIDs, emails, timestamps) as span attributes.
Impact: Trace backends (especially columnar stores) index attributes. High cardinality causes storage bloat, query degradation, and increased costs.
Best Practice: Only add low-cardinality attributes (e.g., http.method, db.operation). Use logs for high-cardinality details, correlated via trace_id.
2. Broken Context in Async Boundaries
Mistake: Spawning background jobs or using thread pools without context propagation.
Impact: Spans become orphaned or detached from the root trace, breaking the causality graph.
Best Practice: Use context.with() to bind context to async operations. For worker queues, propagate traceparent in the message payload and start a new root span or link to the parent in the consumer.
3. Over-Tracing Everything
Mistake: Exporting 100% of traces in high-throughput systems.
Impact: Network saturation, backend storage costs spike, and UI performance degrades.
Best Practice: Implement probabilistic sampling (e.g., 10% of requests). Use tail sampling in the Collector to ensure 100% of error traces and high-latency traces are retained, while sampling successful traces.
4. Ignoring Error Semantics
Mistake: Catching errors but not recording them on the span.
Impact: Traces appear successful in the UI even when the business logic failed. Alerts based on trace status miss failures.
Best Practice: Always call span.recordException(error) and span.setStatus({ code: SpanStatusCode.ERROR }) in catch blocks.
5. Hardcoding Exporter Endpoints
Mistake: Embedding backend URLs in application code.
Impact: Inflexible deployments; changing backends requires code changes and redeployment.
Best Practice: Configure endpoints via environment variables (OTEL_EXPORTER_OTLP_ENDPOINT). Use the OTel Collector to abstract the backend.
6. Treating Traces as Logs
Mistake: Dumping large payloads or verbose logs into span events.
Impact: Trace payload size increases, causing serialization overhead and storage issues.
Best Practice: Traces are for timing and structure. Use structured logging with trace_id and span_id for detailed payloads. Correlate them rather than embedding.
7. Missing Resource Attributes
Mistake: Failing to set service name, version, or environment.
Impact: Inability to filter traces by deployment, version, or environment. Traces become indistinguishable.
Best Practice: Set service.name, service.version, deployment.environment in the Resource configuration. Inject these via CI/CD pipelines.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Throughput Microservices | Tail Sampling via OTel Collector | Retains error/latency outliers; drops healthy noise efficiently. | Low (Storage savings offset Collector cost) |
| Serverless Functions | Lightweight SDK + Propagation via Headers | Minimizes cold start latency; context passed via event payload. | Low |
| Compliance/Audit Requirements | Head-based Sampling (100% or high %) + Full Export | Ensures no request is dropped for audit trails. | High |
| Budget-Constrained Startup | Probabilistic Sampling (1-5%) + Open Source Backend | Minimizes data volume; uses cost-effective OSS stack. | Very Low |
Configuration Template
OTel Collector Configuration (otel-collector-config.yaml)
receivers:
otlp:
protocols:
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 5s
send_batch_max_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 500
spike_limit_mib: 100
# Tail sampling processor for production
tail_sampling:
policies:
- name: error-policy
type: status_code
status_code: { status_codes: [ "ERROR" ] }
- name: latency-policy
type: latency
latency: { threshold_ms: 500 }
- name: probabilistic-policy
type: probabilistic
sampling_percentage: 10
exporters:
otlp:
endpoint: "trace-backend:4317"
tls:
insecure: true # Use TLS in production
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, tail_sampling]
exporters: [otlp]
Quick Start Guide
- Install Dependencies:
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http
- Initialize Tracer:
Create
tracer.ts with the initialization code from the Core Solution. Import this file at the entry point of your application (before any other modules).
- Run Local Collector:
docker run -d --name otel-collector \
-p 4318:4318 \
-v $(pwd)/otel-config.yaml:/etc/otel-collector-config.yaml \
otel/opentelemetry-collector-contrib:latest \
--config /etc/otel-collector-config.yaml
- Verify:
Generate traffic to your service. Check the Collector logs for received spans. Query your trace backend to visualize the trace graph.