auto-instrumentation and manual instrumentation based on service criticality and control requirements. Auto-instrumentation uses environment variables and agent libraries to patch popular frameworks without code changes. Manual instrumentation provides precise span control, custom attributes, and business logic context.
Python FastAPI Example (Manual + Auto Hybrid)
# app.py
import os
from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
# Resource attributes (essential for service identification)
resource = Resource.create({
"service.name": os.getenv("OTEL_SERVICE_NAME", "payment-service"),
"service.version": os.getenv("OTEL_VERSION", "1.0.0"),
"deployment.environment": os.getenv("OTEL_ENV", "production")
})
# Tracer setup
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(trace_provider)
# Meter setup
metric_reader = PeriodicExportingMetricReader(OTLPMetricExporter())
metric_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(metric_provider)
tracer = trace.get_tracer("payment.tracer")
meter = metrics.get_meter("payment.meter")
request_counter = meter.create_counter("http.server.requests", unit="1")
app = FastAPI()
@app.get("/process/{payment_id}")
async def process_payment(payment_id: str):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.id", payment_id)
span.set_attribute("payment.type", "credit_card")
# Business logic with child span
with tracer.start_as_current_span("validate_payment") as child_span:
child_span.set_attribute("validation.result", "success")
# Simulate validation
import time; time.sleep(0.05)
request_counter.add(1, {"method": "GET", "status": "200"})
return {"status": "processed", "id": payment_id}
2. Collector Configuration
The Collector acts as the telemetry data plane. It receives OTLP, applies processors, and exports to backends. Production deployments should run collectors as sidecars or daemonsets, never as single points of failure.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
resource:
attributes:
- key: k8s.cluster.name
value: "prod-us-east-1"
action: upsert
exporters:
otlp/traces:
endpoint: "tempo:4317"
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "otel"
otlp/logs:
endpoint: "loki:3100/otlp"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/traces]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [otlp/logs]
3. Context Propagation & Correlation
Distributed tracing requires propagating context across service boundaries. OTel uses W3C Trace Context headers (traceparent, tracestate). HTTP clients and message brokers must be instrumented to inject/extract context automatically.
# Example: HTTP client propagation
from opentelemetry.propagators.textmap import TextMapPropagator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
import requests
propagator = TraceContextTextMapPropagator()
headers = {}
propagator.inject(headers) # Adds traceparent to outgoing request
response = requests.get("http://inventory-service/check", headers=headers)
4. Backend Integration
OTel exports via OTLP to any compliant backend. Common combinations include:
- Traces: Tempo, Jaeger, Honeycomb, Datadog
- Metrics: Prometheus, VictoriaMetrics, New Relic
- Logs: Loki, Elasticsearch, Splunk
Query examples (PromQL for metrics, LogQL for logs, Tempo for traces) become trivial once semantic conventions are enforced. Use service.name, http.status_code, and k8s.pod.name as primary dimensions.
Pitfall Guide
1. Over-Instrumentation Without Sampling
Risk: Capturing every span for high-throughput services overwhelms storage and degrades P99 latency.
Mitigation: Implement probabilistic sampling at the SDK level (OTEL_TRACES_SAMPLER=parentbased_traceidratio, OTEL_TRACES_SAMPLER_ARG=0.1). Use head-based sampling for traces and tail-based sampling in the Collector for error-only retention.
2. Ignoring Cardinality Limits
Risk: Unbounded labels (e.g., user IDs, request UUIDs) in metrics cause Prometheus/Tempo to OOM or reject data.
Mitigation: Restrict metric labels to low-cardinality attributes (service, method, status, region). Use span attributes for high-cardinality data. Enforce limits via Collector filter processor or backend admission controllers.
3. Mixing Vendor SDKs with OTel
Risk: Running Datadog APM agent, New Relic SDK, and OTel Collector simultaneously creates duplicate spans, conflicts context propagation, and inflates costs.
Mitigation: Standardize on OTel as the single instrumentation layer. Use vendor-specific exporters only in the Collector. Remove legacy agents before deployment.
4. Neglecting Log-Trace Correlation
Risk: Logs and traces exist in separate systems without shared identifiers, forcing manual cross-referencing during incidents.
Mitigation: Inject trace_id and span_id into log records using OTel log bridge or framework-specific appenders. Configure backends to index correlation fields. Use opentelemetry-instrumentation-logging for automatic enrichment.
Risk: Deploying a single collector instance creates a bottleneck and single point of failure. Running collectors in-process blocks application threads.
Mitigation: Use sidecar containers for per-pod isolation or daemonsets for node-level aggregation. Deploy collectors as stateless deployments with horizontal scaling. Always configure memory_limiter and batch processors.
6. Skipping Semantic Conventions
Risk: Custom attribute names (user_id vs enduser.id, http_url vs http.url) break dashboards, prevent cross-service correlation, and require constant query rewrites.
Mitigation: Enforce CNCF semantic conventions via SDK configuration, linting tools, and code review checklists. Use opentelemetry-semantic-conventions package. Document exceptions in an internal observability playbook.
Production Bundle
Checklist
Decision Matrix
| Scenario | Recommended Approach | Rationale |
|---|
| Legacy app, minimal code changes allowed | Auto-instrumentation + OTel Agent | Zero code modifications; framework patching covers 80% of HTTP/DB calls |
| Critical payment/auth service | Manual instrumentation + custom spans | Precise control over business logic spans, error handling, and attribute enrichment |
| Kubernetes cluster with 50+ services | OTel Collector Operator + DaemonSet | Centralized management, automatic config reloading, resource governance |
| Multi-cloud hybrid (on-prem + AWS/GCP) | Collector as gateway + OTLP over mTLS | Unified data plane, secure cross-network export, consistent processing |
| Budget-constrained startup | Prometheus + Tempo + Loki (open source) | Zero licensing, community support, scales to 10k RPS with proper tuning |
| Enterprise compliance (SOC2, HIPAA) | Commercial backend (Datadog/Honeycomb) + OTel SDK | Built-in audit trails, data residency controls, vendor support SLAs |
Config Template (OTel Collector Production-Ready)
# otel-prod.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
max_recv_msg_size_mib: 32
http:
endpoint: "0.0.0.0:4318"
cors:
allowed_origins:
- "https://*.yourdomain.com"
allowed_headers:
- "Authorization"
processors:
memory_limiter:
check_interval: 1s
limit_mib: 768
spike_limit_mib: 192
batch:
timeout: 10s
send_batch_max_size: 2048
send_batch_size: 1024
resource:
attributes:
- key: k8s.cluster.name
value: "${K8S_CLUSTER_NAME}"
action: upsert
- key: deployment.environment
value: "${OTEL_ENV}"
action: upsert
filter:
metrics:
include:
match_type: regexp
metric_names:
- "^http\\.server\\..*"
- "^rpc\\.client\\..*"
- "^process\\.runtime\\..*"
exporters:
otlp/traces:
endpoint: "${TRACE_BACKEND}:4317"
tls:
insecure: false
ca_file: "/etc/ssl/certs/ca-bundle.crt"
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "prod"
send_timestamps: true
metric_expiration: 180m
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource, filter]
exporters: [otlp/traces]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus]
telemetry:
logs:
level: "info"
development: false
metrics:
address: "0.0.0.0:8888"
Quick Start
-
Initialize SDK in your service:
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
export OTEL_SERVICE_NAME="demo-app"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
-
Run Collector locally:
docker run -d --name otel-collector \
-p 4317:4317 -p 4318:4318 -p 8889:8889 \
-v $(pwd)/otel-collector-config.yaml:/etc/otel-collector-config.yaml \
otel/opentelemetry-collector-contrib:latest \
--config=/etc/otel-collector-config.yaml
-
Instrument & run application:
# Ensure OTel providers are initialized before app startup
# Run FastAPI/Uvicorn normally; OTel will auto-export
uvicorn app:app --host 0.0.0.0 --port 8000
-
Verify data flow:
# Check metrics endpoint
curl http://localhost:8889/metrics | grep http_server_requests
# Query traces (if Tempo is connected)
curl http://localhost:3200/api/search?service=demo-app
-
Add to CI/CD:
- Inject
OTEL_* variables via secrets manager
- Validate collector health probes (
/health)
- Run integration tests with
OTEL_TRACES_SAMPLER=always_on
- Monitor collector memory/CPU via Prometheus metrics
OpenTelemetry is not a monitoring tool; it is an instrumentation standard. Treat it as critical infrastructure. Enforce conventions, govern cardinality, decouple export from collection, and align telemetry with SLOs. When implemented correctly, OTel transforms observability from a cost center into a reliability engine.