rt trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask, request
import os
Resource attributes for service identification
resource = Resource.create({
"service.name": os.getenv("SERVICE_NAME", "checkout-service"),
"service.version": os.getenv("SERVICE_VERSION", "1.0.0"),
"deployment.environment": os.getenv("ENV", "production")
})
Tracing setup
trace.set_tracer_provider(TracerProvider(resource=resource))
span_exporter = OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(span_exporter))
Metrics setup
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="otel-collector:4317", insecure=True)
)
metrics.set_meter_provider(MeterProvider(resource=resource, metric_readers=[metric_reader]))
app = Flask(name)
FlaskInstrumentor().instrument_app(app)
tracer = trace.get_tracer(name)
@app.route("/checkout", methods=["POST"])
def checkout():
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.method", request.json.get("method"))
# Business logic here
return {"status": "success"}, 200
if name == "main":
app.run(host="0.0.0.0", port=8080)
For Node.js, auto-instrumentation requires zero code changes:
```bash
node --require @opentelemetry/auto-instrumentations-node/register server.js
2. Context Propagation
Distributed tracing relies on W3C Trace Context headers. OTel handles this automatically for supported frameworks, but manual propagation is required for async queues:
# Producer
import opentelemetry.propagate as propagate
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
carrier = {}
propagate.inject(carrier)
# Send carrier as message headers to Kafka/RabbitMQ
# Consumer
from opentelemetry.context import extract
ctx = extract(carrier)
with tracer.start_as_current_span("consume_order", context=ctx):
# Process message
pass
3. Log Correlation
Logs must include trace_id and span_id to enable one-click navigation from log lines to full traces:
import logging
from opentelemetry import trace
class OTelLogFormatter(logging.Formatter):
def format(self, record):
span = trace.get_current_span()
if span and span.is_recording():
ctx = span.get_span_context()
record.trace_id = format(ctx.trace_id, '032x')
record.span_id = format(ctx.span_id, '016x')
else:
record.trace_id = "00000000000000000000000000000000"
record.span_id = "0000000000000000"
return super().format(record)
handler = logging.StreamHandler()
handler.setFormatter(OTelLogFormatter("%(asctime)s [%(trace_id)s] %(message)s"))
logging.getLogger().addHandler(handler)
4. Kubernetes Deployment (Collector)
The OpenTelemetry Collector acts as a vendor-neutral proxy. Deploy it as a DaemonSet or Deployment:
# otel-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-conf
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
memory_limiter:
check_interval: 1s
limit_mib: 1024
exporters:
otlp/jaeger:
endpoint: "jaeger-collector:14250"
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, memory_limiter]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
5. Query & Visualization
Grafana unifies Prometheus (metrics), Loki (logs), and Tempo (traces). Example PromQL for error rate by service:
sum(rate(http_server_duration_seconds_count{status=~"5.."}[5m])) by (service_name)
/
sum(rate(http_server_duration_seconds_count[5m])) by (service_name)
In Grafana, enable "Explore" mode to pivot from a latency spike in metrics → filter logs by trace_id → open the exact failing trace in Tempo. This closed-loop workflow eliminates tool-switching and accelerates incident response.
Pitfall Guide
-
High-Cardinality Explosion
- Description: Tagging metrics or traces with user IDs, email addresses, or request UUIDs causes index blowouts and storage costs to scale linearly with traffic.
- Root Cause: Misunderstanding cardinality limits in Prometheus/Loki/Tempo.
- Mitigation: Enforce low-cardinality attributes (service, endpoint, region, status). Use OTel processors to drop or hash high-cardinality keys. Implement cardinality guardrails in the Collector.
-
Siloed Observability Stack
- Description: Running separate tools for logs, metrics, and traces without correlation capabilities forces engineers to manually cross-reference data.
- Root Cause: Legacy tool adoption or vendor-driven procurement without architectural alignment.
- Mitigation: Standardize on OpenTelemetry, deploy a unified query layer (Grafana), and mandate
trace_id/span_id in all log lines. Use Tempo/Loki/Prometheus or commercial equivalents with native cross-pillar linking.
-
Naive Sampling Strategies
- Description: Recording 100% of traces in production consumes excessive storage and network bandwidth, while random sampling drops critical failure paths.
- Root Cause: Lack of sampling policy design during early adoption.
- Mitigation: Implement head sampling for latency/error thresholds, tail sampling for complex decision-making, and adaptive sampling that scales with traffic. OTel's
probabilistic_sampler and rate_limiting processors handle this efficiently.
-
Treating Observability as Monitoring
- Description: Building static dashboards and alerting on fixed thresholds without enabling exploratory analysis or SLO tracking.
- Root Cause: Cultural inertia; teams expect observability to replace PagerDuty without changing workflows.
- Mitigation: Define SLOs first. Use error budgets to govern release velocity. Train engineers on query-driven debugging. Shift from "alert fatigue" to "signal-driven investigation."
-
Ignoring Security & Compliance
- Description: Traces and logs inadvertently capture PII, secrets, or health data, violating GDPR/HIPAA and creating liability.
- Root Cause: Over-instrumentation without data classification or redaction policies.
- Mitigation: Implement OTel processors to mask/strip sensitive fields. Classify telemetry data streams. Apply RBAC in query interfaces. Audit sampling rules for compliance alignment.
-
Lack of Service Dependency Mapping
- Description: Teams cannot visualize how services interact, leading to blind spots during cascading failures or capacity planning.
- Root Cause: Traces are collected but not processed into topology graphs.
- Mitigation: Enable OTel's
servicegraph processor. Export to Jaeger/Grafana Service Map. Correlate with Kubernetes networking policies. Use dependency maps to identify single points of failure and optimize sync/async boundaries.
Production Bundle
Checklist
Pre-Deployment
Runtime
Post-Deployment & Governance
Decision Matrix
| Criteria | OpenTelemetry + Open Source (Prometheus/Loki/Tempo/Grafana) | Commercial APM (Datadog/New Relic/Dynatrace) | Cloud-Native (AWS CloudWatch/Azure Monitor/GCP Operations) |
|---|
| Cost | Low (infrastructure only); scales with retention | High (per-GB/per-host pricing); predictable but expensive | Medium-High (pay-as-you-go); can spike with high volume |
| Vendor Lock-in | Minimal (OTel standard, pluggable backends) | High (proprietary SDKs, custom query languages) | Medium (cloud-specific APIs, but OTel exporters available) |
| Scalability | High (horizontal scaling, Kubernetes-native) | High (managed, but cost-prohibitive at scale) | High (fully managed, but limited cross-cloud) |
| Ease of Use | Medium (requires SRE expertise, self-managed) | High (out-of-box dashboards, AI insights) | Medium-High (managed, but fragmented across services) |
| Compliance | Full control (on-prem, air-gapped, custom retention) | Vendor-dependent (SOC2, HIPAA, GDPR certified) | Cloud-region dependent; strong enterprise certifications |
| Best For | Engineering teams with SRE maturity, cost-sensitive, multi-cloud | Fast-moving teams, limited ops resources, budget-flexible | Single-cloud deployments, regulated industries, managed-service preference |
Config Template
# otel-collector-prod.yaml
receivers:
otlp:
protocols:
grpc:
max_recv_msg_size_mib: 32
http:
cors:
allowed_origins: ["*"]
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
batch:
timeout: 5s
send_batch_max_size: 2000
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
attributes:
actions:
- key: "http.user_agent"
action: delete
- key: "user.email"
action: update
value: "***REDACTED***"
tail_sampling:
policies:
- name: error-policy
type: status_code
status_code: { status_codes: ["ERROR"] }
- name: latency-policy
type: latency
latency: { threshold_ms: 500 }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp/jaeger:
endpoint: "jaeger-collector.monitoring.svc:14250"
tls:
insecure: true
prometheusremotewrite:
endpoint: "http://victoriametrics:8428/api/v1/write"
tls:
insecure: true
loki:
endpoint: "http://loki.monitoring.svc:3100/loki/api/v1/push"
batch:
max_size: 2000
timeout: 5s
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, tail_sampling]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [loki]
Quick Start
- Clone the baseline stack:
git clone https://github.com/open-telemetry/opentelemetry-demo.git
cd opentelemetry-demo
- Launch observability infrastructure:
docker compose -f docker-compose.yml -f docker-compose.override.yml up -d jaeger prometheus grafana loki
- Instrument a sample service (Python):
pip install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-flask
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_SERVICE_NAME=quickstart-service
python main.py
- Verify data flow:
- Send traffic:
curl -X POST http://localhost:8080/checkout -H "Content-Type: application/json" -d '{"method":"card"}'
- Open Grafana (
http://localhost:3000) → Explore → Query metrics, logs, and traces using service_name="quickstart-service"
- Confirm trace context appears in log lines and metric labels
- Productionize:
- Replace Docker with Kubernetes manifests
- Apply the
otel-collector-prod.yaml config
- Enforce cardinality policies and sampling
- Integrate with CI/CD for automated SLO validation
Observability is not a destination; it's a continuous feedback loop between system behavior and engineering decisions. By standardizing on OpenTelemetry, enforcing data hygiene, and aligning telemetry with SLOs, teams transform microservices complexity from a liability into a competitive advantage. The tools are mature. The patterns are proven. The only remaining variable is organizational commitment to instrument, observe, and iterate.