ervice for dynamic enrichment and a message broker for durable buffering.
Step 1: Decouple Collection from Processing
Install the OpenTelemetry Collector as a sidecar or daemonset. Configure it to receive telemetry via OTLP (gRPC/HTTP) and forward to a durable buffer. This isolates application threads from network latency and backend outages.
# otel-collector-config.yaml (receivers + exporters)
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
kafka:
brokers: ["kafka-1:9092", "kafka-2:9092"]
topic: "telemetry-ingest"
encoding: otlp_proto
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
sending_queue:
enabled: true
num_consumers: 4
queue_size: 5000
Step 2: Implement Backpressure-Aware Buffering
Kafka (or Pulsar/Redis Streams) decouples ingestion from processing. The Collector's sending_queue prevents OOM by dropping or rejecting data when downstream consumers lag. Configure consumer groups in the processing layer to scale horizontally based on partition lag.
Step 3: Normalize Schemas & Reduce Cardinality
Telemetry from multiple services arrives with inconsistent attributes. A normalization stage enforces a schema contract, strips high-cardinality fields, and maps vendor-specific tags to OpenTelemetry semantic conventions.
// pipeline-processor.ts
import { Kafka, logLevel } from 'kafkajs';
const kafka = new Kafka({
clientId: 'telemetry-processor',
brokers: ['kafka-1:9092', 'kafka-2:9092'],
logLevel: logLevel.WARN,
});
const consumer = kafka.consumer({ groupId: 'normalize-group' });
const HIGH_CARDINALITY_KEYS = ['request_id', 'user_session', 'trace_id'];
const SCHEMA_MAP: Record<string, string> = {
'http.status_code': 'http.response.status_code',
'db.statement': 'db.statement',
'custom.env': 'deployment.environment',
};
async function normalizeAndRoute() {
await consumer.connect();
await consumer.subscribe({ topic: 'telemetry-ingest', fromBeginning: false });
await consumer.run({
eachMessage: async ({ message }) => {
const payload = JSON.parse(message.value!.toString());
const normalized = { ...payload };
// Strip high-cardinality attributes from metrics/logs
if (normalized.resource?.attributes) {
HIGH_CARDINALITY_KEYS.forEach(key => {
if (normalized.resource.attributes[key]) {
delete normalized.resource.attributes[key];
}
});
}
// Map legacy keys to OTel conventions
if (normalized.attributes) {
Object.keys(SCHEMA_MAP).forEach(oldKey => {
if (normalized.attributes[oldKey]) {
normalized.attributes[SCHEMA_MAP[oldKey]] = normalized.attributes[oldKey];
delete normalized.attributes[oldKey];
}
});
}
// Route based on signal type and business context
const targetTopic = normalized.type === 'trace' && normalized.resource?.attributes?.['service.name']?.includes('billing')
? 'telemetry-billing'
: 'telemetry-processed';
await producer.send({
topic: targetTopic,
messages: [{ value: JSON.stringify(normalized) }],
});
},
});
}
normalizeAndRoute().catch(console.error);
Step 4: Dynamic Routing & Multi-Backend Export
Route normalized signals to specialized backends. Traces go to Tempo/Jaeger, metrics to Prometheus/VictoriaMetrics, logs to Loki/ELK. Use pipeline-level routing rules to enforce data retention policies and cost controls.
Step 5: Pipeline Observability
Monitor the pipeline itself. Export Collector metrics (otelcol_exporter_sent_spans, otelcol_receiver_refused_spans, otelcol_processor_batch_batch_send_size) to your metrics backend. Set alerts on queue depth, drop rates, and export latency. A pipeline that cannot report its own health is a liability.
Architecture Decisions & Rationale
- OTel Collector over language-specific agents: Centralizes protocol handling, reduces SDK memory footprint, and enforces vendor neutrality.
- Kafka over in-memory queues: Provides durable buffering, partition-level scaling, and consumer group semantics for horizontal processing.
- Schema normalization at ingest: Prevents downstream query failures, reduces storage bloat, and ensures consistent correlation across services.
- TypeScript processing layer: Chosen for rapid iteration on routing logic and attribute mapping. In production, this layer can be rewritten in Go/Rust for throughput, but TS demonstrates the pipeline contract clearly.
Pitfall Guide
- Synchronous SDK Exports: Blocking application threads on network calls increases tail latency and causes cascading timeouts. Always use async exporters with bounded queues.
- Unbounded Memory Queues: Default OTel/Fluent Bit queues can grow indefinitely during backend outages, triggering OOM kills. Configure
queue_size and retry_on_failure limits explicitly.
- High-Cardinality Attribute Explosion: Adding
request_id, user_email, or dynamic headers to metrics creates exponential series cardinality. Strip or hash these fields before metric export.
- Treating the Pipeline as a Black Box: Failing to instrument the pipeline itself means you won't detect data loss until users report missing dashboards. Export pipeline metrics and set SLOs on signal delivery.
- Vendor-Specific Agent Lock-in: Proprietary agents tie you to a single backend. Migrate to OTel-native receivers and use vendor exporters only at the final hop.
- Ignoring Temporal Ordering & Clock Skew: Distributed services drift. Normalize timestamps to UTC at the pipeline edge. Use OTel's
timestamp_unix_nano and avoid relying on local clock injection.
- Over-Sampling Without Business Context: Uniform sampling discards critical transactions. Implement head-based sampling for traces (keep 100% of errors, sample 10% of success) and tail-based sampling for latency outliers.
Best Practices from Production:
- Enforce schema contracts via pipeline validation gates.
- Implement circuit breakers for backend exporters.
- Use dynamic routing based on service criticality, not just signal type.
- Run pipeline load tests that simulate traffic spikes and backend failures.
- Maintain a telemetry catalog documenting attribute definitions and retention policies.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / Low Volume (<10k RPS) | OTel Collector + Direct Export | Simplicity outweighs buffering complexity. In-memory queues suffice. | Low ($0.05/GB) |
| Mid-Scale / Multi-Service (10kβ100k RPS) | OTel Collector + Kafka Buffer + Normalization | Prevents OOM, enforces schema consistency, enables horizontal scaling. | Medium ($0.18/GB) |
| Enterprise / Multi-Region (>100k RPS) | Collector Mesh + Kafka/Pulsar + Tail Sampling + Dynamic Routing | Guarantees signal fidelity, isolates regional failures, optimizes storage spend. | Optimized ($0.12/GB) |
Configuration Template
# otel-collector-prod.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_max_size: 5000
memory_limiter:
check_interval: 1s
limit_mib: 1500
spike_limit_mib: 256
transform:
trace_statements:
- context: resource
statements:
- delete_key(attributes, "request_id")
- set(attributes["deployment.environment"], "production")
exporters:
kafka:
brokers: ["kafka-1:9092", "kafka-2:9092", "kafka-3:9092"]
topic: "telemetry-ingest"
encoding: otlp_proto
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
sending_queue:
enabled: true
num_consumers: 4
queue_size: 5000
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, transform]
exporters: [kafka]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [kafka]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [kafka]
Quick Start Guide
- Deploy the Collector: Run
docker run -p 4317:4317 -p 4318:4318 -v ./otel-collector-prod.yaml:/etc/otel/config.yaml otel/opentelemetry-collector-contrib:latest --config=/etc/otel/config.yaml
- Configure Application SDKs: Set
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector-host:4317 and enable async batching in your OTel SDK initialization.
- Validate Pipeline Flow: Generate test telemetry, check Kafka topic lag, and verify Collector metrics (
otelcol_exporter_sent_spans_total) in your metrics backend.
- Enable Self-Monitoring: Add
otelcol metrics to your Prometheus scrape config and set alerts on otelcol_processor_batch_batch_send_size and queue depth thresholds.