ndent; it’s routing-dependent. Teams that implement schema validation at the edge, correlation injection before transmission, and tier-aware sinks reduce index bloat, eliminate redundant parsing, and align storage cost with actual access patterns.
Core Solution
Building a production-grade log aggregation pipeline requires five architectural phases: source standardization, edge collection, correlation enrichment, tiered routing, and lifecycle enforcement.
Step 1: Enforce Structured Logging at the Source
Text-based logs force parsers to run regex on every ingestion cycle. Structured JSON eliminates this overhead and enables field-level indexing.
TypeScript implementation using pino with async context storage for correlation ID injection:
import pino from 'pino';
import { AsyncLocalStorage } from 'async_hooks';
const correlationStore = new AsyncLocalStorage<string>();
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label.toUpperCase() }),
},
base: undefined,
transport: {
target: 'pino-socket',
options: { address: '127.0.0.1', port: 5044, mode: 'tcp' },
},
});
export function withCorrelation<T>(correlationId: string, fn: () => T): T {
return correlationStore.run(correlationId, fn);
}
export function log(message: string, meta?: Record<string, unknown>): void {
const correlationId = correlationStore.getStore();
logger.info(
{
...meta,
correlation_id: correlationId || 'none',
service: process.env.SERVICE_NAME || 'unknown',
environment: process.env.NODE_ENV || 'development',
timestamp: new Date().toISOString(),
},
message
);
}
This ensures every log carries context before it leaves the process. Correlation IDs must propagate across HTTP/gRPC boundaries via headers (X-Correlation-ID, traceparent).
Step 2: Deploy Lightweight Edge Collectors
Avoid heavy daemons. Use sidecar or daemonset deployments of Vector or Fluent Bit. Vector (Rust) offers 40% lower memory footprint and native async batching compared to Ruby/Python-based alternatives.
Architecture decision: Push-based collection with internal backpressure queues. Pull-based (scraping) introduces latency and misses short-lived containers. Sidecars isolate collection per service; daemonsets reduce overhead but require namespace-aware filtering.
Step 3: Implement Correlation & Enrichment
Collectors should not parse or transform logs beyond field extraction. Enrichment happens in a dedicated processing layer:
- Extract
correlation_id, service, severity
- Drop or sample routine health checks
- Attach Kubernetes metadata (
pod_name, namespace, node_ip)
- Validate schema against a registry (JSON Schema or Protobuf)
Step 4: Tiered Routing Architecture
[App/Container] → [Sidecar Collector] → [Kafka/Pulsar Buffer] → [Vector Processor]
↓
[Hot Tier: OpenSearch/Loki] ← (index < 7d, high cardinality)
[Warm Tier: S3/GCS Parquet] ← (7d–90d, columnar, analytical)
[Cold Tier: Glacier/Archive] ← (>90d, compliance only)
Hot tier handles real-time debugging. Warm tier supports SLA reporting and trend analysis. Cold tier satisfies regulatory retention. Routing is driven by log severity, service criticality, and age.
Step 5: Lifecycle & Cost Enforcement
Configure index lifecycle management (ILM) or equivalent:
hot → warm transition at 7 days
warm → cold at 90 days
- Delete after 365 days unless compliance flag exists
- Apply compression (zstd) and rollover policies based on size (50GB) or age
Architecture rationale: Decoupling ingestion from indexing prevents backpressure cascades. Object storage provides 80% cost reduction over block storage. Schema enforcement at the edge prevents parser failures downstream.
Pitfall Guide
- Unbounded Log Retention: Storing everything indefinitely inflates storage costs and degrades query performance. Best practice: enforce tiered lifecycle policies with compliance overrides.
- Missing Correlation Context: Logs without trace IDs force manual cross-service matching. Best practice: propagate correlation IDs via headers and inject them at the edge before transmission.
- Over-Indexing Hot Storage: Indexing debug logs or health checks consumes CPU and memory. Best practice: route low-value logs directly to warm/cold tiers or drop them at the collector.
- Ignoring Collector Backpressure: When sinks slow down, collectors buffer indefinitely, causing OOM crashes. Best practice: configure internal queue limits, retry policies, and dead-letter routing.
- Schema Drift Without Validation: Field name changes or type mismatches break parsers downstream. Best practice: enforce JSON Schema at the source and validate in the processing layer.
- Treating Logs as Primary Debugging Tool: Logs are expensive and slow for real-time debugging. Best practice: use metrics for alerting, traces for request flow, and logs for forensic analysis.
- Single-Point Collector Deployment: Centralized collectors create bottlenecks and failure domains. Best practice: distribute collection via sidecars/daemonsets with independent buffering.
Production experience shows that 70% of log pipeline failures stem from unmanaged backpressure and schema inconsistency. Addressing these two areas yields the highest ROI.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / MVP | Sampling-Driven (Loki) | Low setup overhead, label-based indexing, sufficient for <10 services | Low ($0.15–0.25/GB) |
| High-Throughput SaaS | Tiered/Hybrid | Decouples ingestion from query, handles 10k+ EPS, scales linearly | Medium ($0.20–0.30/GB) |
| Compliance / Regulated | Centralized + Cold Archive | Immutable storage, audit trails, legal hold capabilities | High ($0.40–0.60/GB) |
| Multi-Cloud / Hybrid | Stream-First (Vector+S3) | Cloud-agnostic, consistent routing, avoids vendor lock-in | Medium ($0.25–0.35/GB) |
Configuration Template
Vector YAML configuration for tiered log routing with S3 hot/warm separation:
sources:
app_logs:
type: socket
address: "0.0.0.0:5044"
mode: "tcp"
decoding:
codec: "json"
transforms:
enrich_metadata:
type: "remap"
inputs: ["app_logs"]
source: |
.service = env("SERVICE_NAME") ?? "unknown"
.environment = env("NODE_ENV") ?? "production"
.correlation_id = .correlation_id ?? "none"
.ingested_at = now()
drop_health_checks:
type: "filter"
inputs: ["enrich_metadata"]
condition: |
.message != null &&
(.message == "GET /health" || .message == "GET /ready")
sinks:
hot_elasticsearch:
type: "elasticsearch"
inputs: ["enrich_metadata"]
endpoints: ["https://es-hot.internal:9200"]
healthcheck: true
indexing:
index: "logs-hot-%Y.%m.%d"
batch:
max_bytes: 10485760
timeout_secs: 5
warm_s3:
type: "aws_s3"
inputs: ["enrich_metadata"]
bucket: "company-logs-warm"
key_prefix: "logs/warm/%Y/%m/%d/"
encoding:
codec: "json"
compression: "gzip"
batch:
max_bytes: 52428800
timeout_secs: 300
healthcheck: false
Quick Start Guide
- Install Vector: Download the binary or run
docker run --rm -it timberio/vector:0.38.0-alpine.
- Configure Source & Sink: Use the template above, update endpoints/bucket names, and save as
vector.toml.
- Deploy Collector: Run
vector --config vector.toml locally or deploy as a Kubernetes DaemonSet using the official Helm chart.
- Verify Ingestion: Send a test log via
echo '{"message":"test","correlation_id":"abc-123"}' | nc -w1 localhost 5044. Check hot index or S3 bucket for arrival.
- Monitor Backpressure: Watch
vector_component_received_bytes_total and vector_component_dropped_bytes_total metrics. Adjust batch/queue limits if drops exceed 1%.
Log aggregation is not a storage problem. It is a routing problem. Architect for signal, enforce schema at the edge, tier by access pattern, and let lifecycle policies manage cost. The pipeline that pays for itself is the one that stops logging noise before it reaches the network.