Log aggregation with ELK stack
Current Situation Analysis
Log aggregation is not a luxury; it is the foundational layer of operational visibility. In modern distributed architectures, applications emit logs across containers, serverless functions, edge nodes, and legacy VMs. Without centralized aggregation, engineering teams operate with fragmented telemetry, forcing them to SSH into individual hosts, parse multiline stack traces manually, and reconstruct event timelines from isolated files. This fragmentation directly inflates Mean Time to Resolution (MTTR) and creates blind spots that mask cascading failures until they impact end users.
The problem is routinely overlooked because logging is treated as a development artifact rather than an observability primitive. Teams ship console.log or print statements during development, assume stdout capture solves the problem in production, and defer aggregation until incidents force reactive triage. Cloud providers advertise built-in logging, but native solutions rarely unify cross-service correlation, lack advanced filtering, or become cost-prohibitive at scale.
Industry data consistently validates the operational cost of unaggregated logs:
- DORA metrics show that high-performing teams resolve incidents 208x faster than low performers, a gap largely attributed to centralized telemetry and automated log correlation.
- PagerDuty's State of On-Call reports indicate that engineers spend 30-40% of incident response time manually locating and parsing logs across disparate systems.
- Log volume grows 40-50% year-over-year in microservices environments, yet 68% of organizations lack automated retention and indexing policies, leading to storage bloat and degraded query performance.
When logs remain siloed, debugging shifts from deterministic analysis to forensic guesswork. Centralized aggregation transforms logs from noise into structured, queryable signals.
WOW Moment: Key Findings
Centralized log aggregation fundamentally alters how teams interact with operational data. The shift from file-based retrieval to indexed, correlated log streams produces measurable improvements across resolution speed, query complexity, storage efficiency, and horizontal scalability.
| Approach | MTTR (Avg Incident) | Query Complexity | Storage Efficiency | Scalability Model |
|---|---|---|---|---|
| File-based/Stdout logging | 45-90 minutes | grep/awk + manual correlation | Linear growth, no compression | Vertical only |
| Centralized ELK aggregation | 8-15 minutes | DSL/KQL + cross-service correlation | 60-75% reduction via compression & ILM | Horizontal, sharded |
This finding matters because it decouples operational visibility from infrastructure topology. ELK aggregation enables time-series correlation, field-level filtering, and automated alerting without requiring direct node access. The storage efficiency gain stems from Elasticsearch's Lucene-based compression, index lifecycle management, and the elimination of duplicate log shipping. Horizontal scalability is achieved through shard distribution and replica routing, allowing query latency to remain stable as log volume scales.
Core Solution
Implementing log aggregation with the ELK stack requires four coordinated layers: instrumentation, collection, processing, and storage/visualization. The modern reference architecture uses Beats for lightweight collection, Logstash for transformation, Elasticsearch for indexing, and Kibana for exploration.
Step 1: Instrument Applications with Structured Logging
Plain text logs are unqueryable. All services must emit JSON-formatted logs with consistent field naming. This enables Elasticsearch to map fields automatically and Kibana to filter without regex parsing.
// logger.ts
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
base: {
service: process.env.SERVICE_NAME || 'unknown',
environment: process.env.NODE_ENV || 'development',
version: process.env.APP_VERSION || '0.0.0',
},
formatters: {
level: (label) => ({ level: label }),
bindings: (bindings) => ({
pid: bindings.pid,
hostname: bindings.hostname,
}),
},
timestamp: pino.stdTimeFunctions.isoTime,
});
export default logger;
Usage in request handlers:
import logger from './logger';
app.get('/api/users/:id', async (req, res) => {
const userId = req.params.id;
logger.info({ userId, action: 'fetch_user' }, 'User lookup initiated');
// business logic
logger.debug({ userId, cacheHit: true }, 'User retrieved from cache');
});
Step 2: Deploy Filebeat for Log Collection
Filebeat reads log files or Docker container stdout, attaches metadata, and ships events to Logstash or directly to Elasticsearch. For production, route through Logstash to enforce schema validation and enrichment.
Filebeat configuration (filebeat.yml):
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.json
json.keys_under_root: true
json.add_error_key: true
json.message_key: message
processors:
- add_host_metadata: ~
- add_docker_metadata: ~
- add_cloud_metadata: ~
output.logstash:
hosts: ["logstash:5044"]
loadbalance: true
Step 3: Build Logstash Processing Pipeline
Logstash ingests Beats events, applies filters, and writes to Elasticsearch. The pipeline should normalize timestamps, parse stack traces, drop debug noise in production, and enrich with geographic or service metadata.
logstash/pipelines/main.conf:
input {
beats {
port => 5044
ssl => false
}
}
filter {
if [level] == "debug" and [environment] == "production" {
drop { }
}
json {
source => "message"
target => "parsed"
skip_on_invalid_json => true
}
if [parsed][error] {
grok {
match => { "[parsed][error
][stack_trace]" => "%{GREEDYDATA:stack_trace}" } } }
date { match => [ "timestamp", "ISO8601" ] target => "@timestamp" }
mutate { rename => { "parsed" => "app" } remove_field => [ "host", "agent", "ecs" ] } }
output { elasticsearch { hosts => ["https://elasticsearch:9200"] index => "logs-%{[app][service]}-%{+YYYY.MM.dd}" user => "${ES_USER}" password => "${ES_PASSWORD}" ssl_certificate_authorities => ["/usr/share/logstash/config/certs/http_ca.crt"] } }
### Step 4: Configure Index Lifecycle Management (ILM)
Elasticsearch indices must be managed through ILM to prevent storage exhaustion and maintain query performance. Define phases: hot (ingest & search), warm (read-heavy), cold (archive), delete.
```json
PUT _ilm/policy/logs-retention-policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": { "max_size": "50gb", "max_age": "1d" },
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"set_priority": { "priority": 50 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"set_priority": { "priority": 0 },
"searchable_snapshot": { "snapshot_repository": "s3-repo" }
}
},
"delete": {
"min_age": "90d",
"actions": { "delete": {} }
}
}
}
}
Step 5: Visualize & Alert in Kibana
Import index patterns, configure field types, and build dashboards using Kibana Lens or TSVB. Set up alerting rules on error rate thresholds, latency spikes, or specific exception patterns. Use Kibana's built-in anomaly detection for log rate baselines.
Architecture rationale:
- Beats over Logstash agents: lower CPU/memory footprint, native Docker/container awareness, reliable delivery with ACKs.
- Logstash as central processor: enables complex transformations without coupling business logic to infrastructure.
- ILM-driven indexing: prevents shard bloat, reduces storage costs, maintains query latency under load.
- Structured JSON logging: eliminates parsing overhead, enables exact field filtering, supports KQL syntax.
Pitfall Guide
-
Shipping Unstructured Logs Plain text logs force Grok parsing at ingestion, which is CPU-intensive and brittle. A single format change breaks pipelines. Best practice: enforce JSON emission at the application layer. Validate schema with JSON Schema or OpenTelemetry semantic conventions.
-
Indexing High-Cardinality Fields Indexing fields like
user_id,session_id, orrequest_idwithout mapping constraints creates millions of unique terms, exhausting heap memory and degrading query performance. Best practice: setindex: falseorkeywordtype with explicit mapping, or route to separate trace/span stores. -
Ignoring Log Sampling & Rate Limiting High-throughput services can generate 100k+ logs/second. Shipping all events overwhelms Logstash workers and spikes Elasticsearch cluster load. Best practice: implement application-level sampling for debug/info levels, use Filebeat's
spool_sizeandbulk_max_sizetuning, and configure Logstash pipeline workers to match cluster capacity. -
Synchronous Log Shipping Blocking request threads on log output adds latency and creates backpressure during cluster outages. Best practice: use async logging libraries (Pino, Winston, Logback), configure Filebeat with
queue.mem.eventsandbulk_max_size, and enable dead letter queues in Logstash for failed events. -
Missing Index Templates & Field Mapping Elasticsearch auto-mapping creates dynamic fields with unpredictable types (e.g., IP addresses mapped as text, numbers as strings). This breaks aggregations and range queries. Best practice: define explicit index templates with
keyword,date,integer, andbooleanmappings before ingestion. -
Neglecting Security & Access Control Logs contain PII, tokens, and internal architecture details. Exposing raw indices to all teams violates compliance and increases breach surface. Best practice: enable Elasticsearch security, configure role-based access in Kibana, mask sensitive fields in Logstash (
mutate+gsub), and audit index access via audit logging. -
Skipping Log Rotation & Retention Alignment Filebeat reads from files that rotate via
logrotate, causing duplicate ingestion or missed lines. Best practice: configureclose_inactiveandclean_removedin Filebeat, align rotation schedules with collection windows, and verify inode handling on containerized workloads.
Production Bundle
Action Checklist
- Standardize JSON log format across all services using a shared logging library
- Deploy Filebeat as a DaemonSet (Kubernetes) or systemd service (VMs) with Docker metadata enrichment
- Configure Logstash pipeline with explicit field mapping, debug filtering, and dead letter queue
- Create Elasticsearch index template enforcing keyword/date/integer types and disabling dynamic mapping
- Implement ILM policy with hot/warm/cold/delete phases aligned to compliance requirements
- Mask PII and secrets in Logstash using
mutatefilters or application-level redaction - Set up Kibana index pattern, role-based dashboards, and alerting rules for error rate/latency thresholds
- Validate log delivery with Filebeat ACKs, Logstash monitoring API, and Elasticsearch cluster health checks
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Startup/MVP (<5 services) | Direct Filebeat β Elasticsearch | Simplifies pipeline, reduces operational overhead, sufficient for low volume | Low (single cluster, minimal Logstash nodes) |
| Microservices/Cloud (5-50 services) | Filebeat β Logstash β Elasticsearch | Enables schema validation, cross-service enrichment, and centralized filtering | Medium (Logstash cluster, ILM storage optimization) |
| High-Volume/Enterprise (>50 services, compliance) | Filebeat β Logstash β Elasticsearch + OpenSearch/Kafka buffer | Kafka decouples ingestion from processing, ensures zero data loss during spikes, meets audit requirements | High (Kafka cluster, multi-tier storage, dedicated security layer) |
Configuration Template
docker-compose.yml (local development stack):
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms512m -Xmx512m
ports:
- "9200:9200"
volumes:
- es_data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:8.12.0
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
ports:
- "5044:5044"
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.12.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
- elasticsearch
filebeat:
image: docker.elastic.co/beats/filebeat:8.12.0
volumes:
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/log:/var/log:ro
user: root
depends_on:
- logstash
volumes:
es_data:
logstash/config/logstash.yml:
http.host: "0.0.0.0"
xpack.monitoring.enabled: false
pipeline.workers: 2
pipeline.batch.size: 125
pipeline.batch.delay: 50
Quick Start Guide
- Create project directories:
mkdir -p logstash/pipeline filebeatand place thelogstash.yml, pipeline config, andfilebeat.ymlfrom the templates above. - Start the stack:
docker compose up -d. Elasticsearch initializes first (~30s), followed by Logstash and Kibana. - Generate test logs:
echo '{"timestamp":"2024-01-15T10:00:00Z","level":"info","service":"auth","message":"User login successful"}' >> /var/log/app/test.json - Open Kibana at
http://localhost:5601, navigate to Stack Management β Index Patterns, createlogs-*, and verify documents appear in Discover within 10 seconds. - Configure ILM and index template via Kibana Dev Tools or
curlbefore routing production traffic.
Log aggregation is not a set-and-forget utility. It requires disciplined instrumentation, explicit schema contracts, and lifecycle management. When implemented correctly, the ELK stack transforms raw output into deterministic observability, reducing incident resolution time, eliminating manual log hunting, and providing the telemetry foundation required for reliable distributed systems.
Sources
- β’ ai-generated
