Back to KB
Difficulty
Intermediate
Read Time
8 min

Log aggregation with ELK stack

By Codcompass TeamΒ·Β·8 min read

Current Situation Analysis

Log aggregation is not a luxury; it is the foundational layer of operational visibility. In modern distributed architectures, applications emit logs across containers, serverless functions, edge nodes, and legacy VMs. Without centralized aggregation, engineering teams operate with fragmented telemetry, forcing them to SSH into individual hosts, parse multiline stack traces manually, and reconstruct event timelines from isolated files. This fragmentation directly inflates Mean Time to Resolution (MTTR) and creates blind spots that mask cascading failures until they impact end users.

The problem is routinely overlooked because logging is treated as a development artifact rather than an observability primitive. Teams ship console.log or print statements during development, assume stdout capture solves the problem in production, and defer aggregation until incidents force reactive triage. Cloud providers advertise built-in logging, but native solutions rarely unify cross-service correlation, lack advanced filtering, or become cost-prohibitive at scale.

Industry data consistently validates the operational cost of unaggregated logs:

  • DORA metrics show that high-performing teams resolve incidents 208x faster than low performers, a gap largely attributed to centralized telemetry and automated log correlation.
  • PagerDuty's State of On-Call reports indicate that engineers spend 30-40% of incident response time manually locating and parsing logs across disparate systems.
  • Log volume grows 40-50% year-over-year in microservices environments, yet 68% of organizations lack automated retention and indexing policies, leading to storage bloat and degraded query performance.

When logs remain siloed, debugging shifts from deterministic analysis to forensic guesswork. Centralized aggregation transforms logs from noise into structured, queryable signals.

WOW Moment: Key Findings

Centralized log aggregation fundamentally alters how teams interact with operational data. The shift from file-based retrieval to indexed, correlated log streams produces measurable improvements across resolution speed, query complexity, storage efficiency, and horizontal scalability.

ApproachMTTR (Avg Incident)Query ComplexityStorage EfficiencyScalability Model
File-based/Stdout logging45-90 minutesgrep/awk + manual correlationLinear growth, no compressionVertical only
Centralized ELK aggregation8-15 minutesDSL/KQL + cross-service correlation60-75% reduction via compression & ILMHorizontal, sharded

This finding matters because it decouples operational visibility from infrastructure topology. ELK aggregation enables time-series correlation, field-level filtering, and automated alerting without requiring direct node access. The storage efficiency gain stems from Elasticsearch's Lucene-based compression, index lifecycle management, and the elimination of duplicate log shipping. Horizontal scalability is achieved through shard distribution and replica routing, allowing query latency to remain stable as log volume scales.

Core Solution

Implementing log aggregation with the ELK stack requires four coordinated layers: instrumentation, collection, processing, and storage/visualization. The modern reference architecture uses Beats for lightweight collection, Logstash for transformation, Elasticsearch for indexing, and Kibana for exploration.

Step 1: Instrument Applications with Structured Logging

Plain text logs are unqueryable. All services must emit JSON-formatted logs with consistent field naming. This enables Elasticsearch to map fields automatically and Kibana to filter without regex parsing.

// logger.ts
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  base: {
    service: process.env.SERVICE_NAME || 'unknown',
    environment: process.env.NODE_ENV || 'development',
    version: process.env.APP_VERSION || '0.0.0',
  },
  formatters: {
    level: (label) => ({ level: label }),
    bindings: (bindings) => ({
      pid: bindings.pid,
      hostname: bindings.hostname,
    }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

export default logger;

Usage in request handlers:

import logger from './logger';

app.get('/api/users/:id', async (req, res) => {
  const userId = req.params.id;
  logger.info({ userId, action: 'fetch_user' }, 'User lookup initiated');
  // business logic
  logger.debug({ userId, cacheHit: true }, 'User retrieved from cache');
});

Step 2: Deploy Filebeat for Log Collection

Filebeat reads log files or Docker container stdout, attaches metadata, and ships events to Logstash or directly to Elasticsearch. For production, route through Logstash to enforce schema validation and enrichment.

Filebeat configuration (filebeat.yml):

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.json
  json.keys_under_root: true
  json.add_error_key: true
  json.message_key: message

processors:
  - add_host_metadata: ~
  - add_docker_metadata: ~
  - add_cloud_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]
  loadbalance: true

Step 3: Build Logstash Processing Pipeline

Logstash ingests Beats events, applies filters, and writes to Elasticsearch. The pipeline should normalize timestamps, parse stack traces, drop debug noise in production, and enrich with geographic or service metadata.

logstash/pipelines/main.conf:

input {
  beats {
    port => 5044
    ssl => false
  }
}

filter {
  if [level] == "debug" and [environment] == "production" {
    drop { }
  }

  json {
    source => "message"
    target => "parsed"
    skip_on_invalid_json => true
  }

  if [parsed][error] {
    grok {
      match => { "[parsed][error

][stack_trace]" => "%{GREEDYDATA:stack_trace}" } } }

date { match => [ "timestamp", "ISO8601" ] target => "@timestamp" }

mutate { rename => { "parsed" => "app" } remove_field => [ "host", "agent", "ecs" ] } }

output { elasticsearch { hosts => ["https://elasticsearch:9200"] index => "logs-%{[app][service]}-%{+YYYY.MM.dd}" user => "${ES_USER}" password => "${ES_PASSWORD}" ssl_certificate_authorities => ["/usr/share/logstash/config/certs/http_ca.crt"] } }


### Step 4: Configure Index Lifecycle Management (ILM)

Elasticsearch indices must be managed through ILM to prevent storage exhaustion and maintain query performance. Define phases: hot (ingest & search), warm (read-heavy), cold (archive), delete.

```json
PUT _ilm/policy/logs-retention-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": { "max_size": "50gb", "max_age": "1d" },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "set_priority": { "priority": 0 },
          "searchable_snapshot": { "snapshot_repository": "s3-repo" }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

Step 5: Visualize & Alert in Kibana

Import index patterns, configure field types, and build dashboards using Kibana Lens or TSVB. Set up alerting rules on error rate thresholds, latency spikes, or specific exception patterns. Use Kibana's built-in anomaly detection for log rate baselines.

Architecture rationale:

  • Beats over Logstash agents: lower CPU/memory footprint, native Docker/container awareness, reliable delivery with ACKs.
  • Logstash as central processor: enables complex transformations without coupling business logic to infrastructure.
  • ILM-driven indexing: prevents shard bloat, reduces storage costs, maintains query latency under load.
  • Structured JSON logging: eliminates parsing overhead, enables exact field filtering, supports KQL syntax.

Pitfall Guide

  1. Shipping Unstructured Logs Plain text logs force Grok parsing at ingestion, which is CPU-intensive and brittle. A single format change breaks pipelines. Best practice: enforce JSON emission at the application layer. Validate schema with JSON Schema or OpenTelemetry semantic conventions.

  2. Indexing High-Cardinality Fields Indexing fields like user_id, session_id, or request_id without mapping constraints creates millions of unique terms, exhausting heap memory and degrading query performance. Best practice: set index: false or keyword type with explicit mapping, or route to separate trace/span stores.

  3. Ignoring Log Sampling & Rate Limiting High-throughput services can generate 100k+ logs/second. Shipping all events overwhelms Logstash workers and spikes Elasticsearch cluster load. Best practice: implement application-level sampling for debug/info levels, use Filebeat's spool_size and bulk_max_size tuning, and configure Logstash pipeline workers to match cluster capacity.

  4. Synchronous Log Shipping Blocking request threads on log output adds latency and creates backpressure during cluster outages. Best practice: use async logging libraries (Pino, Winston, Logback), configure Filebeat with queue.mem.events and bulk_max_size, and enable dead letter queues in Logstash for failed events.

  5. Missing Index Templates & Field Mapping Elasticsearch auto-mapping creates dynamic fields with unpredictable types (e.g., IP addresses mapped as text, numbers as strings). This breaks aggregations and range queries. Best practice: define explicit index templates with keyword, date, integer, and boolean mappings before ingestion.

  6. Neglecting Security & Access Control Logs contain PII, tokens, and internal architecture details. Exposing raw indices to all teams violates compliance and increases breach surface. Best practice: enable Elasticsearch security, configure role-based access in Kibana, mask sensitive fields in Logstash (mutate + gsub), and audit index access via audit logging.

  7. Skipping Log Rotation & Retention Alignment Filebeat reads from files that rotate via logrotate, causing duplicate ingestion or missed lines. Best practice: configure close_inactive and clean_removed in Filebeat, align rotation schedules with collection windows, and verify inode handling on containerized workloads.

Production Bundle

Action Checklist

  • Standardize JSON log format across all services using a shared logging library
  • Deploy Filebeat as a DaemonSet (Kubernetes) or systemd service (VMs) with Docker metadata enrichment
  • Configure Logstash pipeline with explicit field mapping, debug filtering, and dead letter queue
  • Create Elasticsearch index template enforcing keyword/date/integer types and disabling dynamic mapping
  • Implement ILM policy with hot/warm/cold/delete phases aligned to compliance requirements
  • Mask PII and secrets in Logstash using mutate filters or application-level redaction
  • Set up Kibana index pattern, role-based dashboards, and alerting rules for error rate/latency thresholds
  • Validate log delivery with Filebeat ACKs, Logstash monitoring API, and Elasticsearch cluster health checks

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
Startup/MVP (<5 services)Direct Filebeat β†’ ElasticsearchSimplifies pipeline, reduces operational overhead, sufficient for low volumeLow (single cluster, minimal Logstash nodes)
Microservices/Cloud (5-50 services)Filebeat β†’ Logstash β†’ ElasticsearchEnables schema validation, cross-service enrichment, and centralized filteringMedium (Logstash cluster, ILM storage optimization)
High-Volume/Enterprise (>50 services, compliance)Filebeat β†’ Logstash β†’ Elasticsearch + OpenSearch/Kafka bufferKafka decouples ingestion from processing, ensures zero data loss during spikes, meets audit requirementsHigh (Kafka cluster, multi-tier storage, dedicated security layer)

Configuration Template

docker-compose.yml (local development stack):

version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms512m -Xmx512m
    ports:
      - "9200:9200"
    volumes:
      - es_data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:8.12.0
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
    ports:
      - "5044:5044"
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.12.0
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/log:/var/log:ro
    user: root
    depends_on:
      - logstash

volumes:
  es_data:

logstash/config/logstash.yml:

http.host: "0.0.0.0"
xpack.monitoring.enabled: false
pipeline.workers: 2
pipeline.batch.size: 125
pipeline.batch.delay: 50

Quick Start Guide

  1. Create project directories: mkdir -p logstash/pipeline filebeat and place the logstash.yml, pipeline config, and filebeat.yml from the templates above.
  2. Start the stack: docker compose up -d. Elasticsearch initializes first (~30s), followed by Logstash and Kibana.
  3. Generate test logs: echo '{"timestamp":"2024-01-15T10:00:00Z","level":"info","service":"auth","message":"User login successful"}' >> /var/log/app/test.json
  4. Open Kibana at http://localhost:5601, navigate to Stack Management β†’ Index Patterns, create logs-*, and verify documents appear in Discover within 10 seconds.
  5. Configure ILM and index template via Kibana Dev Tools or curl before routing production traffic.

Log aggregation is not a set-and-forget utility. It requires disciplined instrumentation, explicit schema contracts, and lifecycle management. When implemented correctly, the ELK stack transforms raw output into deterministic observability, reducing incident resolution time, eliminating manual log hunting, and providing the telemetry foundation required for reliable distributed systems.

Sources

  • β€’ ai-generated