Back to KB
Difficulty
Intermediate
Read Time
9 min

Log Aggregation Architecture: A Production-Ready Guide

By Codcompass TeamΒ·Β·9 min read

Log Aggregation Architecture: A Production-Ready Guide

Current Situation Analysis

Modern software delivery has fundamentally shifted the requirements for log aggregation. What was once a simple exercise in tailing text files and shipping them to a central syslog server has evolved into a high-velocity, multi-tenant data engineering problem. Today's environments are characterized by ephemeral infrastructure, distributed microservices, multi-cloud deployments, and event-driven architectures. Each component generates logs at varying frequencies, formats, and cardinalities, creating a data ingestion challenge that traditional pipelines cannot sustain.

The current landscape presents three critical friction points:

  1. Volume & Velocity: Container orchestration platforms like Kubernetes spin up and tear down thousands of pods daily. Each pod produces stdout/stderr streams, application logs, and sidecar metrics. Without intelligent buffering and rate control, ingestion pipelines choke, causing backpressure that cascades into application latency or data loss.
  2. Variety & Schema Drift: Logs arrive as raw text, JSON, protobuf, or structured traces. Without a unified schema or parsing strategy, query performance degrades exponentially. Field type mismatches, nested objects, and inconsistent timestamps break aggregation logic and inflate storage costs.
  3. Cost & Compliance: Storing every debug line indefinitely is financially unsustainable. Organizations face regulatory mandates (GDPR, HIPAA, SOC 2, PCI-DSS) that require data retention policies, PII redaction, and audit trails. Balancing observability needs with storage economics demands tiered lifecycle management and intelligent sampling.

Legacy architectures often rely on synchronous, monolithic collectors that couple ingestion with storage. This tight coupling eliminates fault tolerance, complicates scaling, and makes schema evolution painful. Modern log aggregation must decouple collection, transport, storage, and consumption. It must treat logs as a first-class data stream, applying patterns borrowed from event-driven architecture: idempotent delivery, partitioned storage, schema enforcement, and automated lifecycle transitions.

The architecture described in this guide addresses these realities by implementing a pipeline that prioritizes resilience, cost-awareness, and query performance. It is designed for cloud-native environments but remains applicable to hybrid and on-premises deployments.


WOW Moment Table

Architectural PatternTraditional ApproachModern ImplementationImpact / Metric
CollectionAgent-per-host, synchronous file tailingLightweight sidecar/daemon with async batching & backpressure control40-60% reduction in CPU/memory overhead; zero app-blocking
TransportDirect push to storage (tight coupling)Decoupled message broker with partitioning & consumer groups10x throughput scaling; graceful degradation during storage outages
Parsing & EnrichmentPost-storage regex extractionPre-ingestion structured parsing with schema registry70% faster query execution; consistent field types across tenants
Storage & IndexingFlat indices, manual rotationIndex Lifecycle Management (ILM) with tiered hot/warm/cold50-80% storage cost reduction; automatic data aging
Query & ConsumptionSingle-engine search, monolithic dashboardsMulti-engine routing (search vs. analytics vs. ML) with federated accessSub-second P95 queries; role-based data isolation
ResilienceSingle point of failure in collector or brokerMulti-AZ replication, dead-letter queues, idempotent writes99.99% pipeline availability; zero data loss under node failure

Core Solution with Code

Architecture Overview

The pipeline follows a four-stage data flow:

  1. Collection: Fluent Bit runs as a DaemonSet or sidecar, reading container stdout, application logs, and systemd/journald entries. It applies lightweight filtering, parses JSON, enriches with Kubernetes metadata, and batches records.
  2. Transport: Apache Kafka acts as a durable, partitioned buffer. Fluent Bit pushes to Kafka using async producers with configurable linger, batch size, and retry policies. Kafka decouples ingestion from storage, enabling independent scaling.
  3. Storage: OpenSearch (or Elasticsearch-compatible engines) consumes from Kafka via connectors. Index templates enforce schema, and ILM policies transition indices through hot β†’ warm β†’ cold β†’ delete tiers based on age and size.
  4. Consumption: Grafana provides dashboards; OpenSearch Dashboards enables ad-hoc search; Alertmanager integrates with OpenSearch for threshold-based alerting. All queries are routed through a query cache and field data optimization layer.

Component Configurations

1. Fluent Bit (Collection)

[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    info
    Parsers_File parsers.conf

[INPUT]
    Name         tail
    Path         /var/log/containers/*.log
    Parser       json
    Tag          kube.*
    Refresh_Interval 5
    Mem_Buf_Limit 50MB
    Skip_Long_Lines On

[FILTER]
    Name         kubernetes
    Match        kube.*
    Kube_URL     https://kubernetes.default.svc:443
    Kube_Tag_Prefix kube.var.log.containers.
    Merge_Log    On
    Keep_Log     Off
    Buffer_Size  0

[OUTPUT]
    Name         kafka
    Match        *
    Brokers      kafka-1:9092,kafka-2:9092,kafka-3:9092
    Topics       logs-ingestion
    Timestamp_Key @timestamp
    Retry_Limit  5
    Queue_Size   10000
    Batch_Size   1MB
    Linger_Ms    1000
    Compression  lz4

Key design choices: Mem_Buf_Limit prevents OOM during spikes. Kafka output uses LZ4 compression to reduce network payload. Linger_Ms and Batch_Size optimize throughput vs. latency trade-offs.

2. Kafka (Transport)

# docker-compose snippet for Kafka broker
KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_CFG_LOG_RETENTION_HOURS: 72
KAFKA_CFG_LOG_SEGMENT_BYTES: 1073741824
KAFKA_CFG_LOG_RETENTION_CHECK_INTERVAL_MS: 300000
KAFKA_CFG_NUM_PARTITIONS: 12
KAFKA_CFG_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_CFG_MIN_IN_SYNC_REPLICAS: 2
KAFKA_CFG_LOG_CLEANER_ENABLE: true

Key design choices: 12 partitions enable parallel consumption. MIN_IN_SYNC_REPLICAS=2 ensures durability without sacrificing availability. 72-hour retention provides a safety net for storage pipeline failures.

3. OpenSearch Index Te

mplate & ILM Policy

PUT _index_template/logs_v1
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-ilm-policy",
      "index.lifecycle.rollover_alias": "logs-current"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "host": { "type": "keyword" },
        "service": { "type": "keyword" },
        "level": { "type": "keyword" },
        "message": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },
        "trace_id": { "type": "keyword" },
        "k8s": {
          "properties": {
            "namespace": { "type": "keyword" },
            "pod": { "type": "keyword" },
            "container": { "type": "keyword" }
          }
        }
      }
    }
  }
}

PUT _ilm/policy/logs-ilm-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": { "max_size": "50gb", "max_age": "1d" },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "14d",
        "actions": {
          "searchable_snapshot": { "snapshot_repository": "s3-cold" },
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

Key design choices: rollover prevents oversized shards. forcemerge in warm phase reduces segment count for faster queries. searchable_snapshot in cold phase moves data to S3 while retaining query capability, cutting storage costs by ~70%.

4. Data Flow Validation

# Generate test logs
echo '{"@timestamp":"2024-06-15T10:00:00Z","host":"node-1","service":"auth","level":"info","message":"User login successful","trace_id":"abc123"}' | fluent-bit -c /etc/fluent-bit/fluent-bit.conf

# Verify Kafka topic
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic logs-ingestion --from-beginning --max-messages 1

# Query OpenSearch
curl -X GET "localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "service": "auth" } },
  "sort": [{ "@timestamp": "desc" }],
  "size": 5
}'

Pitfall Guide

1. Blind Ingestion Without Sampling

Problem: Shipping every debug line from high-throughput services inflates storage costs and degrades query performance. Mitigation: Implement dynamic sampling at the collector level. Drop DEBUG logs in production, sample high-cardinality events, and use grep/lua filters in Fluent Bit to drop known noise patterns before they enter Kafka.

2. Ignoring Backpressure & Buffer Overflow

Problem: When storage lags, collectors fill memory buffers and crash, causing data loss or application throttling. Mitigation: Configure Mem_Buf_Limit and Storage.Type filesystem in Fluent Bit. Set Kafka queue.buffering.max.messages and enable retry.backoff.ms. Monitor output.dropped and input.bytes metrics to trigger auto-scaling or alerting.

3. Poor Shard & Partition Strategy

Problem: Too many small shards waste cluster resources; too few large shards cause hotspots and slow queries. Mitigation: Align shard count with data volume and query concurrency. Use 1-3 primary shards per 50GB of data. Leverage ILM shrink action to reduce shard count in warm phase. Avoid time-based indices without rollover; use alias-based rollover instead.

4. PII Leakage & Compliance Gaps

Problem: Logs accidentally contain emails, tokens, or personal data, violating GDPR/HIPAA and exposing legal risk. Mitigation: Deploy a pre-ingestion PII redaction filter using regex or ML-based detection. Mask sensitive fields in Fluent Bit or Kafka Streams. Maintain an audit log of data access and enforce field-level security in OpenSearch via index-level permissions.

5. Single Point of Failure in Transport

Problem: A lone Kafka broker or Fluent Bit coordinator becomes a bottleneck. If it fails, the entire pipeline halts. Mitigation: Deploy Kafka as a 3-node cluster with replication.factor=3. Use Fluent Bit DaemonSet mode (not centralized) to distribute collection. Implement dead-letter queues (DLQ) in Kafka for failed records, and monitor consumer lag with Prometheus + Alertmanager.

6. Neglecting Schema Evolution

Problem: Application updates change log structure, causing index mapping conflicts or query failures. Mitigation: Use dynamic: false in OpenSearch mappings to prevent automatic field creation. Maintain a schema registry or versioned index templates. Use ignore_malformed and coerce settings where appropriate. Version indices (logs-v1, logs-v2) and route traffic via aliases during migration.


Production Bundle

Checklist

PhaseAction ItemVerification
Pre-DeploymentSize CPU/memory for Fluent Bit based on log volumekubectl top pods -n logging
Configure network policies for Kafka/OpenSearchkubectl get netpol
Set up TLS/mTLS for all pipeline componentsopenssl s_client -connect kafka:9092
Define retention & compliance boundariesDocumented ILM policy + PII redaction rules
RuntimeMonitor consumer lag & output dropsPrometheus: kafka_consumer_group_lag
Validate index rollover & ILM transitionsGET _ilm/explain/logs-*
Test failover by killing a Kafka brokerPipeline continues; no data loss
Audit query performance & cache hit ratioGET _nodes/stats/indices/query_cache
Post-DeploymentRun load test with synthetic log generatorVerify P95 latency < 2s
Document runbook for pipeline recoveryStored in wiki/Confluence
Schedule quarterly cost & storage reviewCompare cold tier savings vs. baseline

Decision Matrix

CriterionFluent Bit + Kafka + OpenSearchFluentd + Kafka + ELKVector + Pulsar + LokiSplunk Cloud
Scaleβ˜…β˜…β˜…β˜…β˜… (partitioned, async)β˜…β˜…β˜…β˜…β˜† (Ruby-based, higher overhead)β˜…β˜…β˜…β˜…β˜… (Rust, low footprint)β˜…β˜…β˜…β˜†β˜† (vendor limits)
Costβ˜…β˜…β˜…β˜…β˜† (open source, self-hosted)β˜…β˜…β˜…β˜†β˜† (higher resource usage)β˜…β˜…β˜…β˜…β˜… (efficient, label-based)β˜…β˜†β˜†β˜†β˜† (license-heavy)
Complexityβ˜…β˜…β˜…β˜†β˜† (requires Kafka/OS ops)β˜…β˜…β˜…β˜†β˜† (similar complexity)β˜…β˜…β˜…β˜…β˜† (simpler storage model)β˜…β˜…β˜…β˜…β˜… (managed)
Query Flexibilityβ˜…β˜…β˜…β˜…β˜… (full-text, aggregations)β˜…β˜…β˜…β˜…β˜… (identical to OS)β˜…β˜…β˜…β˜†β˜† (logql, limited joins)β˜…β˜…β˜…β˜…β˜† (SPL, mature)
Cloud-Native Fitβ˜…β˜…β˜…β˜…β˜… (K8s, eBPF, sidecar ready)β˜…β˜…β˜…β˜…β˜† (DaemonSet supported)β˜…β˜…β˜…β˜…β˜… (eBPF, vector-native)β˜…β˜…β˜†β˜†β˜† (agent-heavy)
Best ForEnterprise, compliance, high throughputLegacy ELK migrationsCost-sensitive, Kubernetes-nativeNon-technical teams, budget

Config Template (Consolidated)

# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        5
        Log_Level    info
        Parsers_File parsers.conf

    [INPUT]
        Name         tail
        Path         /var/log/containers/*.log
        Parser       json
        Tag          kube.*
        Refresh_Interval 5
        Mem_Buf_Limit 50MB
        Skip_Long_Lines On

    [FILTER]
        Name         kubernetes
        Match        kube.*
        Kube_URL     https://kubernetes.default.svc:443
        Merge_Log    On
        Keep_Log     Off

    [OUTPUT]
        Name         kafka
        Match        *
        Brokers      kafka.logging.svc:9092
        Topics       logs-ingestion
        Timestamp_Key @timestamp
        Retry_Limit  5
        Queue_Size   10000
        Batch_Size   1MB
        Linger_Ms    1000
        Compression  lz4

Quick Start

  1. Deploy Kafka: Use Strimzi Operator or Helm chart. Create a 3-broker cluster with 12 partitions and replication.factor=3. Verify connectivity: kafka-broker-api-versions.sh --bootstrap-server localhost:9092.
  2. Deploy OpenSearch: Use the official Helm chart. Apply the index template and ILM policy via curl or API. Ensure JVM heap is set to 50% of node memory (max 31GB).
  3. Install Fluent Bit: Deploy as a DaemonSet in logging namespace. Mount /var/log/containers and /var/lib/kubelet/pods as read-only. Apply the ConfigMap above.
  4. Generate Test Traffic: Run kubectl run log-generator --image=busybox --command -- /bin/sh -c "while true; do echo '{\"@timestamp\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"service\":\"test\",\"level\":\"info\",\"message\":\"heartbeat\"}'; sleep 1; done".
  5. Validate Pipeline:
    • Check Kafka: kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic logs-ingestion --from-beginning --max-messages 1
    • Query OpenSearch: GET logs-*/_search { "query": { "match": { "service": "test" } } }
    • Monitor: kubectl top pods -n logging, kubectl logs -l app=fluent-bit -n logging

Once validated, enable ILM transitions, configure Grafana dashboards, and set up Alertmanager rules for consumer lag > 1000 or output drop rate > 0.1%. The pipeline is now production-ready.


Log aggregation is no longer a utility task; it is a foundational data engineering discipline. By decoupling collection, transport, storage, and consumption, enforcing schema discipline, and automating lifecycle management, organizations transform logs from a cost center into a strategic observability asset. The architecture outlined here balances performance, resilience, and cost, providing a repeatable blueprint for modern infrastructure at scale.

Sources

  • β€’ ai-generated