: Classification, Filtering/Sampling, Tiers, and Lifecycle Policies.
Step 1: Log Classification and Schema Enforcement
Define a strict schema for logs. Unstructured logs prevent efficient filtering and increase storage costs. Use structured logging (JSON) with standardized attributes.
// TypeScript: Strict Log Schema Definition
import { z } from 'zod';
export const LogLevel = z.enum(['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL']);
export const LogCategory = z.enum(['APP', 'AUDIT', 'ACCESS', 'SYSTEM']);
export interface LogRetentionPolicy {
tier: 'HOT' | 'WARM' | 'COLD' | 'ARCHIVE';
retentionDays: number;
samplingRate?: number; // 0.0 to 1.0
redactionRules?: string[];
complianceHold?: boolean;
}
export const LogPayload = z.object({
timestamp: z.string().datetime(),
level: LogLevel,
category: LogCategory,
service: z.string(),
traceId: z.string().optional(),
message: z.string(),
attributes: z.record(z.unknown()).optional(),
});
Step 2: Source-Side Filtering and Sampling
Implement filtering at the agent or sidecar level (e.g., Fluent Bit, Vector, OTEL Collector). Do not ship data that will be dropped downstream.
Dynamic Sampling Strategy:
- Error-Driven Sampling: Always ship
ERROR and FATAL logs.
- Trace-Linked Sampling: If a log contains a
traceId associated with a sampled trace, retain the log to ensure correlation.
- Volume-Based Sampling: For
INFO/DEBUG, apply a sampling rate (e.g., 10%) that adjusts based on current ingest throughput.
// TypeScript: Sampling Decision Engine
class LogSamplingEngine {
private errorRate: number = 0;
private currentThroughput: number = 0;
private baseSampleRate: number = 0.1; // 10% for non-errors
shouldRetain(log: z.infer<typeof LogPayload>): boolean {
// 1. Critical logs always retained
if (log.level === 'ERROR' || log.level === 'FATAL') return true;
// 2. Audit logs always retained (Compliance)
if (log.category === 'AUDIT') return true;
// 3. Adaptive sampling for info/debug
const adaptiveRate = this.calculateAdaptiveRate();
// Simple hash-based sampling for consistency
const hash = this.hashString(log.traceId || log.message);
const normalizedHash = hash % 100;
return normalizedHash < (adaptiveRate * 100);
}
private calculateAdaptiveRate(): number {
// Reduce sampling rate if throughput exceeds threshold
if (this.currentThroughput > 50000) {
return this.baseSampleRate * 0.5;
}
return this.baseSampleRate;
}
private hashString(str: string): number {
let hash = 0;
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i);
hash = ((hash << 5) - hash) + char;
hash |= 0;
}
return Math.abs(hash);
}
}
Step 3: Tiered Storage Architecture
Map retention policies to storage tiers based on access patterns and cost.
- Hot Tier: High-performance search (Elasticsearch/OpenSearch/ClickHouse). Retention: 3β7 days. Use for active debugging.
- Warm Tier: Lower cost, slightly higher latency (S3 with query acceleration or warm shards). Retention: 30β90 days. Use for trend analysis and post-incident review.
- Cold/Archive Tier: Object storage (S3 Glacier Deep Archive). Retention: 1β7 years. Use for compliance and forensic audits. Format: Parquet/ORC for cost efficiency.
Step 4: PII Redaction and Compliance
PII must be redacted before data leaves the application boundary. Use regex or NLP-based redaction rules defined in configuration.
// TypeScript: Redaction Middleware
const PII_PATTERNS = [
{ regex: /\b\d{3}-\d{2}-\d{4}\b/g, mask: '***-**-****', label: 'SSN' },
{ regex: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, mask: '***@***.***', label: 'EMAIL' },
{ regex: /\b\d{16}\b/g, mask: '****-****-****-****', label: 'CC_NUMBER' },
];
function redactLog(log: string): string {
return PII_PATTERNS.reduce((acc, pattern) => {
return acc.replace(pattern.regex, pattern.mask);
}, log);
}
Step 5: Lifecycle Policy Implementation
Automate transitions using infrastructure-as-code. Ensure WORM (Write Once, Read Many) policies for archive tiers to prevent tampering.
# Example Vector Configuration for Lifecycle Routing
sources:
app_logs:
type: "kubernetes_logs"
transforms:
filter_debug:
type: "filter"
inputs: ["app_logs"]
condition: '.level != "DEBUG"'
redact_pii:
type: "remap"
inputs: ["filter_debug"]
source: |
.message = replace(.message, r'\b\d{3}-\d{2}-\d{4}\b', '***-**-****')
.message = replace(.message, r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '***@***.***')
sinks:
hot_storage:
type: "elasticsearch"
inputs: ["redact_pii"]
endpoint: "https://es-hot.cluster.internal"
healthcheck:
enabled: true
cold_archive:
type: "aws_s3"
inputs: ["redact_pii"]
bucket: "compliance-logs-archive"
encoding:
codec: "parquet"
storage_class: "GLACIER"
buffer:
type: "memory"
max_bytes: "50mb"
timeout: "5m"
Pitfall Guide
1. Retaining Debug Logs in Production
Mistake: Shipping DEBUG level logs to production aggregators.
Impact: Debug logs can increase volume by 500% without adding diagnostic value for production incidents. This inflates costs and slows queries.
Fix: Configure agents to drop DEBUG logs at the source unless a specific debug flag is enabled via dynamic configuration.
2. Ignoring Indexing Costs
Mistake: Focusing only on storage costs and ignoring index overhead.
Impact: In Elasticsearch, every field indexed consumes heap memory and disk space. Indexing high-cardinality fields (like user_id or request_id in logs) can crash clusters and cost more than storage.
Fix: Use keyword only for fields used in filters/aggregations. Use text with index: false for message bodies unless full-text search is required. Disable indexing on high-cardinality attributes.
3. Flat Retention Policies
Mistake: Applying a single retention window (e.g., 30 days) to all log types.
Impact: Audit logs are lost after 30 days (compliance violation), while debug logs are kept for 30 days unnecessarily (cost waste).
Fix: Implement category-based retention. Audit logs: 7 years. Error logs: 90 days. Info/Debug: 3 days.
4. Over-Sampling Error Logs
Mistake: Applying uniform sampling rates to all log levels.
Impact: You lose critical error context during high-load events, making incident response impossible.
Fix: Always sample errors at 100%. Apply sampling only to lower-severity logs. Use trace-correlated sampling to retain logs associated with sampled traces.
5. PII Leakage into Cold Storage
Mistake: Redacting PII only in the hot tier or UI, but storing raw logs in cold storage.
Impact: Cold storage often has looser access controls. Storing raw PII in S3 Glacier creates a massive compliance risk and violates GDPR data minimization.
Fix: Redaction must occur at the edge, before data is routed to any sink. Cold storage should only contain redacted, structured data.
6. Lack of Log Integrity Verification
Mistake: Assuming logs in cold storage are immutable and intact.
Impact: Without integrity checks, logs can be silently corrupted or tampered with, rendering them useless for forensics or legal discovery.
Fix: Enable WORM policies on archive buckets. Implement hash-chain verification or digital signatures for critical audit logs.
7. Treating Logs and Metrics as Identical
Mistake: Applying log retention strategies to metrics or vice versa.
Impact: Metrics require rollups and downsampling, not raw retention. Logs require raw text retention for debugging. Mixing strategies leads to data loss or cost inefficiency.
Fix: Separate pipelines. Metrics should use time-series databases with aggregation policies. Logs should use object storage or search engines with lifecycle policies.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / MVP | SaaS Default + Error-Only Retention | Minimize ops overhead; focus on errors for stability. | Low initial cost; scales with usage. |
| Enterprise / High-Volume | Tiered Lifecycle + Vector/FluentBit + S3/ClickHouse | Granular control; cost optimization via cold storage; performance via ClickHouse. | 70-80% cost reduction vs flat retention. |
| High Compliance (Finance/Health) | WORM Archive + Immutable Audit Logs + Strict Redaction | Regulatory requirement for data integrity and long-term retention. | Higher storage cost for archive, but mitigates massive compliance fines. |
| Cost-Constrained | Aggressive Sampling + Parquet Compression + Glacier Deep Archive | Maximize compression; minimize hot storage; accept higher retrieval latency. | Lowest possible cost; retrieval delays acceptable for audit. |
Configuration Template
Vector Configuration for Tiered Log Retention
This template demonstrates routing, redaction, and lifecycle management using Vector.
# vector.toml
[sources.app_logs]
type = "kubernetes_logs"
exclude_paths = ["/var/log/containers/*-debug-*"]
[transforms.redact_pii]
type = "remap"
inputs = ["app_logs"]
source = '''
# Redact Email
.message = replace(.message, r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "***@***.***")
# Redact SSN
.message = replace(.message, r"\b\d{3}-\d{2}-\d{4}\b", "***-**-****")
# Drop DEBUG logs
if .level == "DEBUG" {
abort
}
'''
[transforms.sample_info]
type = "filter"
inputs = ["redact_pii"]
condition = '''
if .level == "INFO" {
# 20% sampling for INFO logs
return hash(.timestamp) % 5 == 0
}
true
'''
[sinks.hot_elasticsearch]
type = "elasticsearch"
inputs = ["sample_info"]
endpoint = "https://es-hot:9200"
index = "logs-hot-%Y.%m.%d"
healthcheck.enabled = true
buffer.type = "memory"
buffer.max_bytes = "100mb"
[sinks.cold_s3]
type = "aws_s3"
inputs = ["sample_info"]
bucket = "prod-logs-archive"
key_prefix = "logs/%Y/%m/%d/"
encoding.codec = "parquet"
storage_class = "GLACIER"
compression = "gzip"
buffer.type = "memory"
buffer.max_bytes = "50mb"
buffer.timeout = "10m"
Quick Start Guide
- Deploy Agent: Install Vector or Fluent Bit as a DaemonSet on your Kubernetes cluster.
- Apply Redaction & Filter: Use the configuration template to enable PII redaction and drop
DEBUG logs immediately.
- Define Retention Rules: Configure your storage sink (Elasticsearch/S3) with lifecycle policies: Hot for 7 days, Warm for 30 days, Cold for 1 year.
- Verify Pipeline: Check agent metrics to confirm
DEBUG logs are dropped and PII is masked. Monitor storage costs for the first 24 hours.
- Optimize: Adjust sampling rates for
INFO logs based on your volume targets. Enable WORM on archive buckets for compliance.
Result: You will achieve immediate cost reduction, improved query performance, and a compliant log retention posture within hours of deployment.