Engineering Predictable Throughput with Log-Structured Merge Storage

Current Situation Analysis

Modern database drivers and query languages have successfully abstracted storage mechanics to the point of invisibility. Developers interact with familiar SQL or CQL interfaces, while the underlying engine silently manages disk I/O, caching, and durability. This abstraction works flawlessly until the storage engine's internal maintenance routines collide with application traffic. The most common manifestation is a sudden, unexplained degradation in write latency or throughput during peak ingestion windows.

The root cause is rarely application logic or network latency. It is almost always a mismatch between workload characteristics and the storage engine's compaction behavior. Log-Structured Merge (LSM) trees dominate high-write workloads because they convert random disk writes into sequential appends. However, this design shifts the cost from the write path to the background maintenance path. When compaction triggers aggressively, it competes for disk IOPS, CPU cycles, and memory bandwidth. Monitoring dashboards that only track request latency or queue depth will show symptoms long after the storage engine has already entered a degraded state.

Industry telemetry consistently shows that unoptimized LSM deployments experience p99 write latency spikes of 50x to 200x during compaction storms. Write amplification in default configurations frequently exceeds 15x, silently consuming SSD endurance budgets and triggering hardware-level throttling. The problem is overlooked because storage engines expose configuration knobs that appear cosmetic but fundamentally alter I/O topology. Without a clear mental model of how memtables, write-ahead logs, sorted string tables, and compaction strategies interact, teams treat storage as a black box until production incidents force a deep dive.

WOW Moment: Key Findings

The critical insight for engineering LSM-backed systems is that you are not selecting a data structure; you are selecting an amplification profile. Every LSM implementation forces a trade-off between read amplification, write amplification, and point lookup latency. Understanding these metrics allows you to align storage behavior with your actual service level objectives.

Storage Model	Read Amplification	Write Amplification	Point Lookup Latency
B-Tree (InnoDB)	~1 disk seek	~1.5x (page splits)	Sub-millisecond
Tiered Compaction	5-10x (scan multiple files)	~2-5x	2-8ms
Leveled Compaction	~1-2x (bloom + single file/level)	10-20x	0.5-1.5ms

This comparison reveals why default configurations fail under specific workloads. Tiered compaction minimizes write amplification by merging files of similar size, making it ideal for ingestion pipelines where reads are infrequent or tolerate higher latency. Leveled compaction enforces non-overlapping key ranges across levels, drastically reducing read amplification but multiplying write I/O due to frequent level-to-level migrations. B-trees remain optimal for balanced OLTP workloads where random reads and writes coexist at moderate volumes.

The finding matters because it shifts tuning from reactive firefighting to proactive capacity planning. By mapping your workload's read/write ratio, latency tolerance, and storage hardware characteristics to the correct amplification profile, you eliminate compaction-induced latency spikes before they impact users.

Core Solution

Building a production-ready LSM pipeline requires explicit control over memory boundaries, flush thresholds, compaction scheduling, and index structures. The following implementation demonstrates a TypeScript-based LSM engine that isolates each architectural layer, enforces backpressure, and exposes telemetry hooks for production monitoring.

Step 1: In-Memory Buffer and Durability Layer

The entry point for all mutations is an in-memory sorted structure paired with a sequential write-ahead log. The buffer absorbs writes at memory speed while the WAL guarantees crash recovery.

import { EventEmitter } from 'events';

interface KVPair {
  key: string;
  value: string | null; // null represents tombstone
  timestamp: number;
}

class MemTable {
  private entries: Map<string, KVPair> = new Map();
  private sizeBytes: number = 0;
  private readonly maxBytes: number;

  constructor(maxBytes: number = 64 * 1024 * 1024) {
    this.maxBytes = maxBytes;
  }

  put(key: string, value: string, ts: number): void {
    this.entries.set(key, { key, value, timestamp: ts });
    this.sizeBytes += key.length + value.length + 24; // approximate overhead
  }

  delete(key: string, ts: number): void {
    this.entries.set(key, { key, value: null, timestamp: ts });
    this.sizeBytes += key.length + 24;
  }

  isFull(): boolean {
    return this.sizeBytes >= this.maxBytes;
  }

  snapshot(): KVPair[] {
    return Array.from(this.entries.values()).sort((a, b) => a.key.localeCompare(b.key));
  }

  reset(): void {
    this.entries.clear();
    this.sizeBytes = 0;
  }
}

Step 2: Immutable Disk Files and Level Management

When the buffer crosses its threshold, it is sealed and flushed to disk as an immutable sorted file. Files are organized into levels with geometric size growth to control compaction frequency.

class SSTableWriter {
  private readonly level: number;
  private readonly baseDir: string;
  private fileCounter: number = 0;

  constructor(level: number, baseDir: string) {
    this.level = level;
    this.baseDir = baseDir;
  }

  async flush(entries: KVPair[]): Promise<string> {
    const filename = `l${this.level}_seq${++this.fileCounter}.sst`;
    const filepath = `${this.baseDir}/${filename}`;
    
    // Serialize sorted entries with block indexing and bloom filter
    const payload = this.serializeWithIndex(entries);
    await Deno.writeFile(filepath, payload);
    
    return filename;
  }

  private serializeWithIndex(entries: KVPair[]): Uint8Array {
    // Production implementation uses block compression (Snappy/Zstd),
    // binary search index blocks, and 10-15 bit bloom filters per key
    const encoder = new TextEncoder();
    return encoder.encode(JSON.stringify(entries));
  }
}

Step 3: Compaction Scheduler and Backpressure

Compaction must run asynchronously but respect system resource limits. The scheduler monitors level file counts and triggers merges only when thresholds are breached, preventing I/O starvation.

class CompactionScheduler extends EventEmitter {
  private readonly l0Trigger: number;
  private readonly levelMultiplier: number;
  private isRunning: boolean = false;

  constructor(l0Trigger: number = 4, levelMultiplier: number = 10) {
    super();
    this.l0Trigger = l0Trigger;
    this.levelMultiplier = levelMultiplier;
  }

  async scheduleIfNeeded(levelFileCounts: Map<number, number>): Promise<void> {
    if (this.isRunning) return;
    
    const l0Count = levelFileCounts.get(0) || 0;
    if (l0Count >= this.l0Trigger) {
      this.isRunning = true;
      try {
        await this.executeCompaction(levelFileCounts);
      } finally {
        this.isRunning = false;
      }
    }
  }

  private async executeCompaction(counts: Map<number, number>): Promise<void> {
    // Merge overlapping L0 files into L1
    // Enforce non-overlapping invariant for L1+
    // Garbage collect tombstones older than retention window
    this.emit('compaction:started');
    // Simulate I/O bound merge operation
    await new Promise(res => setTimeout(res, 150));
    this.emit('compaction:completed');
  }
}

Architecture Decisions and Rationale

Skip List vs Red-Black Tree for MemTable: Skip lists provide O(log n) insertion with lower memory overhead and lock-free concurrency patterns, making them preferable for high-throughput ingestion.
Immutable SSTables: Immutability eliminates page splits, lock contention, and partial write corruption. Updates and deletes are handled via versioning and tombstones, simplifying crash recovery.
Geometric Level Growth: Multiplying level sizes by 10x ensures that each level holds roughly 10x more data than the previous. This bounds the number of files per level and keeps compaction frequency predictable.
Bloom Filter Integration: Each SSTable embeds a probabilistic filter that rejects non-existent keys before disk I/O. This reduces read amplification from linear file scans to constant-time checks.

Pitfall Guide

1. L0 File Pile-Up

Explanation: Level 0 files are flushed without key-range validation, causing overlapping ranges. When L0 exceeds the compaction trigger, reads must scan every file in the level, spiking latency. Fix: Lower level0_file_num_compaction_trigger to 2-3, increase compaction thread count, and monitor compaction_queue_depth metrics. Ensure background threads are not starved by foreground I/O.

2. Tombstone Accumulation

Explanation: Deletes write tombstone markers that persist until compaction merges them. Unchecked tombstones inflate SSTable size, increase read amplification, and waste disk space. Fix: Configure compaction_min_level to force periodic full compaction on older levels. Implement a tombstone retention window and enable purge_obsolete_files_periodically.

3. MemTable Oversizing

Explanation: Large memtables delay flushes, causing the WAL to grow excessively. On crash recovery, replaying a multi-gigabyte WAL blocks the database for minutes. Fix: Cap write_buffer_size at 32-64MB per instance. Calculate acceptable recovery time: max_wal_size / wal_flush_rate. Adjust buffer count accordingly.

4. Bloom Filter Misconfiguration

Explanation: Undersized filters produce high false positive rates, triggering unnecessary disk seeks. Oversized filters waste RAM and increase memtable pressure. Fix: Allocate 10-15 bits per key for standard workloads. Use bloom_bits_per_key: 12 as a baseline. Adjust upward if read latency degrades, downward if memory pressure spikes.

5. Compaction Strategy Mismatch

Explanation: Applying leveled compaction to write-heavy pipelines burns IOPS and triggers SSD throttling. Tiered compaction on read-heavy workloads causes unpredictable lookup latency. Fix: Profile workload read/write ratio. Use tiered for ingestion > 80% writes. Use leveled for mixed or read-heavy workloads. Consider dynamic compaction that shifts strategy based on queue depth.

6. Ignoring Write Amplification on NVMe

Explanation: High write amplification (15-30x) accelerates NAND wear, reduces sustained throughput, and triggers controller throttling. Enterprise drives handle it better, but consumer SSDs degrade rapidly. Fix: Monitor TBW (Terabytes Written) metrics. Use drives with power-loss protection and overprovisioning. Tune max_bytes_for_level_multiplier to 8-12 to reduce level migration frequency.

7. Compaction Thread Starvation

Explanation: Foreground writes consume I/O bandwidth, delaying background compaction. The compaction queue grows, L0 files accumulate, and the system enters a compaction storm. Fix: Reserve I/O priority for compaction threads using io_priority: background or cgroup limits. Implement write stall thresholds that pause ingestion when compaction_pending > 10.

Production Bundle

Action Checklist

Instrument compaction queue depth and L0 file count metrics; alert when L0 > 8 files
Validate bloom filter false positive rate using test datasets; adjust bits-per-key accordingly
Configure write stall thresholds to pause ingestion when compaction cannot keep pace
Schedule periodic full compaction during low-traffic windows to purge tombstones
Benchmark write amplification under peak load; verify SSD TBW endurance matches projected growth
Test crash recovery time with maximum WAL size; adjust memtable flush thresholds if recovery > 30s
Profile read amplification by tracing SSTable file accesses per query; optimize index block size if > 3 seeks/query

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
IoT Telemetry Ingestion (>90% writes)	Tiered Compaction + Large MemTable	Minimizes write amplification, tolerates higher read latency	Lower IOPS cost, higher storage density
Mixed OLTP (60/40 read/write)	Leveled Compaction + 12-bit Bloom Filters	Bounds read amplification, predictable point lookups	Higher write I/O, requires enterprise SSDs
Time-Series Analytics (batch reads)	Tiered + Periodic Full Compaction	Optimizes ingestion throughput, cleans tombstones on schedule	Moderate storage cost, predictable maintenance windows
Low-Latency Key-Value Cache	In-Memory LSM + Async Flush	Eliminates disk I/O on hot path, WAL ensures durability	High RAM cost, requires power-loss protection

Configuration Template

{
  "storage_engine": "log_structured_merge",
  "memtable": {
    "write_buffer_size_mb": 48,
    "max_write_buffer_number": 3,
    "min_write_buffer_number_to_merge": 1,
    "skip_list_height": 12
  },
  "wal": {
    "sync_on_flush": true,
    "ttl_seconds": 3600,
    "max_total_size_mb": 512
  },
  "compaction": {
    "strategy": "leveled",
    "level0_file_num_trigger": 3,
    "level0_slowdown_trigger": 8,
    "level0_stop_trigger": 12,
    "max_bytes_for_level_base_mb": 256,
    "level_multiplier": 10,
    "compaction_threads": 4,
    "io_priority": "background"
  },
  "indexing": {
    "bloom_bits_per_key": 12,
    "block_size_kb": 16,
    "compression": "zstd",
    "compression_level": 3
  },
  "backpressure": {
    "write_stall_threshold": 10,
    "compaction_queue_alert": 8,
    "auto_throttle_writes": true
  }
}

Quick Start Guide

Initialize the engine: Deploy the configuration template to your target environment. Verify that write_buffer_size_mb and max_write_buffer_number align with available RAM (reserve 20% headroom for OS page cache).
Run baseline load test: Ingest 100k records/sec for 5 minutes. Monitor compaction_queue_depth, l0_file_count, and write_amplification_ratio. Adjust level0_file_num_trigger if queue depth exceeds 5.
Validate read path: Execute 10k random point lookups. Trace SSTable file accesses per query. If average seeks exceed 2, increase bloom_bits_per_key to 15 or switch to leveled compaction.
Simulate failure: Kill the process during active ingestion. Restart and measure recovery time. If WAL replay exceeds 30 seconds, reduce write_buffer_size_mb or increase min_write_buffer_number_to_merge.
Enable production monitoring: Export compaction_pending, memtable_flush_latency, and sst_read_amplification to your observability stack. Set alert thresholds at 80% of configured stall limits.

LSM Trees: Why Your Database Is Secretly Using One and What It's Actually Doing