LSM Trees: Why Your Database Is Secretly Using One and What It's Actually Doing
Engineering Predictable Throughput with Log-Structured Merge Storage
Current Situation Analysis
Modern database drivers and query languages have successfully abstracted storage mechanics to the point of invisibility. Developers interact with familiar SQL or CQL interfaces, while the underlying engine silently manages disk I/O, caching, and durability. This abstraction works flawlessly until the storage engine's internal maintenance routines collide with application traffic. The most common manifestation is a sudden, unexplained degradation in write latency or throughput during peak ingestion windows.
The root cause is rarely application logic or network latency. It is almost always a mismatch between workload characteristics and the storage engine's compaction behavior. Log-Structured Merge (LSM) trees dominate high-write workloads because they convert random disk writes into sequential appends. However, this design shifts the cost from the write path to the background maintenance path. When compaction triggers aggressively, it competes for disk IOPS, CPU cycles, and memory bandwidth. Monitoring dashboards that only track request latency or queue depth will show symptoms long after the storage engine has already entered a degraded state.
Industry telemetry consistently shows that unoptimized LSM deployments experience p99 write latency spikes of 50x to 200x during compaction storms. Write amplification in default configurations frequently exceeds 15x, silently consuming SSD endurance budgets and triggering hardware-level throttling. The problem is overlooked because storage engines expose configuration knobs that appear cosmetic but fundamentally alter I/O topology. Without a clear mental model of how memtables, write-ahead logs, sorted string tables, and compaction strategies interact, teams treat storage as a black box until production incidents force a deep dive.
WOW Moment: Key Findings
The critical insight for engineering LSM-backed systems is that you are not selecting a data structure; you are selecting an amplification profile. Every LSM implementation forces a trade-off between read amplification, write amplification, and point lookup latency. Understanding these metrics allows you to align storage behavior with your actual service level objectives.
| Storage Model | Read Amplification | Write Amplification | Point Lookup Latency |
|---|---|---|---|
| B-Tree (InnoDB) | ~1 disk seek | ~1.5x (page splits) | Sub-millisecond |
| Tiered Compaction | 5-10x (scan multiple files) | ~2-5x | 2-8ms |
| Leveled Compaction | ~1-2x (bloom + single file/level) | 10-20x | 0.5-1.5ms |
This comparison reveals why default configurations fail under specific workloads. Tiered compaction minimizes write amplification by merging files of similar size, making it ideal for ingestion pipelines where reads are infrequent or tolerate higher latency. Leveled compaction enforces non-overlapping key ranges across levels, drastically reducing read amplification but multiplying write I/O due to frequent level-to-level migrations. B-trees remain optimal for balanced OLTP workloads where random reads and writes coexist at moderate volumes.
The finding matters because it shifts tuning from reactive firefighting to proactive capacity planning. By mapping your workload's read/write ratio, latency tolerance, and storage hardware characteristics to the correct amplification profile, you eliminate compaction-induced latency spikes before they impact users.
Core Solution
Building a production-ready LSM pipeline requires explicit control over memory boundaries, flush thresholds, compaction scheduling, and index structures. The following implementation demonstrates a TypeScript-based LSM engine that isolates each architectural layer, enforces backpressure, and exposes telemetry hooks for production monitoring.
Step 1: In-Memory Buffer and Durability Layer
The entry point for all mutations is an in-memory sorted structure paired with a sequential write-ahead log. The buffer absorbs writes at memory speed while the WAL guarantees crash recovery.
import { EventEmitter } from 'events';
interface KVPair {
key: string;
value: string | null; // null represents tombstone
timestamp: number;
}
class MemTable {
private entries: Map<string, KVPair> = new Map();
private sizeBytes: number = 0;
private readonly maxBytes: number;
constructor(maxBytes: number = 64 * 1024 * 1024) {
this.maxBytes = maxBytes;
}
put(key: string, value: string, ts: number): void {
this.entries.set(key, { key, value, timestamp: ts });
this.sizeBytes += key.length + value.length + 24; // approximate overhead
}
delete(key: string, ts: number): void {
this.entries.set(key, { key, value: null, timestamp: ts });
this.sizeBytes += key.length + 24;
}
isFull(): boolean {
return this.sizeBytes >= this.maxBytes;
}
snapshot(): KVPair[] {
return Array.from(this.entries.values()).sort((a, b) => a.key.localeCompare(b.key));
}
reset(): void {
this.entries.clear();
this.sizeBytes = 0;
}
}
Step 2: Immutable Disk Files and Level Management
When the buffer crosses its threshold, it is sealed and flushed to disk as an immutable sorted file. Files are organized into levels with geometric size growth to control compaction frequency.
class SSTableWriter {
private readonly level: number;
private readonly baseDir: string;
private fileCounter: number = 0;
constructor(level: number, baseDir: string) {
this.level = level;
this.baseDir = baseDir;
}
async flush(entries: KVPair[]): Promise<string> {
const filename = `l${this.level}_seq${++this.fileCounter}.sst`;
const filepath = `${this.baseDir}/${filename}`;
// Serialize sorted entries with block indexing and bloom filter
const payload = this.serializeWithIndex(entries);
await Deno.writeFile(filepath, payload);
return filename;
}
private serializeWithIndex(entries: KVPair[]): Uint8Array {
// Production implementation uses block compression (Snappy/Zstd),
// binary search index blocks, and 10-15 bit bloom filters per key
const encoder = new TextEncoder();
return encoder.encode(JSON.stringify(entries));
}
}
Step 3: Compaction Scheduler and Backpressure
Compaction must run asynchronously but respect system resource limits. The scheduler monitors level file counts and triggers merges only when thresholds are breached, preventing I/O starvation.
class CompactionScheduler extends EventEmitter {
private readonly l0Trigger: number;
private readonly levelMultiplier: number;
private isRunning: boolean = false;
constructor(l0Trigger: number = 4, levelMultiplier: number = 10) {
super();
this.l0Trigger = l0Trigger;
this.levelMultiplier = levelMultiplier;
}
async scheduleIfNeeded(levelFileCounts: Map<number, number>): Promise<void> {
if (this.isRunning) return;
const l0Count = levelFileCounts.get(0) || 0;
if (l0Count >= this.l0Trigger) {
this.isRunning = true;
try {
await this.executeCompaction(levelFileCounts);
} finally {
this.isRunning = false;
}
}
}
private async executeCompaction(counts: Map<number, number>): Promise<void> {
// Merge overlapping L0 files into L1
// Enforce non-overlapping invariant for L1+
// Garbage collect tombstones older than retention window
this.emit('compaction:started');
// Simulate I/O bound merge operation
await new Promise(res => setTimeout(res, 150));
this.emit('compaction:completed');
}
}
Architecture Decisions and Rationale
- Skip List vs Red-Black Tree for MemTable: Skip lists provide O(log n) insertion with lower memory overhead and lock-free concurrency patterns, making them preferable for high-throughput ingestion.
- Immutable SSTables: Immutability eliminates page splits, lock contention, and partial write corruption. Updates and deletes are handled via versioning and tombstones, simplifying crash recovery.
- Geometric Level Growth: Multiplying level sizes by 10x ensures that each level holds roughly 10x more data than the previous. This bounds the number of files per level and keeps compaction frequency predictable.
- Bloom Filter Integration: Each SSTable embeds a probabilistic filter that rejects non-existent keys before disk I/O. This reduces read amplification from linear file scans to constant-time checks.
Pitfall Guide
1. L0 File Pile-Up
Explanation: Level 0 files are flushed without key-range validation, causing overlapping ranges. When L0 exceeds the compaction trigger, reads must scan every file in the level, spiking latency.
Fix: Lower level0_file_num_compaction_trigger to 2-3, increase compaction thread count, and monitor compaction_queue_depth metrics. Ensure background threads are not starved by foreground I/O.
2. Tombstone Accumulation
Explanation: Deletes write tombstone markers that persist until compaction merges them. Unchecked tombstones inflate SSTable size, increase read amplification, and waste disk space.
Fix: Configure compaction_min_level to force periodic full compaction on older levels. Implement a tombstone retention window and enable purge_obsolete_files_periodically.
3. MemTable Oversizing
Explanation: Large memtables delay flushes, causing the WAL to grow excessively. On crash recovery, replaying a multi-gigabyte WAL blocks the database for minutes.
Fix: Cap write_buffer_size at 32-64MB per instance. Calculate acceptable recovery time: max_wal_size / wal_flush_rate. Adjust buffer count accordingly.
4. Bloom Filter Misconfiguration
Explanation: Undersized filters produce high false positive rates, triggering unnecessary disk seeks. Oversized filters waste RAM and increase memtable pressure.
Fix: Allocate 10-15 bits per key for standard workloads. Use bloom_bits_per_key: 12 as a baseline. Adjust upward if read latency degrades, downward if memory pressure spikes.
5. Compaction Strategy Mismatch
Explanation: Applying leveled compaction to write-heavy pipelines burns IOPS and triggers SSD throttling. Tiered compaction on read-heavy workloads causes unpredictable lookup latency. Fix: Profile workload read/write ratio. Use tiered for ingestion > 80% writes. Use leveled for mixed or read-heavy workloads. Consider dynamic compaction that shifts strategy based on queue depth.
6. Ignoring Write Amplification on NVMe
Explanation: High write amplification (15-30x) accelerates NAND wear, reduces sustained throughput, and triggers controller throttling. Enterprise drives handle it better, but consumer SSDs degrade rapidly.
Fix: Monitor TBW (Terabytes Written) metrics. Use drives with power-loss protection and overprovisioning. Tune max_bytes_for_level_multiplier to 8-12 to reduce level migration frequency.
7. Compaction Thread Starvation
Explanation: Foreground writes consume I/O bandwidth, delaying background compaction. The compaction queue grows, L0 files accumulate, and the system enters a compaction storm.
Fix: Reserve I/O priority for compaction threads using io_priority: background or cgroup limits. Implement write stall thresholds that pause ingestion when compaction_pending > 10.
Production Bundle
Action Checklist
- Instrument compaction queue depth and L0 file count metrics; alert when L0 > 8 files
- Validate bloom filter false positive rate using test datasets; adjust bits-per-key accordingly
- Configure write stall thresholds to pause ingestion when compaction cannot keep pace
- Schedule periodic full compaction during low-traffic windows to purge tombstones
- Benchmark write amplification under peak load; verify SSD TBW endurance matches projected growth
- Test crash recovery time with maximum WAL size; adjust memtable flush thresholds if recovery > 30s
- Profile read amplification by tracing SSTable file accesses per query; optimize index block size if > 3 seeks/query
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| IoT Telemetry Ingestion (>90% writes) | Tiered Compaction + Large MemTable | Minimizes write amplification, tolerates higher read latency | Lower IOPS cost, higher storage density |
| Mixed OLTP (60/40 read/write) | Leveled Compaction + 12-bit Bloom Filters | Bounds read amplification, predictable point lookups | Higher write I/O, requires enterprise SSDs |
| Time-Series Analytics (batch reads) | Tiered + Periodic Full Compaction | Optimizes ingestion throughput, cleans tombstones on schedule | Moderate storage cost, predictable maintenance windows |
| Low-Latency Key-Value Cache | In-Memory LSM + Async Flush | Eliminates disk I/O on hot path, WAL ensures durability | High RAM cost, requires power-loss protection |
Configuration Template
{
"storage_engine": "log_structured_merge",
"memtable": {
"write_buffer_size_mb": 48,
"max_write_buffer_number": 3,
"min_write_buffer_number_to_merge": 1,
"skip_list_height": 12
},
"wal": {
"sync_on_flush": true,
"ttl_seconds": 3600,
"max_total_size_mb": 512
},
"compaction": {
"strategy": "leveled",
"level0_file_num_trigger": 3,
"level0_slowdown_trigger": 8,
"level0_stop_trigger": 12,
"max_bytes_for_level_base_mb": 256,
"level_multiplier": 10,
"compaction_threads": 4,
"io_priority": "background"
},
"indexing": {
"bloom_bits_per_key": 12,
"block_size_kb": 16,
"compression": "zstd",
"compression_level": 3
},
"backpressure": {
"write_stall_threshold": 10,
"compaction_queue_alert": 8,
"auto_throttle_writes": true
}
}
Quick Start Guide
- Initialize the engine: Deploy the configuration template to your target environment. Verify that
write_buffer_size_mbandmax_write_buffer_numberalign with available RAM (reserve 20% headroom for OS page cache). - Run baseline load test: Ingest 100k records/sec for 5 minutes. Monitor
compaction_queue_depth,l0_file_count, andwrite_amplification_ratio. Adjustlevel0_file_num_triggerif queue depth exceeds 5. - Validate read path: Execute 10k random point lookups. Trace SSTable file accesses per query. If average seeks exceed 2, increase
bloom_bits_per_keyto 15 or switch to leveled compaction. - Simulate failure: Kill the process during active ingestion. Restart and measure recovery time. If WAL replay exceeds 30 seconds, reduce
write_buffer_size_mbor increasemin_write_buffer_number_to_merge. - Enable production monitoring: Export
compaction_pending,memtable_flush_latency, andsst_read_amplificationto your observability stack. Set alert thresholds at 80% of configured stall limits.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
