M-Tree (Leveled) | 400k+ | 20β30% | 1.5β3.0ms (w/ Bloom) | 2β5x | ~500k+ IOPS |
Key Findings:
- Sequential I/O Dominance: LSM trees convert random writes into sequential appends (WAL + MemTable flush), bypassing NVMe erase-block penalties and achieving 8β10x higher write throughput on identical hardware.
- Deferred Ordering Cost: The B-tree's O(log n) point-read guarantee is replaced by multi-level SSTable scanning. Bloom filters recover ~99% of this overhead by eliminating unnecessary disk seeks.
- Sweet Spot: LSM trees excel in write-heavy, append-dominant workloads (time-series, event logging, high-frequency ingestion). B-trees remain optimal for read-heavy OLTP where point-query latency and strict consistency outweigh write volume.
Core Solution
The LSM architecture inverts traditional storage design: writes are cheap and unordered in memory, while read-optimized sorting is assembled lazily in the background. Every production LSM implementation (RocksDB, LevelDB, Cassandra, ScyllaDB) maps to three core layers:
- MemTable: A mutable, in-memory sorted structure (typically a skip list or red-black tree) that absorbs all incoming writes.
- SSTables: Immutable, sorted files written to disk when the MemTable hits its size threshold.
- Compaction: Background threads that merge SSTables across levels to reclaim space, remove tombstones, and enforce read-optimized ordering.
The Write Path: Why Appends Win
Every write touches two places sequentially: the WAL first, then the MemTable. The WAL is a sequential append requiring a single fdatasync call. The MemTable is updated in memory with zero page splits or disk reads. When the MemTable reaches its threshold, it freezes, a new MemTable takes over, and a background thread flushes the frozen one to disk as an SSTable. The application never blocks on this flush.
# RocksDB config that controls this behavior
# rocksdb/options.h equivalents in a typical config file
write_buffer_size = 134217728 # 128MB per MemTable
max_write_buffer_number = 3 # how many MemTables can exist simultaneously
min_write_buffer_number_to_merge = 1 # flush when 1 MemTable is full
# WAL behavior
wal_dir = /data/wal # separate WAL on a different device if you want
sync_log_entry = true # fsync per write β safe but slower
The Read Path: Where the Pain Starts
Reading is the honest cost paid for cheap writes. Lookup order: active MemTable β frozen MemTables β Level 0 SSTables β Level 1, Level 2, etc. In the worst case (key absence), the engine may scan every SSTable across all levels. This is why Bloom filters are mandatory in production. Each SSTable carries a probabilistic filter that skips non-existent keys with ~99% confidence, preventing full-level scans.
# In a RocksDB ColumnFamilyOptions β bloom filter per SSTable
BlockBasedTableOptions table_options;
table_options.filter_policy.reset(
NewBloomFilterPolicy(
10, // bits per key β 10 gives ~1% false positive rate
false // use block-based filter, not full filter
)
);
table_options.whole_key_filtering = true;
Architecture Decisions
- Immutable SSTables: Enable lock-free reads, efficient compaction, and crash recovery via WAL replay.
- Leveled vs. Tiered Compaction: Leveled compaction minimizes read amplification but increases write amplification. Tiered compaction reduces write amplification but allows SSTable proliferation, increasing read overhead.
- Background-First Design: Sorting, merging, and space reclamation are offloaded to dedicated threads, ensuring the write path remains non-blocking.
Pitfall Guide
- Disabling or Misconfiguring Bloom Filters: Turning off Bloom filters or setting bits-per-key too low causes cold-key lookups to scan every SSTable, spiking latency to 50ms+. Always enable
whole_key_filtering and tune bits-per-key based on acceptable false-positive rates.
- Ignoring Compaction Strategy Alignment: Choosing Leveled compaction for write-heavy pipelines causes excessive background I/O and SSD wear. Choosing Tiered compaction for read-heavy workloads creates read amplification bottlenecks. Match compaction to your dominant access pattern.
- Misaligned MemTable Flush Thresholds: Setting
write_buffer_size too low triggers frequent SSTable creation, overwhelming compaction threads. Setting it too high increases memory pressure and WAL recovery time after crashes. Default to 64β128MB and adjust based on RAM availability and write velocity.
- Co-locating WAL and Data Files: Placing the WAL on the same NVMe device as SSTables creates I/O contention, negating LSM's sequential write advantage. Separate WAL onto a dedicated low-latency device or partition to maintain append performance.
- Overlooking Write vs. Space Amplification Trade-offs: Aggressive compaction reduces read amplification but spikes write amplification and storage overhead. Monitor
compaction_stats and tune level0_file_num_compaction_trigger to balance background I/O with read performance.
Deliverables
- π LSM Architecture Blueprint: Visual mapping of MemTable β SSTable β Compaction flow, including I/O path analysis for write vs. read operations and hardware-level NAND interaction diagrams.
- β
Production Tuning Checklist: Step-by-step validation for Bloom filter configuration, compaction strategy selection, MemTable sizing, WAL isolation, and monitoring metrics (
iostat, rocksdb.dbstats, pg_stat_bgwriter equivalents).
- βοΈ Configuration Templates: Pre-optimized RocksDB/LevelDB config files for three workload profiles: (1) High-Throughput Ingestion, (2) Balanced OLTP, (3) Read-Optimized Analytics. Includes annotated
ColumnFamilyOptions, DBOptions, and compaction thresholds ready for deployment.