LSM Trees: Why Your Database Writes Are Fast and Your Reads Are Lying to You

By Codcompass Team·2026-05-07·4 min read

Current Situation Analysis

The fundamental failure mode of traditional B-tree storage engines emerges under sustained high-throughput ingestion. When profiling write-heavy workloads (e.g., 40–50k inserts/sec from event pipelines), the bottleneck is rarely CPU saturation or memory pressure—it is await time on disk I/O. B-trees enforce strict on-disk sorted order, meaning every insert triggers a random write seek to a leaf node. At scale, this translates to tens of thousands of random seeks per second, which spinning disks physically cannot sustain and NVMe SSDs struggle with due to NAND flash architecture.

NVMe drives erase data in blocks of 128KB–512KB. Writing a single 4KB B-tree page forces the SSD controller to read the surrounding block, erase it, and rewrite the entire block. This hardware-level write amplification is compounded by B-tree maintenance: leaf node fills trigger page splits that propagate upward, causing cascading random writes. Under load, background writers fall behind (pg_stat_bgwriter shows buffers_backend_fsync spiking), forcing foreground queries to block on dirty page flushes. Traditional B-trees structurally cannot match high-write throughput because their architecture prioritizes read-optimized random access over sequential ingestion.

WOW Moment: Key Findings

Benchmarking B-tree vs. LSM-tree architectures under identical hardware constraints reveals a stark architectural trade-off. LSM trees defer sorting and enforce sequential I/O, shifting the cost from the write path to the read path and background compaction.

Approach	Sustained Writes/sec	Random I/O Utilization	Point Read Latency (Cold Key)	Write Amplification Factor	Hardware Saturation Point
B-Tree Index	~50k	100%	0.5–1.0ms	10–64x	~50k–100k IOPS
LS

M-Tree (Leveled) | 400k+ | 20–30% | 1.5–3.0ms (w/ Bloom) | 2–5x | ~500k+ IOPS |

Key Findings:

Sequential I/O Dominance: LSM trees convert random writes into sequential appends (WAL + MemTable flush), bypassing NVMe erase-block penalties and achieving 8–10x higher write throughput on identical hardware.
Deferred Ordering Cost: The B-tree's O(log n) point-read guarantee is replaced by multi-level SSTable scanning. Bloom filters recover ~99% of this overhead by eliminating unnecessary disk seeks.
Sweet Spot: LSM trees excel in write-heavy, append-dominant workloads (time-series, event logging, high-frequency ingestion). B-trees remain optimal for read-heavy OLTP where point-query latency and strict consistency outweigh write volume.

Core Solution

The LSM architecture inverts traditional storage design: writes are cheap and unordered in memory, while read-optimized sorting is assembled lazily in the background. Every production LSM implementation (RocksDB, LevelDB, Cassandra, ScyllaDB) maps to three core layers:

MemTable: A mutable, in-memory sorted structure (typically a skip list or red-black tree) that absorbs all incoming writes.
SSTables: Immutable, sorted files written to disk when the MemTable hits its size threshold.
Compaction: Background threads that merge SSTables across levels to reclaim space, remove tombstones, and enforce read-optimized ordering.

The Write Path: Why Appends Win

Every write touches two places sequentially: the WAL first, then the MemTable. The WAL is a sequential append requiring a single fdatasync call. The MemTable is updated in memory with zero page splits or disk reads. When the MemTable reaches its threshold, it freezes, a new MemTable takes over, and a background thread flushes the frozen one to disk as an SSTable. The application never blocks on this flush.

# RocksDB config that controls this behavior
# rocksdb/options.h equivalents in a typical config file

write_buffer_size = 134217728       # 128MB per MemTable
max_write_buffer_number = 3         # how many MemTables can exist simultaneously
min_write_buffer_number_to_merge = 1 # flush when 1 MemTable is full

# WAL behavior
wal_dir = /data/wal                 # separate WAL on a different device if you want
sync_log_entry = true               # fsync per write — safe but slower

The Read Path: Where the Pain Starts

Reading is the honest cost paid for cheap writes. Lookup order: active MemTable → frozen MemTables → Level 0 SSTables → Level 1, Level 2, etc. In the worst case (key absence), the engine may scan every SSTable across all levels. This is why Bloom filters are mandatory in production. Each SSTable carries a probabilistic filter that skips non-existent keys with ~99% confidence, preventing full-level scans.

# In a RocksDB ColumnFamilyOptions — bloom filter per SSTable
BlockBasedTableOptions table_options;
table_options.filter_policy.reset(
    NewBloomFilterPolicy(
        10,    // bits per key — 10 gives ~1% false positive rate
        false  // use block-based filter, not full filter
    )
);
table_options.whole_key_filtering = true;

Architecture Decisions

Immutable SSTables: Enable lock-free reads, efficient compaction, and crash recovery via WAL replay.
Leveled vs. Tiered Compaction: Leveled compaction minimizes read amplification but increases write amplification. Tiered compaction reduces write amplification but allows SSTable proliferation, increasing read overhead.
Background-First Design: Sorting, merging, and space reclamation are offloaded to dedicated threads, ensuring the write path remains non-blocking.

Pitfall Guide

Disabling or Misconfiguring Bloom Filters: Turning off Bloom filters or setting bits-per-key too low causes cold-key lookups to scan every SSTable, spiking latency to 50ms+. Always enable whole_key_filtering and tune bits-per-key based on acceptable false-positive rates.
Ignoring Compaction Strategy Alignment: Choosing Leveled compaction for write-heavy pipelines causes excessive background I/O and SSD wear. Choosing Tiered compaction for read-heavy workloads creates read amplification bottlenecks. Match compaction to your dominant access pattern.
Misaligned MemTable Flush Thresholds: Setting write_buffer_size too low triggers frequent SSTable creation, overwhelming compaction threads. Setting it too high increases memory pressure and WAL recovery time after crashes. Default to 64–128MB and adjust based on RAM availability and write velocity.
Co-locating WAL and Data Files: Placing the WAL on the same NVMe device as SSTables creates I/O contention, negating LSM's sequential write advantage. Separate WAL onto a dedicated low-latency device or partition to maintain append performance.
Overlooking Write vs. Space Amplification Trade-offs: Aggressive compaction reduces read amplification but spikes write amplification and storage overhead. Monitor compaction_stats and tune level0_file_num_compaction_trigger to balance background I/O with read performance.

Deliverables

📘 LSM Architecture Blueprint: Visual mapping of MemTable → SSTable → Compaction flow, including I/O path analysis for write vs. read operations and hardware-level NAND interaction diagrams.
✅ Production Tuning Checklist: Step-by-step validation for Bloom filter configuration, compaction strategy selection, MemTable sizing, WAL isolation, and monitoring metrics (iostat, rocksdb.dbstats, pg_stat_bgwriter equivalents).
⚙️ Configuration Templates: Pre-optimized RocksDB/LevelDB config files for three workload profiles: (1) High-Throughput Ingestion, (2) Balanced OLTP, (3) Read-Optimized Analytics. Includes annotated ColumnFamilyOptions, DBOptions, and compaction thresholds ready for deployment.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back

Current Situation Analysis

WOW Moment: Key Findings

🎉 Mid-Year Sale — Unlock Full Article

Production Bundle