What is a storage engine?
It is easy to confuse a storage engine with a full-fledged database server like PostgreSQL, MongoDB, or Cassandra. Those are complete systems — they speak a wire protocol, manage connections, parse queries, and handle replication. You start them as a separate process and your application talks to them over a network or socket.
BedrockDB is none of that. It is a LSM (Log-Structured Merge-tree) key-value storage engine — the layer beneath the database. Not a standalone process you connect to, but the primitive that a database is built on top of. The part that actually persists bytes to disk, manages memory buffers, and reliably retrieves records. PostgreSQL has one of these inside it. RocksDB is one. BedrockDB is a from-scratch implementation of the same primitive.
This distinction matters because storage engines are where the hard problems live. Durability guarantees, write amplification, read latency under memory pressure, crash recovery — none of that is the database's job. It's the engine's job. Every database you've used delegates all of it to a layer exactly like this one.
How does a storage engine differ from a database?
A database is a complete system — query parsing, connection handling, authentication, replication, transactions. A storage engine is the primitive buried inside it that actually reads and writes bytes. Swapping the engine changes the performance profile; everything above it stays the same.
| Storage Engine | Database | |
|---|---|---|
| Interface | Embedded library / function calls | Wire protocol (TCP, HTTP, Unix socket) |
| Deployment | Linked into your process | Separate process / cluster |
| Query language | Put, Get, Delete, Scan | SQL, MQL, CQL, etc. |
| Concurrency control | No transaction isolation | MVCC, 2PL, OCC |
| Replication | None — caller's responsibility | Leader-follower, Raft, Paxos |
| Schema | Opaque bytes — []byte → []byte | Typed columns, indexes, constraints |
| Durability | WAL + fsync, crash recovery | WAL + replication log |
| Auth / access control | None | Users, roles, row-level security |
| Data structure | B-tree, LSM tree, Hash | Abstracted — engine's concern |
| Example | RocksDB, LevelDB, BedrockDB | PostgreSQL, CockroachDB, Cassandra |
PostgreSQL uses its own heap-based engine (not pluggable, unlike MySQL/InnoDB). CockroachDB embeds Pebble (a RocksDB-compatible LSM engine). Cassandra ships with a built-in LSM engine of the same family. In each case, the engine handles durability, caching, and compaction — the database handles everything else.
BedrockDB sits squarely in the left column. It gives you durable, sorted key-value storage with crash recovery, bloom-filtered reads, and an LRU block cache — and nothing more. No query planner, no replication, no auth.
BedrockDB uses an LSM tree — optimized for write-heavy workloads via sequential disk I/O and in-memory buffering.
That is the point: it is the foundation a database builds on, not the database itself.
Architecture
Write path
Read path
Range path
Range path — N-way heap merge across Memtable + SSTables → deduplicate → filter tombstones.
Crash recovery
Crash recovery — WAL replay → Memtable rebuild → resume.
Compaction (background)
Write-ahead log (WAL) — Group commit batches concurrent writes into a single fsync. Every write is durable before ack. Writers never block waiting for fsync.
Crash recovery — On restart, Open() replays the WAL into a fresh memtable before accepting any reads or writes. Replay stops at the first incomplete or corrupt record (expected after a crash — the tail write is partial). All records before the corrupt tail are recovered. Once the memtable flushes to an SSTable, the WAL is truncated — only unflushed writes need replay. A clean shutdown (Close()) flushes the memtable and truncates the WAL, so restart is instant with nothing to replay.
Memtable — Concurrent sorted structure (B-tree) protected by a read-write mutex. Reads take a read lock only — concurrent reads never block each other.
SSTable — Immutable on-disk files with a three-layer read path: bloom filter (skip absent keys), binary-searched sparse index (find the right block), LRU block cache (avoid redundant decodes). SSTables are maintained in newest-first order, ensuring recent updates shadow older versions during lookups and compaction.
Compaction— Size-tiered compaction groups SSTables into size buckets where max_size / min_size < 2.0 and compacts any bucket that reaches the threshold. Two tiers emerge naturally under typical workloads — small (fresh memtable flushes) and large (previous compaction outputs) — each compacting independently when their bucket fills. Duplicate keys are resolved by recency. Tombstones are dropped during compaction using key range awareness and bloom filters across non-selected SSTables (see design decisions). Compaction uses a streaming iterator — blocks are decoded and written one at a time rather than loading entire SSTables into memory, keeping heap usage bounded regardless of SSTable size.
Read Amplification
In STCS, read amplification is not structurally bounded to a fixed number of tiers — the number of tiers grows with compaction history and workload. The bound below is empirical for BedrockDB's two-tier steady state under typical write patterns.
Observed steady state
Under typical workloads, BedrockDB naturally produces two size tiers:
- Small tier — fresh memtable flush outputs (~64MB each)
- Large tier — outputs of previous compactions (significantly larger)
Each tier compacts independently when its bucket accumulates T files.
The fixed cost (+2)
Every read always checks:
- Active memtable — contains the most recent writes
- Immutable memtable — a frozen memtable waiting to be flushed
These are always present regardless of compaction state:
+1 (active memtable)
+1 (immutable memtable)
= +2Per-tier cost
Within each tier, a bucket holds at most T − 1 SSTables before compaction fires. In the worst case, both tiers are just below threshold:
- Small tier: T − 1 SSTables
- Large tier: T − 1 SSTables
Two-tier bound
active memtable → 1
immutable memtable → 1
small-tier SSTables → T − 1
large-tier SSTables → T − 1
total = 2(T − 1) + 2For T = 4: 2(3) + 2 = 8 sources
This bound holds as long as the workload produces exactly two tiers. If compaction falls behind and a third tier accumulates, the bound extends to 3(T − 1) + 2, and so on. Unlike leveled compaction, STCS offers no hard structural guarantee on the number of tiers — the bound is workload-dependent.
Each SSTable probe involves a bloom filter check (cheap) and a possible disk read if the filter passes (expensive). This bound has real latency implications on cache misses.
The tradeoff
The threshold T directly controls the balance between read and write cost:
- Higher T → fewer compactions → lower write amplification
- But also → more SSTables per tier → higher read amplification
This is the core tradeoff of size-tiered compaction: you reduce write cost by increasing T, but pay for it with increased read fan-out.
SSTable file layout
[ data block 0 ][ data block 1 ]...[ data block N ] — 4KB each, page-aligned
[ index block ] — sparse index: firstKey → blockOffset
[ bloom filter ] — serialised bloom filter bytes, length-prefixed
[ footer ] — indexOffset (8B) | bloomOffset (8B) | magic (8B)Each data block encodes entries as:
| key_len (4B) | val_len (4B) | key bytes | value bytes | ... | num_entries (4B) |The 4KB block size aligns with Linux page size and NVMe sector size — a single pread maps to one page fault and one sector read.
API & Usage
Ready to run BedrockDB? See the HTTP endpoints, CLI shell, and setup instructions.
HTTP API, CLI & Run ↗Benchmarks
WAL
| Benchmark | Throughput | ns/op |
|---|---|---|
| Group commit (default) | 171 MB/s | 309 ns/op |
| Fsync-every-write | 0.05 MB/s | 964,717 ns/op |
Group commit delivers a 3,122× throughput advantage over fsync-per-write. The durability guarantee is identical — both strategies fsync to disk. The difference is how many writes share each fsync.
Memtable
| Benchmark | ns/op | B/op | allocs/op |
|---|---|---|---|
| Put | 83 | 64 | 2 |
| Get | 37 | 32 | 1 |
| ConcurrentReadWrite | 319 | — | — |
Get is 2× faster than Put. Put acquires a write lock and may subtract old value size on overwrite. Get acquires a read lock only — concurrent reads proceed in parallel.
SSTable
| Benchmark | ns/op | allocs/op | Notes |
|---|---|---|---|
| BlockEncode | 502 | 0 | Pure memcopy into pre-allocated 4KB buffer |
| BlockDecode | 2,961 | 191 | One allocation per entry (Go string conversion) |
| Get (Memtable) | ~45 | 1 | Pure in-memory lookup |
| Get (SST cache hit) | ~550 | 2 | Bloom check + LRU hit + scan, no decode |
| Get (SST cache miss) | ~4600 | 149 | pread + full block decode |
| BloomFilterReject | 98 | 2 | Absent keys rejected before any I/O |
Cache hit avoids block decode entirely — decoded entries are reused directly from the LRU cache. The ~2 allocations are key/value handling only.
Block decode dominates the cold read path — ~150 allocations per block — but is amortised across subsequent cache hits via the LRU.
Read path characteristics — BedrockDB exhibits three distinct latency tiers. Each jump is explained by what the engine can no longer skip:
- Memtable hit (~45ns) — The key is still in the in-memory B-tree. One read-lock acquisition and a pointer chase. No bloom filter, no index lookup, no disk I/O, no block decode. This is the fastest possible path.
- SSTable cache hit (~550ns, ~12× slower) — The key has been flushed to an SSTable, so the engine must check the bloom filter, binary-search the sparse index to locate the block, and then scan entries inside the block. But the block is already decoded in the LRU cache — the expensive
preadsyscall and the ~150-allocation block decode are skipped entirely. The 12× jump over memtable is the cost of the bloom + index path and two extra allocations for key/value handling. - SSTable cache miss (~4,600ns, ~8× slower than cache hit) — Same bloom + index path, but the block is not in the LRU. The engine issues a
preadsyscall to fetch 4KB from disk, then decodes the raw bytes into[]Entrystructs (~149 allocations from Go string conversion). This decode cost dominates the cold path — but it is paid once per block and then amortised across all subsequent cache hits on that block.
The full span from fastest to slowest is ~100× (45ns → 4,600ns), but in practice the LRU collapses most of that gap. Under a Zipfian access distribution (typical for key-value workloads), a small cache captures a disproportionate fraction of reads — most lookups land in the ~45–550ns range, while the ~4.6µs cold path fires only on first access or after eviction.
Range
| Benchmark | Window | ns/op | Notes |
|---|---|---|---|
| Range (memtable, small) | 100 keys | ~11,250 | In-memory B-tree scan, no I/O |
| Range (memtable, large) | 800 keys | ~81,000 | Same path, 8× more keys |
| Range (SST cache hit) | 100 keys | ~10,900 | Blocks warm in LRU, no pread |
| Range (SST cold, 5 SSTables) | 30k keys | ~5,500,000 | N-way heap merge, full block decodes |
Cache hit ≈ memtable — ~10,900ns vs ~11,250ns for the same 100-key window. Once blocks are warm in the LRU, the SSTable path costs the same as an in-memory scan. Both paths end up doing the same thing: pointer-chasing through in-memory structures. The memtable entries were never on disk to begin with; the SSTable entries were decoded from disk once and now sit as []Entry structs in the LRU. By the time the range scan actually runs, the work is identical.
Cold multi-SST is the outlier — 5.5ms for a 30k-key window across 5 SSTables is the pathological case: three bad things happening simultaneously. A wide range (keys 10,000–40,000) forces the engine to touch many blocks. A cold LRU means every block requires a pread syscall + full decode (~149 allocations per block, hence ~60k total). And 5 SSTables means all decoded streams must be heap-merged via the N-way merge iterator. In isolation, any one of these is manageable — together they produce the worst-case latency the engine can exhibit.
In practice these three conditions rarely coincide. A repeated range query on the same window warms the cache on its first call and drops to ~11µs on the second — equivalent to a memtable scan. This is the design intent behind the LRU: absorb the decode cost once, then serve all subsequent reads at memory speed.
DB layer
| Benchmark | ns/op | Notes |
|---|---|---|
| Put | 725 | WAL write (lock-free) + memtable insert |
| Get | 52 | Memtable hit |
| Get (SSTable cache miss) | 4,637 | Full stack: memtable miss → SSTable cold pread |
| MixedWorkload (80% read) | 770 | — |
Compaction
Microbenchmarks (go test -bench . -benchmem)
| Benchmark | ns/op | B/op | allocs/op | allocs/key |
|---|---|---|---|---|
| Compact4x500 (2k keys) | 1,783,560 | 901 KB | 9,602 | ~4.8 |
| Compact4x5000 (20k keys) | 5,587,839 | 5.7 MB | 82,759 | ~4.1 |
| CompactTombstoneHeavy (2k keys, 50% tombstones) | 1,920,463 | 1.2 MB | 17,011 | ~8.5 |
| CompactTombstoneWithExternal (2k keys + bloom checks) | 2,009,813 | 1.25 MB | 17,059 | ~8.5 |
~4 allocs/key on the normal compaction path is on par with production-grade Go LSMs (Pebble: 3–5, Badger: 6–10).
End-to-end compaction (streaming)
| Input | Output | Duration |
|---|---|---|
| 4 × 71 MB SSTables (284 MB total) | 284 MB SSTable | ~4.1 s |
| 4 × 284 MB SSTables (1.1 GB total) | ~1.1 GB SSTable | ~16.5 s |
Compaction is streaming — blocks are decoded and written one at a time. Peak heap usage stays bounded regardless of SSTable size. Throughput scales linearly: ~69 MB/s on the 284 MB input, ~67 MB/s on the 1.1 GB input, consistent across runs.
HTTP (wrk2, 20 connections, 30s, corrected for coordinated omission)
| Endpoint | RPS | P50 | P90 | P99 |
|---|---|---|---|---|
| PUT | 500 | 0.99ms | 1.60ms | 2.36ms |
| GET (cache hit) | 1000 | 0.91ms | 1.53ms | 1.96ms |
| GET (cache miss, cold) | 1000 | 1.05ms | 1.67ms | 2.13ms |
| GET (absent key) | 2000 | 0.92ms | 1.59ms | 2.05ms |
| RANGE (cold, no cache) | 3000 | 1.39ms | 1.79ms | 2.36ms |
GET vs GET miss are nearly identical — bloom filter rejection (98ns) is invisible at the HTTP level. Latency is dominated by HTTP stack overhead, not the storage engine.
LSM write semantics — BedrockDB follows LSM tree design. Writes are sequential appends — fast on any storage media. Reads are optimised via bloom filters and LRU block cache to minimise read amplification across SSTables.
Design decisions
Why 4KB blocks?Aligns with PAGE_SIZE on Linux. A cache miss triggers one page fault — the OS fetches exactly the data needed. Larger blocks increase read amplification on point lookups; smaller blocks increase index size and reduce LRU effectiveness.
Why a sparse index?One index entry per block, not per key. At 4KB blocks with ~50-byte average keys, a 1GB SSTable needs ~5,000 index entries (~250KB in memory) rather than millions. The index fits in L2 cache during binary search.
Why group commit in the WAL?fsync latency on NVMe is 50–200µs regardless of how many bytes were written. Batching N writes into one fsync amortises that latency across N operations. BedrockDB buffers writes in a bufio.Writer and a background goroutine fsyncs every 100µs — or immediately when the buffer exceeds 64KB. Writers never block waiting for fsync.
Why pread + LRU over mmap?mmap delegates caching to the OS page cache, which gives up control. The OS evicts pages under memory pressure without regard for access frequency — it doesn't know a block cache hit is worth more than a cold page. It also surfaces I/O errors as SIGBUS rather than return values, which are difficult to handle correctly in Go. pread keeps the read path explicit: BedrockDB decides which blocks stay resident, eviction policy is LRU over decoded []Entry structs, and errors propagate as normal Go errors. The tradeoff is that pread pays a syscall per cache miss whereas mmap faults in pages transparently — acceptable because the LRU makes cache misses rare on hot data.
Why LRU over a simpler cache?LRU evicts cold blocks under memory pressure while keeping hot blocks resident. Under a Zipfian access distribution (typical for key-value workloads), a small cache captures a disproportionate fraction of reads. BedrockDB defaults to 256 entries (1MB of decoded blocks).
Why string conversion on block decode?BlockDecode allocates once per entry — string(buf[...]) copies bytes out of the block's backing buffer rather than returning a subslice. A []byte subslice would be zero-copy and eliminate the 191 allocs, but it would pin the entire 4KB block buffer in memory for as long as the caller holds the value. The LRU cache evicts the block logically, but the GC cannot collect the backing array until all references are gone — breaking the memory boundedness the cache is designed to provide. The allocation cost is paid once per cache miss and amortised across all subsequent hits on that block.
Compaction — Size-tiered compaction groups SSTables into size buckets where max_size / min_size < 2.0 and compacts any bucket that reaches the threshold. Two tiers emerge naturally — small (fresh memtable flushes) and large (previous compaction outputs) — each compacting independently when their bucket fills. Duplicate keys are resolved by recency and tombstones are dropped at compaction time. Read amplification is bounded at 2 * (threshold - 1) + 2 sources per lookup.
Why size-tiered compaction?Size-tiered groups files of similar size rather than compacting by key range overlap (leveled) or recency (FIFO). This minimises write amplification — files are only rewritten when merged with peers of similar size, not on every level push. The tradeoff is higher read amplification than leveled compaction, mitigated by bloom filters and the LRU block cache. Cassandra uses the same strategy as its default for write-heavy workloads.
Why streaming compaction?The previous compaction implementation loaded entire SSTables into memory before merging. The current implementation uses a streaming block iterator: each SSTable is read one block at a time, decoded, and its entries fed into the heap. The output SSTable is written block-by-block in parallel. This keeps peak heap usage proportional to the number of open SSTables × block size (a few MB at most), not to the total compaction input size. The end-to-end benchmarks confirm this — 1.1 GB of input compacts at ~67 MB/s with no OOM risk.
How are tombstones dropped during compaction?In leveled compaction, tombstone dropping is straightforward — once a tombstone reaches the bottom level, no older version can exist below it. STCS has no such structural guarantee: a large previously-compacted SST may still hold the original value for a key being deleted in a smaller tier.
BedrockDB uses key range awareness combined with bloom filters to decide whether a tombstone is safe to drop. During compaction of a selected bucket, every tombstone is checked against all non-selected (external) SSTables. For each external SST, BedrockDB first checks whether the key falls within that SST's [MinKey, MaxKey] range — a cheap string comparison that eliminates irrelevant files entirely. Only if the key falls within range does it consult the bloom filter. If the bloom filter reports the key might exist, the tombstone is kept. If every external SST either has a non-overlapping range or a bloom miss for that key, the tombstone is safely dropped.
Bloom filters have false positives but never false negatives — a bloom miss is a guarantee of absence. The two-stage check (range first, bloom second) prevents bloom false positives from unrelated SSTs from unnecessarily retaining tombstones.
Why N-way heap merge for compaction and range queries?Both compaction and range queries require merging N sorted streams into one globally sorted output. A min-heap gives O(log N) per element across N iterators. The heap entry carries a source index — on key collision, the lower source index (newer SSTable) wins.
Why double-buffer immutable memtable?When the active memtable hits the size threshold, it is frozen in place as the immutable memtable and a fresh memtable takes writes immediately. The flush goroutine drains the immutable to an SSTable in the background. Writers never stall waiting for a flush to complete — the double-buffer decouples write throughput from disk I/O latency.
Why sync.Cond over busy-poll in Close()?The original Close() spun in a tight loop checking db.flushing. Replaced with sync.Cond — the goroutine sleeps until the flush goroutine signals completion. Eliminates CPU waste during shutdown with no change in correctness.