What is a storage engine?
It is easy to confuse a storage engine with a full-fledged database server like PostgreSQL, MongoDB, or Cassandra. Those are complete systems — they speak a wire protocol, manage connections, parse queries, and handle replication. You start them as a separate process and your application talks to them over a network or socket.
BedrockDB is none of that. It is a LSM (Log-Structured Merge-tree) key-value storage engine — the layer beneath the database. Not a standalone process you connect to, but the primitive that a database is built on top of. The part that actually persists bytes to disk, manages memory buffers, and reliably retrieves records. PostgreSQL has one of these inside it. RocksDB is one. BedrockDB is a from-scratch implementation of the same primitive.
This distinction matters because storage engines are where the hard problems live. Durability guarantees, write amplification, read latency under memory pressure, crash recovery — none of that is the database's job. It's the engine's job. Every database you've used delegates all of it to a layer exactly like this one.
Architecture
Write path
Read path
Write-ahead log (WAL) — Group commit batches concurrent writes into a single fsync. Every write is durable before ack. Writers never block waiting for fsync.
Memtable — Concurrent sorted structure (B-tree) protected by a read-write mutex. Reads take a read lock only — concurrent reads never block each other.
SSTable — Immutable on-disk files with a three-layer read path: bloom filter (skip absent keys), binary-searched sparse index (find the right block), LRU block cache (avoid redundant decodes). SSTables are maintained in newest-first order, ensuring recent updates shadow older versions during lookups and compaction.
Compaction— Size-tiered compaction groups SSTables into size buckets where max_size / min_size < 2.0 and compacts any bucket that reaches the threshold. Two tiers emerge naturally — small (fresh memtable flushes) and large (previous compaction outputs) — each compacting independently when their bucket fills. Duplicate keys are resolved by recency and tombstones are dropped at compaction time. Read amplification is bounded at 2 * (threshold - 1) + 2 sources per lookup.
SSTable file layout
[ data block 0 ][ data block 1 ]...[ data block N ] — 4KB each, page-aligned
[ index block ] — sparse index: firstKey → blockOffset
[ bloom filter ] — serialised bloom filter bytes, length-prefixed
[ footer ] — indexOffset (8B) | bloomOffset (8B) | magic (8B)Each data block encodes entries as:
| key_len (4B) | val_len (4B) | key bytes | value bytes | ... | num_entries (4B) |The 4KB block size aligns with Linux page size and NVMe sector size — a single pread maps to one page fault and one sector read.
HTTP API
# write
curl -X PUT http://localhost:8080/key/{key} \
-H "Content-Type: application/json" \
-d '{"value": "..."}'
# read
curl http://localhost:8080/key/{key}
# delete
curl -X DELETE http://localhost:8080/key/{key}
# range query — returns all keys in [start, end] inclusive, sorted
curl "http://localhost:8080/range?start=a&end=z"Example responses
# PUT
curl -X PUT http://localhost:8080/key/city \
-H "Content-Type: application/json" \
-d '{"value": "bangalore"}'
# {"key":"city","value":"bangalore"}
# GET
curl http://localhost:8080/key/city
# {"key":"city","value":"bangalore"}
# RANGE
curl "http://localhost:8080/range?start=city&end=name"
# [{"key":"city","value":"bangalore"},{"key":"name","value":"alice"}]
# GET deleted key
curl http://localhost:8080/key/city
# {"error":"not found"}CLI (Interactive Shell)
BedrockDB provides an interactive CLI for local debugging and inspection of the storage engine.
go run cmd/cli/main.goSupported commands
# basic operations
put <key> <value>
get <key>
delete <key>
range <start> <end>
# observability
stats
sstables
# exit
exitExample
bedrockdb> put user:1 abhiram
OK
bedrockdb> get user:1
abhiram
bedrockdb> stats
memtable_size=3
immutable=false
sst_count=1
bedrockdb> sstables
[0] data/sst-000001.sstRun
# HTTP server
go run cmd/server/main.go
# CLI
go run cmd/cli/main.go
# docker
docker compose up --buildServer listens on :8080. Data directory defaults to ./data.
Benchmarks
WAL
| Benchmark | Throughput | ns/op |
|---|---|---|
| Group commit (default) | 120 MB/s | 439 ns/op |
| Fsync-every-write | 0.10 MB/s | 519,012 ns/op |
Group commit delivers a 1,181× throughput advantage over fsync-per-write. The durability guarantee is identical — both strategies fsync to disk. The difference is how many writes share each fsync.
Memtable
| Benchmark | ns/op | B/op | allocs/op |
|---|---|---|---|
| Put | 83 | 64 | 2 |
| Get | 37 | 32 | 1 |
| ConcurrentReadWrite | 319 | — | — |
Get is 2× faster than Put. Put acquires a write lock and may subtract old value size on overwrite. Get acquires a read lock only — concurrent reads proceed in parallel.
SSTable
| Benchmark | ns/op | allocs/op | Notes |
|---|---|---|---|
| BlockEncode | 502 | 0 | Pure memcopy into pre-allocated 4KB buffer |
| BlockDecode | 2,961 | 191 | One allocation per entry (Go string conversion) |
| GetCacheHit | 237 | 0 | Bloom check + LRU lookup + linear scan, no decode |
| GetCacheMiss | 11,173 | 373 | pread + decode |
| BloomFilterReject | 98 | 2 | Absent keys rejected before any I/O |
Cache hit at 237ns allocates nothing — bloom filter test, LRU map lookup, and linear scan over already-decoded entries stay on the stack or in existing heap objects. Block decode cost (191 allocs) is paid once per cache miss, then amortised across all subsequent reads of that block.
DB layer
| Benchmark | ns/op | Notes |
|---|---|---|
| Put | 725 | WAL write + memtable insert |
| Get | 52 | Memtable hit |
| MixedWorkload (80% read) | 770 | — |
Compaction
| Benchmark | Duration | Notes |
|---|---|---|
| 4 SSTables × 500 keys | 2.18ms | N-way heap merge |
HTTP (wrk2, 20 connections, 30s, corrected for coordinated omission)
| Endpoint | RPS | P50 | P90 | P99 |
|---|---|---|---|---|
| PUT | 500 | 1.36ms | 2.71ms | 521ms |
| GET (cache hit) | 2000 | 1.06ms | 1.76ms | 625ms |
| GET (absent key) | 2000 | 1.04ms | 1.80ms | 621ms |
| RANGE | 200 | 4.58ms | 7.26ms | 622ms |
GET vs GET miss are nearly identical — bloom filter rejection (98ns) is invisible at the HTTP level. Latency is dominated by HTTP stack overhead, not the storage engine.
P99 spikes (~520–625ms) — WAL fsync and memtable flush compete with foreground requests under sustained load. P50 and P90 remain stable — the spike affects fewer than 1% of requests.
RANGE P50 is 4.58ms — range queries perform an N-way heap merge across the memtable and all SSTables, with per-SSTable block scans feeding the heap. Latency scales with the number of SSTables and result set size. Expected behaviour for an LSM tree without seek-based SSTable iterators.
LSM write semantics — BedrockDB follows LSM tree design. Writes are sequential appends — fast on any storage media. Reads are optimised via bloom filters and LRU block cache to minimise read amplification across SSTables.
Design decisions
Why 4KB blocks?Aligns with PAGE_SIZE on Linux. A cache miss triggers one page fault — the OS fetches exactly the data needed. Larger blocks increase read amplification on point lookups; smaller blocks increase index size and reduce LRU effectiveness.
Why a sparse index?One index entry per block, not per key. At 4KB blocks with ~50-byte average keys, a 1GB SSTable needs ~5,000 index entries (~250KB in memory) rather than millions. The index fits in L2 cache during binary search.
Why group commit in the WAL?fsync latency on NVMe is 50–200µs regardless of how many bytes were written. Batching N writes into one fsync amortises that latency across N operations. BedrockDB buffers writes in a bufio.Writer and a background goroutine fsyncs every 100µs — or immediately when the buffer exceeds 64KB. Writers never block waiting for fsync.
Why pread + LRU over mmap?mmap delegates caching to the OS page cache, which gives up control. The OS evicts pages under memory pressure without regard for access frequency — it doesn't know a block cache hit is worth more than a cold page. It also surfaces I/O errors as SIGBUS rather than return values, which are difficult to handle correctly in Go. pread keeps the read path explicit: BedrockDB decides which blocks stay resident, eviction policy is LRU over decoded []Entry structs, and errors propagate as normal Go errors. The tradeoff is that pread pays a syscall per cache miss whereas mmap faults in pages transparently — acceptable because the LRU makes cache misses rare on hot data. io_uring was considered for async I/O but deferred — Go's runtime has immature native support for it and the benefit is marginal when cache hit rates are high.
Why LRU over a simpler cache?LRU evicts cold blocks under memory pressure while keeping hot blocks resident. Under a Zipfian access distribution (typical for key-value workloads), a small cache captures a disproportionate fraction of reads. BedrockDB defaults to 256 entries (1MB of decoded blocks).
Why string conversion on block decode?BlockDecode allocates once per entry — string(buf[...]) copies bytes out of the block's backing buffer rather than returning a subslice. A []byte subslice would be zero-copy and eliminate the 191 allocs, but it would pin the entire 4KB block buffer in memory for as long as the caller holds the value. The LRU cache evicts the block logically, but the GC cannot collect the backing array until all references are gone — breaking the memory boundedness the cache is designed to provide. The allocation cost is paid once per cache miss and amortised across all subsequent hits on that block.
Compaction — Size-tiered compaction groups SSTables into size buckets where max_size / min_size < 2.0 and compacts any bucket that reaches the threshold. Two tiers emerge naturally — small (fresh memtable flushes) and large (previous compaction outputs) — each compacting independently when their bucket fills. Duplicate keys are resolved by recency and tombstones are dropped at compaction time. Read amplification is bounded at 2 * (threshold - 1) + 2 sources per lookup.
Why size-tiered compaction?Size-tiered groups files of similar size rather than compacting by key range overlap (leveled) or recency (FIFO). This minimises write amplification — files are only rewritten when merged with peers of similar size, not on every level push. The tradeoff is higher read amplification than leveled compaction, mitigated by bloom filters and the LRU block cache. Cassandra uses the same strategy as its default for write-heavy workloads.
Why N-way heap merge for compaction and range queries?Both compaction and range queries require merging N sorted streams into one globally sorted output. A min-heap gives O(log N) per element across N iterators. The heap entry carries a source index — on key collision, the lower source index (newer SSTable) wins. Tombstones are dropped at the deepest compaction level where no older versions can exist below.