LevelDB

LevelDB is one key/value store built by Google. It can support an ordered mapping from string to string. LSM-tree is one type of write-optimized B-tree variants consisting of key-value pairs. It allows large sequential writes as opposed to small random writes. LevelDB is an open source LSM-tree implementation.

History

Two googlers Jeff Dean and Sanjay Ghemawat were inspired by the design scheme of bigtable tablet. Tablets in bigtable are defined as segments of the table split along chosen row. They wanted to build one open-source system containing the characteristic of bigtable tablet. Aside from that, they hoped leveldb can support chrome in its IndexedDB implementation. This is the origin of leveldb.

Checkpoints

Blocking

When operation logging file exceeds over the limit, it will do checkpoints. Data will be flushed to the disk. And compaction scheme will be called. So data will go down levels. Aside from that, leveldb will generate new logging file and memtable for new use.

System Architecture

Embedded

In leveldb immutable are stored on the disk which can be shared by different cluster nodes. There are totally 7 levels plus at most two in-memory tables. The procedure can be described as firstly the system buffers write operations in an in-memory table called MEMTable and flushes data to disk when it becomes full. On the disk, tables are organized into levels. Each level contains multiple tables called SSTable. The down level maintains larger capacity than the upper level. When the upper level is full, the system needs to push data to the down level, which might need to read and write multiple SSTables.

Stored Procedures

Not Supported

Storage Organization

Log-structured

Storage Model

N-ary Storage Model (Row/Record)

SSTable uses NSM to arrange data. It contains a set of arbitrary, sorted key-value pairs. At the end of the block, it provides the start offset and key value for each block. So bloom filter can be used to search for target block.

Isolation Levels

Snapshot Isolation

It saves the state of database at a given point and supports reference to it. Users can retrieve data from specific snapshot at the time the snapshot was created.

Indexes

Skip List

It uses skip list in MemTable. Aside from that, LSM-tree is one type of write-optimized B-tree variants consisting of key-value pairs. The LSM-tree is a persistent key-value store optimized for insertions and deletions. LevelDB is an open source LSM-tree implementation.

Query Interface

Custom API

Keys and values in leveldb are byte arrays with arbitrary length. It supports basic operations like Put(), Get(), Delete(). It also support Batch operations: Batch(). The whole process of operations will run together and return result in a single Batch operation. However, it does not support SQL queries because this is not a SQL type database. Aside from that, it has no support for indexing.

Storage Architecture

Disk-oriented

It puts temporarily accessed data into MemTable and periodically move data from MemTable into Immutable MemTable. Aside form that, it adopts compaction to reduce the invalid data in each level and then generates one new block at next level.

Data Model

Key/Value

Key/value store supports the mapping from the key to the corresponding value. In SSTable the layout of key and value is managed as adjacent string sequence.

Logging

Logical Logging

Before every insertion, update or delete, system need to add the message to log. In case of node's failure, uncommited messages can be retrieved and do operation again for recovery.

Concurrency Control

Two-Phase Locking (Deadlock Prevention)

Leveldb only allow one process to open at one time. The operation system will use the locking scheme to prevent concurrent access. Within one process, Leveldb can be accessed by multiple threads. For multi-writers, it will only allow the first writer to write to database and other writers will be blocked. For read-write conflicts, readers can retrieve data from immutable which is seperated from writing process. The updated version will come into effect in compaction process.

Query Execution

Tuple-at-a-Time Model

Query Compilation

Not Supported

Cannot find enough information about query compilation