WiredTiger is an open-source key/value storage engine for NoSQL databases and is currently the default storage engine for MongoDB.
WiredTiger supports both B-tree and log-structured merge tree for data storage. It also supports both row store and column store. WiredTiger employs multi-version concurrency control for better scalability on multi-core architectures and supports several standard isolation levels. Though WiredTiger is often considered as a storage engine, it also has some basic schema support where schemas are provided by users at runtime.
WiredTiger is developed in C and provides Java/Python language bindings. It has a set of APIs that is compatible with LevelDB.
Checkpoints in WiredTiger are different from typical checkpoints in other database systems. In WiredTiger, the block storage manager is actually managing multiple versions of checkpoints, i.e. all the durable data in WiredTiger are considered as checkpoints. Checkpoint process in WiredTiger are triggered every 60 seconds or manually triggered.
A successful checkpoint in WiredTiger requires four phases. We will take B-tree as an example in the following discussion of these phases. First, all dirty leaf pages are written to disk. Then all dirty internal pages are written to disk. Then fsync is called on the checkpoint file so that the checkpoint is persistent in disk. Finally, the new root of the B-tree is atomically changed to the new root. This is somewhat similar to shadow paging. When a checkpoint is deleted, WiredTiger will check whether there is any dependency between the two adjacent checkpoints and merge them if there are common pages in the two B-trees.
Nested Loop Join Index Nested Loop Join
Nested loop joins in WiredTiger are supported by join cursors. When a join cursor is created, it creates corresponding iterators on tables to be joined. Indexes can be used to narrow down the range of iterators for WHERE
clauses in joins.
Custom API Command-line / Shell
WiredTiger has a command line tool that supports basic read
/write
functionalities. WiredTiger provides language bindings for C/Python/Java and databases are accessed via cursors. It also provides optional asynchronous query APIs with callback mechanism if high throughput is desired.
B+Tree Log-Structured Merge Tree
B-tree: WiredTiger uses B-tree for in-memory indexes. Hazard pointers are used in their B-tree implementation, which points to file offsets on disk files. The checksum of a child page, as well as the child page's pointer are stored in the parent page to avoid potential memory flip issues.
Log-structured merge tree: LSM trees are used in WireTiger to improve write throughput. Each LSM tree has a corresponding Bloom filter so that unnecessary disk access can be avoided.
Dictionary Encoding Run-Length Encoding Naïve (Page-Level) Prefix Compression
Multiple compression techniques are used in WiredTiger.
Block-level: Existing compression libraries can be directly plugged into WiredTiger. It supports LZ4, snappy, zlib, and Zstd.
Record-level: WiredTiger supports dictionary encoding, run-length encoding, prefix encoding, and Huffman encoding.
Multi-version Concurrency Control (MVCC)
WiredTiger employs multi-version concurrency control (MVCC), in which readers never block writers. In case of write-write conflict, one transaction will be rolled back and retried. In addition, transactions' working set is required to fit in main memory in WiredTiger.
In WiredTiger, transactions can pin particular versions of the database, which are called "named snapshots". Queries within such transactions are executed as if they are operating on the named snapshots.
Read Uncommitted Read Committed Snapshot Isolation
Three isolation levels are provided: READ_UNCOMMITTED
, READ_COMMITTED
, and SNAPSHOT
. READ_COMMIT
is the default isolation level.
https://github.com/wiredtiger/wiredtiger
http://source.wiredtiger.com/develop/index.html
WiredTiger Inc.
2008
MongoDB