TiKV

View Current Viewing Revision #8 from 05/20/2019 7:34 p.m.

TiKV is an open source distributed Key-Value database which is based on the design of Google Spanner and HBase, but it is much simpler without dependency on any distributed file system. It's has primary features including Geo-Replication, Horizontal scalability, Consistent distributed transactions, Coprocessor support.

History

Inspired by Google Spanner, PingCAP started developing TiKV in 2015, and released the first version of TiKV along with TiDB in 2016. Up to April 27, 2018, PingCAP has released TiDB/TiKV 2.0, which has a lot new features and great performance gains compared to 1.0.

Logging

Physical Logging

TiKV uses Raft to replicate data and each data change will be recorded as a Raft log. Through the log replication function of Raft, data is safely and reliably synchronized to multiple nodes of the Raft group

Data Model

Key/Value

TiKV uses Key-Value model and the Key-Value pairs are ordered according to the Key's binary sequence.

Storage Model

Custom

TiKV uses Key-Value model and the Key-Value pairs are ordered according to the Key's binary sequence.

Query Execution

Tuple-at-a-Time Model

Didn't find much information on it.

System Architecture

Shared-Nothing

TiKV is built on top of RocksDB, where all data in a TiKV node shares two RocksDB instances. One is for data, and the other is for Raft log. There are some major components in TiKV:

Placement Driver (PD): Manages the metadata about Nodes, Stores, Regions mapping, and makes decisions for data placement and load balancing.
Node: A physical node in the cluster. Each node contains one or more Stores.
Store: Stores data in local disks using RocksDB. Each store contains one or more regions.
Region: The basic unit of Key-Value data movement and corresponds to a data range in a Store. Each Region is replicated to multiple Nodes and form a Raft group. A replica of a Region is called a Peer.

Stored Procedures

Not Supported

Joins

Not Supported

Join can be implemented in application level.

Isolation Levels

Read Committed Repeatable Read

TiDB/TiKV uses the Percolator transaction model. The default isolation level in TiKV is Repeatable Read. When a transaction starts, there will be a global read timestamp; when a transaction commits, there will be a global commit timestamp. The execution order of transactions is confirmed based on the timestamps. The underlying details can be found in the Concurrency Control section.

Views

Not Supported

Concurrency Control

Multi-version Concurrency Control (MVCC)

TiKV has a Timestamp Oracle(TSO) to provide globally unique timestamp. The core transaction model of TiKV is called 2-Phase Commit powered by MVCC. There are two stages within each transaction:

PreWrite:
Create a startTS timestamp. Select one row as the primary row and others as secondary rows.
Check whether there are locks on this row or whether there are commits after the startTS. If conflicts exists, the transaction will be rollback. If not, lock the row.
Repeat the second step on other rows.
Commit:
Write to the CF_WRITE with current timestamp commitTS.
Release all the locks.

Storage Architecture

Disk-oriented

TiKV does not write data to disk directly, instead it uses RocksDB as it underlying storage engine, where RocksDB can be regard as a standalone Key Value Map.

Query Interface

Custom API

TiKV support queries such as simple Key-Value, transactional Key-Value and push-down. But no matter it’s transactional Key-Value or push-down, it will be transformed to simple Key-Value operations in TiKV.