The TiDB cluster has three components: the TiDB server, the PD server, and the TiKV server. - The TiDB server is stateless. It does not store data and it is for computing only. TiDB is horizontally scalable and provides the unified interface to the outside through the load balancing components such as Linux Virtual Server (LVS), HAProxy, or F5. - The Placement Driver (PD) server is the managing component of the entire cluster. - The TiKV server is responsible for storing data. From an external view, TiKV is a distributed transactional Key-Value storage engine. Region is the basic unit to store data. Each Region stores the data for a particular Key Range which is a left-closed and right-open interval from StartKey to EndKey. There are multiple Regions in each TiKV node. TiKV uses the Raft protocol for replication to ensure the data consistency and disaster recovery. The replicas of the same Region on different nodes compose a Raft Group. The load balancing of the data among different TiKV nodes are scheduled by PD. Region is also the basic unit for scheduling the load balance.
Multi-version Concurrency Control (MVCC)
The history versions of data are kept because each update / removal creates a new version of the data object instead of updating / removing the data object in-place. But not all the versions are kept. If the versions are older than a specific time, they will be removed completely to reduce the storage occupancy and the performance overhead caused by too many history versions.
In TiDB, Garbage Collection (GC) runs periodically to remove the obsolete data versions. GC is triggered in the following way: There is a gc_worker
goroutine running in the background of each TiDB server. In a cluster with multiple TiDB servers, one of the gc_worker
goroutines will be automatically selected to be the leader. The leader is responsible for maintaining the GC state and sends GC commands to each TiKV region leader.
Any durable storage engine stores data on disk and TiKV is no exception. But TiKV doesn’t write data to disk directly. Instead, it stores data in RocksDB and then RocksDB is responsible for the data storage. The reason is that it costs a lot to develop a standalone storage engine, especially a high-performance standalone engine.
Read Committed Repeatable Read
TiDB uses the Percolator transaction model. A global read timestamp is obtained when the transaction is started, and a global commit timestamp is obtained when the transaction is committed. The execution order of transactions is confirmed based on the timestamps. Repeatable Read is the default transaction isolation level in TiDB.
Tuple-at-a-Time Model Vectorized Model
In most cases, TiDB processes data tuple by tuple. But in some cases, TiDB uses vectorized execution.
https://github.com/pingcap/tidb
PingCAP
2016
C, C++, Cocoa, D, Eiffel, Erlang, Go, Haskell, Java, Lua, Ocaml, Perl, PHP, Python, Ruby, Scheme, SQL, Tcl