Tera is a 100 PBs level, real-time high performance distributed NoSQL database. The storage architecture of Tera is based on Google's BigTable and targeted for real-time applications, mainly for Baidu's own services such as its very own web crawling engine, Spider. Design choices are made based on the challenging contexts of Baidu's webpage indexing services, leading to requirements of scalability and very high real-time performance (over hundreds of billions QPS). Besides the web crawling engine, Tera is widely used in other services of Baidu with various requirements, ranging from throughput-oriented applications to latency-sensitive service, including web indexing, WebPage DB, LinkBase DB, etc.
Tera is able to support hundreds of millions QPS in terms of random reads and writes, 100,000 billions level of records and hundreds of PB volumns of data. It replaced the former Hadoop based storage system in Baidu, which lacked scalability and timeliness due to its nature of heavy MapReduce programs usage. Tera supports a global ordering of all records, which facilitates the statistical analysis of intervals, as well as automatic loading balancing and MVCC.
Baidu, as it is known for, is a company focusing on providing web searching and indexing services mainly to Chinese internet. In 2014, the number of web pages indexed by Baidu is roughly 100,000 billion pages and only a tenth of it are considered valuable in the criteria used by Baidu. Every day there were 10 billion web pages generated around that time and in average 120 links was included in each web page. These numbers led to the fact that the web crawling engine of Baidu will have to process 1,200 billion links everyday, which is a very tough problem to deal with in terms of either scalability or the speed of new records taking effect.
Before Tera came into existence, the links data are processed by a series of MapReduce batched programs based on Hadoop. The issues with the former system are mainly linear scalability and timeliness. In 2014, the processing of 100 billion links in Baidu will usually use 500 servers, which means 1,000 billion links will require 50,000 servers, and typically 10 passes of MapReduce programs took 2 days at the time, thus incremental and streaming processing is necessary in terms of performance. The new design of Baidu is based on these judgements. The whole processing process will have to be real-time (from being discovered to stored in the database and to picked out by the scheduler). There are a lot of updates happening in this process, so the new storage system will have to support hundreds of millions QPS in terms of random reads and writes.
Baidu Tera supports table snapshots with records in unfinished transactions. Data are stored in the DFS in varied sections by key and the dissection is managed by the master server and every section is served by one data node at one time, and consistency is guaranteed this way.
Multi-version Concurrency Control (MVCC)
Tera uses a MVCC concurrency control strategy.
Multiple versions are kept for the history of crawling status regarding links. For some scenarios, timestamps are critical for judgement to prevent the DBMS from overwriting new data with old data. This is to support analysis of the historical data and rollback of transactions.
The data model of Tera is based upon Google BigTable. It is essentially a collection of sparse, distributed, multidimensional tables (sorted maps). It is indexed by a row key, a column key and a timestamp and each value in the table is an uninterpreted array of bytes in the form of:
(row:string, (column family+qualifier):string, time:int64) string.
Or a table in tera can be interpreted as a data structure like
map<RowKey, map<ColummnFamily:Qualifier, map<Timestamp, Value>>>.
In the original paper of Google BigTable, it is said that one factor that drove their design decisions was to store a large collection of web pages or related information, which is similar to what is desired for the usecases of Baidu Tera, such as Baidu Spider.
The row keys in a table are arbitrary strings. Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the system's behavior in the presence of concurrent updates to the same row.
Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type. A column family must be created before data can be stored under any column key in that family; after a family has been created, any column key within the family can be used. Timestamps are 64-bit integers. They can be assigned by Tera, in which case they represent 'real time' in microseconds, or be explicitly assigned by client applications.
Tera uses a LSM-tree index instead of a skip list, in that it has a large volume of inserts and skip lists do not perform as well when the tables are reduced as key value pairs.
Tera supports snapshot isolation for its tables.
All the data of Tera is stored in Baidu File System and use this distributed file system to guarantee the durability of data by a strategy of returning success of a write only after over 3 copies are made in BFS.
Logs are always written after data is written to the memory. The write-ahead log is usually only written and barely read, for fault tolerance considerations.
Tera have these SST files as an auxiliary logging method, which is read-only static files produced by memory dump or compaction.
Tera accepts general SQL queries. It also provides other APIs, for example, C++/Java/Python/REST-ful APIs.
Tera uses a RAM + SSD + SATA disk hybrid storage architecture. The typical single machine performance of Tera is 30000QPS for random reads of 1KB and 30000QPS of random writes of 1KB. Data is always written to memory first, and will be made durable to disks later on.
One feature of Tera is that it supports row key as well as column key, but essentially it is a hybrid database. For durable storage, Baidu developed based on leveldb, extending it with several components to suit their own needs. For example, one is DBTable, which provides the capability of column major storage optimization, through forming localitygroups of records. Localitygroups and tablets are two kinds of ways to split tables in Tera. DBTable provides a facade on top of several
DBImpl and uses WAL and MVCC to guarantee atomic reads and writes. DFSENv component provides a distributed file system (on top of BFS), which can work with HDFS and NFS, etc. MemTableOnLevelDB is LSM-tree memory table instead of skiplist, which is designed to provide good performance when the table is expanded as key value pairs.
Baidu Tera is developed upon Baidu File System (BFS), which is a distributed file system designed to support real-time applications based on shared SSD + SATA disks. Basically Baidu Tera does not manage the shared data space itself and BFS is in charge.