Tera is a 100 PBs level, real-time high performance distributed NoSQL database. The storage architecture of Tera is based on Google's BigTable and targeted for real-time applications, mainly for Baidu's own services such as its very own web crawling engine, Spider. Design choices are made based on the challenging contexts of Baidu's webpage indexing services, leading to requirements of scalability and very high real-time performance (over hundreds of billions QPS). Besides the web crawling engine, Tera is widely used in other services of Baidu with various requirements, ranging from throughput-oriented applications to latency-sensitive service, including web indexing, WebPage DB, LinkBase DB, etc.
Tera is able to support hundreds of millions QPS in terms of random reads and writes, 100,000 billions level of records and hundreds of PB volumns of data. It replaced the former Hadoop based storage system in Baidu, which lacked scalability and timeliness due to its nature of heavy MapReduce programs usage. Tera supports a global ordering of all records, which facilitates the statistical analysis of intervals, as well as automatic loading balancing and MVCC.
Baidu, as it is known for, is a company focusing on providing web searching and indexing services mainly to Chinese internet. In 2014, the number of web pages indexed by Baidu is roughly 100,000 billion pages and only a tenth of it are considered valuable in the criteria used by Baidu. Every day there were 10 billion web pages generated around that time and in average 120 links was included in each web page. These numbers led to the fact that the web crawling engine of Baidu will have to process 1,200 billion links everyday, which is a very tough problem to deal with in terms of either scalability or the speed of new records taking effect.
Before Tera came into existence, the links data are processed by a series of MapReduce batched programs based on Hadoop. The issues with the former system are mainly linear scalability and timeliness. In 2014, the processing of 100 billion links in Baidu will usually use 500 servers, which means 1,000 billion links will require 50,000 servers, and typically 10 passes of MapReduce programs took 2 days at the time, thus incremental and streaming processing is necessary in terms of performance. The new design of Baidu is based on these judgements. The whole processing process will have to be real-time (from being discoverd to stored in the database and to picked out by the scheduler). There are a lot of updates happening in this process, so the new storage system will have to support hundreds of millions QPS in terms of random reads and writes.
Tera is designed to support services of 100,000 billions of records, hundreds of PB volumns of data and hundreds of millions QPS and it has become a core module for Baidu spider and WebPage database ever since it came out. Baidu spider, as the name indicates, is the web crawling spider of Baidu, which collects web pages and submits the updates to the Baidu index. Baidu spider 3.0 is an important upgrade in 2014 for Baidu spider, which made an effor to refactor the former offline, whole quantity calculation to a real-time, incremental calculation system, meanwhile improving the diversity and uniqueness of the results, as well as the real-time performance of Baidu index. Tera is tested and found to be performing reasonably well in terms of scalability, single machine performance, as well as real-time response speed, etc.
Multi-version Concurrency Control (MVCC)
Tera uses a MVCC concurrency control strategy.
Multiple versions are kept for the history of crawling status regarding links. For some scenarios, timestamps are critical for judgement to prevent the DBMS from overwriting new data with old data. This is to support analysis of the historical data and rollback of transactions.
All the data of Tera is stored in Baidu File System and use this distributed file system to guarantee the durability of data by a strategy of returning success of a write only after over 3 copies are made in BFS.
Logs are always written after data is written to the memory. The write-ahead log is usually only written and barely read, for fault tolerance considerations.
Tera have these SST files as an auxiliary logging method, which is read-only static files produced by memory dump or compaction.
One feature of Tera is that it supports row key as well as column key, but essentially it is a hybrid database. For durable storage, Baidu developed based on leveldb, extending it with several components to suit their own needs. For example, one is DBTable, which provides the capability of column major storage optimization, through forming localitygroups of records. Localitygroups and tablets are two kinds of ways to split tables in Tera. DBTable provides a facade on top of several DBImpl
and uses WAL and MVCC to guarantee atomic reads and writes. DFSENv component provides a distributed file system (on top of BFS), which can work with HDFS and NFS, etc. MemTableOnLevelDB is LSM-tree memory table instead of skiplist, which is designed to provide good performance when the table is expanded as key value pairs.
The data model of Tera is based upon Google BigTable. It is essentially a collection of sparse, distributed, multidimensional tables (sorted maps). It is indexed by a row key, a column key and a timestamp and each value in the table is an uninterpreted array of bytes in the form of:
(row:string, (column family+qualifier):string, time:int64) string
.
Or a table in tera can be interpreted as a data structure like map<RowKey, map<ColummnFamily:Qualifier, map<Timestamp, Value>>>
.
In the original paper of Google BigTable, it is said that one factor that drove their design decisions was to store a large collection of web pages or related information, which is similar to what is desired for the usecases of Baidu Tera, such as Baidu Spider.
The row keys in a table are arbitrary strings. Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the system's behavior in the presence of concurrent updates to the same row.
Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type. A column family must be created before data can be stored under any column key in that family; after a family has been created, any column key within the family can be used. Timestamps are 64-bit integers. They can be assigned by Tera, in which case they represent 'real time' in microseconds, or be explicitly assigned by client applications.