OceanBase is a distributed, scalable, shared-nothing relational DBMS developed by Alibaba. The goal of OceanBase is to serve for financial scenarios which is demanding on performance, cost, scalability and requires database with high availability and strong consistency. It is designed and optimized for diverse OLTP applications on relational structured data, though its shared-nothing structure also supports OLAP applications.
In 2010, OceanBase team leader Zhenkun Yang joined Alibaba. Because of the increasing concurrency in Alibaba's business and the shortened development cycle to build a database for new transaction, Yang found that original DBMS can't support for rapidly growing workloads in Alibaba. He decided to abandon the traditional DBMS framework and develop a novel DBMS from scratch. At the very beginning, he presented three kernel principles for their new products: (1) distributed (2) low cost (3) high reliability. In 2013, Alipay decided to abandon Oracle. Since MySQL can't ensure strong consistency between active server and standby server, OceanBase got its first opportunity. From now on, OceanBase is not open sourced anymore. From 2014 to 2016, this team spend three years developing OceanBase 1.0. It is the first and only commercial DBMS which supports distributed transactions. From 2017, OceanBase started to serve for external customers. In 2019, OceanBase beat Oracle and won the first place in TPC-C test.
OceanBase supports materialized view well. Their first business, Taobao Favorites, is done by leveraging materialized views.
OceanBase adopts MVCC to do concurrency control. If the operation involves single partition or multiple partition on one ObServer, it will read the snapshot of that ObServer. If the operation involves partitions on multiple ObServer, it executes distributed snapshot read.
OceanBase is a distributed disk-oriented DBMS. From the perspective of storage management, OceanBase is divided into multiple Zones. Each Zone is a collection of physical server nodes. Several Zones would store the same replica and synchronize them using Paxos distributed consensus algorithm. Each Zone has multiple server nodes, ObServers. OceanBase also supports horizontal partitions and automatically balance partition load across ObServers. There are two kind of blocks for data file storage, `Macro Block` and `Micro Block`. `Macro Block`(2MB) is the smallest unit for write operation. `Micro Block`(16KB before compression) is the smallest unit for read operation. From the perspective of resource management, each database instance would be considered as a tenant in OceanBase. Every tenant is allocated with a unit pool containing units. Each unit is a group of computation and storage resource on a ObServer. Each tenant can have at most one unit on one ObServer. Conceptually, unit is receptacle for replica. OceanBase implements block cache for `Micro Block` in disks to accelerate big scan query. It also implements a row cache for rows in block cache to accelerate small get query. The storage data structure of OceanBase is designed based on LSM-Tree in LevelDB. The data modification is first recorded in `MemTable` (dynamic data in memory) using redo linked list, and the head is linked to the corresponding block in block cache. During the low peak period at night or when the size of `MemTable` reaches the threshold, OceanBase will merges the `MemTable` to `SSTable`(static data in disk) using one of following merge algorithms: (1) Major Compaction: Read all the static data from disk, merge it with the dynamic data and then write back to disk as new static data. This is the most expensive algorithm and will typically be used by OceanBase after DDL operation. (2) Minor Compaction: Reuse all the `Macro Block` which are not written. For the dirty `Macro Block`, directly copy the `Micro Block` which are not written. This is the default algorithm OceanBase adopts. (3) Alternate Compaction: Zones store the replicas which is about to merge data will block and merge alternately. When one Zone is merging data, queries on the merged replica will be sent to other Zones that store this replica. This Zone will also warm the cache after compaction. When having to merge data during peak period, OceanBase adopts this algorithm. This algorithm is orthogonal to minor compaction and major compaction and should be used in combination with one of them. (4) Dump: Dump the `MemTable` to disk as `Minor SSTable` and merge it with the previous dumbed `Minor SSTable`. When the size of `Minor SSTable` is large enough, merge it to `SSTable` using aforementioned compaction algorithm. This lightweight approach is used when the dynamic data is significantly less than static data.
OceanBase doesn't require and doesn't rely on checkpoint. When one server node shuts down, it can also ensure data consistency without checkpoint. There are two reasons for this: (1) OceanBase is deployed across Zones with multi-replica, it recovers data by majority votes mechanism. (2) The special `MemTable` and `SSTable` storage architecture design doesn't require frequent checkpoint.
OceanBase uses column compression for `SSTable` in disk. It implements several encoding algorithm and it will automatically choose the most suitable one for every column. It costs only half as much space as MySQL does.
C, C++, Java