OceanBase

OceanBase is a distributed, scalable, shared-nothing relational DBMS developed by Alibaba. The goal of OceanBase is to serve for financial scenarios which are demanding on performance, cost, scalability, and requires database with high availability and strong consistency. It is designed and optimized for OLTP applications on relational structured data, though its shared-nothing structure also supports OLAP applications.

History

In 2010, OceanBase team leader Zhenkun Yang joined Alibaba. Because of the increasing concurrency in Alibaba's business and the shortened development cycle to build a database for new transaction, Yang found that original DBMS can't support for rapidly growing workloads in Alibaba. He decided to abandon the traditional DBMS framework and develop a novel DBMS from scratch. At the very beginning, he presented three kernel principles for their new products: (1) distributed (2) low cost (3) high reliability.
In 2013, Alipay decided to stop using Oracle. Since the alternate choice MySQL can't ensure strong consistency between active server and standby server, OceanBase got its first opportunity. From now on, OceanBase is not open sourced anymore.
From 2014 to 2016, this team spent three years developing OceanBase 1.0. It is the first commercial DBMS which supports distributed transactions.
From 2017, OceanBase started to serve for external customers.
In 2019, OceanBase beat Oracle and won the first place in TPC-C test.

Checkpoints

Non-Blocking Consistent

OceanBase adopts consistent non-blocking checkpoint mechanism. Their team is currently working on the development of fuzzy non-blocking checkpoint mechanism implemented in LSM tree structure and plan to release that in next version.

Data Model

Relational

OceanBase is a relational database. It supports relational data model and is compatible with MySQL.

System Architecture

Shared-Nothing

OceanBase adopts shared-nothing system architecture. It stores replica of each partition on at least three server nodes in different Zones. Each server node has its own SQL engine and storage engine. The storage engine only accesses the local data on that node. The SQL engine accesses the global schema and generates the distributed query plan. Query executors visit the storage engine of each node to distribute and gather data among them to complete the query. For each database instance, it sets one server node as active root server to provide root service which monitors the health of all the nodes related to this database. The root service is responsible for load balance, data consistency, error recovery, etc. If this active root server shuts down, OceanBase automatically promotes one standby root server to become new active root server.

Query Interface

SQL

OceanBase supports standard SQL query interface, though there are slight differences.
The detailed OceanBase SQL syntax doc can be found in citations.

Storage Organization

Heaps

Views

Materialized Views

OceanBase supports materialized view. OceanBase is a commercial DBMS mainly serves corporate clients with large-scale data storage and high QPS, so it implements materialized view to increase the throughput and reduce latency in order to reduce the number of servers needed(save hardware cost).

Storage Model

N-ary Storage Model (Row/Record) Hybrid

Originally, OceanBase is designed to only support N-ary Storage Model. From OceanBase 2.0, it supports hybrid storage model. Attributes belong to the same tuple are stored in the same block, but the tuples in the same block are compressed and stored in columnar model.

Concurrency Control

Multi-version Concurrency Control (MVCC)

OceanBase adopts MVCC to do concurrency control. If the operation involves single partition or multiple partitions on single server node, it will read the snapshot of that server node. If the operation involves partitions on multiple server nodes, it executes distributed snapshot read.

Parallel Execution

Intra-Operator (Horizontal) Inter-Operator (Vertical)

OceanBase supports both vertical and horizontal parallelism, which increases throughput and reduces latency.

Query Compilation

Code Generation

OceanBase implements code generator to translate the logical execution plan into physical execution plan. OceanBase caches these plans to avoid re-compiling them.

Logging

Physiological Logging

OceanBase uses physiological logging to records all the modification on MemTable. It uses Paxos consensus algorithm to synchronize log replicas on different server nodes.

Indexes

B+Tree

For index structure, the only available value for the parameter of index type in OceanBase is B+Tree when creating index.
For index range, as OceanBase splits table into partitions, it supports local index for local partitions and global index for global table. OceanBase also supports secondary index. It combines the index keys and table primary key for secondary index.

Stored Procedures

Supported

From OceanBase 2.0, Stored Procedures written in SQL are supported. It allows much more complicated queries by organizing the data manipulation and multiple simple queries into procedural code. Stored Procedure is an extension of SQL in relational DBMS.

Isolation Levels

Read Committed Serializable Snapshot Isolation

From OceanBase 1.0, it supports read committed. Read committed is the default isolation level.
From OceanBase 2.0, it supports snapshot isolation. From OceanBase 2.2, it supports serializable.

Foreign Keys

Supported

OceanBase supports foreign key to constrain data consistency, which is an important advantage compared to many other distributed DBMS. Foreign key reference is defined when creating new table schema.

Query Execution

Tuple-at-a-Time Model

OceanBase uses iterator model to executes queries by default. Its default iterator interface is Tuple-at-a-Time Model.

Compression

Dictionary Encoding Delta Encoding Run-Length Encoding Prefix Compression

OceanBase uses column compression. It implements several encoding algorithms and it automatically chooses the most suitable one for every column. It adopts column compression to leverage better data similarity including same data type, same value range, etc.

Joins

Nested Loop Join Hash Join Sort-Merge Join Index Nested Loop Join

OceanBase supports three kinds of join algorithms: Nested Loop Join, Sort-Merge Join, Hash Join. Sort-Merge Join and Hash Join only works for equijoin scenario while Nested Loop Join works under any join conditions. For nested loop join, OceanBase supports both sequential scan and index scan for inner table. OceanBase also implements Blocked Nested Loop Join.

Storage Architecture

Disk-oriented

OceanBase is a distributed disk-oriented DBMS.
From the perspective of storage management, OceanBase is divided into multiple Zones. Each Zone is a collection of physical server nodes. Several Zones store replicas of same partitions and synchronize logs using Paxos distributed consensus algorithm. Each Zone has multiple server nodes, ObServers. OceanBase also supports horizontal partitions and automatically balance partition load across ObServers. There are two kinds of blocks for data storage, Macro Block and Micro Block. Macro Block(2MB) is the smallest unit for write operation. Micro Block(16KB before compression) is the smallest unit for read operation.
From the perspective of resource management, each database instance is considered as a tenant in OceanBase. Every tenant is allocated with a unit pool containing units. Each unit is a group of computation and storage resource on a ObServer. Each tenant has at most one unit on one ObServer. Conceptually, unit is receptacle for replica.
OceanBase implements block cache for Micro Block to accelerate big scan query. It also implements a row cache for rows in block cache to accelerate small get query.
The storage data structure of OceanBase is designed based on LSM-Tree in LevelDB. The data modification is first recorded in MemTable (dynamic data in memory) using linked list, and the head is linked to the corresponding block in block cache. During the low peak period at night or when the size of MemTable reaches the threshold, OceanBase will merges the MemTable to SSTable(static data in disk) using one of following merge algorithms:
(1) Major Compaction: Read all the static data from disk, merge it with the dynamic data and then write back to disk as new static data. This is the most expensive algorithm and is typically used after DDL operation.
(2) Minor Compaction: Reuse all the Macro Blocks which are not dirty. This is the default algorithm OceanBase adopts.
(3) Alternate Compaction: When one ObServer is about to compact one partition, queries on the merged partition will be sent to ObServers in other Zones storing replicas of the same partition. The merged Zone warms the cache after compaction. When having to merge data during peak period, OceanBase adopts this algorithm. This algorithm is orthogonal to minor compaction and major compaction and should be used in combination with one of them.
(4) Dump: Dump the MemTable to disk as Minor SSTable and merge it with the previous dumbed Minor SSTable. When the size of Minor SSTable exceeds the threshold, OceanBase merges it to SSTable using aforementioned compaction algorithm. This lightweight approach is used when the dynamic data is significantly less than static data.

Revision #16 | Updated 12/05/2019 10:39 p.m.

OceanBase

History

Checkpoints

Data Model

System Architecture

Query Interface

Storage Organization

Views

Storage Model

Concurrency Control

Parallel Execution

Query Compilation

Logging

Indexes

Stored Procedures

Isolation Levels

Foreign Keys

Query Execution

Compression

Joins

Storage Architecture

Website

Source Code

Tech Docs

Developer

Country of Origin

Start Year

Project Type

Written in

Supported languages

Compatible With

Operating Systems

Licenses