Cubrick

Cubrick is a distributed multidimensional in-memory DBMS developed for internal use at Facebook. It is designed for low-latency realtime OLAP analysis over large datasets. It is built from scratch to support merely the necessary features required by its realtime analysis use cases.

Checkpoints

Non-Blocking

Data persistency in Cubrick is done by an external disk-based key value store (e.g., RocksDB), and the in-memory data are periodically and asynchronously flushed to the persistent storage.

Compression

Dictionary Encoding

String fields in Cubrick are dictionary encoded, for both *dimensions* (i.e., indices) and *metrics* (i.e., values). Internally, Cubrick processes string fields using their encoded integers, and only converts them back when returning the results to the users. Cubrick also uses BESS (Bit-Encoded Sparse Structure) encoding for compressing the multidimensional index for each *cell* (i.e., a group of metrics corresponding to the same dimension).

Data Model

Column Family

Cubrick stores data in *bricks* (i.e., partitions) in a column-oriented way. In each brick, each column has a dynamic vector to store the metrics or the BESS encoded indices. Cells in a brick are unordered, and the ingested cells are only appended to the end of the brick.

Indexes

Hash Table

Cubrick uses *Granular Partitioning* as the main indexing approach to organize *bricks* (i.e., partitions) in a *cell* (i.e., table). Multidimensional indices are converted to partition ids via a conversion function, which maps predefined multidimensional ranges to an integer. The partition id to storage node mapping is maintained by consistent hashing.

Joins

Not Supported

Cubrick assumes the ingested data are denormalized, and it does not support joins.

Logging

Not Supported

Logging is not supported by Cubrick. Cubrick is purely in-memory, and the data persistency of Cubrick is done by an external disk-based key value store (e.g., RocksDB).

Parallel Execution

Intra-Operator (Horizontal)

Queries are sent to all nodes, and nodes process the same queries locally on their own data.

Query Execution

Materialized Model

Intermediate results are generated before moving to the next step.

Query Interface

SQL

A subset of SQL is supported, including filtering, aggregations, group bys, order bys, having, and some arithmetic and logical expressions. Nested queries and joins are not supported.

Storage Architecture

In-Memory

Cubrick stores all data in-memory for its low-latency OLAP analysis. Data persistency is supported, but it is done via an external disk-based key value store.

Storage Model

Hybrid

Records are partitioned by a predefined conversion function and stored in nodes determined by consistent hashing (row-oriented partitioning). Within each partition, records are stored in a column-oriented way (column-oriented storage).

System Architecture

Shared-Nothing

Cubrick assumes that it uses a shared-nothing cluster.

People Also Viewed

Cubrick Logo
Website

https://research.fb.com/cubrick-a-new-multidimensional-in-memory-dbms/

Developer

Facebook

Country of Origin

US

Start Year

2016

Project Type

Industrial Research

Licenses

Proprietary

People Also Viewed