Scuba

Scuba is a distributed in-memory database built at Facebook. It is a time-series data analysis database aimed towards serving real-time analytical queries approximately. Scuba aims to keep data ingestion latency low and handles huge data inflow by expelling old data from the memory.

Compression

Dictionary Encoding

Scuba uses dictionary compression for strings and variable length encoding for integers.

Data Model

Relational

Scuba follows a relational model with some key differences. It does not support a CREATE TABLE statement; the table's schema is inferred from the ingested data. Since the data is partitioned, the schema for the same table can differ across nodes. This difference is reconciled during aggregation.

Indexes

Not Supported

No table has an associated index. The leaf nodes (nodes storing data) store time-range for the data to skip scanning irrelevant data upon receiving a query.

Logging

Not Supported

Scuba is an analytical DB, thus, it does not need to support logging. However, it backs up all ingested data on disk for future recovery.

Query Interface

Custom API SQL HTTP / REST

Scuba supports a web-based interface, a SQL interface through the command line, and a custom Thrift-based API for running queries from application code. All the queries originating from the SQL interface and the web interface ultimately rely on the Thrift interface to query the database backend.

Storage Architecture

In-Memory

Scuba allocates contiguous space in memory for a table (Shared Memory Layout) as the size and the contents are known at the time of allocation. On the other hand, to cope up with high rates of data ingestion, Scuba evicts old data. It evicts a row if it becomes old by using a variant of TTL. In some cases, where it is necessary to keep old data around, Scuba supports subsampling where a fraction of old data is retained for analytical purposes.

Storage Model

N-ary Storage Model (Row/Record)

As their primary workload is analytical, Scuba in future is considering shifting to the columnar layout.

System Architecture

Shared-Nothing

Scuba partitions data across nodes and upon receiving a query aggregates results from all nodes containing the requested data. The architecture is hierarchical nature where leaf nodes store data. A query in this hierarchy originates from a single root node and passes through various intermediate aggregator nodes in a top-down fashion. The leaf nodes perform scan on the data stored locally and return results which are aggregated in a bottom-up fashion.

People Also Viewed