Scuba uses dictionary compression for strings and variable length encoding for integers.
Scuba follows a relational model with some key differences. It does not support a
CREATE TABLE statement; the table's schema is inferred from the ingested data. Since the data is partitioned, the schema for the same table can differ across nodes. This difference is reconciled during aggregation.
No table has an associated index. The leaf nodes (nodes storing data) store time-range for the data to skip scanning irrelevant data upon receiving a query.
Scuba is an analytical DB, thus, it does not need to support logging. However, it backs up all ingested data on disk for future recovery.
Scuba supports a web-based interface, a SQL interface through the command line, and a custom Thrift-based API for running queries from application code. All the queries originating from the SQL interface and the web interface ultimately rely on the Thrift interface to query the database backend.
Scuba allocates contiguous space in memory for a table (Shared Memory Layout) as the size and the contents are known at the time of allocation. On the other hand, to cope up with high rates of data ingestion, Scuba evicts old data. It evicts a row if it becomes old by using a variant of TTL. In some cases, where it is necessary to keep old data around, Scuba supports subsampling where a fraction of old data is retained for analytical purposes.
N-ary Storage Model (Row/Record)
As their primary workload is analytical, Scuba in future is considering shifting to the columnar layout.
Scuba partitions data across nodes and upon receiving a query aggregates results from all nodes containing the requested data. The architecture is hierarchical nature where leaf nodes store data. A query in this hierarchy originates from a single root node and passes through various intermediate aggregator nodes in a top-down fashion. The leaf nodes perform scan on the data stored locally and return results which are aggregated in a bottom-up fashion.