Rockset

Rockset is a cloud-based search and analytics database-as-a-service offering.

Compression

Bitmap Encoding

zSTD compression with dictionary encoding per file and 10-bit bloom filter to quickly find keys for their proprietary converged index.

Data Model

Relational

Has dynamic, smart schemas that supports semi-structured data like JSON, csv, tsv, etc… This makes it great as a database for the back end for a web application as it does not need to take in very structured data. This is handled by special indexing of these files such that they can be used for SQL based queries (relational queries). -The schema is not inferred from a sample of the data. Instead, the entire data set is indexed so that when new data arrives it can instantly be used to update the dynamic schema if it contains new rows or information. This means that no rows are ever rejected and all of them are instantly query-able by Rocket. -Most primitive data structure is the Document in Rocket. Documents contains a set of fields and has a unique document ID and are mutable. -A collection is a container of documents and is analogous to a table in a formal relational database. -Updates to a document even with multiple fields are atomic, but writes to multiple documents are not atomic (recall that 4k is max amount that hardware can guarantee atomicity). -Writes are asynchronous (asynchronous propagation), but user placed blocks or barriers (cannot pass until all processes have reached the barrier) can be used to implement a form of synchronization if certain writes need to happen.

Hardware Acceleration

Custom

Indexes

BitMap Inverted Index (Full Text)

-Have a proprietary combination of three different indexes: -Inverted Index -Columnar Index -Document Index -Optimized for the following 6 query types: -Key-value -Time-series -Document -Search -Aggregation -Graph -Because Rockset contains a smart, dynamic schema, it will not know what the shape of the data is ahead of time. For this reason, Rockset uses the above described three index system to optimize for queries with an unknown shape or schema. This is especially relevant for point queries and aggregation queries. -The index is a live, real-time index that is updated from numerous data sources. -The index is a covering index, which means that there is an index on all columns to ensure that the query never needs to revert back to the actual table for computation. This is very important for performance especially in the case where the shape of the data is unknown ahead of time. -Updates to the indexes are served through the cloud. -There are many storage and I/O optimizations present in the index, mainly the separation of the logical and physical indexes. These are linked through a key-value store, which is different from the implementation of any other database. -Additionally, there Rockset uses a 10-bit bloom filter to make it easier to find keys. This reduces the I/O load by 99%. -The indexes point to documents, which are mutable. This means that a document can be edited without being reindexed after each write since the index can be edited directly.

Parallel Execution

Intra-Operator (Horizontal)

-Uses a bottom-up approach to process queries. -Uses the iterator or volcano model to process data, but can switch to vectorized if the query needs to scale. -Uses a combination of both rule based and cost based optimization.

Query Execution

Tuple-at-a-Time Model Vectorized Model

Uses the volcano or iterator model by default to execute queries, but can use the vectorized approach in order to scale queries on large amounts of data.

Query Interface

SQL HTTP / REST

Storage Architecture

In-Memory

In memory storage in a distributed cloud based system where each node contains a small amount of data (data parallel) in the memory of the machine.

Storage Model

Decomposition Storage Model (Columnar) Hybrid

-Uses a hybrid storage model as it uses a dynamic schema system. -Has an index on columns (columnar) to support this dynamic system.

System Architecture

Shared-Nothing

-Rockset contains a distributed cloud based architecture that takes advantage of the data parallelism present in queries to speed up queries. This means that each machine operates the same query on different data.

Rockset Logo
Website

https://rockset.com

Source Code

https://github.com/rockset

Tech Docs

https://docs.rockset.com/

Developer

Rockset

Country of Origin

US

Start Year

2016

Project Type

Commercial

Written in

C++

Supported languages

Go, Java, JavaScript, Python

Operating Systems

Hosted

Licenses

Proprietary