Rockset

Acquired Company OLAP

Rockset is a cloud-based search and analytics database-as-a-service offering.

History

Rockset was acquired by OpenAI in June 2024. The company then announced that it will cease the database service in September 2024 and delete all of its customers' data.

Compression

Naïve (Page-Level)

zSTD compression with dictionary encoding per file.

Data Model

Relational

Rockset is a relational database that supports dynamic, semi-structured data (JSON, CSV, TSV).

This is handled by special indexing of these files such that they can be used for SQL based queries (relational queries). The schema is not inferred from a sample of the data. Instead, the entire data set is indexed so that when new data arrives it can instantly be used to update the dynamic schema if it contains new rows or information. This means that no rows are ever rejected and all of them are instantly query-able by Rockset. The most primitive data structure is the Document in Rockset. Documents contains a set of fields and has a unique document ID and are mutable. A Rockset collection is a container of documents and is analogous to a table in a formal relational database. Updates to a document even with multiple fields are atomic, but writes to multiple documents are not atomic (recall that 4k is max amount that hardware can guarantee atomicity). Writes are asynchronous (asynchronous propagation), but user placed blocks or barriers (cannot pass until all processes have reached the barrier) can be used to implement a form of synchronization if certain writes need to happen.

Indexes

BitMap Inverted Index (Full Text)

Rockset uses a proprietary combination of three different indexes: Inverted Index, Columnar Index, and Document Index The index is optimized for the following 6 query types: Key-value, Time-series, Document, Search, Aggregation, and Graph queries. Because Rockset contains a smart, dynamic schema, it will not know what the shape of the data is ahead of time. For this reason, Rockset uses the above described three index system to optimize for queries with an unknown shape or schema. This is especially relevant for point queries and aggregation queries. The index is a live, real-time index that is updated from numerous data sources. The index is a covering index, which means that there is an index on all columns to ensure that the query never needs to revert back to the actual table for computation. This is very important for performance especially in the case where the shape of the data is unknown ahead of time. Updates to the indexes are served through a hierarchical, cloud-based, and disaggregated system to help maintain the live indexes in an efficient manner. There are many storage and I/O optimizations present in the index, mainly the separation of the logical and physical indexes. These are linked through a key-value store, which is different from the implementation of any other database. Additionally, there Rockset uses a 10-bit bloom filter to make it easier to find keys. This reduces the I/O load by 99%. The indexes point to documents, which are mutable. This means that a document can be edited without being reindexed after each write since the index can be edited directly.

Parallel Execution

Intra-Operator (Horizontal)

Rockset uses a bottom-up approach to process queries with the iterator or volcano model to process data, but can switch to vectorized if the query needs to scale. Rocket uses a combination of both rule based and cost based optimization to find the most optimal query schedule.

Query Execution

Tuple-at-a-Time Model Vectorized Model

Rockset uses the volcano or iterator model by default to execute queries, but can use the vectorized approach in order to scale queries on large amounts of data.

Query Interface

SQL HTTP / REST

Storage Architecture

In-Memory

Rockset uses an in memory storage in a distributed cloud based system where each node contains a small amount of data (data parallel) in the memory of the machine.

Storage Model

Decomposition Storage Model (Columnar) Hybrid

Rockset uses a hybrid storage model as it uses a dynamic schema system with an index on columns (columnar) to support this dynamic system.

System Architecture

Shared-Nothing

Rockset contains a distributed cloud based architecture that takes advantage of the data parallelism present in queries to speed up queries. This means that each machine operates the same query on different data. Each node is designed to preform a specific task to ingest unstructured data and funnel that data to Rocksdb-Cloud, which handles storage and query execution. The system places hot data in a local SSD that is durable and colder data in a cloud instance (instead of writing to disk). The SSD is durable due to constant data replication (Active-Active) amongst the nodes. If one node dies, another one can immediately pick up where that node left off.

Rockset maintains its key components in Kubernetes containers for a cloud agnostic system (independent of the cloud platform).

People Also Viewed