Rockset

View Current Viewing Revision #4 from 11/24/2019 10:59 p.m.

Rockset is a cloud-based search and analytics database-as-a-service offering.

Data Model

Has dynamic, smart schemas that supports semi-structured data like JSON, csv, tsv, etc… This makes it great as a database for the back end for a web application as it does not need to take in very structured data. This is handled by special indexing of these files such that they can be used for SQL based queries (relational queries). -The schema is not inferred from a sample of the data. Instead, the entire data set is indexed so that when new data arrives it can instantly be used to update the dynamic schema if it contains new rows or information. This means that no rows are ever rejected and all of them are instantly query-able by Rocket. -Most primitive data structure is the Document in Rocket. Documents contains a set of fields and has a unique document ID and are mutable. -A collection is a container of documents and is analogous to a table in a formal relational database. -Updates to a document even with multiple fields are atomic, but writes to multiple documents are not atomic (recall that 4k is max amount that hardware can guarantee atomicity). -Writes are asynchronous (asynchronous propagation), but user placed blocks or barriers (cannot pass until all processes have reached the barrier) can be used to implement a form of synchronization if certain writes need to happen.

System Architecture

Shared-Nothing

-Rockset contains a distributed cloud based architecture that takes advantage of the data parallelism present in queries to speed up queries. This means that each machine operates the same query on different data.

In memory storage in a distributed cloud based system where each node contains a small amount of data (data parallel) in the memory of the machine.

Parallel Execution

Intra-Operator (Horizontal)

-Uses a bottom-up approach to process queries. -Uses the iterator or volcano model to process data, but can switch to vectorized if the query needs to scale. -Uses a combination of both rule based and cost based optimization.

Indexes

BitMap Inverted Index (Full Text)

-Have a proprietary combination of three different indexes: -Inverted Index -Columnar Index -Document Index -Optimized for the following 6 query types: -Key-value -Time-series -Document -Search -Aggregation -Graph -Because Rockset contains a smart, dynamic schema, it will not know what the shape of the data is ahead of time. For this reason, Rockset uses the above described three index system to optimize for queries with an unknown shape or schema. This is especially relevant for point queries and aggregation queries. -The index is a live, real-time index that is updated from numerous data sources. -The index is a covering index, which means that there is an index on all columns to ensure that the query never needs to revert back to the actual table for computation. This is very important for performance especially in the case where the shape of the data is unknown ahead of time. -Updates to the indexes are served through the cloud. -There are many storage and I/O optimizations present in the index, mainly the separation of the logical and physical indexes. These are linked through a key-value store, which is different from the implementation of any other database. -Additionally, there Rockset uses a 10-bit bloom filter to make it easier to find keys. This reduces the I/O load by 99%. -The indexes point to documents, which are mutable. This means that a document can be edited without being reindexed after each write since the index can be edited directly.

Query Execution

Tuple-at-a-Time Model Vectorized Model

Uses the volcano or iterator model by default to execute queries, but can use the vectorized approach in order to scale queries on large amounts of data.