Crux is an open source document database that uses Apache Kafka for the primary storage of transactions and documents, and RocksDB or LMDB to host indexes for rich query support. This decoupling allows Crux to be very scalable and allow for a large variety of use cases. Crux is a bitemporal database, which makes it possible to store and query data on two different factors, valid time and system time. Crux does not enforce any schema for the documents it stores and it supports a Datalog query interface for reading data and traversing relationships across all documents, where queries are executed so that the results are lazily streamed. Additionally, even though the main transaction log is immutable, Crux still supports the eviction of active as well as historical data.
Crux has been available as a Public Alpha since April 19th 2019. The Public Alpha period will continue until Crux is released as a Generally Available open source software product by JUXT later in 2019.
Each Crux node reads and writes from disk, and each node should always store the full database that is stored on disk. This is because Crux doesn't offer sharding, so each node has to keep track of all of the data. When data has to be updated to a node, it is automatically updated to the disk.
The language that Crux uses to execute queries, Datalog, has the same functionality as SQL, but allows for more efficient joins. It uses nested loop joins and sorted merge joins as does SQL, but it also uses joins over granular indexes. This ensures that the DBMS does not have to worry about normalizing the data or what the shape of the data is.
The query interface that Crux uses is the Datalog interface. This interface allows Crux to read data and explore relationships across various different documents. Additionally, the Datalog interface provides support for most SQL-like join operations and also, since Crux is a database with graph queries, the Datalog interface also allows for recursive graph traversals.
The documents in Crux are all stored as Extensible Data Notation, or EDN, documents. The fields within this documents are triples, which have entity, attribute, and value fields. This data model gives Crux better support for efficient graph queries.
When Crux runs a Datalog query, it outputs a lazy sequence of all of the tuples that satisfy all of the clauses in the query. This means that as the database finds tuples that satisfy the predicate, it outputs the tuples one at a time. Therefore, query execution is done using the Tuple-at-a-Time Model.
Crux does not support node-level sharding, so every Crux node has the same data, and this is the same data that is stored on disk. When a change is made, all of the nodes must incorporate this change, so that the nodes are all consistent with each other. Crux may add functionality for node-level sharding in the future.
Crux uses RocksDB or LMDB in order to host its indexes. RocksDB uses two different formats for its indexes: block based table and plain table. In a block based table, it is easier to compress the data into blocks, but queries take longer to execute. In plain table, the data is stored in a hash table, so it takes more space to store the data, but queries execute faster. LMDB uses two different B+ trees for its indexes format. One of the B+ trees stores pages with data, and the other stores free pages that empty up after deletes.
Crux uses Apache Kafka as a means of storing the transaction and document logs. These logs are semi-immutable, and since these logs are decoupled from the actual Crux node, Crux is very scalable. An alternative method of storage organization that Crux can use instead of Kafka is a local log store that operates within a Crux standalone node.