Umbra

View Current Viewing Revision #28 from 04/02/2023 4:25 p.m.

OLAP OLTP

Umbra is a relational DBMS designed to support high performance for OLAP and OLTP workloads using flash-based storage. Umbra provides the performance of a pure main-memory DBMS for workloads that fit within main memory, with the scalability of a disk-based system. Umbra's buffer manager is based on LeanStore and supports variable-sized pages, enabling disk-based data structures to be accessed directly. Umbra integrates Worst-Case Optimal Joins (WCOJ) into the query optimizer, allowing WCOJ to be used for sub-plans of a query, improving performance for queries with large intermediate results. Umbra extends the query compilation approach from Hyper with a low-latency backend, Flying Start, which emits x86 machine code in a single pass. Umbra also supports User-Defined Operators (UDOs), which extend the DBMS functionality to support custom algorithms using any programming language that compiles programs to ELF object files.

History

Umbra is the new DBMS built at the Technical University of Munich (TUM) after the HyPer project. The Umbra project was initiated in 2018 by Prof. Thomas Neumann and advised by Prof. Alfons Kemper, with frequent collaborations with Prof. Viktor Leis and Prof. Jana Giceva.

Although HyPer provided excellent performance, DRAM remained relatively expensive, and DRAM sizes stopped growing. By contrast, Solid State Drives (SSDs) became much cheaper while still providing high read bandwidth, making them an ideal choice for new database systems. However, supporting SSD-backed storage with a traditional buffer manager would limit performance when the working set fits within main memory. As a result, the Umbra project was initiated, with the primary goal of achieving in-memory performance for working sets fitting within main memory with SSD-backed storage.

Concurrency Control

Multi-version Concurrency Control (MVCC)

Umbra employs Multi-Version Concurrency Control (MVCC) for its concurrency control scheme. Umbra defaults to a purely in-memory MVCC scheme and falls back to a different scheme for bulk operations. By default, version chains are stored exclusively in memory and are associated with pages through local mapping tables. For bulk operations, the transaction is given exclusive write access to the relevant relations by setting the "created" or "deleted" bits directly on the database pages, creating "virtual versions" of a database object.

Umbra supports the "Serializable" isolation level, implemented with a graph-based scheduling algorithm that delivers linear scalability in transaction throughput with the number of cores.

Joins

Nested Loop Join Hash Join Semi Join Index Nested Loop Join Worst-Case Optimal Join

Umbra executes queries using traditional binary joins such as hash-joins and nested-loop joins. Additionally, Umbra has integrated Worst-Case Optimal Joins (WCOJ) into the query optimizer and execution engine. Worst-case optimal joins provide superior performance to binary joins when the cardinality of intermediate joins is large. Therefore, during query optimization, Umbra detects when a portion of the query plan would result in large intermediate results and use a WCOJ instead. To execute a WCOJ, Umbra builds hash-trie indexes on the involved relations and performs a multi-way join using these indexes.

Query Compilation

JIT Compilation

Umbra performs Just-In-Time (JIT) compilation of queries into Umbra IR, a custom intermediate representation (IR) similar to LLVM IR but optimized for use in a database system. After generating Umbra IR, the code is lowered using one of two backends:

LLVM
Flying Start

The LLVM backend emits LLVM IR, compiled at optimization level -O3. This backend is the slowest but generates the fastest executing code, making it suitable for long-running queries.

The Flying Start backend emits x86 machine code using asmJIT, generating x86 in a single pass. In addition, the Flying Start backend implements Stack Space Reuse, Machine Register Allocation, Lazy Address Calculation, and Comparison-Branch Fusion optimizations. As a result, the code generated by Flying Start has performance on par with code generated by LLVM -O0 (i.e., with optimizations disabled). Additionally, Flying Start outperforms interpretation of Umbra IR, making Flying Start suitable for all but the longest-running queries.

Umbra supports adaptive execution, pioneered by HyPer, allowing the DBMS to switch execution strategies while processing a single query. Umbra first generates x86 machine code using the Flying Start backend and then switches to the code generated by the LLVM backend for long-running queries.

Query Execution

Tuple-at-a-Time Model

Umbra uses data-centric query processing, executing operators tuple-at-a-time.

Query Interface

SQL

Umbra supports SQL and ArrayQL as query interfaces, where ArrayQL operators are compiled to relational algebra operators.

Umbra allows the user to extend the functionality of the DBMS with custom algorithms with User-Defined Operators (UDOs). UDOs can be written using any programming language that can be compiled to ELF object files. The interface of a UDO consists of 2 functions:

void Accept(const InputTuple &tuple);

bool Process();

Storage Architecture

Hybrid

Views

Virtual Views Materialized Views

Umbra supports Virtual Views and "Continuous Views," which are Materialized Views maintained over append-only data streams. Continuous Views are maintained by splitting view maintenance between inserts and queries, achieving superior performance to deferred or incremental view maintenance approaches.

Revision #28 | Updated 04/02/2023 4:25 p.m.