Velox is a reusable vectorized database execution engine. It can be used to build compute engines focused on analytical workloads, including batch (Spark, Presto), interactive (PyVelox), stream, log processing, and AI/ML.[04][05]
- Website
- https://velox-lib.io[01]
- Source Code
- https://github.com/facebookincubator/velox[02]
- @velox_lib
- Developer
- Country of Origin
- US
- Start Year
- 2020 [21]
- Project Types
- Industrial Research, Open Source
- Written in
- C++
- License
- Apache v2
Unlike a complete DBMS, Velox is not meant to be used directly by end-users. Rather, it is designed to be a general-purpose component to handle execution that database developers can use in their systems.
Velox is a reusable vectorized database execution engine. It can be used to build compute engines focused on analytical workloads, including batch (Spark, Presto), interactive (PyVelox), stream, log processing, and AI/ML.
Unlike a complete DBMS, Velox is not meant to be used directly by end-users. Rather, it is designed to be a general-purpose component to handle execution that database developers can use in their systems.[04][05]
History[04][06]
Meta's data infrastructure contains dozens of specialized data computation engines, which have been largely developed independently. Maintaining and enhancing each of them can be difficult, especially considering the rapid change of workload requirements and hardware condition.
Velox is created in 2020 and open-sourced in 2021 to address this problem as a unified execution engine. It is under active development, but it is already in various stages of integration with some systems, including Presto, Spark, and PyTorch (the latter through a data preprocessing library called TorchArrow), etc. Additional contributions were provided by Intel, ByteDance, and Ahana, etc.
Checkpoints[04][07][08]
Does not support database checkpoints. However, there is some actions in Velox to make its executor more resilient. For example, when memory allocation fails in a task, the current state can be spilled to disk, the task can be paused and return a continuable VeloxPromise. The Task can be resumed when memory tension is fixed.
Compression[04][05]
Vectors in Velox are arrow compatible but with slight difference. During execution, the vector being passed around may be compressed based on its value feature. There are flat vector, dictionary vector, constant vector and lazy vector, etc. Vectors' metadata typically contains four lists: Null Value Mask, Offsets, Sizes, Elements. Different vector types can achieve different extent of compression.
For strings, it adopts the Umbra way of storing 4-byte string prefix plus 8-byte pointer to BLOB buffer if the string size is larger than 12 bytes.
Data Model[04][09]
Velox uses relational model. Inside the Velox dataframe abstraction, there are scalar values such as numbers of different length and precision, strings (VARCHAR or VARBINARY), timestamps of different precision and lambda functions. Complex types in Velox include arrays, maps, and structs, all of which can embed arbitrary scalar types.
Velox also provides a data type called OPAQUE that can wrap arbitrary C++ data structures.
Hardware Acceleration[10]
Velox uses SIMD among multiple Nodes, such as BigintValuesUsingHashTable::testValues and processFixedFilter in the Filter Operator. Simple scalar UDFs may also be compiled as SIMD by Velox.
Indexes[04]
Velox does not support index. But it supports expression pushdown to linked storage layer.
Joins[11]
Velox supports most common join rules, such as inner, left, right, semi, outer joins.
During Hash Join when the selectivity of the join keys on the build side is high and the table can fit into memory, Velox will use Broadcast distribution strategy, i.e. enforcing "Dynamic Filter Pushdown" to the Tablescan node. Other times, Velox will use partitioned strategy.
Velox also supports inner and left merge join for the case where both sides are sorted on the join keys.
If the join condition is IN in Semi joins and Anti Joins, Velox will maintain a null-mask to distinguish it with EXISTS case.
Parallel Execution[04][12][13]
The top level concept in Velox execution is the query plan, a.k.a Task. Task can then be converted to multiple stacked pipelines, which is similar to the idea of pipeline in Morsel-Driven Parallelism in Hyper. Each Pipeline is consist of multiple Nodes, such as HashJoinNode, CrossJoinNode, MergeJoinNode, LocalMergeNode, LocalPartitionNode and TableScanNode. Nodes complete their execution using Operators and Drivers, which are both created by the Task. Each Driver represents a thread and it takes over the ownership of the Operator from Task. One node can have multiple Drivers working at the same time to achieve inner-query parallelism.
Query Compilation[04][14]
For now, Velox experimentally supports query compilation through Codegen. It will transpile the query plan (Task) into C++ code and compile it to shared library using regular compilers (gcc / clang). This shared library can be linked to the main process at runtime.
Query Interface[04]
Users should use C++ for native support. Velox is also built as binary wheels for PyVelox (the Velox Python Bindings).
Storage Architecture[04]
Velox itself does not manage disk storage. It is expected to be connected with a disaggregated storage system such as S3, HDFS and Tectonic.
Storage Format[16]
Velox is natively arrow compatible. By implementing its connector interface, users can make Velox support more storage formats. Support for formats such as Parquet and DWRF are already included in the library.
Storage Model[04]
As an execution engine designed for OLAP systems, DBMS is designed mostly for DSM storage model.
Storage Organization[04]
Velox has a cache system between execution and underlying disaggregated storage to increase cache hit rate. Cache can is in RAM when first use and can be spilled out to disk for later. Velox also supports column prefetching by tracking the access frequency of each column in a query.
Stored Procedures[04][17][18][19]
Velox exposes C++ scalar UDF function API to the user. Users can write business logic based on a template, and Velox will help compile arithmetic functions to SIMD automatically. It also supports user-defined aggregation functions.
System Architecture[04]
Velox focuses on computation efficiency on single computer. Velox itself is an embedded engine. However, depending on the host system, it can also be expanded to run as standalone program. Prestissimo is the example of such practice.
Views[20]
There is materialized view between pipelines (thread-level) stored in local exchange queues. Also, there is materialized view between tasks (computer-level) maintaining by exchange client.
Citations
22 sources- Velox | Open-Source Composable Execution Engine velox-lib.io
- GitHub - facebookincubator/velox: A composable and fully extensible C++ execution engine library for data management systems. · GitHub github.com
- Velox Documentation — Velox documentation github.io
- Velox: Meta’s Unified Execution Engine fbcdn.net
- https://www.youtube.com/watch?v=HgNP3d93Jb4 youtube.com
- Introducing Velox: An open source unified execution engine fb.com
- velox/velox/common/future at main · facebookincubator/velox · GitHub github.com
- Spilling — Velox documentation github.io
- velox/velox/examples/OpaqueType.cpp at main · facebookincubator/velox · GitHub github.com
- SIMD Usage in Velox — Velox documentation github.io
- Joins — Velox documentation github.io
- What’s in the Task? — Velox documentation github.io
- https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf cmu.edu
- https://github.com/facebookincubator/velox/tree/main/velox/experimental/codegen github.com
- 23 - Meta Velox (CMU Advanced Databases / Spring 2023) - YouTube youtube.com
- velox/velox/dwio at main · facebookincubator/velox · GitHub github.com
- velox/velox/examples/SimpleFunctions.cpp at main · facebookincubator/velox · GitHub github.com
- How to add a scalar function? — Velox documentation github.io
- How to add an aggregate function? — Velox documentation github.io
- What’s in the Task? — Velox documentation github.io
- https://twitter.com/andy_pavlo/status/1523666175602814976?s=20 twitter.com
- https://github.com/facebookincubator/velox/commit/6af674c384870322d9d2ff3d6510b004c861c3ec github.com