BlazingSQL is a distributed GPU-accelerated SQL engine with data lake integration, where data lakes are huge quantities of raw data that are stored in a flat architecture. It is ACID-compliant. BlazingSQL targets ETL workloads and aims to perform efficient read IO and OLAP querying. BlazingDB refers to the company and BlazingSQL refers to the product. It is currently under active development with offices in San Franscisco and Peru.[04][05][06]
- Website
- https://blazingsql.com[01]
- Source Code
- https://github.com/BlazingDB/blazingsql[02]
- Tech Docs
- https://docs.blazingdb.com[03]
- @blazingsql
- Developer
- Country of Origin
- PE
- Start Year
- 2015 [34]
- Former Name
- BlazingDB
- Project Types
- Commercial, Open Source
- Written in
- C++
- Supported Languages
- SQL
- Operating System
- Linux
- Licenses
- Apache v2, Proprietary
BlazingSQL is a distributed GPU-accelerated SQL engine with data lake integration, where data lakes are huge quantities of raw data that are stored in a flat architecture. It is ACID-compliant. BlazingSQL targets ETL workloads and aims to perform efficient read IO and OLAP querying. BlazingDB refers to the company and BlazingSQL refers to the product. It is currently under active development with offices in San Franscisco and Peru.[04][05][06]
History[07][08][09][10][11]
BlazingSQL started as a GPU table joiner for multi-terabyte databases. The Aramburu brothers, Rodrigo and Felipe, founded a company in 2013 that provided analytical solutions and needed to speed up joins for pension fraud detection. It integrates with the open-source open GPU data science initiative, RAPIDS, which relies on NVIDIA GPUs.
The system is originally closed-source with a free community binary, but then became open-source in August 2019.
Checkpoints
It is unclear if BlazingSQL supports checkpointing.
Compression[12][13][14][15][16]
Historically, BlazingSQL supported compression and decompression on the GPU with bit-packing, delta encoding, dictionary encoding, and run-length encoding. This is currently disabled alongside its custom Simpatico file format. As of November 2018, it operates directly on Apache Parquet, CSV, and ORC. BlazingSQL does not currently write data and instead reads it from the data lake. It is able to operate directly on compressed data.
Concurrency Control[05]
BlazingSQL does not write data. It reads directly from the data lake, loading it into GPU data frames that can be shared with other BlazingSQL worker nodes through interprocess communication. Worker nodes do not have to be on the same machine, they can utilize different machines and different GPUs. BlazingSQL handles concurrency for the generation of result sets. However, the user is responsible for ensuring that the data in the data lake is internally consistent and free of corruption when it is queried.
Data Model[17]
BlazingSQL is a relational database. It accepts multiple in-memory formats (e.g. Apache Parquet) and provides a SQL interface for querying the data.
Foreign Keys
It is unclear if foreign keys are supported by BlazingSQL.
Hardware Acceleration[18][19][20][21]
BlazingSQL is hardware-accelerated with NVIDIA GPUs. Relevant columnar data is compressed, cached and sent to the GPU. The GPUs are used to speed up transforms, predicates, running predicates while skipping metadata, and to perform accelerated joins. This is accomplished by hooking into the cu* libraries that are part of the RAPIDS initiative, which are themselves bindings around NVIDIA's CUDA libraries.
Joins[22][23][24]
BlazingSQL supports transformations and hash joins (left, left-outer, full-outer) on all the column types supported by rapids.ai. Ordering, arithmetic, date transformations, predicates and group by operations are performed over vectors of data with GPU SIMD.
Query Compilation[25]
BlazingSQL uses RAPIDS libraries, which themselves use NVIDIA's CUDA. CUDA has support for JIT and code generation.
Storage Architecture[05][06]
BlazingSQL caches the data which is read from the data lake. The cache is cascading, storing data in GPU memory, GPU memory, and finally SSD/NVME.
Storage Model[16][26]
BlazingSQL does not write data. It reads compressed data directly from the data lake and transmits relevant columns to the GPU. On the GPU, data is represented as a GPU DataFrame (GDF). GDFs are built on top of Apache Arrow, which is a columnar in-memory format.
Storage Organization[27]
BlazingSQL does not write data. It relies on external storage, i.e. data being available in a "data lake": huge quantities of raw data in some flat-file architecture, commonly Parquet.
System Architecture[28][29][30][31][32][20]
BlazingSQL can utilize multiple GPUs distributed across different servers. BlazingSQL also has a distributed cache. Upon reading from the data lake, data is cached on the worker nodes. If a worker node A requests data that was recently read from the data lake by another worker node B, worker node B is able to push the desired data to worker node A.
Views[33]
BlazingSQL supports both virtual and materialized views. Materialized views are currently not persistent.
Citations
34 sources- https://blazingsql.com blazingsql.com
- GitHub - BlazingDB/blazingsql: BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF. · GitHub github.com
- https://docs.blazingdb.com blazingdb.com
- Technology Wallpaper livewallpapers.com
- In all honesty we get very few questions about ACID compliance from users and cu... | Hacker News ycombinator.com
- https://blog.blazingdb.com/announcing-blazingsql-a-gpu-sql-engine-for-rapids-open-source-software-from-nvidia-11e115ba7dd7 blazingdb.com
- https://blog.blazingdb.com/blazingsql-is-now-open-source-b859d342ec20 blazingdb.com
- https://blog.blazingdb.com/blazingdb-origins-oh-and-we-just-raised-2-9m-from-nvidia-and-samsung-99cd581e66c7 blazingdb.com
- https://blog.blazingdb.com/tcdrisupt-the-database-dabf044178ce blazingdb.com
- https://www.linkedin.com/in/roaramburu/ linkedin.com
- https://www.linkedin.com/in/felipe-aramburu-707a5b48/ linkedin.com
- https://blazingdb.atlassian.net/wiki/spaces/BlazPub/overview atlassian.net
- https://news.ycombinator.com/item?id=15840900 ycombinator.com
- SolidWorks 2013 Solution Overview nvidia.com
- https://news.ycombinator.com/item?id=15820091 ycombinator.com
- https://news.ycombinator.com/item?id=12485967 ycombinator.com
- https://blog.blazingdb.com/blazingdb-2-0-gpu-fast-sql-on-apache-parquet-f2e8eff1f77a blazingdb.com
- https://blazingdb.atlassian.net/wiki/spaces/BlazPub/pages/105807873/BlazingSQL+Release+Notes atlassian.net
- RAPIDS | GPU Accelerated Data Science rapids.ai
- https://news.ycombinator.com/item?id=18201604 ycombinator.com
- https://news.ycombinator.com/item?id=13992328 ycombinator.com
- https://news.ycombinator.com/item?id=12488062 ycombinator.com
- https://docs.blazingdb.com/docs/blazingdb-sql-guide blazingdb.com
- https://news.ycombinator.com/item?id=12486060 ycombinator.com
- 1. Introduction — NVIDIA CUDA Compiler Driver 13.3 documentation nvidia.com
- GitHub - rapidsai/cudf: cuDF - GPU DataFrame Library · GitHub github.com
- https://news.ycombinator.com/item?id=13990901 ycombinator.com
- https://twitter.com/blazingdb/status/1055894085330976769 twitter.com
- https://docs.blazingdb.com/discuss/57e2544bcda3750e0054a7e8 blazingdb.com
- https://youtu.be/tUIrR_mj9fQ?t=2194 youtu.be
- https://news.ycombinator.com/item?id=18201881 ycombinator.com
- https://news.ycombinator.com/item?id=18201738 ycombinator.com
- https://docs.blazingdb.com/docs/database-administration blazingdb.com
- https://www.crunchbase.com/organization/blazing-db crunchbase.com