BlinkDB is an approximate query engine built on top of Hive and Shark ("Hive on Spark", the former Spark SQL). It allows users to trade-off query accuracy for response time, thus enabling interactive queries on big data. BlinkDB builds a couple of stratified samples on the original data and executes the queries on the samples instead of the original data to reduce query execution time. The number and sizes of the stratified samples are limited by the storage budget specified when importing the data. It has two major parts: one is the sample building engine that selects what stratified samples to build by considering historic workloads and the distribution of the data; the other part is a dynamic sample selection module that chooses appropriate sample files at runtime according to specific time/accuracy requirements specified by the query. BlinkDB has a public open source repository on Github. The latest version number is alpha-0.2.0. However, its features are quite limited. For example, instead of automatically creating samples for a dataset, this version supports manual sample creation with explicitly specified sample ratio only.
BlinkDB was built at Berkeley's AMPLab (now RISELab). It was proposed in [BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data](https://sameeragarwal.github.io/blinkdb_eurosys13.pdf), which is the best paper of Eurosys 2013. BlinkDB is no longer maintained. The successor of BlinkDB is VerdictDB, which builds on the same idea but supports more features than BlinkDB.
The samples in BlinkDB are essentially Hive tables. BlinkDB does not maintain large physical samples. A large sample in BlinkDB actually consists of a collection of smaller samples, which are stored in consecutive blocks on the HDFS. This is beneficial to runtime sample selection because BlinkDB can simply read more blocks to construct a larger sample, which makes it easy to estimate the execution cost of a query on different sizes of samples.
BlinkDB does not build online samples, so it does not build samples for the joined tables. Therefore, BlinkDB supports two types of joins: 1) Arbitrary joins are supported if the join key is present in the columns set of one of the stratifies samples. In that way, BlinkDB can use one of the samples to join the other table even if it is not sampled. 2) If no samples contain the key columns, then the join is only allowed if one of the original tables is small enough to fit in the memory. Since BlinkDB does not sample data that fits in the memory, this can also guarantee that the result table is small enough to execute the query on. Joining sample tables is identical to normal table joins, and it is left to underlying DBMS. Instead of using closed-form error bound computation, BlinkDB uses statistical bootstrapping method for estimating query error bounds on joined tables.
BlinkDB executes the query on one or more samples instead of the original data. It tries to select the most appropriate sample to execute the query at runtime. The sample selection involves two steps: first is to find a *sample family*, which is a set of columns on which BlinkDB has built the stratified samples; then is to select a *sample resolution*, that is, to find a sample in the sample family with the most appropriate size to satisfy the requirement of the query. For example, BlinkDB finds the smallest sample for an error bound constrained query to minimize its execution time, and it finds the largest sample for a time-constrained query to minimize its error bound. BlinkDB "probes" the smaller samples in a couple of sample families to estimate the query's selectivity, complexity, and the data distribution it specifies, then decides a sample family and a sample size to execute the query on.
The query interface of BlinkDB is SQL-based aggregation queries along with response time of error bound constraints. Like: `SELECT avg(sessionTime) FROM Table WHERE city='San Francisco' WITHIN 2 SECONDS` will return an estimated average of `sessionTime` of `San Francisco` with error bounds of default confidence level within 2 seconds. `SELECT avg(sessionTime) FROM Table WHERE city='San Francisco' ERROR 0.1 CONFIDENCE 95.0%` will return the estimated average `sessionTime` until it has processed enough data so that the error is less than 10% with 95% confidence. To support these SQL queries and sample maintenance, BlinkDB made a couple of changes to the HiveQL parser to: 1. Support queries with response time and error bounds; 2. Detect data modification inputs, which could trigger creating new samples or updating the existing samples; 3. Support re-writing the original query to execute on samples, and iteratively assigning appropriately sized samples for this query to run on; 4. Support returning error bounds and confidence for aggregation functions by modifying the implementation of aggregation functions. However, the open-source version does not support the `WITHIN xx SECONDS` statement. It only supports `APPROX_SUM`, `APPROX_AVG` and `APPROX_COUNT` aggregation operators that return the estimated sum, average or count with error bars of 99% confidence.
BlinkDB stores its physical samples on the HDFS, and stores its metadata in Hive Metastore. The metadata maps the logical samples to the physical blocks on the HDFS. It can also directly read the Hive Metastore to access the native tables in Hive.
BlinkDB is built on Shark which uses columnar storage model. In the Github version, samples are saved as Hive tables and can be cached in the memory by Shark like other Hive tables.
University of California-Berkeley, Massachusetts Institute of Technology
Academic, Open Source
All OS with Java VM