BlinkDB

View Current Viewing Revision #11 from 12/07/2018 8:28 p.m.

BlinkDB is an approximate query engine built on top of Hive as well as Shark (Hive on Spark, the former Spark SQL). It allows users to trade-off query accuracy for response time, thus enabling interactive queries on big data. BlinkDB builds a couple of stratified samples on the original data and executes the queries on the samples instead of the original data to reduce query execution time. The number and sizes of the stratified samples are limited by the storage budget specified when importing the data. It has two major parts: one is the sample building engine that selects what stratified samples to build by considering historic workloads and the distribution of the data; the other part is a dynamic sample selection module that chooses appropriate sample files at runtime according to specific time/accuracy requirements specified by the query.

BlinkDB has a public open source repository on Github. The latest version number is alpha-0.2.0. However, its features are quite limited. Many features in the paper are not implemented in this version. For example, instead of automatically creating samples for a dataset, this version supports manual sample creation with explicitly specified sample ratio only.

History

BlinkDB was built at Berkeley's AMPLab (now RISELab). It was proposed in BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data, which is the best paper of Eurosys 2013.

BlinkDB is no longer maintained. The successor of BlinkDB is VerdictDB, which builds on the same idea but supports more features than BlinkDB.

Concurrency Control

Not Supported

BlinkDB leaves concurrency control to the base database system.

BlinkDB stores its physical samples on the HDFS. A logical large sample actually consists of a collection of smaller samples, which are stored in consecutive blocks on the HDFS. This is beneficial to runtime sample selection because BlinkDB can simply read more blocks to construct a larger sample, which makes it easy to estimate the execution cost of a query on different sizes of samples.

BlinkDB stores its metadata in Hive Metastore. The metadata maps the logical samples to the physical blocks on the HDFS. It can also directly read the Hive Metastore to access the native tables in Hive.

Query Execution

Tuple-at-a-Time Model Vectorized Model

BlinkDB executes the query on one or more samples instead of the original data. It tries to select the most appropriate sample to execute the query at runtime. The sample selection involves two steps: first is to find a sample family, which is a set of columns on which BlinkDB has built the stratified samples; then is to select a sample resolution, that is, to find a sample in the sample family with the most appropriate size to satisfy the requirement of the query. For example, BlinkDB finds the smallest sample for an error bound constrained query to minimize its execution time, and it finds the largest sample for a time-constrained query to minimize its error bound. BlinkDB "probes" the smaller samples in a couple of sample families to estimate the query's selectivity, complexity, and the data distribution it specifies, then decides a sample family and a sample size to execute the query on.

Storage Organization

Heaps

Indexes

Not Supported

Storage Model

Decomposition Storage Model (Columnar)

Samples are distinguished by their column sets. Samples which has the same column set belongs to a "sample family" and are stored together (actually is a series of consecutive blocks on the HDFS).

In the Github version, samples are saved as Hive tables and can be cached in the memory by Shard like other Hive tables.

System Architecture

Shared-Nothing

Query Interface

SQL

The query interface of BlinkDB is SQL-based aggregation queries along with response time of error bound constraints. Like:

SELECT avg(sessionTime) FROM Table WHERE city='San Francisco' WITHIN 2 SECONDS will return an estimated average of sessionTime of San Francisco with error bounds of default confidence level within 2 seconds.

SELECT avg(sessionTime) FROM Table WHERE city='San Francisco' ERROR 0.1 CONFIDENCE 95.0% will return the estimated average sessionTime until it has processed enough data so that the error is less than 10% with 95% confidence.

To support these SQL queries and sample maintenance, BlinkDB made a couple of changes to the HiveQL parser to:

1) support queries with response time and error bounds;

2) detect data modification inputs, which could trigger creating new samples or updating the existing samples;

3) support re-writing the original query to execute on samples, and iteratively assigning appropriately sized samples for this query to run on;

4) support returning error bounds and confidence for aggregation functions by modifying the implementation of aggregation functions.

However, the Github version does not support WITHIN xx SECONDSstatement as said in the paper. It only supports APPROX_SUM,APPROX_AVG and APPROX_COUNT aggregation operators that returns the estimated sum, average or count with error bars with 99% confidence.

Joins

Sort-Merge Join

BlinkDB supports two types of joins:

1) Arbitrary joins are supported if the join key is present in the columns set of one of the stratifies samples. In that way, BlinkDB can use one of the samples to join the other table even if it is not sampled.

2) If no samples contain the key columns, then the join is only allowed if one of the original tables is small enough to fit in the memory. Since BlinkDB does not sample data that fits in the memory, this can also guarantee that the result table is small enough to execute the query on.

Revision #11 | Updated 12/07/2018 8:28 p.m.