DBDB.io The Encyclopedia of Database Systems · Est. 2017
Database of Databases

Database Entry

BlinkDB


BlinkDB is an approximate query engine built on top of Hive as well as Shark (Hive on Spark, the former Spark SQL). It allows users to trade-off query accuracy for response time, thus enabling interactive queries on big data. BlinkDB builds a couple of stratified samples on the original data and executes the queries on the samples instead of the original data to reduce query execution time. It has two major parts: one is the sample building engine that selects what stratified samples to build by considering historic workloads and the distribution of the data; the other part is a dynamic sample selection module that chooses appropriate sample files at runtime according to specific time/accuracy requirements.

Source Code
https://github.com/sameeragarwal/blinkdb[02]
Country of Origin
US
Start Year
2012
End Year
2014
Project Types
Academic, Open Source
Derived From
Spark SQL
Operating System
All OS with Java VM
License
Apache v2

Database Entry

BlinkDB


BlinkDB is an approximate query engine built on top of Hive as well as Shark (Hive on Spark, the former Spark SQL). It allows users to trade-off query accuracy for response time, thus enabling interactive queries on big data. BlinkDB builds a couple of stratified samples on the original data and executes the queries on the samples instead of the original data to reduce query execution time. It has two major parts: one is the sample building engine that selects what stratified samples to build by considering historic workloads and the distribution of the data; the other part is a dynamic sample selection module that chooses appropriate sample files at runtime according to specific time/accuracy requirements.

History


BlinkDB was proposed in BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data, which is the best paper of Eurosys 2013.

BlinkDB is no longer maintained. It is integrated into VerdictDB.

Concurrency Control


BlinkDB leaves concurrency controlling to the base database system.

Data Model


Indexes


Joins[03]


BlinkDB supports two types of joins:

1) Arbitrary joins are allowed (self-joins or joining two tables) as long as there is a stratified sample on one of the join tables that contains the join key in its column-set;

2) In the absence of any suitable stratified sample, the join is still allowed as long as one of the two tables fits in memory (since BlinkDB does not sample tables that fit in memory).

The implementation is left to the base database system.

Query Compilation[03]


BlinkDB made a couple of changes to the HiveQL parser to:

1) support queries with response time and error bounds;

2) detect data modification inputs, which could trigger creating new samples or updating the existing samples;

3) support re-writing the query and iteratively assign appropriately sized samples for this query to run on;

4) support returning error bounds and confidence for aggregation functions.

Query Interface[01]


SQL

The query interface of BlinkDB is SQL-based aggregation queries along with response time of error bound constraints. Like:

SELECT avg(sessionTime) FROM Table WHERE city='San Francisco' WITHIN 2 SECONDS SELECT avg(sessionTime) FROM Table WHERE city='San Francisco' ERROR 0.1 CONFIDENCE 95.0%.

Storage Architecture


BlinkDB maintains samples both on disks and in memory.

Storage Model


System Architecture


Revision #6 Last Updated: