PrestoDB is an open source distributed SQL query engine for running interactive analytic queries against heterogeneous data sources. It was open sourced by Facebook in 2013. It does not manage the storage of data. Instead, Presto is a query engine which allows querying data where it lives, including Hive, Cassandra, Kafka, and relational databases. A single PrestoDB query is able to combine data from multiple sources. Presto was designed, built and optimized for interactive queries. In comparison, both Presto and Hive support SQL queries against HDFS, while Presto is targeted at interactive queries and Hive is suitable for batch processing. Presto supports ANSI-compatible SQL statements.
The Presto project started at Facebook in 2012 and then internally launched to the company in early 2013. Facebook open-sourced Presto in November 2013 under the Apache Software License.
To avoid confusion with the separate PrestoSQL project, Facebook's version of Presto is commonly referred to as PrestoDB.
Dictionary Encoding Run-Length Encoding
PrestoDB can operate on dictionary and run-length-encoded blocks from connectors. When generating intermediate results, Presto also produces compressed data in the form of dictionary or run-length-encoded blocks.
Column Family / Wide-Column Relational Key/Value Document / XML
PrestoDB provides connectors to different data sources, and each connector is implemented to be compatible with the data model of the underlying database. Thus, PrestoDB supports different types of data models, including column family, document/XML, key/value, and relational.
Read Uncommitted Read Committed Serializable Repeatable Read
Depending on the underlying source of data, whether or not transaction is supported depends on the implementation of the specific connector. For the connectors that support transactions, the PrestoDB API supports four different types of isolation levels. The isolation level is to be specified when a transaction is started.
Hash Join Broadcast Join Shuffle Join
PrestoDB has two types of join distributions. It can support both broadcast join and partitioned (shuffle) join. The join distribution can either be specified by the user or be decided by the cost-based optimization strategies that are supported by Presto.
At each node level, PrestoDB performs a hash-based join.
Code Generation JIT Compilation
PrestoDB uses code generation targeting JVM bytecode. To do this, it can evaluate expressions, and also use heuristics to generate code that are compatible with the optimization of the JIT compilers, thus providing better performance.
PrestoDB is designed to query data from sources including Hadoop environments and other relational database systems, so it does not directly take the role of data storage. All data and the intermediate results are stored in-memory whenever possible. For communication between nodes, data is also stored in in-memory buffers and sent through the network. This avoids the high cost of I/O operations and speeds up the execution.
For memory-intensive queries, PrestoDB also offers the functionality of spilling data to disk. But this is not a primary function of PrestoDB and it is assumed that most of the query operations should be performed in-memory completely.
Decomposition Storage Model (Columnar)
To execute a query, PrestoDB splits the assignments to each worker, and the workers fetch the data from the data sources. The unit of data that PrestoDB locally operates on is called a page. The page is a columnar of a sequence of rows.
As similar to many classic MPP (massively parallel process) database management systems, PrestoDB utilizes a shared-nothing system architecture.
PrestoDB is deployed on a cluster of nodes. A node can take the role of either a coordinator or a worker. Each node has its own private disk and memory, and the user can configure the memory usage of each node. Since PrestoDB does not store data directly, the disk of each node is used minimally for storing logs only, and all communications are done through the network.
https://github.com/prestodb/presto
https://prestodb.io/docs/current/
2013
Presto
Accumulo, Cassandra, Elasticsearch, Hive, Kudu, MongoDB, MySQL, Pinot, PostgreSQL, Redis, Redshift