Impala is an open source SQL engine that offers interactive query processing on data stored in Apache Hadoop file formats. As opposed to SQL-on-Hadoop databases such as Hive that are used for long batch jobs, Impala enables interactive exploration and fine-tuning analytic queries by using its Massively Parallel Process (MPP) model. Impala avoids data movement and enables the users to interact with the data stored in HDFS via a SQL front-end rather than the traditional HDFS jobs.
The impala project was announced in October 2012 with the objective to provide a SQL interface and Business Intelligence tools for data scientists. Impala supports various HDFS file formats, however it is optimized for Parquet, a column-oriented file format which was announced in early 2013. Impala was accepted into the Apache incubator on December 2, 2015.
Impala supports SQL as its query language. It provides a high dgree of compatibility with the Hive Query Language (HiveQL). Additionally it also provides an impala-shell interpreter which processes all the SQL commands supported by Impala along with a few shell-only commands which can be used for performance tuning.
Nested Loop Join Hash Join Broadcast Join Shuffle Join
Impala provides a variety of Join Options. Impala does not provide a command to hint on the type of join to be executed incase of Nested Loop Joins and Hash Joins. Impala internally decied on the most suitable join mechanism for the query. However, it supports query hints for choosing between Broadcast and Shuffle joins.
Read Uncommitted Read Committed
Impala supports both Read Committed and Read Uncommitted isolation levels.
Support for stored procedures in Impala was added from the 1.2 release. It now enables users to write UDFs in C++ or Java based Hive UDFs. C++ UDFs achieve a significant performance improvement over the Java written UDFs. Currently support for User Defined Table Functions (UDTF) has not been added.
Impala does not support indexes. Although HIVE provides limited index capabilities, they are not leveraged by Impala. Since Impala is not a monolithic DBMS, Impala is often unaware of the data the shows up in the HDFS files. Hence it is not possible for the index to stay in sync with the base data.
Impala can access data stored on HDFS in any of the Apache Hadoop file formats, including, Parquet, Text, Avro, RCFile and SequenceFile. It also supports compressed file formats in order to reduce the disk space and I/O volume, although such formats induce a CPU overhead to decompress the data.
Impala is a distributed, Massively Parallel Processing (MPP) query engine which uses a Shared-Nothing architecture. Impala consists of the following three major components 1. Impala Daemon - A daemon process runs on each data node to read and write data for the accepted queries and parallelizes the work across the cluster. It transmits the query results to the central coordinator node. 2. Impala Statestore - It is a daemon process which continously monitors the health status of the daemons on the datanodes in the cluster. When a datanode goes down, it ensures that no requests are made to an unreachable datanode. It provides robustness, load balancing and high availability. 3. Impala Catalog Service - It relays the metadata changes from SQL statements to all the Imapala Daemons. The catalog server ensures that if the metadata change occured via SQL queries issued through Impala.
Code Generation JIT Compilation
Impala uses the LLVM engine to perform just in time (JIT) query compilation. It uses runtime code generation for specific versions of the function by which performance improvements of more than 5x are achieved.