DBDB.io The Encyclopedia of Database Systems · Est. 2017
Database of Databases

Database Entry

Impala


Impala is an open source SQL engine that offers interactive query processing on data stored in Apache Hadoop file formats. As opposed to SQL-on-Hadoop databases such as Hive that are used for long batch jobs, Impala enables interactive exploration and fine-tuning analytic queries by using its Massively Parallel Process (MPP) model. Impala avoids data movement and enables the users to interact with the data stored in HDFS via a SQL front-end rather than the traditional HDFS jobs.[05][06]

Source Code
https://github.com/apache/impala[02]
Developer
Country of Origin
US
Start Year
2013 [29]
Project Types
Commercial, Open Source
Written in
C++
Inspired By
BigQuery
Operating System
Linux
License
Apache v2

Database Entry

Impala


Impala is an open source SQL engine that offers interactive query processing on data stored in Apache Hadoop file formats. As opposed to SQL-on-Hadoop databases such as Hive that are used for long batch jobs, Impala enables interactive exploration and fine-tuning analytic queries by using its Massively Parallel Process (MPP) model. Impala avoids data movement and enables the users to interact with the data stored in HDFS via a SQL front-end rather than the traditional HDFS jobs.[05][06]

History[07][08]


The Impala project was announced in October 2012 with the objective to provide a SQL interface and Business Intelligence tools for data scientists. Impala supports various HDFS file formats, however it is optimized for Parquet, a column-oriented file format which was announced in early 2013. Impala was accepted into the Apache incubator on December 2, 2015.

Checkpoints[09][10]


Checkpoints of a query are not supported in Impala. When a host node on which a query was running fails, Impala cancels the query. Additional support for long running queries will be added in the future so that a query could complete even in the presence of node failures.

Concurrency Control[11]


Impala does not support any Concurrency control mechanism. The transactional nature of the HiveMetaStore (HMS), which receives updates on inserts and updates raises an error incase parallel inserts are made into the same table.

Data Model[12]


Impala is a massively parallel query engine which is not strongly coupled with the underlying storage layer. Currently, impala only supports a flat relational schema. They plan to add support for nested schemas with complex column types.

Foreign Keys[13][14]


Although Foreign Keys are not supported by Impala currently, they will be added later for cardinality estimation during query planning. However, they will not be enforced by Impala.

Indexes[15][16]


Impala does not support indexes. Although HIVE provides limited index capabilities, they are not leveraged by Impala. Since Impala is not a monolithic DBMS, Impala is often unaware of the data the shows up in the HDFS files. Hence it is not possible for the index to stay in sync with the base data.

Isolation Levels[17]


Impala supports both Read Committed and Read Uncommitted isolation levels.

Joins[18]


Impala provides a variety of Join Options. Impala does not provide a command to hint on the type of join to be executed incase of Nested Loop Joins and Hash Joins. Impala internally decied on the most suitable join mechanism for the query. However, it supports query hints for choosing between Broadcast and Shuffle joins.

Logging[19]


Since Impala does not support transactions and is suited for analytical queries, it does not support logging.

Query Compilation[12]


Impala uses the LLVM engine to perform just in time (JIT) query compilation. It uses runtime code generation for specific versions of the function by which performance improvements of more than 5x are achieved.

Query Execution[12][20]


Query Interface[21][22]


Impala supports SQL as its query language. It provides a high dgree of compatibility with the Hive Query Language (HiveQL). Additionally it also provides an impala-shell interpreter which processes all the SQL commands supported by Impala along with a few shell-only commands which can be used for performance tuning.

Storage Architecture[23][12]


Impala can access data stored on HDFS in any of the Apache Hadoop file formats, including, Parquet, Text, Avro, RCFile and SequenceFile. It also supports compressed file formats in order to reduce the disk space and I/O volume, although such formats induce a CPU overhead to decompress the data.

Storage Format[24]


Storage Model[25][12]


Impala does not provide its own storage engine but rather reads data from any of the underlying storage format. Nonetheless, when data is stored in Parquet, a binary columnar storage format, it displays significant performance improvement as it substantially reduces the I/O volume.

Stored Procedures[26]


Support for stored procedures in Impala was added from the 1.2 release. It now enables users to write UDFs in C++ or Java based Hive UDFs. C++ UDFs achieve a significant performance improvement over the Java written UDFs. Currently support for User Defined Table Functions (UDTF) has not been added.

System Architecture[27][28]


Impala is a distributed, Massively Parallel Processing (MPP) query engine which uses a Shared-Nothing architecture. Impala consists of the following three major components 1. Impala Daemon - A daemon process runs on each data node to read and write data for the accepted queries and parallelizes the work across the cluster. It transmits the query results to the central coordinator node. 2. Impala Statestore - It is a daemon process which continously monitors the health status of the daemons on the datanodes in the cluster. When a datanode goes down, it ensures that no requests are made to an unreachable datanode. It provides robustness, load balancing and high availability. 3. Impala Catalog Service - It relays the metadata changes from SQL statements to all the Imapala Daemons. The catalog server ensures that if the metadata change occured via SQL queries issued through Impala.

Views[10]


Imapala supports virtual views as lightweight logical constructs to act as query aliases. It does not support materialized views since data updates in the Hadoop Environment make it difficult to keep them up-to date.

Compatible Systems
Kudu Kudu
Derivative Systems
StarRocks StarRocks

Citations

33 sources
  1. Impala apache.org
  2. GitHub - apache/impala: Apache Impala · GitHub github.com
  3. Impala apache.org
  4. Apache Impala - Wikipedia wikipedia.org
  5. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala.html cloudera.com Dead — Check Archive
  6. Introducing Apache Impala apache.org
  7. Apache Impala - Wikipedia wikipedia.org
  8. Introducing Apache Impala apache.org
  9. Impala Frequently Asked Questions | 5.2.x | Cloudera Documentation cloudera.com
  10. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_faq.html cloudera.com Dead — Check Archive
  11. Archived - Cloudera Community cloudera.com
  12. Impala: A Modern, Open-Source SQL Engine for Hadoop cidrdb.org
  13. https://issues.apache.org/jira/browse/IMPALA-2112 apache.org
  14. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_reserved_words.html cloudera.com Dead — Check Archive
  15. Impala Frequently Asked Questions | 5.6.x | Cloudera Documentation cloudera.com
  16. Google Groups google.com
  17. impala/thirdparty/hbase-0.94.6-cdh4.3.0/src/main/java/org/apache/hadoop/hbase/client/IsolationLevel.java at master · schubertzhang/impala · GitHub github.com
  18. Joins in Impala SELECT Statements | 5.9.x | Cloudera Documentation cloudera.com
  19. https://www.tutorialspoint.com/impala/impala_overview.html tutorialspoint.com Dead — Check Archive
  20. https://chatwithengineers.com/2016/08/29/a-survey-of-query-execution-engines-from-volcano-to-vectorized-processing chatwithengineers.com Dead — Check Archive
  21. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_langref.html cloudera.com Dead — Check Archive
  22. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_impala_shell.html cloudera.com Dead — Check Archive
  23. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_file_formats.html cloudera.com Dead — Check Archive
  24. How Impala Works with Hadoop File Formats apache.org
  25. Latest Insights on Data and AI | Cloudera Blog cloudera.com
  26. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html cloudera.com Dead — Check Archive
  27. Impala Concepts and Architecture apache.org
  28. Impala apache.org
  29. Impala Version and Download Information | 5.x | Cloudera Documentation cloudera.com
  30. https://github.com/apache/impala/commit/11b5fd6c1f5201af5710342e9c793b586664d6b1 github.com
  31. https://github.com/apache/impala/commit/82292dfd1add34f8e148b45239d58187cfd55e16 github.com
  32. https://github.com/apache/impala/commit/12e79a22ef438c3b515ba0218e8e002b3b234cc6 github.com
  33. https://github.com/apache/impala/commit/328448d0600ee0adcc15a4180a18b624b0ef87d0 github.com
Revision #13 Last Updated: