Impala is an open source SQL engine that offers interactive query processing on data stored in Apache Hadoop file formats. As opposed to SQL-on-Hadoop databases such as Hive that are used for long batch jobs, Impala enables interactive exploration and fine-tuning analytic queries by using its Massively Parallel Process (MPP) model. Impala avoids data movement and enables the users to interact with the data stored in HDFS via a SQL front-end rather than the traditional HDFS jobs.[05][06]
- Source Code
- https://github.com/apache/impala[02]
- @ApacheImpala
- Developer
- Governance
- Apache Software Foundation
- Country of Origin
- US
- Start Year
- 2013 [29]
- Project Types
- Commercial, Open Source
- Written in
- C++
- Inspired By
- BigQuery
- Operating System
- Linux
- License
- Apache v2
Impala is an open source SQL engine that offers interactive query processing on data stored in Apache Hadoop file formats. As opposed to SQL-on-Hadoop databases such as Hive that are used for long batch jobs, Impala enables interactive exploration and fine-tuning analytic queries by using its Massively Parallel Process (MPP) model. Impala avoids data movement and enables the users to interact with the data stored in HDFS via a SQL front-end rather than the traditional HDFS jobs.[05][06]
History[07][08]
The Impala project was announced in October 2012 with the objective to provide a SQL interface and Business Intelligence tools for data scientists. Impala supports various HDFS file formats, however it is optimized for Parquet, a column-oriented file format which was announced in early 2013. Impala was accepted into the Apache incubator on December 2, 2015.
Checkpoints[09][10]
Checkpoints of a query are not supported in Impala. When a host node on which a query was running fails, Impala cancels the query. Additional support for long running queries will be added in the future so that a query could complete even in the presence of node failures.
Concurrency Control[11]
Impala does not support any Concurrency control mechanism. The transactional nature of the HiveMetaStore (HMS), which receives updates on inserts and updates raises an error incase parallel inserts are made into the same table.
Data Model[12]
Impala is a massively parallel query engine which is not strongly coupled with the underlying storage layer. Currently, impala only supports a flat relational schema. They plan to add support for nested schemas with complex column types.
Foreign Keys[13][14]
Although Foreign Keys are not supported by Impala currently, they will be added later for cardinality estimation during query planning. However, they will not be enforced by Impala.
Indexes[15][16]
Impala does not support indexes. Although HIVE provides limited index capabilities, they are not leveraged by Impala. Since Impala is not a monolithic DBMS, Impala is often unaware of the data the shows up in the HDFS files. Hence it is not possible for the index to stay in sync with the base data.
Joins[18]
Impala provides a variety of Join Options. Impala does not provide a command to hint on the type of join to be executed incase of Nested Loop Joins and Hash Joins. Impala internally decied on the most suitable join mechanism for the query. However, it supports query hints for choosing between Broadcast and Shuffle joins.
Logging[19]
Since Impala does not support transactions and is suited for analytical queries, it does not support logging.
Query Compilation[12]
Impala uses the LLVM engine to perform just in time (JIT) query compilation. It uses runtime code generation for specific versions of the function by which performance improvements of more than 5x are achieved.
Query Interface[21][22]
Impala supports SQL as its query language. It provides a high dgree of compatibility with the Hive Query Language (HiveQL). Additionally it also provides an impala-shell interpreter which processes all the SQL commands supported by Impala along with a few shell-only commands which can be used for performance tuning.
Storage Architecture[23][12]
Impala can access data stored on HDFS in any of the Apache Hadoop file formats, including, Parquet, Text, Avro, RCFile and SequenceFile. It also supports compressed file formats in order to reduce the disk space and I/O volume, although such formats induce a CPU overhead to decompress the data.
Storage Model[25][12]
Impala does not provide its own storage engine but rather reads data from any of the underlying storage format. Nonetheless, when data is stored in Parquet, a binary columnar storage format, it displays significant performance improvement as it substantially reduces the I/O volume.
Stored Procedures[26]
Support for stored procedures in Impala was added from the 1.2 release. It now enables users to write UDFs in C++ or Java based Hive UDFs. C++ UDFs achieve a significant performance improvement over the Java written UDFs. Currently support for User Defined Table Functions (UDTF) has not been added.
System Architecture[27][28]
Impala is a distributed, Massively Parallel Processing (MPP) query engine which uses a Shared-Nothing architecture. Impala consists of the following three major components 1. Impala Daemon - A daemon process runs on each data node to read and write data for the accepted queries and parallelizes the work across the cluster. It transmits the query results to the central coordinator node. 2. Impala Statestore - It is a daemon process which continously monitors the health status of the daemons on the datanodes in the cluster. When a datanode goes down, it ensures that no requests are made to an unreachable datanode. It provides robustness, load balancing and high availability. 3. Impala Catalog Service - It relays the metadata changes from SQL statements to all the Imapala Daemons. The catalog server ensures that if the metadata change occured via SQL queries issued through Impala.
Views[10]
Imapala supports virtual views as lightweight logical constructs to act as query aliases. It does not support materialized views since data updates in the Hadoop Environment make it difficult to keep them up-to date.
Citations
33 sources- Impala apache.org
- GitHub - apache/impala: Apache Impala · GitHub github.com
- Impala apache.org
- Apache Impala - Wikipedia wikipedia.org
- https://docs.cloudera.com/documentation/enterprise/latest/topics/impala.html cloudera.com
- Introducing Apache Impala apache.org
- Apache Impala - Wikipedia wikipedia.org
- Introducing Apache Impala apache.org
- Impala Frequently Asked Questions | 5.2.x | Cloudera Documentation cloudera.com
- https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_faq.html cloudera.com
- Archived - Cloudera Community cloudera.com
- Impala: A Modern, Open-Source SQL Engine for Hadoop cidrdb.org
- https://issues.apache.org/jira/browse/IMPALA-2112 apache.org
- https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_reserved_words.html cloudera.com
- Impala Frequently Asked Questions | 5.6.x | Cloudera Documentation cloudera.com
- Google Groups google.com
- impala/thirdparty/hbase-0.94.6-cdh4.3.0/src/main/java/org/apache/hadoop/hbase/client/IsolationLevel.java at master · schubertzhang/impala · GitHub github.com
- Joins in Impala SELECT Statements | 5.9.x | Cloudera Documentation cloudera.com
- https://www.tutorialspoint.com/impala/impala_overview.html tutorialspoint.com
- https://chatwithengineers.com/2016/08/29/a-survey-of-query-execution-engines-from-volcano-to-vectorized-processing chatwithengineers.com
- https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_langref.html cloudera.com
- https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_impala_shell.html cloudera.com
- https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_file_formats.html cloudera.com
- How Impala Works with Hadoop File Formats apache.org
- Latest Insights on Data and AI | Cloudera Blog cloudera.com
- https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html cloudera.com
- Impala Concepts and Architecture apache.org
- Impala apache.org
- Impala Version and Download Information | 5.x | Cloudera Documentation cloudera.com
- https://github.com/apache/impala/commit/11b5fd6c1f5201af5710342e9c793b586664d6b1 github.com
- https://github.com/apache/impala/commit/82292dfd1add34f8e148b45239d58187cfd55e16 github.com
- https://github.com/apache/impala/commit/12e79a22ef438c3b515ba0218e8e002b3b234cc6 github.com
- https://github.com/apache/impala/commit/328448d0600ee0adcc15a4180a18b624b0ef87d0 github.com