What is Apache Impala?
Apache Impala is a massively parallel processing query engine that executes on Hadoop platform. It is an open source software, which was developed on the basis of Google’s Dremel paper.
It is an interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS). Moreover, Impala is used as a scalable parallel database technology provided to Hadoop, which enables the users to issue low-latency SQL queries to the data stored in HDFS and Apache HBase without requiring data movement or transformation.
Impala integrates with the Apache Hive metastore database to share database table information between both the components. Analysts and Data Scientists perform analytical operations on data stored in Hadoop and advertises it via SQL tools. It helps to provide large-scale data processing (via MapReduce), and interactive queries. Furthermore, it can be processed on the same system using the same data and metadata, which helps to eliminate the need to shift data sets into specialized systems.
Why use Impala?
- Impala provides parallel processing database technology on top of Hadoop eco-system. So, it can smoothly perform low latency queries interactively.
- Impala is a time-saving job which gives results in seconds whereas, in Hive MapReduce, it takes time in launching and processing queries.
- Impala is also beneficial for Analytics and Data Scientists to perform analytics on data stored in Hadoop File System with the help of real-time query engine.
- Because of providing real-time results, it works perfectly for reporting tools or visualization tools like Pentaho.
- Impala provides in-built support of processing all of the Hadoop supported file formats (ORC, Parquet.etc.). This project provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats.
Components of the Impala Server
Impala is a massively parallel processing query engine that lies on top of cluster systems like Apache Hadoop. It involves different daemon processes that run on a specific host within your CDH cluster, which are given as follows:
- Impala daemon
- Impala Statestore
- Impala Catalog Service
Daemon is defined as a core part of Impala that runs on each node of the cluster, called as Impala daemon. The main task of the daemon is to read and write into the data files. Moreover, it accepts queries that transmitted from any connection requests from Data Analytic tools with JDBC, or ODBC connections.
It helps to handle the workload by dividing and distributing the load to other nodes in the cluster and transmit the intermediate results back to the coordinator node. Coordinator node is defined as a node in which the job has launched whereas, Impala daemon is responsible for handling user queries running on any node of your cluster, and that node is responsible as the coordinator node for that query. Different nodes work simultaneously for managing the workload while the other nodes transmit the remaining data back to the coordinator node, which then constructs the final result set for that query.
The statestore task is to check the Impala daemon presence on all the nodes of the cluster. A daemon process called statestored physically represents it. Only one node in the cluster needs to have this process. Daemons are in continuous interaction with the Statestore, to check about which nodes are healthy and accept new work for processing. If an impalad daemon is not working because of any reasons, then statestore is responsible for informing all the other Impalad daemons that are running in the cluster. The cluster becomes less robust when impala daemon fails in a node and statestore is offline. When statestored comes online, it re-establishes communication with other nodes and resumes monitoring.
Impala Catalog Service
Catalog service is responsible for broadcasting the metadata changes in Impala DDL(Data Definition Language) queries or DML(Data Manipulation Language) queries to all nodes in the cluster. This process is defined as a catalogd daemon, and we need such processes only in one host in the cluster.
Impala Basic Commands and Syntax
Impala gives interactive query results through the command line interface, which also supports DML and DDL operations. With the help of catalog daemon, any change in Impala, effects in Hive Metastore immediately.
But if any change occurs in hive Metastore or HDFS file system, there should be executed a manual command, i.e.,” invalidate metadata.”
The basic command syntax is as follows:
- Impala supported Datatypes
It supports all the numeric, character, and date data types, such as:
TINYINT, SMALLINT, INT, BIGINT, FLOAT, DECIMAL, DOUBLE, REAL,
CHAR, VARCHAR, STRING, TIMESTAMP, BOOLEAN.
- Impala table queries
CREATE TABLE DEZYRE_USER_INFO(user_name bigint, unique_accounts bigint, server_hits bigint, topic_name string, accessed_date string
STORED AS PARQUET;
Features of Impala
Following are some important features of Impala:
- Open Source: Apache Impala is an open source software, so user can freely access and manipulate the code.
- In – memory Processing: Impala supports in-memory data processing, which means that without any data movement, it accesses and analyzes the data stored in Hadoop data nodes.
- Easy Data Access: Like SQL queries, we can easily access data using Impala. Moreover, Impala offers common data access interfaces. That includes:
a) JDBC driver
b) ODBC driver
- Faster Access: Impala is an efficient tool for more rapid accessing of data in HDFS.
- Storage Systems: We have a storage system in Impala for storing data such as HDFS, Apache HBase.
- High Performance: Impala offers high performance and low latency task as compared to other SQL queries for Hadoop.
- Authentication: Authentication plays a significant role in respect to security concerns because authentication helps to identify the users and prevents the data from eavesdroppers.
So, concerning this, it uses Kerberos authentication.
Security plays a vital role in preventing your data from being misused and stolen by any unauthorized party. So, We have to identify the authority of users and check whether it has permission to read and write the files or not.
Impala Security Features
The Security Features of Impala is as follows:
For Authorization, Impala uses the OS user ID of the users which is responsible for running impala-shell or another client program and also accomplishes various privileges with each user at the time of enabling authorization.
Impala relies on the Kerberos subsystem for authentication purposes.
Its work is to collect the audit data to track down any suspicious activity or illegal operations. This feature provides a way to look back and diagnose.
Advantages of Impala
There are several advantages of Impala which are given as follows:-
- Fast Speed: We can process data in HDFS at very fast speed by using Impala.
- Migrating data is not necessary: We don’t need to transform and move data store on Hadoop even if the data processing is carried where the data resides.
- Big Data: A user can easily store and manage a large amount of data.
- Languages: Impala does not have issue respect to language support, because it supports all languages.
- High Performance: It offers high performance and low latency task for Hadoop.
- Distributed: It provides a distributive environment in which a query is distributed among different clusters for reducing workload and provides convenient scalability.
- Easy Access: We can easily access the data that is stored in HDFS, HBase, and Amazon s3 without requiring the knowledge of Java.
Disadvantages of Impala
There are also some disadvantages of Impala which are as follows:
- No Support for Triggers: It does not support triggers.
- No Updation: In Impala, We cannot update or delete any individual records.
- No Transactions: In Impala, there is no support for transactions.
- No Indexing: Also, it does not support indexing in Impala.