DSS and Impala

Impala is a tool of the Hadoop environment to run interactive analytic SQL queries on large amounts of HDFS data.

Unlike Hive, Impala does not use MapReduce but “Massive Parallel Processing”, ie. each node of the Hadoop cluster runs the query on its part of the data.

Data Science Studio provides the following integration points with Impala :

  • All HDFS datasets can be made available in the Impala environment, where they can be used by any Impala-capable tool.
  • The Impala recipes run queries on Impala, while handling the schema of the output dataset.
  • The “Impala notebook” allows you to run Impala queries on any Impala database, whether they have been created by DSS or not.
  • When performing Data Visualization with DSS on a HDFS dataset, you can choose to use Impala as the query execution engine.
  • The grouping and join visual recipes can be run on Impala if the computed query permits it.

Metastore synchronization

Making HDFS datasets automatically available to Impala is done through the same mechanism as for Hive. See DSS and Hive for more info.

Supported formats and limitations

Impala can only interact with HDFS datasets with the following formats:

  • CSV
    • only in “Escaping only” or “No escaping nor quoting” modes.
    • only in “NONE” compression
  • Parquet
    • If the dataset has been built by DSS, it should use the “Hive flavor” option of the Parquet parameters.
  • Hive Sequence File

  • Hive RC File

  • Avro

Additional limitations apply:

  • Impala cannot handle datasets if they contain any complex type column.

How to setup the connection to Impala’s servers

The settings for Impala are located in the administration section, under “Settings”.

Impala queries are analyzed and their execution initiated by a impalad daemon on one datanode from your Hadoop cluster. Thus, in order for DSS to interact with Impala, DSS must know the hostnames of the datanodes, or at least of a fraction of the datanodes. You can setup the list of these hostnames in the “Hosts” field. If the list is left empty, then DSS will assume the localhost is a datanode.

Should you need a custom port, you can also set it in the “Port” field.

Since Impala queries are run by the impalad daemons under the impala user, in a kerberized environment the principal of that impala user is required in order to properly connect to the daemons through jdbc, and it can be set in the “Principal” field. When multiple hostnames have been specified in the “Hosts” field, DSS provides the same placeholder mechanism as Impala itself: the string _HOST is swapped with the hostname DSS tries to run the query against.

Using Impala to write the query output

Even though Impala is traditionally used to perform SELECT queries, it also offers INSERT capabilities, albeit reduced ones.

First, Impala supports less formats for writing than it does for reading. You can check Impala’s support of your format on Cloudera’s documentation.

Second, Impala doesn’t do impersonation and writes its output using the impala user. Since DSS uses EXTERNAL tables (in the meaning Hive gives to it), the user must be particularly attentive to the handling of file permissions.

In a non-securized Hadoop cluster

In order for Impala to write to the directories corresponging to the managed datasets, it needs to have write permissions on them. To achieve this, it is necessary that:

  • The directory holding the datasets gives write permission to the impala user, for example by having rwx permisions for all users.
  • Hive must be set to propagate parent permissions onto sub-folders as it creates them, which means the property hive.warehouse.subdir.inherit.perms must be set to “true”.

In a kerberized environment

Sentry must be activated to control write permissions, and the synchronization of HDFS’s ACLs with Sentry must be active. Refer to Cloudera’s documentation for help on setting up Sentry for Impala and HDFS synchronization.