DSS and Impala¶
Impala is a tool of the Hadoop environment to run interactive analytic SQL queries on large amounts of HDFS data.
Unlike Hive, Impala does not use MapReduce nor Tez but a custom Massive Parallel Processing engine, ie. each node of the Hadoop cluster runs the query on its part of the data.
Data Science Studio provides the following integration points with Impala :
- All HDFS datasets can be made available in the Impala environment, where they can be used by any Impala-capable tool.
- The Impala recipes run queries on Impala, while handling the schema of the output dataset.
- The “Impala notebook” allows you to run Impala queries on any Impala database, whether they have been created by DSS or not.
- When performing Charts on a HDFS dataset, you can choose to use Impala as the query execution engine.
- Many visual recipes can use Impala as their execution engine if the computed query permits it.
Data Science Studio connects to Impala through a JDBC connection to one of the impalad server(s) configured in the “Settings / Hadoop” administration screen.
Hive connectivity is mandatory for Impala use, as Impala connections use the Hive JDBC driver, and Impala table definitions are stored in the global Hive metastore.
Making HDFS datasets automatically available to Impala is done through the same mechanism as for Hive. See DSS and Hive for more info.
Supported formats and limitations¶
Impala can only interact with HDFS datasets with the following formats:
- only in “Escaping only” or “No escaping nor quoting” modes.
- only in “NONE” compression
- If the dataset has been built by DSS, it should use the “Hive flavor” option of the Parquet parameters.
- Hive Sequence File
- Hive RC File
Additional limitations apply:
- Impala cannot handle datasets if they contain any complex type column.
Configuring connection to Impala servers¶
The settings for Impala are located in Administration > Settings > Impala.
Impala queries are analyzed and their execution initiated by a
impalad daemon on one datanode from your Hadoop cluster. Thus, in order for DSS to interact with Impala, DSS must know the hostnames of the impalad hosts (or at least a fraction of them). You need to setup the list of these hostnames in the “Hosts” field.
Should you need a custom port, you can also set it in the “Port” field.
DSS can handle both the regular Hive-provided JDBC driver and the Cloudera enterprise connector.
DSS supports the following authentication modes for Impala:
- No authentication (on non-secure Hadoop clusters)
- Kerberos (on secure Hadoop clusters)
Kerberos authentication (secure clusters)¶
Since Impala queries are run by the
impalad daemons under the
impala user, in a kerberized environment the principal of that
impala user is required in order to properly connect to the daemons through jdbc, and it can be set in the “Principal” field.
When multiple hostnames have been specified in the “Hosts” field, DSS provides the same placeholder mechanism as Impala itself: the string
_HOST is swapped with the hostname DSS tries to run the query against.
Just like SQL connections in DSS, credentials must be provided in form of a user/password. These credentials can be GLOBAL and shared by all users of the DSS instance, of defined per-DSS user in their Profile settings.
If Impala has been setup to encrypt connections to client services (see Cloudera’s documentation), then DSS, as a JDBC client of Impala, needs access to the Java truststore holding the necessary security certificates (see the Hive documentation on JDBC connection).
Please refer to Adding SSL certificates to the Java truststore for the procedure to add trusted certificates to the JVM used by DSS.
Using Impala to write outputs¶
Even though Impala is traditionally used to perform SELECT queries, it also offers INSERT capabilities, albeit reduced ones.
First, Impala supports less formats for writing than it does for reading. You can check Impala’s support of your format on Cloudera’s documentation.
Second, Impala doesn’t do impersonation and writes its output using the
impala user. Since DSS uses EXTERNAL tables (in the meaning Hive gives to it), the user must be particularly attentive to the handling of file permissions, depending on the Hive authorization mode and the DSS security mode.
Sentry (DSS regular security)¶
With ACL synchronization (recommended)¶
For this mode, you must have the “HDFS ACL Sentry synchronization” mechanism enabled (this is the default on Cloudera). Refer to Cloudera’s documentation for help on setting up Sentry for Impala and HDFS synchronization.
You need to add the root folders of all DSS HDFS managed connections to the list of roots that the Sentry ACL synchronization plugin handles.
For Cloudera Manager:
- Go to HDFS > Configuration and look for the “Sentry Synchronization Path Prefixes” setting
- Add an entry with the root path of each of your HDFS connections
- Save settings, and restart stale services
In DSS, make sure that in Administration > Settings > Impala, the
Pre-create folder for write recipes setting is checked. Save DSS settings.
With permissions inheritance¶
If you don’t have HDFS ACL Sentry synchronization enabled, or can’t add the DSS connection roots to the synchronization roots, you can use the permissions inheritance + permissions setup mode described above.
Sentry (DSS User Isolation Framework)¶
Sentry must be activated to control write permissions, and the synchronization of HDFS’s ACLs with Sentry must be active.
Switching from write-through-DSS to write-through-Impala¶
Before switching to write-through-Impala, you must clear the dataset. Failure to clear the dataset will lead to permission issues that will require cluster administrator intervention.