DSS and Impala¶
Impala is a tool of the Hadoop environment to run interactive analytic SQL queries on large amounts of HDFS data.
Unlike Hive, Impala does not use MapReduce but “Massive Parallel Processing”, ie. each node of the Hadoop cluster runs the query on its part of the data.
Data Science Studio provides the following integration points with Impala :
- All HDFS datasets can be made available in the Impala environment, where they can be used by any Impala-capable tool.
- The “Impala notebook” allows you to run Impala queries on any Impala database, whether they have been created by DSS or not.
- When performing
/visualize/indexon a HDFS dataset, you can choose to use Impala as the query execution engine.
Making HDFS datasets automatically available to Impala is done through the same mechanism as for Hive. See DSS and Hive for more info.
Supported formats and limitations¶
Impala can only interact with HDFS datasets with the following formats:
- only in “Escaping only” or “No escaping nor quoting” modes.
- only in “NONE” compression
- If the dataset has been built by DSS, it should use the “Hive flavor” option of the Parquet parameters.
Hive Sequence File
Hive RC File
Additional limitations apply:
- Impala cannot handle datasets if they contain any complex type column.