DSS and Impala

Impala is a tool of the Hadoop environment to run interactive analytic SQL queries on large amounts of HDFS data.

Unlike Hive, Impala does not use MapReduce but “Massive Parallel Processing”, ie. each node of the Hadoop cluster runs the query on its part of the data.

Data Science Studio provides the following integration points with Impala :

  • All HDFS datasets can be made available in the Impala environment, where they can be used by any Impala-capable tool.
  • The “Impala notebook” allows you to run Impala queries on any Impala database, whether they have been created by DSS or not.
  • When performing /visualize/index on a HDFS dataset, you can choose to use Impala as the query execution engine.

Metastore synchronization

Making HDFS datasets automatically available to Impala is done through the same mechanism as for Hive. See DSS and Hive for more info.

Supported formats and limitations

Impala can only interact with HDFS datasets with the following formats:

  • CSV

    • only in “Escaping only” or “No escaping nor quoting” modes.
    • only in “NONE” compression
  • Parquet
    • If the dataset has been built by DSS, it should use the “Hive flavor” option of the Parquet parameters.
  • Hive Sequence File

  • Hive RC File

  • Avro

Additional limitations apply:

  • Impala cannot handle datasets if they contain any complex type column.