Parquet

Parquet is an efficient file format of the Hadoop ecosystem. Its main points are:

  • Column-oriented, even for nested complex types
  • Block-based compression
  • Ability to “push down” filtering predicates to avoid useless reads

Using Parquet or another efficient file format is strongly recommended when working with Hadoop data (rather than CSV data). Speedups can reach up to x100 on select queries.

Applicability

  • Parquet datasets can only be stored on HDFS. If the data is on S3 or Azure Blob Storage, then access needs to be setup through Hadoop with HDFS connections
  • Parquet datasets can be used as inputs and outputs of all recipes
  • Parquet datasets can be used in the Hive and Impala notebooks

Limitations and issues

Case-sensitivity

Due to differences in how Hive and Parquet treat identifiers, it is strongly recommended that you only use lowercase identifiers when dealing with Parquet files.

Misc

  • Due to various differences in how Pig and Hive map their data types to Parquet, you must select a writing Flavor when DSS writes a Parquet dataset. Reading with Hive a Parquet dataset written by Pig (and vice versa) leads to various issues, most being related to complex types.