Parquet

Parquet is an efficient file format of the Hadoop ecosystem. Its main points are:

  • Column-oriented, even for nested complex types
  • Block-based compression

Applicability

  • Parquet datasets can only be stored on HDFS
  • They can be used as inputs and outputs of Pig and Hive recipes
  • They can be used in the Hive and Impala notebooks

Limitations and issues

Misc

  • Due to various differences in how Pig and Hive map their data types to Parquet, you must select a writing Flavor when DSS writes a Parquet dataset. Reading with Hive a Parquet dataset written by Pig (and vice versa) leads to various issues, most being related to complex types.