Parquet is an efficient file format of the Hadoop ecosystem. Its main points are:
- Column-oriented, even for nested complex types
- Block-based compression
- Parquet datasets can only be stored on HDFS
- They can be used as inputs and outputs of Pig and Hive recipes
- They can be used in the Hive and Impala notebooks
Limitations and issues¶
- Due to various differences in how Pig and Hive map their data types to Parquet, you must select a writing Flavor when DSS writes a Parquet dataset. Reading with Hive a Parquet dataset written by Pig (and vice versa) leads to various issues, most being related to complex types.