Parquet is an efficient file format of the Hadoop ecosystem. Its main points are:
- Column-oriented, even for nested complex types
- Block-based compression
- Ability to “push down” filtering predicates to avoid useless reads
Using Parquet or another efficient file format is strongly recommended when working with Hadoop data (rather than CSV data). Speedups can reach up to x100 on select queries.
- Parquet datasets can only be stored on Hadoop filesystems. If the data is on S3 or Azure Blob Storage, then access needs to be setup through Hadoop with HDFS connections
- Parquet datasets can be used as inputs and outputs of all recipes
- Parquet datasets can be used in the Hive and Impala notebooks
Limitations and issues¶
Due to differences in how Hive and Parquet treat identifiers, it is strongly recommended that you only use lowercase identifiers when dealing with Parquet files.
- Due to various differences in how Pig and Hive map their data types to Parquet, you must select a writing Flavor when DSS writes a Parquet dataset. Reading with Hive a Parquet dataset written by Pig (and vice versa) leads to various issues, most being related to complex types.
- While reading Parquet files, DSS uses the schema from the dataset settings and not the integrated schema in the files. To use the schema from the Parquet files, set
truein the Spark settings.
- on recent EMR clusters, the
EmrOptimizedSparkSqlParquetOutputCommitterconflicts with the
fs.s3.impl.disable.cache=truethat DSS sets, which causes failures to create the staging directory. Disabling the optimized EMRFS committer or adding a property
dku.no.disable.hdfs.cache -> trueto the S3 connection in DSS is then needed.