Parquet¶
Parquet is an efficient file format of the Hadoop ecosystem. Its main points are:
Column-oriented, even for nested complex types
Block-based compression
Ability to “push down” filtering predicates to avoid useless reads
Using Parquet or another efficient file format is strongly recommended when working with Hadoop data (rather than CSV data). Speedups can reach up to x100 on select queries.
Requirements¶
Using Parquet format requires Setting up DSS Hadoop integration. If you don’t have a Hadoop cluster, you must run the Standalone Hadoop integration in order to use Parquet format.
Applicability¶
Parquet datasets can be stored on the following cloud storage and hadoop connections: HDFS, S3, GCS, Azure Blob storage. For more details see Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS).
Parquet datasets can be used as inputs and outputs of all recipes
Parquet datasets can be used in the Hive and Impala notebooks
Limitations and issues¶
Case-sensitivity¶
Due to differences in how Hive and Parquet treat identifiers, it is strongly recommended that you only use lowercase identifiers when dealing with Parquet files.
Misc¶
Due to various differences in how Pig and Hive map their data types to Parquet, you must select a writing Flavor when DSS writes a Parquet dataset. Reading with Hive a Parquet dataset written by Pig (and vice versa) leads to various issues, most being related to complex types.
While reading Parquet files, DSS uses the schema from the dataset settings and not the integrated schema in the files. To use the schema from the Parquet files, set
spark.dku.allow.native.parquet.reader.infer
totrue
in the Spark settings.on recent EMR clusters, the
EmrOptimizedSparkSqlParquetOutputCommitter
conflicts with thefs.s3.impl.disable.cache=true
that DSS sets, which causes failures to create the staging directory. Disabling the optimized EMRFS committer or adding a propertydku.no.disable.hdfs.cache -> true
to the S3 connection in DSS is then needed.