Parquet

Parquet is an efficient file format of the Hadoop ecosystem. Its main points are:

  • Column-oriented, even for nested complex types

  • Block-based compression

  • Ability to “push down” filtering predicates to avoid useless reads

Using Parquet or another efficient file format is strongly recommended when working with Hadoop data (rather than CSV data). Speedups can reach up to x100 on select queries.

Requirements

Using Parquet format requires Setting up DSS Hadoop integration. If you don’t have a Hadoop cluster, you must run the Standalone Hadoop integration in order to use Parquet format.

Applicability

  • Parquet datasets can be stored on the following cloud storage and hadoop connections: HDFS, S3, GCS, Azure Blob storage. For more details see Hadoop filesystems connections (HDFS, S3, EMRFS, WASB, ADLS, GS).

  • Parquet datasets can be used as inputs and outputs of all recipes

  • Parquet datasets can be used in the Hive and Impala notebooks

Limitations and issues

Case-sensitivity

Due to differences in how Hive and Parquet treat identifiers, it is strongly recommended that you only use lowercase identifiers when dealing with Parquet files.

Misc

  • While reading Parquet files, DSS uses the schema from the dataset settings and not the integrated schema in the files. To use the schema from the Parquet files, set spark.dku.allow.native.parquet.reader.infer to true in the Spark settings.

  • on recent EMR clusters, the EmrOptimizedSparkSqlParquetOutputCommitter conflicts with the fs.s3.impl.disable.cache=true that DSS sets, which causes failures to create the staging directory. Disabling the optimized EMRFS committer or adding a property dku.no.disable.hdfs.cache -> true to the S3 connection in DSS is then needed.