Working with partitions

Partitioning refers to the splitting of a dataset along meaningful dimensions. Each partition contains a subset of the dataset.

For a general introduction to partitioning, see DSS concepts

The two partitioning models

There are two major models for partitioning datasets: files-based partitioning and column-based partitioning.

Files-based partitioning

This partitioning method is used for all datasets based on a filesystem hierarchy. This includes FS, HDFS, S3, RemoteFiles datasets.

In this method:

  • The partitioning is given by the organization of files in folders
  • The actual data in the files is NOT used to decide which records belong to which partition.

For more information, see Partitioning files-based datasets

Column-based partitioning

This partitioning method is used for datasets based on structured storage engines:

  • All SQL databases
  • NoSQL databases: MongoDB and Cassandra

In this method, the partitioning is derived from information (generally a column) which is part of the data.

A very important point is that in this method, the schema of the dataset does contain the partitioning data.