Working with partitions¶
Partitioning refers to the splitting of a dataset along meaningful dimensions. Each partition contains a subset of the dataset that can be built independently.
For a general introduction to partitioning, see DSS concepts
The two partitioning models¶
There are two models for partitioning datasets: files-based partitioning and column-based partitioning.
Files-based partitioning¶
This partitioning method is used for all datasets based on a filesystem hierarchy. This includes Filesystem, HDFS, Amazon S3, Azure Blob Storage, Google Cloud Storage and Network datasets.
In this method, partitioning is defined by the layout of the files on disk., so the data in the files is NOT used to decide which records belong to which partition.
For more information, see Partitioning files-based datasets
Column-based partitioning¶
This partitioning method is used for datasets based on structured storage engines:
All SQL databases
NoSQL databases: MongoDB, Cassandra and Elasticsearch
In this method, the partitioning is derived from information (one or several columns) which is part of the data.
A very important point is that in this method, the schema of the dataset does contain the partitioning data.