Partitioning files-based datasets

All datasets that are based on files can be partitioned. This includes the following kinds of datasets :

  • Server Filesystem
  • Hadoop HDFS
  • Amazon S3
  • Remote files

On files-based datasets, partitioning is defined by the layout of the files on disk.


Partitioning a files-based dataset cannot be defined by the content of a column within this dataset

For example, if a filesystem is organized this way:

  • /folder/2013/02/03/file0.csv
  • /folder/2013/02/03/file1.csv
  • /folder/2013/02/04/file0.csv
  • /folder/2013/02/04/file1.csv

This folder can be partitioned at the day level, with one folder per partition.

Files-based partitioning is defined by a matching pattern which allows mapping each file to a given partition identifier.

For example, the previous example would be represented by the pattern /%Y/%M/%D/.*

Define a partitioned dataset

You first need to have defined the connection and format params. Once this is OK and you can see your data, go to the Partitioning tab, and click “Activate partitioning”


Dataiku DSS automatically tries to recognize the pattern. If it succeeds, a partitioning scheme will be suggested, which you can directly use.


To manually define partitioning, first, add your dimensions.

  • You can add a single “time” dimension. The time dimension has a fixed periodicity, which can be year, month, day or hour.
  • You can add multiple discrete value dimensions, if each dimension corresponds to a subdirectory in your file structure.

Each dimension has a name.

Then, define the pattern. The indicators on the right of the pattern allow you to check that you have used all dimensions in the pattern.

The time dimension is referred in the pattern by the %Y (year, on 4 digits), %M (month, on 2 digits), %D (day, on 2 digits) and %H (hour, on 2 digits) specifiers. The pattern for the time dimension must represent a valid time hierarchy for the chosen period. For example, if you choose “Day” as the period for the time dimension, then the pattern must include %Y, %M, and %D.

Each discrete value dimension is referred by the %{dimension_name} specifier.


The above example defines a partitioning scheme with two dimensions, which would match files:

  • /2013-02-04/France/file0.csv
  • /2013-02-05/Italy/file1.csv


Your file names and paths must exactly match the pattern.

The initial '/' is important, as all paths are anchored to the root of the dataset. The final .* is important : it catches all files with the given prefix.

The “Test” button inspects the dataset and displays which partitions would be generated by the current pattern.


For an example of partitions with no time patterns, an example looks as follows:

Our data is stored in a directory called dir, which contains two subdirectories subdir_1 (which contains x.csv) and subdir_2 (which contains y.csv and z.csv).

We can partition over only the subdirectories, like so:


Or, we can partition over the individual files, which would look as follows:



If 0 partitions are detected, it generally means that your pattern does not match your files.

More information might be available in the backend log file. See Diagnosing and debugging issues for more information.