Recipes for partitioned datasets

When a recipe is used to compute a partitioned dataset and/or to compute from a partitioned dataset, the processing done by the recipe is not global to the involved datasets, but specific to the involved partitions.

A recipe computing a partitioned dataset computes only one partition of the target dataset at a time.

If a recipe computes several datasets:

  • All output datasets must have the same partitioning schema
  • The same partition will be computed for all target datasets.

A single invocation of a recipe will therefore :

  • Read one or several partitions of the input datasets
  • Write one partition for each output dataset (in case of multiple output dataset, the same partition for all).

Data Science Studio guarantees idempotence of a recipe that computes a partition. This means that a partition is always written atomically. A recipe cannot append data to a partition. Instead it must replace the content of the target partition.

DSS automatically computes the partitions of the input datasets depending on the requested output partitions using the partition-level dependencies mechanism. For more information, please refer to DSS concepts and Specifying partition dependencies

See Partitioned Hive recipes and Partitioned SQL recipes about how to read only the input partitions and write to the output partition.