Recipes for partitioned datasets¶
When a recipe is used to compute a partitioned dataset and/or to compute from a partitioned dataset, the processing done by the recipe is not global to the involved datasets, but specific to the involved partitions.
A recipe computing a partitioned dataset computes only one partition of the target dataset at a time.
If a recipe computes several datasets:
- All output datasets must have the same partitioning schema
- The same partition will be computed for all target datasets.
A single invocation of a recipe will therefore :
- Read one or several partitions of the input datasets
- Write one partition for each output dataset (in case of multiple output dataset, the same partition for all).
Data Science Studio guarantees idempotence of a recipe that computes a partition. This means that a partition is always written atomically. A recipe cannot append data to a partition. Instead it must replace the content of the target partition.
DSS automatically computes the partitions of the input datasets depending on the requested output partitions using the partition-level dependencies mechanism. For more information, please refer to DSS concepts and Specifying partition dependencies