DSS is aware of the dependencies of each dataset in the flow. A change to a dataset or a recipe can be propagated downstream through the flow. There are multiple options for propagating these changes.
There a a few options for considering upstream dependencies when building a dataset. To do this:
Right-click a dataset and select Build then select one of the following options.
Non recursive (Build only this dataset) builds the selected dataset using its parent recipe. This option requires the least computation, but does not take into account any upstream changes to datasets or recipes.
Recursive determines which recipes need to be run based on your choice:
- Smart reconstruction checks each dataset and recipe upstream of the selected dataset to see if it has been modified more recently than the selected dataset. Dataiku DSS then rebuilds all impacted datasets down to the selected one. This is the recommended default.
- Forced recursive rebuild rebuilds all of the dependencies of the selected datasets going back to the start of the flow. This is the most computationally-intense operation, but can be used for overnight builds to start the next day with a double-checked and up to date flow.
- “Missing” data only is a very specific and advanced mode that you’re unlikely to need. It works a bit like Smart reconstruction, but a dataset needs to be (re)built only if it’s completely empty. This is not recommended for general usage
You might want to prevent some datasets from being rebuilt, for instance if rebuilding them is particularly expensive or if their unavailability must be restricted to certain hours. In a dataset’s Settings > Advanced tab, you can configure its Rebuild behaviour:
Normal: the dataset is can be rebuilt, including recursively in the cases described above.
Explicit: the dataset can be rebuilt, but not recursively when rebuilding a downstream dataset.
Given a flow such as the following:
If B is set as Explicit rebuild, building Output recursively, even with Forced-recursive, will only rebuild C (and Output). B will not be rebuilt, nor will its upstream datasets.
To rebuild B, you need to build it explicitly, e.g. by right-clicking it and choosing Build. This also holds true when Output is built from a Scenario or via an API call.
Write-protected: the dataset cannot be rebuilt, even explicitly, making it effectively read-only from the Flow’s perspective. You can still write to this dataset from a Notebook.
It is also possible to propagate dataset changes downstream; that is, to rebuild the datasets impacted by a change. To do this:
Right-click a dataset and select Tools > Rebuild from here then select a method for handling dependencies (see the descriptions above).
- Dataiku DSS starts from the selected dataset and walks the flow downstream, following all branches, to determine the terminal datasets at the “end” of the flow.
- For each terminal dataset, DSS builds it with the selected option
This means that the selected dataset has no special meaning; it may or may not be rebuilt, depending on whether or not it is out of date with respect to the terminal datasets.
When a new column is added to a dataset in the beginning of a flow, one then wants to add/remove/rename the impacted column(s) in all downstream datasets. To do this:
Right-click a dataset and select Tools > Start schema propagation from here, then Start in the dialog that opens in the lower left corner, then for each recipe that needs an update (in red):
- Open it (preferably in a new tab)
- Force a save of the recipe: hit Ctrl+s, (or modify anything, click Save, revert the change, save again). For most recipe types, saving it triggers a schema check, which will detect the need for update and offer to fix the output schema. DSS will not silently update the output schema, as it could break other recipes relying on this schema. But most of the time, the desired action is to accept and click “update schema”.
- You probably need to run the recipe again
Some recipe types cannot be automatically checked; for example, Python recipes.