Scenario steps¶
The following steps can be executed by a scenario.
Build / Train¶
This step builds elements from the Dataiku Flow:
Datasets (or dataset partitions in the case of partitioned datasets)
Managed folders
Saved models
Clear¶
This step clears the contents of elements from the Dataiku Flow:
Datasets (or dataset partitions in the case of partitioned datasets) : the corresponding data is deleted
Managed folders (or folder partitions in the case of partitioned folders) : all the contents of the folder (or the folder partitions) are deleted
Verify rules or run checks¶
This step computes the checks or data quality rules defined on elements from the Dataiku Flow:
Datasets (or dataset partitions in the case of partitioned datasets)
Managed folders
Saved models
Model evaluation stores
For Datasets, this step computes the Data Quality rules specified in the Data Quality tab of the Dataset. This includes computing the underlying metrics that may be required by the rules. Optionally, you can prevent the computation of rules that are automatically run on build. Since those rule results are likely to be up to date with the data in most use cases, it may allow you to avoid some redundant computation.
For other objects, the step runs the checks defined on the Status tab of the element.
The outcomes of the checks and rules are collected and the step fails if at least one check on the selected elements fails.
The outcomes of the checks and rules (OK, ERROR, WARNING), along with the optional message, are available to subsequent steps as variables (See Variables in scenarios).
Compute metrics¶
This step runs the probes defined on elements from the Dataiku Flow:
Datasets (or dataset partitions in the case of partitioned datasets)
Managed folders
The probes are those defined on the Status tab of the element.
The values of the metrics are available to subsequent steps as variables (See Variables in scenarios).
Synchronize Hive table¶
HDFS datasets can be associated by DSS to external Hive tables. This is done automatically by jobs launched in DSS, or from the dataset’s advanced settings pane, and can be scheduled in a scenario using a Sync Hive table step. The schema of the Hive table is regenerated by the step so that it matches the schema of the dataset.
Create notebook export¶
Python notebooks can be made into insights by publishing them. The insight is then disconnected from the notebook. The Create notebook export step re-publishes a python notebook to the insight it was published to before, and optionally runs the notebook prior to publishing it.
When checking Execute the notebook in this step, one can thus refresh the data used by the published insight.
Execute SQL¶
This step executes one or more SQL statements on a DSS connection. Both straight SQL connections (ex: a Postgresql connection) and HiveQL connections (Hive and Impala) can be used.
The output of the query, if there is one, is available to subsequent steps as variables (See Variables in scenarios).
Execute Python code¶
This step runs a chunk of Python code in the context of the scenario. The Dataiku API available in Python notebooks is available in this step, as well as the scenario-specific API.
To get the parameters of the trigger that started the scenario, for example, one can use:
import dataiku.scenario
s = dataiku.scenario.Scenario()
params = s.get_trigger_params()
Define variables, Set project variables, Set global variables¶
In DSS, there are instance-level variables, project-level variables, and scenario-level variables. The scenario-level variables are only visible for the duration of the scenario run, and are defined either programmatically (in Python) or using a Define variables step. All 3 types of variables are available to recipes of jobs started in a scenario run.
The variables can be specified either by inputting a JSON object, in which case the variables values are fixed, or by inputting a list of key-value pairs, where the values are DSS formulas and can depend on pre-existing variables.
Run global variables update¶
This step runs the update of the DSS instance’s variables as defined in the administration section, or a given update code if specified in the step.
Send message¶
This step sends a message, like the reporters do at the end of a scenario run. The setup is the same, the only difference being that since the scenario is (obviously) not finished when the step is run, not all variables created during the run are available.
Run another scenario¶
This step starts a run of a scenario and waits for its completion. The step’s outcome is the outcome of that scenario run.
Only scenarios of projects the user has access to can be used. The user under which this step is run is used for the scenario started by this step, regardless of that scenario’s Run as user setting.
Package API service¶
This step generates a package of the specified API service. The package identifier needs to be unique. It can be specified using an expression with variables, and can be automatically padded with a number to ensure unicity.
Create dashboard export¶
This step generates a dashboard export inside a managed folder. Dashboard must be specified or the step will fail. If you don’t specify the Managed folder, files generated will be stored in temporary folder dashboard-exports inside DSS data directory.
Files generated are fully customizable so users are fully in control over what they obtain. There are several parameters that will enable it :
file type determine file extension type.
export format determine file dimensions. If a standard format (A4 or US Letter) is chosen, file dimensions will be calculated based on your screen and file will be what you see. On the contrary, Custom format allow to set custom width and offer two means of calculating file’s height :
Grid cells correspond to cells displayed in dashboard’s edit mode, height will be calculated in pixels based on width, number of grid cells and their height.
Pixels correspond to a height in pixels (obviously).
Execute Python unit test¶
This step executes one or more Python pytest tests from a project’s Libraries folder using a Pytest selector.
Run integration test¶
This step is used to run non-regression tests of your Dataiku project flow. It does this by:
Swapping one or more datasets in the flow with reference input datasets, that are stable and known.
Rebuilding one or more items to generate a new set output datasets.
Comparing the new output with reference datasets.
WebApp test¶
This step checks if the selected webapp is up and whether is reachable.