The Flow

In DSS, the datasets and the recipes together make up the flow. We have created a visual grammer for data science, so users can quickly understand a data pipeline through the flow.

Using the flow, DSS knows the lineage of every dataset in the flow. DSS, therefore, is able to dynamically rebuild datasets whenever one of their parent datasets or recipes has been modified.

Visual Grammar

Here’s an example of a flow in DSS

../_images/Pipeline_DSS_1.png

Datasets

Datasets in DSS appear as blue squares. An icon in the lower left of each square represents the type of dataset. For instance, an upward pointing arrow indicates that the dataset was uploaded.

See Connecting to data for more information on the types of data that you can connect to in DSS

Visual Recipes

Visual recipes in DSS appear as yellow circles. The icon inside of the visual recipe indicates the type of recipe. For instance, the broom icon represents a preparation recipe, while an arrow represents a syncing recipe. Visual recipes are created by users in a web browser using our data exploration window.

See Data preparation for a explanation of visual recipes and the transformations that they can accomplish.

Machine Learning

Processes related to machine learning are shown in green. Here, a barbell represents a model training event, scatter plot represents model scoring and the trophy shows an application of the model to a new dataset.

See Machine learning for more information about the machine learning capabilities of DSS.

Code-based recipes

DSS also allows users to into R, Python, SQL, Hive and Pig scripts inside the flow.

../_images/Pipeline_DSS_2.png

Code-based recipes are represented by red circles. The icon inside the circle indicates the type of recipe. For instance, the honeycomb icon represents a Hive recipe.

See Recipes based on code for information on R, Python, SQL, Hive and Pig recipes.

Rebuilding datasets

DSS is aware of the dependencies of each dataset in the flow. A change to a dataset or a recipe can be propagated donwstream through the flow.

Users have four different options for building a dataset:

Build only this dataset builds the selected dataset, but none of its dependencies. This is the option requires the least amount of computation, but does not take into account any changes upstream of the dataset.

Build required datasets is aware of changes made upstream of the selected dataset. It first rebuilds any datasets that require an update due to changes in the flow. These changes could be a modification to a recipe or the addition of new data into one of the datasets. Then, it rebuilds the selected dataset.

Force-rebuild this dataset and all dependencies rebuilds all of the dependencies of the selected datasets going back to the start of the flow. This is the most computationally-intense operation, but can be used to demonstrate the entire operation of the pipeline.

Build missing dependent datasets then this one builds any missing dependent datasets. This option does not rebuild dependencies effected by changes in the flow. Rather, it solely rebuilds the datasets that are completely missing.