Schema for data preparation¶
Data preparation can be accessed in two parts in DSS:
- In the Lab, through the Visual analysis. This is used to iterate between preparation, visualization and machine learning.
- In a Prepare recipe in the Flow, used to create a new dataset
In Data Preparation, meanings are used to automatically suggest relevant transformations (either when clicking on a column header or on a cell).
In this example, the column has a meaning of “HTTP Query string” and has invalid values, so DSS suggests both removal of invalid values and query-string-specific operations.
A visual analysis is a pure “Lab” object, which does not have persistence of output data. As such, columns in analysis don’t have a notion of storage types (since they are not stored).
In analysis, only meanings are shown.
When you create an analysis from a dataset, the forced meanings and column descriptions are carried over from the dataset.
When you create a data preparation recipe, DSS automatically fills the schema of the output dataset. At creation time, the forced meanings and column descriptions are carried over from the input to the output dataset.
The schema of the output dataset is infered by looking at the data in the sample. DSS automatically tries to use the “best” storage type that matches the data.
Since storage types use strict interpretation of what data is valid, you often need to parse or format the data before being able to use it with a precise storage type.
For example, the string “1 245,21” has meaning “Decimal (comma)”, but is not valid for the “double” storage meaning, which only accepts “raw” decimals (i.e. “1245.21”). You need to use a “Numerical Format converter” processor to convert to proper raw decimals.
If no other type is possible, DSS default to generated a “string” column.
In the UI of the Prepare recipe, both the meaning of the column and the storage type in the output dataset are displayed.
When you change the storage type here, it changes what will be stored in the output dataset. Types that appear in white are “possible”, while those appearing in red will generate errors or warnings.
By default, DSS discards invalid data when storing in the output dataset.
Note that the modification is only applied when you Save the recipe.