Data preparation and schemas

Datasets in Data Science Studio have a schema. The schema of a dataset is the list of columns, with their names and types.

Storage types and meaning

In a Data Preparation recipe, two kind of “types” are handled.

  • The storage type, that appears directly in the schema of the dataset. The storage type is used to indicate how the dataset backend should store column data
  • The meaning, which is the “rich” semantic type that always appear in data preparation. Meanings are automatically detected from the contents of the columns. They provide advanced validation and transformations capabilities. Meanings include, for example : IP address, User-Agent, URL, E-Mail, Latitude, …

Storage types and meaning are related but with large amounts of flexibility that allow Data Science Studio to handle invalid data while retaining advanced types.

Output dataset schema in data preparation recipes

When you create a data preparation recipe, DSS automatically fills the schema of the output managed dataset.

It uses a mapping between the meaning and storage types to achieve that. For example, if a column was detected as an e-mail in the Shaker, the column will be created with “string” storage type in the output dataset. If a column was detected as a Latitude in the Shaker, it will be created with “double” storage type.

For columns that contain values that the shaker considers as invalid with regard to their meaning, DSS will default to generating a “string” columns.

In this example, the ep_creation mostly contains boolean values, so its meaning was infered as Boolean. However, some values (undefined) are invalid, so DSS chose to use string storage type.

../_images/schema-1.png

You can click on the column header and Store as to change the storage type for the column. DSS will highlight in green the compatible storage types.

For example, here, this integer column can be stored in various integer and decimal representations, and also as a string, but not as a boolean.

../_images/schema-2.png

Each time you Save the preparation recipe, DSS checks if you changed any storage type and warns you that you need to update the output dataset schema to conform to what is declared in the recipe.

You also get a schema mismatch warning if you change the columns (for example, if you add any new step): you need to also update the dataset schema to match what is being produced by the preparation recipe.

For example, here is what happens when you save after adding a Split processor.

../_images/schema-3.png

Warning

If you update the schema of a partitioned dataset, you generally need to manually Clear the output dataset.

Failure to do so can either lead to invalid data, or (especially in the case of SQL), being unable to write in the dataset because its definition does not match its schema.