Handling of schemas¶
Datasets in Data Science Studio have a schema. The schema of a dataset is the list of columns, with their names and types.
Some dataset backends (like SQL databases) have strict requirements for types, while other backends can accept invalid data more easily like most text-based formats (CSV, fixed-width, JSON, …)
Schemas of new external datasets¶
When an external dataset is created, DSS automatically detects the column names and in some case types, and automatically initializes the schema of the dataset based on the data.
For source datasets based on SQL, Data Science Studio retrieves the names and exact storage types from the SQL engine. The schema of the dataset is not user-editable, as the “source of truth” for the real schema is the database table.
What if the data changes ?
If the schema of the underlying table changes, DSS will automatically update the schema of the dataset. However, it will only do so when you go to the edition page for this dataset. In that case, the “Save” button will be enabled
For source datasets based on text-like files without a strict schema (CSV, fixed-width, JSON, …), Data Science Studio tries to detect column names from the content and metadata of the files. Column names can be freely edited by the user.
As these files don’t include a schema restricting what kind of data can be present, Data Science Studio takes a conservative approach to typing : all columns in the generated schema will be typed as “string”, which accepts any kind of data.
There are two main usage patterns from here:
- If you are sure that your data is “valid” for what you want to do with it, you can directly set the storage type in the schema of the dataset. The storage type will be accessible to the recipes using the dataset.
- If you need to clean, enrich or preprocess your data, you can leave all storage types to “string”, and use a Data Preparation recipe to generate a clean dataset. The Data Preparation recipe will automatically generate an output dataset with precise storage types depending on the transformations defined in it. More details are available in the Data exploration and preparation.
Why would I want to give a specific type ?
Setting a storage type on a column from a text-based dataset does not directly impact how DSS itself interprets the content of the column. For example, Data Preparation does not use the defined storage type when reading a text-based dataset.
However, some recipes will use it. For example, if you use a text-based dataset directly in Pig, the storage type information will be taken into account for the typing of the input relations. If you use a text-based dataset directly in Hive, the storage type information will be taken into account for the typing of the input Hive table).
What if the data changes ?
If you hadn’t made any edit to the schema detected by Data Science Studio, it will automatically update the schema of the dataset if it notices that the underlying data files columns have changed. However, it will only do so when you go to the edition page for this dataset. In that case, the “Save” button will be enabled
If you had manually edited the schema, Data Science Studio will notice the mismatch when you go to the edition page for this dataset and display a warning. You can then manually adapt the schema to the new data.
Schemas of managed datasets¶
In an external dataset, the “source of truth” about the dataset is the data itself. This is why, on a SQL external dataset, the schema is not editable, as Data Science Studio implicitly trusts the SQL table. In a managed dataset, on the other hand, the user controls the schema, and defines it from the start.
When you manually create a Managed Dataset, it starts empty with an empty schema. You can then manually fill the schema in the dataset edition UI.
In most situations, you would not manually fill the schema but use the capability of the generating recipe to do it. Managed datasets are then created from the recipe edition UI, to be used as output of the recipe being edited.
If at some point you modify the schema of a managed dataset while it already contains data, and the new schema does not match the existing data, Data Science Studio will notice the error and give you the option to reload the schema from the actual data, or to drop the existing data.
Each time you modify the schema, (from the dataset UI, the recipe UI or by validating a recipe), it is recommended to click on the button to check the consistency between data and schema.