Definitions¶
Datasets in Data Science Studio have a schema. The schema of a dataset is the list of columns, with their names and types.
There are two kinds of “types” in DSS.
The storage type, used to indicate how the dataset backend should store the column data.
The meaning, a “rich” semantic type. Meanings are automatically detected from the contents of the columns.
Storage types and meaning are related but with large amounts of flexibility that allow Data Science Studio to handle invalid data while retaining advanced meanings.
Storage types¶
Storage types are “technical” types:
string
int (32 bits), bigint (64 bits), smallint (16 bits), tinyint (8 bits)
float (32 bits decimal), double (64 bits decimal)
boolean
date
geopoint (for storing coordinates)
geometry (for storing lines, polygons, …)
array
map
object
A storage type is “strict”, ie. it is generally not possible to store data which would be “invalid” for a given storage type. For example, if a SQL table has an “int” columns, it is not possible at all to store a decimal in it.
Why use precise storage types ?¶
Storage types are used in many places in DSS, notably recipes, including for generating queries and jobs in other systems (like SQL, Hadoop, Spark).
For example, if you use a text-based dataset directly in a Spark recipe, the storage type information will be taken into account for the typing of the input dataframes. If you use a text-based dataset directly in Hive, the storage type information will be taken into account for the typing of the input Hive table).
A mistyped column will generally result in failures.
Meanings¶
Meanings have a “high-level” definition, like:
URL
IP Address
Email address
Country code
Currency code
…
Each meaning in DSS is able to validate a cell value. Thus, each cell can be “valid” or “invalid” for a given meaning.