Component: Sample Dataset¶
Description¶
Dataiku DSS gives the ability to kickstart a project by adding ready-to-use datasets, called sample datasets.
Some sample datasets are installed by default. In case you need to add your own sample datasets, you can develop a sample dataset plugin.
See also
A tutorial on this plugin component is available in the Developer Guide: Sample dataset.
Creation¶
To develop a new sample dataset plugin, go to a development plugin page, click on “+New Component”, then choose “Sample Dataset” from the list of components. Choose a name for your component, and click on “Add”.
Note
Once this is done Dataiku DSS opens the code editor, allowing you to update the metadata of your plugin, as well the sample’s content, configuration and resource files.
Configuration¶
A sample dataset plugin can be configured via the associated JSON files dataset.json and config_{<version>}.json (automatically created by Dataiku DSS),
in the directory sample-datasets/{<sample-dataset-id>}.
Those JSON configuration files comprises different parts as shown in the code below.
dataset.json¶{
// Metadata section.
"meta": {
// Metadata used for display purposes.
// See below for more information.
},
}
config_{<version>}.json¶{
// Global configuration about the sample data plugin.
// See below for more information.
"columns": [...]
}
Metadata configuration¶
For the metadata section, the usual configuration applies. Please refer to Metadata section.
There are some additional optional parameters that you may want to fill depending on your needs.
/* Logo used on the Sample Dataset modal to represent your sample */
"logo": "logo-name.png",
/* Number used to sort the dataset samples by descending order when listed */
"displayOrderRank": 10,
/* List of available versions in your plugin */
"versions": ["v1", "v2"],
/* Version of your sample to install by default or when listed in the UI */
"activeVersion": "v2"
Logo¶
If you want to specify a logo for your sample, place the image inside a shared resource folder in root of your plugin,
and can only contain letters, digits, dots, underscores, hyphens and spaces.
Your logo should ideally be an image of 280x200 pixels.
The available extensions for a logo are the following: .apng, .png, .avif, .gif, .jpg, .jpeg, .jfif, .svg, .webp, .bmp, .ico, .cur.
Versioning¶
Use versioning, if you want to update the configuration or content of a sample dataset without impacting the projects or datasets that are currently using them.
To do that, list in the field versions the various versions of your dataset, and set in activeVersion the default version of the dataset you want to install.
For each version, you must have a dedicated data folder that must be located in sample-datasets/{<sample-dataset-id>}/data_{<version>}.
It is also highly recommended that you must have a dedicated configuration file in sample-datasets/{<sample-dataset-id>}/config_{<version>}.json.
Otherwise, the configuration will be read from the dataset.json file.
Sample data¶
Your sample data must be comprised of one or several .csv file or .csv.gz file
located here : sample-datasets/{<sample-dataset-id>}/data_{<version>}/{<sample-file>}.
By default your sample dataset component will be created with a small sample example named sample.csv.
Each sample file consists in a CSV file where the separator is a comma , and the quote character is a double quote ".
Ensure that the input CSV files adhere to this format for correct parsing and processing.
Each sample file should not contain any header row, as the column names are defined in the JSON configuration file.
You will need at least one sample file to save your plugin.
Global configuration¶
For the global configuration, a sample dataset has to define the following in its configuration file config_{<version>}.json :
/* The number of rows in your sample, will be displayed when listing the dataset samples */
"rowCount": 100,
/* Description of the schema of your dataset */
"columns": [...],
Schema¶
The columns field is the place where you can specify the schema of your sample dataset, it contains as many columns as your dataset has.
Your column should be ordered in the same order than the one in your csv file.
Each column is structured with the following fields :
name: unique identifier of the column
type: storage type of the column
comment (optional): description of the column
meaning (optional): “high-level” definition of the column, used to validate the cell. If not set, the meaning will be deduced from the sample content.
For the list of columns types, please refer to Storage types
For the list of available meanings, please refer to List of recognized meanings
{
/* Unique identifier for your column */
"name": "columns-name",
/* Type of the column */
"type": "double",
/* Optional description of the column, that will be displayed in the dataset's explore view */
"comment": "This is the description of my column",
/* Optional meaning of the column */
"meaning": "DoubleMeaning"
}
Complete example¶
dataset.json¶/* This file is the descriptor for the sample dataset template my-sample-dataset */
{
"meta": {
"label": "Temperature Time Series",
"description": "This dataset contains daily temperature data over a multi-year period",
"icon": "fas fa-thermometer",
"logo": "thermometer-logo.png",
"displayOrderRank": 100,
"versions": ["v1"],
"activeVersion": "v1"
}
}
config_v1.json¶/* This file is the config for the sample dataset template my-sample-dataset version v1 */
{
"rowCount": 100,
"columns": [
{
"name": "date",
"type": "dateonly",
"comment": "The date of the temperature observation in the format YYYY-MM-DD",
"meaning": "DateOnly"
},
{
"name": "temperature",
"type": "float",
"comment": "The recorded temperature value for the given date"
},
{
"name": "temperature_trend",
"type": "float",
"comment": "A calculated trend value representing the smoothed or averaged temperature over a specific period"
}
]
}