Component: Sample Dataset

Description

Dataiku DSS gives the ability to kickstart a project by adding ready-to-use datasets, called sample datasets.

Some sample datasets are installed by default. In case you need to add your own sample datasets, you can develop a sample dataset plugin.

See also

A tutorial on this plugin component is available in the Developer Guide: Sample dataset.

Creation

To develop a new sample dataset plugin, go to a development plugin page, click on “+New Component”, then choose “Sample Dataset” from the list of components. Choose a name for your component, and click on “Add”.

Note

Once this is done Dataiku DSS opens the code editor, allowing you to update the metadata of your plugin, as well the sample’s content, configuration and resource files.

Configuration

A sample dataset plugin can be configured via the associated JSON files dataset.json and config_{<version>}.json (automatically created by Dataiku DSS), in the directory sample-datasets/{<sample-dataset-id>}.

Those JSON configuration files comprises different parts as shown in the code below.

Main configuration file of a sample dataset plugin dataset.json
{
    // Metadata section.
    "meta": {
        // Metadata used for display purposes.
        // See below for more information.
    },
}
Dataset configuration file of a sample dataset plugin config_{<version>}.json
{
    // Global configuration about the sample data plugin.
    // See below for more information.
    "columns": [...]
}

Metadata configuration

For the metadata section, the usual configuration applies. Please refer to Metadata section.

There are some additional optional parameters that you may want to fill depending on your needs.

Additional optional metadata parameters of a sample dataset
/* Logo used on the Sample Dataset modal to represent your sample */
"logo": "logo-name.png",

/* Number used to sort the dataset samples by descending order when listed */
"displayOrderRank": 10,

/* List of available versions in your plugin */
"versions": ["v1", "v2"],

/* Version of your sample to install by default or when listed in the UI */
"activeVersion": "v2"

Versioning

Use versioning, if you want to update the configuration or content of a sample dataset without impacting the projects or datasets that are currently using them.

To do that, list in the field versions the various versions of your dataset, and set in activeVersion the default version of the dataset you want to install.

For each version, you must have a dedicated data folder that must be located in sample-datasets/{<sample-dataset-id>}/data_{<version>}.

It is also highly recommended that you must have a dedicated configuration file in sample-datasets/{<sample-dataset-id>}/config_{<version>}.json. Otherwise, the configuration will be read from the dataset.json file.

Sample data

Your sample data must be comprised of one or several .csv file or .csv.gz file located here : sample-datasets/{<sample-dataset-id>}/data_{<version>}/{<sample-file>}. By default your sample dataset component will be created with a small sample example named sample.csv.

Each sample file consists in a CSV file where the separator is a comma , and the quote character is a double quote ". Ensure that the input CSV files adhere to this format for correct parsing and processing. Each sample file should not contain any header row, as the column names are defined in the JSON configuration file.

You will need at least one sample file to save your plugin.

Global configuration

For the global configuration, a sample dataset has to define the following in its configuration file config_{<version>}.json :

Global configuration of a sample dataset
/* The number of rows in your sample, will be displayed when listing the dataset samples */
"rowCount": 100,

/* Description of the schema of your dataset */
"columns": [...],

Schema

The columns field is the place where you can specify the schema of your sample dataset, it contains as many columns as your dataset has. Your column should be ordered in the same order than the one in your csv file.

Each column is structured with the following fields :

  • name: unique identifier of the column

  • type: storage type of the column

  • comment (optional): description of the column

  • meaning (optional): “high-level” definition of the column, used to validate the cell. If not set, the meaning will be deduced from the sample content.

For the list of columns types, please refer to Storage types

For the list of available meanings, please refer to List of recognized meanings

Structure of a column
{
    /* Unique identifier for your column */
    "name": "columns-name",
    /* Type of the column */
    "type": "double",
    /* Optional description of the column, that will be displayed in the dataset's explore view */
    "comment": "This is the description of my column",
    /* Optional meaning of the column */
    "meaning": "DoubleMeaning"
}

Complete example

Complete example of the dataset.json
/* This file is the descriptor for the sample dataset template my-sample-dataset */
{
    "meta": {
        "label": "Temperature Time Series",
        "description": "This dataset contains daily temperature data over a multi-year period",
        "icon": "fas fa-thermometer",
        "logo": "thermometer-logo.png",
        "displayOrderRank": 100,
        "versions": ["v1"],
        "activeVersion": "v1"
    }

}
Complete example of the config_v1.json
/* This file is the config for the sample dataset template my-sample-dataset version v1 */
{
    "rowCount": 100,
    "columns": [
        {
            "name": "date",
            "type": "dateonly",
            "comment": "The date of the temperature observation in the format YYYY-MM-DD",
            "meaning": "DateOnly"
        },
        {
            "name": "temperature",
            "type": "float",
            "comment": "The recorded temperature value for the given date"
        },
        {
            "name": "temperature_trend",
            "type": "float",
            "comment": "A calculated trend value representing the smoothed or averaged temperature over a specific period"
        }
    ]
}