Managing datasets

Datasets belong to a given project, so all access to datasets implies getting a handle of the project first.

Basic operations

The list of all datasets of a project is accessible via list_datasets()

project = client.get_project('TEST_PROJECT')
datasets = project.list_datasets()
prettyprinter.pprint(datasets)

outputs

[   {   'checklists': {   'checklists': []},
        'customMeta': {   'kv': {   }},
        'flowOptions': {   'crossProjectBuildBehavior': 'DEFAULT',
                            'rebuildBehavior': 'NORMAL'},
        'formatParams': {  /* Parameters specific to each format type */ },
        'formatType': 'csv',
        'managed': False,
        'name': 'train_set',
        'params': { /* Parameters specific to each dataset type */  },
        'partitioning': {   'dimensions': [], 'ignoreNonMatchingFile': False},
        'projectKey': 'TEST_PROJECT',
        'schema': {   'columns': [   {     'name': 'col0',
                                           'type': 'string'},
                                       {   'name': 'col1',
                                           'type': 'string'},
                                       /* Other columns ... */
                                       ],
                       'userModified': False},
        'tags': ['creator_admin'],
        'type': 'UploadedFiles'},
...
]

Datasets can be deleted.

dataset = project.get_dataset('TEST_DATASET')
dataset.delete()

The metadata of the dataset can be modified. It is advised to first retrieve the current settings state with the get_metadata call, modify the returned object, and then set it back on the DSS instance.

dataset_metadata = dataset.get_metadata()
dataset_metadata['tag'] = ['tag1','tag2']
dataset.set_metadata(dataset_metadata)

Accessing the dataset data

The data of a dataset can be streamed over http to the API client with the iter_rows() method. This call returns the raw data, so that in most cases it is necessary to first get the dataset’s schema with a call to get_schema(). For example, printing the first 10 rows can be done with

columns = [column['name'] for column in dataset.get_schema()['columns']]
print(columns)
row_count = 0
for row in dataset.iter_rows():
        print(row)
        row_count = row_count + 1
        if row_count >= 10:
                break

outputs

['tube_assembly_id', 'supplier', 'quote_date', 'annual_usage', 'min_order_quantity', 'bracket_pricing', 'quantity', 'cost']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '1', '21.9059330191461']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '2', '12.3412139792904']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '5', '6.60182614356538']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '10', '4.6877695119712']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '25', '3.54156118026073']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '50', '3.22440644770007']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '100', '3.08252143576504']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '250', '2.99905966403855']
['TA-00004', 'S-0066', '2013-07-07', '0', '0', 'Yes', '1', '21.9727024365273']
['TA-00004', 'S-0066', '2013-07-07', '0', '0', 'Yes', '2', '12.4079833966715']

Schema

The schema of a data can be modified with the set_schema() method:

schema = dataset.get_schema()
schema['columns'].append({'name' : 'new_column', 'type' : 'bigint'})
dataset.set_schema(schema)

Partitions

For partitioned datasets, the list of partitions is retrieved with list_partitions():

partitions = dataset.list_partitions()

And the data of a given partition can be retrieved by passing the appropriate partition spec as parameter to iter_rows():

row_count = 0
for row in dataset.iter_rows(partitions='partition_spec1,partition_spec2'):
        print(row)
        row_count = row_count + 1
        if row_count >= 10:
                break

Dataset operations

The rows of the dataset can be cleared, entirely or on a per-partition basis, with the clear() method.

dataset = project.get_dataset('SOME_DATASET')
dataset.clear(['partition_spec_1', 'partition_spec_2'])         # clears specified partitions
dataset.clear()                                                                                         # clears all partitions

For datasets associated with a table in the Hive metastore, the synchronization of the table definition in the metastore with the dataset’s schema in DSS will be needed before it can be visible to Hive, and usable by Impala queries.

dataset = project.get_dataset('SOME_HDFS_DATASET')
dataset.synchronize_hive_metastore()

Creating datasets

Datasets can be created. For example loading the csv files of a folder

project = client.get_project('TEST_PROJECT')
folder_path = 'path/to/folder/'
for file in listdir(folder_path):
    if not file.endswith('.csv'):
        continue
    dataset = project.create_dataset(file[:-4]  # dot is not allowed in dataset names
        ,'Filesystem'
        , params={
            'connection': 'filesystem_root'
            ,'path': folder_path + file
        }, formatType='csv'
        , formatParams={
            'separator': ','
            ,'style': 'excel'  # excel-style quoting
            ,'parseHeaderRow': True
        })
    df = pandas.read_csv(folder_path + file)
    dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})

Reference documentation

class dataikuapi.dss.dataset.DSSDataset(client, project_key, dataset_name)

A dataset on the DSS instance

delete(drop_data=False)

Delete the dataset

Parameters:drop_data (bool) – Should the data of the dataset be dropped
get_definition()

Get the definition of the dataset

Returns:
the definition, as a JSON object
set_definition(definition)

Set the definition of the dataset

Args:
definition: the definition, as a JSON object. You should only set a definition object that has been retrieved using the get_definition call.
get_schema()

Get the schema of the dataset

Returns:
a JSON object of the schema, with the list of columns
set_schema(schema)

Set the schema of the dataset

Args:
schema: the desired schema for the dataset, as a JSON object. All columns have to provide their name and type
get_metadata()

Get the metadata attached to this dataset. The metadata contains label, description checklists, tags and custom metadata of the dataset

Returns:
a dict object. For more information on available metadata, please see https://doc.dataiku.com/dss/api/latest
set_metadata(metadata)

Set the metadata on this dataset.

Args:
metadata: the new state of the metadata for the dataset. You should only set a metadata object that has been retrieved using the get_metadata call.
iter_rows(partitions=None)

Get the dataset’s data

Return:
an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by get_schema
list_partitions()

Get the list of all partitions of this dataset

Returns:
the list of partitions, as a list of strings
clear(partitions=None)

Clear all data in this dataset

Args:
partitions: (optional) a list of partitions to clear. When not provided, the entire dataset is cleared
synchronize_hive_metastore()

Synchronize this dataset with the Hive metastore

compute_metrics(partition='', metric_ids=None, probes=None)

Compute metrics on a partition of this dataset. If neither metric ids nor custom probes set are specified, the metrics setup on the dataset are used.

run_checks(partition='', checks=None)

Run checks on a partition of this dataset. If the checks are not specified, the checks setup on the dataset are used.

get_last_metric_values(partition='')

Get the last values of the metrics on this dataset

Returns:
a list of metric objects and their value
get_metric_history(metric, partition='')

Get the history of the values of the metric on this dataset

Returns:
an object containing the values of the metric, cast to the appropriate type (double, boolean,…)
get_usages()

Get the recipes or analyses referencing this dataset

Returns:
a list of usages