Managing datasets

Datasets belong to a given project, so all access to datasets implies getting a handle of the project first.

Basic operations

The list of all datasets of a project is accessible via list_datasets()

project = client.get_project('TEST_PROJECT')
datasets = project.list_datasets()
prettyprinter.pprint(datasets)

outputs

[   {   'checklists': {   'checklists': []},
        'customMeta': {   'kv': {   }},
        'flowOptions': {   'crossProjectBuildBehavior': 'DEFAULT',
                            'rebuildBehavior': 'NORMAL'},
        'formatParams': {  /* Parameters specific to each format type */ },
        'formatType': 'csv',
        'managed': False,
        'name': 'train_set',
        'params': { /* Parameters specific to each dataset type */  },
        'partitioning': {   'dimensions': [], 'ignoreNonMatchingFile': False},
        'projectKey': 'TEST_PROJECT',
        'schema': {   'columns': [   {     'name': 'col0',
                                           'type': 'string'},
                                       {   'name': 'col1',
                                           'type': 'string'},
                                       /* Other columns ... */
                                       ],
                       'userModified': False},
        'tags': ['creator_admin'],
        'type': 'UploadedFiles'},
...
]

Datasets can be deleted.

dataset = project.get_dataset('TEST_DATASET')
dataset.delete()

The metadata of the dataset can be modified. It is advised to first retrieve the current settings state with the get_metadata call, modify the returned object, and then set it back on the DSS instance.

dataset_metadata = dataset.get_metadata()
dataset_metadata['tags'] = ['tag1','tag2']
dataset.set_metadata(dataset_metadata)

Accessing the dataset data

The data of a dataset can be streamed over http to the API client with the iter_rows() method. This call returns the raw data, so that in most cases it is necessary to first get the dataset’s schema with a call to get_schema(). For example, printing the first 10 rows can be done with

columns = [column['name'] for column in dataset.get_schema()['columns']]
print(columns)
row_count = 0
for row in dataset.iter_rows():
        print(row)
        row_count = row_count + 1
        if row_count >= 10:
                break

outputs

['tube_assembly_id', 'supplier', 'quote_date', 'annual_usage', 'min_order_quantity', 'bracket_pricing', 'quantity', 'cost']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '1', '21.9059330191461']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '2', '12.3412139792904']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '5', '6.60182614356538']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '10', '4.6877695119712']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '25', '3.54156118026073']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '50', '3.22440644770007']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '100', '3.08252143576504']
['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '250', '2.99905966403855']
['TA-00004', 'S-0066', '2013-07-07', '0', '0', 'Yes', '1', '21.9727024365273']
['TA-00004', 'S-0066', '2013-07-07', '0', '0', 'Yes', '2', '12.4079833966715']

Schema

The schema of a data can be modified with the set_schema() method:

schema = dataset.get_schema()
schema['columns'].append({'name' : 'new_column', 'type' : 'bigint'})
dataset.set_schema(schema)

Partitions

For partitioned datasets, the list of partitions is retrieved with list_partitions():

partitions = dataset.list_partitions()

And the data of a given partition can be retrieved by passing the appropriate partition spec as parameter to iter_rows():

row_count = 0
for row in dataset.iter_rows(partitions='partition_spec1,partition_spec2'):
        print(row)
        row_count = row_count + 1
        if row_count >= 10:
                break

Dataset operations

The rows of the dataset can be cleared, entirely or on a per-partition basis, with the clear() method.

dataset = project.get_dataset('SOME_DATASET')
dataset.clear(['partition_spec_1', 'partition_spec_2'])         # clears specified partitions
dataset.clear()                                                                                         # clears all partitions

For datasets associated with a table in the Hive metastore, the synchronization of the table definition in the metastore with the dataset’s schema in DSS will be needed before it can be visible to Hive, and usable by Impala queries.

dataset = project.get_dataset('SOME_HDFS_DATASET')
dataset.synchronize_hive_metastore()

Creating datasets

Datasets can be created. For example loading the csv files of a folder

project = client.get_project('TEST_PROJECT')
folder_path = 'path/to/folder/'
for file in listdir(folder_path):
    if not file.endswith('.csv'):
        continue
    dataset = project.create_dataset(file[:-4]  # dot is not allowed in dataset names
        ,'Filesystem'
        , params={
            'connection': 'filesystem_root'
            ,'path': folder_path + file
        }, formatType='csv'
        , formatParams={
            'separator': ','
            ,'style': 'excel'  # excel-style quoting
            ,'parseHeaderRow': True
        })
    df = pandas.read_csv(folder_path + file)
    dataset.set_schema({'columns': [{'name': column, 'type':'string'} for column in df.columns]})

Reference documentation

class dataikuapi.dss.dataset.DSSDataset(client, project_key, dataset_name)

A dataset on the DSS instance

delete(drop_data=False)

Delete the dataset

Parameters:drop_data (bool) – Should the data of the dataset be dropped
get_definition()

Get the definition of the dataset

Returns:
the definition, as a JSON object
set_definition(definition)

Set the definition of the dataset

Args:
definition: the definition, as a JSON object. You should only set a definition object that has been retrieved using the get_definition call.
get_schema()

Get the schema of the dataset

Returns:
a JSON object of the schema, with the list of columns
set_schema(schema)

Set the schema of the dataset

Args:
schema: the desired schema for the dataset, as a JSON object. All columns have to provide their name and type
get_metadata()

Get the metadata attached to this dataset. The metadata contains label, description checklists, tags and custom metadata of the dataset

Returns:
a dict object. For more information on available metadata, please see https://doc.dataiku.com/dss/api/5.0/rest/
set_metadata(metadata)

Set the metadata on this dataset.

Args:
metadata: the new state of the metadata for the dataset. You should only set a metadata object that has been retrieved using the get_metadata call.
iter_rows(partitions=None)

Get the dataset’s data

Return:
an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by get_schema
list_partitions()

Get the list of all partitions of this dataset

Returns:
the list of partitions, as a list of strings
clear(partitions=None)

Clear all data in this dataset

Args:
partitions: (optional) a list of partitions to clear. When not provided, the entire dataset is cleared
synchronize_hive_metastore()

Synchronize this dataset with the Hive metastore

update_from_hive()

Resynchronize this dataset from its Hive definition

compute_metrics(partition='', metric_ids=None, probes=None)

Compute metrics on a partition of this dataset. If neither metric ids nor custom probes set are specified, the metrics setup on the dataset are used.

run_checks(partition='', checks=None)

Run checks on a partition of this dataset. If the checks are not specified, the checks setup on the dataset are used.

list_statistics_worksheets(as_objects=True)

List the statistics worksheets associated to this dataset.

Return type:list of dataikuapi.dss.statistics.DSSStatisticsWorksheet
create_statistics_worksheet(name='My worksheet')

Create a new worksheet in the dataset, and return a handle to interact with it.

Parameters:
  • input_dataset (string) – input dataset of the worksheet
  • worksheet_name (string) – name of the worksheet
Returns:
A dataikuapi.dss.statistics.DSSStatisticsWorksheet dataset handle
get_statistics_worksheet(worksheet_id)

Get a handle to interact with a statistics worksheet

Parameters:worksheet_id (string) – the ID of the desired worksheet
Returns:A dataikuapi.dss.statistics.DSSStatisticsWorksheet worksheet handle
get_last_metric_values(partition='')

Get the last values of the metrics on this dataset

Returns:
a list of metric objects and their value
get_metric_history(metric, partition='')

Get the history of the values of the metric on this dataset

Returns:
an object containing the values of the metric, cast to the appropriate type (double, boolean,…)
get_usages()

Get the recipes or analyses referencing this dataset

Returns:
a list of usages
get_object_discussions()

Get a handle to manage discussions on the dataset

Returns:the handle to manage discussions
Return type:dataikuapi.discussion.DSSObjectDiscussions
class dataikuapi.dss.dataset.DSSManagedDatasetCreationHelper(project, dataset_name)
get_creation_settings()
with_store_into(connection, type_option_id=None, format_option_id=None)

Sets the connection into which to store the new managed dataset :param str connection: Name of the connection to store into :param str type_option_id: If the connection accepts several types of datasets, the type :param str format_option_id: Optional identifier of a file format option :return: self

with_copy_partitioning_from(dataset_ref, object_type='DATASET')

Sets the new managed dataset to use the same partitioning as an existing dataset_name

Parameters:dataset_ref (str) – Name of the dataset to copy partitioning from
Returns:self
create()

Executes the creation of the managed dataset according to the selected options

Returns:The DSSDataset corresponding to the newly created dataset