Datasets (reference)

Please see Datasets (introduction) for an introduction about interacting with datasets in Dataiku Python API

API reference: The dataiku.Dataset class

For starting code samples, please see Python recipes.

class dataiku.Dataset(name, project_key=None, ignore_flow=False)

This is a handle to obtain readers and writers on a dataiku Dataset. From this Dataset class, you can:

  • Read a dataset as a Pandas dataframe
  • Read a dataset as a chunked Pandas dataframe
  • Read a dataset row-by-row
  • Write a pandas dataframe to a dataset
  • Write a series of chunked Pandas dataframes to a dataset
  • Write to a dataset row-by-row
  • Edit the schema of a dataset
static list(project_key=None)

Lists the names of datasets. If project_key is None, the current project key is used.

full_name
get_location_info(sensitive_info=False)
get_files_info(partitions=[])
set_write_partition(spec)

Sets which partition of the dataset gets written to when you create a DatasetWriter. Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow.

add_read_partitions(spec)

Add a partition or range of partitions to read.

The spec argument must be given in the DSS partition spec format. You cannot manually set partitions when running inside a Python recipe. They are automatically set using the dependencies.

read_schema(raise_if_empty=True)

Gets the schema of this dataset, as an array of objects like this one: { ‘type’: ‘string’, ‘name’: ‘foo’, ‘maxLength’: 1000 }. There is more information for the map, array and object types.

list_partitions(raise_if_empty=True)

List the partitions of this dataset, as an array of partition specifications

set_preparation_steps(steps, requested_output_schema)
get_dataframe(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, infer_with_pandas=True, parse_dates=True, bool_as_str=False, float_precision=None)

Read the dataset (or its selected partitions, if applicable) as a Pandas dataframe.

Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.

Keywords arguments:

  • columns – When not None, returns only the given list of columns (default None)

  • limit – Limits the number of rows returned (default None)

  • sampling – Sampling method, if:

    • ‘head’ returns the first rows of the dataset. Incompatible with ratio parameter.
    • ‘random’ returns a random sample of the dataset
    • ‘random-column’ returns a random sample of the dataset. Incompatible with limit parameter.
  • sampling_column – Select the column used for “columnwise-random” sampling (default None)

  • ratio – Limits the ratio to at n% of the dataset. (default None)

  • infer_with_pandas – uses the types detected by pandas rather than the dataset schema as detected in DSS. (default True)

  • parse_dates – Date column in DSS’s dataset schema are parsed (default True)

  • bool_as_str – Leave boolean values as strings (default False)

Inconsistent sampling parameter raise ValueError.

Note about encoding:

  • Column labels are “unicode” objects
  • When a column is of string type, the content is made of utf-8 encoded “str” objects
static get_dataframe_schema_st(schema, columns=None, parse_dates=True, infer_with_pandas=False, bool_as_str=False)
iter_dataframes_forced_types(names, dtypes, parse_date_columns, chunksize=10000, sampling='head', sampling_column=None, limit=None, ratio=None, float_precision=None)
iter_dataframes(chunksize=10000, infer_with_pandas=True, sampling='head', sampling_column=None, parse_dates=True, limit=None, ratio=None, columns=None, bool_as_str=False, float_precision=None)

Read the dataset to Pandas dataframes by chunks of fixed size.

Returns a generator over pandas dataframes.

Useful is the dataset doesn’t fit in RAM.

write_with_schema(df, dropAndCreate=False)

Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.

This variant replaces the schema of the output dataset with the schema of the dataframe.

Encoding node: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

Parameters:
  • df – input panda dataframe.
  • dropAndCreate – drop and recreate the dataset.
write_dataframe(df, infer_schema=False, dropAndCreate=False)

Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.

This variant only edit the schema if infer_schema is True, otherwise you must take care to only write dataframes that have a compatible schema. Also see “write_with_schema”.

Encoding note: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

Parameters:
  • df – input panda dataframe.
  • infer_schema – infer the schema from the dataframe.
  • dropAndCreate – if infer_schema and this parameter are both set to True, clear and recreate the dataset structure.
iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None)

Returns a generator on the rows (as a dict-like object) of the data (or its selected partitions, if applicable)

Keyword arguments: * limit – maximum number of rows to be emitted * log_every – print out the number of rows read on stdout

Field values are casted according to their types. String are parsed into “unicode” values.

raw_formatted_data(sampling=None, columns=None, format='tsv-excel-noheader', format_params=None)

Get a stream of raw bytes from a dataset as a file-like object, formatted in a supported DSS output format.

You MUST close the file handle. Failure to do so will result in resource leaks.

iter_tuples(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None)

Returns the rows of the dataset as tuples. The order and type of the values are the same are matching the dataset’s parameter

Keyword arguments:

  • limit – maximum number of rows to be emitted
  • log_every – print out the number of rows read on stdout
  • timeout – time (in seconds) of inactivity after which we want to close the generator if nothing has been read. Without it notebooks typically tend to leak “DKU” processes.

Field values are casted according to their types. String are parsed into “unicode” values.

get_writer()

Get a stream writer for this dataset (or its target partition, if applicable). The writer must be closed as soon as you don’t need it.

The schema of the dataset MUST be set before using this. If you don’t set the schema of the dataset, your data will generally not be stored by the output writers
get_continuous_writer()
write_schema(columns, dropAndCreate=False)

Write the dataset schema into the dataset JSON definition file.

Sometimes, the schema of a dataset being written is known only by the code of the Python script itself. In that case, it can be useful for the Python script to actually modify the schema of the dataset. Obviously, this must be used with caution. ‘columns’ must be an array of dicts like { ‘name’ : ‘column name’, ‘type’ : ‘column type’}

write_schema_from_dataframe(df, dropAndCreate=False)
read_metadata()

Reads the dataset metadata object

write_metadata(meta)

Writes the dataset metadata object

get_config()
get_last_metric_values(partition='')

Get the set of last values of the metrics on this dataset, as a dataiku.ComputedMetrics object

get_metric_history(metric_lookup, partition='')

Get the set of all values a given metric took on this dataset

Parameters:
  • metric_lookup – metric name or unique identifier
  • partition – optionally, the partition for which the values are to be fetched
save_external_metric_values(values_dict, partition='')

Save metrics on this dataset. The metrics are saved with the type “external”

Parameters:
  • values_dict – the values to save, as a dict. The keys of the dict are used as metric names
  • partition – optionally, the partition for which the values are to be saved
save_external_check_values(values_dict, partition='')

Save checks on this dataset. The checks are saved with the type “external”

Parameters:values_dict – the values to save, as a dict. The keys of the dict are used as check names
class dataiku.core.dataset_write.DatasetWriter(dataset)

Handle to write to a dataset. Use Dataset.get_writer() to obtain a DatasetWriter.

Very important: a DatasetWriter MUST be closed after usage. Failure to close a DatasetWriter will lead to incomplete or no data being written to the output dataset

active_writers = {}
static atexit_handler()
write_tuple(row)

Write a single row from a tuple or list of column values. Columns must be given in the order of the dataset schema.

Note: The schema of the dataset MUST be set before using this.

Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.

write_row_array(row)
write_row_dict(row_dict)

Write a single row from a dict of column name -> column value.

Some columns can be omitted, empty values will be inserted instead.

Note: The schema of the dataset MUST be set before using this.

Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.

write_dataframe(df)

Appends a Pandas dataframe to the dataset being written.

This method can be called multiple times (especially when you have been using iter_dataframes to read from an input dataset)

Encoding node: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

close()

Closes this dataset writer

API reference: The dataikuapi.dss.dataset package

Main DSSDataset class

class dataikuapi.dss.dataset.DSSDataset(client, project_key, dataset_name)

A dataset on the DSS instance. Do not instantiate this class, use dataikuapi.dss.project.DSSProject.get_dataset()

id
name
delete(drop_data=False)

Delete the dataset

Parameters:drop_data (bool) – Should the data of the dataset be dropped
get_settings()

Returns the settings of this dataset as a DSSDatasetSettings, or one of its subclasses.

Know subclasses of DSSDatasetSettings include FSLikeDatasetSettings and SQLDatasetSettings

You must use save() on the returned object to make your changes effective on the dataset.

# Example: activating discrete partitioning on a SQL dataset
dataset = project.get_dataset("my_database_table")
settings = dataset.get_settings()
settings.add_discrete_partitioning_dimension("country")
settings.save()
Return type:DSSDatasetSettings
get_definition()

Deprecated. Use get_settings() Get the raw settings of the dataset as a dict :rtype: dict

set_definition(definition)

Deprecated. Use get_settings() and DSSDatasetSettings.save() Set the definition of the dataset

Parameters:definition – the definition, as a dict. You should only set a definition object that has been retrieved using the get_definition call.
exists()

Returns whether this dataset exists

get_schema()

Get the schema of the dataset

Returns:
a JSON object of the schema, with the list of columns
set_schema(schema)

Set the schema of the dataset

Args:
schema: the desired schema for the dataset, as a JSON object. All columns have to provide their name and type
get_metadata()

Get the metadata attached to this dataset. The metadata contains label, description checklists, tags and custom metadata of the dataset

Returns:
a dict object. For more information on available metadata, please see https://doc.dataiku.com/dss/api/5.0/rest/
set_metadata(metadata)

Set the metadata on this dataset.

Args:
metadata: the new state of the metadata for the dataset. You should only set a metadata object that has been retrieved using the get_metadata call.
iter_rows(partitions=None)

Get the dataset’s data

Return:
an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by get_schema
list_partitions()

Get the list of all partitions of this dataset

Returns:
the list of partitions, as a list of strings
clear(partitions=None)

Clear all data in this dataset

Args:
partitions: (optional) a list of partitions to clear. When not provided, the entire dataset is cleared
copy_to(target, sync_schema=True, write_mode='OVERWRITE')

Copies the data of this dataset to another dataset

Parameters:Dataset (target) – a dataikuapi.dss.dataset.DSSDataset representing the target of this copy
Returns:a DSSFuture representing the operation
build(job_type='NON_RECURSIVE_FORCED_BUILD', partitions=None, wait=True, no_fail=False)

Starts a new job to build this dataset and wait for it to complete. Raises if the job failed.

job = dataset.build()
print("Job %s done" % job.id)
Parameters:
  • job_type – The job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD
  • partitions – If the dataset is partitioned, a list of partition ids to build
  • no_fail – if True, does not raise if the job failed.
Returns:

the dataikuapi.dss.job.DSSJob job handle corresponding to the built job

Return type:

dataikuapi.dss.job.DSSJob

synchronize_hive_metastore()

Synchronize this dataset with the Hive metastore

update_from_hive()

Resynchronize this dataset from its Hive definition

compute_metrics(partition='', metric_ids=None, probes=None)

Compute metrics on a partition of this dataset. If neither metric ids nor custom probes set are specified, the metrics setup on the dataset are used.

run_checks(partition='', checks=None)

Run checks on a partition of this dataset. If the checks are not specified, the checks setup on the dataset are used.

uploaded_add_file(fp, filename)

Adds a file to an “uploaded files” dataset

Parameters:
  • fp (file) – A file-like object that represents the file to upload
  • filename (str) – The filename for the file to upload
uploaded_list_files()

List the files in an “uploaded files” dataset

create_prediction_ml_task(target_variable, ml_backend_type='PY_MEMORY', guess_policy='DEFAULT', prediction_type=None, wait_guess_complete=True)

Creates a new prediction task in a new visual analysis lab for a dataset.

Parameters:
  • input_dataset (string) – the dataset to use for training/testing the model
  • target_variable (string) – the variable to predict
  • ml_backend_type (string) – ML backend to use, one of PY_MEMORY, MLLIB or H2O
  • guess_policy (string) – Policy to use for setting the default parameters. Valid values are: DEFAULT, SIMPLE_FORMULA, DECISION_TREE, EXPLANATORY and PERFORMANCE
  • prediction_type (string) – The type of prediction problem this is. If not provided the prediction type will be guessed. Valid values are: BINARY_CLASSIFICATION, REGRESSION, MULTICLASS
  • wait_guess_complete (boolean) – if False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)
create_clustering_ml_task(input_dataset, ml_backend_type='PY_MEMORY', guess_policy='KMEANS')

Creates a new clustering task in a new visual analysis lab for a dataset.

The returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms.

You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

Parameters:
  • ml_backend_type (string) – ML backend to use, one of PY_MEMORY, MLLIB or H2O
  • guess_policy (string) – Policy to use for setting the default parameters. Valid values are: KMEANS and ANOMALY_DETECTION
create_analysis()

Creates a new visual analysis lab

list_analyses(as_type='listitems')

List the visual analyses on this dataset :param str as_type: How to return the list. Supported values are “listitems” and “objects”. :returns: The list of the analyses. If “as_type” is “listitems”, each one as a dict,

If “as_type” is “objects”, each one as a dataikuapi.dss.analysis.DSSAnalysis
Return type:list
delete_analyses(drop_data=False)

Deletes all analyses that have this dataset as input dataset. Also deletes ML tasks that are part of the analysis

Param:bool drop_data: whether to drop data for all ML tasks in the analysis
list_statistics_worksheets(as_objects=True)

List the statistics worksheets associated to this dataset.

Return type:list of dataikuapi.dss.statistics.DSSStatisticsWorksheet
create_statistics_worksheet(name='My worksheet')

Create a new worksheet in the dataset, and return a handle to interact with it.

Parameters:
  • input_dataset (string) – input dataset of the worksheet
  • worksheet_name (string) – name of the worksheet
Returns:
A dataikuapi.dss.statistics.DSSStatisticsWorksheet dataset handle
get_statistics_worksheet(worksheet_id)

Get a handle to interact with a statistics worksheet

Parameters:worksheet_id (string) – the ID of the desired worksheet
Returns:A dataikuapi.dss.statistics.DSSStatisticsWorksheet worksheet handle
get_last_metric_values(partition='')

Get the last values of the metrics on this dataset

Returns:
a list of metric objects and their value
get_metric_history(metric, partition='')

Get the history of the values of the metric on this dataset

Returns:
an object containing the values of the metric, cast to the appropriate type (double, boolean,…)
get_zone()

Gets the flow zone of this dataset

Return type:dataikuapi.dss.flow.DSSFlowZone
move_to_zone(zone)

Moves this object to a flow zone

Parameters:zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to move the object
get_usages()

Get the recipes or analyses referencing this dataset

Returns:
a list of usages
get_object_discussions()

Get a handle to manage discussions on the dataset

Returns:the handle to manage discussions
Return type:dataikuapi.discussion.DSSObjectDiscussions
test_and_detect(infer_storage_types=False)

Used internally by autodetect_settings. It is not usually required to call this method

autodetect_settings(infer_storage_types=False)

Detects appropriate settings for this dataset using Dataiku detection engine

Returns new suggested settings that you can DSSDatasetSettings.save()

Return type:DSSDatasetSettings or a subclass
get_as_core_dataset()

Returns the dataiku.Dataset object corresponding to this dataset

new_code_recipe(type, code=None, recipe_name=None)

Starts creation of a new code recipe taking this dataset as input :param str type: Type of the recipe (‘python’, ‘r’, ‘pyspark’, ‘sparkr’, ‘sql’, ‘sparksql’, ‘hive’, …) :param str code: The code of the recipe

new_recipe(type, recipe_name=None)

Starts creation of a new recipe taking this dataset as input. For more details, please see dataikuapi.dss.project.DSSProject.new_recipe()

Parameters:type (str) – Type of the recipe

Listing datasets

class dataikuapi.dss.dataset.DSSDatasetListItem(client, data)

An item in a list of datasets. Do not instantiate this class, use dataikuapi.dss.project.DSSProject.list_datasets()

to_dataset()

Gets the DSSDataset corresponding to this dataset

name
id
type
schema
connection

Returns the connection on which this dataset is attached, or None if there is no connection for this dataset

get_column(column)

Returns the schema column given a name. :param str column: Column to find :return a dict of the column settings or None if column does not exist

Settings of datasets

class dataikuapi.dss.dataset.DSSDatasetSettings(dataset, settings)

Base settings class for a DSS dataset. Do not instantiate this class directly, use DSSDataset.get_settings()

Use save() to save your changes

get_raw()

Get the raw dataset settings as a dict

get_raw_params()

Get the type-specific params, as a raw dict

type
schema_columns
remove_partitioning()
add_discrete_partitioning_dimension(dim_name)
add_time_partitioning_dimension(dim_name, period='DAY')
add_raw_schema_column(column)
save()
class dataikuapi.dss.dataset.SQLDatasetSettings(dataset, settings)

Settings for a SQL dataset. This class inherits from DSSDatasetSettings. Do not instantiate this class directly, use DSSDataset.get_settings()

Use save() to save your changes

set_table(connection, schema, table)

Sets this SQL dataset in ‘table’ mode, targeting a particular table of a connection

class dataikuapi.dss.dataset.FSLikeDatasetSettings(dataset, settings)

Settings for a files-based dataset. This class inherits from DSSDatasetSettings. Do not instantiate this class directly, use DSSDataset.get_settings()

Use save() to save your changes

set_connection_and_path(connection, path)
get_raw_format_params()

Get the raw format parameters as a dict

set_format(format_type, format_params=None)
set_csv_format(separator=', ', style='excel', skip_rows_before=0, header_row=True, skip_rows_after=0)
set_partitioning_file_pattern(pattern)

Creation of managed datasets

class dataikuapi.dss.dataset.DSSManagedDatasetCreationHelper(project, dataset_name)
get_creation_settings()
with_store_into(connection, type_option_id=None, format_option_id=None)

Sets the connection into which to store the new managed dataset :param str connection: Name of the connection to store into :param str type_option_id: If the connection accepts several types of datasets, the type :param str format_option_id: Optional identifier of a file format option :return: self

with_copy_partitioning_from(dataset_ref, object_type='DATASET')

Sets the new managed dataset to use the same partitioning as an existing dataset_name

Parameters:dataset_ref (str) – Name of the dataset to copy partitioning from
Returns:self
create(overwrite=False)

Executes the creation of the managed dataset according to the selected options :param overwrite: If the dataset being created already exists, delete it first (removing data) :return: The DSSDataset corresponding to the newly created dataset

already_exists()

Returns whether this managed dataset already exists