Datasets (reference)

Please see Datasets (introduction) for an introduction about interacting with datasets in Dataiku Python API

API reference: The dataiku.Dataset class

Tip

For starting code samples, please see Python recipes.

class dataiku.Dataset(name, project_key=None, ignore_flow=False)

This is a handle to obtain readers and writers on a dataiku Dataset. From this Dataset class, you can:

  • Read a dataset as a Pandas dataframe

  • Read a dataset as a chunked Pandas dataframe

  • Read a dataset row-by-row

  • Write a pandas dataframe to a dataset

  • Write a series of chunked Pandas dataframes to a dataset

  • Write to a dataset row-by-row

  • Edit the schema of a dataset

static list(project_key=None)

Lists the names of datasets. If project_key is None, the current project key is used.

property full_name
get_location_info(sensitive_info=False)
get_files_info(partitions=[])
set_write_partition(spec)

Sets which partition of the dataset gets written to when you create a DatasetWriter. Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow.

add_read_partitions(spec)

Add a partition or range of partitions to read.

The spec argument must be given in the DSS partition spec format. You cannot manually set partitions when running inside a Python recipe. They are automatically set using the dependencies.

read_schema(raise_if_empty=True)

Gets the schema of this dataset, as an array of objects like this one: { ‘type’: ‘string’, ‘name’: ‘foo’, ‘maxLength’: 1000 }. There is more information for the map, array and object types.

list_partitions(raise_if_empty=True)

List the partitions of this dataset, as an array of partition specifications

set_preparation_steps(steps, requested_output_schema, context_project_key=None)
get_dataframe(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, infer_with_pandas=True, parse_dates=True, bool_as_str=False, float_precision=None, na_values=None, keep_default_na=True)

Read the dataset (or its selected partitions, if applicable) as a Pandas dataframe.

Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.

Keywords arguments:

  • columns – When not None, returns only the given list of columns (default None)

  • limit – Limits the number of rows returned (default None)

  • sampling – Sampling method, if:

    • ‘head’ returns the first rows of the dataset. Incompatible with ratio parameter.

    • ‘random’ returns a random sample of the dataset

    • ‘random-column’ returns a random sample of the dataset. Incompatible with limit parameter.

  • sampling_column – Select the column used for “columnwise-random” sampling (default None)

  • ratio – Limits the ratio to at n% of the dataset. (default None)

  • infer_with_pandas – uses the types detected by pandas rather than the dataset schema as detected in DSS. (default True)

  • parse_dates – Date column in DSS’s dataset schema are parsed (default True)

  • bool_as_str – Leave boolean values as strings (default False)

Inconsistent sampling parameter raise ValueError.

Note about encoding:

  • Column labels are “unicode” objects

  • When a column is of string type, the content is made of utf-8 encoded “str” objects

static get_dataframe_schema_st(schema, columns=None, parse_dates=True, infer_with_pandas=False, bool_as_str=False, int_as_float=False)
iter_dataframes_forced_types(names, dtypes, parse_date_columns, chunksize=10000, sampling='head', sampling_column=None, limit=None, ratio=None, float_precision=None, na_values=None, keep_default_na=True)
iter_dataframes(chunksize=10000, infer_with_pandas=True, sampling='head', sampling_column=None, parse_dates=True, limit=None, ratio=None, columns=None, bool_as_str=False, float_precision=None, na_values=None, keep_default_na=True)

Read the dataset to Pandas dataframes by chunks of fixed size.

Returns a generator over pandas dataframes.

Useful is the dataset doesn’t fit in RAM.

write_with_schema(df, dropAndCreate=False)

Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.

This variant replaces the schema of the output dataset with the schema of the dataframe.

Encoding node: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

Parameters
  • df – input panda dataframe.

  • dropAndCreate – drop and recreate the dataset.

write_dataframe(df, infer_schema=False, dropAndCreate=False)

Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.

This variant only edit the schema if infer_schema is True, otherwise you must take care to only write dataframes that have a compatible schema. Also see “write_with_schema”.

Encoding note: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

Parameters
  • df – input panda dataframe.

  • infer_schema – infer the schema from the dataframe.

  • dropAndCreate – if infer_schema and this parameter are both set to True, clear and recreate the dataset structure.

iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=- 1, timeout=30, columns=None)

Returns a generator on the rows (as a dict-like object) of the data (or its selected partitions, if applicable)

Keyword arguments: * limit – maximum number of rows to be emitted * log_every – print out the number of rows read on stdout

Field values are casted according to their types. String are parsed into “unicode” values.

raw_formatted_data(sampling=None, columns=None, format='tsv-excel-noheader', format_params=None, read_session_id=None)

Get a stream of raw bytes from a dataset as a file-like object, formatted in a supported DSS output format.

You MUST close the file handle. Failure to do so will result in resource leaks.

After closing, you can also call verify_read() to check for any errors that occurred while reading the dataset data.

verify_read(read_session_id)

Verifies that no error occurred when using raw_formatted_data() to read a dataset. Use the same read_session_id that you passed to the call to raw_formatted_data().

iter_tuples(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=- 1, timeout=30, columns=None)

Returns the rows of the dataset as tuples. The order and type of the values are the same are matching the dataset’s parameter

Keyword arguments:

  • limit – maximum number of rows to be emitted

  • log_every – print out the number of rows read on stdout

  • timeout – time (in seconds) of inactivity after which we want to close the generator if nothing has been read. Without it notebooks typically tend to leak “DKU” processes.

Field values are casted according to their types. String are parsed into “unicode” values.

get_writer()

Get a stream writer for this dataset (or its target partition, if applicable). The writer must be closed as soon as you don’t need it.

The schema of the dataset MUST be set before using this. If you don’t set the schema of the dataset, your data will generally not be stored by the output writers

get_continuous_writer(source_id, split_id=0)
write_schema(columns, dropAndCreate=False)

Write the dataset schema into the dataset JSON definition file.

Sometimes, the schema of a dataset being written is known only by the code of the Python script itself. In that case, it can be useful for the Python script to actually modify the schema of the dataset. Obviously, this must be used with caution. ‘columns’ must be an array of dicts like { ‘name’ : ‘column name’, ‘type’ : ‘column type’}

write_schema_from_dataframe(df, dropAndCreate=False)
read_metadata()

Reads the dataset metadata object

write_metadata(meta)

Writes the dataset metadata object

get_config()
get_last_metric_values(partition='')

Get the set of last values of the metrics on this dataset, as a dataiku.ComputedMetrics object

get_metric_history(metric_lookup, partition='')

Get the set of all values a given metric took on this dataset

Parameters
  • metric_lookup – metric name or unique identifier

  • partition – optionally, the partition for which the values are to be fetched

save_external_metric_values(values_dict, partition='')

Save metrics on this dataset. The metrics are saved with the type “external”

Parameters
  • values_dict – the values to save, as a dict. The keys of the dict are used as metric names

  • partition – optionally, the partition for which the values are to be saved

save_external_check_values(values_dict, partition='')

Save checks on this dataset. The checks are saved with the type “external”

Parameters

values_dict – the values to save, as a dict. The keys of the dict are used as check names

dataset.create_sampling_argument(sampling_column=None, limit=None, ratio=None)
class dataiku.core.dataset.Schema(data)
class dataiku.core.dataset.DatasetCursor(val, col_names, col_idx)

A dataset cursor that helps iterating on rows.

column_id(name)
keys()
items()
values()
get(col_name, default_value=None)
class dataiku.core.dataset_write.DatasetWriter(dataset)

Handle to write to a dataset. Use Dataset.get_writer() to obtain a DatasetWriter.

Very important: a DatasetWriter MUST be closed after usage. Failure to close a DatasetWriter will lead to incomplete or no data being written to the output dataset

write_tuple(row)

Write a single row from a tuple or list of column values. Columns must be given in the order of the dataset schema.

Note: The schema of the dataset MUST be set before using this.

Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.

write_row_array(row)
write_row_dict(row_dict)

Write a single row from a dict of column name -> column value.

Some columns can be omitted, empty values will be inserted instead.

Note: The schema of the dataset MUST be set before using this.

Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.

write_dataframe(df)

Appends a Pandas dataframe to the dataset being written.

This method can be called multiple times (especially when you have been using iter_dataframes to read from an input dataset)

Encoding note: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

close()

Closes this dataset writer

API reference: The dataikuapi.dss.dataset package

Main DSSDataset class

class dataikuapi.dss.dataset.DSSDataset(client, project_key, dataset_name)

A dataset on the DSS instance. Do not instantiate this class, use dataikuapi.dss.project.DSSProject.get_dataset()

property id
property name
delete(drop_data=False)

Delete the dataset

Parameters

drop_data (bool) – Should the data of the dataset be dropped

get_settings()

Returns the settings of this dataset as a DSSDatasetSettings, or one of its subclasses.

Know subclasses of DSSDatasetSettings include FSLikeDatasetSettings and SQLDatasetSettings

You must use save() on the returned object to make your changes effective on the dataset.

# Example: activating discrete partitioning on a SQL dataset
dataset = project.get_dataset("my_database_table")
settings = dataset.get_settings()
settings.add_discrete_partitioning_dimension("country")
settings.save()
Return type

DSSDatasetSettings

get_definition()

Deprecated. Use get_settings() Get the raw settings of the dataset as a dict :rtype: dict

set_definition(definition)

Deprecated. Use get_settings() and DSSDatasetSettings.save() Set the definition of the dataset

Parameters

definition – the definition, as a dict. You should only set a definition object that has been retrieved using the get_definition call.

exists()

Returns whether this dataset exists

get_schema()

Get the schema of the dataset

Returns:

a JSON object of the schema, with the list of columns

set_schema(schema)

Set the schema of the dataset

Args:

schema: the desired schema for the dataset, as a JSON object. All columns have to provide their name and type

get_metadata()

Get the metadata attached to this dataset. The metadata contains label, description checklists, tags and custom metadata of the dataset

Returns:

a dict object. For more information on available metadata, please see https://doc.dataiku.com/dss/api/5.0/rest/

set_metadata(metadata)

Set the metadata on this dataset.

Args:

metadata: the new state of the metadata for the dataset. You should only set a metadata object that has been retrieved using the get_metadata call.

iter_rows(partitions=None)

Get the dataset’s data

Return:

an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by get_schema

list_partitions()

Get the list of all partitions of this dataset

Returns:

the list of partitions, as a list of strings

clear(partitions=None)

Clear all data in this dataset

Args:

partitions: (optional) a list of partitions to clear. When not provided, the entire dataset is cleared

copy_to(target, sync_schema=True, write_mode='OVERWRITE')

Copies the data of this dataset to another dataset

Parameters

Dataset (target) – a dataikuapi.dss.dataset.DSSDataset representing the target of this copy

Returns

a DSSFuture representing the operation

build(job_type='NON_RECURSIVE_FORCED_BUILD', partitions=None, wait=True, no_fail=False)

Starts a new job to build this dataset and wait for it to complete. Raises if the job failed.

job = dataset.build()
print("Job %s done" % job.id)
Parameters
  • job_type – The job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD

  • partitions – If the dataset is partitioned, a list of partition ids to build

  • no_fail – if True, does not raise if the job failed.

Returns

the dataikuapi.dss.job.DSSJob job handle corresponding to the built job

Return type

dataikuapi.dss.job.DSSJob

synchronize_hive_metastore()

Synchronize this dataset with the Hive metastore

update_from_hive()

Resynchronize this dataset from its Hive definition

compute_metrics(partition='', metric_ids=None, probes=None)

Compute metrics on a partition of this dataset. If neither metric ids nor custom probes set are specified, the metrics setup on the dataset are used.

run_checks(partition='', checks=None)

Run checks on a partition of this dataset. If the checks are not specified, the checks setup on the dataset are used.

uploaded_add_file(fp, filename)

Adds a file to an “uploaded files” dataset

Parameters
  • fp (file) – A file-like object that represents the file to upload

  • filename (str) – The filename for the file to upload

uploaded_list_files()

List the files in an “uploaded files” dataset

create_prediction_ml_task(target_variable, ml_backend_type='PY_MEMORY', guess_policy='DEFAULT', prediction_type=None, wait_guess_complete=True)

Creates a new prediction task in a new visual analysis lab for a dataset.

Parameters
  • input_dataset (string) – the dataset to use for training/testing the model

  • target_variable (string) – the variable to predict

  • ml_backend_type (string) – ML backend to use, one of PY_MEMORY, MLLIB or H2O

  • guess_policy (string) – Policy to use for setting the default parameters. Valid values are: DEFAULT, SIMPLE_FORMULA, DECISION_TREE, EXPLANATORY and PERFORMANCE

  • prediction_type (string) – The type of prediction problem this is. If not provided the prediction type will be guessed. Valid values are: BINARY_CLASSIFICATION, REGRESSION, MULTICLASS

  • wait_guess_complete (boolean) – if False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

create_clustering_ml_task(input_dataset, ml_backend_type='PY_MEMORY', guess_policy='KMEANS', wait_guess_complete=True)

Creates a new clustering task in a new visual analysis lab for a dataset.

The returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms.

You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

Parameters
  • ml_backend_type (string) – ML backend to use, one of PY_MEMORY, MLLIB or H2O

  • guess_policy (string) – Policy to use for setting the default parameters. Valid values are: KMEANS and ANOMALY_DETECTION

  • wait_guess_complete (boolean) – if False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

create_timeseries_forecasting_ml_task(target_variable, time_variable, timeseries_identifiers=None, guess_policy='TIMESERIES_DEFAULT', wait_guess_complete=True)

Creates a new time series forecasting task in a new visual analysis lab for a dataset.

Parameters
  • target_variable (string) – The variable to forecast

  • time_variable (string) – Column to be used as time variable. Should be a Date (parsed) column.

  • timeseries_identifiers (list) – List of columns to be used as time series identifiers (when the dataset has multiple series)

  • guess_policy (string) – Policy to use for setting the default parameters. Valid values are: TIMESERIES_DEFAULT, TIMESERIES_STATISTICAL, and TIMESERIES_DEEP_LEARNING

  • wait_guess_complete (boolean) – If False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling wait_guess_complete on the returned object before doing anything else (in particular calling train or get_settings)

create_analysis()

Creates a new visual analysis lab

list_analyses(as_type='listitems')

List the visual analyses on this dataset :param str as_type: How to return the list. Supported values are “listitems” and “objects”. :returns: The list of the analyses. If “as_type” is “listitems”, each one as a dict,

If “as_type” is “objects”, each one as a dataikuapi.dss.analysis.DSSAnalysis

Return type

list

delete_analyses(drop_data=False)

Deletes all analyses that have this dataset as input dataset. Also deletes ML tasks that are part of the analysis

Param

bool drop_data: whether to drop data for all ML tasks in the analysis

list_statistics_worksheets(as_objects=True)

List the statistics worksheets associated to this dataset.

Return type

list of dataikuapi.dss.statistics.DSSStatisticsWorksheet

create_statistics_worksheet(name='My worksheet')

Create a new worksheet in the dataset, and return a handle to interact with it.

Parameters
  • input_dataset (string) – input dataset of the worksheet

  • worksheet_name (string) – name of the worksheet

Returns:

A dataikuapi.dss.statistics.DSSStatisticsWorksheet dataset handle

get_statistics_worksheet(worksheet_id)

Get a handle to interact with a statistics worksheet

Parameters

worksheet_id (string) – the ID of the desired worksheet

Returns

A dataikuapi.dss.statistics.DSSStatisticsWorksheet worksheet handle

get_last_metric_values(partition='')

Get the last values of the metrics on this dataset

Returns:

a list of metric objects and their value

get_metric_history(metric, partition='')

Get the history of the values of the metric on this dataset

Returns:

an object containing the values of the metric, cast to the appropriate type (double, boolean,…)

get_info()

Retrieve all the information about a dataset

Returns

a DSSDatasetInfo containing all the information about a dataset.

Return type

DSSDatasetInfo

get_zone()

Gets the flow zone of this dataset

Return type

dataikuapi.dss.flow.DSSFlowZone

move_to_zone(zone)

Moves this object to a flow zone

Parameters

zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to move the object

share_to_zone(zone)

Share this object to a flow zone

Parameters

zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to share the object

unshare_from_zone(zone)

Unshare this object from a flow zone

Parameters

zone (object) – a dataikuapi.dss.flow.DSSFlowZone from where to unshare the object

get_usages()

Get the recipes or analyses referencing this dataset

Returns:

a list of usages

get_object_discussions()

Get a handle to manage discussions on the dataset

Returns

the handle to manage discussions

Return type

dataikuapi.discussion.DSSObjectDiscussions

test_and_detect(infer_storage_types=False)

Used internally by autodetect_settings. It is not usually required to call this method

autodetect_settings(infer_storage_types=False)

Detects appropriate settings for this dataset using Dataiku detection engine

Returns new suggested settings that you can DSSDatasetSettings.save()

Return type

DSSDatasetSettings or a subclass

get_as_core_dataset()

Returns the dataiku.Dataset object corresponding to this dataset

new_code_recipe(type, code=None, recipe_name=None)

Starts creation of a new code recipe taking this dataset as input :param str type: Type of the recipe (‘python’, ‘r’, ‘pyspark’, ‘sparkr’, ‘sql’, ‘sparksql’, ‘hive’, …) :param str code: The code of the recipe

new_recipe(type, recipe_name=None)

Starts creation of a new recipe taking this dataset as input. For more details, please see dataikuapi.dss.project.DSSProject.new_recipe()

Parameters

type (str) – Type of the recipe

Listing datasets

class dataikuapi.dss.dataset.DSSDatasetListItem(client, data)

An item in a list of datasets. Do not instantiate this class, use dataikuapi.dss.project.DSSProject.list_datasets()

to_dataset()

Gets the DSSDataset corresponding to this dataset

property name
property id
property type
property schema
property connection

Returns the connection on which this dataset is attached, or None if there is no connection for this dataset

get_column(column)

Returns the schema column given a name. :param str column: Column to find :return a dict of the column settings or None if column does not exist

Settings of datasets

class dataikuapi.dss.dataset.DSSDatasetSettings(dataset, settings)

Base settings class for a DSS dataset. Do not instantiate this class directly, use DSSDataset.get_settings()

Use save() to save your changes

get_raw()

Get the raw dataset settings as a dict

get_raw_params()

Get the type-specific params, as a raw dict

property type
property schema_columns
remove_partitioning()
add_discrete_partitioning_dimension(dim_name)
add_time_partitioning_dimension(dim_name, period='DAY')
add_raw_schema_column(column)
property is_feature_group

Indicates whether the Dataset is defined as a Feature Group, available in the Feature Store.

Return type

bool

set_feature_group(status)

(Un)sets the dataset as a Feature Group, available in the Feature Store. Changes of this property will be applied when calling save() and require the “Manage Feature Store” permission.

Parameters

status (bool) – whether the dataset should be defined as a feature group

save()
class dataikuapi.dss.dataset.SQLDatasetSettings(dataset, settings)

Settings for a SQL dataset. This class inherits from DSSDatasetSettings. Do not instantiate this class directly, use DSSDataset.get_settings()

Use save() to save your changes

set_table(connection, schema, table)

Sets this SQL dataset in ‘table’ mode, targeting a particular table of a connection

class dataikuapi.dss.dataset.FSLikeDatasetSettings(dataset, settings)

Settings for a files-based dataset. This class inherits from DSSDatasetSettings. Do not instantiate this class directly, use DSSDataset.get_settings()

Use save() to save your changes

set_connection_and_path(connection, path)
get_raw_format_params()

Get the raw format parameters as a dict

set_format(format_type, format_params=None)
set_csv_format(separator=',', style='excel', skip_rows_before=0, header_row=True, skip_rows_after=0)
set_partitioning_file_pattern(pattern)

Dataset Information

class dataikuapi.dss.dataset.DSSDatasetInfo(dataset, info)

Info class for a DSS dataset (Read-Only). Do not instantiate this class directly, use DSSDataset.get_info()

get_raw()

Get the raw dataset full information as a dict

Returns

the raw dataset full information

Return type

dict

property last_build_start_time

The last build start time of the dataset as a datetime.datetime or None if there is no last build information.

Returns

the last build start time

Return type

datetime.datetime or None

property last_build_end_time

The last build end time of the dataset as a datetime.datetime or None if there is no last build information.

Returns

the last build end time

Return type

datetime.datetime or None

property is_last_build_successful

Get whether the last build of the dataset is successful.

Returns

True if the last build is successful

Return type

bool

Creation of managed datasets

class dataikuapi.dss.dataset.DSSManagedDatasetCreationHelper(project, dataset_name)
get_creation_settings()
with_store_into(connection, type_option_id=None, format_option_id=None)

Sets the connection into which to store the new managed dataset :param str connection: Name of the connection to store into :param str type_option_id: If the connection accepts several types of datasets, the type :param str format_option_id: Optional identifier of a file format option :return: self

with_copy_partitioning_from(dataset_ref, object_type='DATASET')

Sets the new managed dataset to use the same partitioning as an existing dataset_name

Parameters

dataset_ref (str) – Name of the dataset to copy partitioning from

Returns

self

create(overwrite=False)

Executes the creation of the managed dataset according to the selected options :param overwrite: If the dataset being created already exists, delete it first (removing data) :return: The DSSDataset corresponding to the newly created dataset

already_exists()

Returns whether this managed dataset already exists