Datasets (reference)¶
Please see Datasets (introduction) for an introduction about interacting with datasets in Dataiku Python API
API reference: The dataiku.Dataset class¶
Tip
For starting code samples, please see Python recipes.
-
class
dataiku.
Dataset
(name, project_key=None, ignore_flow=False)¶ This is a handle to obtain readers and writers on a dataiku Dataset. From this Dataset class, you can:
Read a dataset as a Pandas dataframe
Read a dataset as a chunked Pandas dataframe
Read a dataset row-by-row
Write a pandas dataframe to a dataset
Write a series of chunked Pandas dataframes to a dataset
Write to a dataset row-by-row
Edit the schema of a dataset
-
static
list
(project_key=None)¶ Lists the names of datasets. If project_key is None, the current project key is used.
-
property
full_name
¶
-
get_location_info
(sensitive_info=False)¶
-
get_files_info
(partitions=[])¶
-
set_write_partition
(spec)¶ Sets which partition of the dataset gets written to when you create a DatasetWriter. Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow.
-
add_read_partitions
(spec)¶ Add a partition or range of partitions to read.
The spec argument must be given in the DSS partition spec format. You cannot manually set partitions when running inside a Python recipe. They are automatically set using the dependencies.
-
read_schema
(raise_if_empty=True)¶ Gets the schema of this dataset, as an array of objects like this one: { ‘type’: ‘string’, ‘name’: ‘foo’, ‘maxLength’: 1000 }. There is more information for the map, array and object types.
-
list_partitions
(raise_if_empty=True)¶ List the partitions of this dataset, as an array of partition specifications
-
set_preparation_steps
(steps, requested_output_schema, context_project_key=None)¶
-
get_dataframe
(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, infer_with_pandas=True, parse_dates=True, bool_as_str=False, float_precision=None, na_values=None, keep_default_na=True)¶ Read the dataset (or its selected partitions, if applicable) as a Pandas dataframe.
Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.
Keywords arguments:
columns – When not None, returns only the given list of columns (default None)
limit – Limits the number of rows returned (default None)
sampling – Sampling method, if:
‘head’ returns the first rows of the dataset. Incompatible with ratio parameter.
‘random’ returns a random sample of the dataset
‘random-column’ returns a random sample of the dataset. Incompatible with limit parameter.
sampling_column – Select the column used for “columnwise-random” sampling (default None)
ratio – Limits the ratio to at n% of the dataset. (default None)
infer_with_pandas – uses the types detected by pandas rather than the dataset schema as detected in DSS. (default True)
parse_dates – Date column in DSS’s dataset schema are parsed (default True)
bool_as_str – Leave boolean values as strings (default False)
Inconsistent sampling parameter raise ValueError.
Note about encoding:
Column labels are “unicode” objects
When a column is of string type, the content is made of utf-8 encoded “str” objects
-
static
get_dataframe_schema_st
(schema, columns=None, parse_dates=True, infer_with_pandas=False, bool_as_str=False, int_as_float=False)¶
-
iter_dataframes_forced_types
(names, dtypes, parse_date_columns, chunksize=10000, sampling='head', sampling_column=None, limit=None, ratio=None, float_precision=None, na_values=None, keep_default_na=True)¶
-
iter_dataframes
(chunksize=10000, infer_with_pandas=True, sampling='head', sampling_column=None, parse_dates=True, limit=None, ratio=None, columns=None, bool_as_str=False, float_precision=None, na_values=None, keep_default_na=True)¶ Read the dataset to Pandas dataframes by chunks of fixed size.
Returns a generator over pandas dataframes.
Useful is the dataset doesn’t fit in RAM.
-
write_with_schema
(df, dropAndCreate=False)¶ Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.
This variant replaces the schema of the output dataset with the schema of the dataframe.
Encoding node: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.
- Parameters
df – input panda dataframe.
dropAndCreate – drop and recreate the dataset.
-
write_dataframe
(df, infer_schema=False, dropAndCreate=False)¶ Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.
This variant only edit the schema if infer_schema is True, otherwise you must take care to only write dataframes that have a compatible schema. Also see “write_with_schema”.
Encoding note: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.
- Parameters
df – input panda dataframe.
infer_schema – infer the schema from the dataframe.
dropAndCreate – if infer_schema and this parameter are both set to True, clear and recreate the dataset structure.
-
iter_rows
(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=- 1, timeout=30, columns=None)¶ Returns a generator on the rows (as a dict-like object) of the data (or its selected partitions, if applicable)
Keyword arguments: * limit – maximum number of rows to be emitted * log_every – print out the number of rows read on stdout
Field values are casted according to their types. String are parsed into “unicode” values.
-
raw_formatted_data
(sampling=None, columns=None, format='tsv-excel-noheader', format_params=None, read_session_id=None)¶ Get a stream of raw bytes from a dataset as a file-like object, formatted in a supported DSS output format.
You MUST close the file handle. Failure to do so will result in resource leaks.
After closing, you can also call
verify_read()
to check for any errors that occurred while reading the dataset data.
-
verify_read
(read_session_id)¶ Verifies that no error occurred when using
raw_formatted_data()
to read a dataset. Use the same read_session_id that you passed to the call toraw_formatted_data()
.
-
iter_tuples
(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=- 1, timeout=30, columns=None)¶ Returns the rows of the dataset as tuples. The order and type of the values are the same are matching the dataset’s parameter
Keyword arguments:
limit – maximum number of rows to be emitted
log_every – print out the number of rows read on stdout
timeout – time (in seconds) of inactivity after which we want to close the generator if nothing has been read. Without it notebooks typically tend to leak “DKU” processes.
Field values are casted according to their types. String are parsed into “unicode” values.
-
get_writer
()¶ Get a stream writer for this dataset (or its target partition, if applicable). The writer must be closed as soon as you don’t need it.
The schema of the dataset MUST be set before using this. If you don’t set the schema of the dataset, your data will generally not be stored by the output writers
-
get_continuous_writer
(source_id, split_id=0)¶
-
write_schema
(columns, dropAndCreate=False)¶ Write the dataset schema into the dataset JSON definition file.
Sometimes, the schema of a dataset being written is known only by the code of the Python script itself. In that case, it can be useful for the Python script to actually modify the schema of the dataset. Obviously, this must be used with caution. ‘columns’ must be an array of dicts like { ‘name’ : ‘column name’, ‘type’ : ‘column type’}
-
write_schema_from_dataframe
(df, dropAndCreate=False)¶
-
read_metadata
()¶ Reads the dataset metadata object
-
write_metadata
(meta)¶ Writes the dataset metadata object
-
get_config
()¶
-
get_last_metric_values
(partition='')¶ Get the set of last values of the metrics on this dataset, as a
dataiku.ComputedMetrics
object
-
get_metric_history
(metric_lookup, partition='')¶ Get the set of all values a given metric took on this dataset
- Parameters
metric_lookup – metric name or unique identifier
partition – optionally, the partition for which the values are to be fetched
-
save_external_metric_values
(values_dict, partition='')¶ Save metrics on this dataset. The metrics are saved with the type “external”
- Parameters
values_dict – the values to save, as a dict. The keys of the dict are used as metric names
partition – optionally, the partition for which the values are to be saved
-
save_external_check_values
(values_dict, partition='')¶ Save checks on this dataset. The checks are saved with the type “external”
- Parameters
values_dict – the values to save, as a dict. The keys of the dict are used as check names
-
dataset.
create_sampling_argument
(sampling_column=None, limit=None, ratio=None)¶
-
class
dataiku.core.dataset.
Schema
(data)¶
-
class
dataiku.core.dataset.
DatasetCursor
(val, col_names, col_idx)¶ A dataset cursor that helps iterating on rows.
-
column_id
(name)¶
-
keys
()¶
-
items
()¶
-
values
()¶
-
get
(col_name, default_value=None)¶
-
-
class
dataiku.core.dataset_write.
DatasetWriter
(dataset)¶ Handle to write to a dataset. Use Dataset.get_writer() to obtain a DatasetWriter.
Very important: a DatasetWriter MUST be closed after usage. Failure to close a DatasetWriter will lead to incomplete or no data being written to the output dataset
-
write_tuple
(row)¶ Write a single row from a tuple or list of column values. Columns must be given in the order of the dataset schema.
Note: The schema of the dataset MUST be set before using this.
Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.
-
write_row_array
(row)¶
-
write_row_dict
(row_dict)¶ Write a single row from a dict of column name -> column value.
Some columns can be omitted, empty values will be inserted instead.
Note: The schema of the dataset MUST be set before using this.
Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.
-
write_dataframe
(df)¶ Appends a Pandas dataframe to the dataset being written.
This method can be called multiple times (especially when you have been using iter_dataframes to read from an input dataset)
Encoding note: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.
-
close
()¶ Closes this dataset writer
-
API reference: The dataikuapi.dss.dataset package¶
Main DSSDataset class¶
-
class
dataikuapi.dss.dataset.
DSSDataset
(client, project_key, dataset_name)¶ A dataset on the DSS instance. Do not instantiate this class, use
dataikuapi.dss.project.DSSProject.get_dataset()
-
property
id
¶
-
property
name
¶
-
delete
(drop_data=False)¶ Delete the dataset
- Parameters
drop_data (bool) – Should the data of the dataset be dropped
-
get_settings
()¶ Returns the settings of this dataset as a
DSSDatasetSettings
, or one of its subclasses.Know subclasses of
DSSDatasetSettings
includeFSLikeDatasetSettings
andSQLDatasetSettings
You must use
save()
on the returned object to make your changes effective on the dataset.# Example: activating discrete partitioning on a SQL dataset dataset = project.get_dataset("my_database_table") settings = dataset.get_settings() settings.add_discrete_partitioning_dimension("country") settings.save()
- Return type
-
get_definition
()¶ Deprecated. Use
get_settings()
Get the raw settings of the dataset as a dict :rtype: dict
-
set_definition
(definition)¶ Deprecated. Use
get_settings()
andDSSDatasetSettings.save()
Set the definition of the dataset- Parameters
definition – the definition, as a dict. You should only set a definition object that has been retrieved using the get_definition call.
-
exists
()¶ Returns whether this dataset exists
-
get_schema
()¶ Get the schema of the dataset
- Returns:
a JSON object of the schema, with the list of columns
-
set_schema
(schema)¶ Set the schema of the dataset
- Args:
schema: the desired schema for the dataset, as a JSON object. All columns have to provide their name and type
-
get_metadata
()¶ Get the metadata attached to this dataset. The metadata contains label, description checklists, tags and custom metadata of the dataset
- Returns:
a dict object. For more information on available metadata, please see https://doc.dataiku.com/dss/api/5.0/rest/
-
set_metadata
(metadata)¶ Set the metadata on this dataset.
- Args:
metadata: the new state of the metadata for the dataset. You should only set a metadata object that has been retrieved using the get_metadata call.
-
iter_rows
(partitions=None)¶ Get the dataset’s data
- Return:
an iterator over the rows, each row being a tuple of values. The order of values in the tuples is the same as the order of columns in the schema returned by get_schema
-
list_partitions
()¶ Get the list of all partitions of this dataset
- Returns:
the list of partitions, as a list of strings
-
clear
(partitions=None)¶ Clear all data in this dataset
- Args:
partitions: (optional) a list of partitions to clear. When not provided, the entire dataset is cleared
-
copy_to
(target, sync_schema=True, write_mode='OVERWRITE')¶ Copies the data of this dataset to another dataset
- Parameters
Dataset (target) – a
dataikuapi.dss.dataset.DSSDataset
representing the target of this copy- Returns
a DSSFuture representing the operation
-
build
(job_type='NON_RECURSIVE_FORCED_BUILD', partitions=None, wait=True, no_fail=False)¶ Starts a new job to build this dataset and wait for it to complete. Raises if the job failed.
job = dataset.build() print("Job %s done" % job.id)
- Parameters
job_type – The job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD
partitions – If the dataset is partitioned, a list of partition ids to build
no_fail – if True, does not raise if the job failed.
- Returns
the
dataikuapi.dss.job.DSSJob
job handle corresponding to the built job- Return type
-
synchronize_hive_metastore
()¶ Synchronize this dataset with the Hive metastore
-
update_from_hive
()¶ Resynchronize this dataset from its Hive definition
-
compute_metrics
(partition='', metric_ids=None, probes=None)¶ Compute metrics on a partition of this dataset. If neither metric ids nor custom probes set are specified, the metrics setup on the dataset are used.
-
run_checks
(partition='', checks=None)¶ Run checks on a partition of this dataset. If the checks are not specified, the checks setup on the dataset are used.
-
uploaded_add_file
(fp, filename)¶ Adds a file to an “uploaded files” dataset
- Parameters
fp (file) – A file-like object that represents the file to upload
filename (str) – The filename for the file to upload
-
uploaded_list_files
()¶ List the files in an “uploaded files” dataset
-
create_prediction_ml_task
(target_variable, ml_backend_type='PY_MEMORY', guess_policy='DEFAULT', prediction_type=None, wait_guess_complete=True)¶ Creates a new prediction task in a new visual analysis lab for a dataset.
- Parameters
input_dataset (string) – the dataset to use for training/testing the model
target_variable (string) – the variable to predict
ml_backend_type (string) – ML backend to use, one of PY_MEMORY, MLLIB or H2O
guess_policy (string) – Policy to use for setting the default parameters. Valid values are: DEFAULT, SIMPLE_FORMULA, DECISION_TREE, EXPLANATORY and PERFORMANCE
prediction_type (string) – The type of prediction problem this is. If not provided the prediction type will be guessed. Valid values are: BINARY_CLASSIFICATION, REGRESSION, MULTICLASS
wait_guess_complete (boolean) – if False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling
wait_guess_complete
on the returned object before doing anything else (in particular callingtrain
orget_settings
)
-
create_clustering_ml_task
(input_dataset, ml_backend_type='PY_MEMORY', guess_policy='KMEANS', wait_guess_complete=True)¶ Creates a new clustering task in a new visual analysis lab for a dataset.
The returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms.
You should wait for the guessing to be completed by calling
wait_guess_complete
on the returned object before doing anything else (in particular callingtrain
orget_settings
)- Parameters
ml_backend_type (string) – ML backend to use, one of PY_MEMORY, MLLIB or H2O
guess_policy (string) – Policy to use for setting the default parameters. Valid values are: KMEANS and ANOMALY_DETECTION
wait_guess_complete (boolean) – if False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling
wait_guess_complete
on the returned object before doing anything else (in particular callingtrain
orget_settings
)
-
create_timeseries_forecasting_ml_task
(target_variable, time_variable, timeseries_identifiers=None, guess_policy='TIMESERIES_DEFAULT', wait_guess_complete=True)¶ Creates a new time series forecasting task in a new visual analysis lab for a dataset.
- Parameters
target_variable (string) – The variable to forecast
time_variable (string) – Column to be used as time variable. Should be a Date (parsed) column.
timeseries_identifiers (list) – List of columns to be used as time series identifiers (when the dataset has multiple series)
guess_policy (string) – Policy to use for setting the default parameters. Valid values are: TIMESERIES_DEFAULT, TIMESERIES_STATISTICAL, and TIMESERIES_DEEP_LEARNING
wait_guess_complete (boolean) – If False, the returned ML task will be in ‘guessing’ state, i.e. analyzing the input dataset to determine feature handling and algorithms. You should wait for the guessing to be completed by calling
wait_guess_complete
on the returned object before doing anything else (in particular callingtrain
orget_settings
)
-
create_analysis
()¶ Creates a new visual analysis lab
-
list_analyses
(as_type='listitems')¶ List the visual analyses on this dataset :param str as_type: How to return the list. Supported values are “listitems” and “objects”. :returns: The list of the analyses. If “as_type” is “listitems”, each one as a dict,
If “as_type” is “objects”, each one as a
dataikuapi.dss.analysis.DSSAnalysis
- Return type
list
-
delete_analyses
(drop_data=False)¶ Deletes all analyses that have this dataset as input dataset. Also deletes ML tasks that are part of the analysis
- Param
bool drop_data: whether to drop data for all ML tasks in the analysis
-
list_statistics_worksheets
(as_objects=True)¶ List the statistics worksheets associated to this dataset.
- Return type
-
create_statistics_worksheet
(name='My worksheet')¶ Create a new worksheet in the dataset, and return a handle to interact with it.
- Parameters
input_dataset (string) – input dataset of the worksheet
worksheet_name (string) – name of the worksheet
- Returns:
A
dataikuapi.dss.statistics.DSSStatisticsWorksheet
dataset handle
-
get_statistics_worksheet
(worksheet_id)¶ Get a handle to interact with a statistics worksheet
- Parameters
worksheet_id (string) – the ID of the desired worksheet
- Returns
A
dataikuapi.dss.statistics.DSSStatisticsWorksheet
worksheet handle
-
get_last_metric_values
(partition='')¶ Get the last values of the metrics on this dataset
- Returns:
a list of metric objects and their value
-
get_metric_history
(metric, partition='')¶ Get the history of the values of the metric on this dataset
- Returns:
an object containing the values of the metric, cast to the appropriate type (double, boolean,…)
-
get_info
()¶ Retrieve all the information about a dataset
- Returns
a
DSSDatasetInfo
containing all the information about a dataset.- Return type
-
get_zone
()¶ Gets the flow zone of this dataset
- Return type
-
move_to_zone
(zone)¶ Moves this object to a flow zone
- Parameters
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
where to move the object
Share this object to a flow zone
- Parameters
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
where to share the object
Unshare this object from a flow zone
- Parameters
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
from where to unshare the object
-
get_usages
()¶ Get the recipes or analyses referencing this dataset
- Returns:
a list of usages
-
get_object_discussions
()¶ Get a handle to manage discussions on the dataset
- Returns
the handle to manage discussions
- Return type
dataikuapi.discussion.DSSObjectDiscussions
-
test_and_detect
(infer_storage_types=False)¶ Used internally by autodetect_settings. It is not usually required to call this method
-
autodetect_settings
(infer_storage_types=False)¶ Detects appropriate settings for this dataset using Dataiku detection engine
Returns new suggested settings that you can
DSSDatasetSettings.save()
- Return type
DSSDatasetSettings
or a subclass
-
get_as_core_dataset
()¶ Returns the
dataiku.Dataset
object corresponding to this dataset
-
new_code_recipe
(type, code=None, recipe_name=None)¶ Starts creation of a new code recipe taking this dataset as input :param str type: Type of the recipe (‘python’, ‘r’, ‘pyspark’, ‘sparkr’, ‘sql’, ‘sparksql’, ‘hive’, …) :param str code: The code of the recipe
-
new_recipe
(type, recipe_name=None)¶ Starts creation of a new recipe taking this dataset as input. For more details, please see
dataikuapi.dss.project.DSSProject.new_recipe()
- Parameters
type (str) – Type of the recipe
-
property
Listing datasets¶
-
class
dataikuapi.dss.dataset.
DSSDatasetListItem
(client, data)¶ An item in a list of datasets. Do not instantiate this class, use
dataikuapi.dss.project.DSSProject.list_datasets()
-
to_dataset
()¶ Gets the
DSSDataset
corresponding to this dataset
-
property
name
¶
-
property
id
¶
-
property
type
¶
-
property
schema
¶
-
property
connection
¶ Returns the connection on which this dataset is attached, or None if there is no connection for this dataset
-
get_column
(column)¶ Returns the schema column given a name. :param str column: Column to find :return a dict of the column settings or None if column does not exist
-
Settings of datasets¶
-
class
dataikuapi.dss.dataset.
DSSDatasetSettings
(dataset, settings)¶ Base settings class for a DSS dataset. Do not instantiate this class directly, use
DSSDataset.get_settings()
Use
save()
to save your changes-
get_raw
()¶ Get the raw dataset settings as a dict
-
get_raw_params
()¶ Get the type-specific params, as a raw dict
-
property
type
¶
-
property
schema_columns
¶
-
remove_partitioning
()¶
-
add_discrete_partitioning_dimension
(dim_name)¶
-
add_time_partitioning_dimension
(dim_name, period='DAY')¶
-
add_raw_schema_column
(column)¶
-
property
is_feature_group
¶ Indicates whether the Dataset is defined as a Feature Group, available in the Feature Store.
- Return type
bool
-
set_feature_group
(status)¶ (Un)sets the dataset as a Feature Group, available in the Feature Store. Changes of this property will be applied when calling
save()
and require the “Manage Feature Store” permission.- Parameters
status (bool) – whether the dataset should be defined as a feature group
-
save
()¶
-
-
class
dataikuapi.dss.dataset.
SQLDatasetSettings
(dataset, settings)¶ Settings for a SQL dataset. This class inherits from
DSSDatasetSettings
. Do not instantiate this class directly, useDSSDataset.get_settings()
Use
save()
to save your changes-
set_table
(connection, schema, table)¶ Sets this SQL dataset in ‘table’ mode, targeting a particular table of a connection
-
-
class
dataikuapi.dss.dataset.
FSLikeDatasetSettings
(dataset, settings)¶ Settings for a files-based dataset. This class inherits from
DSSDatasetSettings
. Do not instantiate this class directly, useDSSDataset.get_settings()
Use
save()
to save your changes-
set_connection_and_path
(connection, path)¶
-
get_raw_format_params
()¶ Get the raw format parameters as a dict
-
set_format
(format_type, format_params=None)¶
-
set_csv_format
(separator=',', style='excel', skip_rows_before=0, header_row=True, skip_rows_after=0)¶
-
set_partitioning_file_pattern
(pattern)¶
-
Dataset Information¶
-
class
dataikuapi.dss.dataset.
DSSDatasetInfo
(dataset, info)¶ Info class for a DSS dataset (Read-Only). Do not instantiate this class directly, use
DSSDataset.get_info()
-
get_raw
()¶ Get the raw dataset full information as a dict
- Returns
the raw dataset full information
- Return type
dict
-
property
last_build_start_time
¶ The last build start time of the dataset as a
datetime.datetime
or None if there is no last build information.- Returns
the last build start time
- Return type
datetime.datetime
or None
-
property
last_build_end_time
¶ The last build end time of the dataset as a
datetime.datetime
or None if there is no last build information.- Returns
the last build end time
- Return type
datetime.datetime
or None
-
property
is_last_build_successful
¶ Get whether the last build of the dataset is successful.
- Returns
True if the last build is successful
- Return type
bool
-
Creation of managed datasets¶
-
class
dataikuapi.dss.dataset.
DSSManagedDatasetCreationHelper
(project, dataset_name)¶ -
get_creation_settings
()¶
-
with_store_into
(connection, type_option_id=None, format_option_id=None)¶ Sets the connection into which to store the new managed dataset :param str connection: Name of the connection to store into :param str type_option_id: If the connection accepts several types of datasets, the type :param str format_option_id: Optional identifier of a file format option :return: self
-
with_copy_partitioning_from
(dataset_ref, object_type='DATASET')¶ Sets the new managed dataset to use the same partitioning as an existing dataset_name
- Parameters
dataset_ref (str) – Name of the dataset to copy partitioning from
- Returns
self
-
create
(overwrite=False)¶ Executes the creation of the managed dataset according to the selected options :param overwrite: If the dataset being created already exists, delete it first (removing data) :return: The
DSSDataset
corresponding to the newly created dataset
-
already_exists
()¶ Returns whether this managed dataset already exists
-