Interacting with datasets

Basic usage

For starting code samples, please see Python recipes.

Typing of dataframes

Applies when reading a dataframe.

By default, the data frame is created without explicit typing. This means that we let Pandas “guess” the proper Pandas type for each column. The main advantage of this approach is that even if your dataset only contains “string” column (which is the default on a newly imported dataset) , if the column actually contains numbers, a proper numerical type will be used.

If you pass infer_with_pandas=False as option to get_dataframe(), the exact dataset types will be passed to Pandas. Note that if your dataset contains invalid values, the whole get_dataframe call will fail.

Chunked reading and writing with Pandas

When using Dataset.get_dataframe(), the whole dataset (or selected partitions) are read into a single Pandas dataframe, which must fit in RAM on the DSS server.

This is sometimes inconvenient and DSS provides a way to do this by chunks:

mydataset = Dataset("myname")

for df in mydataset.iter_dataframes(chunksize=10000):
        # df is a dataframe of at least 10K rows.

By doing this, you only need to load a few thousands of rows at a time.

Writing in a dataset can also be made by chunks of dataframes. For that, you need to obtain a writer:

inp = Dataset("input")
out = Dataset("output")

with out.get_writer() as writer:

        for df in inp.iter_dataframes():
                # Process the df dataframe ...

                # Write the processed dataframe
                writer.write_dataframe(df)

Note

When using chunked writing, you cannot set the schema for each chunk, you cannot use Dataset.write_with_schema.

Instead, you should set the schema first on the dataset object, using Dataset.write_schema_from_dataframe(first_output_dataframe)

Encoding

When dealing with both dataframes and row-by-row iteration, you must pay attention to str/unicode and encoding issues

  • DSS provides dataframes where the string content is utf-8 encoded str
  • When writing dataframes, DSS expects utf-8 encoded str
  • Per-line iterators provide string content as unicode objects
  • Per-line writers expect unicode objects.

For example, if you read from a dataframe but write row-by-row, you must decode your str into Unicode object

Sampling

All calls to iterate the dataset (get_dataframe, iter_dataframes, iter_rows and iter_tuples) take several arguments to set sampling.

Sampling lets you only retrieve a selection of the rows of the input dataset. It’s often useful when using Pandas if your dataset does not fit in RAM.

For more information about sampling methods, please see Sampling.

The sampling argument takes the following values.

random

Returns a random sample of the dataset. Additional arguments:

  • ratio=X: ratio (between 0 and 1) to select.
  • OR: limit=X: number of rows to read.

random-column

Return a column-based random sample. Additional arguments:

  • sampling_column: column to use for sampling
  • ratio=X: ratio (between 0 and 1) to select

Examples

# Get a Dataframe over the first 3K rows
dataset.get_dataframe(sampling='head', limit=3000)

# Iterate over a random 10% sample
dataset.iter_tuples(sampling='random', ratio=0.1)

# Iterate over 27% of the values of column 'user_id'
dataset.iter_tuples(sampling='random-column', sampling_column='user_id', ratio=0.27)

# Get a chunked stream of dataframes over 100K randomly selected rows
dataset.iter_dataframes(sampling='random', limit=100000)

Getting a dataset as raw bytes

In addition to retrieving a dataset as Pandas Dataframes or iterator, you can also ask DSS for a streamed export, as formatted data.

Data can be exported by DSS in various formats: CSV, Excel, Avro, …

# Read a dataset as Excel, and dump to a file, chunk by chunk
#
# Very important: you MUST use a with() statement to ensure that the stream
# returned by raw_formatted is closed

with open(target_path, "w") as ofl:
        with dataset.raw_formatted_data(format="excel") as ifl:
                while True:
                        chunk = ifl.read(32000)
                        if len(chunk) == 0:
                                break
                        ofl.write(chunk)

API reference: The Dataset class

This is the main class that you will use in Python recipes and the iPython notebook.

For starting code samples, please see Python recipes.

class dataiku.Dataset(name, project_key=None)

This is a handle to obtain readers and writers on a dataiku Dataset. From this Dataset class, you can:

  • Read a dataset as a Pandas dataframe
  • Read a dataset as a chunked Pandas dataframe
  • Read a dataset row-by-row
  • Write a pandas dataframe to a dataset
  • Write a series of chunked Pandas dataframes to a dataset
  • Write to a dataset row-by-row
  • Edit the schema of a dataset
add_read_partitions(spec)

Add a partition or range of partitions to read.

The spec argument must be given in the DSS partition spec format. You cannot manually set partitions when running inside a Python recipe. They are automatically set using the dependencies.

get_dataframe(columns=None, sampling='head', sampling_column=None, limit=None, ratio=None, infer_with_pandas=True, parse_dates=True, bool_as_str=False)

Read the dataset (or its selected partitions, if applicable) as a Pandas dataframe.

Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.

Keywords arguments:

  • columns – When not None, returns only the given list of columns (default None)

  • limit – Limits the number of rows returned (default None)

  • sampling – Sampling method, if:

    • ‘head’ returns the first rows of the dataset. Incompatible with ratio parameter.
    • ‘random’ returns a random sample of the dataset
    • ‘random-column’ returns a random sample of the dataset. Incompatible with limit parameter.
  • sampling_column – Select the column used for “columnwise-random” sampling (default None)

  • ratio – Limits the ratio to at n% of the dataset. (default None)

  • infer_with_pandas – uses the types detected by pandas rather than the dataset schema as detected in DSS. (default True)

  • parse_dates – Date column in DSS’s dataset schema are parsed (default True)

  • bool_as_str – Leave boolean values as strings (default False)

Inconsistent sampling parameter raise ValueError.

Note about encoding:

  • Column labels are “unicode” objects
  • When a column is of string type, the content is made of utf-8 encoded “str” objects
get_last_metric_values(partition='')

Get the set of last values of the metrics on this dataset, as a dataiku.ComputedMetrics object

get_metric_history(metric_lookup, partition='')

Get the set of all values a given metric took on this dataset

Parameters:
  • metric_lookup – metric name or unique identifier
  • partition – optionally, the partition for which the values are to be fetched
get_writer()

Get a stream writer for this dataset (or its target partition, if applicable). The writer must be closed as soon as you don’t need it.

The schema of the dataset MUST be set before using this. If you don’t set the schema of the dataset, your data will generally not be stored by the output writers
iter_dataframes(chunksize=10000, infer_with_pandas=True, sampling='head', sampling_column=None, parse_dates=True, limit=None, ratio=None, columns=None, bool_as_str=False)

Read the dataset to Pandas dataframes by chunks of fixed size.

Returns a generator over pandas dataframes.

Useful is the dataset doesn’t fit in RAM.

iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None)

Returns a generator on the rows (as a dict-like object) of the data (or its selected partitions, if applicable)

Keyword arguments: * limit – maximum number of rows to be emitted * log_every – print out the number of rows read on stdout

Field values are casted according to their types. String are parsed into “unicode” values.

iter_tuples(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=-1, timeout=30, columns=None)

Returns the rows of the dataset as tuples. The order and type of the values are the same are matching the dataset’s parameter

Keyword arguments:

  • limit – maximum number of rows to be emitted
  • log_every – print out the number of rows read on stdout
  • timeout – time (in seconds) of inactivity after which we want to close the generator if nothing has been read. Without it notebooks typically tend to leak “DKU” processes.

Field values are casted according to their types. String are parsed into “unicode” values.

static list(project_key=None)

Lists the names of datasets. If project_key is None, the current project key is used.

list_partitions(raise_if_empty=True)

List the partitions of this dataset, as an array of partition specifications

raw_formatted_data(sampling=None, columns=None, format='tsv-excel-noheader', format_params=None)

Get a stream of raw bytes from a dataset as a file-like object, formatted in a supported DSS output format.

You MUST close the file handle. Failure to do so will result in resource leaks.

read_metadata()

Reads the dataset metadata object

read_schema(raise_if_empty=True)

Gets the schema of this dataset, as an array of objects like this one: { ‘type’: ‘string’, ‘name’: ‘foo’, ‘maxLength’: 1000 }. There is more information for the map, array and object types.

save_external_check_values(values_dict, partition='')

Save checks on this dataset. The checks are saved with the type “external”

Parameters:values_dict – the values to save, as a dict. The keys of the dict are used as check names
save_external_metric_values(values_dict, partition='')

Save metrics on this dataset. The metrics are saved with the type “external”

Parameters:
  • values_dict – the values to save, as a dict. The keys of the dict are used as metric names
  • partition – optionally, the partition for which the values are to be saved
set_write_partition(spec)

Sets which partition of the dataset gets written to when you create a DatasetWriter. Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow.

write_from_dataframe(df, infer_schema=False, write_direct=False, dropAndCreate=False)

Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.

This variant does not edit the schema of the output dataset, so you must take care to only write dataframes that have a compatible schema. Also see “write_with_schema”.

Encoding note: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

arguments: df – input panda dataframe.

write_metadata(meta)

Writes the dataset metadata object

write_schema(columns, dropAndCreate=False)

Write the dataset schema into the dataset JSON definition file.

Sometimes, the schema of a dataset being written is known only by the code of the Python script itself. In that case, it can be useful for the Python script to actually modify the schema of the dataset. Obviously, this must be used with caution. ‘columns’ must be an array of dicts like { ‘name’ : ‘column name’, ‘type’ : ‘column type’}

write_with_schema(df, dropAndCreate=False)

Writes this dataset (or its target partition, if applicable) from a single Pandas dataframe.

This variant replaces the schema of the output dataset with the schema of the dataframe.

Encoding node: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

class dataiku.core.dataset_write.DatasetWriter(dataset)

Handle to write to a dataset. Use Dataset.get_writer() to obtain a DatasetWriter.

Very important: a DatasetWriter MUST be closed after usage. Failure to close a DatasetWriter will lead to incomplete or no data being written to the output dataset

close()

Closes this dataset writer

write_dataframe(df)

Appends a Pandas dataframe to the dataset being written.

This method can be called multiple times (especially when you have been using iter_dataframes to read from an input dataset)

Encoding node: strings MUST be in the dataframe as UTF-8 encoded str objects. Using unicode objects will fail.

write_row_dict(row_dict)

Write a single row from a dict of column name -> column value.

Some columns can be omitted, empty values will be inserted instead.

Note: The schema of the dataset MUST be set before using this.

Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.

write_tuple(row)

Write a single row from a tuple or list of column values. Columns must be given in the order of the dataset schema.

Note: The schema of the dataset MUST be set before using this.

Encoding note: strings MUST be given as Unicode object. Giving str objects will fail.