Python recipes

Data Science Studio gives you the ability to write recipes using the Python language. Python recipes can read and write datasets, whatever their storage backend is.

For example, you can write a Python recipe that reads a SQL dataset and a HDFS dataset and that writes an S3 dataset. Python recipes use a specific API to read and write datasets.

Python recipes can manipulate datasets either :

  • Using regular Python code to iterate on the rows of the input datasets and to write the rows of the output datasets
  • Using Pandas dataframes.

Your first Python recipe

  • From the Flow, select one of the datasets that you want to use as input of the recipe.
  • In the right column, in the “Actions” tab, click on “Python”
  • In the recipe creation window, create a new dataset that will contain the output of your Python code.
  • Validate to create the recipe
  • You can now write your Python code.

(Note that if needed, you might need to fill the partition dependencies. For more information, see Working with partitions)

First of all, you need to load the Dataiku API (the Dataiku API is preloaded when you create a new Python recipe)

import dataiku

You then need to obtain Dataset objects corresponding to your inputs and outputs.

For example, if your recipe has datasets A and B as inputs and dataset C as output, you can use :

datasetA = dataiku.Dataset("A")
datasetB = dataiku.Dataset("B")
datasetC = dataiku.Dataset("C")

Interaction with the Dataset object can be made in two flavors :

  • Using a streaming read and write API
  • Using Pandas dataframes

Using Pandas

Pandas is a popular python package for in-memory data manipulation. http://pandas.pydata.org/

Using the dataset via Pandas will load your dataset in memory, it is therefore critical that your dataset is “small enough” to fit in the memory of the DSS server.

The core object of Pandas is the DataFrame object, which represents a dataset.

Getting a Pandas DataFrame from a Dataset object is straightforward:

# Object representing our input dataset
cars = dataiku.Dataset("mycarsdataset")

# We read the whole dataset in a pandas dataframe
cars_df = cars.get_dataframe()

The cars_df is a regular Pandas data frame, which you can manipulate using all Pandas methods.

Writing a Pandas DataFrame in a dataset

Once you have used Pandas to manipulate the input data frame, you generally want to write it to the output dataset.

The Dataset object provides the method write_with_schema

output_ds = dataiku.Dataset("myoutputdataset")
output_ds.write_with_schema(my_dataframe)

Writing the output schema

Generally speaking, it is preferable to declare the schema of the output dataset prior to running the Python code. However, it is often impractical to do so, especially when you write data frames with many columns (or columns that change often). In that case, it can be useful for the Python script to actually modify the schema of the dataset.

When you use the write_with_schema method, this is what happens: the schema of the dataframe is used to modify the schema of the output dataset, each time the Python recipe is run. This must obviously be used with caution, as mistakes could lead the “next” parts of your Flow to fail.

You can also select to only write the schema (not the data):

# Set the schema of ‘myoutputdataset’ to match the columns of the dataframe
output_ds.write_schema_from_dataframe(my_dataframe)

And you can write the data in the dataframe without changing the schema:

# Write the dataframe without touching the schema
output_ds.write_from_dataframe(my_dataframe)

Using the streaming API

If the dataset does not fit in memory, it is also possible to stream the rows thanks to a python generator.

Reading

Dataset object’s:

  • iter_rows method are iterating over the rows of the dataset, represented as dictionary-like objects.
  • iter_tuples method are iterating over the rows of the dataset, represented as tuples. Values are ordered according to the schema of the dataset.
import dataiku
from collections import Counter

cars = dataiku.Dataset("cars")

origin_count = Counter()

# iterate on the dataset. The row object is a dict-like object
# the dataset is "streamed" and it is not required to fit in RAM.
for row in cars.iter_rows():
origin_count[row["origin"]] += 1

Writing

Writing the output dataset is done via a writer object returned by Dataset.get_writer

with output.get_writer() as writer:
        for (origin,count) in origin_count.items():
                writer.write_row_array((origin,count))

Note

Don’t forget to close your writer. If you don’t, your data will not get fully written. In some cases (like SQL output datasets), no data will get written at all.

We strongly recommend that you use the with keyword in Python to ensure that the writer is closed.

Writing the output schema

Generally speaking, it is preferable to declare the schema of the output dataset prior to running the Python code. However, it is often impractical to do so, especially when you write data frames with many columns (or columns that change often). In that case, it can be useful for the Python script to actually modify the schema of the dataset.

The Dataset API provides a method to set the schema of the output dataset. When doing that, the schema of the dataset is modified each time the Python recipe is run. This must obviously be used with caution.

output.write_schema([
{
  "name": "origin",
  "type": "string",
},
{
  "name": "count",
  "type": "int",
}
])

When to use Python recipes

In many cases, you might want to use a Python recipe because you need to perform a custom operation on cells that is not possible using traditional Preparation processors.

In these cases, rather than creating a Python recipe, you should consider using a Python UDF within the Preparation.

For simple operations, using a Python UDF has several advantages over the Python recipes:

  • You do not need an « intermediate » dataset after the preparation operation, which means that you don’t need to copy the whole data.
  • Python UDF can be more efficient because they don’t need to re-read the whole dataset, they use the memory representation
  • Python UDF are automatically multithreaded when possible. When using a Python recipe, handling multithreading is your responsibility.

However:

  • Python UDF can only use a subset of the packages in the base Python installation and cannot use any other package