Python recipes

Data Science Studio gives you the ability to write recipes using the Python language. Python recipes can read and write datasets, whatever their storage backend is.

For example, you can write a Python recipe that reads a SQL dataset and a HDFS dataset and that writes an S3 dataset. Python recipes use a specific API to read and write datasets.

Python recipes can manipulate datasets either :

  • Using regular Python code to iterate on the rows of the input datasets and to write the rows of the output datasets
  • Using Pandas dataframes.

Basic Python recipe

  • Create a new Python recipe by clicking the « Python » button in the Recipes page.
  • Go to the Inputs/Outputs tab
  • Add the input datasets that will be used as source data in your recipes.
  • Select or create the output datasets that will be created by your recipe. For more information, see Creating recipes
  • If needed, fill the partition dependencies. For more information, see Working with partitions
  • Give a name and save your Recipe.
  • You can now write your Python code.

First of all, you will need to load the Dataiku API (the Dataiku API is preloaded when you create a new Python recipe)

import dataiku

You will then need to obtain Dataset objects corresponding to your inputs and outputs.

For example, if your recipe has datasets A and B as inputs and dataset C as output, you can use :

datasetA = dataiku.Dataset("A")
datasetB = dataiku.Dataset("B")
datasetC = dataiku.Dataset("C")

Interaction with the Dataset can be made in two flavors :

  • Using the usual core Python langage
  • Using Pandas

Using Pandas

Pandas is a popular python package for in-memory data manipulation. http://pandas.pydata.org/

Using the dataset via pandas will load your dataset in memory, it is therefore critical that your dataset is « small enough » to fit in the memory of the Data Science Studio server.

The core object of Pandas is the DataFrame object, which represents a dataset.

Getting a Pandas DataFrame from a Dataset object is straightforward:

# Object representing our input dataset
cars = Dataset("mycarsdataset")

# We read the whole dataset in a pandas dataframe
cars_df = cars.get_dataframe()

The cars_df is a regular Pandas data frame, which you can manipulate using all Pandas methods.

Writing a Pandas DataFrame in a dataset

Once you have used Pandas to manipulate the input data frame, you generally want to write it to the output dataset.

The Dataset object provides the method write_from_dataframe

output_ds = Dataset("myoutputdataset")
output_ds.write_from_dataframe(my_dataframe)

Writing the output schema

Generally, you should declare the schema of the output dataset prior to running the Python code. However, it is often impractical to do so, especially when you write data frames with many columns (or columns that change often). In that case, it can be useful for the Python script to actually modify the schema of the dataset.

The Dataset API provides a method to set the schema of the output dataset. When doing that, the schema of the dataset is modified each time the Python recipe is run. This must obviously be used with caution.

# Set the schema of ‘myoutputdataset’ to match the columns of the dataframe
output_ds.write_schema_from_dataframe(my_dataframe)

You can also write the schema and the dataframe at the same time

# Write the schema from the dataframe and write the dataframe
output_ds.write_with_schema(my_dataframe)

Using the native API

If the dataset does not fit in memory, it is also possible to stream the rows thanks to a python generator.

Reading

Dataset object’s

  • iter_rows method are iterating over the rows of the dataset, represented as dictionary-like objects.
  • iter_tuples method are iterating over the rows of the dataset, represented as tuples. Values are ordered according to the schema of the dataset.
from dataiku import Dataset
from collections import Counter

cars = Dataset("cars")

origin_count = Counter()

# iterate on the dataset. The row object is a dict-like object
# the dataset is "streamed" and it is not required to fit in RAM.
for row in cars.iter_rows():
origin_count[row["origin"]] += 1

Writing

Writing the output dataset is done via a writer object returned by Dataset.get_writer

writer = output.get_writer()

for (origin,count) in origin_count.items():
        writer.write_row_array((origin,count))

writer.close()

Note

Don’t forget to close your writer. If you don’t, your data will not get fully written.

Setting the schema

Generally, you should declare the schema of the output dataset prior to running the Python code. However, it is often impractical to do so, especially if your code generates many columns (or columns that change often). In that case, it can be useful for the Python script to actually modify the schema of the dataset. The Dataset API provides a method to set the schema of the output dataset. When doing that, the schema of the dataset is modified each time the Python recipe is run. This must obviously be used with caution.

output.write_schema([
{
  "name": "origin",
  "type": "string",
},
{
  "name": "count",
  "type": "int",
}
])

When to use Python recipes

In many cases, you might want to use a Python recipe because you need to perform a custom operation on cells that is not possible using traditional Shaker processors.

In these cases, rather than creating a Python recipe, you should consider using a Python UDF within the Shaker.

For simple operations, using a Python UDF has several advantages over the Python recipes:

  • You do not need an « intermediate » dataset after the Shaker operation, which means that you don’t need to copy the whole data.
  • Python UDF can be more efficient because they don’t need to re-read the whole dataset, they use the memory representation
  • Python UDF are automatically multithreaded when possible. When using a Python recipe, handling multithreading is your responsibility.

However:

  • Python UDF can only use a subset of the packages in the base Python installation and cannot use any other package