Python function

This processor executes a custom Python function for each row.

It allows you to easily perform complex computations in a preparation script.

To operate this processor, you write a process() Python function, which can modify rows, and add or remove rows.

Operation modes (non-vectorized)

The processor features 3 modes of operation:

  • ‘cell’: in this mode, the processor receives the data for a row and outputs the value for a single output column

  • ‘row’: in this mode, the processor receives the data for a row, and can modify in place all the values of the row

  • ‘rows’: in this mode, the processor receives the data for a row, and can output an arbitrary number of output rows. The input row is deleted and replaced by all rows returned by the function (so you can have 1->N processing).

Python vs Jython mode

Out of the box, the Python processor uses Jython, a reimplementation of Python in Java. This mode of operation provides good performance for simple operations. However, Jython only provides Python 2, only supports the standard Python library, and cannot use code defined in libraries.

The Python processor can also use a “real” Python process. In that case, you can use all normal Python packages and the code defined in libraries. Multiple versions of Python can be used, thanks to the code environment capabilities of Dataiku DSS.

To enable normal Python process mode, select the “Use a real Python process” checkbox.

Vectorized operation

When using a real Python process, vectorized operation using Pandas is possible. With vectorized operation, the processor receives rows by batches, as a Pandas dataframe. The processor is still called multiple times, by batches of a few dozens-hundreds of records.

Vectorized operation provides much improved performance and is strongly recommended when using a real Python process.

Operation modes (non-vectorized)

Cell mode

In this mode, the process(row) function receives the input row as a dict, and must must return a single value, which is used as the value of the output column.

Row mode

In this mode, the process(row) function receives the input row as a dict and returns a Python dictionary. All columns and values of the input row are replaced by the keys and values of the dictionary.

Modifying the input dictionary in place and returning it is supported.

Rows mode

In this mode,the process(row) function receives the input row as a dict and must return an iterable of rows. The input row is deleted and replaced by all rows returned by the function (so you can have 1->N processing).

Returning the input dictionary is supported. However, if you want to return multiple rows, you must copy them, for example using copy.deepcopy

Operation modes (vectorized)

Cell mode

In this mode, the process(rows) function receives the input batch of rows as a pandas Dataframe, and must return a pandas Series of the same number of records, which will be used as the values of the output column.

Row mode

In this mode, the process(rows) function receives the input batch of rows as a pandas Dataframe, and must return a pandas Dataframe of the same number of records, which will be used as the values of the output batch of rows (of the same length).

Rows mode

In this mode, the process(rows) function receives the input batch of rows as a pandas Dataframe, and must return an indexed dictionary of vectors, either built by modifying the ‘rows’ or by returning a pandas DataFrame.

The inline help contains more details on this mode.

Help and code

When you select an operation mode, some sample code is automatically written so you only need to write your custom logic.

Restrictions

The Python function should remain a “streaming” operation. If you need more complex operations, like sorting all the rows, joining, deduplicating, grouping, …, you should create a data preparation recipe with all previous steps, and then a Python recipe.