Component: Preparation Processor

Preparation processors are additional steps you can add to a Prepare recipe Script.

To create a new Preparation processor, we recommend that you use the plugin developer tools (see the tutorial for an introduction). In the Definition tab, click on “+ ADD COMPONENT”, choose ”PREPARATION PROCESSOR”, and enter the identifier for your new processor. You’ll see a new folder named after your identifier containing 2 files: processor.json and processor.py.

A basic processor definition looks like:

{
    "meta" : {
        "label" : "Custom processor",
        "description" : "",
        "icon" : "icon-puzzle-piece"
    },
    "mode": "CELL",
    "params" : [
        {
            "name": "param1",
            "label": "Parameter 1",
            "type": "STRING",
            "description": "Some documentation for parameter1",
            "mandatory": true
        }
    ]
}

A basic implementation looks like:

def process(row):
    # row is a dict of the row on which the step is applied
    param1_value = params.get('param1')
    return param1_value

The “meta” field is similar to all other kinds of DSS components.

Output single column

When mode is set to CELL in the descriptor, the preparation processor outputs a single column.

To generate the values for this output column, DSS calls the process function of the processor for each rows of the dataset, and stores the returned value in the output column for the associated row.

The following implementation creates a new column containing a salutation message using the Name column in the input dataset.

def process(row):
    return "Dear " + row['Name']

To allow end-users to select an input column, you add a parameter of type COLUMN.

{
    "meta" : {
        "label" : "Custom processor (cell)",
        "description" : "",
        "icon" : "icon-puzzle-piece"
    },
    "mode": "CELL",
    "params" : [
        {
            "name": "input_column",
            "label": "Input column",
            "type": "COLUMN",
            "description": "Column containing the name of the person",
            "columnRole": "main",
            "mandatory": true
        }
    ]
}
def process(row):
    input_column = params.get('input_column')
    return "Dear " + row[input_column]

Output multiple columns

To output or modify more than one column, mode must be set to ROW in the descriptor.

The implementation of your process function must return a dict where each key/value represent a cell in the row. You usually will return the row object received as argument after your modifications.

To configure the names of the output columns, you add one parameter per output column.

{
    "meta" : {
        "label" : "Custom processor (row)",
        "description" : "",
        "icon" : "icon-puzzle-piece"
    },
    "mode": "ROW",
    "params" : [
        {
            "name": "input_column",
            "label": "Input column",
            "type": "COLUMN",
            "description": "Column containing the name of the person",
            "columnRole": "main",
            "mandatory": true
        },
        {
            "name": "salutation_column",
            "label": "Salutation column",
            "type": "COLUMN",
            "description": "Output for salutation message",
            "columnRole": "output_salutation"
        },
        {
            "name": "greeting_column",
            "label": "Greeting column",
            "type": "COLUMN",
            "description": "Output column for greeting message",
            "columnRole": "output_greeting"
        }
    ]
}

For example, to generate 2 additional columns containing a salutation and a greeting message for each person in the dataset, you would use the above descriptor and this implementation:

def process(row):
    input_column = params.get('input_column')

    salutation_column = params.get('salutation_column')
    if salutation_column is not None and salutation_column != "":
        row[salutation_column] = "Dear " + row[input_column]

    greeting_column = params.get('greeting_column')
    if greeting_column is not None and greeting_column != "":
        row[greeting_column] = "Hello " + row[input_column]

    return row

Using code environment for a processor

To use the code environment defined for the processor, add an additional parameter “useKernel” within the processor.json.

The updated processor definition looks like:

{
    "meta" : {
        "label" : "Custom processor",
        "description" : "",
        "icon" : "icon-puzzle-piece"
    },
    "mode": "CELL",
    "params" : [
        {
            "name": "param1",
            "label": "Parameter 1",
            "type": "STRING",
            "description": "Some documentation for parameter1",
            "mandatory": true
        }
    ],
    "useKernel" : true
}

The plugin will need to be reloaded after making the above change.