Component: Feature preprocessor¶

This plugin component allows you to extend the feature handling capabilities of Visual Machine Learning with custom preprocessors for numerical and categorical features.

When a feature uses a plugin preprocessor, DSS calls Python code from the plugin during model training and scoring.

First example ¶

To start adding feature preprocessors, use the plugin developer tools. In the Definition tab, click on + NEW COMPONENT, choose Feature Preprocessor, and enter the identifier for your new preprocessor.

This creates a folder under python-preprocessors containing:

preprocessor.json for the component descriptor
preprocessor.py for the Python implementation

A basic descriptor looks like:

{
    "meta": {
        "label": "Custom preprocessor normalize-score",
        "description": "Normalize a numeric feature with a z-score transform",
        "icon": "fas fa-puzzle-piece"
    },
    "featureTypes": ["NUMERIC"],
    "params": []
}

and the corresponding Python code:

import pandas as pd

class CustomPreprocessor(object):
    def __init__(self, params, feature_type=None):
        self.params = params
        self.feature_type = feature_type
        self.mean = 0.0
        self.std = 1.0

    def fit(self, series):
        self.mean = series.mean()
        self.std = series.std() or 1.0

    def transform(self, series):
        values = (series - self.mean) / self.std
        return pd.DataFrame({"zscore": values})

Once the plugin component is added, it is available in the Feature handling tab for compatible numerical or categorical features through the Plugin preprocessing option.

Descriptor ¶

Each feature preprocessor descriptor has the following structure:

meta: Description of the component. Similar to other plugin components.
featureTypes: List of feature types for which the preprocessor is available. Supported values are "NUMERIC" and "CATEGORY".
params: List of plugin parameters exposed in the Feature handling UI, using the same parameter definitions as other plugin components.
paramsPythonSetup: Optional Python file in the plugin resource folder, used with the standard plugin parameter mechanism to dynamically populate SELECT or MULTISELECT parameters if their definition uses "getChoicesFromPython": true.

If any parameter uses "getChoicesFromPython": true, then paramsPythonSetup is mandatory and must point to a Python file. There can be as many files as there are components in a plugin.

When paramsPythonSetup is used, the referenced Python file must implement a do(payload, config, plugin_config, inputs) function returning a list of {"value": ..., "label": ...} dictionaries.

For feature preprocessor parameters, the payload passed to do includes the standard fields used by dynamic choices:

parameterType: the parameter type, for example SELECT or MULTISELECT
parameterName: the name of the parameter being populated, useful when multiple parameters use "getChoicesFromPython": true and thus share the same callback function
customChoices: boolean flag indicating that the call is requesting dynamic choices
rootModel: the current parameter model

In addition, feature preprocessors receive extra context in the payload:

dataset: smart name of the input dataset of the Visual ML task
column: name of the feature currently being configured

Python object ¶

The preprocessor.py file must define a CustomPreprocessor class.

Its constructor receives:

params: dictionary of parameter values configured by the user in the UI
feature_type: feature type for which the preprocessor is used. Possible values are "NUMERIC" and "CATEGORY"

The class must implement:

fit(self, series): called on a Pandas series during training
transform(self, series): called on a Pandas series during training and scoring

The transform method must return one of:

a Pandas DataFrame
a NumPy ndarray
a SciPy csr_matrix

Both single-column and multi-column outputs are supported.

Execution model ¶

Feature preprocessors are applied to one feature at a time.

At training time, DSS:

instantiates CustomPreprocessor
calls fit on the feature’s training series
serializes the fitted preprocessor with the model
calls transform to generate the preprocessed features

At scoring time, DSS reloads (deserializes) the fitted preprocessor and calls transform on the incoming series.

The preprocessor receives the feature after application on that feature of the standard missing-value handling selected in Feature handling.

Output naming ¶

Output feature names follow the pattern plugin:{plugin_component_id}:{column_name}:{suffix}, with the suffix depending on the type returned by transform:

For a NumPy array, SciPy sparse matrix, or unnamed Pandas Series, unnamed_{0,1,2,...}.
For a Pandas DataFrame, the column names.
For a named Pandas Series, the series name.

Versioning ¶

The model stores the plugin identifier, component identifier, and plugin version used at training time.

If the installed plugin version changes after training, DSS keeps using the component but logs a warning about the version mismatch.

Compatibility and limitations ¶

Feature preprocessors currently have the following scope and limitations:

They apply only to numerical and categorical input features.
They are available through the Plugin preprocessing option in Visual ML feature handling.
They can output dense or sparse features. Sparse features may be densified during training and scoring depending on the compatibility of the selected model and scoring engine.
Models using a plugin preprocessor can be deployed on API nodes. The necessary plugin files are bundled into the API service package.
Models using a plugin preprocessor are not compatible with Java optimized scoring, SQL scoring, model exports to Python (through the dataiku-scoring package), Java, MLflow, PMML, nor Snowflake functions. See Exporting models.
Ensembling is not supported when any feature uses a plugin preprocessor.

Usage of a code-env ¶

Feature preprocessors do not use a dedicated plugin code environment, as they run in the same process as the model training/scoring.

If the preprocessor depends on external Python packages, install them in the code environment used to train and score the model. For API node deployments, the API node environment must also contain these dependencies.