Component: Feature preprocessor¶
This plugin component allows you to extend the feature handling capabilities of Visual Machine Learning with custom preprocessors for numerical and categorical features.
When a feature uses a plugin preprocessor, DSS calls Python code from the plugin during model training and scoring.
First example¶
To start adding feature preprocessors, use the plugin developer tools. In the Definition tab, click on + NEW COMPONENT, choose Feature Preprocessor, and enter the identifier for your new preprocessor.
This creates a folder under python-preprocessors containing:
preprocessor.jsonfor the component descriptorpreprocessor.pyfor the Python implementation
A basic descriptor looks like:
{
"meta": {
"label": "Custom preprocessor normalize-score",
"description": "Normalize a numeric feature with a z-score transform",
"icon": "fas fa-puzzle-piece"
},
"featureTypes": ["NUMERIC"],
"params": []
}
and the corresponding Python code:
import pandas as pd
class CustomPreprocessor(object):
def __init__(self, params, feature_type=None):
self.params = params
self.feature_type = feature_type
self.mean = 0.0
self.std = 1.0
def fit(self, series):
self.mean = series.mean()
self.std = series.std() or 1.0
def transform(self, series):
values = (series - self.mean) / self.std
return pd.DataFrame({"zscore": values})
Once the plugin component is added, it is available in the Feature handling tab for compatible numerical or categorical features through the Plugin preprocessing option.
Descriptor¶
Each feature preprocessor descriptor has the following structure:
meta: Description of the component. Similar to other plugin components.featureTypes: List of feature types for which the preprocessor is available. Supported values are"NUMERIC"and"CATEGORY".params: List of plugin parameters exposed in the Feature handling UI, using the same parameter definitions as other plugin components.paramsPythonSetup: Optional Python file in the pluginresourcefolder, used with the standard plugin parameter mechanism to dynamically populateSELECTorMULTISELECTparameters if their definition uses"getChoicesFromPython": true.
If any parameter uses "getChoicesFromPython": true, then paramsPythonSetup is mandatory and must point to a Python file. There can be as many files as there are components in a plugin.
When paramsPythonSetup is used, the referenced Python file must implement a do(payload, config, plugin_config, inputs) function returning a list of {"value": ..., "label": ...} dictionaries.
For feature preprocessor parameters, the payload passed to do includes the standard fields used by dynamic choices:
parameterType: the parameter type, for exampleSELECTorMULTISELECTparameterName: the name of the parameter being populated, useful when multiple parameters use"getChoicesFromPython": trueand thus share the same callback functioncustomChoices: boolean flag indicating that the call is requesting dynamic choicesrootModel: the current parameter model
In addition, feature preprocessors receive extra context in the payload:
dataset: smart name of the input dataset of the Visual ML taskcolumn: name of the feature currently being configured
Python object¶
The preprocessor.py file must define a CustomPreprocessor class.
Its constructor receives:
params: dictionary of parameter values configured by the user in the UIfeature_type: feature type for which the preprocessor is used. Possible values are"NUMERIC"and"CATEGORY"
The class must implement:
fit(self, series): called on a Pandas series during trainingtransform(self, series): called on a Pandas series during training and scoring
The transform method must return one of:
a Pandas
DataFramea NumPy
ndarraya SciPy
csr_matrix
Both single-column and multi-column outputs are supported.
Execution model¶
Feature preprocessors are applied to one feature at a time.
At training time, DSS:
instantiates
CustomPreprocessorcalls
fiton the feature’s training seriesserializes the fitted preprocessor with the model
calls
transformto generate the preprocessed features
At scoring time, DSS reloads (deserializes) the fitted preprocessor and calls transform on the incoming series.
The preprocessor receives the feature after application on that feature of the standard missing-value handling selected in Feature handling.
Output naming¶
Output feature names follow the pattern plugin:{plugin_component_id}:{column_name}:{suffix}, with the suffix depending on the type returned by transform:
For a NumPy array, SciPy sparse matrix, or unnamed Pandas Series,
unnamed_{0,1,2,...}.For a Pandas DataFrame, the column names.
For a named Pandas Series, the series name.
Versioning¶
The model stores the plugin identifier, component identifier, and plugin version used at training time.
If the installed plugin version changes after training, DSS keeps using the component but logs a warning about the version mismatch.
Compatibility and limitations¶
Feature preprocessors currently have the following scope and limitations:
They apply only to numerical and categorical input features.
They are available through the Plugin preprocessing option in Visual ML feature handling.
They can output dense or sparse features. Sparse features may be densified during training and scoring depending on the compatibility of the selected model and scoring engine.
Models using a plugin preprocessor can be deployed on API nodes. The necessary plugin files are bundled into the API service package.
Models using a plugin preprocessor are not compatible with Java optimized scoring, SQL scoring, model exports to Python (through the
dataiku-scoringpackage), Java, MLflow, PMML, nor Snowflake functions. See Exporting models.Ensembling is not supported when any feature uses a plugin preprocessor.
Usage of a code-env¶
Feature preprocessors do not use a dedicated plugin code environment, as they run in the same process as the model training/scoring.
If the preprocessor depends on external Python packages, install them in the code environment used to train and score the model. For API node deployments, the API node environment must also contain these dependencies.