Visual Machine learning#

Through the public API, the Python client allows you to automate all the aspects of the lifecycle of machine learning models.

Creating a visual analysis and ML task
Tuning settings
Training models
Inspecting model details and results
Deploying saved models to Flow and retraining them

Concepts#

In DSS, you train models as part of a visual analysis. A visual analysis is made of a preparation script, and one or several ML Tasks.

A ML Task is an individual section in which you train models. A ML Task is either a prediction of a single target variable, or a clustering.

The ML API allows you to manipulate ML Tasks, and use them to train models, inspect their details, and deploy them to the Flow.

Once deployed to the Flow, the Saved model can be retrained by the usual build mechanism of DSS.

A ML Task has settings, which control:

Which features are active
The preprocessing settings for each features
Which algorithms are active
The hyperparameter settings (including grid searched hyperparameters) for each algorithm
The settings of the grid search
Train/Test splitting settings
Feature selection and generation settings

Usage samples#

The whole cycle#

This examples create a prediction task, enables an algorithm, trains it, inspects models, and deploys one of the model to Flow

# client is a DSS API client

p = client.get_project("MYPROJECT")

# Create a new ML Task to predict the variable "target" from "trainset"
mltask = p.create_prediction_ml_task(
    input_dataset="trainset",
    target_variable="target",
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='DEFAULT' # Template to use for setting default parameters
)

# Wait for the ML task to be ready
mltask.wait_guess_complete()

# Obtain settings, enable GBT, save settings
settings = mltask.get_settings()
settings.set_algorithm_enabled("GBT_CLASSIFICATION", True)
settings.save()

# Start train and wait for it to be complete
mltask.start_train()
mltask.wait_train_complete()

# Get the identifiers of the trained models
# There will be 3 of them because Logistic regression and Random forest were default enabled
ids = mltask.get_trained_models_ids()

for id in ids:
    details = mltask.get_trained_model_details(id)
    algorithm = details.get_modeling_settings()["algorithm"]
    auc = details.get_performance_metrics()["auc"]

    print("Algorithm=%s AUC=%s" % (algorithm, auc))

# Let's deploy the first model
model_to_deploy = ids[0]

ret = mltask.deploy_to_flow(model_to_deploy, "my_model", "trainset")

print("Deployed to saved model id = %s train recipe = %s" % (ret["savedModelId"], ret["trainRecipeName"]))

The methods for creating prediction and clustering ML tasks are defined at dataikuapi.dss.project.DSSProject.create_prediction_ml_task() and dataikuapi.dss.project.DSSProject.create_clustering_ml_task().

Obtaining a handle to an existing ML Task#

When you create these ML tasks, the returned dataikuapi.dss.ml.DSSMLTask object will contain two fields analysis_id and mltask_id that can later be used to retrieve the same DSSMLTask object

# client is a DSS API client

p = client.get_project("MYPROJECT")
mltask = p.get_ml_task(analysis_id, mltask_id)

Tuning feature preprocessing#

Enabling and disabling features#

# mltask is a DSSMLTask object

settings = mltask.get_settings()

settings.reject_feature("not_useful")
settings.use_feature("useful")

settings.save()

Changing advanced parameters for a feature#

# mltask is a DSSMLTask object

settings = mltask.get_settings()

# Use impact coding rather than dummy-coding
fs = settings.get_feature_preprocessing("mycategory")
fs["category_handling"] = "IMPACT"

# Impute missing with most frequent value
fs["missing_handling"] = "IMPUTE"
fs["missing_impute_with"] = "MODE"

settings.save()

Tuning algorithms#

Global parameters for hyperparameter search#

This sample shows how to modify the parameters of the search to be performed on the hyperparameters.

# mltask is a DSSMLTask object

settings = mltask.get_settings()

hp_search_settings = mltask_settings.get_hyperparameter_search_settings()

# Set the search strategy either to "GRID", "RANDOM" or "BAYESIAN"
hp_search_settings.strategy = "RANDOM"

# Alternatively use a setter, either set_grid_search
# set_random_search or set_bayesian_search
hp_search_settings.set_random_search(seed=1234)

# Set the validation mode either to "KFOLD", "SHUFFLE" (or accordingly their
# "TIME_SERIES"-prefixed counterpart) or "CUSTOM"
hp_search_settings.validation_mode = "KFOLD"

# Alternatively use a setter, either set_kfold_validation, set_single_split_validation
# or set_custom_validation
hp_search_settings.set_kfold_validation(n_folds=5, stratified=True)

# Save the settings
settings.save()

Algorithm specific hyperparameter search#

This sample shows how to modify the settings of the Random Forest Classification algorithm, where two kinds of hyperparameters (multi-valued numerical and single-valued) are introduced.

# mltask is a DSSMLTask object

settings = mltask.get_settings()

rf_settings = settings.get_algorithm_settings("RANDOM_FOREST_CLASSIFICATION")


# rf_settings is an object representing the settings for this algorithm.
# The 'enabled' attribute indicates whether this algorithm will be trained.
# Other attributes are the various hyperparameters of the algorithm.

# The precise hyperparameters for each algorithm are not all documented, so let's
# print the dictionary keys to see available hyperparameters.
# Alternatively, tab completion will provide relevant hints to available hyperparameters.
print(rf_settings.keys())

# Let's first have a look at rf_settings.n_estimators which is a multi-valued hyperparameter
# represented as a NumericalHyperparameterSettings object
print(rf_settings.n_estimators)

# Set multiple explicit values for "n_estimators" to be explored during the search
rf_settings.n_estimators.definition_mode = "EXPLICIT"
rf_settings.n_estimators.values = [100, 200]
# Alternatively use the set_values setter
rf_settings.n_estimators.set_values([100, 200])

# Set a range of values for "n_estimators" to be explored during the search
rf_settings.n_estimators.definition_mode = "RANGE"
rf_settings.n_estimators.range.min = 10
rf_settings.n_estimators.range.max = 100
rf_settings.n_estimators.range.nb_values = 5  # Only relevant for grid-search
# Alternatively, use the set_range setter
rf_settings.n_estimators.set_range(min=10, max=100, nb_values=5)

# Let's now have a look at rf_settings.selection_mode which is a single-valued hyperparameter
# represented as a SingleCategoryHyperparameterSettings object.
# The object stores the valid options for this hyperparameter.
print(rf_settings.selection_mode)

# Features selection mode is not multi-valued so it's not actually searched during the
# hyperparameter search
rf_settings.selection_mode = "sqrt"

# Save the settings
settings.save()

The next sample shows how to modify the settings of the Logistic Regression classification algorithm, where a new kind of hyperparameter (multi-valued categorical) is introduced.

# mltask is a DSSMLTask object

settings = mltask.get_settings()

logit_settings = settings.get_algorithm_settings("LOGISTIC_REGRESSION")

# Let's have a look at logit_settings.penalty which is a multi-valued categorical
# hyperparameter represented as a CategoricalHyperparameterSettings object
print(logit_settings.penalty)

# List currently enabled values
print(logit_settings.penalty.get_values())

# List all possible values
print(logit_settings.penalty.get_all_possible_values())

# Set the values for the "penalty" hyperparameter to be explored during the search
logit_settings.penalty = ["l1", "l2"]
# Alternatively use the set_values setter
logit_settings.penalty.set_values(["l1", "l2"])

# Save the settings
settings.save()

Exporting a model documentation#

This sample shows how to generate and download a model documentation from a template.

See Model Document Generator for more information.

# mltask is a DSSMLTask object

details = mltask.get_trained_model_details(id)

# Launch the model document generation by either
# using the default template for this model by calling without argument
# or specifying a managed folder id and the path to the template to use in that folder
future = details.generate_documentation(FOLDER_ID, "path/my_template.docx")

# Alternatively, use a custom uploaded template file
with open("my_template.docx", "rb") as f:
    future = details.generate_documentation_from_custom_template(f)

# Wait for the generation to finish, retrieve the result and download the generated
# model documentation to the specified file
result = future.wait_for_result()
export_id = result["exportId"]

details.download_documentation_to_file(export_id, "path/my_model_documentation.docx")

Using a model in a Python recipe or notebook#

Once a Saved Model has been deployed to the Flow, the normal way to use it is to use scoring recipes.

However, you can also use the dataiku.Model class in a Python recipe or notebook to directly score records.

This method has a number of limitations:

It cannot be used together with containerized execution
It is not compatible with Partitioned models

import dataiku

m = dataiku.Model(my_model_id)
my_predictor = m.get_predictor()

predicted_dataframe = my_predictor.predict(input_dataframe)

Detailed examples#

This section contains more advanced examples using ML Tasks and Saved Models.

Deploy best MLTask model to the Flow#

After training several models in a ML Task you can programmatically deploy the best one by creating a new Saved Model or updating an existing one. In the following example:

The deploy_with_best_model() function creates a new Saved Model with the input MLTask’s best model
The update_with_best_model() function updates an existing Saved Model with the MLTask’s best model.

Both functions rely on dataikuapi.dss.ml.DSSMLTask and dataikuapi.dss.savedmodel.DSSSavedModel.

def get_best_model(project, analysis_id, ml_task_id, metric):
    analysis = project.get_analysis(analysis_id)
    ml_task = analysis.get_ml_task(ml_task_id)
    trained_models = ml_task.get_trained_models_ids()
    trained_models_snippets = [ml_task.get_trained_model_snippet(m) for m in trained_models]
    # Assumes that for your metric, "higher is better"
    best_model_snippet = max(trained_models_snippets, key=lambda x:x[metric])
    best_model_id = best_model_snippet["fullModelId"]
    return ml_task, best_model_id


def deploy_with_best_model(project,
    analysis_id,
    ml_task_id,
    metric,
    saved_model_name,
    training_dataset):
    """Create a new Saved Model in the Flow with the 'best model' of a MLTask.
    """

    ml_task, best_model_id = get_best_model(project,
                                            analysis_id,
                                            ml_task_id,
                                            metric)
    ml_task.deploy_to_flow(best_model_id,
                           saved_model_name,
                           training_dataset)

def update_with_best_model(project,
                           analysis_id,
                           ml_task_id,
                           metric,
                           saved_model_name,
                           activate=True):
    """Update an existing Saved Model in the Flow with the 'best model' 
       of a MLTask.
    """
    ml_task, best_model_id = get_best_model(project,
                                            analysis_id,
                                            ml_task_id,
                                            metric)
    training_recipe_name = f"train_{saved_model_name}"
    ml_task.redeploy_to_flow(model_id=best_model_id,
                             recipe_name=training_recipe_name,
                             activate=activate)

List details of all Saved Models#

You can retrieve, for each Saved Model in a Project, the current model algorithm and performances. In the following example, the get_project_saved_models() function outputs a Python dictionary with several details on the current activeversions of all Saved Models in the target Project.

def explore_saved_models(client=None, project_key=None):
    """List saved models of a project and give details on the active versions.
    Args:
        client: A handle on the target DSS instance
        project_key: A string representing the target project key
    Returns:
        smdl_list: A dict with all saved model ids and perf + algorithm 
                   for the active versions. 
    """
    smdl_list = []
    prj = client.get_project(project_key)
    smdl_ids = [x["id"] for x in prj.list_saved_models()]
    for smdl in smdl_ids:
        data = {}
        obj = prj.get_saved_model(smdl)
        data["version_ids"] = [m["id"] for m in obj.list_versions()]
        active_version_id = obj.get_active_version()["id"]
        active_version_details = obj.get_version_details(active_version_id)
        data["active_version"] = {"id": active_version_id,
                                  "algorithm": active_version_details.details["actualParams"]["resolved"]["algorithm"],
                                  "performance_metrics": active_version_details.get_performance_metrics()}
        smdl_list.append(data)
    return smdl_list

List version details of a given Saved Model#

This code snippet allows you to retrieve a summary of all versions of a given Saved Model (algorithm, hyperparameters, performance, features) using dataikuapi.dss.savedmodel.DSSSavedModel.

import copy
from dataiku import recipe

def export_saved_model_metadata(project, saved_model_id):
    """
    """

    model = project.get_saved_model(saved_model_id)
    output = []
    for version in model.list_versions():
        version_details = model.get_version_details(version["id"])
        version_dict = {}
    
        # Retrieve algorithm and hyperarameters
        resolved = copy.deepcopy(version_details.get_actual_modeling_params()["resolved"])
        version_dict["algorithm"] = resolved["algorithm"]
        del resolved["algorithm"]
        del resolved["skipExpensiveReports"]
        for (key, hyperparameters) in resolved.items():
            for (hyperparameter_key, hyperparameter_value) in hyperparameters.items():
                version_dict["hyperparameter_%s" % hyperparameter_key] = hyperparameter_value
            
        # Retrieve test performance
        for (metric_key, metric_value) in version_details.get_performance_metrics().items():
            version_dict["test_perf_%s" % metric_key] = metric_value
        
        # Retrieve lineage
        version_dict["training_target_variable"] = version_details.details["coreParams"]["target_variable"]
        split_desc = version_details.details["splitDesc"]
        version_dict["training_train_rows"] = split_desc["trainRows"]
        version_dict["training_test_rows"] = split_desc["testRows"]
        training_used_features = []
        for (key, item) in version_details.get_preprocessing_settings()["per_feature"].items():
            if item["role"] == "INPUT":
                training_used_features.append(key)
        version_dict["training_used_features"] = ",".join(training_used_features)
        
        # Retrieve training time
        ti = version_details.get_train_info()
        version_dict["training_total_time"] = int((ti["endTime"] - ti["startTime"])/1000)
        version_dict["training_preprocessing_time"] = int(ti["preprocessingTime"]/1000)
        version_dict["training_training_time"] = int(ti["trainingTime"]/1000)
    
        output.append(version_dict)

    return output

Retrieve linear model coefficients#

You can retrieve the list of coefficient names and values from a Saved Model version for compatible algorithms.

def get_model_coefficients(project, saved_model_id, version_id):
    """
    Returns a dictionary with key="coefficient name" and value=coefficient
    """

    model = project.get_saved_model(saved_model_id)
    if version_id is None:
        version_id = model.get_active_version().get('id')
    details = model.get_version_details(version_id)
    details_lr = details.details.get('iperf', {}).get('lmCoefficients', {})
    rescaled_coefs = details_lr.get('rescaledCoefs', [])
    variables = details_lr.get('variables',[])
    coef_dict = {var: coef for var, coef in zip(variables, rescaled_coefs)}
    if len(coef_dict)==0:
        print(f"Model {saved_model_id} and version {version_id} does not have coefficients")
    return coef_dict

Export model#

You can programmatically export the best version of a Saved Model as either a Python function or a MLFlow model. In the following example, the get_best_classifier_version() function returns the best version id of the classifier.

Pass that id to the dataikuapi.dss.savedmodel.DSSSavedModel.get_version_details() method to get a dataikuapi.dss.ml.DSSTrainedPredictionModelDetails handle.
Then either use get_scoring_python() or get_scoring_mlflow() to download the model archive to a given file name in either Python or MLflow, respectively.

import dataiku

PROJECT_KEY = 'YOUR_PROJECT_KEY'
METRIC = 'auc' # or any classification metrics of interest.
SAVED_MODEL_ID = 'YOUR_SAVED_MODEL_ID'
FILENAME = 'path/to/model-archive.zip'


def get_best_classifier_version(project, saved_model_id, metric):
    """
    This function returns the best version id of a
    given DSS classifier model in a project.
    """

    model = project.get_saved_model(saved_model_id)
    outcome = []
    
    for version in model.list_versions():    
        version_id = version.get('id')
        version_details = model.get_version_details(version_id)
        perf = version_details.get_raw_snippet().get(metric)
        outcome.append((version_id, perf))
    
    # get the best version id. User reverse=False if 
    # lower metric means better
    best_version_id = sorted(
        outcome, key = lambda x: x[1], reverse=True)[0][0]
    
    return best_version_id
        


client = dataiku.api_client()
project = client.get_project(PROJECT_KEY)
model = project.get_saved_model(SAVED_MODEL_ID)
best_version_id = get_best_classifier_version(project, SAVED_MODEL_ID, METRIC)
version_details = model.get_version_details(best_version_id)

# Export in Python
version_details.get_scoring_python(FILENAME)

# Export in MLflow format
version_details.get_scoring_mlflow(FILENAME)

Using a Saved Model in a Python recipe or notebook#

Once a model has been trained and deployed as a sazved model, you typically use scoring recipes or API node in order to use them.

You can however also use the saved model directly in a Python recipe or notebook for performing scoring from your own code.

This comes with several limitations:

It only supports models trained with the in-memory engine. It does not support MLlib models
It does not support running in containers. Only local execution is supported.

Here is an example:

  import dataiku

  m = dataiku.Model(my_model_id)
  my_predictor = m.get_predictor()
  my_predictor.predict(my_df_to_score)

Reference documentation#

`dataiku.Model`(lookup[, project_key, ignore_flow])	Handle to interact with a saved model.
`dataiku.core.saved_model.SavedModelVersionMetrics`(...)	Handle to the metrics of a version of a saved model

Reference documentation#

Interaction with a ML Task#

dataikuapi.dss.ml.DSSMLTask(client, ...)

A handle to interact with a ML Task for prediction or clustering in a DSS visual analysis.

Manipulation of settings#

`dataikuapi.dss.ml.DSSMLTaskSettings`(client, ...)	Object to read and modify the settings of an existing ML task.
`dataikuapi.dss.ml.DSSPredictionMLTaskSettings`(...)
`dataikuapi.dss.ml.DSSClusteringMLTaskSettings`(...)
`dataikuapi.dss.ml.DSSTimeseriesForecastingMLTaskSettings`(...)
`dataikuapi.dss.ml.PredictionSplitParamsHandler`(...)	Object to modify the train/test dataset splitting params.

Exploration of results#

`dataikuapi.dss.ml.DSSTrainedPredictionModelDetails`(...)	Object to read details of a trained prediction model
`dataikuapi.dss.ml.DSSTrainedClusteringModelDetails`(...)	Object to read details of a trained clustering model

Saved models#

`dataikuapi.dss.savedmodel.DSSSavedModel`(...)	Handle to interact with a saved model on the DSS instance.
`dataikuapi.dss.savedmodel.DSSSavedModelSettings`(...)	Handle on the settings of a saved model.

MLflow models#

`dataikuapi.dss.savedmodel.ExternalModelVersionHandler`(...)	Handler to interact with an External model version (MLflow import of Proxy model).
`dataikuapi.dss.savedmodel.MLFlowVersionSettings`(...)	Handle for the settings of an imported MLFlow model version.

Saved models (usage in a recipe)#

`dataiku.Model`(lookup[, project_key, ignore_flow])	Handle to interact with a saved model.
`dataiku.core.saved_model.Predictor`(params, ...)	Object allowing to preprocess and make predictions on a dataframe.