Machine learning

Through the public API, the Python client allows you to automate all the aspects of the lifecycle of machine learning models.

  • Creating a visual analysis and ML task
  • Tuning settings
  • Training models
  • Inspecting model details and results
  • Deploying saved models to Flow and retraining them

Examples

The whole cycle

This examples create a prediction task, enables an algorithm, trains it, inspects models, and deploys one of the model to Flow

# client is a DSS API client

p = client.get_project("MYPROJECT")

# Create a new ML Task to predict the variable "target" from "trainset"
mltask = p.create_prediction_ml_task(input_dataset="trainset", target_variable="target")

# Wait for the ML task to be ready
mltask.wait_guess_complete()

# Obtain settings, enable GBT, save settings
settings = mltask.get_settings()
settings.set_algorithm_enabled("GBT_CLASSIFICATION", True)
settings.save()

# Start train and wait for it to be complete
mltask.start_train()
mltask.wait_train_complete()

# Get the identifiers of the trained models
# There will be 3 of them because Logistic regression and Random forest were default enabled
ids = mltask.get_trained_models_ids()

for id in ids:
        details = mltask.get_trained_model_details(id)
        algorithm = details.get_modeling_settings()["algorithm"]
        auc = details.get_performance_metrics()["auc"]

        print("Algorithm=%s AUC=%s" % (algorithm, auc))

# Let's deploy the first model
model_to_deploy = ids[0]

ret = mltask.deploy_to_flow(model_to_deploy, "my_model", "trainset")

print("Deployed to saved model id = %s train recipe = %s" % (ret["savedModelId"], ret["trainRecipeName"]))

Obtaining a handle to an existing ML Task

When you create a new MLTask using dataikuapi.dss.ml.DSSProject.create_prediction_ml_task() or dataikuapi.dss.ml.DSSProject.create_clustering_ml_task`(), the returned dataikuapi.dss.ml.DSSMLTask object will contain two fields analysis_id and mltask_id that can later be used to retrieve the same DSSMLTask object

# client is a DSS API client

p = client.get_project("MYPROJECT")
mltask = p.get_ml_task(analysis_id, mltask_id)

Tuning feature preprocessing

Enabling and disabling features

# mltask is a DSSMLTask object

settings = mltask.get_settings()

settings.reject_feature("not_useful")
settings.use_feature("useful")

settings.save()

Changing advanced parameters for a feature

# mltask is a DSSMLTask object

settings = mltask.get_settings()

# Use impact coding rather than dummy-coding
fs = settings.get_feature_preprocessing("mycategory")
fs["category_handling"] = "IMPACT"

# Impute missing with most frequent value
fs["missing_handling"] = "IMPUTE"
fs["missing_impute_with"] = "MODE"

settings.save()

Reference documentation

class dataikuapi.dss.ml.DSSMLTask(client, project_key, analysis_id, mltask_id)

A handle to interact with a MLTask for prediction or clustering in a DSS visual analysis

wait_guess_complete()

Waits for guess to be complete. This should be called immediately after the creation of a new ML Task (if the ML Task was created with wait_guess_complete=False), before calling get_settings or train

get_status()

Gets the status of this ML Task

Returns:a dict
get_settings()

Gets the settings of this ML Tasks

Returns:a DSSMLTaskSettings object to interact with the settings
Return type:dataikuapi.dss.ml.DSSMLTaskSettings
train(session_name=None, session_description=None)

Trains models for this ML Task

Parameters:
  • session_name (str) – name for the session
  • session_description (str) – description for the session

This method waits for train to complete. If you want to train asynchronously, use start_train() and wait_train_complete()

This method returns the list of trained model identifiers. It returns models that have been trained for this train session, not all trained models for this ML task. To get all identifiers for all models trained across all training sessions, use get_trained_models_ids()

These identifiers can be used for get_trained_model_snippet(), get_trained_model_details() and deploy_to_flow()

Returns:A list of model identifiers
Return type:list of strings
ensemble(model_ids=[], method=None)

Create an ensemble model of a set of models

Parameters:
  • model_ids (list) – A list of model identifiers
  • method (str) – the ensembling method. One of: AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL

This method waits for the ensemble train to complete. If you want to train asynchronously, use start_ensembling() and wait_train_complete()

This method returns the identifier of the trained ensemble. To get all identifiers for all models trained across all training sessions, use get_trained_models_ids()

This identifier can be used for get_trained_model_snippet(), get_trained_model_details() and deploy_to_flow()

Returns:A model identifier
Return type:string
start_train(session_name=None, session_description=None)

Starts asynchronously a new train session for this ML Task.

Parameters:
  • session_name (str) – name for the session
  • session_description (str) – description for the session

This returns immediately, before train is complete. To wait for train to complete, use wait_train_complete()

start_ensembling(model_ids=[], method=None)

Creates asynchronously a new ensemble models of a set of models.

Parameters:
  • model_ids (list) – A list of model identifiers
  • method (str) – the ensembling method (AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL)

This returns immediately, before train is complete. To wait for train to complete, use wait_train_complete()

Returns:the model identifier of the ensemble
Return type:string
wait_train_complete()

Waits for train to be complete.

get_trained_models_ids(session_id=None, algorithm=None)

Gets the list of trained model identifiers for this ML task.

These identifiers can be used for get_trained_model_snippet and deploy_to_flow

Returns:A list of model identifiers
Return type:list of strings
get_trained_model_snippet(id=None, ids=None)

Gets a quick summary of a trained model, as a dict. For complete information and a structured object, use :meth:get_trained_model_details

Parameters:
  • id (str) – a model id
  • ids (list) – a list of model ids
Return type:

dict

get_trained_model_details(id)

Gets details for a trained model

Parameters:id (str) – Identifier of the trained model, as returned by get_trained_models_ids()
Returns:A DSSTrainedModelDetails representing the details of this trained model id
Return type:DSSTrainedModelDetails
deploy_to_flow(model_id, model_name, train_dataset, test_dataset=None, redo_optimization=True)

Deploys a trained model from this ML Task to a saved model + train recipe in the Flow.

Parameters:
  • model_id (str) – Model identifier, as returned by get_trained_models_ids()
  • model_name (str) – Name of the saved model to deploy in the Flow
  • train_dataset (str) – Name of the dataset to use as train set. May either be a short name or a PROJECT.name long name (when using a shared dataset)
  • test_dataset (str) – Name of the dataset to use as test set. If null, split will be applied to the train set. May either be a short name or a PROJECT.name long name (when using a shared dataset). Only for PREDICTION tasks
  • redo_optimization (bool) – Should the hyperparameters optimization phase be done ? Defaults to True. Only for PREDICTION tasks
Returns:

A dict containing: “savedModelId” and “trainRecipeName” - Both can be used to obtain further handles

Return type:

dict

redeploy_to_flow(model_id, recipe_name=None, saved_model_id=None, activate=True)

Redeploys a trained model from this ML Task to a saved model + train recipe in the Flow. Either recipe_name of saved_model_id need to be specified

Parameters:
  • model_id (str) – Model identifier, as returned by get_trained_models_ids()
  • recipe_name (str) – Name of the training recipe to update
  • saved_model_id (str) – Name of the saved model to update
  • activate (bool) – Should the deployed model version become the active version
Returns:

A dict containing: “impactsDownstream” - whether the active version changed and downstream recipes are impacted

Return type:

dict

class dataikuapi.dss.ml.DSSMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)

Object to read and modify the settings of a ML task.

Do not create this object directly, use DSSMLTask.get_settings() instead

get_raw()

Gets the raw settings of this ML Task. This returns a reference to the raw settings, not a copy, so changes made to the returned object will be reflected when saving.

Return type:dict
get_split_params()

Gets an object to modify train/test splitting params.

Return type:PredictionSplitParamsHandler
get_feature_preprocessing(feature_name)

Gets the feature preprocessing params for a particular feature. This returns a reference to the feature’s settings, not a copy, so changes made to the returned object will be reflected when saving

Returns:A dict of the preprocessing settings for a feature
Return type:dict
foreach_feature(fn, only_of_type=None)

Applies a function to all features (except target)

Parameters:
  • fn (function) – Function that takes 2 parameters: feature_name and feature_params and returns modified feature_params
  • only_of_type (str) – if not None, only applies to feature of the given type. Can be one of CATEGORY, NUMERIC, TEXT or VECTOR
reject_feature(feature_name)

Marks a feature as rejected and not used for training :param str feature_name: Name of the feature to reject

use_feature(feature_name)

Marks a feature as input for training :param str feature_name: Name of the feature to reject

use_sample_weighting(feature_name)

Uses a feature as sample weight :param str feature_name: Name of the feature to use

remove_sample_weighting()

Remove sample weighting. If a feature was used as weight, it’s set back to being an input feature

get_algorithm_settings(algorithm_name)

Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.

All algorithms have at least an “enabled” setting. Other settings are algorithm-dependent. You can print the returned object to learn more about the settings of each particular algorithm

Parameters:algorithm_name (str) – Name (in capitals) of the algorithm.
Returns:A dict of the settings for an algorithm
Return type:dict
set_algorithm_enabled(algorithm_name, enabled)

Enables or disables an algorithm.

Parameters:algorithm_name (str) – Name (in capitals) of the algorithm.
set_metric(metric=None, custom_metric=None, custom_metric_greater_is_better=True, custom_metric_use_probas=False)

Set a metric on a prediction ML task

Parameters:
  • metric (str) – metric to use. Leave empty for custom_metric
  • custom_metric (str) – code of the custom metric
  • custom_metric_greater_is_better (bool) – whether the custom metric is a score or a loss
  • custom_metric_use_probas (bool) – whether to use the classes’ probas or the predicted value (for classification)
save()

Saves back these settings to the ML Task

class dataikuapi.dss.ml.DSSTrainedPredictionModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)

Object to read details of a trained prediction model

Do not create this object directly, use DSSMLTask.get_trained_model_details() instead

get_roc_curve_data()
get_performance_metrics()

Returns all performance metrics for this model.

For binary classification model, this includes both “threshold-independent” metrics like AUC and “threshold-dependent” metrics like precision. Threshold-dependent metrics are returned at the threshold value that was found to be optimal during training.

To get access to the per-threshold values, use the following:

# Returns a list of tested threshold values
details.get_performance()["perCutData"]["cut"]
# Returns a list of F1 scores at the tested threshold values
details.get_performance()["perCutData"]["f1"]
# Both lists have the same length

If K-Fold cross-test was used, most metrics will have a “std” variant, which is the standard deviation accross the K cross-tested folds. For example, “auc” will be accompanied with “aucstd”

Returns:a dict of performance metrics values
Return type:dict
get_preprocessing_settings()

Gets the preprocessing settings that were used to train this model

Return type:dict
get_modeling_settings()

Gets the modeling (algorithms) settings that were used to train this model.

Note: the structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithm)

Return type:dict
get_actual_modeling_params()

Gets the actual / resolved parameters that were used to train this model, post hyperparameter optimization.

Returns:A dictionary, which contains at least a “resolved” key, which is a dict containing the post-optimization parameters
Return type:dict
get_trees()

Gets the trees in the model (for tree-based models)

Returns:a DSSTreeSet object to interact with the trees
Return type:dataikuapi.dss.ml.DSSTreeSet
get_coefficient_paths()

Gets the coefficient paths for Lasso models

Returns:a DSSCoefficientPaths object to interact with the coefficient paths
Return type:dataikuapi.dss.ml.DSSCoefficientPaths