Machine learning

Through the public API, the Python client allows you to automate all the aspects of the lifecycle of machine learning models.

  • Creating a visual analysis and ML task
  • Tuning settings
  • Training models
  • Inspecting model details and results
  • Deploying saved models to Flow and retraining them

Concepts

In DSS, you train models as part of a visual analysis. A visual analysis is made of a preparation script, and one or several ML Tasks.

A ML Task is an individual section in which you train models. A ML Task is either a prediction of a single target variable, or a clustering.

The ML API allows you to manipulate ML Tasks, and use them to train models, inspect their details, and deploy them to the Flow.

Once deployed to the Flow, the Saved model can be retrained by the usual build mechanism of DSS.

A ML Task has settings, which control:

  • Which features are active
  • The preprocessing settings for each features
  • Which algorithms are active
  • The hyperparameter settings (including grid searched hyperparameters) for each algorithm
  • The settings of the grid search
  • Train/Test splitting settings
  • Feature selection and generation settings

Usage samples

The whole cycle

This examples create a prediction task, enables an algorithm, trains it, inspects models, and deploys one of the model to Flow

# client is a DSS API client

p = client.get_project("MYPROJECT")

# Create a new ML Task to predict the variable "target" from "trainset"
mltask = p.create_prediction_ml_task(
    input_dataset="trainset",
    target_variable="target",
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='DEFAULT' # Template to use for setting default parameters
)

# Wait for the ML task to be ready
mltask.wait_guess_complete()

# Obtain settings, enable GBT, save settings
settings = mltask.get_settings()
settings.set_algorithm_enabled("GBT_CLASSIFICATION", True)
settings.save()

# Start train and wait for it to be complete
mltask.start_train()
mltask.wait_train_complete()

# Get the identifiers of the trained models
# There will be 3 of them because Logistic regression and Random forest were default enabled
ids = mltask.get_trained_models_ids()

for id in ids:
        details = mltask.get_trained_model_details(id)
        algorithm = details.get_modeling_settings()["algorithm"]
        auc = details.get_performance_metrics()["auc"]

        print("Algorithm=%s AUC=%s" % (algorithm, auc))

# Let's deploy the first model
model_to_deploy = ids[0]

ret = mltask.deploy_to_flow(model_to_deploy, "my_model", "trainset")

print("Deployed to saved model id = %s train recipe = %s" % (ret["savedModelId"], ret["trainRecipeName"]))

The methods for creating prediction and clustering ML tasks are defined at dataikuapi.dss.project.DSSProject.create_prediction_ml_task() and dataikuapi.dss.project.DSSProject.create_clustering_ml_task().

Obtaining a handle to an existing ML Task

When you create these ML tasks, the returned dataikuapi.dss.ml.DSSMLTask object will contain two fields analysis_id and mltask_id that can later be used to retrieve the same DSSMLTask object

# client is a DSS API client

p = client.get_project("MYPROJECT")
mltask = p.get_ml_task(analysis_id, mltask_id)

Tuning feature preprocessing

Enabling and disabling features

# mltask is a DSSMLTask object

settings = mltask.get_settings()

settings.reject_feature("not_useful")
settings.use_feature("useful")

settings.save()

Changing advanced parameters for a feature

# mltask is a DSSMLTask object

settings = mltask.get_settings()

# Use impact coding rather than dummy-coding
fs = settings.get_feature_preprocessing("mycategory")
fs["category_handling"] = "IMPACT"

# Impute missing with most frequent value
fs["missing_handling"] = "IMPUTE"
fs["missing_impute_with"] = "MODE"

settings.save()

Tuning algorithms

API Reference

Interaction with a ML Task

class dataikuapi.dss.ml.DSSMLTask(client, project_key, analysis_id, mltask_id)

A handle to interact with a MLTask for prediction or clustering in a DSS visual analysis

delete()

Delete the present ML task

wait_guess_complete()

Waits for guess to be complete. This should be called immediately after the creation of a new ML Task (if the ML Task was created with wait_guess_complete=False), before calling get_settings or train

get_status()

Gets the status of this ML Task

Returns:a dict
get_settings()

Gets the settings of this ML Tasks

Returns:a DSSMLTaskSettings object to interact with the settings
Return type:dataikuapi.dss.ml.DSSMLTaskSettings
train(session_name=None, session_description=None)

Trains models for this ML Task

Parameters:
  • session_name (str) – name for the session
  • session_description (str) – description for the session

This method waits for train to complete. If you want to train asynchronously, use start_train() and wait_train_complete()

This method returns the list of trained model identifiers. It returns models that have been trained for this train session, not all trained models for this ML task. To get all identifiers for all models trained across all training sessions, use get_trained_models_ids()

These identifiers can be used for get_trained_model_snippet(), get_trained_model_details() and deploy_to_flow()

Returns:A list of model identifiers
Return type:list of strings
ensemble(model_ids=[], method=None)

Create an ensemble model of a set of models

Parameters:
  • model_ids (list) – A list of model identifiers
  • method (str) – the ensembling method. One of: AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL

This method waits for the ensemble train to complete. If you want to train asynchronously, use start_ensembling() and wait_train_complete()

This method returns the identifier of the trained ensemble. To get all identifiers for all models trained across all training sessions, use get_trained_models_ids()

This identifier can be used for get_trained_model_snippet(), get_trained_model_details() and deploy_to_flow()

Returns:A model identifier
Return type:string
start_train(session_name=None, session_description=None)

Starts asynchronously a new train session for this ML Task.

Parameters:
  • session_name (str) – name for the session
  • session_description (str) – description for the session

This returns immediately, before train is complete. To wait for train to complete, use wait_train_complete()

start_ensembling(model_ids=[], method=None)

Creates asynchronously a new ensemble models of a set of models.

Parameters:
  • model_ids (list) – A list of model identifiers
  • method (str) – the ensembling method (AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL)

This returns immediately, before train is complete. To wait for train to complete, use wait_train_complete()

Returns:the model identifier of the ensemble
Return type:string
wait_train_complete()

Waits for train to be complete.

get_trained_models_ids(session_id=None, algorithm=None)

Gets the list of trained model identifiers for this ML task.

These identifiers can be used for get_trained_model_snippet and deploy_to_flow

Returns:A list of model identifiers
Return type:list of strings
get_trained_model_snippet(id=None, ids=None)

Gets a quick summary of a trained model, as a dict. For complete information and a structured object, use :meth:get_trained_model_details

Parameters:
  • id (str) – a model id
  • ids (list) – a list of model ids
Return type:

dict

get_trained_model_details(id)

Gets details for a trained model

Parameters:id (str) – Identifier of the trained model, as returned by get_trained_models_ids()
Returns:A DSSTrainedModelDetails representing the details of this trained model id
Return type:DSSTrainedModelDetails
deploy_to_flow(model_id, model_name, train_dataset, test_dataset=None, redo_optimization=True)

Deploys a trained model from this ML Task to a saved model + train recipe in the Flow.

Parameters:
  • model_id (str) – Model identifier, as returned by get_trained_models_ids()
  • model_name (str) – Name of the saved model to deploy in the Flow
  • train_dataset (str) – Name of the dataset to use as train set. May either be a short name or a PROJECT.name long name (when using a shared dataset)
  • test_dataset (str) – Name of the dataset to use as test set. If null, split will be applied to the train set. May either be a short name or a PROJECT.name long name (when using a shared dataset). Only for PREDICTION tasks
  • redo_optimization (bool) – Should the hyperparameters optimization phase be done ? Defaults to True. Only for PREDICTION tasks
Returns:

A dict containing: “savedModelId” and “trainRecipeName” - Both can be used to obtain further handles

Return type:

dict

redeploy_to_flow(model_id, recipe_name=None, saved_model_id=None, activate=True)

Redeploys a trained model from this ML Task to a saved model + train recipe in the Flow. Either recipe_name of saved_model_id need to be specified

Parameters:
  • model_id (str) – Model identifier, as returned by get_trained_models_ids()
  • recipe_name (str) – Name of the training recipe to update
  • saved_model_id (str) – Name of the saved model to update
  • activate (bool) – Should the deployed model version become the active version
Returns:

A dict containing: “impactsDownstream” - whether the active version changed and downstream recipes are impacted

Return type:

dict

Manipulation of settings

class dataikuapi.dss.ml.DSSMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)

Object to read and modify the settings of a ML task.

Do not create this object directly, use DSSMLTask.get_settings() instead

get_raw()

Gets the raw settings of this ML Task. This returns a reference to the raw settings, not a copy, so changes made to the returned object will be reflected when saving.

Return type:dict
get_split_params()

Gets an object to modify train/test splitting params.

Return type:PredictionSplitParamsHandler
get_feature_preprocessing(feature_name)

Gets the feature preprocessing params for a particular feature. This returns a reference to the feature’s settings, not a copy, so changes made to the returned object will be reflected when saving

Returns:A dict of the preprocessing settings for a feature
Return type:dict
foreach_feature(fn, only_of_type=None)

Applies a function to all features (except target)

Parameters:
  • fn (function) – Function that takes 2 parameters: feature_name and feature_params and returns modified feature_params
  • only_of_type (str) – if not None, only applies to feature of the given type. Can be one of CATEGORY, NUMERIC, TEXT or VECTOR
reject_feature(feature_name)

Marks a feature as rejected and not used for training :param str feature_name: Name of the feature to reject

use_feature(feature_name)

Marks a feature as input for training :param str feature_name: Name of the feature to reject

use_sample_weighting(feature_name)

Uses a feature as sample weight :param str feature_name: Name of the feature to use

remove_sample_weighting()

Remove sample weighting. If a feature was used as weight, it’s set back to being an input feature

get_algorithm_settings(algorithm_name)

Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.

All algorithms have at least an “enabled” setting. Other settings are algorithm-dependent. You can print the returned object to learn more about the settings of each particular algorithm

Parameters:algorithm_name (str) – Name (in capitals) of the algorithm.
Returns:A dict of the settings for an algorithm
Return type:dict
set_algorithm_enabled(algorithm_name, enabled)

Enables or disables an algorithm.

Parameters:algorithm_name (str) – Name (in capitals) of the algorithm.
set_metric(metric=None, custom_metric=None, custom_metric_greater_is_better=True, custom_metric_use_probas=False)

Set a metric on a prediction ML task

Parameters:
  • metric (str) – metric to use. Leave empty for custom_metric
  • custom_metric (str) – code of the custom metric
  • custom_metric_greater_is_better (bool) – whether the custom metric is a score or a loss
  • custom_metric_use_probas (bool) – whether to use the classes’ probas or the predicted value (for classification)
save()

Saves back these settings to the ML Task

class dataikuapi.dss.ml.PredictionSplitParamsHandler(mltask_settings)

Object to modify the train/test splitting params.

set_split_random(train_ratio=0.8, selection=None, dataset_name=None)

Sets the train/test split to random splitting of an extract of a single dataset

Parameters:
  • train_ratio (float) – Ratio of rows to use for train set. Must be between 0 and 1
  • selection (object) – A DSSDatasetSelectionBuilder to build the settings of the extract of the dataset. May be None (won’t be changed)
  • dataset_name (str) – Name of dataset to split. If None, the main dataset used to create the ML Task will be used.
set_split_kfold(n_folds=5, selection=None, dataset_name=None)

Sets the train/test split to k-fold splitting of an extract of a single dataset

Parameters:
  • n_folds (int) – number of folds. Must be greater than 0
  • selection (object) – A DSSDatasetSelectionBuilder to build the settings of the extract of the dataset. May be None (won’t be changed)
  • dataset_name (str) – Name of dataset to split. If None, the main dataset used to create the ML Task will be used.
set_split_explicit(train_selection, test_selection, dataset_name=None, test_dataset_name=None, train_filter=None, test_filter=None)

Sets the train/test split to explicit extract of one or two dataset

Parameters:
  • train_selection (object) – A DSSDatasetSelectionBuilder to build the settings of the extract of the train dataset. May be None (won’t be changed)
  • test_selection (object) – A DSSDatasetSelectionBuilder to build the settings of the extract of the test dataset. May be None (won’t be changed)
  • dataset_name (str) – Name of dataset to use for the extracts. If None, the main dataset used to create the ML Task will be used.
  • test_dataset_name (str) – Name of a second dataset to use for the test data extract. If None, both extracts are done from dataset_name
  • train_filter (object) – A DSSFilterBuilder to build the settings of the filter of the train dataset. May be None (won’t be changed)
  • test_filter (object) – A DSSFilterBuilder to build the settings of the filter of the test dataset. May be None (won’t be changed)

Exploration of results

class dataikuapi.dss.ml.DSSTrainedPredictionModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)

Object to read details of a trained prediction model

Do not create this object directly, use DSSMLTask.get_trained_model_details() instead

get_roc_curve_data()
get_performance_metrics()

Returns all performance metrics for this model.

For binary classification model, this includes both “threshold-independent” metrics like AUC and “threshold-dependent” metrics like precision. Threshold-dependent metrics are returned at the threshold value that was found to be optimal during training.

To get access to the per-threshold values, use the following:

# Returns a list of tested threshold values
details.get_performance()["perCutData"]["cut"]
# Returns a list of F1 scores at the tested threshold values
details.get_performance()["perCutData"]["f1"]
# Both lists have the same length

If K-Fold cross-test was used, most metrics will have a “std” variant, which is the standard deviation accross the K cross-tested folds. For example, “auc” will be accompanied with “aucstd”

Returns:a dict of performance metrics values
Return type:dict
get_preprocessing_settings()

Gets the preprocessing settings that were used to train this model

Return type:dict
get_modeling_settings()

Gets the modeling (algorithms) settings that were used to train this model.

Note: the structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithm)

Return type:dict
get_actual_modeling_params()

Gets the actual / resolved parameters that were used to train this model, post hyperparameter optimization.

Returns:A dictionary, which contains at least a “resolved” key, which is a dict containing the post-optimization parameters
Return type:dict
get_trees()

Gets the trees in the model (for tree-based models)

Returns:a DSSTreeSet object to interact with the trees
Return type:dataikuapi.dss.ml.DSSTreeSet
get_coefficient_paths()

Gets the coefficient paths for Lasso models

Returns:a DSSCoefficientPaths object to interact with the coefficient paths
Return type:dataikuapi.dss.ml.DSSCoefficientPaths
class dataikuapi.dss.ml.DSSTrainedClusteringModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)

Object to read details of a trained clustering model

Do not create this object directly, use DSSMLTask.get_trained_model_details() instead

get_raw()

Gets the raw dictionary of trained model details

get_train_info()

Returns various information about the train process (size of the train set, quick description, timing information)

Return type:dict
get_facts()

Gets the ‘cluster facts’ data, i.e. the structure behind the screen “for cluster X, average of Y is Z times higher than average

Return type:DSSClustersFacts
get_performance_metrics()

Returns all performance metrics for this clustering model. :returns: a dict of performance metrics values :rtype: dict

get_preprocessing_settings()

Gets the preprocessing settings that were used to train this model

Return type:dict
get_modeling_settings()

Gets the modeling (algorithms) settings that were used to train this model.

Note: the structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithm)

Return type:dict
get_actual_modeling_params()

Gets the actual / resolved parameters that were used to train this model. :return: A dictionary, which contains at least a “resolved” key :rtype: dict

get_scatter_plots()

Gets the cluster scatter plot data

Returns:a DSSScatterPlots object to interact with the scatter plots
Return type:dataikuapi.dss.ml.DSSScatterPlots

Algorithm details

This section documents which algorithms are available, and some of the settings for them.

These algorithm names can be used for dataikuapi.dss.ml.DSSMLTaskSettings.get_algorithm_settings() and dataikuapi.dss.ml.DSSMLTaskSettings.set_algorithm_enabled()

Note

This documentation does not cover all settings of all algorithms. To know which settings are available for an algorithm, use mltask_settings.get_algorithm_settings('ALGORITHM_NAME') and print the returned dictionary.

Generally speaking, most algorithm settings which are arrays means that this parameter can be grid-searched. All values will be tested as part of the hyperparameter optimization.

For more documentation of settings, please refer to the UI of the visual machine learning, which contains detailed documentation for all algorithm parameters

LOGISTIC_REGRESSION

  • Type: Prediction (binary or multiclass)
  • Available on backend: PY_MEMORY
  • Main parameters:
"multi_class": "ovr",
"l1": false,
"l2": true,
"C": [
    0.01,
    0.1,
  ],
"n_jobs": 2

RANDOM_FOREST_CLASSIFICATION

  • Type: Prediction (binary or multiclass)
  • Available on backend: PY_MEMORY

RANDOM_FOREST_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: PY_MEMORY

EXTRA_TREES

  • Type: Prediction (all kinds)
  • Available on backend: PY_MEMORY

RIDGE_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: PY_MEMORY

LASSO_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: PY_MEMORY

LEASTSQUARE_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: PY_MEMORY

SVC_CLASSIFICATION

  • Type: Prediction (binary or multiclass)
  • Available on backend: PY_MEMORY

SVM_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: PY_MEMORY

SGD_CLASSIFICATION

  • Type: Prediction (binary or multiclass)
  • Available on backend: PY_MEMORY

SGD_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: PY_MEMORY

GBT_CLASSIFICATION

  • Type: Prediction (binary or multiclass)
  • Available on backend: PY_MEMORY

GBT_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: PY_MEMORY

DECISION_TREE_CLASSIFICATION

  • Type: Prediction (binary or multiclass)
  • Available on backend: PY_MEMORY

DECISION_TREE_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: PY_MEMORY

XGBOOST_CLASSIFICATION

  • Type: Prediction (binary or multiclass)
  • Available on backend: PY_MEMORY

XGBOOST_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: PY_MEMORY

NEURAL_NETWORK

  • Type: Prediction (all kinds)
  • Available on backend: PY_MEMORY

KNN

  • Type: Prediction (all kinds)
  • Available on backend: PY_MEMORY

LARS

  • Type: Prediction (all kinds)
  • Available on backend: PY_MEMORY

MLLIB_LOGISTIC_REGRESSION

  • Type: Prediction (binary or multiclass)
  • Available on backend: MLLIB

MLLIB_DECISION_TREE

  • Type: Prediction (all kinds)
  • Available on backend: MLLIB

MLLIB_RANDOM_FOREST

  • Type: Prediction (all kinds)
  • Available on backend: MLLIB

MLLIB_GBT

  • Type: Prediction (all kinds)
  • Available on backend: MLLIB

MLLIB_LINEAR_REGRESSION

  • Type: Prediction (regression)
  • Available on backend: MLLIB

MLLIB_NAIVE_BAYES

  • Type: Prediction (all kinds)
  • Available on backend: MLLIB

Other

  • SCIKIT_MODEL
  • MLLIB_CUSTOM
  • SPARKLING_DEEP_LEARNING
  • SPARKLING_GBM
  • SPARKLING_RF
  • SPARKLING_GLM
  • SPARKLING_NB