Machine learning¶

Through the public API, the Python client allows you to automate all the aspects of the lifecycle of machine learning models.

Creating a visual analysis and ML task
Tuning settings
Training models
Inspecting model details and results
Deploying saved models to Flow and retraining them

Concepts
Usage samples
API Reference
Algorithm details

Concepts ¶

In DSS, you train models as part of a visual analysis. A visual analysis is made of a preparation script, and one or several ML Tasks.

A ML Task is an individual section in which you train models. A ML Task is either a prediction of a single target variable, or a clustering.

The ML API allows you to manipulate ML Tasks, and use them to train models, inspect their details, and deploy them to the Flow.

Once deployed to the Flow, the Saved model can be retrained by the usual build mechanism of DSS.

A ML Task has settings, which control:

Which features are active
The preprocessing settings for each features
Which algorithms are active
The hyperparameter settings (including grid searched hyperparameters) for each algorithm
The settings of the grid search
Train/Test splitting settings
Feature selection and generation settings

Usage samples ¶

The whole cycle ¶

This examples create a prediction task, enables an algorithm, trains it, inspects models, and deploys one of the model to Flow

# client is a DSS API client

p = client.get_project("MYPROJECT")

# Create a new ML Task to predict the variable "target" from "trainset"
mltask = p.create_prediction_ml_task(
    input_dataset="trainset",
    target_variable="target",
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='DEFAULT' # Template to use for setting default parameters
)

# Wait for the ML task to be ready
mltask.wait_guess_complete()

# Obtain settings, enable GBT, save settings
settings = mltask.get_settings()
settings.set_algorithm_enabled("GBT_CLASSIFICATION", True)
settings.save()

# Start train and wait for it to be complete
mltask.start_train()
mltask.wait_train_complete()

# Get the identifiers of the trained models
# There will be 3 of them because Logistic regression and Random forest were default enabled
ids = mltask.get_trained_models_ids()

for id in ids:
    details = mltask.get_trained_model_details(id)
    algorithm = details.get_modeling_settings()["algorithm"]
    auc = details.get_performance_metrics()["auc"]

    print("Algorithm=%s AUC=%s" % (algorithm, auc))

# Let's deploy the first model
model_to_deploy = ids[0]

ret = mltask.deploy_to_flow(model_to_deploy, "my_model", "trainset")

print("Deployed to saved model id = %s train recipe = %s" % (ret["savedModelId"], ret["trainRecipeName"]))

The methods for creating prediction and clustering ML tasks are defined at dataikuapi.dss.project.DSSProject.create_prediction_ml_task() and dataikuapi.dss.project.DSSProject.create_clustering_ml_task().

Obtaining a handle to an existing ML Task ¶

When you create these ML tasks, the returned dataikuapi.dss.ml.DSSMLTask object will contain two fields analysis_id and mltask_id that can later be used to retrieve the same DSSMLTask object

# client is a DSS API client

p = client.get_project("MYPROJECT")
mltask = p.get_ml_task(analysis_id, mltask_id)

Tuning feature preprocessing ¶

Enabling and disabling features ¶

# mltask is a DSSMLTask object

settings = mltask.get_settings()

settings.reject_feature("not_useful")
settings.use_feature("useful")

settings.save()

Changing advanced parameters for a feature ¶

# mltask is a DSSMLTask object

settings = mltask.get_settings()

# Use impact coding rather than dummy-coding
fs = settings.get_feature_preprocessing("mycategory")
fs["category_handling"] = "IMPACT"

# Impute missing with most frequent value
fs["missing_handling"] = "IMPUTE"
fs["missing_impute_with"] = "MODE"

settings.save()

Tuning algorithms ¶

Global parameters for hyperparameter search ¶

This sample shows how to modify the parameters of the search to be performed on the hyperparameters.

# mltask is a DSSMLTask object

settings = mltask.get_settings()

hp_search_settings = mltask_settings.get_hyperparameter_search_settings()

# Set the search strategy either to "GRID", "RANDOM" or "BAYESIAN"
hp_search_settings.strategy = "RANDOM"

# Alternatively use a setter, either set_grid_search
# set_random_search or set_bayesian_search
hp_search_settings.set_random_search(seed=1234)

# Set the validation mode either to "KFOLD", "SHUFFLE" (or accordingly their
# "TIME_SERIES"-prefixed counterpart) or "CUSTOM"
hp_search_settings.validation_mode = "KFOLD"

# Alternatively use a setter, either set_kfold_validation, set_single_split_validation
# or set_custom_validation
hp_search_settings.set_kfold_validation(n_folds=5, stratified=True)

# Save the settings
settings.save()

Algorithm specific hyperparameter search ¶

This sample shows how to modify the settings of the Random Forest Classification algorithm, where two kinds of hyperparameters (multi-valued numerical and single-valued) are introduced.

# mltask is a DSSMLTask object

settings = mltask.get_settings()

rf_settings = settings.get_algorithm_settings("RANDOM_FOREST_CLASSIFICATION")


# rf_settings is an object representing the settings for this algorithm.
# The 'enabled' attribute indicates whether this algorithm will be trained.
# Other attributes are the various hyperparameters of the algorithm.

# The precise hyperparameters for each algorithm are not all documented, so let's
# print the dictionary keys to see available hyperparameters.
# Alternatively, tab completion will provide relevant hints to available hyperparameters.
print(rf_settings.keys())

# Let's first have a look at rf_settings.n_estimators which is a multi-valued hyperparameter
# represented as a NumericalHyperparameterSettings object
print(rf_settings.n_estimators)

# Set multiple explicit values for "n_estimators" to be explored during the search
rf_settings.n_estimators.definition_mode = "EXPLICIT"
rf_settings.n_estimators.values = [100, 200]
# Alternatively use the set_values setter
rf_settings.n_estimators.set_values([100, 200])

# Set a range of values for "n_estimators" to be explored during the search
rf_settings.n_estimators.definition_mode = "RANGE"
rf_settings.n_estimators.range.min = 10
rf_settings.n_estimators.range.max = 100
rf_settings.n_estimators.range.nb_values = 5  # Only relevant for grid-search
# Alternatively, use the set_range setter
rf_settings.n_estimators.set_range(min=10, max=100, nb_values=5)

# Let's now have a look at rf_settings.selection_mode which is a single-valued hyperparameter
# represented as a SingleCategoryHyperparameterSettings object.
# The object stores the valid options for this hyperparameter.
print(rf_settings.selection_mode)

# Features selection mode is not multi-valued so it's not actually searched during the
# hyperparameter search
rf_settings.selection_mode = "sqrt"

# Save the settings
settings.save()

The next sample shows how to modify the settings of the Logistic Regression classification algorithm, where a new kind of hyperparameter (multi-valued categorical) is introduced.

# mltask is a DSSMLTask object

settings = mltask.get_settings()

logit_settings = settings.get_algorithm_settings("LOGISTIC_REGRESSION")

# Let's have a look at logit_settings.penalty which is a multi-valued categorical
# hyperparameter represented as a CategoricalHyperparameterSettings object
print(logit_settings.penalty)

# List currently enabled values
print(logit_settings.penalty.get_values())

# List all possible values
print(logit_settings.penalty.get_all_possible_values())

# Set the values for the "penalty" hyperparameter to be explored during the search
logit_settings.penalty = ["l1", "l2"]
# Alternatively use the set_values setter
logit_settings.penalty.set_values(["l1", "l2"])

# Save the settings
settings.save()

Exporting a model documentation ¶

This sample shows how to generate and download a model documentation from a template.

See Model Document Generator for more information.

# mltask is a DSSMLTask object

details = mltask.get_trained_model_details(id)

# Launch the model document generation by either
# using the default template for this model by calling without argument
# or specifying a managed folder id and the path to the template to use in that folder
future = details.generate_documentation(FOLDER_ID, "path/my_template.docx")

# Alternatively, use a custom uploaded template file
with open("my_template.docx", "rb") as f:
    future = details.generate_documentation_from_custom_template(f)

# Wait for the generation to finish, retrieve the result and download the generated
# model documentation to the specified file
result = future.wait_for_result()
export_id = result["exportId"]

details.download_documentation_to_file(export_id, "path/my_model_documentation.docx")

API Reference ¶

Interaction with a ML Task ¶

class dataikuapi.dss.ml.DSSMLTask(client, project_key, analysis_id, mltask_id)¶

static from_full_model_id(client, fmi, project_key=None)¶

delete()¶: Delete the present ML task

wait_guess_complete()¶: Waits for guess to be complete. This should be called immediately after the creation of a new ML Task (if the ML Task was created with wait_guess_complete=False), before calling get_settings or train

get_status()¶

Gets the status of this ML Task

Returns: a dict

get_settings()¶

Gets the settings of this ML Tasks

Returns: a DSSMLTaskSettings object to interact with the settings
Return type: dataikuapi.dss.ml.DSSMLTaskSettings

train(session_name=None, session_description=None, run_queue=False)¶

Trains models for this ML Task

Parameters

session_name (str) – name for the session
session_description (str) – description for the session

This method waits for train to complete. If you want to train asynchronously, use start_train() and wait_train_complete()

This method returns the list of trained model identifiers. It returns models that have been trained for this train session, not all trained models for this ML task. To get all identifiers for all models trained across all training sessions, use get_trained_models_ids()

These identifiers can be used for get_trained_model_snippet(), get_trained_model_details() and deploy_to_flow()

Returns: A list of model identifiers
Return type: list of strings

ensemble(model_ids=None, method=None)¶

Create an ensemble model of a set of models

Parameters

model_ids (list) – A list of model identifiers (defaults to [])
method (str) – the ensembling method. One of: AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL

This method waits for the ensemble train to complete. If you want to train asynchronously, use start_ensembling() and wait_train_complete()

This method returns the identifier of the trained ensemble. To get all identifiers for all models trained across all training sessions, use get_trained_models_ids()

This identifier can be used for get_trained_model_snippet(), get_trained_model_details() and deploy_to_flow()

Returns: A model identifier
Return type: string

start_train(session_name=None, session_description=None, run_queue=False)¶

Starts asynchronously a new train session for this ML Task.

Parameters

session_name (str) – name for the session
session_description (str) – description for the session

This returns immediately, before train is complete. To wait for train to complete, use wait_train_complete()

start_ensembling(model_ids=None, method=None)¶

Creates asynchronously a new ensemble models of a set of models.

Parameters

model_ids (list) – A list of model identifiers (defaults to [])
method (str) – the ensembling method (AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL)

This returns immediately, before train is complete. To wait for train to complete, use wait_train_complete()

Returns: the model identifier of the ensemble
Return type: string

wait_train_complete()¶: Waits for train to be complete (if started with start_train())

get_trained_models_ids(session_id=None, algorithm=None)¶

Gets the list of trained model identifiers for this ML task.

These identifiers can be used for get_trained_model_snippet() and deploy_to_flow()

Returns: A list of model identifiers
Return type: list of strings

get_trained_model_snippet(id=None, ids=None)¶

Gets a quick summary of a trained model, as a dict. For complete information and a structured object, use get_trained_model_detail()

Parameters

id (str) – a model id
ids (list) – a list of model ids

Return type

dict

get_trained_model_details(id)¶

Gets details for a trained model

Parameters: id (str) – Identifier of the trained model, as returned by get_trained_models_ids()
Returns: A DSSTrainedPredictionModelDetails or DSSTrainedClusteringModelDetails representing the details of this trained model id
Return type: DSSTrainedPredictionModelDetails or DSSTrainedClusteringModelDetails

delete_trained_model(model_id)¶

Deletes a trained model

Parameters: model_id (str) – Model identifier, as returend by get_trained_models_ids()

train_queue()¶

Trains this MLTask’s queue

Returns: A dict including the next sessionID to be trained in the queue

:rtype dict

deploy_to_flow(model_id, model_name, train_dataset, test_dataset=None, redo_optimization=True)¶

Deploys a trained model from this ML Task to a saved model + train recipe in the Flow.

Parameters

model_id (str) – Model identifier, as returned by get_trained_models_ids()
model_name (str) – Name of the saved model to deploy in the Flow
train_dataset (str) – Name of the dataset to use as train set. May either be a short name or a PROJECT.name long name (when using a shared dataset)
test_dataset (str) – Name of the dataset to use as test set. If null, split will be applied to the train set. May either be a short name or a PROJECT.name long name (when using a shared dataset). Only for PREDICTION tasks
redo_optimization (bool) – Should the hyperparameters optimization phase be done ? Defaults to True. Only for PREDICTION tasks

Returns

A dict containing: “savedModelId” and “trainRecipeName” - Both can be used to obtain further handles

Return type

dict

redeploy_to_flow(model_id, recipe_name=None, saved_model_id=None, activate=True)¶

Redeploys a trained model from this ML Task to a saved model + train recipe in the Flow. Either recipe_name of saved_model_id need to be specified

Parameters

model_id (str) – Model identifier, as returned by get_trained_models_ids()
recipe_name (str) – Name of the training recipe to update
saved_model_id (str) – Name of the saved model to update
activate (bool) – Should the deployed model version become the active version

Returns

A dict containing: “impactsDownstream” - whether the active version changed and downstream recipes are impacted

Return type

dict

remove_unused_splits()¶

Deletes all stored splits data that are not anymore in use for this ML Task.

It is generally not needed to call this method

remove_all_splits()¶

Deletes all stored splits data for this ML Task. This operation saves disk space.

After performing this operation, it will not be possible anymore to: * Ensemble already trained models * View the “predicted data” or “charts” for already trained models * Resume training of models for which optimization had been previously interrupted

Training new models remains possible

guess(prediction_type=None, reguess_level=None, target_variable=None, timeseries_identifiers=None, time_variable=None, full_reguess=None)¶

Reguess all the settings of the ML task when no optional parameter are given. For prediction ML tasks only, set a new value for a core parameter of the task (target variable or prediction type) and subsequently reguess the impacted settings.

Parameters

prediction_type (string) – Only valid for prediction tasks of either BINARY_CLASSIFICATION, MULTICLASS or REGRESSION type, ignored otherwise. The prediction type to set. Cannot be set if target_variable, time_variable, or timeseries_identifiers is also specified.
target_variable (string) – Only valid for prediction tasks, ignored for clustering. The target variable to set. Cannot be set if prediction_type, time_variable, or timeseries_identifiers is also specified.
timeseries_identifiers (list) – Only valid for time series forecasting tasks. List of columns to be used as time series identifiers. Cannot be set if prediction_type, target_variable, or time_variable is also specified.
time_variable (string) – Only valid for time series forecasting tasks. Column to be used as time variable. Cannot be set if prediction_type, target_variable, or timeseries_identifiers is also specified.
full_reguess (bool) – Only valid for prediction tasks, ignored for clustering. Scope of the reguess process: whether it should reguess all the settings after changing a core parameter, or only reguess impacted settings (e.g. target remapping when changing the target, metrics when changing the prediction type…). Ignored if no core parameter is given. Defaults to true.
reguess_level (string) –
Deprecated, use full_reguess instead. Only valid for prediction tasks. Can be one of the following values: - TARGET_CHANGE: Change the target if target_variable is specified, reguess the target remapping, and

clear the model’s assertions if any. Equivalent to `full_reguess`=False (recommended usage)
- FULL_REGUESS: All the settings of the ML task are reguessed.
  Equivalent to `full_reguess`=True (recommended usage)

Manipulation of settings ¶

class dataikuapi.dss.ml.DSSMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)¶

Object to read and modify the settings of a ML task.

Do not create this object directly, use DSSMLTask.get_settings() instead

get_raw()¶

Gets the raw settings of this ML Task. This returns a reference to the raw settings, not a copy, so changes made to the returned object will be reflected when saving.

Return type: dict

get_feature_preprocessing(feature_name)¶

Gets the feature preprocessing params for a particular feature. This returns a reference to the feature’s settings, not a copy, so changes made to the returned object will be reflected when saving

Returns: A dict of the preprocessing settings for a feature
Return type: dict

foreach_feature(fn, only_of_type=None)¶

Applies a function to all features (except target)

Parameters

fn (function) – Function that takes 2 parameters: feature_name and feature_params and returns modified feature_params
only_of_type (str) – if not None, only applies to feature of the given type. Can be one of CATEGORY, NUMERIC, TEXT or VECTOR

reject_feature(feature_name)¶

Marks a feature as rejected and not used for training

Parameters: feature_name (str) – Name of the feature to reject

use_feature(feature_name)¶

Marks a feature as input for training

Parameters: feature_name (str) – Name of the feature to reject

get_algorithm_settings(algorithm_name)¶

get_diagnostics_settings()¶

Gets the diagnostics settings for a mltask. This returns a reference to the diagnostics’ settings, not a copy, so changes made to the returned object will be reflected when saving.

This method returns a dictionary of the settings with: - ‘enabled’: indicates if the diagnostics are enabled globally, if False, all diagnostics will be disabled - ‘settings’: a list of dict comprised of:

‘type’: the diagnostic type

‘enabled’: indicates if the diagnostic type is enabled, if False, all diagnostics of that type will be disabled

Please refer to the documentation for details on available diagnostics.

Returns: A dict of diagnostics settings
Return type: dict

set_diagnostics_enabled(enabled)¶

Globally enables or disables all diagnostics.

Parameters: enabled (bool) – if the diagnostics should be enabled or not

set_diagnostic_type_enabled(diagnostic_type, enabled)¶

Enables or disables a diagnostic based on its type.

Please refer to the documentation for details on available diagnostics.

Parameters

diagnostic_type (str) – Name (in capitals) of the diagnostic type.
enabled (bool) – if the diagnostic should be enabled or not

set_algorithm_enabled(algorithm_name, enabled)¶

Enables or disables an algorithm based on its name.

Please refer to the documentation for details on available algorithms.

Parameters: algorithm_name (str) – Name (in capitals) of the algorithm.

disable_all_algorithms()¶: Disables all algorithms

get_all_possible_algorithm_names()¶

Returns the list of possible algorithm names, i.e. the list of valid identifiers for set_algorithm_enabled() and get_algorithm_settings()

This includes all possible algorithms, regardless of the prediction kind (regression/classification) or engine, so some algorithms may be irrelevant

Returns: the list of algorithm names as a list of strings
Return type: list of string

get_enabled_algorithm_names()¶

Returns: the list of enabled algorithm names as a list of strings
Return type: list of string

get_enabled_algorithm_settings()¶

Returns: the map of enabled algorithm names with their settings
Return type: dict

set_metric(metric=None, custom_metric=None, custom_metric_greater_is_better=True, custom_metric_use_probas=False)¶

Sets the score metric to optimize for a prediction ML Task

Parameters

metric (str) – metric to use. Leave empty to use a custom metric. You need to set the custom_metric value in that case
custom_metric (str) – code of the custom metric
custom_metric_greater_is_better (bool) – whether the custom metric is a score or a loss
custom_metric_use_probas (bool) – whether to use the classes’ probas or the predicted value (for classification)

add_custom_python_model(name='Custom Python Model', code='')¶

Adds a new custom python model

Parameters

name (str) – name of the custom model
code (str) – code of the custom model

add_custom_mllib_model(name='Custom MLlib Model', code='')¶

Adds a new custom MLlib model

Parameters

name (str) – name of the custom model
code (str) – code of the custom model

save()¶: Saves back these settings to the ML Task

class dataikuapi.dss.ml.DSSPredictionMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)¶

class PredictionTypes¶

BINARY = 'BINARY_CLASSIFICATION'¶

REGRESSION = 'REGRESSION'¶

MULTICLASS = 'MULTICLASS'¶

get_all_possible_algorithm_names()¶

Returns the list of possible algorithm names, i.e. the list of valid identifiers for set_algorithm_enabled() and get_algorithm_settings()

This includes all possible algorithms, regardless of the prediction kind (regression/classification) or engine, so some algorithms may be irrelevant

Returns: the list of algorithm names as a list of strings
Return type: list of string

get_enabled_algorithm_names()¶

Returns: the list of enabled algorithm names as a list of strings
Return type: list of string

get_algorithm_settings(algorithm_name)¶

Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.

This method returns the settings for this algorithm as an PredictionAlgorithmSettings (extended dict). All algorithm dicts have at least an “enabled” property/key in the settings. The “enabled” property/key indicates whether this algorithm will be trained.

Other settings are algorithm-dependent and are the various hyperparameters of the algorithm. The precise properties/keys for each algorithm are not all documented. You can print the returned AlgorithmSettings to learn more about the settings of each particular algorithm.

Please refer to the documentation for details on available algorithms.

Parameters: algorithm_name (str) – Name (in capitals) of the algorithm.
Returns: A PredictionAlgorithmSettings (extended dict) for one of the built-in prediction algorithms
Return type: PredictionAlgorithmSettings

split_ordered_by(feature_name, ascending=True)¶: Deprecated. Use split_params.set_time_ordering()

remove_ordered_split()¶: Deprecated. Use split_params.unset_time_ordering()

use_sample_weighting(feature_name)¶: Deprecated. use set_weighting()

set_weighting(method, feature_name=None)¶

Sets the method to weight samples.

If there was a WEIGHT feature declared previously, it will be set back as an INPUT feature first.

Parameters

method (str) – Method to use. One of NO_WEIGHTING, SAMPLE_WEIGHT (must give a feature name), CLASS_WEIGHT or CLASS_AND_SAMPLE_WEIGHT (must give a feature name)
feature_name (str) – Name of the feature to use as sample weight

remove_sample_weighting()¶: Deprecated. Use set_weighting(method=”NO_WEIGHTING”) instead

get_assertions_params()¶

Retrieves the assertions parameters for this ml task

Return type: DSSMLAssertionsParams

class dataikuapi.dss.ml.DSSClusteringMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)¶

get_algorithm_settings(algorithm_name)¶

Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.

This method returns a dictionary of the settings for this algorithm. All algorithm dicts have at least an “enabled” key in the dictionary. The ‘enabled’ key indicates whether this algorithm will be trained

Other settings are algorithm-dependent and are the various hyperparameters of the algorithm. The precise keys for each algorithm are not all documented. You can print the returned dictionary to learn more about the settings of each particular algorithm

Please refer to the documentation for details on available algorithms.

Param: algorithm_name: Name of the algorithm (uppercase).
Type: algorithm_name: str
Returns: A dict of the settings for an algorithm
Return type: dict

class dataikuapi.dss.ml.DSSTimeseriesForecastingMLTaskSettings(client, project_key, analysis_id, mltask_id, mltask_settings)¶

class PredictionTypes¶

TIMESERIES_FORECAST = 'TIMESERIES_FORECAST'¶

get_time_step_params()¶

Gets the time step parameters for the time series forecasting task. This returns a reference to the time step parameters, not a copy, so changes made to the returned object will be reflected when saving

Returns: A dict of the time step parameters
Return type: dict

set_time_step(time_unit=None, n_time_units=None, end_of_week_day=None, reguess=True, update_algorithm_settings=True)¶

Sets the time step parameters for the time series forecasting task.

Parameters

time_unit (str) – time unit for forecasting step. Valid values are: MILLISECOND, SECOND, MINUTE, HOUR, DAY, BUSINESS_DAY, WEEK, MONTH, QUARTER, HALF_YEAR, YEAR
n_time_units (int) – number of time units within a time step
end_of_week_day (int) – only useful for the WEEK time unit. Valid values are: 1 (Sunday), 2 (Monday), …, 7 (Saturday)
reguess (bool) – Defaults to true. Whether to reguess the ML task settings after changing the time step params
update_algorithm_settings (bool) – Defaults to true. Whether the algorithm settings should be reguessed after changing time step parameters.

Returns

get_resampling_params()¶

Gets the time series resampling parameters for the time series forecasting task. This returns a reference to the time series resampling parameters, not a copy, so changes made to the returned object will be reflected when saving

Returns: A dict of the resampling parameters
Return type: dict

set_numerical_interpolation(method=None, constant=None)¶

Sets the time series resampling numerical interpolation parameters

Parameters

method (str) – Interpolation method. Valid values are: NEAREST, PREVIOUS, NEXT, LINEAR, QUADRATIC, CUBIC, CONSTANT
constant (float) – Value for the CONSTANT interpolation method

Returns

set_numerical_extrapolation(method=None, constant=None)¶

Sets the time series resampling numerical extrapolation parameters

Parameters

method (str) – Extrapolation method. Valid values are: PREVIOUS_NEXT, NO_EXTRAPOLATION, CONSTANT, LINEAR, QUADRATIC, CUBIC
constant (float) – Value for the CONSTANT extrapolation method

Returns

set_categorical_imputation(method=None, constant=None)¶

Sets the time series resampling categorical imputation parameters

Parameters

method (str) – Imputation method. Valid values are: MOST_COMMON, NULL, CONSTANT, PREVIOUS_NEXT, PREVIOUS, NEXT
constant (str) – Value for the CONSTANT imputation method

Returns

set_duplicate_timestamp_handling(method)¶

Sets the time series duplicate timestamp handling

Parameters: method (str) – Duplicate timestamp handling method. Valid values are: FAIL_IF_CONFLICTING, DROP_IF_CONFLICTING, MEAN_MODE

property forecast_horizon¶

Returns: Number of time steps to be forecast
Return type: int

set_forecast_horizon(forecast_horizon, reguess=True, update_algorithm_settings=True)¶

Parameters

forecast_horizon (int) – Number of time steps to be forecast
reguess (bool) – Defaults to true. Whether to reguess the ML task settings after changing the forecast horizon
update_algorithm_settings (bool) – Defaults to true. Whether the algorithm settings should be reguessed after the forecast horizon.

property evaluation_gap¶

Returns: Number of skipped time steps for evaluation
Return type: int

property time_variable¶

Returns: Feature used as time variable (read-only)
Return type: str

property timeseries_identifiers¶

Returns: Features used as time series identifiers (read-only copy)
Return type: list

property quantiles_to_forecast¶

Returns: List of quantiles to forecast
Return type: list

class dataikuapi.dss.ml.PredictionSplitParamsHandler(mltask_settings)¶

Object to modify the train/test splitting params.

SPLIT_PARAMS_KEY = 'splitParams'¶

get_raw()¶

Gets the raw settings of the prediction split configuration. This returns a reference to the raw settings, not a copy, so changes made to the returned object will be reflected when saving.

Return type: dict

set_split_random(train_ratio=0.8, selection=None, dataset_name=None)¶

Sets the train/test split to random splitting of an extract of a single dataset

Parameters

train_ratio (float) – Ratio of rows to use for train set. Must be between 0 and 1
selection (object) – A DSSDatasetSelectionBuilder to build the settings of the extract of the dataset. May be None (won’t be changed)
dataset_name (str) – Name of dataset to split. If None, the main dataset used to create the visual analysis will be used.

set_split_kfold(n_folds=5, selection=None, dataset_name=None)¶

Sets the train/test split to k-fold splitting of an extract of a single dataset

Parameters

n_folds (int) – number of folds. Must be greater than 0
selection (object) – A DSSDatasetSelectionBuilder to build the settings of the extract of the dataset. May be None (won’t be changed)
dataset_name (str) – Name of dataset to split. If None, the main dataset used to create the visual analysis will be used.

set_split_explicit(train_selection, test_selection, dataset_name=None, test_dataset_name=None, train_filter=None, test_filter=None)¶

Sets the train/test split to explicit extract of one or two dataset(s)

Parameters

train_selection (object) – A DSSDatasetSelectionBuilder to build the settings of the extract of the train dataset. May be None (won’t be changed)
test_selection (object) – A DSSDatasetSelectionBuilder to build the settings of the extract of the test dataset. May be None (won’t be changed)
dataset_name (str) – Name of dataset to use for the extracts. If None, the main dataset used to create the ML Task will be used.
test_dataset_name (str) – Name of a second dataset to use for the test data extract. If None, both extracts are done from dataset_name
train_filter (object) – A DSSFilterBuilder to build the settings of the filter of the train dataset. May be None (won’t be changed)
test_filter (object) – A DSSFilterBuilder to build the settings of the filter of the test dataset. May be None (won’t be changed)

set_time_ordering(feature_name, ascending=True)¶

Uses a variable to sort the data for train/test split and hyperparameter optimization by time

Parameters

feature_name (str) – Name of the variable to use
ascending (bool) – True iff the test set is expected to have larger time values than the train set

unset_time_ordering()¶: Remove time-based ordering for train/test split and hyperparameter optimization

has_time_ordering()¶

Returns: whether the splitting uses time ordering
Return type: bool

get_time_ordering_variable()¶

Returns: the name of the variable
Return type: str

is_time_ordering_ascending()¶

Returns: True if the ordering is set to be ascending with respect to the time-ordering variable
Return type: bool

Exploration of results ¶

class dataikuapi.dss.ml.DSSTrainedPredictionModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)¶

Object to read details of a trained prediction model

Do not create this object directly, use DSSMLTask.get_trained_model_details() instead

get_roc_curve_data()¶

get_performance_metrics()¶

Returns all performance metrics for this model.

For binary classification model, this includes both “threshold-independent” metrics like AUC and “threshold-dependent” metrics like precision. Threshold-dependent metrics are returned at the threshold value that was found to be optimal during training.

To get access to the per-threshold values, use the following:

# Returns a list of tested threshold values
details.get_performance()["perCutData"]["cut"]
# Returns a list of F1 scores at the tested threshold values
details.get_performance()["perCutData"]["f1"]
# Both lists have the same length

If K-Fold cross-test was used, most metrics will have a “std” variant, which is the standard deviation accross the K cross-tested folds. For example, “auc” will be accompanied with “aucstd”

Returns: a dict of performance metrics values
Return type: dict

get_assertions_metrics()¶

Retrieves assertions metrics computed for this trained model

Returns: an object representing assertion metrics
Return type: DSSMLAssertionsMetrics

get_hyperparameter_search_points()¶

Gets the list of points in the hyperparameter search space that have been tested.

Returns a list of dict. Each entry in the list represents a point.

For each point, the dict contains at least:

“score”: the average value of the optimization metric over all the folds at this point
“params”: a dict of the parameters at this point. This dict has the same structure
as the params of the best parameters

get_preprocessing_settings()¶

Gets the preprocessing settings that were used to train this model

Return type: dict

get_modeling_settings()¶

Gets the modeling (algorithms) settings that were used to train this model.

Note: the structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithm)

Return type: dict

get_actual_modeling_params()¶

Gets the actual / resolved parameters that were used to train this model, post hyperparameter optimization.

Returns: A dictionary, which contains at least a “resolved” key, which is a dict containing the post-optimization parameters
Return type: dict

get_trees()¶

Gets the trees in the model (for tree-based models)

Returns: a DSSTreeSet object to interact with the trees
Return type: dataikuapi.dss.ml.DSSTreeSet

get_coefficient_paths()¶

Gets the coefficient paths for Lasso models

Returns: a DSSCoefficientPaths object to interact with the coefficient paths
Return type: dataikuapi.dss.ml.DSSCoefficientPaths

get_scoring_jar_stream(model_class='model.Model', include_libs=False)¶

Get a scoring jar for this trained model, provided that you have the license to do so and that the model is compatible with optimized scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.

Parameters

model_class (str) – fully-qualified class name, e.g. “com.company.project.Model”
include_libs (bool) – if True, also packs the required dependencies; if False, runtime will require the scoring libs given by DSSClient.scoring_libs()

Returns

a jar file, as a stream

Return type

file-like

get_scoring_pmml_stream()¶

Get a scoring PMML for this trained model, provided that you have the license to do so and that the model is compatible with PMML scoring You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.

Returns: a PMML file, as a stream
Return type: file-like

get_scoring_python_stream()¶

Download the zip containing data to use for this trained model, provided that you have the license to do so and that the model is compatible with Python scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.

Returns: an archive file, as a stream
Return type: file-like

get_scoring_python(filename)¶

Download the zip containing data to use Python scoring for this trained model in filename, provided that you have the license to do so and that the model is compatible with Python scoring.

Parameters: filename (str) – filename of the resulting downloaded file

get_scoring_mlflow_stream()¶

Download the zip containing this trained model using MLflow Model format, provided that you have the license to do so and that the model is compatible with MLflow scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.

Returns: an archive file, as a stream
Return type: file-like

get_scoring_mlflow(filename)¶

Download the zip containing data for this trained model, using MLflow Model format, provided that you have the license to do so and that the model is compatible with MLflow scoring

Parameters: filename (str) – filename to the resulting MLflow Model zip

compute_subpopulation_analyses(split_by, wait=True, sample_size=1000, random_state=1337, n_jobs=1, debug_mode=False)¶

Launch computation of Subpopulation analyses for this trained model.

Parameters

split_by (list|str) – column(s) on which subpopulation analyses are to be computed (one analysis per column)
wait (bool) – if True, the call blocks until the computation is finished and returns the results directly
sample_size (int) – number of records of the dataset to use for the computation
random_state (int) – random state to use to build sample, for reproducibility
n_jobs (int) – number of cores used for parallel training. (-1 means ‘all cores’)
debug_mode (bool) – if True, output all logs (slower)

Returns

if wait is True, an object containing the Subpopulation analyses, else a future to wait on the result

Return type

dataikuapi.dss.ml.DSSSubpopulationAnalyses or dataikuapi.dss.future.DSSFuture

get_subpopulation_analyses()¶

Retrieve all subpopulation analyses computed for this trained model

Returns: the subpopulation analyses
Return type: dataikuapi.dss.ml.DSSSubpopulationAnalyses

compute_partial_dependencies(features, wait=True, sample_size=1000, random_state=1337, n_jobs=1, debug_mode=False)¶

Launch computation of Partial dependencies for this trained model.

Parameters

features (list|str) – feature(s) on which partial dependencies are to be computed
wait (bool) – if True, the call blocks until the computation is finished and returns the results directly
sample_size (int) – number of records of the dataset to use for the computation
random_state (int) – random state to use to build sample, for reproducibility
n_jobs (int) – number of cores used for parallel training. (-1 means ‘all cores’)
debug_mode (bool) – if True, output all logs (slower)

Returns

if wait is True, an object containing the Partial dependencies, else a future to wait on the result

Return type

dataikuapi.dss.ml.DSSPartialDependencies or dataikuapi.dss.future.DSSFuture

get_partial_dependencies()¶

Retrieve all partial dependencies computed for this trained model

Returns: the partial dependencies
Return type: dataikuapi.dss.ml.DSSPartialDependencies

download_documentation_stream(export_id)¶

Download a model documentation, as a binary stream.

Warning: this stream will monopolize the DSSClient until closed.

Parameters: export_id – the id of the generated model documentation returned as the result of the future
Returns: A DSSFuture representing the model document generation process

download_documentation_to_file(export_id, path)¶

Download a model documentation into the given output file.

Parameters

export_id – the id of the generated model documentation returned as the result of the future
path – the path where to download the model documentation

Returns

None

property full_id¶

generate_documentation(folder_id=None, path=None)¶

Start the model document generation from a template docx file in a managed folder, or from the default template if no folder id and path are specified.

Parameters

folder_id – (optional) the id of the managed folder
path – (optional) the path to the file from the root of the folder

Returns

A DSSFuture representing the model document generation process

generate_documentation_from_custom_template(fp)¶

Start the model document generation from a docx template (as a file object).

Parameters: fp (object) – A file-like object pointing to a template docx file
Returns: A DSSFuture representing the model document generation process

get_diagnostics()¶

Retrieves diagnostics computed for this trained model

Returns: list of diagnostics
Return type: list of type dataikuapi.dss.ml.DSSMLDiagnostic

get_origin_analysis_trained_model()¶

Fetch details about the model in an analysis, this model has been exported from. Returns None if the deployed trained model does not have an origin analysis trained model.

Return type: DSSTrainedModelDetails | None

get_raw()¶: Gets the raw dictionary of trained model details

get_raw_snippet()¶: Gets the raw dictionary of trained model snippet. The snippet is a lighter version than the details.

get_train_info()¶

Returns various information about the train process (size of the train set, quick description, timing information)

Return type: dict

get_user_meta()¶: Gets the user-accessible metadata (name, description, cluster labels, classification threshold) Returns the original object, not a copy. Changes to the returned object are persisted to DSS by calling save_user_meta()

save_user_meta()¶

class dataikuapi.dss.ml.DSSTrainedClusteringModelDetails(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)¶

Object to read details of a trained clustering model

Do not create this object directly, use DSSMLTask.get_trained_model_details() instead

get_raw()¶: Gets the raw dictionary of trained model details

get_train_info()¶

Returns various information about the train process (size of the train set, quick description, timing information)

Return type: dict

get_facts()¶

Gets the ‘cluster facts’ data, i.e. the structure behind the screen “for cluster X, average of Y is Z times higher than average

Return type: DSSClustersFacts

get_performance_metrics()¶

Returns all performance metrics for this clustering model.

Returns: a dict of performance metrics values
Return type: dict

get_preprocessing_settings()¶

Gets the preprocessing settings that were used to train this model

Return type: dict

get_modeling_settings()¶

Gets the modeling (algorithms) settings that were used to train this model.

Note: the structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithm)

Return type: dict

get_actual_modeling_params()¶

Gets the actual / resolved parameters that were used to train this model.

Returns: A dictionary, which contains at least a “resolved” key
Return type: dict

get_scatter_plots()¶

Gets the cluster scatter plot data

Returns: a DSSScatterPlots object to interact with the scatter plots
Return type: dataikuapi.dss.ml.DSSScatterPlots

download_documentation_stream(export_id)¶

Download a model documentation, as a binary stream.

Warning: this stream will monopolize the DSSClient until closed.

Parameters: export_id – the id of the generated model documentation returned as the result of the future
Returns: A DSSFuture representing the model document generation process

download_documentation_to_file(export_id, path)¶

Download a model documentation into the given output file.

Parameters

export_id – the id of the generated model documentation returned as the result of the future
path – the path where to download the model documentation

Returns

None

property full_id¶

generate_documentation(folder_id=None, path=None)¶

Start the model document generation from a template docx file in a managed folder, or from the default template if no folder id and path are specified.

Parameters

folder_id – (optional) the id of the managed folder
path – (optional) the path to the file from the root of the folder

Returns

A DSSFuture representing the model document generation process

generate_documentation_from_custom_template(fp)¶

Start the model document generation from a docx template (as a file object).

Parameters: fp (object) – A file-like object pointing to a template docx file
Returns: A DSSFuture representing the model document generation process

get_diagnostics()¶

Retrieves diagnostics computed for this trained model

Returns: list of diagnostics
Return type: list of type dataikuapi.dss.ml.DSSMLDiagnostic

get_origin_analysis_trained_model()¶

Fetch details about the model in an analysis, this model has been exported from. Returns None if the deployed trained model does not have an origin analysis trained model.

Return type: DSSTrainedModelDetails | None

get_raw_snippet()¶: Gets the raw dictionary of trained model snippet. The snippet is a lighter version than the details.

get_user_meta()¶: Gets the user-accessible metadata (name, description, cluster labels, classification threshold) Returns the original object, not a copy. Changes to the returned object are persisted to DSS by calling save_user_meta()

save_user_meta()¶

Saved models ¶

class dataikuapi.dss.savedmodel.DSSSavedModel(client, project_key, sm_id)¶

A handle to interact with a saved model on the DSS instance.

Do not create this directly, use dataikuapi.dss.DSSProject.get_saved_model()

property id¶

get_settings()¶

Returns the settings of this saved model.

Return type: DSSSavedModelSettings

list_versions()¶

Get the versions of this saved model

Returns: a list of the versions, as a dict of object. Each object contains at least a “id” parameter, which can be passed to get_metric_values(), get_version_details() and set_active_version()
Return type: list

get_active_version()¶

Gets the active version of this saved model

Returns: a dict representing the active version or None if no version is active. The dict contains at least a “id” parameter, which can be passed to get_metric_values(), get_version_details() and set_active_version()
Return type: dict

get_version_details(version_id)¶

Gets details for a version of a saved model

Parameters: version_id (str) – Identifier of the version, as returned by list_versions()
Returns: A DSSTrainedPredictionModelDetails representing the details of this trained model id
Return type: DSSTrainedPredictionModelDetails

set_active_version(version_id)¶: Sets a particular version of the saved model as the active one

delete_versions(versions, remove_intermediate=True)¶

Delete version(s) of the saved model

Parameters

versions (list[str]) – list of versions to delete
remove_intermediate – also remove intermediate versions (default: True). In the case of a partitioned

model, an intermediate version is created every time a partition has finished training. :type remove_intermediate: bool

get_origin_ml_task()¶

Fetch the last ML task that has been exported to this saved model. Returns None if the saved model does not have an origin ml task.

Return type: DSSMLTask | None

import_mlflow_version_from_path(version_id, path, code_env_name='INHERIT', container_exec_config_name='NONE', set_active=True, binary_classification_threshold=0.5)¶

Create a new version for this saved model from a path containing a MLFlow model.

Requires the saved model to have been created using dataikuapi.dss.project.DSSProject.create_mlflow_pyfunc_model().

Parameters

version_id (str) – Identifier of the version to create
path (str) – An absolute path on the local filesystem. Must be a folder, and must contain a MLFlow model
code_env_name (str) – Name of the code env to use for this model version. The code env must contain at least mlflow and the package(s) corresponding to the used MLFlow-compatible frameworks. If value is “INHERIT”, the default active code env of the project will be used
container_exec_config_name (str) – Name of the containerized execution configuration to use while creating this model version. If value is “INHERIT”, the container execution configuration of the project will be used. If value is “NONE”, local execution will be used (no container)
set_active (bool) – sets this new version as the active version of the saved model
binary_classification_threshold (float) – For binary classification, define the actual threshold for the imported version. Default to 0.5

:return a :class:ExternalModelVersionHandler in order to interact with the new MLFlow model version

import_mlflow_version_from_managed_folder(version_id, managed_folder, path, code_env_name='INHERIT', container_exec_config_name='INHERIT', set_active=True, binary_classification_threshold=0.5)¶

Create a new version for this saved model from a path containing a MLFlow model in a managed folder.

Requires the saved model to have been created using dataikuapi.dss.project.DSSProject.create_mlflow_pyfunc_model().

Parameters

version_id (str) – Identifier of the version to create
managed_folder (str) – Identifier of the managed folder or dataikuapi.dss.managedfolder.DSSManagedFolder
path (str) – Path of the MLflow folder in the managed folder
code_env_name (str) – Name of the code env to use for this model version. The code env must contain at least mlflow and the package(s) corresponding to the used MLFlow-compatible frameworks. If value is “INHERIT”, the default active code env of the project will be used
container_exec_config_name (str) – Name of the containerized execution configuration to use for evaluating this model version. If value is “INHERIT”, the container execution configuration of the project will be used. If value is “NONE”, local execution will be used (no container)
set_active (bool) – sets this new version as the active version of the saved model
binary_classification_threshold (float) – For binary classification, define the actual threshold for the imported version. Default to 0.5

:return a ExternalModelVersionHandler in order to interact with the new MLFlow model version

create_proxy_model_version(version_id, protocol, configuration)¶

EXPERIMENTAL. Creates a new version of a proxy model.

This is an experimental API, subject to change. Requires the saved model to have been created using dataikuapi.dss.project.DSSProject.create_proxy_model(). :param str version_id: Identifier of the version to create :param str protocol: one of [“KServe”, “DSS_API_NODE”] :param dict configuration: A dictionary containing the required params for the selected protocol :return a :class:ExternalModelVersionHandler in order to interact with the new Proxy model version

get_external_model_version_handler(version_id)¶: Returns a :class:ExternalModelVersionHandler to interact with an External model version (MLflow or Proxy model)

get_metric_values(version_id)¶

Get the values of the metrics on the version of this saved model

Returns:: a list of metric objects and their value

get_zone()¶

Gets the flow zone of this saved model

Return type: dataikuapi.dss.flow.DSSFlowZone

move_to_zone(zone)¶

Moves this object to a flow zone

Parameters: zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to move the object

share_to_zone(zone)¶

Share this object to a flow zone

Parameters: zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to share the object

unshare_from_zone(zone)¶

Unshare this object from a flow zone

Parameters: zone (object) – a dataikuapi.dss.flow.DSSFlowZone from where to unshare the object

get_usages()¶

Get the recipes referencing this model

Returns:: a list of usages

get_object_discussions()¶

Get a handle to manage discussions on the saved model

Returns: the handle to manage discussions
Return type: dataikuapi.discussion.DSSObjectDiscussions

delete()¶: Delete the saved model

class dataikuapi.dss.savedmodel.DSSSavedModelSettings(saved_model, settings)¶

A handle on the settings of a saved model

Do not create this class directly, instead use dataikuapi.dss.DSSSavedModel.get_settings()

get_raw()¶

property prediction_metrics_settings¶: The settings of evaluation metrics for a prediction saved model

save()¶: Saves the settings of this saved model

MLflow models ¶

class dataikuapi.dss.savedmodel.ExternalModelVersionHandler(saved_model, version_id)¶

Handler to interact with an External model version (MLflow import of Proxy model

get_settings()¶

set_core_metadata(target_column_name, class_labels=None, get_features_from_dataset=None, features_list=None, output_style='AUTO_DETECT', container_exec_config_name='NONE')¶

Sets metadata for this MLFlow model version

In addition to target_column_name, one of get_features_from_dataset or features_list must be passed in order to be able to evaluate performance

Parameters

target_column_name (str) – name of the target column. Mandatory in order to be able to evaluate performance
class_labels (list) – List of strings, ordered class labels. Mandatory in order to be able to evaluate performance on classification models
get_features_from_dataset (str) – Name of a dataset to get feature names from
features_list (list) – List of {“name”: “feature_name”, “type”: “feature_type”}
container_exec_config_name (str) – Name of the containerized execution configuration to use for running the evaluation process. If value is “INHERIT”, the container execution configuration of the project will be used. If value is “NONE” (default), local execution will be used (no container)

evaluate(dataset_ref, container_exec_config_name='INHERIT', selection=None, use_optimal_threshold=True)¶

Evaluates the performance of this model version on a particular dataset. After calling this, the “result screens” of the MLFlow model version will be available (confusion matrix, error distribution, performance metrics, …) and more information will be available when calling DSSSavedModel.get_version_details()

set_core_metadata() must be called before you can evaluate a dataset :param str dataset_ref: Evaluation dataset to use (either a dataset name, “PROJECT.datasetName”, DSSDataset instance or dataiku.Dataset instance) :param str container_exec_config_name: Name of the containerized execution configuration to use for running the evaluation process.

If value is “INHERIT”, the container execution configuration of the project will be used. If value is “NONE”, local execution will be used (no container)

Parameters

selection (str) – will default to HEAD_SEQUENTIAL with a maxRecords of 10_000.
use_optimal_threshold (boolean) – Choose between optimized or actual threshold. Optimized threshold has been computed according to the metric set on the saved model setting “prediction_metrics_settings[‘thresholdOptimizationMetric’]”

class dataikuapi.dss.savedmodel.MLFlowVersionSettings(version_handler, data)¶

Handle for the settings of an imported MLFlow model version

property raw¶

save()¶

Algorithm details ¶

This section documents which algorithms are available, and some of the settings for them.

These algorithm names can be used for dataikuapi.dss.ml.DSSMLTaskSettings.get_algorithm_settings() and dataikuapi.dss.ml.DSSMLTaskSettings.set_algorithm_enabled()

Note

This documentation does not cover all settings of all algorithms. To know which settings are available for an algorithm, use mltask_settings.get_algorithm_settings('ALGORITHM_NAME') and print the returned dictionary.

Generally speaking, most algorithm settings which are arrays means that this parameter can be grid-searched. All values will be tested as part of the hyperparameter optimization.

For more documentation of settings, please refer to the UI of the visual machine learning, which contains detailed documentation for all algorithm parameters

LOGISTIC_REGRESSION ¶

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
Main parameters:

{
    "multi_class": SingleCategoryHyperparameterSettings, # accepted valued: ['multinomial', 'ovr']
    "penalty": CategoricalHyperparameterSettings, # possible values: ["l1", "l2"]
    "C": NumericalHyperparameterSettings, # scaling: "LOGARITHMIC"
    "n_jobs": 2
}

RANDOM_FOREST_CLASSIFICATION ¶

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
Main parameters:

{
    "n_estimators": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "min_samples_leaf": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "max_tree_depth": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "max_feature_prop": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "max_features": NumericalHyperparameterSettings, # scaling: "LINEAR"
    "selection_mode": SingleCategoryHyperparameterSettings, # accepted_values=['auto', 'sqrt', 'log2', 'number', 'prop']
    "n_jobs": 4
}

RANDOM_FOREST_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY
Main parameters: same as RANDOM_FOREST_CLASSIFICATION

EXTRA_TREES ¶

Type: Prediction (all kinds)
Available on backend: PY_MEMORY

RIDGE_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

LASSO_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

LEASTSQUARE_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

SVC_CLASSIFICATION ¶

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

SVM_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

SGD_CLASSIFICATION ¶

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

SGD_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

GBT_CLASSIFICATION ¶

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

GBT_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

DECISION_TREE_CLASSIFICATION ¶

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

DECISION_TREE_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

LIGHTGBM_CLASSIFICATION ¶

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

LIGHTGBM_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

XGBOOST_CLASSIFICATION ¶

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

XGBOOST_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

NEURAL_NETWORK ¶

Type: Prediction (all kinds)
Available on backend: PY_MEMORY

DEEP_NEURAL_NETWORK_REGRESSION ¶

Type: Prediction (regression)
Available on backend: PY_MEMORY

DEEP_NEURAL_NETWORK_CLASSIFICATION ¶

Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY

KNN ¶

Type: Prediction (all kinds)
Available on backend: PY_MEMORY

LARS ¶

Type: Prediction (all kinds)
Available on backend: PY_MEMORY

MLLIB_LOGISTIC_REGRESSION ¶

Type: Prediction (binary or multiclass)
Available on backend: MLLIB

MLLIB_DECISION_TREE ¶

Type: Prediction (all kinds)
Available on backend: MLLIB

MLLIB_RANDOM_FOREST ¶

Type: Prediction (all kinds)
Available on backend: MLLIB

MLLIB_GBT ¶

Type: Prediction (all kinds)
Available on backend: MLLIB

MLLIB_LINEAR_REGRESSION ¶

Type: Prediction (regression)
Available on backend: MLLIB

MLLIB_NAIVE_BAYES ¶

Type: Prediction (all kinds)
Available on backend: MLLIB

Other ¶

SCIKIT_MODEL
MLLIB_CUSTOM
SPARKLING_DEEP_LEARNING
SPARKLING_GBM
SPARKLING_RF
SPARKLING_GLM
SPARKLING_NB