Machine learning¶
Through the public API, the Python client allows you to automate all the aspects of the lifecycle of machine learning models.
Creating a visual analysis and ML task
Tuning settings
Training models
Inspecting model details and results
Deploying saved models to Flow and retraining them
Concepts¶
In DSS, you train models as part of a visual analysis. A visual analysis is made of a preparation script, and one or several ML Tasks.
A ML Task is an individual section in which you train models. A ML Task is either a prediction of a single target variable, or a clustering.
The ML API allows you to manipulate ML Tasks, and use them to train models, inspect their details, and deploy them to the Flow.
Once deployed to the Flow, the Saved model can be retrained by the usual build mechanism of DSS.
A ML Task has settings, which control:
Which features are active
The preprocessing settings for each features
Which algorithms are active
The hyperparameter settings (including grid searched hyperparameters) for each algorithm
The settings of the grid search
Train/Test splitting settings
Feature selection and generation settings
Usage samples¶
The whole cycle¶
This examples create a prediction task, enables an algorithm, trains it, inspects models, and deploys one of the model to Flow
# client is a DSS API client
p = client.get_project("MYPROJECT")
# Create a new ML Task to predict the variable "target" from "trainset"
mltask = p.create_prediction_ml_task(
input_dataset="trainset",
target_variable="target",
ml_backend_type='PY_MEMORY', # ML backend to use
guess_policy='DEFAULT' # Template to use for setting default parameters
)
# Wait for the ML task to be ready
mltask.wait_guess_complete()
# Obtain settings, enable GBT, save settings
settings = mltask.get_settings()
settings.set_algorithm_enabled("GBT_CLASSIFICATION", True)
settings.save()
# Start train and wait for it to be complete
mltask.start_train()
mltask.wait_train_complete()
# Get the identifiers of the trained models
# There will be 3 of them because Logistic regression and Random forest were default enabled
ids = mltask.get_trained_models_ids()
for id in ids:
details = mltask.get_trained_model_details(id)
algorithm = details.get_modeling_settings()["algorithm"]
auc = details.get_performance_metrics()["auc"]
print("Algorithm=%s AUC=%s" % (algorithm, auc))
# Let's deploy the first model
model_to_deploy = ids[0]
ret = mltask.deploy_to_flow(model_to_deploy, "my_model", "trainset")
print("Deployed to saved model id = %s train recipe = %s" % (ret["savedModelId"], ret["trainRecipeName"]))
The methods for creating prediction and clustering ML tasks are defined at dataikuapi.dss.project.DSSProject.create_prediction_ml_task()
and dataikuapi.dss.project.DSSProject.create_clustering_ml_task()
.
Obtaining a handle to an existing ML Task¶
When you create these ML tasks, the returned dataikuapi.dss.ml.DSSMLTask
object will contain two fields analysis_id
and mltask_id
that can later be used to retrieve the same DSSMLTask
object
# client is a DSS API client
p = client.get_project("MYPROJECT")
mltask = p.get_ml_task(analysis_id, mltask_id)
Tuning feature preprocessing¶
Enabling and disabling features¶
# mltask is a DSSMLTask object
settings = mltask.get_settings()
settings.reject_feature("not_useful")
settings.use_feature("useful")
settings.save()
Changing advanced parameters for a feature¶
# mltask is a DSSMLTask object
settings = mltask.get_settings()
# Use impact coding rather than dummy-coding
fs = settings.get_feature_preprocessing("mycategory")
fs["category_handling"] = "IMPACT"
# Impute missing with most frequent value
fs["missing_handling"] = "IMPUTE"
fs["missing_impute_with"] = "MODE"
settings.save()
Tuning algorithms¶
Global parameters for hyperparameter search¶
This sample shows how to modify the parameters of the search to be performed on the hyperparameters.
# mltask is a DSSMLTask object
settings = mltask.get_settings()
hp_search_settings = mltask_settings.get_hyperparameter_search_settings()
# Set the search strategy either to "GRID", "RANDOM" or "BAYESIAN"
hp_search_settings.strategy = "RANDOM"
# Alternatively use a setter, either set_grid_search
# set_random_search or set_bayesian_search
hp_search_settings.set_random_search(seed=1234)
# Set the validation mode either to "KFOLD", "SHUFFLE" (or accordingly their
# "TIME_SERIES"-prefixed counterpart) or "CUSTOM"
hp_search_settings.validation_mode = "KFOLD"
# Alternatively use a setter, either set_kfold_validation, set_single_split_validation
# or set_custom_validation
hp_search_settings.set_kfold_validation(n_folds=5, stratified=True)
# Save the settings
settings.save()
Algorithm specific hyperparameter search¶
This sample shows how to modify the settings of the Random Forest Classification algorithm, where two kinds of hyperparameters (multi-valued numerical and single-valued) are introduced.
# mltask is a DSSMLTask object
settings = mltask.get_settings()
rf_settings = settings.get_algorithm_settings("RANDOM_FOREST_CLASSIFICATION")
# rf_settings is an object representing the settings for this algorithm.
# The 'enabled' attribute indicates whether this algorithm will be trained.
# Other attributes are the various hyperparameters of the algorithm.
# The precise hyperparameters for each algorithm are not all documented, so let's
# print the dictionary keys to see available hyperparameters.
# Alternatively, tab completion will provide relevant hints to available hyperparameters.
print(rf_settings.keys())
# Let's first have a look at rf_settings.n_estimators which is a multi-valued hyperparameter
# represented as a NumericalHyperparameterSettings object
print(rf_settings.n_estimators)
# Set multiple explicit values for "n_estimators" to be explored during the search
rf_settings.n_estimators.definition_mode = "EXPLICIT"
rf_settings.n_estimators.values = [100, 200]
# Alternatively use the set_values setter
rf_settings.n_estimators.set_values([100, 200])
# Set a range of values for "n_estimators" to be explored during the search
rf_settings.n_estimators.definition_mode = "RANGE"
rf_settings.n_estimators.range.min = 10
rf_settings.n_estimators.range.max = 100
rf_settings.n_estimators.range.nb_values = 5 # Only relevant for grid-search
# Alternatively, use the set_range setter
rf_settings.n_estimators.set_range(min=10, max=100, nb_values=5)
# Let's now have a look at rf_settings.selection_mode which is a single-valued hyperparameter
# represented as a SingleCategoryHyperparameterSettings object.
# The object stores the valid options for this hyperparameter.
print(rf_settings.selection_mode)
# Features selection mode is not multi-valued so it's not actually searched during the
# hyperparameter search
rf_settings.selection_mode = "sqrt"
# Save the settings
settings.save()
The next sample shows how to modify the settings of the Logistic Regression classification algorithm, where a new kind of hyperparameter (multi-valued categorical) is introduced.
# mltask is a DSSMLTask object
settings = mltask.get_settings()
logit_settings = settings.get_algorithm_settings("LOGISTIC_REGRESSION")
# Let's have a look at logit_settings.penalty which is a multi-valued categorical
# hyperparameter represented as a CategoricalHyperparameterSettings object
print(logit_settings.penalty)
# List currently enabled values
print(logit_settings.penalty.get_values())
# List all possible values
print(logit_settings.penalty.get_all_possible_values())
# Set the values for the "penalty" hyperparameter to be explored during the search
logit_settings.penalty = ["l1", "l2"]
# Alternatively use the set_values setter
logit_settings.penalty.set_values(["l1", "l2"])
# Save the settings
settings.save()
Exporting a model documentation¶
This sample shows how to generate and download a model documentation from a template.
See Model Document Generator for more information.
# mltask is a DSSMLTask object
details = mltask.get_trained_model_details(id)
# Launch the model document generation by either
# using the default template for this model by calling without argument
# or specifying a managed folder id and the path to the template to use in that folder
future = details.generate_documentation(FOLDER_ID, "path/my_template.docx")
# Alternatively, use a custom uploaded template file
with open("my_template.docx", "rb") as f:
future = details.generate_documentation_from_custom_template(f)
# Wait for the generation to finish, retrieve the result and download the generated
# model documentation to the specified file
result = future.wait_for_result()
export_id = result["exportId"]
details.download_documentation_to_file(export_id, "path/my_model_documentation.docx")
API Reference¶
Interaction with a ML Task¶
-
class
dataikuapi.dss.ml.
DSSMLTask
(client, project_key, analysis_id, mltask_id)¶ -
static
from_full_model_id
(client, fmi, project_key=None)¶
-
delete
()¶ Delete the present ML task
-
wait_guess_complete
()¶ Waits for guess to be complete. This should be called immediately after the creation of a new ML Task (if the ML Task was created with wait_guess_complete=False), before calling
get_settings
ortrain
-
get_status
()¶ Gets the status of this ML Task
- Returns
a dict
-
get_settings
()¶ Gets the settings of this ML Tasks
- Returns
a DSSMLTaskSettings object to interact with the settings
- Return type
-
train
(session_name=None, session_description=None, run_queue=False)¶ Trains models for this ML Task
- Parameters
session_name (str) – name for the session
session_description (str) – description for the session
This method waits for train to complete. If you want to train asynchronously, use
start_train()
andwait_train_complete()
This method returns the list of trained model identifiers. It returns models that have been trained for this train session, not all trained models for this ML task. To get all identifiers for all models trained across all training sessions, use
get_trained_models_ids()
These identifiers can be used for
get_trained_model_snippet()
,get_trained_model_details()
anddeploy_to_flow()
- Returns
A list of model identifiers
- Return type
list of strings
-
ensemble
(model_ids=None, method=None)¶ Create an ensemble model of a set of models
- Parameters
model_ids (list) – A list of model identifiers (defaults to [])
method (str) – the ensembling method. One of: AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL
This method waits for the ensemble train to complete. If you want to train asynchronously, use
start_ensembling()
andwait_train_complete()
This method returns the identifier of the trained ensemble. To get all identifiers for all models trained across all training sessions, use
get_trained_models_ids()
This identifier can be used for
get_trained_model_snippet()
,get_trained_model_details()
anddeploy_to_flow()
- Returns
A model identifier
- Return type
string
-
start_train
(session_name=None, session_description=None, run_queue=False)¶ Starts asynchronously a new train session for this ML Task.
- Parameters
session_name (str) – name for the session
session_description (str) – description for the session
This returns immediately, before train is complete. To wait for train to complete, use
wait_train_complete()
-
start_ensembling
(model_ids=None, method=None)¶ Creates asynchronously a new ensemble models of a set of models.
- Parameters
model_ids (list) – A list of model identifiers (defaults to [])
method (str) – the ensembling method (AVERAGE, PROBA_AVERAGE, MEDIAN, VOTE, LINEAR_MODEL, LOGISTIC_MODEL)
This returns immediately, before train is complete. To wait for train to complete, use
wait_train_complete()
- Returns
the model identifier of the ensemble
- Return type
string
-
wait_train_complete
()¶ Waits for train to be complete (if started with
start_train()
)
-
get_trained_models_ids
(session_id=None, algorithm=None)¶ Gets the list of trained model identifiers for this ML task.
These identifiers can be used for
get_trained_model_snippet()
anddeploy_to_flow()
- Returns
A list of model identifiers
- Return type
list of strings
-
get_trained_model_snippet
(id=None, ids=None)¶ Gets a quick summary of a trained model, as a dict. For complete information and a structured object, use
get_trained_model_detail()
- Parameters
id (str) – a model id
ids (list) – a list of model ids
- Return type
dict
-
get_trained_model_details
(id)¶ Gets details for a trained model
- Parameters
id (str) – Identifier of the trained model, as returned by
get_trained_models_ids()
- Returns
A
DSSTrainedPredictionModelDetails
orDSSTrainedClusteringModelDetails
representing the details of this trained model id- Return type
DSSTrainedPredictionModelDetails
orDSSTrainedClusteringModelDetails
-
delete_trained_model
(model_id)¶ Deletes a trained model
- Parameters
model_id (str) – Model identifier, as returend by
get_trained_models_ids()
-
train_queue
()¶ Trains this MLTask’s queue
- Returns
A dict including the next sessionID to be trained in the queue
:rtype dict
-
deploy_to_flow
(model_id, model_name, train_dataset, test_dataset=None, redo_optimization=True)¶ Deploys a trained model from this ML Task to a saved model + train recipe in the Flow.
- Parameters
model_id (str) – Model identifier, as returned by
get_trained_models_ids()
model_name (str) – Name of the saved model to deploy in the Flow
train_dataset (str) – Name of the dataset to use as train set. May either be a short name or a PROJECT.name long name (when using a shared dataset)
test_dataset (str) – Name of the dataset to use as test set. If null, split will be applied to the train set. May either be a short name or a PROJECT.name long name (when using a shared dataset). Only for PREDICTION tasks
redo_optimization (bool) – Should the hyperparameters optimization phase be done ? Defaults to True. Only for PREDICTION tasks
- Returns
A dict containing: “savedModelId” and “trainRecipeName” - Both can be used to obtain further handles
- Return type
dict
-
redeploy_to_flow
(model_id, recipe_name=None, saved_model_id=None, activate=True)¶ Redeploys a trained model from this ML Task to a saved model + train recipe in the Flow. Either recipe_name of saved_model_id need to be specified
- Parameters
model_id (str) – Model identifier, as returned by
get_trained_models_ids()
recipe_name (str) – Name of the training recipe to update
saved_model_id (str) – Name of the saved model to update
activate (bool) – Should the deployed model version become the active version
- Returns
A dict containing: “impactsDownstream” - whether the active version changed and downstream recipes are impacted
- Return type
dict
-
remove_unused_splits
()¶ Deletes all stored splits data that are not anymore in use for this ML Task.
It is generally not needed to call this method
-
remove_all_splits
()¶ Deletes all stored splits data for this ML Task. This operation saves disk space.
After performing this operation, it will not be possible anymore to: * Ensemble already trained models * View the “predicted data” or “charts” for already trained models * Resume training of models for which optimization had been previously interrupted
Training new models remains possible
-
guess
(prediction_type=None, reguess_level=None, target_variable=None, timeseries_identifiers=None, time_variable=None, full_reguess=None)¶ Reguess all the settings of the ML task when no optional parameter are given. For prediction ML tasks only, set a new value for a core parameter of the task (target variable or prediction type) and subsequently reguess the impacted settings.
- Parameters
prediction_type (string) – Only valid for prediction tasks of either BINARY_CLASSIFICATION, MULTICLASS or REGRESSION type, ignored otherwise. The prediction type to set. Cannot be set if target_variable, time_variable, or timeseries_identifiers is also specified.
target_variable (string) – Only valid for prediction tasks, ignored for clustering. The target variable to set. Cannot be set if prediction_type, time_variable, or timeseries_identifiers is also specified.
timeseries_identifiers (list) – Only valid for time series forecasting tasks. List of columns to be used as time series identifiers. Cannot be set if prediction_type, target_variable, or time_variable is also specified.
time_variable (string) – Only valid for time series forecasting tasks. Column to be used as time variable. Cannot be set if prediction_type, target_variable, or timeseries_identifiers is also specified.
full_reguess (bool) – Only valid for prediction tasks, ignored for clustering. Scope of the reguess process: whether it should reguess all the settings after changing a core parameter, or only reguess impacted settings (e.g. target remapping when changing the target, metrics when changing the prediction type…). Ignored if no core parameter is given. Defaults to true.
reguess_level (string) –
Deprecated, use full_reguess instead. Only valid for prediction tasks. Can be one of the following values: - TARGET_CHANGE: Change the target if target_variable is specified, reguess the target remapping, and
clear the model’s assertions if any. Equivalent to `full_reguess`=False (recommended usage)
- FULL_REGUESS: All the settings of the ML task are reguessed.
Equivalent to `full_reguess`=True (recommended usage)
-
static
Manipulation of settings¶
-
class
dataikuapi.dss.ml.
DSSMLTaskSettings
(client, project_key, analysis_id, mltask_id, mltask_settings)¶ Object to read and modify the settings of a ML task.
Do not create this object directly, use
DSSMLTask.get_settings()
instead-
get_raw
()¶ Gets the raw settings of this ML Task. This returns a reference to the raw settings, not a copy, so changes made to the returned object will be reflected when saving.
- Return type
dict
-
get_feature_preprocessing
(feature_name)¶ Gets the feature preprocessing params for a particular feature. This returns a reference to the feature’s settings, not a copy, so changes made to the returned object will be reflected when saving
- Returns
A dict of the preprocessing settings for a feature
- Return type
dict
-
foreach_feature
(fn, only_of_type=None)¶ Applies a function to all features (except target)
- Parameters
fn (function) – Function that takes 2 parameters: feature_name and feature_params and returns modified feature_params
only_of_type (str) – if not None, only applies to feature of the given type. Can be one of
CATEGORY
,NUMERIC
,TEXT
orVECTOR
-
reject_feature
(feature_name)¶ Marks a feature as rejected and not used for training
- Parameters
feature_name (str) – Name of the feature to reject
-
use_feature
(feature_name)¶ Marks a feature as input for training
- Parameters
feature_name (str) – Name of the feature to reject
-
get_algorithm_settings
(algorithm_name)¶
-
get_diagnostics_settings
()¶ Gets the diagnostics settings for a mltask. This returns a reference to the diagnostics’ settings, not a copy, so changes made to the returned object will be reflected when saving.
This method returns a dictionary of the settings with: - ‘enabled’: indicates if the diagnostics are enabled globally, if False, all diagnostics will be disabled - ‘settings’: a list of dict comprised of:
‘type’: the diagnostic type
‘enabled’: indicates if the diagnostic type is enabled, if False, all diagnostics of that type will be disabled
Please refer to the documentation for details on available diagnostics.
- Returns
A dict of diagnostics settings
- Return type
dict
-
set_diagnostics_enabled
(enabled)¶ Globally enables or disables all diagnostics.
- Parameters
enabled (bool) – if the diagnostics should be enabled or not
-
set_diagnostic_type_enabled
(diagnostic_type, enabled)¶ Enables or disables a diagnostic based on its type.
Please refer to the documentation for details on available diagnostics.
- Parameters
diagnostic_type (str) – Name (in capitals) of the diagnostic type.
enabled (bool) – if the diagnostic should be enabled or not
-
set_algorithm_enabled
(algorithm_name, enabled)¶ Enables or disables an algorithm based on its name.
Please refer to the documentation for details on available algorithms.
- Parameters
algorithm_name (str) – Name (in capitals) of the algorithm.
-
disable_all_algorithms
()¶ Disables all algorithms
-
get_all_possible_algorithm_names
()¶ Returns the list of possible algorithm names, i.e. the list of valid identifiers for
set_algorithm_enabled()
andget_algorithm_settings()
This includes all possible algorithms, regardless of the prediction kind (regression/classification) or engine, so some algorithms may be irrelevant
- Returns
the list of algorithm names as a list of strings
- Return type
list of string
-
get_enabled_algorithm_names
()¶ - Returns
the list of enabled algorithm names as a list of strings
- Return type
list of string
-
get_enabled_algorithm_settings
()¶ - Returns
the map of enabled algorithm names with their settings
- Return type
dict
-
set_metric
(metric=None, custom_metric=None, custom_metric_greater_is_better=True, custom_metric_use_probas=False)¶ Sets the score metric to optimize for a prediction ML Task
- Parameters
metric (str) – metric to use. Leave empty to use a custom metric. You need to set the
custom_metric
value in that casecustom_metric (str) – code of the custom metric
custom_metric_greater_is_better (bool) – whether the custom metric is a score or a loss
custom_metric_use_probas (bool) – whether to use the classes’ probas or the predicted value (for classification)
-
add_custom_python_model
(name='Custom Python Model', code='')¶ Adds a new custom python model
- Parameters
name (str) – name of the custom model
code (str) – code of the custom model
-
add_custom_mllib_model
(name='Custom MLlib Model', code='')¶ Adds a new custom MLlib model
- Parameters
name (str) – name of the custom model
code (str) – code of the custom model
-
save
()¶ Saves back these settings to the ML Task
-
-
class
dataikuapi.dss.ml.
DSSPredictionMLTaskSettings
(client, project_key, analysis_id, mltask_id, mltask_settings)¶ -
class
PredictionTypes
¶ -
BINARY
= 'BINARY_CLASSIFICATION'¶
-
REGRESSION
= 'REGRESSION'¶
-
MULTICLASS
= 'MULTICLASS'¶
-
-
get_all_possible_algorithm_names
()¶ Returns the list of possible algorithm names, i.e. the list of valid identifiers for
set_algorithm_enabled()
andget_algorithm_settings()
This includes all possible algorithms, regardless of the prediction kind (regression/classification) or engine, so some algorithms may be irrelevant
- Returns
the list of algorithm names as a list of strings
- Return type
list of string
-
get_enabled_algorithm_names
()¶ - Returns
the list of enabled algorithm names as a list of strings
- Return type
list of string
-
get_algorithm_settings
(algorithm_name)¶ Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.
This method returns the settings for this algorithm as an PredictionAlgorithmSettings (extended dict). All algorithm dicts have at least an “enabled” property/key in the settings. The “enabled” property/key indicates whether this algorithm will be trained.
Other settings are algorithm-dependent and are the various hyperparameters of the algorithm. The precise properties/keys for each algorithm are not all documented. You can print the returned AlgorithmSettings to learn more about the settings of each particular algorithm.
Please refer to the documentation for details on available algorithms.
- Parameters
algorithm_name (str) – Name (in capitals) of the algorithm.
- Returns
A PredictionAlgorithmSettings (extended dict) for one of the built-in prediction algorithms
- Return type
PredictionAlgorithmSettings
-
split_ordered_by
(feature_name, ascending=True)¶ Deprecated. Use split_params.set_time_ordering()
-
remove_ordered_split
()¶ Deprecated. Use split_params.unset_time_ordering()
-
use_sample_weighting
(feature_name)¶ Deprecated. use set_weighting()
-
set_weighting
(method, feature_name=None)¶ Sets the method to weight samples.
If there was a WEIGHT feature declared previously, it will be set back as an INPUT feature first.
- Parameters
method (str) – Method to use. One of NO_WEIGHTING, SAMPLE_WEIGHT (must give a feature name), CLASS_WEIGHT or CLASS_AND_SAMPLE_WEIGHT (must give a feature name)
feature_name (str) – Name of the feature to use as sample weight
-
remove_sample_weighting
()¶ Deprecated. Use set_weighting(method=”NO_WEIGHTING”) instead
-
get_assertions_params
()¶ Retrieves the assertions parameters for this ml task
- Return type
DSSMLAssertionsParams
-
class
-
class
dataikuapi.dss.ml.
DSSClusteringMLTaskSettings
(client, project_key, analysis_id, mltask_id, mltask_settings)¶ -
get_algorithm_settings
(algorithm_name)¶ Gets the training settings for a particular algorithm. This returns a reference to the algorithm’s settings, not a copy, so changes made to the returned object will be reflected when saving.
This method returns a dictionary of the settings for this algorithm. All algorithm dicts have at least an “enabled” key in the dictionary. The ‘enabled’ key indicates whether this algorithm will be trained
Other settings are algorithm-dependent and are the various hyperparameters of the algorithm. The precise keys for each algorithm are not all documented. You can print the returned dictionary to learn more about the settings of each particular algorithm
Please refer to the documentation for details on available algorithms.
- Param
algorithm_name: Name of the algorithm (uppercase).
- Type
algorithm_name: str
- Returns
A dict of the settings for an algorithm
- Return type
dict
-
-
class
dataikuapi.dss.ml.
DSSTimeseriesForecastingMLTaskSettings
(client, project_key, analysis_id, mltask_id, mltask_settings)¶ -
-
get_time_step_params
()¶ Gets the time step parameters for the time series forecasting task. This returns a reference to the time step parameters, not a copy, so changes made to the returned object will be reflected when saving
- Returns
A dict of the time step parameters
- Return type
dict
-
set_time_step
(time_unit=None, n_time_units=None, end_of_week_day=None, reguess=True, update_algorithm_settings=True)¶ Sets the time step parameters for the time series forecasting task.
- Parameters
time_unit (str) – time unit for forecasting step. Valid values are: MILLISECOND, SECOND, MINUTE, HOUR, DAY, BUSINESS_DAY, WEEK, MONTH, QUARTER, HALF_YEAR, YEAR
n_time_units (int) – number of time units within a time step
end_of_week_day (int) – only useful for the WEEK time unit. Valid values are: 1 (Sunday), 2 (Monday), …, 7 (Saturday)
reguess (bool) – Defaults to true. Whether to reguess the ML task settings after changing the time step params
update_algorithm_settings (bool) – Defaults to true. Whether the algorithm settings should be reguessed after changing time step parameters.
- Returns
-
get_resampling_params
()¶ Gets the time series resampling parameters for the time series forecasting task. This returns a reference to the time series resampling parameters, not a copy, so changes made to the returned object will be reflected when saving
- Returns
A dict of the resampling parameters
- Return type
dict
-
set_numerical_interpolation
(method=None, constant=None)¶ Sets the time series resampling numerical interpolation parameters
- Parameters
method (str) – Interpolation method. Valid values are: NEAREST, PREVIOUS, NEXT, LINEAR, QUADRATIC, CUBIC, CONSTANT
constant (float) – Value for the CONSTANT interpolation method
- Returns
-
set_numerical_extrapolation
(method=None, constant=None)¶ Sets the time series resampling numerical extrapolation parameters
- Parameters
method (str) – Extrapolation method. Valid values are: PREVIOUS_NEXT, NO_EXTRAPOLATION, CONSTANT, LINEAR, QUADRATIC, CUBIC
constant (float) – Value for the CONSTANT extrapolation method
- Returns
-
set_categorical_imputation
(method=None, constant=None)¶ Sets the time series resampling categorical imputation parameters
- Parameters
method (str) – Imputation method. Valid values are: MOST_COMMON, NULL, CONSTANT, PREVIOUS_NEXT, PREVIOUS, NEXT
constant (str) – Value for the CONSTANT imputation method
- Returns
-
set_duplicate_timestamp_handling
(method)¶ Sets the time series duplicate timestamp handling
- Parameters
method (str) – Duplicate timestamp handling method. Valid values are: FAIL_IF_CONFLICTING, DROP_IF_CONFLICTING, MEAN_MODE
-
property
forecast_horizon
¶ - Returns
Number of time steps to be forecast
- Return type
int
-
set_forecast_horizon
(forecast_horizon, reguess=True, update_algorithm_settings=True)¶ - Parameters
forecast_horizon (int) – Number of time steps to be forecast
reguess (bool) – Defaults to true. Whether to reguess the ML task settings after changing the forecast horizon
update_algorithm_settings (bool) – Defaults to true. Whether the algorithm settings should be reguessed after the forecast horizon.
-
property
evaluation_gap
¶ - Returns
Number of skipped time steps for evaluation
- Return type
int
-
property
time_variable
¶ - Returns
Feature used as time variable (read-only)
- Return type
str
-
property
timeseries_identifiers
¶ - Returns
Features used as time series identifiers (read-only copy)
- Return type
list
-
property
quantiles_to_forecast
¶ - Returns
List of quantiles to forecast
- Return type
list
-
-
class
dataikuapi.dss.ml.
PredictionSplitParamsHandler
(mltask_settings)¶ Object to modify the train/test splitting params.
-
SPLIT_PARAMS_KEY
= 'splitParams'¶
-
get_raw
()¶ Gets the raw settings of the prediction split configuration. This returns a reference to the raw settings, not a copy, so changes made to the returned object will be reflected when saving.
- Return type
dict
-
set_split_random
(train_ratio=0.8, selection=None, dataset_name=None)¶ Sets the train/test split to random splitting of an extract of a single dataset
- Parameters
train_ratio (float) – Ratio of rows to use for train set. Must be between 0 and 1
selection (object) – A
DSSDatasetSelectionBuilder
to build the settings of the extract of the dataset. May be None (won’t be changed)dataset_name (str) – Name of dataset to split. If None, the main dataset used to create the visual analysis will be used.
-
set_split_kfold
(n_folds=5, selection=None, dataset_name=None)¶ Sets the train/test split to k-fold splitting of an extract of a single dataset
- Parameters
n_folds (int) – number of folds. Must be greater than 0
selection (object) – A
DSSDatasetSelectionBuilder
to build the settings of the extract of the dataset. May be None (won’t be changed)dataset_name (str) – Name of dataset to split. If None, the main dataset used to create the visual analysis will be used.
-
set_split_explicit
(train_selection, test_selection, dataset_name=None, test_dataset_name=None, train_filter=None, test_filter=None)¶ Sets the train/test split to explicit extract of one or two dataset(s)
- Parameters
train_selection (object) – A
DSSDatasetSelectionBuilder
to build the settings of the extract of the train dataset. May be None (won’t be changed)test_selection (object) – A
DSSDatasetSelectionBuilder
to build the settings of the extract of the test dataset. May be None (won’t be changed)dataset_name (str) – Name of dataset to use for the extracts. If None, the main dataset used to create the ML Task will be used.
test_dataset_name (str) – Name of a second dataset to use for the test data extract. If None, both extracts are done from dataset_name
train_filter (object) – A
DSSFilterBuilder
to build the settings of the filter of the train dataset. May be None (won’t be changed)test_filter (object) – A
DSSFilterBuilder
to build the settings of the filter of the test dataset. May be None (won’t be changed)
-
set_time_ordering
(feature_name, ascending=True)¶ Uses a variable to sort the data for train/test split and hyperparameter optimization by time
- Parameters
feature_name (str) – Name of the variable to use
ascending (bool) – True iff the test set is expected to have larger time values than the train set
-
unset_time_ordering
()¶ Remove time-based ordering for train/test split and hyperparameter optimization
-
has_time_ordering
()¶ - Returns
whether the splitting uses time ordering
- Return type
bool
-
get_time_ordering_variable
()¶ - Returns
the name of the variable
- Return type
str
-
is_time_ordering_ascending
()¶ - Returns
True if the ordering is set to be ascending with respect to the time-ordering variable
- Return type
bool
-
Exploration of results¶
-
class
dataikuapi.dss.ml.
DSSTrainedPredictionModelDetails
(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)¶ Object to read details of a trained prediction model
Do not create this object directly, use
DSSMLTask.get_trained_model_details()
instead-
get_roc_curve_data
()¶
-
get_performance_metrics
()¶ Returns all performance metrics for this model.
For binary classification model, this includes both “threshold-independent” metrics like AUC and “threshold-dependent” metrics like precision. Threshold-dependent metrics are returned at the threshold value that was found to be optimal during training.
To get access to the per-threshold values, use the following:
# Returns a list of tested threshold values details.get_performance()["perCutData"]["cut"] # Returns a list of F1 scores at the tested threshold values details.get_performance()["perCutData"]["f1"] # Both lists have the same length
If K-Fold cross-test was used, most metrics will have a “std” variant, which is the standard deviation accross the K cross-tested folds. For example, “auc” will be accompanied with “aucstd”
- Returns
a dict of performance metrics values
- Return type
dict
-
get_assertions_metrics
()¶ Retrieves assertions metrics computed for this trained model
- Returns
an object representing assertion metrics
- Return type
DSSMLAssertionsMetrics
-
get_hyperparameter_search_points
()¶ Gets the list of points in the hyperparameter search space that have been tested.
Returns a list of dict. Each entry in the list represents a point.
- For each point, the dict contains at least:
“score”: the average value of the optimization metric over all the folds at this point
- “params”: a dict of the parameters at this point. This dict has the same structure
as the params of the best parameters
-
get_preprocessing_settings
()¶ Gets the preprocessing settings that were used to train this model
- Return type
dict
-
get_modeling_settings
()¶ Gets the modeling (algorithms) settings that were used to train this model.
Note: the structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithm)
- Return type
dict
-
get_actual_modeling_params
()¶ Gets the actual / resolved parameters that were used to train this model, post hyperparameter optimization.
- Returns
A dictionary, which contains at least a “resolved” key, which is a dict containing the post-optimization parameters
- Return type
dict
-
get_trees
()¶ Gets the trees in the model (for tree-based models)
- Returns
a DSSTreeSet object to interact with the trees
- Return type
dataikuapi.dss.ml.DSSTreeSet
-
get_coefficient_paths
()¶ Gets the coefficient paths for Lasso models
- Returns
a DSSCoefficientPaths object to interact with the coefficient paths
- Return type
dataikuapi.dss.ml.DSSCoefficientPaths
-
get_scoring_jar_stream
(model_class='model.Model', include_libs=False)¶ Get a scoring jar for this trained model, provided that you have the license to do so and that the model is compatible with optimized scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.
- Parameters
model_class (str) – fully-qualified class name, e.g. “com.company.project.Model”
include_libs (bool) – if True, also packs the required dependencies; if False, runtime will require the scoring libs given by
DSSClient.scoring_libs()
- Returns
a jar file, as a stream
- Return type
file-like
-
get_scoring_pmml_stream
()¶ Get a scoring PMML for this trained model, provided that you have the license to do so and that the model is compatible with PMML scoring You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.
- Returns
a PMML file, as a stream
- Return type
file-like
-
get_scoring_python_stream
()¶ Download the zip containing data to use for this trained model, provided that you have the license to do so and that the model is compatible with Python scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.
- Returns
an archive file, as a stream
- Return type
file-like
-
get_scoring_python
(filename)¶ Download the zip containing data to use Python scoring for this trained model in filename, provided that you have the license to do so and that the model is compatible with Python scoring.
- Parameters
filename (str) – filename of the resulting downloaded file
-
get_scoring_mlflow_stream
()¶ Download the zip containing this trained model using MLflow Model format, provided that you have the license to do so and that the model is compatible with MLflow scoring. You need to close the stream after download. Failure to do so will result in the DSSClient becoming unusable.
- Returns
an archive file, as a stream
- Return type
file-like
-
get_scoring_mlflow
(filename)¶ Download the zip containing data for this trained model, using MLflow Model format, provided that you have the license to do so and that the model is compatible with MLflow scoring
- Parameters
filename (str) – filename to the resulting MLflow Model zip
-
compute_subpopulation_analyses
(split_by, wait=True, sample_size=1000, random_state=1337, n_jobs=1, debug_mode=False)¶ Launch computation of Subpopulation analyses for this trained model.
- Parameters
split_by (list|str) – column(s) on which subpopulation analyses are to be computed (one analysis per column)
wait (bool) – if True, the call blocks until the computation is finished and returns the results directly
sample_size (int) – number of records of the dataset to use for the computation
random_state (int) – random state to use to build sample, for reproducibility
n_jobs (int) – number of cores used for parallel training. (-1 means ‘all cores’)
debug_mode (bool) – if True, output all logs (slower)
- Returns
if wait is True, an object containing the Subpopulation analyses, else a future to wait on the result
- Return type
dataikuapi.dss.ml.DSSSubpopulationAnalyses
ordataikuapi.dss.future.DSSFuture
-
get_subpopulation_analyses
()¶ Retrieve all subpopulation analyses computed for this trained model
- Returns
the subpopulation analyses
- Return type
dataikuapi.dss.ml.DSSSubpopulationAnalyses
-
compute_partial_dependencies
(features, wait=True, sample_size=1000, random_state=1337, n_jobs=1, debug_mode=False)¶ Launch computation of Partial dependencies for this trained model.
- Parameters
features (list|str) – feature(s) on which partial dependencies are to be computed
wait (bool) – if True, the call blocks until the computation is finished and returns the results directly
sample_size (int) – number of records of the dataset to use for the computation
random_state (int) – random state to use to build sample, for reproducibility
n_jobs (int) – number of cores used for parallel training. (-1 means ‘all cores’)
debug_mode (bool) – if True, output all logs (slower)
- Returns
if wait is True, an object containing the Partial dependencies, else a future to wait on the result
- Return type
dataikuapi.dss.ml.DSSPartialDependencies
ordataikuapi.dss.future.DSSFuture
-
get_partial_dependencies
()¶ Retrieve all partial dependencies computed for this trained model
- Returns
the partial dependencies
- Return type
dataikuapi.dss.ml.DSSPartialDependencies
-
download_documentation_stream
(export_id)¶ Download a model documentation, as a binary stream.
Warning: this stream will monopolize the DSSClient until closed.
- Parameters
export_id – the id of the generated model documentation returned as the result of the future
- Returns
A
DSSFuture
representing the model document generation process
-
download_documentation_to_file
(export_id, path)¶ Download a model documentation into the given output file.
- Parameters
export_id – the id of the generated model documentation returned as the result of the future
path – the path where to download the model documentation
- Returns
None
-
property
full_id
¶
-
generate_documentation
(folder_id=None, path=None)¶ Start the model document generation from a template docx file in a managed folder, or from the default template if no folder id and path are specified.
- Parameters
folder_id – (optional) the id of the managed folder
path – (optional) the path to the file from the root of the folder
- Returns
A
DSSFuture
representing the model document generation process
-
generate_documentation_from_custom_template
(fp)¶ Start the model document generation from a docx template (as a file object).
- Parameters
fp (object) – A file-like object pointing to a template docx file
- Returns
A
DSSFuture
representing the model document generation process
-
get_diagnostics
()¶ Retrieves diagnostics computed for this trained model
- Returns
list of diagnostics
- Return type
list of type dataikuapi.dss.ml.DSSMLDiagnostic
-
get_origin_analysis_trained_model
()¶ Fetch details about the model in an analysis, this model has been exported from. Returns None if the deployed trained model does not have an origin analysis trained model.
- Return type
DSSTrainedModelDetails | None
-
get_raw
()¶ Gets the raw dictionary of trained model details
-
get_raw_snippet
()¶ Gets the raw dictionary of trained model snippet. The snippet is a lighter version than the details.
-
get_train_info
()¶ Returns various information about the train process (size of the train set, quick description, timing information)
- Return type
dict
-
get_user_meta
()¶ Gets the user-accessible metadata (name, description, cluster labels, classification threshold) Returns the original object, not a copy. Changes to the returned object are persisted to DSS by calling
save_user_meta()
-
save_user_meta
()¶
-
-
class
dataikuapi.dss.ml.
DSSTrainedClusteringModelDetails
(details, snippet, saved_model=None, saved_model_version=None, mltask=None, mltask_model_id=None)¶ Object to read details of a trained clustering model
Do not create this object directly, use
DSSMLTask.get_trained_model_details()
instead-
get_raw
()¶ Gets the raw dictionary of trained model details
-
get_train_info
()¶ Returns various information about the train process (size of the train set, quick description, timing information)
- Return type
dict
-
get_facts
()¶ Gets the ‘cluster facts’ data, i.e. the structure behind the screen “for cluster X, average of Y is Z times higher than average
- Return type
DSSClustersFacts
-
get_performance_metrics
()¶ Returns all performance metrics for this clustering model.
- Returns
a dict of performance metrics values
- Return type
dict
-
get_preprocessing_settings
()¶ Gets the preprocessing settings that were used to train this model
- Return type
dict
-
get_modeling_settings
()¶ Gets the modeling (algorithms) settings that were used to train this model.
Note: the structure of this dict is not the same as the modeling params on the ML Task (which may contain several algorithm)
- Return type
dict
-
get_actual_modeling_params
()¶ Gets the actual / resolved parameters that were used to train this model.
- Returns
A dictionary, which contains at least a “resolved” key
- Return type
dict
-
get_scatter_plots
()¶ Gets the cluster scatter plot data
- Returns
a DSSScatterPlots object to interact with the scatter plots
- Return type
dataikuapi.dss.ml.DSSScatterPlots
-
download_documentation_stream
(export_id)¶ Download a model documentation, as a binary stream.
Warning: this stream will monopolize the DSSClient until closed.
- Parameters
export_id – the id of the generated model documentation returned as the result of the future
- Returns
A
DSSFuture
representing the model document generation process
-
download_documentation_to_file
(export_id, path)¶ Download a model documentation into the given output file.
- Parameters
export_id – the id of the generated model documentation returned as the result of the future
path – the path where to download the model documentation
- Returns
None
-
property
full_id
¶
-
generate_documentation
(folder_id=None, path=None)¶ Start the model document generation from a template docx file in a managed folder, or from the default template if no folder id and path are specified.
- Parameters
folder_id – (optional) the id of the managed folder
path – (optional) the path to the file from the root of the folder
- Returns
A
DSSFuture
representing the model document generation process
-
generate_documentation_from_custom_template
(fp)¶ Start the model document generation from a docx template (as a file object).
- Parameters
fp (object) – A file-like object pointing to a template docx file
- Returns
A
DSSFuture
representing the model document generation process
-
get_diagnostics
()¶ Retrieves diagnostics computed for this trained model
- Returns
list of diagnostics
- Return type
list of type dataikuapi.dss.ml.DSSMLDiagnostic
-
get_origin_analysis_trained_model
()¶ Fetch details about the model in an analysis, this model has been exported from. Returns None if the deployed trained model does not have an origin analysis trained model.
- Return type
DSSTrainedModelDetails | None
-
get_raw_snippet
()¶ Gets the raw dictionary of trained model snippet. The snippet is a lighter version than the details.
-
get_user_meta
()¶ Gets the user-accessible metadata (name, description, cluster labels, classification threshold) Returns the original object, not a copy. Changes to the returned object are persisted to DSS by calling
save_user_meta()
-
save_user_meta
()¶
-
Saved models¶
-
class
dataikuapi.dss.savedmodel.
DSSSavedModel
(client, project_key, sm_id)¶ A handle to interact with a saved model on the DSS instance.
Do not create this directly, use
dataikuapi.dss.DSSProject.get_saved_model()
-
property
id
¶
-
get_settings
()¶ Returns the settings of this saved model.
- Return type
-
list_versions
()¶ Get the versions of this saved model
- Returns
a list of the versions, as a dict of object. Each object contains at least a “id” parameter, which can be passed to
get_metric_values()
,get_version_details()
andset_active_version()
- Return type
list
-
get_active_version
()¶ Gets the active version of this saved model
- Returns
a dict representing the active version or None if no version is active. The dict contains at least a “id” parameter, which can be passed to
get_metric_values()
,get_version_details()
andset_active_version()
- Return type
dict
-
get_version_details
(version_id)¶ Gets details for a version of a saved model
- Parameters
version_id (str) – Identifier of the version, as returned by
list_versions()
- Returns
A
DSSTrainedPredictionModelDetails
representing the details of this trained model id- Return type
DSSTrainedPredictionModelDetails
-
set_active_version
(version_id)¶ Sets a particular version of the saved model as the active one
-
delete_versions
(versions, remove_intermediate=True)¶ Delete version(s) of the saved model
- Parameters
versions (list[str]) – list of versions to delete
remove_intermediate – also remove intermediate versions (default: True). In the case of a partitioned
model, an intermediate version is created every time a partition has finished training. :type remove_intermediate: bool
-
get_origin_ml_task
()¶ Fetch the last ML task that has been exported to this saved model. Returns None if the saved model does not have an origin ml task.
- Return type
DSSMLTask | None
-
import_mlflow_version_from_path
(version_id, path, code_env_name='INHERIT', container_exec_config_name='NONE', set_active=True, binary_classification_threshold=0.5)¶ Create a new version for this saved model from a path containing a MLFlow model.
Requires the saved model to have been created using
dataikuapi.dss.project.DSSProject.create_mlflow_pyfunc_model()
.- Parameters
version_id (str) – Identifier of the version to create
path (str) – An absolute path on the local filesystem. Must be a folder, and must contain a MLFlow model
code_env_name (str) – Name of the code env to use for this model version. The code env must contain at least mlflow and the package(s) corresponding to the used MLFlow-compatible frameworks. If value is “INHERIT”, the default active code env of the project will be used
container_exec_config_name (str) – Name of the containerized execution configuration to use while creating this model version. If value is “INHERIT”, the container execution configuration of the project will be used. If value is “NONE”, local execution will be used (no container)
set_active (bool) – sets this new version as the active version of the saved model
binary_classification_threshold (float) – For binary classification, define the actual threshold for the imported version. Default to 0.5
:return a :class:ExternalModelVersionHandler in order to interact with the new MLFlow model version
-
import_mlflow_version_from_managed_folder
(version_id, managed_folder, path, code_env_name='INHERIT', container_exec_config_name='INHERIT', set_active=True, binary_classification_threshold=0.5)¶ Create a new version for this saved model from a path containing a MLFlow model in a managed folder.
Requires the saved model to have been created using
dataikuapi.dss.project.DSSProject.create_mlflow_pyfunc_model()
.- Parameters
version_id (str) – Identifier of the version to create
managed_folder (str) – Identifier of the managed folder or dataikuapi.dss.managedfolder.DSSManagedFolder
path (str) – Path of the MLflow folder in the managed folder
code_env_name (str) – Name of the code env to use for this model version. The code env must contain at least mlflow and the package(s) corresponding to the used MLFlow-compatible frameworks. If value is “INHERIT”, the default active code env of the project will be used
container_exec_config_name (str) – Name of the containerized execution configuration to use for evaluating this model version. If value is “INHERIT”, the container execution configuration of the project will be used. If value is “NONE”, local execution will be used (no container)
set_active (bool) – sets this new version as the active version of the saved model
binary_classification_threshold (float) – For binary classification, define the actual threshold for the imported version. Default to 0.5
:return a
ExternalModelVersionHandler
in order to interact with the new MLFlow model version
-
create_proxy_model_version
(version_id, protocol, configuration)¶ EXPERIMENTAL. Creates a new version of a proxy model.
This is an experimental API, subject to change. Requires the saved model to have been created using
dataikuapi.dss.project.DSSProject.create_proxy_model()
. :param str version_id: Identifier of the version to create :param str protocol: one of [“KServe”, “DSS_API_NODE”] :param dict configuration: A dictionary containing the required params for the selected protocol :return a :class:ExternalModelVersionHandler in order to interact with the new Proxy model version
-
get_external_model_version_handler
(version_id)¶ Returns a :class:ExternalModelVersionHandler to interact with an External model version (MLflow or Proxy model)
-
get_metric_values
(version_id)¶ Get the values of the metrics on the version of this saved model
- Returns:
a list of metric objects and their value
-
get_zone
()¶ Gets the flow zone of this saved model
- Return type
-
move_to_zone
(zone)¶ Moves this object to a flow zone
- Parameters
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
where to move the object
Share this object to a flow zone
- Parameters
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
where to share the object
Unshare this object from a flow zone
- Parameters
zone (object) – a
dataikuapi.dss.flow.DSSFlowZone
from where to unshare the object
-
get_usages
()¶ Get the recipes referencing this model
- Returns:
a list of usages
-
get_object_discussions
()¶ Get a handle to manage discussions on the saved model
- Returns
the handle to manage discussions
- Return type
dataikuapi.discussion.DSSObjectDiscussions
-
delete
()¶ Delete the saved model
-
property
-
class
dataikuapi.dss.savedmodel.
DSSSavedModelSettings
(saved_model, settings)¶ A handle on the settings of a saved model
Do not create this class directly, instead use
dataikuapi.dss.DSSSavedModel.get_settings()
-
get_raw
()¶
-
property
prediction_metrics_settings
¶ The settings of evaluation metrics for a prediction saved model
-
save
()¶ Saves the settings of this saved model
-
MLflow models¶
-
class
dataikuapi.dss.savedmodel.
ExternalModelVersionHandler
(saved_model, version_id)¶ Handler to interact with an External model version (MLflow import of Proxy model
-
get_settings
()¶
-
set_core_metadata
(target_column_name, class_labels=None, get_features_from_dataset=None, features_list=None, output_style='AUTO_DETECT', container_exec_config_name='NONE')¶ Sets metadata for this MLFlow model version
In addition to target_column_name, one of get_features_from_dataset or features_list must be passed in order to be able to evaluate performance
- Parameters
target_column_name (str) – name of the target column. Mandatory in order to be able to evaluate performance
class_labels (list) – List of strings, ordered class labels. Mandatory in order to be able to evaluate performance on classification models
get_features_from_dataset (str) – Name of a dataset to get feature names from
features_list (list) – List of {“name”: “feature_name”, “type”: “feature_type”}
container_exec_config_name (str) – Name of the containerized execution configuration to use for running the evaluation process. If value is “INHERIT”, the container execution configuration of the project will be used. If value is “NONE” (default), local execution will be used (no container)
-
evaluate
(dataset_ref, container_exec_config_name='INHERIT', selection=None, use_optimal_threshold=True)¶ Evaluates the performance of this model version on a particular dataset. After calling this, the “result screens” of the MLFlow model version will be available (confusion matrix, error distribution, performance metrics, …) and more information will be available when calling
DSSSavedModel.get_version_details()
set_core_metadata()
must be called before you can evaluate a dataset :param str dataset_ref: Evaluation dataset to use (either a dataset name, “PROJECT.datasetName”,DSSDataset
instance ordataiku.Dataset
instance) :param str container_exec_config_name: Name of the containerized execution configuration to use for running the evaluation process.If value is “INHERIT”, the container execution configuration of the project will be used. If value is “NONE”, local execution will be used (no container)
- Parameters
selection (str) – will default to HEAD_SEQUENTIAL with a maxRecords of 10_000.
use_optimal_threshold (boolean) – Choose between optimized or actual threshold. Optimized threshold has been computed according to the metric set on the saved model setting “prediction_metrics_settings[‘thresholdOptimizationMetric’]”
-
Algorithm details¶
This section documents which algorithms are available, and some of the settings for them.
These algorithm names can be used for dataikuapi.dss.ml.DSSMLTaskSettings.get_algorithm_settings()
and dataikuapi.dss.ml.DSSMLTaskSettings.set_algorithm_enabled()
Note
This documentation does not cover all settings of all algorithms. To know which settings are
available for an algorithm, use mltask_settings.get_algorithm_settings('ALGORITHM_NAME')
and print the returned dictionary.
Generally speaking, most algorithm settings which are arrays means that this parameter can be grid-searched. All values will be tested as part of the hyperparameter optimization.
For more documentation of settings, please refer to the UI of the visual machine learning, which contains detailed documentation for all algorithm parameters
LOGISTIC_REGRESSION¶
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
Main parameters:
{
"multi_class": SingleCategoryHyperparameterSettings, # accepted valued: ['multinomial', 'ovr']
"penalty": CategoricalHyperparameterSettings, # possible values: ["l1", "l2"]
"C": NumericalHyperparameterSettings, # scaling: "LOGARITHMIC"
"n_jobs": 2
}
RANDOM_FOREST_CLASSIFICATION¶
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
Main parameters:
{
"n_estimators": NumericalHyperparameterSettings, # scaling: "LINEAR"
"min_samples_leaf": NumericalHyperparameterSettings, # scaling: "LINEAR"
"max_tree_depth": NumericalHyperparameterSettings, # scaling: "LINEAR"
"max_feature_prop": NumericalHyperparameterSettings, # scaling: "LINEAR"
"max_features": NumericalHyperparameterSettings, # scaling: "LINEAR"
"selection_mode": SingleCategoryHyperparameterSettings, # accepted_values=['auto', 'sqrt', 'log2', 'number', 'prop']
"n_jobs": 4
}
RANDOM_FOREST_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
Main parameters: same as RANDOM_FOREST_CLASSIFICATION
EXTRA_TREES¶
Type: Prediction (all kinds)
Available on backend: PY_MEMORY
RIDGE_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
LASSO_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
LEASTSQUARE_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
SVC_CLASSIFICATION¶
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
SVM_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
SGD_CLASSIFICATION¶
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
SGD_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
GBT_CLASSIFICATION¶
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
GBT_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
DECISION_TREE_CLASSIFICATION¶
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
DECISION_TREE_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
LIGHTGBM_CLASSIFICATION¶
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
LIGHTGBM_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
XGBOOST_CLASSIFICATION¶
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
XGBOOST_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
NEURAL_NETWORK¶
Type: Prediction (all kinds)
Available on backend: PY_MEMORY
DEEP_NEURAL_NETWORK_REGRESSION¶
Type: Prediction (regression)
Available on backend: PY_MEMORY
DEEP_NEURAL_NETWORK_CLASSIFICATION¶
Type: Prediction (binary or multiclass)
Available on backend: PY_MEMORY
MLLIB_LOGISTIC_REGRESSION¶
Type: Prediction (binary or multiclass)
Available on backend: MLLIB
MLLIB_DECISION_TREE¶
Type: Prediction (all kinds)
Available on backend: MLLIB
MLLIB_RANDOM_FOREST¶
Type: Prediction (all kinds)
Available on backend: MLLIB
MLLIB_LINEAR_REGRESSION¶
Type: Prediction (regression)
Available on backend: MLLIB
MLLIB_NAIVE_BAYES¶
Type: Prediction (all kinds)
Available on backend: MLLIB