Model Evaluation Stores

Through the public API, the Python client allows you to perform evaluation of models. Those models are typically models trained in the Lab, and then deployed to the Flow as Saved Models (see Machine learning for additional information). They can also be external models.

Concepts

With a DSS model

In DSS, you can evaluate a version of a Saved Model using an Evaluation Recipe. An Evaluation Recipe takes as input a Saved Model and a Dataset on which to perform this evaluation. An Evaluation Recipe can have three outputs:

  • an output dataset,

  • a metrics dataset, or

  • a Model Evaluation Store (MES).

By default, the active version of the Saved Model is evaluated. This can be configured in the Evaluation Recipe.

If a MES is configured as an output, a Model Evaluation (ME) will be written in the MES each time the MES is built (or each time the Evaluation Recipe is run).

A Model Evaluation is a container for metrics of the evaluation of the Saved Model Version on the Evaluation Dataset. Those metrics include:

  • all available performance metrics,

  • the Data Drift metric.

The Data Drift metric is the accuracy of a model trained to recognize lines:

  • from the evaluation dataset

  • from the train time test dataset of the configured version of the Saved Model.

The higher this metric, the better the model can separate lines from the evaluation dataset from those from the train time test dataset. And so, the more data from the evaluation dataset is different from train time data.

Detailed information and other tools, including a binomial test, univariate data drift, and feature drift importance, are available in the Input Data Drift tab of a Model Evaluation. Note that this tool is interactive and that displayed results are not persisted.

With an external model

In DSS, you can also evaluate an external model using a Standalone Evaluation Recipe. A Standalone Evaluation Recipe (SER) takes as input a labeled dataset containing labels, predictions, and (optionally) weights. A SER takes a single output: a Model Evaluation Store.

As the Evaluation Recipe, the Standalone Evaluation Recipe will output a Model Evaluation to the configured Model Evaluation Store each time it runs. In this case, however, the Data Drift can not be computed as there is no notion of reference data.

How evaluation is performed

The Evaluation Recipe and its counterpart for external models, the Standalone Evaluation Recipe, perform the evaluation on a sample of the Evaluation Dataset. The sampling parameters are defined in the recipe. Note that the sample will contain at most 20,000 lines.

Performance metrics are then computed on this sample.

Data drift can be computed in three ways:

  • at evaluation time, between the evaluation dataset and the train time test dataset;

  • using the API, between the samples of a Model Evaluation, a Saved Model Version (sample of train time test dataset) or a Lab Model (sample of train time test dataset);

  • interactively, in the “Input data drift” tab of a Model Evaluation.

In all cases, to compute the Data Drift, the sample of the Model Evaluation and a sample of the reference data are concatenated. In order to balance the data, those samples are truncated to the length of the smallest one. If the size of the reference sample if higher than the size of the ME sample, the reference sample will be truncated.

So:

  • at evaluation time, we shall take as input the sample of the Model Evaluation (whose length is at most 20,000 lines) and a sample of the train time test dataset;

  • interactively, the sample of the reference model evaluation and:

    • if the other compared item is an ME, its sample;

    • if the other compared item is a Lab Model or an SMV, a sample of its train time test dataset.

Limitations

Model Evaluation Stores cannot be used with:

  • clustering models,

  • ensembling models,

  • partitioned models.

Compatible prediction models have to be Python models.

Usage samples

Create a Model Evaluation Store

# client is a DSS API client

p = client.get_project("MYPROJECT")

mes_id = p.create_model_evaluation_store("My Mes Name")

Note that the display name of a Model Evaluation Store (in the above sample My Mes Name) is distinct from its unique id.

Retrieve a Model Evaluation Store

# client is a DSS API client

p = client.get_project("MYPROJECT")

mes_id = p.get_model_evaluation_store("mes_id")

List Model Evaluation Stores

# client is a DSS API client

p = client.get_project("MYPROJECT")

stores =  p.list_model_evaluation_stores(as_type="objects")

Create an Evaluation Recipe

See dataikuapi.dss.recipe.EvaluationRecipeCreator

Build a Model Evaluation Store and retrieve the performance and data drift metrics of the just computed ME

# client is a DSS API client

p = client.get_project("MYPROJECT")

mes = project.get_model_evaluation_store("M3s_1d")

mes.build()

me = mes.get_latest_model_evaluation()

full_info = me.get_full_info()

metrics = full_info.metrics

List Model Evaluations from a store

# client is a DSS API client

p = client.get_project("MYPROJECT")

mes = project.get_model_evaluation_store("M3s_1d")

me_list = mes.list_model_evaluations()

Retrieve an array of creation date / accuracy from a store

p = client.get_project("MYPROJECT")

mes = project.get_model_evaluation_store("M3s_1d")

me_list = mes.list_model_evaluations()

res = []

for me in me_list:
    full_info = me.get_full_info()
    creation_date = full_info.creation_date
    accuracy = full_info.metrics["accuracy"]
    res.append([creation_date,accuracy])

Retrieve an array of label value / precision from a store

The date of creation of a model evaluation might not be the best way to key a metric. In some cases, it might be more interesting to use the labeling system, for instance to tag the version of the evaluation dataset.

If the user created a label “myCustomLabel:evaluationDataset”, he may retrieve an array of label value / precision from a store with the following snippet:

p = client.get_project("MYPROJECT")

mes = project.get_model_evaluation_store("M3s_1d")

me_list = mes.list_model_evaluations()

res = []

for me in me_list:
    full_info = me.get_full_info()
    label_value = next(x for x in full_info.user_meta["labels"] if x["key"] == "myCustomLabel:evaluationDataset")
    precision= full_info.metrics["precision"]
    res.append([label_value,precision])

Compute data drift of the evaluation dataset of a Model Evaluation with the train time test dataset of its base DSS model version

# using base SMV is implicit
drift = me1.compute_data_drift()

drift_model_result = drift.drift_model_result
drift_model_accuracy = drift_model_result.drift_model_accuracy
print("Value: {} < {} < {}".format(drift_model_accuracy.lower_confidence_interval,
                                    drift_model_accuracy.value,
                                    drift_model_accuracy.upper_confidence_interval))
print("p-value: {}".format(drift_model_accuracy.pvalue))

Compute data drift, display results and adjust parameters

# me1 and me2 are two compatible model evaluations (having the same prediction type) from any store

drift = me1.compute_data_drift(me2)

drift_model_result = drift.drift_model_result
drift_model_accuracy = drift_model_result.drift_model_accuracy
print("Value: {} < {} < {}".format(drift_model_accuracy.lower_confidence_interval,
                                    drift_model_accuracy.value,
                                    drift_model_accuracy.upper_confidence_interval))
print("p-value: {}".format(drift_model_accuracy.pvalue))

# Check sample sizes
print("Reference sample size: {}".format(drift_model_result.get_raw()["referenceSampleSize"]))
print("Current sample size: {}".format(drift_model_result.get_raw()["currentSampleSize"]))


# check columns handling
per_col_settings = drift.per_column_settings
for col_settings in per_col_settings:
    print("col {} - default handling {} - actual handling {}".format(col_settings.name, col_settings.default_column_handling, col_settings.actual_column_handling))

# recompute, with Pclass set as CATEGORICAL
drift = me1.compute_data_drift(me2,
                            DataDriftParams.from_params(
                                PerColumnDriftParamBuilder().with_column_drift_param("Pclass", "CATEGORICAL", True).build()
                            )
                            )
...

API reference

There are two main parts related to the handling of metrics and checks in Dataiku’s Python APIs:

Both set of classes have fairly similar capabilities.

For more details on the two packages, please see Python APIs.

dataiku package API

class dataiku.ModelEvaluationStore(lookup, project_key=None, ignore_flow=False)

This is a handle to interact with a model evaluation store

Note: this class is also available as dataiku.ModelEvaluationStore

get_info(sensitive_info=False)

Get information about the location and settings of this model evaluation store :rtype: dict

get_path()

Gets the filesystem path of this model evaluation store.

get_id()
get_name()
list_runs()
get_evaluation(evaluation_id)
get_last_metric_values()

Get the set of last values of the metrics on this folder, as a dataiku.ComputedMetrics object

get_metric_history(metric_lookup)

Get the set of all values a given metric took on this folder :param metric_lookup: metric name or unique identifier

class dataiku.core.model_evaluation_store.ModelEvaluation(store, evaluation_id)

This is a handle to interact with a model evaluation

set_preparation_steps(steps, requested_output_schema, context_project_key=None)
get_schema()

Gets the schema of the sample in this model evaluation store, as an array of objects like this one: { ‘type’: ‘string’, ‘name’: ‘foo’, ‘maxLength’: 1000 }. There is more information for the map, array and object types.

get_dataframe(columns=None, infer_with_pandas=True, parse_dates=True, bool_as_str=False, float_precision=None)

Read the sample in the run as a Pandas dataframe.

Pandas dataframes are fully in-memory, so you need to make sure that your dataset will fit in RAM before using this.

Keywords arguments:

  • infer_with_pandas – uses the types detected by pandas rather than the dataset schema as detected in DSS. (default True)

  • parse_dates – Date column in DSS’s dataset schema are parsed (default True)

  • bool_as_str – Leave boolean values as strings (default False)

Inconsistent sampling parameter raise ValueError.

Note about encoding:

  • Column labels are “unicode” objects

  • When a column is of string type, the content is made of utf-8 encoded “str” objects

iter_dataframes_forced_types(names, dtypes, parse_date_columns, sampling=None, chunksize=10000, float_precision=None)
iter_dataframes(chunksize=10000, infer_with_pandas=True, parse_dates=True, columns=None, bool_as_str=False, float_precision=None)

Read the model evaluation sample to Pandas dataframes by chunks of fixed size.

Returns a generator over pandas dataframes.

Useful is the sample doesn’t fit in RAM.

dataikuapi package API

class dataikuapi.dss.modelevaluationstore.DSSModelEvaluationStore(client, project_key, mes_id)

A handle to interact with a model evaluation store on the DSS instance.

Do not create this directly, use dataikuapi.dss.DSSProject.get_model_evaluation_store()

property id
get_settings()

Returns the settings of this model evaluation store.

Return type

DSSModelEvaluationStoreSettings

get_zone()

Gets the flow zone of this model evaluation store

Return type

dataikuapi.dss.flow.DSSFlowZone

move_to_zone(zone)

Moves this object to a flow zone

Parameters

zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to move the object

share_to_zone(zone)

Share this object to a flow zone

Parameters

zone (object) – a dataikuapi.dss.flow.DSSFlowZone where to share the object

unshare_from_zone(zone)

Unshare this object from a flow zone

Parameters

zone (object) – a dataikuapi.dss.flow.DSSFlowZone from where to unshare the object

get_usages()

Get the recipes referencing this model evaluation store

Returns:

a list of usages

get_object_discussions()

Get a handle to manage discussions on the model evaluation store

Returns

the handle to manage discussions

Return type

dataikuapi.discussion.DSSObjectDiscussions

delete()

Delete the model evaluation store

list_model_evaluations()

List the model evaluations in this model evaluation store. The list is sorted by ME creation date.

Returns

The list of the model evaluations

Return type

list of dataikuapi.dss.modelevaluationstore.DSSModelEvaluation

get_model_evaluation(evaluation_id)

Get a handle to interact with a specific model evaluation

Parameters

evaluation_id (string) – the id of the desired model evaluation

Returns

A dataikuapi.dss.modelevaluationstore.DSSModelEvaluation model evaluation handle

get_latest_model_evaluation()

Get a handle to interact with the latest model evaluation computed

Returns

A dataikuapi.dss.modelevaluationstore.DSSModelEvaluation model evaluation handle if the store is not empty, else None

delete_model_evaluations(evaluations)

Remove model evaluations from this store

build(job_type='NON_RECURSIVE_FORCED_BUILD', wait=True, no_fail=False)

Starts a new job to build this model evaluation store and wait for it to complete. Raises if the job failed.

job = mes.build()
print("Job %s done" % job.id)
Parameters
  • job_type – The job type. One of RECURSIVE_BUILD, NON_RECURSIVE_FORCED_BUILD or RECURSIVE_FORCED_BUILD

  • wait – wait for the build to finish before returning

  • no_fail – if True, does not raise if the job failed. Valid only when wait is True

Returns

the dataikuapi.dss.job.DSSJob job handle corresponding to the built job

Return type

dataikuapi.dss.job.DSSJob

get_last_metric_values()

Get the metrics of the latest model evaluation built

Returns:

a list of metric objects and their value

get_metric_history(metric)

Get the history of the values of the metric on this model evaluation store

Returns:

an object containing the values of the metric, cast to the appropriate type (double, boolean,…)

compute_metrics(metric_ids=None, probes=None)

Compute metrics on this model evaluation store. If the metrics are not specified, the metrics setup on the model evaluation store are used.

class dataikuapi.dss.modelevaluationstore.DSSModelEvaluationStoreSettings(model_evaluation_store, settings)

A handle on the settings of a model evaluation store

Do not create this class directly, instead use dataikuapi.dss.DSSModelEvaluationStore.get_settings()

get_raw()
save()
class dataikuapi.dss.modelevaluationstore.DSSModelEvaluation(model_evaluation_store, evaluation_id)

A handle on a model evaluation

Do not create this class directly, instead use dataikuapi.dss.DSSModelEvaluationStore.get_model_evaluation()

get_full_info()

Retrieve the model evaluation with its performance data

Returns

the model evaluation full info, as a dataikuapi.dss.DSSModelEvaluationInfo

get_full_id()
delete()

Remove this model evaluation

property full_id
compute_data_drift(reference=None, data_drift_params=None, wait=True)

Compute data drift against a reference model or model evaluation. The reference is determined automatically unless specified.

Parameters
Returns

a dataikuapi.dss.modelevaluationstore.DataDriftResult containing data drift analysis results if wait is True, or a DSSFuture handle otherwise

get_metrics()

Get the metrics for this model evaluation. Metrics must be understood here as Metrics in DSS Metrics & Checks

Returns

the metrics, as a JSON object

get_sample_df()

Get the sample of the evaluation dataset on which the evaluation was performed

Returns

the sample content, as a pandas.DataFrame

class dataikuapi.dss.modelevaluationstore.DSSModelEvaluationFullInfo(model_evaluation, full_info)

A handle on the full information on a model evaluation.

Includes information such as the full id of the evaluated model, the evaluation params, the performance and drift metrics, if any, etc.

Do not create this class directly, instead use dataikuapi.dss.DSSModelEvaluation.get_full_info()

metrics

The performance and data drift metric, if any.

creation_date

The date and time of the creation of the model evaluation, as an epoch.

user_meta

The user-accessible metadata (name, labels) Returns the original object, not a copy. Changes to the returned object are persisted to DSS by calling save_user_meta().

get_raw()
save_user_meta()
class dataikuapi.dss.modelevaluationstore.DataDriftParams(data)

Object that represents parameters for data drift computation. Do not create this object directly, use dataikuapi.dss.modelevaluationstore.DataDriftParams.from_params() instead.

static from_params(per_column_settings, nb_bins=10, compute_histograms=True, confidence_level=0.95)

Creates parameters for data drift computation from columns, number of bins, compute histograms and confidence level

Parameters

per_column_settings (dict) – A dict representing the per column settings.

You should use a PerColumnDriftParamBuilder to build it. :param int nb_bins: (optional) Nb. bins in histograms (apply to all columns) - default: 10 :param bool compute_histograms: (optional) Enable/disable histograms - default: True :param float confidence_level: (optional) Used to compute confidence interval on drift’s model accuracy - default: 0.95

Return type

dataikuapi.dss.modelevaluationstore.DataDriftParams

class dataikuapi.dss.modelevaluationstore.PerColumnDriftParamBuilder

Builder for a map of per column drift params settings. Used as a helper before computing data drift to build columns param expected in dataikuapi.dss.modelevaluationstore.DataDriftParams.from_params().

build()

Returns the built dict for per column drift params settings

with_column_drift_param(name, handling='AUTO', enabled=True)

Sets the drift params settings for given column name.

Param

string name: The name of the column

Param

string handling: (optional) The column type, should be either NUMERICAL, CATEGORICAL or AUTO (default: AUTO)

Param

bool enabled: (optional) False means the column is ignored in drift computation (default: True)

class dataikuapi.dss.modelevaluationstore.DataDriftResult(data)

A handle on the data drift result of a model evaluation.

Do not create this class directly, instead use dataikuapi.dss.DSSModelEvaluation.compute_data_drift()

drift_model_result

Drift analysis based on drift modeling.

univariate_drift_result

Per-column drift analysis based on pairwise comparison of distributions.

per_column_settings

Information about column handling that has been used (errors, types, etc).

get_raw()
Returns

the raw data drift result

Return type

dict

class dataikuapi.dss.modelevaluationstore.DriftModelResult(data)

A handle on the drift model result.

Do not create this class directly, instead use dataikuapi.dss.modelevaluationstore.DataDriftResult.drift_model_result

get_raw()
Returns

the raw drift model result

Return type

dict

class dataikuapi.dss.modelevaluationstore.UnivariateDriftResult(data)

A handle on the univariate data drift.

Do not create this class directly, instead use dataikuapi.dss.modelevaluationstore.DataDriftResult.univariate_drift_result

per_column_drift_data

Drift data per column, as a dict of column name -> drift data.

get_raw()
Returns

the raw univariate data drift

Return type

dict

class dataikuapi.dss.modelevaluationstore.ColumnSettings(data)

A handle on column handling information.

Do not create this class directly, instead use dataikuapi.dss.modelevaluationstore.DataDriftResult.get_per_column_settings()

actual_column_handling

The actual column handling (either forced via drift params or inferred from model evaluation preprocessings). It can be any of NUMERICAL, CATEGORICAL, or IGNORED.

default_column_handling

The default column handling (based on model evaluation preprocessing only). It can be any of NUMERICAL, CATEGORICAL, or IGNORED.

get_raw()
Returns

the raw column handling information

Return type

dict

class dataikuapi.dss.modelevaluationstore.DriftModelAccuracy(data)

A handle on the drift model accuracy.

Do not create this class directly, instead use dataikuapi.dss.modelevaluationstore.DriftModelResult.drift_model_accuracy

get_raw()
Returns

the drift model accuracy data

Return type

dict