ML Diagnostics

ML Diagnostics are designed to identify and help troubleshoot potential problems and suggest possible improvements at different stages of training and building machine learning models.

Some checks are based on the characteristics of the datasets and serve as warnings to avoid common pitfalls when interpreting the evaluation metrics.

Other checks run additional tests after training to identify overfitting or potential data leakage, allowing you to fix these issues before deployment.

Use of Diagnostics can be disabled in Analysis > Design > Debugging.

Dataset Sanity Checks

When evaluating a machine learning model it is important that the dataset used for evaluation is representative of both the training data and future scoring data. This is often referred to as the i.i.d assumption.

Test set might be too small for reliable performance estimation

If the test dataset is too small, the performance measurement may not be reliable. If possible provide a larger test set. If the test set is split from the training set, either use a larger percentage for the testing (default being 20%) or use five fold cross-validation.

Target variable distribution in test data does not match the training data distribution, metrics could be misleading

If the test dataset’s target is drawn from a different distribution to that of the training dataset, the model may not be able to generalize and may perform poorly. For example if there is a difference in time between when the training and testing data were collected there may be changes in the data that it is important to address.

Statistical tests are performed to assess if the target distribution for the test set is drawn from the same distribution as that of the training set. For classification tasks a Chi-squared test is used and for regression tasks a Kolmogorov-Smirnov is used. A p-value of less than 0.05 means that the difference was considered statistically significant.

Interactive statistics can be used to examine and better understand the distributions of these two datasets targets.

Training set might be too small for robust training

For training a ML model, the training dataset should be large and diverse enough to capture all the needed patterns in order to make reliable predictions. While this is task dependent, a good rule of thumb is to try to gather more than 1000 observations in the dataset.

The dataset is imbalanced, metrics can be misleading

When training a classification model one factor that can negatively impact the model is the balance between different classes. If one class is strongly underrepresented in the training data this may make it difficult for the model to make accurate predictions for this class. While DSS uses class weighting to aid in training models in the presence of imbalanced data it is always better to gather more data for the underrepresented classes if at all possible.

During evaluation it is also important to be mindful of the impact this imbalance has on the metrics. For example in a binary classification task, if 94% of the target values are 1, then the accuracy would be 94% if the model always predicted 1. In this case it is best to use metrics that balance precision and recall, such as AUC or F1-score. The confusion matrix can also help to identify this type of failure.

Modeling Parameters

Some modeling parameters need to be adapted to the characteristics of the data, otherwise they could lead to slower training time, possible data leakage or overfitting.

Outlier detection: The mini-cluster size threshold may be too high with respect to the training dataset size. Training might fail. Consider using a smaller value.

When performing outlier detection for a clustering ML task with a mini-cluster size threshold that is too high, all rows are likely to be dropped before training. If this happens, the training will fail. To avoid this, consider reducing the mini-cluster size threshold to less than ~10% of the training set size.

Training Speed

Training speed might not be optimal due to runtime environment bottlenecks, hyperparameter search strategy or other factors.

N remote workers failed to start.

In distributed hyperparameter search mode, some of the remote workers may fail to start, without interrupting the whole search. The search thus runs slower because of a reduced number of running workers. This might be due to a failure to start a kubernetes pod, or some other issue on the cluster.

N remote workers taking a long time to start

When performing a distributed hyperparameter search, some remote workers are taking more than 2 minutes to start. This might be due to a bottleneck in the kubernetes cluster, e.g. the maximum number of running pods is reached.

N remote workers took more than T to start

In distributed hyperparameter search mode, some remote workers started successfully but took more than 2 minutes to start (T is the minimum time for a remote worker to start). This might be due to a bottleneck in the kubernetes cluster, or a slow starting process in the container.

Overfitting Detection

Training a machine learning model is a delicate balance between bias and variance. If the model does not capture enough information from the data, it is under-fit. It will have a high bias and will be unable to make accurate predictions. On the other hand if the model is overfit, it has fit the data in the training set too closely, learning the noise specific to this training set and it will fail to generalize to new unseen data, it has too high variance.

The algorithm seems to have overfit the train set, all the leaves in the model are pure

For tree-based algorithms, another way to identify likely overfitting is to examine the leaves in the trees. For a classification task if the model has been able to partition the data fully, unless the task is relatively simple, then it has probably overfit and the model size needs to be restricted. This can be detected if all the leaves in the model are pure.

Number of tree leaves is too large with respect to dataset size

For regression tasks, the number of the leaves in a tree can hint at overfitting. If the number of leaves in a tree is greater than 50 percent of the number of samples then it could be indicative that this tree has overfit this data sample. For tree ensembles such as Random Forest, if more than 10 percent of the trees in the model are overfit, it may be worth checking the parameters of the model.

The best way to avoid overfitting is always to add more data if possible as well as adding regularization. This can be addressed by constraining the model further, by changing the hyperparameters of the algorithm. Use additional regularization or decrease the values of the hyperparameters that control the size of the model.

Leakage Detection

Data Leakage occurs when information that will not be available at test/scoring time accidentally appears in the training dataset, this allows the model to achieve unrealistically high performance during training, though it will fail to reproduce this performance once deployed. A good example of data leakage would be for a sales prediction task, if windowing features capturing all sales for the previous week are used, but that window includes the day to be predicted, the model will have information about the sales on the day it is trying to predict.

Too good to be true?

One indicator that data leakage might have occurred is an extremely high performance metric such as > 98% AUC.

Feature has suspiciously high importance that could be indicative of a possible data leak or overfitting

If data leakage has occurred, the feature importances can offer insights into which features contain leaked data as the model will attribute very high importances to these features. DSS will warn you if it identifies that a single feature accounts for more than 80% of the feature importance.

Model Checks

When evaluating machine learning models, it is helpful to have a baseline model to compare with in order to establish that the model is performing better than an extremely simple rule. A good baseline is a dummy model which simply predicts the most common value. Especially in the presence of imbalanced data, this can give a more accurate picture of how much value this model is really able to bring.

The model accuracy is not significantly different than a random classifier

DSS calculates the performance a dummy classifier would achieve on this dataset and performs a statistical test to ensure that the trained model performs better. If the trained model does not outperform the dummy classifier it could be indicative of problems in the data preparation or a lack of data for underrepresented classes.

R2 score is suspiciously low - the model is marginally better than a constant mean prediction / The mean constant predictor outperforms the model

For a regression model, if the R2 metric is too low it means that the model is unable to perform better than a prediction using the mean constant. It could be beneficial to add more features or data to the model.

ML assertions

ML assertions provide a way to streamline and accelerate the model evaluation process, by automatically checking that predictions for specified subpopulations meet certain conditions.

DSS raises diagnostics to warn you when assertions could not be computed or fail.

X assertion(s) failed

DSS computes each assertion and warns you if any fail.

X assertion(s) got 0 matching rows

After applying the filter on the test set, a diagnostic is raised if the subsample is empty i.e. none of the rows met the criteria.

X assertion(s) got matching rows but all rows were dropped by the model’s preprocessing

After applying the model’s preprocessing to the subsample a diagnostic is raised if the preprocessed subsample is empty, i.e. all rows that matched the criteria were dropped during the preprocessing. Rows may have been dropped because of the feature handling chosen, or because targets were not defined for those rows.

In the 3 diagnostic examples above, X is an integer less than or equal to the total number of assertion defined for the ml task.