Advanced models optimization¶
Each machine learning algorithm has some settings, called hyperparameters.
For each algorithm that you select in DSS, you can ask DSS to explore several values for each hyperparameter. For example, for a regression algorithm, you can try several values of the regularization. If you ask DSS to explore several values for several hyperparameters, all the combination of values will be assessed in a “grid search”.
DSS will automatically try each combination and only keep the best one. “Best” means that it maximizes the metric chosen in the Metric section. To do so, DSS resplits the train set and extracts a “cross-validation” set. It then repeatedly trains on train set minus cross-validation set, and then verifies how the model performed on the cross-validation set.
Hyperparameter optimization is always performed before the Train / Test is applied. During this optimization of hyperparameters, DSS never uses the test set, which must remain “pristine” for final evaluation of the model quality.
DSS gives you a lot of settings to tune how the search for the best hyperparameters is performed
You can tune the following parameters
There are several strategies for selecting the cross-validation set.
With this method, the training set is split into a “real training” and a “cross-validation” set. The split is performed either randomly, or according to a time variable if “Time-based ordering” is activated in the “Train/test split” section. For each value of each hyperparameter, DSS trains the model and computes the evaluation metric, keeping the value of the hyperparameter that provides the best evaluation metric.
The obvious drawback of this method is that restricts further the size of the data on which DSS truly trains. Also, this method comes with some uncertainty, linked to the characteristics of the random split.
With this method, by default the training set is randomly split into K equally sized portions, known as folds. Random splits are stratified with respect to the prediction target, so that the percentage of samples of each class is preserved in each fold.
For each combination of hyperparameter and each fold, DSS trains the model on K-1 folds and computes the evaluation metric on the last one. For each combination, DSS then computes the average metric across all folds. DSS keeps the hyperparameter combination that maximizes this average evaluation metric across folds and then retrains the model with this hyperparameter combination on the whole training set.
Note that if “Time-based ordering” is activated in the “Train/test split” section, the training set is split into K equally sized portions sorted according to the time variable, and the training splits are assembled in order to ignore samples posterior to each evaluation split so as to emulate a forecasting situation (see e.g. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
This method increases the training time (roughtly by K) but allows to train on the whole training set (and also decreases the uncertainty since it provides several values for the metric).
The methodology for K-Fold cross-validation is the same as K-Fold cross-test but they serve different goals. K-Fold cross-validation aims at finding the best hyperparameter combination. K-Fold cross-test aims at evaluating error margins on the final scores by using the test set.
When using K-fold strategy both for hyperparameter search (cross-validation) and for testing (cross-test), the following steps are applied for all algorithms:
- Hyperparameter search: The dataset is split into K_val folds. For each combination of hyperparameter, a model is trained K_val times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset. This will be the model used for deployment.
- Test: The dataset is split again into K_test folds, independently from the previous step. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds. The performance metric is reported as the average across test folds, with min-max values for estimating uncertainty.
The two steps are done independently, but are shared for all algorithms. Hence one algorithm is compared to another using the same folds.
The number of model trainings needed for a given algorithm to go through the two steps is:
This only applies to the “Python in-memory” training engine
If you are using scikit-learn or XGBoost, you can provide a custom cross-validation object. This object must follow the protocol for cross-validation objects of scikit-Learn.
If you select a large number of hyperparameters to optimize and hyperparameter values, training can become very slow.
At any time while DSS is grid-searching, you can choose to interrupt the optimization. DSS will finish the current grid point it is evaluating, and will train and evaluate the model on the “best hyperparameters found so far”.
We recommend that you enable “Randomize grid search” if you plan on interrupting your grid search.
An interrupted grid search can be resumed later on. DSS will only try the hyperparameter values that it hadn’t tried yet.
If you have selected several hyperparameters for DSS to test, during training, DSS will show a graph of the evolution of the best cross-validation scores found so far. DSS only shows the best score found so far, so the graph will show “ever-improving” results, even though the latest evaluated model might not be good. If you hover over one of the points, you’ll see the evolution of hyperparameter values that yielded an improvement.
In the right part of the charts, you see final test scores for completed models (models for which the grid-search phase is done)
The timing that you see as X axis represents time spent training this particular algorithm. DSS does not train all algorithms at once, but each algorithm will have a 0-starting X axis.
The scores that you are seeing in the left part of the chart are cross-validation scores on the cross-validation set. They cannot be directly compared to the test scores that you see in the right part.
- They are not computed on the same data set
- They are not computed with the same model (after grid-search, DSS retrains the model on the whole train set)
In this example:
- Even though XGBoost was better than Random Forest in the cross-validation set, ultimately on the test set (once trained on the whole dataset), Random forest won (this might indicate that the RF didn’t have enough data once the cross-validation set was out)
- The ANN scored 0.83 on the cross-validation set, but its final score on the test set was slightly lower at 0.812