Advanced models optimization¶
- Search strategies
- Interrupting and resuming hyperparameter search
- Visualization of hyperparameter search results
Each machine learning algorithm has some settings, called hyperparameters.
For each algorithm that you select in DSS, you can ask DSS to explore several values for each hyperparameter. For example, for a regression algorithm, you can try several values of the regularization parameter. If you ask DSS to explore several values for hyperparameters, all the combination of values will be assessed in a “hyperparameter optimization” phase.
Instead of searching for specific discrete values, like “1, 3, 10, 30”, DSS can also search for hyperparameters in continuous ranges like “between 1 and 30”.
DSS will automatically train model for each combination of hyperparameters and only keep the best one. “Best” means that it maximizes the metric chosen in the Metric section. To do so, DSS resplits the train set and extracts a “cross-validation” set. It then repeatedly trains on train set minus cross-validation set, and then verifies how the model performed on the cross-validation set.
Hyperparameter optimization is always performed before the Train / Test is applied. During this optimization of hyperparameters, DSS never uses the test set, which must remain “pristine” for final evaluation of the model quality.
DSS gives you a lot of settings to tune how the search for the best hyperparameters is performed.
The most classical strategy for optimizing search parameters is called “Grid search”. For each hyperparameter, you specify either a list of values to test, or a range specification like “5 values equally spaced between 30 and 80” or “8 values logarithmically spaced beween 1 and 1000”.
DSS tries all combinations of all hyperparameters as discrete “grid points”.
The grid can either be explored in order or in a shuffled order. Shuffling the grid tends to find better points earlier on average, which is preferable if you want to interrupt search.
Instead of exploring discrete points on a grid, random searching considers hyperparameters as a continuous spaces and tests randomly-chosen points in the hyperparameters space.
For each hyperparameter, you specify a range to test. DSS will then pick random points in the space defined by all parameters and test them.
A Random search is by nature infinite, so it is mandatory to select a maximum number of iterations and/or maximum time before stopping the search.
Bayesian search starts like a Random search, but as new points in the hyperparameters space are tried, a predictive model is trained in order to model the search space. This predictive model is used to refine the search in order to focus on the most promising parts of the hyperparameters search, in order to reach a good set of hyperparameters faster.
DSS bayesian search leverages a dedicated python package, scikit-optimize, and therefore requires to run on a code-env, with the appropriate packages installed. To do so, you need to:
- Create a new code environment
- Go to the “Packages to install” tab of this code-env and click on “Add sets of packages”
- Select “Visual Machine Learning with Bayesian search (scikit-learn, XGBoost, scikit-optimize)” and click “Add”
- Update your code-env
You can now select the code-env in the “Runtime environment” tab of the Design part of the Lab, and train your experiments leveraging bayesian search.
There are several strategies for selecting the cross-validation set.
With this method, the training set is split into a “real training” and a “cross-validation” set. The split is performed either randomly, or according to a time variable if “Time-based ordering” is activated in the “Train/test split” section. For each value of each hyperparameter, DSS trains the model and computes the evaluation metric, keeping the value of the hyperparameter that provides the best evaluation metric.
The obvious drawback of this method is that restricts further the size of the data on which DSS truly trains. Also, this method comes with some uncertainty, linked to the characteristics of the random split.
With this method, by default the training set is randomly split into K equally sized portions, known as folds. Random splits are stratified with respect to the prediction target, so that the percentage of samples of each class is preserved in each fold.
For each combination of hyperparameter and each fold, DSS trains the model on K-1 folds and computes the evaluation metric on the last one. For each combination, DSS then computes the average metric across all folds. DSS keeps the hyperparameter combination that maximizes this average evaluation metric across folds and then retrains the model with this hyperparameter combination on the whole training set.
Note that if “Time-based ordering” is activated in the “Train/test split” section, the training set is split into K equally sized portions sorted according to the time variable, and the training splits are assembled in order to ignore samples posterior to each evaluation split so as to emulate a forecasting situation (see e.g. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
This method increases the training time (roughtly by K) but allows to train on the whole training set (and also decreases the uncertainty since it provides several values for the metric).
The methodology for K-Fold cross-validation is the same as K-Fold cross-test but they serve different goals. K-Fold cross-validation aims at finding the best hyperparameter combination. K-Fold cross-test aims at evaluating error margins on the final scores by using the test set.
When using K-fold strategy both for hyperparameter search (cross-validation) and for testing (cross-test), the following steps are applied for all algorithms:
- Hyperparameter search: The dataset is split into K_val folds. For each combination of hyperparameter, a model is trained K_val times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset. This will be the model used for deployment.
- Test: The dataset is split again into K_test folds, independently from the previous step. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds. The performance metric is reported as the average across test folds, with min-max values for estimating uncertainty.
The two steps are done independently, but are shared for all algorithms. Hence one algorithm is compared to another using the same folds.
The number of model trainings needed for a given algorithm to go through the two steps is:
This only applies to the “Python in-memory” training engine
If you are using scikit-learn or XGBoost, you can provide a custom cross-validation object. This object must follow the protocol for cross-validation objects of scikit-Learn.
If you select a large number of hyperparameters to optimize and hyperparameter values, training can become very slow.
At any time while DSS is searching, you can choose to interrupt the optimization. DSS will finish the current point it is evaluating, and will train and evaluate the model on the “best hyperparameters found so far”.
If you are using Grid search strategy, we recommend that you enable “Randomize grid search” if you plan on interrupting your grid search.
Alternatively, before starting the search, you can select a maximum time or number of points to evaluate. DSS will automatically interrupt the search when one of these criterios is reached.
If you have selected several hyperparameters for DSS to test, during training, DSS will show a graph of the evolution of the best cross-validation scores found so far. DSS only shows the best score found so far, so the graph will show “ever-improving” results, even though the latest evaluated model might not be good. If you hover over one of the points, you’ll see the evolution of hyperparameter values that yielded an improvement.
In the right part of the charts, you see final test scores for completed models (models for which the hyperparameter-search phase is done)
The timing that you see as X axis represents time spent training this particular algorithm. DSS does not train all algorithms at once, but each algorithm will have a 0-starting X axis.
The scores that you are seeing in the left part of the chart are cross-validation scores on the cross-validation set. They cannot be directly compared to the test scores that you see in the right part.
- They are not computed on the same data set
- They are not computed with the same model (after hyperparameter-search, DSS retrains the model on the whole train set)
In this example:
- Even though XGBoost was better than Random Forest in the cross-validation set, ultimately on the test set (once trained on the whole dataset), Random forest won (this might indicate that the RF didn’t have enough data once the cross-validation set was out)
- The ANN scored 0.83 on the cross-validation set, but its final score on the test set was slightly lower at 0.812