Advanced models optimization¶
Each machine learning algorithm has some settings, called hyper-parameters.
For each algorithm that you select in DSS, you can ask DSS to explore several values for each parameter. For example, for a regression algorithm, you can try several values of the regularization parameter.
DSS will automatically try each specified value and only keep the best one. This process is the optimization of hyper-parameters, or “grid search”.
In order to decide which parameter is the best, DSS resplits the train set and extracts a “cross validation” set. It then repeatedly trains on train set minus cross-validation set, and then verifies how the model performed on the cross-validation set.
During this optimization of hyper-parameters, DSS never uses the test set, which must remain “pristine” for final evaluation of the model quality.
DSS gives you a lot of settings to tune how the search for the best hyper-parameters is performed
You can tune the following parameters
There are several strategies for selecting the cross-validation set.
With this method, the training set is split into a “real training” and a “cross-validation” set. For each value of each hyperparameter, DSS trains the model and computes the evaluation metric, keeping the value of the hyperparameter that provides the best evaluation metric.
The obvious drawback of this method is that restricts further the size of the data on which DSS truly trains. Also, this method comes with some uncertainty, linked to the characteristics of the split.
With this method, the training set is split into n equally sized portions, known as folds. For each value of the parameter and each fold, DSS trains the model on n-1 folds and computes the evaluation metric on the last one. For each value of the hyperparameter, DSS keeps the average on all folds. DSS keeps the value of the hyperparameter that provides the best evaluation metric and then retrains the model with this hyperpameter value on the whole training set.
This method increases the training time (roughtly by n) but allows to train on the whole training set (and also decreases the uncertainty since it provides several values for the metric)
K-Fold cross validation is a way to optimize hyper parameters on a cross-validation set.
Not to be confused with K-Fold cross test, which is used to evaluate error margins on the final scores by using the test set.
This only applies to the “Python in-memory” training engine
If you are using scikit-learn or XGBoost, you can provide a custom cross-validation object. This object must follow the protocol for cross-validation objects of scikit-Learn.
If you select a large number of hyperparameters to optimize and hyperparameter values, training can become very slow.
At any time while DSS is grid-searching, you can choose to Interrupt the optimization. DSS will finish the current grid point it is evaluating, and will train and evaluate the model on the “best hyperparameters found so far”.
We recommend that you enable “Randomize grid search” if you plan on interrupting your grid search.
An interrupted grid search can be resumed later on. DSS will only try the hyperparameter values that it hadn’t tried yet.
If you have selected several hyperparameters for DSS to test, during training, DSS will show a graph of the evolution of the best cross-validation scores found so far. DSS only shows the best score found so far, so the graph will show “ever-improving” results, even though the latest evaluated model might not be good. If you hover over one of the points, you’ll see the evolution of hyperparameter values that yielded an improvement.
In the right part of the charts, you see final test scores for completed models (models for which the grid-search phase is done)
The timing that you see as X axis represents time spent training this particular algorithm. DSS does not train all algorithms at once, but each algorithm will have a 0-starting X axis.
The scores that you are seeing in the left part of the chart are cross-validation scores on the cross-validation set. They cannot be directly compared to the test scores that you see in the right part.
- They are not computed on the same data set
- They are not computed with the same model (after grid-search, DSS retrains the model on the whole train set)
In this example:
- Even though XGBoost was better than Random Forest in the cross-validation set, ultimately on the test set (once trained on the whole dataset), Random forest won (this might indicate that the RF didn’t have enough data once the cross-validation set was out)
- The ANN scored 0.83 on the cross-validation set, but its final score on the test set was slightly lower at 0.812