Prediction (Supervised ML)¶
Prediction (aka supervised machine learning) is used when you have a target variable that you want to predict. For instance, you may want to predict the price of apartments in New York City using the size of the apartments, their location and amenities in the building. In this case, the price of the apartments is the target, while the size of the apartments, their location and the amenities are the features used for prediction.
Our Tutorial 103 provides a step-by-step explanation of how to create your first prediction model and deploy it for scoring of new records.
The rest of this document assumes that you have followed this tutorial.
- Running Supervised Machine Learning in DSS
- Learning task settings
- Train and validation
- Features handling
- Feature generation
- Feature reduction
Use the following steps to access supervised machine learning in DSS:
- Go to the Flow for your project
- Click on the dataset you want to use
- Select the Lab
- Create a new visual analysis
- Click on the Models tab
- Select Create first model
- Select Prediction
You can change the learning task for a model under the Models > Settings > Learning task tab
DSS supports three different types of prediction for three different types of targets.
- Regression is used when the target is numeric (e.g. price of the apartment).
- Two-class classification is used when the target can be one of two categories (e.g. presence or absence of a doorman).
- Multi-class classification is used when targets can be one of many categories (e.g. neighborhood of the apartment).
DSS can build predictive models for each of these kinds of learning tasks. Available options, algorithms and result screens will vary depending on the kind of learning task.
The model is optimized according to the selected measure. This measure is used for model evaluation in cross-validation (see the Train and validation panel) and hyperparameter grid search (when you specify a list of possible values in an algorithm’s settings).
For Two-class classification problems, the probability threshold for scoring the target class is optimized according to the selected scoring measure.
When training a model, it is important to test the performance of the model on a “test set”. DSS provides two main strategies for conducting this separation between a training and validation set.
By default, DSS randomly splits the dataset into a training and a testing set. The fraction of data used for training can be specified in DSS. 80% is a standard fraction of data to use for training.
Furthermore, depending on the engine DSS can perform this random split from a subsample of the dataset. This is especially important for in-memory engines, like Scikit-learn / XGBoost engine. DSS defaults to using the first 100‘000 rows of the dataset.
A variant of this method is called “K-Fold cross test”, which DSS can also use. With k-fold cross-test, the dataset is split into n equally sized portions, known as folds. Each fold is independently used as a separate testing set, with the remaining n-1 folds used as a training set. This method strongly increases training time (roughly speaking, it multiplies it by n). However, it allows for two interesting features:
- It provides a more accurate estimation of model performance, by providing “error margins” on the performance metrics). When K-Fold cross test is enabled, all performance metrics will have tolerance information.
- Once the scores have been computed on each fold, DSS can retrain the model on 100% of the dataset’s data. This is useful if you don’t have much training data.
In general, use a random split of your dataset if your data is homogeneous.
DSS also allows the user to specify explicitly which data to use as the training and testing set. If your data has a known structure, such as apartment prices from two different cities, it may be beneficial to use this structure to specify training and testing sets.
The explicit extracts can either come from a single dataset or from two different datasets. Each extract can be defined using:
- Filtering rules
- Sampling rules
In general, use an explicit extract of your dataset if your data is heterogeneous.
Each machine learning algorithm has some settings, called hyper-parameters.
For each algorithm that you select in DSS, you can ask DSS to explore several values for each parameter. For example, for a regression algorithm, you can try several values of the regularization parameter.
DSS will automatically try each specified value and only keep the best one. This process is the optimization of hyper-parameters.
In order to decide which parameter is the best, DSS resplits the train set and extracts a “cross validation” set. During this optimization of hyper-parameters, DSS never uses the test set, which must remain “pristine” for final evaluation of the model quality.
There are several strategies for selecting the cross-validation set.
With this method, the training set is split into a “real training” and a “validation” set. For each value of the hyperparameter, DSS trains the model and computes the evaluation metric, keeping the value of the hyperparameter that provides the best evaluation metric.
The obvious drawback of this method is that restricts further the size of the data on which DSS truly trains. Also, this method comes with some uncertainty, linked to the characteristics of the split.
With this method, the training set is split into n equally sized portions, known as folds. For each value of the parameter and each fold, DSS trains the model on n-1 folds and computes the evaluation metric on the last one. For each value of the hyperparameter, DSS keeps the average on all folds. DSS keeps the value of the hyperparameter that provides the best evaluation metric and then retrains the model with this hyperpameter value on the whole training set.
This method increases the training time (roughtly by n) but allows to train on the whole training set (and also decreases the uncertainty since it provides several values for the metric)
K-Fold cross validation is a way to optimize hyper parameters on a test set. Not to be confused with K-Fold cross test, which is used to evaluate error margins on the final scores
If you are using scikit-learn or XGBoost, you can provide a custom cross-validation object. This object must follow the protocol for cross-validation objects of scikit-Learn.
You can change the settings for feature generation under Models > Settings > Feature generation
DSS can compute interactions between variables, such as linear and polynomial combinations. These generated features allow for linear methods, such as linear regression, to detect non-linear relationship between the variables and the target. These generated features may improve model performance in these cases.
You can change the settings for feature reduction under Models > Settings > Feature reduction
Feature reduction operates on the preprocessed features. It allows you to reduce the dimension of the feature space in order to regularize your model or make it more interpretable.
- Correlation with target: Only the features most correlated (Pearson) with the target will be selected. A threshold for minimum absolute correlation can be set.
- Tree-based: This will create a Random Forest model to predict the target. Only the top features according to the feature importances computed by the algorithm will be selected.
- Principal Component Analysis: The feature space dimension will be reduced using Principal Component Analysis. Only the top principal components will be selected. Note: This method will generate non-interpretable feature names as its output. The model may be performant, but will not be interpretable.
- Lasso regression: This will create a LASSO model to predict the target, using 3-fold cross-validation to select the best value of the regularization term. Only the features with nonzero coefficients will be selected.
You can change the settings for feature generation under Models > Settings > Algorithms
DSS supports several algorithms that can be used to train predictive models. We recommend trying several different algorithms before deciding on one particular modeling method.
The algorithms depend on each engine. See The machine learning engines for details
DSS cannot handle large number of classes. We recommend that you do not try to use machine learning with more than about 50 classes.
You must ensure that all classes are detected while creating the machine learning task. Detection of possible classes is done on the analysis’s script sample. Make sure that this sample includes at least one row for each possible class. If some classes are not detected on this sample but found when fitting the algorithm, training will fail.
Furthermore, you need to ensure that all classes are present both in the train and the test set. You might need to adjust the split settings for that assertion to hold true.
Note that these constraints are more complex to handle with large number of classes and very rare classes.