Supervised Machine Learning

Supervised machine learning is used when you have a target variable that you want to predict. For instance, you may want to predict the price of apartments in New York City using the size of the apartments, their location and amenities in the building. In this case, the price of the apartments is the target, while the size of the apartments, their location and the amenities are the features used for prediction.

Running Supervised Learning in DSS

To conduct supervised learning in DSS use the following steps:

  • Go to the flow
  • Click on the dataset you would like to use
  • Click on the analyse widget on the right
  • Create or reuse an analysis
  • Click on the column you would like to use as the target of the model
  • Select create predictive model

Learning Task

Note

You can change the learning task for a model under the Models > Settings > Learning task tab

DSS supports three different types of modeling for three different types of targets.

Regression is used when the target is numeric (e.g. price of the apartment).

Two-Way Classification is used when the target can be one of two categories (e.g. presence or absence of a doorman).

Multi-class Classification is used when targets can be one of many categories (e.g. neighborhood of the apartment).

We can build predictive models for each of these targets. However, different methods are available for different targets.

Algorithm Regression Two-Way Classification Multi-class Classification
Ridge Regression X    
Lasso Regression X    
Ordinary Least Squares Regression X    
Logistic Regression   X X
Random Forest X X X
Support Vector Machine   X X
Gradient Boosted Tree   X X
Decision Tree   X X
Stochastic Gradient Descent   X X
Custom Model X X X

Train & Validation

Note

You can change the parameters for training and validating a model under the Models > Settings > Train & Validation tab

When training a model, it is important to test the performance of the model on a testing set. DSS provides two main strategies for conducting this separation between a training and validation set.

Splitting the dataset

By default, DSS randomly splits the dataset into a training and a testing set. The fraction of data used for training can be specified in DSS. 80% is a standard fraction of data to use for training

A variant of this method is k-fold cross-validation, which DSS can also use. With k-fold cross-validation, the dataset is split into n equally sized portions, known as folds. Each fold is independently used as a separate testing set, with the remaining n-1 folds used as a training set. This method increases training time; however, it allows for more accurate estimation of model performance

In general, use a random split of your dataset if your data is homogeneous.

Explicit extracts from the dataset

DSS also allows the user to specify explicitly which data to use as the training and testing set. If your data has a known structure, such as apartment prices from two different cities, it may be beneficial to use this structure to specify training and testing sets.

In general, use an explicit extract of your dataset if your data is heterogeneous.

Features

Note

You can change the settings for feature processing under Models > Settings > Train & Validation tab

DSS allows users to specify pre-processing of variables before model training.

Rescaling numeric variables

Numeric features can be rescaled prior to training, which can improve model performance in some instances. Standard rsecaling scales the feature to a standard deviation of one and a mean of zero. Min-max rescaling sets the minimum values of the feature to zero and the max to one.

Rescale numeric variables if there are large differences between the features.

Missing values

DSS has facilities for handling missing data prior to model training. First, the user must decide whether to discard rows with missing data.

Avoid discarding rows, unless missing data is extremely rare.

The user must also decide whether to treat “missing” as a regular value. Structurally missing data are those that are impossible to measure, e.g. the US state for an address in Canada. In contrast, randomly missing data are missing due to random noise.

Treat “missing” as a regular value when data is structurally missing. Impute when data is randomly missing.

Feature Generation

Note

You can change the settings for feature generation under Models > Settings > Feature Generation

DSS can also compute interactions between variables, such as linear and polynomial combinations. These generated features allow for linear methods, such as linear regression, to detect non-linear relationship between the variables and the target. These generated features may improve performance in these cases

Algorithms

Note

You can change the settings for feature generation under Models > Settings > Algorithms

DSS supports several algorithms that can be used to train predictive models. We recommend trying several different algorithms before deciding on one particular modeling method.

Ordinary Least Squares

Ordinary Least Squares or Linear Least Squares is the simplest algorithm for linear regression. The target variable is computed as the sum of weighted input variables. OLS finds the appropriate weights by minimizing the cost function (ie, how ‘wrong’ the algorithm is).

OLS is very simple and provides a very “explainable” model, but : - it cannot automatically fit data for which the target variable is not the result of a linear combination of input features - it is highly sensitive to errors in the input dataset and prone to overfitting

Ridge Regression

Ridge Regression adresses some problems of Ordinary Least Squares by imposing a penalty (or regularization term) to the weights. Ridge regression uses a L2 regularization. L2 regularization reduces the size of the coefficients in the model.

  • Regularization term (auto-optimized or specific values): Auto-optimization is generally faster than trying multiple values, but it does not support sparse features (like text hashing)
  • Alpha: The regularization parameter

Lasso Regression

Lasso Regression is another linear model, using a different regularization term (L1 regularization). L1 regularization reduces the number of features included in the final model.

  • Regularization term (auto-optimized or specific values): Auto-optimization is generally faster than trying multiple values, but it does not support sparse features (like text hashing)
  • Alpha: The regularization term

Random Forest

A random forest is a collection of decision trees. Each decision tree is trained using a random sample of the dataset. Then, a prediction is made from the entire forest by averaging the prediction of the trees.

A random forest has three parameters that can affect performance:

  • Number of trees: DSS can automatically train trees until performance is maximized, or the user can specify a number of trees. Increasing the number of trees in a random forest does not result in overfitting.
  • Maximum depth of tree: Maximum depth of each tree in the forest. Higher values generally increase the quality of the prediction, but can lead to overfitting. High values also increase the training and prediction time. Use 0 for unlimited depth (ie, keep splitting the tree until each node contains a single target value)
  • Minimum samples per leaf: Minimum number of samples required in a single tree node to split this node. Lower values increase the quality of the prediction (by splitting the tree mode), but can lead to overfitting and increased training and prediction time.

Gradient Boosted Tree

Gradient boosted trees are another ensemble method based on decision trees. Trees are added to the model sequentially, and each tree attempts to improve the performance of the ensemble as a whole.

The gradient boosted tree algorithm has four parameters:

  • Number of boosting stages: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. You can try several values by using a comma-separated list.
  • Learning rate: Learning rate shrinks the contribution of each tree by learning_rate.There is a trade-off between learning rate and number of boosting stages. Smaller learning rates require a greater number of boosting stages
  • Loss (deviance or exponential): Deviance refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss ‘exponential’, gradient boosting recovers the AdaBoost algorithm.
  • Maximum depth of tree: Maximum depth of the trees in the ensemble. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.

Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Decision trees have four parameters that can affect performance:

  • Maximum depth: The maximum depth of the tree. You can try several values by using a comma separated list.
  • Criterion (Gini or Entropy): The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
  • Minimum samples per leaf: Minimum number of samples required to be at a leaf node. You can try several values by using a comma separated list.
  • Split strategy (Best or random). The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

Logistic regression

Despite its name, Logistic Regression is a classification algorithm using a linear model (ie, it computes the target feature as a linear combination of input features). Logistic Regression minimizes a specific cost function (called logit or sigmoid function), which makes it appropriate for classification. A simple Logistic regression algorithm is prone to overfitting and sensitive to errors in the input dataset. To address these issues, it is possible to use a penalty (or regularization term ) to the weights.

Logistic regression has two parameters:

  • Regularization (L1 or L2 regularization): L1 regularization reduces the number of features that are used in the model. L2 regularization reduces the size of the coefficientfor each feature.
  • C: Penalty parameter C of the error term. A low value of C will generate a smoother decision boundary (higher bias) while a high value aims at correctly classifying all training examples, at the risk of overfitting (high variance). (C corresponds to the inverse of a regularization parameter). You can try several values of C by using a comma-separated list.

Support Vector Machine

Support Vector Machine is a powerful ‘black-box’ algorithm for classification. Through the use of kernel functions, it can learn complex non-linear decision boundaries (ie, when it is not possible to compute the target as a linear combination of input features). SVM is effective with large number of features. However, this algorithm is generally slower than others.

  • Kernel (linear, RBF, polynomial, sigmoid): The kernel function used for computing the similarity of samples. Try several to see which works the best.
  • C: Penalty parameter C of the error term. A low value of C will generate a smoother decision boundary (higher bias) while a high value aims at correctly classifying all training examples, at the risk of overfitting (high variance). (C corresponds to the inverse of a regularization parameter). You can try several values of C by using a comma-separated list.
  • Gamma: Kernel coefficient for RBF, polynomial and sigmoid kernels. Gamma defines the ‘influence’ of each training example in the features space. A low value of gamma means that each example has ‘far-reaching influence’, while a high value means that each example only has close-range influence. If no value is specified (or 0.0), then 1/nb_features is used. You can try several values of Gamma by using a comma-separated list.
  • Tolerance: Tolerance for stopping criterion.
  • Maximum number of iterations: Number of iterations when fitting the model. -1 can be used to specific no limit.

Stocastic Gradient Descent

SGD a family of algorithms that reuse concepts from Support Vector Machines and Logistic Regression. SGD uses an optimized method to minimize the cost (or loss ) function, making it particularly suitable for large datasets (or datasets with large number of features).

  • Loss function (logit or modified Huber): Selecting ‘logit’ loss will make the SGD behave like a Logistic Regression. Enabling ‘modified huber’ loss will make the SGD behave quite like a Support Vector Machine.
  • Iterations: number of iterations on the data
  • Penalty (L1, L2 or elastic net): L1 and L2 regularization are similar to those for linear and logistic regression. Elastic net regularization is a combination of L1 and L2 regularization.
  • Alpha: Regularization parameter. A high value of alpha (ie, more regularization) will generate a smoother decision boundary (higher bias) while a lower value (less regularization) aims at correctly classifying all training examples, at the risk of overfitting (high variance). You can try several values of alpha by using a comma-separated list.
  • L1 ratio: ElasticNet regularization mixes both L1 and L2 regularization. This ratio controls the proportion of L2 in the mix. (ie: 0 corresponds to L2-only, 1 corresponds to L1-only). Defaults to 0.15 (85% L2, 15% L1).
  • Parallelism: Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

Custom Models

The user can also specify a custom model using python. The model should have train and predict methods similar to the models used in scikit-learn.