Scikit-learn / XGBoost engine

This engine provide in-memory processing.

The train and test sets must fit in memory. Use the sampling settings if needed.

Most algorithms are based on the Scikit Learn machine learning library

Prediction algorithms

Prediction with this engine supports the following algorithms.

(Regression) Ordinary Least Squares

Ordinary Least Squares or Linear Least Squares is the simplest algorithm for linear regression. The target variable is computed as the sum of weighted input variables. OLS finds the appropriate weights by minimizing the cost function (ie, how ‘wrong’ the algorithm is).

OLS is very simple and provides a very “explainable” model, but : - it cannot automatically fit data for which the target variable is not the result of a linear combination of input features - it is highly sensitive to errors in the input dataset and prone to overfitting

(Regression) Ridge Regression

Ridge Regression adresses some problems of Ordinary Least Squares by imposing a penalty (or regularization term) to the weights. Ridge regression uses a L2 regularization. L2 regularization reduces the size of the coefficients in the model.

  • Regularization term (auto-optimized or specific values): Auto-optimization is generally faster than trying multiple values, but it does not support sparse features (like text hashing)
  • Alpha: The regularization parameter

(Regression) Lasso Regression

Lasso Regression is another linear model, using a different regularization term (L1 regularization). L1 regularization reduces the number of features included in the final model.

  • Regularization term (auto-optimized or specific values): Auto-optimization is generally faster than trying multiple values, but it does not support sparse features (like text hashing)
  • Alpha: The regularization term

(Classification) Logistic regression

Despite its name, Logistic Regression is a classification algorithm using a linear model (ie, it computes the target feature as a linear combination of input features). Logistic Regression minimizes a specific cost function (called logit or sigmoid function), which makes it appropriate for classification. A simple Logistic regression algorithm is prone to overfitting and sensitive to errors in the input dataset. To address these issues, it is possible to use a penalty (or regularization term ) to the weights.

Logistic regression has two parameters:

  • Regularization (L1 or L2 regularization): L1 regularization reduces the number of features that are used in the model. L2 regularization reduces the size of the coefficientfor each feature.
  • C: Penalty parameter C of the error term. A low value of C will generate a smoother decision boundary (higher bias) while a high value aims at correctly classifying all training examples, at the risk of overfitting (high variance). (C corresponds to the inverse of a regularization parameter). You can try several values of C by using a comma-separated list.

(Regression & classification) Random Forest

A random forest is a collection of decision trees. Each decision tree is trained using a random sample of the dataset. Then, a prediction is made from the entire forest by averaging the prediction of the trees.

A random forest has three parameters that can affect performance:

  • Number of trees: DSS can automatically train trees until performance is maximized, or the user can specify a number of trees. Increasing the number of trees in a random forest does not result in overfitting.
  • Maximum depth of tree: Maximum depth of each tree in the forest. Higher values generally increase the quality of the prediction, but can lead to overfitting. High values also increase the training and prediction time. Use 0 for unlimited depth (ie, keep splitting the tree until each node contains a single target value)
  • Minimum samples per leaf: Minimum number of samples required in a single tree node to split this node. Lower values increase the quality of the prediction (by splitting the tree mode), but can lead to overfitting and increased training and prediction time.

(Regression & classification) Gradient Boosted Trees

Gradient boosted trees are another ensemble method based on decision trees. Trees are added to the model sequentially, and each tree attempts to improve the performance of the ensemble as a whole.

The gradient boosted tree algorithm has four parameters:

  • Number of boosting stages: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. You can try several values by using a comma-separated list.
  • Learning rate: Learning rate shrinks the contribution of each tree by learning_rate.There is a trade-off between learning rate and number of boosting stages. Smaller learning rates require a greater number of boosting stages
  • Loss (deviance or exponential): Deviance refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss ‘exponential’, gradient boosting recovers the AdaBoost algorithm.
  • Maximum depth of tree: Maximum depth of the trees in the ensemble. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.

This algorithm also provides the ability to visualize partial dependency plots of your features.

(Regression & classification) XGBoost

XGBoost uses a specific library instead of scikit-learn.

It implements a variant of the gradient boosting algorithm.

XGBoost provides very fast learning, and the ability to “early stop” the growing of new trees when the model stops progressing.

(Regression & classification) Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Decision trees have four parameters that can affect performance:

  • Maximum depth: The maximum depth of the tree. You can try several values by using a comma separated list.
  • Criterion (Gini or Entropy): The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
  • Minimum samples per leaf: Minimum number of samples required to be at a leaf node. You can try several values by using a comma separated list.
  • Split strategy (Best or random). The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

(Regression & Classification) Support Vector Machine

Support Vector Machine is a powerful ‘black-box’ algorithm for classification. Through the use of kernel functions, it can learn complex non-linear decision boundaries (ie, when it is not possible to compute the target as a linear combination of input features). SVM is effective with large number of features. However, this algorithm is generally slower than others.

  • Kernel (linear, RBF, polynomial, sigmoid): The kernel function used for computing the similarity of samples. Try several to see which works the best.
  • C: Penalty parameter C of the error term. A low value of C will generate a smoother decision boundary (higher bias) while a high value aims at correctly classifying all training examples, at the risk of overfitting (high variance). (C corresponds to the inverse of a regularization parameter). You can try several values of C by using a comma-separated list.
  • Gamma: Kernel coefficient for RBF, polynomial and sigmoid kernels. Gamma defines the ‘influence’ of each training example in the features space. A low value of gamma means that each example has ‘far-reaching influence’, while a high value means that each example only has close-range influence. If no value is specified (or 0.0), then 1/nb_features is used. You can try several values of Gamma by using a comma-separated list.
  • Tolerance: Tolerance for stopping criterion.
  • Maximum number of iterations: Number of iterations when fitting the model. -1 can be used to specific no limit.

(Regression & Classification) Stochastic Gradient Descent

SGD is a family of algorithms that reuse concepts from Support Vector Machines and Logistic Regression. SGD uses an optimized method to minimize the cost (or loss ) function, making it particularly suitable for large datasets (or datasets with large number of features).

  • Loss function (logit or modified Huber): Selecting ‘logit’ loss will make the SGD behave like a Logistic Regression. Enabling ‘modified huber’ loss will make the SGD behave quite like a Support Vector Machine.
  • Iterations: number of iterations on the data
  • Penalty (L1, L2 or elastic net): L1 and L2 regularization are similar to those for linear and logistic regression. Elastic net regularization is a combination of L1 and L2 regularization.
  • Alpha: Regularization parameter. A high value of alpha (ie, more regularization) will generate a smoother decision boundary (higher bias) while a lower value (less regularization) aims at correctly classifying all training examples, at the risk of overfitting (high variance). You can try several values of alpha by using a comma-separated list.
  • L1 ratio: ElasticNet regularization mixes both L1 and L2 regularization. This ratio controls the proportion of L2 in the mix. (ie: 0 corresponds to L2-only, 1 corresponds to L1-only). Defaults to 0.15 (85% L2, 15% L1).
  • Parallelism: Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

(Regression & Classification) Custom Models

You can also specify custom models using Python.

Your custom models should follow the scikit-learn predictor protocol with proper fit and predict methods.

Code samples are available for custom models.

Clustering algorithms

K-means

The k-means algorithm clusters data by trying to separate samples in n groups, minimizing a criterion known as the ‘inertia’ of the groups.

In k-means clustering, you must specify the number of desired clusters. You can try multiple values by providing a comma-separated list.

Mini-batch K-means

The Mini-Batch k-means is a variant of the k-means algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function.

In mini-batch k-means clustering, you must specify the number of desired clusters. You can try multiple values by providing a comma-separated list.

Gaussian Mixture

The Gaussian Mixture Model models the distribution of the data as a “mixture” of several populations, each of which can be described by a single multivariate normal distribution.

An example of such a distribution is that of sizes among adults, which is described by the mixture of two distributions: the sizes of men, and those of women, each of which is approximately described by a normal distribution.

Ward Hierarchical Clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging them successively. This hierarchy of clusters represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

In Ward hierarchical clustering, you must specify the number of desired clusters. You can try multiple values by providing a comma-separated list.

Spectral Clustering

Spectral clustering algorithm uses the graph distance in the nearest neighbor graph. It does a low-dimension embedding of the affinity matrix between samples, followed by a k-means in the low dimensional space.

There are two parameters that you can modify in in spectral clustering: - Number of clusters: You can try several values by using a comma-separated list - Affinity measure: The method to computing the distance between samples. Possible options are nearest neighbors, RBF kernel and polynomial kernel.

DBSCAN

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. Numerical features should use standard rescaling.

There are two parameters that you can modify in DBSCAN:

  • Epsilon: Maximum distance to consider two samples in the same neighborhood. You can try several values by using a comma-separated list
  • Min. Sample ratio: Minimum ratio of records to form a cluster

Custom Models

You can also specify custom models using Python.

Your custom models should follow the scikit-learn predictor protocol with proper fit and fit_predict methods.

A specified number of clusters can also be passed to the model through the interface.

Code samples are available for custom models.