Scikit-learn / XGBoost engine

Most algorithms are based on the Scikit Learn or XGBoost machine learning library.

This engine provides in-memory processing. The train and test sets must fit in memory. Use the sampling settings if needed.

Prediction algorithms

Prediction with this engine supports the following algorithms.

(Regression) Ordinary Least Squares

Ordinary Least Squares or Linear Least Squares is the simplest algorithm for linear regression. The target variable is computed as the sum of weighted input variables. OLS finds the appropriate weights by minimizing the cost function (ie, how ‘wrong’ the algorithm is).

OLS is very simple and provides a very “explainable” model, but:

  • it cannot automatically fit data for which the target variable is not the result of a linear combination of input features
  • it is highly sensitive to errors in the input dataset and prone to overfitting

Parameters:

  • Parallelism: Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets. (-1 means ‘all cores’)

(Regression) Ridge Regression

Ridge Regression addresses some problems of Ordinary Least Squares by imposing a penalty (or regularization term) to the weights. Ridge regression uses a L2 regularization. L2 regularization reduces the size of the coefficients in the model.

Parameters:

  • Regularization term (auto-optimized or specific values): Auto-optimization is generally faster than trying multiple values, but it does not support sparse features (like text hashing)
  • Alpha: The regularization term. You can try multiple values by providing a comma-separated list. This increases the training time.

(Regression) Lasso Regression

Lasso Regression is another linear model, using a different regularization term (L1 regularization). L1 regularization reduces the number of features included in the final model.

Parameters:

  • Regularization term (auto-optimized or specific values): Auto-optimization is generally faster than trying multiple values, but it does not support sparse features (like text hashing)
  • Alpha: The regularization term. You can try multiple values by providing a comma-separated list. This increases the training time.

(Classification) Logistic regression

Despite its name, Logistic Regression is a classification algorithm using a linear model (ie, it computes the target feature as a linear combination of input features). Logistic Regression minimizes a specific cost function (called logit or sigmoid function), which makes it appropriate for classification. A simple Logistic regression algorithm is prone to overfitting and sensitive to errors in the input dataset. To address these issues, it is possible to use a penalty (or regularization term ) to the weights.

Logistic regression has two parameters:

  • Regularization (L1 or L2 regularization): L1 regularization reduces the number of features that are used in the model. L2 regularization reduces the size of the coefficient for each feature.
  • C: Penalty parameter C of the error term. A low value of C will generate a smoother decision boundary (higher bias) while a high value aims at correctly classifying all training examples, at the risk of overfitting (high variance). (C corresponds to the inverse of a regularization parameter). You can try several values of C by using a comma-separated list.

(Regression & Classification) Random Forests

Decision tree classification is a simple algorithm which builds a decision tree. Each node of the decision tree includes a condition on one of the input features.

A Random Forest regressor is made of many decision trees. When predicting a new record, it is predicted by each tree, and each tree “votes” for the final answer of the forest. The forest then averages the individual trees answers. When “growing” (ie, training) the forest:

  • for each tree, a random sample of the training set is used;
  • for each decision point in the tree, a random subset of the input features is considered.

Random Forests generally provide good results, at the expense of “explainability” of the model.

Parameters:

  • Number of trees: Number of trees in the forest. Increasing the number of trees in a random forest does not result in overfitting. You can try multiple values by providing a comma-separated list. This increases the training time.
  • Feature sampling strategy: Adjusts the number of features to sample at each split.
    • Automatic will select 30% of the features.
    • Square root and Logarithm will select the square root or base 2 logarithm of the number of features respectively
    • Fixed number will select the given number of features
    • Fixed proportion will select the given proportion of features
  • Maximum depth of tree: Maximum depth of each tree in the forest. Higher values generally increase the quality of the prediction, but can lead to overfitting. High values also increase the training and prediction time. Use 0 for unlimited depth (ie, keep splitting the tree until each node contains a single target value)
  • Minimum samples per leaf: Minimum number of samples required in a single tree node to split this node. Lower values increase the quality of the prediction (by splitting the tree mode), but can lead to overfitting and increased training and prediction time.
  • Parallelism: Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

(Regression & Classification) Gradient Boosted Trees

Gradient boosted trees are another ensemble method based on decision trees. Trees are added to the model sequentially, and each tree attempts to improve the performance of the ensemble as a whole. The advantages of GBRT are:

  • Natural handling of data of mixed type (= heterogeneous features)
  • Predictive power
  • Robustness to outliers in output space (via robust loss functions)

Please note that you may face scalability issues, due to the sequential nature of boosting it can hardly be parallelized.

The gradient boosted tree algorithm has four parameters:

  • Number of boosting stages: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. You can try multiple values by providing a comma-separated list. This increases the training time.

  • Learning rate: Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning rate and number of boosting stages. Smaller learning rates require a greater number of boosting stages. You can try multiple values by providing a comma-separated list. This increases the training time.

  • Loss: The available loss functions depend upon whether this is a classification or regression problem.

    • Classification: Deviance refers to deviance (equivalent to logistic regression) for classification with probabilistic outputs. For exponential loss, gradient boosting recovers the AdaBoost algorithm.
    • Regression: Choose from least squares, least absolution deviation, or Huber. Huber is a combination of Least Square and Least Absolute Deviation.
  • Maximum depth of tree: Maximum depth of the trees in the ensemble. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables. You can try multiple values by providing a comma-separated list. This increases the training time.

This algorithm also provides the ability to visualize partial dependency plots of your features.

(Regression & Classification) XGBoost

XGBoost uses a specific library instead of scikit-learn.

XGBoost is an advanced gradient boosted tree algorithm. It has support for parallel processing, regularization, early stopping which makes it a very fast, scalable and accurate algorithm.

Parameters:

  • Maximum number of trees: XGBoost has an early stop mechanism so the exact number of trees will be optimized. High number of actual trees will increase the training and prediction time. Typical values: 100 - 10000
  • Early stopping: Use XGBoost’s built-in early stop mechanism so the exact number of trees will be optimized. The cross-validation scheme defined in the Train & validation tab will be used.
  • Early stopping rounds: The optimizer stops if the loss never decreases for this consecutive number of iterations. Typical values: 1 - 100
  • Maximum depth of tree: Maximum depth of each tree. High values can increase the quality of the prediction, but can lead to overfitting. Typical values: 3 - 10. You can try multiple values by providing a comma-separated list. This increases the training time.
  • Learning rate: Lower values slow down convergence and can make the model more robust. Typical values: 0.01 - 0.3. You can try multiple values by providing a comma-separated list. This increases the training time.
  • L2 regularization: L2 regularization reduces the size of the coefficient for each feature. You can try multiple values by providing a comma-separated list. This increases the training time.
  • L1 regularization: In addition to reduce overfitting, may improve scoring speed for very high dimensional datasets. You can try multiple values by providing a comma-separated list. This increases the training time.
  • Gamma: Minimum loss reduction to split a leaf. You can try multiple values by providing a comma-separated list. This increases the training time.
  • Minimum child weight: Minimum sum of weights(hessian) in a node. High values can prevent overfitting by learning highly specific cases. Smaller values allow leaf nodes to match a small set of rows, which can be relevant for highly imbalanced sets. You can try multiple values by providing a comma-separated list. This increases the training time.
  • Subsample: Subsample ratio for the data to be used in each tree. Low values can prevent overfitting but can make specific cases harder to learn. Typical values: 0.5 - 1. You can try multiple values by providing a comma-separated list. This increases the training time.
  • Colsample by tree: Fraction of the features to be used in each tree. Typical values: 0.5-1. You can try multiple values by providing a comma-separated list. This increases the training time.
  • Replace missing values:
  • Parallelism: Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets. (-1 means “all cores”)

(Regression & Classification) Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Parameters:

  • Maximum depth: The maximum depth of the tree. You can try several values by using a comma separated list. This increases the training time.
  • Criterion (Gini or Entropy): The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. This applies only to classification problems.
  • Minimum samples per leaf: Minimum number of samples required to be at a leaf node. You can try several values by using a comma separated list. This increases the training time.
  • Split strategy (Best or random). The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

(Regression & Classification) Support Vector Machine

Support Vector Machine is a powerful ‘black-box’ algorithm for classification. Through the use of kernel functions, it can learn complex non-linear decision boundaries (ie, when it is not possible to compute the target as a linear combination of input features). SVM is effective with large number of features. However, this algorithm is generally slower than others.

Parameters:

  • Kernel (linear, RBF, polynomial, sigmoid): The kernel function used for computing the similarity of samples. Try several to see which works the best.
  • C: Penalty parameter C of the error term. A low value of C will generate a smoother decision boundary (higher bias) while a high value aims at correctly classifying all training examples, at the risk of overfitting (high variance). (C corresponds to the inverse of a regularization parameter). You can try several values of C by using a comma-separated list.
  • Gamma: Kernel coefficient for RBF, polynomial and sigmoid kernels. Gamma defines the ‘influence’ of each training example in the features space. A low value of gamma means that each example has ‘far-reaching influence’, while a high value means that each example only has close-range influence. If no value is specified (or 0.0), then 1/nb_features is used. You can try several values of Gamma by using a comma-separated list.
  • Tolerance: Tolerance for stopping criterion.
  • Maximum number of iterations: Number of iterations when fitting the model. -1 can be used to specific no limit.

(Regression & Classification) Stochastic Gradient Descent

SGD is a family of algorithms that reuse concepts from Support Vector Machines and Logistic Regression. SGD uses an optimized method to minimize the cost (or loss ) function, making it particularly suitable for large datasets (or datasets with large number of features).

Parameters:

  • Loss function (logit or modified Huber): Selecting ‘logit’ loss will make the SGD behave like a Logistic Regression. Enabling ‘modified huber’ loss will make the SGD behave quite like a Support Vector Machine.
  • Iterations: number of iterations on the data
  • Penalty (L1, L2 or elastic net): L1 and L2 regularization are similar to those for linear and logistic regression. Elastic net regularization is a combination of L1 and L2 regularization.
  • Alpha: Regularization parameter. A high value of alpha (ie, more regularization) will generate a smoother decision boundary (higher bias) while a lower value (less regularization) aims at correctly classifying all training examples, at the risk of overfitting (high variance). You can try several values of alpha by using a comma-separated list.
  • L1 ratio: ElasticNet regularization mixes both L1 and L2 regularization. This ratio controls the proportion of L2 in the mix. (ie: 0 corresponds to L2-only, 1 corresponds to L1-only). Defaults to 0.15 (85% L2, 15% L1).
  • Parallelism: Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

(Regression & Classification) K Nearest Neighbors

K Nearest Neighbor classification makes predictions for a sample by finding the k nearest samples and assigning the most represented class among them.

Warning: this algorithm requires storing the entire training data into the model. This will lead to a very large model if the data is larger than a few hundred lines. Predictions may also be slow.

Parameters:

  • K: The number of neighbors to examine for each sample. You can try several values by using a comma separated list. This increases the training time.
  • Distance weighting: If enabled, voting across neighbors will be weighed by the inverse distance from the sample to the neighbor.
  • Neighbor finding algorithm: The method used to find the nearest neighbors to each point. Has no impact on predictive performance, but will have a high impact on training and prediction speed.
    • Automatic: a method will be selected empirically depending on the data.
    • KD & Ball Tree : stores the data points into a partitioned data structure for efficient lookup.
    • Brute force: will examine every training sample for every prediction. Usually inefficient.
  • p: The exponent of the Minkowski metric used to search neighbors. For p = 2, this gives Euclidian distance, for p = 1, Manhattan distance. Greater values lead to the Lp distances.

(Regression & Classification) Extra Random Trees

Extra trees, just like Random Forests, are an ensemble model. In addition to sampling features at each stage of splitting the tree, it also samples random threshold at which to make the splits. The additional randomness may improve generalization of the model.

Parameters:

  • Numbers of trees: Number of trees in the forest. You can try several values by using a comma separated list. This increases the training time.
  • Feature sampling strategy: Adjusts the number of features to sample at each split.
    • Automatic will select 30% of the features.
    • Square root and Logarithm will select the square root or base 2 logarithm of the number of features respectively
    • Fixed number will select the given number of features
    • Fixed proportion will select the given proportion of features
  • Maximum depth of tree: Maximum depth of each tree in the forest. Higher values generally increase the quality of the prediction, but can lead to overfitting. High values also increase the training and prediction time. Use 0 for unlimited depth (ie, keep splitting the tree until each node contains a single target value). You can try several values by using a comma separated list. This increases the training time.
  • Minimum samples per leaf: Minimum number of samples required in a single tree node to split this node. Lower values increase the quality of the prediction (by splitting the tree mode), but can lead to overfitting and increased training and prediction time. You can try several values by using a comma separated list. This increases the training time.
  • Parallelism: Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

(Regression & Classification) Artificial Neural Network

Neural Networks are a class of parametric models which are inspired by the functioning of neurons. They consist of several “hidden” layers of neurons, which receive inputs and transmit them to the next layer, mixing the inputs and applying non-linearities, allowing for a complex decision function.

Parameters:

  • Hidden layer sizes: Number of neurons on each hidden layer. Separate by commas to add additional layers.
  • Activation: The activation function for the neurons in the network.
  • Alpha: L2 regularization parameter. Higher values lead to smaller neuron weights and a more generalizable, although less sharp model.
  • Max iterations: Maximum iterations for learning. Higher values lead to better convergence, but take more time.
  • Convergence tolerance: If the loss does not improve by this ratio over two iterations, training stops.
  • Early stopping: Whether the model should use validation and stop early.
  • Solver: The solver to use for optimization. LBFGS is a batch algorithm and is not suited for larger datasets.
  • Shuffle data: Whether the data should be shuffled between epochs (recommended, unless the data is already in random order).
  • Initial Learning Rate: The initial learning rate for gradient descent.
  • Automatic batching: Whether batches should be created automatically (will use 200, or the whole dataset if there are less samples). Uncheck to select batch size.
  • beta_1: beta_1 parameter for ADAM solver.
  • beta_2: beta_2 parameter for ADAM solver.
  • epsilon: epsilon parameter for ADAM solver.

(Regression & Classification) Lasso Path

The Lasso Path is a method which computes the LASSO path (ie. for all values of the regularization parameter). This is performed using LARS regression. It requires a number of passes on the data equal to the number of features. If this number is large, computation may be slow. This computation allows to select a given number of non-zero coefficients, ie. to select a given number of features. After training, you will be able to visualize the LASSO path and select a new number of features.

Parameters:

  • Maximum features: The number of kept features. Input 0 to have all features enabled (no regularization). Has no impact on training time.

(Regression & Classification) Custom Models

You can also specify custom models using Python.

Your custom models should follow the scikit-learn predictor protocol with proper fit and predict methods.

Code samples are available for custom models.

Clustering algorithms

K-means

The k-means algorithm clusters data by trying to separate samples in n groups, minimizing a criterion known as the ‘inertia’ of the groups.

Parameters:

  • Number of clusters: You can try multiple values by providing a comma-separated list. This increases the training time.
  • Seed: Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)
  • Parallelism: Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption. If -1 all CPUs are used. For values below -1, (n_cpus + 1 + value) are used: ie for -2, all CPUs but one are used.

Gaussian Mixture

The Gaussian Mixture Model models the distribution of the data as a “mixture” of several populations, each of which can be described by a single multivariate normal distribution.

An example of such a distribution is that of sizes among adults, which is described by the mixture of two distributions: the sizes of men, and those of women, each of which is approximately described by a normal distribution.

Parameters:

  • Number of mixture components: Number of populations. You can try multiple values by providing a comma-separated list. This increases the training time.
  • Max Iterations: The maximum number of iterations to learn the model. The Gaussian Mixture model uses the Expectation-Maximization algorithm, which is iterative, each iteration running on all of the data. A higher value of this parameter will lead to a longer running time, but a more precise clustering. A value between 10 and 100 is recommended.
  • Seed: Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)

Mini-batch K-means

The Mini-Batch k-means is a variant of the k-means algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function.

Parameters:

  • Numbers of clusters: You can try multiple values by providing a comma-separated list. This increases the training time.
  • Seed: Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)

Agglomerative Clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging them successively. This hierarchy of clusters represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

Parameters:

  • Numbers of clusters: You can try multiple values by providing a comma-separated list. This increases the training time.

Spectral Clustering

Spectral clustering algorithm uses the graph distance in the nearest neighbor graph. It does a low-dimension embedding of the affinity matrix between samples, followed by a k-means in the low dimensional space.

Parameters:

  • Numbers of clusters: You can try several values by using a comma-separated list. This increases the training time.
  • Affinity measure: The method to computing the distance between samples. Possible options are nearest neighbors, RBF kernel and polynomial kernel.
  • Gamma: Kernel coefficient for RBF and polynomial kernels.

Gamma defines the ‘influence’ of each training example in the features space. A low value of gamma means that each example has ‘far-reaching influence’, while a high value means that each example only has close-range influence. If no value is specified (or 0.0), then 1/nb_features is used. - Coef0: Independent term for ‘polynomial’ or ‘sigmoid’ kernel function. - Seed: Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)

DBSCAN

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. Numerical features should use standard rescaling.

There are two parameters that you can modify in DBSCAN:

  • Epsilon: Maximum distance to consider two samples in the same neighborhood. You can try several values by using a comma-separated list
  • Min. Sample ratio: Minimum ratio of records to form a cluster

Interactive Clustering (Two-step clustering)

Interactive clustering is based on a two-step clustering algorithm. This two-staged algorithm first agglomerates data points into small clusters using K-Means clustering. Then, it applies agglomerative hierarchical clustering in order to further cluster the data, while also building a hierarchy between the smaller clusters, which can then be interpreted. It therefore allows to extract hierarchical information from datasets larger than a few hundred lines, which cannot be achieved through standard methods. The clustering can then be manually adjusted in DSS’s interface.

Parameters:

  • Number of Pre-clusters: The number of clusters for KMeans preclustering. It is recommended that this number be lower than a couple hundred for readability.
  • Number of clusters: The number of clusters in the hierarchy. The full hierarchy will be built and displayed, but these clusters will be used for scoring.
  • Max Iterations: The maximum number of iterations for preclustering. KMeans is an iterative algorithm. A higher value of this parameter will lead to a longer running time, but a more precise pre-clustering. A value between 10 and 100 is recommended.
  • Seed: Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)

Isolation Forest (Anomaly Detection)

Isolation forest is an anomaly detection algorithm. It isolates observations by creating a Random Forest of trees, each splitting samples in different partitions. Anomalies tend to have much shorter paths from the root of the tree. Thus, the mean distance from the root provides a good measure of non-normality.

Parameters:

  • Number of trees: Number of trees in the forest.
  • Contamination: Expected proportion of anomalies in the data.
  • Anomalies to display: Maximum number of anomalies to display in the model report. Too high a number may cause memory and UI problems.

Custom Models

You can also specify custom models using Python.

Your custom models should follow the scikit-learn predictor protocol with proper fit and fit_predict methods.

A specified number of clusters can also be passed to the model through the interface.

Code samples are available for custom models.