# In-memory Python (Scikit-learn / LightGBM / XGBoost)¶

Most algorithms are based on the Scikit Learn, the LightGBM or the XGBoost machine learning libraries.

This engine provides in-memory processing. The train and test sets must fit in memory. Use the sampling settings if needed.

## Prediction algorithms¶

Prediction with this engine supports the following algorithms.

### (Regression) Ordinary Least Squares¶

Ordinary Least Squares or Linear Least Squares is the simplest algorithm for linear regression. The target variable is computed as the sum of weighted input variables. OLS finds the appropriate weights by minimizing the cost function (ie, how ‘wrong’ the algorithm is).

OLS is very simple and provides a very “explainable” model, but:

it cannot automatically fit data for which the target variable is not the result of a linear combination of input features

it is highly sensitive to errors in the input dataset and prone to overfitting

Parameters:

**Parallelism:**Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets. (-1 means ‘all cores’)

### (Regression) Ridge Regression¶

Ridge Regression addresses some problems of Ordinary Least Squares by imposing a penalty (or regularization term) to the weights. Ridge regression uses a L2 regularization. L2 regularization reduces the size of the coefficients in the model.

Parameters:

**Regularization term (auto-optimized or specific values)**: Auto-optimization is generally faster than trying multiple values, but it does not support sparse features (like text hashing)**Alpha**: The regularization term. This parameter can be optimized.

### (Regression) Lasso Regression¶

Lasso Regression is another linear model, using a different regularization term (L1 regularization). L1 regularization reduces the number of features included in the final model.

Parameters:

**Regularization term (auto-optimized or specific values)**: Auto-optimization is generally faster than trying multiple values, but it does not support sparse features (like text hashing)**Alpha**: The regularization term. This parameter can be optimized.

### (Classification) Logistic regression¶

Despite its name, Logistic Regression is a classification algorithm using a linear model (ie, it computes the target feature as a linear combination of input features). Logistic Regression minimizes a specific cost function (called logit or sigmoid function), which makes it appropriate for classification. A simple Logistic regression algorithm is prone to overfitting and sensitive to errors in the input dataset. To address these issues, it is possible to use a penalty (or regularization term ) to the weights.

Logistic regression has two parameters:

**Regularization (L1 or L2 regularization)**: L1 regularization reduces the number of features that are used in the model. L2 regularization reduces the size of the coefficient for each feature.**C**: Penalty parameter C of the error term. A low value of C will generate a smoother decision boundary (higher bias) while a high value aims at correctly classifying all training examples, at the risk of overfitting (high variance). (C corresponds to the inverse of a regularization parameter). You can try several values of C by using a comma-separated list.

### (Regression & Classification) Random Forests¶

*Decision tree classification* is a simple algorithm which builds a decision tree. Each node of the decision tree includes a condition on one of the input features.

A *Random Forest* regressor is made of many decision trees. When predicting a new record, it is predicted by each tree, and each tree “votes” for the final answer of the forest.
The forest then averages the individual trees answers. When “growing” (ie, training) the forest:

for each tree, a random sample of the training set is used;

for each decision point in the tree, a random subset of the input features is considered.

Random Forests generally provide good results, at the expense of “explainability” of the model.

Parameters:

**Number of trees**: Number of trees in the forest. Increasing the number of trees in a random forest does not result in overfitting. This parameter can be optimized.**Feature sampling strategy:**Adjusts the number of features to sample at each split.Automatic will select 30% of the features.

Square root and Logarithm will select the square root or base 2 logarithm of the number of features respectively

Fixed number will select the given number of features

Fixed proportion will select the given proportion of features

**Maximum depth of tree**: Maximum depth of each tree in the forest. Higher values generally increase the quality of the prediction, but can lead to overfitting. High values also increase the training and prediction time. Use 0 for unlimited depth (ie, keep splitting the tree until each node contains a single target value)**Minimum samples per leaf**: Minimum number of samples required in a single tree node to split this node. Lower values increase the quality of the prediction (by splitting the tree mode), but can lead to overfitting and increased training and prediction time.**Parallelism:**Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

### (Regression & Classification) Gradient Boosted Trees¶

Gradient boosted trees are another ensemble method based on decision trees. Trees are added to the model sequentially, and each tree attempts to improve the performance of the ensemble as a whole. The advantages of GBRT are:

Natural handling of data of mixed type (= heterogeneous features)

Predictive power

Robustness to outliers in output space (via robust loss functions)

Please note that you may face scalability issues, due to the sequential nature of boosting it can hardly be parallelized.

The gradient boosted tree algorithm has four parameters:

**Number of boosting stages**: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. This parameter can be optimized.**Learning rate**: Multiplier applied to all base learners. Lower values slow down convergence and can make the model more robust. Typical values: 0.01 - 0.3. This parameter can be optimized.**Loss:**The available loss functions depend upon whether this is a classification or regression problem.**Classification:**Deviance refers to deviance (equivalent to logistic regression) for classification with probabilistic outputs. For exponential loss, gradient boosting recovers the AdaBoost algorithm.**Regression:**Choose from least squares, least absolution deviation, or Huber. Huber is a combination of Least Square and Least Absolute Deviation.

**Maximum depth of tree:**Maximum depth of each tree. High values can increase the quality of the prediction, but can also lead to over-fitting. Typical values: 3 - 10. This parameter can be optimized.

This algorithm also provides the ability to visualize partial dependency plots of your features.

### (Regression & Classification) LightGBM¶

LightGBM uses a specific library instead of scikit-learn.

LightGBM is a tree-based gradient boosting library designed to be distributed and efficient. It provides fast training speed, low memory usage, good accuracy and is capable of handling large scale data.

Parameters:

**Maximum number of trees:**LightGBM has an early stopping mechanism so the exact number of trees will be optimized. High number of actual trees will increase the training and prediction time. Typical values: 50 - 200. This parameter can be optimized.**Maximum depth of tree:**Maximum depth of each tree. High values can increase the quality of the prediction, but can also lead to over-fitting. Typical values: 3 - 10.**Number of leaves:**Maximum tree leaves for base learners. Typical values range between 20 and 500. This parameter can be optimized.**Learning rate:**Multiplier applied to all base learners. Lower values slow down convergence and can make the model more robust. Typical values: 0.01 - 0.3. This parameter can be optimized.**Minimum split gain:**Minimum loss reduction required to make a further partition on a leaf node of the tree. This parameter can be optimized.**Minimum child weight:**Minimum sum of instance weight (hessian) needed in a child (leaf). High values can prevent over-fitting by learning highly specific cases. Smaller values allow leaf nodes to match a small set of rows, which can be relevant for highly imbalanced datasets. This parameter can be optimized.**Minimum leaf samples:**Minimum number of data samples needed in a leaf. This parameter can be optimized.**Colsample by tree:**Fraction of the features to be used in each tree. Typical values: 0.5 - 1. This parameter can be optimized.**L1 regularization:**L1 regularization coefficient applied to the weight of potential splits for their evaluation during tree-building. Aims at reducing over-fitting and the complexity of trees. This parameter can be optimized.**L2 regularization:**L2 regularization coefficient applied to the weight of potential splits for their evaluation during tree-building. Aims at reducing over-fitting and the complexity of trees. This parameter can be optimized.**Use Bagging:**Bagging can be used to speed up training and/or prevent over-fitting but can also make specific cases harder to learn. Enabling bagging allows to configure the “Subsample” parameters.**Subsample ratio:**Subsample ratio for the data to be used in each tree. Low values can prevent over-fitting but can make specific cases harder to learn. Typical values: 0.5 - 1. Note that 1. will de facto disable bagging.**Subsample frequency:**Frequency (in number of iterations) at which bagging must be performed. Setting a value of k means “perform bagging every k iterations”. Note that 0 will de facto disable bagging.**Early stopping:**Use the LightGBM’s built-in early stopping mechanism so the exact number of trees will be optimized (up to the specified maximum number of trees). The cross-validation scheme defined in the**Train & validation**tab will be used.**Early stopping rounds:**The optimizer stops if the loss never decreases for this consecutive number of iterations. Typical values: 4 - 10.**Parallelism:**Number of cores used for parallel training (-1 means “all cores”). Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

### (Regression & Classification) XGBoost¶

XGBoost uses a specific library instead of scikit-learn.

XGBoost is an advanced gradient boosted tree algorithm. It has support for parallel processing, regularization, early stopping which makes it a very fast, scalable and accurate algorithm.

Parameters:

**Maximum number of trees:**XGBoost has an early stopping mechanism so the exact number of trees will be optimized. High number of actual trees will increase the training and prediction time. Typical values: 50 - 200. This parameter can be optimized.**Early stopping:**Use the XGBoost’s built-in early stopping mechanism so the exact number of trees will be optimized (up to the specified maximum number of trees). The cross-validation scheme defined in the**Train & validation**tab will be used.**Early stopping rounds:**The optimizer stops if the loss never decreases for this consecutive number of iterations. Typical values: 4 - 10.**Maximum depth of tree:**Maximum depth of each tree. High values can increase the quality of the prediction, but can also lead to over-fitting. Typical values: 3 - 10. This parameter can be optimized.**Learning rate:**Multiplier applied to all base learners. Lower values slow down convergence and can make the model more robust. Typical values: 0.01 - 0.3. This parameter can be optimized.**L2 regularization:**L2 regularization coefficient applied to the weight of potential splits for their evaluation during tree-building. Aims at reducing over-fitting and the complexity of trees. This parameter can be optimized.**L1 regularization:**L1 regularization coefficient applied to the weight of potential splits for their evaluation during tree-building. Aims at reducing over-fitting and the complexity of trees. This parameter can be optimized.**Gamma:**Minimum loss reduction required to make a further partition on a leaf node of the tree. This parameter can be optimized.**Minimum child weight:**Minimum sum of instance weight (hessian) needed in a child (leaf). High values can prevent over-fitting by learning highly specific cases. Smaller values allow leaf nodes to match a small set of rows, which can be relevant for highly imbalanced datasets. This parameter can be optimized.**Subsample:**Subsample ratio for the data to be used in each tree. Low values can prevent over-fitting but can make specific cases harder to learn. Typical values: 0.5 - 1. This parameter can be optimized.**Colsample by tree:**Fraction of the features to be used in each tree. Typical values: 0.5 - 1. This parameter can be optimized.**Replace missing values:****Parallelism:**Number of cores used for parallel training (-1 means “all cores”). Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

### (Regression & Classification) Decision Tree¶

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Parameters:

**Maximum depth**: The maximum depth of the tree. This parameter can be optimized.**Criterion (Gini or Entropy)**: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. This applies only to classification problems.**Minimum samples per leaf**: Minimum number of samples required to be at a leaf node. This parameter can be optimized.**Split strategy (Best or random)**. The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

### (Regression & Classification) Support Vector Machine¶

Support Vector Machine is a powerful ‘black-box’ algorithm for classification. Through the use of kernel functions, it can learn complex non-linear decision boundaries (ie, when it is not possible to compute the target as a linear combination of input features). SVM is effective with large number of features. However, this algorithm is generally slower than others.

Parameters:

**Kernel (linear, RBF, polynomial, sigmoid)**: The kernel function used for computing the similarity of samples. Try several to see which works the best.**C**: Penalty parameter C of the error term. A low value of C will generate a smoother decision boundary (higher bias) while a high value aims at correctly classifying all training examples, at the risk of overfitting (high variance). (C corresponds to the inverse of a regularization parameter). This parameter can be optimized.**Gamma**: Kernel coefficient for RBF, polynomial and sigmoid kernels. Gamma defines the ‘influence’ of each training example in the features space. A low value of gamma means that each example has ‘far-reaching influence’, while a high value means that each example only has close-range influence. If no value is specified (or 0.0), then 1/nb_features is used. This parameter can be optimized.**Tolerance**: Tolerance for stopping criterion.**Maximum number of iterations**: Number of iterations when fitting the model. -1 can be used to specific no limit.

### (Regression & Classification) Stochastic Gradient Descent¶

SGD is a family of algorithms that reuse concepts from Support Vector Machines and Logistic Regression. SGD uses an optimized method to minimize the cost (or loss ) function, making it particularly suitable for large datasets (or datasets with large number of features).

Parameters:

**Loss function (logit or modified Huber)**: Selecting ‘logit’ loss will make the SGD behave like a Logistic Regression. Enabling ‘modified huber’ loss will make the SGD behave quite like a Support Vector Machine.**Iterations**: number of iterations on the data**Penalty (L1, L2 or elastic net)**: L1 and L2 regularization are similar to those for linear and logistic regression. Elastic net regularization is a combination of L1 and L2 regularization.**Alpha**: Regularization parameter. A high value of alpha (ie, more regularization) will generate a smoother decision boundary (higher bias) while a lower value (less regularization) aims at correctly classifying all training examples, at the risk of overfitting (high variance). This parameter can be optimized.**L1 ratio**: ElasticNet regularization mixes both L1 and L2 regularization. This ratio controls the proportion of L2 in the mix. (ie: 0 corresponds to L2-only, 1 corresponds to L1-only). Defaults to 0.15 (85% L2, 15% L1).**Parallelism**: Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

### (Regression & Classification) K Nearest Neighbors¶

K Nearest Neighbor classification makes predictions for a sample by finding the k nearest samples and assigning the most represented class among them.

*Warning:* this algorithm requires storing the entire training data into the model. This will lead to a very large model if the data is larger than a few hundred lines. Predictions may also be slow.

Parameters:

**K:**The number of neighbors to examine for each sample. This parameter can be optimized.**Distance weighting:**If enabled, voting across neighbors will be weighed by the inverse distance from the sample to the neighbor.**Neighbor finding algorithm:**The method used to find the nearest neighbors to each point. Has no impact on predictive performance, but will have a high impact on training and prediction speed.Automatic: a method will be selected empirically depending on the data.

KD & Ball Tree : stores the data points into a partitioned data structure for efficient lookup.

Brute force: will examine every training sample for every prediction. Usually inefficient.

**p:**The exponent of the Minkowski metric used to search neighbors. For p = 2, this gives Euclidean distance, for p = 1, Manhattan distance. Greater values lead to the Lp distances.

### (Regression & Classification) Extra Random Trees¶

Extra trees, just like Random Forests, are an ensemble model. In addition to sampling features at each stage of splitting the tree, it also samples random threshold at which to make the splits. The additional randomness may improve generalization of the model.

Parameters:

**Numbers of trees:**Number of trees in the forest. This parameter can be optimized.**Feature sampling strategy:**Adjusts the number of features to sample at each split.Automatic will select 30% of the features.

Square root and Logarithm will select the square root or base 2 logarithm of the number of features respectively

Fixed number will select the given number of features

Fixed proportion will select the given proportion of features

**Maximum depth of tree:**Maximum depth of each tree in the forest. Higher values generally increase the quality of the prediction, but can lead to overfitting. High values also increase the training and prediction time. Use 0 for unlimited depth (ie, keep splitting the tree until each node contains a single target value). This parameter can be optimized.**Minimum samples per leaf:**Minimum number of samples required in a single tree node to split this node. Lower values increase the quality of the prediction (by splitting the tree mode), but can lead to overfitting and increased training and prediction time. This parameter can be optimized.**Parallelism:**Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption, especially for large training datasets.

### (Regression & Classification) Artificial Neural Network¶

Neural Networks are a class of parametric models which are inspired by the functioning of neurons. They consist of several “hidden” layers of neurons, which receive inputs and transmit them to the next layer, mixing the inputs and applying non-linearities, allowing for a complex decision function.

Parameters:

**Hidden layer sizes:**Number of neurons on each hidden layer. Separate by commas to add additional layers.**Activation:**The activation function for the neurons in the network.**Alpha:**L2 regularization parameter. Higher values lead to smaller neuron weights and a more generalizable, although less sharp model.**Max iterations:**Maximum iterations for learning. Higher values lead to better convergence, but take more time.**Convergence tolerance:**If the loss does not improve by this ratio over two iterations, training stops.**Early stopping:**Whether the model should use validation and stop early.**Solver:**The solver to use for optimization. LBFGS is a batch algorithm and is not suited for larger datasets.**Shuffle data:**Whether the data should be shuffled between epochs (recommended, unless the data is already in random order).**Initial Learning Rate:**The initial learning rate for gradient descent.**Automatic batching:**Whether batches should be created automatically (will use 200, or the whole dataset if there are less samples). Uncheck to select batch size.**beta_1:**beta_1 parameter for ADAM solver.**beta_2:**beta_2 parameter for ADAM solver.**epsilon:**epsilon parameter for ADAM solver.

### (Regression & Classification) Lasso Path¶

The Lasso Path is a method which computes the LASSO path (ie. for all values of the regularization parameter). This is performed using LARS regression. It requires a number of passes on the data equal to the number of features. If this number is large, computation may be slow. This computation allows to select a given number of non-zero coefficients, ie. to select a given number of features. After training, you will be able to visualize the LASSO path and select a new number of features.

Parameters:

**Maximum features:**The number of kept features. Input 0 to have all features enabled (no regularization). Has no impact on training time.

### (Regression & Classification) Custom Models¶

You can make custom models using Python. Your custom models should be scikit-learn compatible:

They must implement the methods

`fit`

and`predict`

.They must subclass

`sklearn.base.BaseEstimator`

.They must receive the parameters of the

`__init__`

function as explicit keyword arguments

Warning

Classes cannot be declared directly in the Models > Design tab. They must be packaged in a library and imported, as demonstrated in the examples below.

For more details and advanced examples, please refer to Advanced Custom Models.

#### Regression¶

##### Example¶

On the Models > Design > Algorithms tab, in the “Custom python model” code editor, you should create the

`clf`

variable.from custom_python_model import MyRandomRegressor clf = MyRandomRegressor()

In

`custom_python_model.py`

:from sklearn.base import BaseEstimator import numpy as np import pandas as pd class MyRandomRegressor(BaseEstimator): """This model predicts random values between the mininimum and the maximum of y""" def fit(self, X, y): self.y_range = [np.min(y), np.max(y)] def predict(self, X): return np.random.uniform(self.y_range[0], self.y_range[1], size=X.shape[0])

#### Classification¶

In addition to `fit`

and `predict`

, a classifier must also have a `classes_`

attribute, and it can implement a `predict_proba`

method.

##### Example¶

On the Models > Design > Algorithms tab, in the “Custom python model” code editor, you should create the

`clf`

variable.from custom_python_model import MyRandomClassifier clf = MyRandomClassifier()

In

`custom_python_model.py`

:from sklearn.base import BaseEstimator import numpy as np import pandas as pd class MyRandomClassifier(BaseEstimator): """This model predicts classes randomly""" def fit(self, X, y): self.classes_ = list(set(y)) def predict(self, X): return np.random.choice(self.classes_, size=X.shape[0]) def predict_proba(self,X): return np.random.rand(X.shape[0], len(self.classes_))

Note

For linear binary classification models, it is possible to display the fitted regression coefficients in the “Regression coefficients” tab of the model report. To do so, you need to specify them using the scikit-learn approach, i.e. the custom classifier must satisfy the following conditions:

The classifier has attributes

`coef_`

and`intercept_`

.These attributes are either of type

`numpy.ndarray`

or`list`

.These attributes only have one row (i.e.

`coef_.shape[0] == 1`

, or`len(coef_) == 1`

if of type`list`

, and same thing for`intercept_`

).`len(clf.coef_[0])`

is equal to the number of preprocessed features (i.e. the number of columns in the train dataframe).

### (Regression & Classification) Plugin Models¶

You can also build and use plugin models using Python.

See Component: Prediction algorithm for more details.

## Clustering algorithms¶

### K-means¶

The k-means algorithm clusters data by trying to separate samples in *n* groups, minimizing a criterion known as the ‘inertia’ of the groups.

Parameters:

**Number of clusters:**This parameter can be optimized.**Seed:**Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)**Parallelism:**Number of cores used for parallel training. Using more cores leads to faster training but at the expense of more memory consumption. If -1 all CPUs are used. For values below -1, (n_cpus + 1 + value) are used: ie for -2, all CPUs but one are used.

### Gaussian Mixture¶

The Gaussian Mixture Model models the distribution of the data as a “mixture” of several populations, each of which can be described by a single multivariate normal distribution.

An example of such a distribution is that of sizes among adults, which is described by the mixture of two distributions: the sizes of men, and those of women, each of which is approximately described by a normal distribution.

Parameters:

**Number of mixture components:**Number of populations. This parameter can be optimized.**Max Iterations:**The maximum number of iterations to learn the model. The Gaussian Mixture model uses the Expectation-Maximization algorithm, which is iterative, each iteration running on all of the data. A higher value of this parameter will lead to a longer running time, but a more precise clustering. A value between 10 and 100 is recommended.**Seed:**Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)

### Mini-batch K-means¶

The Mini-Batch k-means is a variant of the k-means algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function.

Parameters:

**Numbers of clusters:**This parameter can be optimized.**Seed:**Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)

### Agglomerative Clustering¶

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging them successively. This hierarchy of clusters represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

Parameters:

**Numbers of clusters:**This parameter can be optimized.

### Spectral Clustering¶

Spectral clustering algorithm uses the graph distance in the nearest neighbor graph. It does a low-dimension embedding of the affinity matrix between samples, followed by a k-means in the low dimensional space.

Parameters:

**Numbers of clusters:**This parameter can be optimized.**Affinity measure:**The method to computing the distance between samples. Possible options are nearest neighbors, RBF kernel and polynomial kernel.**Gamma:**Kernel coefficient for RBF and polynomial kernels. Gamma defines the ‘influence’ of each training example in the features space. A low value of gamma means that each example has ‘far-reaching influence’, while a high value means that each example only has close-range influence. If no value is specified (or 0.0), then 1/nb_features is used.**Coef0:**Independent term for ‘polynomial’ or ‘sigmoid’ kernel function.**Seed:**Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)

### DBSCAN¶

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. Numerical features should use standard rescaling.

There are two parameters that you can modify in DBSCAN:

**Epsilon:**Maximum distance to consider two samples in the same neighborhood. You can try several values by using a comma-separated list**Min. Sample ratio:**Minimum ratio of records to form a cluster

### Interactive Clustering (Two-step clustering)¶

Interactive clustering is based on a two-step clustering algorithm. This two-staged algorithm first agglomerates data points into small clusters using K-Means clustering. Then, it applies agglomerative hierarchical clustering in order to further cluster the data, while also building a hierarchy between the smaller clusters, which can then be interpreted. It therefore allows to extract hierarchical information from datasets larger than a few hundred lines, which cannot be achieved through standard methods. The clustering can then be manually adjusted in DSS’s interface.

Parameters:

**Number of Pre-clusters:**The number of clusters for KMeans preclustering. It is recommended that this number be lower than a couple hundred for readability.**Number of clusters:**The number of clusters in the hierarchy. The full hierarchy will be built and displayed, but these clusters will be used for scoring.**Max Iterations:**The maximum number of iterations for preclustering. KMeans is an iterative algorithm. A higher value of this parameter will lead to a longer running time, but a more precise pre-clustering. A value between 10 and 100 is recommended.**Seed:**Used to generate reproducible results. 0 or no value means that no known seed is used (results will not be fully reproducible)

### Isolation Forest (Anomaly Detection)¶

Isolation forest is an anomaly detection algorithm. It isolates observations by creating a Random Forest of trees, each splitting samples in different partitions. Anomalies tend to have much shorter paths from the root of the tree. Thus, the mean distance from the root provides a good measure of non-normality.

Parameters:

**Number of trees:**Number of trees in the forest.**Contamination:**Expected proportion of anomalies in the data.**Anomalies to display:**Maximum number of anomalies to display in the model report. Too high a number may cause memory and UI problems.

### Custom Clustering Models¶

You can make custom models using Python. Your custom models should be scikit-learn compatible:

They must implement the methods

`fit`

and`predict`

.They must subclass

`sklearn.base.BaseEstimator`

.They must receive the parameters of the

`__init__`

function as explicit keyword arguments

For clustering tasks, the number of clusters can be passed to the model through the interface.

Moreover, the model should implement the method `fit_predict(self, X)`

, in addition to `fit(self, X)`

and `predict(self, X)`

.

Warning

Classes cannot be declared directly in the Models > Design tab. They must be packaged in a library and imported, as demonstrated in the examples below.

For more details and advanced examples, please refer to Advanced Custom Models.

#### Example¶

On the Models > Design > Algorithms tab, in the “Custom python model” code editor, you should create the

`clf`

variable.from custom_python_model import MyRandomClusteringModel clf = MyRandomClusteringModel()

In

`custom_python_model.py`

:from sklearn.base import BaseEstimator import numpy as np import pandas as pd class MyRandomClusteringModel(BaseEstimator): """This model assigns clusters randomly""" def fit(self, X): pass def predict(self, X): return np.random.choice([0, 1, 2], size=X.shape[0]) def fit_predict(self, X): return np.random.choice([0, 1, 2], size=X.shape[0])

## Advanced Custom Models¶

This section shows advanced concepts for building custom models.

For simple use cases, please refer to:

### Handling parameters¶

The estimator must be clonable by `sklearn.base.clone()`

which only clones attributes that have constructor arguments with the same name.

Therefore, when using parameters at the class level, custom models should always:

receive the parameters of the

`__init__`

function as explicit keyword argumentsimplement

`get_params(deep=True)`

and`set_params(**params)`

These methods can either be implemented manually or by having the class extend `sklearn.base.BaseEstimator`

.

### Retrieving column names in the custom model¶

In order to have access to the column names of the preprocessed dataset (ie: `X`

in the functions `fit`

and `predict`

), a method `set_column_labels(self, column_labels)`

can be implemented in the model.
If this method exists, DSS will automatically call it and provide the list of the column names as argument.

Warning

The column labels passed to the function `set_column_labels`

are the labels of the prepared and preprocessed columns resulting from the preparation script followed by features handling. Hence, their name may not correspond to the original columns of the dataset.
For instance, if automatic pairwise linear combinations were enabled, some columns may take the form: `pw_linear:<A>+<B>`

. To find the exact name of the columns, it is advisable to print the column labels received in `set_column_labels`

.

#### Example¶

`clf`

variable.from custom_python_model import MyCustomRegressor important_column_name = ... clf = MyCustomRegressor(important_column_name)

In

`custom_python_model.py`

:from sklearn.base import BaseEstimator class MyCustomRegressor(BaseEstimator): def __init__(self, important_column=None, column_labels=None): self.important_column = important_column self.column_labels = column_labels def set_column_labels(self, column_labels): # in order to preserve the attribute `column_labels` when cloning # the estimator, we have declared it as a keyword argument in the # `__init__` and set it there self.column_labels = column_labels def fit(self, X, y): if self.important_column is not None: # Retrieve the index of the important column column_index = self.column_labels.index(self.important_column) # Retrieve the corresponding data column column = X[:, column_index] # Finish the implementation of the fit function ... def predict(self, X): # Implement the predict function ...

#### Advanced example: Setting monotonicity constraints¶

The following example uses XGBoost and shows how to set monotonicity constraints on specific columns given their name.

`clf`

variable.from constrained_python_model import MyConstrainedRegressor clf = MyConstrainedRegressor(["important_column"])

In

`constrained_python_model.py`

:from xgboost import XGBRegressor from sklearn.base import BaseEstimator import numpy as np class MyConstrainedRegressor(BaseEstimator): def __init__(self, monotone_column_labels=None, column_labels=None, xgb_regressor=None): if monotone_column_labels is None: self.monotone_column_labels = [] else: self.monotone_column_labels = monotone_column_labels self.column_labels = column_labels if xgb_regressor is None: self.xgb = XGBRegressor() else: self.xgb = xgb_regressor def set_column_labels(self, column_labels): # in order to preserve the attribute `column_labels` when cloning # the estimator, we have declared it as a keyword argument in the # `__init__` and set it there self.column_labels = column_labels def fit(self, X, y): # Init the constraints array monotone_constraints = np.zeros(X.shape[1], int) for monotone_column_label in self.monotone_column_labels: # Retrieve the index of the column that should be monotonic # NB: the corresponding data would then be X[:, monotone_column_index] monotone_column_index = self.column_labels.index(monotone_column_label) # Set the increasing monotonic constraint for the corresponding column monotone_constraints[monotone_column_index] = 1 # Convert the list into a XGBoost-compatible parameter stringified_monotone_constraints = "(" + ",".join(map(str, monotone_constraints)) + ")" # Instanciate and fit the XGBoost model self.xgb.set_params(monotone_constraints=stringified_monotone_constraints) self.xgb.fit(X, y) def predict(self, X): return self.xgb.predict(X)