Unsupervised Machine Learning

Unsupervised machine learning is used to understand the structure of your data. For instance, you could group customers into clusters based on their payment history, which could be used to guide sales strategies.

Note

Unlike supervised machine learning, you don’t need a target to conduct unsupervised machine learning

Running Unsupervised Machine Learning in DSS

Use the following steps to access unsupervised machine learning in DSS:

  • Go to the flow for your project
  • Click on the dataset you want to use
  • Select the analyse widget on the right
  • Click on the model tab
  • Select create a new model
  • Select clustering

Sampling

Note

You can access the sampling settings in Models > Settings > Sampling

If your dataset does not fit in your RAM, you may want to extract a sample on which clustering will be performed. Data can be sampled from the beginning of a dataset (fastest) or randomly sampled from the entire dataset.

Dimensionality Reduction

Note

You can access the sampling settings in Models > Settings > Dimensionality Reduction

Dimensionality reduction reduces the number of variables by arranging them into ‘principal components’ grouping together all correlated variables. The principal components are computed to carry as much variance as possible from the original dataset.

The main interest of using PCA for clustering is to improve the running time of the algorithms, especially when you have a large number of dimensions.

You can choose to enable it, disable it, or try both options to compare.

Outliers detection

Note

You can access the parameters for outlier detection in Models > Settings > Outlier detection

When performing clustering, it is generally recommended to detect outliers. Not doing so could generate very skewed clusters, or many small clusters and one cluster containing almost the whole dataset.

DSS detects outliers by performing a pre-clustering with a large number of clusters and considering the smallest “mini-clusters” as outliers, if:

  • their cluster size is less than a specified threshold (ex : 10)
  • their cumulative percentage is less than a specified threshold (ex: 1%)

Once outliers are detected, you can either:

  • Drop: outliers are dropped.
  • Cluster : create a “cluster” from all detected outliers.

Algorithms

Note

You can access the algorithm used for clustering in Models > Settings > Algorithms

We supply multiple different algorithms that can be used for clustering. You can select multiple algorithms to see which performs best for your dataset.

K-means

The k-means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the ‘inertia’ of the groups.

In k-means clustering, you must specify the number of desired clusters. You can try multiple values by providing a comma-separated list.

Mini-batch K-means

The Mini-Batch k-means is a variant of the k-means algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function.

In mini-batch k-means clustering, you must specify the number of desired clusters. You can try multiple values by providing a comma-separated list.

Ward Hierarchical Clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging them successively. This hierarchy of clusters represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

In Ward hierarchical clustering, you must specify the number of desired clusters. You can try multiple values by providing a comma-separated list.

Spectral Clustering

Spectral clustering algorithm uses the graph distance in the nearest neighbor graph. It does a low-dimension embedding of the affinity matrix between samples, followed by a k-means in the low dimensional space.

There are two parameters that you can modify in in spectral clustering: - Number of clusters: You can try several values by using a comma-separated list - Affinity measure: The method to computing the distance between samples. Possible options are nearest neighbors, RBF kernel and polynomial kernel.

DBSCAN

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. Numerical features should use standard rescaling.

There are two parameters that you can modify in DBSCAN:

  • Epsilon: Maximum distance to consider two samples in the same neighborhood. You can try several values by using a comma-separated list
  • Min. Sample ratio: Minimum ratio of records to form a cluster