Clustering (Unsupervised ML)¶
Clustering (aka unsupervised machine learning) is used to understand the structure of your data. For instance, you could group customers into clusters based on their payment history, which could be used to guide sales strategies.
Unlike supervised machine learning, you don’t need a target to conduct unsupervised machine learning
Use the following steps to access unsupervised machine learning in DSS:
- Go to the Flow for your project
- Click on the dataset you want to use
- Select the Lab
- Create a new visual analysis
- Click on the Models tab
- Select Create first model
- Select Clustering
You can access the sampling settings in Models > Settings > Sampling
The available sampling methods depend on the machine learning engine.
If your dataset does not fit in your RAM, you may want to extract a sample on which clustering will be performed. Data can be sampled from the beginning of a dataset (fastest) or randomly sampled from the entire dataset.
You can access the sampling settings in Models > Settings > Dimensionality Reduction
Dimensionality reduction reduces the number of variables by arranging them into ‘principal components’ grouping together all correlated variables. The principal components are computed to carry as much variance as possible from the original dataset.
The main interest of using PCA for clustering is to improve the running time of the algorithms, especially when you have a large number of dimensions.
You can choose to enable it, disable it, or try both options to compare.
You can access the parameters for outlier detection in Models > Settings > Outlier detection
When performing clustering, it is generally recommended to detect outliers. Not doing so could generate very skewed clusters, or many small clusters and one cluster containing almost the whole dataset.
DSS detects outliers by performing a pre-clustering with a large number of clusters and considering the smallest “mini-clusters” as outliers, if:
- their cluster size is less than a specified threshold (ex : 10)
- their cumulative percentage is less than a specified threshold (ex: 1%)
Once outliers are detected, you can either:
- Drop: outliers are dropped.
- Cluster : create a “cluster” from all detected outliers.