Clustering settings¶
The “Settings” tab allows you to fully customize all aspects of your clustering.
Sampling¶
Note
You can access the sampling settings in Models > Settings > Sampling
The available sampling methods depend on the machine learning engine.
If your dataset does not fit in your RAM, you may want to extract a sample on which clustering will be performed. Data can be sampled from the beginning of a dataset (fastest) or randomly sampled from the entire dataset.
Features¶
See Features handling.
Dimensionality reduction¶
Note
You can access the sampling settings in Models > Settings > Dimensionality Reduction
Dimensionality reduction reduces the number of variables by arranging them into ‘principal components’ grouping together all correlated variables. The principal components are computed to carry as much variance as possible from the original dataset.
The main interest of using PCA for clustering is to improve the running time of the algorithms, especially when you have a large number of dimensions.
You can choose to enable it, disable it, or try both options to compare.
Outliers detection¶
Note
You can access the parameters for outlier detection in Models > Settings > Outlier detection
When performing clustering, it is generally recommended to detect outliers. Not doing so could generate very skewed clusters, or many small clusters and one cluster containing almost the whole dataset.
DSS detects outliers by performing a pre-clustering with a large number of clusters and considering the smallest “mini-clusters” as outliers, if:
their cluster size is less than a specified threshold (ex : 10)
their cumulative percentage is less than a specified threshold (ex: 1%)
Once outliers are detected, you can either:
Drop: outliers are dropped.
Cluster : create a “cluster” from all detected outliers.
Algorithms¶
Note
You can change the settings for algorithms under Models > Settings > Algorithms
DSS supports several algorithms that can be used for clustering. You can select multiple algorithms to see which performs best for your dataset.
The algorithms depend on the machine learning engine.