What is the best clustering in my data when labels are unknown ?

Reda Merzouki
5 min readMar 10, 2021

--

Best Clustering in KMeans using Silhouette Score

Unsupervised learning : best split in data provided with unknown labels.

Ethereal clusters (photo by Nareeta Martin on Unsplash)

Imagine you have millions of unlabeled data to explore, each data is represented by a vector and all these vectors are grouped into a matrix or a table, the more data you have in quantity and variety, the more you will be able to capture the phenomenon you are trying to describe such as categories of customers, states of a machine or different states of a disease. In your case, you don’t know how many states or classes are represented in the data at your disposal because they are not labeled and you do not have business experts to manually label millions of data. You just know that the number of labels is greater or equal to two, in other words you have to solve an unsupervised classification problem and then answer questions such as how many different states does my machine have ?… Personally, I have already encountered this situation in real use cases.

In this article, for the sake of simplicity and confidentiality I will generate data and try to find the original clustering thanks to the snippet Python codes I will share with you below.

In this article, we will use the metric Silhouette Coefficient score which computes the average of the Silhouette Coefficient of all examples in the dataset.

Indeed, when the “ground truth” or when the labels are unknown, we need to evaluate different clusterings using the model itself. The Silhouette Coefficient score is a metric that allows this type of evaluation to be carried out.

A high Silhouette Coefficient score -close to 1- relates to a model with better defined Clusters. Here, we will be using the Kmeans implementation of Scikit-learn.

The Silhouette Coefficient is defined for each sample and is composed of two scores:

a: The mean distance between a sample and all other points in the same class.

b: The mean distance between a sample and all other points in the next nearest cluster.

The Silhouette Coefficient s for a single sample is then given as:

1- Import some useful modules

2- Generate a dataset with variables coming from 3 different distributions

For the sake of simplicity we will generate 3 datasets with only one variable for each one. But obviously, the code in the third section is intended to be used with multivariate datasets.

Now let’s group and shuffle the data :

In this article, we are using a contrived dataset with 3000 examples and one variable, of course in the real world, we encounter multivariate datasets with millions of rows and then we may need to sub-sample our initial sample to improve the computation time. The sub-sample must be representative of the total original sample population.

3- Write a function that makes use of silhouette_score metric

This function will help you to assess the best split in data provided with unknown labels but also it will provide you with a labeled dataset.

When your variables take values in different scales, this can influence negatively the performance of an algorithm. This is the reason why we need to scale our data as part of a preprocessing step.
Decision trees and Random Forests are two of the very few machine learning algorithms where we don’t need to worry about feature scaling. Those algorithms are scale invariant. Rescaling is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent but also for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like k-nearest neighbors or K-Means. We can rescale our data using for example MinMaxScaler class of scikit-learn here below. In the function below we will set the parameter “scaling” to True in case we want to scale our data. Notice that since our data is univariate we do not need to set this parameter to True.

4 — What is the best clustering for my data?

We will now use the function I wrote above to assess the best clustering in my data.

  • Visualization of the best clustering :
# We don't need to scale the data since we only have one variable
Best_Clustering(data = data, scaling = False)
  • Getting the best parameters and my labeled data so that we can use them later :
best_params , my_labeled_data = Best_Clustering(data = data, scaling = False, visualization = False)
best_params
my_labeled_data

Conclusion

With this function, we were able to determine the number of clusters in the unlabeled data. 3 is exactly the number of clusters in the initially generated data.Thus, we were able to automatically label an initially unlabeled dataset.

Note : do not hesitate to use my snippet codes in Github to assess best clustering for mulvariate unlabeled datasets and provide them with labels.

Should you have any questions or you would like to stay in touch, feel free to contact me on LinkedIn: Reda Merzouki

Thank you for reading !

--

--

Reda Merzouki

Senior Data Scientist & Solutions Architect passionate about data, digital transformation and mathematics underlying machine learning.