K-means algorithm… It’s all about the story! First “meeting” we had was in the spring of 2013. The 3th year of university, Data Analysis module… It was one of my favorite module and the first time when I got into deepest data classification algorithms.
That module had a pretty complex final evaluation method, one step being to conduct and present the methodology of one classification technique. I and my team have chosen to present… guess what?! You know, K-Means algorithm. 😀
Let’s the story begin…
According to Wiki, K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
Firstly, we have to keep clear in mind the difference between K-Means and KNN (K Nearest Neighbor) classifier. For a better understanding, we have to retain that KNN is a subset of supervised learning, while K-Means is subset of unsupervised learning.
K-Means aims to split a set of n points into K clusters such that the points in each cluster tend to be near each other. It is considered an unsupervised classifier because the points do not have any external classification.
In order to determine the classification of a point, KNN algorithm combines the classification of the K nearest points. It is a supervised learning because it is trying to classify a point based on the known classification of other points.
- Set a priori number of clusters.
In order to run a K-Means algorithm we have to set a k number of clusters a priori. So that, the main idea is to define a centroid for each cluster. These centroids should be placed in a cunning way because of different location causes different result. The best choice is to place them as much as possible far away from each other.
An empirical way to find the best number of clusters is to try K-means clustering with different number of clusters and measure the resulting sum of squares.
- Associate each point to the nearest centroid
Now, we have to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done.
Also, we have to initialize the position of the clusters. How? This is debatable… Mostly, it is up to you!
But, here you have some common methods:
Forgy: set the positions of the k clusters to k observations chosen randomly from the dataset.
Random partition: assign a cluster randomly to each observation and compute the means, according to formula:
- Recalculate k new centroids
At this point we need to recalculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these new centroids, a new binding has to be done between the same data set points and the nearest new centroids. Thus, a loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more.
- Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function: Where is a chosen distance measure between a data point and a cluster center cj, is an indicator of the distance of the n data points from their respective cluster centers.