Contents
1. Introduction
2. What does the K-Means algorithm do?
3. Implementation in Python
4. Assessment and interpretation
5. Conclusions and next steps
Most widely used machine learning algorithms, such as linear regression, logistic regression, decision trees and others, are useful for making predictions from labeled data, i.e. each input includes feature values with a label value associated with them. That's what we call Supervised teaching.
However, we often need to process large datasets without associated labels. Imagine a business that needs to understand different customer groups based on their purchasing behavior, demographics, address and other information, so that it can offer better services, products and promotions.
These types of problems can be resolved through the use of Unsupervised learning techniques. The K-Means algorithm is an unsupervised learning algorithm widely used in Machine Learning. Its simple and elegant approach allows a dataset to be separated into a desired number of K distinct clusters, thereby enabling models to be learned from unlabeled data.
As stated earlier, the K-Means algorithm seeks to partition data points into a given number of clusters. The points within each cluster are similar, while the points in different clusters have considerable differences.
That said, a question arises: how do we define similarity or difference? In K-Means clustering, Euclidean distance is the most common metric for measuring similarity.
In the figure below we can clearly see 3 different groups. Thus, we could determine the centers of each group and each point would be associated with the nearest center.
By doing this, mathematically speaking, the idea is to minimize the variance within the clusterthe measure of similarity between each point and its nearest center.
Performing the task in the example above was simple because the data was two-dimensional and the groups were clearly distinct. However, as the number of dimensions increases and different values of K are considered, we need an algorithm to handle the complexity.
Step 1: Choose Initial Centers (Randomly)
We need to bootstrap the algorithm with initial central vectors that can be chosen randomly from the data or generate random vectors with the same dimensions as the original data. See the white diamonds in the image below.
Step 2: Find the distances from each point to the centers
Now we will calculate the distance from each data point to the K centers. Then we associate each point with the center closest to that point.
Given a dataset with NOT entries and Mr. characteristics, distances to centers vs can be given by the following equation:
Or:
k varies from 1 to K;
D is the distance from a point n to k center;
X is the point vector;
vs is the central vector.
So, for each data point not we will have K distances, then we need to label the vector at the center with the smallest distance:
Or D is a vector with K distances.
Step 3: Find the K centroids and iterate
For each of K clusters, recalculate the center of gravity. The new centroid is the average of all data points assigned to this cluster. Then update the centroid positions with the newly calculated ones.
Check if the centroids have changed significantly from the previous iteration. This can be done by comparing the positions of the centroids in the current iteration with those in the last iteration.
If the centroids have changed significantly, return to step 2. Otherwise, the algorithm has converged and the process stops. See the image below.
Now that we know the fundamental concepts of the K-Means algorithm, it's time to implement a Python class. The packages used were Numpy for mathematical calculations, Matplotlib for visualization, and Sklearn's Make_blobs package for simulated data.
# import required packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
The class will have the following methods:
A constructor method to initialize the basic parameters of the algorithm: the value k of clusters, the maximum number of iterations max_iter, and tolerance tol value to stop optimization when there is no significant improvement.
These methods aim to assist the optimization process during training, such as calculating the Euclidean distance, randomly choosing the initial centroids, assigning the closest centroid to each point, updating the centroid values, and checking whether the optimization converged.
As mentioned earlier, the K-Means algorithm is an unsupervised learning technique, which means it does not require labeled data during the training process. This way, a single method is needed to fit the data and predict which cluster each data point belongs to.
A method to evaluate the quality of optimization by calculating the total squared error optimization. This will be explored in the next section.
Here is the complete code:
class Kmeans:# construct method for hyperparameter initialization
def __init__(self, k=3, max_iter=100, tol=1e-06):
self.k = k
self.max_iter = max_iter
self.tol = tol
# randomly picks the initial centroids from the input data
def pick_centers(self, X):
centers_idxs = np.random.choice(self.n_samples, self.k)
return X(centers_idxs)
# finds the closest centroid for each data point
def get_closest_centroid(self, x, centroids):
distances = (euclidean_distance(x, centroid) for centroid in centroids)
return np.argmin(distances)
# creates a list with lists containing the idxs of each cluster
def create_clusters(self, centroids, X):
clusters = (() for _ in range(self.k))
labels = np.empty(self.n_samples)
for i, x in enumerate(X):
centroid_idx = self.get_closest_centroid(x, centroids)
clusters(centroid_idx).append(i)
labels(i) = centroid_idx
return clusters, labels
# calculates the centroids for each cluster using the mean value
def compute_centroids(self, clusters, X):
centroids = np.empty((self.k, self.n_features))
for i, cluster in enumerate(clusters):
centroids(i) = np.mean(X(cluster), axis=0)
return centroids
# helper function to verify if the centroids changed significantly
def is_converged(self, old_centroids, new_centroids):
distances = (euclidean_distance(old_centroids(i), new_centroids(i)) for i in range(self.k))
return (sum(distances) < self.tol)
# method to train the data, find the optimized centroids and label each data point according to its cluster
def fit_predict(self, X):
self.n_samples, self.n_features = X.shape
self.centroids = self.pick_centers(X)
for i in range(self.max_iter):
self.clusters, self.labels = self.create_clusters(self.centroids, X)
new_centroids = self.compute_centroids(self.clusters, X)
if self.is_converged(self.centroids, new_centroids):
break
self.centroids = new_centroids
# method for evaluating the intracluster variance of the optimization
def clustering_errors(self, X):
cluster_values = (X(cluster) for cluster in self.clusters)
squared_distances = ()
# calculation of total squared Euclidean distance
for i, cluster_array in enumerate(cluster_values):
squared_distances.append(np.sum((cluster_array - self.centroids(i))**2))
total_error = np.sum(squared_distances)
return total_error
We will now use the K-Means class to perform clustering of the simulated data. To do this, we will use the make_blobs Sklearn library package. The data consists of 500 two-dimensional points with 4 fixed centers.
# create simulated data for examples
X, _ = make_blobs(n_samples=500, n_features=2, centers=4,
shuffle=False, random_state=0)
After performing the training using four clusters, we obtain the following result.
model = Kmeans(k=4)
model.fit_predict(X)
labels = model.labels
centroids =model.centroids
plot_clusters(X, labels, centroids)
In this case, the algorithm was able to calculate the clusters successfully in 18 iterations. However, we must keep in mind that we already know the optimal number of clusters from the simulated data. In real-world applications, we often don't know this value.
As stated previously, the K-Means algorithm aims to make the variance within the cluster as small as possible. The metric used to calculate this variance is the total Euclidean distance squared given by:
Or:
p is the number of data points in a cluster;
c_i is the centroid vector of a cluster;
K is the number of clusters.
In simple terms, the above formula adds the distances of the data points to the nearest centroid. The error decreases as the number K increases.
In the extreme case of K = N, you have one cluster for each data point and this error will be zero.
Willmott, Paul (2019).
If we plot the error versus the number of clusters and look at where the graph “curves”, we will be able to find the optimal number of clusters.
As we can see, the plot has an “elbow shape” and it curves at K = 4, which means that for higher values of K the decrease in total error will be less significant.
In this article, we have covered the fundamental concepts of K-Means algorithm, its uses and applications. Additionally, using these concepts, we were able to implement a Python class from scratch that performed clustering of simulated data and how to find the optimal value for K using a scree plot.
However, since this is an unsupervised technique, there is an additional step. The algorithm can successfully assign a label to the clusters, but the meaning of each label is a task that the data scientist or machine learning engineer will need to complete by analyzing the data from each cluster.
Additionally, I will leave a few points for further exploration:
- Our simulated data used two-dimensional points. Try using the algorithm for other datasets and find the optimal values for K.
- There are other widely used unsupervised learning algorithms, such as Hierarchical classification.
- Depending on the problem domain, it may be necessary to use other error measures such as Manhattan distance and cosine similarity. Try to investigate them.
Full code available here: