Master K-Means clustering. Implement the K-Means algorithm from… | by Marcus Sena

Contents
Contents
Step 1: Choose Initial Centers (Randomly)Step 2: Find the distances from each point to the centers Step 3: Find the K centroids and iterate

1. Introduction
2. What does the K-Means algorithm do?
3. Implementation in Python
4. Assessment and interpretation
5. Conclusions and next steps

Most widely used machine learning algorithms, such as linear regression, logistic regression, decision trees and others, are useful for making predictions from labeled data, i.e. each input includes feature values with a label value associated with them. That's what we call Supervised teaching.

However, we often need to process large datasets without associated labels. Imagine a business that needs to understand different customer groups based on their purchasing behavior, demographics, address and other information, so that it can offer better services, products and promotions.

These types of problems can be resolved through the use of Unsupervised learning techniques. The K-Means algorithm is an unsupervised learning algorithm widely used in Machine Learning. Its simple and elegant approach allows a dataset to be separated into a desired number of K distinct clusters, thereby enabling models to be learned from unlabeled data.

As stated earlier, the K-Means algorithm seeks to partition data points into a given number of clusters. The points within each cluster are similar, while the points in different clusters have considerable differences.

That said, a question arises: how do we define similarity or difference? In K-Means clustering, Euclidean distance is the most common metric for measuring similarity.

In the figure below we can clearly see 3 different groups. Thus, we could determine the centers of each group and each point would be associated with the nearest center.

Simulated dataset with 200 observations (image by author).

By doing this, mathematically speaking, the idea is to minimize the variance within the clusterthe measure of similarity between each point and its nearest center.

Performing the task in the example above was simple because the data was two-dimensional and the groups were clearly distinct. However, as the number of dimensions increases and different values of K are considered, we need an algorithm to handle the complexity.

Step 1: Choose Initial Centers (Randomly)

We need to bootstrap the algorithm with initial central vectors that can be chosen randomly from the data or generate random vectors with the same dimensions as the original data. See the white diamonds in the image below.

The initial centers are chosen at random (image by the author).

Step 2: Find the distances from each point to the centers

Now we will calculate the distance from each data point to the K centers. Then we associate each point with the center closest to that point.

Given a dataset with NOT entries and Mr. characteristics, distances to centers vs can be given by the following equation:

Euclidean distance (image generated using codecogs.com).

Or:

k varies from 1 to K;

D is the distance from a point n to k center;

X is the point vector;

vs is the central vector.

So, for each data point not we will have K distances, then we need to label the vector at the center with the smallest distance:

(image generated using codecogs.com)

Or D is a vector with K distances.

Step 3: Find the K centroids and iterate

For each of K clusters, recalculate the center of gravity. The new centroid is the average of all data points assigned to this cluster. Then update the centroid positions with the newly calculated ones.

Check if the centroids have changed significantly from the previous iteration. This can be done by comparing the positions of the centroids in the current iteration with those in the last iteration.

If the centroids have changed significantly, return to step 2. Otherwise, the algorithm has converged and the process stops. See the image below.

Convergence of centroids (image by the author).

Now that we know the fundamental concepts of the K-Means algorithm, it's time to implement a Python class. The packages used were Numpy for mathematical calculations, Matplotlib for visualization, and Sklearn's Make_blobs package for simulated data.

# import required packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

The class will have the following methods:

A constructor method to initialize the basic parameters of the algorithm: the value k of clusters, the maximum number of iterations max_iter, and tolerance tol value to stop optimization when there is no significant improvement.

These methods aim to assist the optimization process during training, such as calculating the Euclidean distance, randomly choosing the initial centroids, assigning the closest centroid to each point, updating the centroid values, and checking whether the optimization converged.

As mentioned earlier, the K-Means algorithm is an unsupervised learning technique, which means it does not require labeled data during the training process. This way, a single method is needed to fit the data and predict which cluster each data point belongs to.

A method to evaluate the quality of optimization by calculating the total squared error optimization. This will be explored in the next section.

Here is the complete code:

class Kmeans:# construct method for hyperparameter initialization
def __init__(self, k=3, max_iter=100, tol=1e-06):
self.k = k
self.max_iter = max_iter
self.tol = tol
# randomly picks the initial centroids from the input data
def pick_centers(self, X):
centers_idxs = np.random.choice(self.n_samples, self.k)
return X(centers_idxs)
# finds the closest centroid for each data point
def get_closest_centroid(self, x, centroids):
distances = (euclidean_distance(x, centroid) for centroid in centroids)
return np.argmin(distances)
# creates a list with lists containing the idxs of each cluster
def create_clusters(self, centroids, X):
clusters = (() for _ in range(self.k))
labels = np.empty(self.n_samples)
for i, x in enumerate(X):
centroid_idx = self.get_closest_centroid(x, centroids)
clusters(centroid_idx).append(i)
labels(i) = centroid_idx
return clusters, labels
# calculates the centroids for each cluster using the mean value 
def compute_centroids(self, clusters, X):
centroids = np.empty((self.k, self.n_features))
for i, cluster in enumerate(clusters):
centroids(i) = np.mean(X(cluster), axis=0)
return centroids
# helper function to verify if the centroids changed significantly
def is_converged(self, old_centroids, new_centroids):
distances = (euclidean_distance(old_centroids(i), new_centroids(i)) for i in range(self.k))
return (sum(distances) < self.tol)
# method to train the data, find the optimized centroids and label each data point according to its cluster
def fit_predict(self, X):
self.n_samples, self.n_features = X.shape
self.centroids = self.pick_centers(X)
for i in range(self.max_iter):
self.clusters, self.labels = self.create_clusters(self.centroids, X)
new_centroids = self.compute_centroids(self.clusters, X)
if self.is_converged(self.centroids, new_centroids):
break
self.centroids = new_centroids
# method for evaluating the intracluster variance of the optimization
def clustering_errors(self, X):
cluster_values = (X(cluster) for cluster in self.clusters)
squared_distances = ()
# calculation of total squared Euclidean distance
for i, cluster_array in enumerate(cluster_values):
squared_distances.append(np.sum((cluster_array - self.centroids(i))**2))
total_error = np.sum(squared_distances)
return total_error

We will now use the K-Means class to perform clustering of the simulated data. To do this, we will use the make_blobs Sklearn library package. The data consists of 500 two-dimensional points with 4 fixed centers.

# create simulated data for examples
X, _ = make_blobs(n_samples=500, n_features=2, centers=4, 
shuffle=False, random_state=0)

After performing the training using four clusters, we obtain the following result.

model = Kmeans(k=4)
model.fit_predict(X)
labels = model.labels
centroids =model.centroids
plot_clusters(X, labels, centroids)

Clustering for k=4 (image by the author).

In this case, the algorithm was able to calculate the clusters successfully in 18 iterations. However, we must keep in mind that we already know the optimal number of clusters from the simulated data. In real-world applications, we often don't know this value.

As stated previously, the K-Means algorithm aims to make the variance within the cluster as small as possible. The metric used to calculate this variance is the total Euclidean distance squared given by:

Total squared Euclidean distance formula (image by author using codecogs.com).

Or:

p is the number of data points in a cluster;

c_i is the centroid vector of a cluster;

K is the number of clusters.

In simple terms, the above formula adds the distances of the data points to the nearest centroid. The error decreases as the number K increases.

In the extreme case of K = N, you have one cluster for each data point and this error will be zero.

Willmott, Paul (2019).

If we plot the error versus the number of clusters and look at where the graph “curves”, we will be able to find the optimal number of clusters.

As we can see, the plot has an “elbow shape” and it curves at K = 4, which means that for higher values of K the decrease in total error will be less significant.

In this article, we have covered the fundamental concepts of K-Means algorithm, its uses and applications. Additionally, using these concepts, we were able to implement a Python class from scratch that performed clustering of simulated data and how to find the optimal value for K using a scree plot.

However, since this is an unsupervised technique, there is an additional step. The algorithm can successfully assign a label to the clusters, but the meaning of each label is a task that the data scientist or machine learning engineer will need to complete by analyzing the data from each cluster.

Additionally, I will leave a few points for further exploration:

Our simulated data used two-dimensional points. Try using the algorithm for other datasets and find the optimal values for K.
There are other widely used unsupervised learning algorithms, such as Hierarchical classification.
Depending on the problem domain, it may be necessary to use other error measures such as Manhattan distance and cosine similarity. Try to investigate them.

Full code available here:

Master K-Means clustering. Implement the K-Means algorithm from… | by Marcus Sena | May 2024

Step 1: Choose Initial Centers (Randomly)

Step 2: Find the distances from each point to the centers

Step 3: Find the K centroids and iterate

Leave a Reply Cancel reply

Stay Connected

Create an Amazing Newspaper

Latest News

AI Reduces Work Productivity, Increases Workload, Study Finds

BTCC Exchange features up to 50x leverage on over 300 USDT

Senate committee subpoenas Steward CEO over bankruptcy case

L'AI-förfalskningar kan upptäckas med hjälp av astronomimetoder

Subscribe to our newsletter

Step 1: Choose Initial Centers (Randomly)

Step 2: Find the distances from each point to the centers

Step 3: Find the K centroids and iterate

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Create an Amazing Newspaper

Latest News

AI Reduces Work Productivity, Increases Workload, Study Finds

BTCC Exchange features up to 50x leverage on over 300 USDT

Senate committee subpoenas Steward CEO over bankruptcy case

L'AI-förfalskningar kan upptäckas med hjälp av astronomimetoder

Subscribe to our newsletter