February 1, 2023
Clustering is without doubt one of the most typical duties in unsupervised machine studying. We use clustering algorithms to phase a dataset into teams based mostly on the patterns within the information.
As an example, let’s say we’ve got information about 1000’s of our firm’s prospects. We might use a clustering algorithm to separate these prospects into teams with the intention to apply totally different advertising methods to every group.
To carry out such segmentation, there are two principal machine studying algorithms that we might use:
The k-means algorithm
Hierarchical clustering
On this tutorial, we’ll give attention to the k-means algorithm.
You don’t want superior information to comply with together with this tutorial, however we do anticipate you to be conversant in primary Python and the pandas library.
The Okay-means Algorithm
The k-means algorithm is an iterative algorithm designed to discover a cut up for a dataset given various clusters set by the person. The variety of clusters is named Okay.
The algorithm begins by randomly selecting Okay factors to be the preliminary facilities of the clusters. These factors are known as the clusters’ centroids. The person units Okay.
Then, as an interactive algorithm, k-means repeats the identical course of till it reaches a decided stopping situation. On this course of, the next steps are repeated:
Calculate the space from every information level to every centroid.
Assign the information factors to the closest centroid.
Recalculate the centroids because the imply of the factors of their cluster.
The stopping situation is met when the centroids are within the imply of their clusters, which signifies that they are going to not change.
The animation under illustrates this course of:
Implementing Okay-means from Scratch
Now that we perceive k-means, we’ll begin constructing an implementation from scratch.
To place our algorithm to work, we’ll use a dataset of consumers that we intend to phase based mostly on their annual revenue and spending rating. You’ll be able to entry the dataset right here. It incorporates the next variables:
CustomerID: a novel identifier for every buyer
* Annual Revenue: the shopper’s annual revenue in 1000’s of {dollars}
* Spending Rating: a rating based mostly on buyer purchasing patterns. Goes from 1 to 100
Listed here are the frst 5 rows of the dataset:
import pandas as pd
import numpy as np
prospects = pd.read_csv(‘customers_clustering.csv’)
prospects.head()
CustomerID
Annual Revenue
Spending Rating
0
1
15
39
1
2
15
81
2
3
16
6
3
4
16
77
4
5
17
40
With the intention to phase this dataset into two clusters, we’ll write a sequence of features that, as soon as mixed, will behave as a simplified model of k-means.
The primary operate is the get_centroids() operate. This can be a quite simple piece of code liable for randomly selecting Okay prospects because the preliminary centroids. It returns the chosen rows and the values of the 2 variables we’re utilizing as an inventory of lists. We’ll now refer to those variables as coordinates.
Right here’s the operate:
def get_centroids(df, okay):
np.random.seed(1)
centroids = df.pattern(okay).reset_index(drop=True)
return centroids, centroids.values.tolist()
And right here’s the output once we set okay=2:
centroids, coords = get_centroids(prospects[[‘Annual Income’, ‘Spending Score’]], okay=2)
print(centroids)
print(coords)
Annual Revenue Spending Rating
0 46 51
1 38 35
[[46, 51], [38, 35]]
Subsequent, we’ll write a operate to calculate the space between each buyer to every of the 2 preliminary centroids. The calculate_distance() operate takes within the information and the coordinates of the centroids:
def calculate_distance(df, centroids_coords):
names = []
for i, centroid in enumerate(centroids_coords):
identify = f’dist_centroid_{i + 1}’
df.loc[:, name] = np.sqrt((df.iloc[:,0] – centroid[0])**2 + (df.iloc[:,1] – centroid[1])**2)
names.append(identify)
return df, names
df, dist_columns = calculate_distance(prospects[[‘Annual Income’, ‘Spending Score’]], coords)
print(df)
print(dist_columns)
Annual Revenue Spending Rating dist_centroid_1 dist_centroid_2
0 15 39 33.241540 23.345235
1 15 81 43.139309 51.429563
2 16 6 54.083269 36.400549
3 16 77 39.698866 47.413078
4 17 40 31.016125 21.587033
.. … … … …
195 120 79 79.120162 93.059121
196 126 28 83.240615 88.277970
197 126 74 83.240615 96.254870
198 137 18 96.798760 100.448992
199 137 83 96.462428 110.022725
[200 rows x 4 columns]
[‘dist_centroid_1’, ‘dist_centroid_2’]
Discover that the operate provides new columns to the DataFrame representing the space of that buyer to every cluster. It then returns the modified DataFrame and an inventory with the identify of the brand new columns.
To calculate the space, we used the Euclidean Distance formulation.
With these two features working, we’ll begin to write our principal operate. The kmeans() operate takes in two arguments: the DataFrame to be segmented and a price for okay.
The primary two steps are to create an array with unique variables within the dataset after which name the get_centroids() operate to initialize the centroids. Right here’s the way it appears at this level:
def kmeans(df, okay):
variables = df.columns
centroids, coords = get_centroids(df, okay)
Discover that we’re alleged to go the DataFrame containing solely the variables used within the clustering course of and never the Buyer ID, as an example, as we’ve been doing already.
Constructing on that, inside a loop, we’ll carry out three different steps:
Create a replica of the coordinates we obtained from the get_centroids() operate.
Use the coordinates to calculate the distances from every buyer to every centroid with the calculate_distance() operate.
Create a cluster column within the DataFrame containing the variety of their closest column to every buyer.
The code under exhibits the operate after these modifications. We added a print and a break assertion so we are able to visualize how the information takes care of the primary iteration.
def kmeans(df, okay):
variables = df.columns
centroids, coords = get_centroids(df, okay)
whereas True:
# Creating a replica of the coordinates to check with the brand new coordinates that we’ll generate
last_coords = coords.copy()
df, dist_columns = calculate_distance(df, coords)
df[‘cluster’] = df[dist_columns].idxmin(axis=1).str.cut up(‘_’).str[-1]
print(df)
break
We use the idxmin() technique to get the decrease worth in every row of a DataFrame containing solely dist_columns, that are the columns we created to characterize the space from every buyer to every centroid.
After we run the operate as it’s, we are able to see that we’ve got a cluster, 1 or 2, assigned to every buyer:
kmeans(prospects[[‘Annual Income’, ‘Spending Score’]], okay=2)
Annual Revenue Spending Rating dist_centroid_1 dist_centroid_2 cluster
0 15 39 33.241540 23.345235 2
1 15 81 43.139309 51.429563 1
2 16 6 54.083269 36.400549 2
3 16 77 39.698866 47.413078 1
4 17 40 31.016125 21.587033 2
.. … … … … …
195 120 79 79.120162 93.059121 1
196 126 28 83.240615 88.277970 1
197 126 74 83.240615 96.254870 1
198 137 18 96.798760 100.448992 1
199 137 83 96.462428 110.022725 1
[200 rows x 5 columns]
We’ve got 4 new steps left to complete the principle operate:
Recalculate the centroids because the imply of the cluster. For that, we use groupby() and the record of variables we create at first of the operate.
Rework the coordinates of those new centroids to an inventory of lists identical to the one we get from the get_centroids() operate.
Examine if these new coordinates are equal to the final one. If that’s the case, this implies the centroids are situated on the imply of their cluster, and we’ve reached our stoppage factors. We then break the loop and return solely the cluster column and the centroids DataFrame.
We additionally added a counter so we are able to know what number of iterations had been essential to discover a cut up within the dataset.
Lastly, we added the max_iterations parameter to cease the loop if it takes too lengthy for the centroid to converge to the imply of their clusters.
Right here’s the finished operate:
def kmeans(df, okay, max_iterations=50):
variables = df.columns
centroids, coords = get_centroids(df, okay)
counter = 0
whereas True:
last_coords = coords.copy()
df, dist_columns = calculate_distance(df, coords)
df[‘cluster’] = df[dist_columns].idxmin(axis=1).str.cut up(‘_’).str[-1]
centroids = spherical(df.groupby(‘cluster’)[variables].imply(), 2)
coords = centroids.values.tolist()
if last_coords == coords or counter + 1 == max_iterations:
print(f’Variety of iterations: {counter + 1}’)
return df[‘cluster’], centroids
else:
counter += 1
We then name the operate and assign the consequence to a brand new column known as cluster in our unique DataFrame:
clusters, centroids = kmeans(prospects[[‘Annual Income’, ‘Spending Score’]], okay=2)
prospects[‘cluster’] = clusters
print(prospects)
Variety of iterations: 8
CustomerID Annual Revenue Spending Rating cluster
0 1 15 39 1
1 2 15 81 1
2 3 16 6 2
3 4 16 77 1
4 5 17 40 1
.. … … … …
195 196 120 79 1
196 197 126 28 2
197 198 126 74 1
198 199 137 18 2
199 200 137 83 1
[200 rows x 4 columns]
We now use matplotlib and seaborn to visualise how the shoppers had been cut up and the place the ultimate centroids (the black factors) are situated:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(‘whitegrid’)
fig, ax = plt.subplots(figsize=(10, 5))
sns.scatterplot(x=’Annual Revenue’, y=’Spending Rating’, hue=’cluster’, palette=’tab10′, information=prospects, s=50, ax=ax)
sns.scatterplot(x=’Annual Revenue’, y=’Spending Rating’, colour=’black’, information=centroids, s=100, ax=ax)
plt.tight_layout()
plt.present()
Inertia and the Elbow Curve
This appears like a good cut up, proper? Prospects with larger spending scores are in a single cluster, and those with low spending scores are in one other. However is there a option to know if there can be a greater cut up than this one? The reply is sure.
First, we have to perceive an essential metric for k-means: inertia. Inertia is the measure of how far the information factors assigned to a cluster are from that cluster’s centroid.
Mathematically talking, inertia is the sum of the squared distances from every of the information factors to its centroid. The decrease the inertia, the nearer the factors are to their centroids, which is what we’re in search of.
Our dataset incorporates 200 prospects. If we cut up it into 200 prospects, the inertia can be zero, and every buyer can be its personal cluster, which doesn’t make sense in any respect. However have we not stated the decrease the inertia, the higher? Sure, however solely to a sure extent.
As we’ve seen, the upper the variety of clusters, the decrease the inertia, and we have to resolve when the inertia is low sufficient so we are able to cease including clusters. That is when the Elbow Curve is available in.
The Elbow Curve is nothing greater than a line plot of the inertias towards the variety of clusters. Since inertia decreases whereas the variety of clusters will increase, the curve appears like this:
As we begin from one cluster and add the second and third, inertia will drop very quick. Sooner or later, nevertheless, it plateaus, making a “sharp elbow” within the curve. At this level, including extra clusters is not going to cut back considerably the inertia, and that is the place we must always cease.
Right here’s an instance of a pointy elbow:
To create such a plot, we have to run the k-means for a number of numbers of Okay after which calculate the inertia for each. The next code performs this calculation utilizing Okay from 1 to 9 and shops the information right into a DataFrame:
inertias = []
for okay in vary(1, 11):
# Creating a brief DataFrame to calculate the inertia for every okay
df_temp = prospects[[‘Annual Income’, ‘Spending Score’]].copy()
# Calculating the clusters for every okay
clusters, centroids = kmeans(prospects[[‘Annual Income’, ‘Spending Score’]], okay=okay)
# Including the clusters and the coordinates of centroids to the momentary DataFrame
df_temp[‘cluster’] = clusters
df_temp = df_temp.merge(centroids.rename(columns={‘Annual Revenue’: ‘Annual Income_centroid’,
‘Spending Rating’: ‘Spending Score_centroid’}), on=’cluster’)
# Utilizing euclidean distance an calcualting the squared distance of every level to its centroid
df_temp[‘sqr_distance’] = (np.sqrt((df_temp[‘Annual Income’] – df_temp[‘Annual Income_centroid’])**2 + (df_temp[‘Spending Score’] – df_temp[‘Spending Score_centroid’])**2))**2
# Storing the sum od the squared distances
inertias.append([k, df_temp[‘sqr_distance’].sum()])
df_inertia = pd.DataFrame(inertias, columns=[‘k’, ‘inertia’])
Variety of iterations: 2
Variety of iterations: 8
Variety of iterations: 10
Variety of iterations: 8
Variety of iterations: 10
Variety of iterations: 10
Variety of iterations: 7
Variety of iterations: 13
Variety of iterations: 9
Variety of iterations: 50
If we use the inertias to create the elbow curve, we’d have this plot:
Discover that, this time, we don’t have an apparent level to select from and that the “elbow” will not be that sharp. That is fairly regular, and someday we should examine the place the lower in inertia actually drops and search for enterprise insights to find out Okay.
In our state of affairs, we’ll go together with 5. Right here’s what this new cut up appears like:
# Studying the information once more
prospects = pd.read_csv(‘customers_clustering.csv’)
# Creating the clusters utilizing okay=5
clusters, centroids = kmeans(prospects[[‘Annual Income’, ‘Spending Score’]], okay=5)
prospects[‘cluster’] = clusters
# Plotting the outcomes
fig, ax = plt.subplots(figsize=(10, 5))
sns.scatterplot(x=’Annual Revenue’, y=’Spending Rating’, hue=’cluster’, palette=’tab10′, information=prospects, s=50, ax=ax)
sns.scatterplot(x=’Annual Revenue’, y=’Spending Rating’, colour=’black’, information=centroids, s=100, ax=ax)
plt.tight_layout()
plt.present()
Variety of iterations: 10
In reality, it’s a lot better, as we are able to clearly see 5 teams with totally different traits concerning each variables.
Conclusion
On this article, with the intention to take a more in-depth take a look at how the clustering algorithm k-means works, we did the next:
Explored some essential clustering ideas
Applied a simplified-but-functional model of k-means from scratch
Discovered how one can use the Elbow Curve
In case you’re fascinated about going deeper, not solely into k-means however into different machine studying subjects and algorithms as effectively, you could find all that in Dataquest’s Machine Studying in Python path!

Concerning the creator
Otávio Simões Silveira
Otávio is an economist and information scientist from Brazil. In his free time, he writes about Python and Knowledge Science on the web. You could find him at LinkedIn.