K-Means is an unsupervised machine learning algorithm that groups data into k number of clusters. The number of clusters is user-defined and the algorithm will try to group the data even if this number is not optimal for the specific case.
Therefore we have to come up with a technique that somehow will help us decide how many clusters we should use for the K-Means model.
The Elbow method is a very popular technique and the idea is to run k-means clustering for a range of clusters k (let’s say from 1 to 10) and for each value, we are calculating the sum of squared distances from each point to its assigned center(distortions).
When the distortions are plotted and the plot looks like an arm then the “elbow”(the point of inflection on the curve) is the best value of k.
K-Means Elbow method example with Iris Dataset
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() #we are usingh df=pd.DataFrame(iris['data']) print(df.head())
0 1 2 3
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
Let’s see the number of groups that the Iris dataset has
iris['target']
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
There are 3 different clusters in the Dataset and we have 4 features that we can feed the K-Means model.
Running K-Means with a range of k
We can easily run K-Means for a range of clusters using a for loop and collecting the distortions into a list.
distortions = [] K = range(1,10) for k in K: kmeanModel = KMeans(n_clusters=k) kmeanModel.fit(df) distortions.append(kmeanModel.inertia_)
Plotting the distortions of K-Means
plt.figure(figsize=(16,8)) plt.plot(K, distortions, 'bx-') plt.xlabel('k') plt.ylabel('Distortion') plt.title('The Elbow Method showing the optimal k') plt.show()
We can observe that the “elbow” is the number 3 which is optimal for this case. Now we can run a K-Means using as n_clusters the number 3.
kmeanModel = KMeans(n_clusters=3) kmeanModel.fit(df)
K-Means vs Actual for n_clusters=3
df['k_means']=kmeanModel.predict(df) df['target']=iris['target'] fig, axes = plt.subplots(1, 2, figsize=(16,8)) axes[0].scatter(df[0], df[1], c=df['target']) axes[1].scatter(df[0], df[1], c=df['k_means'], cmap=plt.cm.Set1) axes[0].set_title('Actual', fontsize=18) axes[1].set_title('K_Means', fontsize=18)
Given the number of clusters, it is easy to see that the K-means does a really good job defining the clusters of the dataset.
In case you want to apply the Elbow Method in R you can have a look at our post!
7 thoughts on “K-Means Elbow Method code for Python”
Thank you so much this was very helpful ! But sometimes it’s hard to find the elbow if the reduction in distortion doesn’t significantly change for different numbers of k..
Very Helpful
hi
how i can compute elbow method manual ?
Thank you so much :DDDDDDDDDDDDDDDDDDDD
Thank you bro
This is soo insightful. Thank you