Predictive Hacks

Hierarchical Clustering in Python

dendrogram

We have provided an example of K-means clustering and now we will provide an example of Hierarchical Clustering. We will work with the famous Iris Dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

from sklearn import datasets
iris = datasets.load_iris()


df=pd.DataFrame(iris['data'])
print(df.head())
 
     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2

Let’s see the number of targets that the Iris dataset has and their frequency:

np.unique(iris.target,return_counts=True)
 
(array([0, 1, 2]), array([50, 50, 50], dtype=int64))

As we can see there are three targets of 50 observations each. If we want to see the names of the target:

iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Data Preparation for Cluster Analysis

When we apply Cluster Analysis we need to scale our data. There are many different approaches like standardizing or normalizing the values etc. Also, we can whiten the values which is a process of rescaling data to a standard deviation of 1:

\(x_{new} = x/std\_dev(x)\)

Let’s scaled the iris dataset.

# Import the whiten function
from scipy.cluster.vq import whiten
scaled_data = whiten(df.to_numpy())
 

Let’s check if the variance of every feature is close to 1 now:

pd.DataFrame(scaled_data).describe()
 
0123
count150.000000150.000000150.000000150.000000
mean7.0802437.0378822.1359511.578709
std1.0033501.0033501.0033501.003350
min5.2102184.6039350.5683740.131632
25%6.1795616.4455090.9093990.394897
50%7.0277366.9059032.4724281.711218
75%7.7547447.5964932.8987092.369379
max9.57226210.1286583.9217823.290805

Creat the Distance Matrix based on linkage

Look at the documentation of the `linkage` function to see the available methods and metrics.

# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import fcluster, linkage

# Use the linkage() function
distance_matrix = linkage(scaled_data, method = 'ward', metric = 'euclidean')
 

How many Clusters – Introduction to dendrograms

Dendrograms help in showing progressions as clusters are merged. It is a branching diagram that demonstrates how each cluster is composed by branching out into its child nodes.

# Import the dendrogram function
from scipy.cluster.hierarchy import dendrogram

# Create a dendrogram
dn = dendrogram(distance_matrix)

# Display the dendogram
plt.show()
 
Hierarchical Clustering in Python 1

From the dendrogram we can realize that a good candidate for the number of Clusters is 3 and that 2 clusters are closer (the red ones) compared to the green one.


Run the Hierarchical Clustering

# Assign cluster labels
df['cluster_labels'] = fcluster(distance_matrix, 3, criterion='maxclust')
 

Notice that we can define clusters based on the linkage distance by changing the criterion to distance in the fcluster function!


Hierarchical vs Actual for n_clusters=3

df['target'] = iris.target

fig, axes = plt.subplots(1, 2, figsize=(16,8))
axes[0].scatter(df[0], df[1], c=df['target'])
axes[1].scatter(df[0], df[1], c=df['cluster_labels'], cmap=plt.cm.Set1)
axes[0].set_title('Actual', fontsize=18)
axes[1].set_title('Hierarchical', fontsize=18)
Hierarchical Clustering in Python 2

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

data science journey
Miscellaneous

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my