Predictive Hacks

Tip: How to define your distance function for Hierarchical Clustering

custome function

Many times there is a need to define your distance function. I found this answer in StackOverflow very helpful and for that reason, I posted here as a tip.

All of the SciPy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata:

import numpy as np
from scipy.cluster.hierarchy import fclusterdata

# a custom function that just computes Euclidean distance
def mydist(p1, p2):
    diff = p1 - p2
    return np.vdot(diff, diff) ** 0.5

X = np.random.randn(100, 2)

fclust1 = fclusterdata(X, 1.0, metric=mydist)
fclust2 = fclusterdata(X, 1.0, metric='euclidean')

print(np.allclose(fclust1, fclust2))
# True

Valid inputs for the metric= kwarg are the same as for scipy.spatial.distance.pdist. Also here you can find some other info

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore


Document Splitting with LangChain

In this tutorial, we will talk about different ways of how to split the loaded documents into smaller chunks using