Predictive Hacks

Tip: How to define your distance function for Hierarchical Clustering

custome function

Many times there is a need to define your distance function. I found this answer in StackOverflow very helpful and for that reason, I posted here as a tip.

All of the SciPy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata:

import numpy as np
from scipy.cluster.hierarchy import fclusterdata

# a custom function that just computes Euclidean distance
def mydist(p1, p2):
    diff = p1 - p2
    return np.vdot(diff, diff) ** 0.5

X = np.random.randn(100, 2)

fclust1 = fclusterdata(X, 1.0, metric=mydist)
fclust2 = fclusterdata(X, 1.0, metric='euclidean')

print(np.allclose(fclust1, fclust2))
# True

Valid inputs for the metric= kwarg are the same as for scipy.spatial.distance.pdist. Also here you can find some other info

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

connect with sql

How to Connect R with SQL

Need to Connect R with SQL It is common for Data Analysts/Scientists to connect R with SQL. For that reason,

[the_ad_group id="232"]
[the_ad id="2133"]