Predictive Hacks

Tip: How to define your distance function for Hierarchical Clustering

custome function

Many times there is a need to define your distance function. I found this answer in StackOverflow very helpful and for that reason, I posted here as a tip.

All of the SciPy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata:

import numpy as np
from scipy.cluster.hierarchy import fclusterdata

# a custom function that just computes Euclidean distance
def mydist(p1, p2):
    diff = p1 - p2
    return np.vdot(diff, diff) ** 0.5

X = np.random.randn(100, 2)

fclust1 = fclusterdata(X, 1.0, metric=mydist)
fclust2 = fclusterdata(X, 1.0, metric='euclidean')

print(np.allclose(fclust1, fclust2))
# True
 

Valid inputs for the metric= kwarg are the same as for scipy.spatial.distance.pdist. Also here you can find some other info

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

deploy a flask api
Python

How to Process Requests in Flask

The most common payload of incoming data are the query strings, the form data and the JSON objects. Let’s provide