An efficient way to get the pairwise Similarity of a numpy array (or a pandas data frame) is to use the pdist
and squareform
functions from the scipy
package. Let’s start working with a practical example by taking into consideration the Jaccard similarity:
import numpy as np from scipy.spatial.distance import pdist, squareform my_data = np.array([[1,1,1,0,1], [1,1,0,0,1], [0,0,1,1,1], [1,0,1,1,0]]) my_data
Now we are going to calculate the pairwise Jaccard distance:
# Calculate all pairwise distances jaccard_distances = pdist(my_data, metric='jaccard') # Convert the distances to a square matrix jaccard_distances = squareform(jaccard_distances)
Finally, the Jaccard Similarity = 1- Jaccard Distance.
jaccard_similarity = 1-jaccard_distances jaccard_similarity
As we can see, the final outcome is a 4×4 array. Note that the number of documents was 4 and that is why we got a 4×4 similarity matrix.
Note that the scipy.spatial.distance supports many distances such as:
‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
Pairwise Distance with Scikit-Learn
Alternatively, you can work with Scikit-learn as follows:
import numpy as np from sklearn.metrics import pairwise_distances # get the pairwise Jaccard Similarity 1-pairwise_distances(my_data, metric='jaccard')