We have provided examples of how you can apply fuzzy joins in R and we assume that you are familiar with string distances and similarities. In this tutorial, we will show how you can apply fuzzy join in Python. Since we work mainly with the Levenshtein distance, it will be helpful to provide here the formula:
where “a” and “b” are strings and the min refers to the deletion, insertion, and substitution. Notice that is very helpful to provide the Levenshtein Similarity Ratio which can be calculated from the formula below:
where |a| and |b| refer to the length of the strings a and b respectively.
Example of Fuzzy Joins
We will provide a practical example of Fuzzy Joins. We will work with the FuzzyWuzzy and textdistance libraries. Assume that you are dealing with two different data frames and you want to match them on the string. Below we provide the two hypothetical data frames:
import pandas as pd from fuzzywuzzy import fuzz from fuzzywuzzy import process import textdistance df1 = pd.DataFrame({'Text_A':['12 Estias Street, Ampelokipi', 'Georgios Pipis', 'fuzzy much in python', 'Today is Friday! TGIF', 'This is a partial text']}) df2 = pd.DataFrame({'Text_B':['Predictive Hacks is a Data Science Blog', 'Abelokipi, Estias Str 12', 'Fuzzy matching in Python', 'George P. Pipis', 'partial text']})
Cross Join
The first thing that we can do is to cross join the two data frames as follows:
# cross join the data frames df1['dummy'] = True df2['dummy'] = True df = pd.merge(df1,df2, on='dummy') df.drop('dummy', axis=1, inplace=True) df
FuzzyWuzzy Library
We will caclulate the follwing ratios between the two columns of our data frame:
- Ratio: It refers to the Levenshtein Distance Ratio.
- Partial Ratio: Assume that we are dealing with two strings of different lengths such as L1 and L2, and assume that L1 is less than L2. Then the algorithm seeks the score of the best matching of length -L1 substring.
- Token Sort Ratio: First it removes punctuations and converts the text to lower case and then it tokenizes it. Then it sorts the tokens alphabetically and then it joins them in a single string.
- Token Set Ratio: Similar to the Token Sort Ratio, but it takes into consideration the unique tokens.
Below we will return these four measures by adding them as columns to the joined data frame.
df['Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.ratio(x.Text_A, x.Text_B), axis=1) df['Partial_Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.partial_ratio(x.Text_A, x.Text_B), axis=1) df['Token_Sort_Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.token_sort_ratio(x.Text_A, x.Text_B), axis=1) df['Token_Set_Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.token_set_ratio(x.Text_A, x.Text_B), axis=1) df
Fuzzy Joins
Since we have calculated the pairwise similarities of the text, we can join the two string columns by keeping the most similar pair. Let’s assume that we want to match df1 on df2. We can group the joined df on Text_A and get the rank of similarities and then keep the most similar (i.e. Rank=1). We have to choose measure, and for this example, we will keep the Token_Set_Ratio
df['Rank_Token_Set_Ratio'] = df.groupby('Text_A')['Token_Set_Ratio'].rank(ascending=False, method='dense') df[['Text_A','Text_B', 'Token_Set_Ratio']].loc[df.Rank_Token_Set_Ratio==1]
As we can see, it captured all the matches but it matched also two strings that are not related. The reason for that is because the text “Today is Friday! TGIF” cannot be matched with any text from Text_B. For that reason, it makes sense to add a threshold. A good threshold is around 70.
df[['Text_A','Text_B', 'Token_Set_Ratio']].loc[(df.Rank_Token_Set_Ratio==1)&(df.Token_Set_Ratio>70)]
Process Extract
We can find the most similar string out of a vector of strings using the process
module:
df1['Most_Similar'] = df1.Text_A.apply(lambda x : [process.extract(x, df2.Text_B, limit=1)][0][0][0]) df1[['Text_A', 'Most_Similar']]
TextDistance Library
Previously we showed how to get similarity measures based on Levenshtein similarity obtained from the FuzzyWuzzy library. Let’s see how we can get other similarities based on q-grams, like Jaccard and Cosine Similarity using the textdistance package. Let’s return the Cosine and Jaccard similarity of the joined df based on 2-gram based on characters.
# cross join the data frames df1['dummy'] = True df2['dummy'] = True df = pd.merge(df1,df2, on='dummy') df.drop('dummy', axis=1, inplace=True) df['Jaccard'] = df.apply(lambda x: (1-textdistance.Jaccard(qval=2).distance(x['Text_A'], x['Text_B'])), axis=1) df['Cosine'] = df.apply(lambda x: (1-textdistance.Cosine(qval=2).distance(x['Text_A'], x['Text_B'])), axis=1) df
# get the most similar with a threshold of 0.3 df['Rank_Jaccard'] = df.groupby('Text_A')['Jaccard'].rank(ascending=False, method='dense') df[['Text_A','Text_B', 'Jaccard']].loc[(df.Rank_Jaccard==1)&(df.Jaccard>0.3)]
PollyFuzz Library
There exists a new library called PollyFuzz that I didn’t have time to work thoroughly on it. I will provide some examples from the documentation.
Currently, the following models are implemented in PolyFuzz:
- TF-IDF
- EditDistance (you can use any distance measure, see documentation)
- FastText and GloVe
- Transformers
from polyfuzz import PolyFuzz from_list = ["apple", "apples", "appl", "recal", "house", "similarity"] to_list = ["apple", "apples", "mouse"] model = PolyFuzz("TF-IDF").match(from_list, to_list) model.get_matches()
model.group(link_min_similarity=0.75) model.get_matches()
You can find more examples here