Predictive Hacks

# Fuzzy Joins Tutorial We have provided examples of how you can apply fuzzy joins in R and we assume that you are familiar with string distances and similarities. In this tutorial, we will show how you can apply fuzzy join in Python. Since we work mainly with the Levenshtein distance, it will be helpful to provide here the formula:

where “a” and “b” are strings and the min refers to the deletion, insertion, and substitution. Notice that is very helpful to provide the Levenshtein Similarity Ratio which can be calculated from the formula below:

where |a| and |b| refer to the length of the strings a and b respectively.

## Example of Fuzzy Joins

We will provide a practical example of Fuzzy Joins. We will work with the FuzzyWuzzy and textdistance libraries. Assume that you are dealing with two different data frames and you want to match them on the string. Below we provide the two hypothetical data frames:

```import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import textdistance

df1 = pd.DataFrame({'Text_A':['12 Estias Street, Ampelokipi',
'Georgios Pipis',
'fuzzy much in python',
'Today is Friday! TGIF',
'This is a partial text']})

df2 = pd.DataFrame({'Text_B':['Predictive Hacks is a Data Science Blog',
'Abelokipi, Estias Str 12',
'Fuzzy matching in Python',
'George P. Pipis',
'partial text']})

```

### Cross Join

The first thing that we can do is to cross join the two data frames as follows:

```# cross join the data frames

df1['dummy'] = True
df2['dummy'] = True

df = pd.merge(df1,df2, on='dummy')

df.drop('dummy', axis=1, inplace=True)
df

```

## FuzzyWuzzy Library

We will caclulate the follwing ratios between the two columns of our data frame:

• Ratio: It refers to the Levenshtein Distance Ratio.
• Partial Ratio: Assume that we are dealing with two strings of different lengths such as L1 and L2, and assume that L1 is less than L2. Then the algorithm seeks the score of the best matching of length -L1 substring.
• Token Sort Ratio: First it removes punctuations and converts the text to lower case and then it tokenizes it. Then it sorts the tokens alphabetically and then it joins them in a single string.
• Token Set Ratio: Similar to the Token Sort Ratio, but it takes into consideration the unique tokens.

Below we will return these four measures by adding them as columns to the joined data frame.

```df['Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.ratio(x.Text_A, x.Text_B), axis=1)
df['Partial_Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.partial_ratio(x.Text_A, x.Text_B), axis=1)
df['Token_Sort_Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.token_sort_ratio(x.Text_A, x.Text_B), axis=1)
df['Token_Set_Ratio'] = df[['Text_A','Text_B']].apply(lambda x:fuzz.token_set_ratio(x.Text_A, x.Text_B), axis=1)

df

```

### Fuzzy Joins

Since we have calculated the pairwise similarities of the text, we can join the two string columns by keeping the most similar pair. Let’s assume that we want to match df1 on df2. We can group the joined df on Text_A and get the rank of similarities and then keep the most similar (i.e. Rank=1). We have to choose measure, and for this example, we will keep the `Token_Set_Ratio`

```df['Rank_Token_Set_Ratio'] = df.groupby('Text_A')['Token_Set_Ratio'].rank(ascending=False, method='dense')
df[['Text_A','Text_B', 'Token_Set_Ratio']].loc[df.Rank_Token_Set_Ratio==1]
```

As we can see, it captured all the matches but it matched also two strings that are not related. The reason for that is because the text “Today is Friday! TGIF” cannot be matched with any text from Text_B. For that reason, it makes sense to add a threshold. A good threshold is around 70.

```df[['Text_A','Text_B', 'Token_Set_Ratio']].loc[(df.Rank_Token_Set_Ratio==1)&amp;(df.Token_Set_Ratio>70)]

```

### Process Extract

We can find the most similar string out of a vector of strings using the `process` module:

```df1['Most_Similar'] = df1.Text_A.apply(lambda x : [process.extract(x, df2.Text_B, limit=1)])

df1[['Text_A', 'Most_Similar']]

```

## TextDistance Library

Previously we showed how to get similarity measures based on Levenshtein similarity obtained from the FuzzyWuzzy library. Let’s see how we can get other similarities based on q-grams, like Jaccard and Cosine Similarity using the textdistance package. Let’s return the Cosine and Jaccard similarity of the joined df based on 2-gram based on characters.

```# cross join the data frames

df1['dummy'] = True
df2['dummy'] = True

df = pd.merge(df1,df2, on='dummy')

df.drop('dummy', axis=1, inplace=True)

df['Jaccard'] = df.apply(lambda x: (1-textdistance.Jaccard(qval=2).distance(x['Text_A'],  x['Text_B'])), axis=1)
df['Cosine'] = df.apply(lambda x: (1-textdistance.Cosine(qval=2).distance(x['Text_A'],  x['Text_B'])), axis=1)
df

```
```# get the most similar with a threshold of 0.3
df['Rank_Jaccard'] = df.groupby('Text_A')['Jaccard'].rank(ascending=False, method='dense')
df[['Text_A','Text_B', 'Jaccard']].loc[(df.Rank_Jaccard==1)&amp;(df.Jaccard>0.3)]

```

## PollyFuzz Library

There exists a new library called PollyFuzz that I didn’t have time to work thoroughly on it. I will provide some examples from the documentation.

Currently, the following models are implemented in PolyFuzz:

• TF-IDF
• EditDistance (you can use any distance measure, see documentation)
• FastText and GloVe
• Transformers
```from polyfuzz import PolyFuzz

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

model = PolyFuzz("TF-IDF").match(from_list, to_list)
model.get_matches()
```
```model.group(link_min_similarity=0.75)
model.get_matches()
```

You can find more examples here