Edit Distance for measuring the Text Distance
Today we will talk about text similarities and how we can “calculate” a distance measure between two texts. For example, intuitively we know that the text “cats” is close to “rats” since they differ in one letter (Note: Forget about the meaning of the words). A popular approach of distance measure is the Edit Distance which is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. The main transformations are the following:
- Insertion of a new character. Example: boat vs boats (insertion of an
s
) - Deletion of an existing character. Example: fair vs far (deletion of an
i
) - Substitution of an existing character. Example: lamp vs lamb (substitution of
p
withb
) - Transposition of two existing consecutive characters. Example: sing vs sign (transposition of
ng
togn
)
We will work with the following two string metrics:
- Levenshtein distance which is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
- Damerau-Levenshtein distance which differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions).
Let’s try to calculate the Levenshtein distance of the words “balloon” and “baboon”. The changes that we need to apply are one deletion (i.e. the “l”) and one insertion (i.e. the “o”). Thus, their distance is 2.
R gives us the opportunity to calculate easily the distances between two texts. For example:
library(stringdist) # methods "lv" is for Levenshtein and "dl" is for Damerau-Levenshtein stringdist("antarctica", "africa", method = "lv")
5
Notice that the string distance can be applied to sentences and documents. For example:
stringdist("Georgios Pipis", "George Pipis", method = "lv")
3
Q-Grams
Another approach of measuring the distance between text is the q-gram distance which is the sum of absolute differences between N-gram vectors of both strings. Once we calculate the vector space of the Q-grams (which is like the N-grams of letters) we can calculate the distance by applying Cosine Distance or Jaccard Distance. Notice that both Cosine and Jaccard distance take value from 0 to 1, where the smaller the number the smaller the distance and the higher the similarity between two texts.
Let’s see the 1-Gram and 2-Grams of the input text “George Pipis“
# 1 gram qgrams("George Pipis", q = 1)
G e o r s p g i P
V1 1 2 1 1 1 1 1 2 1 1
# 2 gram qgrams("George Pipis", q = 2)
Ge eo or rg pi ge ip is e Pi P
V1 1 1 1 1 1 1 1 1 1 1 1
Let’s calculate the 2-grams of two texts such as “George Pipis” and “Georgios Pipis“
qgrams("George Pipis", "Georgios Pipis", q = 2)
Ge eo or rg s pi os ge ip is gi io e Pi P
V1 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1
V2 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1
Now we can calculate the Cosine and Jaccard distance of the two texts above based on 2-grams.
# cosine distance stringdist("George Pipis", "Georgios Pipis", method = "cosine", q = 2)
[1] 0.2473822
stringdist("George Pipis", "Georgios Pipis", method = "jaccard", q = 2)
[1] 0.4
For comparison let’s calculate the Jaccard distance of “George Pipis” vs “Rick Pitino“
stringdist("George Pipis", "Rick Pitino", method = "jaccard", q = 2)
[1] 0.8947368
Fuzzy Joins based on Text Distance
As a data scientist, it is quite common to apply Data Linkage which is briefly a method of bringing information from different sources together about the same person or entity to create a new, richer dataset.
Let’s describe one case where we want to match the basketball players of two different data sources where the names have been written slightly different as well as their height (cm). For this task, we will apply fuzzy matches.
We assume that we are dealing with the following two datasets.
dfa<-data.frame(ID=c(1,2,3,4), Name=c("Dimitrios", "Nikolas", "Georgios", "Nick"), Surname=c("Diamantidis", "Pappas", "Papagiannis", "Calathis"), Height = c(196, 195, 218, 198) ) dfb<-data.frame(DoB=c(1980,1990,1997,1989), Name=c("Dimitris", "Nikos", "George", "Nick"), Surname=c("Diamantides", "Pappas", "Papagiannis", "Calathes"), Height = c(196, 196, 220, 199) )
And we want to join those two data frames based on the following conditions:
- The Levenshtein distance of the Names must be less than 4
- The cosine distance of (q=2) of the Surnames must be less than 0.5
- The difference in the Height must be less than 3 cm
We will work with the fuzzyjoin
library:
library(fuzzyjoin) # Check if the absolute value between left and right is smaller than three is_closer_than_three_cm <- function(left, right) { abs(left - right) < 3 } # Calculate the string distance - it should be smaller than 4 is_name_distance_below_four <- function(left, right) { stringdist(left, right) < 4 } # Calculate the string distance - it should be smaller than 0.5 is_surname_distance_below_four <- function(left, right) { stringdist(left, right, method="cosine", q=2) < 0.5 } # Join by "title" and "year" with our two helper functions fuzzy_left_join( dfa,dfb, by = c("Name", "Surname", "Height"), match_fun = c("Name" = is_name_distance_below_four, "Surname" = is_surname_distance_below_four, "Height" = is_closer_than_three_cm) )
ID Name.x Surname.x Height.x DoB Name.y Surname.y Height.y
1 1 Dimitrios Diamantidis 196 1980 Dimitris Diamantides 196
2 2 Nikolas Pappas 195 1990 Nikos Pappas 196
3 3 Georgios Papagiannis 218 1997 George Papagiannis 220
4 4 Nick Calathis 198 1989 Nick Calathes 199
Although there were differences in all three fields (Name, Surname, Heigh), we managed to join the two data sources taking into consideration the distance between the fields.
Generally, working with full names is a difficult task. You can have a look at the name parser which is a good idea to start with before you apply the fuzzy joins.
1 thought on “How to Apply Text Distances and Fuzzy Joins”