Predictive Hacks

Spelling Recommender with NLTK

autocorrect

Spelling Recommender

We showed how you can build an autocorrect based on Jaccard distance by returning also the probability of each word. We will create three different spelling recommenders, that each takes a list of misspelled words and recommends a correctly spelled word for every word in the list. For every misspelled word, the recommender should find the word in correct_spellings that has the shortest distance and starts with the same letter as the misspelled word, and return that word as a recommendation.

As a dictionary of correct words, we will consider the `words` from NTLK 

Note: Each of the two different recommenders will use a different distance measure. For our example, we will consider the following misspelled words: [spleling, mispelling, reccomender]


Jaccard distance on the 2 Q-Grams of the two words

The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union and can be described by the following formula:

Spelling Recommender with NLTK 1

Also, we will work with Q-grams which are equivalent to N-grams but they referred to characters instead of tokens. We will provide an example by taking the Jaccard distance of 2 Q-Gams. First, we will take the dictionary of the correct words from NLTK.


import nltk
from nltk.corpus import words

correct_spellings = words.words()



from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.metrics.distance  import edit_distance
 

Since we loaded the libraries, let’s work on the function. We will work with list comprehensions.

entries=['spleling', 'mispelling', 'reccomender']

for entry in entries:
    temp = [(jaccard_distance(set(ngrams(entry, 2)), set(ngrams(w, 2))),w) for w in correct_spellings if w[0]==entry[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

And we get:

spelling
misspelling
recommender

Good job no? We got the right words


Edit Distance

A popular approach of distance measure is the Edit Distance which is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. The main transformations are the following:

  • Insertion of a new character. Example: boat vs boats (insertion of an s)
  • Deletion of an existing character. Example: fair vs far (deletion of an i)
  • Substitution of an existing character. Example: lamp vs lamb (substitution of p with b)
  • Transposition of two existing consecutive characters. Example: sing vs sign (transposition of ng to gn)


for entry in entries:
    temp = [(edit_distance(entry, w),w) for w in correct_spellings if w[0]==entry[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])
 

and we get:

selling
misspelling
recommender

Conclusion

We showed how we can apply a Spelling Recommender using Python and NLTK. We have to mention that the advanced spelling recommenders take into consideration more factors than just the distance, such as the context and the relative frequency of the suggested word.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “Spelling Recommender with NLTK”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

fuzzy matching
Python

Fuzzy Joins Tutorial

We have provided examples of how you can apply fuzzy joins in R and we assume that you are familiar

data science journey
Miscellaneous

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my