We showed how you can build an autocorrect based on Jaccard distance by returning also the probability of each word. We will create three different spelling recommenders, that each takes a list of misspelled words and recommends a correctly spelled word for every word in the list. For every misspelled word, the recommender should find the word in
correct_spellings that has the shortest distance and starts with the same letter as the misspelled word, and return that word as a recommendation.
As a dictionary of correct words, we will consider the `words` from NTLK
Note: Each of the two different recommenders will use a different distance measure. For our example, we will consider the following misspelled words: [spleling, mispelling, reccomender]
Jaccard distance on the 2 Q-Grams of the two words
The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union and can be described by the following formula:
Also, we will work with Q-grams which are equivalent to N-grams but they referred to characters instead of tokens. We will provide an example by taking the Jaccard distance of 2 Q-Gams. First, we will take the dictionary of the correct words from NLTK.
import nltk from nltk.corpus import words correct_spellings = words.words() from nltk.metrics.distance import jaccard_distance from nltk.util import ngrams from nltk.metrics.distance import edit_distance
Since we loaded the libraries, let’s work on the function. We will work with list comprehensions.
entries=['spleling', 'mispelling', 'reccomender'] for entry in entries: temp = [(jaccard_distance(set(ngrams(entry, 2)), set(ngrams(w, 2))),w) for w in correct_spellings if w==entry] print(sorted(temp, key = lambda val:val))
And we get:
spelling misspelling recommender
Good job no? We got the right words
A popular approach of distance measure is the Edit Distance which is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. The main transformations are the following:
- Insertion of a new character. Example: boat vs boats (insertion of an
- Deletion of an existing character. Example: fair vs far (deletion of an
- Substitution of an existing character. Example: lamp vs lamb (substitution of
- Transposition of two existing consecutive characters. Example: sing vs sign (transposition of
for entry in entries: temp = [(edit_distance(entry, w),w) for w in correct_spellings if w==entry] print(sorted(temp, key = lambda val:val))
and we get:
selling misspelling recommender
We showed how we can apply a Spelling Recommender using Python and NLTK. We have to mention that the advanced spelling recommenders take into consideration more factors than just the distance, such as the context and the relative frequency of the suggested word.