Predictive Hacks

How to Build an Autocorrect in Python

autocorrect

Description

We assume that you are familiar with the concepts of String Distance and String Similarities. You can also have a look at the Spelling Recommender. We will show how you can easily build a simple Autocorrect tool in Python with a few lines of code. What you will need is a corpus to build your vocabulary and the word frequencies. The idea is the following:

  • You enter a word, if this is word exists in the vocabulary then we assume that is correct.
  • If this word does not exist in the vocabulary we try to find the most similar words ordered by their frequency probability.

Build the Vocabulary

We will work with the Moby Dick book. Let’s start.


import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter

words = [] 

with open('moby.txt', 'r') as f:
    file_name_data = f.read()
    file_name_data=file_name_data.lower()
    words = re.findall('\w+',file_name_data)

# This is our vocabulary
V = set(words)

print(f"The first ten words in the text are: \n{words[0:10]}")
print(f"There are {len(V)} unique words in the vocabulary.")
 
The first ten words in the text are: 
['moby', 'dick', 'by', 'herman', 'melville', '1851', 'etymology', 'supplied', 'by', 'a']
There are 17140 unique words in the vocabulary.

Get the Word Frequencies

We have already built a list of words called words and now we can build our word frequency. We can use the Counter function.


word_freq_dict = {}  
word_freq_dict = Counter(words)  

print(word_freq_dict.most_common()[0:10])
 
[('the', 14431), ('of', 6609), ('and', 6430), ('a', 4736), ('to', 4625), ('in', 4172), ('that', 3085), ('his', 2530), ('it', 2522), ('i', 2127)]

Get the Relative Word Frequencies

Now we want to get the probability of each word to appear, this is equivalent to the relative word frequencies.

 probs = {} 
    
Total = sum(word_freq_dict.values())
    
for k in word_freq_dict.keys():
    probs[k] = word_freq_dict[k]/Total
  

Similarity based on Jaccard Distance and Q-Grams

We will sort the similar words based on Jaccard Distance by computing the 2 Q grams of the words. We will return the 5 most similar words order by Similarity and Probability.


def my_autocorrect(input_word):
    input_word = input_word.lower()

    if input_word in V:
        return('Your word seems to be correct')
    else:
        similarities = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq_dict.keys()]
        df = pd.DataFrame.from_dict(probs, orient='index').reset_index()
        df = df.rename(columns={'index':'Word', 0:'Prob'})
        df['Similarity'] = similarities
        output = df.sort_values(['Similarity', 'Prob'], ascending=False).head()
        return(output)
 

Let’s see some examples:

Autocorrect neverteless:

my_autocorrect('neverteless')
 

WordProbSimilarity
2209nevertheless0.0002290.750000
13300boneless0.0000140.416667
12309elevates0.0000050.416667
718never0.0009420.400000
6815level0.0001100.400000

Autocorrect nesseccary:

my_autocorrect('nesseccary')
 
WordProbSimilarity
6546necessary0.0000320.545455
4088unnecessary0.0000270.461538
14079caressed0.0000050.454545
1766seneca0.0000050.400000
16770caress0.0000050.400000

Autocorrect occurence:

my_autocorrect('occurence')
WordProbSimilarity
10986occurrence0.0000090.888889
1202occurred0.0000460.500000
11230occur0.0000320.500000
11824occurs0.0000050.444444
2817current0.0000090.400000

Conclusion

We just represent one case where our vocabulary was taken from Moby Dick which for sure does not represent the actual relative frequencies of the English words, however, we did a relatively good job. Also, we used the Jaccard distance. You can try other distances like Cosine Distance, Edit Distance etc. You can also have a look at the documentation of the textdistance library.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “How to Build an Autocorrect in Python”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s