Letter Frequency
We will provide you a walk-through example of how you can easily get the letter frequency in documents by considering the whole document or the unique words. Finally, we will compare our observed relative frequencies with the letter frequency of the English language.
From the above horizontal barplot, we can easily see that the letter e is the most common in both English Texts and Dictionaries. Notice also that the distribution is changed between Texts and Dictionaries.
Part A: Get the Letter Frequency in Documents
We will work with the Moby Dick book and we will provide the frequency and the relative frequency of the letters. Finally, we will apply a chi-square test to test if the distribution of the letters in Moby Dick is the same with what we see in English texts.
import pandas as pd import numpy as np import re from collections import Counter with open('moby.txt', 'r') as f: file_name_data = f.read() file_name_data=file_name_data.lower() # convert to a list where each character # is an element letter_list = list(file_name_data) # get the frequency of each letter my_counter = Counter(letter_list) # convert the Counter into Pandas data frame df = pd.DataFrame.from_dict(my_counter, orient='index').reset_index() df = df.rename(columns={'index':'letter', 0:'frequency'}) # keep only the 26 english letters df = df.loc[df['letter'].isin(['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'])] df['doc_rel_freq']=df['frequency']/df['frequency'].sum() df = df.sort_values('letter') # load the english letter frequency according to wikipedia english = pd.read_csv("english_freq.csv") df = pd.merge(df,english, on="letter") # get the expected frequency df['expected'] = np.round(df['rel_freq']*df['frequency'].sum(),0) df
import matplotlib.pyplot as plt %matplotlib inline df.plot(x="letter", y=["doc_rel_freq", "rel_freq"], kind="barh", figsize=(12,8))
Compare the Observed Frequencies with the Expected
We will apply the Chi-Square test to compare the observed with the expected letter frequencies.
from scipy.stats import chi2_contingency # Chi-square test of independence. c, p, dof, expected = chi2_contingency(df[['frequency', 'expected']]) p
We get that the p-value (p) is 0 which implies that the letter frequency in Moby Dick does not follow the same distribution with what we see in English tests, although the Pearson correlation is too high (~99.6%).
df[['frequency', 'expected']].corr()
Part B: Get the Letter Frequency in Unique Words
We will apply the same logic as above, but in this case, we will consider only the unique words and we will compare them with the letter frequency of the English Dictionary according to Wikipedia.
# get the words words = re.findall('\w+',file_name_data) # get the unique words V = list(set(words)) # concatenate all words into one text # and then get the list of each character letter_list = list(" ".join(V)) # get the frequency of each letter my_counter = Counter(letter_list) # get the frequency of each letter my_counter = Counter(letter_list) df = pd.DataFrame.from_dict(my_counter, orient='index').reset_index() df = df.rename(columns={'index':'letter', 0:'frequency'}) # keep only the 26 english letters df = df.loc[df['letter'].isin(['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'])] df['doc_rel_freq']=df['frequency']/df['frequency'].sum() df = df.sort_values('letter') # load the english letter frequency according to wikipedia english = pd.read_csv("english_dict_freq.csv") df = pd.merge(df,english, on="letter") # get the expected frequency df['expected'] = np.round(df['rel_freq']*df['frequency'].sum(),0) df
import matplotlib.pyplot as plt %matplotlib inline df.plot(x="letter", y=["doc_rel_freq", "rel_freq"], kind="barh", figsize=(12,8))
Compare the Observed Frequencies with the Expected
As before we will apply the Chi-Square test.
# Chi-square test of independence. c, p, dof, expected = chi2_contingency(df[['frequency', 'expected']]) p
1.7915973729245735e-84
Again, we may infer that there is a statistically significant difference in the distribution of the letters between the unique words of our document and the English dictionary. Again the Pearson correlation is high (~99.6%)
df[['frequency', 'expected']].corr()
Discussion
The use of letter frequencies and frequency analysis plays a fundamental role in cryptograms and several word puzzle games, including Hangman, Scrabble and the television game show Wheel of Fortune. Letter frequencies also have a strong effect on the design of some keyboard layouts.
Today, we showed how easily you can get the letter frequency using Python, and how you can apply statistical tests to compare the distribution of the letters between two documents or between a group of documents (like English Texts).