In this tutorial, we will get the Bayesian Score of each word as well as of the whole Subject Line. The score will indicate the chance of a Subject Line and/or token being “spam”. You can find the dataset here. We have used the same dataset, in the Email Spam Detector Tutorial, so feel free to compare the Bayesian approach with the Logistic Regression.
Load the libraries
import pandas as pd import numpy as np import re from collections import Counter import string
Theory and Formulas
So how do you train a Naive Bayes classifier?
- The first part of training a naive bayes classifier is to identify the number of classes that you have.
- You will create a probability for each class. \(P(D_{pos})\) is the probability that the document is positive. \(P(D_{neg})\) is the probability that the document is negative.
Use the formulas as follows and store the values in a dictionary:
\(P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}\)
\(P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}\)
Where \(D\) is the total number of documents, or Subject Lines in this case, \(D_{pos}\) is the total number of positive SL and \(D_{neg}\) is the total number of negative SL.
Prior and Logprior
The prior probability represents the underlying probability in the target population that a SL is positive versus negative. In other words, if we had no specific information and blindly picked a SL out of the population set, what is the probability that it will be positive versus that it will be negative? That is the “prior”.
The prior is the ratio of the probabilities \(\frac{P(D_{pos})}{P(D_{neg})}\).
We can take the log of the prior to rescale it, and we’ll call this the logprior
\(\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)\).
Note that \(log(\frac{A}{B})\) is the same as $log(A) – log(B)$. So the logprior can also be calculated as the difference between two logs:
\(\text{logprior} = \log (P(D_{pos})) – \log (P(D_{neg})) = \log (D_{pos}) – \log (D_{neg})\tag{3}\)
Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we’ll use the following inputs:
- \(freq_{pos}\) and \(freq_{neg}\) are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- \(N_{pos}\) and \(N_{neg}\) are the total number of positive and negative words for all documents (for all SLs), respectively.
- \(V\) is the number of unique words in the entire set of documents, for all classes, whether positive or negative.
We’ll use these to compute the positive and negative probability for a specific word using this formula:
\(P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} \)
\(P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} \)
Notice that we add the “+1” in the numerator for additive smoothing. This wiki article explains more about additive smoothing.
Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:
\(\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}\)
The overall probability of each SL is:
\(p = logprior + \sum_i^N (loglikelihood_i)\)
where we sum up loglikelihoods of each word in the SL plus the logprior.
Coding
Let’s get our hands dirty by building the formulas above.
# load the data and set the spam=1 and ham=0 # convert the text into lower case and remove the puncuations df = pd.read_csv("spam.csv") df['target'] = df.target.map({'spam':1, 'ham':0}) df['text'] = df.text.apply(lambda x:x.lower()) df['text'] = df.text.apply(lambda x:x.translate(str.maketrans('', '', string.punctuation))) df
# Get the V Freq V_freq = Counter(" ".join(df['text'].values.tolist()).split(" ")) # Get the V V = len(V_freq.keys()) # get the freq_pos freq_pos = Counter(" ".join(df.loc[df.target==1]['text'].values.tolist()).split(" ")) # get the freq_neg freq_neg = Counter(" ".join(df.loc[df.target==0]['text'].values.tolist()).split(" ")) # get the number of positive and negative documents D_pos = sum(df.target==1) D_neg = sum(df.target==0) # get the number of unique positive and negative words N_pos = len(freq_pos.keys()) N_neg = len(freq_neg.keys()) logprior = np.log(D_pos/D_neg) def word_loglikelihood(w): w = w.lower() if w in V_freq: p_w_pos = (freq_pos.get(w,0)+1 / (N_pos+V)) p_w_neg = (freq_neg.get(w,0)+1 / (N_neg+V)) return np.log(p_w_pos/p_w_neg) else: return(0)
Let’s see the score of some words, like “lovable” and “free“.
As we can see, the word “free” has a high score (>0) which means that this word is more related to spam emails. On contrary, the word “lovable” has a very low score (<0) which means that this word is not related to spam emails.
Let’s create a function that returns the score of the whole subject line by adding up the word likelihood of each word plus the logprior.
def text_loglikelihood(mytxt): mytxt = mytxt.lower().split(" ") score = logprior for w in mytxt: score+= word_loglikelihood(w) # print(w,word_loglikelihood(w)) return(score)
Get the score of the first SL from our data frame:
text_loglikelihood(df.iloc[0]['text'])
We get:
-107.49288547485799
which implies that this SL is more likely to be Ham.
Make Predictions
Let’s say that we want to make predictions for all the SL. We will add two columns. The score
and the label of the prediction by taking values 0 and 1, where 1 is when the score is positive and 0 otherwise.
df['score'] = df.text.apply(lambda x:text_loglikelihood(x)) df['prediction'] = df.score.apply(lambda x:int(x>0)) # confusion matrix df.groupby(['target','prediction']).size().reset_index()
Finally, the accuracy on the train dataset is 99.5%
np.mean(df.target==df.prediction)