Predictive Hacks

# Naive Bayes Classification in NLP tasks from Scratch In this tutorial, we will get the Bayesian Score of each word as well as of the whole Subject Line. The score will indicate the chance of a Subject Line and/or token being “spam”. You can find the dataset here. We have used the same dataset, in the Email Spam Detector Tutorial, so feel free to compare the Bayesian approach with the Logistic Regression.

import pandas as pd
import numpy as np
import re
from collections import Counter
import string


## Theory and Formulas

#### So how do you train a Naive Bayes classifier?

• The first part of training a naive bayes classifier is to identify the number of classes that you have.
• You will create a probability for each class. $$P(D_{pos})$$ is the probability that the document is positive. $$P(D_{neg})$$ is the probability that the document is negative.
Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $$D$$ is the total number of documents, or Subject Lines in this case, $$D_{pos}$$ is the total number of positive SL and $$D_{neg}$$ is the total number of negative SL.

#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a SL is positive versus negative. In other words, if we had no specific information and blindly picked a SL out of the population set, what is the probability that it will be positive versus that it will be negative? That is the “prior”.

The prior is the ratio of the probabilities $$\frac{P(D_{pos})}{P(D_{neg})}$$.
We can take the log of the prior to rescale it, and we’ll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $$log(\frac{A}{B})$$ is the same as $log(A) – log(B)$. So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) – \log (P(D_{neg})) = \log (D_{pos}) – \log (D_{neg})\tag{3}$$

#### Positive and Negative Probability of a Word

To compute the positive probability and the negative probability for a specific word in the vocabulary, we’ll use the following inputs:

• $$freq_{pos}$$ and $$freq_{neg}$$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
• $$N_{pos}$$ and $$N_{neg}$$ are the total number of positive and negative words for all documents (for all SLs), respectively.
• $$V$$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We’ll use these to compute the positive and negative probability for a specific word using this formula:

$$P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4}$$
$$P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5}$$

#### Log likelihood

To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

The overall probability of each SL is:

$$p = logprior + \sum_i^N (loglikelihood_i)$$

where we sum up loglikelihoods of each word in the SL plus the logprior.

## Coding

Let’s get our hands dirty by building the formulas above.

# load the data and set the spam=1 and ham=0
# convert the text into lower case and remove the puncuations
df['target'] = df.target.map({'spam':1, 'ham':0})
df['text'] = df.text.apply(lambda x:x.lower())
df['text'] = df.text.apply(lambda x:x.translate(str.maketrans('', '', string.punctuation)))
df


# Get the V Freq
V_freq = Counter(" ".join(df['text'].values.tolist()).split(" "))

# Get the V
V = len(V_freq.keys())

# get the freq_pos
freq_pos = Counter(" ".join(df.loc[df.target==1]['text'].values.tolist()).split(" "))

# get the freq_neg
freq_neg = Counter(" ".join(df.loc[df.target==0]['text'].values.tolist()).split(" "))

# get the number of positive and negative documents
D_pos = sum(df.target==1)
D_neg = sum(df.target==0)

# get the number of unique positive and negative words
N_pos = len(freq_pos.keys())
N_neg = len(freq_neg.keys())

logprior = np.log(D_pos/D_neg)

def word_loglikelihood(w):
w = w.lower()
if w in V_freq:
p_w_pos = (freq_pos.get(w,0)+1 / (N_pos+V))
p_w_neg = (freq_neg.get(w,0)+1 / (N_neg+V))
return np.log(p_w_pos/p_w_neg)
else:
return(0)



Let’s see the score of some words, like “lovable” and “free“.

As we can see, the word “free” has a high score (>0) which means that this word is more related to spam emails. On contrary, the word “lovable” has a very low score (<0) which means that this word is not related to spam emails.

Let’s create a function that returns the score of the whole subject line by adding up the word likelihood of each word plus the logprior.

def text_loglikelihood(mytxt):
mytxt = mytxt.lower().split(" ")
score = logprior
for w in mytxt:
score+= word_loglikelihood(w)
# print(w,word_loglikelihood(w))
return(score)


Get the score of the first SL from our data frame:

text_loglikelihood(df.iloc['text'])


We get:

-107.49288547485799


which implies that this SL is more likely to be Ham.

## Make Predictions

Let’s say that we want to make predictions for all the SL. We will add two columns. The score and the label of the prediction by taking values 0 and 1, where 1 is when the score is positive and 0 otherwise.

df['score'] = df.text.apply(lambda x:text_loglikelihood(x))
df['prediction'] = df.score.apply(lambda x:int(x>0))

# confusion matrix
df.groupby(['target','prediction']).size().reset_index()


Finally, the accuracy on the train dataset is 99.5%

np.mean(df.target==df.prediction)