Predictive Hacks

Email Spam Detector in Python


Ham or Spam

One of the most common projects, especially for teaching purposes, is to build models to predict if a message is spam or not. Our dataset called Spam contains the subject lines and the target which takes values 0 and 1 for ham and spam respectively.

import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

spam_data = pd.read_csv('spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)

        text	                                           target
0	Go until jurong point, crazy.. Available only ...	0
1	Ok lar... Joking wif u oni...	0
2	Free entry in 2 a wkly comp to win FA Cup fina...	1
3	U dun say so early hor... U c already then say...	0
4	Nah I don't think he goes to usf, he lives aro...	0
5	FreeMsg Hey there darling it's been 3 week's n...	1
6	Even my brother is not like to speak with me. ...	0
7	As per your request 'Melle Melle (Oru Minnamin...	0
8	WINNER!! As a valued network customer you have...	1
9	Had your mobile 11 months or more? U R entitle...	1

Split the Data into Train and Test Dataset

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 

Build the tf-idf on N-grams

Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams)

vect = TfidfVectorizer(min_df=5, ngram_range=(1,3)).fit(X_train)
X_train_vectorized = vect.transform(X_train)

Add Features

We apart from the tokens, we can add features such as the number of digits, the dollar sign , the length of the subject line and the number of characters (anything other than a letter, digit or underscore) . Let’s create a function for that.

def add_feature(X, feature_to_add):
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

# Train Data

X_train_transformed = add_feature(X_train_vectorized , [add_length, add_digits,  add_dollars, add_characters])

# Test Data

X_test_transformed = add_feature(vect.transform(X_test), [add_length_t, add_digits_t,  add_dollars_t, add_characters_t])

Train the Logistic Regression Model

We will build the Logistic Regression Model and we will report the AUC score on the test dataset:

clf = LogisticRegression(C=100, solver='lbfgs', max_iter=1000), y_train)

y_predicted = clf.predict(X_test_transformed)

auc = roc_auc_score(y_test, y_predicted)

Get the Most Important Features

We will show the 50 most important features which lead to either Ham of Spam respectively.

feature_names = np.array(vect.get_feature_names() + ['lengthc', 'digit', 'dollars', 'n_char'])
sorted_coef_index = clf.coef_[0].argsort()
smallest = feature_names[sorted_coef_index[:50]]
largest = feature_names[sorted_coef_index[:-51:-1]]

Features which lead to Spam:

array(['text', 'sale', 'free', 'uk', 'content', 'tones', 'sms', 'reply',
       'order', 'won', 'ltd', 'girls', 'ringtone', 'to', 'comes',
       'darling', 'this message', 'what you', 'new', 'www', 'co uk',
       'std', 'co', 'about the', 'strong', 'txt', 'your', 'user',
       'all of', 'choose', 'service', 'wap', 'mobile', 'the new', 'with',
       'sexy', 'sunshine', 'xxx', 'this', 'hot', 'freemsg', 'ta',
       'waiting for your', 'asap', 'stop', 'll have', 'hello', 'http',
       'vodafone', 'of the'], dtype='<U31')

Features which lead to Ham:

array(['ì_ wan', 'for 1st', 'park', '1st', 'ah', 'wan', 'got', 'say',
       'tomorrow', 'if', 'my', 'ì_', 'call', 'opinion', 'days', 'gt',
       'its', 'lt', 'lovable', 'sorry', 'all', 'when', 'can', 'hope',
       'face', 'she', 'pls', 'lt gt', 'hav', 'he', 'smile', 'wife',
       'for my', 'trouble', 'me', 'went', 'about me', 'hey', '30', 'sir',
       'lovely', 'small', 'sun', 'silent', 'me if', 'happy', 'only',
       'them', 'my dad', 'dad'], dtype='<U31')


We provided a practical and reproducible example of how you can build a decent Ham or Spam algorithm. This is one of the main tasks in the field of NLP. Our model achieved an AUC score of 97% on the test dataset which is really good. We were also able to add features and also to identify the features which are more likely to appear in a Spam email and vice versa.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

data science journey

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my