Ham or Spam
One of the most common projects, especially for teaching purposes, is to build models to predict if a message is spam or not. Our dataset called Spam contains the subject lines and the target which takes values 0
and 1
for ham and spam respectively.
import pandas as pd import numpy as np from sklearn.metrics import roc_auc_score from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression spam_data = pd.read_csv('spam.csv') spam_data['target'] = np.where(spam_data['target']=='spam',1,0) spam_data.head(10)
text target
0 Go until jurong point, crazy.. Available only ... 0
1 Ok lar... Joking wif u oni... 0
2 Free entry in 2 a wkly comp to win FA Cup fina... 1
3 U dun say so early hor... U c already then say... 0
4 Nah I don't think he goes to usf, he lives aro... 0
5 FreeMsg Hey there darling it's been 3 week's n... 1
6 Even my brother is not like to speak with me. ... 0
7 As per your request 'Melle Melle (Oru Minnamin... 0
8 WINNER!! As a valued network customer you have... 1
9 Had your mobile 11 months or more? U R entitle... 1
Split the Data into Train and Test Dataset
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0)
Build the tf-idf on N-grams
Fit and transform the training data X_train
using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams)
vect = TfidfVectorizer(min_df=5, ngram_range=(1,3)).fit(X_train) X_train_vectorized = vect.transform(X_train)
Add Features
We apart from the tokens, we can add features such as the number of digits, the dollar sign , the length of the subject line and the number of characters (anything other than a letter, digit or underscore) . Let’s create a function for that.
def add_feature(X, feature_to_add): """ Returns sparse feature matrix with added feature. feature_to_add can also be a list of features. """ from scipy.sparse import csr_matrix, hstack return hstack([X, csr_matrix(feature_to_add).T], 'csr') # Train Data add_length=X_train.str.len() add_digits=X_train.str.count(r'\d') add_dollars=X_train.str.count(r'\$') add_characters=X_train.str.count(r'\W') X_train_transformed = add_feature(X_train_vectorized , [add_length, add_digits, add_dollars, add_characters]) # Test Data add_length_t=X_test.str.len() add_digits_t=X_test.str.count(r'\d') add_dollars_t=X_test.str.count(r'\$') add_characters_t=X_test.str.count(r'\W') X_test_transformed = add_feature(vect.transform(X_test), [add_length_t, add_digits_t, add_dollars_t, add_characters_t])
Train the Logistic Regression Model
We will build the Logistic Regression Model and we will report the AUC
score on the test dataset:
clf = LogisticRegression(C=100, solver='lbfgs', max_iter=1000) clf.fit(X_train_transformed, y_train) y_predicted = clf.predict(X_test_transformed) auc = roc_auc_score(y_test, y_predicted) auc
0.9674528462047772
Get the Most Important Features
We will show the 50 most important features which lead to either Ham of Spam respectively.
feature_names = np.array(vect.get_feature_names() + ['lengthc', 'digit', 'dollars', 'n_char']) sorted_coef_index = clf.coef_[0].argsort() smallest = feature_names[sorted_coef_index[:50]] largest = feature_names[sorted_coef_index[:-51:-1]]
Features which lead to Spam:
largest
array(['text', 'sale', 'free', 'uk', 'content', 'tones', 'sms', 'reply',
'order', 'won', 'ltd', 'girls', 'ringtone', 'to', 'comes',
'darling', 'this message', 'what you', 'new', 'www', 'co uk',
'std', 'co', 'about the', 'strong', 'txt', 'your', 'user',
'all of', 'choose', 'service', 'wap', 'mobile', 'the new', 'with',
'sexy', 'sunshine', 'xxx', 'this', 'hot', 'freemsg', 'ta',
'waiting for your', 'asap', 'stop', 'll have', 'hello', 'http',
'vodafone', 'of the'], dtype='<U31')
Features which lead to Ham:
smallest
array(['ì_ wan', 'for 1st', 'park', '1st', 'ah', 'wan', 'got', 'say',
'tomorrow', 'if', 'my', 'ì_', 'call', 'opinion', 'days', 'gt',
'its', 'lt', 'lovable', 'sorry', 'all', 'when', 'can', 'hope',
'face', 'she', 'pls', 'lt gt', 'hav', 'he', 'smile', 'wife',
'for my', 'trouble', 'me', 'went', 'about me', 'hey', '30', 'sir',
'lovely', 'small', 'sun', 'silent', 'me if', 'happy', 'only',
'them', 'my dad', 'dad'], dtype='<U31')
Discussion
We provided a practical and reproducible example of how you can build a decent Ham or Spam algorithm. This is one of the main tasks in the field of NLP. Our model achieved an AUC score of 97% on the test dataset which is really good. We were also able to add features and also to identify the features which are more likely to appear in a Spam email and vice versa.