Predictive Hacks

How to Find Similar Documents using N-grams and Word Embeddings

similar documents

Introduction

A very common task in NLP is to define the similarity between documents. Usually, the metric is the Cosine Similarity and there are two main approaches such as:

  • Transform the documents into a vector space by generating the Document-Term Matrix or the TF-IDF. This approach is based on n-grams, where usually we consider up to bi-grams.
  • Transform the documents into a vector space by taking the average of the pre-trained word embeddings.

In this tutorial, we will provide you a hands-on example of how you can find similar documents from a list of documents using these two different approaches. We will not try to optimize the performance of the algorithms by applying different approaches like “stemming”, “lemmatization”, different tokenizers, different number of n-grams etc.

For our example, we will consider the  Spam dataset which contains a list of subject lines marked as Ham or Spam. If you want to build a model to predict if the email is Ham or Spam you can have a look at our tutorial.

Similar Documents using N-grams

Let’s start by loading the data:

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity 
pd.set_option("max_colwidth", 300)

df = spam_data = pd.read_csv('spam.csv')
df
 

Now, let’s build the TF-IDF:

vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,1), lowercase=True, min_df=2)
X = vectorizer.fit_transform(df.text)

Finally, let’s write a function which takes as input the content of the document, the existing documents and finally, the number of recommendations and it returns the n most similar documents.

def similar_documents(mytext, df, n=10):
    df = df.copy()
    input_vect = vectorizer.transform([mytext])
    df['similarity'] = cosine_similarity(input_vect, X).flatten()
    return (df.sort_values(by='similarity', ascending=False)[['text', 'similarity']].head(n))


user_input = """Nah I don't think he goes to usf, he lives around here though"""

similar_documents(text=user_input, df=df, n=10)
 

Similar Documents using Word Embeddings

Another approach is to work with Word Embeddings. We have provided a similar tutorial using GloVe. In this post, we will work with the SpaCy library.

import spacy

# load the word embeddings
nlp = spacy.load("en_core_web_md")

# in case we want to work with 2D Numpy arrayes we need to unnest the numpy array as follows
# np.stack(df.embedding.to_numpy()).shape
# np.vstack(df.embedding.to_numpy()).shape

# create a column of word embedding sectors
df['embedding'] = df['text'].apply(lambda x: nlp(x).vector)

df
 

Let’s create a similar function to the above, but this time by taking into consideration the word embeddings.

def emb_similar_documents(text, df, n=10):
    df = df.copy()
    input_vect = nlp(text).vector
    # reshape the inputs to 1, 300 since we are dealing with vectors of 300-D
    df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(input_vect.reshape(1,300), x.reshape(1,300))[0][0])
    return (df.sort_values(by='similarity', ascending=False)[['text', 'similarity']].head(n))


user_input = """I don't quite know what to do. I still can't get hold of anyone. I cud pick you up bout 7.30pm and we can see if they're in the pub?"""

emb_similar_documents(text=user_input, df=df, n=10)
 

Final Thoughts

I believe that in small documents, like subject lines, the TF-IDF approach is better than the Word Embeddings. Also, based on my experience, taking the average of the word embeddings does not lead to a meaningful vector representation for the document. There are other techniques like Doc2Vec where we can discuss in another post. Stay tuned!

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.