Predictive Hacks

# How to Find Similar Documents using N-grams and Word Embeddings

## Introduction

A very common task in NLP is to define the similarity between documents. Usually, the metric is the Cosine Similarity and there are two main approaches such as:

• Transform the documents into a vector space by generating the Document-Term Matrix or the TF-IDF. This approach is based on n-grams, where usually we consider up to bi-grams.
• Transform the documents into a vector space by taking the average of the pre-trained word embeddings.

In this tutorial, we will provide you a hands-on example of how you can find similar documents from a list of documents using these two different approaches. We will not try to optimize the performance of the algorithms by applying different approaches like “stemming”, “lemmatization”, different tokenizers, different number of n-grams etc.

For our example, we will consider the  Spam dataset which contains a list of subject lines marked as Ham or Spam. If you want to build a model to predict if the email is Ham or Spam you can have a look at our tutorial.

## Similar Documents using N-grams

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
pd.set_option("max_colwidth", 300)

df



Now, let’s build the TF-IDF:

vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,1), lowercase=True, min_df=2)
X = vectorizer.fit_transform(df.text)


Finally, let’s write a function which takes as input the content of the document, the existing documents and finally, the number of recommendations and it returns the n most similar documents.

def similar_documents(mytext, df, n=10):
df = df.copy()
input_vect = vectorizer.transform([mytext])
df['similarity'] = cosine_similarity(input_vect, X).flatten()

user_input = """Nah I don't think he goes to usf, he lives around here though"""

similar_documents(text=user_input, df=df, n=10)



## Similar Documents using Word Embeddings

Another approach is to work with Word Embeddings. We have provided a similar tutorial using GloVe. In this post, we will work with the SpaCy library.

import spacy

# in case we want to work with 2D Numpy arrayes we need to unnest the numpy array as follows
# np.stack(df.embedding.to_numpy()).shape
# np.vstack(df.embedding.to_numpy()).shape

# create a column of word embedding sectors
df['embedding'] = df['text'].apply(lambda x: nlp(x).vector)

df



Let’s create a similar function to the above, but this time by taking into consideration the word embeddings.

def emb_similar_documents(text, df, n=10):
df = df.copy()
input_vect = nlp(text).vector
# reshape the inputs to 1, 300 since we are dealing with vectors of 300-D
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(input_vect.reshape(1,300), x.reshape(1,300))[0][0])

user_input = """I don't quite know what to do. I still can't get hold of anyone. I cud pick you up bout 7.30pm and we can see if they're in the pub?"""

emb_similar_documents(text=user_input, df=df, n=10)



## Final Thoughts

I believe that in small documents, like subject lines, the TF-IDF approach is better than the Word Embeddings. Also, based on my experience, taking the average of the word embeddings does not lead to a meaningful vector representation for the document. There are other techniques like Doc2Vec where we can discuss in another post. Stay tuned!

### Get updates and learn from the best

Miscellaneous

#### How to Redirect and Save Errors in Unix

In Unix, there are three types of redirection such as: Standard Input (stdin) that is denoted by 0. Usually, it’s

Python

#### Content-Based Recommender Systems with TensorFlow Recommenders

In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we