A very common task in NLP is to define the similarity between documents. Usually, the metric is the Cosine Similarity and there are two main approaches such as:
- Transform the documents into a vector space by generating the Document-Term Matrix or the TF-IDF. This approach is based on n-grams, where usually we consider up to bi-grams.
- Transform the documents into a vector space by taking the average of the pre-trained word embeddings.
In this tutorial, we will provide you a hands-on example of how you can find similar documents from a list of documents using these two different approaches. We will not try to optimize the performance of the algorithms by applying different approaches like “stemming”, “lemmatization”, different tokenizers, different number of n-grams etc.
For our example, we will consider the Spam dataset which contains a list of subject lines marked as Ham or Spam. If you want to build a model to predict if the email is Ham or Spam you can have a look at our tutorial.
Similar Documents using N-grams
Let’s start by loading the data:
import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity pd.set_option("max_colwidth", 300) df = spam_data = pd.read_csv('spam.csv') df
Now, let’s build the TF-IDF:
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,1), lowercase=True, min_df=2) X = vectorizer.fit_transform(df.text)
Finally, let’s write a function which takes as input the content of the document, the existing documents and finally, the number of recommendations and it returns the n most similar documents.
def similar_documents(text, df, n=10): df = df.copy() input_vect = vectorizer.transform() df['similarity'] = cosine_similarity(input_vect, X).flatten() return (df.sort_values(by='similarity', ascending=False)[['text', 'similarity']].head(n)) user_input = """Nah I don't think he goes to usf, he lives around here though""" similar_documents(text=user_input, df=df, n=10)
Similar Documents using Word Embeddings
Another approach is to work with Word Embeddings. We have provided a similar tutorial using GloVe. In this post, we will work with the SpaCy library.
import spacy # load the word embeddings nlp = spacy.load("en_core_web_md") # in case we want to work with 2D Numpy arrayes we need to unnest the numpy array as follows # np.stack(df.embedding.to_numpy()).shape # np.vstack(df.embedding.to_numpy()).shape # create a column of word embedding sectors df['embedding'] = df['text'].apply(lambda x: nlp(x).vector) df
Let’s create a similar function to the above, but this time by taking into consideration the word embeddings.
def emb_similar_documents(text, df, n=10): df = df.copy() input_vect = nlp(text).vector # reshape the inputs to 1, 300 since we are dealing with vectors of 300-D df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(input_vect.reshape(1,300), x.reshape(1,300))) return (df.sort_values(by='similarity', ascending=False)[['text', 'similarity']].head(n)) user_input = """I don't quite know what to do. I still can't get hold of anyone. I cud pick you up bout 7.30pm and we can see if they're in the pub?""" emb_similar_documents(text=user_input, df=df, n=10)
I believe that in small documents, like subject lines, the TF-IDF approach is better than the Word Embeddings. Also, based on my experience, taking the average of the word embeddings does not lead to a meaningful vector representation for the document. There are other techniques like Doc2Vec where we can discuss in another post. Stay tuned!