Sentence transformers is a Python framework for state-of-the-art vector representations of sentences. Having the sentences in space we can compute the distance between them and by doing that, we can find the most similar sentences based on their semantic meaning.
As an example, let’s say that we have these two sentences:
Coffee makes mornings better.
Morning walks tend to start and end your day in a good mood.
And we want to find the most similar sentence to the following:
i like a nice cup of tea in the morning.
The closest sentence is the “Coffee makes mornings better.” because they both talking about morning drinks. Even if they don’t use the same words, their vector representations will be close to each other.
Cosine Similarity
To get the similarity of two sentence vectors, we are using the cosine similarity(1 – cosine distance). This is because the direction of the vectors plays a huge role in their semantic meaning. For example, the vectors of the ‘coffee’ and ‘tea’ may be pointing in a direction that roughly corresponds to ‘morning drinks’. Now, let’s say that we have the vector a=[1,1,-1] and the b=2a=[2,2,-2]. These two vectors have different magnitudes but the same direction. Their cosine distance is 0 but their Euclidean distance which is not taking account the direction, is 4.58 and It would be very easy to find another vector with less distance, in a completely different direction.
The Most Similar Lyrics
In this post we will use the Song Lyrics Dataset provided by Kaggle and given text input, we will find the songs with the most similar Lyrics.
The pipeline is the following:
- Get the embeddings for every lyric in the Dataset
- Get the embedding of the text input
- Compute the cosine similarity of the text input with every lyric
- Return the songs with the highest similarity
First things first, you need to install sentence transformers.
pip install -U sentence-transformers
Then, import the necessary libraries.
import pandas as pd import numpy as np from scipy import spatial from sentence_transformers import SentenceTransformer model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
We are using the “all-MiniLM-L6-v2” pertained model but you can experiment with more sentence transformers that can be found on SBERT models page.
Let’s read and have a look at the data.
df.read_csv('lyrics.csv') df.head(3)
Now, to get the embeddings of the Lyrics, we use the model.encode function.
df=df.assign(embeddings=df['Lyric'].apply(lambda x: model.encode(x))) df.head(3)
As you can see, we result with a dataset that has the Song’s attributes along with the lyrics embeddings.
Finally, we created a function that gets the embeddings of the input, then computes the cosine similarity which is 1 – cosine distance, and then returns 5 songs with the highest cosine similarity.
def closest_lyrics(inp): data=df.copy() inp_vector=model.encode(inp) s=data['embeddings'].apply(lambda x: 1 - spatial.distance.cosine(x, inp_vector) ) data=data.assign(similarity=s) return(data.sort_values('similarity',ascending=False).head(5))
Let’s try it:
closest_lyrics('thinking about you')
As you can see we resulted with songs that are relevant to the input.
Example Applications
As an example, I created a perfume recommendation app that recommends perfumes based on scenery, notes, or a description of a perfume.
Summing it up
In this post, we talked about how to use Sentence Transformers to represent sentences in space and how to find the most similar sentence based on their semantic meaning using cosine similarity.
More About Transformers:
How To Fine-Tune An NLP Regression Model With Transformers And HuggingFace
How To Fine-Tune An NLP Classification Model With Transformers And HuggingFace