Predictive Hacks

How to Run Sentiment Analysis in Python using VADER

sentiment

Words Sentiment Score

We have explained how to get a sentiment score for words in Python. Instead of building our own lexicon, we can use a pre-trained one like the VADER which stands from Valence Aware Dictionary and sEntiment Reasoner and is specifically attuned to sentiments expressed in social media.

You can install the VADER library using pip like pip install vaderSentiment or you can get it directly from NTLK. You can have a look at VADER documentation.

Examples of Sentiment Scores

The VADER library returns 4 values such as:

  • pos: The probability of the sentiment to be positive
  • neu: The probability of the sentiment to be neutral
  • neg: The probability of the sentiment to be negative
  • compound: The normalized compound score which calculates the sum of all lexicon ratings and takes values from -1 to 1

Notice that the pos, neu and neg probabilities add up to 1. Also, the compound score is a very useful metric in case we want a single measure of sentiment. Typical threshold values are the following:

  • positive: compound score>=0.05
  • neutral: compound score between -0.05 and 0.05
  • negative: compound score<=-0.05

Let’s see these features in practice. We will work with a sample fo twitters obtained from NTLK.

import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import twitter_samples 

nltk.download('twitter_samples')
nltk.download('vader_lexicon')

# get 5000 posivie and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

analyzer = SentimentIntensityAnalyzer()
 
 

Let’s get an arbitrary positive tweet and then a negative one.

# positive
all_positive_tweets[100]
 

Let’s have a look at the tweet.

"@metalgear_jp @Kojima_Hideo I want you're T-shirts ! They are so cool ! :D"

Let’s get its sentiment score:

analyzer.polarity_scores(all_positive_tweets[100])
 

The output is 56.8% positive ad 43.2% neutral. The compound score is 0.8476

{'neg': 0.0, 'neu': 0.432, 'pos': 0.568, 'compound': 0.8476}

Let’s do the same for a negative tweet.

all_negative_tweets[20]

which is:

'I feel lonely someone talk to me guys and girls :(\n\n@TheOnlyRazzYT @imarieuda @EiroZPegasus @AMYSQUEE @UdotV'

Let’s get the sentiment score:

analyzer.polarity_scores(all_negative_tweets[20])
 

The output is 70.7% neutral ad 29.3% negative. The compound score is -0.6597

{'neg': 0.293, 'neu': 0.707, 'pos': 0.0, 'compound': -0.6597}

Important Note about Sentiment Scores

In most NLP tasks we need to apply data cleansing first. In my opinion, this should be avoided when we run sentiment analysis. Notice that VADER:

  • It is case sensitive. The sentence This is great has a different score than the sentence This is GREAT.
  • Punctuation matters. The exclamation marks for example have a positive score
  • The emojis have also a score and actually very strong sentiments. Try the <3, :) , :p and :(
  • Words after @ and # have a neutral score.

Get the Sentiment Score of Thousands of Tweets

We will show how you can run a sentiment analysis in many tweets. We will work with the 10K sample of tweets obtained from NLTK. We start our analysis by creating the pandas data frame with two columns, tweets and my_labels which take values 0 (negative) and 1 (positive).

my_labels = [1]*len(all_positive_tweets)
negative_labels = [0]*len(all_negative_tweets)
my_labels.extend(negative_labels)

all_positive_tweets.extend(all_negative_tweets)

df = pd.DataFrame({'tweets' : all_positive_tweets, 
                   'my_labels' : my_labels})

df 
 

Now, we will add 4 new columns such as the neg, neu, pos and compound using the lambda function.

df['neg'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
df['neu'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
df['pos'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
df['compound'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
df
 

Analyze the Sentiment Score

Since we have tide the data and we have gathered the required information in a structured format, we can apply any kind of analysis. So for example let’s have a look at the compound score for the positive and negative labels.

df.groupby('my_labels')['compound'].describe()
 

Let’s also have a look at the boxplot.

df.boxplot(by='my_labels', column='compound', figsize=(12,8))
 

Discussion

It is obvious that VADER is a reliable tool to perform sentiment analysis, especially in social media comments. As we can see from the box plot above, the positive labels achieved much higher score compound score and the majority is higher than 0.5. On contrary, the negative labels got a very low compound score, with the majority to lie below 0.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

6 thoughts on “How to Run Sentiment Analysis in Python using VADER”

  1. HI Author,

    Thank you for sharing this.

    I would like to know though how the sentiment scores (neg, neu and pos) are computed manually. I did the Compound score using excel and I was able to replicate it, except for the 3 scores I mentioned. I need to understand the algorithm before I apply in any programming woirk.

    Thank you very much.

    Reply
  2. How would the code change if I already have a corpus of tweets instead of downloading arbitrary positive or negative tweets?

    Reply

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s