Words Sentiment Score
We have explained how to get a sentiment score for words in Python. Instead of building our own lexicon, we can use a pre-trained one like the VADER which stands from Valence Aware Dictionary and sEntiment Reasoner and is specifically attuned to sentiments expressed in social media.
You can install the VADER library using pip like pip install vaderSentiment
or you can get it directly from NTLK. You can have a look at VADER documentation.
Examples of Sentiment Scores
The VADER library returns 4 values such as:
- pos: The probability of the sentiment to be positive
- neu: The probability of the sentiment to be neutral
- neg: The probability of the sentiment to be negative
- compound: The normalized compound score which calculates the sum of all lexicon ratings and takes values from -1 to 1
Notice that the pos
, neu
and neg
probabilities add up to 1. Also, the compound
score is a very useful metric in case we want a single measure of sentiment. Typical threshold values are the following:
- positive: compound score>=0.05
- neutral: compound score between -0.05 and 0.05
- negative: compound score<=-0.05
Let’s see these features in practice. We will work with a sample fo twitters obtained from NTLK.
import pandas as pd import nltk from nltk.sentiment.vader import SentimentIntensityAnalyzer from nltk.corpus import twitter_samples nltk.download('twitter_samples') nltk.download('vader_lexicon') # get 5000 posivie and negative tweets all_positive_tweets = twitter_samples.strings('positive_tweets.json') all_negative_tweets = twitter_samples.strings('negative_tweets.json') analyzer = SentimentIntensityAnalyzer()
Let’s get an arbitrary positive tweet and then a negative one.
# positive all_positive_tweets[100]
Let’s have a look at the tweet.
"@metalgear_jp @Kojima_Hideo I want you're T-shirts ! They are so cool ! :D"
Let’s get its sentiment score:
analyzer.polarity_scores(all_positive_tweets[100])
The output is 56.8% positive ad 43.2% neutral. The compound score is 0.8476
{'neg': 0.0, 'neu': 0.432, 'pos': 0.568, 'compound': 0.8476}
Let’s do the same for a negative tweet.
all_negative_tweets[20]
which is:
'I feel lonely someone talk to me guys and girls :(\n\n@TheOnlyRazzYT @imarieuda @EiroZPegasus @AMYSQUEE @UdotV'
Let’s get the sentiment score:
analyzer.polarity_scores(all_negative_tweets[20])
The output is 70.7% neutral ad 29.3% negative. The compound score is -0.6597
{'neg': 0.293, 'neu': 0.707, 'pos': 0.0, 'compound': -0.6597}
Important Note about Sentiment Scores
In most NLP tasks we need to apply data cleansing first. In my opinion, this should be avoided when we run sentiment analysis. Notice that VADER:
- It is case sensitive. The sentence
This is great
has a different score than the sentenceThis is GREAT
. - Punctuation matters. The exclamation marks for example have a positive score
- The emojis have also a score and actually very strong sentiments. Try the
<3
,:)
,:p
and:(
- Words after
@
and#
have a neutral score.
Get the Sentiment Score of Thousands of Tweets
We will show how you can run a sentiment analysis in many tweets. We will work with the 10K sample of tweets obtained from NLTK. We start our analysis by creating the pandas
data frame with two columns, tweets
and my_labels
which take values 0 (negative) and 1 (positive).
my_labels = [1]*len(all_positive_tweets) negative_labels = [0]*len(all_negative_tweets) my_labels.extend(negative_labels) all_positive_tweets.extend(all_negative_tweets) df = pd.DataFrame({'tweets' : all_positive_tweets, 'my_labels' : my_labels}) df
Now, we will add 4 new columns such as the neg
, neu
, pos
and compound
using the lambda function.
df['neg'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['neg']) df['neu'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['neu']) df['pos'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['pos']) df['compound'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['compound']) df
Analyze the Sentiment Score
Since we have tide the data and we have gathered the required information in a structured format, we can apply any kind of analysis. So for example let’s have a look at the compound
score for the positive and negative labels.
df.groupby('my_labels')['compound'].describe()
Let’s also have a look at the boxplot.
df.boxplot(by='my_labels', column='compound', figsize=(12,8))
Discussion
It is obvious that VADER is a reliable tool to perform sentiment analysis, especially in social media comments. As we can see from the box plot above, the positive labels achieved much higher score compound
score and the majority is higher than 0.5. On contrary, the negative labels got a very low compound
score, with the majority to lie below 0.
6 thoughts on “How to Run Sentiment Analysis in Python using VADER”
Thank you..keep up the good work!
Thank you Amira!
HI Author,
Thank you for sharing this.
I would like to know though how the sentiment scores (neg, neu and pos) are computed manually. I did the Compound score using excel and I was able to replicate it, except for the 3 scores I mentioned. I need to understand the algorithm before I apply in any programming woirk.
Thank you very much.
Great. Thank you.
You are welcome!
How would the code change if I already have a corpus of tweets instead of downloading arbitrary positive or negative tweets?