We have provided an example of how to get a sentiment score for words in Python based on ratio frequency. For this example, we will work with the Naive Bayes approach taking into consideration a Twitter dataset that comes with NLTK which has been manually annotated. The sample dataset from NLTK is separated into positive and negative tweets. It contains 5000 positive tweets and 5000 negative tweets exactly.
Positive and Negative Probability of a Word
We have provided an example of Naive Bayes Classification where we explain the theory. To compute the positive probability and the negative probability for a specific word in the vocabulary, we’ll use the following inputs:
- \(freq_{pos}\) and \(freq_{neg}\) are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- \(N_{pos}\) and \(N_{neg}\) are the total number of positive and negative words for all documents (for all SLs), respectively.
- \(V\) is the number of unique words in the entire set of documents, for all classes, whether positive or negative.
We’ll use these to compute the positive and negative probability for a specific word using this formula:
\(P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V} \)
\(P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V} \)
Notice that we add the “+1” in the numerator for additive smoothing. This wiki article explains more about additive smoothing.
Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:
\(\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\)
Words with positive log likelihood means that they have a positive sentiment and vice versa. The log likelihood takes values from -inf to inf.
Coding
First things first, we will load the libraries and the data.
import numpy as np import nltk # Python library for NLP from nltk.corpus import twitter_samples # sample Twitter dataset from NLTK from collections import Counter nltk.download('twitter_samples') # select the set of positive and negative tweets all_positive_tweets = twitter_samples.strings('positive_tweets.json') all_negative_tweets = twitter_samples.strings('negative_tweets.json') # get a sample all_positive_tweets[0:10]
A sample of positive tweets:
['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)', '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!', '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!', '@97sides CONGRATS :)', 'yeaaaah yippppy!!! my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days', '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM', "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI", '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.', 'Jgh , but we have to go to Bayan :D bye', 'As an act of misc
Let’s continue by building the formulas above.
# Create three Counter objects to store positive, negative and total counts V_freq = Counter() freq_neg = Counter() freq_pos = Counter() for i in range(len(all_positive_tweets)): for word in all_positive_tweets[i].lower().split(" "): freq_pos[word]+=1 V_freq[word]+=1 for i in range(len(all_negative_tweets)): for word in all_negative_tweets[i].lower().split(" "): freq_neg[word]+=1 V_freq[word]+=1 # Get the V V = len(V_freq.keys()) # get the number of unique positive and negative words N_pos = len(freq_pos.keys()) N_neg = len(freq_neg.keys()) # define the loglikelihood function def word_loglikelihood(w): w = w.lower() if w in V_freq: p_w_pos = (freq_pos.get(w,0)+1 / (N_pos+V)) p_w_neg = (freq_neg.get(w,0)+1 / (N_neg+V)) return np.log(p_w_pos/p_w_neg) else: return(0) # get the sentiment score of the words that have appeared at least 100 times wl_dict = {} for v in V_freq.keys(): if V_freq[v]>=100: wl_dict[v] = (word_loglikelihood(v))
Results
Let’s see the 20 words with the highest positive and negative sentiment respectively.
Positive Words
We will sort the wl_dict
and we will get the top 20 words:
# sort by dictionary by value reverse order dict(sorted(wl_dict.items(), key=lambda item: item[1], reverse=True)[0:20])
And we get:
{':)': 18.63959625712883, ':-)': 17.004791788224438, ':d': 16.994987788504915, ':p': 15.43519992996352, ':))': 15.265300927223933, 'thanks': 2.807678903079016, 'great': 2.251290369790345, 'thank': 2.1465800079419135, 'happy': 2.003728796400575, 'hi': 1.8458256608722114, '<3': 1.7086919541878434, 'nice': 1.6204865995790583, '!': 1.6094369300047262, 'our': 1.0426532872476826, 'new': 1.0310186653054187, 'an': 1.000973431018713, 'us': 0.9842014904267523, 'follow': 0.9831628218512771, 'good': 0.9139890521105093, 'your': 0.9117691277531761}
Negative Words
Similarly, we will return the top 20 negative words:
# sort by dictionary by value dict(sorted(wl_dict.items(), key=lambda item: item[1])[0:20])
And we get:
{':-(': -16.72062100340213, ':((': -16.277236693311615, ':(((': -15.756702318116552, ':(': -8.222261541574424, 'sad': -3.267660345532505, 'miss': -2.573458846755138, 'followed': -2.1355294279645554, 'sorry': -2.1102119210977404, 'why': -1.7176508188113777, 'wish': -1.6211328621671777, "can't": -1.4271159399725901, 'feel': -1.2425058888621603, 'wanna': -1.2110897009657746, 'want': -1.0801783823774231, 'please': -1.045266117959381, 'been': -0.9671488467945346, 'still': -0.9373438513917545, 'but': -0.9068343183526079, 'too': -0.8574500795948041, 'im': -0.8552028034988816}
The Takeaway
As we can see got expected results and more specifically:
- The “emojis” are the most powerful tokens. This is another reason why we should be careful when removing punctuations in Sentiment Analysis and in NLP tasks
- The Top 5 Positive tokens were related to smile face like
:)
,:-)
,:D
,:P
:))
. Notice that we have converted all the letters to lower case. - Other positive words that we found were the
thanks
,great
,happy
andnice
, - The Top 4 Negative tokens were related to sad faces like
:(((
,:((
,:-(
and:(
- Other negative words that we found were the
sad
,miss
,sorry
,why
,wish
,but
,feel
etc
We can improve the sentiment analysis by applying different tokenizers, text mining etc. If we want to get a sentiment score of a word and we do not have annotated documents, we can work with other libraries like Vader as we have explained in another post.