In many NLP tasks, it is necessary to remove “stopwords” from the text. Usually, by “stopwords” we mean the words that occur frequently and don’t contribute much to the overall meaning of the sentence. Some examples of the stopwords are the {"a", "an", "the", "this", "that", "is", "it", "to", "and"}
and so on.
In this tutorial, we will show how to remove stopwrods in Python using the NLTK library.
Let’s load the libraries
import nltk nltk.download('stopwords') nltk.download('punkt') from nltk.corpus import stopwords from nltk.tokenize import word_tokenize
The English stop words are given by the list:
stopwords.words('english')
However, someone can create their own stop word list like:
stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]
How to Add Stopwords to the NLTK Stopword List
Or you can add your custom stop words to the NLTK stopword list. For example:
# stopwords from NLTK my_stopwords = nltk.corpus.stopwords.words('english') # my new custom stopwords my_extra = ['abc', 'google', 'apple'] # add the new custom stopwrds to my stopwords my_stopwords.extend(my_extra)
How to Remove Stopwords from the NLTK Stopword List
Similarly, you can remove some words from the “stopword list” using list comprehensions. For example:
# remove these words from stop words my_lst = ['have', 'few'] # update the stopwords list without the words above my_stopwords = [el for el in my_stopwords if el not in my_lst]
How to Remove Stopwords from Text
Now, we are ready to remove the stopwords from the text. Let’s consider the following nonsense text for exhibition purposes.
my_txt = "I'm George. I live in Athens! This is my blog, hopefully you enjoy this post! Look at this!" filtered_list = [] stop_words = nltk.corpus.stopwords.words('english') # Tokenize the sentence words = word_tokenize(my_txt) for w in words: if w.lower() not in stop_words: filtered_list.append(w) filtered_list
Output:
["'m",
'George',
'.',
'live',
'Athens',
'!',
'blog',
',',
'hopefully',
'enjoy',
'post',
'!',
'Look',
'!']
Now, we may want to convert the list to a string. Let’s do it:
my_clean_txt = " ".join(filtered_list) my_clean_txt
Output:
"'m George . live Athens ! blog , hopefully enjoy post ! Look !"