Predictive Hacks

How to Remove Stopwords from Text in Python

In many NLP tasks, it is necessary to remove “stopwords” from the text. Usually, by “stopwords” we mean the words that occur frequently and don’t contribute much to the overall meaning of the sentence. Some examples of the stopwords are the {"a", "an", "the", "this", "that", "is", "it", "to", "and"} and so on.

In this tutorial, we will show how to remove stopwrods in Python using the NLTK library.

Let’s load the libraries

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 

The English stop words are given by the list:

stopwords.words('english')
 

However, someone can create their own stop word list like:

stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

How to Add Stopwords to the NLTK Stopword List

Or you can add your custom stop words to the NLTK stopword list. For example:

# stopwords from NLTK
my_stopwords = nltk.corpus.stopwords.words('english')

# my new custom stopwords
my_extra = ['abc', 'google', 'apple']

# add the new custom stopwrds to my stopwords
my_stopwords.extend(my_extra)
 

How to Remove Stopwords from the NLTK Stopword List

Similarly, you can remove some words from the “stopword list” using list comprehensions. For example:

# remove these words from stop words
my_lst = ['have', 'few']

# update the stopwords list without the words above
my_stopwords = [el for el in my_stopwords if el not in my_lst]
 

How to Remove Stopwords from Text

Now, we are ready to remove the stopwords from the text. Let’s consider the following nonsense text for exhibition purposes.

my_txt = "I'm George. I live in Athens! This is my blog, hopefully you enjoy this post! Look at this!"

filtered_list = []
stop_words = nltk.corpus.stopwords.words('english')

# Tokenize the sentence
words = word_tokenize(my_txt)
for w in words:
    if w.lower() not in stop_words:
        filtered_list.append(w)
        
filtered_list
 

Output:

["'m",
 'George',
 '.',
 'live',
 'Athens',
 '!',
 'blog',
 ',',
 'hopefully',
 'enjoy',
 'post',
 '!',
 'Look',
 '!']

Now, we may want to convert the list to a string. Let’s do it:

my_clean_txt = " ".join(filtered_list)
my_clean_txt
 

Output:

"'m George . live Athens ! blog , hopefully enjoy post ! Look !"

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.