In NLP tasks, we used to apply some text cleansing before we move to the Machine Learning part. Of course, there are many approaches, but in this tutorial, we will show you a basic approach that includes the following steps:
- Convert text to lower case
- Remove leading and trailing whitespace
- Remove extra space and tabs
- Remove HTML tags and markups
- Exclude some words from the stop words list
- Remove the updated list of stop words from the text
- Remove the numbers
- Apply a word tokenizer
- Remove tokens that are less than 3 characters-length, including punctuation
- Apply stemming
Let’s start coding. First, we will load the dataset and the required libraries
import pandas as pd import nltk import re nltk.download('punkt') nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem import SnowballStemmer from nltk.tokenize import word_tokenize
Load the Data
df = pd.read_csv('AMAZON-REVIEW-DATA-CLASSIFICATION.csv') df.head(10)
Build the Text Cleansing Function
Let’s build the text cleansing function and apply it to the reviewText
column.
# Let's get a list of stop words from the NLTK library stop = stopwords.words('english') # These words are important for our problem. We don't want to remove them. excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] # New stop word list stop_words = [word for word in stop if word not in excluding] snow = SnowballStemmer('english') def text_cleansing(sent): # Check if the sentence is a missing value if isinstance(sent, str) == False: sent = "" filtered_sentence=[] sent = sent.lower() # Lowercase sent = sent.strip() # Remove leading/trailing whitespace sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups: for w in word_tokenize(sent): # We are applying some custom filtering here, feel free to try different things # Check if it is not numeric and its length>2 and not in stop words if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words): # Stem and add to filtered list filtered_sentence.append(snow.stem(w)) final_string = " ".join(filtered_sentence) #final string of cleaned words return final_string df['cleanReviewText'] = df['reviewText'].apply(lambda x: text_cleansing(x)) df.head(10)
If you compare the “reviewText” vs. the “cleanReviewText” you will notice that we have applied the transformations that we mentioned above. Feel free to start experimenting with different text cleansing processes since it is a very important part of the NLP tasks. For example, if you work on a spam detector problem, you may need to keep the currency signs as well as the numbers. Or in sentiment analysis, you may need to keep some punctuation and so on.