Predictive Hacks

Text Cleansing in NLP Tasks

similar documents

In NLP tasks, we used to apply some text cleansing before we move to the Machine Learning part. Of course, there are many approaches, but in this tutorial, we will show you a basic approach that includes the following steps:

  • Convert text to lower case
  • Remove leading and trailing whitespace
  • Remove extra space and tabs
  • Remove HTML tags and markups
  • Exclude some words from the stop words list
  • Remove the updated list of stop words from the text
  • Remove the numbers
  • Apply a word tokenizer
  • Remove tokens that are less than 3 characters-length, including punctuation
  • Apply stemming

Let’s start coding. First, we will load the dataset and the required libraries

import pandas as pd
import nltk
import re

nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

Load the Data

df = pd.read_csv('AMAZON-REVIEW-DATA-CLASSIFICATION.csv')
df.head(10)

Build the Text Cleansing Function

Let’s build the text cleansing function and apply it to the reviewText column.

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]


# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')


def text_cleansing(sent): 
    
    # Check if the sentence is a missing value
    if isinstance(sent, str) == False:
        sent = ""

    filtered_sentence=[]

    sent = sent.lower() # Lowercase 
    sent = sent.strip() # Remove leading/trailing whitespace
    sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
    sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:

    for w in word_tokenize(sent):
        # We are applying some custom filtering here, feel free to try different things
        # Check if it is not numeric and its length>2 and not in stop words
        if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
            # Stem and add to filtered list
            filtered_sentence.append(snow.stem(w))
    final_string = " ".join(filtered_sentence) #final string of cleaned words
        
    return final_string



df['cleanReviewText'] = df['reviewText'].apply(lambda x: text_cleansing(x))

df.head(10)
  

If you compare the “reviewText” vs. the “cleanReviewText” you will notice that we have applied the transformations that we mentioned above. Feel free to start experimenting with different text cleansing processes since it is a very important part of the NLP tasks. For example, if you work on a spam detector problem, you may need to keep the currency signs as well as the numbers. Or in sentiment analysis, you may need to keep some punctuation and so on.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s