Predictive Hacks

Stemming and Lemmatization in Python using NLTK

similar documents

In this tutorial, we will show you how to use stemming and lemmatization in NLP tasks. You can find more info about stemming and lemmatization in this post from Stanford.

Stemming

Stemming is a rule-based process that converts tokens into their root form by removing the suffixes. Let’s consider the following text and apply stemming using the SnowballStemmer from NLTK.

import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')


import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer


text = 'Jim has an engineering background and he works as project manager! Before he was working as a developer in a software company'


snow = SnowballStemmer('english')

stemmed_sentence = []
# Word Tokenizer
words = word_tokenize(text)
for w in words:
    # Apply Stemming
    stemmed_sentence.append(snow.stem(w))
stemmed_text = " ".join(stemmed_sentence)
 
stemmed_text
 

Output

'jim has an engin background and he work as project manag ! befor he was work as a develop in a softwar compani'

Note that we could have used the Porter stemmer as follows:

from nltk.stem.porter import *

# Initialize the stemmer
porter = PorterStemmer()

stemmed_sentence = []
# Tokenize the sentence
words = word_tokenize(text)
for w in words:
    # Stem the word/token
    stemmed_sentence.append(porter.stem(w))
stemmed_text = " ".join(stemmed_sentence)
 
stemmed_text
 

Output

'jim ha an engin background and he work as project manag ! befor he wa work as a develop in a softwar compani'

As we can see, both stemmers returned similar results. Focus on the words:

  • engineering –> engin
  • manager–> manag
  • developer–> develop

and so on.

Lemmatization

Lemmatization is not a ruled-based process like stemming and it is much more computationally expensive. In lemmatization, we need to know the part of speech of the tokens like “verb”, “adverb”, “noun” etc. Let’s look at the following example.

# Importing the necessary functions
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
wl = WordNetLemmatizer()

# This is a helper function to map NTLK position tags
# Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

lemmatized_sentence = []
# Tokenize the sentence
words = word_tokenize(text)
# Get position tags
word_pos_tags = nltk.pos_tag(words)
# Map the position tag and lemmatize the word/token
for idx, tag in enumerate(word_pos_tags):
    lemmatized_sentence.append(wl.lemmatize(tag[0], get_wordnet_pos(tag[1])))

lemmatized_text = " ".join(lemmatized_sentence)

lemmatized_text 
 

Output

Jim have an engineering background and he work a project manager ! Before he be work a a developer in a software company

References

[1] AWS Machine Learning University

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Get Started with Hugging Face Auto Train

Hugging Face has launched the auto train, which is a new way to automatically train, evaluate and deploy state-of-the-art Machine