Stemming is a rule-based process that converts tokens into their root form by removing the suffixes. Let’s consider the following text and apply stemming using the SnowballStemmer from NLTK.
import nltk nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('wordnet') import nltk from nltk.tokenize import word_tokenize from nltk.stem import SnowballStemmer text = 'Jim has an engineering background and he works as project manager! Before he was working as a developer in a software company' snow = SnowballStemmer('english') stemmed_sentence =  # Word Tokenizer words = word_tokenize(text) for w in words: # Apply Stemming stemmed_sentence.append(snow.stem(w)) stemmed_text = " ".join(stemmed_sentence) stemmed_text
'jim has an engin background and he work as project manag ! befor he was work as a develop in a softwar compani'
Note that we could have used the Porter stemmer as follows:
from nltk.stem.porter import * # Initialize the stemmer porter = PorterStemmer() stemmed_sentence =  # Tokenize the sentence words = word_tokenize(text) for w in words: # Stem the word/token stemmed_sentence.append(porter.stem(w)) stemmed_text = " ".join(stemmed_sentence) stemmed_text
'jim ha an engin background and he work as project manag ! befor he wa work as a develop in a softwar compani'
As we can see, both stemmers returned similar results. Focus on the words:
- engineering –> engin
- manager–> manag
- developer–> develop
and so on.
Lemmatization is not a ruled-based process like stemming and it is much more computationally expensive. In lemmatization, we need to know the part of speech of the tokens like “verb”, “adverb”, “noun” etc. Let’s look at the following example.
# Importing the necessary functions import nltk from nltk.tokenize import word_tokenize from nltk.corpus import wordnet from nltk.stem import WordNetLemmatizer # Initialize the lemmatizer wl = WordNetLemmatizer() # This is a helper function to map NTLK position tags # Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html def get_wordnet_pos(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN lemmatized_sentence =  # Tokenize the sentence words = word_tokenize(text) # Get position tags word_pos_tags = nltk.pos_tag(words) # Map the position tag and lemmatize the word/token for idx, tag in enumerate(word_pos_tags): lemmatized_sentence.append(wl.lemmatize(tag, get_wordnet_pos(tag))) lemmatized_text = " ".join(lemmatized_sentence) lemmatized_text
Jim have an engineering background and he work a project manager ! Before he be work a a developer in a software company