It is very convenient to work with TfidfVectorizer
and CountVectorizer
of Scikit learn for NLP tasks. However, sometimes other packages like NLTK provide us more options for tokenizers. Let’s see how we can add an NLTK tokenizer to the TfidfVectorizer. Let’s assume that we want to work with the TweetTokenizer and our data frame is the train where the column of documents is the “Tweet”. What we have to do is to build a function of the tokenizer and to pass it into the TfidfVectorizer in the field of “tokenizer”. For example.
import pandas as pd import nltk from nltk.tokenize import TweetTokenizer from sklearn.feature_extraction.text import TfidfVectorizer # create a function for the tweet tokenizer from NLTK def tok(text): tt = TweetTokenizer() return tt.tokenize(text) vect = TfidfVectorizer(min_df=20, max_df=0.95, ngram_range=(1,1), stop_words='english', tokenizer=tok).fit(train.Tweet) train_transformed = vect.transform(train.Tweet)