Predictive Hacks

How to add NLTK Tokenizers to Scikit Learn TfidfVectorizer

It is very convenient to work with TfidfVectorizer and CountVectorizer of Scikit learn for NLP tasks. However, sometimes other packages like NLTK provide us more options for tokenizers. Let’s see how we can add an NLTK tokenizer to the TfidfVectorizer. Let’s assume that we want to work with the TweetTokenizer and our data frame is the train where the column of documents is the “Tweet”. What we have to do is to build a function of the tokenizer and to pass it into the TfidfVectorizer in the field of “tokenizer”. For example.

import pandas as pd
import nltk
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer

# create a function for the tweet tokenizer from NLTK
def tok(text):
    tt = TweetTokenizer()
    return tt.tokenize(text)

vect = TfidfVectorizer(min_df=20, max_df=0.95, ngram_range=(1,1), stop_words='english', tokenizer=tok).fit(train.Tweet)

train_transformed = vect.transform(train.Tweet)

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore