Usually in NLP tasks we use to remove punctuation and “stopwords” from the corpus. This is valid when we are dealing with large corpora and we want to do some specific tasks like document similarity, classification, clustering etc. However, in some projects there is a need to keep everything. For example when we want to model small documents like Email Subject Lines
we want to keep the tokens 50% off
and the $50
. Also, when we want to do Natural Language Generation
we want to keep the stopwords and we want also the punctuation marks since they also indicate the end of the sentence.
Thus, the tokenizer can replace all punctuation marks with themselves by adding a space around them. Then it uses the space (“\S+”) to split the text into tokens.
In the following code, we are replacing the punctuation marks as described above.
Let’s assume we have the sentence:
That's an example. Don't ignore it!
text='''That's an example. Don't ignore it!''' #list of all punctuation marks except the "'" (because we want #to keep as one token the words like that's, don't etc. r='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~' #adding an escape character to them to_replace=[re.escape(i) for i in r] #adding a space between and after them replace_with=[' '+i+' ' for i in r] # We're converting the sentence to a dataframe so we can easily replace all #punctuation marks with the function "replace" of pandas x=pd.DataFrame()[0].replace(to_replace,replace_with,regex=True)[0] print(x)
That's an example . Don't ignore it !
Tokenizer Function
We can create a function using the above so we can easily apply it on a data frame.
def tokenizer(x): r='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~' to_replace=[re.escape(i) for i in r] replace_with=[' '+i+' ' for i in r] x=x.lower() x=pd.DataFrame([x])[0].replace(to_replace,replace_with,regex=True)[0] return(x) corpus=pd.DataFrame(["That's an example.", "Don't ignore it!", "This is another example.", "50$ is a price:-)", "Look @predictivehacks for more hacks", "These are hashtags #predictivehacks #datascience #100%fun"],columns=['sentences']) corpus['tokenized']=corpus['sentences'].apply(tokenizer) print(corpus)
sentences | tokenized |
That’s an example. | that’s an example . |
Don’t ignore it! | don’t ignore it ! |
This is another example. | this is another example . |
50$ is a price:-) | 50 $ is a price : – ) |
Look @predictivehacks for more hacks | look @ predictivehacks for more hacks |
These are hashtags #predictivehacks #datascience #100%fun | these are hashtags # predictivehacks # datascience # 100 % fun |
Now we can feed a count vectorizer using whitespece (“\S+”) as token pattern with the tokenized column we just create.
from sklearn.feature_extraction.text import CountVectorizer #We are using the '\S+' as token pattern so it will create tokens #splitting with one or more spaces v = CountVectorizer(token_pattern='\S+') x = v.fit_transform(corpus['tokenized'])
Printing the tokens of the count vectorizer, we can see that we have also all the punctuation marks and the words “don’t”, “that’s” etc..
f = pd.DataFrame(x.toarray()) f.columns=v.get_feature_names() f=f.sum(axis=0) print(f)
! 1
# 3
$ 1
% 1
) 1
- 1
. 2
100 1
50 1
: 1
@ 1
a 1
an 1
another 1
are 1
datascience 1
don't 1
example 2
for 1
fun 1
hacks 1
hashtags 1
ignore 1
is 2
it 1
look 1
more 1
predictivehacks 2
price 1
that's 1
these 1
this 1
1 thought on “Keep Punctuation in NLP tasks with Python”