Predictive Hacks

Keep Punctuation in NLP tasks with Python


Usually in NLP tasks we use to remove punctuation and “stopwords” from the corpus. This is valid when we are dealing with large corpora and we want to do some specific tasks like document similarity, classification, clustering etc. However, in some projects there is a need to keep everything. For example when we want to model small documents like Email Subject Lines we want to keep the tokens 50% off and the $50. Also, when we want to do Natural Language Generation we want to keep the stopwords and we want also the punctuation marks since they also indicate the end of the sentence.

Thus, the tokenizer can replace all punctuation marks with themselves by adding a space around them. Then it uses the space (“\S+”) to split the text into tokens.

In the following code, we are replacing the punctuation marks as described above.

Let’s assume we have the sentence:

That's an example. Don't ignore it!
text='''That's an example. Don't ignore it!'''

#list of all punctuation marks except the "'" (because we want
#to keep as one token the words like that's, don't etc.
r='!"#$%&()*+,-./:;<=>[email protected][\\]^_`{|}~'

#adding an escape character to them
to_replace=[re.escape(i) for i in r]
#adding a space between and after them
replace_with=[' '+i+' ' for i in r]

# We're converting the sentence to a dataframe so we can easily replace all
#punctuation marks with the function "replace" of pandas

That's an example .  Don't ignore it !

Tokenizer Function

We can create a function using the above so we can easily apply it on a data frame.

def tokenizer(x):
    r='!"#$%&()*+,-./:;<=>[email protected][\\]^_`{|}~'
    to_replace=[re.escape(i) for i in r]
    replace_with=[' '+i+' ' for i in r]

corpus=pd.DataFrame(["That's an example.",
"Don't ignore it!",
"This is another example.",
"50$ is a price:-)",
"Look @predictivehacks for more hacks",
"These are hashtags #predictivehacks #datascience #100%fun"],columns=['sentences'])



That’s an example.that’s an example . 
Don’t ignore it!don’t ignore it ! 
This is another example.this is another example . 
50$ is a price:-)50 $  is a price :  –  ) 
Look @predictivehacks for more hackslook  @ predictivehacks for more hacks
These are hashtags #predictivehacks #datascience #100%funthese are hashtags  # predictivehacks  # datascience  # 100 % fun

Now we can feed a count vectorizer using whitespece (“\S+”) as token pattern with the tokenized column we just create.

from sklearn.feature_extraction.text import CountVectorizer

#We are using the '\S+' as token pattern so it will create tokens 
#splitting with one or more spaces

v = CountVectorizer(token_pattern='\S+')
x = v.fit_transform(corpus['tokenized'])

Printing the tokens of the count vectorizer, we can see that we have also all the punctuation marks and the words “don’t”, “that’s” etc..

f = pd.DataFrame(x.toarray()) 


!                  1
#                  3
$                  1
%                  1
)                  1
-                  1
.                  2
100                1
50                 1
:                  1
@                  1
a                  1
an                 1
another            1
are                1
datascience        1
don't              1
example            2
for                1
fun                1
hacks              1
hashtags           1
ignore             1
is                 2
it                 1
look               1
more               1
predictivehacks    2
price              1
that's             1
these              1
this               1

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “Keep Punctuation in NLP tasks with Python”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

data science journey

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my