Predictive Hacks

# Keep Punctuation in NLP tasks with Python

Usually in NLP tasks we use to remove punctuation and “stopwords” from the corpus. This is valid when we are dealing with large corpora and we want to do some specific tasks like document similarity, classification, clustering etc. However, in some projects there is a need to keep everything. For example when we want to model small documents like Email Subject Lines we want to keep the tokens 50% off and the $50. Also, when we want to do Natural Language Generation we want to keep the stopwords and we want also the punctuation marks since they also indicate the end of the sentence. Thus, the tokenizer can replace all punctuation marks with themselves by adding a space around them. Then it uses the space (“\S+”) to split the text into tokens. In the following code, we are replacing the punctuation marks as described above. Let’s assume we have the sentence: That's an example. Don't ignore it! text='''That's an example. Don't ignore it!''' #list of all punctuation marks except the "'" (because we want #to keep as one token the words like that's, don't etc. r='!"#$%&amp;()*+,-./:;<=>[email protected][\\]^_{|}~'

#adding an escape character to them
to_replace=[re.escape(i) for i in r]

#adding a space between and after them
replace_with=[' '+i+' ' for i in r]

# We're converting the sentence to a dataframe so we can easily replace all
#punctuation marks with the function "replace" of pandas
x=pd.DataFrame()[0].replace(to_replace,replace_with,regex=True)[0]

print(x)

That's an example .  Don't ignore it !

### TokenizerFunction

We can create a function using the above so we can easily apply it on a data frame.

def tokenizer(x):
r='!"#$%&amp;()*+,-./:;<=>[email protected][\\]^_{|}~' to_replace=[re.escape(i) for i in r] replace_with=[' '+i+' ' for i in r] x=x.lower() x=pd.DataFrame([x])[0].replace(to_replace,replace_with,regex=True)[0] return(x) corpus=pd.DataFrame(["That's an example.", "Don't ignore it!", "This is another example.", "50$ is a price:-)",
"Look @predictivehacks for more hacks",
"These are hashtags #predictivehacks #datascience #100%fun"],columns=['sentences'])

corpus['tokenized']=corpus['sentences'].apply(tokenizer)

print(corpus)


Now we can feed a count vectorizer using whitespece (“\S+”) as token pattern with the tokenized column we just create.

from sklearn.feature_extraction.text import CountVectorizer

#We are using the '\S+' as token pattern so it will create tokens
#splitting with one or more spaces

v = CountVectorizer(token_pattern='\S+')
x = v.fit_transform(corpus['tokenized'])


Printing the tokens of the count vectorizer, we can see that we have also all the punctuation marks and the words “don’t”, “that’s” etc..

f = pd.DataFrame(x.toarray())
f.columns=v.get_feature_names()

f=f.sum(axis=0)

print(f)

!                  1
#                  3
\$                  1
%                  1
)                  1
-                  1
.                  2
100                1
50                 1
:                  1
@                  1
a                  1
an                 1
another            1
are                1
datascience        1
don't              1
example            2
for                1
fun                1
hacks              1
hashtags           1
ignore             1
is                 2
it                 1
look               1
more               1
predictivehacks    2
price              1
that's             1
these              1
this               1

### Get updates and learn from the best

Miscellaneous

#### How to Redirect and Save Errors in Unix

In Unix, there are three types of redirection such as: Standard Input (stdin) that is denoted by 0. Usually, it’s

Python

#### Content-Based Recommender Systems with TensorFlow Recommenders

In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we