In NLP projects, we used to remove punctuation from the text. However, we should be very careful when we perform such tasks, depending on the project since punctuations can actually be very important like sentiment analysis and so on. Let’s provide some examples:
import re import string text = "This is a text!!! It has (parenthesis), square and curly brackets [[{{}}]] and hashtags #." text.translate(str.maketrans('', '', string.punctuation))
'This is a text It has parenthesis square and curly brackets and hashtags '
Another way to do that is the following:
re.compile('[%s]' % re.escape(string.punctuation)).sub('', text)
'This is a text It has parenthesis square and curly brackets and hashtags '
Awesome, we managed to remove all punctuation. But what if we want to keep some of them, like the hashtag?
Remove some Punctuation and Keep some others
Let’s see how we can keep some punctuation. First, let’s get all the punctuation.
('[%s]' % re.escape(string.punctuation))
'[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~]'
The above is the regular expression. Let’s keep all of them, but hashtags.
re.compile('[!"\\\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~]').sub('', text)
'This is a text It has parenthesis square and curly brackets and hashtags #'
Voilà! We managed to keep hashtags!