Predictive Hacks

Regular Expression for Exact Match

In this post, we will provide you an example of how you can write a regular expression (regex) that searches for the exact match of the input. We will work with Python and Pandas. Let’s provide a dummy Pandas data frame with documents.

import re
import pandas as pd


df = pd.DataFrame({'ID':[1,2,3,4,5],
                  'Document':['This t-shirt costs 30$', 
                              'I am 30 years old', 
                              'There is a discount of $ 30', 
                              'Now you can get it with 30% off',
                             'I am 30+ :)']})

df

When we are dealing with symbols like $, %, + etc at the start of or at the end of the input, the regular expression with the word boundaries does not work. Below, we represent a “hack” of how you can return the documents that contain the exact input keyword.

Examples

The regular expression that returns the exact match is the following:

(?<!\w)input(?!\w)

where input is the user’s input. Let’s try to explain the regular expression.

(?<!...) Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

(?!...) Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

\w Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

Return the documents which contain the “30%”

Notice that it is important to sanitize the input by adding the escape characters when this is necessary. For example:

keyword = "30%"
keyword = re.escape(keyword)
df.loc[df.Document.str.contains(rf"(?<!\w){keyword}(?!\w)", case=False, regex=True)]

Return the documents which contain the “30+”

keyword = "30+"
keyword = re.escape(keyword)
df.loc[df.Document.str.contains(rf"(?<!\w){keyword}(?!\w)", case=False, regex=True)]

Return the documents which contain the “$ 30”

keyword = "$ 30"
keyword = re.escape(keyword)
df.loc[df.Document.str.contains(rf"(?<!\w){keyword}(?!\w)", case=False, regex=True)]

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s