Rule-Based Matching for NLP using spaCy

Tags: matching, natural language proccessing, nlp, python, rule-based matching, spacy

If you are an NLP enthusiast you know for sure the spaCy library. It’s a powerful library mostly known for word2vec functions. Today we will show a different use of spacy for rule-based matching using the spaCy’s function Matcher.

You may ask, why not just using Regular Expressions?
The answer is Token Attributes.

Shop: Noun vs Shop: Verb
Matching lemmas like begin with began

With only these two examples we can understand the power of Matcher versus RegEx. However, we will show you how to use them both to create a “next-level” pattern.

For this project we will use the Alice In Wornderland book.

First of all be sure that you have installed the spaCy library and downloaded the en_core_web_sm as follows.

pip install -U spacy
python -m spacy download en_core_web_sm

Let’s begin by reading the data and importing the libraries.

#reading the data
data = open('11-0.txt').read()

#if you get an error try the following
#data = open('11-0.txt',encoding = 'cp850').read()

import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(data)

The available attributes that we can use are the following

ATTRIBUTE	VALUE TYPE	DESCRIPTION
ORTH	unicode	The exact verbatim text of a token.
TEXT V2.1	unicode	The exact verbatim text of a token.
LOWER	unicode	The lowercase form of the token text.
LENGTH	int	The length of the token text.
IS_ALPHA, IS_ASCII, IS_DIGIT	bool	Token text consists of alphabetic characters, ASCII characters, digits.
IS_LOWER, IS_UPPER, IS_TITLE	bool	Token text is in lowercase, uppercase, titlecase.
IS_PUNCT, IS_SPACE, IS_STOP	bool	Token is punctuation, whitespace, stop word.
IS_SENT_START	bool	Token is start of sentence.
SPACY	bool	Token has a trailing space.
LIKE_NUM, LIKE_URL, LIKE_EMAIL	bool	Token text resembles a number, URL, email.
POS, TAG, DEP, LEMMA, SHAPE	unicode	The token’s simple and extended part-of-speech tag, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the Annotation Specifications.
ENT_TYPE	unicode	The token’s entity label.

Examples

Let’s say we want to find phrases starting with the word Alice followed by a verb.

#initialize matcher
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "Alice" and a Verb
#TEXT is for the exact match and VERB for a verb
pattern = [{"TEXT": "Alice"}, {"POS": "VERB"}]


# Add the pattern to the matcher

#the first variable is a unique id for the pattern (alice).
#The second is an optional callback and the third one is our pattern.
matcher.add("alice", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['Alice think', 'Alice started', 'Alice began', 'Alice opened', 'Alice ventured', 'Alice felt', 'Alice took', 'Alice thought', 'Alice went', 'Alice went', 'Alice thought', 'Alice kept', 'Alice thought', 'Alice called', 'Alice replied', 'Alice began', 'Alice guessed', 'Alice said', 'Alice went', 'Alice knew', 'Alice heard', 'Alice thought', 'Alice heard', 'Alice noticed', 'Alice dodged', 'Alice looked', 'Alice looked', 'Alice replied', 'Alice replied', 'Alice felt', 'Alice turned', 'Alice thought', 'Alice replied', 'Alice folded', 'Alice said', 'Alice waited', 'Alice remained', 'Alice crouched', 'Alice noticed', 'Alice laughed', 'Alice went', 'Alice thought', 'Alice said', 'Alice said', 'Alice glanced', 'Alice caught', 'Alice looked', 'Alice added', 'Alice felt', 'Alice remarked', 'Alice waited', 'Alice coming', 'Alice looked', 'Alice said', 'Alice thought', 'Alice considered', 'Alice replied', 'Alice felt', 'Alice replied', 'Alice sighed', 'Alice asked', 'Alice ventured', 'Alice tried', 'Alice replied', 'Alice said', 'Alice said', 'Alice thought', 'Alice looked', 'Alice recognised', 'Alice joined', 'Alice gave', 'Alice thought', 'Alice found', 'Alice began', 'Alice waited', 'Alice put', 'Alice began', 'Alice thought', 'Alice appeared', 'Alice ventured', 'Alice whispered', 'Alice thought', 'Alice remarked', 'Alice said', 'Alice said', 'Alice looked', 'Alice heard', 'Alice thought', 'Alice asked', 'Alice ventured', 'Alice went', 'Alice began', 'Alice replied', 'Alice looked', 'Alice asked', 'Alice began', 'Alice said', 'Alice said', 'Alice panted', 'Alice whispered', 'Alice began', 'Alice felt', 'Alice guessed', 'Alice watched', 'Alice looked', 'Alice got']

Find adjectives followed by a noun .

matcher = Matcher(nlp.vocab)

pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}]

matcher.add("id1", None, pattern)
matches = matcher(doc)

# We will show you the first 20 matches
print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))

Matches: {'grand words', 'hot day', 'legged table', 'dry leaves', 'great delight', 'low hall', 'own mind', 'many miles', 'little girl', 'good opportunity', 'right word', 'long passage', 'other parts', 'low curtain', 'large rabbit', 'pink eyes', 'several things', 'golden key', 'little door'}

Match begin as LEMMA followed by an adposition

matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "begin"},{"POS": "ADP"}]
matcher.add("id1", None, pattern)
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))

Matches: {'began by', 'begin at', 'begins with', 'beginning with', 'beginning to', 'begin with', 'began in'}

Quantifiers

We can use quamtifiers just like regular expressions.

OP	DESCRIPTION
`!`	Negate the pattern, by requiring it to match exactly 0 times.
`?`	Make the pattern optional, by allowing it to match 0 or 1 times.
`+`	Require the pattern to match 1 or more times.
`*`	Allow the pattern to match zero or more times.

For example, match the exact word Alice followed by zero or more punctuations:

matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": "Alice"}, {"IS_PUNCT": True,"OP":"*"}]
matcher.add("id1", None, pattern)
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))

Matches: {'Alice', 'Alice,', 'Alice:', 'Alice (', 'Alice!', 'Alice;', 'Alice,)', 'Alice, (', 'Alice.'}

Use of Regular Expressions

We can create more complex patterns by using regular expressions. This unlocks a new level of rule-based macthing.

Example: Match all words starting with “a” followed by parts of speech that start with “V” (VERB etc)

matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": {"REGEX": "^a"}},{"POS": {"REGEX": "^V"}}]
matcher.add("country", None, pattern)
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))

Matches: {'and make', 'are located', 'away went', 'and found', 'about stopping', 'and finding', 'and burning', 'and cried', 'and went', 'all round', 'all seemed', 'and round', 'and noticed', 'and saying', 'all made', 'all think', 'and looked', 'all locked', 'and wander'}

Add and Remove Patterns

You can add more patterns to the Macther before running it. You onlly need to use unique ids for every pattern.

matcher = Matcher(nlp.vocab)

pattern = [{"TEXT": "Alice"}, {"IS_PUNCT": True,"OP":"*"}]
matcher.add("id1", None, pattern)

pattern = [{"POS": "ADJ"},{"LOWER":"rabbit"}]
matcher.add("id2", None, pattern)

matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))

Matches: {'Alice', 'Alice,', 'Alice:', 'Alice (', 'Alice;', 'Alice!', 'Alice,)', 'Alice, (', 'large rabbit', 'Alice.'}

To remove it use the remove function just like the add.

matcher.remove('id1')

matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))

Matches: {'large rabbit'}

You can learn more in spaCy’s documentation and you can experiment in the Rule-Based Matcher Explorer.

Tags: matching, natural language proccessing, nlp, python, rule-based matching, spacy

Share This Post

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

George Pipis March 21, 2024

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s

George Pipis March 15, 2024

Rule-Based Matching for NLP using spaCy

Examples

Quantifiers

Use of Regular Expressions

Add and Remove Patterns

Share This Post

Leave a Comment Cancel reply

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Image Captioning with HuggingFace

Intro to Chatbots with HuggingFace

Rule-Based Matching for NLP using spaCy

Examples

Quantifiers

Use of Regular Expressions

Add and Remove Patterns

Share This Post

Leave a Comment Cancel reply

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Image Captioning with HuggingFace

Intro to Chatbots with HuggingFace

#Tag Cloud ☁️