Predictive Hacks

Rule-Based Matching for NLP using spaCy

rule-based matching

If you are an NLP enthusiast you know for sure the spaCy library. It’s a powerful library mostly known for word2vec functions. Today we will show a different use of spacy for rule-based matching using the spaCy’s function Matcher.

You may ask, why not just using Regular Expressions?
The answer is Token Attributes.

Shop: Noun vs Shop: Verb
Matching lemmas like begin with began


With only these two examples we can understand the power of Matcher versus RegEx. However, we will show you how to use them both to create a “next-level” pattern.

For this project we will use the Alice In Wornderland book.

First of all be sure that you have installed the spaCy library and downloaded the en_core_web_sm as follows.

pip install -U spacy
python -m spacy download en_core_web_sm

Let’s begin by reading the data and importing the libraries.

#reading the data
data = open('11-0.txt').read()

#if you get an error try the following
#data = open('11-0.txt',encoding = 'cp850').read()

import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(data)


The available attributes that we can use are the following

ATTRIBUTEVALUE TYPE DESCRIPTION
ORTHunicodeThe exact verbatim text of a token.
TEXT V2.1unicodeThe exact verbatim text of a token.
LOWERunicodeThe lowercase form of the token text.
LENGTHintThe length of the token text.
IS_ALPHA, IS_ASCII, IS_DIGITboolToken text consists of alphabetic characters, ASCII characters, digits.
IS_LOWER, IS_UPPER, IS_TITLEboolToken text is in lowercase, uppercase, titlecase.
IS_PUNCT, IS_SPACE, IS_STOPboolToken is punctuation, whitespace, stop word.
IS_SENT_STARTboolToken is start of sentence.
SPACYboolToken has a trailing space.
LIKE_NUM, LIKE_URL, LIKE_EMAILboolToken text resembles a number, URL, email.
POS, TAG, DEP, LEMMA, SHAPEunicodeThe token’s simple and extended part-of-speech tag, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the Annotation Specifications.
ENT_TYPEunicodeThe token’s entity label.

Examples

Let’s say we want to find phrases starting with the word Alice followed by a verb.

#initialize matcher
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "Alice" and a Verb
#TEXT is for the exact match and VERB for a verb
pattern = [{"TEXT": "Alice"}, {"POS": "VERB"}]


# Add the pattern to the matcher

#the first variable is a unique id for the pattern (alice).
#The second is an optional callback and the third one is our pattern.
matcher.add("alice", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
Matches: ['Alice think', 'Alice started', 'Alice began', 'Alice opened', 'Alice ventured', 'Alice felt', 'Alice took', 'Alice thought', 'Alice went', 'Alice went', 'Alice thought', 'Alice kept', 'Alice thought', 'Alice called', 'Alice replied', 'Alice began', 'Alice guessed', 'Alice said', 'Alice went', 'Alice knew', 'Alice heard', 'Alice thought', 'Alice heard', 'Alice noticed', 'Alice dodged', 'Alice looked', 'Alice looked', 'Alice replied', 'Alice replied', 'Alice felt', 'Alice turned', 'Alice thought', 'Alice replied', 'Alice folded', 'Alice said', 'Alice waited', 'Alice remained', 'Alice crouched', 'Alice noticed', 'Alice laughed', 'Alice went', 'Alice thought', 'Alice said', 'Alice said', 'Alice glanced', 'Alice caught', 'Alice looked', 'Alice added', 'Alice felt', 'Alice remarked', 'Alice waited', 'Alice coming', 'Alice looked', 'Alice said', 'Alice thought', 'Alice considered', 'Alice replied', 'Alice felt', 'Alice replied', 'Alice sighed', 'Alice asked', 'Alice ventured', 'Alice tried', 'Alice replied', 'Alice said', 'Alice said', 'Alice thought', 'Alice looked', 'Alice recognised', 'Alice joined', 'Alice gave', 'Alice thought', 'Alice found', 'Alice began', 'Alice waited', 'Alice put', 'Alice began', 'Alice thought', 'Alice appeared', 'Alice ventured', 'Alice whispered', 'Alice thought', 'Alice remarked', 'Alice said', 'Alice said', 'Alice looked', 'Alice heard', 'Alice thought', 'Alice asked', 'Alice ventured', 'Alice went', 'Alice began', 'Alice replied', 'Alice looked', 'Alice asked', 'Alice began', 'Alice said', 'Alice said', 'Alice panted', 'Alice whispered', 'Alice began', 'Alice felt', 'Alice guessed', 'Alice watched', 'Alice looked', 'Alice got']

Find adjectives followed by a noun .

matcher = Matcher(nlp.vocab)

pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}]

matcher.add("id1", None, pattern)
matches = matcher(doc)

# We will show you the first 20 matches
print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))
Matches: {'grand words', 'hot day', 'legged table', 'dry leaves', 'great delight', 'low hall', 'own mind', 'many miles', 'little girl', 'good opportunity', 'right word', 'long passage', 'other parts', 'low curtain', 'large rabbit', 'pink eyes', 'several things', 'golden key', 'little door'}

Match begin as LEMMA followed by an adposition 

matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "begin"},{"POS": "ADP"}]
matcher.add("id1", None, pattern)
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
Matches: {'began by', 'begin at', 'begins with', 'beginning with', 'beginning to', 'begin with', 'began in'}

Quantifiers

We can use quamtifiers just like regular expressions.

OPDESCRIPTION
!Negate the pattern, by requiring it to match exactly 0 times.
?Make the pattern optional, by allowing it to match 0 or 1 times.
+Require the pattern to match 1 or more times.
*Allow the pattern to match zero or more times.

For example, match the exact word Alice followed by zero or more punctuations:

matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": "Alice"}, {"IS_PUNCT": True,"OP":"*"}]
matcher.add("id1", None, pattern)
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
Matches: {'Alice', 'Alice,', 'Alice:', 'Alice (', 'Alice!', 'Alice;', 'Alice,)', 'Alice, (', 'Alice.'}

Use of Regular Expressions

We can create more complex patterns by using regular expressions. This unlocks a new level of rule-based macthing.

Example: Match all words starting with “a” followed by parts of speech that start with “V” (VERB etc)

matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": {"REGEX": "^a"}},{"POS": {"REGEX": "^V"}}]
matcher.add("country", None, pattern)
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))
Matches: {'and make', 'are located', 'away went', 'and found', 'about stopping', 'and finding', 'and burning', 'and cried', 'and went', 'all round', 'all seemed', 'and round', 'and noticed', 'and saying', 'all made', 'all think', 'and looked', 'all locked', 'and wander'}

Add and Remove Patterns

You can add more patterns to the Macther before running it. You onlly need to use unique ids for every pattern.

matcher = Matcher(nlp.vocab)

pattern = [{"TEXT": "Alice"}, {"IS_PUNCT": True,"OP":"*"}]
matcher.add("id1", None, pattern)

pattern = [{"POS": "ADJ"},{"LOWER":"rabbit"}]
matcher.add("id2", None, pattern)

matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
Matches: {'Alice', 'Alice,', 'Alice:', 'Alice (', 'Alice;', 'Alice!', 'Alice,)', 'Alice, (', 'large rabbit', 'Alice.'}

To remove it use the remove function just like the add.

matcher.remove('id1')
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
Matches: {'large rabbit'}

You can learn more in spaCy’s documentation and you can experiment in the Rule-Based Matcher Explorer.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.