If you are an NLP enthusiast you know for sure the spaCy library. It’s a powerful library mostly known for word2vec functions. Today we will show a different use of spacy for rule-based matching using the spaCy’s function Matcher.
You may ask, why not just using Regular Expressions?
The answer is Token Attributes.
Shop: Noun vs Shop: Verb
Matching lemmas like begin with began
With only these two examples we can understand the power of Matcher versus RegEx. However, we will show you how to use them both to create a “next-level” pattern.
For this project we will use the Alice In Wornderland book.
First of all be sure that you have installed the spaCy library and downloaded the en_core_web_sm as follows.
pip install -U spacy
python -m spacy download en_core_web_sm
Let’s begin by reading the data and importing the libraries.
#reading the data data = open('11-0.txt').read() #if you get an error try the following #data = open('11-0.txt',encoding = 'cp850').read() import spacy # Import the Matcher from spacy.matcher import Matcher nlp = spacy.load("en_core_web_sm") doc = nlp(data)
The available attributes that we can use are the following
ATTRIBUTE | VALUE TYPE | DESCRIPTION |
ORTH | unicode | The exact verbatim text of a token. |
TEXT V2.1 | unicode | The exact verbatim text of a token. |
LOWER | unicode | The lowercase form of the token text. |
LENGTH | int | The length of the token text. |
IS_ALPHA, IS_ASCII, IS_DIGIT | bool | Token text consists of alphabetic characters, ASCII characters, digits. |
IS_LOWER, IS_UPPER, IS_TITLE | bool | Token text is in lowercase, uppercase, titlecase. |
IS_PUNCT, IS_SPACE, IS_STOP | bool | Token is punctuation, whitespace, stop word. |
IS_SENT_START | bool | Token is start of sentence. |
SPACY | bool | Token has a trailing space. |
LIKE_NUM, LIKE_URL, LIKE_EMAIL | bool | Token text resembles a number, URL, email. |
POS, TAG, DEP, LEMMA, SHAPE | unicode | The token’s simple and extended part-of-speech tag, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the Annotation Specifications. |
ENT_TYPE | unicode | The token’s entity label. |
Examples
Let’s say we want to find phrases starting with the word Alice followed by a verb.
#initialize matcher matcher = Matcher(nlp.vocab) # Create a pattern matching two tokens: "Alice" and a Verb #TEXT is for the exact match and VERB for a verb pattern = [{"TEXT": "Alice"}, {"POS": "VERB"}] # Add the pattern to the matcher #the first variable is a unique id for the pattern (alice). #The second is an optional callback and the third one is our pattern. matcher.add("alice", None, pattern) # Use the matcher on the doc matches = matcher(doc) print("Matches:", [doc[start:end].text for match_id, start, end in matches])
Matches: ['Alice think', 'Alice started', 'Alice began', 'Alice opened', 'Alice ventured', 'Alice felt', 'Alice took', 'Alice thought', 'Alice went', 'Alice went', 'Alice thought', 'Alice kept', 'Alice thought', 'Alice called', 'Alice replied', 'Alice began', 'Alice guessed', 'Alice said', 'Alice went', 'Alice knew', 'Alice heard', 'Alice thought', 'Alice heard', 'Alice noticed', 'Alice dodged', 'Alice looked', 'Alice looked', 'Alice replied', 'Alice replied', 'Alice felt', 'Alice turned', 'Alice thought', 'Alice replied', 'Alice folded', 'Alice said', 'Alice waited', 'Alice remained', 'Alice crouched', 'Alice noticed', 'Alice laughed', 'Alice went', 'Alice thought', 'Alice said', 'Alice said', 'Alice glanced', 'Alice caught', 'Alice looked', 'Alice added', 'Alice felt', 'Alice remarked', 'Alice waited', 'Alice coming', 'Alice looked', 'Alice said', 'Alice thought', 'Alice considered', 'Alice replied', 'Alice felt', 'Alice replied', 'Alice sighed', 'Alice asked', 'Alice ventured', 'Alice tried', 'Alice replied', 'Alice said', 'Alice said', 'Alice thought', 'Alice looked', 'Alice recognised', 'Alice joined', 'Alice gave', 'Alice thought', 'Alice found', 'Alice began', 'Alice waited', 'Alice put', 'Alice began', 'Alice thought', 'Alice appeared', 'Alice ventured', 'Alice whispered', 'Alice thought', 'Alice remarked', 'Alice said', 'Alice said', 'Alice looked', 'Alice heard', 'Alice thought', 'Alice asked', 'Alice ventured', 'Alice went', 'Alice began', 'Alice replied', 'Alice looked', 'Alice asked', 'Alice began', 'Alice said', 'Alice said', 'Alice panted', 'Alice whispered', 'Alice began', 'Alice felt', 'Alice guessed', 'Alice watched', 'Alice looked', 'Alice got']
Find adjectives followed by a noun .
matcher = Matcher(nlp.vocab) pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}] matcher.add("id1", None, pattern) matches = matcher(doc) # We will show you the first 20 matches print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))
Matches: {'grand words', 'hot day', 'legged table', 'dry leaves', 'great delight', 'low hall', 'own mind', 'many miles', 'little girl', 'good opportunity', 'right word', 'long passage', 'other parts', 'low curtain', 'large rabbit', 'pink eyes', 'several things', 'golden key', 'little door'}
Match begin as LEMMA followed by an adposition
matcher = Matcher(nlp.vocab) pattern = [{"LEMMA": "begin"},{"POS": "ADP"}] matcher.add("id1", None, pattern) matches = matcher(doc) print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
Matches: {'began by', 'begin at', 'begins with', 'beginning with', 'beginning to', 'begin with', 'began in'}
Quantifiers
We can use quamtifiers just like regular expressions.
OP | DESCRIPTION |
---|---|
! | Negate the pattern, by requiring it to match exactly 0 times. |
? | Make the pattern optional, by allowing it to match 0 or 1 times. |
+ | Require the pattern to match 1 or more times. |
* | Allow the pattern to match zero or more times. |
For example, match the exact word Alice followed by zero or more punctuations:
matcher = Matcher(nlp.vocab) pattern = [{"TEXT": "Alice"}, {"IS_PUNCT": True,"OP":"*"}] matcher.add("id1", None, pattern) matches = matcher(doc) print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
Matches: {'Alice', 'Alice,', 'Alice:', 'Alice (', 'Alice!', 'Alice;', 'Alice,)', 'Alice, (', 'Alice.'}
Use of Regular Expressions
We can create more complex patterns by using regular expressions. This unlocks a new level of rule-based macthing.
Example: Match all words starting with “a” followed by parts of speech that start with “V” (VERB etc)
matcher = Matcher(nlp.vocab) pattern = [{"TEXT": {"REGEX": "^a"}},{"POS": {"REGEX": "^V"}}] matcher.add("country", None, pattern) matches = matcher(doc) print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))
Matches: {'and make', 'are located', 'away went', 'and found', 'about stopping', 'and finding', 'and burning', 'and cried', 'and went', 'all round', 'all seemed', 'and round', 'and noticed', 'and saying', 'all made', 'all think', 'and looked', 'all locked', 'and wander'}
Add and Remove Patterns
You can add more patterns to the Macther before running it. You onlly need to use unique ids for every pattern.
matcher = Matcher(nlp.vocab) pattern = [{"TEXT": "Alice"}, {"IS_PUNCT": True,"OP":"*"}] matcher.add("id1", None, pattern) pattern = [{"POS": "ADJ"},{"LOWER":"rabbit"}] matcher.add("id2", None, pattern) matches = matcher(doc) print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
Matches: {'Alice', 'Alice,', 'Alice:', 'Alice (', 'Alice;', 'Alice!', 'Alice,)', 'Alice, (', 'large rabbit', 'Alice.'}
To remove it use the remove function just like the add.
matcher.remove('id1')
matches = matcher(doc) print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))
Matches: {'large rabbit'}
You can learn more in spaCy’s documentation and you can experiment in the Rule-Based Matcher Explorer.