Predictive Hacks

Redact Name Entities with SpaCy


When we work on NLP projects, we need to do text mining and data cleansing. A common task is to detect the Name Entities and sometimes it makes sense to replace the original text with the corresponding entities. This is like a Feature Engineering. Imagine that you try to model a list of documents using TF-IDF on n-grams. It is better to replace the original text with its corresponding entities. For example all the dates will be replaced with DATE, the prices etc will be replaced with MONEY and so on.

We will work with the SpaCy library. Let’s provide a practical example:

import pandas as pd
import spacy
pd.set_option("max_colwidth", 300)

nlp = spacy.load("en_core_web_sm")

# My sample data

df = pd.DataFrame({'Documents' : ["Apple is looking at buying U.K. startup for $1 billion",
                                  "San Francisco considers banning sidewalk delivery robots",
                                  "Amazon is hiring a new vice president of global policy",
                                  "George Pipis works for Predictive Hacks",
                                  "Today is Wednesday, 18:00",
                                  "Dear George, can you please respond to my email?"
Redact Name Entities with SpaCy 1

Now, we will create a function called replace_ner which will replace the detected entities of the original text with their corresponding entities. The trick is to use the reversed function so that to start replacing from the last detecting entity otherwise the original text will be affected.

def replace_ner(mytxt):
    clean_text = mytxt
    doc = nlp(mytxt)
    for ent in reversed(doc.ents):
        clean_text = clean_text[:ent.start_char] +ent.label_ + clean_text[ent.end_char:]
    return clean_text

df['Redacted'] = df['Documents'].apply(lambda x:replace_ner(x) )

Redact Name Entities with SpaCy 2

As we can see, SpaCy was able to detect some entities but it failed to detect the “NAME” in “Dear George” and that “ORG” in Predictive Hacks. . However, it detected correctly the $1 billion, Apple, Amazon, San Francisco, Today, Wednesday, 18:00, and George Pipis

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore