When we work on NLP projects, we need to do text mining and data cleansing. A common task is to detect the Name Entities and sometimes it makes sense to replace the original text with the corresponding entities. This is like a Feature Engineering
. Imagine that you try to model a list of documents using TF-IDF on n-grams. It is better to replace the original text with its corresponding entities. For example all the dates
will be replaced with DATE
, the prices etc will be replaced with MONEY
and so on.
We will work with the SpaCy library. Let’s provide a practical example:
import pandas as pd import spacy pd.set_option("max_colwidth", 300) nlp = spacy.load("en_core_web_sm") # My sample data df = pd.DataFrame({'Documents' : ["Apple is looking at buying U.K. startup for $1 billion", "San Francisco considers banning sidewalk delivery robots", "Amazon is hiring a new vice president of global policy", "George Pipis works for Predictive Hacks", "Today is Wednesday, 18:00", "Dear George, can you please respond to my email?" ]})
Now, we will create a function called replace_ner
which will replace the detected entities of the original text with their corresponding entities. The trick is to use the reversed
function so that to start replacing from the last detecting entity otherwise the original text will be affected.
def replace_ner(mytxt): clean_text = mytxt doc = nlp(mytxt) for ent in reversed(doc.ents): clean_text = clean_text[:ent.start_char] +ent.label_ + clean_text[ent.end_char:] return clean_text df['Redacted'] = df['Documents'].apply(lambda x:replace_ner(x) ) df
As we can see, SpaCy was able to detect some entities but it failed to detect the “NAME” in “Dear George” and the “ORG” in Predictive Hacks. However, it detected correctly the $1 billion, Apple, Amazon, San Francisco, Today, Wednesday, 18:00, and George Pipis