Predictive Hacks

How to Redact NER using SpaCy and Python UDF in Snowflake

python snowflake

In a previous tutorial, we showed you how to get started with Python UDFs in Python. In this tutorial, we will show you how to use SpaCy with Python UDF in Snowflake.

Download the SpaCy Model

The first that we need to do is to download the “en_core_web_sm” model and then upload it to the stage. You can download the SpaCy library using conda or pip and then you can search for the “en_core_web_sm” by running a find command like:

find . -name "en_core*" -print
 

In my case, the model is under the path:

./Anaconda3/Lib/site-packages/en_core_web_sm

Then, we can go to this path and zip the folder.

spacy ner

Finally, should upload the “en_core_web_sm.zip” file to the stage, as we will see later.

Create the Python Function that Redacts NERs

Then we need to create a Python function that redacts name entities with SpaCy. The function is the following:

import spacy
nlp = spacy.load("en_core_web_sm")

def replace_ner(mytxt):
    clean_text = mytxt
    doc = nlp(mytxt)
    for ent in reversed(doc.ents):
        clean_text = clean_text[:ent.start_char] +ent.label_ + clean_text[ent.end_char:]
    return clean_text
 

Create the UDF in Snowflake

Now we are ready to build the UDF in Snowflake. Note that the code below uses file locks to ensure that the extraction is synchronized across Python worker processes; this way, contents are unzipped only once. Also, keep in mind that the version of our model was the en_core_web_sm-3.0.0 where you can find it within the path of the folder (i.e. en_core_web_sm/en_core_web_sm-3.0.0').

Let’s build the function called redact_ner.

create or replace function redact_ner(str string)
   returns string
   language python
   runtime_version = 3.8
   handler = 'replace_ner'
   packages = ('spacy')
   imports = ('@MY_S3_TEST_SNOWFLAKE_STAGE/en_core_web_sm.zip')
   as
$$
import fcntl
import os
import spacy
import sys
import threading
import zipfile

# File lock class for synchronizing write access to /tmp
class FileLock:
   def __enter__(self):
      self._lock = threading.Lock()
      self._lock.acquire()
      self._fd = open('/tmp/lockfile.LOCK', 'w+')
      fcntl.lockf(self._fd, fcntl.LOCK_EX)

   def __exit__(self, type, value, traceback):
      self._fd.close()
      self._lock.release()

# Get the location of the import directory. Snowflake sets the import
# directory location so code can retrieve the location via sys._xoptions.
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]

# Get the path to the ZIP file and set the location to extract to.
zip_file_path = import_dir + "en_core_web_sm.zip"
extracted = '/tmp/en_core_web_sm'

# Extract the contents of the ZIP. This is done under the file lock
# to ensure that only one worker process unzips the contents.
with FileLock():
   if not os.path.isdir(extracted + '/en_core_web_sm/en_core_web_sm-3.0.0'):
      with zipfile.ZipFile(zip_file_path, 'r') as myzip:
         myzip.extractall(extracted)

# Load the model from the extracted file.
nlp = spacy.load(extracted + "/en_core_web_sm/en_core_web_sm-3.0.0")

def replace_ner(mytxt):
    clean_text = mytxt
    doc = nlp(mytxt)
    for ent in reversed(doc.ents):
        clean_text = clean_text[:ent.start_char] +ent.label_ + clean_text[ent.end_char:]
    return clean_text
$$;
 

Apply the Function

We can apply the function to a dummy dataset. Let’s create the following temporary table called MY_DOCUMENTS such as:

CREATE OR REPLACE TEMP TABLE MY_DOCUMENTS
(DOCUMENT STRING)

INSERT INTO MY_DOCUMENTS values
('Apple is looking at buying U.K. startup for $1 billion'),
('San Francisco considers banning sidewalk delivery robots'),
('Amazon is hiring a new vice president of global policy'),
('George Pipis works for Predictive Hacks'),
('Today is Wednesday, 18:00'),
('Dear George, can you please respond to my email?')



select * from MY_DOCUMENTS
 

Now, let’s redact the detected NERs.

select document, redact_ner(document) as clean_document from MY_DOCUMENTS
 

As we can see, SpaCy was able to detect some entities but it failed to detect the “NAME” in “Dear George” and the “ORG” in Predictive Hacks. However, it detected correctly the $1 billionApple, Amazon, San Francisco, Today, Wednesday, 18:00, and George Pipis.

Finally, notice that when we run the query the results appear in a different order but this is not a problem since the mapping between columns is correct.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s