Predictive Hacks

How to Redact PII Data using AWS Comprehend

redact pii

Personal data, also known as personally identifiable information (PII) is any information relating to an identifiable person. Some examples of PII are the following:

  • National identification number (e.g., Social Security number in the U.S.)
  • Bank account numbers
  • Passport number
  • Driver’s license number
  • Debit/Credit card numbers
  • Full Name
  • Home Address
  • City
  • State
  • Postcode
  • Country
  • Telephone
  • Age, Date of Birth, especially if non-specific
  • Gender or race
  • Web cookie

Due to GDPR Compliance, a common task is to apply algorithms to detect and redact PII data. AWS Comprehend enables us to detect the PII entities and to redact them. More particularly it can detect the following PII entities:

PII entity categoryPII entity types
FinancialBANK_ACCOUNT_NUMBER
BANK_ROUTING
CREDIT_DEBIT_NUMBER
CREDIT_DEBIT_CVV
CREDIT_DEBIT_EXPIRY
PIN
PersonalNAME
ADDRESS
PHONE
EMAIL
AGE
Technical securityUSERNAME
PASSWORD
URLAWS_ACCESS_KEY
AWS_SECRET_KEY
IP_ADDRESS
MAC_ADDRESS
NationalSSN
PASSPORT_NUMBER
DRIVER_ID
OtherDATE_TIME

Detect and Redact PII Using Boto3

Let’s provide an example where our task is to redact the PII info.

import boto3

client = boto3.client('comprehend')

mytxt = """My name is Joe Smith and I was born in 1988 and I am 33 years old.
I work at Predictive Hacks and my email is joe.smith@predictivehacks.com. 
I live in Athens, Greece. My phone number is 623 12 34 567 and my bank account is 123-123-567-888"""

response = client.detect_pii_entities(
    Text= mytxt,
    LanguageCode='en'
)


# get the response

response

Output:

{'Entities': [{'Score': 0.9999890923500061,
   'Type': 'NAME',
   'BeginOffset': 11,
   'EndOffset': 20},
  {'Score': 0.9999454021453857,
   'Type': 'DATE_TIME',
   'BeginOffset': 39,
   'EndOffset': 43},
  {'Score': 0.9998961091041565,
   'Type': 'AGE',
   'BeginOffset': 53,
   'EndOffset': 61},
  {'Score': 0.9999966621398926,
   'Type': 'EMAIL',
   'BeginOffset': 110,
   'EndOffset': 139},
  {'Score': 0.999873697757721,
   'Type': 'ADDRESS',
   'BeginOffset': 152,
   'EndOffset': 166},
  {'Score': 0.9999797344207764,
   'Type': 'PHONE',
   'BeginOffset': 187,
   'EndOffset': 200},
  {'Score': 0.9999954700469971,
   'Type': 'BANK_ACCOUNT_NUMBER',
   'BeginOffset': 224,
   'EndOffset': 239}],
 'ResponseMetadata': {'RequestId': 'a3083872-a14b-41a6-ae64-893c9a6cac9a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'a3083872-a14b-41a6-ae64-893c9a6cac9a',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '570',
   'date': 'Fri, 26 Mar 2021 14:29:33 GMT'},
  'RetryAttempts': 0}}

Let’s replace our original text with the redacted one.

clean_text = mytxt

# reversed to not modify the offsets of other entities when substituting
for NER in reversed(response['Entities']):
    clean_text = clean_text[:NER['BeginOffset']] + NER['Type'] + clean_text[NER['EndOffset']:]

print(clean_text)

Output:

My name is NAME and I was born in DATE_TIME and I am AGE old. I work at Predictive Hacks and my email is EMAIL. I live in ADDRESS. My phone number is PHONE and my bank account is BANK_ACCOUNT_NUMBER

Where the original text was:

My name is Joe Smith and I was born in 1988 and I am 33 years old. I work at Predictive Hacks and my email is joe.smith@predictivehacks.com. I live in Athens, Greece. My phone number is 623 12 34 567 and my bank account is 123-123-567-888

Good job!

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore