Personal data, also known as personally identifiable information (PII) is any information relating to an identifiable person. Some examples of PII are the following:
- National identification number (e.g., Social Security number in the U.S.)
- Bank account numbers
- Passport number
- Driver’s license number
- Debit/Credit card numbers
- Full Name
- Home Address
- City
- State
- Postcode
- Country
- Telephone
- Age, Date of Birth, especially if non-specific
- Gender or race
- Web cookie
Due to GDPR Compliance, a common task is to apply algorithms to detect and redact PII data. AWS Comprehend enables us to detect the PII entities and to redact them. More particularly it can detect the following PII entities:
PII entity category | PII entity types |
Financial | BANK_ACCOUNT_NUMBER BANK_ROUTING CREDIT_DEBIT_NUMBER CREDIT_DEBIT_CVV CREDIT_DEBIT_EXPIRY PIN |
Personal | NAME ADDRESS PHONE AGE |
Technical security | USERNAME PASSWORD URLAWS_ACCESS_KEY AWS_SECRET_KEY IP_ADDRESS MAC_ADDRESS |
National | SSN PASSPORT_NUMBER DRIVER_ID |
Other | DATE_TIME |
Detect and Redact PII Using Boto3
Let’s provide an example where our task is to redact the PII info.
import boto3 client = boto3.client('comprehend') mytxt = """My name is Joe Smith and I was born in 1988 and I am 33 years old. I work at Predictive Hacks and my email is [email protected]. I live in Athens, Greece. My phone number is 623 12 34 567 and my bank account is 123-123-567-888""" response = client.detect_pii_entities( Text= mytxt, LanguageCode='en' ) # get the response response
Output:
{'Entities': [{'Score': 0.9999890923500061,
'Type': 'NAME',
'BeginOffset': 11,
'EndOffset': 20},
{'Score': 0.9999454021453857,
'Type': 'DATE_TIME',
'BeginOffset': 39,
'EndOffset': 43},
{'Score': 0.9998961091041565,
'Type': 'AGE',
'BeginOffset': 53,
'EndOffset': 61},
{'Score': 0.9999966621398926,
'Type': 'EMAIL',
'BeginOffset': 110,
'EndOffset': 139},
{'Score': 0.999873697757721,
'Type': 'ADDRESS',
'BeginOffset': 152,
'EndOffset': 166},
{'Score': 0.9999797344207764,
'Type': 'PHONE',
'BeginOffset': 187,
'EndOffset': 200},
{'Score': 0.9999954700469971,
'Type': 'BANK_ACCOUNT_NUMBER',
'BeginOffset': 224,
'EndOffset': 239}],
'ResponseMetadata': {'RequestId': 'a3083872-a14b-41a6-ae64-893c9a6cac9a',
'HTTPStatusCode': 200,
'HTTPHeaders': {'x-amzn-requestid': 'a3083872-a14b-41a6-ae64-893c9a6cac9a',
'content-type': 'application/x-amz-json-1.1',
'content-length': '570',
'date': 'Fri, 26 Mar 2021 14:29:33 GMT'},
'RetryAttempts': 0}}
Let’s replace our original text with the redacted one.
clean_text = mytxt # reversed to not modify the offsets of other entities when substituting for NER in reversed(response['Entities']): clean_text = clean_text[:NER['BeginOffset']] + NER['Type'] + clean_text[NER['EndOffset']:] print(clean_text)
Output:
My name is NAME and I was born in DATE_TIME and I am AGE old. I work at Predictive Hacks and my email is EMAIL. I live in ADDRESS. My phone number is PHONE and my bank account is BANK_ACCOUNT_NUMBER
Where the original text was:
My name is Joe Smith and I was born in 1988 and I am 33 years old. I work at Predictive Hacks and my email is [email protected]. I live in Athens, Greece. My phone number is 623 12 34 567 and my bank account is 123-123-567-888
Good job!