Predictive Hacks

Name Parser with Python

Many companies and individuals hold a database of names and there is a need to join them with one or more other databases or just to clean them by removing the duplicates. In most cases, their approach in this kind of task is to apply fuzzy matching using Levenshtein distance (see here). Usually, the fuzzy matching fails because every userbase has a different approach of storing and representing the data. For example, the name “George Pipis”, can be written as “George P. Pipis”, “Mr. George Prokopis Pipis”, “Dr. George Pipis Jr.”, “Pipis, George” etc. Clearly, in this case, the “fuzzy matching” does not work since we are not talking about a mi-spelling of a couple of characters.

Python Library nameparser

The python library called nameparser, gives us the chance to split the “Full Name” into “title”, “first”, “middle”, “last”, “suffix” and “nickname. For example the name “Dr. Juan Q. Xavier de la Vega III (Doc Vega)” can be decomposed as follows:

from nameparser import HumanName
name = HumanName("Dr. Juan Q. Xavier de la Vega III (Doc Vega)")
name
<HumanName : [
    title: 'Dr.'
    first: 'Juan'
    middle: 'Q. Xavier'
    last: 'de la Vega'
    suffix: 'III'
    nickname: 'Doc Vega'
]>

Join Databases on Full Names

A good approach for joining the databases on “Full Names”, would be:

  1. Split them into parts.
  2. Convert them to lower case.
  3. Remove the periods . symbols like (Dr. can be Dr) .

The following example shows how we can easily get a pandas dataframe with a column with the “Full Name” and to create another 5 columns based on name parts.

from nameparser import HumanName
import pandas as pd

df = pd.DataFrame({'Name': ["George P. Pipis", "George Pipis", "Mr. George Prokopios Pipis", "Dr. George Pipis Jr.",
                       "Pipis, George", "Dr. Juan Q. Xavier de la Vega III (Doc Vega)"]})

df["title"] = df["Name"].apply(lambda x: HumanName(x).title)
df["first"] = df["Name"].apply(lambda x: HumanName(x).first)
df["middle"] = df["Name"].apply(lambda x: HumanName(x).middle)
df["last"] = df["Name"].apply(lambda x: HumanName(x).last)
df["suffix"] = df["Name"].apply(lambda x: HumanName(x).suffix)
df["nickname"] = df["Name"].apply(lambda x: HumanName(x).nickname)

df.drop('Name', axis=1)
	title	first	middle	last	suffix	nickname
0		George	P.	Pipis		
1		George		Pipis		
2	Mr	George	Prokopios	Pipis		
3	Dr	George		Pipis	Jr.	
4		George		Pipis		
5	Dr.	Juan	Q. Xavier	de la Vega	III	Doc Vega

This means that we can define some rules for joining or to give a priority like:

  1. First match the exact match of the 5 parts.
  2. Then the exact match of the “first”, “middle” ,”last”.
  3. Then the exact match of the “first”, the first character of “middle”, “last”.
  4. Then the exact match of the “first” and “last”.

Sometimes we will need also to apply a fuzzy matching after the name parser. You get an idea of the concept of Text Distances and Fuzzy Joins.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

4 thoughts on “Name Parser with Python”

  1. Can you please complete the code on the last example, it seems that its missing the portion on which you pass the original “Name” columns dataframe to the Function HumanName, I tried:

    df[‘Firstname’],df[‘lastname’] = HumanName(df[‘Name’]).first, HumanName(df[‘Name’]).last

    but that does not seem to work.

    Reply
  2. George, once again u saved the day! it works now, I searched the entire web for this and even on Stack Overflow they don’t have it correctly. keep up the great work!

    Reply

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

data science journey
Miscellaneous

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my