Many companies and individuals hold a database of names and there is a need to join them with one or more other databases or just to clean them by removing the duplicates. In most cases, their approach in this kind of task is to apply fuzzy matching using Levenshtein distance (see here). Usually, the fuzzy matching fails because every userbase has a different approach of storing and representing the data. For example, the name “George Pipis”, can be written as “George P. Pipis”, “Mr. George Prokopis Pipis”, “Dr. George Pipis Jr.”, “Pipis, George” etc. Clearly, in this case, the “fuzzy matching” does not work since we are not talking about a mi-spelling of a couple of characters.
Python Library nameparser
The python library called nameparser, gives us the chance to split the “Full Name” into “title”, “first”, “middle”, “last”, “suffix” and “nickname. For example the name “Dr. Juan Q. Xavier de la Vega III (Doc Vega)” can be decomposed as follows:
from nameparser import HumanName name = HumanName("Dr. Juan Q. Xavier de la Vega III (Doc Vega)") name
<HumanName : [
title: 'Dr.'
first: 'Juan'
middle: 'Q. Xavier'
last: 'de la Vega'
suffix: 'III'
nickname: 'Doc Vega'
]>
Join Databases on Full Names
A good approach for joining the databases on “Full Names”, would be:
- Split them into parts.
- Convert them to lower case.
- Remove the periods . symbols like (Dr. can be Dr) .
The following example shows how we can easily get a pandas dataframe with a column with the “Full Name” and to create another 5 columns based on name parts.
from nameparser import HumanName import pandas as pd df = pd.DataFrame({'Name': ["George P. Pipis", "George Pipis", "Mr. George Prokopios Pipis", "Dr. George Pipis Jr.", "Pipis, George", "Dr. Juan Q. Xavier de la Vega III (Doc Vega)"]}) df["title"] = df["Name"].apply(lambda x: HumanName(x).title) df["first"] = df["Name"].apply(lambda x: HumanName(x).first) df["middle"] = df["Name"].apply(lambda x: HumanName(x).middle) df["last"] = df["Name"].apply(lambda x: HumanName(x).last) df["suffix"] = df["Name"].apply(lambda x: HumanName(x).suffix) df["nickname"] = df["Name"].apply(lambda x: HumanName(x).nickname) df.drop('Name', axis=1)
title first middle last suffix nickname
0 George P. Pipis
1 George Pipis
2 Mr George Prokopios Pipis
3 Dr George Pipis Jr.
4 George Pipis
5 Dr. Juan Q. Xavier de la Vega III Doc Vega
This means that we can define some rules for joining or to give a priority like:
- First match the exact match of the 5 parts.
- Then the exact match of the “first”, “middle” ,”last”.
- Then the exact match of the “first”, the first character of “middle”, “last”.
- Then the exact match of the “first” and “last”.
Sometimes we will need also to apply a fuzzy matching after the name parser. You get an idea of the concept of Text Distances and Fuzzy Joins.
4 thoughts on “Name Parser with Python”
Can you please complete the code on the last example, it seems that its missing the portion on which you pass the original “Name” columns dataframe to the Function HumanName, I tried:
df[‘Firstname’],df[‘lastname’] = HumanName(df[‘Name’]).first, HumanName(df[‘Name’]).last
but that does not seem to work.
Thanks for your message! I updated it
George, once again u saved the day! it works now, I searched the entire web for this and even on Stack Overflow they don’t have it correctly. keep up the great work!
Always a pleasure! I added an answer to StackOverflow: https://stackoverflow.com/questions/44605333/parsing-last-name-from-name-in-python/61142186#61142186