I am using Python/Pandas. I need to split personal names so that names that end in "Van Dyke", both words end up in the last name. So, if the name is Richard Wayne Van Dyke, Wayne is the middle name and Van Dyke is the last. Complicating matters are names like Gary Del Barco, where no middle name was entered and I want "Del Barco" as the last name. I had a working script using the csv module, but I need to move that script to pandas. Below is my attempt at making that move.

df = pd.DataFrame({'Name':['Richard Wayne Van Dyke','Gary Del Barco','Dave Allen Smith']})
df = df.fillna('')
df =df.astype(unicode)
splits = df['Name'].str.split(' ', expand=True)

df['firstName'] = splits[0]
if  splits[2].notnull and splits[3].isnull:#this works for Bret Allen Cardwell

    df['lastName'] = splits[2]
    df['middleName'] = splits[1]
    print "Case 1: First: " + df['firstName'] + " middle: " +df['middleName'] + " last: " + df['lastName']
elif splits[2].all() == 'Del':#trying to get last name of "Del Barco"
    print 'del'
    df['middleName'] = ''
    df['lastName'] = splits[2] + " " + splits[3]
    print "Case 2: First: " + df['firstName'] + " middle: " +df['middleName'] + " last: " + df['lastName']

elif splits[3].notnull: #trying to get last name of "Van Dyke"
    df['middleName'] = splits[1]
    df['lastName'] = splits[2] + " " + splits[3]
    print "Case 3: First: " + df['firstName'] + " middle: " +df['middleName'] + " last: " + df['lastName']

Is there an application for lambda instead of what I have attempted?


Recommended Answers

All 3 Replies

Considering names like Gary Del Barco vs Sarah Michele Geller, or Richard Wayne Van Dyke vs Doris Mary Ann Kappelhoff (a.k.a.Doris Day), I suspect what you are asking is impossible without the use of human-level A.I.

(Not to mention Edda Kathleen van Heemstra Hepburn-Ruston, a.k.a. Audrey Hepburn)

What about Renaud Le Van Kim ?

More seriously, I can't run your code because my python has pandas 0.13.1 in ubuntu 14.04, and this version does not understand the parameter expand= in split().

The problem can be solved before give data to Pandas.
As mention over there are lot of cases to think about.
I would have made up rules with regex.

 >>> import re

>>> s = "Richard Wayne Van Dyke"
>>> re.split(r"\s*(?:(?:[a-zA-Z]')|(Van \w.*))", s, re.I)[:-1]
['Richard Wayne', 'Van Dyke']
>>> s = 'Wayne Van Dyke'
>>> re.split(r"\s*(?:(?:[a-zA-Z]')|(Van \w.*))", s, re.I)[:-1]
['Wayne', 'Van Dyke']
>>> s = 'Edda Kathleen Van Heemstra Hepburn-Ruston'
>>> re.split(r"\s*(?:(?:[a-zA-Z]')|(Van \w.*))", s, re.I)[:-1]
['Edda Kathleen', 'Van Heemstra Hepburn-Ruston']

>>> s = "Gary Del Barco"
>>> re.split(r"\s*(?:(?:[a-zA-Z]')|(Del \w.*))", s, re.I)[:-1]
['Gary', 'Del Barco']

>>> s = 'Dave Allen Smith'
>>> re.split(r"\s*(?:(?:[a-zA-Z]')|(Del \w.*))", s, re.I)[:-1]

So key word like Van and Del,in this regex will take key word and all word after as last name.
If need more rule than this,you have to make own make rule for problem names.
If pass in name that's not in regex,a empy list is returned.

commented: Thanks for the responses. I will be working with a limited set of data periodically, so I can easily write new rules for problem names. +0
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, learning, and sharing knowledge.