I am working on a project where I need to parse a bunch of user text that comes in different fields. The problem is that the user input doesn't always come in on the same fields. One user might have a name in field 3 and a date in field 5, while another user might have names in fields 1 and 2, and a date in field 3. The first thing my code needs do is figure out what type of information is in what field.

I don't want to reinvent the wheel. Has anyone come across some open source code that already does this, maybe as part of a larger program? Specifically, something like IsItAName(str) and IsItADate(str) would be awesome.

Thanks!

I would just try changing the type, and check for errors, like exceptions being thrown.

int("hello world")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'hello world'

Then just make sure that the most percise thing comes first, like int before float, "123" can be turned to either an int or float, but "123.455" can only be turned to a float.

Thanks for your reply, joehms.

I think I didn't make my question clear. Unfortunately, I'm not interested in figuring out the python variable type of user input.

All the user input is in the form of various text strings, like "Joe Schmo" or "Tuesday April 2nd" or "Dr. Smith" or "I think cheese is delicious" or "3-10-92". I'm wondering if anyone's come across any open-source code that would make a good guess as to whether a given string is a name (like "Joe Schmo" and "Dr. Smith"), or a date (like "Tuesday April 2nd" or "3-10-92"), or something completely different (like "I think cheese is delicious").

I think you should reinvent the wheel. Depending on the structure of the different fields, the code can be very short, for example

import re
name_re = re.compile(r'^\w+$')

def IsItAName(s):
    return name_re.match(s) is not None

Do you have a description of all possible inputs ?

Edited 6 Years Ago by Gribouillis: n/a

Thanks Gribouillis,

Alas, I do not have a description of all possible inputs. If I did, you're right, this could be very simple indeed!

Names could be the first and/or last names of any person, with or without titles, initials, misspellings, etc. Dates are entered as text, and could really be in any format. These are mixed together with fields of short sentences, phrases, single words, and other cruft.

I think I could whip up an IsItADate() test pretty easily that would be right most of the time.

I suspect a proper IsItAName() function couldn't be implemented without two things. 1) a really long list of first and last names, and 2) some soft-ish rules: is it less than 5 words, does it only use alphabetic characters, etc...

If these were well-coded, such functions would return a number indicating the likelihood that a string is a name, or a date, rather than just a TRUE or FALSE.

My guess is that such functions would be very useful to anyone parsing random user input in the wild, from the web for example, or as part of a natural language processing library. That's why I have a hunch that I don't have to write these from scratch. Unfortunately, I have not been able to find such code.

I think you should reinvent the wheel. Depending on the structure of the different fields, the code can be very short, for example

import re
name_re = re.compile(r'^\w+$')

def IsItAName(s):
    return name_re.match(s) is not None

Do you have a description of all possible inputs ?

Thanks Gribouillis,

Alas, I do not have a description of all possible inputs. If I did, you're right, this could be very simple indeed!

Names could be the first and/or last names of any person, with or without titles, initials, misspellings, etc. Dates are entered as text, and could really be in any format. These are mixed together with fields of short sentences, phrases, single words, and other cruft.

I think I could whip up an IsItADate() test pretty easily that would be right most of the time.

I suspect a proper IsItAName() function couldn't be implemented without two things. 1) a really long list of first and last names, and 2) some soft-ish rules: is it less than 5 words, does it only use alphabetic characters, etc...

If these were well-coded, such functions would return a number indicating the likelihood that a string is a name, or a date, rather than just a TRUE or FALSE.

My guess is that such functions would be very useful to anyone parsing random user input in the wild, from the web for example, or as part of a natural language processing library. That's why I have a hunch that I don't have to write these from scratch. Unfortunately, I have not been able to find such code.

Perhaps the natural language toolkit could help you http://www.nltk.org/ .

Thanks Gribouillis, that's probably where I should be looking for a solution!

The learning curve looked a little steep to me with the nltk, but I bet it will pay off in the end for what I'm trying to do.

Why not just provide text to every field what sort of inputs acceptable....
Won't that be more robust?
;)

Why not just provide text to every field what sort of inputs acceptable....
Won't that be more robust?
;)

Unfortunately, the data has already been collected. This program needs to make sense of it after the fact.

can you provide a demo version of the sort of data you want to work with?

Edited 6 Years Ago by richieking: n/a

This article has been dead for over six months. Start a new discussion instead.