Let's say I have a text file separated by tabs containing stuff like:

1.02 \t  hello \t 01/02/2008

How do I get python to recognise that 1.02 is a number, hello is a string and 01/02/2008 is a date.
The file has no pattern, i.e it won't always be number,string,date, it could be anything.

I'm thinking perhaps using regular expressions or something.

Waiting your earliest reply. Sample code would be great.

Decide what types of objects that need to be converted. If needed, create a function to make each conversion type. An exception should occur on any failed conversion. If all conversions fail, return a string. Example:

import time

def fdim(s):
    # return a float given a fraction (EX. '1/2')
    ss = s.split('/')
    return float(ss[0])/float(ss[1])

def evalList(s):
    # return a list given a string representation
    if s.strip().startswith('[' ) and s.strip().endswith(']'):
        try: return eval(s)
        except: raise ValueError
    raise ValueError

def evalDateStr(s):
    # return a struct_time object given a string representation
    return time.strptime(s, '%m/%d/%Y')

def convertType(s):
    for func in (int, float, evalDateStr, fdim, evalList):
        try:
            n = func(s)
            return n
        except:
            pass
    return s

s = "6 \t 1.02 \t  hello \t 01/02/2008 \t [1,2,3] \t 15/16"

for obj in [convertType(item) for item in [w.strip() for w in s.split('\t')]]:
    print '%s - object type: %s' % (obj, type(obj))

Output:

>>> 6 - object type: <type 'int'>
1.02 - object type: <type 'float'>
hello - object type: <type 'str'>
(2008, 1, 2, 0, 0, 0, 2, 2, -1) - object type: <type 'time.struct_time'>
[1, 2, 3] - object type: <type 'list'>
0.9375 - object type: <type 'float'>
>>>
Comments
awesome

Wow that's awesome, one more question though how do I detect type bools, such as TRUE and FALSE I changed your code a bit to test it and got rid of the bits I don't need.

omg.txt (tab delimited)

0.0003	2.3	8	hello	000292	0.0.43
11/01/2008	13/13/1991	0	"343,4343"
0116 2346	6421-01-01	31/02/2008	"2.3"
01/02/08	1.223232223

conv.py

import time

def evalDateStr(s):
    # return a struct_time object given a string representation
    # changed it to day month year
    return time.strptime(s, '%d/%m/%Y')

def convertType(s):
    for func in (int, float, evalDateStr ):
        try:
            n = func(s)
            return n
        except:
            pass
    return s


#changed to read in file line by line
print "\nLooping through the file, line by line."
f = open("omg.txt", "r")
for line in f:
    s = line
    for obj in [convertType(item) for item in [w.strip() for w in s.split('\t')]]:
         print '%s - object type: %s' % (obj, type(obj))


f.close()

output

Looping through the file, line by line.
0.0003 - object type: <type 'float'>
2.3 - object type: <type 'float'>
8 - object type: <type 'int'>
hello - object type: <type 'str'>
292 - object type: <type 'int'>
0.0.43 - object type: <type 'str'>
(2008, 1, 11, 0, 0, 0, 4, 11, -1) - object type: <type 'time.struct_time'>
13/13/1991 - object type: <type 'str'>
0 - object type: <type 'int'>
"343,4343" - object type: <type 'str'>
0116 2346 - object type: <type 'str'>
6421-01-01 - object type: <type 'str'>
31/02/2008 - object type: <type 'str'>    
"2.3" - object type: <type 'str'>
01/02/08 - object type: <type 'str'>
1.223232223 - object type: <type 'float'>

As you can see the results are all correct, I checked it to see if it would be fooled by february having 31 days and it passed. But does it auto detect leap years?

I have one more question as well I will post later. Thanks very much for saving me time.

Part two.

Let's assume I have a tab delimited file such as

test.txt

hello	2.3	01/02/2008
2.0	there	TRUE

I want to have my python script read it in.

import pyXLWriter as xl
import datetime 

# Create a new workbook called simple.xls and add a worksheet
workbook  = xl.Writer("simple.xls")
worksheet = workbook.add_worksheet()

worksheet.write([0, 0],  "hello")          
worksheet.write([0, 1],  2.3)    
worksheet.write([0, 2],  datetime.date(2008,02,01))
worksheet.write([1, 0],  2.0) 
worksheet.write([1, 1],  "there") 
worksheet.write([1, 2],  1

workbook.close()

As you can see in my example the [x,y] represents the row, column in the tab delimited text file.

I want my code to be generic so it can handle any tab delimited file of unknowN entries. I think I need to use functions but I have no idea how to.

Sample code would be great! Awaiting your earliest reply.

Create another function to test for for boole value strings. Placing values in columns and rows is very easy in Python using enumerate().

import time

def evalBoole(s):
    if s in ('True', 'False'):
        return eval(s)
    else:
        raise ValueError

def evalDateStr(s):
    # return a struct_time object given a string representation
    # changed it to day month year
    return time.strptime(s, '%d/%m/%Y')

def convertType(s):
    for func in (int, float, evalBoole, evalDateStr):
        try:
            n = func(s)
            return n
        except:
            pass
    return s

#changed to read in file line by line
print "\nLooping through the file, line by line."
f = open("omg.txt", "r")
for i, line in enumerate(f):
    for j, obj in enumerate([convertType(item) for item in 
                             [w.strip() for w in line.strip().split('\t')]
                             ]):
         print 'Row %d, Column %d, Value: %s' % (i,j,obj)

f.close()

Thanks that is what I was thinking about booleans!

However, the next bit that I am stuck on is in my post #4.

I need to take the tab file and convert it to an excel file.

import pyXLWriter as xl
import datetime 

# Create a new workbook called simple.xls and add a worksheet
workbook  = xl.Writer("simple.xls")
worksheet = workbook.add_worksheet()

worksheet.write([0, 0],  "hello")          
worksheet.write([0, 1],  2.3)    
worksheet.write([0, 2],  datetime.date(2008,02,01))
worksheet.write([1, 0],  2.0) 
worksheet.write([1, 1],  "there") 
worksheet.write([1, 2],  1

workbook.close()

Here I have the code to write to excel using the module pyXLWriter but it is hard coded. What I want is it to read the tab file then choose what the variable is, what x,y position is it and write it to excel. so can you help me do this? I don't know how to do this? Thank you for your help so far.

I thought my use of enumerate() would give you enough information. Here's the code again with code to write to the Excel file (untested):

print "\nLooping through the file, line by line."
workbook  = xl.Writer("simple.xls")
worksheet = workbook.add_worksheet()
f = open("omg.txt", "r")
for i, line in enumerate(f):
    for j, obj in enumerate([convertType(item) for item in 
                             [w.strip() for w in line.strip().split('\t')]
                             ]):
         worksheet.write([i, j], obj)
         
workbook.close()
f.close()
Comments
Almost there

No I can't seem to get it to work! I tried two tests.

import pyXLWriter as xl
import time
import datetime


workbook  = xl.Writer("simple555.xls")
worksheet = workbook.add_worksheet()

def evalBoole(s):
    if s in ('True', 'False'):
        return eval(s)
    else:
        raise ValueError

def evalDateStr(s):
    # return a struct_time object given a string representation
    # changed it to day month year
    return time.strptime(s, '%d/%m/%Y')

def convertType(s):
    for func in (int, float, evalBoole, evalDateStr):
        try:
            n = func(s)
            return n
        except:
            pass
    return s

#changed to read in file line by line
print "\nLooping through the file, line by line."
f = open("omg.txt", "r")
for i, line in enumerate(f):
    for j, obj in enumerate([convertType(item) for item in 
                             [w.strip() for w in line.strip().split('\t')]
                             ]):
		#print 'Row %d, Column %d, Value: %s' % (i,j,obj)
            worksheet.write([i, j], datetime.date(2008,02,01))

f.close()
workbook.close()

That works when you explicitly give it a value, either date, string or intger/float

But it doesn't work when I do

import pyXLWriter as xl
import time
import datetime


workbook  = xl.Writer("simple555.xls")
worksheet = workbook.add_worksheet()

def evalDateStr(s):
    # return a struct_time object given a string representation
    # changed it to day month year
    return time.strptime(s, '%d/%m/%Y')

def convertType(s):
    for func in (int, float, evalDateStr):
        try:
            n = func(s)
            return n
        except:
            pass
    return s

#changed to read in file line by line
print "\nLooping through the file, line by line."
f = open("omg.txt", "r")
for i, line in enumerate(f):
    for j, obj in enumerate([convertType(item) for item in 
                             [w.strip() for w in line.strip().split('\t')]
                             ]):
		#print 'Row %d, Column %d, Value: %s' % (i,j,obj)
            worksheet.write([i, j], obj)

f.close()
workbook.close()

it fails, it doesn't even write to excel file!!!!


Do I need to create a function such as:

if obj = str then
   s = str(obj)
 worksheet.write ([i,j], s)

else if obj = integer then
  i = int(obj)
  worksheet.write([i,j], i)

else if obj = date then
  a = split date by commas =day
  b = split date by commas = month
  c = split date by commas = year

  worksheet.write([i,j], datetime.date(c,b,a))
end

If so how do I do this. I don't know anything about python!!!! THanks I'm almost finished.

You should run some tests on different types of data to see what is failing. Did you get any error messages? I would guess that it is failing on the date. Try this:

for i, line in enumerate(f):
    for j, obj in enumerate([convertType(item) for item in 
                             [w.strip() for w in line.strip().split('\t')]
                             ]):
        if isinstance(obj, time.struct_time):
            worksheet.write([i, j], datetime.date(*obj[:3]))
        else:
            worksheet.write([i, j], obj)
import pyXLWriter as xl
import time
import datetime


workbook  = xl.Writer("simple555.xls")
worksheet = workbook.add_worksheet()

def evalBoole(s):
    if s in ('TRUE', 'FALSE'):
        return eval(s)
    else:
        raise ValueError

def evalDateStr(s):
    # return a struct_time object given a string representation
    # changed it to day month year
    return time.strptime(s, '%d/%m/%Y')

def convertType(s):
    for func in (int, float, evalBoole, evalDateStr):
        try:
            n = func(s)
            return n
        except:
            pass
    return s

#changed to read in file line by line
print "\nLooping through the file, line by line."
f = open("omg.txt", "r")
for i, line in enumerate(f):
    for j, obj in enumerate([convertType(item) for item in 
                             [w.strip() for w in line.strip().split('\t')]
                             ]):
        if isinstance(obj, time.struct_time):
            worksheet.write([i, j], datetime.date(*obj[:3]))
        else:
            worksheet.write([i, j], obj)
f.close()
workbook.close()

Cool! Thank you so much it works now, it writes to the excel file (2000 & 2003 format) as I expect. However, it still comes up with some error messages and I am not quite sure what they mean, or what it might affect:

command prompt dump

Looping through the file, line by line.
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:135: DeprecationWarning: s
truct integer overflow masking is deprecated
  unknown3 = pack("<H", -2)
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:135: DeprecationWarning: '
H' format requires 0 <= number <= 65535
  unknown3 = pack("<H", -2)
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:141: DeprecationWarning: s
truct integer overflow masking is deprecated
  sbd_startblock = pack("<L", -2)
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:142: DeprecationWarning: s
truct integer overflow masking is deprecated
  unknown7 = pack("<LLL", 0x00, -2 ,0x00)
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:143: DeprecationWarning: s
truct integer overflow masking is deprecated
  unused = pack("<L", -1)
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:205: DeprecationWarning: s
truct integer overflow masking is deprecated
  pps_prev = pack("<L", -1)              #0x44
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:206: DeprecationWarning: s
truct integer overflow masking is deprecated
  pps_next = pack("<L", -1)              #0x48
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:213: DeprecationWarning: s
truct integer overflow masking is deprecated
  pps_sb = pack("<L", sb)                #0x74
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:207: DeprecationWarning: s
truct integer overflow masking is deprecated
  pps_dir = pack("<L", dir)              #0x4c
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:168: DeprecationWarning: s
truct integer overflow masking is deprecated
  marker = pack("<L", -3)
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:169: DeprecationWarning: s
truct integer overflow masking is deprecated
  end_of_chain = pack("<L", -2)
C:\Python25\lib\site-packages\pyXLWriter\OLEWriter.py:170: DeprecationWarning: s
truct integer overflow masking is deprecated
  unused = pack("<L", -1)

Also I wanted to ask what does *obj[:3]) mean?

obj[:3] is a slice of the time.struct_time object. The asterisk preceding the list returned by the slice expands the list into individual elements. Example:

>>> d = time.strptime('16/06/2010', '%d/%m/%Y')
>>> d
(2010, 6, 16, 0, 0, 0, 2, 167, -1)
>>> [d[0],d[1],d[2]]
[2010, 6, 16]
>>> datetime.date(d[0],d[1],d[2])
datetime.date(2010, 6, 16)
>>> datetime.date(*d[:3])
datetime.date(2010, 6, 16)
>>> d[:3]
(2010, 6, 16)
>>>

The DeprecationWarning you are seeing is unrelated to the code we are discussing, but originates in the pyXLWriter module. Warnings can be suppressed or turned into exceptions with a call to filterwarnings(), which adds an entry to the warning filter. See Python documentation.

Comments
Thanks again
This article has been dead for over six months. Start a new discussion instead.