I'm new to Python, and need a bit of help. I have a large data set that is tab delimited but annoyingly also has some extra spaces in it and I can't seem to get it in a nice array to perform computations on it.

This is a simplification of what the data looks like (hundreds of lines like this):
[two spaces]6.0730000e+003[tab][one space]-9.2027000e+004[tab][two spaces]7.8891354e+01[tab]\r\n

I've tried doing the .readlines() on it and going line by line and splitting at the tab, but because of the extra spaces, it's still not usable, and plus, the last element of each list/line is \r\n, which I don't want, but isn't a big deal.

I also tried a regular expression, but couldn't get it to a series of vectors of usable floating point numbers (Python seems to handle the scientific notation format fine, which is nice).

I'm sure I'm missing something painfully easy and obvious. Can someone steer me in the right direction? Any help is appreciated.

Recommended Answers

All 7 Replies

Can you live with this?

"""
[two spaces]6.0730000e+003[tab][one space]-9.2027000e+004[tab][two spaces]7.8891354e+01[tab]\r\n
"""
my_line = '  6.0730000e+003\t -9.2027000e+004\t  7.8891354e+01\t\r\n'
my_list = [eval(n) for n in my_line.split(None)]
print my_list
"""
result -->
[6073.0, -92027.0, 78.891354000000007]
"""

This is a simplification of what the data looks like (hundreds of lines like this):
[two spaces]6.0730000e+003[tab][one space]-9.2027000e+004[tab][two spaces]7.8891354e+01[tab]\r\n

string.split() treats all whites space (space, tab, newline) the same
s=" 6.0730000e+003\t -9.2027000e+004\t 7.8891354e+01\t\r\n"
print s.split()

Instead of:

my_list = [eval(n) for n in my_line.split(None)]

of answer #3 of vegaseat, it is always better to reduce use of eval to an absolute minimum, so if you know it's a file of floats then use float() like this:

my_list = [float(n) for n in my_line.split(None)]

Ah, sorry for not responding to the comments sooner. I've been a bit busy with this and other projects.

vegaseat's eval(n) for n in my_line.split(None) comment worked as I wanted, but it was extremely slow (I have to process 3 data files each with over 60,000 lines).

paddy3118's float(n) for n in my_line.split(None) was the icing on the cake. Seems to work same as vegaseat but is much faster.

Thanks for all the help!

Thanks for the extra work paddy3118 and grahhh. If you know the type you want, then int() or float() is faster and better than the more general eval().

Thanks for the extra work paddy3118 and grahhh. If you know the type you want, then int() or float() is faster and better than the more general eval().

And it helps when validating input data. Someone can't insert text to remove all files into the middle of your input file and have it blindly executed by eval :)

- Paddy.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.