954,510 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Parsing a text file

Hi,

I'm new to python and I am having issues attempting to input data into my code from a text file. The text looks like this:

>INFO> CELLID, #729,
>INFO> 20100520-035248 LightningTable (scale_1)
>INFO> LON,LAT -96.485,34.67,0

datatime, maxref, ref_-10, MaxVIL, TotalVIL, Size(km2), CGDenAVG, CGmaxden, CGCount, FlashCount, FlashDenAVG, MESH
20:03:30:05, 63.5, 59.5, 44.0613, 18091.8, 311.062, 0, 0, 0, 40.5807, 0.0332085, 27.41,
20:03:32:06, 63.5, 60.5, 48.3901, 17427.8, 270.588, 0, 0, 0, 76.9209, 0.0723621, 38.0266,

There are several "cells", and I need to pull out the FlashCount column from each cell.

Thanks.

bsh6wc
Newbie Poster
5 posts since Jul 2011
Reputation Points: 10
Solved Threads: 0
 

Drop extra lines from beginning and use my code snippet: http://www.daniweb.com/software-development/python/code/293490

# text based data input with data accessible
# with named fields or indexing
from __future__ import print_function ## Python 3 style printing
from collections import namedtuple
import string

filein = open("sample.dat")

datadict = {}
for line in filein:
    if line.startswith(('>INFO','\n')):
        continue
    headerline = line.lower().replace('-','').replace('(','').replace(')', '') ## lowercase field names Python style
    break
## first non-letter and non-number is taken to be the separator
separator = headerline.strip(string.lowercase + string.digits)[0]
print("Separator is '%s'" % separator)

headerline = [field.strip() for field in headerline.split(separator)]
Dataline = namedtuple('Dataline',headerline)
print ('Fields are:',Dataline._fields,'\n')

for data in filein:
    data = [f.strip() for f in data.rstrip('\n '+separator).split(separator)]
    d = Dataline(*data)
    print(d.flashcount)
pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

Thank you,

Each text file contains multiple cells, and I am interested in the FlashCount separated by cell. Would dropping the first few lines allow me to do that? Sorry I wasn't very clear about that before.

Also, I'm on version 2.4.3 so I can't use namedtuple. Is there something else that I could do this with?

bsh6wc
Newbie Poster
5 posts since Jul 2011
Reputation Points: 10
Solved Threads: 0
 

Named tuple is for convenience and allows the column to be variable. If the data is allways at same column you can fix it or you can just count from header the correct column in each cell. Additional complication was caused by unconventional ending of the line with the separator instead of only newline.

filein = open("sample.dat")

for line in filein:
    if line.startswith(('>INFO','\n')):
        print(line.rstrip())
        continue
    headerline = line.split(', ')
    fieldno = headerline.index('FlashCount')
    break

for data in filein:
    d = data.split(', ')[fieldno]
    print(d)

filein.close()
pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

Thank you,

This has helped tremendously! I have managed to get this to work for a text file containing only 1 cell. Next, I want to get this to work with a text file containing multiple cells. If my data looks like:

>INFO> CELLID, #763,
>INFO> 20100520-035248 LightningTable (scale_1)
>INFO> LON,LAT -93.7,37.78,0

datatime, maxref, ref_-10, MaxVIL, TotalVIL, Size(km2), CGDenAVG, CGmaxden, CGCount, FlashCount, FlashDenAVG, MESH
20:03:42:29, 47, 40.5, 2.99706, 522.765, 383.863, -99900, -99900, -99900, -99900, -99900, 0.985357,
20:03:44:33, 49.5, 44, 3.88048, 807.916, 465.574, -99900, -99900, -99900, -99900, -99900, 2.5169,

>INFO> CELLID, #729,
>INFO> 20100520-035248 LightningTable (scale_1)
>INFO> LON,LAT -96.485,34.67,0

datatime, maxref, ref_-10, MaxVIL, TotalVIL, Size(km2), CGDenAVG, CGmaxden, CGCount, FlashCount, FlashDenAVG, MESH
20:03:30:05, 63.5, 59.5, 44.0613, 18091.8, 311.062, 0, 0, 0, 40.5807, 0.0332085, 27.41,
20:03:32:06, 63.5, 60.5, 48.3901, 17427.8, 270.588, 0, 0, 0, 76.9209, 0.0723621, 38.0266,

I was thinking I could somehow split the file up by searching for #'s, and then applying the bit of code I have to read a single cell. Is that a sound way of doing this? If so, how would I go about doing this?

bsh6wc
Newbie Poster
5 posts since Jul 2011
Reputation Points: 10
Solved Threads: 0
 

Line 11 already has checking for info line beginning the block, which checks your start of record, if you put all the lines (3-13) in proper loop and correct break from the for loop lines 11-13. That is your job. Of course you should save the data in loop instead of printing it, probably with the cellid given in first info line as key to dictionary.

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: