problems on data processing from .dat with million records

Question

clyt 0 Newbie Poster

14 Years Ago

I want to obtain the data from .dat file which contains millions records (most of them are string) with 52 fields.

I am trying to store the data into 52 lists. However, it is very slow

Here is the code:

import sys
try:
file= open("test.dat", "r")
except IOError:
print >> sys.stderr, "File could not be opened"
sys.exit(1)

a=[]
b=[]
c=[]
.
.
for record in file:
a.append(record.split()[0])
b.append(record.split()[1])
c.append(record.split()[2])
.
.
z.append(record.split()[51])

On the other hand, I want to do some calculation on specific fields.
e.g. one field called starttime and another one called endtime which data format is "yymmddhhmmss". I want to calculate the time spent from these two fields.
Also, I want to do some data processing which is similar to deal with database.

Is there any better way to deal with this problem?
Thank you.

python

3 Contributors
6 Replies
138 Views
9 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by clyt

All 6 Replies

Gribouillis 1,391 Programming Explorer

14 Years Ago

You are splitting each record 52 times. It's very inefficient. Perhaps you could speed things up with

rows = [ record.split() for record in file ] # warning: contains megabytes
ncols = len(rows[0])
cols = [ [record[i] for record in rows] for i in xrange(ncols) ] # doubling the used memory
# now cols should be a list of 52 lists, each representing a column
del rows # if we don't need the rows from now on

# if you want to index them on ascii letters (why would you do that ?), use a dictionary
import string
D = dict((x, cols[i]) for i, x in enumerate(string.ascii_letters))

# Now D["a"] should be the first column, etc
# If you want to create variables a, b, .. Z with the lists (why would you do that ?), use
globals().update(D)

Edited 14 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

14 Years Ago

Thank you for your reply
however, after trying these codes
it encounters the same problem:
"Traceback (most recent call last):
IndexError: list index out of range"

IndexError may also mean that the records don't have the same number of fields, or that they apparently don't. For example if the records are separated by tab characters, it may be better to split them with record.split("\t") . It may also be useful to get rid of newline characters at the end of the line with rstrip: record.rstrip().split("\t") . Otherwise, you must add tests to know exactly where the problem happens in the input file or in the python script, and why.

For example, you could chek the length of the rows with

all_length = set(len(row) for row in rows)
print all_length # should print set([52])

Edited 14 Years Ago by Gribouillis because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

clyt 0 Newbie Poster · Answer 1 · 2011-06-30T12:37:36+00:00

Thank you for your reply
however, after trying these codes
it encounters the same problem:

"Traceback (most recent call last):
IndexError: list index out of range"

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 2 · 2011-06-30T12:59:41+00:00

Only line that could happen is line 2, if you have empty first line. You could change it to

ncols = len(rows[1])

if second line is not empty

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 3 · 2011-06-30T13:44:59+00:00

Also it is possible that you get error in line 3 if some line has less records than first line in file.

clyt 0 Newbie Poster · Answer 4 · 2011-06-30T13:47:43+00:00

yup, i find some problems on my .dat file
thank you for your help!

problems on data processing from .dat with million records

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers