I want to obtain the data from .dat file which contains millions records (most of them are string) with 52 fields.

I am trying to store the data into 52 lists. However, it is very slow

Here is the code:

import sys
try:
file= open("test.dat", "r")
except IOError:
print >> sys.stderr, "File could not be opened"
sys.exit(1)

a=[]
b=[]
c=[]
.
.
for record in file:
a.append(record.split()[0])
b.append(record.split()[1])
c.append(record.split()[2])
.
.
z.append(record.split()[51])

On the other hand, I want to do some calculation on specific fields.
e.g. one field called starttime and another one called endtime which data format is "yymmddhhmmss". I want to calculate the time spent from these two fields.
Also, I want to do some data processing which is similar to deal with database.

Is there any better way to deal with this problem?
Thank you.

You are splitting each record 52 times. It's very inefficient. Perhaps you could speed things up with

rows = [ record.split() for record in file ] # warning: contains megabytes
ncols = len(rows[0])
cols = [ [record[i] for record in rows] for i in xrange(ncols) ] # doubling the used memory
# now cols should be a list of 52 lists, each representing a column
del rows # if we don't need the rows from now on

# if you want to index them on ascii letters (why would you do that ?), use a dictionary
import string
D = dict((x, cols[i]) for i, x in enumerate(string.ascii_letters))

# Now D["a"] should be the first column, etc
# If you want to create variables a, b, .. Z with the lists (why would you do that ?), use
globals().update(D)

Thank you for your reply
however, after trying these codes
it encounters the same problem:

"Traceback (most recent call last):
IndexError: list index out of range"

Only line that could happen is line 2, if you have empty first line. You could change it to

ncols = len(rows[1])

if second line is not empty

Thank you for your reply
however, after trying these codes
it encounters the same problem:

"Traceback (most recent call last):
IndexError: list index out of range"

IndexError may also mean that the records don't have the same number of fields, or that they apparently don't. For example if the records are separated by tab characters, it may be better to split them with record.split("\t") . It may also be useful to get rid of newline characters at the end of the line with rstrip: record.rstrip().split("\t") . Otherwise, you must add tests to know exactly where the problem happens in the input file or in the python script, and why.

For example, you could chek the length of the rows with

all_length = set(len(row) for row in rows)
print all_length # should print set([52])

Also it is possible that you get error in line 3 if some line has less records than first line in file.

yup, i find some problems on my .dat file
thank you for your help!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.