0

Hello,

I need to store data in large lists (~e7 elements) and I often get a memory error in code that looks like:

f = open('data.txt','r')
 for line in f:
     list1.append(line.split(',')[1])
     list2.append(line.split(',')[2])
     # etc.

I get the error when reading-in the data, but I don't really need all elements to be stored in RAM all the time. I work with chunks of that data.

So, more specifically, I have to read-in ~ 10,000,000 entries (strings and numeric) from 15 different columns in a text file, store them in list-like objects, do some element-wise calculations and get summary statistics (means, stdevs etc.) for blocks of say 500,000. Fast access for these blocks would be needed!

I need to read everything in at once (so no f.seek() etc. to read the data a block at a time). So I'm looking for any alternative list implementation (or other list-like data structure) with which I could read all the data, store it on disk, and load in RAM a chunk/"page" of it at a time.

Any advice on how to achieve this? Platform = windowsXP

Cheers!

5
Contributors
4
Replies
5
Views
6 Years
Discussion Span
Last Post by ultimatebuster
0

Try pickling. You can store large amounts of data in (almost) native python formats to the disk, then re-load them quickly later.

The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy.

http://docs.python.org/library/pickle.html

Hope it helps:
-Joe

0

You may want to look into a relatively new (Python 2.6.5 +) container called namedtuple in module collections. It uses about as much memory as a regular tuple.

0

Additionally you can use blist, a faster implementation of lists for big data sets:

The blist is a drop-in replacement for the Python list the provides better performance when modifying large lists. The blist package also provides sortedlist, sortedset, weaksortedlist, weaksortedset, sorteddict, and btuple types.

0

I'm not sure on how fast/efficient this would be, but you can write a custom object that uses the __setattr__, __getattr__ and __delattr__. It would read your file and generate lists and convert them into tuple every 10/20/30/40... elements, and store them with a numeric value.

Again I'm not sure on the speed/efficiency of this algorithm. IT's just an idea.

This article has been dead for over six months. Start a new discussion instead.
Be sure to adhere to our posting rules.