I need to store data in large lists (~e7 elements) and I often get a memory error in code that looks like:

f = open('data.txt','r')
 for line in f:
     # etc.

I get the error when reading-in the data, but I don't really need all elements to be stored in RAM all the time. I work with chunks of that data.

So, more specifically, I have to read-in ~ 10,000,000 entries (strings and numeric) from 15 different columns in a text file, store them in list-like objects, do some element-wise calculations and get summary statistics (means, stdevs etc.) for blocks of say 500,000. Fast access for these blocks would be needed!

I need to read everything in at once (so no f.seek() etc. to read the data a block at a time). So I'm looking for any alternative list implementation (or other list-like data structure) with which I could read all the data, store it on disk, and load in RAM a chunk/"page" of it at a time.

Any advice on how to achieve this? Platform = windowsXP


Try pickling. You can store large amounts of data in (almost) native python formats to the disk, then re-load them quickly later.

The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy.


Hope it helps:

You may want to look into a relatively new (Python 2.6.5 +) container called namedtuple in module collections. It uses about as much memory as a regular tuple.

Additionally you can use blist, a faster implementation of lists for big data sets:

The blist is a drop-in replacement for the Python list the provides better performance when modifying large lists. The blist package also provides sortedlist, sortedset, weaksortedlist, weaksortedset, sorteddict, and btuple types.

I'm not sure on how fast/efficient this would be, but you can write a custom object that uses the __setattr__, __getattr__ and __delattr__. It would read your file and generate lists and convert them into tuple every 10/20/30/40... elements, and store them with a numeric value.

Again I'm not sure on the speed/efficiency of this algorithm. IT's just an idea.