954,525 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Read large input file to memory

I wrote a python program that gets input from an input file. It works fine for small input files since after opening the file I did this to get the data

data=fPtr.readlines()

Since readlines only takes the input data and packs it into a list, it is clear it won't work for large input files. The problem is that I need to extract all the data in the input file before I begin any operation.
I will be doing lots of looping in the program and I don't know whether opening and closing the file in a single loop would be efficient.
Please advise on the best option.

Happy times

sureronald
Junior Poster
139 posts since May 2008
Reputation Points: 11
Solved Threads: 19
 

So the question is:
How large is your data file?

vegaseat
DaniWeb's Hypocrite
Moderator
5,989 posts since Oct 2004
Reputation Points: 1,345
Solved Threads: 1,417
 

A Python list will hold something like 2 trillion items, but is going to be pretty slow with a very large number or records in it. If your list is going to be 100 million records or more, then consider an SQLite database instead. If it's a paltry one million records (we now live in a gigabyte world), then there should not be a problem, but you might want to consider using a dictionary or set as they are both indexed via a hash and would be much faster on lookups.

woooee
Nearly a Posting Maven
2,454 posts since Dec 2006
Reputation Points: 777
Solved Threads: 714
 

I realized that a python list can hold as much data as the computer memory allows. On the python interpreter I gave this lines just to verify this and then on one console I gave the top command just to monitor the memory consumption of the python interpreter

li=[]
while True:
   li.append("king")

There was no error, the size of the list increased infinitely and hence the memory consumption of the python interpreter.
The reason I posted this question is that I thought it was a bug in a program I had submitted to some online judge who normally test a program with large input files.
Many thanks to all contributors!

sureronald
Junior Poster
139 posts since May 2008
Reputation Points: 11
Solved Threads: 19
 

As performance is concerned, the file read from disk will be the slowest part by far!

bumsfeld
Nearly a Posting Virtuoso
1,445 posts since Jul 2005
Reputation Points: 404
Solved Threads: 184
 

I am trying to open a big file (> 1 GB), but I am getting MemoryError.
The code is:
for line in open(data.txt,'r').readlines():

This line worked for me when the file size was around 750 MB, but giving error when the file size is greater than 1 GB.

Any remedy to this?
I dont want to read the file string or character wise... this will alter whole my code..

Thanks,
Mahesh

mahesham
Newbie Poster
6 posts since May 2010
Reputation Points: 10
Solved Threads: 0
 

Does the code run without the readlines and how fast for 1 GB (compared to 750 MB before)?

i.e. for line in open(data.txt,'r'):

Could you post main code, maybe we could optimize it together?

Usually it is best to use generator for huge data files.

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

I changed the code to:
for line in open(data.txt,'r'):
and it worked now.

For me, speed is not a concern.
Thanks for the help.

mahesham
Newbie Poster
6 posts since May 2010
Reputation Points: 10
Solved Threads: 0
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You