Hey all,

I ran a code today which digested an input file which was 304 MB, consisting of about 10 million lines with six columns a piece. I ran the code and got an indexing error. In troubleshooting, I copied only the first 1,000,000 lines into a new data file, ran it and the code ran fine. Google searches say that python arrays are only limited by the ram of the machine.

My question is how can I get a sense of the limitations of my machine? The machine I'm running the code on has several 64bit processors and 8 gb's of ram. Is there an exact way (say a built in command?) that will allow me to test if a data file will be too large without requiring that I actually run the file and wait for it to error. Secondly, what would you recommend I do to obviate such a problem in the future? Lastly, is there a smart way to append the code so that if it fails, it will let me know exactly at which line it failed so I get a sense of how far it got before crashing?

Thanks

Recommended Answers

All 5 Replies

I am running file processing with larger (>1G) files on a fare more weaker machine. I hardly believe your case ran into a limitation. In my experience the never ending program is more possible, than the one running out of memory.

Try catching the exception that occurs, and print out the line number. The other way is to write the out a "rejected records" file.

Based on my experience the problem will be a malformed record having less then six columns, maybe the last one or the header.

If possible try to write the program line oriented, ie do not read in the whole file.

for line in open(fname):
   process line, write other files, do aggregation whatever

If everything else fails, and you firmly believe you reached some hard barrier, import it to an sqlite3 database. That can reach terrabytes...

What I don't understand is if the code was inherently flawed, why did it run a 1,000,000 line file perfectly fine.

s there a smart way to append the code so that if it fails, it will let me know exactly at which line it failed so I get a sense of how far it got before crashing?

Yes there is a smart way. Try-catch.

for count,line in enumerate(fileobject):
    try:
       do your stuff
   catch:
      print("Some error occured on line %s" % count)
      print("The bad line was:")
      print(line)
      raise

That will print out the bad line, the line number, and the stack. You will know exactly on which line the exception occured, what exception was that, on which line number this exception occured.

What I don't understand is if the code was inherently flawed, why did it run a 1,000,000 line file perfectly fine.

Well. If you have 10**6 lines with the structure of:
number;number;characters;number

Then, if the 10**6+1 th line contains a data like:
1;2;asd;jkle;3

Then your program will most likely crash.

Thanks for the help slate.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.