Hey guys,

A while ago, with your help, I was able to create a code which scans a data file for names, and when a new name is found, it appends it to a list. For each entry is the list, the code then opens a new file for each entry. Using the same code, I am trying to parse huge data files, with over 100 names each. Python doesn't like to open such a large number of files and is complaining:

IOError: [Errno 24] Too many open files: '/home/hugadams/Desktop/SHANE_STUFF/Modules/Rep_Split_FULL/MER65B-int'

I was wondering if any of you knew a way to bypass this, whether it be an import or otherwise? I'd prefer to not completely restructure my code if there is a simple import that can allow python to open more files.

Recommended Answers

All 12 Replies

Can you post some of your code? I'm confused as to why you need to hold so many files open at one time... could you just cycle them one by one and not have them all open?

The maximum number of open files a process is allowed to have comes from the Operating System.

It's usually possible to change this (but details vary), but doing so almost certainly requires "admin/root" privileges.

It's also something which shouldn't be done lightly to save a bit of inconvenience on your part.

Just consider it in your future designs that you don't have an unlimited number of open files to play with.

My question : Do we really need this? Why?
If you are needing to open 100+ files at once, you should perhaps re factor your code and design.

BTW, if you are on *inx (Unix,Linux), you can run the following command to know what is the maximum number of open file descriptors:
Note that $ is my shell prompt.

$ cat /proc/sys/fs/file-max
95141

Hence, there can be at most 95141 possible file descriptors opened at once.
To change this use: where 104854 is max number which you want.

$ echo "104854" > /proc/sys/fs/file-max

To check how many of file descriptor are been used;

$ cat /proc/sys/fs/file-nr
5024	0	95141

The first number is total allocated file descriptors (the number of file descriptors allocated since boot)

I agree with everyone else here. There should never ever be a reason to open this many files at once. Ever.

You should explain what you're actually trying to do so that we can help you come up with an intelligent solution.

Ok, here is what I am doing:

First my code scans a large data file. In column 10, the genetic data is listed by "name", and there are about 110 different names total. Whenever it gets to a new name, it stores it in a list, so I have something like:

Name_List = ['AluSp', 'AluGp', 'AluSx' ... 'ZZcta' ]

For each name in the name list, I want to store a file of the same name, eg:

AluSp.bed, AluGp.bed, AluSx.bed...ZZcta.bed

The program will rescan the data file, and for each name, that line gets written into the appropriate file.

Because the raw data file is so large, I can't just open up one file for a name in the list, scan the whole data file, append entries for only the first name (AluSp.bed) and then close. Doing so would require a 110 scans of a 50 million line file. What I am doing is scanning the raw data file once, while all the 110 name files remain open, and for each line in the data file, that line gets written to the appropriate name file.

Is that clear?

The code is in place, so if you'd prefer to look at it, I can post it.

The maximum number of open files a process is allowed to have comes from the Operating System.
...

That is not all of it. In Python you have constructs like:

for line in file(filename):
    # do something with line

Here the closing of the file is left to the very eagle-eyed Python garbage collector. It is the garbage collector and its file closing algorithm that will feel overwhelmed after too many file opens.

Ok, here is what I am doing:

First my code scans a large data file. In column 10, the genetic data is listed by "name", and there are about 110 different names total. Whenever it gets to a new name, it stores it in a list, so I have something like:

Name_List = ['AluSp', 'AluGp', 'AluSx' ... 'ZZcta' ]

For each name in the name list, I want to store a file of the same name, eg:

AluSp.bed, AluGp.bed, AluSx.bed...ZZcta.bed

The program will rescan the data file, and for each name, that line gets written into the appropriate file.

Because the raw data file is so large, I can't just open up one file for a name in the list, scan the whole data file, append entries for only the first name (AluSp.bed) and then close. Doing so would require a 110 scans of a 50 million line file. What I am doing is scanning the raw data file once, while all the 110 name files remain open, and for each line in the data file, that line gets written to the appropriate name file.

Is that clear?

The code is in place, so if you'd prefer to look at it, I can post it.

There's no need to keep all the files open at the same time though. Instead of having a list of 100 file handles replace it with a list of the filenames, then each time you want to write to one of them just open the file, write to it, then close it.

There's no need to keep all the files open at the same time though. Instead of having a list of 100 file handles replace it with a list of the filenames, then each time you want to write to one of them just open the file, write to it, then close it.

Won't that the process of opening and closing the file cause it to get overwritten, instead of the line being appended?

Won't that the process of opening and closing the file cause it to get overwritten, instead of the line being appended?

Not if you open it in append mode. That means that anything previously in the file will stay there, and this might not be desirable on the first write (since you want the file to contain only data from the current run presumably). To fix this just keep track of whether you've written to the file before; if you have open it in append mode, otherwise open it in write mode.

I'm curious now how long it takes your computer to parse this "50 million line file" with Python, seeing how slow Python is compared to things like C... unless of course you were exaggerating the number of lines.
Anyways, if you open a file with fhandle = open('myfile', 'a') then when you call write("") on it, it'll just append it. Like The_Kernel said, just make sure you keep track of whether the file has been created yet, and if not, open it with the mode parameter as "w".

Not if you open it in append mode. That means that anything previously in the file will stay there, and this might not be desirable on the first write (since you want the file to contain only data from the current run presumably). To fix this just keep track of whether you've written to the file before; if you have open it in append mode, otherwise open it in write mode.

Thanks.

EDIT: Concept already demonstrated

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.