954,525 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Splitting a large file evenly

Hello all,

I've tried to find this problem in past threads, so sorry if this is a repost.

I have a large file with regular data. The file is too large for me to open it all at once, so I need to split it into evenly (preferred) spaced output files with some regular name.

So if my full file is called 'bigfile.txt' I want to output:

subfile_1.txt, subfile_2.txt etc... where each subfile has a user-determined number of lines (say 100k). Also, if this program gets to the end of bigfile, what do you recommend I do so that it doesn't error, since the last subfile probably won't be exactly 100k lines?

Thanks.

shoemoodoshaloo
Junior Poster
168 posts since May 2009
Reputation Points: 16
Solved Threads: 6
 

Here is a function to split the file

from itertools import islice, count
       
def split_file(filename, subfile_prefix, max_lines):
    assert max_lines > 0
    try:
        with open(filename, "rb") as src_file:
            for i in count(1): # 1, 2, 3, ... indefinitely
                line = next(src_file) # read one line to raise StopIteration at the end of src_file
                dst_filename = "{0}_{1:d}.txt".format(subfile_prefix, i)
                with open(dst_filename, "wb") as dst_file:
                    dst_file.write(line)
                    dst_file.writelines(islice(src_file, 0, max_lines-1))
    except StopIteration:
        pass

You only need to add command line arguments parsing to turn this into a script :)

Gribouillis
Posting Maven
Moderator
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
 

Here is a better version which uses an abstract helper function to group items evenly in any iterable

from itertools import islice, chain

def grouper(n, iterable):
    """split an iterable into a sequence of iterables with at most n items"""
    assert n > 0
    it = iter(iterable)
    return (chain((item,), islice(it, 0, n-1)) for item in it)

def split_file(filename, subfile_prefix, max_lines):
    with open(filename, "rb") as src_file:
        for i, group in enumerate(grouper(max_lines, src_file), 1):
            dst_filename = "{0}_{1:d}.txt".format(subfile_prefix, i)
            with open(dst_filename, "wb") as dst_file:
                dst_file.writelines(group)

Notice that there is a similar grouper() function in the itertools module's documentation for python 3 (the function is not part of the module). This one is different (I don't like the implementation described in the doc). Also note that the groups items must be "consumed" in order for the algorithm to work.

Gribouillis
Posting Maven
Moderator
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
 

Thanks this is great!

shoemoodoshaloo
Junior Poster
168 posts since May 2009
Reputation Points: 16
Solved Threads: 6
 
lordspace
Junior Poster in Training
90 posts since May 2006
Reputation Points: 18
Solved Threads: 6
 
http://usage.cc/split


Nice, I didn't think to look in linux commands. Splitting a file is a basic task. My python code is multiplatform however.

Gribouillis
Posting Maven
Moderator
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You