Hello all,

I've tried to find this problem in past threads, so sorry if this is a repost.

I have a large file with regular data. The file is too large for me to open it all at once, so I need to split it into evenly (preferred) spaced output files with some regular name.

So if my full file is called 'bigfile.txt' I want to output:

subfile_1.txt, subfile_2.txt etc... where each subfile has a user-determined number of lines (say 100k). Also, if this program gets to the end of bigfile, what do you recommend I do so that it doesn't error, since the last subfile probably won't be exactly 100k lines?

Thanks.

Here is a function to split the file

from itertools import islice, count
       
def split_file(filename, subfile_prefix, max_lines):
    assert max_lines > 0
    try:
        with open(filename, "rb") as src_file:
            for i in count(1): # 1, 2, 3, ... indefinitely
                line = next(src_file) # read one line to raise StopIteration at the end of src_file
                dst_filename = "{0}_{1:d}.txt".format(subfile_prefix, i)
                with open(dst_filename, "wb") as dst_file:
                    dst_file.write(line)
                    dst_file.writelines(islice(src_file, 0, max_lines-1))
    except StopIteration:
        pass

You only need to add command line arguments parsing to turn this into a script :)

Edited 5 Years Ago by Gribouillis: n/a

Here is a better version which uses an abstract helper function to group items evenly in any iterable

from itertools import islice, chain

def grouper(n, iterable):
    """split an iterable into a sequence of iterables with at most n items"""
    assert n > 0
    it = iter(iterable)
    return (chain((item,), islice(it, 0, n-1)) for item in it)

def split_file(filename, subfile_prefix, max_lines):
    with open(filename, "rb") as src_file:
        for i, group in enumerate(grouper(max_lines, src_file), 1):
            dst_filename = "{0}_{1:d}.txt".format(subfile_prefix, i)
            with open(dst_filename, "wb") as dst_file:
                dst_file.writelines(group)

Notice that there is a similar grouper() function in the itertools module's documentation for python 3 (the function is not part of the module). This one is different (I don't like the implementation described in the doc). Also note that the groups items must be "consumed" in order for the algorithm to work.

Edited 5 Years Ago by Gribouillis: n/a

This article has been dead for over six months. Start a new discussion instead.