I have 2 GB file with 10 million lines. Is anyway to sort the file content without loading the full file content into memory

6 Years
Discussion Span
Last Post by WaltP

You probably need to use a merge-sort algorithm. Read the file in small chuncks, sort them, then write each chunk into its own file. When that is done you might wind up with 10-15 files, depending on the size of each chunck read. After that, open all the files at once and begin merging them into the final destination file.

You can probably find a program that already does that.


The first thing you should try is to use a memory-mapped file. Depending on what is in your file, you might have to copy it first. So, you create a memory-mapped file, and "load" the content of your big file into that "memory", which, in actually, will have the effect of copying the file into a memory-mapped file. Then, you can use that memory as if it was RAM memory (although much slower, of course). At this point, you can just use the standard std::sort() function. It is a modified quick-sort-like algorithm, so, the main operation is the partitioning around a pivot element. Partitioning is usually done by starting at both ends of the array (or list) and swapping elements which are on the wrong side, until both traversal pointers (iterators) meet in the middle. So, these are sequential memory access patterns that are very suitable for pre-fetching mechanisms, so you can expect fairly good caching performance and a limited amount of RAM memory being used. Whatever you do, it is most likely that using a memory mapped file is going to be the best of all options, performance-wise.

If you can't or don't want to use a memory mapped file, then you can emulate it using your own iterator implementation (some kind of random-access file iterator). But, it will be quite difficult to create an efficient implementation for that, especially one that has good pre-fetching behavior.

The above two options are those you have that allow you to reuse the standard sort algorithm on the entire file (without having to load it all into memory). The other options involve coding your own "special" sorting algorithm.

One idea is to re-implement the standard sort algorithm (it's an intro-sort algorithm, not very hard to implement at all) but using a file-stream instead of iterators. You can probably manage to do a decent job of minimizing the reading back-and-forth in the file and all that, but this won't be a walk in the park.

Another idea is to use a merge-sort strategy at the high-level with special features to deal with the memory loading / unloading more conservatively and sequentially. Merge-sort is good for this because it gives you both a guaranteed O(NlogN) and good memory accessing patterns if you order the steps correctly:

Assume you define a maximum chunk-size for the amount of memory you want to load at one time. Let's say its 1024 elements (better make it a power of 2). Then, you can load the first 1024 elements from the file, sort them (using std::sort()), and then save it to the output file, and repeat with the next chunk of 1024 elements. At the end, you'll have a file with sorted sequences of 1024 elements one after the other. Now, the tricky part is merging those sequences without having to construct the complete sequence into memory. For this, one strategy is inspired from bucket-sort. Basically, you create individual files for each "bucket" which contain elements in a certain range. When you take a new 1024-element sorted sequence, you merge the sub-sequence corresponding to the range of each bucket to the elements of that bucket. If a bucket gets too big (more than 1024 elements), you split it in half. When you are done filling all the sorted sequences, you dump the content of each bucket, in order, into the output file, and you're done. This is going to work OK, but that is assuming that your method for loading and unloading the file-content isn't too slow (should be just a binary dump, as opposed to some formatted inputs (loading strings or numbers from text)). If your file is not a straight binary, and requires translation or formatting to load-unload, then your really have to go with a memory-mapped file.


If your sort key is short, you can loop to
1a) remember line position (ftell)
1b) read line
1c) extract the key
put both key & position in a structure array
Now sort the structure
Read each line from the the line position (fseek) and write the sorted file.

I used this technique and it was quite fast, even with reading the file twice.

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.