I have to sort around 12GB of integer data which is kept in a file. I have a 1 GB RAM. How should I do it? The problem is I cant read whole of data together and store it in a vector because it goes out of memory bound. Which sorting algorithm should I use? I have to do it quickly. When I googled for this I got a line on the page http://en.wikipedia.org/wiki/Sorting_algorithm#Memory_usage_patterns_and_index_sorting saying "combine two algorithms in a way that takes advantages of the strength of each to improve overall performance." It suggests using quicksort on small data and then merging it. But ultimately merging will also require the data to be in RAM which is impossible.

What is the solution around this problem?

## All 12 Replies

Break it into manageable chunks. Sort each chunk into a file. Merge the files into another file.

thanks arkoenig.
But I dont get one thing that ultimately when merging I have to get it into my program which will cause problem..
Example:

I have 10 values. And I have RAM to store only 5 values. OK I will take chunks and the sort 5 elements. But finally when I have to merge them I will have to compare them and then merge.
For example I have data as: 2 1 5 9 4 6 3 8 7 0
I sort it as 1 2 4 5 9 and 0 3 6 7 8 .
But then when I merge I have to compare among them and then merge. So I require to bring all the data in my program.
Can you help me on this? Please if possible give a code snippet/algorithm

Thanks

To merge n files into one, all you need is an n-element array.

Begin by reading the first element of each file into the corresponding element of the array. That is, the first array element is the first element of the first file, the second array element is the first element of the second file, and so on.

Give each array a "mark" -- a single bit that indicates whether the corresponding file has reached its end. At first, every element is unmarked.

Now, you repeat the following steps:

1) Find the array element with the smallest value that is not marked. If all elements are marked, you're done.

2) Write that element into the output file.

3) Read the next value from the input file that corresponds to the array element. If you reach end of file, mark the element.

There are clever algorithms for keeping track of which array element is smallest, but unless you have a lot of input files, you won't need them because the time spent on input/output is much greater than the time spent on working with the array.

Thank you once again. I think with the above information I could possibly make the program.
The last question I want to ask is which will be the fastest sorting algorithm that you would suggest ? Any suggestions?

Thanks

I doubt you want the fastest algorithm; what you probably want is the easiest one to use that will give you the results you want in an acceptable amount of time. For that reason, I suggest you start by using std::sort, and look elsewhere only if std::sort is unsuitable for some reason.

Why do you need to be able to sort it fast? I'm asking because it takes quite a long time to do almost anything with 12GB of data, including generate it and read it. So it sounds to me like you might have 12 GB of data on disk that you want to do something with that requires sorting, but that the data isn't going to be changing at too big a rate. In this case, you might be able to get away with sorting it once (which might be slow) and store it on disk somewhere. Then you just need to insert or remove any new values that you need to. This is much faster than a complete sort, since you can use an efficient search algorithm to find the place to put the new value or remove an old one, which should be fast even for 12 GB (a binary search on this much sorted data will get to the right place in about 33 steps).

Since you only have 1GB of RAM, I suspect you are attempting to do this on a 32-bit system. I'm a little concerned about how much of the original file you'll actually be able to read through. Because of how pointers and file reading work, a 32-bit system may have trouble getting through more than the first few GB of it. You may have to do this on a 64-bit system.

You are going to have to use a variation on a merge sort. Initially, you'll want to sub-divide the original data file into multiple segments that are small enough for your system to hold the entire contents of in memory. An int value is normally 4-bytes on a 32-bit system, thus 12GB of integer data equates to about 3.22 Billion values. If we assume that your O/S is using as much as 512MB of your 1GB, that leaves you about 512MB to work with. Based on 4-byte integers, that allows you to process about 130-million values simultaneously, if you use a sorting algorithm that sorts in-place. Based on these numbers, you'll have to divide the data into at least 25 segments. The easiest way is to just base it on a counter, when the counter reaches a pre-determined value, you sort the segment relative to itself and output the sorted segment to a file.

Once you have created all of the sub-files, you'll then have to re-open all of them simultaneously and begin re-assembling them into a single file. The order in which you add them to the final output file would be based on relative comparisons between the currently-read values from each file. After you write a value, you read the next value from the appropriate file and begin the comparison over again.

Do you see how complicated this is going to get? There really isn't an easy way to do it. At least not one that I can see...

Thanks ravenous and Fbody. And lots of thanks to arkoenig. I think you have given sufficient hints to me to code this program. Please do share any new ideas you get.

I have to do this sorting only once. I agree it will take a long time. But my meaning when I said fast was relatively. And I have to sort this data as fast as possibnly. Might be 5 hours or so..

Also I want to know about external sorting Does someone has any idea ?

It's pretty much what I described to you.

It's pretty much what I described to you.

Thanks again Fbody. I read it and I just had an idea that can I somehow implement multithreading or any of the other methods suggested on the link. Like use asynchronous I/O. Obviously I cant change the hardware. But can you suggest/innovate ideas which can make my program to use maximum resources..

How to implement multithreading in this in C++. I know I am going a bit above my level but I want to make my program as fast as possible and am ready to work as hard as possible.

I have some knowledge of multithreading in Java but I have to use C++.

Thanks

To merge n files into one, all you need is an n-element array.

There are clever algorithms for keeping track of which array element is smallest, but unless you have a lot of input files, you won't need them because the time spent on input/output is much greater than the time spent on working with the array.

arkoenig Can you pls tell me about the algorithms that you mentioned in your earlier post abt sorting using multiple files

arkoenig Can you pls tell me about the algorithms that you mentioned in your earlier post abt sorting using multiple files

Look at std::priority_queue for a good example.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.