0

Kindly tell an algorithm to sort a data of 500mb while the memory of the system is just 2 mb.
Thanks

5
Contributors
6
Replies
7
Views
6 Years
Discussion Span
Last Post by Adak
0

You use a file based merge sort:

Repeat:
  Read in about half the memory's worth of data and sort it any way you like
  Write the sorted data to a temporary file
Until all data has been read and sorted once
Repeat:
  read and merge-sorted write the next two temp files into yet another temp file
until you have merged all the temp files into a single sorted output

Note that this is still Big O(N log(N)) (with a big constant because of the file access): You are dealing with all N items using log(N) passes through the data.

This disk sort technique has been in use longer than there have been computer languages (except assembly, I suppose).

Alternatively:

  • You can merge from more than two temp files, up to the number of file handles you can deal with. This is clearly a better option because you write and then read fewer temporary files, and file access is slow; but the algorithm is easier to understand with just two.
  • You can also play various games with keeping "active" parts of the sorted values in memory.
  • You can instead use a disk-based radix sort where the temporary files are associated with each key. Depending on how many keys you use, you may need to further subdivide the temporary files until they are small enough to handle in memory.

There's a lot of more or less accessible information out there on the web. The Google knows...

Edited by griswolf: n/a

0

Shortly, and briefly (just because I've got to hurry now :) ):
File-buffered activity, data broken up to suitable chunks, each overlapping the previous one by one unit. Of course you'll also have to iterate several times and file operations will introduce a heavy overhead to the execution of the code.

0

Shortly, and briefly (just because I've got to hurry now :) ):
File-buffered activity, data broken up to suitable chunks, each overlapping the previous one by one unit. Of course you'll also have to iterate several times and file operations will introduce a heavy overhead to the execution of the code.

Damn I suck! :D (Too much hurry I guess)

So buffer in file, brake up, sort the elements of each chunk, then comb sort the chunks, or just create a two-level-sorting where chunks are the higher level and individual data is the lower one. (You could take several other approaches as well)

How the hell could I screw that?

Edited by -Powerslave-: n/a

0

Damn I suck! :D (Too much hurry I guess)

So buffer in file, brake up, sort the elements of each chunk, then comb sort the chunks, or just create a two-level-sorting where chunks are the higher level and individual data is the lower one. (You could take several other approaches as well)

How the hell could I screw that?

I just can't believe that! It was good. I'd better go to bed for now...
Take the former (or similar) approach for a minimum selection sort, eliminate duplicate data in memory buffer if needed, swap chunks in and out as they are needed.

:@ Shame... Good night (or whatever)

-1

Selection sort has nothing to recommend it, *except* it makes the absolute minimum number of comparisons. Other than those rare times that comparisons are inordinately costly, it's not a sort to use.

The fastest general purpose sort on Windows (since Windows 2000), is to use the system sort, with a command line on the terminal window:

sort <Unsorted.txt >Sorted.txt ( ;) )

Why? Because MS set aside special resources for sorting, that your program can't begin to match. I have several versions of Quicksort (optimized with Insertion sort for small sub arrays), Radix, Flash, Merge, Comb 11, etc., in C, and none of them come close to the system sort. And they never will. Easy squeezy also, since it does the merging of the temp files for you.

After that, in general, it's faster to sort in memory than on the HD (SSD might change this a bit). Sequentially reading the data, instead of bouncing the read head back and forth, is what you want here.

So, an array that is a multiple of 32 or 64 (depends on your OS size), and your drive's own buffer, is called for.

Fastest general purpose sorter I have is the tweaked Quicksort mentioned above. If you can get away with leaving the data unsorted, but making it appear as if it's sorted by using an index array, then use the index array method. Combine that with Quicksort, or your favorite fast sorter.**

Fill your array, sort it, and write it out (if not indexed), to a temp file, and repeat. When you're done, start merging your temp files. Merge either two or three (or multiples thereof), so you never will have just one orphan file to merge.

Does a beautiful job.

**Despite claims to to contrary, none of the "oh so fast" sorting algorithms has beaten the optimized Quicksort, in my tests (tested with int's). Multi-file merge sort CAN beat it, if you optimize it enough, and have the right hardware, from evidence I've read on sorting contests.

Votes + Comments
Though this will be a zombie post, I'd disagree with you. User-to-system transitions are costly whch means you can create an even more optimized sorting code. Using the system is still right if getting the job done quickly has priority over performance.
This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.