Hi all,

What is the best sorting algorithm for a doubly linked list?

Insertion sort and merge sort appears to the best due to the less overhead compared to the bubble/selection sort.

Would like to know your insights on this.


For linked-lists, the merge-sort algorithm is pretty much the only decent choice. For arrays, things are a lot more subtle and interesting, but not for linked-lists, it's merge-sort that's just the clear winner over any other option (although it could be that insertion-sort is better when the list is very small).

Back in the "dark ages", when I needed to sort a linked list, I would put the nodes into an array, then use qsort to sort that, and then rebuild the list from the array. Worked well. Was simple. And it was quite fast. It also met the KISS principle.

IMAO, the best approach to sorting a linked list is not to let it get out of order in the first place. While inserting the nodes in order is slower than pushing them to the head of the list and then sorting, it has the affect of amortizing the sorting over the course of the program's operation, which can prove more important than sheer operational speed. More importantly, it means that there is never any need to repeat the sorting, which is often the case if the list is being actively changed over time.

Although I've never had a chance to benchmark my concept with a large dataset, I've implemented a slight twist to the insertion sort wherein I maintain an array of pointers to the first element. PNTR(0) to the fist alpha numerically. 1 - 26 correspond to letters a-z and PNTR(27) anything above "Z". This eliminates the need to always start from the first or last entry.

The average typist can probably hammer out 3 chars/sec which on a 3.8 gig CPU leaves us with 200 million instructions between each keystroke. My loop to cycle in either direction through the list was 194 instructions roughly to find a name, so theoretically my application should have been able to find the next occurence of 1.3 million entries between each keystroke. Factor in multi-tasking and maybe 600K would be more realistic. So extropolate this into a database of the largest city in the world being Tokoyo at 9 million and evenly distibuting the number of people for each letter of the alphabet, by at least getting to the first occurence, it would only take 50% of the time between each keystroke to find the next node.

I remember well that day in 1968 when I implemented a bubble sort on a Z80 based machine running CPM and a screaming 384 Khz. Using basic in only took 11 minutes to sort 3400 entries. This is when I learned about assembly code and on that same machine using a bubble algo, I reduced that to 34 sec. Still not satisfied, a Shell Metzner appoach was used and now we are at just over 11 seconds.

Without knowing the specifics of your application @Thomas_25, but if it is for a very large dataset, I'd not only be looking at the algo, but investigating a means by dropping into kernel code (ring(0)) disabling interrupts so the sort algo at that level had exclusive use or even better, deticate it to another core.

When I designed the framework for FACTORYworks (a major MES for semiconductor, flat-panel display, and disc drive manufacturers), I designed and wrote an insertion sort for sorted collections. It was a qsort variant, and REALLY fast. With head/tail optimizations, it could suck up over 100K nodes so fast that our QA department had to ask if there was a bug! There wasn't... :-) The head/tail optimizations were useful when loading data from a sorted SQL query. Since it was array-based, the array resizing algorithm was critical. It was a binary algorithm with a reasonal starting size (factor of 2) where each increase would be simple to compute and reallocate. IE, double size on each reallocation.

Linked lists are elegant, but inefficient to keep sorted. To find the insertion spot requires possibly a full scan of the list, so this is another situation where head/tail optimizations can be useful if you REALLY need to use a linked list. If the new node isn't at the head or tail of the list, then you have to scan the list - not an optimal situation when performance is required. At least a sorted array will use a binary search which reduces the scan time considerably. This is why when I did need to sort a linked list, the "read into an array and use qsort" method worked well.

FWIW, I also did AVL and other variants of balanced binary trees. Fast for lookup. Not so fast when needing to rebalance the tree on insertion (depending upon the number of entries allowed in each data node). Consider them a mix of a sorted array and linked list... :-) I got a lot of my inspiration for these from Knuth, volume 3.

Linked lists are elegant, but inefficient to keep sorted.

Not if you use a hybrid tree/list, hash/list or a skiplist. :D But those implementations aren't as blindingly simple as a regular linked list, so you don't see them often.

Data structures are an exercise in trade-offs. You can have simple or efficient, but rarely both unless the application is very niche.

Thomas_25: rubberman and deceptikon have, indirectly, brought up a good point: is a doubly linked list what you actually need for your purposes, or would an ordered tree structure of some kind (e.g., a priority queue, an AVL tree, a splay tree) serve your needs better? Now, obviously, if this is for a homework assignment on doubly-linked lists, there's not going to be any leeway in this regard, but otherwise, you might want to consider just what data structure fits the problem best. What is the project you are looking to apply this to?

I agree that data structures are all about placing the trade-offs where it's most appropriate. When we talk about some kind of sorted container, e.g., something with an interface similar to std::set, then there are three fundamental operations: insertion / deletion, in-order traversal, and (range) look-ups.

There are basically three fundamental data structures that cater to each of these operations:

  • Binary search tree (linked-tree) provides the fastest insertion / deletion operations (incl. efficient re-balancing through tree rotations), but is relatively slow at everything else. This is exemplified by the std::set container.
  • Ordered array / vector provides the fastest in-order traversal operations, but is relatively slow at everything else. This is exemplified by the boost::flat_set container.
  • Implicit binary search tree (e.g., breadth-first or von Emde Boas layout) provides the fastest look-up operations, but is relatively slow at everything else. Unfortunately, there aren't too many standard examples of this, but I do give a pretty thorough explanation here and reference boost-style implementations of breadth-first layout and von Emde Boas layout.

In general, when you want to play around with the trade-offs, you can simply mix those three data structures in some smart way. For example, it is quite easy to mix a binary search tree whose leafs are sorted containers, e.g., 100K elements might be split into 100 vectors of 1000 elements each, arranging those vectors into a binary search tree, which will result in much faster insertions than using a single 100K vector and much faster traversals than using a fine-grained binary search tree over the 100K individual elements.

So, like deceptikon said, it's all about trade-offs and the specifics of your own application.

When it comes to sorting and searching, you cannot beat Knuth! :-) He wrote the bible on the subject.