C++ Challenge: Natural Sorting

Question

Narue 5,707 Bad Cop

14 Years Ago

Consider the following program and input file:

#include <algorithm>
#include <cstdlib>
#include <ctime>
#include <functional>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
#include <vector>
#include "natural_compare.h"

struct natural_less: std::binary_function<std::string, std::string, bool> {
  bool operator()(const std::string& a, const std::string& b)
  {
    return natural_compare(a, b) < 0;
  }
};

std::vector<std::string> get_lines(std::istream& in)
{
  std::vector<std::string> v;
  std::string line;

  while (std::getline(in, line)) {
    if (line[0] != '#')
      v.push_back(line);
  }

  return v;
}

int main()
{
  std::srand((unsigned)std::time(0));

  std::ifstream in("test.txt");

  if (in) {
    std::vector<std::string> v = get_lines(in);

    std::random_shuffle(v.begin(), v.end());
    std::sort(v.begin(), v.end(), natural_less());

    std::vector<std::string>::const_iterator it = v.begin();
    std::vector<std::string>::const_iterator end = v.end();

    while (it != end)
      std::cout<<'>'<< *it++ <<"<\n";
  }
}

# Basic tests
a1
a2
a3
a10
a11
a12
# Matching value tests
b5
b 5
b05
b  5
b 05
b005
b   5
b  05
b 005
b0005
b    5
b   05
b  005
b 0005
b00005
# Second tier matching value tests
c 1 5
c 1  5
c 1 05
c 1   5
c 1  05
c 1 005
c 1    5
c 1   05
c 1  005
c 1 0005
c 1     5
c 1    05
c 1   005
c 1  0005
c 1 00005

The code obviously won't compile because the natural_compare function isn't implemented (we'll get to that shortly). But with a basic call to std::sort using the normal character ordering rules, the output would look something like this:

>a1<
>a10<
>a11<
>a12<
>a2<
>a3<
>b    5<
>b    5 <
>b   05<
>b   5<
>b   5 <
>b  005<
>b  05<
>b  5<
>b  5 <
>b 0005<
>b 005<
>b 05<
>b 5<
>b 5 <
>b00005<
>b0005<
>b005<
>b05<
>b5<
>b5 <
>c 1     5<
>c 1     5 <
>c 1    05<
>c 1    5<
>c 1    5 <
>c 1   005<
>c 1   05<
>c 1   5<
>c 1   5 <
>c 1  0005<
>c 1  005<
>c 1  05<
>c 1  5<
>c 1  5 <
>c 1 00005<
>c 1 0005<
>c 1 005<
>c 1 05<
>c 1 5<
>c 1 5 <

Not exactly intuitive unless you're a computer. Because the value of each individual character is used unconditionally in the comparison, the result is less than desirable. A more natural ordering would take numeric values into account like so:

>a1<
>a2<
>a3<
>a10<
>a11<
>a12<
>b5<
>b5 <
>b 5<
>b 5 <
>b05<
>b  5<
>b  5 <
>b 05<
>b005<
>b   5<
>b   5 <
>b  05<
>b 005<
>b0005<
>b    5<
>b    5 <
>b   05<
>b  005<
>b 0005<
>b00005<
>c 1 5<
>c 1 5 <
>c 1  5<
>c 1  5 <
>c 1 05<
>c 1   5<
>c 1   5 <
>c 1  05<
>c 1 005<
>c 1    5<
>c 1    5 <
>c 1   05<
>c 1  005<
>c 1 0005<
>c 1     5<
>c 1     5 <
>c 1    05<
>c 1   005<
>c 1  0005<
>c 1 00005<

Your challenge, should you choose to accept it, is to fill in the blanks (write the natural_compare.h header and all implementation files) such that the natural_compare function exists and can produce the naturally sorted output.

This is a software engineering problem, so there's no right answer as long as the function works as intended. You're welcome to add functionality once the basic functionality required to produce the second output has been completed. If you choose to give up and see my attempt, I'll be happy to send it to you, but give it an honest try first. There are a number of ways to solve the problem.

algorithm c++ sort sorting

5 Contributors
18 Replies
2K Views
6 Days Discussion Span
Latest Post 14 Years Ago Latest Post by dusktreader

mrnutty commented: Hope you post more challenges. +4

All 18 Replies

NathanOliver 429 Veteran Poster

14 Years Ago

i believe i have a solution to your problem but there are more values in your sorted results than in the text file with the information to sort. for example there are two "b5"'s in your sorted results but only one in the text file. please let me know because i am interested to see if my code is correct.

Nick Evan 4,005 Industrious Poster

14 Years Ago

I'm temporarily making this thread a sticky. We'll see what comes out of it in the next month or so :)

dusktreader 137 Posting Whiz in Training

14 Years Ago

My preliminary solution works, though there's one logical step that I'm whittling down. Did you want us to post our solution here, or leave the question unanswered for others?

dusktreader 137 Posting Whiz in Training

14 Years Ago

That's fine. I'll wait a while as well (perhaps until you post your solution).

It is is an interesting problem. In particular, the partial equivalence of the '0' and ' ' character poses a nice brain tickler.

tetron 22 Junior Poster

14 Years Ago

What is it that you want the system to do when

#tiered test
a 005 b 04
a  5 b 3
a 0005 b 5
#no spaces between number and letter
y 000a 
y 00091
#spaces
b 5  abc
b  5 abcd
b   5 abc

?
As I am about to have to do a fully optimised retrieval of words I thought I'd see how close my solution was as I am rusty having put a wrapper around all my sort algorithms ages ago.

But to save me writing about another 4 solutions that I am sure someone here will beat :) is it just treating numbers separated by spaces as a whole number then sub sorting on the format?

thanks

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Narue 5,707 Bad Cop Team Colleague · Answer 1 · 2010-02-11T06:47:32+00:00

>there are more values in your sorted results than in the text file with the information to sort
My bad, I added some more cases, but forgot to post the updated text file. Feel free to use the posted input or add the cases from the output as well. Either will be acceptable.

Narue 5,707 Bad Cop Team Colleague · Answer 2 · 2010-02-11T23:46:29+00:00

Here's the updated input file:

# Basic tests
a1
a2
a3
a10
a11
a12
# Matching value tests
b5
b5 
b 5
b 5 
b05
b  5
b  5 
b 05
b005
b   5
b   5 
b  05
b 005
b0005
b    5
b    5 
b   05
b  005
b 0005
b00005
# Second tier matching value tests
c 1 5
c 1 5 
c 1  5
c 1  5 
c 1 05
c 1   5
c 1   5 
c 1  05
c 1 005
c 1    5
c 1    5 
c 1   05
c 1  005
c 1 0005
c 1     5
c 1     5 
c 1    05
c 1   005
c 1  0005
c 1 00005

Narue 5,707 Bad Cop Team Colleague · Answer 3 · 2010-02-12T01:57:20+00:00

>Did you want us to post our solution here, or leave the question unanswered for others?
It's a personal challenge, not a contest, so you don't have to prove that you did it. However, if you want to share, feel free. I was planning on giving it a little time before posting my own solution, so as not to stifle creativity.

In all honesty, I'm hoping everyone can get together and collaborate on making all of the posted solutions better. That's the part of the challenge I think we'll learn the most from. At the very least there will be one solution (mine) posted and open to review.

Narue 5,707 Bad Cop Team Colleague · Answer 4 · 2010-02-14T22:51:36+00:00

>#tiered test
>a 005 b 04
>a 5 b 3
>a 0005 b 5

a  5 b 3
a 005 b 04
a 0005 b 5

>y 000a
>y 00091

y 00091
y 000a

>b 5 abc
>b 5 abcd
>b 5 abc

b 5  abc
b  5 abcd
b   5 abc

Narue 5,707 Bad Cop Team Colleague · Answer 5 · 2010-02-16T03:08:47+00:00

Well, it's been five days since the challenge was issued, so I'll post my shot at a solution:

#ifndef JSW_NATURAL_COMPARE
#define JSW_NATURAL_COMPARE

#include <string>

int natural_compare(const char *a, const char *b);
int natural_compare(const std::string& a, const std::string& b);

#endif

#include <cctype>

namespace {
  // Note: This is a convenience for the natural_compare 
  // function, it is *not* designed for general use
  class int_span {
    int _ws;
    int _zeros;
    const char *_value;
    const char *_end;
  public:
    int_span(const char *src)
    {
      const char *start = src;

      // Save and skip leading whitespace
      while (std::isspace(*(unsigned char*)src)) ++src;
      _ws = src - start;

      // Save and skip leading zeros
      start = src;
      while (*src == '0') ++src;
      _zeros = src - start;

      // Save the edges of the value
      _value = src;
      while (std::isdigit(*(unsigned char*)src)) ++src;
      _end = src;
    }

    bool is_int() const { return _value != _end; }
    const char *value() const { return _value; }
    int whitespace() const { return _ws; }
    int zeros() const { return _zeros; }
    int digits() const { return _end - _value; }
    int non_value() const { return whitespace() + zeros(); }
  };

  inline int safe_compare(int a, int b)
  {
    return a < b ? -1 : a > b;
  }
}

int natural_compare(const char *a, const char *b)
{
  int cmp = 0;

  while (cmp == 0 && *a != '\0' && *b != '\0') {
    int_span lhs(a), rhs(b);

    if (lhs.is_int() && rhs.is_int()) {
      if (lhs.digits() != rhs.digits()) {
        // For differing widths (excluding leading characters),
        // the value with fewer digits takes priority
        cmp = safe_compare(lhs.digits(), rhs.digits());
      }
      else {
        int digits = lhs.digits();

        a = lhs.value();
        b = rhs.value();

        // For matching widths (excluding leading characters),
        // search from MSD to LSD for the larger value
        while (--digits >= 0 && cmp == 0)
          cmp = safe_compare(*a++, *b++);
      }

      if (cmp == 0) {
        // If the values are equal, we need a tie   
        // breaker using leading whitespace and zeros
        if (lhs.non_value() != rhs.non_value()) {
          // For differing widths of combined whitespace and 
          // leading zeros, the smaller width takes priority
          cmp = safe_compare(lhs.non_value(), rhs.non_value());
        }
        else {
          // For matching widths of combined whitespace 
          // and leading zeros, more whitespace takes priority
          cmp = safe_compare(rhs.whitespace(), lhs.whitespace());
        }
      }
    }
    else {
      // No special logic unless both spans are integers
      cmp = safe_compare(*a++, *b++);
    }
  }

  // All else being equal so far, the shorter string takes priority
  return cmp == 0 ? safe_compare(*a, *b) : cmp;
}

#include <string>

int natural_compare(const std::string& a, const std::string& b)
{
  return natural_compare(a.c_str(), b.c_str());
}

Comments and criticism are welcome. :)

Nick Evan 4,005 Industrious Poster Team Colleague Featured Poster · Answer 6 · 2010-02-16T03:23:31+00:00

Comments and criticism are welcome. :)

Yeah, like that will happen ... :icon_wink:

The solution in my head, had a lot more stl in it. Perhaps, if I find the time, I'll implement and post it. But this "moderator-thing" is taking up a lot of my "make programs for fun" time. :icon_neutral:

NathanOliver 429 Veteran Poster Featured Poster · Answer 7 · 2010-02-16T05:09:34+00:00

question about the other comparisons. you said it should be

b 5  abc
b  5 abcd
b   5 abc

but shouldn't it be?

b 5  abc
b   5 abc
b  5 abcd

since at position # 4 in the middle string it is a ' ' and in last string it is a 5.

NathanOliver 429 Veteran Poster Featured Poster · Answer 8 · 2010-02-16T05:21:21+00:00

Well here is my attempt at a solution. Any feedback is welcome.

#ifndef NATURAL_COMPARE_H
#define NATURAL_COMPARE_H

#include <cmath>
#include <locale>

typedef unsigned int uint;

bool IsStringANumber(std::string);

int natural_compare(std::string first, std::string second)
{
	size_t firstLength = first.length();
	size_t secondLength = second.length();
	size_t count = 0;
	uint firstSpaces = 0, secondSpaces = 0;
	//////////////////////////////////////////////////////////////////////////////
	//  testing for testing null strings
	if (firstLength == 0 && secondLength != 0)
		return -1;
	if (secondLength == 0 && firstLength != 0)
		return 1;
	if (firstLength == 0 && secondLength == 0)
		return 0;
	/////////////////////////////////////////////////////////////////////////////////
	//  eating trialing whitespaces.  no data is modified just the size
	//  of the string to compare is changed.
	while(firstLength > 0 && first[firstLength - 1] == ' ')
	{
		firstSpaces++;
		firstLength--;
	}
	while(secondLength > 0 && second[secondLength - 1] == ' ')
	{
		secondSpaces++;
		secondLength--;
	}
	//////////////////////////////////////////////////////////////////////////////////
	//  testing of numbers.  all numbers come first so if one
	//  string is a number it will be less than the other string
	if (!IsStringANumber(first) && !IsStringANumber(second))
	{
		if (first[0] > second[0])
			return 1;
		if (first[0] < second[0])
			return -1;
	}
	if (IsStringANumber(first) && !IsStringANumber(second))
		return -1;
	if (!IsStringANumber(first) && IsStringANumber(second))
		return 1;
	////////////////////////////////////////////////////////////////////////////////////
	//  testing the length.  if the string is shorter and satisfies the other
	//  conditions than it is smaller
	if (firstLength > secondLength)
		return 1;
	if (firstLength < secondLength)
		return -1;
	/////////////////////////////////////////////////////////////////////////////////////
	//  this block checks equality over same length strings.
	while (count < firstLength)
	{
		if (first[count] == second[count])
		{
			count++;
			continue;
		}
		if (first[count] > second[count])
			return 1;
		else
			return -1;
	}
	//////////////////////////////////////////////////////////////////////////////////////
	//  is the strings are equal than the one with less trailing whitespace is smaller
	if (firstSpaces || secondSpaces)
	{
		if (firstSpaces == secondSpaces)
			return 0;
		if (firstSpaces > secondSpaces)
			return 1;
		else
			return -1;
	}
	//////////////////////////////////////////////////////////////////////////////////////
	// if string are totaly equal return 0
	return 0;
}

bool IsStringANumber(std::string test)
{
	return !!atof(test.c_str());
}

#endif

Narue 5,707 Bad Cop Team Colleague · Answer 9 · 2010-02-16T06:10:38+00:00

question about the other comparisons. you said it should be
b 5  abc
b  5 abcd
b   5 abc
but shouldn't it be?
b 5  abc
b   5 abc
b  5 abcd
since at position # 4 in the middle string it is a ' ' and in last string it is a 5.

"  5"

is shorter than

"   5"

by one space, so it should compare as less. Granted these are my rules based on what I think is more intuitive, but hopefully it makes sense. :)

tetron 22 Junior Poster · Answer 10 · 2010-02-16T23:43:56+00:00

Well, it's been five days since the challenge was issued, so I'll post my shot at a solution:

Not sure that I found all the rules very natural
but since I did a rough try I will include mine
to make others feel better:)

I changed the logic of this and have not
even tested with a compiler...

I find the first char that is different
this says that there is a precidence

special cases are

num : num
space : num
num : space

otherwise normal char comparison

methods used are:

typedef std::string::iterator str_it;
//false means no digit to find
bool skip_to_digit(str_it &cur, str_it &it_stop);
//both d1 & d2 pointing to a digit
bool first_number_bigger(str_it &it1, str_it &it1_stop,
str_it &it2, str_it &it2_stop, bool def = false);
bool is_digit(char c);

natural find

str_it it1(a.begin()), it2(b.begin());
str_it it1_stop(a.end()), it2.end());
do{
	
 char c1(*it1), c2(*it2);
 if(c1 != c2)
 {
   bool current(false);
   if(is_digit(c1))
   {
     if(c2 == ' ')
     {
      	 if(advance_to_digit(it2, it2_stop))
       	{
	   //if the same c1 bigger
	   return first_number_bigger(it1, it1_stop, it2, it2_stop, false);
       	}
       	return false;
     }
     else if(is_digit(c2))
     {
	//both numbers same ret = false
	return first_number_bigger(it1, it1_stop, it2, it2_stop);
     }
   }
   else if(c1 == ' ' && is_digit(c2))
   {
     if(advance_to_digit(it1, it1_stop))
     {
	//both are numbers if same c2 bigger
	return first_number_bigger(it1, it1_stop, it2, it2_stop);
     }
     return true;
   }

   return (c1 > c2);
 }

}
while(++it1 != it1_stop && ++it2 != it2.stop);

//so far the same final check
return (a.size() > b.size());

the one to tell when a number is bigger than another number
needs to compare length and if length is the same leading
digit
b4 is simply a bool to deal with if both numbers are the same
to act as the return
first_number_bigger

char s1(' '), s2(' ');

while(++it1 != it1_stop && true == is_digit(c1))
{
	++it2;
	if(it2 == it2_stop)
	{
	  return false;
	}
	char c2(*it2);
	if(true == is_digit(c2))
	{
	   if(s1 == ' ' && c1 != c2)
	   {
	    //first time store different digits
             s1 = c1;
	     s2 = c2;
	   }
	}
	else
	{
	   return true;
	}

}


if(true == is_digit(*c2))
{
	return false;
}


if(s1 != '0')
{
	return (s1 > s2);
}
else
{
	return b4;
}

}

Do we have a digit following a space
if so advance iterator to first digit

bool ret(false);
while(++cur < it_stop)
{
	if(*cur != ' ')
	{
		if(is_digit(cur))
		{
			ret = true;
		}
		break;
	}
}
return ret;

is digit probably how stl does it but
should check upper limit b4 lower

is digit

bool ret(false);
if(c <= '9' && temp >= '0')
{
  ret = true;
}
return ret;

for a natural sort
I would just chop the string on spaces compare words
if word starts number do a number compare
this would require first cleaning double space to single

dusktreader 137 Posting Whiz in Training · Answer 11 · 2010-02-17T01:44:49+00:00

Here is my solution:

int natural_compare( const std::string& a, const std::string& b ){

    // Remembers which input had the first leading zero
    int lt = 0;

    // Temporaries.  Used for converting natural equivalents ' ' and '0' to '~'
    char ta, tb;

    // Remembers if the previous characters composed an integer value
    bool num = false;

    // Iterate over the strings stopping at the end of the shorter one
    for( unsigned int i=0; i<std::min( a.size(), b.size() ); i++ ){

        // If there is a data missmatch.  Look for special circumstances
        if( a[i] != b[i] ){

            // Use the '~' replacement unless the input is currently numeric
            ta = a[i] == ' ' || ( a[i] == '0' && !num ) ? '~' : a[i];
            tb = b[i] == ' ' || ( b[i] == '0' && !num ) ? '~' : b[i];

            // Because the strings don't match here, the can't be numeric
            num = false;

            // There is a natural missmatch with no special circumstances, so terminate
            if( ta != tb )
                return ta - tb;
            // The strings match, so check for leading zeros
            else if( lt == 0 )
                lt = a[i] - b[i];
        }
        // Still equivalent, so check for numeric state
        else
            num = ( num && a[i] == '0' ) || a[i] > '0' && a[i] <= '9';
    }
    // The strings are naturally equivalent, but the longer is naturally greater
    if( a.size() != b.size() )
        return a.size() - b.size();

    // The strings are naturally equivalent, so we return the one with the first leading '0'
    return lt;
}

I tested it with this input:

# Basic tests
a1
a2
a3
a10
a11
a12
# Matching value tests
b5
b 5
b05
b  5
b 05
b005
b   5
b  05
b 005
b0005
b    5
b   05
b  005
b 0005
b00005
# Second tier matching value tests
c 1 5
c 1  5
c 1 05
c 1   5
c 1  05
c 1 005
c 1    5
c 1   05
c 1  005
c 1 0005
c 1     5
c 1    05
c 1   005
c 1  0005
c 1 00005
#tiered test
a 005 b 04
a  5 b 3
a 0005 b 5
#no spaces between number and letter
y 000a
y 00091
#spaces
b 5  abc
b  5 abcd
b   5 abc

And, here is the output:

>a1<
>a10<
>a11<
>a12<
>a2<
>a3<
>a  5 b 3<
>a 005 b 04<
>a 0005 b 5<
>b5<
>b 5<
>b05<
>b 5  abc<
>b  5<
>b 05<
>b005<
>b  5 abcd<
>b   5<
>b  05<
>b 005<
>b0005<
>b   5 abc<
>b    5<
>b   05<
>b  005<
>b 0005<
>b00005<
>c 1 5<
>c 1  5<
>c 1 05<
>c 1   5<
>c 1  05<
>c 1 005<
>c 1    5<
>c 1   05<
>c 1  005<
>c 1 0005<
>c 1     5<
>c 1    05<
>c 1   005<
>c 1  0005<
>c 1 00005<
>y 00091<
>y 000a<

The rules were pretty tricky. I found one of them didn't seem very natural to me. In my mind "b5" is greater than "b005", because my brain puts trailing zeros after the first. Unfortunately, its also more tricky to implement this rule than the method my brain uses.

Thanks for the challenge, and any feedback is more than welcome.

Narue 5,707 Bad Cop Team Colleague · Answer 12 · 2010-02-17T02:33:48+00:00

>Not sure that I found all the rules very natural
Well, the rules are rather arbitrary. I defined them based on what I felt was an intuitive way to organize unlikely file names. Most natural sorts I've seen don't consider leading zeros or whitespace. We can work together to come up with a definition that works better for the majority and then re-approach the challenge, if you'd like. :)

>In my mind "b5" is greater than "b005", because my brain puts trailing zeros after the first.
The idea was that leading zeros and whitespace would be a part of the number rather than part of the previous character. So your example would be comparing "5" and "005" because 'b' matches in both. I felt that it made more sense for "5" to be less than "005".

dusktreader 137 Posting Whiz in Training · Answer 13 · 2010-02-17T02:37:08+00:00

>In my mind "b5" is greater than "b005", because my brain puts trailing zeros after the first.
The idea was that leading zeros and whitespace would be a part of the number rather than part of the previous character. So your example would be comparing "5" and "005" because 'b' matches in both. I felt that it made more sense for "5" to be less than "005".

In the context of file names, this makes sense.

C++ Challenge: Natural Sorting

Recommended Answers Collapse Answers

All 18 Replies

Recommended Answers