Following up on a discussion on this thread, I did some tests on the performance of recursion vs. iteration, and some interesting things happened, so I thought I would share the results. I'm gonna structure this a bit like a lab-report, because that's really the best way to put it. But I hope the discussion can continue on additional ideas or insight into the performance issues that are discussed here, or if you disagree with my analysis.

Experimental Method

So, I took Schol-R-LEA's coin-combination example as a test case, and timed the execution of three different versions: recursive, iterative with dynamic allocation, and iterative with VLA (stack-based arrays) (because of this discussion). Here is the code:

#include <vector>
#include <utility>
#include <chrono>
#include <iostream>

// recursive version
int count_combinations_rec(int sum, const int* coins) {
  if (!*coins || sum < 0) 
    return 0;
  int result = count_combinations_rec(sum, coins + 1);
  if (sum == *coins) 
    return 1 + result;
  else
    return count_combinations_rec(sum - *coins, coins) + result;
};


// equivalent iterative version (shared by other two functions):
template <typename TaskIter>
int count_combinations_iter_impl(TaskIter first_task) {
  TaskIter cur_task = first_task;
  int result = 0;
  while( cur_task >= first_task ) {
    const int cur_sum = cur_task->first; 
    const int* const cur_coins = cur_task->second;
    if( !*cur_coins || (cur_sum <= 0) ) {
      --cur_task;
      continue;
    };
    ++(cur_task->second);
    if(*cur_coins == cur_sum)
      ++result;
    else {
      ++cur_task;
      cur_task->first = cur_sum - *cur_coins;
      cur_task->second = cur_coins;
    };
  };
  return result;
};

// iterative version with dyn-alloc
int count_combinations_iter(int total_sum, int min_coin, const int* coins_first) {
  typedef std::pair<int, const int*> task_type;
  std::vector< task_type > tasks(total_sum / min_coin + 1); // <- use dyn-array
  tasks[0].first = total_sum;
  tasks[0].second = coins_first;
  return count_combinations_iter_impl(tasks.begin());
};

// iterative version with VLA
int count_combinations_VLAiter(int total_sum, int min_coin, const int* coins_first) {
  typedef std::pair<int, const int*> task_type;
  task_type tasks[total_sum / min_coin + 1]; // <- use VLA
  tasks[0].first = total_sum;
  tasks[0].second = coins_first;
  return count_combinations_iter_impl(tasks);
};

namespace ch = std::chrono;

int main() {

  int coins_array[] = {1000, 100, 50, 20, 10, 5, 2, 1, 0};
  const int num_trials = 1e3;
  volatile int non_suppressed_output = 0;
  typedef ch::high_resolution_clock hrc;

  {
    auto t1 = hrc::now();
    for(int i = 0; i < num_trials; ++i) {
      non_suppressed_output += count_combinations_rec( 100, coins_array);
    };
    auto dt = hrc::now() - t1;
    dt /= num_trials;
    std::cout << "Recursive solution took, on average     " << ch::duration_cast<ch::nanoseconds>(dt).count() << " ns to solve the problem." << std::endl;
  };

  {
    auto t1 = hrc::now();
    for(int i = 0; i < num_trials; ++i) {
      non_suppressed_output += count_combinations_iter( 100, 1, coins_array);
    };
    auto dt = hrc::now() - t1;
    dt /= num_trials;
    std::cout << "Dyn-Iterative solution took, on average " << ch::duration_cast<ch::nanoseconds>(dt).count() << " ns to solve the problem." << std::endl;
  };

  {
    auto t1 = hrc::now();
    for(int i = 0; i < num_trials; ++i) {
      non_suppressed_output += count_combinations_VLAiter( 100, 1, coins_array);
    };
    auto dt = hrc::now() - t1;
    dt /= num_trials;
    std::cout << "VLA-Iterative solution took, on average " << ch::duration_cast<ch::nanoseconds>(dt).count() << " ns to solve the problem." << std::endl;
  };

};

Results

Here are the results (GCC 4.8.1, Linux Kubuntu 13.10):

1) Compilation under highest optimization:

$ g++ -std=c++11 -Wall -O3 -fomit-frame-pointer count_coins_rec_it.cpp -o count_coins_rec_it
$ ./count_coins_rec_it
Recursive solution took, on average     289553 ns to solve the problem.
Dyn-Iterative solution took, on average 596631 ns to solve the problem.
VLA-Iterative solution took, on average 542989 ns to solve the problem.

2) Compilation under less optimization:

$ g++ -std=c++11 -Wall -O1 -fomit-frame-pointer count_coins_rec_it.cpp -o count_coins_rec_it 
$ ./count_coins_rec_it 
Recursive solution took, on average     768298 ns to solve the problem.
Dyn-Iterative solution took, on average 515090 ns to solve the problem.
VLA-Iterative solution took, on average 549333 ns to solve the problem.

Observations

Here are some observations:

1) There isn't really much significant performance difference between the VLA and dynamically allocated versions. Oddly enough, it seems that more optimization is a bit worse for the dynamic allocation, I guess that must have to do with using a more optimized heap (and thus, more complex). Also, this is not a very representative test for the dynamic allocation because (1) the amount allocated is very small and (2) the heap is not fragmented at all since it is the only dynamic allocation (ostensibly). And, in any case, the overhead of dynamic allocation is not expected to be much (but should be worse than stack). These results may also suggest that the VLA implementation generated by the compiler is not optimal (as it should never be worse than dynamic allocation, even by a few dozen micro-seconds).

2) The iterative versions are not affected much by the optimization level (perform equally at all levels). This would imply that the iterative version is already "optimal" in its current form, and cannot really be optimized further by the compiler.

3) The recursive version is highly affected by the optimization level. The results at the lesser optimization level (-O1) are the results expected in general in a recursive vs. iterative show-down (due to function-call overhead). What I was interested in seeing is how far and where would the compiler go to optimize the recursive form. I hypothesized (in the last post) that it would take a route along the lines of tail-call optimization and try to turn this into the equivalent iterative form, at least, to some extent, which would have brought the performance closer to the iterative form. However, it went another route in the optimization, and manage to surpass, by quite a good margin, the performance of the iterative functions.

After inspecting the assembly code produced by the compiler at the -O3 level (which I won't post here, due to length), the answer to this mystery was quite enlightening! If you can't turn a recursion into an iteration, what can you do to optimize it? Answer: you unroll it. The compiler unrolls a 2 levels of recursion and combines them into one function, and thus, minimizing the overhead of function calls. If you are confused about what I mean, here is the translation back from assembly to C++ of the highly-optimized version of the recursion:

// recursive method, unrolled once:
int count_combinations_rec(int sum, const int* coins) {
  int result = 0;

  if (!*coins || sum < 0) 
    return 0;

  const int* coins2 = coins + 1;
  if(*coins2) {
    result += count_combinations_rec(sum, coins2 + 1);

    if (sum == *coins2) 
      ++result;
    else {
      result += count_combinations_rec(sum - *coins2, coins2);
    };
  };

  if (sum == *coins) 
    ++result;
  else {
    int sum2 = sum - *coins;
    if (sum2 >= 0) {
      result += count_combinations_rec(sum2, coins + 1);

      if (sum2 == *coins) 
        ++result;
      else {
        result += count_combinations_rec(sum2 - *coins, coins);
      };
    };
  };

  return result;
};


// recursive method, unrolled twice:
// this is the version produced by GCC with '-O3':
int count_combinations_rec(int sum, const int* coins) {
  int result = 0;

  if (!*coins || sum < 0) 
    return 0;

  const int* coins2 = coins + 1;
  if(*coins2) {

    const int* coins3 = coins2 + 1;
    if(*coins3) {
      result += count_combinations_rec(sum, coins3 + 1);

      if (sum == *coins3) 
        ++result;
      else {
        result += count_combinations_rec(sum - *coins3, coins3);
      };
    };

    if (sum == *coins2) 
      ++result;
    else {
      int sum2 = sum - *coins2;
      if (sum2 >= 0) {
        result += count_combinations_rec(sum2, coins2 + 1);

        if (sum2 == *coins2) 
          ++result;
        else
          result += count_combinations_rec(sum2 - *coins2, coins2);
      };
    };
  };

  if (sum == *coins) 
    ++result;
  else {
    int sum2 = sum - *coins;
    if (sum2 >= 0) {

      const int* coins3 = coins + 1;
      if(*coins3) {
        result += count_combinations_rec(sum2, coins3 + 1);

        if (sum2 == *coins3) 
          ++result;
        else {
          result += count_combinations_rec(sum2 - *coins3, coins3);
        };
      };

      if (sum2 == *coins) 
        ++result;
      else {
        int sum3 = sum2 - *coins;
        if (sum3 >= 0) {
          result += count_combinations_rec(sum3, coins + 1);

          if (sum3 == *coins) 
            ++result;
          else
            result += count_combinations_rec(sum3 - *coins, coins);
        };
      };
    };
  };

  return result;
};

By doing this unrolling (which obviously causes executables to be much bigger, as they contain lots of redundant code), the compiler can optimize away three things:

  1. Some redundant checks. For example, if you pass coins in the recursion (and not coins + 1), you don't need to check if *coins is zero again at the next recursion.
  2. Some redundant parameters. At each recursive call, only one of the two parameters changes (either sum or coins, but never both), which means that you can always re-use one of them (read-only), and thus, optimize away some space on the stack memory for that extra redundant variable.
  3. The function-call overhead. By turning 7 recursive function calls into a single function call, it can remove that overhead.

Given that (3) is what makes the non-optimized recursive function slower than the iterative functions, if it was removed completely (which is impossible), the performance would only equal the iterative functions. Then, (2) is not very significant because an extra variable on the stack is nothing to break a sweat about. So, by elimination, the main optimization is the removal of some redundant checks (3), and that makes sense because conditionals (if-statements) are expensive because they can obstruct the execution pipeline on the CPU and cause many wasted clock cycles when branch-prediction fails (as it often would in this case).

Conclusions

I guess my conclusions would be the following:

  1. GCC 4.8.1 is really surprising me once again with how amazing it is at optimizing code!
  2. The overhead of dynamic memory allocation is not significant, in this case (as I would expect in most similar cases too).
  3. If you want to optimize away something that is worth it, optimize away redundant but unpredictable branches.

This last conclusion is where a lot of the trouble lies. What exactly is a "redundant but unpredictable branch"? Well, a branch is a conditional statement (if-statement), that's easy. But a conditional that is unpredictable seems like a conditional worth evaluating (not redundant, not removable). But in a case like the one here (counting coin combinations), the termination condition, i.e., (!*coins || sum < 0), has two components (one for each parameter), and only one component is ever really needed, if you know where the recursive-call came from (and thus, which was the modified, "untested" parameter). But in general, without knowing where the call came from, either one of these could (and often will) evaluate to false, and are thus, unpredictable, thwarting branch-prediction. And that's where the optimization opportunity lies, because a branch that is necessary (not redundant) cannot be removed, and a branch that is predictable is not worth removing because the CPU branch-predictor will take care of that in real-time, but a branch that is both redundant and unpredictable is a great opportunity for optimization.

N.B.: One could optimize the iterative version in exactly the same way, it's just a shame the compiler doesn't do it automatically. If you don't understand why the compiler has a harder time (and fails) to make that optimization on the iterative versions, you should look into "pure functional programming" and its main argument: mutability makes analysis harder.

Comments
Great job :)
Thanks Mike. This sort of practical knowledge about the tool is always good to know and make me look far for competent at work than I really am. Much appreciated :)

Career objective:
To lead research and development of robotics applications for consumers, industrials or the aerospace industry; to best utilize multidisciplinary competences within a research or development team; and to help use robotics and mechatronics technologies to bring a new developmental revolution.

Expert Programming Skills: With more than 12 years experience in object-oriented programming and softwarengineering, skills range from low-level hardware programming (device drivers, buses, etc.) to high-level programming (artificial intelligence, event-driven software, etc.). Main expertise is in C/C++ with active participation to online C++ forums and have occasionally written programming tutorials and done code reviews. Very competent in many more languages including MatLab/Simulink, Delphi, LabVIEW, Java, etc.

Mechanical Engineering Skills: Competent with most CAD software (AutoCAD, Pro/E, SolidWork, etc.) as well as many simulation software (MSC - ADAMS / PATRAN / NASTRAN, SimuLink, Ansys, etc.). Skilled with mechatronics designs including design and analysis of mechanism, design optimization, electrical drives, and various actuator technologies. Expert analytical skills and numerical simulation skills for multi-body dynamics.

Control Engineering Skills: Competent and experienced with either analysis, synthesis, or implementation of control systems and software. Familiar with robust control theory, discrete-time issues, non-linear control, and probabilistic approaches to estimation and control. Strong mechatronics analytical skills for model-based control approaches and implementations.

Electrical Engineering Skills: Familiar with printed-circuit board design, circuit analysis and various low-level communication buses. Skilled with electrical drives and control circuits. Most familiar with common sensors used in robotics (lidars, computer-vision, INS, etc.).

Operating Systems: Linux, Windows, and any Unix-based system.

Other Applications: MatLab/SimuLink, LaTeX, MS Office Suite, LabVIEW, MSC ADAMS, Ansys, SolidWork, etc.

Languages: French and Swedish as mother tongues; Fluent in English, written and spoken, at mother tongue level; Functional in German, read and spoken; Basic skills in Spanish and Finnish.

Citizenship: Dual-citizenship as Canadian and Swedish.

Miscellaneous: Skilled at machining with most common machine tools, strong verbal and written communication skills, excellent troubleshooting and debugging skills, exceptional problem solving skills, and work well in a team.

I was going to test using Visual Studio 2014 to see how the results compared with yours. But VS won't even compile it due to this line

task_type tasks[total_sum / min_coin + 1]; // <- use VLA

Ancient Dragon: Are you sure you're compiling as C and not C++? There are no VLAs in C++.

Given that the code uses the C++ standard library and c++ only features such as templates, I rather suspect that this is making use of the g++ extension that allows VLAs in C++. It certainly won't compile as C :)

Yeah, the VLA feature is a C99 feature that was made into an optional extension in C11. Many C++ compilers support this extension in C++ too, like GCC does. That is how this is allowed to compile. In MSVC (which only supports up to C90), you can use alloca to allocate stack memory, it's just a little less convenient. And in any case, it is not really worth making optimization checks with the MSVC compiler, it is well-known to be dismal at optimization.

Ancient Dragon: Are you sure you're compiling as C and not C++? There are no VLAs in C++.

The code Mike posted is c++, not c. Hint: it used <vector>

Given that the code uses the C++ standard library and c++ only features such as templates, I rather suspect that this is making use of the g++ extension that allows VLAs in C++. It certainly won't compile as C :)

You're right. I guess that's what I get for skimming over code instead of reading all of it!

The article starter is a financial contributor. Sponsored articles offer a bounty for quality replies.