Following up on a discussion on this thread, I did some tests on the performance of recursion vs. iteration, and some interesting things happened, so I thought I would share the results. I'm gonna structure this a bit like a lab-report, because that's really the best way to put it. But I hope the discussion can continue on additional ideas or insight into the performance issues that are discussed here, or if you disagree with my analysis.

## Experimental Method

So, I took Schol-R-LEA's coin-combination example as a test case, and timed the execution of three different versions: recursive, iterative with dynamic allocation, and iterative with VLA (stack-based arrays) (because of this discussion). Here is the code:

```
#include <vector>
#include <utility>
#include <chrono>
#include <iostream>
// recursive version
int count_combinations_rec(int sum, const int* coins) {
if (!*coins || sum < 0)
return 0;
int result = count_combinations_rec(sum, coins + 1);
if (sum == *coins)
return 1 + result;
else
return count_combinations_rec(sum - *coins, coins) + result;
};
// equivalent iterative version (shared by other two functions):
template <typename TaskIter>
int count_combinations_iter_impl(TaskIter first_task) {
TaskIter cur_task = first_task;
int result = 0;
while( cur_task >= first_task ) {
const int cur_sum = cur_task->first;
const int* const cur_coins = cur_task->second;
if( !*cur_coins || (cur_sum <= 0) ) {
--cur_task;
continue;
};
++(cur_task->second);
if(*cur_coins == cur_sum)
++result;
else {
++cur_task;
cur_task->first = cur_sum - *cur_coins;
cur_task->second = cur_coins;
};
};
return result;
};
// iterative version with dyn-alloc
int count_combinations_iter(int total_sum, int min_coin, const int* coins_first) {
typedef std::pair<int, const int*> task_type;
std::vector< task_type > tasks(total_sum / min_coin + 1); // <- use dyn-array
tasks[0].first = total_sum;
tasks[0].second = coins_first;
return count_combinations_iter_impl(tasks.begin());
};
// iterative version with VLA
int count_combinations_VLAiter(int total_sum, int min_coin, const int* coins_first) {
typedef std::pair<int, const int*> task_type;
task_type tasks[total_sum / min_coin + 1]; // <- use VLA
tasks[0].first = total_sum;
tasks[0].second = coins_first;
return count_combinations_iter_impl(tasks);
};
namespace ch = std::chrono;
int main() {
int coins_array[] = {1000, 100, 50, 20, 10, 5, 2, 1, 0};
const int num_trials = 1e3;
volatile int non_suppressed_output = 0;
typedef ch::high_resolution_clock hrc;
{
auto t1 = hrc::now();
for(int i = 0; i < num_trials; ++i) {
non_suppressed_output += count_combinations_rec( 100, coins_array);
};
auto dt = hrc::now() - t1;
dt /= num_trials;
std::cout << "Recursive solution took, on average " << ch::duration_cast<ch::nanoseconds>(dt).count() << " ns to solve the problem." << std::endl;
};
{
auto t1 = hrc::now();
for(int i = 0; i < num_trials; ++i) {
non_suppressed_output += count_combinations_iter( 100, 1, coins_array);
};
auto dt = hrc::now() - t1;
dt /= num_trials;
std::cout << "Dyn-Iterative solution took, on average " << ch::duration_cast<ch::nanoseconds>(dt).count() << " ns to solve the problem." << std::endl;
};
{
auto t1 = hrc::now();
for(int i = 0; i < num_trials; ++i) {
non_suppressed_output += count_combinations_VLAiter( 100, 1, coins_array);
};
auto dt = hrc::now() - t1;
dt /= num_trials;
std::cout << "VLA-Iterative solution took, on average " << ch::duration_cast<ch::nanoseconds>(dt).count() << " ns to solve the problem." << std::endl;
};
};
```

## Results

Here are the results (GCC 4.8.1, Linux Kubuntu 13.10):

1) Compilation under highest optimization:

```
$ g++ -std=c++11 -Wall -O3 -fomit-frame-pointer count_coins_rec_it.cpp -o count_coins_rec_it
$ ./count_coins_rec_it
Recursive solution took, on average 289553 ns to solve the problem.
Dyn-Iterative solution took, on average 596631 ns to solve the problem.
VLA-Iterative solution took, on average 542989 ns to solve the problem.
```

2) Compilation under less optimization:

```
$ g++ -std=c++11 -Wall -O1 -fomit-frame-pointer count_coins_rec_it.cpp -o count_coins_rec_it
$ ./count_coins_rec_it
Recursive solution took, on average 768298 ns to solve the problem.
Dyn-Iterative solution took, on average 515090 ns to solve the problem.
VLA-Iterative solution took, on average 549333 ns to solve the problem.
```

## Observations

Here are some observations:

1) There isn't really much significant performance difference between the VLA and dynamically allocated versions. Oddly enough, it seems that more optimization is a bit worse for the dynamic allocation, I guess that must have to do with using a more optimized heap (and thus, more complex). Also, this is not a very representative test for the dynamic allocation because (1) the amount allocated is very small and (2) the heap is not fragmented at all since it is the only dynamic allocation (ostensibly). And, in any case, the overhead of dynamic allocation is not expected to be much (but should be worse than stack). These results may also suggest that the VLA implementation generated by the compiler is not optimal (as it should never be worse than dynamic allocation, even by a few dozen micro-seconds).

2) The iterative versions are not affected much by the optimization level (perform equally at all levels). This would imply that the iterative version is already "optimal" in its current form, and cannot really be optimized further by the compiler.

3) The recursive version is highly affected by the optimization level. The results at the lesser optimization level (`-O1`

) are the results expected in general in a recursive vs. iterative show-down (due to function-call overhead). What I was interested in seeing is how far and where would the compiler go to optimize the recursive form. I hypothesized (in the last post) that it would take a route along the lines of tail-call optimization and try to turn this into the equivalent iterative form, at least, to some extent, which would have brought the performance closer to the iterative form. However, it went another route in the optimization, and manage to surpass, by quite a good margin, the performance of the iterative functions.

After inspecting the assembly code produced by the compiler at the `-O3`

level (which I won't post here, due to length), the answer to this mystery was quite enlightening! If you can't turn a recursion into an iteration, what can you do to optimize it? Answer: you unroll it. The compiler unrolls a 2 levels of recursion and combines them into one function, and thus, minimizing the overhead of function calls. If you are confused about what I mean, here is the translation back from assembly to C++ of the highly-optimized version of the recursion:

```
// recursive method, unrolled once:
int count_combinations_rec(int sum, const int* coins) {
int result = 0;
if (!*coins || sum < 0)
return 0;
const int* coins2 = coins + 1;
if(*coins2) {
result += count_combinations_rec(sum, coins2 + 1);
if (sum == *coins2)
++result;
else {
result += count_combinations_rec(sum - *coins2, coins2);
};
};
if (sum == *coins)
++result;
else {
int sum2 = sum - *coins;
if (sum2 >= 0) {
result += count_combinations_rec(sum2, coins + 1);
if (sum2 == *coins)
++result;
else {
result += count_combinations_rec(sum2 - *coins, coins);
};
};
};
return result;
};
// recursive method, unrolled twice:
// this is the version produced by GCC with '-O3':
int count_combinations_rec(int sum, const int* coins) {
int result = 0;
if (!*coins || sum < 0)
return 0;
const int* coins2 = coins + 1;
if(*coins2) {
const int* coins3 = coins2 + 1;
if(*coins3) {
result += count_combinations_rec(sum, coins3 + 1);
if (sum == *coins3)
++result;
else {
result += count_combinations_rec(sum - *coins3, coins3);
};
};
if (sum == *coins2)
++result;
else {
int sum2 = sum - *coins2;
if (sum2 >= 0) {
result += count_combinations_rec(sum2, coins2 + 1);
if (sum2 == *coins2)
++result;
else
result += count_combinations_rec(sum2 - *coins2, coins2);
};
};
};
if (sum == *coins)
++result;
else {
int sum2 = sum - *coins;
if (sum2 >= 0) {
const int* coins3 = coins + 1;
if(*coins3) {
result += count_combinations_rec(sum2, coins3 + 1);
if (sum2 == *coins3)
++result;
else {
result += count_combinations_rec(sum2 - *coins3, coins3);
};
};
if (sum2 == *coins)
++result;
else {
int sum3 = sum2 - *coins;
if (sum3 >= 0) {
result += count_combinations_rec(sum3, coins + 1);
if (sum3 == *coins)
++result;
else
result += count_combinations_rec(sum3 - *coins, coins);
};
};
};
};
return result;
};
```

By doing this unrolling (which obviously causes executables to be much bigger, as they contain lots of redundant code), the compiler can optimize away three things:

- Some redundant checks. For example, if you pass
`coins`

in the recursion (and not`coins + 1`

), you don't need to check if`*coins`

is zero again at the next recursion. - Some redundant parameters. At each recursive call, only one of the two parameters changes (either sum or coins, but never both), which means that you can always re-use one of them (read-only), and thus, optimize away some space on the stack memory for that extra redundant variable.
- The function-call overhead. By turning 7 recursive function calls into a single function call, it can remove that overhead.

Given that (3) is what makes the non-optimized recursive function slower than the iterative functions, if it was removed completely (which is impossible), the performance would only equal the iterative functions. Then, (2) is not very significant because an extra variable on the stack is nothing to break a sweat about. So, by elimination, the main optimization is the removal of some redundant checks (3), and that makes sense because conditionals (if-statements) are expensive because they can obstruct the execution pipeline on the CPU and cause many wasted clock cycles when branch-prediction fails (as it often would in this case).

## Conclusions

I guess my conclusions would be the following:

- GCC 4.8.1 is really surprising me once again with how amazing it is at optimizing code!
- The overhead of dynamic memory allocation is not significant, in this case (as I would expect in most similar cases too).
- If you want to optimize away something that is worth it, optimize away redundant but unpredictable branches.

This last conclusion is where a lot of the trouble lies. What exactly is a "redundant but unpredictable branch"? Well, a branch is a conditional statement (if-statement), that's easy. But a conditional that is unpredictable seems like a conditional worth evaluating (not redundant, not removable). But in a case like the one here (counting coin combinations), the termination condition, i.e., `(!*coins || sum < 0)`

, has two components (one for each parameter), and only one component is ever really needed, if you know where the recursive-call came from (and thus, which was the modified, "untested" parameter). But in general, without knowing where the call came from, either one of these could (and often will) evaluate to false, and are thus, unpredictable, thwarting branch-prediction. And that's where the optimization opportunity lies, because a branch that is necessary (not redundant) cannot be removed, and a branch that is predictable is not worth removing because the CPU branch-predictor will take care of that in real-time, but a branch that is both redundant and unpredictable is a great opportunity for optimization.

N.B.: One could optimize the iterative version in exactly the same way, it's just a shame the compiler doesn't do it automatically. If you don't understand why the compiler has a harder time (and fails) to make that optimization on the iterative versions, you should look into "pure functional programming" and its main argument: mutability makes analysis harder.