I wrote the following code to test which is faster: performing a multiplication and division many times, or accessing the results in a vector via iterators(perform the calculations ahead of time), or hard coding the results of those calculations into the program. I actually expected storing the results in the vector to be faster than performing the calculations, but on my system I get the following relation for the times to complete each corresponding function:

Hard Coded Values < 5 * (On the fly calculations) < 50 * ( Vector Stored Values)

So my question is this: Am I stuck making these same calcuations over and over again (or hard coding their results), or is there a faster way to do it?

Note: My timer is for Windows, but I think the rest of the code is OS agnostic.

#include <windows.h>
#include <iostream>
#include <algorithm>
#include <vector>

using namespace std;

template <class Function>
__int64 time_call(Function&& f)
{
    __int64 begin = GetTickCount();
    f();
    return GetTickCount() - begin;
}

int numberOfReps(100000000);
int dotFrequency(25000000);

class PreCalcObj
{
public:
      PreCalcObj()
      {
          for (double i = 1; i < 4; ++i)
          {
              for (double j = 1; j < 4; ++j)
              {
                  TiTjsumT2.push_back(i * j / 4597);
              }
          }
      }
      vector<double> TiTjsumT2;
};

double CalcAll()
{
    double valReturn;

    for (int iCount = 0; iCount < numberOfReps; ++iCount)
    {
        double bVal(33.23);
        if (iCount % dotFrequency== 0)
        {
            cout << ".";
        }
        for (double i = 1; i < 4; ++i)
        {
            for (double j = 1; j < 4; ++j)
            {
                valReturn = bVal * i * j / 4597;
            }
        }
    }

    cout << valReturn << endl;
    return valReturn;
}

double UsePreCalced(const PreCalcObj& mObj)
{
    double valReturn;
    double bVal(33.23);

    for (int iCount = 0; iCount < numberOfReps; ++iCount)
    {
        if (iCount % dotFrequency== 0)
        {
            cout << ".";
        }
        for (auto it = mObj.TiTjsumT2.begin(); it != mObj.TiTjsumT2.end(); ++it)
        {
            valReturn = bVal * *it;
        }
    }

    cout << valReturn << endl;
    return valReturn;
}

double CalcHardCodedVAls()
{
    double valReturn;
    double bVal(33.23);

    for (int iCount = 0; iCount < numberOfReps; ++iCount)
    {
        if (iCount % dotFrequency== 0)
        {
            cout << ".";
        }
        valReturn = bVal * .000217533178;   // = bVal * 1*1 / 4597
        valReturn = bVal * .000435066356;   // = bVal * 2*1 / 4597
        valReturn = bVal * .0006525995214;  // = bVal * 3*1 / 4597
        valReturn = bVal * .000435066356;   // = bVal * 1*2 / 4597
        valReturn = bVal * .0008701326952;  // = bVal * 2*2 / 4597
        valReturn = bVal * .00130519904;    // = bVal * 3*2 / 4597
        valReturn = bVal * .0006525995214;  // = bVal * 1*3 / 4597
        valReturn = bVal * .00130519904;    // = bVal * 2*3 / 4597
        valReturn = bVal * .00195779856;    // = bVal * 3*3 / 4597
    }

    cout << valReturn << endl;
    return valReturn;
}

int main(int argc, char* argv[])
{
    PreCalcObj preCalced;

    cout << "PreCalced values time: " << time_call([&] {UsePreCalced(preCalced);}) << endl;
    cout << "All Calculated time: "   << time_call([&] {CalcAll();}) << endl;
    cout << "Hard Coded time: "       << time_call([&] {CalcHardCodedVAls();}) << endl;

    cin.get();
    return 0;
}

The results make sense -- vectors have a lot of overhead that take cpu cycles, more time than just doing the calculations. The program has to make other calculations just to access the vector item.

Well, I took NathanOliver's advice and tried it with an array instead of a vector. Turns out after repeating the test many times, the array is almost as fast as the hard coded values. Some times it took the same number of miliseconds, other times it took as much as 5% more. That's fantastic!

And more good news. The vector was not as bad as I thought (except when debugging). I think somehow I had misread the print out. Sorry, I ran the test like 5 times and misread it 5 times. I was shocked when I got back in this evening, ran the test and realized my error!

So here are my results on the times for the last run I did. I was very careful not to misread the printout!

Hard coded time = 125ms
PreCalced Array time = 130ms
PreCalced Vector time = 624ms
Repetitive Calc time = 5444ms...

So, the great news is the std::vector was pretty darn good. Only five times slower than the hard coded result, but nine times faster than doing the repetitive calculations. But even better, storing the values in an array ahead of time was almost just as fast as hard coding the values.

Now, I certainly feel silly for misreading the printout, and starting this thread, but on the positive side, I might not have thought to use an array which is about 5 times faster than the vector (in this particular function)! I've really used vector extensively over the past year, but for speed applications, I'm going to start using arrays. Now I'll just have to learn to be careful with my memory allocations. It was so nice just letting my vectors go out of scope and know that nothing was leaking. :)

Well, I meant compared to repeating the same calculations over and over again. But you are right, it's still not that good compared to using an array which is 5 times faster.

I have a couple of questions:

What happens if you increase the number of calculations you have to make per iteration?
Are you sure you're timing the hardcoded approach properly?
What optimization flags are you using?

Check this out:

#include <windows.h>
#include <iostream>
#include <algorithm>
#include <vector>

using namespace std;

template <class Function>
__int64 time_call(Function&& f)
{
    __int64 begin = GetTickCount();
    f();
    return GetTickCount() - begin;
}

int numberOfReps(25000);
int dotFrequency(7500);

const int MY_SIZE = 200;

class PreCalcObj
{
public:
      PreCalcObj()
      {
          for (int i = 0; i < MY_SIZE; ++i)
          {
              for (int j = 0; j < MY_SIZE; ++j)
              {
                  TiTjsumT2.push_back(i * j / 4597.0);
              }
          }
      }
      vector<double> TiTjsumT2;
};

class PreCalcObjArr
{
public:
      PreCalcObjArr()
      {
          for (int i = 0; i < MY_SIZE; ++i)
          {
              for (int j = 0; j < MY_SIZE; ++j)
              {
                  TiTjsumT2[i * MY_SIZE + j] = i * j / 4597.0;
              }
          }
      }
      double TiTjsumT2[MY_SIZE * MY_SIZE];
};

double CalcAll()
{
    double valReturn = 0;

    for (int iCount = 0; iCount < numberOfReps; ++iCount)
    {
        double bVal(33.23);
        if (iCount % dotFrequency== 0)
        {
            cout << ".";
        }
        for (int i = 1; i < MY_SIZE; ++i)
        {
            for (int j = 1; j < MY_SIZE; ++j)
            {
                valReturn += bVal * i * j / 4597.0;
            }
        }
    }

    cout << valReturn << endl;
    return valReturn;
}

double UsePreCalced(const PreCalcObj& mObj)
{
    double valReturn = 0;
    double bVal(33.23);

    for (int iCount = 0; iCount < numberOfReps; ++iCount)
    {
        if (iCount % dotFrequency== 0)
        {
            cout << ".";
        }
        for (auto it = mObj.TiTjsumT2.begin(), end_it = mObj.TiTjsumT2.end(); it != end_it; ++it)
        {
            valReturn += bVal * *it;
        }
    }

    cout << valReturn << endl;
    return valReturn;
}

double UsePreCalcedArr(const PreCalcObjArr & mObj)
{
    double valReturn = 0;
    double bVal(33.23);

    const double * arr = mObj.TiTjsumT2;

    for (int iCount = 0; iCount < numberOfReps; ++iCount)
    {
        if (iCount % dotFrequency== 0)
        {
            cout << ".";
        }
        for (int i = 0; i != MY_SIZE * MY_SIZE; ++i)
        {
            valReturn += bVal * arr[i];
        }
    }

    cout << valReturn << endl;
    return valReturn;
}

#define HC_1 valReturn += bVal * .00001;\
        valReturn += bVal * .00002;   \
        valReturn += bVal * .00003;  \
        valReturn += bVal * .00004;   \
        valReturn += bVal * .00005; \
        valReturn += bVal * .00006;   \
        valReturn += bVal * .00007;  \
        valReturn += bVal * .00008;   \
        valReturn += bVal * .00009; \
        valReturn += bVal * .00010;

#define HC_2 HC_1 HC_1 HC_1 HC_1 HC_1 HC_1 HC_1 HC_1 HC_1 HC_1
#define HC_3 HC_2 HC_2 HC_2 HC_2 HC_2 HC_2 HC_2 HC_2 HC_2 HC_2
#define HC_4 HC_3 HC_3 HC_3 HC_3 HC_3 HC_3 HC_3 HC_3 HC_3 HC_3
#define HC_5 HC_4 HC_4 HC_4 HC_4

double CalcHardCodedVAls()
{
    double valReturn = 0;
    double bVal(33.23);

    for (int iCount = 0; iCount < numberOfReps; ++iCount)
    {
        if (iCount % dotFrequency== 0)
        {
            cout << ".";
        }

        HC_5
    }

    cout << valReturn << endl;
    return valReturn;
}

int main(int argc, char* argv[])
{
    PreCalcObj preCalced;
    PreCalcObjArr preCalcedArr;

    cout << "Hard Coded time: "       << time_call([&] {CalcHardCodedVAls();}) << endl;
    cout << "array PreCalced values time: " << time_call([&] {UsePreCalcedArr(preCalcedArr);}) << endl;
    cout << "vector PreCalced values time: " << time_call([&] {UsePreCalced(preCalced);}) << endl;
    cout << "All Calculated time: "   << time_call([&] {CalcAll();}) << endl;

    cin.get();
    return 0;
}

When I compile with -O3, I get 2.1 seconds for the hardcoded approach, 1.3 seconds for the array and vector
approaches (actually, almost every time, the vector is slightly faster) and 6.9 seconds with on-the-fly calculations.

Edited 4 Years Ago by m4ster_r0shi

Comments
Great point about the optimization!

Well, I turned off all optimizations I could find in Visual Studio (including setting c++ optimization to Od. I suppose that is similar to O3) and got the following results with your modified code. Notably, my vectors didn't perform nearly as well as yours did, and neither did my array when compared to the hardcoded results.

Hard Coded: 2.839s
Array Vals: 2.808s
Vector Vals: 12.449s
on-the-fly: 6.037

But I don't really want to turn off all optimization for this test. I just want to make sure none of the loop iterations gets skipped-that all calculations are computed as I've written them. If the linker or compiler can make things more efficient without a shortcut, that is fine so long as those same efficiency gains are experienced when I use this in a real program.

I'm not sure what to do next. Perhaps I'll try to make two copies of the real program I'm working on. One with arrays and one with vectors and see if there is a difference.

Well I added #define _SECURE_SCL 0 to disable checked iterators, and got roughly the same results. All of the times I have posted were in release mode. When I tested in debug, the vector function was 1 or 2 orders of magnitude slower.

Hard coded time = 125ms
PreCalced Array time = 130ms
PreCalced Vector time = 624ms
Repetitive Calc time = 5444ms...

First observation, from your code, my first thought was: the compiler will optimize away the entire calculation part of the hard-coded loop. And that's gonna happen even at lowest optimization level (as long as it is in release mode). In other words, it will come down to this:

double CalcHardCodedVAls()
{
    for (int iCount = 0; iCount < numberOfReps; ++iCount)
    {
        if (iCount % dotFrequency== 0)
        {
            cout << ".";
        }
    }
    double valReturn = 33.23 * .00195779856;
    cout << valReturn << endl;
    return valReturn;
}

In the case of the array storage, it is very likely to be similar, with the calculation line as double valReturn = 33.23 * arr[8];. Which means, in effect, that all you are really measuring is the time that it takes to do the dot printing, and that's why the results are identical. A simple way to test this is to remove the dot-printing and check if you get very close to 0ms (under the same compilation conditions). This is also why r0shi added the += in the loop instead of the straight assignment, this actually forces the compiler not to optimize away the loop.

The only reason why the vector case was much slower than the array case, is because the compiler cannot infer the same things about a vector as it can about an array, thus, not allowing it to optimize away the inner-loop. I don't understand why the repetitive calculation case would be that much slower, because, in fact, it ought to be optimized away (even without optimizations enabled) just like in the first two cases. I guess your compiler must be pretty dumb (and knowing that you are using MSVC, it is not too surprising).

The results make sense -- vectors have a lot of overhead that take cpu cycles, more time than just doing the calculations. The program has to make other calculations just to access the vector item.

That's generally not true. On GNU implementations, and in my experience, accessing elements of a vector (via operator[] or iterators) is exactly equivalent to accessing an array through the same method. Usually, in most implementations, vector-iterators are just a typedef for a pointer type, and the C++ standard is worded in a way that highly suggest this should be the case. There might be one or two levels of trivial inlining to be done when using vectors, which might cause some additional overhead if compiling at the lowest optimization level or in debug mode, but none of that will show up when optimizations are turned on.

I've really used vector extensively over the past year, but for speed applications, I'm going to start using arrays.

No. Don't start using array because of the results of a flawed test. Accessing elements in a vector is just as fast as in an array, when there is a minimal level of optimization turned on (O1 will be enough for that, and of course, in release, and turning off any MSVC crap about pervasive range-checks and all that). And, overall, with the amortized constant allocations of memory in the vector class, you always turn up on top of arrays in terms of performance.

When I compile with -O3, I get 2.1 seconds for the hardcoded approach, 1.3 seconds for the array and vector approaches (actually, almost every time, the vector is slightly faster) and 6.9 seconds with on-the-fly calculations.

This is much more inline with what one would expect as a result of a proper test program for this. Well done r0shi!

Well, I turned off all optimizations I could find in Visual Studio (including setting c++ optimization to Od. I suppose that is similar to O3) and got the following results with your modified code.

Most of small performance tests like these is about what amount of optimization the compiler can do, and what is the minimum level required to get what is expected (i.e., vector performance == array performance). It makes no sense to make the tests with no optimizations, because that simply doesn't correspond to any reasonable use-case (i.e., you would never compile a performance-sensitive program without optimizations turned on, and all MSVC crap turned off).

I just want to make sure none of the loop iterations gets skipped-that all calculations are computed as I've written them.

Then make sure your loop is necessary somehow. The problem with your original loop is that all that matters is the last iteration, because it overwrites the result from all previous iterations. So, you do as r0shi did, you make a summation, or something similar, that makes all iterations relevant to the final result, meaning the compiler cannot optimize the iterations away, unless the compiler is really clever (but now we're talking Intel-compiler levels of optimizations).

Another useful trick to prevent certain things from being optimized away is to use the volatile keyword (which forces the compiler to re-read the value of the variable all the time). For example, a loop like this:

int value = 42;

int sum() {
  int result = 0;
  for(int i = 0; i < 50; ++i)
    result += value;
  return result;
};

Could get optimized away, by a clever compiler, as just:

int value = 42;

int sum() {
  return 50 * 42;
};

But, if you declare the variable value with the keyword volatile, you'll be able to prevent that, because the compiler can no longer assume that value will have a constant through the loop and will be forced to evaluated each iteration.

As a general hint for the future, if you want to do these kinds of small performance tests, I highly recommend that you get familiar with assembly listings that your compiler can produce for you. Because inspecting assembly listings is really the only reasonable way to assess what optimizations are really being done, and what the resulting code really looks like. And, don't be scared by it, it's a lot simpler than it looks at first.

Thank you. Very helpful post (once again). When I have time later this weekend I'll research assembly listings.

This question has already been answered. Start a new discussion instead.