I have an array of 3,000 floats: float *mVec = new float[3000]; and I need to take the log of each element. I am translating MATLAB code, in which everything is vectorized (ie, log(mVec) does some parallelization behind the scenes).

There must be a better way to do this than a for loop. Any suggestions? Speed is important to me.

There are mainly two easy ways to "parallelize" the operations in a for-loop like this.

One option is to use SSE(2-3) instruction sets. SSE instructions sets can basically perform multiple (4) floating point operations at once (in one instruction). This is something that the compiler can do automatically. If you are using GCC (or ICC), these are the appropriate compilation flags:

-mfpmath=sse -Ofast -march=native -funroll-loops

If you add those to your compilation command, the compiler should optimize more heavily for your current architecture (your computer), using SSE instructions, and unrolling for-loops to further optimize things.

Another easy option for parallelizing code is to use OpenMP. OpenMP allows you to tell the compiler to create multiple threads, each executing one chunk of the overall for-loop, all in parallel. It requires a few bits of mark-ups on your code, but it's easy. Here is a parallel for-loop that does a logarithm on an array using 4 threads:

void do_log_for_loop_omp_sse(float* arr, int n) {
  #pragma omp parallel num_threads(4)
    #pragma omp for
    for(int i = 0; i < n; ++i) 
      arr[i] = std::log(arr[i]);

When you compile code that uses openMP on GCC, you need to provide the command-line option -fopenmp to enable this.

Also note that you can easily combine the two methods by using openmp in your code, and telling the compiler to use SSE instructions.

Just for fun, I wrote a program that measures the time for all these four methods (for 3000 elements), and here is the output that I got:

Time (normal) is :              30548 nanoseconds.
Time with SSE is :              19755 nanoseconds.
Time with OpenMP is :           13943 nanoseconds.
Time with OpenMP + SSE is :      9750 nanoseconds.

As you see, you can get pretty good speed-up by using these methods. There is virtually no doubt in my mind that what Matlab does in terms of parallelization of the operation is exactly what I did above, i.e., combining SSE and OpenMP. And by exactly, I mean exactly. I don't think matlab does anything more fancy than this (there isn't much more that can be done, actually), and I'm pretty sure they use OpenMP too (maybe Intel TBB, which is very similar).

Of course, if you have access to an Intel compiler, you should be using Intel's TBB and related parallization libraries and language extensions. I don't have access to an Intel compiler, so I can't tell if it would be better (probably, because Intel is hard to beat, performance-wise).

And the rule for SSE versus OpenMP is that OpenMP relies on threads, which, for small tasks (small arrays), is more expensive to create / launch / destroy than it is worth in terms of the speed benefit of doing things in parallel. So, OpenMP is more beneficial for very time-consuming operations. Here, with only 3000 elements, it is beneficial, but barely. If you moved up to 30,000 elements, the speed-up from OpenMP is far more significant (on my computer, at 30,000, with OpenMP, it's 4 times faster than with SSE alone). But if you moved down to 30 or 300, using SSE alone is probably better.

This article has been dead for over six months. Start a new discussion instead.