Optimizing Matrix Multiplication

One time consuming task is multiplying large matrices. In this post we'll look at ways to improve the speed of this process. We'll be using a square matrix, but with simple modifications the code can be adapted to any type of matrix.

The straight forward way to multiply a matrix is:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i,j] += A[i,k] * B[k,j];
        }
    }
}

Now it's not important to us in which order the loops run (i, j, k) but the compiler might be able to optimize the code based on the order, so lets try some timings of all variations of orders (ijk, ikj, jik, jki, kij, kji) for the multiplication of two 500x500 matrices of doubles.

We'll run each test 10 times to balance for any system calls that might happen during the test:

Name   Milliseconds
ijk       2,361
ikj       2,000
jik       2,315
jki       5,775
kij       2,089
kji       5,665

It seems that the order ikj is slightly faster, while both jki and kji are significantly slower.

But what about using jagged arrays (that aren't really jagged, since all rows will have same size)? They might give better performance. Here is the same code above but using jagged arrays:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

And the results:

Name   Milliseconds
ijk       3,622
ikj       1,212
jik       3,641
jki       7,897
kij       1,244
kji       7,773

Well that is interesting. Most of the methods are slower, except for ikj and kij which are almost twice as fast.

This leads into the best part about using jagged arrays. With the current code every access to an element is a double index. First to the row, then to the column. By introducing some extra variables we can optimize this to a single index in each loop:

for (int i = 0; i < N; i++) {
    double[] iRowA = A[i];
    double[] iRowC = C[i];
    for (int k = 0; k < N; k++) {
        double[] kRowB = B[k];
        double ikA = iRowA[k];
        for (int j = 0; j < N; j++) {
            iRowC[j] += ikA * kRowB[j];
        }
    }
}

Running this code gives:

Name   Milliseconds
opt       410

Now we are about 5 times faster than our original method.

Can we improve on this? Of course we can. With .NET 4.0 came the introduction of PLINQ, which gives us an easy way to perform parallel tasks. By taking one of the indexes out of the loop, we can use it as a parameter to the method and calculate multiple rows at a time. Our code to do this looks like this:

double[] iRowA = A[i];
double[] iRowC = C[i];
for (int k = 0; k < N; k++) {
    double[] kRowB = B[k];
    double ikA = iRowA[k];
    for (int j = 0; j < N; j++) {
        iRowC[j] += ikA * kRowB[j];
    }
}

All we have to do now is call this method N times with i ranging from 0 to N. Again, PLINQ gives us the easy way to do this:

var source = Enumerable.Range(0, N);
var pquery = from num in source.AsParallel()
             select num;
pquery.ForAll((e) => Popt(A, B, C, e));

Where Popt is our method name taking 3 jagged arrays (C = A * B) and the row to calculate (e). How fast is this:

Name   Milliseconds
Popt       187

That's over 12 times faster than our original code! With the magic of PLINQ we are creating 500 threads in this example and don't have to manage a single one of them, everything is handled for you.

So the snippet for this post will be the final parallel multiplication code.

Hopefully this post will help you 'think outside the box' when you encounter a problem that needs some optimization. By changing the order in which we did things, to the base data structure used, to introducing parallelism we've made a huge improvement!

Edited 5 Years Ago by Momerath: n/a

Comments
Great work :)
Good work.
thanks
Excellent Post!!! Dedicated testing and good explainations
Out of this world!!!
// requires a global value N which is the dimension of the array
// Array C must be fully allocated or you'll get a null reference exception

void Topt(double[][] A, double[][] B, double[][] C) {
    var source = Enumerable.Range(0, N);
    var pquery = from num in source.AsParallel()
                    select num;
    pquery.ForAll((e) => Popt(A, B, C, e));
}

void Popt(double[][] A, double[][] B, double[][] C, int i) {
    double[] iRowA = A[i];
    double[] iRowC = C[i];
    for (int k = 0; k < N; k++) {
        double[] kRowB = B[k];
        double ikA = iRowA[k];
        for (int j = 0; j < N; j++) {
            iRowC[j] += ikA * kRowB[j];
        }
    }
}

Wow!
You must have spent some time to work this all out!
This is great, great stuff!
I'm working( well not all day ;) ) on a sort of matrix calculator, you surely gave me some ideas to rework my code.
Rep X 10, but alas I can't:'(

Would your implementation be faster than C++, who claims to be one of the fastest guys around the block?

Edited 5 Years Ago by ddanbe: n/a

Not the greatest C++ programmer, but trying to stick with the best method available I came up with:

void MatMul(double** A, double** B, double** C) {
    for (int i = 0; i < N; i++) {
        double* iRowA = A[i];
        double* iRowC = C[i];
        for (int k = 0; k < N; k++) {
            double* kRowB = B[k];
            double ikA = iRowA[k];
            for (int j = 0; j < N; j++) {
                iRowC[j] += ikA * kRowB[j];
            }
        }
    }
}

Testing this gives a runtime of 213 milliseconds, or slightly slower than the parallel version. C++ pointer math wins. Now to figure out how to write this in unsafe C# code and see if that improves anything.

Comments
Great!

Looking around I've also found the Microsoft has a Matrix class in their Infer.net. Testing it for a 500x500 matrix gives:

Infer.net 1060 milliseconds

Interesting to know that my code beats theirs even without using parallel execution :)

Hi, Could you please explain, is C the destination array (where A*B) is stored after the multiplication? Thanks

If you want to write fast MM code you need to write it in C, you need to use multliple threaded execution and you need to use the cache.

Cache optimizations are going to give you the most speed up for single threaded execution.

Tiling method - you chop the matrix into tiles small enough to fit in cache so that way there are less cache misses.

tile_size = <size of your cache line>
for(int i = 0; i < N; i+= tile_size) {
for(int j = 0; j < N; j+=tile_size) {
for(int k = 0; k < N; k+=tile_size) {
for(int ii = i; ii < min(i+tile_size, N); ++ii) {
for(int jj = j; jj < min(j+tile_size, N); ++jj) {
for(int kk = k; kk < min(k+tile_size, N); ++kk) {
c[ii*N + jj] += a[ii*N + kk] + b[kk*N + jj];


#this code will only work for square matrices

Besides that, openmp and MPI are good tools.

Edited 3 Years Ago by Nick Evan: Fixed formatting

I'm late to the party, but for the benefit of others who may be reading this, it should be noted that the optimium loop ordering for dense matrix-matrix multiplication has been known since shortly after cache memory was introduced. See any book on numerical linear algebra, such as Golub and Van Loan's “Matrix Computations”, for a thorough treatment.

The canonical solution is to use the BLAS Level 3 (matrix-matrix) routines DGEMM (for double precision) or SGEMM (for single precision) from a good BLAS implementation (e.g. Intel's MKL, OpenBLAS/GotoBLAS, or ATLAS), which will absolutely smoke your implementation unless you've put a staggering amount of effort into microoptimization. In my experience, a good BLAS DGEMM will be something like 30x faster than any manual implementation such as the above.

It's also noteworthy that a good compiler can recognize the “antipattern” of doing dense matrix multiplication with the wrong loop order, and transparently reorder the loops to the optimum order, so you will observe different timings for ijk vs. kji ordering using some compilers (such as gcc), but not others (such as icc).

If you're more interested in learning about sequential microoptimization than achieving the best performance — a worthy goal in itself — then after getting the loop order right, the next step is usually one or more levels of cache blocking, then register blocking (if your architecture has enough registers) and vectorization with SSE2+/AVX/AltiVec (or whatever SIMD instruction set is available on your architecture), possibly with some manual loop unrolling.

Comments
impressive

hello, I'm a graduating student in math and I'm dealing with linear regression, I need some functions written in C code that use loop unrolling and blocking cache, the most optimized possible to compute this function: A = (C * C) -1 C * y, where C * is the transpose of C, and -1 represents the inverse. are not practical c, can you help?

The article starter has earned a lot of community kudos, and such articles offer a bounty for quality replies.