954,505 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?

Optimizing Matrix Multiplication

By Ron Whittle on Mar 25th, 2011 12:20 pm

Optimizing Matrix Multiplication

One time consuming task is multiplying large matrices. In this post we'll look at ways to improve the speed of this process. We'll be using a square matrix, but with simple modifications the code can be adapted to any type of matrix.

The straight forward way to multiply a matrix is:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i,j] += A[i,k] * B[k,j];
        }
    }
}


Now it's not important to us in which order the loops run (i, j, k) but the compiler might be able to optimize the code based on the order, so lets try some timings of all variations of orders (ijk, ikj, jik, jki, kij, kji) for the multiplication of two 500x500 matrices of doubles.

We'll run each test 10 times to balance for any system calls that might happen during the test:

Name   Milliseconds
ijk       2,361
ikj       2,000
jik       2,315
jki       5,775
kij       2,089
kji       5,665


It seems that the order ikj is slightly faster, while both jki and kji are significantly slower.

But what about using jagged arrays (that aren't really jagged, since all rows will have same size)? They might give better performance. Here is the same code above but using jagged arrays:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}


And the results:

Name   Milliseconds
ijk       3,622
ikj       1,212
jik       3,641
jki       7,897
kij       1,244
kji       7,773


Well that is interesting. Most of the methods are slower, except for ikj and kij which are almost twice as fast.

This leads into the best part about using jagged arrays. With the current code every access to an element is a double index. First to the row, then to the column. By introducing some extra variables we can optimize this to a single index in each loop:

for (int i = 0; i < N; i++) {
    double[] iRowA = A[i];
    double[] iRowC = C[i];
    for (int k = 0; k < N; k++) {
        double[] kRowB = B[k];
        double ikA = iRowA[k];
        for (int j = 0; j < N; j++) {
            iRowC[j] += ikA * kRowB[j];
        }
    }
}


Running this code gives:

Name   Milliseconds
opt       410


Now we are about 5 times faster than our original method.

Can we improve on this? Of course we can. With .NET 4.0 came the introduction of PLINQ, which gives us an easy way to perform parallel tasks. By taking one of the indexes out of the loop, we can use it as a parameter to the method and calculate multiple rows at a time. Our code to do this looks like this:

double[] iRowA = A[i];
double[] iRowC = C[i];
for (int k = 0; k < N; k++) {
    double[] kRowB = B[k];
    double ikA = iRowA[k];
    for (int j = 0; j < N; j++) {
        iRowC[j] += ikA * kRowB[j];
    }
}


All we have to do now is call this method N times with i ranging from 0 to N. Again, PLINQ gives us the easy way to do this:

var source = Enumerable.Range(0, N);
var pquery = from num in source.AsParallel()
             select num;
pquery.ForAll((e) => Popt(A, B, C, e));

Where Popt is our method name taking 3 jagged arrays (C = A * B) and the row to calculate (e). How fast is this:

Name   Milliseconds
Popt       187


That's over 12 times faster than our original code! With the magic of PLINQ we are creating 500 threads in this example and don't have to manage a single one of them, everything is handled for you.

So the snippet for this post will be the final parallel multiplication code.

Hopefully this post will help you 'think outside the box' when you encounter a problem that needs some optimization. By changing the order in which we did things, to the base data structure used, to introducing parallelism we've made a huge improvement!

// requires a global value N which is the dimension of the array
// Array C must be fully allocated or you'll get a null reference exception

void Topt(double[][] A, double[][] B, double[][] C) {
    var source = Enumerable.Range(0, N);
    var pquery = from num in source.AsParallel()
                    select num;
    pquery.ForAll((e) => Popt(A, B, C, e));
}

void Popt(double[][] A, double[][] B, double[][] C, int i) {
    double[] iRowA = A[i];
    double[] iRowC = C[i];
    for (int k = 0; k < N; k++) {
        double[] kRowB = B[k];
        double ikA = iRowA[k];
        for (int j = 0; j < N; j++) {
            iRowC[j] += ikA * kRowB[j];
        }
    }
}

Wow!
You must have spent some time to work this all out!
This is great, great stuff!
I'm working( well not all day ;) ) on a sort of matrix calculator, you surely gave me some ideas to rework my code.
Rep X 10, but alas I can't:'(

ddanbe
Senior Poster
3,829 posts since Oct 2008
Reputation Points: 2,070
Solved Threads: 661
 

Would your implementation be faster than C++, who claims to be one of the fastest guys around the block?

ddanbe
Senior Poster
3,829 posts since Oct 2008
Reputation Points: 2,070
Solved Threads: 661
 

Not the greatest C++ programmer, but trying to stick with the best method available I came up with:

void MatMul(double** A, double** B, double** C) {
    for (int i = 0; i < N; i++) {
        double* iRowA = A[i];
        double* iRowC = C[i];
        for (int k = 0; k < N; k++) {
            double* kRowB = B[k];
            double ikA = iRowA[k];
            for (int j = 0; j < N; j++) {
                iRowC[j] += ikA * kRowB[j];
            }
        }
    }
}


Testing this gives a runtime of 213 milliseconds, or slightly slower than the parallel version. C++ pointer math wins. Now to figure out how to write this in unsafe C# code and see if that improves anything.

Momerath
Nearly a Senior Poster
3,384 posts since Aug 2010
Reputation Points: 1,232
Solved Threads: 558
 

Wow guy's u doin great job

nismac
Newbie Poster
3 posts since Nov 2010
Reputation Points: 10
Solved Threads: 0
 

Looking around I've also found the Microsoft has a Matrix class in their Infer.net . Testing it for a 500x500 matrix gives:

Infer.net 1060 milliseconds

Interesting to know that my code beats theirs even without using parallel execution :)

Momerath
Nearly a Senior Poster
3,384 posts since Aug 2010
Reputation Points: 1,232
Solved Threads: 558
 

@Momerath
A question popped up: did you use the StopWatch class to time your code?

ddanbe
Senior Poster
3,829 posts since Oct 2008
Reputation Points: 2,070
Solved Threads: 661
 

Yes I do

Momerath
Nearly a Senior Poster
3,384 posts since Aug 2010
Reputation Points: 1,232
Solved Threads: 558
 

Hi, Could you please explain, is C the destination array (where A*B) is stored after the multiplication? Thanks

alud007
Newbie Poster
1 post since Oct 2011
Reputation Points: 10
Solved Threads: 0
 

If you want to write fast MM code you need to write it in C, you need to use multliple threaded execution and you need to use the cache.

Cache optimizations are going to give you the most speed up for single threaded execution.

Tiling method - you chop the matrix into tiles small enough to fit in cache so that way there are less cache misses.

tile_size =
for(int i = 0; i < N; i+= tile_size) {
for(int j = 0; j < N; j+=tile_size) {
for(int k = 0; k < N; k+=tile_size) {
for(int ii = i; ii < min(i+tile_size, N); ++ii) {
for(int jj = j; jj < min(j+tile_size, N); ++jj) {
for(int kk = k; kk < min(k+tile_size, N); ++kk) {
c[ii*N + jj] += a[ii*N + kk] + b[kk*N + jj];

#this code will only work for square matrices

Besides that, openmp and MPI are good tools.

snizzzzle
Newbie Poster
1 post since Nov 2011
Reputation Points: 10
Solved Threads: 0
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: