Optimizing Matrix Multiplication

One time consuming task is multiplying large matrices. In this post we'll look at ways to improve the speed of this process. We'll be using a square matrix, but with simple modifications the code can be adapted to any type of matrix.

The straight forward way to multiply a matrix is:

```
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i,j] += A[i,k] * B[k,j];
}
}
}
```

Now it's not important to us in which order the loops run (i, j, k) but the compiler might be able to optimize the code based on the order, so lets try some timings of all variations of orders (ijk, ikj, jik, jki, kij, kji) for the multiplication of two 500x500 matrices of doubles.

We'll run each test 10 times to balance for any system calls that might happen during the test:

```
Name Milliseconds
ijk 2,361
ikj 2,000
jik 2,315
jki 5,775
kij 2,089
kji 5,665
```

It seems that the order ikj is slightly faster, while both jki and kji are significantly slower.

But what about using jagged arrays (that aren't really jagged, since all rows will have same size)? They might give better performance. Here is the same code above but using jagged arrays:

```
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
```

And the results:

```
Name Milliseconds
ijk 3,622
ikj 1,212
jik 3,641
jki 7,897
kij 1,244
kji 7,773
```

Well that is interesting. Most of the methods are slower, except for ikj and kij which are almost twice as fast.

This leads into the best part about using jagged arrays. With the current code every access to an element is a double index. First to the row, then to the column. By introducing some extra variables we can optimize this to a single index in each loop:

```
for (int i = 0; i < N; i++) {
double[] iRowA = A[i];
double[] iRowC = C[i];
for (int k = 0; k < N; k++) {
double[] kRowB = B[k];
double ikA = iRowA[k];
for (int j = 0; j < N; j++) {
iRowC[j] += ikA * kRowB[j];
}
}
}
```

Running this code gives:

```
Name Milliseconds
opt 410
```

Now we are about 5 times faster than our original method.

Can we improve on this? Of course we can. With .NET 4.0 came the introduction of PLINQ, which gives us an easy way to perform parallel tasks. By taking one of the indexes out of the loop, we can use it as a parameter to the method and calculate multiple rows at a time. Our code to do this looks like this:

```
double[] iRowA = A[i];
double[] iRowC = C[i];
for (int k = 0; k < N; k++) {
double[] kRowB = B[k];
double ikA = iRowA[k];
for (int j = 0; j < N; j++) {
iRowC[j] += ikA * kRowB[j];
}
}
```

All we have to do now is call this method N times with i ranging from 0 to N. Again, PLINQ gives us the easy way to do this:

```
var source = Enumerable.Range(0, N);
var pquery = from num in source.AsParallel()
select num;
pquery.ForAll((e) => Popt(A, B, C, e));
```

Where Popt is our method name taking 3 jagged arrays (C = A * B) and the row to calculate (e). How fast is this:

```
Name Milliseconds
Popt 187
```

That's over 12 times faster than our original code! With the magic of PLINQ we are creating 500 threads in this example and don't have to manage a single one of them, everything is handled for you.

So the snippet for this post will be the final parallel multiplication code.

Hopefully this post will help you 'think outside the box' when you encounter a problem that needs some optimization. By changing the order in which we did things, to the base data structure used, to introducing parallelism we've made a huge improvement!