Can anyone please tell me how can I do loop unroll

void do_block (int lda, int M, int N, int K, double* A, double* B, double* C)
  /* For each row i of A */
  for (int i = 0; i < M; ++i)
    /* For each column j of B */ 
    for (int j = 0; j < N; ++j) 
      /* Compute C(i,j) */
      double cij = C[i+j*lda];
      for (int k = 0; k < K; ++k)
    cij += A[i+k*lda] * B[k+j*lda];
      C[i+j*lda] = cij;

Delete both lines 10 and 11. Line 11 will never be executed because of the k < k condition in likne 10.

Here is a good article about loop unrolling.

One of the optomizations you would make is to calculate i*j*idaonly once within a loop, save the result in another variable, then use that variable everywhere else i*j*ida appears in the loop.

Have you decided what number you are going to use to unroll the loop? For example, are you going to unroll 5 loops at a time? 7 loops at a time? etc.?

I posted a code snippet several months ago that does matrix multiplication:

The block that does the actual multiplication is contained in lines 40 - 48.

Say you wanted to unroll the inner loop 5 at time.
You don't know beforehand if the number of loops is evenly divisible by 5, so you have to check and deal with the remainder.

For example,

int remainder = nCols%5;


dummy = 0.0;
for (i = 0; i < remainder; i++){
    dummy += A_Matrix[k][j]*B_Matrix[j][i];
C_Matrix[k][i] = dummy;

So now you have taken care of the iterations that would be "left over" when the loop is unrolled 5 at a time.

Now you can do the unrolling:

dummy = 0.0;
for (j = 0; j < nCols; j +=5) {
    dummy += A_Matrix[k][j]*B_Matrix[j][i];
    dummy += A_Matrix[k][j+1]*B_Matrix[j+1][i];
    dummy += A_Matrix[k][j+2]*B_Matrix[j+2][i];
    dummy += A_Matrix[k][j+3]*B_Matrix[j+3][i];
    dummy += A_Matrix[k][j+4]*B_Matrix[j+4][i];
} // End for j
C_Matrix[k][i] = dummy;

This code may not be completely accurate; I am writing off the top of my head. But I hope you get the idea. You are explicitly writing out the loops 5 at a time, incrementing the counter by five, and eliminating the test in the for-loop for many iterations (instead of testing the condition in the for-loop every time, you are only doing it once every 5 loops.)

Actually, on line 2 in my post just above, I think the loop index should have started at remainder, since the first few entries were done in the step just before. i.e. - line 2 should be for (j = remainder; j < nCols; j +=5) {

Thanks a lot. Can you tell how can I use GCC flags like -funroll-loops to unroll the loop.

I just cant figure out the syntax.

gcc -O2                  -funroll-loops -dgemm-blocked.c

optimazation level     flag name       file name

What am I missing?