If you want to write fast MM code you need to write it in C, you need to use multliple threaded execution and you need to use the cache.

Cache optimizations are going to give you the most speed up for single threaded execution.

Tiling method - you chop the matrix into tiles small enough to fit in cache so that way there are less cache misses.

tile_size = <size of your cache line>
for(int i = 0; i < N; i+= tile_size) {
for(int j = 0; j < N; j+=tile_size) {
for(int k = 0; k < N; k+=tile_size) {
for(int ii = i; ii < min(i+tile_size, N); ++ii) {
for(int jj = j; jj < min(j+tile_size, N); ++jj) {
for(int kk = k; kk < min(k+tile_size, N); ++kk) {
c[ii*N + jj] += a[ii*N + kk] + b[kk*N + jj];

#this code will only work for square matrices

Besides that, openmp and MPI are good tools.