Hi everyone, I'm trying to parallelise a Gram-Schmidt algorithm. Here's the serial code:

```
for( k = 0; k < m; k++ ) {
for( i = 0; i < k; i++ ) {
s = 0.0;
for( j = 0; j < n; j++ ) {
s += q[ i ][ j ] * q[ k ][ j ];
}
for( j = 0; j < n; j++ ) {
q[ k ][ j ] -= s * q[ i ][ j ];
}
}
s = 0.0;
for( j = 0; j < n; j++ ) {
s += pow( q[ k ][ j ], 2 );
}
for( j = 0; j < n; j++ ) {
q[ k ][ j ] /= sqrt( s );
}
}
```

I have tried a bunch of different ways to parallelise it and I found that the k-loop or the i-loop cannot be parallelised because they all depend on previous iterations. Only some the inner loops can be parallelised (i.e. the j-loops) by doing something like this:

```
for( k = 0; k < m; k++ ) {
for( i = 0; i < k; i++ ) {
s = 0.0;
#pragma omp parallel for default(none) shared(i, k, q, m, n) private(j) reduction(+:s) schedule(dynamic)
for( j = 0; j < n; j++ ) {
s += q[ i ][ j ] * q[ k ][ j ];
}
/* #pragma omp parallel for */
for( j = 0; j < n; j++ ) {
q[ k ][ j ] -= s * q[ i ][ j ];
}
}
s = 0.0;
#pragma omp parallel for default(none) shared(i, k, q, m, n) private(j) reduction(+:s) schedule(dynamic)
for( j = 0; j < n; j++ ) {
s += pow( q[ k ][ j ], 2 );
}
/* #pragma omp parallel for */
for( j = 0; j < n; j++ ) {
q[ k ][ j ] /= sqrt( s );
}
}
```

This however creates a large overhead because threads are created and destroyed over and over as the algorithm is running making it a lot slower than the serial version even for very large 'm' and 'n'.

I can't find any way to efficiently parallelise the piece of code but I have seen parallelised MGS algorithms in research papers. What am I missing here?

Thanks