I am writing some PDE code, and I am just starting to think about parallelizing it. The code
is extremely parallelizable by both CUDA (at the lowest level) and regular threading at the higher level.
Specifically, I am coding a matrix multiply, and the matrix is banded, so the multiply is just a sum over
the bands of some vector operations. In my optimistic mind, I imagine I could run the band multiplications
in parallel by putting them in separate threads, and adding the results later. But I also thought that I
could do the vector operations in CUDA. But will the separate threads, all calling CUDA, interfere with one
another? Will they get mixed together? Does CUDA know how to handle requests coming in from many different threads?
Do I have to pick either threads, OR CUDA, but not both?