In Cuda 9.0 it is not guaranteed that threads in the warps are
convergent. We need to use syncwarp() function to reconverge
the threads and to guarantee the memory ordering among threads in the
warps.
This is the first patch to fix the problem with the test
libomptarget/deviceRTLs/nvptx/src/sync.cu on Cuda9+.
This patch just replaces calls to shfl_sync() function with the call
of __syncwarp() function where we need to reconverge the threads when we
try to modify the value of the parallel level counter.
Details
Diff Detail
- Repository
- rOMP OpenMP
- Build Status
Buildable 37194 Build 37193: arc lint + arc unit
Event Timeline
Does this behavior of CUDA >= 9.0 affect only the parallel level counter? Do we need to propagate these changes to other functions?
Seems to me, it affects only parallelLevel counter because currently only parallelLevel counter is used on per-warp level.
I'm confused, partly about the "convergent" part.
The code looks vastly different but no tests are affected?
Could you please point out how to reproduce the problem?
Where did the shuffles go?
Why is there a threadfence and syncwrap now?
Which old accesses were problematic and why?
There is a problem with at least 1 test in Cuda 9+: spmd_parallel_regions.cpp. To fix this problem we need 3 things: fix the test itself (see D65112), fix the runtime part (this patch) and fix the handling of critical sections in compiler (the 3rd patch that depends on this one).
There seems to be a problem with this "fix", not the test. At least so far, the argument was CUDA 9 semantics which should be irrelevant to the test. If there is a problem, than that the runtime doesn't implement OpenMP semantics properly for that test. Modifying the test will only hide that problem.
Reworked to fix the test spmd_parallel_regions.cpp and fix problems with SPMD mode in CUDA9+ found during testing.
Generally, this seems fine but I was hoping we could say what configuration and test file can be used to reproduce this error.
Generally speaking, we just switched to __syncwarp function instead of shfl_sync in this patch (In cuda <= 8 we just don't need to reconverge the threads because of the architecture, in Cuda 9+ there is a special function __syncwarp for this). It will just improve performance in Cuda 8 and won't affect Cuda9+ at all.
The problem is in threads divergence within the warp. We use parallel level counter on per-warp basis (because we're very limited in shared memory). Previously, it was even worse, we had parallel level counter on per-block basis.
I think, it would be better to rename the patch. Just like I said, this is not a fix, just a small improvement in the way we reconverge the threads.