This is an archive of the discontinued LLVM Phabricator instance.

[OPENMP][NVPTX]Use __syncwarp() to reconverge the threads.
ClosedPublic

Authored by ABataev on Jul 19 2019, 1:17 PM.

Details

Summary

In Cuda 9.0 it is not guaranteed that threads in the warps are
convergent. We need to use syncwarp() function to reconverge
the threads and to guarantee the memory ordering among threads in the
warps.
This is the first patch to fix the problem with the test
libomptarget/deviceRTLs/nvptx/src/sync.cu on Cuda9+.
This patch just replaces calls to
shfl_sync() function with the call
of __syncwarp() function where we need to reconverge the threads when we
try to modify the value of the parallel level counter.

Event Timeline

ABataev created this revision.Jul 19 2019, 1:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2019, 1:17 PM

Does this behavior of CUDA >= 9.0 affect only the parallel level counter? Do we need to propagate these changes to other functions?

Does this behavior of CUDA >= 9.0 affect only the parallel level counter? Do we need to propagate these changes to other functions?

Seems to me, it affects only parallelLevel counter because currently only parallelLevel counter is used on per-warp level.

grokos accepted this revision.Jul 22 2019, 10:26 AM

OK, looks good.

This revision is now accepted and ready to land.Jul 22 2019, 10:26 AM
jdoerfert requested changes to this revision.Jul 22 2019, 7:49 PM

I'm confused, partly about the "convergent" part.

The code looks vastly different but no tests are affected?
Could you please point out how to reproduce the problem?
Where did the shuffles go?
Why is there a threadfence and syncwrap now?
Which old accesses were problematic and why?

This revision now requires changes to proceed.Jul 22 2019, 7:49 PM

I'm confused, partly about the "convergent" part.

The code looks vastly different but no tests are affected?
Could you please point out how to reproduce the problem?
Where did the shuffles go?
Why is there a threadfence and syncwrap now?
Which old accesses were problematic and why?

There is a problem with at least 1 test in Cuda 9+: spmd_parallel_regions.cpp. To fix this problem we need 3 things: fix the test itself (see D65112), fix the runtime part (this patch) and fix the handling of critical sections in compiler (the 3rd patch that depends on this one).

I'm confused, partly about the "convergent" part.

The code looks vastly different but no tests are affected?
Could you please point out how to reproduce the problem?
Where did the shuffles go?
Why is there a threadfence and syncwrap now?
Which old accesses were problematic and why?

There is a problem with at least 1 test in Cuda 9+: spmd_parallel_regions.cpp. To fix this problem we need 3 things: fix the test itself (see D65112), fix the runtime part (this patch) and fix the handling of critical sections in compiler (the 3rd patch that depends on this one).

There seems to be a problem with this "fix", not the test. At least so far, the argument was CUDA 9 semantics which should be irrelevant to the test. If there is a problem, than that the runtime doesn't implement OpenMP semantics properly for that test. Modifying the test will only hide that problem.

ABataev updated this revision to Diff 211746.Jul 25 2019, 7:36 AM

Reworked to fix the test spmd_parallel_regions.cpp and fix problems with SPMD mode in CUDA9+ found during testing.

Generally, this seems fine but I was hoping we could say what configuration and test file can be used to reproduce this error.

Generally, this seems fine but I was hoping we could say what configuration and test file can be used to reproduce this error.

Generally speaking, we just switched to __syncwarp function instead of shfl_sync in this patch (In cuda <= 8 we just don't need to reconverge the threads because of the architecture, in Cuda 9+ there is a special function __syncwarp for this). It will just improve performance in Cuda 8 and won't affect Cuda9+ at all.
The problem is in threads divergence within the warp. We use parallel level counter on per-warp basis (because we're very limited in shared memory). Previously, it was even worse, we had parallel level counter on per-block basis.

I think, it would be better to rename the patch. Just like I said, this is not a fix, just a small improvement in the way we reconverge the threads.

ABataev retitled this revision from [OPENMP][NVPTX]Fix parallel level counter in Cuda 9.0. to [OPENMP][NVPTX]Use __syncwarp() to reconverge the threads..Aug 23 2019, 9:17 AM
ABataev edited the summary of this revision. (Show Details)
jdoerfert accepted this revision.Aug 23 2019, 11:17 AM

thx for the explanation. LGTM.

This revision is now accepted and ready to land.Aug 23 2019, 11:17 AM

Make sure to update the commit message as well.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2019, 11:33 AM