Download Raw Diff

Details

Reviewers

grokos
jdoerfert

Commits

rOMP369796: [OPENMP][NVPTX]Use __syncwarp() to reconverge the threads.
rG0366168f3ac6: [OPENMP][NVPTX]Use __syncwarp() to reconverge the threads.
rL369796: [OPENMP][NVPTX]Use __syncwarp() to reconverge the threads.

Summary

In Cuda 9.0 it is not guaranteed that threads in the warps are
convergent. We need to use syncwarp() function to reconverge
the threads and to guarantee the memory ordering among threads in the
warps.
This is the first patch to fix the problem with the test
libomptarget/deviceRTLs/nvptx/src/sync.cu on Cuda9+.
This patch just replaces calls to shfl_sync() function with the call
of __syncwarp() function where we need to reconverge the threads when we
try to modify the value of the parallel level counter.

Diff Detail

Repository

rOMP OpenMP

Build Status

Buildable 37194
Build 37193: arc lint + arc unit

Event Timeline

ABataev created this revision.Jul 19 2019, 1:17 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2019, 1:17 PM

Herald added subscribers: jdoerfert, jfb, guansong. · View Herald Transcript

Harbormaster completed remote builds in B35383: Diff 210884.Jul 19 2019, 1:18 PM

Does this behavior of CUDA >= 9.0 affect only the parallel level counter? Do we need to propagate these changes to other functions?

In D65013#1595977, @grokos wrote:

Does this behavior of CUDA >= 9.0 affect only the parallel level counter? Do we need to propagate these changes to other functions?

Seems to me, it affects only parallelLevel counter because currently only parallelLevel counter is used on per-warp level.

OK, looks good.

This revision is now accepted and ready to land.Jul 22 2019, 10:26 AM

I'm confused, partly about the "convergent" part.

The code looks vastly different but no tests are affected?
Could you please point out how to reproduce the problem?
Where did the shuffles go?
Why is there a threadfence and syncwrap now?
Which old accesses were problematic and why?

This revision now requires changes to proceed.Jul 22 2019, 7:49 PM

In D65013#1596821, @jdoerfert wrote:

I'm confused, partly about the "convergent" part.

The code looks vastly different but no tests are affected?
Could you please point out how to reproduce the problem?
Where did the shuffles go?
Why is there a threadfence and syncwrap now?
Which old accesses were problematic and why?

There is a problem with at least 1 test in Cuda 9+: spmd_parallel_regions.cpp. To fix this problem we need 3 things: fix the test itself (see D65112), fix the runtime part (this patch) and fix the handling of critical sections in compiler (the 3rd patch that depends on this one).

In D65013#1596990, @ABataev wrote:

In D65013#1596821, @jdoerfert wrote:

I'm confused, partly about the "convergent" part.

The code looks vastly different but no tests are affected?
Could you please point out how to reproduce the problem?
Where did the shuffles go?
Why is there a threadfence and syncwrap now?
Which old accesses were problematic and why?

There is a problem with at least 1 test in Cuda 9+: spmd_parallel_regions.cpp. To fix this problem we need 3 things: fix the test itself (see D65112), fix the runtime part (this patch) and fix the handling of critical sections in compiler (the 3rd patch that depends on this one).

There seems to be a problem with this "fix", not the test. At least so far, the argument was CUDA 9 semantics which should be irrelevant to the test. If there is a problem, than that the runtime doesn't implement OpenMP semantics properly for that test. Modifying the test will only hide that problem.

ABataev mentioned this in D65112: [OPENMP][NVPTX]Make the test compatible with CUDA9+, NFC..Jul 25 2019, 7:17 AM

Hahnfeld added a subscriber: Hahnfeld.Jul 25 2019, 7:25 AM

Reworked to fix the test spmd_parallel_regions.cpp and fix problems with SPMD mode in CUDA9+ found during testing.

Harbormaster completed remote builds in B35638: Diff 211746.Jul 25 2019, 7:38 AM

Rebase.

Harbormaster completed remote builds in B36250: Diff 213675.Aug 6 2019, 11:57 AM

Ping

Generally, this seems fine but I was hoping we could say what configuration and test file can be used to reproduce this error.

In D65013#1643020, @jdoerfert wrote:

Generally, this seems fine but I was hoping we could say what configuration and test file can be used to reproduce this error.

Generally speaking, we just switched to __syncwarp function instead of shfl_sync in this patch (In cuda <= 8 we just don't need to reconverge the threads because of the architecture, in Cuda 9+ there is a special function __syncwarp for this). It will just improve performance in Cuda 8 and won't affect Cuda9+ at all.
The problem is in threads divergence within the warp. We use parallel level counter on per-warp basis (because we're very limited in shared memory). Previously, it was even worse, we had parallel level counter on per-block basis.

I think, it would be better to rename the patch. Just like I said, this is not a fix, just a small improvement in the way we reconverge the threads.

Renamed.

ABataev retitled this revision from [OPENMP][NVPTX]Fix parallel level counter in Cuda 9.0. to [OPENMP][NVPTX]Use __syncwarp() to reconverge the threads..Aug 23 2019, 9:17 AM

ABataev edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B37194: Diff 216877.Aug 23 2019, 9:19 AM

thx for the explanation. LGTM.

This revision is now accepted and ready to land.Aug 23 2019, 11:17 AM

Make sure to update the commit message as well.

Closed by commit rL369796: [OPENMP][NVPTX]Use __syncwarp() to reconverge the threads. (authored by ABataev). · Explain WhyAug 23 2019, 11:33 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2019, 11:33 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

This is an archive of the discontinued LLVM Phabricator instance.

[OPENMP][NVPTX]Use __syncwarp() to reconverge the threads.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 216877

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h

libomptarget/deviceRTLs/nvptx/src/supporti.h

This is an archive of the discontinued LLVM Phabricator instance.

[OPENMP][NVPTX]Use __syncwarp() to reconverge the threads.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 216877

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h

libomptarget/deviceRTLs/nvptx/src/supporti.h

[OPENMP][NVPTX]Use __syncwarp() to reconverge the threads.
ClosedPublic