This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Added memory scope to atomic::inc API and used the device scope in reduction.
ClosedPublic

Authored by dhruvachak on Jun 29 2023, 5:42 PM.

Details

Summary

With https://reviews.llvm.org/D137524, memory scope and ordering
attributes are being used to generate the required instructions for
atomic inc/dec on AMDGPU. This patch adds the memory scope attribute to
the atomic::inc API and uses the device scope in reduction. Without
the device scope in atomic_inc, the default system scope leads to
unnecessary L2 write-backs/invalidates.

Diff Detail

Event Timeline

dhruvachak created this revision.Jun 29 2023, 5:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 5:42 PM
dhruvachak requested review of this revision.Jun 29 2023, 5:42 PM
Herald added a project: Restricted Project. · View Herald Transcript

The memscope types added are kept consistent with the attributes in OpenMP spec 6.0 TR11. I think the system, device, and team scopes should be sufficient for the OpenMP use cases. Additional scopes are supported by the AMDGPU backend but they haven't been added here.

arsenm added inline comments.Jun 29 2023, 6:10 PM
openmp/libomptarget/DeviceRTL/include/Synchronization.h
29

Why does this need a new subset abstraction for scopes? I could understand a complete enum around all the names target scopes

openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226

I doubt this is target specific

dhruvachak added inline comments.Jun 29 2023, 6:32 PM
openmp/libomptarget/DeviceRTL/include/Synchronization.h
29

I tried to keep the memscope arch-independent, hence the additional abstraction. I kept the same memscopes as what the upcoming OpenMP spec is going to have. This way, the call to atomic::inc in DeviceRTL (e.g. in reduction) does not have to be arch-dependent. In other words, I did not want to pass in "agent" directly from the reduction code for amdgpu, as an example.

openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226

Well, I don't know whether it is required for NVPTX for example, so I kept it around.

As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu.

jdoerfert added inline comments.Jun 29 2023, 6:42 PM
openmp/libomptarget/DeviceRTL/include/Synchronization.h
33

Can we out a namespace or sth around this, atomic::device and atomic::all are confusing.
What about atomic::scope::ABC?

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
161–171
207

The above is expanded below.

jdoerfert added inline comments.Jun 29 2023, 6:44 PM
openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226

As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu.

  1. This is unrelated.
  2. Could you elaborate on why you think the change is sound (on AMDGPUs)?
dhruvachak added inline comments.Jun 30 2023, 12:17 AM
openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226
  1. Agreed this is unrelated. I can separate this out if you like.
  2. With reference to https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-memory-model-gfx90a (gfx90a as an example), the guide points out the required instructions. And that's what we get with this patch for the atomic inc:

s_waitcnt vmcnt(0) lgkmcnt(0)
global_atomic_inc v0, v1, v0, s[2:3] glc
s_waitcnt vmcnt(0)
buffer_wbinvl1_vol

If we bring back the system fence, here's the resulting sequence (fence followed by the atomic inc):

buffer_wbl2
s_waitcnt vmcnt(0) lgkmcnt(0)
buffer_invl2
buffer_wbinvl1_vol
s_waitcnt vmcnt(0) lgkmcnt(0)
global_atomic_inc v0, v1, v0, s[2:3] glc
s_waitcnt vmcnt(0)
buffer_wbinvl1_vol

According to the guide, the L2 instructions are required for coherence between different agents. For the reduction use case, we need coherence within a GPU which is provided by buffer_wbinvl1_vol following the atomic inc. So we can drop the L2 instructions here. The duplicate waitcnt and the buffer_wbinvl1_vol can be dropped too.

In other words, as long as the ordering and scope are correct on the atomic operation, we should not need an additional fence. Perhaps it was required before D137524.

Addressed feedback. Removed the fence change, used the suggested macro, and used the enum scope for clarity.

dhruvachak marked 5 inline comments as done.Jun 30 2023, 11:31 AM
dhruvachak added inline comments.
openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226

I removed this unrelated change, will post a separate patch for it.

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
161–171

Used the suggested macro. Thanks @jdoerfert

dhruvachak edited the summary of this revision. (Show Details)Jun 30 2023, 11:35 AM
arsenm accepted this revision.Jun 30 2023, 11:58 AM
This revision is now accepted and ready to land.Jun 30 2023, 11:58 AM