With https://reviews.llvm.org/D137524, memory scope and ordering
attributes are being used to generate the required instructions for
atomic inc/dec on AMDGPU. This patch adds the memory scope attribute to
the atomic::inc API and uses the device scope in reduction. Without
the device scope in atomic_inc, the default system scope leads to
unnecessary L2 write-backs/invalidates.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
The memscope types added are kept consistent with the attributes in OpenMP spec 6.0 TR11. I think the system, device, and team scopes should be sufficient for the OpenMP use cases. Additional scopes are supported by the AMDGPU backend but they haven't been added here.
openmp/libomptarget/DeviceRTL/include/Synchronization.h | ||
---|---|---|
29 | I tried to keep the memscope arch-independent, hence the additional abstraction. I kept the same memscopes as what the upcoming OpenMP spec is going to have. This way, the call to atomic::inc in DeviceRTL (e.g. in reduction) does not have to be arch-dependent. In other words, I did not want to pass in "agent" directly from the reduction code for amdgpu, as an example. | |
openmp/libomptarget/DeviceRTL/src/Reduction.cpp | ||
226 | Well, I don't know whether it is required for NVPTX for example, so I kept it around. As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu. |
openmp/libomptarget/DeviceRTL/src/Reduction.cpp | ||
---|---|---|
226 |
|
openmp/libomptarget/DeviceRTL/src/Reduction.cpp | ||
---|---|---|
226 |
s_waitcnt vmcnt(0) lgkmcnt(0) If we bring back the system fence, here's the resulting sequence (fence followed by the atomic inc): buffer_wbl2 According to the guide, the L2 instructions are required for coherence between different agents. For the reduction use case, we need coherence within a GPU which is provided by buffer_wbinvl1_vol following the atomic inc. So we can drop the L2 instructions here. The duplicate waitcnt and the buffer_wbinvl1_vol can be dropped too. In other words, as long as the ordering and scope are correct on the atomic operation, we should not need an additional fence. Perhaps it was required before D137524. |
Addressed feedback. Removed the fence change, used the suggested macro, and used the enum scope for clarity.
openmp/libomptarget/DeviceRTL/src/Reduction.cpp | ||
---|---|---|
226 | I removed this unrelated change, will post a separate patch for it. | |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
161–171 | Used the suggested macro. Thanks @jdoerfert |
Why does this need a new subset abstraction for scopes? I could understand a complete enum around all the names target scopes