This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Ensure memory fences are created with barriers for AMDGPUs
ClosedPublic

Authored by jdoerfert on Mar 3 2023, 5:53 PM.

Details

Summary

It turns out that the __builtin_amdgcn_s_barrier() alone does not emit
a fence. We somehow got away with this and assumed it would work as it
(hopefully) is correct on the NVIDIA path where we just emit a
__syncthreads. After talking to @arsenm we now (mostly) align with the
OpenCL barrier implementation [1] and emit explicit fences for AMDGPUs.

It seems this was the underlying cause for #59759, but I am not 100%
certain. There is a chance this simply hides the problem.

Fixes: https://github.com/llvm/llvm-project/issues/59759

[1] https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/07b347366eb2c6ebc3414af323c623cbbbafc854/opencl/src/workgroup/wgbarrier.cl#L21

Diff Detail

Event Timeline

jdoerfert created this revision.Mar 3 2023, 5:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2023, 5:53 PM
jdoerfert requested review of this revision.Mar 3 2023, 5:53 PM
jdoerfert added inline comments.Mar 3 2023, 6:01 PM
openmp/libomptarget/DeviceRTL/src/Kernel.cpp
89

Actually, the ones to form aligned regions should all use relaxed ordering as we really only want the barrier for synchronizing the threads.

jplehr added a subscriber: jplehr.Mar 15 2023, 3:36 AM

I compared it to the HIP implementation linked and from that point of view it looks reasonable to me, but I don't have a good understanding of the internals yet. @JonChesterfield can you comment on the topic?

openmp/libomptarget/DeviceRTL/include/Synchronization.h
107
JonChesterfield added a comment.EditedMar 15 2023, 1:28 PM

Do we actually have seq_cst ordering on GPUs? It means every thread sees the same ordering which I'd guess has to be done by a RMW atomic operation. Maybe a fetch_add. Plus these aren't scoped, so the memory underlying it has to be accessible from all threads, which probably means it goes to host shared memory. Aka fetch_add on CPU memory to get the ordering relation. That seems expensive to the extent that it probably isn't implemented. @arsenm how do we usually deal with that?

@jdoerfert what motives seq_cst here instead of acquire/release? Commented inline that this ordering argument only makes sense when it's consistent across all threads that are participating, which might be worth approximating as constant per call site, i.e. raise it to a template parameter.

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
264–273

I don't understand this. Why is the ordering on a fenceteam related to the ordering on syncthreads in this way? What about acquire_release?

In general it seems hazardous that ordering is a runtime variable, if different threads passed in different ordering this would turn into a horrendous mess. Perhaps we should move it to a template parameter, and maybe static_assert in the syncThreads implementation that it meets whatever the constraints on it are.

That seems expensive to the extent that it probably isn't implemented. @arsenm how do we usually deal with that?

Why wouldn't it be implemented? The exact treatment varies per subtarget but implies setting some cache bits and using some cache flush instructions

That seems expensive to the extent that it probably isn't implemented. @arsenm how do we usually deal with that?

Why wouldn't it be implemented? The exact treatment varies per subtarget but implies setting some cache bits and using some cache flush instructions

That sounds sufficient for within a single GPU but not between GPUs. Though I'm not sure just how far the total order of events is supposed to be taken, maybe it's reasonable to observe different orders when variables are only directly visible to some subset of threads.

That seems expensive to the extent that it probably isn't implemented. @arsenm how do we usually deal with that?

Why wouldn't it be implemented? The exact treatment varies per subtarget but implies setting some cache bits and using some cache flush instructions

That sounds sufficient for within a single GPU but not between GPUs. Though I'm not sure just how far the total order of events is supposed to be taken, maybe it's reasonable to observe different orders when variables are only directly visible to some subset of threads.

We are looking at workgroup fences here. We don't even need per GPU fencing. I think this discussion derailed a bit.

We are looking at workgroup fences here. We don't even need per GPU fencing. I think this discussion derailed a bit.

It's the total order requested by choosing sequentially consistent for the memory model. I'm not sure what the semantics of that are on a system with multiple nested address spaces as that's not the C++ model.

What's the semantics you want here? All warps in a workgroup see the same total order of events, while independent workgroups could see different orders? If so then we can ask whether that's what seq_cst is lowered to on a gpu, which it might be.

Or if the global total ordering is not necessary, should we go with acquire/release on the fences instead?

We are looking at workgroup fences here. We don't even need per GPU fencing. I think this discussion derailed a bit.

It's the total order requested by choosing sequentially consistent for the memory model. I'm not sure what the semantics of that are on a system with multiple nested address spaces as that's not the C++ model.

What's the semantics you want here? All warps in a workgroup see the same total order of events, while independent workgroups could see different orders? If so then we can ask whether that's what seq_cst is lowered to on a gpu, which it might be.

Or if the global total ordering is not necessary, should we go with acquire/release on the fences instead?

Given that we might need to flush global memory, I went with what the OpenCL folks do.

Or if the global total ordering is not necessary, should we go with acquire/release on the fences instead?

Given that we might need to flush global memory, I went with what the OpenCL folks do.

Fair enough, that's a heuristic I'm happy with.

Raising the argument to a compile time template parameter is better I think - not totally confident as written would constant fold at O0 - but we could leave that for another day / until it comes up in practice.

jdoerfert updated this revision to Diff 506836.Mar 20 2023, 8:24 PM

Add test, correct fence kinds

@JonChesterfield is this patch good to go?

LGTM but Matt's the expert here

LGTM but Matt's the expert here

I know nothing about memory models

I know nothing about memory models

That's exciting. I've tagged Tony and Brian as my next guess..I'm reasonably clear on memory models in general but haven't reverse engineered what they mean to amdgpu. It's somewhere on my to-do list.

ye-luo accepted this revision.Mar 23 2023, 8:47 PM

Keep things moving.

This revision is now accepted and ready to land.Mar 23 2023, 8:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2023, 6:40 PM

Got test failure

Failed Tests (1):
  libomptarget :: x86_64-pc-linux-gnu :: offloading/barrier_fence.c
dhruvachak added inline comments.
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
266

What if I want a release fence before the barrier and nothing else? As a client, I pass in atomic::release and I get a seq_cst fence before and after the barrier. Seems like an overkill.

jdoerfert added inline comments.May 1 2023, 1:32 PM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
266

If you want functionality beyond what is implemented, you need to implement it.