This patch adds support for pragma omp critical regions when offloading to AMD devices.
Diff Detail
Event Timeline
Maybe we should define omp_lock_t as uint32_t on AMDGPU?
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
282 | This doesn't work. You check the lock first, then you set it. | |
290 | ||
294 | Shouldn't the release and acquire semantics be part of this CAS? |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
294 | The idea is that the thread only executes the acquire once when it enters the critical region to prevent it from being executed at every iteration of the loop (which comes with performance penalties if it happens). I would like to understand why you say there's a chance we don't see the update of another thread. The unsetting of the lock is happening here: void unsetLock(omp_lock_t *Lock) { (void)atomicExchange((uint32_t *)Lock, UNSET, atomic::acq_rel); } |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
294 | The relaxed unset doesn't make other memory effects visible to a thread that takes the lock next. I believe it should come with a proper fence, hence be atomic::release. Similarly, I am not sure if the update of the lock itself is guaranteed to be observed if the update and check are relaxed. |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
294 | UnsetLock actually has a release fence, testLock does not have proper fencing, sorry. |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
294 | I fixed testLock to use acq_rel. |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
294 | As far as I can tell the current combination of fences and CAS atomics works, please let me know if you have any comments or if I haven't addressed something. |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
276 | In theory, unset could simply do an atomic write to the Lock with release memorder. Any reason that's not used here? You are not using the prev value anyways. | |
291 | I am getting confused by this pattern. You have: while (!atomicCAS(...) != UNSET) Can't the above be written as while (atomicCAS(...) == UNSET) But what does atomicCAS return? Bool or the prev value? If it is the prev value, shouldn't the condition be? while (atomicCAS(...) == SET) { sleep(); } | |
292 | I think the CAS should take an acq_rel as the success memorder and an acq as the failure memorder. The acq in both the cases will ensure it sees an update from another thread. The release from the other updating thread is not sufficient for this thread to see that update. And when this thread succeeds, the release will ensure other threads see the update of this thread. With the above memorders, I would think we could get rid of the fences. |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
290 | Why do we need the release fence in setLock before the CAS? | |
292 |
We had an offline discussion. Based on that, I am ok with the current memorders in the patch for the atomicCAS in setLock. C++11 spec has this statement "Implementations should make atomic stores visible to atomic loads within a reasonable amount of time." and since the CAS is done in a loop, the setLock should eventually see the update made by another thread that executed the unsetLock. |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
290 | You have to release all of the ordinary stores before you write to the lock. |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
292 | Sounds good. I'll mark as resolved. |
I'm really sure that locks at thread scope do not work on amdgpu or pre-volta nvptx. One of the threads wins the cas, all the others do not, and it immediately deadlocks.
Critical sections can be done by rewriting the cfg, general purpose locks can't.
What am I missing here?
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
290 | In theory, setLock should have acquire semantics and unsetLock should have release semantics. So the setLock does not really need the release fence. But this patch is consistent between the set and the unset, namely both use acq_rel semantics. So I am ok with the current state of the patch on this aspect. |
What's missing is the outer loop which only allows the lowest thread in the wave to actually be active. All the other inactive wave threads are waiting at a synchronization point. Once the first thread in the wave executes the critical it joins the other threads and loops around again but this time the next thread will be active and so on. Does this help answer your question? Some of that code requires inspection of LLVM IR if you want to see it. This here is just the runtime bit.
Agreed: this patch does not address "unstructured" locks in general, when there is inter-wave/warp thread divergence in GPUs that cannot guarantee forward progress for diverging lanes within the same wave.
What this is for is to support OpenMP critical sections, whose implementation is based on setLock/unsetLock. That's why the LIT test uses critical. The implementation - it is my understanding - calls very carefully set/unsetLock as @doru1004 indicated (one lane in the wave calss set/unsetLock and the others are waiting at a wave sync point).
OK then as written this is definitely going to blow up on us. We shouldn't implement the general purpose lock API if it deadlocks unless called in a very specific situation.
Probably best to emit the CAS in line as part of the IR transform, but otherwise we could add more runtime functions specific to critical. Uses of the general purpose omp_lock should be a compile time error on platforms that can't do it (it's unfortunate that lock returns void), but until then builtin_trap at least looks clearer when debugging than deadlock.
How about introducing __kmpc_* functions for all the set/unset/test lock functions and have the compiler call the kmpc versions? The set/unset/test versions can trap for the time being. That way, this patch can go in with interface name changes.
Could you provide a small reproducer for this issue? I'd like to include it in the testing + fix.
NB: I think I understand now that you probably mean the use of omp_set_lock and omp_unset_lock directly (so no need to provide an example I thought you meant you had some critical region situation) in which case yes this patch does not cater to the direct usage of those API functions.
How about introducing __kmpc_* functions for all the set/unset/test lock functions and have the compiler call the kmpc versions? The set/unset/test versions can trap for the time being. That way, this patch can go in with interface name changes.
Agreed. As Carlo stated this patch attempts to fix the critical region not omp_set_lock / omp_unset_lock in general. I will make the necessary changes to reflect that. I thought omp_set_lock and omp_unset_lock are internal to the runtime but since they are not I will return them to their original unimplemented state.
@JonChesterfield I believe I have addressed your concerns and the changes I made are now confined to critical regions only.
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
378 | Nit: Move it into the generic part, plz. AMDGPU can overwrite it still. |
I'm not confident that an acq_rel exchange in combination with the fences is correct. If it is correct, I think it must be suboptimal. Why have you gone with that implementation?
edit: I'd remove the requested changes if I could work out how to. Thanks for that change
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
279 | I expected unset to be a relaxed store of unset, possibly surrounded by fences. I'm not totally confident an acq_rel ordering here synchronises with the fences in setCriticalLock, though it might do. | |
290 | The fences probably need to be outside the branch on active thread, assuming the intent is for the lock to cover all the threads in the warp. Though I'm not sure it'll make any difference to codegen. I think relaxed ordering on both is right here. | |
473 | These read like a copy/paste error - most functions nearby are calling impl:: with a similar name. Could you add a comment to the effect that nvptx uses the same lock implementation for critical sections as for other locks? | |
openmp/libomptarget/test/offloading/target_critical_region.cpp | ||
5 | why not nvptx? also why not x64, but that seems less immediately worrying |
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp | ||
---|---|---|
279 | The acq_rel semantics apply to the lock variable (not the fences) and the atomic store assumed here synchronizes with the atomic load in setCriticalLock. | |
290 | The fence here is to ensure all other ordinary stores happen before the fence so it is enough that the lane that gets through to execute it. Eventually all lanes will execute it. | |
473 | I can add the comment I am not sure what the copy / paste error is. | |
openmp/libomptarget/test/offloading/target_critical_region.cpp | ||
5 | The nvptx implementation is broken so we avoid enabling it for this test. x64 is disabled since this test is really meant to be for critical regions in target regions. |
OK, talked to some more people. Fences are fine inside the branch.
We need acquire/release fencing around the critical block so that code which expects to see writes from other threads going through the same critical block works.
We need mutual exclusion so that we actually have the critical semantics.
As written this patch should do that. Taking the mutex & doing device scope fencing for each lane in the wavefront is a slow thing to do but should work. Better would be to take the mutex once per warp, something like:
if (id == ffs(activemask)) { while (atomicCAS(...)) builtin_sleep(); fence_acquire(agent) } for each lane in warp { fence_acquire(workgroup); critical-region-here fence_release(workgroup); } drop-mutex(agent)
In theory, unset could simply do an atomic write to the Lock with release memorder. Any reason that's not used here? You are not using the prev value anyways.