This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget] Add support for critical regions in AMD GPU device offloading
ClosedPublic

Authored by doru1004 on Mar 10 2023, 3:23 PM.

Details

Summary

This patch adds support for pragma omp critical regions when offloading to AMD devices.

Diff Detail

Event Timeline

doru1004 created this revision.Mar 10 2023, 3:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2023, 3:23 PM
doru1004 requested review of this revision.Mar 10 2023, 3:23 PM
doru1004 updated this revision to Diff 504827.Mar 13 2023, 1:23 PM
doru1004 updated this revision to Diff 504836.Mar 13 2023, 1:40 PM

Maybe we should define omp_lock_t as uint32_t on AMDGPU?

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
281

This doesn't work. You check the lock first, then you set it.
It has to be an atomic step. Use an atomicCAS, the result tells you if it worked/if it is now set. This is the setLock code w/o the while. You can even call this in the while below.

289
293

Shouldn't the release and acquire semantics be part of this CAS?
If we run relaxed who is to say we see the update of another thread. I would have assumed on failure we want an acquire fence and on success a release fence. Not sure if we need additional ones.

doru1004 added inline comments.Mar 15 2023, 1:17 PM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
293

The idea is that the thread only executes the acquire once when it enters the critical region to prevent it from being executed at every iteration of the loop (which comes with performance penalties if it happens).

I would like to understand why you say there's a chance we don't see the update of another thread. The unsetting of the lock is happening here:

void unsetLock(omp_lock_t *Lock) {
  (void)atomicExchange((uint32_t *)Lock, UNSET, atomic::acq_rel);
}
doru1004 updated this revision to Diff 505607.Mar 15 2023, 1:20 PM
doru1004 marked 2 inline comments as done.
jdoerfert added inline comments.Mar 15 2023, 3:48 PM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
293

The relaxed unset doesn't make other memory effects visible to a thread that takes the lock next. I believe it should come with a proper fence, hence be atomic::release.

Similarly, I am not sure if the update of the lock itself is guaranteed to be observed if the update and check are relaxed.

jdoerfert added inline comments.Mar 15 2023, 3:49 PM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
293

UnsetLock actually has a release fence, testLock does not have proper fencing, sorry.

doru1004 updated this revision to Diff 505648.Mar 15 2023, 4:11 PM
doru1004 added inline comments.
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
293

I fixed testLock to use acq_rel.

doru1004 added inline comments.Mar 16 2023, 10:08 AM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
293

As far as I can tell the current combination of fences and CAS atomics works, please let me know if you have any comments or if I haven't addressed something.

dhruvachak added inline comments.Mar 16 2023, 10:16 AM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
275

In theory, unset could simply do an atomic write to the Lock with release memorder. Any reason that's not used here? You are not using the prev value anyways.

290

I am getting confused by this pattern. You have:

while (!atomicCAS(...) != UNSET)

Can't the above be written as

while (atomicCAS(...) == UNSET)

But what does atomicCAS return? Bool or the prev value? If it is the prev value, shouldn't the condition be?

while (atomicCAS(...) == SET) { sleep(); }

291

I think the CAS should take an acq_rel as the success memorder and an acq as the failure memorder.

The acq in both the cases will ensure it sees an update from another thread. The release from the other updating thread is not sufficient for this thread to see that update. And when this thread succeeds, the release will ensure other threads see the update of this thread.

With the above memorders, I would think we could get rid of the fences.

dhruvachak added inline comments.Mar 19 2023, 4:44 PM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
289

Why do we need the release fence in setLock before the CAS?

291

I think the CAS should take an acq_rel as the success memorder and an acq as the failure memorder.

The acq in both the cases will ensure it sees an update from another thread. The release from the other updating thread is not sufficient for this thread to see that update. And when this thread succeeds, the release will ensure other threads see the update of this thread.

With the above memorders, I would think we could get rid of the fences.

We had an offline discussion. Based on that, I am ok with the current memorders in the patch for the atomicCAS in setLock. C++11 spec has this statement "Implementations should make atomic stores visible to atomic loads within a reasonable amount of time." and since the CAS is done in a loop, the setLock should eventually see the update made by another thread that executed the unsetLock.

doru1004 added inline comments.Mar 20 2023, 7:11 AM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
289

You have to release all of the ordinary stores before you write to the lock.

doru1004 marked an inline comment as done.Mar 20 2023, 7:12 AM
doru1004 added inline comments.
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
291

Sounds good. I'll mark as resolved.

doru1004 marked 2 inline comments as done.Mar 20 2023, 7:13 AM
doru1004 marked an inline comment as done.
doru1004 updated this revision to Diff 506614.Mar 20 2023, 9:03 AM
doru1004 marked an inline comment as done.

I'm really sure that locks at thread scope do not work on amdgpu or pre-volta nvptx. One of the threads wins the cas, all the others do not, and it immediately deadlocks.

Critical sections can be done by rewriting the cfg, general purpose locks can't.

What am I missing here?

dhruvachak added inline comments.Mar 20 2023, 7:13 PM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
289

In theory, setLock should have acquire semantics and unsetLock should have release semantics. So the setLock does not really need the release fence. But this patch is consistent between the set and the unset, namely both use acq_rel semantics. So I am ok with the current state of the patch on this aspect.

I'm really sure that locks at thread scope do not work on amdgpu or pre-volta nvptx. One of the threads wins the cas, all the others do not, and it immediately deadlocks.

Critical sections can be done by rewriting the cfg, general purpose locks can't.

What am I missing here?

What's missing is the outer loop which only allows the lowest thread in the wave to actually be active. All the other inactive wave threads are waiting at a synchronization point. Once the first thread in the wave executes the critical it joins the other threads and loops around again but this time the next thread will be active and so on. Does this help answer your question? Some of that code requires inspection of LLVM IR if you want to see it. This here is just the runtime bit.

I'm really sure that locks at thread scope do not work on amdgpu or pre-volta nvptx. One of the threads wins the cas, all the others do not, and it immediately deadlocks.

Critical sections can be done by rewriting the cfg, general purpose locks can't.

What am I missing here?

Agreed: this patch does not address "unstructured" locks in general, when there is inter-wave/warp thread divergence in GPUs that cannot guarantee forward progress for diverging lanes within the same wave.

What this is for is to support OpenMP critical sections, whose implementation is based on setLock/unsetLock. That's why the LIT test uses critical. The implementation - it is my understanding - calls very carefully set/unsetLock as @doru1004 indicated (one lane in the wave calss set/unsetLock and the others are waiting at a wave sync point).

doru1004 updated this revision to Diff 507120.Mar 21 2023, 2:11 PM
JonChesterfield requested changes to this revision.EditedMar 21 2023, 3:20 PM

OK then as written this is definitely going to blow up on us. We shouldn't implement the general purpose lock API if it deadlocks unless called in a very specific situation.

Probably best to emit the CAS in line as part of the IR transform, but otherwise we could add more runtime functions specific to critical. Uses of the general purpose omp_lock should be a compile time error on platforms that can't do it (it's unfortunate that lock returns void), but until then builtin_trap at least looks clearer when debugging than deadlock.

This revision now requires changes to proceed.Mar 21 2023, 3:20 PM

OK then as written this is definitely going to blow up on us. We shouldn't implement the general purpose lock API if it deadlocks unless called in a very specific situation.

Probably best to emit the CAS in line as part of the IR transform, but otherwise we could add more runtime functions specific to critical. Uses of the general purpose omp_lock should be a compile time error on platforms that can't do it (it's unfortunate that lock returns void), but until then builtin_trap at least looks clearer when debugging than deadlock.

How about introducing __kmpc_* functions for all the set/unset/test lock functions and have the compiler call the kmpc versions? The set/unset/test versions can trap for the time being. That way, this patch can go in with interface name changes.

doru1004 added a comment.EditedMar 21 2023, 4:33 PM

OK then as written this is definitely going to blow up on us. We shouldn't implement the general purpose lock API if it deadlocks unless called in a very specific situation.

Could you provide a small reproducer for this issue? I'd like to include it in the testing + fix.

NB: I think I understand now that you probably mean the use of omp_set_lock and omp_unset_lock directly (so no need to provide an example I thought you meant you had some critical region situation) in which case yes this patch does not cater to the direct usage of those API functions.

How about introducing __kmpc_* functions for all the set/unset/test lock functions and have the compiler call the kmpc versions? The set/unset/test versions can trap for the time being. That way, this patch can go in with interface name changes.

Agreed. As Carlo stated this patch attempts to fix the critical region not omp_set_lock / omp_unset_lock in general. I will make the necessary changes to reflect that. I thought omp_set_lock and omp_unset_lock are internal to the runtime but since they are not I will return them to their original unimplemented state.

doru1004 updated this revision to Diff 507753.Mar 23 2023, 8:37 AM
doru1004 updated this revision to Diff 507755.Mar 23 2023, 8:44 AM

OK then as written this is definitely going to blow up on us. We shouldn't implement the general purpose lock API if it deadlocks unless called in a very specific situation.

Probably best to emit the CAS in line as part of the IR transform, but otherwise we could add more runtime functions specific to critical. Uses of the general purpose omp_lock should be a compile time error on platforms that can't do it (it's unfortunate that lock returns void), but until then builtin_trap at least looks clearer when debugging than deadlock.

@JonChesterfield I believe I have addressed your concerns and the changes I made are now confined to critical regions only.

jdoerfert added inline comments.Mar 23 2023, 9:54 AM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
377

Nit: Move it into the generic part, plz. AMDGPU can overwrite it still.

doru1004 updated this revision to Diff 507908.Mar 23 2023, 3:51 PM
JonChesterfield added a comment.EditedMar 23 2023, 5:16 PM

I'm not confident that an acq_rel exchange in combination with the fences is correct. If it is correct, I think it must be suboptimal. Why have you gone with that implementation?

edit: I'd remove the requested changes if I could work out how to. Thanks for that change

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
278

I expected unset to be a relaxed store of unset, possibly surrounded by fences. I'm not totally confident an acq_rel ordering here synchronises with the fences in setCriticalLock, though it might do.

289

The fences probably need to be outside the branch on active thread, assuming the intent is for the lock to cover all the threads in the warp. Though I'm not sure it'll make any difference to codegen.

I think relaxed ordering on both is right here.

480

These read like a copy/paste error - most functions nearby are calling impl:: with a similar name. Could you add a comment to the effect that nvptx uses the same lock implementation for critical sections as for other locks?

openmp/libomptarget/test/offloading/target_critical_region.cpp
6

why not nvptx? also why not x64, but that seems less immediately worrying

doru1004 added inline comments.Mar 24 2023, 6:35 AM
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
278

The acq_rel semantics apply to the lock variable (not the fences) and the atomic store assumed here synchronizes with the atomic load in setCriticalLock.

289

The fence here is to ensure all other ordinary stores happen before the fence so it is enough that the lane that gets through to execute it. Eventually all lanes will execute it.

480

I can add the comment I am not sure what the copy / paste error is.

openmp/libomptarget/test/offloading/target_critical_region.cpp
6

The nvptx implementation is broken so we avoid enabling it for this test. x64 is disabled since this test is really meant to be for critical regions in target regions.

JonChesterfield accepted this revision.Mar 24 2023, 10:34 AM

OK, talked to some more people. Fences are fine inside the branch.

We need acquire/release fencing around the critical block so that code which expects to see writes from other threads going through the same critical block works.

We need mutual exclusion so that we actually have the critical semantics.

As written this patch should do that. Taking the mutex & doing device scope fencing for each lane in the wavefront is a slow thing to do but should work. Better would be to take the mutex once per warp, something like:

if (id == ffs(activemask)) {
  while (atomicCAS(...)) builtin_sleep();
  fence_acquire(agent)
}
for each lane in warp {
  fence_acquire(workgroup);
  critical-region-here
  fence_release(workgroup);
}
drop-mutex(agent)
This revision is now accepted and ready to land.Mar 24 2023, 10:34 AM
dhruvachak accepted this revision.Mar 24 2023, 11:12 AM

LG. All of my concerns have been resolved.

doru1004 marked 5 inline comments as done.Mar 24 2023, 11:18 AM