Page MenuHomePhabricator

[libomptarget] Implement locks for amdgcn
ClosedPublic

Authored by JonChesterfield on Mar 3 2020, 10:52 AM.

Details

Summary

[libomptarget] Implement locks for amdgcn

The nvptx implementation deadlocks on amdgcn. atomic_cas with multiple
active lanes can deadlock - if one lane succeeds, all the others are locked
out. The set_lock implementation therefore runs on a single lane.

Also uses a sleep intrinsic instead of the system clock for a probably
minor performance improvement. The unset/test implementations may be revised
later, based on code size / performance or similar concerns.

This implements the lock at a per-wavefront scope. That's not strictly as
specified, since openmp describes locks in terms of threads. I think the
nvptx implementation provides true per-thread locking on volta and the same
per-warp locking on other architectures.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2020, 10:52 AM
  • s/warp/wavefront
Harbormaster completed remote builds in B47942: Diff 247961.
jdoerfert added inline comments.Mar 4 2020, 1:47 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/amdgcn_locks.hip
16

Doesn't this mean we cannot implement locks properly at all?

If we say thread == lane of simd, and every lane executes the same instruction with some masked off, then we can 'lock' a thread with respect to the rest of the device if and only if it is the only active one in the warp.

It's a symptom of defining thread to have finer granularity than instruction pointer, which I believe is a serious design mistake passed down from cuda. I'd like openmp to map SIMD onto the warp instead, at which point we can lock the newly defined 'thread' using an implementation like this.

If we continue doing thread==lane, and want to support this API, I think it can be done by rewriting the cfg. The end result will perform horrendously, but that's fairly likely of any code using spin locks.

The above is also a spin lock, which is not traditionally a great idea, so I'd like to add a futex syscall equivalent to the kernel driver. That's a longer term goal.

In the meantime, this is the most useful functionality I have been able to work out under the openmp lock API.

Can we have a print("Locks are not supported in this thread mapping model") instead?

  • Warn instead of providing a partial implementation

Can we have a print("Locks are not supported in this thread mapping model") instead?

Absolutely. Updated the diff to do so.

The CAS lock implementation will work correctly for a simd=>warp model, so can resurrect this once that's online.

This revision is now accepted and ready to land.Mar 5 2020, 11:59 AM
This revision was automatically updated to reflect the committed changes.