This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Add intrinsics to support named barriers.
ClosedPublic

Authored by arpith-jacob on Feb 26 2016, 2:07 PM.

Details

Summary

Support for barrier synchronization between a subset of threads in a CTA through one of sixteen explicitly specified barriers. These intrinsics are not directly exposed in CUDA but are critical for forthcoming Clang/LLVM support of OpenMP on NVPTX GPUs.

The intrinsics allow the synchronization of an arbitrary (multiple of 32) number of threads in a CTA at one of 16 distinct barriers. The two intrinsics added are as follows:

call void @llvm.nvvm.barrier.n(i32 10)
waits for all threads in a CTA to arrive at named barrier #10.

call void @llvm.nvvm.barrier(i32 15, i32 992)
waits for 992 threads in a CTA to arrive at barrier #15.

Detailed description of these intrinsics are available in the PTX manual.
http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions

Diff Detail

Repository
rL LLVM

Event Timeline

arpith-jacob retitled this revision from to [NVPTX] Add intrinsics to support named barriers..
arpith-jacob updated this object.
arpith-jacob added reviewers: jholewinski, hfinkel.
jlebar added a subscriber: jlebar.Feb 27 2016, 9:43 AM
jlebar added inline comments.
include/llvm/IR/IntrinsicsNVVM.td
743 ↗(On Diff #49234)

I think you want convergent here, in addition to noduplicate. (I have patches which let us finally remove noduplicate for the other barriers, but I'm still waiting on reviews.)

Convergent is necessary to prevent the compiler from splitting an instruction such that some threads may run one copy, while others may run the copy. This is necessary because "barriers are executed on a per-warp basis as if all the threads in a warp are active", so if some threads are currently inactive (because we added a control-flow dependency to the convergent op), then the barrier will never complete.

I'm not 100% sure, because this is for intra-cta -- not merely intra-warp -- synchronization, but I don't think you'll need noduplicate once we can remove it elsewhere. The main way that noduplicate is stricter than convergent is that you can't inline functions that contain a noduplicate instruction (unless you're only inlining into one place), and you can't unroll loops that contain noduplicate instructions. Neither of those should be a problem here.

Added convergent to the barrier intrinsics.

arpith-jacob marked an inline comment as done.Feb 27 2016, 10:03 AM
arpith-jacob added inline comments.
include/llvm/IR/IntrinsicsNVVM.td
743 ↗(On Diff #49296)

Done. Thanks for the explanation; makes sense. When you do remove noduplicate, please go ahead and remove it for named barriers as well. Thanks!

hfinkel accepted this revision.Apr 26 2016, 6:23 PM
hfinkel edited edge metadata.

LGTM.

include/llvm/IR/IntrinsicsNVVM.td
743 ↗(On Diff #49296)

This happened in r264107. Update this patch before submitting.

This revision is now accepted and ready to land.Apr 26 2016, 6:23 PM

Looks like patch was not committed.

mkuron added a subscriber: mkuron.Oct 18 2016, 6:31 AM
arpith-jacob marked an inline comment as done.

Looks like this patch slipped through the cracks :( I've made the requested changes.

Looks like this patch slipped through the cracks :( I've made the requested changes.

You can go ahead and commit then.

jlebar accepted this revision.Jan 27 2017, 6:28 PM
This revision was automatically updated to reflect the committed changes.