Support for barrier synchronization between a subset of threads in a CTA through one of sixteen explicitly specified barriers. These intrinsics are not directly exposed in CUDA but are critical for forthcoming Clang/LLVM support of OpenMP on NVPTX GPUs.
The intrinsics allow the synchronization of an arbitrary (multiple of 32) number of threads in a CTA at one of 16 distinct barriers. The two intrinsics added are as follows:
call void @llvm.nvvm.barrier.n(i32 10)
waits for all threads in a CTA to arrive at named barrier #10.
call void @llvm.nvvm.barrier(i32 15, i32 992)
waits for 992 threads in a CTA to arrive at barrier #15.
Detailed description of these intrinsics are available in the PTX manual.
http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions
I think you want convergent here, in addition to noduplicate. (I have patches which let us finally remove noduplicate for the other barriers, but I'm still waiting on reviews.)
Convergent is necessary to prevent the compiler from splitting an instruction such that some threads may run one copy, while others may run the copy. This is necessary because "barriers are executed on a per-warp basis as if all the threads in a warp are active", so if some threads are currently inactive (because we added a control-flow dependency to the convergent op), then the barrier will never complete.
I'm not 100% sure, because this is for intra-cta -- not merely intra-warp -- synchronization, but I don't think you'll need noduplicate once we can remove it elsewhere. The main way that noduplicate is stricter than convergent is that you can't inline functions that contain a noduplicate instruction (unless you're only inlining into one place), and you can't unroll loops that contain noduplicate instructions. Neither of those should be a problem here.