This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Add intrinsics to support named barriers.
ClosedPublic

Authored by arpith-jacob on Feb 26 2016, 2:07 PM.

Download Raw Diff

Details

Reviewers

jholewinski
jlebar
hfinkel

Commits

rG2b156edf5601: [NVPTX] Add intrinsics to support named barriers.
rL293384: [NVPTX] Add intrinsics to support named barriers.

Summary

Support for barrier synchronization between a subset of threads in a CTA through one of sixteen explicitly specified barriers. These intrinsics are not directly exposed in CUDA but are critical for forthcoming Clang/LLVM support of OpenMP on NVPTX GPUs.

The intrinsics allow the synchronization of an arbitrary (multiple of 32) number of threads in a CTA at one of 16 distinct barriers. The two intrinsics added are as follows:

call void @llvm.nvvm.barrier.n(i32 10)
waits for all threads in a CTA to arrive at named barrier #10.

call void @llvm.nvvm.barrier(i32 15, i32 992)
waits for 992 threads in a CTA to arrive at barrier #15.

Detailed description of these intrinsics are available in the PTX manual.
http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions

Diff Detail

Event Timeline

arpith-jacob updated this revision to Diff 49234.Feb 26 2016, 2:07 PM

arpith-jacob retitled this revision from to [NVPTX] Add intrinsics to support named barriers..

arpith-jacob updated this object.

arpith-jacob added reviewers: jholewinski, hfinkel.

arpith-jacob added subscribers: llvm-commits, sfantao, carlo.bertolli, caomhin.

Herald added a subscriber: jholewinski. · View Herald TranscriptFeb 26 2016, 2:07 PM

jlebar added a subscriber: jlebar.Feb 27 2016, 9:43 AM

jlebar added inline comments.

include/llvm/IR/IntrinsicsNVVM.td
735	I think you want convergent here, in addition to noduplicate. (I have patches which let us finally remove noduplicate for the other barriers, but I'm still waiting on reviews.) Convergent is necessary to prevent the compiler from splitting an instruction such that some threads may run one copy, while others may run the copy. This is necessary because "barriers are executed on a per-warp basis as if all the threads in a warp are active", so if some threads are currently inactive (because we added a control-flow dependency to the convergent op), then the barrier will never complete. I'm not 100% sure, because this is for intra-cta -- not merely intra-warp -- synchronization, but I don't think you'll need noduplicate once we can remove it elsewhere. The main way that noduplicate is stricter than convergent is that you can't inline functions that contain a noduplicate instruction (unless you're only inlining into one place), and you can't unroll loops that contain noduplicate instructions. Neither of those should be a problem here.

Added convergent to the barrier intrinsics.

arpith-jacob marked an inline comment as done.Feb 27 2016, 10:03 AM

arpith-jacob added inline comments.

include/llvm/IR/IntrinsicsNVVM.td
735	Done. Thanks for the explanation; makes sense. When you do remove noduplicate, please go ahead and remove it for named barriers as well. Thanks!

LGTM.

include/llvm/IR/IntrinsicsNVVM.td
735	This happened in r264107. Update this patch before submitting.

This revision is now accepted and ready to land.Apr 26 2016, 6:23 PM

Looks like patch was not committed.

mkuron added a subscriber: mkuron.Oct 18 2016, 6:31 AM

Looks like this patch slipped through the cracks :( I've made the requested changes.

In D17657#659144, @arpith-jacob wrote:

Looks like this patch slipped through the cracks :( I've made the requested changes.

You can go ahead and commit then.

jlebar accepted this revision.Jan 27 2017, 6:28 PM

Closed by commit rL293384: [NVPTX] Add intrinsics to support named barriers. (authored by arpith). · Explain WhyJan 28 2017, 8:49 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsNVVM.td

6 lines

lib/

Target/

NVPTX/

NVPTXIntrinsics.td

6 lines

test/

CodeGen/

NVPTX/

named-barriers.ll

40 lines

Diff 86108

include/llvm/IR/IntrinsicsNVVM.td

Show First 20 Lines • Show All 726 Lines • ▼ Show 20 Lines	// Atomic not available as an llvm intrinsic.
defm int_nvvm_atomic_and_gen_i : PTXAtomicWithScope2<llvm_anyint_ty>;		defm int_nvvm_atomic_and_gen_i : PTXAtomicWithScope2<llvm_anyint_ty>;
defm int_nvvm_atomic_cas_gen_i : PTXAtomicWithScope3<llvm_anyint_ty>;		defm int_nvvm_atomic_cas_gen_i : PTXAtomicWithScope3<llvm_anyint_ty>;

// Bar.Sync		// Bar.Sync

// The builtin for "bar.sync 0" is called __syncthreads. Unlike most of the		// The builtin for "bar.sync 0" is called __syncthreads. Unlike most of the
// intrinsics in this file, this one is a user-facing API.		// intrinsics in this file, this one is a user-facing API.
def int_nvvm_barrier0 : GCCBuiltin<"__syncthreads">,		def int_nvvm_barrier0 : GCCBuiltin<"__syncthreads">,
Intrinsic<[], [], [IntrConvergent]>;		Intrinsic<[], [], [IntrConvergent]>;
		jlebarUnsubmitted Done Reply Inline Actions I think you want convergent here, in addition to noduplicate. (I have patches which let us finally remove noduplicate for the other barriers, but I'm still waiting on reviews.) Convergent is necessary to prevent the compiler from splitting an instruction such that some threads may run one copy, while others may run the copy. This is necessary because "barriers are executed on a per-warp basis as if all the threads in a warp are active", so if some threads are currently inactive (because we added a control-flow dependency to the convergent op), then the barrier will never complete. I'm not 100% sure, because this is for intra-cta -- not merely intra-warp -- synchronization, but I don't think you'll need noduplicate once we can remove it elsewhere. The main way that noduplicate is stricter than convergent is that you can't inline functions that contain a noduplicate instruction (unless you're only inlining into one place), and you can't unroll loops that contain noduplicate instructions. Neither of those should be a problem here. jlebar: I think you want convergent here, in addition to noduplicate. (I have patches which let us…
		arpith-jacobAuthorUnsubmitted Not Done Reply Inline Actions Done. Thanks for the explanation; makes sense. When you do remove noduplicate, please go ahead and remove it for named barriers as well. Thanks! arpith-jacob: Done. Thanks for the explanation; makes sense. When you do remove noduplicate, please go…
		hfinkelUnsubmitted Not Done Reply Inline Actions This happened in r264107. Update this patch before submitting. hfinkel: This happened in r264107. Update this patch before submitting.
		// Synchronize all threads in the CTA at barrier 'n'.
		def int_nvvm_barrier_n : GCCBuiltin<"__nvvm_bar_n">,
		Intrinsic<[], [llvm_i32_ty], [IntrConvergent]>;
		// Synchronize 'm' (arg 2) threads in the CTA at barrier 'n' (arg 1).
		def int_nvvm_barrier : GCCBuiltin<"__nvvm_bar">,
		Intrinsic<[], [llvm_i32_ty, llvm_i32_ty], [IntrConvergent]>;
def int_nvvm_barrier0_popc : GCCBuiltin<"__nvvm_bar0_popc">,		def int_nvvm_barrier0_popc : GCCBuiltin<"__nvvm_bar0_popc">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent]>;		Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent]>;
def int_nvvm_barrier0_and : GCCBuiltin<"__nvvm_bar0_and">,		def int_nvvm_barrier0_and : GCCBuiltin<"__nvvm_bar0_and">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent]>;		Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent]>;
def int_nvvm_barrier0_or : GCCBuiltin<"__nvvm_bar0_or">,		def int_nvvm_barrier0_or : GCCBuiltin<"__nvvm_bar0_or">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent]>;		Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent]>;

def int_nvvm_bar_sync :		def int_nvvm_bar_sync :
▲ Show 20 Lines • Show All 2,989 Lines • Show Last 20 Lines

lib/Target/NVPTX/NVPTXIntrinsics.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show All 30 Lines

	//-----------------------------------			//-----------------------------------
	// Synchronization and shuffle functions			// Synchronization and shuffle functions
	//-----------------------------------			//-----------------------------------
	let isConvergent = 1 in {			let isConvergent = 1 in {
	def INT_BARRIER0 : NVPTXInst<(outs), (ins),			def INT_BARRIER0 : NVPTXInst<(outs), (ins),
	"bar.sync \t0;",			"bar.sync \t0;",
	[(int_nvvm_barrier0)]>;			[(int_nvvm_barrier0)]>;
				def INT_BARRIERN : NVPTXInst<(outs), (ins Int32Regs:$src1),
				"bar.sync \t$src1;",
				[(int_nvvm_barrier_n Int32Regs:$src1)]>;
				def INT_BARRIER : NVPTXInst<(outs), (ins Int32Regs:$src1, Int32Regs:$src2),
				"bar.sync \t$src1, $src2;",
				[(int_nvvm_barrier Int32Regs:$src1, Int32Regs:$src2)]>;
	def INT_BARRIER0_POPC : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$pred),			def INT_BARRIER0_POPC : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$pred),
	!strconcat("{{ \n\t",			!strconcat("{{ \n\t",
	".reg .pred \t%p1; \n\t",			".reg .pred \t%p1; \n\t",
	"setp.ne.u32 \t%p1, $pred, 0; \n\t",			"setp.ne.u32 \t%p1, $pred, 0; \n\t",
	"bar.red.popc.u32 \t$dst, 0, %p1; \n\t",			"bar.red.popc.u32 \t$dst, 0, %p1; \n\t",
	"}}"),			"}}"),
	[(set Int32Regs:$dst, (int_nvvm_barrier0_popc Int32Regs:$pred))]>;			[(set Int32Regs:$dst, (int_nvvm_barrier0_popc Int32Regs:$pred))]>;
	def INT_BARRIER0_AND : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$pred),			def INT_BARRIER0_AND : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$pred),
	▲ Show 20 Lines • Show All 7,102 Lines • Show Last 20 Lines

test/CodeGen/NVPTX/named-barriers.ll

This file was added.

				; RUN: llc < %s -march=nvptx -mcpu=sm_20 \| FileCheck %s
				; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 \| FileCheck %s

				; Use bar.sync to arrive at a pre-computed barrier number and
				; wait for all threads in CTA to also arrive:
				define ptx_device void @test_barrier_named_cta() {
				; CHECK: mov.u32 %r[[REG0:[0-9]+]], 0;
				; CHECK: bar.sync %r[[REG0]];
				; CHECK: mov.u32 %r[[REG1:[0-9]+]], 10;
				; CHECK: bar.sync %r[[REG1]];
				; CHECK: mov.u32 %r[[REG2:[0-9]+]], 15;
				; CHECK: bar.sync %r[[REG2]];
				; CHECK: ret;
				call void @llvm.nvvm.barrier.n(i32 0)
				call void @llvm.nvvm.barrier.n(i32 10)
				call void @llvm.nvvm.barrier.n(i32 15)
				ret void
				}

				; Use bar.sync to arrive at a pre-computed barrier number and
				; wait for fixed number of cooperating threads to arrive:
				define ptx_device void @test_barrier_named() {
				; CHECK: mov.u32 %r[[REG0A:[0-9]+]], 32;
				; CHECK: mov.u32 %r[[REG0B:[0-9]+]], 0;
				; CHECK: bar.sync %r[[REG0B]], %r[[REG0A]];
				; CHECK: mov.u32 %r[[REG1A:[0-9]+]], 352;
				; CHECK: mov.u32 %r[[REG1B:[0-9]+]], 10;
				; CHECK: bar.sync %r[[REG1B]], %r[[REG1A]];
				; CHECK: mov.u32 %r[[REG2A:[0-9]+]], 992;
				; CHECK: mov.u32 %r[[REG2B:[0-9]+]], 15;
				; CHECK: bar.sync %r[[REG2B]], %r[[REG2A]];
				; CHECK: ret;
				call void @llvm.nvvm.barrier(i32 0, i32 32)
				call void @llvm.nvvm.barrier(i32 10, i32 352)
				call void @llvm.nvvm.barrier(i32 15, i32 992)
				ret void
				}

				declare void @llvm.nvvm.barrier(i32, i32)
				declare void @llvm.nvvm.barrier.n(i32)