This is an archive of the discontinued LLVM Phabricator instance.

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468	Instead of creating one builtin per integer variant, can we use a more generic builtin `__nvvm_redux_sync_add_i`, similar to how we handle `__nvvm_atom_add_gen_i` ?
llvm/include/llvm/IR/IntrinsicsNVVM.td
4103	This could also be consolidated into an overloaded intrinsic operating on `llvm_anyint_ty`
4105	Similar to `shfl`, these intrinsics probably need `IntrInaccessibleMemOnly` as they exchange data with other threads.

tra added a reviewer: tra.Apr 8 2021, 10:34 AM

Harbormaster completed remote builds in B97765: Diff 336137.Apr 8 2021, 11:09 AM

@tra Thank you for the feedback! I think I see what you're getting at, but I am not quite understanding how it would work for these builtins and intrinsics. I have added some comments to the corresponding feedback about my confusion and/or concerns.

The comment about IntrInaccessibleMemOnly I agree with completely, so I will replace IntrNoMem with it in the updated version I'm getting ready. Good call. :)

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468	What gives me pause is that a for atomic minimum there are both `__nvvm_atom_min_gen_i` and `__nvvm_atom_min_gen_ui` to distinguish between signed and unsigned. What makes the difference? That noted, I'll happily rename the builtins to be more in line with the other builtins. `__nvvm_redux_sync__i` and `__nvvm_redux_sync__ui` maybe?
llvm/include/llvm/IR/IntrinsicsNVVM.td
4103	I am having a look at other uses of this, but I'm having difficulty wrapping my head around how to map these overloads to the PTX instructions in llvm/lib/Target/NVPTX/NVPTXIntrinsics.td. Though it would be nice, it just seems overly complicated to get a signed and an unsigned 32-bit integer version of each of these intrinsics.

Following changes:

Changed the type in the names of the intrinsics and builtins.
Changed use of IntrNoMem to IntrInaccessibleMemOnly.
Added PTX70 as a requirement to the builtins.

Harbormaster completed remote builds in B97988: Diff 336454.Apr 9 2021, 8:54 AM

tra added inline comments.Apr 12 2021, 1:22 PM

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468	What gives me pause is that a for atomic minimum there are both nvvm_atom_min_gen_i and nvvm_atom_min_gen_ui to distinguish between signed and unsigned. What makes the difference? Good point. We do not need unsigned variant for `add`. We do need explicit signed and unsigned variants ad LLVM IR integer types do not take signedness into account, and the underlying min/max instructions do. Maybe, rename min_i/min_ui -> min/umin as LLVM does with atomics? We may skip the `_i` suffix on logical ops as they only apply to integers anyways.
llvm/include/llvm/IR/IntrinsicsNVVM.td
4103	Considering that `redux` only supports 32-bit integers, we do not need to get into it. `llvm_i32_ty` will do for now. We'll cross the bridge if/when we get to support multiple integer types.

Interesting. Reduction across lanes in warp? If so, this is probably a way to handle the last step reduction for openmp reductions

In D100124#2684303, @JonChesterfield wrote:

Interesting. Reduction across lanes in warp? If so, this is probably a way to handle the last step reduction for openmp reductions

It is! I can imagine that it would be useful for OpenMP reductions, though it is limited to few, albeit common, operators on 32-bit integers.

steffenlarsen marked 3 inline comments as done.Apr 20 2021, 10:18 AM

steffenlarsen added inline comments.

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468	Sorry, I completely missed your responses. Maybe, rename min_i/min_ui -> min/umin as LLVM does with atomics? Sounds good to me. Would there also be umax and uadd? We may skip the _i suffix on logical ops as they only apply to integers anyways. Absolutely. I'll make that happen!
llvm/include/llvm/IR/IntrinsicsNVVM.td
4103	Perfect, thank you!

tra added inline comments.Apr 20 2021, 10:52 AM

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468	Would there also be umax and uadd? You will need `umax`, but there's no need for `uadd` as 2-complement addition is the same for signed/unsigned. E.g `umax(0xffffffff, 1) -> 0xffffffff`, `max(-1,1) -> 1`, give different answers, but `uadd(0xffffffff, 1) -> 0` and `add(-1,1) -> 0`.

steffenlarsen added inline comments.Apr 21 2021, 1:57 AM

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468	Ah, of course. Though I do wonder as to the motivation of having signed and unsigned add variants in PTX. I'll drop the unsigned variant.

Changes:

Removed integer type from builtin and intrinsic names.
Signedness in builtin and intrinsic names moved to operator name, i.e. umin and umax.
Removed redundant addition variant.

Harbormaster completed remote builds in B99937: Diff 339169.Apr 21 2021, 4:36 AM

Do you know if any existing code already uses the __nvvm_* builtins for cp.async? In other words, does nvcc provide them already or is it something we're free to name as we wish?
I do not see any relevant intrinsics mentioned in NVVM IR spec: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html and I don't think NVCC's builtins are publicly documented anywhere.

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468	It's for uniformity sake, I guess. All arithmetic ops in PTX operate on sXX/uXX arguments, though not all of them have to.

Do you know if any existing code already uses the __nvvm_* builtins for cp.async? In other words, does nvcc provide them already or is it something we're free to name as we wish? I do not see any relevant intrinsics mentioned in NVVM IR spec: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html and I don't think NVCC's builtins are publicly documented anywhere.

I don't know of any yet. We will be using these in the relatively near future, but we can still change them no problem. However, the intrinsic and builtin naming for NVVM and NVPTX seems a bit inconsistent so it may be a long discussion (or maybe not.)

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468	I bet you're right. Thanks for the help. 😄

In D100124#2707731, @steffenlarsen wrote:

Do you know if any existing code already uses the __nvvm_* builtins for cp.async? In other words, does nvcc provide them already or is it something we're free to name as we wish? I do not see any relevant intrinsics mentioned in NVVM IR spec: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html and I don't think NVCC's builtins are publicly documented anywhere.

I don't know of any yet. We will be using these in the relatively near future, but we can still change them no problem. However, the intrinsic and builtin naming for NVVM and NVPTX seems a bit inconsistent so it may be a long discussion (or maybe not.)

LLVM uses different intrinsic names, mostly for historic reasons -- NVVM implements their own without upstreaming them back to LLVM and meanwhile LLVM grew its own set. So far I haven't seen any practical cases where that might've been an issue. I think NVIDIA folks popped up on a review *once* when they thought that an intrinsic we were about to introduce might've clashed with one of theirs, but they prompty disappeared when it turned out not to be the case. The bottom line is that effectively intrinsic names in LLVM and NVVM are independent, though we should take care not to introduce identically named ones with different parameters or functionality.

Clang builtins are a bit different. Clang needs to compile CUDA headers and those do use __nvvm builtinns, so clang must also provide those. NVIDIA does not document NVCC's compiler builtins, so if they are not used in CUDA headers, we have no idea whether relevant ones already exist. It would be great to stay in sync and make end-users code more portable across clang/NVCC, but there's not much we can do about that at the moment. The risk there is that if NVCC eventually introduces a builtin with the name we've used, but with a different arguments or functionality, that would be a bit of an annoyance for the users.

This revision is now accepted and ready to land.Apr 22 2021, 10:19 AM

vchuravy added a subscriber: vchuravy.May 4 2021, 11:37 AM

@tra Thanks a ton for the review! This is my first LLVM patch so I only know as much as the Code Review documentation tells me. Is there a process for chasing up additional reviews?

In D100124#2757251, @steffenlarsen wrote:

@tra Thanks a ton for the review! This is my first LLVM patch so I only know as much as the Code Review documentation tells me. Is there a process for chasing up additional reviews?

Generally, you don't need approvals from *all* the reviewers on the list. My rule of thumb is to give the patch few days, and wait for the LGTM from someone who owns the code (this is hard to establish sometimes) or from someone familiar with the code.
In this case my LGTM is sufficient and the patch has been out long enough for the interested parties to raise concerns if there were any.

Do you have ability to commit to LLVM? If not, I can land the patch on your behalf.

In D100124#2757306, @tra wrote:

Do you have ability to commit to LLVM? If not, I can land the patch on your behalf.

Not to my knowledge, so please do. Thanks again!

https://reviews.llvm.org/D100394 is from my colleagues and it also looks ready. Would you mind landing that one as well? 😄

tra mentioned this in D100394: [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX cp.async instructions.May 17 2021, 9:40 AM

Closed by commit rGf226e28a880f: [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX redux.sync… (authored by steffenlarsen, committed by tra). · Explain WhyMay 17 2021, 9:47 AM

This revision was automatically updated to reflect the committed changes.

tra added a commit: rGf226e28a880f: [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX redux.sync….

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

BuiltinsNVPTX.def

11 lines

test/

CodeGenCUDA/

redux-builtins.cu

47 lines

llvm/

include/

llvm/

IR/

IntrinsicsNVVM.td

48 lines

lib/

Target/

NVPTX/

NVPTXInstrInfo.td

2 lines

NVPTXIntrinsics.td

17 lines

test/

CodeGen/

NVPTX/

redux-sync.ll

73 lines

Diff 336137

clang/include/clang/Basic/BuiltinsNVPTX.def

	Show First 20 Lines • Show All 450 Lines • ▼ Show 20 Lines

	// Match			// Match
	TARGET_BUILTIN(__nvvm_match_any_sync_i32, "UiUiUi", "", PTX60)			TARGET_BUILTIN(__nvvm_match_any_sync_i32, "UiUiUi", "", PTX60)
	TARGET_BUILTIN(__nvvm_match_any_sync_i64, "WiUiWi", "", PTX60)			TARGET_BUILTIN(__nvvm_match_any_sync_i64, "WiUiWi", "", PTX60)
	// These return a pair {value, predicate}, which requires custom lowering.			// These return a pair {value, predicate}, which requires custom lowering.
	TARGET_BUILTIN(__nvvm_match_all_sync_i32p, "UiUiUii*", "", PTX60)			TARGET_BUILTIN(__nvvm_match_all_sync_i32p, "UiUiUii*", "", PTX60)
	TARGET_BUILTIN(__nvvm_match_all_sync_i64p, "WiUiWii*", "", PTX60)			TARGET_BUILTIN(__nvvm_match_all_sync_i64p, "WiUiWii*", "", PTX60)

				// Redux
				TARGET_BUILTIN(__nvvm_redux_sync_add_s32, "SiSii", "", SM_80)
				TARGET_BUILTIN(__nvvm_redux_sync_min_s32, "SiSii", "", SM_80)
				TARGET_BUILTIN(__nvvm_redux_sync_max_s32, "SiSii", "", SM_80)
				TARGET_BUILTIN(__nvvm_redux_sync_add_u32, "UiUii", "", SM_80)
				TARGET_BUILTIN(__nvvm_redux_sync_min_u32, "UiUii", "", SM_80)
				TARGET_BUILTIN(__nvvm_redux_sync_max_u32, "UiUii", "", SM_80)
				TARGET_BUILTIN(__nvvm_redux_sync_and_b32, "iii", "", SM_80)
				TARGET_BUILTIN(__nvvm_redux_sync_xor_b32, "iii", "", SM_80)
				TARGET_BUILTIN(__nvvm_redux_sync_or_b32, "iii", "", SM_80)
				traUnsubmitted Not Done Reply Inline Actions Instead of creating one builtin per integer variant, can we use a more generic builtin `__nvvm_redux_sync_add_i`, similar to how we handle `__nvvm_atom_add_gen_i` ? tra: Instead of creating one builtin per integer variant, can we use a more generic builtin…
				steffenlarsenAuthorUnsubmitted Not Done Reply Inline Actions What gives me pause is that a for atomic minimum there are both `__nvvm_atom_min_gen_i` and `__nvvm_atom_min_gen_ui` to distinguish between signed and unsigned. What makes the difference? That noted, I'll happily rename the builtins to be more in line with the other builtins. `__nvvm_redux_sync__i` and `__nvvm_redux_sync__ui` maybe? steffenlarsen: What gives me pause is that a for atomic minimum there are both `__nvvm_atom_min_gen_i` and…
				traUnsubmitted Not Done Reply Inline Actions What gives me pause is that a for atomic minimum there are both nvvm_atom_min_gen_i and nvvm_atom_min_gen_ui to distinguish between signed and unsigned. What makes the difference? Good point. We do not need unsigned variant for `add`. We do need explicit signed and unsigned variants ad LLVM IR integer types do not take signedness into account, and the underlying min/max instructions do. Maybe, rename min_i/min_ui -> min/umin as LLVM does with atomics? We may skip the `_i` suffix on logical ops as they only apply to integers anyways. tra: > What gives me pause is that a for atomic minimum there are both __nvvm_atom_min_gen_i and…
				steffenlarsenAuthorUnsubmitted Not Done Reply Inline Actions Sorry, I completely missed your responses. Maybe, rename min_i/min_ui -> min/umin as LLVM does with atomics? Sounds good to me. Would there also be umax and uadd? We may skip the _i suffix on logical ops as they only apply to integers anyways. Absolutely. I'll make that happen! steffenlarsen: Sorry, I completely missed your responses. > Maybe, rename min_i/min_ui -> min/umin as LLVM…
				traUnsubmitted Not Done Reply Inline Actions Would there also be umax and uadd? You will need `umax`, but there's no need for `uadd` as 2-complement addition is the same for signed/unsigned. E.g `umax(0xffffffff, 1) -> 0xffffffff`, `max(-1,1) -> 1`, give different answers, but `uadd(0xffffffff, 1) -> 0` and `add(-1,1) -> 0`. tra: > Would there also be umax and uadd? You will need `umax`, but there's no need for `uadd` as 2…
				steffenlarsenAuthorUnsubmitted Not Done Reply Inline Actions Ah, of course. Though I do wonder as to the motivation of having signed and unsigned add variants in PTX. I'll drop the unsigned variant. steffenlarsen: Ah, of course. Though I do wonder as to the motivation of having signed and unsigned add…
				traUnsubmitted Not Done Reply Inline Actions It's for uniformity sake, I guess. All arithmetic ops in PTX operate on sXX/uXX arguments, though not all of them have to. tra: It's for uniformity sake, I guess. All arithmetic ops in PTX operate on sXX/uXX arguments…
				steffenlarsenAuthorUnsubmitted Done Reply Inline Actions I bet you're right. Thanks for the help. 😄 steffenlarsen: I bet you're right. Thanks for the help. 😄

	// Membar			// Membar

	BUILTIN(__nvvm_membar_cta, "v", "")			BUILTIN(__nvvm_membar_cta, "v", "")
	BUILTIN(__nvvm_membar_gl, "v", "")			BUILTIN(__nvvm_membar_gl, "v", "")
	BUILTIN(__nvvm_membar_sys, "v", "")			BUILTIN(__nvvm_membar_sys, "v", "")

	// Memcpy, Memset			// Memcpy, Memset

	▲ Show 20 Lines • Show All 279 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/redux-builtins.cu

This file was added.

				// RUN: %clang_cc1 "-triple" "nvptx-nvidia-cuda" "-target-feature" "+ptx70" "-target-cpu" "sm_80" -emit-llvm -fcuda-is-device -o - %s \| FileCheck %s
				// RUN: %clang_cc1 "-triple" "nvptx64-nvidia-cuda" "-target-feature" "+ptx70" "-target-cpu" "sm_80" -emit-llvm -fcuda-is-device -o - %s \| FileCheck %s

				// CHECK: define{{.}} void @_Z6kernelPi(i32 %out)
				__attribute__((global)) void kernel(int *out) {
				int a = 1;
				unsigned int b = 5;
				int i = 0;

				out[i++] = __nvvm_redux_sync_add_s32(a, 0xFF);
				// CHECK: call i32 @llvm.nvvm.redux.sync.add.s32

				out[i++] = __nvvm_redux_sync_min_s32(a, 0x0F);
				// CHECK: call i32 @llvm.nvvm.redux.sync.min.s32

				out[i++] = __nvvm_redux_sync_max_s32(a, 0xF0);
				// CHECK: call i32 @llvm.nvvm.redux.sync.max.s32

				out[i++] = __nvvm_redux_sync_add_u32(b, 0x01);
				// CHECK: call i32 @llvm.nvvm.redux.sync.add.u32

				out[i++] = __nvvm_redux_sync_min_u32(b, 0xF0);
				// CHECK: call i32 @llvm.nvvm.redux.sync.min.u32

				out[i++] = __nvvm_redux_sync_max_u32(b, 0x0F);
				// CHECK: call i32 @llvm.nvvm.redux.sync.max.u32

				out[i++] = __nvvm_redux_sync_and_b32(a, 0xF0);
				// CHECK: call i32 @llvm.nvvm.redux.sync.and.b32

				out[i++] = __nvvm_redux_sync_and_b32(b, 0x0F);
				// CHECK: call i32 @llvm.nvvm.redux.sync.and.b32

				out[i++] = __nvvm_redux_sync_xor_b32(a, 0x10);
				// CHECK: call i32 @llvm.nvvm.redux.sync.xor.b32

				out[i++] = __nvvm_redux_sync_xor_b32(b, 0x01);
				// CHECK: call i32 @llvm.nvvm.redux.sync.xor.b32

				out[i++] = __nvvm_redux_sync_or_b32(a, 0xFF);
				// CHECK: call i32 @llvm.nvvm.redux.sync.or.b32

				out[i++] = __nvvm_redux_sync_or_b32(b, 0xFF);
				// CHECK: call i32 @llvm.nvvm.redux.sync.or.b32

				// CHECK: ret void
				}

llvm/include/llvm/IR/IntrinsicsNVVM.td

Show First 20 Lines • Show All 4,091 Lines • ▼ Show 20 Lines	def int_nvvm_match_all_sync_i32p :
Intrinsic<[llvm_i32_ty, llvm_i1_ty], [llvm_i32_ty, llvm_i32_ty],		Intrinsic<[llvm_i32_ty, llvm_i1_ty], [llvm_i32_ty, llvm_i32_ty],
[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.match.all.sync.i32p">;		[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.match.all.sync.i32p">;
// match.all.sync.b64p mask, value		// match.all.sync.b64p mask, value
def int_nvvm_match_all_sync_i64p :		def int_nvvm_match_all_sync_i64p :
Intrinsic<[llvm_i64_ty, llvm_i1_ty], [llvm_i32_ty, llvm_i64_ty],		Intrinsic<[llvm_i64_ty, llvm_i1_ty], [llvm_i32_ty, llvm_i64_ty],
[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.match.all.sync.i64p">;		[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.match.all.sync.i64p">;

//		//
		// REDUX.SYNC
		//
		// redux.sync.add.u32 dst, src, membermask;
		def int_nvvm_redux_sync_add_u32 : GCCBuiltin<"__nvvm_redux_sync_add_u32">,
		traUnsubmitted Done Reply Inline Actions This could also be consolidated into an overloaded intrinsic operating on `llvm_anyint_ty` tra: This could also be consolidated into an overloaded intrinsic operating on `llvm_anyint_ty`
		steffenlarsenAuthorUnsubmitted Done Reply Inline Actions I am having a look at other uses of this, but I'm having difficulty wrapping my head around how to map these overloads to the PTX instructions in llvm/lib/Target/NVPTX/NVPTXIntrinsics.td. Though it would be nice, it just seems overly complicated to get a signed and an unsigned 32-bit integer version of each of these intrinsics. steffenlarsen: I am having a look at other uses of this, but I'm having difficulty wrapping my head around how…
		traUnsubmitted Not Done Reply Inline Actions Considering that `redux` only supports 32-bit integers, we do not need to get into it. `llvm_i32_ty` will do for now. We'll cross the bridge if/when we get to support multiple integer types. tra: Considering that `redux` only supports 32-bit integers, we do not need to get into it.
		steffenlarsenAuthorUnsubmitted Done Reply Inline Actions Perfect, thank you! steffenlarsen: Perfect, thank you!
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
		[IntrConvergent, IntrNoMem]>;
		traUnsubmitted Done Reply Inline Actions Similar to `shfl`, these intrinsics probably need `IntrInaccessibleMemOnly` as they exchange data with other threads. tra: Similar to `shfl`, these intrinsics probably need `IntrInaccessibleMemOnly` as they exchange…

		// redux.sync.min.u32 dst, src, membermask;
		def int_nvvm_redux_sync_min_u32 : GCCBuiltin<"__nvvm_redux_sync_min_u32">,
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
		[IntrConvergent, IntrNoMem]>;

		// redux.sync.max.u32 dst, src, membermask;
		def int_nvvm_redux_sync_max_u32 : GCCBuiltin<"__nvvm_redux_sync_max_u32">,
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
		[IntrConvergent, IntrNoMem]>;

		// redux.sync.add.s32 dst, src, membermask;
		def int_nvvm_redux_sync_add_s32 : GCCBuiltin<"__nvvm_redux_sync_add_s32">,
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
		[IntrConvergent, IntrNoMem]>;

		// redux.sync.min.s32 dst, src, membermask;
		def int_nvvm_redux_sync_min_s32 : GCCBuiltin<"__nvvm_redux_sync_min_s32">,
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
		[IntrConvergent, IntrNoMem]>;

		// redux.sync.max.s32 dst, src, membermask;
		def int_nvvm_redux_sync_max_s32 : GCCBuiltin<"__nvvm_redux_sync_max_s32">,
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
		[IntrConvergent, IntrNoMem]>;

		// redux.sync.and.b32 dst, src, membermask;
		def int_nvvm_redux_sync_and_b32 : GCCBuiltin<"__nvvm_redux_sync_and_b32">,
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
		[IntrConvergent, IntrNoMem]>;

		// redux.sync.xor.b32 dst, src, membermask;
		def int_nvvm_redux_sync_xor_b32 : GCCBuiltin<"__nvvm_redux_sync_xor_b32">,
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
		[IntrConvergent, IntrNoMem]>;

		// redux.sync.or.b32 dst, src, membermask;
		def int_nvvm_redux_sync_or_b32 : GCCBuiltin<"__nvvm_redux_sync_or_b32">,
		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
		[IntrConvergent, IntrNoMem]>;

		//
// WMMA instructions		// WMMA instructions
//		//
// WMMA.LOAD		// WMMA.LOAD
class NVVM_WMMA_LD<WMMA_REGS Frag, string Layout, int WithStride>		class NVVM_WMMA_LD<WMMA_REGS Frag, string Layout, int WithStride>
: Intrinsic<Frag.regs,		: Intrinsic<Frag.regs,
!if(WithStride, [llvm_anyptr_ty, llvm_i32_ty], [llvm_anyptr_ty]),		!if(WithStride, [llvm_anyptr_ty, llvm_i32_ty], [llvm_anyptr_ty]),
[IntrReadMem, IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>],		[IntrReadMem, IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>],
WMMA_NAME_LDST<"load", Frag, Layout, WithStride>.intr>;		WMMA_NAME_LDST<"load", Frag, Layout, WithStride>.intr>;
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

	Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines

	def True : Predicate<"true">;			def True : Predicate<"true">;

	def hasPTX31 : Predicate<"Subtarget->getPTXVersion() >= 31">;			def hasPTX31 : Predicate<"Subtarget->getPTXVersion() >= 31">;
	def hasPTX60 : Predicate<"Subtarget->getPTXVersion() >= 60">;			def hasPTX60 : Predicate<"Subtarget->getPTXVersion() >= 60">;
	def hasPTX61 : Predicate<"Subtarget->getPTXVersion() >= 61">;			def hasPTX61 : Predicate<"Subtarget->getPTXVersion() >= 61">;
	def hasPTX63 : Predicate<"Subtarget->getPTXVersion() >= 63">;			def hasPTX63 : Predicate<"Subtarget->getPTXVersion() >= 63">;
	def hasPTX64 : Predicate<"Subtarget->getPTXVersion() >= 64">;			def hasPTX64 : Predicate<"Subtarget->getPTXVersion() >= 64">;
				def hasPTX70 : Predicate<"Subtarget->getPTXVersion() >= 70">;

	def hasSM30 : Predicate<"Subtarget->getSmVersion() >= 30">;			def hasSM30 : Predicate<"Subtarget->getSmVersion() >= 30">;
	def hasSM70 : Predicate<"Subtarget->getSmVersion() >= 70">;			def hasSM70 : Predicate<"Subtarget->getSmVersion() >= 70">;
	def hasSM72 : Predicate<"Subtarget->getSmVersion() >= 72">;			def hasSM72 : Predicate<"Subtarget->getSmVersion() >= 72">;
	def hasSM75 : Predicate<"Subtarget->getSmVersion() >= 75">;			def hasSM75 : Predicate<"Subtarget->getSmVersion() >= 75">;
				def hasSM80 : Predicate<"Subtarget->getSmVersion() >= 80">;

	// non-sync shfl instructions are not available on sm_70+ in PTX6.4+			// non-sync shfl instructions are not available on sm_70+ in PTX6.4+
	def hasSHFL : Predicate<"!(Subtarget->getSmVersion() >= 70"			def hasSHFL : Predicate<"!(Subtarget->getSmVersion() >= 70"
	"&& Subtarget->getPTXVersion() >= 64)">;			"&& Subtarget->getPTXVersion() >= 64)">;

	def useShortPtr : Predicate<"useShortPointers()">;			def useShortPtr : Predicate<"useShortPointers()">;
	def useFP16Math: Predicate<"Subtarget->allowFP16Math()">;			def useFP16Math: Predicate<"Subtarget->allowFP16Math()">;

	▲ Show 20 Lines • Show All 2,984 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	def rr : NVPTXInst<(outs regclass:$dest, Int1Regs:$pred),
[(set regclass:$dest, Int1Regs:$pred, (IntOp Int32Regs:$mask, regclass:$value))]>,		[(set regclass:$dest, Int1Regs:$pred, (IntOp Int32Regs:$mask, regclass:$value))]>,
Requires<[hasPTX60, hasSM70]>;		Requires<[hasPTX60, hasSM70]>;
}		}
defm MATCH_ALLP_SYNC_32 : MATCH_ALLP_SYNC<Int32Regs, "b32", int_nvvm_match_all_sync_i32p,		defm MATCH_ALLP_SYNC_32 : MATCH_ALLP_SYNC<Int32Regs, "b32", int_nvvm_match_all_sync_i32p,
i32imm>;		i32imm>;
defm MATCH_ALLP_SYNC_64 : MATCH_ALLP_SYNC<Int64Regs, "b64", int_nvvm_match_all_sync_i64p,		defm MATCH_ALLP_SYNC_64 : MATCH_ALLP_SYNC<Int64Regs, "b64", int_nvvm_match_all_sync_i64p,
i64imm>;		i64imm>;

		multiclass REDUX_SYNC<string BinOp, string PTXType, Intrinsic Intrin> {
		def : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$src, Int32Regs:$mask),
		"redux.sync." # BinOp # "." # PTXType # " $dst, $src, $mask;",
		[(set Int32Regs:$dst, (Intrin Int32Regs:$src, Int32Regs:$mask))]>,
		Requires<[hasPTX70, hasSM80]>;
		}

		defm REDUX_SYNC_ADD_U32 : REDUX_SYNC<"add", "u32", int_nvvm_redux_sync_add_u32>;
		defm REDUX_SYNC_MIN_U32 : REDUX_SYNC<"min", "u32", int_nvvm_redux_sync_min_u32>;
		defm REDUX_SYNC_MAX_U32 : REDUX_SYNC<"max", "u32", int_nvvm_redux_sync_max_u32>;
		defm REDUX_SYNC_ADD_S32 : REDUX_SYNC<"add", "s32", int_nvvm_redux_sync_add_s32>;
		defm REDUX_SYNC_MIN_S32 : REDUX_SYNC<"min", "s32", int_nvvm_redux_sync_min_s32>;
		defm REDUX_SYNC_MAX_S32 : REDUX_SYNC<"max", "s32", int_nvvm_redux_sync_max_s32>;
		defm REDUX_SYNC_AND_B32 : REDUX_SYNC<"and", "b32", int_nvvm_redux_sync_and_b32>;
		defm REDUX_SYNC_XOR_B32 : REDUX_SYNC<"xor", "b32", int_nvvm_redux_sync_xor_b32>;
		defm REDUX_SYNC_OR_B32 : REDUX_SYNC<"or", "b32", int_nvvm_redux_sync_or_b32>;

} // isConvergent = true		} // isConvergent = true

//-----------------------------------		//-----------------------------------
// Explicit Memory Fence Functions		// Explicit Memory Fence Functions
//-----------------------------------		//-----------------------------------
class MEMBAR<string StrOp, Intrinsic IntOP> :		class MEMBAR<string StrOp, Intrinsic IntOP> :
NVPTXInst<(outs), (ins),		NVPTXInst<(outs), (ins),
StrOp, [(IntOP)]>;		StrOp, [(IntOP)]>;
▲ Show 20 Lines • Show All 7,328 Lines • Show Last 20 Lines

llvm/test/CodeGen/NVPTX/redux-sync.ll

This file was added.

				; RUN: llc < %s -march=nvptx64 -mcpu=sm_80 -mattr=+ptx70 \| FileCheck %s

				declare i32 @llvm.nvvm.redux.sync.add.u32(i32, i32)
				; CHECK-LABEL: .func{{.*}}redux_sync_add_u32
				define i32 @redux_sync_add_u32(i32 %src, i32 %mask) {
				; CHECK: redux.sync.add.u32
				%val = call i32 @llvm.nvvm.redux.sync.add.u32(i32 %src, i32 %mask)
				ret i32 %val
				}

				declare i32 @llvm.nvvm.redux.sync.min.u32(i32, i32)
				; CHECK-LABEL: .func{{.*}}redux_sync_min_u32
				define i32 @redux_sync_min_u32(i32 %src, i32 %mask) {
				; CHECK: redux.sync.min.u32
				%val = call i32 @llvm.nvvm.redux.sync.min.u32(i32 %src, i32 %mask)
				ret i32 %val
				}

				declare i32 @llvm.nvvm.redux.sync.max.u32(i32, i32)
				; CHECK-LABEL: .func{{.*}}redux_sync_max_u32
				define i32 @redux_sync_max_u32(i32 %src, i32 %mask) {
				; CHECK: redux.sync.max.u32
				%val = call i32 @llvm.nvvm.redux.sync.max.u32(i32 %src, i32 %mask)
				ret i32 %val
				}

				declare i32 @llvm.nvvm.redux.sync.add.s32(i32, i32)
				; CHECK-LABEL: .func{{.*}}redux_sync_add_s32
				define i32 @redux_sync_add_s32(i32 %src, i32 %mask) {
				; CHECK: redux.sync.add.s32
				%val = call i32 @llvm.nvvm.redux.sync.add.s32(i32 %src, i32 %mask)
				ret i32 %val
				}

				declare i32 @llvm.nvvm.redux.sync.min.s32(i32, i32)
				; CHECK-LABEL: .func{{.*}}redux_sync_min_s32
				define i32 @redux_sync_min_s32(i32 %src, i32 %mask) {
				; CHECK: redux.sync.min.s32
				%val = call i32 @llvm.nvvm.redux.sync.min.s32(i32 %src, i32 %mask)
				ret i32 %val
				}

				declare i32 @llvm.nvvm.redux.sync.max.s32(i32, i32)
				; CHECK-LABEL: .func{{.*}}redux_sync_max_s32
				define i32 @redux_sync_max_s32(i32 %src, i32 %mask) {
				; CHECK: redux.sync.max.s32
				%val = call i32 @llvm.nvvm.redux.sync.max.s32(i32 %src, i32 %mask)
				ret i32 %val
				}

				declare i32 @llvm.nvvm.redux.sync.and.b32(i32, i32)
				; CHECK-LABEL: .func{{.*}}redux_sync_and_b32
				define i32 @redux_sync_and_b32(i32 %src, i32 %mask) {
				; CHECK: redux.sync.and.b32
				%val = call i32 @llvm.nvvm.redux.sync.and.b32(i32 %src, i32 %mask)
				ret i32 %val
				}

				declare i32 @llvm.nvvm.redux.sync.xor.b32(i32, i32)
				; CHECK-LABEL: .func{{.*}}redux_sync_xor_b32
				define i32 @redux_sync_xor_b32(i32 %src, i32 %mask) {
				; CHECK: redux.sync.xor.b32
				%val = call i32 @llvm.nvvm.redux.sync.xor.b32(i32 %src, i32 %mask)
				ret i32 %val
				}

				declare i32 @llvm.nvvm.redux.sync.or.b32(i32, i32)
				; CHECK-LABEL: .func{{.*}}redux_sync_or_b32
				define i32 @redux_sync_or_b32(i32 %src, i32 %mask) {
				; CHECK: redux.sync.or.b32
				%val = call i32 @llvm.nvvm.redux.sync.or.b32(i32 %src, i32 %mask)
				ret i32 %val
				}

This is an archive of the discontinued LLVM Phabricator instance.

[Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX redux.sync instructionsClosedPublic

Details

Diff Detail

Event Timeline