This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsNVVM.td
-
lib/Target/NVPTX/
-
Target/
-
NVPTX/
-
NVPTXISelDAGToDAG.cpp
-
NVPTXISelLowering.cpp

Differential D39822

[NVPTX] Model (some) side effects of warp-synchronous data exchange intrinsics.
ClosedPublic

Authored by tra on Nov 8 2017, 4:18 PM.

Download Raw Diff

Details

Reviewers

jlebar

Summary

Exchanging data across threads in a warp does not access memory, but has side effects (read/write other threads' state).
Previously the intrinsics were marked as IntrNoMem, which resulted in the ops CSE'ed away (PR35249).

This patch marks all such intrinsics as IntrInaccessibleMemOnly which prevents CSE.

That only fixes part of the problem, though.
@llvm.nvvm.vote.ballot %r, 1 returns active thread mask and can effectively observe preceding branching decisions. It also has specified behavior for inactive threads and has no requirement to be executed in non-diverged context on pre-sm_70 GPUs. If two identical calls were hoisted out of the branches of divergent if , it would change the result returned by the call. In general this affects any other case where the call of this intrinsic would be moved across divergent conditional branches. It's not clear yet what's the best way to deal with it.

Diff Detail

Build Status

Buildable 11983
Build 11983: arc lint + arc unit

Event Timeline

tra created this revision.Nov 8 2017, 4:18 PM

Herald added subscribers: hiraditya, sanjoy, jholewinski. · View Herald TranscriptNov 8 2017, 4:18 PM

In the commit message, did you mean CSE (Common Subexpression Elimination) instead of CSI?

In D39822#919996, @sanjoy wrote:

In the commit message, did you mean CSE (Common Subexpression Elimination) instead of CSI?

Yes. Fixed. I blame it on TV. :-)

LGTM, but can we expand in the commit message upon the limitations of this, and/or point to the bug?

This revision is now accepted and ready to land.Nov 8 2017, 6:03 PM

I was not sure if the *_sync intrinsics required preventing CSE since these intrinsics capture all state as arguments (lanes in a warp to sync as an argument). However, on Volta, I think different lanes in a warp can execute the intrinsic from different syntactic locations (i.e., different program counters). If true, then we do indeed have to model the data exchanged.

tra edited the summary of this revision. (Show Details)Nov 9 2017, 9:43 AM

In D39822#920550, @arpith-jacob wrote:

I was not sure if the *_sync intrinsics required preventing CSE since these intrinsics capture all state as arguments (lanes in a warp to sync as an argument). However, on Volta, I think different lanes in a warp can execute the intrinsic from different syntactic locations (i.e., different program counters). If true, then we do indeed have to model the data exchanged.

PTX spec says : wait until all non-exited threads corresponding to membermask have executed vote.sync with the same qualifiers and same membermask value followed by a caveat For .target sm_6x or below, all threads in membermask must execute the same vote.sync instruction in convergence, and only threads belonging to some membermask can be active when the vote.sync instruction is executed. Otherwise, the behavior is undefined.

My reading of this matches yours -- the same instruction, executed in convergence does not apply to sm_70.

Landed in r318173

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsNVVM.td

56 lines

lib/

Target/

NVPTX/

NVPTXISelDAGToDAG.cpp

21 lines

NVPTXISelLowering.cpp

10 lines

Diff 122174

llvm/include/llvm/IR/IntrinsicsNVVM.td

	Show First 20 Lines • Show All 3,710 Lines • ▼ Show 20 Lines

	//			//
	// SHUFFLE			// SHUFFLE
	//			//

	// shfl.down.b32 dest, val, offset, mask_and_clamp			// shfl.down.b32 dest, val, offset, mask_and_clamp
	def int_nvvm_shfl_down_i32 :			def int_nvvm_shfl_down_i32 :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.down.i32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.down.i32">,
	GCCBuiltin<"__nvvm_shfl_down_i32">;			GCCBuiltin<"__nvvm_shfl_down_i32">;
	def int_nvvm_shfl_down_f32 :			def int_nvvm_shfl_down_f32 :
	Intrinsic<[llvm_float_ty], [llvm_float_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_float_ty], [llvm_float_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.down.f32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.down.f32">,
	GCCBuiltin<"__nvvm_shfl_down_f32">;			GCCBuiltin<"__nvvm_shfl_down_f32">;

	// shfl.up.b32 dest, val, offset, mask_and_clamp			// shfl.up.b32 dest, val, offset, mask_and_clamp
	def int_nvvm_shfl_up_i32 :			def int_nvvm_shfl_up_i32 :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.up.i32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.up.i32">,
	GCCBuiltin<"__nvvm_shfl_up_i32">;			GCCBuiltin<"__nvvm_shfl_up_i32">;
	def int_nvvm_shfl_up_f32 :			def int_nvvm_shfl_up_f32 :
	Intrinsic<[llvm_float_ty], [llvm_float_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_float_ty], [llvm_float_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.up.f32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.up.f32">,
	GCCBuiltin<"__nvvm_shfl_up_f32">;			GCCBuiltin<"__nvvm_shfl_up_f32">;

	// shfl.bfly.b32 dest, val, offset, mask_and_clamp			// shfl.bfly.b32 dest, val, offset, mask_and_clamp
	def int_nvvm_shfl_bfly_i32 :			def int_nvvm_shfl_bfly_i32 :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.bfly.i32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.bfly.i32">,
	GCCBuiltin<"__nvvm_shfl_bfly_i32">;			GCCBuiltin<"__nvvm_shfl_bfly_i32">;
	def int_nvvm_shfl_bfly_f32 :			def int_nvvm_shfl_bfly_f32 :
	Intrinsic<[llvm_float_ty], [llvm_float_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_float_ty], [llvm_float_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.bfly.f32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.bfly.f32">,
	GCCBuiltin<"__nvvm_shfl_bfly_f32">;			GCCBuiltin<"__nvvm_shfl_bfly_f32">;

	// shfl.idx.b32 dest, val, lane, mask_and_clamp			// shfl.idx.b32 dest, val, lane, mask_and_clamp
	def int_nvvm_shfl_idx_i32 :			def int_nvvm_shfl_idx_i32 :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.idx.i32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.idx.i32">,
	GCCBuiltin<"__nvvm_shfl_idx_i32">;			GCCBuiltin<"__nvvm_shfl_idx_i32">;
	def int_nvvm_shfl_idx_f32 :			def int_nvvm_shfl_idx_f32 :
	Intrinsic<[llvm_float_ty], [llvm_float_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_float_ty], [llvm_float_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.idx.f32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.idx.f32">,
	GCCBuiltin<"__nvvm_shfl_idx_f32">;			GCCBuiltin<"__nvvm_shfl_idx_f32">;

	// Synchronizing shfl variants available in CUDA-9.			// Synchronizing shfl variants available in CUDA-9.
	// On sm_70 these don't have to be convergent, so we may eventually want to			// On sm_70 these don't have to be convergent, so we may eventually want to
	// implement non-convergent variant of this intrinsic.			// implement non-convergent variant of this intrinsic.

	// shfl.sync.down.b32 dest, threadmask, val, offset , mask_and_clamp			// shfl.sync.down.b32 dest, threadmask, val, offset , mask_and_clamp
	def int_nvvm_shfl_sync_down_i32 :			def int_nvvm_shfl_sync_down_i32 :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.sync.down.i32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.sync.down.i32">,
	GCCBuiltin<"__nvvm_shfl_sync_down_i32">;			GCCBuiltin<"__nvvm_shfl_sync_down_i32">;
	def int_nvvm_shfl_sync_down_f32 :			def int_nvvm_shfl_sync_down_f32 :
	Intrinsic<[llvm_float_ty], [llvm_i32_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_float_ty], [llvm_i32_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.sync.down.f32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.sync.down.f32">,
	GCCBuiltin<"__nvvm_shfl_sync_down_f32">;			GCCBuiltin<"__nvvm_shfl_sync_down_f32">;

	// shfl.sync.up.b32 dest, threadmask, val, offset, mask_and_clamp			// shfl.sync.up.b32 dest, threadmask, val, offset, mask_and_clamp
	def int_nvvm_shfl_sync_up_i32 :			def int_nvvm_shfl_sync_up_i32 :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.sync.up.i32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.sync.up.i32">,
	GCCBuiltin<"__nvvm_shfl_sync_up_i32">;			GCCBuiltin<"__nvvm_shfl_sync_up_i32">;
	def int_nvvm_shfl_sync_up_f32 :			def int_nvvm_shfl_sync_up_f32 :
	Intrinsic<[llvm_float_ty], [llvm_i32_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_float_ty], [llvm_i32_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.sync.up.f32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.sync.up.f32">,
	GCCBuiltin<"__nvvm_shfl_sync_up_f32">;			GCCBuiltin<"__nvvm_shfl_sync_up_f32">;

	// shfl.sync.bfly.b32 dest, threadmask, val, offset, mask_and_clamp			// shfl.sync.bfly.b32 dest, threadmask, val, offset, mask_and_clamp
	def int_nvvm_shfl_sync_bfly_i32 :			def int_nvvm_shfl_sync_bfly_i32 :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.sync.bfly.i32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.sync.bfly.i32">,
	GCCBuiltin<"__nvvm_shfl_sync_bfly_i32">;			GCCBuiltin<"__nvvm_shfl_sync_bfly_i32">;
	def int_nvvm_shfl_sync_bfly_f32 :			def int_nvvm_shfl_sync_bfly_f32 :
	Intrinsic<[llvm_float_ty], [llvm_i32_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_float_ty], [llvm_i32_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.sync.bfly.f32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.sync.bfly.f32">,
	GCCBuiltin<"__nvvm_shfl_sync_bfly_f32">;			GCCBuiltin<"__nvvm_shfl_sync_bfly_f32">;

	// shfl.sync.idx.b32 dest, threadmask, val, lane, mask_and_clamp			// shfl.sync.idx.b32 dest, threadmask, val, lane, mask_and_clamp
	def int_nvvm_shfl_sync_idx_i32 :			def int_nvvm_shfl_sync_idx_i32 :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.sync.idx.i32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.sync.idx.i32">,
	GCCBuiltin<"__nvvm_shfl_sync_idx_i32">;			GCCBuiltin<"__nvvm_shfl_sync_idx_i32">;
	def int_nvvm_shfl_sync_idx_f32 :			def int_nvvm_shfl_sync_idx_f32 :
	Intrinsic<[llvm_float_ty], [llvm_i32_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_float_ty], [llvm_i32_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.shfl.sync.idx.f32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.shfl.sync.idx.f32">,
	GCCBuiltin<"__nvvm_shfl_sync_idx_f32">;			GCCBuiltin<"__nvvm_shfl_sync_idx_f32">;

	//			//
	// VOTE			// VOTE
	//			//

	// vote.all pred			// vote.all pred
	def int_nvvm_vote_all :			def int_nvvm_vote_all :
	Intrinsic<[llvm_i1_ty], [llvm_i1_ty],			Intrinsic<[llvm_i1_ty], [llvm_i1_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.vote.all">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.vote.all">,
	GCCBuiltin<"__nvvm_vote_all">;			GCCBuiltin<"__nvvm_vote_all">;
	// vote.any pred			// vote.any pred
	def int_nvvm_vote_any :			def int_nvvm_vote_any :
	Intrinsic<[llvm_i1_ty], [llvm_i1_ty],			Intrinsic<[llvm_i1_ty], [llvm_i1_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.vote.any">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.vote.any">,
	GCCBuiltin<"__nvvm_vote_any">;			GCCBuiltin<"__nvvm_vote_any">;
	// vote.uni pred			// vote.uni pred
	def int_nvvm_vote_uni :			def int_nvvm_vote_uni :
	Intrinsic<[llvm_i1_ty], [llvm_i1_ty],			Intrinsic<[llvm_i1_ty], [llvm_i1_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.vote.uni">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.vote.uni">,
	GCCBuiltin<"__nvvm_vote_uni">;			GCCBuiltin<"__nvvm_vote_uni">;
	// vote.ballot pred			// vote.ballot pred
	def int_nvvm_vote_ballot :			def int_nvvm_vote_ballot :
	Intrinsic<[llvm_i32_ty], [llvm_i1_ty],			Intrinsic<[llvm_i32_ty], [llvm_i1_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.vote.ballot">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.vote.ballot">,
	GCCBuiltin<"__nvvm_vote_ballot">;			GCCBuiltin<"__nvvm_vote_ballot">;

	//			//
	// VOTE.SYNC			// VOTE.SYNC
	//			//

	// vote.sync.all mask, pred			// vote.sync.all mask, pred
	def int_nvvm_vote_all_sync :			def int_nvvm_vote_all_sync :
	Intrinsic<[llvm_i1_ty], [llvm_i32_ty, llvm_i1_ty],			Intrinsic<[llvm_i1_ty], [llvm_i32_ty, llvm_i1_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.vote.all.sync">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.vote.all.sync">,
	GCCBuiltin<"__nvvm_vote_all_sync">;			GCCBuiltin<"__nvvm_vote_all_sync">;
	// vote.sync.any mask, pred			// vote.sync.any mask, pred
	def int_nvvm_vote_any_sync :			def int_nvvm_vote_any_sync :
	Intrinsic<[llvm_i1_ty], [llvm_i32_ty, llvm_i1_ty],			Intrinsic<[llvm_i1_ty], [llvm_i32_ty, llvm_i1_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.vote.any.sync">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.vote.any.sync">,
	GCCBuiltin<"__nvvm_vote_any_sync">;			GCCBuiltin<"__nvvm_vote_any_sync">;
	// vote.sync.uni mask, pred			// vote.sync.uni mask, pred
	def int_nvvm_vote_uni_sync :			def int_nvvm_vote_uni_sync :
	Intrinsic<[llvm_i1_ty], [llvm_i32_ty, llvm_i1_ty],			Intrinsic<[llvm_i1_ty], [llvm_i32_ty, llvm_i1_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.vote.uni.sync">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.vote.uni.sync">,
	GCCBuiltin<"__nvvm_vote_uni_sync">;			GCCBuiltin<"__nvvm_vote_uni_sync">;
	// vote.sync.ballot mask, pred			// vote.sync.ballot mask, pred
	def int_nvvm_vote_ballot_sync :			def int_nvvm_vote_ballot_sync :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i1_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i1_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.vote.ballot.sync">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.vote.ballot.sync">,
	GCCBuiltin<"__nvvm_vote_ballot_sync">;			GCCBuiltin<"__nvvm_vote_ballot_sync">;

	//			//
	// MATCH.SYNC			// MATCH.SYNC
	//			//
	// match.any.sync.b32 mask, value			// match.any.sync.b32 mask, value
	def int_nvvm_match_any_sync_i32 :			def int_nvvm_match_any_sync_i32 :
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.match.any.sync.i32">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.match.any.sync.i32">,
	GCCBuiltin<"__nvvm_match_any_sync_i32">;			GCCBuiltin<"__nvvm_match_any_sync_i32">;
	// match.any.sync.b64 mask, value			// match.any.sync.b64 mask, value
	def int_nvvm_match_any_sync_i64 :			def int_nvvm_match_any_sync_i64 :
	Intrinsic<[llvm_i64_ty], [llvm_i32_ty, llvm_i64_ty],			Intrinsic<[llvm_i64_ty], [llvm_i32_ty, llvm_i64_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.match.any.sync.i64">,			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.match.any.sync.i64">,
	GCCBuiltin<"__nvvm_match_any_sync_i64">;			GCCBuiltin<"__nvvm_match_any_sync_i64">;

	// match.all instruction have two variants -- one returns a single value, another			// match.all instruction have two variants -- one returns a single value, another
	// returns a pair {value, predicate}. We currently only implement the latter as			// returns a pair {value, predicate}. We currently only implement the latter as
	// that's the variant exposed by CUDA API.			// that's the variant exposed by CUDA API.

	// match.all.sync.b32p mask, value			// match.all.sync.b32p mask, value
	def int_nvvm_match_all_sync_i32p :			def int_nvvm_match_all_sync_i32p :
	Intrinsic<[llvm_i32_ty, llvm_i1_ty], [llvm_i32_ty, llvm_i32_ty],			Intrinsic<[llvm_i32_ty, llvm_i1_ty], [llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.match.all.sync.i32p">;			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.match.all.sync.i32p">;
	// match.all.sync.b64p mask, value			// match.all.sync.b64p mask, value
	def int_nvvm_match_all_sync_i64p :			def int_nvvm_match_all_sync_i64p :
	Intrinsic<[llvm_i64_ty, llvm_i1_ty], [llvm_i32_ty, llvm_i64_ty],			Intrinsic<[llvm_i64_ty, llvm_i1_ty], [llvm_i32_ty, llvm_i64_ty],
	[IntrNoMem, IntrConvergent], "llvm.nvvm.match.all.sync.i64p">;			[IntrInaccessibleMemOnly, IntrConvergent], "llvm.nvvm.match.all.sync.i64p">;

	//			//
	// WMMA instructions			// WMMA instructions
	//			//

	// WMMA.LOAD			// WMMA.LOAD
	class NVVM_WMMA_LD_ALSTS<string Abc, string Layout, string Space,			class NVVM_WMMA_LD_ALSTS<string Abc, string Layout, string Space,
	string Type, LLVMType regty, int WithStride>			string Type, LLVMType regty, int WithStride>
	▲ Show 20 Lines • Show All 140 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp

Show First 20 Lines • Show All 805 Lines • ▼ Show 20 Lines
bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {		bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {
unsigned IID = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();		unsigned IID = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
if (getWmmaLdStOpcode(IID))		if (getWmmaLdStOpcode(IID))
return tryWMMA_LDST(N);		return tryWMMA_LDST(N);

switch (IID) {		switch (IID) {
default:		default:
return false;		return false;
		case Intrinsic::nvvm_match_all_sync_i32p:
		case Intrinsic::nvvm_match_all_sync_i64p:
		SelectMatchAll(N);
		return true;
case Intrinsic::nvvm_ldg_global_f:		case Intrinsic::nvvm_ldg_global_f:
case Intrinsic::nvvm_ldg_global_i:		case Intrinsic::nvvm_ldg_global_i:
case Intrinsic::nvvm_ldg_global_p:		case Intrinsic::nvvm_ldg_global_p:
case Intrinsic::nvvm_ldu_global_f:		case Intrinsic::nvvm_ldu_global_f:
case Intrinsic::nvvm_ldu_global_i:		case Intrinsic::nvvm_ldu_global_i:
case Intrinsic::nvvm_ldu_global_p:		case Intrinsic::nvvm_ldu_global_p:
return tryLDGLDU(N);		return tryLDGLDU(N);
}		}
▲ Show 20 Lines • Show All 198 Lines • ▼ Show 20 Lines
bool NVPTXDAGToDAGISel::tryIntrinsicNoChain(SDNode *N) {		bool NVPTXDAGToDAGISel::tryIntrinsicNoChain(SDNode *N) {
unsigned IID = cast<ConstantSDNode>(N->getOperand(0))->getZExtValue();		unsigned IID = cast<ConstantSDNode>(N->getOperand(0))->getZExtValue();
switch (IID) {		switch (IID) {
default:		default:
return false;		return false;
case Intrinsic::nvvm_texsurf_handle_internal:		case Intrinsic::nvvm_texsurf_handle_internal:
SelectTexSurfHandle(N);		SelectTexSurfHandle(N);
return true;		return true;
case Intrinsic::nvvm_match_all_sync_i32p:
case Intrinsic::nvvm_match_all_sync_i64p:
SelectMatchAll(N);
return true;
case Intrinsic::nvvm_wmma_mma_sync_col_col_f16_f16:		case Intrinsic::nvvm_wmma_mma_sync_col_col_f16_f16:
case Intrinsic::nvvm_wmma_mma_sync_col_col_f16_f16_satfinite:		case Intrinsic::nvvm_wmma_mma_sync_col_col_f16_f16_satfinite:
case Intrinsic::nvvm_wmma_mma_sync_col_col_f16_f32:		case Intrinsic::nvvm_wmma_mma_sync_col_col_f16_f32:
case Intrinsic::nvvm_wmma_mma_sync_col_col_f16_f32_satfinite:		case Intrinsic::nvvm_wmma_mma_sync_col_col_f16_f32_satfinite:
case Intrinsic::nvvm_wmma_mma_sync_col_col_f32_f16:		case Intrinsic::nvvm_wmma_mma_sync_col_col_f32_f16:
case Intrinsic::nvvm_wmma_mma_sync_col_col_f32_f16_satfinite:		case Intrinsic::nvvm_wmma_mma_sync_col_col_f32_f16_satfinite:
case Intrinsic::nvvm_wmma_mma_sync_col_col_f32_f32:		case Intrinsic::nvvm_wmma_mma_sync_col_col_f32_f32:
case Intrinsic::nvvm_wmma_mma_sync_col_col_f32_f32_satfinite:		case Intrinsic::nvvm_wmma_mma_sync_col_col_f32_f32_satfinite:
Show All 30 Lines	void NVPTXDAGToDAGISel::SelectTexSurfHandle(SDNode *N) {
SDValue Wrapper = N->getOperand(1);		SDValue Wrapper = N->getOperand(1);
SDValue GlobalVal = Wrapper.getOperand(0);		SDValue GlobalVal = Wrapper.getOperand(0);
ReplaceNode(N, CurDAG->getMachineNode(NVPTX::texsurf_handles, SDLoc(N),		ReplaceNode(N, CurDAG->getMachineNode(NVPTX::texsurf_handles, SDLoc(N),
MVT::i64, GlobalVal));		MVT::i64, GlobalVal));
}		}

void NVPTXDAGToDAGISel::SelectMatchAll(SDNode *N) {		void NVPTXDAGToDAGISel::SelectMatchAll(SDNode *N) {
SDLoc DL(N);		SDLoc DL(N);
		SDValue Chain = N->getOperand(0);
enum { IS_I64 = 4, HAS_CONST_VALUE = 2, HAS_CONST_MASK = 1 };		enum { IS_I64 = 4, HAS_CONST_VALUE = 2, HAS_CONST_MASK = 1 };
unsigned IID = cast<ConstantSDNode>(N->getOperand(0))->getZExtValue();		unsigned IID = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
unsigned OpcodeIndex =		unsigned OpcodeIndex =
(IID == Intrinsic::nvvm_match_all_sync_i64p) ? IS_I64 : 0;		(IID == Intrinsic::nvvm_match_all_sync_i64p) ? IS_I64 : 0;
SDValue MaskOp = N->getOperand(1);		SDValue MaskOp = N->getOperand(2);
SDValue ValueOp = N->getOperand(2);		SDValue ValueOp = N->getOperand(3);
if (ConstantSDNode *ValueConst = dyn_cast<ConstantSDNode>(ValueOp)) {		if (ConstantSDNode *ValueConst = dyn_cast<ConstantSDNode>(ValueOp)) {
OpcodeIndex \|= HAS_CONST_VALUE;		OpcodeIndex \|= HAS_CONST_VALUE;
ValueOp = CurDAG->getTargetConstant(ValueConst->getZExtValue(), DL,		ValueOp = CurDAG->getTargetConstant(ValueConst->getZExtValue(), DL,
ValueConst->getValueType(0));		ValueConst->getValueType(0));
}		}
if (ConstantSDNode *MaskConst = dyn_cast<ConstantSDNode>(MaskOp)) {		if (ConstantSDNode *MaskConst = dyn_cast<ConstantSDNode>(MaskOp)) {
OpcodeIndex \|= HAS_CONST_MASK;		OpcodeIndex \|= HAS_CONST_MASK;
MaskOp = CurDAG->getTargetConstant(MaskConst->getZExtValue(), DL,		MaskOp = CurDAG->getTargetConstant(MaskConst->getZExtValue(), DL,
MaskConst->getValueType(0));		MaskConst->getValueType(0));
}		}
// Maps {IS_I64, HAS_CONST_VALUE, HAS_CONST_MASK} -> opcode		// Maps {IS_I64, HAS_CONST_VALUE, HAS_CONST_MASK} -> opcode
unsigned Opcodes[8] = {		unsigned Opcodes[8] = {
NVPTX::MATCH_ALLP_SYNC_32rr, NVPTX::MATCH_ALLP_SYNC_32ri,		NVPTX::MATCH_ALLP_SYNC_32rr, NVPTX::MATCH_ALLP_SYNC_32ri,
NVPTX::MATCH_ALLP_SYNC_32ir, NVPTX::MATCH_ALLP_SYNC_32ii,		NVPTX::MATCH_ALLP_SYNC_32ir, NVPTX::MATCH_ALLP_SYNC_32ii,
NVPTX::MATCH_ALLP_SYNC_64rr, NVPTX::MATCH_ALLP_SYNC_64ri,		NVPTX::MATCH_ALLP_SYNC_64rr, NVPTX::MATCH_ALLP_SYNC_64ri,
NVPTX::MATCH_ALLP_SYNC_64ir, NVPTX::MATCH_ALLP_SYNC_64ii};		NVPTX::MATCH_ALLP_SYNC_64ir, NVPTX::MATCH_ALLP_SYNC_64ii};
SDNode *NewNode = CurDAG->getMachineNode(Opcodes[OpcodeIndex], DL,		SDNode *NewNode = CurDAG->getMachineNode(
{ValueOp->getValueType(0), MVT::i1},		Opcodes[OpcodeIndex], DL, {ValueOp->getValueType(0), MVT::i1, MVT::Other},
{MaskOp, ValueOp});		{MaskOp, ValueOp});
ReplaceNode(N, NewNode);		ReplaceNode(N, NewNode);
}		}

void NVPTXDAGToDAGISel::SelectAddrSpaceCast(SDNode *N) {		void NVPTXDAGToDAGISel::SelectAddrSpaceCast(SDNode *N) {
SDValue Src = N->getOperand(0);		SDValue Src = N->getOperand(0);
AddrSpaceCastSDNode *CastN = cast<AddrSpaceCastSDNode>(N);		AddrSpaceCastSDNode *CastN = cast<AddrSpaceCastSDNode>(N);
unsigned SrcAddrSpace = CastN->getSrcAddressSpace();		unsigned SrcAddrSpace = CastN->getSrcAddressSpace();
unsigned DstAddrSpace = CastN->getDestAddressSpace();		unsigned DstAddrSpace = CastN->getDestAddressSpace();
▲ Show 20 Lines • Show All 3,129 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

	Show First 20 Lines • Show All 3,315 Lines • ▼ Show 20 Lines
	// because we need the information that is only available in the "Value" type			// because we need the information that is only available in the "Value" type
	// of destination			// of destination
	// pointer. In particular, the address space information.			// pointer. In particular, the address space information.
	bool NVPTXTargetLowering::getTgtMemIntrinsic(			bool NVPTXTargetLowering::getTgtMemIntrinsic(
	IntrinsicInfo &Info, const CallInst &I, unsigned Intrinsic) const {			IntrinsicInfo &Info, const CallInst &I, unsigned Intrinsic) const {
	switch (Intrinsic) {			switch (Intrinsic) {
	default:			default:
	return false;			return false;
				case Intrinsic::nvvm_match_all_sync_i32p:
				case Intrinsic::nvvm_match_all_sync_i64p:
				Info.opc = ISD::INTRINSIC_W_CHAIN;
				// memVT is bogus. These intrinsics have IntrInaccessibleMemOnly attribute
				// in order to model data exchange with other threads, but perform no real
				// memory accesses.
				Info.memVT = MVT::i1;
				Info.readMem = true; // Our result depends on other thread's arguments.
				Info.writeMem = true; // Other threads depend on our thread's argument.
				return true;
	case Intrinsic::nvvm_wmma_load_a_f16_col:			case Intrinsic::nvvm_wmma_load_a_f16_col:
	case Intrinsic::nvvm_wmma_load_a_f16_row:			case Intrinsic::nvvm_wmma_load_a_f16_row:
	case Intrinsic::nvvm_wmma_load_a_f16_col_stride:			case Intrinsic::nvvm_wmma_load_a_f16_col_stride:
	case Intrinsic::nvvm_wmma_load_a_f16_row_stride:			case Intrinsic::nvvm_wmma_load_a_f16_row_stride:
	case Intrinsic::nvvm_wmma_load_a_f16_col_shared:			case Intrinsic::nvvm_wmma_load_a_f16_col_shared:
	case Intrinsic::nvvm_wmma_load_a_f16_row_shared:			case Intrinsic::nvvm_wmma_load_a_f16_row_shared:
	case Intrinsic::nvvm_wmma_load_a_f16_col_shared_stride:			case Intrinsic::nvvm_wmma_load_a_f16_col_shared_stride:
	case Intrinsic::nvvm_wmma_load_a_f16_row_shared_stride:			case Intrinsic::nvvm_wmma_load_a_f16_row_shared_stride:
	▲ Show 20 Lines • Show All 1,476 Lines • Show Last 20 Lines