This is an archive of the discontinued LLVM Phabricator instance.

Differential D31762

AMDGPU: Add new amdgcn.init.exec intrinsics
ClosedPublic

Authored by mareko on Apr 6 2017, 7:37 AM.

Download Raw Diff

Details

Reviewers

nhaehnle

Commits

rG2d82590f640e: AMDGPU: Add new amdgcn.init.exec intrinsics
rL301677: AMDGPU: Add new amdgcn.init.exec intrinsics

Diff Detail

Build Status

Buildable 5568
Build 5568: arc lint + arc unit

Event Timeline

mareko created this revision.Apr 6 2017, 7:37 AM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptApr 6 2017, 7:37 AM

arsenm added inline comments.Apr 6 2017, 9:47 AM

include/llvm/IR/IntrinsicsAMDGPU.td
117–121	Why can't you emit this sequence and feed that into the first intrinsic?
lib/Target/AMDGPU/SIInstructions.td
290	SSrc_b64:$src0
test/CodeGen/AMDGPU/set-initial-exec.ll
1 ↗	(On Diff #94369)	s/CHECK/GCN/

mareko added inline comments.Apr 6 2017, 10:06 AM

include/llvm/IR/IntrinsicsAMDGPU.td
117–121	There are several reasons: It's easier this way, because the custom inserter only has to move the COPY opcode to the beginning instead of the whole expression. LLVM can't select S_BFM_B64. LLVM likely can't select S_CMP_U32_EQ in this case. LLVM can't select S_CMOV_B64.

mareko added inline comments.Apr 6 2017, 10:08 AM

lib/Target/AMDGPU/SIInstructions.td
290	Why SSrc_b64 and not i64imm? It doesn't support non-immediate operands.

arsenm added inline comments.Apr 6 2017, 4:37 PM

include/llvm/IR/IntrinsicsAMDGPU.td
117–121	I don't think we should be adding intrinsics for the sake of working around codegen defects. A better workaround would be to only ever use init_exec and then have AMDGPUCodeGenPrepare insert calls to the second intrinsic until we fix the various SCC handling issues
lib/Target/AMDGPU/SIInstructions.td
290	To have the one intrinsic

mareko added inline comments.Apr 7 2017, 6:41 AM

include/llvm/IR/IntrinsicsAMDGPU.td
117–121	That makes sense, however if we look at it from the hw perperctive, amdgcn.init.exec.from.input is all we need. We don't need any SCC handling, S_CMOV_B64 selection, etc. It would be desirable for have those, but not for init.exec. In fact, the init.exec instrinsic is already too flexible. The driver will only ever pass -1 into it. With all those in mind, it's unnecessary for init.exec to be more flexible. There is no use case for such flexibility. Would you implement something that nobody would ever use?

Bug fixes, cosmetic changes, more tests.

Ping

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

This revision is now accepted and ready to land.Apr 24 2017, 9:19 AM

In D31762#735505, @nhaehnle wrote:

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

Merged monolithic shaders set exec = -1 and then they use "if (tid < thread_count) ...". I think that's the only way to jump over the conditional code right now if the wave has thread_count == 0. If we don't want v_mbcnt+v_cmp, we could do something like "if (amdgcn.set.thread_count(n)) ..." that sets EXEC regardless of current EXEC and skips the conditional for thread_count == 0. The performance of that solution is unlikely to justify the implementation effort.

In D31762#735528, @mareko wrote:

In D31762#735505, @nhaehnle wrote:

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

Merged monolithic shaders set exec = -1 and then they use "if (tid < thread_count) ...". I think that's the only way to jump over the conditional code right now if the wave has thread_count == 0. If we don't want v_mbcnt+v_cmp, we could do something like "if (amdgcn.set.thread_count(n)) ..." that sets EXEC regardless of current EXEC and skips the conditional for thread_count == 0. The performance of that solution is unlikely to justify the implementation effort.

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yeah, I looked at the monolithic shader stuff briefly. I think the LLVM IR is fine. Adding another intrinsic for switching EXEC in the middle of a shader is bound to run into lots of problems in CodeGen around scheduling and such.

The LLVM CodeGen could generally grow some more smarts around EXEC and perhaps even pattern-match the v_mbcnt+v_cmp. I also think it's pretty low priority though.

In D31762#735585, @nhaehnle wrote:

In D31762#735528, @mareko wrote:

In D31762#735505, @nhaehnle wrote:

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

Merged monolithic shaders set exec = -1 and then they use "if (tid < thread_count) ...". I think that's the only way to jump over the conditional code right now if the wave has thread_count == 0. If we don't want v_mbcnt+v_cmp, we could do something like "if (amdgcn.set.thread_count(n)) ..." that sets EXEC regardless of current EXEC and skips the conditional for thread_count == 0. The performance of that solution is unlikely to justify the implementation effort.

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yes, thread_count == 0 is possible, and it's explained here: https://patchwork.freedesktop.org/patch/152356/

In D31762#735590, @mareko wrote:

In D31762#735585, @nhaehnle wrote:

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yes, thread_count == 0 is possible, and it's explained here: https://patchwork.freedesktop.org/patch/152356/

Ah, so that's why the barrier instruction is needed between shader parts? Interesting.

In D31762#735630, @nhaehnle wrote:

In D31762#735590, @mareko wrote:

In D31762#735585, @nhaehnle wrote:

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yes, thread_count == 0 is possible, and it's explained here: https://patchwork.freedesktop.org/patch/152356/

Ah, so that's why the barrier instruction is needed between shader parts? Interesting.

There are two cases when the barrier isn't needed: 1) When GS is processing points without any amplification. 2) When HS has input control points == output control points, and each HS thread doesn't access other threads' inputs. In both cases, the barrier and the LDS traffic can be removed and the previous shader can put outputs into VGPRs to get a fully merged shader.

Closed by commit rL301677: AMDGPU: Add new amdgcn.init.exec intrinsics (authored by mareko). · Explain WhyApr 28 2017, 1:35 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

15 lines

lib/

Target/

AMDGPU/

AMDGPUISelLowering.h

2 lines

AMDGPUISelLowering.cpp

2 lines

AMDGPUInstrInfo.td

9 lines

SIISelLowering.cpp

65 lines

SIInstructions.td

23 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.init.exec.ll

80 lines

Diff 95346

include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
	def int_amdgcn_dispatch_id :			def int_amdgcn_dispatch_id :
	GCCBuiltin<"__builtin_amdgcn_dispatch_id">,			GCCBuiltin<"__builtin_amdgcn_dispatch_id">,
	Intrinsic<[llvm_i64_ty], [], [IntrNoMem]>;			Intrinsic<[llvm_i64_ty], [], [IntrNoMem]>;

	def int_amdgcn_implicit_buffer_ptr :			def int_amdgcn_implicit_buffer_ptr :
	GCCBuiltin<"__builtin_amdgcn_implicit_buffer_ptr">,			GCCBuiltin<"__builtin_amdgcn_implicit_buffer_ptr">,
	Intrinsic<[LLVMQualPointerType<llvm_i8_ty, 2>], [], [IntrNoMem]>;			Intrinsic<[LLVMQualPointerType<llvm_i8_ty, 2>], [], [IntrNoMem]>;

				// Set EXEC to the 64-bit value given.
				// This is always moved to the beginning of the basic block.
				def int_amdgcn_init_exec : Intrinsic<[],
				[llvm_i64_ty], // 64-bit literal constant
				[IntrConvergent]>;

				// Set EXEC according to a thread count packed in an SGPR input:
				// thread_count = (input >> bitoffset) & 0x7f;
				// This is always moved to the beginning of the basic block.
				def int_amdgcn_init_exec_from_input : Intrinsic<[],
				[llvm_i32_ty, // 32-bit SGPR input
				arsenmUnsubmitted Not Done Reply Inline Actions Why can't you emit this sequence and feed that into the first intrinsic? arsenm: Why can't you emit this sequence and feed that into the first intrinsic?
				marekoAuthorUnsubmitted Not Done Reply Inline Actions There are several reasons: It's easier this way, because the custom inserter only has to move the COPY opcode to the beginning instead of the whole expression. LLVM can't select S_BFM_B64. LLVM likely can't select S_CMP_U32_EQ in this case. LLVM can't select S_CMOV_B64. mareko: There are several reasons: - It's easier this way, because the custom inserter only has to move…
				arsenmUnsubmitted Not Done Reply Inline Actions I don't think we should be adding intrinsics for the sake of working around codegen defects. A better workaround would be to only ever use init_exec and then have AMDGPUCodeGenPrepare insert calls to the second intrinsic until we fix the various SCC handling issues arsenm: I don't think we should be adding intrinsics for the sake of working around codegen defects. A…
				marekoAuthorUnsubmitted Not Done Reply Inline Actions That makes sense, however if we look at it from the hw perperctive, amdgcn.init.exec.from.input is all we need. We don't need any SCC handling, S_CMOV_B64 selection, etc. It would be desirable for have those, but not for init.exec. In fact, the init.exec instrinsic is already too flexible. The driver will only ever pass -1 into it. With all those in mind, it's unnecessary for init.exec to be more flexible. There is no use case for such flexibility. Would you implement something that nobody would ever use? mareko: That makes sense, however if we look at it from the hw perperctive, amdgcn.init.exec.from.input…
				llvm_i32_ty], // bit offset of the thread count
				[IntrConvergent]>;


	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Instruction Intrinsics			// Instruction Intrinsics
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	// The first parameter is s_sendmsg immediate (i16),			// The first parameter is s_sendmsg immediate (i16),
	// the second one is copied to m0			// the second one is copied to m0
	def int_amdgcn_s_sendmsg : GCCBuiltin<"__builtin_amdgcn_s_sendmsg">,			def int_amdgcn_s_sendmsg : GCCBuiltin<"__builtin_amdgcn_s_sendmsg">,
	Intrinsic <[], [llvm_i32_ty, llvm_i32_ty], []>;			Intrinsic <[], [llvm_i32_ty, llvm_i32_ty], []>;
	▲ Show 20 Lines • Show All 601 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 359 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
/// \|X \|Y\|Z\|W\|		/// \|X \|Y\|Z\|W\|
/// T0\|v.x\| \| \| \|		/// T0\|v.x\| \| \| \|
/// T1\|v.y\| \| \| \|		/// T1\|v.y\| \| \| \|
/// T2\|v.z\| \| \| \|		/// T2\|v.z\| \| \| \|
/// T3\|v.w\| \| \| \|		/// T3\|v.w\| \| \| \|
BUILD_VERTICAL_VECTOR,		BUILD_VERTICAL_VECTOR,
/// Pointer to the start of the shader's constant data.		/// Pointer to the start of the shader's constant data.
CONST_DATA_PTR,		CONST_DATA_PTR,
		INIT_EXEC,
		INIT_EXEC_FROM_INPUT,
SENDMSG,		SENDMSG,
SENDMSGHALT,		SENDMSGHALT,
INTERP_MOV,		INTERP_MOV,
INTERP_P1,		INTERP_P1,
INTERP_P2,		INTERP_P2,
PC_ADD_REL_OFFSET,		PC_ADD_REL_OFFSET,
KILL,		KILL,
DUMMY_CHAIN,		DUMMY_CHAIN,
Show All 18 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 3,508 Lines • ▼ Show 20 Lines	const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
NODE_NAME_CASE(FP_TO_FP16)		NODE_NAME_CASE(FP_TO_FP16)
NODE_NAME_CASE(FP16_ZEXT)		NODE_NAME_CASE(FP16_ZEXT)
NODE_NAME_CASE(BUILD_VERTICAL_VECTOR)		NODE_NAME_CASE(BUILD_VERTICAL_VECTOR)
NODE_NAME_CASE(CONST_DATA_PTR)		NODE_NAME_CASE(CONST_DATA_PTR)
NODE_NAME_CASE(PC_ADD_REL_OFFSET)		NODE_NAME_CASE(PC_ADD_REL_OFFSET)
NODE_NAME_CASE(KILL)		NODE_NAME_CASE(KILL)
NODE_NAME_CASE(DUMMY_CHAIN)		NODE_NAME_CASE(DUMMY_CHAIN)
case AMDGPUISD::FIRST_MEM_OPCODE_NUMBER: break;		case AMDGPUISD::FIRST_MEM_OPCODE_NUMBER: break;
		NODE_NAME_CASE(INIT_EXEC)
		NODE_NAME_CASE(INIT_EXEC_FROM_INPUT)
NODE_NAME_CASE(SENDMSG)		NODE_NAME_CASE(SENDMSG)
NODE_NAME_CASE(SENDMSGHALT)		NODE_NAME_CASE(SENDMSGHALT)
NODE_NAME_CASE(INTERP_MOV)		NODE_NAME_CASE(INTERP_MOV)
NODE_NAME_CASE(INTERP_P1)		NODE_NAME_CASE(INTERP_P1)
NODE_NAME_CASE(INTERP_P2)		NODE_NAME_CASE(INTERP_P2)
NODE_NAME_CASE(STORE_MSKOR)		NODE_NAME_CASE(STORE_MSKOR)
NODE_NAME_CASE(LOAD_CONSTANT)		NODE_NAME_CASE(LOAD_CONSTANT)
NODE_NAME_CASE(TBUFFER_STORE_FORMAT)		NODE_NAME_CASE(TBUFFER_STORE_FORMAT)
▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUInstrInfo.td

	Show First 20 Lines • Show All 288 Lines • ▼ Show 20 Lines
	>;			>;

	def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,			def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
	[]			[]
	>;			>;

	def AMDGPUfmed3 : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;			def AMDGPUfmed3 : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;

				def AMDGPUinit_exec : SDNode<"AMDGPUISD::INIT_EXEC",
				SDTypeProfile<0, 1, [SDTCisInt<0>]>,
				[SDNPHasChain, SDNPInGlue]>;

				def AMDGPUinit_exec_from_input : SDNode<"AMDGPUISD::INIT_EXEC_FROM_INPUT",
				SDTypeProfile<0, 2,
				[SDTCisInt<0>, SDTCisInt<1>]>,
				[SDNPHasChain, SDNPInGlue]>;

	def AMDGPUsendmsg : SDNode<"AMDGPUISD::SENDMSG",			def AMDGPUsendmsg : SDNode<"AMDGPUISD::SENDMSG",
	SDTypeProfile<0, 1, [SDTCisInt<0>]>,			SDTypeProfile<0, 1, [SDTCisInt<0>]>,
	[SDNPHasChain, SDNPInGlue]>;			[SDNPHasChain, SDNPInGlue]>;

	def AMDGPUsendmsghalt : SDNode<"AMDGPUISD::SENDMSGHALT",			def AMDGPUsendmsghalt : SDNode<"AMDGPUISD::SENDMSGHALT",
	SDTypeProfile<0, 1, [SDTCisInt<0>]>,			SDTypeProfile<0, 1, [SDTCisInt<0>]>,
	[SDNPHasChain, SDNPInGlue]>;			[SDNPHasChain, SDNPInGlue]>;

	▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

Show First 20 Lines • Show All 1,986 Lines • ▼ Show 20 Lines	MachineBasicBlock *SITargetLowering::EmitInstrWithCustomInserter(
}		}
case AMDGPU::SI_INIT_M0:		case AMDGPU::SI_INIT_M0:
BuildMI(*BB, MI.getIterator(), MI.getDebugLoc(),		BuildMI(*BB, MI.getIterator(), MI.getDebugLoc(),
TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)		TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)
.add(MI.getOperand(0));		.add(MI.getOperand(0));
MI.eraseFromParent();		MI.eraseFromParent();
return BB;		return BB;

		case AMDGPU::SI_INIT_EXEC:
		// This should be before all vector instructions.
		BuildMI(BB, &BB->begin(), MI.getDebugLoc(), TII->get(AMDGPU::S_MOV_B64),
		AMDGPU::EXEC)
		.addImm(MI.getOperand(0).getImm());
		MI.eraseFromParent();
		return BB;

		case AMDGPU::SI_INIT_EXEC_FROM_INPUT: {
		// Extract the thread count from an SGPR input and set EXEC accordingly.
		// Since BFM can't shift by 64, handle that case with CMP + CMOV.
		//
		// S_BFE_U32 count, input, {shift, 7}
		// S_BFM_B64 exec, count, 0
		// S_CMP_EQ_U32 count, 64
		// S_CMOV_B64 exec, -1
		MachineInstr FirstMI = &BB->begin();
		MachineRegisterInfo &MRI = MF->getRegInfo();
		unsigned InputReg = MI.getOperand(0).getReg();
		unsigned CountReg = MRI.createVirtualRegister(&AMDGPU::SGPR_32RegClass);
		bool Found = false;

		// Move the COPY of the input reg to the beginning, so that we can use it.
		for (auto I = BB->begin(); I != &MI; I++) {
		if (I->getOpcode() != TargetOpcode::COPY \|\|
		I->getOperand(0).getReg() != InputReg)
		continue;

		if (I == FirstMI) {
		FirstMI = &*++BB->begin();
		} else {
		I->removeFromParent();
		BB->insert(FirstMI, &*I);
		}
		Found = true;
		break;
		}
		assert(Found);

		// This should be before all vector instructions.
		BuildMI(*BB, FirstMI, DebugLoc(), TII->get(AMDGPU::S_BFE_U32), CountReg)
		.addReg(InputReg)
		.addImm((MI.getOperand(1).getImm() & 0x7f) \| 0x70000);
		BuildMI(*BB, FirstMI, DebugLoc(), TII->get(AMDGPU::S_BFM_B64),
		AMDGPU::EXEC)
		.addReg(CountReg)
		.addImm(0);
		BuildMI(*BB, FirstMI, DebugLoc(), TII->get(AMDGPU::S_CMP_EQ_U32))
		.addReg(CountReg, RegState::Kill)
		.addImm(64);
		BuildMI(*BB, FirstMI, DebugLoc(), TII->get(AMDGPU::S_CMOV_B64),
		AMDGPU::EXEC)
		.addImm(-1);
		MI.eraseFromParent();
		return BB;
		}

case AMDGPU::GET_GROUPSTATICSIZE: {		case AMDGPU::GET_GROUPSTATICSIZE: {
DebugLoc DL = MI.getDebugLoc();		DebugLoc DL = MI.getDebugLoc();
BuildMI(*BB, MI, DL, TII->get(AMDGPU::S_MOV_B32))		BuildMI(*BB, MI, DL, TII->get(AMDGPU::S_MOV_B32))
.add(MI.getOperand(0))		.add(MI.getOperand(0))
.addImm(MFI->getLDSSize());		.addImm(MFI->getLDSSize());
MI.eraseFromParent();		MI.eraseFromParent();
return BB;		return BB;
}		}
▲ Show 20 Lines • Show All 1,174 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
case Intrinsic::amdgcn_s_sendmsghalt: {		case Intrinsic::amdgcn_s_sendmsghalt: {
unsigned NodeOp = (IntrinsicID == Intrinsic::amdgcn_s_sendmsg) ?		unsigned NodeOp = (IntrinsicID == Intrinsic::amdgcn_s_sendmsg) ?
AMDGPUISD::SENDMSG : AMDGPUISD::SENDMSGHALT;		AMDGPUISD::SENDMSG : AMDGPUISD::SENDMSGHALT;
Chain = copyToM0(DAG, Chain, DL, Op.getOperand(3));		Chain = copyToM0(DAG, Chain, DL, Op.getOperand(3));
SDValue Glue = Chain.getValue(1);		SDValue Glue = Chain.getValue(1);
return DAG.getNode(NodeOp, DL, MVT::Other, Chain,		return DAG.getNode(NodeOp, DL, MVT::Other, Chain,
Op.getOperand(2), Glue);		Op.getOperand(2), Glue);
}		}
		case Intrinsic::amdgcn_init_exec: {
		return DAG.getNode(AMDGPUISD::INIT_EXEC, DL, MVT::Other, Chain,
		Op.getOperand(2));
		}
		case Intrinsic::amdgcn_init_exec_from_input: {
		return DAG.getNode(AMDGPUISD::INIT_EXEC_FROM_INPUT, DL, MVT::Other, Chain,
		Op.getOperand(2), Op.getOperand(3));
		}
case AMDGPUIntrinsic::SI_tbuffer_store: {		case AMDGPUIntrinsic::SI_tbuffer_store: {
SDValue Ops[] = {		SDValue Ops[] = {
Chain,		Chain,
Op.getOperand(2),		Op.getOperand(2),
Op.getOperand(3),		Op.getOperand(3),
Op.getOperand(4),		Op.getOperand(4),
Op.getOperand(5),		Op.getOperand(5),
Op.getOperand(6),		Op.getOperand(6),
▲ Show 20 Lines • Show All 1,995 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines
// fold operands before it runs.		// fold operands before it runs.
def SI_INIT_M0 : SPseudoInstSI <(outs), (ins SSrc_b32:$src)> {		def SI_INIT_M0 : SPseudoInstSI <(outs), (ins SSrc_b32:$src)> {
let Defs = [M0];		let Defs = [M0];
let usesCustomInserter = 1;		let usesCustomInserter = 1;
let isAsCheapAsAMove = 1;		let isAsCheapAsAMove = 1;
let isReMaterializable = 1;		let isReMaterializable = 1;
}		}

		def SI_INIT_EXEC : SPseudoInstSI <
		(outs), (ins i64imm:$src), []> {
		arsenmUnsubmitted Not Done Reply Inline Actions SSrc_b64:$src0 arsenm: SSrc_b64:$src0
		marekoAuthorUnsubmitted Not Done Reply Inline Actions Why SSrc_b64 and not i64imm? It doesn't support non-immediate operands. mareko: Why SSrc_b64 and not i64imm? It doesn't support non-immediate operands.
		arsenmUnsubmitted Not Done Reply Inline Actions To have the one intrinsic arsenm: To have the one intrinsic
		let Defs = [EXEC];
		let usesCustomInserter = 1;
		let isAsCheapAsAMove = 1;
		}

		def SI_INIT_EXEC_FROM_INPUT : SPseudoInstSI <
		(outs), (ins SSrc_b32:$input, i32imm:$shift), []> {
		let Defs = [EXEC];
		let usesCustomInserter = 1;
		}

// Return for returning shaders to a shader variant epilog.		// Return for returning shaders to a shader variant epilog.
def SI_RETURN_TO_EPILOG : SPseudoInstSI <		def SI_RETURN_TO_EPILOG : SPseudoInstSI <
(outs), (ins variable_ops), [(AMDGPUreturn_to_epilog)]> {		(outs), (ins variable_ops), [(AMDGPUreturn_to_epilog)]> {
let isTerminator = 1;		let isTerminator = 1;
let isBarrier = 1;		let isBarrier = 1;
let isReturn = 1;		let isReturn = 1;
let hasNoSchedulingInfo = 1;		let hasNoSchedulingInfo = 1;
let DisableWQM = 1;		let DisableWQM = 1;
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	def SI_PC_ADD_REL_OFFSET : SPseudoInstSI <
[(set SReg_64:$dst,		[(set SReg_64:$dst,
(i64 (SIpc_add_rel_offset (tglobaladdr:$ptr_lo), (tglobaladdr:$ptr_hi))))]> {		(i64 (SIpc_add_rel_offset (tglobaladdr:$ptr_lo), (tglobaladdr:$ptr_hi))))]> {
let Defs = [SCC];		let Defs = [SCC];
}		}

} // End SubtargetPredicate = isGCN		} // End SubtargetPredicate = isGCN

let Predicates = [isGCN] in {		let Predicates = [isGCN] in {
def : Pat<		def : Pat <
		(AMDGPUinit_exec i64:$src),
		(SI_INIT_EXEC (as_i64imm $src))
		>;

		def : Pat <
		(AMDGPUinit_exec_from_input i32:$input, i32:$shift),
		(SI_INIT_EXEC_FROM_INPUT (i32 $input), (as_i32imm $shift))
		>;

		def : Pat<
(trap),		(trap),
(S_TRAP_PSEUDO TRAPID.LLVM_TRAP)		(S_TRAP_PSEUDO TRAPID.LLVM_TRAP)
>;		>;

def : Pat<		def : Pat<
(debugtrap),		(debugtrap),
(S_TRAP_PSEUDO TRAPID.LLVM_DEBUG_TRAP)		(S_TRAP_PSEUDO TRAPID.LLVM_DEBUG_TRAP)
>;		>;
▲ Show 20 Lines • Show All 865 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.init.exec.ll

This file was added.

				;RUN: llc < %s -march=amdgcn -mcpu=gfx900 -verify-machineinstrs \| FileCheck %s --check-prefix=GCN

				; GCN-LABEL: {{^}}full_mask:
				; GCN: s_mov_b64 exec, -1
				; GCN: v_add_f32_e32 v0,
				define amdgpu_ps float @full_mask(float %a, float %b) {
				main_body:
				%s = fadd float %a, %b
				call void @llvm.amdgcn.init.exec(i64 -1)
				ret float %s
				}

				; GCN-LABEL: {{^}}partial_mask:
				; GCN: s_mov_b64 exec, 0x1e240
				; GCN: v_add_f32_e32 v0,
				define amdgpu_ps float @partial_mask(float %a, float %b) {
				main_body:
				%s = fadd float %a, %b
				call void @llvm.amdgcn.init.exec(i64 123456)
				ret float %s
				}

				; GCN-LABEL: {{^}}input_s3off8:
				; GCN: s_bfe_u32 s0, s3, 0x70008
				; GCN: s_bfm_b64 exec, s0, 0
				; GCN: s_cmp_eq_u32 s0, 64
				; GCN: s_cmov_b64 exec, -1
				; GCN: v_add_f32_e32 v0,
				define amdgpu_ps float @input_s3off8(i32 inreg, i32 inreg, i32 inreg, i32 inreg %count, float %a, float %b) {
				main_body:
				%s = fadd float %a, %b
				call void @llvm.amdgcn.init.exec.from.input(i32 %count, i32 8)
				ret float %s
				}

				; GCN-LABEL: {{^}}input_s0off19:
				; GCN: s_bfe_u32 s0, s0, 0x70013
				; GCN: s_bfm_b64 exec, s0, 0
				; GCN: s_cmp_eq_u32 s0, 64
				; GCN: s_cmov_b64 exec, -1
				; GCN: v_add_f32_e32 v0,
				define amdgpu_ps float @input_s0off19(i32 inreg %count, float %a, float %b) {
				main_body:
				%s = fadd float %a, %b
				call void @llvm.amdgcn.init.exec.from.input(i32 %count, i32 19)
				ret float %s
				}

				; GCN-LABEL: {{^}}reuse_input:
				; GCN: s_bfe_u32 s1, s0, 0x70013
				; GCN: s_bfm_b64 exec, s1, 0
				; GCN: s_cmp_eq_u32 s1, 64
				; GCN: s_cmov_b64 exec, -1
				; GCN: v_add_i32_e32 v0, vcc, s0, v0
				define amdgpu_ps float @reuse_input(i32 inreg %count, i32 %a) {
				main_body:
				call void @llvm.amdgcn.init.exec.from.input(i32 %count, i32 19)
				%s = add i32 %a, %count
				%f = sitofp i32 %s to float
				ret float %f
				}

				; GCN-LABEL: {{^}}reuse_input2:
				; GCN: s_bfe_u32 s1, s0, 0x70013
				; GCN: s_bfm_b64 exec, s1, 0
				; GCN: s_cmp_eq_u32 s1, 64
				; GCN: s_cmov_b64 exec, -1
				; GCN: v_add_i32_e32 v0, vcc, s0, v0
				define amdgpu_ps float @reuse_input2(i32 inreg %count, i32 %a) {
				main_body:
				%s = add i32 %a, %count
				%f = sitofp i32 %s to float
				call void @llvm.amdgcn.init.exec.from.input(i32 %count, i32 19)
				ret float %f
				}

				declare void @llvm.amdgcn.init.exec(i64) #1
				declare void @llvm.amdgcn.init.exec.from.input(i32, i32) #1

				attributes #1 = { convergent }