Download Raw Diff

Details

Reviewers

mareko
arsenm
ruiling

Commits

rGa80ebd01798c: [AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization

Summary

Frame-base materialization may insert vector instructions before EXEC is initialised.
Fix this by moving lowering of llvm.amdgcn.init.exec later in backend.
Also remove SI_INIT_EXEC_LO pseudo as this is not necessary.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

critson created this revision.Jan 13 2021, 7:06 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 6 others. · View Herald TranscriptJan 13 2021, 7:06 PM

critson requested review of this revision.Jan 13 2021, 7:06 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2021, 7:06 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B85109: Diff 316551.Jan 13 2021, 7:53 PM

This looks good to me, but I'm not very familiar with LLVM.

arsenm added inline comments.Jan 14 2021, 8:19 AM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1887–1893 ↗	(On Diff #316551)	We shouldn't need to do ad-hoc use/def searches. Can you do this pre-RA, post-SSA like SILowerControlFlow?

critson added inline comments.Jan 14 2021, 4:09 PM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1887–1893 ↗	(On Diff #316551)	We definitely could do. Do you have an opinion about what pass it should go in? It looks to me that we would need to add one.

ruiling added inline comments.Jan 14 2021, 10:11 PM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1916 ↗	(On Diff #316551)	In case the FirstMI is not null, which means a COPY instruction to copy from input register was not eliminated for some reason, we need to move the COPY instruction to the beginning of the block and insert exec initialization right after that. Because the frame-base materialization instructions may be inserted before the COPY instruction. We need to make sure the exec initialization always be inserted at the entry of the block.

critson added inline comments.Jan 14 2021, 10:51 PM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1916 ↗	(On Diff #316551)	I only partially agree with this. In theory the input to the intrinsic can be any operation -- unless we refine its definition (which I am not against doing). So in the general case we cannot move this operation before the instruction which generates its input. If we are saying that the input to this register is always a live-in, then we shouldn't be moving the COPY, rather we should be using the source of the COPY as the source of this operation directly.

ruiling added inline comments.Jan 15 2021, 12:52 AM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1916 ↗	(On Diff #316551)	what I said is the most common case at MachineIR. a Copy from funtion input, then init.exec.from.input with the Copy destination. I think no frontend will generate the argument to init.exec.with.input with arbitray operation. I am not saying moving operation before the instruction that generates the input to the operation. The general idea is we need to ensure the instructions to set the EXEC should be before any vector instruction. Based on this, we should move all instructions in the define-use chain of the input to init.exec.with.input to block entry to be before all vector instructions. In practice, your suggestion to use the Copy' source should work because that's the only usage the frontend will generate. The existing implementation before your change already assume Copy is the only possible operation that generates input to this intrinsic.

BTW I felt the input in the name of llvm.amdgcn.init.exec.from.input some kind of means the argument is the function input argument in LLVM IR. I think we can clarify that the value to this intrinsic should be an function input argument. That is the only requirement I can see currently from frontend. So that we don't need to consider various hard to solve situations. @mareko What do you think?

In D94645#2500467, @ruiling wrote:

BTW I felt the input in the name of llvm.amdgcn.init.exec.from.input some kind of means the argument is the function input argument in LLVM IR. I think we can clarify that the value to this intrinsic should be an function input argument. That is the only requirement I can see currently from frontend. So that we don't need to consider various hard to solve situations. @mareko What do you think?

Yes, the input can only be an SGPR function argument.

ruiling added inline comments.Jan 18 2021, 4:44 PM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1887–1893 ↗	(On Diff #316551)	We definitely could do. Do you have an opinion about what pass it should go in? It looks to me that we would need to add one. Can we just implement this in SILowerControlFlow?

Add text refining definition of llvm.amdgcn.init.exec.from.input.
Move code to SILowerControlFlow, this is not really a good place for this, but sufficient for now.

In D94645#2511686, @critson wrote:

Add text refining definition of llvm.amdgcn.init.exec.from.input.

Move code to SILowerControlFlow, this is not really a good place for this, but sufficient for now.

There is still an issue if the SGPR used to hold the input llvm.amdgcn.init.exec.from.input is spilt; however, this is not a new issue.
From my testing llvm.amdgcn.init.exec.from.input actually only worked in the entry block previous to this change, so we could tighten its description even further.

Harbormaster completed remote builds in B86032: Diff 318099.Jan 20 2021, 11:17 PM

There is still an issue if the SGPR used to hold the input llvm.amdgcn.init.exec.from.input is spilt; however, this is not a new issue.
From my testing llvm.amdgcn.init.exec.from.input actually only worked in the entry block previous to this change, so we could tighten its description even further.

I am not sure what the problem is. May be we can fix it later. But I don't want to restrict it can only be used in the entry block for now unless we later prove that is really hard to make it correct. We may possibly use it for the second part of the merged shader.

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp

690–715

I think this piece of code can be simplified something like below:

Register InputReg = MI.getOperand(0).getReg();
MachineInstr *FirstMI = &*MBB->begin();
if (InputReg.isVirtual()) {
  MachineInstr *DefInstr = MRI->getVRegDef(InputReg);
  assert(DefInstr && DefInstr->isCopy());
  if (DefInstr->getParent() == MBB && DefInstr != FirstMI) {
    // If the `InputReg` is defined in current block, we also need to
    // move that instruction to the beginning of the block.
    DefInstr->removeFromParent();
    MBB->insert(FirstMI, DefInstr);
    if (LIS)
      LIS->handleMove(*DefInstr);
  }
}

754

Maybe also LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);? This also applies to SI_INIT_EXEC.

Address review comments.

critson marked 2 inline comments as done.Jan 21 2021, 5:57 AM

critson added inline comments.

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
690–715	Sure. Although I had to update the code to handle the case the definition is the first instruction.
754	Yes I was thinking that myself.

ruiling added inline comments.Jan 21 2021, 6:33 AM

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
690–715	Yes, that need to be fixed. I have not tested the code.

Harbormaster completed remote builds in B86074: Diff 318173.Jan 21 2021, 6:57 AM

Thanks for the patch. Basically LGTM with some minor comments.

Fix this by moving lowering of llvm.amdgcn.init.exec post-RA.

The commit message need slightly updated as this is not post-RA anymore.

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
746	I am not sure whether you like to put this under an `if (InputReg.isVirtual())` condition. I guess we may only meet issue when moving this piece of code or the pass around.
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.exec.ll
171–172	There are some testing failure for Windows, seems dynamic `alloca` not correctly handled for GlobalISel path under Windows. I think we can simply move these two `alloca`s to entry block. The case will be used to only check `llvm.amdgcn.init.exec.from.input` works correctly when placing in non-entry block. Do you have any other idea?

Update test to avoid GlobalISel issue on Windows
Tighten tests

LGTM, Let's wait some time to see if anybody else has more comments. And make sure to update the commit message before push.

Fix this by moving lowering of llvm.amdgcn.init.exec post-RA

This revision is now accepted and ready to land.Jan 21 2021, 8:11 PM

critson edited the summary of this revision. (Show Details)Jan 21 2021, 8:23 PM

Harbormaster completed remote builds in B86220: Diff 318402.Jan 21 2021, 10:55 PM

Closed by commit rGa80ebd01798c: [AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization (authored by critson). · Explain WhyJan 24 2021, 3:50 PM

This revision was automatically updated to reflect the committed changes.

critson added a commit: rGa80ebd01798c: [AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization.

Diff 318873

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 176 Lines • ▼ Show 20 Lines
	// FIXME: Should be mangled for wave size.			// FIXME: Should be mangled for wave size.
	def int_amdgcn_init_exec : Intrinsic<[],			def int_amdgcn_init_exec : Intrinsic<[],
	[llvm_i64_ty], // 64-bit literal constant			[llvm_i64_ty], // 64-bit literal constant
	[IntrConvergent, ImmArg<ArgIndex<0>>]>;			[IntrConvergent, ImmArg<ArgIndex<0>>]>;

	// Set EXEC according to a thread count packed in an SGPR input:			// Set EXEC according to a thread count packed in an SGPR input:
	// thread_count = (input >> bitoffset) & 0x7f;			// thread_count = (input >> bitoffset) & 0x7f;
	// This is always moved to the beginning of the basic block.			// This is always moved to the beginning of the basic block.
				// Note: only inreg arguments to the parent function are valid as
				// inputs to this intrinsic, computed values cannot be used.
	def int_amdgcn_init_exec_from_input : Intrinsic<[],			def int_amdgcn_init_exec_from_input : Intrinsic<[],
	[llvm_i32_ty, // 32-bit SGPR input			[llvm_i32_ty, // 32-bit SGPR input
	llvm_i32_ty], // bit offset of the thread count			llvm_i32_ty], // bit offset of the thread count
	[IntrConvergent, ImmArg<ArgIndex<1>>]>;			[IntrConvergent, ImmArg<ArgIndex<1>>]>;

	def int_amdgcn_wavefrontsize :			def int_amdgcn_wavefrontsize :
	GCCBuiltin<"__builtin_amdgcn_wavefrontsize">,			GCCBuiltin<"__builtin_amdgcn_wavefrontsize">,
	Intrinsic<[llvm_i32_ty], [], [IntrNoMem, IntrSpeculatable, IntrWillReturn]>;			Intrinsic<[llvm_i32_ty], [], [IntrNoMem, IntrSpeculatable, IntrWillReturn]>;
	▲ Show 20 Lines • Show All 1,833 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,015 Lines • ▼ Show 20 Lines	MachineBasicBlock *SITargetLowering::EmitInstrWithCustomInserter(
}		}
case AMDGPU::SI_INIT_M0: {		case AMDGPU::SI_INIT_M0: {
BuildMI(*BB, MI.getIterator(), MI.getDebugLoc(),		BuildMI(*BB, MI.getIterator(), MI.getDebugLoc(),
TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)		TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)
.add(MI.getOperand(0));		.add(MI.getOperand(0));
MI.eraseFromParent();		MI.eraseFromParent();
return BB;		return BB;
}		}
case AMDGPU::SI_INIT_EXEC:
// This should be before all vector instructions.
BuildMI(BB, &BB->begin(), MI.getDebugLoc(), TII->get(AMDGPU::S_MOV_B64),
AMDGPU::EXEC)
.addImm(MI.getOperand(0).getImm());
MI.eraseFromParent();
return BB;

case AMDGPU::SI_INIT_EXEC_LO:
// This should be before all vector instructions.
BuildMI(BB, &BB->begin(), MI.getDebugLoc(), TII->get(AMDGPU::S_MOV_B32),
AMDGPU::EXEC_LO)
.addImm(MI.getOperand(0).getImm());
MI.eraseFromParent();
return BB;

case AMDGPU::SI_INIT_EXEC_FROM_INPUT: {
// Extract the thread count from an SGPR input and set EXEC accordingly.
// Since BFM can't shift by 64, handle that case with CMP + CMOV.
//
// S_BFE_U32 count, input, {shift, 7}
// S_BFM_B64 exec, count, 0
// S_CMP_EQ_U32 count, 64
// S_CMOV_B64 exec, -1
MachineInstr FirstMI = &BB->begin();
MachineRegisterInfo &MRI = MF->getRegInfo();
Register InputReg = MI.getOperand(0).getReg();
Register CountReg = MRI.createVirtualRegister(&AMDGPU::SGPR_32RegClass);
bool Found = false;

// Move the COPY of the input reg to the beginning, so that we can use it.
for (auto I = BB->begin(); I != &MI; I++) {
if (I->getOpcode() != TargetOpcode::COPY \|\|
I->getOperand(0).getReg() != InputReg)
continue;

if (I == FirstMI) {
FirstMI = &*++BB->begin();
} else {
I->removeFromParent();
BB->insert(FirstMI, &*I);
}
Found = true;
break;
}
assert(Found);
(void)Found;

// This should be before all vector instructions.
unsigned Mask = (getSubtarget()->getWavefrontSize() << 1) - 1;
bool isWave32 = getSubtarget()->isWave32();
unsigned Exec = isWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
BuildMI(*BB, FirstMI, DebugLoc(), TII->get(AMDGPU::S_BFE_U32), CountReg)
.addReg(InputReg)
.addImm((MI.getOperand(1).getImm() & Mask) \| 0x70000);
BuildMI(*BB, FirstMI, DebugLoc(),
TII->get(isWave32 ? AMDGPU::S_BFM_B32 : AMDGPU::S_BFM_B64),
Exec)
.addReg(CountReg)
.addImm(0);
BuildMI(*BB, FirstMI, DebugLoc(), TII->get(AMDGPU::S_CMP_EQ_U32))
.addReg(CountReg, RegState::Kill)
.addImm(getSubtarget()->getWavefrontSize());
BuildMI(*BB, FirstMI, DebugLoc(),
TII->get(isWave32 ? AMDGPU::S_CMOV_B32 : AMDGPU::S_CMOV_B64),
Exec)
.addImm(-1);
MI.eraseFromParent();
return BB;
}

case AMDGPU::GET_GROUPSTATICSIZE: {		case AMDGPU::GET_GROUPSTATICSIZE: {
assert(getTargetMachine().getTargetTriple().getOS() == Triple::AMDHSA \|\|		assert(getTargetMachine().getTargetTriple().getOS() == Triple::AMDHSA \|\|
getTargetMachine().getTargetTriple().getOS() == Triple::AMDPAL);		getTargetMachine().getTargetTriple().getOS() == Triple::AMDPAL);
DebugLoc DL = MI.getDebugLoc();		DebugLoc DL = MI.getDebugLoc();
BuildMI(*BB, MI, DL, TII->get(AMDGPU::S_MOV_B32))		BuildMI(*BB, MI, DL, TII->get(AMDGPU::S_MOV_B32))
.add(MI.getOperand(0))		.add(MI.getOperand(0))
.addImm(MFI->getLDSSize());		.addImm(MFI->getLDSSize());
MI.eraseFromParent();		MI.eraseFromParent();
▲ Show 20 Lines • Show All 7,857 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 393 Lines • ▼ Show 20 Lines	def SI_INIT_M0 : SPseudoInstSI <(outs), (ins SSrc_b32:$src)> {
let isAsCheapAsAMove = 1;		let isAsCheapAsAMove = 1;
let isReMaterializable = 1;		let isReMaterializable = 1;
}		}

def SI_INIT_EXEC : SPseudoInstSI <		def SI_INIT_EXEC : SPseudoInstSI <
(outs), (ins i64imm:$src),		(outs), (ins i64imm:$src),
[(int_amdgcn_init_exec (i64 timm:$src))]> {		[(int_amdgcn_init_exec (i64 timm:$src))]> {
let Defs = [EXEC];		let Defs = [EXEC];
let usesCustomInserter = 1;
let isAsCheapAsAMove = 1;
let WaveSizePredicate = isWave64;
}

// FIXME: Intrinsic should be mangled for wave size.
def SI_INIT_EXEC_LO : SPseudoInstSI <
(outs), (ins i32imm:$src), []> {
let Defs = [EXEC_LO];
let usesCustomInserter = 1;
let isAsCheapAsAMove = 1;		let isAsCheapAsAMove = 1;
let WaveSizePredicate = isWave32;
}		}

// FIXME: Wave32 version
def SI_INIT_EXEC_FROM_INPUT : SPseudoInstSI <		def SI_INIT_EXEC_FROM_INPUT : SPseudoInstSI <
(outs), (ins SSrc_b32:$input, i32imm:$shift),		(outs), (ins SSrc_b32:$input, i32imm:$shift),
[(int_amdgcn_init_exec_from_input i32:$input, (i32 timm:$shift))]> {		[(int_amdgcn_init_exec_from_input i32:$input, (i32 timm:$shift))]> {
let Defs = [EXEC];		let Defs = [EXEC];
let usesCustomInserter = 1;
}

def : GCNPat <
(int_amdgcn_init_exec timm:$src),
(SI_INIT_EXEC_LO (as_i32timm timm:$src))> {
let WaveSizePredicate = isWave32;
}		}

// Return for returning shaders to a shader variant epilog.		// Return for returning shaders to a shader variant epilog.
def SI_RETURN_TO_EPILOG : SPseudoInstSI <		def SI_RETURN_TO_EPILOG : SPseudoInstSI <
(outs), (ins variable_ops), [(AMDGPUreturn_to_epilog)]> {		(outs), (ins variable_ops), [(AMDGPUreturn_to_epilog)]> {
let isTerminator = 1;		let isTerminator = 1;
let isBarrier = 1;		let isBarrier = 1;
let isReturn = 1;		let isReturn = 1;
▲ Show 20 Lines • Show All 2,250 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	private:

void emitIf(MachineInstr &MI);		void emitIf(MachineInstr &MI);
void emitElse(MachineInstr &MI);		void emitElse(MachineInstr &MI);
void emitIfBreak(MachineInstr &MI);		void emitIfBreak(MachineInstr &MI);
void emitLoop(MachineInstr &MI);		void emitLoop(MachineInstr &MI);

MachineBasicBlock *emitEndCf(MachineInstr &MI);		MachineBasicBlock *emitEndCf(MachineInstr &MI);

		void lowerInitExec(MachineBasicBlock *MBB, MachineInstr &MI);

void findMaskOperands(MachineInstr &MI, unsigned OpNo,		void findMaskOperands(MachineInstr &MI, unsigned OpNo,
SmallVectorImpl<MachineOperand> &Src) const;		SmallVectorImpl<MachineOperand> &Src) const;

void combineMasks(MachineInstr &MI);		void combineMasks(MachineInstr &MI);

bool removeMBBifRedundant(MachineBasicBlock &MBB);		bool removeMBBifRedundant(MachineBasicBlock &MBB);

MachineBasicBlock *process(MachineInstr &MI);		MachineBasicBlock *process(MachineInstr &MI);
▲ Show 20 Lines • Show All 552 Lines • ▼ Show 20 Lines	default:
I = MBB.end();		I = MBB.end();
break;		break;
}		}
}		}

return SplitBB;		return SplitBB;
}		}

		void SILowerControlFlow::lowerInitExec(MachineBasicBlock *MBB,
		MachineInstr &MI) {
		MachineFunction &MF = *MBB->getParent();
		const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
		bool IsWave32 = ST.isWave32();

		if (MI.getOpcode() == AMDGPU::SI_INIT_EXEC) {
		// This should be before all vector instructions.
		BuildMI(*MBB, MBB->begin(), MI.getDebugLoc(),
		TII->get(IsWave32 ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64), Exec)
		.addImm(MI.getOperand(0).getImm());
		if (LIS)
		LIS->RemoveMachineInstrFromMaps(MI);
		MI.eraseFromParent();
		return;
		}

		// Extract the thread count from an SGPR input and set EXEC accordingly.
		// Since BFM can't shift by 64, handle that case with CMP + CMOV.
		//
		// S_BFE_U32 count, input, {shift, 7}
		// S_BFM_B64 exec, count, 0
		// S_CMP_EQ_U32 count, 64
		// S_CMOV_B64 exec, -1
		Register InputReg = MI.getOperand(0).getReg();
		MachineInstr FirstMI = &MBB->begin();
		if (InputReg.isVirtual()) {
		MachineInstr *DefInstr = MRI->getVRegDef(InputReg);
		assert(DefInstr && DefInstr->isCopy());
		if (DefInstr->getParent() == MBB) {
		if (DefInstr != FirstMI) {
		// If the `InputReg` is defined in current block, we also need to
		// move that instruction to the beginning of the block.
		DefInstr->removeFromParent();
		MBB->insert(FirstMI, DefInstr);
		if (LIS)
		LIS->handleMove(*DefInstr);
		} else {
		// If first instruction is definition then move pointer after it.
		FirstMI = &*std::next(FirstMI->getIterator());
		}
		}
		}

		// Insert instruction sequence at block beginning (before vector operations).
		const DebugLoc DL = MI.getDebugLoc();
		const unsigned WavefrontSize = ST.getWavefrontSize();
		const unsigned Mask = (WavefrontSize << 1) - 1;
		Register CountReg = MRI->createVirtualRegister(&AMDGPU::SGPR_32RegClass);
		auto BfeMI = BuildMI(*MBB, FirstMI, DL, TII->get(AMDGPU::S_BFE_U32), CountReg)
		ruilingUnsubmitted Done Reply Inline Actions I think this piece of code can be simplified something like below: Register InputReg = MI.getOperand(0).getReg(); MachineInstr FirstMI = &MBB->begin(); if (InputReg.isVirtual()) { MachineInstr DefInstr = MRI->getVRegDef(InputReg); assert(DefInstr && DefInstr->isCopy()); if (DefInstr->getParent() == MBB && DefInstr != FirstMI) { // If the `InputReg` is defined in current block, we also need to // move that instruction to the beginning of the block. DefInstr->removeFromParent(); MBB->insert(FirstMI, DefInstr); if (LIS) LIS->handleMove(DefInstr); } } ruiling: I think this piece of code can be simplified something like below: ``` Register…
		critsonAuthorUnsubmitted Done Reply Inline Actions Sure. Although I had to update the code to handle the case the definition is the first instruction. critson: Sure. Although I had to update the code to handle the case the definition is the first…
		ruilingUnsubmitted Not Done Reply Inline Actions Yes, that need to be fixed. I have not tested the code. ruiling: Yes, that need to be fixed. I have not tested the code.
		.addReg(InputReg)
		.addImm((MI.getOperand(1).getImm() & Mask) \| 0x70000);
		auto BfmMI =
		BuildMI(*MBB, FirstMI, DL,
		TII->get(IsWave32 ? AMDGPU::S_BFM_B32 : AMDGPU::S_BFM_B64), Exec)
		.addReg(CountReg)
		.addImm(0);
		auto CmpMI = BuildMI(*MBB, FirstMI, DL, TII->get(AMDGPU::S_CMP_EQ_U32))
		.addReg(CountReg, RegState::Kill)
		.addImm(WavefrontSize);
		auto CmovMI =
		BuildMI(*MBB, FirstMI, DL,
		TII->get(IsWave32 ? AMDGPU::S_CMOV_B32 : AMDGPU::S_CMOV_B64),
		Exec)
		.addImm(-1);

		if (!LIS) {
		MI.eraseFromParent();
		return;
		}

		LIS->RemoveMachineInstrFromMaps(MI);
		MI.eraseFromParent();

		LIS->InsertMachineInstrInMaps(*BfeMI);
		LIS->InsertMachineInstrInMaps(*BfmMI);
		LIS->InsertMachineInstrInMaps(*CmpMI);
		LIS->InsertMachineInstrInMaps(*CmovMI);

		LIS->removeInterval(InputReg);
		LIS->createAndComputeVirtRegInterval(InputReg);
		ruilingUnsubmitted Not Done Reply Inline Actions I am not sure whether you like to put this under an `if (InputReg.isVirtual())` condition. I guess we may only meet issue when moving this piece of code or the pass around. ruiling: I am not sure whether you like to put this under an `if (InputReg.isVirtual())` condition. I…
		LIS->createAndComputeVirtRegInterval(CountReg);
		}

bool SILowerControlFlow::removeMBBifRedundant(MachineBasicBlock &MBB) {		bool SILowerControlFlow::removeMBBifRedundant(MachineBasicBlock &MBB) {
auto GetFallThroughSucc = [=](MachineBasicBlock B) -> MachineBasicBlock {		auto GetFallThroughSucc = [=](MachineBasicBlock B) -> MachineBasicBlock {
auto *S = B->getNextNode();		auto *S = B->getNextNode();
if (!S)		if (!S)
return nullptr;		return nullptr;
		ruilingUnsubmitted Done Reply Inline Actions Maybe also `LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);`? This also applies to SI_INIT_EXEC. ruiling: Maybe also ` LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);`? This also applies to…
		critsonAuthorUnsubmitted Done Reply Inline Actions Yes I was thinking that myself. critson: Yes I was thinking that myself.
if (B->isSuccessor(S)) {		if (B->isSuccessor(S)) {
// The only fallthrough candidate		// The only fallthrough candidate
MachineBasicBlock::iterator I(B->getFirstInstrTerminator());		MachineBasicBlock::iterator I(B->getFirstInstrTerminator());
MachineBasicBlock::iterator E = B->end();		MachineBasicBlock::iterator E = B->end();
for (; I != E; I++) {		for (; I != E; I++) {
if (I->isBranch() && TII->getBranchDestBlock(*I) == S)		if (I->isBranch() && TII->getBranchDestBlock(*I) == S)
// We have unoptimized branch to layout successor		// We have unoptimized branch to layout successor
return nullptr;		return nullptr;
▲ Show 20 Lines • Show All 99 Lines • ▼ Show 20 Lines	for (I = MBB->begin(); I != E; I = Next) {
case AMDGPU::SI_END_CF:		case AMDGPU::SI_END_CF:
// Only build worklist if SI_IF instructions must be processed first.		// Only build worklist if SI_IF instructions must be processed first.
if (InsertKillCleanups)		if (InsertKillCleanups)
Worklist.push_back(&MI);		Worklist.push_back(&MI);
else		else
SplitMBB = process(MI);		SplitMBB = process(MI);
break;		break;

		// FIXME: find a better place for this
		case AMDGPU::SI_INIT_EXEC:
		case AMDGPU::SI_INIT_EXEC_FROM_INPUT:
		lowerInitExec(MBB, MI);
		if (LIS)
		LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);
		break;

default:		default:
break;		break;
}		}

if (SplitMBB != MBB) {		if (SplitMBB != MBB) {
MBB = Next->getParent();		MBB = Next->getParent();
E = MBB->end();		E = MBB->end();
}		}
Show All 14 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.exec.ll

	Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	;			;
	; This used to crash.			; This used to crash.
	define amdgpu_ps void @init_unreachable() {			define amdgpu_ps void @init_unreachable() {
	main_body:			main_body:
	call void @llvm.amdgcn.init.exec(i64 -1)			call void @llvm.amdgcn.init.exec(i64 -1)
	unreachable			unreachable
	}			}

				; GCN-LABEL: {{^}}init_exec_before_frame_materialize:
				; GCN-NOT: {{^}}v_
				; GCN: s_mov_b64 exec, -1
				; GCN: v_mov
				; GCN: v_add
				define amdgpu_ps float @init_exec_before_frame_materialize(i32 inreg %a, i32 inreg %b) {
				main_body:
				%array0 = alloca [1024 x i32], align 16, addrspace(5)
				%array1 = alloca [20 x i32], align 16, addrspace(5)
				call void @llvm.amdgcn.init.exec(i64 -1)

				%ptr0 = getelementptr inbounds [1024 x i32], [1024 x i32] addrspace(5)* %array0, i32 0, i32 1
				store i32 %a, i32 addrspace(5)* %ptr0, align 4

				%ptr1 = getelementptr inbounds [20 x i32], [20 x i32] addrspace(5)* %array1, i32 0, i32 1
				store i32 %a, i32 addrspace(5)* %ptr1, align 4

				%ptr2 = getelementptr inbounds [20 x i32], [20 x i32] addrspace(5)* %array1, i32 0, i32 2
				store i32 %b, i32 addrspace(5)* %ptr2, align 4

				%ptr3 = getelementptr inbounds [20 x i32], [20 x i32] addrspace(5)* %array1, i32 0, i32 %b
				%v3 = load i32, i32 addrspace(5)* %ptr3, align 4

				%ptr4 = getelementptr inbounds [1024 x i32], [1024 x i32] addrspace(5)* %array0, i32 0, i32 %b
				%v4 = load i32, i32 addrspace(5)* %ptr4, align 4

				%v5 = add i32 %v3, %v4
				%v = bitcast i32 %v5 to float
				ret float %v
				}

				; GCN-LABEL: {{^}}init_exec_input_before_frame_materialize:
				; GCN-NOT: {{^}}v_
				; GCN: s_bfe_u32 s2, s2, 0x70008
				; GCN-NEXT: s_bfm_b64 exec, s2, 0
				; GCN-NEXT: s_cmp_eq_u32 s2, 64
				; GCN-NEXT: s_cmov_b64 exec, -1
				; GCN: v_mov
				; GCN: v_add
				define amdgpu_ps float @init_exec_input_before_frame_materialize(i32 inreg %a, i32 inreg %b, i32 inreg %count) {
				main_body:
				%array0 = alloca [1024 x i32], align 16, addrspace(5)
				%array1 = alloca [20 x i32], align 16, addrspace(5)
				call void @llvm.amdgcn.init.exec.from.input(i32 %count, i32 8)

				%ptr0 = getelementptr inbounds [1024 x i32], [1024 x i32] addrspace(5)* %array0, i32 0, i32 1
				store i32 %a, i32 addrspace(5)* %ptr0, align 4

				%ptr1 = getelementptr inbounds [20 x i32], [20 x i32] addrspace(5)* %array1, i32 0, i32 1
				store i32 %a, i32 addrspace(5)* %ptr1, align 4

				%ptr2 = getelementptr inbounds [20 x i32], [20 x i32] addrspace(5)* %array1, i32 0, i32 2
				store i32 %b, i32 addrspace(5)* %ptr2, align 4

				%ptr3 = getelementptr inbounds [20 x i32], [20 x i32] addrspace(5)* %array1, i32 0, i32 %b
				%v3 = load i32, i32 addrspace(5)* %ptr3, align 4

				%ptr4 = getelementptr inbounds [1024 x i32], [1024 x i32] addrspace(5)* %array0, i32 0, i32 %b
				%v4 = load i32, i32 addrspace(5)* %ptr4, align 4

				%v5 = add i32 %v3, %v4
				%v = bitcast i32 %v5 to float
				ret float %v
				}

				; GCN-LABEL: {{^}}init_exec_input_before_frame_materialize_nonentry:
				; GCN-NOT: {{^}}v_
				; GCN: %endif
				; GCN: s_bfe_u32 s3, s2, 0x70008
				; GCN-NEXT: s_bfm_b64 exec, s3, 0
				; GCN-NEXT: s_cmp_eq_u32 s3, 64
				; GCN-NEXT: s_cmov_b64 exec, -1
				; GCN: v_mov
				; GCN: v_add
				define amdgpu_ps float @init_exec_input_before_frame_materialize_nonentry(i32 inreg %a, i32 inreg %b, i32 inreg %count) {
				main_body:
				; ideally these alloca would be in %endif, but this causes problems on Windows GlobalISel
				%array0 = alloca [1024 x i32], align 16, addrspace(5)
				%array1 = alloca [20 x i32], align 16, addrspace(5)

				%cc = icmp uge i32 %count, 32
				br i1 %cc, label %endif, label %if

				if:
				call void asm sideeffect "", ""()
				br label %endif
				ruilingUnsubmitted Not Done Reply Inline Actions There are some testing failure for Windows, seems dynamic `alloca` not correctly handled for GlobalISel path under Windows. I think we can simply move these two `alloca`s to entry block. The case will be used to only check `llvm.amdgcn.init.exec.from.input` works correctly when placing in non-entry block. Do you have any other idea? ruiling: There are some testing failure for Windows, seems dynamic `alloca` not correctly handled for…

				endif:
				call void @llvm.amdgcn.init.exec.from.input(i32 %count, i32 8)

				%ptr0 = getelementptr inbounds [1024 x i32], [1024 x i32] addrspace(5)* %array0, i32 0, i32 1
				store i32 %a, i32 addrspace(5)* %ptr0, align 4

				%ptr1 = getelementptr inbounds [20 x i32], [20 x i32] addrspace(5)* %array1, i32 0, i32 1
				store i32 %a, i32 addrspace(5)* %ptr1, align 4

				%ptr2 = getelementptr inbounds [20 x i32], [20 x i32] addrspace(5)* %array1, i32 0, i32 2
				store i32 %b, i32 addrspace(5)* %ptr2, align 4

				%ptr3 = getelementptr inbounds [20 x i32], [20 x i32] addrspace(5)* %array1, i32 0, i32 %b
				%v3 = load i32, i32 addrspace(5)* %ptr3, align 4

				%ptr4 = getelementptr inbounds [1024 x i32], [1024 x i32] addrspace(5)* %array0, i32 0, i32 %b
				%v4 = load i32, i32 addrspace(5)* %ptr4, align 4

				%v5 = add i32 %v3, %v4
				%v6 = add i32 %v5, %count
				%v = bitcast i32 %v6 to float
				ret float %v
				}

	declare void @llvm.amdgcn.init.exec(i64) #1			declare void @llvm.amdgcn.init.exec(i64) #1
	declare void @llvm.amdgcn.init.exec.from.input(i32, i32) #1			declare void @llvm.amdgcn.init.exec.from.input(i32, i32) #1

	attributes #1 = { convergent }			attributes #1 = { convergent }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 318873

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.exec.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix llvm.amdgcn.init.exec and frame materializationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 318873

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.exec.ll

[AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization
ClosedPublic