Download Raw Diff

Details

Reviewers

nhaustov
• tstellarAMD
nhaehnle
vpykhtin
arsenm

Commits

rG0ee250eee84d: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition…
rL288053: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition…
rL286171: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition…

Summary

Codegen prepare sinks comparisons close to a user is we have only one register for conditions. For AMDGPU we have many SGPRs capable to hold vector conditions. Changed BE to report we have many condition registers. That way IR LICM pass would hoist an invariant comparison out of a loop and codegen prepare will not sink it.

With that done a condition is calculated in one block and used in another. Current behavior is to store workitem's condition in a VGPR using v_cndmask_b32 and then restore it with yet another v_cmp instruction from that v_cndmask's result. To mitigate the issue a propagation of source SGPR pair in place of v_cmp is implemented. Additional side effect of this is that we may consume less VGPRs at a cost of more SGPRs in case if holding of multiple conditions is needed, and that is a clear win in most cases.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec updated this revision to Diff 76290.Oct 28 2016, 8:44 PM

rampitec retitled this revision from to [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies.

rampitec updated this object.

rampitec added a reviewer: arsenm.

rampitec set the repository for this revision to rL LLVM.

rampitec added a project: Restricted Project.

rampitec added subscribers: Restricted Project, llvm-commits.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptOct 28 2016, 8:44 PM

Herald added subscribers: tony-tye, yaxunl, nhaehnle and 2 others. · View Herald Transcript

Ping.

rampitec added a reviewer: nhaustov.Nov 2 2016, 3:06 PM

• tstellarAMD added inline comments.Nov 3 2016, 6:54 AM

test/CodeGen/AMDGPU/host-cond.ll
1 ↗	(On Diff #76290)	You should pass -verify-machineinstrs to llc.
44 ↗	(On Diff #76290)	This function is missing its attributes.

Mostly LGTM

lib/Target/AMDGPU/SILowerI1Copies.cpp
133	if we already know that Use is a copy and it uses Dst why it's needed to check that Operand(1) is Dst? Maybe assert is enough here.
134	FromReg variable name is a bit confusing here. It seems that the name comes from "void replaceRegWith(unsigned FromReg, unsigned ToReg)" but in fact this register is a copy destination reg and by its nature is a duplicate of original SGPR64 we would like to propagate. Maybe it worth to name it something like DuplicateReg or something.

rampitec added inline comments.Nov 3 2016, 10:09 AM

lib/Target/AMDGPU/SILowerI1Copies.cpp
133	I'm changing registers in this loop and after the initial use list collection. This is to be on the safe side to check it still uses the same register.

Fixed attributes in test.
Changed variable name from FromReg to RegCopy.

Herald edited edge metadata. · View Herald TranscriptNov 3 2016, 10:25 AM

rampitec marked 3 inline comments as done.Nov 3 2016, 10:26 AM

vpykhtin accepted this revision.Nov 3 2016, 10:48 AM

vpykhtin added a reviewer: vpykhtin.

vpykhtin added inline comments.

lib/Target/AMDGPU/SILowerI1Copies.cpp
133	I'm still unsure it can happen but it's too minor to continue.

This revision is now accepted and ready to land.Nov 3 2016, 10:48 AM

Rebased to master.
Updated test/CodeGen/AMDGPU/branch-relaxation.ll which started to fail since initial patch creation.

Herald edited edge metadata. · View Herald TranscriptNov 4 2016, 12:05 PM

Only virtual register use can be forward propagated. Previous version in some cases was trying to replace phys reg vcc uses.
Rebased to master, updated branch-relaxation.ll test.

Herald edited edge metadata. · View Herald TranscriptNov 7 2016, 3:03 PM

rampitec added a commit: rL286171: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition….Nov 7 2016, 3:17 PM

r286171

Previous version was reverted due to error in GL piglit test fs-discard-exit-2.

The v_cmp_* instruction does not preserve result bits for inactive lanes, but rather sets them to 0. This is in fact equivalent of EXEC[n] & compare[n]. A corrected propagation starts not with v_cndmask_b32 which saves condition, but with a v_cmp instruction which restores it. In case if pattern is matched we can emit s_and_b32 of original scalar result with EXEC instead of v_cmp. Then the first v_cmdmask_b32 will have a chance to be deadcoded.

The next step (in a separate change) will be to combine newly created s_and_b32 with the following s_and_saveexec_b64 if any.

Herald edited edge metadata. · View Herald TranscriptNov 17 2016, 4:38 PM

Reopened for corrected version. Previous commit was reverted.

This revision is now accepted and ready to land.Nov 17 2016, 4:40 PM

rampitec requested a review of this revision.Nov 17 2016, 4:41 PM

rampitec edited edge metadata.

This seems correct to me.

It could be quite beneficial to have a general pass running quite late that optimizes away s[i:i+1] & EXEC instructions. This would allow lowering PHIs of i1 as straightforward and-with-exec in the predecessor blocks + bitwise-or in the block containing the PHI, and it would help with some of the WholeQuadMode changes that I still need to get around to.

In D26114#599905, @nhaehnle wrote:

This seems correct to me.

It could be quite beneficial to have a general pass running quite late that optimizes away s[i:i+1] & EXEC instructions. This would allow lowering PHIs of i1 as straightforward and-with-exec in the predecessor blocks + bitwise-or in the block containing the PHI, and it would help with some of the WholeQuadMode changes that I still need to get around to.

I'm thinking of two places to perform such optimization: SIFoldOperands.cpp and SIOptimizeExecMasking.cpp.

The first is preferable because that happens before register allocation so we may save a pair of SGPRs in a good case. The other may be also beneficial because PHIs are eliminated post RA. At the end of the day both may be needed.

rampitec added a reviewer: nhaehnle.Nov 18 2016, 9:58 AM

It looks like combining EXEC bit logic is too early before the SI Lower control flow pseudo instructions. Before that point it is:

%vreg20<def> = S_AND_B64 %EXEC, %vreg18, %SCC<imp-def>; SReg_64:%vreg20,%vreg18
%vreg3<def> = SI_IF %vreg20, <BB#3>, %EXEC<imp-def,dead>, %SCC<imp-def,dead>, %EXEC<imp-use>; SReg_64:%vreg3,%vreg20

After we have two S_AND_B64 which can be combined together:

%vreg20<def> = S_AND_B64 %EXEC, %vreg18, %SCC<imp-def,dead>; SReg_64:%vreg20,%vreg18
%vreg46<def> = COPY %vreg19<kill>; VGPR_32:%vreg46,%vreg19
%vreg3<def> = COPY %EXEC, %EXEC<imp-def>; SReg_64:%vreg3
%vreg47<def> = S_AND_B64 %vreg3, %vreg20, %SCC<imp-def,dead>; SReg_64:%vreg47,%vreg3,%vreg20

Not an easiest transformation because of the COPY in between, but doable.

Looks like I have patch to combine s_and_b64 and s_or_b64. I will bring it to review after this one is in place.

Ping

Added cleanup of logical operations with exec mask introduced in SILowerI1Copies. Inserted S_AND_B64 will be combined with another S_AND_B64 produced by lowering of SI_IF/SI_ELSE to finally form a single S_ANDSAVEXEC_B64.

Herald edited edge metadata. · View Herald TranscriptNov 22 2016, 6:16 PM

LGTM.

Please add diff with full context (-U999999) next time :)

lib/Target/AMDGPU/SILowerControlFlow.cpp
367	I would use const auto &SrcOp to avoid MachineOperand copy.

This revision is now accepted and ready to land.Nov 25 2016, 6:15 AM

rampitec marked an inline comment as done.Nov 28 2016, 11:02 AM

Updated per Valery's comment - added const auto &SrcOp.

Herald edited edge metadata. · View Herald TranscriptNov 28 2016, 11:05 AM

Closed by commit rL288053: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition… (authored by rampitec). · Explain WhyNov 28 2016, 11:08 AM

This revision was automatically updated to reflect the committed changes.

Diff 79426

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Context not available.

	setSchedulingPreference(Sched::RegPressure);	setSchedulingPreference(Sched::RegPressure);
	setJumpIsExpensive(true);	setJumpIsExpensive(true);
		setHasMultipleConditionRegisters(true);

	// SI at least has hardware support for floating point exceptions, but no way	// SI at least has hardware support for floating point exceptions, but no way
	// of using or handling them is implemented. They are also optional in OpenCL	// of using or handling them is implemented. They are also optional in OpenCL
Context not available.

lib/Target/AMDGPU/SILowerControlFlow.cpp

Context not available.
	void emitLoop(MachineInstr &MI);	void emitLoop(MachineInstr &MI);
	void emitEndCf(MachineInstr &MI);	void emitEndCf(MachineInstr &MI);

		void findMaskOperands(MachineInstr &MI, unsigned OpNo,
		SmallVectorImpl<MachineOperand> &Src) const;

		void combineMasks(MachineInstr &MI);

	public:	public:
	static char ID;	static char ID;

Context not available.
	LIS->handleMove(*NewMI);	LIS->handleMove(*NewMI);
	}	}

		// Returns replace operands for a logical operation, either single result
		// for exec or two operands if source was another equivalent operation.
		void SILowerControlFlow::findMaskOperands(MachineInstr &MI, unsigned OpNo,
		SmallVectorImpl<MachineOperand> &Src) const {
		MachineOperand &Op = MI.getOperand(OpNo);
		if (!Op.isReg() \|\| !TargetRegisterInfo::isVirtualRegister(Op.getReg())) {
		Src.push_back(Op);
		return;
		}

		MachineInstr *Def = MRI->getUniqueVRegDef(Op.getReg());
		if (!Def \|\| Def->getParent() != MI.getParent() \|\|
		!(Def->isFullCopy() \|\| (Def->getOpcode() == MI.getOpcode())))
		return;

		// Make sure we do not modify exec between def and use.
		// A copy with implcitly defined exec inserted earlier is an exclusion, it
		// does not really modify exec.
		for (auto I = Def->getIterator(); I != MI.getIterator(); ++I)
		if (I->modifiesRegister(AMDGPU::EXEC, TRI) &&
		!(I->isCopy() && I->getOperand(0).getReg() != AMDGPU::EXEC))
		return;

		for (const auto &SrcOp : Def->explicit_operands())
		vpykhtinUnsubmitted Done Reply Inline Actions I would use const auto &SrcOp to avoid MachineOperand copy. vpykhtin: I would use const auto &SrcOp to avoid MachineOperand copy.
		if (SrcOp.isUse() && (!SrcOp.isReg() \|\|
		TargetRegisterInfo::isVirtualRegister(SrcOp.getReg()) \|\|
		SrcOp.getReg() == AMDGPU::EXEC))
		Src.push_back(SrcOp);
		}

		// Search and combine pairs of equivalent instructions, like
		// S_AND_B64 x, (S_AND_B64 x, y) => S_AND_B64 x, y
		// S_OR_B64 x, (S_OR_B64 x, y) => S_OR_B64 x, y
		// One of the operands is exec mask.
		void SILowerControlFlow::combineMasks(MachineInstr &MI) {
		assert(MI.getNumExplicitOperands() == 3);
		SmallVector<MachineOperand, 4> Ops;
		unsigned OpToReplace = 1;
		findMaskOperands(MI, 1, Ops);
		if (Ops.size() == 1) OpToReplace = 2; // First operand can be exec or its copy
		findMaskOperands(MI, 2, Ops);
		if (Ops.size() != 3) return;

		unsigned UniqueOpndIdx;
		if (Ops[0].isIdenticalTo(Ops[1])) UniqueOpndIdx = 2;
		else if (Ops[0].isIdenticalTo(Ops[2])) UniqueOpndIdx = 1;
		else if (Ops[1].isIdenticalTo(Ops[2])) UniqueOpndIdx = 1;
		else return;

		unsigned Reg = MI.getOperand(OpToReplace).getReg();
		MI.RemoveOperand(OpToReplace);
		MI.addOperand(Ops[UniqueOpndIdx]);
		if (MRI->use_empty(Reg))
		MRI->getUniqueVRegDef(Reg)->eraseFromParent();
		}

	bool SILowerControlFlow::runOnMachineFunction(MachineFunction &MF) {	bool SILowerControlFlow::runOnMachineFunction(MachineFunction &MF) {
	const SISubtarget &ST = MF.getSubtarget<SISubtarget>();	const SISubtarget &ST = MF.getSubtarget<SISubtarget>();
	TII = ST.getInstrInfo();	TII = ST.getInstrInfo();
Context not available.
	NextBB = std::next(BI);	NextBB = std::next(BI);
	MachineBasicBlock &MBB = *BI;	MachineBasicBlock &MBB = *BI;

	MachineBasicBlock::iterator I, Next;	MachineBasicBlock::iterator I, Next, Last;

	for (I = MBB.begin(); I != MBB.end(); I = Next) {	for (I = MBB.begin(), Last = MBB.end(); I != MBB.end(); I = Next) {
	Next = std::next(I);	Next = std::next(I);
	MachineInstr &MI = *I;	MachineInstr &MI = *I;

Context not available.
	emitEndCf(MI);	emitEndCf(MI);
	break;	break;

		case AMDGPU::S_AND_B64:
		case AMDGPU::S_OR_B64:
		// Cleanup bit manipulations on exec mask
		combineMasks(MI);
		Last = I;
		continue;

	default:	default:
	break;	Last = I;
		continue;
	}	}

		// Replay newly inserted code to combine masks
		Next = (Last == MBB.end()) ? MBB.begin() : Last;
	}	}
	}	}

Context not available.

lib/Target/AMDGPU/SILowerI1Copies.cpp

Context not available.
	const TargetRegisterClass *DstRC = MRI.getRegClass(Dst.getReg());	const TargetRegisterClass *DstRC = MRI.getRegClass(Dst.getReg());
	const TargetRegisterClass *SrcRC = MRI.getRegClass(Src.getReg());	const TargetRegisterClass *SrcRC = MRI.getRegClass(Src.getReg());

		DebugLoc DL = MI.getDebugLoc();
		MachineInstr *DefInst = MRI.getUniqueVRegDef(Src.getReg());
	if (DstRC == &AMDGPU::VReg_1RegClass &&	if (DstRC == &AMDGPU::VReg_1RegClass &&
	TRI->getCommonSubClass(SrcRC, &AMDGPU::SGPR_64RegClass)) {	TRI->getCommonSubClass(SrcRC, &AMDGPU::SGPR_64RegClass)) {
	I1Defs.push_back(Dst.getReg());	I1Defs.push_back(Dst.getReg());
	DebugLoc DL = MI.getDebugLoc();

	MachineInstr *DefInst = MRI.getUniqueVRegDef(Src.getReg());
	if (DefInst->getOpcode() == AMDGPU::S_MOV_B64) {	if (DefInst->getOpcode() == AMDGPU::S_MOV_B64) {
	if (DefInst->getOperand(1).isImm()) {	if (DefInst->getOperand(1).isImm()) {
	I1Defs.push_back(Dst.getReg());	I1Defs.push_back(Dst.getReg());
Context not available.
	MI.eraseFromParent();	MI.eraseFromParent();
	} else if (TRI->getCommonSubClass(DstRC, &AMDGPU::SGPR_64RegClass) &&	} else if (TRI->getCommonSubClass(DstRC, &AMDGPU::SGPR_64RegClass) &&
	SrcRC == &AMDGPU::VReg_1RegClass) {	SrcRC == &AMDGPU::VReg_1RegClass) {
	BuildMI(MBB, &MI, MI.getDebugLoc(), TII->get(AMDGPU::V_CMP_NE_U32_e64))	if (DefInst->getOpcode() == AMDGPU::V_CNDMASK_B32_e64 &&
	.addOperand(Dst)	DefInst->getOperand(1).isImm() && DefInst->getOperand(2).isImm() &&
		vpykhtinUnsubmitted Done Reply Inline Actions if we already know that Use is a copy and it uses Dst why it's needed to check that Operand(1) is Dst? Maybe assert is enough here. vpykhtin: if we already know that Use is a copy and it uses Dst why it's needed to check that Operand(1)…
		rampitecAuthorUnsubmitted Done Reply Inline Actions I'm changing registers in this loop and after the initial use list collection. This is to be on the safe side to check it still uses the same register. rampitec: I'm changing registers in this loop and after the initial use list collection. This is to be on…
		vpykhtinUnsubmitted Done Reply Inline Actions I'm still unsure it can happen but it's too minor to continue. vpykhtin: I'm still unsure it can happen but it's too minor to continue.
	.addOperand(Src)	DefInst->getOperand(1).getImm() == 0 &&
		vpykhtinUnsubmitted Done Reply Inline Actions FromReg variable name is a bit confusing here. It seems that the name comes from "void replaceRegWith(unsigned FromReg, unsigned ToReg)" but in fact this register is a copy destination reg and by its nature is a duplicate of original SGPR64 we would like to propagate. Maybe it worth to name it something like DuplicateReg or something. vpykhtin: FromReg variable name is a bit confusing here. It seems that the name comes from "void…
	.addImm(0);	DefInst->getOperand(2).getImm() != 0 &&
		DefInst->getOperand(3).isReg() &&
		TargetRegisterInfo::isVirtualRegister(
		DefInst->getOperand(3).getReg()) &&
		TRI->getCommonSubClass(
		MRI.getRegClass(DefInst->getOperand(3).getReg()),
		&AMDGPU::SGPR_64RegClass)) {
		BuildMI(MBB, &MI, DL, TII->get(AMDGPU::S_AND_B64))
		.addOperand(Dst)
		.addReg(AMDGPU::EXEC)
		.addOperand(DefInst->getOperand(3));
		} else {
		BuildMI(MBB, &MI, DL, TII->get(AMDGPU::V_CMP_NE_U32_e64))
		.addOperand(Dst)
		.addOperand(Src)
		.addImm(0);
		}
	MI.eraseFromParent();	MI.eraseFromParent();
	}	}
	}	}
Context not available.

test/CodeGen/AMDGPU/branch-relaxation.ll

Context not available.
	; GCN: s_setpc_b64	; GCN: s_setpc_b64

	; GCN: [[LONG_BR_DEST0]]	; GCN: [[LONG_BR_DEST0]]
	; GCN: s_cmp_eq_u32	; GCN: v_cmp_ne_u32_e32
	; GCN-NEXT: ; implicit-def	; GCN-NEXT: ; implicit-def
	; GCN-NEXT: s_cbranch_scc0	; GCN-NEXT: s_cbranch_vccz
	; GCN: s_setpc_b64	; GCN: s_setpc_b64

	; GCN: s_endpgm	; GCN: s_endpgm
Context not available.

test/CodeGen/AMDGPU/hoist-cond.ll

This file was added.

				; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck %s

				; Check that invariant compare is hoisted out of the loop.
				; At the same time condition shall not be serialized into a VGPR and deserialized later
				; using another v_cmp + v_cndmask, but used directly in s_and_saveexec_b64.

				; CHECK: v_cmp_{{..}}_u32_e64 [[COND:s\[[0-9]+:[0-9]+\]]]
				; CHECK: BB0_1:
				; CHECK-NOT: v_cmp
				; CHECK_NOT: v_cndmask
				; CHECK: s_and_saveexec_b64 s[{{[[0-9]+:[0-9]+}}], [[COND]]
				; CHECK: BB0_2:

				define amdgpu_kernel void @hoist_cond(float addrspace(1)* nocapture %arg, float addrspace(1)* noalias nocapture readonly %arg1, i32 %arg3, i32 %arg4) {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x() #0
				%tmp5 = icmp ult i32 %tmp, %arg3
				br label %bb1

				bb1: ; preds = %bb3, %bb
				%tmp7 = phi i32 [ %arg4, %bb ], [ %tmp16, %bb3 ]
				%tmp8 = phi float [ 0.000000e+00, %bb ], [ %tmp15, %bb3 ]
				br i1 %tmp5, label %bb2, label %bb3

				bb2: ; preds = %bb1
				%tmp10 = zext i32 %tmp7 to i64
				%tmp11 = getelementptr inbounds float, float addrspace(1)* %arg1, i64 %tmp10
				%tmp12 = load float, float addrspace(1)* %tmp11, align 4
				br label %bb3

				bb3: ; preds = %bb2, %bb1
				%tmp14 = phi float [ %tmp12, %bb2 ], [ 0.000000e+00, %bb1 ]
				%tmp15 = fadd float %tmp8, %tmp14
				%tmp16 = add i32 %tmp7, -1
				%tmp17 = icmp eq i32 %tmp16, 0
				br i1 %tmp17, label %bb4, label %bb1

				bb4: ; preds = %bb3
				store float %tmp15, float addrspace(1)* %arg, align 4
				ret void
				}

				; Function Attrs: nounwind readnone
				declare i32 @llvm.amdgcn.workitem.id.x() #0

				attributes #0 = { nounwind readnone }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 79426

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/SILowerControlFlow.cpp

lib/Target/AMDGPU/SILowerI1Copies.cpp

test/CodeGen/AMDGPU/branch-relaxation.ll

test/CodeGen/AMDGPU/hoist-cond.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copiesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 79426

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/SILowerControlFlow.cpp

lib/Target/AMDGPU/SILowerI1Copies.cpp

test/CodeGen/AMDGPU/branch-relaxation.ll

test/CodeGen/AMDGPU/hoist-cond.ll

[AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies
ClosedPublic