Download Raw Diff

Details

Reviewers

rtereshin
dsanders

Summary

At the moment, MachineCSE allows CSE-ing convergent instrs which are non-local to each other. This can cause illegal codegen as convergent instrs are control flow dependent. The patch prevents non-local CSE of convergent instrs by adding a check in isProfitableToCSE and rejecting CSE-ing if we're considering CSE-ing non-local convergent instrs. We can still CSE convergent instrs which are in the same control flow scope, so the patch purposely does not make all convergent instrs non-CSE candidates in isCSECandidate.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mkitzan created this revision.Apr 23 2021, 11:02 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptApr 23 2021, 11:02 AM

mkitzan requested review of this revision.Apr 23 2021, 11:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 23 2021, 11:02 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

The change LGTM but could you add a test case? There probably aren't many convergent instructions upstream but it should be possible to make a test case using AMDGPU instructions or via G_INTRINSIC

Harbormaster completed remote builds in B100630: Diff 340107.Apr 23 2021, 1:39 PM

Update: added handwritten MIR unit test for the MachineCSE change using AMDGPU's DS_SWIZZLE_B32 instr (which is marked isConvergent in llvm/lib/Target/AMDGPU/DSInstructions.td)

Herald added subscribers: kerbowa, nhaehnle, jvesely. · View Herald TranscriptApr 23 2021, 3:09 PM

dsanders accepted this revision.Apr 23 2021, 4:15 PM

dsanders added inline comments.

llvm/test/CodeGen/AMDGPU/GlobalISel/no-cse-nonlocal-convergent-instrs.mir
9	CHECK-LABEL is about partitioning the input into multiple pieces that can be checked independently rather than about labels. LGTM with this and the bb.2 one below as either CHECK/CHECK-NEXT

This revision is now accepted and ready to land.Apr 23 2021, 4:15 PM

Ah ok, good to know. Thanks for the review! Changing them to CHECK.

Update: changed basic block checks from CHECK-LABEL to CHECK

mkitzan mentioned this in rG59f2dd5f1acd: [MachineCSE] Prevent CSE of non-local convergent instrs.Apr 23 2021, 4:53 PM

Harbormaster completed remote builds in B100679: Diff 340176.Apr 23 2021, 5:25 PM

dsanders requested changes to this revision.Apr 23 2021, 6:17 PM

dsanders added inline comments.

llvm/test/CodeGen/AMDGPU/GlobalISel/no-cse-nonlocal-convergent-instrs.mir
54	It's been pointed out to me off-list that CSE'ing to here isn't actually banned by isConvergent, it's just one of the cases we conservatively decline to CSE in the change. To be covered by isConvergent it'd have to be CSE'd into a more/differently predicated block (less is ok). Furthermore the other the cases where we wouldn't be conservative are already prevented by other checks in CSE. If we can find the field we actually mean this patch will only need a small change. I haven't been able to find it though, it doesn't seem to exist in the backend and that's probably what's gotten me confused (I don't think this is the first time either :-)) That actually reminded me of something else to double check: Does this CSE without the change too?

This revision now requires changes to proceed.Apr 23 2021, 6:17 PM

arsenm added a subscriber: arsenm.Apr 23 2021, 6:22 PM

arsenm added inline comments.

llvm/test/CodeGen/AMDGPU/GlobalISel/no-cse-nonlocal-convergent-instrs.mir
54	The definition of convergent is pretty broken. For AMDGPU in the MIR the control flow as represented by basic blocks no longer expresses the lane level CFG which we're concerned with for convergent ops. In the future when we have convergence tokens, it's not clear to me if we'll somehow preserve those through codegen. It's best to just not CSE convergent operations. It's really unlikely it would be worthwhile if it's even legal

Harbormaster completed remote builds in B100696: Diff 340193.Apr 23 2021, 6:44 PM

rtereshin added inline comments.Apr 23 2021, 6:59 PM

llvm/test/CodeGen/AMDGPU/GlobalISel/no-cse-nonlocal-convergent-instrs.mir
54	Maybe we should at least put a comment with just about that, Matt, on the change made in MachineCSE? Otherwise I'm afraid it's way too easy to remove the check and be technically right about it. Thoughts? A green light from AMDGPU for this patch though is very helpful, thank you.

lkail added a subscriber: lkail.Apr 24 2021, 1:17 AM

foad added a subscriber: foad.Apr 26 2021, 1:40 AM

foad added inline comments.

llvm/lib/CodeGen/MachineCSE.cpp
437	If this is a correctness issue then surely it should not be done inside "is profitable to cse"?

dsanders added inline comments.Apr 26 2021, 2:58 PM

llvm/test/CodeGen/AMDGPU/GlobalISel/no-cse-nonlocal-convergent-instrs.mir
54	It's best to just not CSE convergent operations. It's really unlikely it would be worthwhile if it's even legal I'd be ok with that (with an explanatory comment). I could believe that any that were legal and worthwhile probably already happened during LLVM-IR.

dsanders added inline comments.Apr 26 2021, 3:01 PM

llvm/test/CodeGen/AMDGPU/GlobalISel/no-cse-nonlocal-convergent-instrs.mir
54	I'd be ok with that (with an explanatory comment). I could believe that any that were legal and worthwhile probably already happened during LLVM-IR. Just to clarify, I don't mean to prevent cases within the same BB there. Those happen and are legal and worthwhile

My suggestion is to keep making progress here:

move the check out of is profitable to processBlock top level
put a comprehensive comment on it outlining the issues discussed here (and off fabricator) so far
do (2) in the test as well (and keep the test otherwise as is)

The issues include:
a) isConvergent as of current definition in LLVM does not prove cross-block MachineCSE illegal, however, with the change MachineCSE pass takes the liberty to extend the definition of isConvergent as a practical necessity. The extension is: "assume it is illegal to make a convergent operation dependent not only on additional conditions, but also on fewer conditions than originally"
b) The current open source GPU backends as is do not appear to allow a reasonably simple test case that provably and undeniably functionally breaks w/o the MachineCSE change proposed, as a result, the test being added is merely a coverage test for the change being made, not a reproducer of an actual (execution) problem in AMDGPU backend.

And we merge it from there. This is a conditional LGTM from me, conditions are above. Thanks!

Following Roman's suggestions, the update:

Move the code preventing CSE of isConvergent instrs into ProcessBlockCSE from isProfitableToCSE
Adds comments in MachineCSE and the test explaining why isConvergent is checked to prevent CSE
Adds comment in the test explaining the test is not reproducing an AMDGPU backend bug, but rather is a coverage test for the MachineCSE change

Thanks for all the feedback!

Harbormaster completed remote builds in B101502: Diff 341316.Apr 28 2021, 4:38 PM

foad added inline comments.Apr 29 2021, 2:40 AM

llvm/lib/CodeGen/MachineCSE.cpp
604	Do we also need this check in ProcessBlockPRE?

LGTM with the check in ProcessBlockPRE

llvm/lib/CodeGen/MachineCSE.cpp
604	I think it's needed there too

This revision is now accepted and ready to land.Apr 29 2021, 9:57 AM

rtereshin added inline comments.Apr 29 2021, 2:03 PM

llvm/lib/CodeGen/MachineCSE.cpp
604	@mkitzan IIUC (which might be not the case) PRE not checking for isConvergent is a genuine bug, unlike the CSE part: PRE moves ops into predicated blocks, making them more predicated than before, which is illegal for isConvergent. If that's the case, perhaps in case of PRE `isConvergent` check could be a part of `isPRECandidate`.

Update:

Added isConvergent check in ProcessBlockPRE

Note: @rtereshin and I talked off-list about whether PRE not checking for isConvergent is a bug, and it was determined that for MachineCSE's implementation of PRE it is not a bug.

Harbormaster completed remote builds in B102365: Diff 342509.May 3 2021, 3:48 PM

LGTM

Forgot to link the differential before pushing, but latest update is in a11489ae3e36063c64921439cbab89d1f3280f4a

Diff 340176

llvm/lib/CodeGen/MachineCSE.cpp

Show First 20 Lines • Show All 427 Lines • ▼ Show 20 Lines

/// isProfitableToCSE - Return true if it's profitable to eliminate MI with a		/// isProfitableToCSE - Return true if it's profitable to eliminate MI with a
/// common expression that defines Reg. CSBB is basic block where CSReg is		/// common expression that defines Reg. CSBB is basic block where CSReg is
/// defined.		/// defined.
bool MachineCSE::isProfitableToCSE(Register CSReg, Register Reg,		bool MachineCSE::isProfitableToCSE(Register CSReg, Register Reg,
MachineBasicBlock CSBB, MachineInstr MI) {		MachineBasicBlock CSBB, MachineInstr MI) {
// FIXME: Heuristics that works around the lack the live range splitting.		// FIXME: Heuristics that works around the lack the live range splitting.

		MachineBasicBlock *BB = MI->getParent();
		// Prevent CSE-ing non-local convergent instructions.
		foadUnsubmitted Not Done Reply Inline Actions If this is a correctness issue then surely it should not be done inside "is profitable to cse"? foad: If this is a correctness issue then surely it should not be done inside "is profitable to…
		if (MI->isConvergent() && CSBB != BB)
		return false;

// If CSReg is used at all uses of Reg, CSE should not increase register		// If CSReg is used at all uses of Reg, CSE should not increase register
// pressure of CSReg.		// pressure of CSReg.
bool MayIncreasePressure = true;		bool MayIncreasePressure = true;
if (Register::isVirtualRegister(CSReg) && Register::isVirtualRegister(Reg)) {		if (Register::isVirtualRegister(CSReg) && Register::isVirtualRegister(Reg)) {
MayIncreasePressure = false;		MayIncreasePressure = false;
SmallPtrSet<MachineInstr*, 8> CSUses;		SmallPtrSet<MachineInstr*, 8> CSUses;
for (MachineInstr &MI : MRI->use_nodbg_instructions(CSReg)) {		for (MachineInstr &MI : MRI->use_nodbg_instructions(CSReg)) {
CSUses.insert(&MI);		CSUses.insert(&MI);
}		}
for (MachineInstr &MI : MRI->use_nodbg_instructions(Reg)) {		for (MachineInstr &MI : MRI->use_nodbg_instructions(Reg)) {
if (!CSUses.count(&MI)) {		if (!CSUses.count(&MI)) {
MayIncreasePressure = true;		MayIncreasePressure = true;
break;		break;
}		}
}		}
}		}
if (!MayIncreasePressure) return true;		if (!MayIncreasePressure) return true;

// Heuristics #1: Don't CSE "cheap" computation if the def is not local or in		// Heuristics #1: Don't CSE "cheap" computation if the def is not local or in
// an immediate predecessor. We don't want to increase register pressure and		// an immediate predecessor. We don't want to increase register pressure and
// end up causing other computation to be spilled.		// end up causing other computation to be spilled.
if (TII->isAsCheapAsAMove(*MI)) {		if (TII->isAsCheapAsAMove(*MI)) {
MachineBasicBlock *BB = MI->getParent();
if (CSBB != BB && !CSBB->isSuccessor(BB))		if (CSBB != BB && !CSBB->isSuccessor(BB))
return false;		return false;
}		}

// Heuristics #2: If the expression doesn't not use a vr and the only use		// Heuristics #2: If the expression doesn't not use a vr and the only use
// of the redundant computation are copies, do not cse.		// of the redundant computation are copies, do not cse.
bool HasVRegUse = false;		bool HasVRegUse = false;
for (const MachineOperand &MO : MI->operands()) {		for (const MachineOperand &MO : MI->operands()) {
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	for (MachineBasicBlock::iterator I = MBB->begin(), E = MBB->end(); I != E; ) {
bool DoCSE = true;		bool DoCSE = true;
unsigned NumDefs = MI->getNumDefs();		unsigned NumDefs = MI->getNumDefs();

for (unsigned i = 0, e = MI->getNumOperands(); NumDefs && i != e; ++i) {		for (unsigned i = 0, e = MI->getNumOperands(); NumDefs && i != e; ++i) {
MachineOperand &MO = MI->getOperand(i);		MachineOperand &MO = MI->getOperand(i);
if (!MO.isReg() \|\| !MO.isDef())		if (!MO.isReg() \|\| !MO.isDef())
continue;		continue;
Register OldReg = MO.getReg();		Register OldReg = MO.getReg();
Register NewReg = CSMI->getOperand(i).getReg();		Register NewReg = CSMI->getOperand(i).getReg();
		foadUnsubmitted Not Done Reply Inline Actions Do we also need this check in ProcessBlockPRE? foad: Do we also need this check in ProcessBlockPRE?
		dsandersUnsubmitted Not Done Reply Inline Actions I think it's needed there too dsanders: I think it's needed there too
		rtereshinUnsubmitted Not Done Reply Inline Actions @mkitzan IIUC (which might be not the case) PRE not checking for isConvergent is a genuine bug, unlike the CSE part: PRE moves ops into predicated blocks, making them more predicated than before, which is illegal for isConvergent. If that's the case, perhaps in case of PRE `isConvergent` check could be a part of `isPRECandidate`. rtereshin: @mkitzan IIUC (which might be not the case) PRE not checking for isConvergent is a genuine bug…

// Go through implicit defs of CSMI and MI, if a def is not dead at MI,		// Go through implicit defs of CSMI and MI, if a def is not dead at MI,
// we should make sure it is not dead at CSMI.		// we should make sure it is not dead at CSMI.
if (MO.isImplicit() && !MO.isDead() && CSMI->getOperand(i).isDead())		if (MO.isImplicit() && !MO.isDead() && CSMI->getOperand(i).isDead())
ImplicitDefsToUpdate.push_back(i);		ImplicitDefsToUpdate.push_back(i);

// Keep track of implicit defs of CSMI and MI, to clear possibly		// Keep track of implicit defs of CSMI and MI, to clear possibly
// made-redundant kill flags.		// made-redundant kill flags.
▲ Show 20 Lines • Show All 292 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/no-cse-nonlocal-convergent-instrs.mir

This file was added.

				# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 -o - -run-pass=machine-cse %s \| FileCheck %s

				# Check that we don't CSE non-local convergent instrs. Otherwise, reusing defs
				# of convergent instrs from different control flow scopes can cause illegal
				# codegen. Previously, the swizzle in bb2 would be CSE-ed in favor of using the
				# swizzle in bb1 despite bb2 being a different control flow scope.

				# CHECK-LABEL: name: no_cse
				# CHECK-LABEL: bb.1.if.then
				dsandersUnsubmitted Not Done Reply Inline Actions CHECK-LABEL is about partitioning the input into multiple pieces that can be checked independently rather than about labels. LGTM with this and the bb.2 one below as either CHECK/CHECK-NEXT dsanders: CHECK-LABEL is about partitioning the input into multiple pieces that can be checked…
				# CHECK: [[SWIZZLE1:%[0-9]+]]:vgpr_32 = DS_SWIZZLE_B32 [[SRC:%[0-9]+]], 100, 0, implicit $exec
				# CHECK-NEXT: V_ADD_CO_U32_e64 [[SWIZZLE1]], {{%[0-9]+}}, 0, implicit $exec
				# CHECK-NEXT: S_CMP_LT_I32 {{.*}} implicit-def $scc
				# CHECK-NEXT: S_CBRANCH_SCC1 %bb.3, implicit $scc
				# CHECK-NEXT: S_BRANCH %bb.2
				# CHECK-LABEL: bb.2.if.then.if.then
				# CHECK: [[SWIZZLE2:%[0-9]+]]:vgpr_32 = DS_SWIZZLE_B32 [[SRC]], 100, 0, implicit $exec
				# CHECK-NEXT: V_ADD_CO_U32_e64 [[SWIZZLE2]], {{%[0-9]+}}, 0, implicit $exec

				--- \|
				define amdgpu_kernel void @no_cse(i32 addrspace(1)*, i32, i1) {
				entry:
				unreachable
				if.then:
				unreachable
				if.then.if.then:
				unreachable
				if.then.phi:
				unreachable
				exit:
				unreachable
				}
				...
				---
				name: no_cse
				tracksRegLiveness: true
				body: \|
				bb.0.entry:
				liveins: $sgpr4_sgpr5
				%0:sgpr_64(p4) = COPY $sgpr4_sgpr5
				%1:sreg_64_xexec = S_LOAD_DWORDX2_IMM %0(p4), 0, 0
				%2:sreg_64_xexec = S_LOAD_DWORDX2_IMM %0(p4), 2, 0
				%3:sreg_64 = COPY %1
				%4:sreg_32 = COPY %2.sub1
				%5:sreg_32 = S_MOV_B32 42
				S_CMP_EQ_U32 %4, %5, implicit-def $scc
				%6:vgpr_32 = COPY %5, implicit $exec
				S_CBRANCH_SCC1 %bb.4, implicit $scc
				S_BRANCH %bb.1

				bb.1.if.then:
				%7:sreg_32 = COPY %2.sub0
				%8:vgpr_32 = COPY %7
				%9:vgpr_32 = DS_SWIZZLE_B32 %8, 100, 0, implicit $exec
				%10:vgpr_32, %21:sreg_32 = V_ADD_CO_U32_e64 %9, %5, 0, implicit $exec
				dsandersUnsubmitted Not Done Reply Inline Actions It's been pointed out to me off-list that CSE'ing to here isn't actually banned by isConvergent, it's just one of the cases we conservatively decline to CSE in the change. To be covered by isConvergent it'd have to be CSE'd into a more/differently predicated block (less is ok). Furthermore the other the cases where we wouldn't be conservative are already prevented by other checks in CSE. If we can find the field we actually mean this patch will only need a small change. I haven't been able to find it though, it doesn't seem to exist in the backend and that's probably what's gotten me confused (I don't think this is the first time either :-)) That actually reminded me of something else to double check: Does this CSE without the change too? dsanders: It's been pointed out to me off-list that CSE'ing to here isn't actually banned by isConvergent…
				arsenmUnsubmitted Not Done Reply Inline Actions The definition of convergent is pretty broken. For AMDGPU in the MIR the control flow as represented by basic blocks no longer expresses the lane level CFG which we're concerned with for convergent ops. In the future when we have convergence tokens, it's not clear to me if we'll somehow preserve those through codegen. It's best to just not CSE convergent operations. It's really unlikely it would be worthwhile if it's even legal arsenm: The definition of convergent is pretty broken. For AMDGPU in the MIR the control flow as…
				rtereshinUnsubmitted Not Done Reply Inline Actions Maybe we should at least put a comment with just about that, Matt, on the change made in MachineCSE? Otherwise I'm afraid it's way too easy to remove the check and be technically right about it. Thoughts? A green light from AMDGPU for this patch though is very helpful, thank you. rtereshin: Maybe we should at least put a comment with just about that, Matt, on the change made in…
				dsandersUnsubmitted Not Done Reply Inline Actions It's best to just not CSE convergent operations. It's really unlikely it would be worthwhile if it's even legal I'd be ok with that (with an explanatory comment). I could believe that any that were legal and worthwhile probably already happened during LLVM-IR. dsanders: > It's best to just not CSE convergent operations. It's really unlikely it would be worthwhile…
				dsandersUnsubmitted Not Done Reply Inline Actions I'd be ok with that (with an explanatory comment). I could believe that any that were legal and worthwhile probably already happened during LLVM-IR. Just to clarify, I don't mean to prevent cases within the same BB there. Those happen and are legal and worthwhile dsanders: > I'd be ok with that (with an explanatory comment). I could believe that any that were legal…
				S_CMP_LT_I32 %7, %5, implicit-def $scc
				S_CBRANCH_SCC1 %bb.3, implicit $scc
				S_BRANCH %bb.2

				bb.2.if.then.if.then:
				%11:sreg_32 = S_MOV_B32 64
				%12:vgpr_32 = DS_SWIZZLE_B32 %8, 100, 0, implicit $exec
				%13:vgpr_32, %24:sreg_32 = V_ADD_CO_U32_e64 %12, %11, 0, implicit $exec

				bb.3.if.then.phi:
				%14:vgpr_32 = PHI %10, %bb.1, %13, %bb.2

				bb.4.exit:
				%15:vgpr_32 = PHI %6, %bb.0, %14, %bb.3
				%16:vreg_64 = COPY %3
				FLAT_STORE_DWORD %16, %15, 0, 0, implicit $exec, implicit $flat_scr
				S_ENDPGM 0

				...

This is an archive of the discontinued LLVM Phabricator instance.

[MachineCSE] Prevent CSE of non-local convergent instrs
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 340176

llvm/lib/CodeGen/MachineCSE.cpp

llvm/test/CodeGen/AMDGPU/GlobalISel/no-cse-nonlocal-convergent-instrs.mir

This is an archive of the discontinued LLVM Phabricator instance.

[MachineCSE] Prevent CSE of non-local convergent instrsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 340176

llvm/lib/CodeGen/MachineCSE.cpp

llvm/test/CodeGen/AMDGPU/GlobalISel/no-cse-nonlocal-convergent-instrs.mir

[MachineCSE] Prevent CSE of non-local convergent instrs
ClosedPublic