This is an archive of the discontinued LLVM Phabricator instance.

Fix for the OCL/LC to failure on some OCLPerf tests
Needs ReviewPublic

Authored by alex-t on May 29 2019, 11:25 AM.

Download Raw Diff

Details

Reviewers

rampitec
nhaehnle

Summary

The reason is: uniform value that is defined inside the loop body and used outside the loop must be treated as divergent if loop has at least one divergent exit.

Diff Detail

Event Timeline

alex-t created this revision.May 29 2019, 11:25 AM

Herald added subscribers: nhaehnle, jvesely. · View Herald TranscriptMay 29 2019, 11:25 AM

alex-t added a reviewer: rampitec.May 29 2019, 11:26 AM

alex-t added a subscriber: llvm-commits.

rampitec added inline comments.May 29 2019, 11:51 AM

lib/Analysis/LegacyDivergenceAnalysis.cpp
112 ↗	(On Diff #202000)	Describe the return value.
180 ↗	(On Diff #202000)	Given that pass only runs when TTI.hasBranchDivergence() this change seems reasonable to me. I do not really believe an index of a loop with the divergent condition is uniform. It is only uniform across the enabled lanes, but not across of the whole wave. Basically you make it divergent or uniform depending on uses, which makes it a semi-uniform value, more or less. I think this logic needs a better description though.

Added comments describing the reason for the change.

Herald added a project: Restricted Project. · View Herald TranscriptMay 30 2019, 4:37 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

alex-t added inline comments.May 30 2019, 4:37 AM

lib/Analysis/LegacyDivergenceAnalysis.cpp
180 ↗	(On Diff #202000)	Since all threads of the wavefront that are live in the loop body observe same value. Thus, formally the value is uniform. The user outside the loop in LCSSA is, again formally, uses another value - LCSSA PHI-node, that is divergent. The problem arises when converting LCSSA back to normal form we insert the copy SGPR to VGPR. The truth is that the value inside the loop is uniform but NOT scalar! The proper fix would be to split the decision by two independent parts: Is the value uniform - !DA->isDivergent(V) Is it SALU or VALU: TargetLowering::isVALU(V) - this should check out-of-divergent-loop-users and as much other conditions as is necessary for the target.

alex-t marked 2 inline comments as done.May 30 2019, 4:37 AM

alex-t added a reviewer: nhaehnle.May 30 2019, 6:59 AM

LGTM

This revision is now accepted and ready to land.May 30 2019, 8:41 AM

In general I see the issue this way: there is no binary answer if value is divergent or uniform. It can be uniform across some scope: some number of lanes, active lanes, wave, workgroup, or even a whole program. So you need a wave uniform value, where DA returns uniformness across active lanes, which is less.

So in the future we need to consider returning uniformness scope from isUniform() instead of a bool.

Changing the meaning of "uniform" in this way is just bound to lead to confusion further down the line, so please can we not do that?

I agree with Stas that the notion of "uniform" alone is simply not enough, though I'd phrase it this way: we need to distinguish between uniform at definition and uniform at use. The value inside the loop is uniform at definition, but it is not uniform when used outside the loop with divergent exit.

This revision now requires changes to proceed.May 31 2019, 12:28 AM

See D60834 for what I think is the right direction to fix this.

In D62614#1524597, @nhaehnle wrote:

See D60834 for what I think is the right direction to fix this.

Once again:

LCSSA is not sufficient for the following reason:
Since definition is considered uniform it is selected to SALU and produces SGPR.
The value in this SGPR is correct for all laves leave in loop body.
LCSSA PHI node that inserted in loop exit block will turn into the copy from SGPR to VGPR – this is incorrect.

I don't agree that the enhancement the definition of the "divergence" to the scope is correct way at all.
literally, the value is uniform if all threads observe same value. Nothing about exec mask, lanes or GPU :)
All threads in our case are all executing loop body. That's it.

My goal is to select instructions to SALU or VALU form in dependence of the both DA results and context.
Assigning correct register classes to the cross block values is part of this task. The problem is that current implementation only consults to DA making the decision.
All I need is to augment the DA results with the context.

Also please note: this review is about temporary solution that allow to not revert https://reviews.llvm.org/D59990#1521409

Incremental changes a easier to integrate.

In D62614#1524852, @alex-t wrote:

In D62614#1524597, @nhaehnle wrote:

See D60834 for what I think is the right direction to fix this.

Once again:

LCSSA is not sufficient for the following reason:
Since definition is considered uniform it is selected to SALU and produces SGPR.
The value in this SGPR is correct for all laves leave in loop body.
LCSSA PHI node that inserted in loop exit block will turn into the copy from SGPR to VGPR – this is incorrect.

Why isn't sgpr to vgpr copy correct? If it is done still inside the loop it is suboptimal but correct. When done outside of the loop it is incorrect, that's true.

added the Divergent Analysis test update that was missed

Herald added a subscriber: jholewinski. · View Herald TranscriptMay 31 2019, 8:45 AM

In D62614#1524592, @nhaehnle wrote:

Changing the meaning of "uniform" in this way is just bound to lead to confusion further down the line, so please can we not do that?

I agree with Stas that the notion of "uniform" alone is simply not enough, though I'd phrase it this way: we need to distinguish between uniform at definition and uniform at use. The value inside the loop is uniform at definition, but it is not uniform when used outside the loop with divergent exit.

To me "uniform at def" vs "uniform at use" it to vague. Essentially what is different between use and def is the context in terms of the threads affected, i.e. exactly the scope. You value can be uniform at def, and them suddenly become divergent even before the use, just because you have changed the exec mask. That is not use makes it divergent, but the control flow context.

In D62614#1525150, @rampitec wrote:

In D62614#1524852, @alex-t wrote:

In D62614#1524597, @nhaehnle wrote:

See D60834 for what I think is the right direction to fix this.

Once again:

LCSSA is not sufficient for the following reason:
Since definition is considered uniform it is selected to SALU and produces SGPR.
The value in this SGPR is correct for all laves leave in loop body.
LCSSA PHI node that inserted in loop exit block will turn into the copy from SGPR to VGPR – this is incorrect.

Why isn't sgpr to vgpr copy correct? If it is done still inside the loop it is suboptimal but correct. When done outside of the loop it is incorrect, that's true.

I obviously meant outside :)
I answered to Nicolai about LCSSA. I not yet sure why but the move is always placed in "Flow" bock that is outside the loop.

Alternative fix.

Herald added a subscriber: arsenm. · View Herald TranscriptJun 2 2019, 9:26 AM

Needs test.

nhaehnle added inline comments.Jun 3 2019, 12:10 AM

lib/Target/AMDGPU/SIInstrInfo.cpp
4174–4180	Can you give an example of why this is necessary? Assuming we have LCSSA phis, I don't understand the reason for this.
5969–5979	shouldSink should never be used for correctness. See the comment: /// MachineSink determines on its own whether the instruction is safe to sink; /// this gives the target a hook to override the default behavior with regards /// to which instructions should be sunk. I suspect the right fix is to ensure that COPY instructions inserted by legalizeGenericOperands are marked with an implicit use of EXEC when a VGPR is involved.

alex-t marked 2 inline comments as done.Jun 3 2019, 7:59 AM

alex-t added inline comments.

lib/Target/AMDGPU/SIInstrInfo.cpp
4174–4180	The copy may be placed to the block produced by the edge splitting like below: the copy of %14 in the example below will be placed to the bb.4.Flow that is already out of the loop body. bb.2.Flow7: ; predecessors: %bb.0, %bb.4 successors: %bb.5(0x80000000); %bb.5(100.00%) %7:vgpr_32 = PHI %108:vgpr_32, %bb.0, %14:sreg_32_xm0, %bb.4 SI_END_CF %6:sreg_64, implicit-def dead $exec, implicit-def dead $scc, implicit $exec S_BRANCH %bb.5 bb.3.for.body: ; predecessors: %bb.1, %bb.3 successors: %bb.4(0x04000000), %bb.3(0x7c000000); %bb.4(3.12%), %bb.3(96.88%) %14:sreg_32_xm0 = S_ADD_I32 %10:sreg_32_xm0, killed %86:sreg_32_xm0, implicit-def dead $scc SI_LOOP %15:sreg_64, %bb.3, implicit-def dead $exec, implicit-def dead $scc, implicit $exec S_BRANCH %bb.4 bb.4.Flow: ; predecessors: %bb.3 successors: %bb.2(0x80000000); %bb.2(100.00%) SI_END_CF %15:sreg_64, implicit-def dead $exec, implicit-def dead $scc, implicit $exec S_BRANCH %bb.2 bb.5.for.end: ; predecessors: %bb.2 %16:vgpr_32 = PHI %7:vgpr_32, %bb.2
5969–5979	Comment says: "this gives the target a hook to override the default behavior with regards to which instructions should be sunk.". So, I don't understand why it "should never be used for correctness". Also, adding extra operands to the copies leads Peephole optimizer to crash in assert Assertion failed: Def->getNumOperands() == 2 && "Invalid number of operands", file llvm\lib\CodeGen\PeepholeOptimizer.cpp, line 1811 The Si Fix VGPR Copies pass adds implicit uses of EXEC later on...

rampitec added inline comments.Jun 3 2019, 9:04 AM

lib/Target/AMDGPU/SIInstrInfo.cpp
5974	I think even V to V copy is illegal to move into a non-control flow equivalent region.

hakzsam mentioned this in D59990: AMDGPU. Divergence driven ISel. Assign register class for cross block values according to the divergence..Jun 5 2019, 12:14 AM

nhaehnle added inline comments.Jun 5 2019, 12:30 AM

lib/Target/AMDGPU/SIInstrInfo.cpp
4174–4180	That PHI is not an LCSSA phi. Is this with a compilation that has D60834?
5969–5979	It should never be used for correctness because the comment says so: MachineSink determines on its own whether the instruction is safe to sink. This change makes that comment incorrect. IMO it would be better to remove the assertion in the ValueTracker. Sure, it claims that bad things will happen everywhere, but we're already using COPY with an implicit EXEC use through much of the compiler pipeline without bad things happening, and this is the semantically right way forward.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIFixSGPRCopies.cpp

2 lines

SIInstrInfo.h

2 lines

SIInstrInfo.cpp

20 lines

Diff 202607

lib/Target/AMDGPU/SIFixSGPRCopies.cpp

Show First 20 Lines • Show All 625 Lines • ▼ Show 20 Lines	for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
break;		break;
}		}
}		}
unsigned PHIRes = MI.getOperand(0).getReg();		unsigned PHIRes = MI.getOperand(0).getReg();
const TargetRegisterClass *RC0 = MRI.getRegClass(PHIRes);		const TargetRegisterClass *RC0 = MRI.getRegClass(PHIRes);

if ((!TRI->isVGPR(MRI, PHIRes) && RC0 != &AMDGPU::VReg_1RegClass) &&		if ((!TRI->isVGPR(MRI, PHIRes) && RC0 != &AMDGPU::VReg_1RegClass) &&
(hasVGPRInput \|\| hasVGPRUses > 1)) {		(hasVGPRInput \|\| hasVGPRUses > 1)) {
TII->moveToVALU(MI);		TII->moveToVALU(MI, MDT);
} else {		} else {
TII->legalizeOperands(MI, MDT);		TII->legalizeOperands(MI, MDT);
}		}

break;		break;
}		}
case AMDGPU::REG_SEQUENCE:		case AMDGPU::REG_SEQUENCE:
if (TRI->hasVGPRs(TII->getOpRegClass(MI, 0)) \|\|		if (TRI->hasVGPRs(TII->getOpRegClass(MI, 0)) \|\|
Show All 30 Lines

lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 933 Lines • ▼ Show 20 Lines	public:
static bool isLegalMUBUFImmOffset(unsigned Imm) {		static bool isLegalMUBUFImmOffset(unsigned Imm) {
return isUInt<12>(Imm);		return isUInt<12>(Imm);
}		}

/// \brief Return a target-specific opcode if Opcode is a pseudo instruction.		/// \brief Return a target-specific opcode if Opcode is a pseudo instruction.
/// Return -1 if the target-specific opcode for the pseudo instruction does		/// Return -1 if the target-specific opcode for the pseudo instruction does
/// not exist. If Opcode is not a pseudo instruction, this is identity.		/// not exist. If Opcode is not a pseudo instruction, this is identity.
int pseudoToMCOpcode(int Opcode) const;		int pseudoToMCOpcode(int Opcode) const;

		bool shouldSink(const MachineInstr &MI) const override;
};		};

/// \brief Returns true if a reg:subreg pair P has a TRC class		/// \brief Returns true if a reg:subreg pair P has a TRC class
inline bool isOfRegClass(const TargetInstrInfo::RegSubRegPair &P,		inline bool isOfRegClass(const TargetInstrInfo::RegSubRegPair &P,
const TargetRegisterClass &TRC,		const TargetRegisterClass &TRC,
MachineRegisterInfo &MRI) {		MachineRegisterInfo &MRI) {
auto *RC = MRI.getRegClass(P.Reg);		auto *RC = MRI.getRegClass(P.Reg);
if (!P.SubReg)		if (!P.SubReg)
▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 4,165 Lines • ▼ Show 20 Lines	for (unsigned I = 1, E = MI.getNumOperands(); I != E; I += 2) {
MachineOperand &Op = MI.getOperand(I);		MachineOperand &Op = MI.getOperand(I);
if (!Op.isReg() \|\| !TargetRegisterInfo::isVirtualRegister(Op.getReg()))		if (!Op.isReg() \|\| !TargetRegisterInfo::isVirtualRegister(Op.getReg()))
continue;		continue;

// MI is a PHI instruction.		// MI is a PHI instruction.
MachineBasicBlock *InsertBB = MI.getOperand(I + 1).getMBB();		MachineBasicBlock *InsertBB = MI.getOperand(I + 1).getMBB();
MachineBasicBlock::iterator Insert = InsertBB->getFirstTerminator();		MachineBasicBlock::iterator Insert = InsertBB->getFirstTerminator();

		unsigned Reg = Op.getReg();
		MachineInstr *DefMI = MRI.getUniqueVRegDef(Reg);
		if (DefMI && DefMI->getParent() != InsertBB &&
		MDT->dominates(DefMI->getParent(), InsertBB)) {
		InsertBB = DefMI->getParent();
		Insert = InsertBB->getFirstTerminator();
		}
		nhaehnleUnsubmitted Not Done Reply Inline Actions Can you give an example of why this is necessary? Assuming we have LCSSA phis, I don't understand the reason for this. nhaehnle: Can you give an example of why this is necessary? Assuming we have LCSSA phis, I don't…
		alex-tAuthorUnsubmitted Done Reply Inline Actions The copy may be placed to the block produced by the edge splitting like below: the copy of %14 in the example below will be placed to the bb.4.Flow that is already out of the loop body. bb.2.Flow7: ; predecessors: %bb.0, %bb.4 successors: %bb.5(0x80000000); %bb.5(100.00%) %7:vgpr_32 = PHI %108:vgpr_32, %bb.0, %14:sreg_32_xm0, %bb.4 SI_END_CF %6:sreg_64, implicit-def dead $exec, implicit-def dead $scc, implicit $exec S_BRANCH %bb.5 bb.3.for.body: ; predecessors: %bb.1, %bb.3 successors: %bb.4(0x04000000), %bb.3(0x7c000000); %bb.4(3.12%), %bb.3(96.88%) %14:sreg_32_xm0 = S_ADD_I32 %10:sreg_32_xm0, killed %86:sreg_32_xm0, implicit-def dead $scc SI_LOOP %15:sreg_64, %bb.3, implicit-def dead $exec, implicit-def dead $scc, implicit $exec S_BRANCH %bb.4 bb.4.Flow: ; predecessors: %bb.3 successors: %bb.2(0x80000000); %bb.2(100.00%) SI_END_CF %15:sreg_64, implicit-def dead $exec, implicit-def dead $scc, implicit $exec S_BRANCH %bb.2 bb.5.for.end: ; predecessors: %bb.2 %16:vgpr_32 = PHI %7:vgpr_32, %bb.2 alex-t: The copy may be placed to the block produced by the edge splitting like below: the copy of…
		nhaehnleUnsubmitted Not Done Reply Inline Actions That PHI is not an LCSSA phi. Is this with a compilation that has D60834? nhaehnle: That PHI is not an LCSSA phi. Is this with a compilation that has D60834?

// Avoid creating no-op copies with the same src and dst reg class. These		// Avoid creating no-op copies with the same src and dst reg class. These
// confuse some of the machine passes.		// confuse some of the machine passes.
legalizeGenericOperand(*InsertBB, Insert, RC, Op, MRI, MI.getDebugLoc());		legalizeGenericOperand(*InsertBB, Insert, RC, Op, MRI, MI.getDebugLoc());
}		}
}		}

// REG_SEQUENCE doesn't really require operand legalization, but if one has a		// REG_SEQUENCE doesn't really require operand legalization, but if one has a
// VGPR dest type and SGPR sources, insert copies so all operands are		// VGPR dest type and SGPR sources, insert copies so all operands are
▲ Show 20 Lines • Show All 1,770 Lines • ▼ Show 20 Lines	bool llvm::isEXECMaskConstantBetweenDefAndUses(unsigned VReg,
for (auto I = std::next(DefI->getIterator()); I != E; ++I) {		for (auto I = std::next(DefI->getIterator()); I != E; ++I) {
Uses.erase(&*I);		Uses.erase(&*I);
// don't check the last use		// don't check the last use
if (Uses.empty() \|\| I->modifiesRegister(AMDGPU::EXEC, TRI))		if (Uses.empty() \|\| I->modifiesRegister(AMDGPU::EXEC, TRI))
break;		break;
}		}
return Uses.empty();		return Uses.empty();
}		}

		bool SIInstrInfo::shouldSink(const MachineInstr &MI) const {
		const MachineRegisterInfo &MRI = MI.getParent()->getParent()->getRegInfo();
		if (MI.isCopy()) {
		unsigned SrcReg = MI.getOperand(1).getReg();
		unsigned DstReg = MI.getOperand(0).getReg();
		if (RI.isSGPRReg(MRI, SrcReg) && !RI.isSGPRReg(MRI, DstReg)) {
		rampitecUnsubmitted Not Done Reply Inline Actions I think even V to V copy is illegal to move into a non-control flow equivalent region. rampitec: I think even V to V copy is illegal to move into a non-control flow equivalent region.
		return false;
		}
		}
		return true;
		}
		nhaehnleUnsubmitted Not Done Reply Inline Actions shouldSink should never be used for correctness. See the comment: /// MachineSink determines on its own whether the instruction is safe to sink; /// this gives the target a hook to override the default behavior with regards /// to which instructions should be sunk. I suspect the right fix is to ensure that COPY instructions inserted by legalizeGenericOperands are marked with an implicit use of EXEC when a VGPR is involved. nhaehnle: shouldSink should never be used for correctness. See the comment: ``` /// MachineSink…
		alex-tAuthorUnsubmitted Done Reply Inline Actions Comment says: "this gives the target a hook to override the default behavior with regards to which instructions should be sunk.". So, I don't understand why it "should never be used for correctness". Also, adding extra operands to the copies leads Peephole optimizer to crash in assert Assertion failed: Def->getNumOperands() == 2 && "Invalid number of operands", file llvm\lib\CodeGen\PeepholeOptimizer.cpp, line 1811 The Si Fix VGPR Copies pass adds implicit uses of EXEC later on... alex-t: Comment says: "this gives the target a hook to override the default behavior with regards to…
		nhaehnleUnsubmitted Not Done Reply Inline Actions It should never be used for correctness because the comment says so: MachineSink determines on its own whether the instruction is safe to sink. This change makes that comment incorrect. IMO it would be better to remove the assertion in the ValueTracker. Sure, it claims that bad things will happen everywhere, but we're already using COPY with an implicit EXEC use through much of the compiler pipeline without bad things happening, and this is the semantically right way forward. nhaehnle: It should never be used for correctness because the comment says so: **MachineSink determines…
		No newline at end of file