Download Raw Diff

Details

Reviewers

rampitec
tpr
arsenm
nhaehnle

Commits

rG2d6a2303f83d: [AMDGPU] Fix-up cases where writelane has 2 SGPR operands
rL375004: [AMDGPU] Fix-up cases where writelane has 2 SGPR operands

Summary

Even though writelane doesn't have the same constraints as other valu
instructions it still can't violate the >1 SGPR operand constraint

Due to later register propagation (e.g. fixing up vgpr operands via
readfirstlane) changing writelane to only have a single SGPR is tricky.

This implementation puts a new check after SIFixSGPRCopies that prevents
multiple SGPRs being used in any writelane instructions.

The algorithm used is to check for trivial copy prop of suitable constants into
one of the SGPR operands and perform that if possible. If this isn't possible
put an explicit copy of Src1 SGPR into M0 and use that instead (this is
allowable for writelane as the constraint is for SGPR read-port and not
constant-bus access).

Diff Detail

Repository

rL LLVM

Build Status

Buildable 22595
Build 22595: arc lint + arc unit

Event Timeline

dstuttard created this revision.Sep 11 2018, 7:40 AM

Harbormaster completed remote builds in B22481: Diff 164878.Sep 11 2018, 7:40 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 7 others. · View Herald TranscriptSep 11 2018, 7:40 AM

dstuttard added reviewers: rampitec, tpr.Sep 11 2018, 7:41 AM

rampitec added inline comments.Sep 11 2018, 12:39 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
456	Why cannot you use loop in the runOnMachineFunction()?
481	You do not need vector here. for (auto MO : {&Src0, &Src1})

dstuttard added inline comments.Sep 11 2018, 2:01 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
456	I think we need all the sgpr to vgpr moves to have been completed before applying this fix since in some cases it might not be necessary. I guess there's an argument for this to be done in a separate pass, or a later pass, then it could go into runOnMachineFunction - any suggestions?
481	Good point, I'll make this one.

rampitec added inline comments.Sep 11 2018, 2:15 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
456	Given the semantics of writelane it is hard to believe its sources will be moved to VALU. Also if that is going to happen in general, it should have already happened by the time iterator would reach the instruction.

Made suggested changes

Harbormaster completed remote builds in B22549: Diff 165104.Sep 12 2018, 9:38 AM

dstuttard marked 5 inline comments as done.Sep 12 2018, 9:38 AM

dstuttard added inline comments.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
456	Yes - I think you're right. I've changed it part of the main runOnMachineFunction.

rampitec added inline comments.Sep 12 2018, 12:01 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
750	missing break.

Added missing break

Harbormaster completed remote builds in B22583: Diff 165215.Sep 13 2018, 1:33 AM

dstuttard added inline comments.Sep 13 2018, 1:34 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
750	Doh

LGTM

This revision is now accepted and ready to land.Sep 13 2018, 1:37 AM

arsenm requested changes to this revision.Sep 13 2018, 1:54 AM

arsenm added inline comments.

lib/Target/AMDGPU/Utils/AMDGPUMCUtils.cpp
23–25 ↗	(On Diff #165215)	This seems more like something for TII

This revision now requires changes to proceed.Sep 13 2018, 1:54 AM

Should have a special check in the verifier

Moved foldToImm into SIInstrInfo as suggested
Implemented check in verifyInstruction and checked that it worked when the fix was removed

Harbormaster completed remote builds in B22595: Diff 165274.Sep 13 2018, 6:51 AM

dstuttard marked 3 inline comments as done.Sep 13 2018, 6:52 AM

I don't actually understand why this code is where it is? Why is SIFixSGPRCopies doing this? To clarify is this just an optimization? My initial reaction was that it was a fix, but looking at it again it seems like an optimization to me

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
744–746	This should be a COPY?

Updated a mov to a copy as per review comment

In D51932#1236779, @arsenm wrote:

I don't actually understand why this code is where it is? Why is SIFixSGPRCopies doing this? To clarify is this just an optimization? My initial reaction was that it was a fix, but looking at it again it seems like an optimization to me

It isn't an optimization - it's a bug. We encountered this in graphics shaders - hence the requirement for the fix.

SIFixSGPRCopies does feel like a strange place to put this fix. It has to be somewhere late enough to catch the issue since it's a transformation that happens after isel that causes the problem in this case. Have you got any other suggestions as to where it could go instead? FixSGPR copies isn't perfect, but doesn't seem too unreasonable as it is an SGPR related fix-up.

LGTM, to be honest. Matt?

@arsenm Matt - good to go?

ping

LGTM

This revision is now accepted and ready to land.Jun 14 2019, 7:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2019, 7:20 AM

GFX10 support has gone in since this change was approved - gfx10 allows 2 sgprs
on the constant bus. Implementation updated to allow for this.

Also updated a MIR test that had an incorrect WRITELANE instruction with 2 SGPR
accesses (one of which was VCC_LO).

Harbormaster completed remote builds in B38254: Diff 220667.Sep 18 2019, 7:49 AM

arsenm added inline comments.Sep 18 2019, 8:12 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
715–716	I think there's a missing word here; " as the lane selector and doesn't"
717	Missing period
733–735	Is this really necessary? Will SIFoldOperands not get this for some reason?
lib/Target/AMDGPU/SIInstrInfo.cpp
2993	Extra empty line
2995	Should use Register
5353	I would expect this to handle only a single vreg def

Made some changes based on review

Harbormaster completed remote builds in B38299: Diff 220882.Sep 19 2019, 10:12 AM

dstuttard marked 9 inline comments as done.Sep 19 2019, 10:17 AM

dstuttard added inline comments.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
715–716	I think this might be a difference in language - so I've changed it to make it less confusing
733–735	Yes, SIFoldOperands can replace the operand with the immediate, but the problem is that this code is forced to change one of the SGPR operands to M0 if it doesn't know if the other can be immediate. If we just detect that an immediate will be folded, but don't do it, then the verification step will fail. So, the simplest thing to do is detect trivial cases and change the operand to an immediate. In some cases it is possible that it will not detect an immediate, replace the second SGPR with M0 and then the SIFoldOperands will replace the other SGPR operand with an immediate. This case is rare and functionally correct, just slightly less efficient. I've reverted moving the foldToImm function from the SDWA peephole pass and implemented a simpler detection of immediates.
lib/Target/AMDGPU/SIInstrInfo.cpp
5353	I reverted moving the foldToImm function to SIInstrInfo - it was simpler to write a new one that attempted to do less.

dstuttard marked 3 inline comments as done.Sep 20 2019, 1:29 AM

arsenm added inline comments.Sep 20 2019, 9:49 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
737	No auto
738–739	The hasOneDef check is suspicious. You should be able to check getVRegDef and just a null check. This is missing a guard for virtual registers

Updates in light of review comments

Harbormaster completed remote builds in B38413: Diff 221267.Sep 23 2019, 1:59 AM

Made suggested changes - is this more in line with what you were thinking Matt?

Matt - are you now happy for me to submit this? (It is tagged as approved, but since you've made some extra comments I'm waiting for you to agree with the latest changes).

ping

I think this should be good to go.

LGTM

arsenm added inline comments.Oct 8 2019, 12:41 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
732–733	I'm working on a patch to stop reserving m0; I suspect this will avoid the need for the special case propagation

Closed by commit rG2d6a2303f83d: [AMDGPU] Fix-up cases where writelane has 2 SGPR operands (authored by dstuttard). · Explain WhyOct 16 2019, 7:40 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: hiraditya. · View Herald TranscriptOct 16 2019, 7:40 AM

Diff 165274

lib/Target/AMDGPU/SIFixSGPRCopies.cpp

Show First 20 Lines • Show All 447 Lines • ▼ Show 20 Lines	static bool hoistAndMergeSGPRInits(unsigned Reg,
const MachineRegisterInfo &MRI,		const MachineRegisterInfo &MRI,
MachineDominatorTree &MDT) {		MachineDominatorTree &MDT) {
// List of inits by immediate value.		// List of inits by immediate value.
using InitListMap = std::map<unsigned, std::list<MachineInstr *>>;		using InitListMap = std::map<unsigned, std::list<MachineInstr *>>;
InitListMap Inits;		InitListMap Inits;
// List of clobbering instructions.		// List of clobbering instructions.
SmallVector<MachineInstr*, 8> Clobbers;		SmallVector<MachineInstr*, 8> Clobbers;
bool Changed = false;		bool Changed = false;

		rampitecUnsubmitted Done Reply Inline Actions Why cannot you use loop in the runOnMachineFunction()? rampitec: Why cannot you use loop in the runOnMachineFunction()?
		dstuttardAuthorUnsubmitted Done Reply Inline Actions I think we need all the sgpr to vgpr moves to have been completed before applying this fix since in some cases it might not be necessary. I guess there's an argument for this to be done in a separate pass, or a later pass, then it could go into runOnMachineFunction - any suggestions? dstuttard: I think we need all the sgpr to vgpr moves to have been completed before applying this fix…
		rampitecUnsubmitted Done Reply Inline Actions Given the semantics of writelane it is hard to believe its sources will be moved to VALU. Also if that is going to happen in general, it should have already happened by the time iterator would reach the instruction. rampitec: Given the semantics of writelane it is hard to believe its sources will be moved to VALU. Also…
		dstuttardAuthorUnsubmitted Done Reply Inline Actions Yes - I think you're right. I've changed it part of the main runOnMachineFunction. dstuttard: Yes - I think you're right. I've changed it part of the main runOnMachineFunction.
for (auto &MI : MRI.def_instructions(Reg)) {		for (auto &MI : MRI.def_instructions(Reg)) {
MachineOperand *Imm = nullptr;		MachineOperand *Imm = nullptr;
for (auto &MO: MI.operands()) {		for (auto &MO: MI.operands()) {
if ((MO.isReg() && ((MO.isDef() && MO.getReg() != Reg) \|\| !MO.isDef())) \|\|		if ((MO.isReg() && ((MO.isDef() && MO.getReg() != Reg) \|\| !MO.isDef())) \|\|
(!MO.isImm() && !MO.isReg()) \|\| (MO.isImm() && Imm)) {		(!MO.isImm() && !MO.isReg()) \|\| (MO.isImm() && Imm)) {
Imm = nullptr;		Imm = nullptr;
break;		break;
} else if (MO.isImm())		} else if (MO.isImm())
Imm = &MO;		Imm = &MO;
}		}
if (Imm)		if (Imm)
Inits[Imm->getImm()].push_front(&MI);		Inits[Imm->getImm()].push_front(&MI);
else		else
Clobbers.push_back(&MI);		Clobbers.push_back(&MI);
}		}

for (auto &Init : Inits) {		for (auto &Init : Inits) {
auto &Defs = Init.second;		auto &Defs = Init.second;

for (auto I1 = Defs.begin(), E = Defs.end(); I1 != E; ) {		for (auto I1 = Defs.begin(), E = Defs.end(); I1 != E; ) {
MachineInstr MI1 = I1;		MachineInstr MI1 = I1;

for (auto I2 = std::next(I1); I2 != E; ) {		for (auto I2 = std::next(I1); I2 != E; ) {
MachineInstr MI2 = I2;		MachineInstr MI2 = I2;

		rampitecUnsubmitted Done Reply Inline Actions You do not need vector here. for (auto MO : {&Src0, &Src1}) rampitec: You do not need vector here. ``` for (auto MO : {&Src0, &Src1}) ```
		dstuttardAuthorUnsubmitted Done Reply Inline Actions Good point, I'll make this one. dstuttard: Good point, I'll make this one.
// Check any possible interference		// Check any possible interference
auto intereferes = [&](MachineBasicBlock::iterator From,		auto intereferes = [&](MachineBasicBlock::iterator From,
MachineBasicBlock::iterator To) -> bool {		MachineBasicBlock::iterator To) -> bool {

assert(MDT.dominates(&To, &From));		assert(MDT.dominates(&To, &From));

auto interferes = [&MDT, From, To](MachineInstr* &Clobber) -> bool {		auto interferes = [&MDT, From, To](MachineInstr* &Clobber) -> bool {
const MachineBasicBlock *MBBFrom = From->getParent();		const MachineBasicBlock *MBBFrom = From->getParent();
▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
Src1RC = MRI.getRegClass(MI.getOperand(2).getReg());		Src1RC = MRI.getRegClass(MI.getOperand(2).getReg());
if (TRI->isSGPRClass(DstRC) &&		if (TRI->isSGPRClass(DstRC) &&
(TRI->hasVGPRs(Src0RC) \|\| TRI->hasVGPRs(Src1RC))) {		(TRI->hasVGPRs(Src0RC) \|\| TRI->hasVGPRs(Src1RC))) {
LLVM_DEBUG(dbgs() << " Fixing INSERT_SUBREG: " << MI);		LLVM_DEBUG(dbgs() << " Fixing INSERT_SUBREG: " << MI);
TII->moveToVALU(MI);		TII->moveToVALU(MI);
}		}
break;		break;
}		}
		case AMDGPU::V_WRITELANE_B32: {
		// Writelane is special in that it can use SGPR and M0 (which would
		// normally
		// count as using the constant bus twice - but in this case it is
		// allowed as the lane selector doesn't count as a use of the constant
		// bus). However, it is still required to abide by the 1 SGPR rule Apply
		// a fix here as we might have multiple SGPRs after legalizing VGPRs to
		// SGPRs
		int Src0Idx =
		AMDGPU::getNamedOperandIdx(MI.getOpcode(), AMDGPU::OpName::src0);
		arsenmUnsubmitted Done Reply Inline Actions I think there's a missing word here; " as the lane selector and doesn't" arsenm: I think there's a missing word here; " as the lane selector and doesn't"
		dstuttardAuthorUnsubmitted Done Reply Inline Actions I think this might be a difference in language - so I've changed it to make it less confusing dstuttard: I think this might be a difference in language - so I've changed it to make it less confusing
		int Src1Idx =
		arsenmUnsubmitted Done Reply Inline Actions Missing period arsenm: Missing period
		AMDGPU::getNamedOperandIdx(MI.getOpcode(), AMDGPU::OpName::src1);
		MachineOperand &Src0 = MI.getOperand(Src0Idx);
		MachineOperand &Src1 = MI.getOperand(Src1Idx);

		// Check to see if the instruction violates the 1 SGPR rule
		if ((Src0.isReg() && TRI->isSGPRReg(MRI, Src0.getReg()) &&
		Src0.getReg() != AMDGPU::M0) &&
		(Src1.isReg() && TRI->isSGPRReg(MRI, Src1.getReg()) &&
		Src1.getReg() != AMDGPU::M0)) {

		// Check for trivially easy constant prop into one of the operands
		// If this is the case then perform the operation now to resolve SGPR
		// issue
		bool Resolved = false;
		for (auto MO : {&Src0, &Src1}) {
		auto Imm = TII->foldToImm(*MO, &MRI);
		arsenmUnsubmitted Not Done Reply Inline Actions I'm working on a patch to stop reserving m0; I suspect this will avoid the need for the special case propagation arsenm: I'm working on a patch to stop reserving m0; I suspect this will avoid the need for the special…
		if (Imm && TII->isInlineConstant(APInt(64, *Imm, true))) {
		MO->ChangeToImmediate(*Imm);
		arsenmUnsubmitted Done Reply Inline Actions Is this really necessary? Will SIFoldOperands not get this for some reason? arsenm: Is this really necessary? Will SIFoldOperands not get this for some reason?
		dstuttardAuthorUnsubmitted Done Reply Inline Actions Yes, SIFoldOperands can replace the operand with the immediate, but the problem is that this code is forced to change one of the SGPR operands to M0 if it doesn't know if the other can be immediate. If we just detect that an immediate will be folded, but don't do it, then the verification step will fail. So, the simplest thing to do is detect trivial cases and change the operand to an immediate. In some cases it is possible that it will not detect an immediate, replace the second SGPR with M0 and then the SIFoldOperands will replace the other SGPR operand with an immediate. This case is rare and functionally correct, just slightly less efficient. I've reverted moving the foldToImm function from the SDWA peephole pass and implemented a simpler detection of immediates. dstuttard: Yes, SIFoldOperands can replace the operand with the immediate, but the problem is that this…
		Resolved = true;
		break;
		arsenmUnsubmitted Done Reply Inline Actions No auto arsenm: No auto
		}
		}
		arsenmUnsubmitted Done Reply Inline Actions The hasOneDef check is suspicious. You should be able to check getVRegDef and just a null check. This is missing a guard for virtual registers arsenm: The hasOneDef check is suspicious. You should be able to check getVRegDef and just a null check.

		if (!Resolved) {
		// Haven't managed to resolve by replacing an SGPR with an immediate
		// Move src1 to be in M0
		BuildMI(*MI.getParent(), MI, MI.getDebugLoc(),
		TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)
		.add(Src1);
		arsenmUnsubmitted Not Done Reply Inline Actions This should be a COPY? arsenm: This should be a COPY?
		Src1.ChangeToRegister(AMDGPU::M0, false);
		}
		}
		break;
		rampitecUnsubmitted Done Reply Inline Actions missing break. rampitec: missing break.
		dstuttardAuthorUnsubmitted Done Reply Inline Actions Doh dstuttard: Doh
		}
}		}
}		}
}		}

if (MF.getTarget().getOptLevel() > CodeGenOpt::None && EnableM0Merge)		if (MF.getTarget().getOptLevel() > CodeGenOpt::None && EnableM0Merge)
hoistAndMergeSGPRInits(AMDGPU::M0, MRI, *MDT);		hoistAndMergeSGPRInits(AMDGPU::M0, MRI, *MDT);

return true;		return true;
}		}

lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 891 Lines • ▼ Show 20 Lines	static bool isLegalMUBUFImmOffset(unsigned Imm) {
return isUInt<12>(Imm);		return isUInt<12>(Imm);
}		}

/// \brief Return a target-specific opcode if Opcode is a pseudo instruction.		/// \brief Return a target-specific opcode if Opcode is a pseudo instruction.
/// Return -1 if the target-specific opcode for the pseudo instruction does		/// Return -1 if the target-specific opcode for the pseudo instruction does
/// not exist. If Opcode is not a pseudo instruction, this is identity.		/// not exist. If Opcode is not a pseudo instruction, this is identity.
int pseudoToMCOpcode(int Opcode) const;		int pseudoToMCOpcode(int Opcode) const;

		/// \brief Return immediate value of operand if possible to do so
		Optional<int64_t> foldToImm(const MachineOperand &Op,
		const MachineRegisterInfo *MRI) const;

};		};

namespace AMDGPU {		namespace AMDGPU {

LLVM_READONLY		LLVM_READONLY
int getVOPe64(uint16_t Opcode);		int getVOPe64(uint16_t Opcode);

LLVM_READONLY		LLVM_READONLY
▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 2,981 Lines • ▼ Show 20 Lines	if (Desc.getOpcode() != AMDGPU::V_WRITELANE_B32
}		}

if (isVOP3(MI) && LiteralCount) {		if (isVOP3(MI) && LiteralCount) {
ErrInfo = "VOP3 instruction uses literal";		ErrInfo = "VOP3 instruction uses literal";
return false;		return false;
}		}
}		}

		// Special case for writelane - this can break the multiple constant bus rule,
		// but still can't use more than one SGPR register
		if (Desc.getOpcode() == AMDGPU::V_WRITELANE_B32) {

		arsenmUnsubmitted Done Reply Inline Actions Extra empty line arsenm: Extra empty line
		unsigned SGPRCount = 0;
		unsigned SGPRUsed = AMDGPU::NoRegister;
		arsenmUnsubmitted Done Reply Inline Actions Should use Register arsenm: Should use Register

		for (int OpIdx : {Src0Idx, Src1Idx, Src2Idx}) {
		if (OpIdx == -1)
		break;

		const MachineOperand &MO = MI.getOperand(OpIdx);

		if (usesConstantBus(MRI, MO, MI.getDesc().OpInfo[OpIdx])) {
		if (MO.isReg() && MO.getReg() != AMDGPU::M0) {
		if (MO.getReg() != SGPRUsed)
		++SGPRCount;
		SGPRUsed = MO.getReg();
		}
		}
		if (SGPRCount > 1) {
		ErrInfo = "WRITELANE instruction uses more than one SGPR";
		return false;
		}
		}
		}

// Verify misc. restrictions on specific instructions.		// Verify misc. restrictions on specific instructions.
if (Desc.getOpcode() == AMDGPU::V_DIV_SCALE_F32 \|\|		if (Desc.getOpcode() == AMDGPU::V_DIV_SCALE_F32 \|\|
Desc.getOpcode() == AMDGPU::V_DIV_SCALE_F64) {		Desc.getOpcode() == AMDGPU::V_DIV_SCALE_F64) {
const MachineOperand &Src0 = MI.getOperand(Src0Idx);		const MachineOperand &Src0 = MI.getOperand(Src0Idx);
const MachineOperand &Src1 = MI.getOperand(Src1Idx);		const MachineOperand &Src1 = MI.getOperand(Src1Idx);
const MachineOperand &Src2 = MI.getOperand(Src2Idx);		const MachineOperand &Src2 = MI.getOperand(Src2Idx);
if (Src0.isReg() && Src1.isReg() && Src2.isReg()) {		if (Src0.isReg() && Src1.isReg() && Src2.isReg()) {
if (!compareMachineOp(Src0, Src1) &&		if (!compareMachineOp(Src0, Src1) &&
▲ Show 20 Lines • Show All 2,303 Lines • ▼ Show 20 Lines	int SIInstrInfo::pseudoToMCOpcode(int Opcode) const {

// (uint16_t)-1 means that Opcode is a pseudo instruction that has		// (uint16_t)-1 means that Opcode is a pseudo instruction that has
// no encoding in the given subtarget generation.		// no encoding in the given subtarget generation.
if (MCOp == (uint16_t)-1)		if (MCOp == (uint16_t)-1)
return -1;		return -1;

return MCOp;		return MCOp;
}		}

		static bool isSameReg(const MachineOperand &LHS, const MachineOperand &RHS) {
		return LHS.isReg() &&
		RHS.isReg() &&
		LHS.getReg() == RHS.getReg() &&
		LHS.getSubReg() == RHS.getSubReg();
		}

		Optional<int64_t> SIInstrInfo::foldToImm(const MachineOperand &Op,
		const MachineRegisterInfo *MRI) const {
		if (Op.isImm()) {
		return Op.getImm();
		}

		// If this is not immediate then it can be copy of immediate value, e.g.:
		// %1 = S_MOV_B32 255;
		if (Op.isReg()) {
		for (const MachineOperand &Def : MRI->def_operands(Op.getReg())) {
		arsenmUnsubmitted Done Reply Inline Actions I would expect this to handle only a single vreg def arsenm: I would expect this to handle only a single vreg def
		dstuttardAuthorUnsubmitted Done Reply Inline Actions I reverted moving the foldToImm function to SIInstrInfo - it was simpler to write a new one that attempted to do less. dstuttard: I reverted moving the foldToImm function to SIInstrInfo - it was simpler to write a new one…
		if (!isSameReg(Op, Def))
		continue;

		const MachineInstr *DefInst = Def.getParent();
		if (!isFoldableCopy(*DefInst))
		return None;

		const MachineOperand &Copied = DefInst->getOperand(1);
		if (!Copied.isImm())
		return None;

		return Copied.getImm();
		}
		}

		return None;
		}

lib/Target/AMDGPU/SIPeepholeSDWA.cpp

Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	private:
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
const SIRegisterInfo *TRI;		const SIRegisterInfo *TRI;
const SIInstrInfo *TII;		const SIInstrInfo *TII;

std::unordered_map<MachineInstr *, std::unique_ptr<SDWAOperand>> SDWAOperands;		std::unordered_map<MachineInstr *, std::unique_ptr<SDWAOperand>> SDWAOperands;
std::unordered_map<MachineInstr *, SDWAOperandsVector> PotentialMatches;		std::unordered_map<MachineInstr *, SDWAOperandsVector> PotentialMatches;
SmallVector<MachineInstr *, 8> ConvertedInstructions;		SmallVector<MachineInstr *, 8> ConvertedInstructions;

Optional<int64_t> foldToImm(const MachineOperand &Op) const;

public:		public:
static char ID;		static char ID;

SIPeepholeSDWA() : MachineFunctionPass(ID) {		SIPeepholeSDWA() : MachineFunctionPass(ID) {
initializeSIPeepholeSDWAPass(*PassRegistry::getPassRegistry());		initializeSIPeepholeSDWAPass(*PassRegistry::getPassRegistry());
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;
▲ Show 20 Lines • Show All 418 Lines • ▼ Show 20 Lines	bool SDWADstPreserveOperand::convertToSDWA(MachineInstr &MI,
// Tie dst to implicit use		// Tie dst to implicit use
MI.tieOperands(AMDGPU::getNamedOperandIdx(MI.getOpcode(), AMDGPU::OpName::vdst),		MI.tieOperands(AMDGPU::getNamedOperandIdx(MI.getOpcode(), AMDGPU::OpName::vdst),
MI.getNumOperands() - 1);		MI.getNumOperands() - 1);

// Convert MI as any other SDWADstOperand and remove v_or_b32		// Convert MI as any other SDWADstOperand and remove v_or_b32
return SDWADstOperand::convertToSDWA(MI, TII);		return SDWADstOperand::convertToSDWA(MI, TII);
}		}

Optional<int64_t> SIPeepholeSDWA::foldToImm(const MachineOperand &Op) const {
if (Op.isImm()) {
return Op.getImm();
}

// If this is not immediate then it can be copy of immediate value, e.g.:
// %1 = S_MOV_B32 255;
if (Op.isReg()) {
for (const MachineOperand &Def : MRI->def_operands(Op.getReg())) {
if (!isSameReg(Op, Def))
continue;

const MachineInstr *DefInst = Def.getParent();
if (!TII->isFoldableCopy(*DefInst))
return None;

const MachineOperand &Copied = DefInst->getOperand(1);
if (!Copied.isImm())
return None;

return Copied.getImm();
}
}

return None;
}

std::unique_ptr<SDWAOperand>		std::unique_ptr<SDWAOperand>
SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {		SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
unsigned Opcode = MI.getOpcode();		unsigned Opcode = MI.getOpcode();
switch (Opcode) {		switch (Opcode) {
case AMDGPU::V_LSHRREV_B32_e32:		case AMDGPU::V_LSHRREV_B32_e32:
case AMDGPU::V_ASHRREV_I32_e32:		case AMDGPU::V_ASHRREV_I32_e32:
case AMDGPU::V_LSHLREV_B32_e32:		case AMDGPU::V_LSHLREV_B32_e32:
case AMDGPU::V_LSHRREV_B32_e64:		case AMDGPU::V_LSHRREV_B32_e64:
case AMDGPU::V_ASHRREV_I32_e64:		case AMDGPU::V_ASHRREV_I32_e64:
case AMDGPU::V_LSHLREV_B32_e64: {		case AMDGPU::V_LSHLREV_B32_e64: {
// from: v_lshrrev_b32_e32 v1, 16/24, v0		// from: v_lshrrev_b32_e32 v1, 16/24, v0
// to SDWA src:v0 src_sel:WORD_1/BYTE_3		// to SDWA src:v0 src_sel:WORD_1/BYTE_3

// from: v_ashrrev_i32_e32 v1, 16/24, v0		// from: v_ashrrev_i32_e32 v1, 16/24, v0
// to SDWA src:v0 src_sel:WORD_1/BYTE_3 sext:1		// to SDWA src:v0 src_sel:WORD_1/BYTE_3 sext:1

// from: v_lshlrev_b32_e32 v1, 16/24, v0		// from: v_lshlrev_b32_e32 v1, 16/24, v0
// to SDWA dst:v1 dst_sel:WORD_1/BYTE_3 dst_unused:UNUSED_PAD		// to SDWA dst:v1 dst_sel:WORD_1/BYTE_3 dst_unused:UNUSED_PAD
MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
auto Imm = foldToImm(*Src0);		auto Imm = TII->foldToImm(*Src0, MRI);
if (!Imm)		if (!Imm)
break;		break;

if (Imm != 16 && Imm != 24)		if (Imm != 16 && Imm != 24)
break;		break;

MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);
Show All 24 Lines	case AMDGPU::V_LSHLREV_B16_e64: {
// to SDWA src:v0 src_sel:BYTE_1		// to SDWA src:v0 src_sel:BYTE_1

// from: v_ashrrev_i16_e32 v1, 8, v0		// from: v_ashrrev_i16_e32 v1, 8, v0
// to SDWA src:v0 src_sel:BYTE_1 sext:1		// to SDWA src:v0 src_sel:BYTE_1 sext:1

// from: v_lshlrev_b16_e32 v1, 8, v0		// from: v_lshlrev_b16_e32 v1, 8, v0
// to SDWA dst:v1 dst_sel:BYTE_1 dst_unused:UNUSED_PAD		// to SDWA dst:v1 dst_sel:BYTE_1 dst_unused:UNUSED_PAD
MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
auto Imm = foldToImm(*Src0);		auto Imm = TII->foldToImm(*Src0, MRI);
if (!Imm \|\| *Imm != 8)		if (!Imm \|\| *Imm != 8)
break;		break;

MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);

if (TRI->isPhysicalRegister(Src1->getReg()) \|\|		if (TRI->isPhysicalRegister(Src1->getReg()) \|\|
TRI->isPhysicalRegister(Dst->getReg()))		TRI->isPhysicalRegister(Dst->getReg()))
Show All 23 Lines	case AMDGPU::V_BFE_U32: {
// 0 \| 16 \| WORD_0		// 0 \| 16 \| WORD_0
// 0 \| 32 \| DWORD ?		// 0 \| 32 \| DWORD ?
// 8 \| 8 \| BYTE_1		// 8 \| 8 \| BYTE_1
// 16 \| 8 \| BYTE_2		// 16 \| 8 \| BYTE_2
// 16 \| 16 \| WORD_1		// 16 \| 16 \| WORD_1
// 24 \| 8 \| BYTE_3		// 24 \| 8 \| BYTE_3

MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
auto Offset = foldToImm(*Src1);		auto Offset = TII->foldToImm(*Src1, MRI);
if (!Offset)		if (!Offset)
break;		break;

MachineOperand *Src2 = TII->getNamedOperand(MI, AMDGPU::OpName::src2);		MachineOperand *Src2 = TII->getNamedOperand(MI, AMDGPU::OpName::src2);
auto Width = foldToImm(*Src2);		auto Width = TII->foldToImm(*Src2, MRI);
if (!Width)		if (!Width)
break;		break;

SdwaSel SrcSel = DWORD;		SdwaSel SrcSel = DWORD;

if (Offset == 0 && Width == 8)		if (Offset == 0 && Width == 8)
SrcSel = BYTE_0;		SrcSel = BYTE_0;
else if (Offset == 0 && Width == 16)		else if (Offset == 0 && Width == 16)
Show All 26 Lines	SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
case AMDGPU::V_AND_B32_e64: {		case AMDGPU::V_AND_B32_e64: {
// e.g.:		// e.g.:
// from: v_and_b32_e32 v1, 0x0000ffff/0x000000ff, v0		// from: v_and_b32_e32 v1, 0x0000ffff/0x000000ff, v0
// to SDWA src:v0 src_sel:WORD_0/BYTE_0		// to SDWA src:v0 src_sel:WORD_0/BYTE_0

MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
auto ValSrc = Src1;		auto ValSrc = Src1;
auto Imm = foldToImm(*Src0);		auto Imm = TII->foldToImm(*Src0, MRI);

if (!Imm) {		if (!Imm) {
Imm = foldToImm(*Src1);		Imm = TII->foldToImm(*Src1, MRI);
ValSrc = Src0;		ValSrc = Src0;
}		}

if (!Imm \|\| (Imm != 0x0000ffff && Imm != 0x000000ff))		if (!Imm \|\| (Imm != 0x0000ffff && Imm != 0x000000ff))
break;		break;

MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);

▲ Show 20 Lines • Show All 455 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll

; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx700 -verify-machineinstrs < %s \| FileCheck %s		; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx700 -verify-machineinstrs < %s \| FileCheck %s
; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx802 -verify-machineinstrs < %s \| FileCheck %s		; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx802 -verify-machineinstrs < %s \| FileCheck %s

declare i32 @llvm.amdgcn.writelane(i32, i32, i32) #0		declare i32 @llvm.amdgcn.writelane(i32, i32, i32) #0

; CHECK-LABEL: {{^}}test_writelane_sreg:		; CHECK-LABEL: {{^}}test_writelane_sreg:
; CHECK: v_writelane_b32 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}		; CHECK: v_writelane_b32 v{{[0-9]+}}, s{{[0-9]+}}, m0
define amdgpu_kernel void @test_writelane_sreg(i32 addrspace(1)* %out, i32 %src0, i32 %src1) #1 {		define amdgpu_kernel void @test_writelane_sreg(i32 addrspace(1)* %out, i32 %src0, i32 %src1) #1 {
%oldval = load i32, i32 addrspace(1)* %out		%oldval = load i32, i32 addrspace(1)* %out
%writelane = call i32 @llvm.amdgcn.writelane(i32 %src0, i32 %src1, i32 %oldval)		%writelane = call i32 @llvm.amdgcn.writelane(i32 %src0, i32 %src1, i32 %oldval)
store i32 %writelane, i32 addrspace(1)* %out, align 4		store i32 %writelane, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test_writelane_imm_sreg:		; CHECK-LABEL: {{^}}test_writelane_imm_sreg:
Show All 18 Lines	define amdgpu_kernel void @test_writelane_vreg_lane(i32 addrspace(1)* %out, <2 x i32> addrspace(1)* %in) #1 {
store i32 %writelane, i32 addrspace(1)* %out, align 4		store i32 %writelane, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

; TODO: m0 should be folded.		; TODO: m0 should be folded.
; CHECK-LABEL: {{^}}test_writelane_m0_sreg:		; CHECK-LABEL: {{^}}test_writelane_m0_sreg:
; CHECK: s_mov_b32 m0, -1		; CHECK: s_mov_b32 m0, -1
; CHECK: s_mov_b32 [[COPY_M0:s[0-9]+]], m0		; CHECK: s_mov_b32 [[COPY_M0:s[0-9]+]], m0
; CHECK: v_writelane_b32 v{{[0-9]+}}, [[COPY_M0]], s{{[0-9]+}}		; CHECK: v_writelane_b32 v{{[0-9]+}}, [[COPY_M0]], m0
define amdgpu_kernel void @test_writelane_m0_sreg(i32 addrspace(1)* %out, i32 %src1) #1 {		define amdgpu_kernel void @test_writelane_m0_sreg(i32 addrspace(1)* %out, i32 %src1) #1 {
%oldval = load i32, i32 addrspace(1)* %out		%oldval = load i32, i32 addrspace(1)* %out
%m0 = call i32 asm "s_mov_b32 m0, -1", "={M0}"()		%m0 = call i32 asm "s_mov_b32 m0, -1", "={M0}"()
%writelane = call i32 @llvm.amdgcn.writelane(i32 %m0, i32 %src1, i32 %oldval)		%writelane = call i32 @llvm.amdgcn.writelane(i32 %m0, i32 %src1, i32 %oldval)
store i32 %writelane, i32 addrspace(1)* %out, align 4		store i32 %writelane, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test_writelane_imm:		; CHECK-LABEL: {{^}}test_writelane_imm:
; CHECK: v_writelane_b32 v{{[0-9]+}}, s{{[0-9]+}}, 32		; CHECK: v_writelane_b32 v{{[0-9]+}}, s{{[0-9]+}}, 32
define amdgpu_kernel void @test_writelane_imm(i32 addrspace(1)* %out, i32 %src0) #1 {		define amdgpu_kernel void @test_writelane_imm(i32 addrspace(1)* %out, i32 %src0) #1 {
%oldval = load i32, i32 addrspace(1)* %out		%oldval = load i32, i32 addrspace(1)* %out
%writelane = call i32 @llvm.amdgcn.writelane(i32 %src0, i32 32, i32 %oldval) #0		%writelane = call i32 @llvm.amdgcn.writelane(i32 %src0, i32 32, i32 %oldval) #0
store i32 %writelane, i32 addrspace(1)* %out, align 4		store i32 %writelane, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test_writelane_sreg_oldval:		; CHECK-LABEL: {{^}}test_writelane_sreg_oldval:
; CHECK: v_mov_b32_e32 [[OLDVAL:v[0-9]+]], s{{[0-9]+}}		; CHECK: v_mov_b32_e32 [[OLDVAL:v[0-9]+]], s{{[0-9]+}}
; CHECK: v_writelane_b32 [[OLDVAL]], s{{[0-9]+}}, s{{[0-9]+}}		; CHECK: v_writelane_b32 [[OLDVAL]], s{{[0-9]+}}, m0
define amdgpu_kernel void @test_writelane_sreg_oldval(i32 inreg %oldval, i32 addrspace(1)* %out, i32 %src0, i32 %src1) #1 {		define amdgpu_kernel void @test_writelane_sreg_oldval(i32 inreg %oldval, i32 addrspace(1)* %out, i32 %src0, i32 %src1) #1 {
%writelane = call i32 @llvm.amdgcn.writelane(i32 %src0, i32 %src1, i32 %oldval)		%writelane = call i32 @llvm.amdgcn.writelane(i32 %src0, i32 %src1, i32 %oldval)
store i32 %writelane, i32 addrspace(1)* %out, align 4		store i32 %writelane, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

; CHECK-LABEL: {{^}}test_writelane_imm_oldval:		; CHECK-LABEL: {{^}}test_writelane_imm_oldval:
; CHECK: v_mov_b32_e32 [[OLDVAL:v[0-9]+]], 42		; CHECK: v_mov_b32_e32 [[OLDVAL:v[0-9]+]], 42
; CHECK: v_writelane_b32 [[OLDVAL]], s{{[0-9]+}}, s{{[0-9]+}}		; CHECK: v_writelane_b32 [[OLDVAL]], s{{[0-9]+}}, m0
define amdgpu_kernel void @test_writelane_imm_oldval(i32 addrspace(1)* %out, i32 %src0, i32 %src1) #1 {		define amdgpu_kernel void @test_writelane_imm_oldval(i32 addrspace(1)* %out, i32 %src0, i32 %src1) #1 {
%writelane = call i32 @llvm.amdgcn.writelane(i32 %src0, i32 %src1, i32 42)		%writelane = call i32 @llvm.amdgcn.writelane(i32 %src0, i32 %src1, i32 42)
store i32 %writelane, i32 addrspace(1)* %out, align 4		store i32 %writelane, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

declare i32 @llvm.amdgcn.workitem.id.x() #2		declare i32 @llvm.amdgcn.workitem.id.x() #2

attributes #0 = { nounwind readnone convergent }		attributes #0 = { nounwind readnone convergent }
attributes #1 = { nounwind }		attributes #1 = { nounwind }
attributes #2 = { nounwind readnone }		attributes #2 = { nounwind readnone }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix-up cases where writelane has 2 SGPR operands
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 165274

lib/Target/AMDGPU/SIFixSGPRCopies.cpp

lib/Target/AMDGPU/SIInstrInfo.h

lib/Target/AMDGPU/SIInstrInfo.cpp

lib/Target/AMDGPU/SIPeepholeSDWA.cpp

test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix-up cases where writelane has 2 SGPR operandsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 165274

lib/Target/AMDGPU/SIFixSGPRCopies.cpp

lib/Target/AMDGPU/SIInstrInfo.h

lib/Target/AMDGPU/SIInstrInfo.cpp

lib/Target/AMDGPU/SIPeepholeSDWA.cpp

test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll

[AMDGPU] Fix-up cases where writelane has 2 SGPR operands
ClosedPublic