This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] SDWA: add support for PRESERVE into SDWA peephole. Add new merge SDWA preserve pass
ClosedPublic

Authored by SamWot on Sep 13 2017, 10:42 AM.

Download Raw Diff

Details

Reviewers

arsenm
vpykhtin
rampitec

Commits

rG5f7f32c3826e: [AMDGPU] SDWA: add support for PRESERVE into SDWA peephole.
rL319662: [AMDGPU] SDWA: add support for PRESERVE into SDWA peephole.

Summary

SDWA instructions support several values of dst_unused operand. One of this is UNUSED_PRESERVE. This value means that parts of destination register that are not wrote by SDWA instruction would not be changed. Currently SDWA peephole pass doesn't generate UNUSED_PRESERVE. It only generates UNUSED_PAD value that means that unused parts of dst register would be set to 0.
Big problem with UNUSED_PRESERVE is that by its nature it can't be represented in SSA form. PRESERVE assumes that register that it writes into was already wrote by some other instruction and our SDWA instruction keeps this value intact.
Another problem is that in AMDGPU backend smallest sub-reg is 32-bit wide and SDWA needs smaller so support for PRESERVE can't be done with subregs.
For those reasons support for PRESERVE for split into 2 major parts. First - changes in SDWA peephole pass that allows it to recognize patterns for PRESERVE and generate according instruction. This pass works on SSA machine code and generates SSA compatible code. Second part - new pass that works on non-SSA code and converts code generated by SDWA peephole into correct code.

Changes in SDWA peephole:

There were several changes in SDWA peephole.
a. First of all there was added new pattern to match for PRESERVE operand. This patterns looks for V_OR_B32 instruction with one of operands that is result of SDWA instruction. Second operand of V_OR_B32 should be instruction that is compatible bit-wise with SDWA instruction (there destination don't overlap). E.g. match:

v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src1_sel:WORD_1 src2_sel:WORD_1
v_add_f16_e32 v3, v1, v2
v_or_b32_e32 v4, v0, v3

Into: SDWA preserve dst:v4 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE preserve:v3

Then this mathced SDWA preserve pattern is converted into SDWA with preserve. During conversion V_OR_B#@ instruction is replaced by SDWA instruction with dst_unused set to UNUSED_PRESERVE. Original instruction is removed. And new instruction gets additional implicit use-operand which is destination of second operand of V_OR_B32 (register that should be preserved):

v_add_f16_e32 v3, v1, v2
v_add_f16_sdwa v4, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3

Problem with this match process is that currently it only works if both instructions were SDWA instructions. Reason is that to be able to match to instructions we should check that those two instructions are compatible to match - meaning that they write different parts of destination register. But currently there is no way to determine if regular instruction writes not whole destination register. E.g. we can't understand that V_ADD_F16 only writes low 16-bit of destination and high 16-bit are irrelevant. This can be determined only for SDWA instruction by looking into dst_sel operand. So for now this pattern only metch 2 SDWA instructions. Ability to match regular instructions would be added later.

b. Second big change in SDWA peephole pass is that now it tries to apply match patterns several times until it can't convert any new instruction. This is needed because (as said earlier) PRESERVE pattern need to match SDWA instructions but SDWA instruction apear (in most cases) only after SDWA peephole. So to be able to match PRESERVE pattern we first apply all other patterns that generate regular SDWA instructions and then on second try we apply PRESERVE pattern to SDWA instructions generated on first try.

New pass - merge SDWA preserve pass:

This pass is needed to convert SSA code generated by SDWA peephole pass into non-SSA correct code. It works after PHI-elimination pass where it is possible to generate non-SSA code.
This pass looks for SDWA instructions with dst_unused set to UNUSED_PRESERVE. In those instructions it looks for implicit register operand (which is added by SDWA peephole pass). This register is the one that should be reserved. Id such instruction is found then this pass changes destination register of this SDWA instruction to implicit register and creates copy from implicit register to original destination of SDWA instruction. E.g. instruction generated by SDWA peephole:

v_add_f16_sdwa v4, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3

Would be converted into:

v_add_f16_sdwa v3, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3
v_mov_b32 v4, v3

Putting it all together original sequence of instructions:

v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src1_sel:WORD_1 src2_sel:WORD_1
v_add_f16_e32 v3, v1, v2
v_or_b32_e32 v4, v0, v3

Would be converted into:

v_add_f16_e32 v3, v1, v2
v_add_f16_sdwa v3, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3
v_mov_b32 v4, v3

Diff Detail

Repository: rL LLVM

Event Timeline

SamWot created this revision.Sep 13 2017, 10:42 AM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptSep 13 2017, 10:42 AM

Harbormaster completed remote builds in B10192: Diff 115070.Sep 13 2017, 10:42 AM

I think we need to decide an overall strategy for dealing with instructions that only partially update the registers. GFX9 really complicated this issue by changing new instructions to preserve the high bits, and adding a control bit to some instructions to change the high bit behavior.

I started dealing with this to add the d16 loads and stores. We can still do this SSA by adding variants of the instructions with tied operands that preserve one half of the instructions, which would probably be less painful than adding another post-SSA pass that needs to deal with liveness. One issue is we still get suboptimal regalloc in some cases, so I'm debating adding new 16-bit subregister classes so subrange liveness tracking works.

In D37817#869898, @arsenm wrote:

I think we need to decide an overall strategy for dealing with instructions that only partially update the registers. GFX9 really complicated this issue by changing new instructions to preserve the high bits, and adding a control bit to some instructions to change the high bit behavior.

I started dealing with this to add the d16 loads and stores. We can still do this SSA by adding variants of the instructions with tied operands that preserve one half of the instructions, which would probably be less painful than adding another post-SSA pass that needs to deal with liveness. One issue is we still get suboptimal regalloc in some cases, so I'm debating adding new 16-bit subregister classes so subrange liveness tracking works.

Do not we want to add bits to the instruction describing it preserves low or high half?
Adding new subregs would be quite painful, as we already have too much registers for RA and LIS to work fast and optimal.

rampitec added inline comments.Sep 13 2017, 3:40 PM

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
821 ↗	(On Diff #115070)	I do not think we need it under fast RA.
lib/Target/AMDGPU/SIMergeSDWAPreserve.cpp
115 ↗	(On Diff #115070)	We can have more than EXEC here. Check it is a virtual register instead, or even a VGPR?
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
283 ↗	(On Diff #115070)	Such things shall be functions, function templates, anything but defines. In particular that is very hard to debug.
853 ↗	(On Diff #115070)	Needs cast from unsigned or use unsigned for SDWAOpcode/Opcode.
1052 ↗	(On Diff #115070)	This loop itself probably deserves a separate change.
test/CodeGen/AMDGPU/sdwa-merge-preserve.mir
1 ↗	(On Diff #115070)	Add -verify-machineinstrs to run lines.
test/CodeGen/AMDGPU/sdwa-preserve.mir
1 ↗	(On Diff #115070)	Add -verify-machineinstrs

In D37817#870219, @rampitec wrote:

In D37817#869898, @arsenm wrote:

I think we need to decide an overall strategy for dealing with instructions that only partially update the registers. GFX9 really complicated this issue by changing new instructions to preserve the high bits, and adding a control bit to some instructions to change the high bit behavior.

I started dealing with this to add the d16 loads and stores. We can still do this SSA by adding variants of the instructions with tied operands that preserve one half of the instructions, which would probably be less painful than adding another post-SSA pass that needs to deal with liveness. One issue is we still get suboptimal regalloc in some cases, so I'm debating adding new 16-bit subregister classes so subrange liveness tracking works.

Do not we want to add bits to the instruction describing it preserves low or high half?
Adding new subregs would be quite painful, as we already have too much registers for RA and LIS to work fast and optimal.

That's another partial option, but won't solve the suboptimal RA. I think we need to try and see what the impact actually ends up being. These are more constrained since you sort of can't directly address the high component usually (i.e. the high component isn't actually separately allocatable).

Resolved some issues

Harbormaster completed remote builds in B10226: Diff 115219.Sep 14 2017, 7:35 AM

In any case independent of sub register questions, I think this would be better off done in the existing pass by adding variants with tied operands. This is how I am handling this problem currently in D38070/D38071 for mad_mix

In D37817#876063, @arsenm wrote:

In any case independent of sub register questions, I think this would be better off done in the existing pass by adding variants with tied operands. This is how I am handling this problem currently in D38070/D38071 for mad_mix

This is actually a better idea than using an implicit operand and having IR potentially broken in between of two passes.

Removed SIMergeSDWAPreserve pass.
Use tied registers to achieve same results

Ping

arsenm added inline comments.Oct 27 2017, 2:30 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
132 ↗	(On Diff #117831)	Probably should use LLVM_ENABLE_DUMP instead of NDEBUG. Also should add the matching dump() with LLVM_DUMP_METHOD.
292 ↗	(On Diff #117831)	use_nodbg_operands? Also space before :
293 ↗	(On Diff #117831)	C++ style comments
294–295 ↗	(On Diff #117831)	These various checks shouldn't be necessary in SSA. You can't have a def of a specific subregister (unless maybe there is a physical register which should probably be skipped anyway).
313 ↗	(On Diff #117831)	This seems to b be re-inventing MRI.getVRegDef/MRI.getUniqueVRegDef?
467 ↗	(On Diff #117831)	This whole loop is just MRI.clearKillFlags()
487–488 ↗	(On Diff #117831)	This is still manually tying the result operand. I was expecting another set of _sdwa opcodes with the preserve behavior with the tied operand statically known in the instruction definition. Manually tying this way is potentially hazardous because the verifier won't check it, and it makes it easier for another pass to accidentally drop the tied operand. I would expect there to be an InstrMapping table between the SDWA opcode and the SWDA with preserve set versions.
lib/Target/AMDGPU/SIRegisterInfo.cpp
1317–1318 ↗	(On Diff #117831)	How are these different from the various MCRegisterInfo functions for checking if registers alias or have subreg relationships?

Fixed latests issues from arsenm

SamWot added inline comments.Nov 2 2017, 3:40 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
487–488 ↗	(On Diff #117831)	I thought that having separate instruction definition for just preserve case would be overkill. I didn't want to introduce another new kind of SDWA instruction that would only bloat already huge set of instruction definitions. But if you think this would be better I will introduce new instructions definition.

Ping.
Matt, what do you think about latest changes in reivew?

At the very least there needs to be a verifier check for the tied operands if this is set. With the separate tied opcodes you get that for free

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
999 ↗	(On Diff #121268)	Extra newline

Added verification for tied register for UNUSED_PRESERVE

arsenm added inline comments.Nov 29 2017, 3:05 PM

lib/Target/AMDGPU/SIInstrInfo.cpp
2702–2705 ↗	(On Diff #124747)	Could use some of the stricter checks the normal verifier has, i.e. else if (TargetRegisterInfo::isPhysicalRegister(MOTied.getReg()) && MO->getReg() != MOTied.getReg())

Stronger verification for UNUSED_PRESERVE

LGTM

This revision is now accepted and ready to land.Dec 1 2017, 1:51 PM

Closed by commit rL319662: [AMDGPU] SDWA: add support for PRESERVE into SDWA peephole. (authored by skolton). · Explain WhyDec 4 2017, 8:23 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

SIInstrInfo.cpp

22 lines

SIPeepholeSDWA.cpp

799 lines

test/

CodeGen/

AMDGPU/

fabs.f16.ll

3 lines

fcanonicalize.f16.ll

7 lines

fneg.f16.ll

3 lines

sdwa-peephole-instr.mir

4 lines

sdwa-preserve.mir

56 lines

Diff 125349

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 2,681 Lines • ▼ Show 20 Lines	if (isVOPC(BasicOpcode)) {
// No omod allowed on GFX9 for VOPC		// No omod allowed on GFX9 for VOPC
const MachineOperand *OMod = getNamedOperand(MI, AMDGPU::OpName::omod);		const MachineOperand *OMod = getNamedOperand(MI, AMDGPU::OpName::omod);
if (OMod && (!OMod->isImm() \|\| OMod->getImm() != 0)) {		if (OMod && (!OMod->isImm() \|\| OMod->getImm() != 0)) {
ErrInfo = "OMod not allowed in VOPC SDWA instructions on VI";		ErrInfo = "OMod not allowed in VOPC SDWA instructions on VI";
return false;		return false;
}		}
}		}
}		}

		const MachineOperand *DstUnused = getNamedOperand(MI, AMDGPU::OpName::dst_unused);
		if (DstUnused && DstUnused->isImm() &&
		DstUnused->getImm() == AMDGPU::SDWA::UNUSED_PRESERVE) {
		const MachineOperand &Dst = MI.getOperand(DstIdx);
		if (!Dst.isReg() \|\| !Dst.isTied()) {
		ErrInfo = "Dst register should have tied register";
		return false;
		}

		const MachineOperand &TiedMO =
		MI.getOperand(MI.findTiedOperandIdx(DstIdx));
		if (!TiedMO.isReg() \|\| !TiedMO.isImplicit() \|\| !TiedMO.isUse()) {
		ErrInfo =
		"Dst register should be tied to implicit use of preserved register";
		return false;
		} else if (TargetRegisterInfo::isPhysicalRegister(TiedMO.getReg()) &&
		Dst.getReg() != TiedMO.getReg()) {
		ErrInfo = "Dst register should use same physical register as preserved";
		return false;
		}
		}
}		}

// Verify VOP*		// Verify VOP*
if (isVOP1(MI) \|\| isVOP2(MI) \|\| isVOP3(MI) \|\| isVOPC(MI) \|\| isSDWA(MI)) {		if (isVOP1(MI) \|\| isVOP2(MI) \|\| isVOP3(MI) \|\| isVOPC(MI) \|\| isSDWA(MI)) {
// Only look at the true operands. Only a real operand can use the constant		// Only look at the true operands. Only a real operand can use the constant
// bus, and we don't want to check pseudo-operands like the source modifier		// bus, and we don't want to check pseudo-operands like the source modifier
// flags.		// flags.
const int OpIndices[] = { Src0Idx, Src1Idx, Src2Idx };		const int OpIndices[] = { Src0Idx, Src1Idx, Src2Idx };
▲ Show 20 Lines • Show All 2,100 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines

STATISTIC(NumSDWAPatternsFound, "Number of SDWA patterns found.");		STATISTIC(NumSDWAPatternsFound, "Number of SDWA patterns found.");
STATISTIC(NumSDWAInstructionsPeepholed,		STATISTIC(NumSDWAInstructionsPeepholed,
"Number of instruction converted to SDWA.");		"Number of instruction converted to SDWA.");

namespace {		namespace {

class SDWAOperand;		class SDWAOperand;
		class SDWADstOperand;

class SIPeepholeSDWA : public MachineFunctionPass {		class SIPeepholeSDWA : public MachineFunctionPass {
public:		public:
using SDWAOperandsVector = SmallVector<SDWAOperand *, 4>;		using SDWAOperandsVector = SmallVector<SDWAOperand *, 4>;

private:		private:
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
const SIRegisterInfo *TRI;		const SIRegisterInfo *TRI;
Show All 9 Lines	public:
static char ID;		static char ID;

SIPeepholeSDWA() : MachineFunctionPass(ID) {		SIPeepholeSDWA() : MachineFunctionPass(ID) {
initializeSIPeepholeSDWAPass(*PassRegistry::getPassRegistry());		initializeSIPeepholeSDWAPass(*PassRegistry::getPassRegistry());
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;
void matchSDWAOperands(MachineFunction &MF);		void matchSDWAOperands(MachineFunction &MF);
		std::unique_ptr<SDWAOperand> matchSDWAOperand(MachineInstr &MI);
bool isConvertibleToSDWA(const MachineInstr &MI, const SISubtarget &ST) const;		bool isConvertibleToSDWA(const MachineInstr &MI, const SISubtarget &ST) const;
bool convertToSDWA(MachineInstr &MI, const SDWAOperandsVector &SDWAOperands);		bool convertToSDWA(MachineInstr &MI, const SDWAOperandsVector &SDWAOperands);
void legalizeScalarOperands(MachineInstr &MI, const SISubtarget &ST) const;		void legalizeScalarOperands(MachineInstr &MI, const SISubtarget &ST) const;

StringRef getPassName() const override { return "SI Peephole SDWA"; }		StringRef getPassName() const override { return "SI Peephole SDWA"; }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
Show All 20 Lines	public:

MachineOperand *getTargetOperand() const { return Target; }		MachineOperand *getTargetOperand() const { return Target; }
MachineOperand *getReplacedOperand() const { return Replaced; }		MachineOperand *getReplacedOperand() const { return Replaced; }
MachineInstr *getParentInst() const { return Target->getParent(); }		MachineInstr *getParentInst() const { return Target->getParent(); }

MachineRegisterInfo *getMRI() const {		MachineRegisterInfo *getMRI() const {
return &getParentInst()->getParent()->getParent()->getRegInfo();		return &getParentInst()->getParent()->getParent()->getRegInfo();
}		}

		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
		virtual void print(raw_ostream& OS) const = 0;
		void dump() const { print(dbgs()); }
		#endif
};		};

using namespace AMDGPU::SDWA;		using namespace AMDGPU::SDWA;

class SDWASrcOperand : public SDWAOperand {		class SDWASrcOperand : public SDWAOperand {
private:		private:
SdwaSel SrcSel;		SdwaSel SrcSel;
bool Abs;		bool Abs;
bool Neg;		bool Neg;
bool Sext;		bool Sext;

public:		public:
SDWASrcOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,		SDWASrcOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,
SdwaSel SrcSel_ = DWORD, bool Abs_ = false, bool Neg_ = false,		SdwaSel SrcSel_ = DWORD, bool Abs_ = false, bool Neg_ = false,
bool Sext_ = false)		bool Sext_ = false)
: SDWAOperand(TargetOp, ReplacedOp), SrcSel(SrcSel_), Abs(Abs_),		: SDWAOperand(TargetOp, ReplacedOp),
Neg(Neg_), Sext(Sext_) {}		SrcSel(SrcSel_), Abs(Abs_), Neg(Neg_), Sext(Sext_) {}

MachineInstr potentialToConvert(const SIInstrInfo TII) override;		MachineInstr potentialToConvert(const SIInstrInfo TII) override;
bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;		bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;

SdwaSel getSrcSel() const { return SrcSel; }		SdwaSel getSrcSel() const { return SrcSel; }
bool getAbs() const { return Abs; }		bool getAbs() const { return Abs; }
bool getNeg() const { return Neg; }		bool getNeg() const { return Neg; }
bool getSext() const { return Sext; }		bool getSext() const { return Sext; }

uint64_t getSrcMods(const SIInstrInfo *TII,		uint64_t getSrcMods(const SIInstrInfo *TII,
const MachineOperand *SrcOp) const;		const MachineOperand *SrcOp) const;

		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
		void print(raw_ostream& OS) const override;
		#endif
};		};

class SDWADstOperand : public SDWAOperand {		class SDWADstOperand : public SDWAOperand {
private:		private:
SdwaSel DstSel;		SdwaSel DstSel;
DstUnused DstUn;		DstUnused DstUn;

public:		public:

SDWADstOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,		SDWADstOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,
SdwaSel DstSel_ = DWORD, DstUnused DstUn_ = UNUSED_PAD)		SdwaSel DstSel_ = DWORD, DstUnused DstUn_ = UNUSED_PAD)
: SDWAOperand(TargetOp, ReplacedOp), DstSel(DstSel_), DstUn(DstUn_) {}		: SDWAOperand(TargetOp, ReplacedOp), DstSel(DstSel_), DstUn(DstUn_) {}

MachineInstr potentialToConvert(const SIInstrInfo TII) override;		MachineInstr potentialToConvert(const SIInstrInfo TII) override;
bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;		bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;

SdwaSel getDstSel() const { return DstSel; }		SdwaSel getDstSel() const { return DstSel; }
DstUnused getDstUnused() const { return DstUn; }		DstUnused getDstUnused() const { return DstUn; }

		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
		void print(raw_ostream& OS) const override;
		#endif
		};

		class SDWADstPreserveOperand : public SDWADstOperand {
		private:
		MachineOperand *Preserve;

		public:
		SDWADstPreserveOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,
		MachineOperand *PreserveOp, SdwaSel DstSel_ = DWORD)
		: SDWADstOperand(TargetOp, ReplacedOp, DstSel_, UNUSED_PRESERVE),
		Preserve(PreserveOp) {}

		bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;

		MachineOperand *getPreservedOperand() const { return Preserve; }

		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
		void print(raw_ostream& OS) const override;
		#endif
};		};

} // end anonymous namespace		} // end anonymous namespace

INITIALIZE_PASS(SIPeepholeSDWA, DEBUG_TYPE, "SI Peephole SDWA", false, false)		INITIALIZE_PASS(SIPeepholeSDWA, DEBUG_TYPE, "SI Peephole SDWA", false, false)

char SIPeepholeSDWA::ID = 0;		char SIPeepholeSDWA::ID = 0;

char &llvm::SIPeepholeSDWAID = SIPeepholeSDWA::ID;		char &llvm::SIPeepholeSDWAID = SIPeepholeSDWA::ID;

FunctionPass *llvm::createSIPeepholeSDWAPass() {		FunctionPass *llvm::createSIPeepholeSDWAPass() {
return new SIPeepholeSDWA();		return new SIPeepholeSDWA();
}		}

#ifndef NDEBUG
		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
static raw_ostream& operator<<(raw_ostream &OS, const SdwaSel &Sel) {		static raw_ostream& operator<<(raw_ostream &OS, const SdwaSel &Sel) {
switch(Sel) {		switch(Sel) {
case BYTE_0: OS << "BYTE_0"; break;		case BYTE_0: OS << "BYTE_0"; break;
case BYTE_1: OS << "BYTE_1"; break;		case BYTE_1: OS << "BYTE_1"; break;
case BYTE_2: OS << "BYTE_2"; break;		case BYTE_2: OS << "BYTE_2"; break;
case BYTE_3: OS << "BYTE_3"; break;		case BYTE_3: OS << "BYTE_3"; break;
case WORD_0: OS << "WORD_0"; break;		case WORD_0: OS << "WORD_0"; break;
case WORD_1: OS << "WORD_1"; break;		case WORD_1: OS << "WORD_1"; break;
case DWORD: OS << "DWORD"; break;		case DWORD: OS << "DWORD"; break;
}		}
return OS;		return OS;
}		}

static raw_ostream& operator<<(raw_ostream &OS, const DstUnused &Un) {		static raw_ostream& operator<<(raw_ostream &OS, const DstUnused &Un) {
switch(Un) {		switch(Un) {
case UNUSED_PAD: OS << "UNUSED_PAD"; break;		case UNUSED_PAD: OS << "UNUSED_PAD"; break;
case UNUSED_SEXT: OS << "UNUSED_SEXT"; break;		case UNUSED_SEXT: OS << "UNUSED_SEXT"; break;
case UNUSED_PRESERVE: OS << "UNUSED_PRESERVE"; break;		case UNUSED_PRESERVE: OS << "UNUSED_PRESERVE"; break;
}		}
return OS;		return OS;
}		}

static raw_ostream& operator<<(raw_ostream &OS, const SDWASrcOperand &Src) {		static raw_ostream& operator<<(raw_ostream &OS, const SDWAOperand &Operand) {
OS << "SDWA src: " << *Src.getTargetOperand()		Operand.print(OS);
<< " src_sel:" << Src.getSrcSel()
<< " abs:" << Src.getAbs() << " neg:" << Src.getNeg()
<< " sext:" << Src.getSext() << '\n';
return OS;		return OS;
}		}

static raw_ostream& operator<<(raw_ostream &OS, const SDWADstOperand &Dst) {		LLVM_DUMP_METHOD
OS << "SDWA dst: " << *Dst.getTargetOperand()		void SDWASrcOperand::print(raw_ostream& OS) const {
<< " dst_sel:" << Dst.getDstSel()		OS << "SDWA src: " << *getTargetOperand()
<< " dst_unused:" << Dst.getDstUnused() << '\n';		<< " src_sel:" << getSrcSel()
return OS;		<< " abs:" << getAbs() << " neg:" << getNeg()
		<< " sext:" << getSext() << '\n';
		}

		LLVM_DUMP_METHOD
		void SDWADstOperand::print(raw_ostream& OS) const {
		OS << "SDWA dst: " << *getTargetOperand()
		<< " dst_sel:" << getDstSel()
		<< " dst_unused:" << getDstUnused() << '\n';
		}

		LLVM_DUMP_METHOD
		void SDWADstPreserveOperand::print(raw_ostream& OS) const {
		OS << "SDWA preserve dst: " << *getTargetOperand()
		<< " dst_sel:" << getDstSel()
		<< " preserve:" << *getPreservedOperand() << '\n';
}		}

#endif		#endif

static void copyRegOperand(MachineOperand &To, const MachineOperand &From) {		static void copyRegOperand(MachineOperand &To, const MachineOperand &From) {
assert(To.isReg() && From.isReg());		assert(To.isReg() && From.isReg());
To.setReg(From.getReg());		To.setReg(From.getReg());
To.setSubReg(From.getSubReg());		To.setSubReg(From.getSubReg());
To.setIsUndef(From.isUndef());		To.setIsUndef(From.isUndef());
if (To.isUse()) {		if (To.isUse()) {
To.setIsKill(From.isKill());		To.setIsKill(From.isKill());
} else {		} else {
To.setIsDead(From.isDead());		To.setIsDead(From.isDead());
}		}
}		}

static bool isSameReg(const MachineOperand &LHS, const MachineOperand &RHS) {		static bool isSameReg(const MachineOperand &LHS, const MachineOperand &RHS) {
return LHS.isReg() &&		return LHS.isReg() &&
RHS.isReg() &&		RHS.isReg() &&
LHS.getReg() == RHS.getReg() &&		LHS.getReg() == RHS.getReg() &&
LHS.getSubReg() == RHS.getSubReg();		LHS.getSubReg() == RHS.getSubReg();
}		}

static bool isSubregOf(const MachineOperand &SubReg,		static MachineOperand findSingleRegUse(const MachineOperand Reg,
const MachineOperand &SuperReg,		const MachineRegisterInfo *MRI) {
const TargetRegisterInfo *TRI) {		if (!Reg->isReg() \|\| !Reg->isDef())
		return nullptr;

if (!SuperReg.isReg() \|\| !SubReg.isReg())		MachineOperand *ResMO = nullptr;
return false;		for (MachineOperand &UseMO : MRI->use_nodbg_operands(Reg->getReg())) {
		// If there exist use of subreg of Reg then return nullptr
		if (!isSameReg(UseMO, *Reg))
		return nullptr;

if (isSameReg(SuperReg, SubReg))		// Check that there is only one instruction that uses Reg
return true;		if (!ResMO) {
		ResMO = &UseMO;
		} else if (ResMO->getParent() != UseMO.getParent()) {
		return nullptr;
		}
		}

if (SuperReg.getReg() != SubReg.getReg())		return ResMO;
return false;		}

LaneBitmask SuperMask = TRI->getSubRegIndexLaneMask(SuperReg.getSubReg());		static MachineOperand findSingleRegDef(const MachineOperand Reg,
LaneBitmask SubMask = TRI->getSubRegIndexLaneMask(SubReg.getSubReg());		const MachineRegisterInfo *MRI) {
SuperMask \|= ~SubMask;		if (!Reg->isReg())
return SuperMask.all();		return nullptr;

		MachineInstr *DefInstr = MRI->getUniqueVRegDef(Reg->getReg());
		if (!DefInstr)
		return nullptr;

		for (auto &DefMO : DefInstr->defs()) {
		if (DefMO.isReg() && DefMO.getReg() == Reg->getReg())
		return &DefMO;
		}

		llvm_unreachable("invalid reg");
}		}

uint64_t SDWASrcOperand::getSrcMods(const SIInstrInfo *TII,		uint64_t SDWASrcOperand::getSrcMods(const SIInstrInfo *TII,
const MachineOperand *SrcOp) const {		const MachineOperand *SrcOp) const {
uint64_t Mods = 0;		uint64_t Mods = 0;
const auto *MI = SrcOp->getParent();		const auto *MI = SrcOp->getParent();
if (TII->getNamedOperand(*MI, AMDGPU::OpName::src0) == SrcOp) {		if (TII->getNamedOperand(*MI, AMDGPU::OpName::src0) == SrcOp) {
if (auto Mod = TII->getNamedOperand(MI, AMDGPU::OpName::src0_modifiers)) {		if (auto Mod = TII->getNamedOperand(MI, AMDGPU::OpName::src0_modifiers)) {
Show All 14 Lines	uint64_t SDWASrcOperand::getSrcMods(const SIInstrInfo *TII,
}		}

return Mods;		return Mods;
}		}

MachineInstr SDWASrcOperand::potentialToConvert(const SIInstrInfo TII) {		MachineInstr SDWASrcOperand::potentialToConvert(const SIInstrInfo TII) {
// For SDWA src operand potential instruction is one that use register		// For SDWA src operand potential instruction is one that use register
// defined by parent instruction		// defined by parent instruction
MachineRegisterInfo *MRI = getMRI();		MachineOperand *PotentialMO = findSingleRegUse(getReplacedOperand(), getMRI());
MachineOperand *Replaced = getReplacedOperand();		if (!PotentialMO)
assert(Replaced->isReg());

MachineInstr *PotentialMI = nullptr;
for (MachineOperand &PotentialMO : MRI->use_operands(Replaced->getReg())) {
// If this is use of another subreg of dst reg then do nothing
if (!isSubregOf(*Replaced, PotentialMO, MRI->getTargetRegisterInfo()))
continue;

// If there exist use of superreg of dst then we should not combine this
// opernad
if (!isSameReg(PotentialMO, *Replaced))
return nullptr;		return nullptr;

// Check that PotentialMI is only instruction that uses dst reg		return PotentialMO->getParent();
if (PotentialMI == nullptr) {
PotentialMI = PotentialMO.getParent();
} else if (PotentialMI != PotentialMO.getParent()) {
return nullptr;
}
}

return PotentialMI;
}		}

bool SDWASrcOperand::convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) {		bool SDWASrcOperand::convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) {
// Find operand in instruction that matches source operand and replace it with		// Find operand in instruction that matches source operand and replace it with
// target operand. Set corresponding src_sel		// target operand. Set corresponding src_sel

MachineOperand *Src = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
MachineOperand *SrcSel = TII->getNamedOperand(MI, AMDGPU::OpName::src0_sel);		MachineOperand *SrcSel = TII->getNamedOperand(MI, AMDGPU::OpName::src0_sel);
MachineOperand *SrcMods =		MachineOperand *SrcMods =
TII->getNamedOperand(MI, AMDGPU::OpName::src0_modifiers);		TII->getNamedOperand(MI, AMDGPU::OpName::src0_modifiers);
assert(Src && (Src->isReg() \|\| Src->isImm()));		assert(Src && (Src->isReg() \|\| Src->isImm()));
if (!isSameReg(Src, getReplacedOperand())) {		if (!isSameReg(Src, getReplacedOperand())) {
// If this is not src0 then it should be src1		// If this is not src0 then it should be src1
Src = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		Src = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
SrcSel = TII->getNamedOperand(MI, AMDGPU::OpName::src1_sel);		SrcSel = TII->getNamedOperand(MI, AMDGPU::OpName::src1_sel);
SrcMods = TII->getNamedOperand(MI, AMDGPU::OpName::src1_modifiers);		SrcMods = TII->getNamedOperand(MI, AMDGPU::OpName::src1_modifiers);

assert(Src && Src->isReg());		assert(Src && Src->isReg());

if ((MI.getOpcode() == AMDGPU::V_MAC_F16_sdwa \|\|		if ((MI.getOpcode() == AMDGPU::V_MAC_F16_sdwa \|\|
MI.getOpcode() == AMDGPU::V_MAC_F32_sdwa) &&		MI.getOpcode() == AMDGPU::V_MAC_F32_sdwa) &&
!isSameReg(Src, getReplacedOperand())) {		!isSameReg(Src, getReplacedOperand())) {
// In case of v_mac_f16/32_sdwa this pass can try to apply src operand to		// In case of v_mac_f16/32_sdwa this pass can try to apply src operand to
// src2. This is not allowed.		// src2. This is not allowed.
return false;		return false;
}		}

assert(isSameReg(Src, getReplacedOperand()) && SrcSel && SrcMods);		assert(isSameReg(Src, getReplacedOperand()) && SrcSel && SrcMods);
}		}
copyRegOperand(Src, getTargetOperand());		copyRegOperand(Src, getTargetOperand());
SrcSel->setImm(getSrcSel());		SrcSel->setImm(getSrcSel());
SrcMods->setImm(getSrcMods(TII, Src));		SrcMods->setImm(getSrcMods(TII, Src));
getTargetOperand()->setIsKill(false);		getTargetOperand()->setIsKill(false);
return true;		return true;
}		}

MachineInstr SDWADstOperand::potentialToConvert(const SIInstrInfo TII) {		MachineInstr SDWADstOperand::potentialToConvert(const SIInstrInfo TII) {
// For SDWA dst operand potential instruction is one that defines register		// For SDWA dst operand potential instruction is one that defines register
// that this operand uses		// that this operand uses
MachineRegisterInfo *MRI = getMRI();		MachineRegisterInfo *MRI = getMRI();
MachineInstr *ParentMI = getParentInst();		MachineInstr *ParentMI = getParentInst();
MachineOperand *Replaced = getReplacedOperand();
assert(Replaced->isReg());

for (MachineOperand &PotentialMO : MRI->def_operands(Replaced->getReg())) {		MachineOperand *PotentialMO = findSingleRegDef(getReplacedOperand(), MRI);
if (!isSubregOf(*Replaced, PotentialMO, MRI->getTargetRegisterInfo()))		if (!PotentialMO)
continue;

if (!isSameReg(*Replaced, PotentialMO))
return nullptr;		return nullptr;

// Check that ParentMI is the only instruction that uses replaced register		// Check that ParentMI is the only instruction that uses replaced register
for (MachineOperand &UseMO : MRI->use_operands(PotentialMO.getReg())) {		for (MachineInstr &UseInst : MRI->use_nodbg_instructions(PotentialMO->getReg())) {
if (isSubregOf(UseMO, PotentialMO, MRI->getTargetRegisterInfo()) &&		if (&UseInst != ParentMI)
UseMO.getParent() != ParentMI) {
return nullptr;		return nullptr;
}		}
}

// Due to SSA this should be onle def of replaced register, so return it		return PotentialMO->getParent();
return PotentialMO.getParent();
}

return nullptr;
}		}

bool SDWADstOperand::convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) {		bool SDWADstOperand::convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) {
// Replace vdst operand in MI with target operand. Set dst_sel and dst_unused		// Replace vdst operand in MI with target operand. Set dst_sel and dst_unused

if ((MI.getOpcode() == AMDGPU::V_MAC_F16_sdwa \|\|		if ((MI.getOpcode() == AMDGPU::V_MAC_F16_sdwa \|\|
MI.getOpcode() == AMDGPU::V_MAC_F32_sdwa) &&		MI.getOpcode() == AMDGPU::V_MAC_F32_sdwa) &&
getDstSel() != AMDGPU::SDWA::DWORD) {		getDstSel() != AMDGPU::SDWA::DWORD) {
Show All 14 Lines	bool SDWADstOperand::convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) {
DstUnused->setImm(getDstUnused());		DstUnused->setImm(getDstUnused());

// Remove original instruction because it would conflict with our new		// Remove original instruction because it would conflict with our new
// instruction by register definition		// instruction by register definition
getParentInst()->eraseFromParent();		getParentInst()->eraseFromParent();
return true;		return true;
}		}

		bool SDWADstPreserveOperand::convertToSDWA(MachineInstr &MI,
		const SIInstrInfo *TII) {
		// MI should be moved right before v_or_b32.
		// For this we should clear all kill flags on uses of MI src-operands or else
		// we can encounter problem with use of killed operand.
		for (MachineOperand &MO : MI.uses()) {
		if (!MO.isReg())
		continue;
		getMRI()->clearKillFlags(MO.getReg());
		}

		// Move MI before v_or_b32
		auto MBB = MI.getParent();
		MBB->remove(&MI);
		MBB->insert(getParentInst(), &MI);

		// Add Implicit use of preserved register
		MachineInstrBuilder MIB(*MBB->getParent(), MI);
		MIB.addReg(getPreservedOperand()->getReg(),
		RegState::ImplicitKill,
		getPreservedOperand()->getSubReg());

		// Tie dst to implicit use
		MI.tieOperands(AMDGPU::getNamedOperandIdx(MI.getOpcode(), AMDGPU::OpName::vdst),
		MI.getNumOperands() - 1);

		// Convert MI as any other SDWADstOperand and remove v_or_b32
		return SDWADstOperand::convertToSDWA(MI, TII);
		}

Optional<int64_t> SIPeepholeSDWA::foldToImm(const MachineOperand &Op) const {		Optional<int64_t> SIPeepholeSDWA::foldToImm(const MachineOperand &Op) const {
if (Op.isImm()) {		if (Op.isImm()) {
return Op.getImm();		return Op.getImm();
}		}

// If this is not immediate then it can be copy of immediate value, e.g.:		// If this is not immediate then it can be copy of immediate value, e.g.:
// %1<def> = S_MOV_B32 255;		// %1<def> = S_MOV_B32 255;
if (Op.isReg()) {		if (Op.isReg()) {
Show All 11 Lines	for (const MachineOperand &Def : MRI->def_operands(Op.getReg())) {

return Copied.getImm();		return Copied.getImm();
}		}
}		}

return None;		return None;
}		}

void SIPeepholeSDWA::matchSDWAOperands(MachineFunction &MF) {		std::unique_ptr<SDWAOperand>
for (MachineBasicBlock &MBB : MF) {		SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
for (MachineInstr &MI : MBB) {
unsigned Opcode = MI.getOpcode();		unsigned Opcode = MI.getOpcode();
switch (Opcode) {		switch (Opcode) {
case AMDGPU::V_LSHRREV_B32_e32:		case AMDGPU::V_LSHRREV_B32_e32:
case AMDGPU::V_ASHRREV_I32_e32:		case AMDGPU::V_ASHRREV_I32_e32:
case AMDGPU::V_LSHLREV_B32_e32:		case AMDGPU::V_LSHLREV_B32_e32:
case AMDGPU::V_LSHRREV_B32_e64:		case AMDGPU::V_LSHRREV_B32_e64:
case AMDGPU::V_ASHRREV_I32_e64:		case AMDGPU::V_ASHRREV_I32_e64:
case AMDGPU::V_LSHLREV_B32_e64: {		case AMDGPU::V_LSHLREV_B32_e64: {
// from: v_lshrrev_b32_e32 v1, 16/24, v0		// from: v_lshrrev_b32_e32 v1, 16/24, v0
// to SDWA src:v0 src_sel:WORD_1/BYTE_3		// to SDWA src:v0 src_sel:WORD_1/BYTE_3

// from: v_ashrrev_i32_e32 v1, 16/24, v0		// from: v_ashrrev_i32_e32 v1, 16/24, v0
// to SDWA src:v0 src_sel:WORD_1/BYTE_3 sext:1		// to SDWA src:v0 src_sel:WORD_1/BYTE_3 sext:1

// from: v_lshlrev_b32_e32 v1, 16/24, v0		// from: v_lshlrev_b32_e32 v1, 16/24, v0
// to SDWA dst:v1 dst_sel:WORD_1/BYTE_3 dst_unused:UNUSED_PAD		// to SDWA dst:v1 dst_sel:WORD_1/BYTE_3 dst_unused:UNUSED_PAD
MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
auto Imm = foldToImm(*Src0);		auto Imm = foldToImm(*Src0);
if (!Imm)		if (!Imm)
break;		break;

if (Imm != 16 && Imm != 24)		if (Imm != 16 && Imm != 24)
break;		break;

MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);
if (TRI->isPhysicalRegister(Src1->getReg()) \|\|		if (TRI->isPhysicalRegister(Src1->getReg()) \|\|
TRI->isPhysicalRegister(Dst->getReg()))		TRI->isPhysicalRegister(Dst->getReg()))
break;		break;

if (Opcode == AMDGPU::V_LSHLREV_B32_e32 \|\|		if (Opcode == AMDGPU::V_LSHLREV_B32_e32 \|\|
Opcode == AMDGPU::V_LSHLREV_B32_e64) {		Opcode == AMDGPU::V_LSHLREV_B32_e64) {
auto SDWADst = make_unique<SDWADstOperand>(		return make_unique<SDWADstOperand>(
Dst, Src1, *Imm == 16 ? WORD_1 : BYTE_3, UNUSED_PAD);		Dst, Src1, *Imm == 16 ? WORD_1 : BYTE_3, UNUSED_PAD);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWADst << '\n');
SDWAOperands[&MI] = std::move(SDWADst);
++NumSDWAPatternsFound;
} else {		} else {
auto SDWASrc = make_unique<SDWASrcOperand>(		return make_unique<SDWASrcOperand>(
Src1, Dst, *Imm == 16 ? WORD_1 : BYTE_3, false, false,		Src1, Dst, *Imm == 16 ? WORD_1 : BYTE_3, false, false,
Opcode != AMDGPU::V_LSHRREV_B32_e32 &&		Opcode != AMDGPU::V_LSHRREV_B32_e32 &&
Opcode != AMDGPU::V_LSHRREV_B32_e64);		Opcode != AMDGPU::V_LSHRREV_B32_e64);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWASrc << '\n');
SDWAOperands[&MI] = std::move(SDWASrc);
++NumSDWAPatternsFound;
}		}
break;		break;
}		}

case AMDGPU::V_LSHRREV_B16_e32:		case AMDGPU::V_LSHRREV_B16_e32:
case AMDGPU::V_ASHRREV_I16_e32:		case AMDGPU::V_ASHRREV_I16_e32:
case AMDGPU::V_LSHLREV_B16_e32:		case AMDGPU::V_LSHLREV_B16_e32:
case AMDGPU::V_LSHRREV_B16_e64:		case AMDGPU::V_LSHRREV_B16_e64:
case AMDGPU::V_ASHRREV_I16_e64:		case AMDGPU::V_ASHRREV_I16_e64:
case AMDGPU::V_LSHLREV_B16_e64: {		case AMDGPU::V_LSHLREV_B16_e64: {
// from: v_lshrrev_b16_e32 v1, 8, v0		// from: v_lshrrev_b16_e32 v1, 8, v0
// to SDWA src:v0 src_sel:BYTE_1		// to SDWA src:v0 src_sel:BYTE_1

// from: v_ashrrev_i16_e32 v1, 8, v0		// from: v_ashrrev_i16_e32 v1, 8, v0
// to SDWA src:v0 src_sel:BYTE_1 sext:1		// to SDWA src:v0 src_sel:BYTE_1 sext:1

// from: v_lshlrev_b16_e32 v1, 8, v0		// from: v_lshlrev_b16_e32 v1, 8, v0
// to SDWA dst:v1 dst_sel:BYTE_1 dst_unused:UNUSED_PAD		// to SDWA dst:v1 dst_sel:BYTE_1 dst_unused:UNUSED_PAD
MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
auto Imm = foldToImm(*Src0);		auto Imm = foldToImm(*Src0);
if (!Imm \|\| *Imm != 8)		if (!Imm \|\| *Imm != 8)
break;		break;

MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);

if (TRI->isPhysicalRegister(Src1->getReg()) \|\|		if (TRI->isPhysicalRegister(Src1->getReg()) \|\|
TRI->isPhysicalRegister(Dst->getReg()))		TRI->isPhysicalRegister(Dst->getReg()))
break;		break;

if (Opcode == AMDGPU::V_LSHLREV_B16_e32 \|\|		if (Opcode == AMDGPU::V_LSHLREV_B16_e32 \|\|
Opcode == AMDGPU::V_LSHLREV_B16_e64) {		Opcode == AMDGPU::V_LSHLREV_B16_e64) {
auto SDWADst =		return make_unique<SDWADstOperand>(Dst, Src1, BYTE_1, UNUSED_PAD);
make_unique<SDWADstOperand>(Dst, Src1, BYTE_1, UNUSED_PAD);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWADst << '\n');
SDWAOperands[&MI] = std::move(SDWADst);
++NumSDWAPatternsFound;
} else {		} else {
auto SDWASrc = make_unique<SDWASrcOperand>(		return make_unique<SDWASrcOperand>(
Src1, Dst, BYTE_1, false, false,		Src1, Dst, BYTE_1, false, false,
Opcode != AMDGPU::V_LSHRREV_B16_e32 &&		Opcode != AMDGPU::V_LSHRREV_B16_e32 &&
Opcode != AMDGPU::V_LSHRREV_B16_e64);		Opcode != AMDGPU::V_LSHRREV_B16_e64);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWASrc << '\n');
SDWAOperands[&MI] = std::move(SDWASrc);
++NumSDWAPatternsFound;
}		}
break;		break;
}		}

case AMDGPU::V_BFE_I32:		case AMDGPU::V_BFE_I32:
case AMDGPU::V_BFE_U32: {		case AMDGPU::V_BFE_U32: {
// e.g.:		// e.g.:
// from: v_bfe_u32 v1, v0, 8, 8		// from: v_bfe_u32 v1, v0, 8, 8
// to SDWA src:v0 src_sel:BYTE_1		// to SDWA src:v0 src_sel:BYTE_1

// offset \| width \| src_sel		// offset \| width \| src_sel
// ------------------------		// ------------------------
// 0 \| 8 \| BYTE_0		// 0 \| 8 \| BYTE_0
// 0 \| 16 \| WORD_0		// 0 \| 16 \| WORD_0
// 0 \| 32 \| DWORD ?		// 0 \| 32 \| DWORD ?
// 8 \| 8 \| BYTE_1		// 8 \| 8 \| BYTE_1
// 16 \| 8 \| BYTE_2		// 16 \| 8 \| BYTE_2
// 16 \| 16 \| WORD_1		// 16 \| 16 \| WORD_1
// 24 \| 8 \| BYTE_3		// 24 \| 8 \| BYTE_3

MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
auto Offset = foldToImm(*Src1);		auto Offset = foldToImm(*Src1);
if (!Offset)		if (!Offset)
break;		break;

MachineOperand *Src2 = TII->getNamedOperand(MI, AMDGPU::OpName::src2);		MachineOperand *Src2 = TII->getNamedOperand(MI, AMDGPU::OpName::src2);
auto Width = foldToImm(*Src2);		auto Width = foldToImm(*Src2);
if (!Width)		if (!Width)
break;		break;

SdwaSel SrcSel = DWORD;		SdwaSel SrcSel = DWORD;

if (Offset == 0 && Width == 8)		if (Offset == 0 && Width == 8)
SrcSel = BYTE_0;		SrcSel = BYTE_0;
else if (Offset == 0 && Width == 16)		else if (Offset == 0 && Width == 16)
SrcSel = WORD_0;		SrcSel = WORD_0;
else if (Offset == 0 && Width == 32)		else if (Offset == 0 && Width == 32)
SrcSel = DWORD;		SrcSel = DWORD;
else if (Offset == 8 && Width == 8)		else if (Offset == 8 && Width == 8)
SrcSel = BYTE_1;		SrcSel = BYTE_1;
else if (Offset == 16 && Width == 8)		else if (Offset == 16 && Width == 8)
SrcSel = BYTE_2;		SrcSel = BYTE_2;
else if (Offset == 16 && Width == 16)		else if (Offset == 16 && Width == 16)
SrcSel = WORD_1;		SrcSel = WORD_1;
else if (Offset == 24 && Width == 8)		else if (Offset == 24 && Width == 8)
SrcSel = BYTE_3;		SrcSel = BYTE_3;
else		else
break;		break;

MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);

if (TRI->isPhysicalRegister(Src0->getReg()) \|\|		if (TRI->isPhysicalRegister(Src0->getReg()) \|\|
TRI->isPhysicalRegister(Dst->getReg()))		TRI->isPhysicalRegister(Dst->getReg()))
break;		break;

auto SDWASrc = make_unique<SDWASrcOperand>(		return make_unique<SDWASrcOperand>(
Src0, Dst, SrcSel, false, false,		Src0, Dst, SrcSel, false, false, Opcode != AMDGPU::V_BFE_U32);
Opcode != AMDGPU::V_BFE_U32);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWASrc << '\n');
SDWAOperands[&MI] = std::move(SDWASrc);
++NumSDWAPatternsFound;
break;
}		}

case AMDGPU::V_AND_B32_e32:		case AMDGPU::V_AND_B32_e32:
case AMDGPU::V_AND_B32_e64: {		case AMDGPU::V_AND_B32_e64: {
// e.g.:		// e.g.:
// from: v_and_b32_e32 v1, 0x0000ffff/0x000000ff, v0		// from: v_and_b32_e32 v1, 0x0000ffff/0x000000ff, v0
// to SDWA src:v0 src_sel:WORD_0/BYTE_0		// to SDWA src:v0 src_sel:WORD_0/BYTE_0

MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
auto ValSrc = Src1;		auto ValSrc = Src1;
auto Imm = foldToImm(*Src0);		auto Imm = foldToImm(*Src0);

if (!Imm) {		if (!Imm) {
Imm = foldToImm(*Src1);		Imm = foldToImm(*Src1);
ValSrc = Src0;		ValSrc = Src0;
}		}

if (!Imm \|\| (Imm != 0x0000ffff && Imm != 0x000000ff))		if (!Imm \|\| (Imm != 0x0000ffff && Imm != 0x000000ff))
break;		break;

MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);

if (TRI->isPhysicalRegister(Src1->getReg()) \|\|		if (TRI->isPhysicalRegister(Src1->getReg()) \|\|
TRI->isPhysicalRegister(Dst->getReg()))		TRI->isPhysicalRegister(Dst->getReg()))
break;		break;

auto SDWASrc = make_unique<SDWASrcOperand>(		return make_unique<SDWASrcOperand>(
ValSrc, Dst, *Imm == 0x0000ffff ? WORD_0 : BYTE_0);		ValSrc, Dst, *Imm == 0x0000ffff ? WORD_0 : BYTE_0);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWASrc << '\n');		}
SDWAOperands[&MI] = std::move(SDWASrc);
++NumSDWAPatternsFound;		case AMDGPU::V_OR_B32_e32:
		case AMDGPU::V_OR_B32_e64: {
		// Patterns for dst_unused:UNUSED_PRESERVE.
		// e.g., from:
		// v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD
		// src1_sel:WORD_1 src2_sel:WORD1
		// v_add_f16_e32 v3, v1, v2
		// v_or_b32_e32 v4, v0, v3
		// to SDWA preserve dst:v4 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE preserve:v3

		// Check if one of operands of v_or_b32 is SDWA instruction
		using CheckRetType = Optional<std::pair<MachineOperand , MachineOperand >>;
		auto CheckOROperandsForSDWA =
		[&](const MachineOperand Op1, const MachineOperand Op2) -> CheckRetType {
		if (!Op1 \|\| !Op1->isReg() \|\| !Op2 \|\| !Op2->isReg())
		return CheckRetType(None);

		MachineOperand *Op1Def = findSingleRegDef(Op1, MRI);
		if (!Op1Def)
		return CheckRetType(None);

		MachineInstr *Op1Inst = Op1Def->getParent();
		if (!TII->isSDWA(*Op1Inst))
		return CheckRetType(None);

		MachineOperand *Op2Def = findSingleRegDef(Op2, MRI);
		if (!Op2Def)
		return CheckRetType(None);

		return CheckRetType(std::make_pair(Op1Def, Op2Def));
		};

		MachineOperand *OrSDWA = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
		MachineOperand *OrOther = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
		assert(OrSDWA && OrOther);
		auto Res = CheckOROperandsForSDWA(OrSDWA, OrOther);
		if (!Res) {
		OrSDWA = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
		OrOther = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
		assert(OrSDWA && OrOther);
		Res = CheckOROperandsForSDWA(OrSDWA, OrOther);
		if (!Res)
break;		break;
}		}

		MachineOperand *OrSDWADef = Res->first;
		MachineOperand *OrOtherDef = Res->second;
		assert(OrSDWADef && OrOtherDef);

		MachineInstr *SDWAInst = OrSDWADef->getParent();
		MachineInstr *OtherInst = OrOtherDef->getParent();

		// Check that OtherInstr is actually bitwise compatible with SDWAInst = their
		// destination patterns don't overlap. Compatible instruction can be either
		// regular instruction with compatible bitness or SDWA instruction with
		// correct dst_sel
		// SDWAInst \| OtherInst bitness / OtherInst dst_sel
		// -----------------------------------------------------
		// DWORD \| no / no
		// WORD_0 \| no / BYTE_2/3, WORD_1
		// WORD_1 \| 8/16-bit instructions / BYTE_0/1, WORD_0
		// BYTE_0 \| no / BYTE_1/2/3, WORD_1
		// BYTE_1 \| 8-bit / BYTE_0/2/3, WORD_1
		// BYTE_2 \| 8/16-bit / BYTE_0/1/3. WORD_0
		// BYTE_3 \| 8/16/24-bit / BYTE_0/1/2, WORD_0
		// E.g. if SDWAInst is v_add_f16_sdwa dst_sel:WORD_1 then v_add_f16 is OK
		// but v_add_f32 is not.

		// TODO: add support for non-SDWA instructions as OtherInst.
		// For now this only works with SDWA instructions. For regular instructions
		// there is no way to determine if instruction write only 8/16/24-bit out of
		// full register size and all registers are at min 32-bit wide.
		if (!TII->isSDWA(*OtherInst))
		break;

		SdwaSel DstSel = static_cast<SdwaSel>(
		TII->getNamedImmOperand(*SDWAInst, AMDGPU::OpName::dst_sel));;
		SdwaSel OtherDstSel = static_cast<SdwaSel>(
		TII->getNamedImmOperand(*OtherInst, AMDGPU::OpName::dst_sel));

		bool DstSelAgree = false;
		switch (DstSel) {
		case WORD_0: DstSelAgree = ((OtherDstSel == BYTE_2) \|\|
		(OtherDstSel == BYTE_3) \|\|
		(OtherDstSel == WORD_1));
		break;
		case WORD_1: DstSelAgree = ((OtherDstSel == BYTE_0) \|\|
		(OtherDstSel == BYTE_1) \|\|
		(OtherDstSel == WORD_0));
		break;
		case BYTE_0: DstSelAgree = ((OtherDstSel == BYTE_1) \|\|
		(OtherDstSel == BYTE_2) \|\|
		(OtherDstSel == BYTE_3) \|\|
		(OtherDstSel == WORD_1));
		break;
		case BYTE_1: DstSelAgree = ((OtherDstSel == BYTE_0) \|\|
		(OtherDstSel == BYTE_2) \|\|
		(OtherDstSel == BYTE_3) \|\|
		(OtherDstSel == WORD_1));
		break;
		case BYTE_2: DstSelAgree = ((OtherDstSel == BYTE_0) \|\|
		(OtherDstSel == BYTE_1) \|\|
		(OtherDstSel == BYTE_3) \|\|
		(OtherDstSel == WORD_0));
		break;
		case BYTE_3: DstSelAgree = ((OtherDstSel == BYTE_0) \|\|
		(OtherDstSel == BYTE_1) \|\|
		(OtherDstSel == BYTE_2) \|\|
		(OtherDstSel == WORD_0));
		break;
		default: DstSelAgree = false;
		}

		if (!DstSelAgree)
		break;

		// Also OtherInst dst_unused should be UNUSED_PAD
		DstUnused OtherDstUnused = static_cast<DstUnused>(
		TII->getNamedImmOperand(*OtherInst, AMDGPU::OpName::dst_unused));
		if (OtherDstUnused != DstUnused::UNUSED_PAD)
		break;

		// Create DstPreserveOperand
		MachineOperand *OrDst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);
		assert(OrDst && OrDst->isReg());

		return make_unique<SDWADstPreserveOperand>(
		OrDst, OrSDWADef, OrOtherDef, DstSel);

		}
		}

		return std::unique_ptr<SDWAOperand>(nullptr);
		}

		void SIPeepholeSDWA::matchSDWAOperands(MachineFunction &MF) {
		for (MachineBasicBlock &MBB : MF) {
		for (MachineInstr &MI : MBB) {
		if (auto Operand = matchSDWAOperand(MI)) {
		DEBUG(dbgs() << "Match: " << MI << "To: " << *Operand << '\n');
		SDWAOperands[&MI] = std::move(Operand);
		++NumSDWAPatternsFound;
}		}
}		}
}		}
}		}

bool SIPeepholeSDWA::isConvertibleToSDWA(const MachineInstr &MI,		bool SIPeepholeSDWA::isConvertibleToSDWA(const MachineInstr &MI,
const SISubtarget &ST) const {		const SISubtarget &ST) const {
		// Check if this is already an SDWA instruction
		unsigned Opc = MI.getOpcode();
		if (TII->isSDWA(Opc))
		return true;

// Check if this instruction has opcode that supports SDWA		// Check if this instruction has opcode that supports SDWA
int Opc = MI.getOpcode();
if (AMDGPU::getSDWAOp(Opc) == -1)		if (AMDGPU::getSDWAOp(Opc) == -1)
Opc = AMDGPU::getVOPe32(Opc);		Opc = AMDGPU::getVOPe32(Opc);

if (Opc == -1 \|\| AMDGPU::getSDWAOp(Opc) == -1)		if (AMDGPU::getSDWAOp(Opc) == -1)
return false;		return false;

if (!ST.hasSDWAOmod() && TII->hasModifiersSet(MI, AMDGPU::OpName::omod))		if (!ST.hasSDWAOmod() && TII->hasModifiersSet(MI, AMDGPU::OpName::omod))
return false;		return false;

if (TII->isVOPC(Opc)) {		if (TII->isVOPC(Opc)) {
if (!ST.hasSDWASdst()) {		if (!ST.hasSDWASdst()) {
const MachineOperand *SDst = TII->getNamedOperand(MI, AMDGPU::OpName::sdst);		const MachineOperand *SDst = TII->getNamedOperand(MI, AMDGPU::OpName::sdst);
Show All 16 Lines	if (!ST.hasSDWAMac() && (Opc == AMDGPU::V_MAC_F16_e32 \|\|
return false;		return false;

return true;		return true;
}		}

bool SIPeepholeSDWA::convertToSDWA(MachineInstr &MI,		bool SIPeepholeSDWA::convertToSDWA(MachineInstr &MI,
const SDWAOperandsVector &SDWAOperands) {		const SDWAOperandsVector &SDWAOperands) {
// Convert to sdwa		// Convert to sdwa
int SDWAOpcode = AMDGPU::getSDWAOp(MI.getOpcode());		int SDWAOpcode;
		unsigned Opcode = MI.getOpcode();
		if (TII->isSDWA(Opcode)) {
		SDWAOpcode = Opcode;
		} else {
		SDWAOpcode = AMDGPU::getSDWAOp(Opcode);
if (SDWAOpcode == -1)		if (SDWAOpcode == -1)
SDWAOpcode = AMDGPU::getSDWAOp(AMDGPU::getVOPe32(MI.getOpcode()));		SDWAOpcode = AMDGPU::getSDWAOp(AMDGPU::getVOPe32(Opcode));
		}
assert(SDWAOpcode != -1);		assert(SDWAOpcode != -1);

const MCInstrDesc &SDWADesc = TII->get(SDWAOpcode);		const MCInstrDesc &SDWADesc = TII->get(SDWAOpcode);

// Create SDWA version of instruction MI and initialize its operands		// Create SDWA version of instruction MI and initialize its operands
MachineInstrBuilder SDWAInst =		MachineInstrBuilder SDWAInst =
BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), SDWADesc);		BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), SDWADesc);

▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::omod) != -1) {
MachineOperand *OMod = TII->getNamedOperand(MI, AMDGPU::OpName::omod);		MachineOperand *OMod = TII->getNamedOperand(MI, AMDGPU::OpName::omod);
if (OMod) {		if (OMod) {
SDWAInst.add(*OMod);		SDWAInst.add(*OMod);
} else {		} else {
SDWAInst.addImm(0);		SDWAInst.addImm(0);
}		}
}		}

// Initialize dst_sel if present		// Copy dst_sel if present, initialize otherwise if needed
if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::dst_sel) != -1) {		if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::dst_sel) != -1) {
		MachineOperand *DstSel = TII->getNamedOperand(MI, AMDGPU::OpName::dst_sel);
		if (DstSel) {
		SDWAInst.add(*DstSel);
		} else {
SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);		SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);
}		}
		}

// Initialize dst_unused if present		// Copy dst_unused if present, initialize otherwise if needed
if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::dst_unused) != -1) {		if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::dst_unused) != -1) {
		MachineOperand *DstUnused = TII->getNamedOperand(MI, AMDGPU::OpName::dst_unused);
		if (DstUnused) {
		SDWAInst.add(*DstUnused);
		} else {
SDWAInst.addImm(AMDGPU::SDWA::DstUnused::UNUSED_PAD);		SDWAInst.addImm(AMDGPU::SDWA::DstUnused::UNUSED_PAD);
}		}
		}

// Initialize src0_sel		// Copy src0_sel if present, initialize otherwise
assert(AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src0_sel) != -1);		assert(AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src0_sel) != -1);
		MachineOperand *Src0Sel = TII->getNamedOperand(MI, AMDGPU::OpName::src0_sel);
		if (Src0Sel) {
		SDWAInst.add(*Src0Sel);
		} else {
SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);		SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);
		}

		// Copy src1_sel if present, initialize otherwise if needed
// Initialize src1_sel if present
if (Src1) {		if (Src1) {
assert(AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src1_sel) != -1);		assert(AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src1_sel) != -1);
		MachineOperand *Src1Sel = TII->getNamedOperand(MI, AMDGPU::OpName::src1_sel);
		if (Src1Sel) {
		SDWAInst.add(*Src1Sel);
		} else {
SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);		SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);
}		}
		}

// Apply all sdwa operand pattenrs		// Apply all sdwa operand pattenrs
bool Converted = false;		bool Converted = false;
for (auto &Operand : SDWAOperands) {		for (auto &Operand : SDWAOperands) {
// There should be no intesection between SDWA operands and potential MIs		// There should be no intesection between SDWA operands and potential MIs
// e.g.:		// e.g.:
// v_and_b32 v0, 0xff, v1 -> src:v1 sel:BYTE_0		// v_and_b32 v0, 0xff, v1 -> src:v1 sel:BYTE_0
// v_and_b32 v2, 0xff, v0 -> src:v0 sel:BYTE_0		// v_and_b32 v2, 0xff, v0 -> src:v0 sel:BYTE_0
Show All 21 Lines	bool SIPeepholeSDWA::convertToSDWA(MachineInstr &MI,
return true;		return true;
}		}

// If an instruction was converted to SDWA it should not have immediates or SGPR		// If an instruction was converted to SDWA it should not have immediates or SGPR
// operands (allowed one SGPR on GFX9). Copy its scalar operands into VGPRs.		// operands (allowed one SGPR on GFX9). Copy its scalar operands into VGPRs.
void SIPeepholeSDWA::legalizeScalarOperands(MachineInstr &MI, const SISubtarget &ST) const {		void SIPeepholeSDWA::legalizeScalarOperands(MachineInstr &MI, const SISubtarget &ST) const {
const MCInstrDesc &Desc = TII->get(MI.getOpcode());		const MCInstrDesc &Desc = TII->get(MI.getOpcode());
unsigned ConstantBusCount = 0;		unsigned ConstantBusCount = 0;
for (MachineOperand &Op: MI.explicit_uses()) {		for (MachineOperand &Op : MI.explicit_uses()) {
if (!Op.isImm() && !(Op.isReg() && !TRI->isVGPR(*MRI, Op.getReg())))		if (!Op.isImm() && !(Op.isReg() && !TRI->isVGPR(*MRI, Op.getReg())))
continue;		continue;

unsigned I = MI.getOperandNo(&Op);		unsigned I = MI.getOperandNo(&Op);
if (Desc.OpInfo[I].RegClass == -1 \|\|		if (Desc.OpInfo[I].RegClass == -1 \|\|
!TRI->hasVGPRs(TRI->getRegClass(Desc.OpInfo[I].RegClass)))		!TRI->hasVGPRs(TRI->getRegClass(Desc.OpInfo[I].RegClass)))
continue;		continue;

Show All 21 Lines	bool SIPeepholeSDWA::runOnMachineFunction(MachineFunction &MF) {
if (!ST.hasSDWA() \|\| skipFunction(*MF.getFunction()))		if (!ST.hasSDWA() \|\| skipFunction(*MF.getFunction()))
return false;		return false;

MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
TRI = ST.getRegisterInfo();		TRI = ST.getRegisterInfo();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();

// Find all SDWA operands in MF.		// Find all SDWA operands in MF.
		bool Changed = false;
		bool Ret = false;
		do {
matchSDWAOperands(MF);		matchSDWAOperands(MF);

for (const auto &OperandPair : SDWAOperands) {		for (const auto &OperandPair : SDWAOperands) {
const auto &Operand = OperandPair.second;		const auto &Operand = OperandPair.second;
MachineInstr *PotentialMI = Operand->potentialToConvert(TII);		MachineInstr *PotentialMI = Operand->potentialToConvert(TII);
if (PotentialMI && isConvertibleToSDWA(*PotentialMI, ST)) {		if (PotentialMI && isConvertibleToSDWA(*PotentialMI, ST)) {
PotentialMatches[PotentialMI].push_back(Operand.get());		PotentialMatches[PotentialMI].push_back(Operand.get());
}		}
}		}

for (auto &PotentialPair : PotentialMatches) {		for (auto &PotentialPair : PotentialMatches) {
MachineInstr &PotentialMI = *PotentialPair.first;		MachineInstr &PotentialMI = *PotentialPair.first;
convertToSDWA(PotentialMI, PotentialPair.second);		convertToSDWA(PotentialMI, PotentialPair.second);
}		}

PotentialMatches.clear();		PotentialMatches.clear();
SDWAOperands.clear();		SDWAOperands.clear();

bool Ret = !ConvertedInstructions.empty();		Changed = !ConvertedInstructions.empty();

		if (Changed)
		Ret = true;

while (!ConvertedInstructions.empty())		while (!ConvertedInstructions.empty())
legalizeScalarOperands(*ConvertedInstructions.pop_back_val(), ST);		legalizeScalarOperands(*ConvertedInstructions.pop_back_val(), ST);
		} while (Changed);

return Ret;		return Ret;
}		}

llvm/trunk/test/CodeGen/AMDGPU/fabs.f16.ll

	Show First 20 Lines • Show All 121 Lines • ▼ Show 20 Lines

	; CI: v_cvt_f32_f16_e32			; CI: v_cvt_f32_f16_e32
	; CI: v_cvt_f32_f16_e32			; CI: v_cvt_f32_f16_e32
	; CI: v_mul_f32_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}			; CI: v_mul_f32_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}
	; CI: v_cvt_f16_f32			; CI: v_cvt_f16_f32
	; CI: v_mul_f32_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}			; CI: v_mul_f32_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}
	; CI: v_cvt_f16_f32			; CI: v_cvt_f16_f32

	; VI: v_lshrrev_b32_e32 v{{[0-9]+}}, 16,			; VI: v_mul_f16_sdwa v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
	; VI: v_mul_f16_sdwa v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
	; VI: v_mul_f16_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}			; VI: v_mul_f16_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}

	; GFX9: v_and_b32_e32 [[FABS:v[0-9]+]], 0x7fff7fff, [[VAL]]			; GFX9: v_and_b32_e32 [[FABS:v[0-9]+]], 0x7fff7fff, [[VAL]]
	; GFX9: v_pk_mul_f16 v{{[0-9]+}}, [[FABS]], v{{[0-9]+$}}			; GFX9: v_pk_mul_f16 v{{[0-9]+}}, [[FABS]], v{{[0-9]+$}}
	define amdgpu_kernel void @v_fabs_fold_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %in) #0 {			define amdgpu_kernel void @v_fabs_fold_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %in) #0 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %in, i32 %tid			%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %in, i32 %tid
	%val = load <2 x half>, <2 x half> addrspace(1)* %gep			%val = load <2 x half>, <2 x half> addrspace(1)* %gep
	Show All 13 Lines

llvm/trunk/test/CodeGen/AMDGPU/fcanonicalize.f16.ll

Show First 20 Lines • Show All 201 Lines • ▼ Show 20 Lines
; GCN: buffer_store_short [[REG]]		; GCN: buffer_store_short [[REG]]
define amdgpu_kernel void @test_fold_canonicalize_snan3_value_f16(half addrspace(1)* %out) #1 {		define amdgpu_kernel void @test_fold_canonicalize_snan3_value_f16(half addrspace(1)* %out) #1 {
%canonicalized = call half @llvm.canonicalize.f16(half 0xHFC01)		%canonicalized = call half @llvm.canonicalize.f16(half 0xHFC01)
store half %canonicalized, half addrspace(1)* %out		store half %canonicalized, half addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_test_canonicalize_var_v2f16:		; GCN-LABEL: {{^}}v_test_canonicalize_var_v2f16:
; VI-DAG: v_max_f16_sdwa [[REG0:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD		; VI-DAG: v_max_f16_sdwa [[REG0:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; VI-DAG: v_max_f16_e32 [[REG1:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}}		; VI-DAG: v_max_f16_e32 [[REG1:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}}
; VI-NOT: v_and_b32		; VI-NOT: v_and_b32

; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+$}}		; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+$}}
; GFX9: buffer_store_dword [[REG]]		; GFX9: buffer_store_dword [[REG]]
define amdgpu_kernel void @v_test_canonicalize_var_v2f16(<2 x half> addrspace(1)* %out) #1 {		define amdgpu_kernel void @v_test_canonicalize_var_v2f16(<2 x half> addrspace(1)* %out) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid		%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid
Show All 22 Lines	define amdgpu_kernel void @v_test_canonicalize_fabs_var_v2f16(<2 x half> addrspace(1)* %out) #1 {
%val.fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %val)		%val.fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %val)
%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %val.fabs)		%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %val.fabs)
store <2 x half> %canonicalized, <2 x half> addrspace(1)* %out		store <2 x half> %canonicalized, <2 x half> addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_test_canonicalize_fneg_fabs_var_v2f16:		; GCN-LABEL: {{^}}v_test_canonicalize_fneg_fabs_var_v2f16:
; VI-DAG: v_or_b32_e32 v{{[0-9]+}}, 0x80008000, v{{[0-9]+}}		; VI-DAG: v_or_b32_e32 v{{[0-9]+}}, 0x80008000, v{{[0-9]+}}
; VI-DAG: v_max_f16_sdwa [[REG0:v[0-9]+]], v{{[0-9]+}}, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD		; VI-DAG: v_max_f16_sdwa [[REG0:v[0-9]+]], v{{[0-9]+}}, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; VI-DAG: v_max_f16_e32 [[REG1:v[0-9]+]], v{{[0-9]+}}, v{{[0-9]+}}		; VI-DAG: v_max_f16_e32 [[REG1:v[0-9]+]], v{{[0-9]+}}, v{{[0-9]+}}
; VI: v_or_b32		; VI: v_or_b32

; GFX9: v_and_b32_e32 [[ABS:v[0-9]+]], 0x7fff7fff, v{{[0-9]+}}		; GFX9: v_and_b32_e32 [[ABS:v[0-9]+]], 0x7fff7fff, v{{[0-9]+}}
; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], [[ABS]], [[ABS]] neg_lo:[1,1] neg_hi:[1,1]{{$}}		; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], [[ABS]], [[ABS]] neg_lo:[1,1] neg_hi:[1,1]{{$}}
; GCN: buffer_store_dword		; GCN: buffer_store_dword
define amdgpu_kernel void @v_test_canonicalize_fneg_fabs_var_v2f16(<2 x half> addrspace(1)* %out) #1 {		define amdgpu_kernel void @v_test_canonicalize_fneg_fabs_var_v2f16(<2 x half> addrspace(1)* %out) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid		%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid
%val = load <2 x half>, <2 x half> addrspace(1)* %gep		%val = load <2 x half>, <2 x half> addrspace(1)* %gep
%val.fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %val)		%val.fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %val)
%val.fabs.fneg = fsub <2 x half> <half -0.0, half -0.0>, %val.fabs		%val.fabs.fneg = fsub <2 x half> <half -0.0, half -0.0>, %val.fabs
%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %val.fabs.fneg)		%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %val.fabs.fneg)
store <2 x half> %canonicalized, <2 x half> addrspace(1)* %out		store <2 x half> %canonicalized, <2 x half> addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_test_canonicalize_fneg_var_v2f16:		; GCN-LABEL: {{^}}v_test_canonicalize_fneg_var_v2f16:
; VI: v_xor_b32_e32 [[FNEG:v[0-9]+]], 0x80008000, v{{[0-9]+}}		; VI: v_xor_b32_e32 [[FNEG:v[0-9]+]], 0x80008000, v{{[0-9]+}}
; VI: v_lshrrev_b32_e32 [[FNEGHI:v[0-9]+]], 16, [[FNEG]]		; VI-DAG: v_max_f16_sdwa [[REG1:v[0-9]+]], [[FNEG]], [[FNEG]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; VI-DAG: v_max_f16_sdwa [[REG1:v[0-9]+]], [[FNEG]], [[FNEGHI]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
; VI-DAG: v_max_f16_e32 [[REG0:v[0-9]+]], [[FNEG]], [[FNEG]]		; VI-DAG: v_max_f16_e32 [[REG0:v[0-9]+]], [[FNEG]], [[FNEG]]
; VI-NOT: 0xffff		; VI-NOT: 0xffff

; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}} neg_lo:[1,1] neg_hi:[1,1]{{$}}		; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}} neg_lo:[1,1] neg_hi:[1,1]{{$}}
; GFX9: buffer_store_dword [[REG]]		; GFX9: buffer_store_dword [[REG]]
define amdgpu_kernel void @v_test_canonicalize_fneg_var_v2f16(<2 x half> addrspace(1)* %out) #1 {		define amdgpu_kernel void @v_test_canonicalize_fneg_var_v2f16(<2 x half> addrspace(1)* %out) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid		%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid
▲ Show 20 Lines • Show All 169 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/fneg.f16.ll

	Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines

	; CI: v_cvt_f32_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}			; CI: v_cvt_f32_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}
	; CI: v_cvt_f32_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}			; CI: v_cvt_f32_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}
	; CI: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}			; CI: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
	; CI: v_cvt_f16_f32			; CI: v_cvt_f16_f32
	; CI: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}			; CI: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
	; CI: v_cvt_f16_f32			; CI: v_cvt_f16_f32

	; VI: v_lshrrev_b32_e32 v{{[0-9]+}}, 16,			; VI: v_mul_f16_sdwa v{{[0-9]+}}, -v{{[0-9]+}}, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
	; VI: v_mul_f16_sdwa v{{[0-9]+}}, -v{{[0-9]+}}, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
	; VI: v_mul_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}, v{{[0-9]+}}			; VI: v_mul_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}, v{{[0-9]+}}

	; GFX9: v_pk_mul_f16 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} neg_lo:[1,0] neg_hi:[1,0]{{$}}			; GFX9: v_pk_mul_f16 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} neg_lo:[1,0] neg_hi:[1,0]{{$}}
	define amdgpu_kernel void @v_fneg_fold_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %in) #0 {			define amdgpu_kernel void @v_fneg_fold_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %in) #0 {
	%val = load <2 x half>, <2 x half> addrspace(1)* %in			%val = load <2 x half>, <2 x half> addrspace(1)* %in
	%fsub = fsub <2 x half> <half -0.0, half -0.0>, %val			%fsub = fsub <2 x half> <half -0.0, half -0.0>, %val
	%fmul = fmul <2 x half> %fsub, %val			%fmul = fmul <2 x half> %fsub, %val
	store <2 x half> %fmul, <2 x half> addrspace(1)* %out			store <2 x half> %fmul, <2 x half> addrspace(1)* %out
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/sdwa-peephole-instr.mir

Show First 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	bb.0:
%sgpr30_sgpr31 = COPY %2		%sgpr30_sgpr31 = COPY %2
S_SETPC_B64_return %sgpr30_sgpr31		S_SETPC_B64_return %sgpr30_sgpr31

...		...
---		---
# GCN-LABEL: {{^}}name: vop2_instructions		# GCN-LABEL: {{^}}name: vop2_instructions


# VI: %{{[0-9]+}}:vgpr_32 = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 6, 0, 6, 5, implicit %exec		# VI: %{{[0-9]+}}:vgpr_32 = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 5, 0, 6, 5, implicit %exec
# VI: %{{[0-9]+}}:vgpr_32 = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec		# VI: %{{[0-9]+}}:vgpr_32 = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec
# VI: %{{[0-9]+}}:vgpr_32 = V_SUB_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec		# VI: %{{[0-9]+}}:vgpr_32 = V_SUB_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec
# VI: %{{[0-9]+}}:vgpr_32 = V_MAC_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, %{{[0-9]+}}, 0, 0, 6, 0, 6, 1, implicit %exec		# VI: %{{[0-9]+}}:vgpr_32 = V_MAC_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, %{{[0-9]+}}, 0, 0, 6, 0, 6, 1, implicit %exec
# VI: %{{[0-9]+}}:vgpr_32 = V_MAC_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec		# VI: %{{[0-9]+}}:vgpr_32 = V_MAC_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec

# GFX9: %{{[0-9]+}}:vgpr_32 = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 6, 0, 6, 5, implicit %exec		# GFX9: %{{[0-9]+}}:vgpr_32 = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 5, 0, 6, 5, implicit %exec
# GFX9: %{{[0-9]+}}:vgpr_32 = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec		# GFX9: %{{[0-9]+}}:vgpr_32 = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec
# GFX9: %{{[0-9]+}}:vgpr_32 = V_SUB_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec		# GFX9: %{{[0-9]+}}:vgpr_32 = V_SUB_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec
# GFX9: %{{[0-9]+}}:vgpr_32 = V_MAC_F32_e32 %{{[0-9]+}}, %{{[0-9]+}}, %{{[0-9]+}}, implicit %exec		# GFX9: %{{[0-9]+}}:vgpr_32 = V_MAC_F32_e32 %{{[0-9]+}}, %{{[0-9]+}}, %{{[0-9]+}}, implicit %exec
# GFX9: %{{[0-9]+}}:vgpr_32 = V_MAC_F16_e32 %{{[0-9]+}}, %{{[0-9]+}}, %{{[0-9]+}}, implicit %exec		# GFX9: %{{[0-9]+}}:vgpr_32 = V_MAC_F16_e32 %{{[0-9]+}}, %{{[0-9]+}}, %{{[0-9]+}}, implicit %exec


# VI: %{{[0-9]+}}:vgpr_32 = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 5, 0, 6, 5, implicit %exec		# VI: %{{[0-9]+}}:vgpr_32 = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 5, 0, 6, 5, implicit %exec
# VI: %{{[0-9]+}}:vgpr_32 = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec		# VI: %{{[0-9]+}}:vgpr_32 = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec
▲ Show 20 Lines • Show All 282 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/sdwa-preserve.mir

				# RUN: llc -march=amdgcn -mcpu=fiji -start-before=si-peephole-sdwa -verify-machineinstrs -o - %s \| FileCheck -check-prefix=SDWA %s
				# RUN: llc -march=amdgcn -mcpu=gfx900 -start-before=si-peephole-sdwa -verify-machineinstrs -o - %s \| FileCheck -check-prefix=SDWA %s

				# SDWA-LABEL: {{^}}add_f16_u32_preserve

				# SDWA: flat_load_dword [[FIRST:v[0-9]+]], v[{{[0-9]+}}:{{[0-9]+}}]
				# SDWA: flat_load_dword [[SECOND:v[0-9]+]], v[{{[0-9]+}}:{{[0-9]+}}]

				# SDWA: v_mul_f32_sdwa [[RES:v[0-9]+]], [[FIRST]], [[SECOND]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:BYTE_3
				# SDWA: v_add_f16_sdwa [[RES:v[0-9]+]], [[FIRST]], [[SECOND]] dst_sel:BYTE_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_1

				# SDWA: flat_store_dword v[{{[0-9]+}}:{{[0-9]+}}], [[RES]]

				---
				name: add_f16_u32_preserve
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vreg_64 }
				- { id: 1, class: vreg_64 }
				- { id: 2, class: sreg_64 }
				- { id: 3, class: vgpr_32 }
				- { id: 4, class: vgpr_32 }
				- { id: 5, class: vgpr_32 }
				- { id: 6, class: vgpr_32 }
				- { id: 7, class: vgpr_32 }
				- { id: 8, class: vgpr_32 }
				- { id: 9, class: vgpr_32 }
				- { id: 10, class: vgpr_32 }
				- { id: 11, class: vgpr_32 }
				- { id: 12, class: vgpr_32 }
				- { id: 13, class: vgpr_32 }
				body: \|
				bb.0:
				liveins: %vgpr0_vgpr1, %vgpr2_vgpr3, %sgpr30_sgpr31

				%2 = COPY %sgpr30_sgpr31
				%1 = COPY %vgpr2_vgpr3
				%0 = COPY %vgpr0_vgpr1
				%3 = FLAT_LOAD_DWORD %0, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)
				%4 = FLAT_LOAD_DWORD %1, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)

				%5 = V_AND_B32_e32 65535, %3, implicit %exec
				%6 = V_LSHRREV_B32_e64 16, %4, implicit %exec
				%7 = V_BFE_U32 %3, 8, 8, implicit %exec
				%8 = V_LSHRREV_B32_e32 24, %4, implicit %exec

				%9 = V_ADD_F16_e64 0, %5, 0, %6, 0, 0, implicit %exec
				%10 = V_LSHLREV_B16_e64 8, %9, implicit %exec
				%11 = V_MUL_F32_e64 0, %7, 0, %8, 0, 0, implicit %exec
				%12 = V_LSHLREV_B32_e64 16, %11, implicit %exec

				%13 = V_OR_B32_e64 %10, %12, implicit %exec

				FLAT_STORE_DWORD %0, %13, 0, 0, 0, implicit %exec, implicit %flat_scr :: (store 4)
				%sgpr30_sgpr31 = COPY %2
				S_SETPC_B64_return %sgpr30_sgpr31