This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] SDWA: add support for PRESERVE into SDWA peephole. Add new merge SDWA preserve pass
ClosedPublic

Authored by SamWot on Sep 13 2017, 10:42 AM.

Download Raw Diff

Details

Reviewers

arsenm
vpykhtin
rampitec

Commits

rG5f7f32c3826e: [AMDGPU] SDWA: add support for PRESERVE into SDWA peephole.
rL319662: [AMDGPU] SDWA: add support for PRESERVE into SDWA peephole.

Summary

SDWA instructions support several values of dst_unused operand. One of this is UNUSED_PRESERVE. This value means that parts of destination register that are not wrote by SDWA instruction would not be changed. Currently SDWA peephole pass doesn't generate UNUSED_PRESERVE. It only generates UNUSED_PAD value that means that unused parts of dst register would be set to 0.
Big problem with UNUSED_PRESERVE is that by its nature it can't be represented in SSA form. PRESERVE assumes that register that it writes into was already wrote by some other instruction and our SDWA instruction keeps this value intact.
Another problem is that in AMDGPU backend smallest sub-reg is 32-bit wide and SDWA needs smaller so support for PRESERVE can't be done with subregs.
For those reasons support for PRESERVE for split into 2 major parts. First - changes in SDWA peephole pass that allows it to recognize patterns for PRESERVE and generate according instruction. This pass works on SSA machine code and generates SSA compatible code. Second part - new pass that works on non-SSA code and converts code generated by SDWA peephole into correct code.

Changes in SDWA peephole:

There were several changes in SDWA peephole.
a. First of all there was added new pattern to match for PRESERVE operand. This patterns looks for V_OR_B32 instruction with one of operands that is result of SDWA instruction. Second operand of V_OR_B32 should be instruction that is compatible bit-wise with SDWA instruction (there destination don't overlap). E.g. match:

v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src1_sel:WORD_1 src2_sel:WORD_1
v_add_f16_e32 v3, v1, v2
v_or_b32_e32 v4, v0, v3

Into: SDWA preserve dst:v4 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE preserve:v3

Then this mathced SDWA preserve pattern is converted into SDWA with preserve. During conversion V_OR_B#@ instruction is replaced by SDWA instruction with dst_unused set to UNUSED_PRESERVE. Original instruction is removed. And new instruction gets additional implicit use-operand which is destination of second operand of V_OR_B32 (register that should be preserved):

v_add_f16_e32 v3, v1, v2
v_add_f16_sdwa v4, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3

Problem with this match process is that currently it only works if both instructions were SDWA instructions. Reason is that to be able to match to instructions we should check that those two instructions are compatible to match - meaning that they write different parts of destination register. But currently there is no way to determine if regular instruction writes not whole destination register. E.g. we can't understand that V_ADD_F16 only writes low 16-bit of destination and high 16-bit are irrelevant. This can be determined only for SDWA instruction by looking into dst_sel operand. So for now this pattern only metch 2 SDWA instructions. Ability to match regular instructions would be added later.

b. Second big change in SDWA peephole pass is that now it tries to apply match patterns several times until it can't convert any new instruction. This is needed because (as said earlier) PRESERVE pattern need to match SDWA instructions but SDWA instruction apear (in most cases) only after SDWA peephole. So to be able to match PRESERVE pattern we first apply all other patterns that generate regular SDWA instructions and then on second try we apply PRESERVE pattern to SDWA instructions generated on first try.

New pass - merge SDWA preserve pass:

This pass is needed to convert SSA code generated by SDWA peephole pass into non-SSA correct code. It works after PHI-elimination pass where it is possible to generate non-SSA code.
This pass looks for SDWA instructions with dst_unused set to UNUSED_PRESERVE. In those instructions it looks for implicit register operand (which is added by SDWA peephole pass). This register is the one that should be reserved. Id such instruction is found then this pass changes destination register of this SDWA instruction to implicit register and creates copy from implicit register to original destination of SDWA instruction. E.g. instruction generated by SDWA peephole:

v_add_f16_sdwa v4, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3

Would be converted into:

v_add_f16_sdwa v3, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3
v_mov_b32 v4, v3

Putting it all together original sequence of instructions:

v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src1_sel:WORD_1 src2_sel:WORD_1
v_add_f16_e32 v3, v1, v2
v_or_b32_e32 v4, v0, v3

Would be converted into:

v_add_f16_e32 v3, v1, v2
v_add_f16_sdwa v3, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src1_sel:WORD_1 src2_sel:WORD_1, implicit v3
v_mov_b32 v4, v3

Diff Detail

Build Status

Buildable 10226
Build 10226: arc lint + arc unit

Event Timeline

SamWot created this revision.Sep 13 2017, 10:42 AM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptSep 13 2017, 10:42 AM

Harbormaster completed remote builds in B10192: Diff 115070.Sep 13 2017, 10:42 AM

I think we need to decide an overall strategy for dealing with instructions that only partially update the registers. GFX9 really complicated this issue by changing new instructions to preserve the high bits, and adding a control bit to some instructions to change the high bit behavior.

I started dealing with this to add the d16 loads and stores. We can still do this SSA by adding variants of the instructions with tied operands that preserve one half of the instructions, which would probably be less painful than adding another post-SSA pass that needs to deal with liveness. One issue is we still get suboptimal regalloc in some cases, so I'm debating adding new 16-bit subregister classes so subrange liveness tracking works.

In D37817#869898, @arsenm wrote:

I think we need to decide an overall strategy for dealing with instructions that only partially update the registers. GFX9 really complicated this issue by changing new instructions to preserve the high bits, and adding a control bit to some instructions to change the high bit behavior.

I started dealing with this to add the d16 loads and stores. We can still do this SSA by adding variants of the instructions with tied operands that preserve one half of the instructions, which would probably be less painful than adding another post-SSA pass that needs to deal with liveness. One issue is we still get suboptimal regalloc in some cases, so I'm debating adding new 16-bit subregister classes so subrange liveness tracking works.

Do not we want to add bits to the instruction describing it preserves low or high half?
Adding new subregs would be quite painful, as we already have too much registers for RA and LIS to work fast and optimal.

rampitec added inline comments.Sep 13 2017, 3:40 PM

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
821	I do not think we need it under fast RA.
lib/Target/AMDGPU/SIMergeSDWAPreserve.cpp
116	We can have more than EXEC here. Check it is a virtual register instead, or even a VGPR?
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
283–309	Such things shall be functions, function templates, anything but defines. In particular that is very hard to debug.
876	Needs cast from unsigned or use unsigned for SDWAOpcode/Opcode.
1077	This loop itself probably deserves a separate change.
test/CodeGen/AMDGPU/sdwa-merge-preserve.mir
2	Add -verify-machineinstrs to run lines.
test/CodeGen/AMDGPU/sdwa-preserve.mir
2	Add -verify-machineinstrs

In D37817#870219, @rampitec wrote:

In D37817#869898, @arsenm wrote:

I think we need to decide an overall strategy for dealing with instructions that only partially update the registers. GFX9 really complicated this issue by changing new instructions to preserve the high bits, and adding a control bit to some instructions to change the high bit behavior.

I started dealing with this to add the d16 loads and stores. We can still do this SSA by adding variants of the instructions with tied operands that preserve one half of the instructions, which would probably be less painful than adding another post-SSA pass that needs to deal with liveness. One issue is we still get suboptimal regalloc in some cases, so I'm debating adding new 16-bit subregister classes so subrange liveness tracking works.

Do not we want to add bits to the instruction describing it preserves low or high half?
Adding new subregs would be quite painful, as we already have too much registers for RA and LIS to work fast and optimal.

That's another partial option, but won't solve the suboptimal RA. I think we need to try and see what the impact actually ends up being. These are more constrained since you sort of can't directly address the high component usually (i.e. the high component isn't actually separately allocatable).

Resolved some issues

Harbormaster completed remote builds in B10226: Diff 115219.Sep 14 2017, 7:35 AM

In any case independent of sub register questions, I think this would be better off done in the existing pass by adding variants with tied operands. This is how I am handling this problem currently in D38070/D38071 for mad_mix

In D37817#876063, @arsenm wrote:

In any case independent of sub register questions, I think this would be better off done in the existing pass by adding variants with tied operands. This is how I am handling this problem currently in D38070/D38071 for mad_mix

This is actually a better idea than using an implicit operand and having IR potentially broken in between of two passes.

Removed SIMergeSDWAPreserve pass.
Use tied registers to achieve same results

Ping

arsenm added inline comments.Oct 27 2017, 2:30 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
132	Probably should use LLVM_ENABLE_DUMP instead of NDEBUG. Also should add the matching dump() with LLVM_DUMP_METHOD.
292	use_nodbg_operands? Also space before :
293	C++ style comments
294–295	These various checks shouldn't be necessary in SSA. You can't have a def of a specific subregister (unless maybe there is a physical register which should probably be skipped anyway).
313–316	This seems to b be re-inventing MRI.getVRegDef/MRI.getUniqueVRegDef?
467	This whole loop is just MRI.clearKillFlags()
487–488	This is still manually tying the result operand. I was expecting another set of _sdwa opcodes with the preserve behavior with the tied operand statically known in the instruction definition. Manually tying this way is potentially hazardous because the verifier won't check it, and it makes it easier for another pass to accidentally drop the tied operand. I would expect there to be an InstrMapping table between the SDWA opcode and the SWDA with preserve set versions.
lib/Target/AMDGPU/SIRegisterInfo.cpp
1317–1318	How are these different from the various MCRegisterInfo functions for checking if registers alias or have subreg relationships?

Fixed latests issues from arsenm

SamWot added inline comments.Nov 2 2017, 3:40 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
487–488	I thought that having separate instruction definition for just preserve case would be overkill. I didn't want to introduce another new kind of SDWA instruction that would only bloat already huge set of instruction definitions. But if you think this would be better I will introduce new instructions definition.

Ping.
Matt, what do you think about latest changes in reivew?

At the very least there needs to be a verifier check for the tied operands if this is set. With the separate tied opcodes you get that for free

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
1013	Extra newline

Added verification for tied register for UNUSED_PRESERVE

arsenm added inline comments.Nov 29 2017, 3:05 PM

lib/Target/AMDGPU/SIInstrInfo.cpp
2702–2705 ↗	(On Diff #124747)	Could use some of the stricter checks the normal verifier has, i.e. else if (TargetRegisterInfo::isPhysicalRegister(MOTied.getReg()) && MO->getReg() != MOTied.getReg())

Stronger verification for UNUSED_PRESERVE

LGTM

This revision is now accepted and ready to land.Dec 1 2017, 1:51 PM

Closed by commit rL319662: [AMDGPU] SDWA: add support for PRESERVE into SDWA peephole. (authored by skolton). · Explain WhyDec 4 2017, 8:23 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPU.h

4 lines

AMDGPUTargetMachine.cpp

13 lines

CMakeLists.txt

1 line

SIMergeSDWAPreserve.cpp

151 lines

SIPeepholeSDWA.cpp

822 lines

SIRegisterInfo.h

5 lines

SIRegisterInfo.cpp

25 lines

test/

CodeGen/

AMDGPU/

fabs.f16.ll

3 lines

fcanonicalize.f16.ll

7 lines

fneg.f16.ll

3 lines

sdwa-merge-preserve.mir

98 lines

sdwa-peephole-instr.mir

4 lines

sdwa-preserve.mir

56 lines

Diff 115219

lib/Target/AMDGPU/AMDGPU.h

	Show All 34 Lines
	FunctionPass *createR600ControlFlowFinalizer();			FunctionPass *createR600ControlFlowFinalizer();
	FunctionPass *createAMDGPUCFGStructurizerPass();			FunctionPass *createAMDGPUCFGStructurizerPass();
	FunctionPass createR600ISelDag(TargetMachine TM, CodeGenOpt::Level OptLevel);			FunctionPass createR600ISelDag(TargetMachine TM, CodeGenOpt::Level OptLevel);

	// SI Passes			// SI Passes
	FunctionPass *createSIAnnotateControlFlowPass();			FunctionPass *createSIAnnotateControlFlowPass();
	FunctionPass *createSIFoldOperandsPass();			FunctionPass *createSIFoldOperandsPass();
	FunctionPass *createSIPeepholeSDWAPass();			FunctionPass *createSIPeepholeSDWAPass();
				FunctionPass *createSIMergeSDWAPreservePass();
	FunctionPass *createSILowerI1CopiesPass();			FunctionPass *createSILowerI1CopiesPass();
	FunctionPass *createSIShrinkInstructionsPass();			FunctionPass *createSIShrinkInstructionsPass();
	FunctionPass *createSILoadStoreOptimizerPass();			FunctionPass *createSILoadStoreOptimizerPass();
	FunctionPass *createSIWholeQuadModePass();			FunctionPass *createSIWholeQuadModePass();
	FunctionPass *createSIFixControlFlowLiveIntervalsPass();			FunctionPass *createSIFixControlFlowLiveIntervalsPass();
	FunctionPass *createSIOptimizeExecMaskingPreRAPass();			FunctionPass *createSIOptimizeExecMaskingPreRAPass();
	FunctionPass *createSIFixSGPRCopiesPass();			FunctionPass *createSIFixSGPRCopiesPass();
	FunctionPass *createSIMemoryLegalizerPass();			FunctionPass *createSIMemoryLegalizerPass();
	▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	extern char &R600PacketizerID;			extern char &R600PacketizerID;

	void initializeSIFoldOperandsPass(PassRegistry &);			void initializeSIFoldOperandsPass(PassRegistry &);
	extern char &SIFoldOperandsID;			extern char &SIFoldOperandsID;

	void initializeSIPeepholeSDWAPass(PassRegistry &);			void initializeSIPeepholeSDWAPass(PassRegistry &);
	extern char &SIPeepholeSDWAID;			extern char &SIPeepholeSDWAID;

				void initializeSIMergeSDWAPreservePass(PassRegistry &);
				extern char &SIMergeSDWAPreserveID;

	void initializeSIShrinkInstructionsPass(PassRegistry&);			void initializeSIShrinkInstructionsPass(PassRegistry&);
	extern char &SIShrinkInstructionsID;			extern char &SIShrinkInstructionsID;

	void initializeSIFixSGPRCopiesPass(PassRegistry &);			void initializeSIFixSGPRCopiesPass(PassRegistry &);
	extern char &SIFixSGPRCopiesID;			extern char &SIFixSGPRCopiesID;

	void initializeSIFixVGPRCopiesPass(PassRegistry &);			void initializeSIFixVGPRCopiesPass(PassRegistry &);
	extern char &SIFixVGPRCopiesID;			extern char &SIFixVGPRCopiesID;
	▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAMDGPUTarget() {
initializeR600ExpandSpecialInstrsPassPass(*PR);		initializeR600ExpandSpecialInstrsPassPass(*PR);
initializeR600VectorRegMergerPass(*PR);		initializeR600VectorRegMergerPass(*PR);
initializeAMDGPUDAGToDAGISelPass(*PR);		initializeAMDGPUDAGToDAGISelPass(*PR);
initializeSILowerI1CopiesPass(*PR);		initializeSILowerI1CopiesPass(*PR);
initializeSIFixSGPRCopiesPass(*PR);		initializeSIFixSGPRCopiesPass(*PR);
initializeSIFixVGPRCopiesPass(*PR);		initializeSIFixVGPRCopiesPass(*PR);
initializeSIFoldOperandsPass(*PR);		initializeSIFoldOperandsPass(*PR);
initializeSIPeepholeSDWAPass(*PR);		initializeSIPeepholeSDWAPass(*PR);
		initializeSIMergeSDWAPreservePass(*PR);
initializeSIShrinkInstructionsPass(*PR);		initializeSIShrinkInstructionsPass(*PR);
initializeSIOptimizeExecMaskingPreRAPass(*PR);		initializeSIOptimizeExecMaskingPreRAPass(*PR);
initializeSILoadStoreOptimizerPass(*PR);		initializeSILoadStoreOptimizerPass(*PR);
initializeAMDGPUAlwaysInlinePass(*PR);		initializeAMDGPUAlwaysInlinePass(*PR);
initializeAMDGPUAnnotateKernelFeaturesPass(*PR);		initializeAMDGPUAnnotateKernelFeaturesPass(*PR);
initializeAMDGPUAnnotateUniformValuesPass(*PR);		initializeAMDGPUAnnotateUniformValuesPass(*PR);
initializeAMDGPUArgumentUsageInfoPass(*PR);		initializeAMDGPUArgumentUsageInfoPass(*PR);
initializeAMDGPULowerIntrinsicsPass(*PR);		initializeAMDGPULowerIntrinsicsPass(*PR);
▲ Show 20 Lines • Show All 645 Lines • ▼ Show 20 Lines	void GCNPassConfig::addFastRegAlloc(FunctionPass *RegAllocPass) {
// TwoAddressInstructions, otherwise the processing of the tied operand of		// TwoAddressInstructions, otherwise the processing of the tied operand of
// SI_ELSE will introduce a copy of the tied operand source after the else.		// SI_ELSE will introduce a copy of the tied operand source after the else.
insertPass(&PHIEliminationID, &SILowerControlFlowID, false);		insertPass(&PHIEliminationID, &SILowerControlFlowID, false);

// This must be run after SILowerControlFlow, since it needs to use the		// This must be run after SILowerControlFlow, since it needs to use the
// machine-level CFG, but before register allocation.		// machine-level CFG, but before register allocation.
insertPass(&SILowerControlFlowID, &SIFixWWMLivenessID, false);		insertPass(&SILowerControlFlowID, &SIFixWWMLivenessID, false);

		// This must be run before TwoAddressInstructions, otherwise it will introduce
		// a copy of merged register after SDWA instruction
		if (EnableSDWAPeephole) {
		insertPass(&SIFixWWMLivenessID, &SIMergeSDWAPreserveID, false);
		rampitecUnsubmitted Not Done Reply Inline Actions I do not think we need it under fast RA. rampitec: I do not think we need it under fast RA.
		}

TargetPassConfig::addFastRegAlloc(RegAllocPass);		TargetPassConfig::addFastRegAlloc(RegAllocPass);
}		}

void GCNPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {		void GCNPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {
insertPass(&MachineSchedulerID, &SIOptimizeExecMaskingPreRAID);		insertPass(&MachineSchedulerID, &SIOptimizeExecMaskingPreRAID);

// This must be run immediately after phi elimination and before		// This must be run immediately after phi elimination and before
// TwoAddressInstructions, otherwise the processing of the tied operand of		// TwoAddressInstructions, otherwise the processing of the tied operand of
// SI_ELSE will introduce a copy of the tied operand source after the else.		// SI_ELSE will introduce a copy of the tied operand source after the else.
insertPass(&PHIEliminationID, &SILowerControlFlowID, false);		insertPass(&PHIEliminationID, &SILowerControlFlowID, false);

// This must be run after SILowerControlFlow, since it needs to use the		// This must be run after SILowerControlFlow, since it needs to use the
// machine-level CFG, but before register allocation.		// machine-level CFG, but before register allocation.
insertPass(&SILowerControlFlowID, &SIFixWWMLivenessID, false);		insertPass(&SILowerControlFlowID, &SIFixWWMLivenessID, false);

		// This must be run before TwoAddressInstructions, otherwise it will introduce
		// a copy of merged register after SDWA instruction
		if (EnableSDWAPeephole) {
		insertPass(&SIFixWWMLivenessID, &SIMergeSDWAPreserveID, false);
		}

TargetPassConfig::addOptimizedRegAlloc(RegAllocPass);		TargetPassConfig::addOptimizedRegAlloc(RegAllocPass);
}		}

void GCNPassConfig::addPostRegAlloc() {		void GCNPassConfig::addPostRegAlloc() {
addPass(&SIFixVGPRCopiesID);		addPass(&SIFixVGPRCopiesID);
addPass(&SIOptimizeExecMaskingID);		addPass(&SIOptimizeExecMaskingID);
TargetPassConfig::addPostRegAlloc();		TargetPassConfig::addPostRegAlloc();
}		}
Show All 30 Lines

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
SIInstrInfo.cpp		SIInstrInfo.cpp
SIISelLowering.cpp		SIISelLowering.cpp
SILoadStoreOptimizer.cpp		SILoadStoreOptimizer.cpp
SILowerControlFlow.cpp		SILowerControlFlow.cpp
SILowerI1Copies.cpp		SILowerI1Copies.cpp
SIMachineFunctionInfo.cpp		SIMachineFunctionInfo.cpp
SIMachineScheduler.cpp		SIMachineScheduler.cpp
SIMemoryLegalizer.cpp		SIMemoryLegalizer.cpp
		SIMergeSDWAPreserve.cpp
SIOptimizeExecMasking.cpp		SIOptimizeExecMasking.cpp
SIOptimizeExecMaskingPreRA.cpp		SIOptimizeExecMaskingPreRA.cpp
SIPeepholeSDWA.cpp		SIPeepholeSDWA.cpp
SIRegisterInfo.cpp		SIRegisterInfo.cpp
SIShrinkInstructions.cpp		SIShrinkInstructions.cpp
SIWholeQuadMode.cpp		SIWholeQuadMode.cpp
)		)

add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)
add_subdirectory(Utils)		add_subdirectory(Utils)

lib/Target/AMDGPU/SIMergeSDWAPreserve.cpp

This file was added.

				//===-- SIMergeSDWAPreserve.cpp - Merge SDWA dst_unused:PRESERVE operand --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file This pass merges SDWA dst_unused:PRESERVE operand
				///
				/// E.g. original:
				/// %vreg2 = V_ADD_F16 %vreg0, %vreg1
				/// %vreg3 = V_ADD_F16_sdwa %vreg0, %vreg1
				/// dst_sel:WORD_1 dst_unused:UNUSED_RESERVE src0_sel:WORD_1 src1_sel:WORD_1
				/// %vreg2<imp-use>
				///
				/// Replace:
				/// %vreg2 = V_ADD_F16 %vreg0, %vreg1
				/// %vreg2 = V_ADD_F16_sdwa %vreg0, %vreg1
				/// dst_sel:WORD_1 dst_unused:UNUSED_RESERVE src0_sel:WORD_1 src1_sel:WORD_1
				/// %vreg3<imp-use>
				/// %vreg3 = V_MOV_B32 %vreg2
				///
				/// This pass should be placed after PHI elimination because it can't be
				/// implemented in SSA form
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUSubtarget.h"
				#include "SIDefines.h"
				#include "SIInstrInfo.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"

				using namespace llvm;

				#define DEBUG_TYPE "si-merge-sdwa-preserve"

				STATISTIC(NumMergeSDWAPreserve,
				"Number of merges of SDWA dst_unused:PRESERVE.");

				namespace {

				class SIMergeSDWAPreserve : public MachineFunctionPass {
				public:
				static char ID;

				SIMergeSDWAPreserve() : MachineFunctionPass(ID) {
				initializeSIMergeSDWAPreservePass(*PassRegistry::getPassRegistry());
				}

				bool runOnMachineFunction(MachineFunction &MF) override;

				StringRef getPassName() const override { return "SI Merge SDWA Preserve"; }

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				// Should preserve the same set that TwoAddressInstructions does.
				AU.addPreserved<SlotIndexes>();
				AU.addPreserved<LiveIntervals>();
				AU.addPreservedID(LiveVariablesID);
				AU.addPreservedID(MachineLoopInfoID);
				AU.addPreservedID(MachineDominatorsID);
				AU.setPreservesCFG();
				MachineFunctionPass::getAnalysisUsage(AU);
				}
				};

				} // End anonymous namespace.

				INITIALIZE_PASS(SIMergeSDWAPreserve, DEBUG_TYPE, "SI Merge SDWA Preserve", false, false)

				char SIMergeSDWAPreserve::ID = 0;

				char &llvm::SIMergeSDWAPreserveID = SIMergeSDWAPreserve::ID;

				FunctionPass *llvm::createSIMergeSDWAPreservePass() {
				return new SIMergeSDWAPreserve();
				}

				bool SIMergeSDWAPreserve::runOnMachineFunction(MachineFunction &MF) {
				LiveIntervals *LIS = getAnalysisIfAvailable<LiveIntervals>();

				const SISubtarget &ST = MF.getSubtarget<SISubtarget>();

				if (!ST.hasSDWA())
				return false;

				bool Merged = false;

				const MachineRegisterInfo *MRI = &MF.getRegInfo();
				const SIRegisterInfo *TRI = ST.getRegisterInfo();
				const SIInstrInfo *TII = ST.getInstrInfo();


				for (MachineBasicBlock &MBB : MF) {
				for (MachineInstr &MI : MBB) {
				if (!TII->isSDWA(MI))
				continue;

				auto *DstUnused = TII->getNamedOperand(MI, AMDGPU::OpName::dst_unused);

				if (!DstUnused \|\| DstUnused->getImm() != AMDGPU::SDWA::UNUSED_PRESERVE)
				continue;

				// VOPC doesn't support PRESERVE
				assert(!TII->isVOPC(AMDGPU::getBasicFromSDWAOp(MI.getOpcode())));

				// Operand to copy should be last implicit operand in SDWA with
				// preserve instruction
				auto &SrcReg = MI.getOperand(MI.getNumOperands() - 1);
				if (!SrcReg.isImplicit() \|\| !SrcReg.isUse() \|\|
				!TRI->isVirtualRegister(SrcReg.getReg()) \|\|
				!TRI->hasVGPRs(TII->getOpRegClass(MI, MI.getNumOperands() - 1)))
				rampitecUnsubmitted Done Reply Inline Actions We can have more than EXEC here. Check it is a virtual register instead, or even a VGPR? rampitec: We can have more than EXEC here. Check it is a virtual register instead, or even a VGPR?
				continue;


				auto *DstReg = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);
				assert(DstReg &&
				DstReg->isReg() &&
				TRI->hasVGPRs(TII->getOpRegClass(MI, MI.getOperandNo(DstReg))));

				// Check that MI is the only instruction that uses src register
				for (MachineOperand &UseMO : MRI->use_operands(SrcReg.getReg())) {
				if (TRI->isSubregOf(SrcReg, UseMO)) {
				// TODO: in that case we should unfold this SDWA-with-PRESERVE back
				// into v_or_b32 + SDWA-without-PRESERVE
				assert(UseMO.getParent() == &MI);
				}
				}

				MachineInstrBuilder Copy = BuildMI(MBB, MI.getNextNode(), MI.getDebugLoc(),
				TII->get(AMDGPU::COPY),
				DstReg->getReg())
				.addReg(SrcReg.getReg());

				DstReg->setReg(SrcReg.getReg());

				if (LIS)
				LIS->repairIntervalsInRange(&MBB, &MI, Copy.getInstr(),
				{ SrcReg.getReg(), DstReg->getReg() });

				++NumMergeSDWAPreserve;
				Merged = true;
				}
				}

				return Merged;
				}

lib/Target/AMDGPU/SIPeepholeSDWA.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines

STATISTIC(NumSDWAPatternsFound, "Number of SDWA patterns found.");		STATISTIC(NumSDWAPatternsFound, "Number of SDWA patterns found.");
STATISTIC(NumSDWAInstructionsPeepholed,		STATISTIC(NumSDWAInstructionsPeepholed,
"Number of instruction converted to SDWA.");		"Number of instruction converted to SDWA.");

namespace {		namespace {

class SDWAOperand;		class SDWAOperand;
		class SDWADstOperand;

class SIPeepholeSDWA : public MachineFunctionPass {		class SIPeepholeSDWA : public MachineFunctionPass {
public:		public:
using SDWAOperandsVector = SmallVector<SDWAOperand *, 4>;		using SDWAOperandsVector = SmallVector<SDWAOperand *, 4>;

private:		private:
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
const SIRegisterInfo *TRI;		const SIRegisterInfo *TRI;
Show All 9 Lines	public:
static char ID;		static char ID;

SIPeepholeSDWA() : MachineFunctionPass(ID) {		SIPeepholeSDWA() : MachineFunctionPass(ID) {
initializeSIPeepholeSDWAPass(*PassRegistry::getPassRegistry());		initializeSIPeepholeSDWAPass(*PassRegistry::getPassRegistry());
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;
void matchSDWAOperands(MachineFunction &MF);		void matchSDWAOperands(MachineFunction &MF);
		std::unique_ptr<SDWAOperand> matchSDWAOperand(MachineInstr &MI);
bool isConvertibleToSDWA(const MachineInstr &MI, const SISubtarget &ST) const;		bool isConvertibleToSDWA(const MachineInstr &MI, const SISubtarget &ST) const;
bool convertToSDWA(MachineInstr &MI, const SDWAOperandsVector &SDWAOperands);		bool convertToSDWA(MachineInstr &MI, const SDWAOperandsVector &SDWAOperands);
void legalizeScalarOperands(MachineInstr &MI, const SISubtarget &ST) const;		void legalizeScalarOperands(MachineInstr &MI, const SISubtarget &ST) const;

StringRef getPassName() const override { return "SI Peephole SDWA"; }		StringRef getPassName() const override { return "SI Peephole SDWA"; }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
Show All 20 Lines	public:

MachineOperand *getTargetOperand() const { return Target; }		MachineOperand *getTargetOperand() const { return Target; }
MachineOperand *getReplacedOperand() const { return Replaced; }		MachineOperand *getReplacedOperand() const { return Replaced; }
MachineInstr *getParentInst() const { return Target->getParent(); }		MachineInstr *getParentInst() const { return Target->getParent(); }

MachineRegisterInfo *getMRI() const {		MachineRegisterInfo *getMRI() const {
return &getParentInst()->getParent()->getParent()->getRegInfo();		return &getParentInst()->getParent()->getParent()->getRegInfo();
}		}

		const SIRegisterInfo *getTRI() const {
		return static_cast<const SIRegisterInfo *>(getMRI()->getTargetRegisterInfo());
		}

		#ifndef NDEBUG
		arsenmUnsubmitted Done Reply Inline Actions Probably should use LLVM_ENABLE_DUMP instead of NDEBUG. Also should add the matching dump() with LLVM_DUMP_METHOD. arsenm: Probably should use LLVM_ENABLE_DUMP instead of NDEBUG. Also should add the matching dump()…
		virtual void print(raw_ostream& OS) const = 0;
		#endif
};		};

using namespace AMDGPU::SDWA;		using namespace AMDGPU::SDWA;

class SDWASrcOperand : public SDWAOperand {		class SDWASrcOperand : public SDWAOperand {
private:		private:
SdwaSel SrcSel;		SdwaSel SrcSel;
bool Abs;		bool Abs;
bool Neg;		bool Neg;
bool Sext;		bool Sext;

public:		public:
SDWASrcOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,		SDWASrcOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,
SdwaSel SrcSel_ = DWORD, bool Abs_ = false, bool Neg_ = false,		SdwaSel SrcSel_ = DWORD, bool Abs_ = false, bool Neg_ = false,
bool Sext_ = false)		bool Sext_ = false)
: SDWAOperand(TargetOp, ReplacedOp), SrcSel(SrcSel_), Abs(Abs_),		: SDWAOperand(TargetOp, ReplacedOp),
Neg(Neg_), Sext(Sext_) {}		SrcSel(SrcSel_), Abs(Abs_), Neg(Neg_), Sext(Sext_) {}

MachineInstr potentialToConvert(const SIInstrInfo TII) override;		MachineInstr potentialToConvert(const SIInstrInfo TII) override;
bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;		bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;

SdwaSel getSrcSel() const { return SrcSel; }		SdwaSel getSrcSel() const { return SrcSel; }
bool getAbs() const { return Abs; }		bool getAbs() const { return Abs; }
bool getNeg() const { return Neg; }		bool getNeg() const { return Neg; }
bool getSext() const { return Sext; }		bool getSext() const { return Sext; }

uint64_t getSrcMods(const SIInstrInfo *TII,		uint64_t getSrcMods(const SIInstrInfo *TII,
const MachineOperand *SrcOp) const;		const MachineOperand *SrcOp) const;

		#ifndef NDEBUG
		void print(raw_ostream& OS) const override;
		#endif
};		};

class SDWADstOperand : public SDWAOperand {		class SDWADstOperand : public SDWAOperand {
private:		private:
SdwaSel DstSel;		SdwaSel DstSel;
DstUnused DstUn;		DstUnused DstUn;

public:		public:

SDWADstOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,		SDWADstOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,
SdwaSel DstSel_ = DWORD, DstUnused DstUn_ = UNUSED_PAD)		SdwaSel DstSel_ = DWORD, DstUnused DstUn_ = UNUSED_PAD)
: SDWAOperand(TargetOp, ReplacedOp), DstSel(DstSel_), DstUn(DstUn_) {}		: SDWAOperand(TargetOp, ReplacedOp), DstSel(DstSel_), DstUn(DstUn_) {}

MachineInstr potentialToConvert(const SIInstrInfo TII) override;		MachineInstr potentialToConvert(const SIInstrInfo TII) override;
bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;		bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;

SdwaSel getDstSel() const { return DstSel; }		SdwaSel getDstSel() const { return DstSel; }
DstUnused getDstUnused() const { return DstUn; }		DstUnused getDstUnused() const { return DstUn; }

		#ifndef NDEBUG
		void print(raw_ostream& OS) const override;
		#endif
		};

		class SDWADstPreserveOperand : public SDWADstOperand {
		private:
		MachineOperand *Preserve;

		public:
		SDWADstPreserveOperand(MachineOperand TargetOp, MachineOperand ReplacedOp,
		MachineOperand *PreserveOp, SdwaSel DstSel_ = DWORD)
		: SDWADstOperand(TargetOp, ReplacedOp, DstSel_, UNUSED_PRESERVE),
		Preserve(PreserveOp) {}

		bool convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) override;

		MachineOperand *getPreservedOperand() const { return Preserve; }

		#ifndef NDEBUG
		void print(raw_ostream& OS) const override;
		#endif
};		};

} // end anonymous namespace		} // end anonymous namespace

INITIALIZE_PASS(SIPeepholeSDWA, DEBUG_TYPE, "SI Peephole SDWA", false, false)		INITIALIZE_PASS(SIPeepholeSDWA, DEBUG_TYPE, "SI Peephole SDWA", false, false)

char SIPeepholeSDWA::ID = 0;		char SIPeepholeSDWA::ID = 0;

Show All 21 Lines	static raw_ostream& operator<<(raw_ostream &OS, const DstUnused &Un) {
switch(Un) {		switch(Un) {
case UNUSED_PAD: OS << "UNUSED_PAD"; break;		case UNUSED_PAD: OS << "UNUSED_PAD"; break;
case UNUSED_SEXT: OS << "UNUSED_SEXT"; break;		case UNUSED_SEXT: OS << "UNUSED_SEXT"; break;
case UNUSED_PRESERVE: OS << "UNUSED_PRESERVE"; break;		case UNUSED_PRESERVE: OS << "UNUSED_PRESERVE"; break;
}		}
return OS;		return OS;
}		}

static raw_ostream& operator<<(raw_ostream &OS, const SDWASrcOperand &Src) {		static raw_ostream& operator<<(raw_ostream &OS, const SDWAOperand &Operand) {
OS << "SDWA src: " << *Src.getTargetOperand()		Operand.print(OS);
<< " src_sel:" << Src.getSrcSel()
<< " abs:" << Src.getAbs() << " neg:" << Src.getNeg()
<< " sext:" << Src.getSext() << '\n';
return OS;		return OS;
}		}

static raw_ostream& operator<<(raw_ostream &OS, const SDWADstOperand &Dst) {		void SDWASrcOperand::print(raw_ostream& OS) const {
OS << "SDWA dst: " << *Dst.getTargetOperand()		OS << "SDWA src: " << *getTargetOperand()
<< " dst_sel:" << Dst.getDstSel()		<< " src_sel:" << getSrcSel()
<< " dst_unused:" << Dst.getDstUnused() << '\n';		<< " abs:" << getAbs() << " neg:" << getNeg()
return OS;		<< " sext:" << getSext() << '\n';
		}

		void SDWADstOperand::print(raw_ostream& OS) const {
		OS << "SDWA dst: " << *getTargetOperand()
		<< " dst_sel:" << getDstSel()
		<< " dst_unused:" << getDstUnused() << '\n';
		}

		void SDWADstPreserveOperand::print(raw_ostream& OS) const {
		OS << "SDWA preserve dst: " << *getTargetOperand()
		<< " dst_sel:" << getDstSel()
		<< " preserve:" << *getPreservedOperand() << '\n';
}		}

#endif		#endif

static void copyRegOperand(MachineOperand &To, const MachineOperand &From) {		static void copyRegOperand(MachineOperand &To, const MachineOperand &From) {
assert(To.isReg() && From.isReg());		assert(To.isReg() && From.isReg());
To.setReg(From.getReg());		To.setReg(From.getReg());
To.setSubReg(From.getSubReg());		To.setSubReg(From.getSubReg());
To.setIsUndef(From.isUndef());		To.setIsUndef(From.isUndef());
if (To.isUse()) {		if (To.isUse()) {
To.setIsKill(From.isKill());		To.setIsKill(From.isKill());
} else {		} else {
To.setIsDead(From.isDead());		To.setIsDead(From.isDead());
}		}
}		}

static bool isSameReg(const MachineOperand &LHS, const MachineOperand &RHS) {		static MachineOperand findSingleRegUse(const MachineOperand Reg,
return LHS.isReg() &&		const MachineRegisterInfo *MRI) {
RHS.isReg() &&		if (!Reg->isReg())
LHS.getReg() == RHS.getReg() &&		return nullptr;
LHS.getSubReg() == RHS.getSubReg();
		auto TRI = static_cast<const SIRegisterInfo *>(
		MRI->getTargetRegisterInfo());

		MachineOperand *DefUseMO = nullptr;
		for (MachineOperand &DefUse: MRI->use_operands(Reg->getReg())) {
		arsenmUnsubmitted Done Reply Inline Actions use_nodbg_operands? Also space before : arsenm: use_nodbg_operands? Also space before :
		/* If this is def/use of another subreg of reg then do nothing */
		arsenmUnsubmitted Done Reply Inline Actions C++ style comments arsenm: C++ style comments
		if (!TRI->isSubregOf(DefUse, *Reg) \|\|
		!TRI->isSubregOf(*Reg, DefUse))
		arsenmUnsubmitted Done Reply Inline Actions These various checks shouldn't be necessary in SSA. You can't have a def of a specific subregister (unless maybe there is a physical register which should probably be skipped anyway). arsenm: These various checks shouldn't be necessary in SSA. You can't have a def of a specific…
		continue;

		/* If there exist def/use of superreg of reg then return nullptr*/
		if (!TRI->isSameReg(DefUse, *Reg))
		return nullptr;

		/* Check that UseMI is only instruction that defs/uses reg */
		if (!DefUseMO) {
		DefUseMO = &DefUse;
		} else if (DefUseMO->getParent() != DefUse.getParent()) {
		return nullptr;
		}
		}
		return DefUseMO;
		rampitecUnsubmitted Not Done Reply Inline Actions Such things shall be functions, function templates, anything but defines. In particular that is very hard to debug. rampitec: Such things shall be functions, function templates, anything but defines. In particular that is…
}		}

static bool isSubregOf(const MachineOperand &SubReg,
const MachineOperand &SuperReg,
const TargetRegisterInfo *TRI) {

if (!SuperReg.isReg() \|\| !SubReg.isReg())		static MachineOperand findSingleRegDef(const MachineOperand Reg,
return false;		const MachineRegisterInfo *MRI) {
		if (!Reg->isReg())
		return nullptr;
		arsenmUnsubmitted Done Reply Inline Actions This seems to b be re-inventing MRI.getVRegDef/MRI.getUniqueVRegDef? arsenm: This seems to b be re-inventing MRI.getVRegDef/MRI.getUniqueVRegDef?

if (isSameReg(SuperReg, SubReg))		auto TRI = static_cast<const SIRegisterInfo *>(
return true;		MRI->getTargetRegisterInfo());

if (SuperReg.getReg() != SubReg.getReg())		MachineOperand *DefUseMO = nullptr;
return false;		for (MachineOperand &DefUse: MRI->def_operands(Reg->getReg())) {
		/* If this is def/use of another subreg of reg then do nothing */
		if (!TRI->isSubregOf(DefUse, *Reg) \|\|
		!TRI->isSubregOf(*Reg, DefUse))
		continue;

		/* If there exist def/use of superreg of reg then return nullptr*/
		if (!TRI->isSameReg(DefUse, *Reg))
		return nullptr;

LaneBitmask SuperMask = TRI->getSubRegIndexLaneMask(SuperReg.getSubReg());		/* Check that UseMI is only instruction that defs/uses reg */
LaneBitmask SubMask = TRI->getSubRegIndexLaneMask(SubReg.getSubReg());		if (!DefUseMO) {
SuperMask \|= ~SubMask;		DefUseMO = &DefUse;
return SuperMask.all();		} else if (DefUseMO->getParent() != DefUse.getParent()) {
		return nullptr;
		}
		}
		return DefUseMO;
}		}

uint64_t SDWASrcOperand::getSrcMods(const SIInstrInfo *TII,		uint64_t SDWASrcOperand::getSrcMods(const SIInstrInfo *TII,
const MachineOperand *SrcOp) const {		const MachineOperand *SrcOp) const {
uint64_t Mods = 0;		uint64_t Mods = 0;
const auto *MI = SrcOp->getParent();		const auto *MI = SrcOp->getParent();
if (TII->getNamedOperand(*MI, AMDGPU::OpName::src0) == SrcOp) {		if (TII->getNamedOperand(*MI, AMDGPU::OpName::src0) == SrcOp) {
if (auto Mod = TII->getNamedOperand(MI, AMDGPU::OpName::src0_modifiers)) {		if (auto Mod = TII->getNamedOperand(MI, AMDGPU::OpName::src0_modifiers)) {
Show All 14 Lines	uint64_t SDWASrcOperand::getSrcMods(const SIInstrInfo *TII,
}		}

return Mods;		return Mods;
}		}

MachineInstr SDWASrcOperand::potentialToConvert(const SIInstrInfo TII) {		MachineInstr SDWASrcOperand::potentialToConvert(const SIInstrInfo TII) {
// For SDWA src operand potential instruction is one that use register		// For SDWA src operand potential instruction is one that use register
// defined by parent instruction		// defined by parent instruction
MachineRegisterInfo *MRI = getMRI();		MachineOperand *PotentialMO = findSingleRegUse(getReplacedOperand(), getMRI());
MachineOperand *Replaced = getReplacedOperand();		if (!PotentialMO)
assert(Replaced->isReg());

MachineInstr *PotentialMI = nullptr;
for (MachineOperand &PotentialMO : MRI->use_operands(Replaced->getReg())) {
// If this is use of another subreg of dst reg then do nothing
if (!isSubregOf(*Replaced, PotentialMO, MRI->getTargetRegisterInfo()))
continue;

// If there exist use of superreg of dst then we should not combine this
// opernad
if (!isSameReg(PotentialMO, *Replaced))
return nullptr;

// Check that PotentialMI is only instruction that uses dst reg
if (PotentialMI == nullptr) {
PotentialMI = PotentialMO.getParent();
} else if (PotentialMI != PotentialMO.getParent()) {
return nullptr;		return nullptr;
}
}

return PotentialMI;		return PotentialMO->getParent();
}		}

bool SDWASrcOperand::convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) {		bool SDWASrcOperand::convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) {
// Find operand in instruction that matches source operand and replace it with		// Find operand in instruction that matches source operand and replace it with
// target operand. Set corresponding src_sel		// target operand. Set corresponding src_sel

		const SIRegisterInfo *TRI = getTRI();

MachineOperand *Src = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
MachineOperand *SrcSel = TII->getNamedOperand(MI, AMDGPU::OpName::src0_sel);		MachineOperand *SrcSel = TII->getNamedOperand(MI, AMDGPU::OpName::src0_sel);
MachineOperand *SrcMods =		MachineOperand *SrcMods =
TII->getNamedOperand(MI, AMDGPU::OpName::src0_modifiers);		TII->getNamedOperand(MI, AMDGPU::OpName::src0_modifiers);
assert(Src && (Src->isReg() \|\| Src->isImm()));		assert(Src && (Src->isReg() \|\| Src->isImm()));
if (!isSameReg(Src, getReplacedOperand())) {		if (!TRI->isSameReg(Src, getReplacedOperand())) {
// If this is not src0 then it should be src1		// If this is not src0 then it should be src1
Src = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		Src = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
SrcSel = TII->getNamedOperand(MI, AMDGPU::OpName::src1_sel);		SrcSel = TII->getNamedOperand(MI, AMDGPU::OpName::src1_sel);
SrcMods = TII->getNamedOperand(MI, AMDGPU::OpName::src1_modifiers);		SrcMods = TII->getNamedOperand(MI, AMDGPU::OpName::src1_modifiers);

assert(Src && Src->isReg());		assert(Src && Src->isReg());

if ((MI.getOpcode() == AMDGPU::V_MAC_F16_sdwa \|\|		if ((MI.getOpcode() == AMDGPU::V_MAC_F16_sdwa \|\|
MI.getOpcode() == AMDGPU::V_MAC_F32_sdwa) &&		MI.getOpcode() == AMDGPU::V_MAC_F32_sdwa) &&
!isSameReg(Src, getReplacedOperand())) {		!TRI->isSameReg(Src, getReplacedOperand())) {
// In case of v_mac_f16/32_sdwa this pass can try to apply src operand to		// In case of v_mac_f16/32_sdwa this pass can try to apply src operand to
// src2. This is not allowed.		// src2. This is not allowed.
return false;		return false;
}		}

assert(isSameReg(Src, getReplacedOperand()) && SrcSel && SrcMods);		assert(TRI->isSameReg(Src, getReplacedOperand()) && SrcSel && SrcMods);
}		}
copyRegOperand(Src, getTargetOperand());		copyRegOperand(Src, getTargetOperand());
SrcSel->setImm(getSrcSel());		SrcSel->setImm(getSrcSel());
SrcMods->setImm(getSrcMods(TII, Src));		SrcMods->setImm(getSrcMods(TII, Src));
getTargetOperand()->setIsKill(false);		getTargetOperand()->setIsKill(false);
return true;		return true;
}		}

MachineInstr SDWADstOperand::potentialToConvert(const SIInstrInfo TII) {		MachineInstr SDWADstOperand::potentialToConvert(const SIInstrInfo TII) {
// For SDWA dst operand potential instruction is one that defines register		// For SDWA dst operand potential instruction is one that defines register
// that this operand uses		// that this operand uses
MachineRegisterInfo *MRI = getMRI();		MachineRegisterInfo *MRI = getMRI();
MachineInstr *ParentMI = getParentInst();		MachineInstr *ParentMI = getParentInst();
MachineOperand *Replaced = getReplacedOperand();
assert(Replaced->isReg());

for (MachineOperand &PotentialMO : MRI->def_operands(Replaced->getReg())) {
if (!isSubregOf(*Replaced, PotentialMO, MRI->getTargetRegisterInfo()))
continue;

if (!isSameReg(*Replaced, PotentialMO))		MachineOperand *PotentialMO = findSingleRegDef(getReplacedOperand(), MRI);
		if (!PotentialMO)
return nullptr;		return nullptr;

// Check that ParentMI is the only instruction that uses replaced register		// Check that ParentMI is the only instruction that uses replaced register
for (MachineOperand &UseMO : MRI->use_operands(PotentialMO.getReg())) {		for (MachineOperand &UseMO : MRI->use_operands(PotentialMO->getReg())) {
if (isSubregOf(UseMO, PotentialMO, MRI->getTargetRegisterInfo()) &&		if (getTRI()->isSubregOf(UseMO, *PotentialMO) &&
UseMO.getParent() != ParentMI) {		UseMO.getParent() != ParentMI) {
return nullptr;		return nullptr;
}		}
}		}

// Due to SSA this should be onle def of replaced register, so return it		return PotentialMO->getParent();
return PotentialMO.getParent();
}

return nullptr;
}		}

bool SDWADstOperand::convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) {		bool SDWADstOperand::convertToSDWA(MachineInstr &MI, const SIInstrInfo *TII) {
// Replace vdst operand in MI with target operand. Set dst_sel and dst_unused		// Replace vdst operand in MI with target operand. Set dst_sel and dst_unused

if ((MI.getOpcode() == AMDGPU::V_MAC_F16_sdwa \|\|		if ((MI.getOpcode() == AMDGPU::V_MAC_F16_sdwa \|\|
MI.getOpcode() == AMDGPU::V_MAC_F32_sdwa) &&		MI.getOpcode() == AMDGPU::V_MAC_F32_sdwa) &&
getDstSel() != AMDGPU::SDWA::DWORD) {		getDstSel() != AMDGPU::SDWA::DWORD) {
// v_mac_f16/32_sdwa allow dst_sel to be equal only to DWORD		// v_mac_f16/32_sdwa allow dst_sel to be equal only to DWORD
return false;		return false;
}		}

MachineOperand *Operand = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Operand = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);
assert(Operand &&		assert(Operand &&
Operand->isReg() &&		Operand->isReg() &&
isSameReg(Operand, getReplacedOperand()));		getTRI()->isSameReg(Operand, getReplacedOperand()));
copyRegOperand(Operand, getTargetOperand());		copyRegOperand(Operand, getTargetOperand());
MachineOperand *DstSel= TII->getNamedOperand(MI, AMDGPU::OpName::dst_sel);		MachineOperand *DstSel= TII->getNamedOperand(MI, AMDGPU::OpName::dst_sel);
assert(DstSel);		assert(DstSel);
DstSel->setImm(getDstSel());		DstSel->setImm(getDstSel());
MachineOperand *DstUnused= TII->getNamedOperand(MI, AMDGPU::OpName::dst_unused);		MachineOperand *DstUnused= TII->getNamedOperand(MI, AMDGPU::OpName::dst_unused);
assert(DstUnused);		assert(DstUnused);
DstUnused->setImm(getDstUnused());		DstUnused->setImm(getDstUnused());

// Remove original instruction because it would conflict with our new		// Remove original instruction because it would conflict with our new
// instruction by register definition		// instruction by register definition
getParentInst()->eraseFromParent();		getParentInst()->eraseFromParent();
return true;		return true;
}		}

		bool SDWADstPreserveOperand::convertToSDWA(MachineInstr &MI,
		const SIInstrInfo *TII) {
		// MI should be moved right before v_or_b32.
		// For this we should clear all kill flags on uses of MI src-operands or else
		// we can encounter problem with use of killed operand.
		for (MachineOperand &MO: MI.uses()) {
		arsenmUnsubmitted Done Reply Inline Actions This whole loop is just MRI.clearKillFlags() arsenm: This whole loop is just MRI.clearKillFlags()
		if (!MO.isReg())
		continue;
		for (MachineOperand &Use: getMRI()->use_operands(MO.getReg())) {
		Use.setIsKill(false);
		}
		}

		// Move MI before v_or_b32
		auto MBB = MI.getParent();
		MBB->remove(&MI);
		MBB->insert(getParentInst(), &MI);

		// Add Implicit use of preserved register
		MachineInstrBuilder MIB(*MBB->getParent(), MI);
		MIB.addReg(getPreservedOperand()->getReg(),
		RegState::ImplicitKill,
		getPreservedOperand()->getSubReg());

		// Convert MI as any other SDWADstOperand and remove v_or_b32
		return SDWADstOperand::convertToSDWA(MI, TII);
		}
		arsenmUnsubmitted Not Done Reply Inline Actions This is still manually tying the result operand. I was expecting another set of _sdwa opcodes with the preserve behavior with the tied operand statically known in the instruction definition. Manually tying this way is potentially hazardous because the verifier won't check it, and it makes it easier for another pass to accidentally drop the tied operand. I would expect there to be an InstrMapping table between the SDWA opcode and the SWDA with preserve set versions. arsenm: This is still manually tying the result operand. I was expecting another set of _sdwa opcodes…
		SamWotAuthorUnsubmitted Not Done Reply Inline Actions I thought that having separate instruction definition for just preserve case would be overkill. I didn't want to introduce another new kind of SDWA instruction that would only bloat already huge set of instruction definitions. But if you think this would be better I will introduce new instructions definition. SamWot: I thought that having separate instruction definition for just preserve case would be overkill.

Optional<int64_t> SIPeepholeSDWA::foldToImm(const MachineOperand &Op) const {		Optional<int64_t> SIPeepholeSDWA::foldToImm(const MachineOperand &Op) const {
if (Op.isImm()) {		if (Op.isImm()) {
return Op.getImm();		return Op.getImm();
}		}

// If this is not immediate then it can be copy of immediate value, e.g.:		// If this is not immediate then it can be copy of immediate value, e.g.:
// %vreg1<def> = S_MOV_B32 255;		// %vreg1<def> = S_MOV_B32 255;
if (Op.isReg()) {		if (Op.isReg()) {
for (const MachineOperand &Def : MRI->def_operands(Op.getReg())) {		for (const MachineOperand &Def : MRI->def_operands(Op.getReg())) {
if (!isSameReg(Op, Def))		if (!TRI->isSameReg(Op, Def))
continue;		continue;

const MachineInstr *DefInst = Def.getParent();		const MachineInstr *DefInst = Def.getParent();
if (!TII->isFoldableCopy(*DefInst))		if (!TII->isFoldableCopy(*DefInst))
return None;		return None;

const MachineOperand &Copied = DefInst->getOperand(1);		const MachineOperand &Copied = DefInst->getOperand(1);
if (!Copied.isImm())		if (!Copied.isImm())
return None;		return None;

return Copied.getImm();		return Copied.getImm();
}		}
}		}

return None;		return None;
}		}

void SIPeepholeSDWA::matchSDWAOperands(MachineFunction &MF) {		std::unique_ptr<SDWAOperand>
for (MachineBasicBlock &MBB : MF) {		SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
for (MachineInstr &MI : MBB) {
unsigned Opcode = MI.getOpcode();		unsigned Opcode = MI.getOpcode();
switch (Opcode) {		switch (Opcode) {
case AMDGPU::V_LSHRREV_B32_e32:		case AMDGPU::V_LSHRREV_B32_e32:
case AMDGPU::V_ASHRREV_I32_e32:		case AMDGPU::V_ASHRREV_I32_e32:
case AMDGPU::V_LSHLREV_B32_e32:		case AMDGPU::V_LSHLREV_B32_e32:
case AMDGPU::V_LSHRREV_B32_e64:		case AMDGPU::V_LSHRREV_B32_e64:
case AMDGPU::V_ASHRREV_I32_e64:		case AMDGPU::V_ASHRREV_I32_e64:
case AMDGPU::V_LSHLREV_B32_e64: {		case AMDGPU::V_LSHLREV_B32_e64: {
// from: v_lshrrev_b32_e32 v1, 16/24, v0		// from: v_lshrrev_b32_e32 v1, 16/24, v0
// to SDWA src:v0 src_sel:WORD_1/BYTE_3		// to SDWA src:v0 src_sel:WORD_1/BYTE_3

// from: v_ashrrev_i32_e32 v1, 16/24, v0		// from: v_ashrrev_i32_e32 v1, 16/24, v0
// to SDWA src:v0 src_sel:WORD_1/BYTE_3 sext:1		// to SDWA src:v0 src_sel:WORD_1/BYTE_3 sext:1

// from: v_lshlrev_b32_e32 v1, 16/24, v0		// from: v_lshlrev_b32_e32 v1, 16/24, v0
// to SDWA dst:v1 dst_sel:WORD_1/BYTE_3 dst_unused:UNUSED_PAD		// to SDWA dst:v1 dst_sel:WORD_1/BYTE_3 dst_unused:UNUSED_PAD
MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
auto Imm = foldToImm(*Src0);		auto Imm = foldToImm(*Src0);
if (!Imm)		if (!Imm)
break;		break;

if (Imm != 16 && Imm != 24)		if (Imm != 16 && Imm != 24)
break;		break;

MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);
if (TRI->isPhysicalRegister(Src1->getReg()) \|\|		if (TRI->isPhysicalRegister(Src1->getReg()) \|\|
TRI->isPhysicalRegister(Dst->getReg()))		TRI->isPhysicalRegister(Dst->getReg()))
break;		break;

if (Opcode == AMDGPU::V_LSHLREV_B32_e32 \|\|		if (Opcode == AMDGPU::V_LSHLREV_B32_e32 \|\|
Opcode == AMDGPU::V_LSHLREV_B32_e64) {		Opcode == AMDGPU::V_LSHLREV_B32_e64) {
auto SDWADst = make_unique<SDWADstOperand>(		return make_unique<SDWADstOperand>(
Dst, Src1, *Imm == 16 ? WORD_1 : BYTE_3, UNUSED_PAD);		Dst, Src1, *Imm == 16 ? WORD_1 : BYTE_3, UNUSED_PAD);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWADst << '\n');
SDWAOperands[&MI] = std::move(SDWADst);
++NumSDWAPatternsFound;
} else {		} else {
auto SDWASrc = make_unique<SDWASrcOperand>(		return make_unique<SDWASrcOperand>(
Src1, Dst, *Imm == 16 ? WORD_1 : BYTE_3, false, false,		Src1, Dst, *Imm == 16 ? WORD_1 : BYTE_3, false, false,
Opcode != AMDGPU::V_LSHRREV_B32_e32 &&		Opcode != AMDGPU::V_LSHRREV_B32_e32 &&
Opcode != AMDGPU::V_LSHRREV_B32_e64);		Opcode != AMDGPU::V_LSHRREV_B32_e64);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWASrc << '\n');
SDWAOperands[&MI] = std::move(SDWASrc);
++NumSDWAPatternsFound;
}		}
break;		break;
}		}

case AMDGPU::V_LSHRREV_B16_e32:		case AMDGPU::V_LSHRREV_B16_e32:
case AMDGPU::V_ASHRREV_I16_e32:		case AMDGPU::V_ASHRREV_I16_e32:
case AMDGPU::V_LSHLREV_B16_e32:		case AMDGPU::V_LSHLREV_B16_e32:
case AMDGPU::V_LSHRREV_B16_e64:		case AMDGPU::V_LSHRREV_B16_e64:
case AMDGPU::V_ASHRREV_I16_e64:		case AMDGPU::V_ASHRREV_I16_e64:
case AMDGPU::V_LSHLREV_B16_e64: {		case AMDGPU::V_LSHLREV_B16_e64: {
// from: v_lshrrev_b16_e32 v1, 8, v0		// from: v_lshrrev_b16_e32 v1, 8, v0
// to SDWA src:v0 src_sel:BYTE_1		// to SDWA src:v0 src_sel:BYTE_1

// from: v_ashrrev_i16_e32 v1, 8, v0		// from: v_ashrrev_i16_e32 v1, 8, v0
// to SDWA src:v0 src_sel:BYTE_1 sext:1		// to SDWA src:v0 src_sel:BYTE_1 sext:1

// from: v_lshlrev_b16_e32 v1, 8, v0		// from: v_lshlrev_b16_e32 v1, 8, v0
// to SDWA dst:v1 dst_sel:BYTE_1 dst_unused:UNUSED_PAD		// to SDWA dst:v1 dst_sel:BYTE_1 dst_unused:UNUSED_PAD
MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
auto Imm = foldToImm(*Src0);		auto Imm = foldToImm(*Src0);
if (!Imm \|\| *Imm != 8)		if (!Imm \|\| *Imm != 8)
break;		break;

MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);

if (TRI->isPhysicalRegister(Src1->getReg()) \|\|		if (TRI->isPhysicalRegister(Src1->getReg()) \|\|
TRI->isPhysicalRegister(Dst->getReg()))		TRI->isPhysicalRegister(Dst->getReg()))
break;		break;

if (Opcode == AMDGPU::V_LSHLREV_B16_e32 \|\|		if (Opcode == AMDGPU::V_LSHLREV_B16_e32 \|\|
Opcode == AMDGPU::V_LSHLREV_B16_e64) {		Opcode == AMDGPU::V_LSHLREV_B16_e64) {
auto SDWADst =		return make_unique<SDWADstOperand>(Dst, Src1, BYTE_1, UNUSED_PAD);
make_unique<SDWADstOperand>(Dst, Src1, BYTE_1, UNUSED_PAD);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWADst << '\n');
SDWAOperands[&MI] = std::move(SDWADst);
++NumSDWAPatternsFound;
} else {		} else {
auto SDWASrc = make_unique<SDWASrcOperand>(		return make_unique<SDWASrcOperand>(
Src1, Dst, BYTE_1, false, false,		Src1, Dst, BYTE_1, false, false,
Opcode != AMDGPU::V_LSHRREV_B16_e32 &&		Opcode != AMDGPU::V_LSHRREV_B16_e32 &&
Opcode != AMDGPU::V_LSHRREV_B16_e64);		Opcode != AMDGPU::V_LSHRREV_B16_e64);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWASrc << '\n');
SDWAOperands[&MI] = std::move(SDWASrc);
++NumSDWAPatternsFound;
}		}
break;		break;
}		}

case AMDGPU::V_BFE_I32:		case AMDGPU::V_BFE_I32:
case AMDGPU::V_BFE_U32: {		case AMDGPU::V_BFE_U32: {
// e.g.:		// e.g.:
// from: v_bfe_u32 v1, v0, 8, 8		// from: v_bfe_u32 v1, v0, 8, 8
// to SDWA src:v0 src_sel:BYTE_1		// to SDWA src:v0 src_sel:BYTE_1

// offset \| width \| src_sel		// offset \| width \| src_sel
// ------------------------		// ------------------------
// 0 \| 8 \| BYTE_0		// 0 \| 8 \| BYTE_0
// 0 \| 16 \| WORD_0		// 0 \| 16 \| WORD_0
// 0 \| 32 \| DWORD ?		// 0 \| 32 \| DWORD ?
// 8 \| 8 \| BYTE_1		// 8 \| 8 \| BYTE_1
// 16 \| 8 \| BYTE_2		// 16 \| 8 \| BYTE_2
// 16 \| 16 \| WORD_1		// 16 \| 16 \| WORD_1
// 24 \| 8 \| BYTE_3		// 24 \| 8 \| BYTE_3

MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
auto Offset = foldToImm(*Src1);		auto Offset = foldToImm(*Src1);
if (!Offset)		if (!Offset)
break;		break;

MachineOperand *Src2 = TII->getNamedOperand(MI, AMDGPU::OpName::src2);		MachineOperand *Src2 = TII->getNamedOperand(MI, AMDGPU::OpName::src2);
auto Width = foldToImm(*Src2);		auto Width = foldToImm(*Src2);
if (!Width)		if (!Width)
break;		break;

SdwaSel SrcSel = DWORD;		SdwaSel SrcSel = DWORD;

if (Offset == 0 && Width == 8)		if (Offset == 0 && Width == 8)
SrcSel = BYTE_0;		SrcSel = BYTE_0;
else if (Offset == 0 && Width == 16)		else if (Offset == 0 && Width == 16)
SrcSel = WORD_0;		SrcSel = WORD_0;
else if (Offset == 0 && Width == 32)		else if (Offset == 0 && Width == 32)
SrcSel = DWORD;		SrcSel = DWORD;
else if (Offset == 8 && Width == 8)		else if (Offset == 8 && Width == 8)
SrcSel = BYTE_1;		SrcSel = BYTE_1;
else if (Offset == 16 && Width == 8)		else if (Offset == 16 && Width == 8)
SrcSel = BYTE_2;		SrcSel = BYTE_2;
else if (Offset == 16 && Width == 16)		else if (Offset == 16 && Width == 16)
SrcSel = WORD_1;		SrcSel = WORD_1;
else if (Offset == 24 && Width == 8)		else if (Offset == 24 && Width == 8)
SrcSel = BYTE_3;		SrcSel = BYTE_3;
else		else
break;		break;

MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);

if (TRI->isPhysicalRegister(Src0->getReg()) \|\|		if (TRI->isPhysicalRegister(Src0->getReg()) \|\|
TRI->isPhysicalRegister(Dst->getReg()))		TRI->isPhysicalRegister(Dst->getReg()))
break;		break;

auto SDWASrc = make_unique<SDWASrcOperand>(		return make_unique<SDWASrcOperand>(
Src0, Dst, SrcSel, false, false,		Src0, Dst, SrcSel, false, false, Opcode != AMDGPU::V_BFE_U32);
Opcode != AMDGPU::V_BFE_U32);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWASrc << '\n');
SDWAOperands[&MI] = std::move(SDWASrc);
++NumSDWAPatternsFound;
break;
}		}

case AMDGPU::V_AND_B32_e32:		case AMDGPU::V_AND_B32_e32:
case AMDGPU::V_AND_B32_e64: {		case AMDGPU::V_AND_B32_e64: {
// e.g.:		// e.g.:
// from: v_and_b32_e32 v1, 0x0000ffff/0x000000ff, v0		// from: v_and_b32_e32 v1, 0x0000ffff/0x000000ff, v0
// to SDWA src:v0 src_sel:WORD_0/BYTE_0		// to SDWA src:v0 src_sel:WORD_0/BYTE_0

MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);		MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);		MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
auto ValSrc = Src1;		auto ValSrc = Src1;
auto Imm = foldToImm(*Src0);		auto Imm = foldToImm(*Src0);

if (!Imm) {		if (!Imm) {
Imm = foldToImm(*Src1);		Imm = foldToImm(*Src1);
ValSrc = Src0;		ValSrc = Src0;
}		}

if (!Imm \|\| (Imm != 0x0000ffff && Imm != 0x000000ff))		if (!Imm \|\| (Imm != 0x0000ffff && Imm != 0x000000ff))
break;		break;

MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);		MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);

if (TRI->isPhysicalRegister(Src1->getReg()) \|\|		if (TRI->isPhysicalRegister(Src1->getReg()) \|\|
TRI->isPhysicalRegister(Dst->getReg()))		TRI->isPhysicalRegister(Dst->getReg()))
break;		break;

auto SDWASrc = make_unique<SDWASrcOperand>(		return make_unique<SDWASrcOperand>(
ValSrc, Dst, *Imm == 0x0000ffff ? WORD_0 : BYTE_0);		ValSrc, Dst, *Imm == 0x0000ffff ? WORD_0 : BYTE_0);
DEBUG(dbgs() << "Match: " << MI << "To: " << *SDWASrc << '\n');		}
SDWAOperands[&MI] = std::move(SDWASrc);
++NumSDWAPatternsFound;		case AMDGPU::V_OR_B32_e32:
		case AMDGPU::V_OR_B32_e64: {
		// Patterns for dst_unused:UNUSED_PRESERVE.
		// e.g., from:
		// v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD
		// src1_sel:WORD_1 src2_sel:WORD1
		// v_add_f16_e32 v3, v1, v2
		// v_or_b32_e32 v4, v0, v3
		// to SDWA preserve dst:v4 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE preserve:v3

		// Check if one of operands of v_or_b32 is SDWA instruction
		using CheckRetType = Optional<std::pair<MachineOperand , MachineOperand >>;
		auto CheckOROperandsForSDWA =
		[&](const MachineOperand Op1, const MachineOperand Op2) -> CheckRetType {
		if (!Op1 \|\| !Op1->isReg() \|\| !Op2 \|\| !Op2->isReg())
		return CheckRetType(None);

		MachineOperand *Op1Def = findSingleRegDef(Op1, MRI);
		if (!Op1Def)
		return CheckRetType(None);

		MachineInstr *Op1Inst = Op1Def->getParent();
		if (!TII->isSDWA(*Op1Inst))
		return CheckRetType(None);

		MachineOperand *Op2Def = findSingleRegDef(Op2, MRI);
		if (!Op2Def)
		return CheckRetType(None);

		return CheckRetType(std::make_pair(Op1Def, Op2Def));
		};

		MachineOperand *OrSDWA = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
		MachineOperand *OrOther = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
		assert(OrSDWA && OrOther);
		auto Res = CheckOROperandsForSDWA(OrSDWA, OrOther);
		if (!Res) {
		OrSDWA = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
		OrOther = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
		assert(OrSDWA && OrOther);
		Res = CheckOROperandsForSDWA(OrSDWA, OrOther);
		if (!Res)
break;		break;
}		}

		MachineOperand *OrSDWADef = Res->first;
		MachineOperand *OrOtherDef = Res->second;
		assert(OrSDWADef && OrOtherDef);

		MachineInstr *SDWAInst = OrSDWADef->getParent();
		MachineInstr *OtherInst = OrOtherDef->getParent();

		// Check that OtherInstr is actually bitwise compatible with SDWAInst = their
		// destination patterns don't overlap. Compatible instruction can be either
		// regular instruction with compatible bitness or SDWA instruction with
		// correct dst_sel
		// SDWAInst \| OtherInst bitness / OtherInst dst_sel
		// -----------------------------------------------------
		// DWORD \| no / no
		// WORD_0 \| no / BYTE_2/3, WORD_1
		// WORD_1 \| 8/16-bit instructions / BYTE_0/1, WORD_0
		// BYTE_0 \| no / BYTE_1/2/3, WORD_1
		// BYTE_1 \| 8-bit / BYTE_0/2/3, WORD_1
		// BYTE_2 \| 8/16-bit / BYTE_0/1/3. WORD_0
		// BYTE_3 \| 8/16/24-bit / BYTE_0/1/2, WORD_0
		// E.g. if SDWAInst is v_add_f16_sdwa dst_sel:WORD_1 then v_add_f16 is OK
		// but v_add_f32 is not.

		// TODO: add support for non-SDWA instructions as OtherInst.
		// For now this only works with SDWA instructions. For regular instructions
		// there is no way to determine if instruction write only 8/16/24-bit out of
		// full register size and all registers are at min 32-bit wide.
		if (!TII->isSDWA(*OtherInst))
		break;

		SdwaSel DstSel = static_cast<SdwaSel>(
		TII->getNamedImmOperand(*SDWAInst, AMDGPU::OpName::dst_sel));;
		SdwaSel OtherDstSel = static_cast<SdwaSel>(
		TII->getNamedImmOperand(*OtherInst, AMDGPU::OpName::dst_sel));

		bool DstSelAgree = false;
		switch (DstSel) {
		case WORD_0: DstSelAgree = ((OtherDstSel == BYTE_2) \|\|
		(OtherDstSel == BYTE_3) \|\|
		(OtherDstSel == WORD_1));
		break;
		case WORD_1: DstSelAgree = ((OtherDstSel == BYTE_0) \|\|
		(OtherDstSel == BYTE_1) \|\|
		(OtherDstSel == WORD_0));
		break;
		case BYTE_0: DstSelAgree = ((OtherDstSel == BYTE_1) \|\|
		(OtherDstSel == BYTE_2) \|\|
		(OtherDstSel == BYTE_3) \|\|
		(OtherDstSel == WORD_1));
		break;
		case BYTE_1: DstSelAgree = ((OtherDstSel == BYTE_0) \|\|
		(OtherDstSel == BYTE_2) \|\|
		(OtherDstSel == BYTE_3) \|\|
		(OtherDstSel == WORD_1));
		break;
		case BYTE_2: DstSelAgree = ((OtherDstSel == BYTE_0) \|\|
		(OtherDstSel == BYTE_1) \|\|
		(OtherDstSel == BYTE_3) \|\|
		(OtherDstSel == WORD_0));
		break;
		case BYTE_3: DstSelAgree = ((OtherDstSel == BYTE_0) \|\|
		(OtherDstSel == BYTE_1) \|\|
		(OtherDstSel == BYTE_2) \|\|
		(OtherDstSel == WORD_0));
		break;
		default: DstSelAgree = false;
		}

		if (!DstSelAgree)
		break;

		// Also OtherInst dst_unused should be UNUSED_PAD
		DstUnused OtherDstUnused = static_cast<DstUnused>(
		TII->getNamedImmOperand(*OtherInst, AMDGPU::OpName::dst_unused));
		if (OtherDstUnused != DstUnused::UNUSED_PAD)
		break;

		// Create DstPreserveOperand
		MachineOperand *OrDst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);
		assert(OrDst && OrDst->isReg());

		return make_unique<SDWADstPreserveOperand>(
		OrDst, OrSDWADef, OrOtherDef, DstSel);

		}
		}

		return std::unique_ptr<SDWAOperand>(nullptr);
		}

		void SIPeepholeSDWA::matchSDWAOperands(MachineFunction &MF) {
		for (MachineBasicBlock &MBB : MF) {
		for (MachineInstr &MI : MBB) {
		if (auto Operand = matchSDWAOperand(MI)) {
		DEBUG(dbgs() << "Match: " << MI << "To: " << *Operand << '\n');
		SDWAOperands[&MI] = std::move(Operand);
		++NumSDWAPatternsFound;
}		}
}		}
}		}
}		}

bool SIPeepholeSDWA::isConvertibleToSDWA(const MachineInstr &MI,		bool SIPeepholeSDWA::isConvertibleToSDWA(const MachineInstr &MI,
const SISubtarget &ST) const {		const SISubtarget &ST) const {
		// Check if this is already an SDWA instruction
		unsigned Opc = MI.getOpcode();
		if (TII->isSDWA(Opc))
		return true;

// Check if this instruction has opcode that supports SDWA		// Check if this instruction has opcode that supports SDWA
int Opc = MI.getOpcode();
if (AMDGPU::getSDWAOp(Opc) == -1)		if (AMDGPU::getSDWAOp(Opc) == -1)
Opc = AMDGPU::getVOPe32(Opc);		Opc = AMDGPU::getVOPe32(Opc);

if (Opc == -1 \|\| AMDGPU::getSDWAOp(Opc) == -1)		if (Opc == -1 \|\| AMDGPU::getSDWAOp(Opc) == -1)
return false;		return false;

if (!ST.hasSDWAOmod() && TII->hasModifiersSet(MI, AMDGPU::OpName::omod))		if (!ST.hasSDWAOmod() && TII->hasModifiersSet(MI, AMDGPU::OpName::omod))
return false;		return false;
Show All 20 Lines	if (!ST.hasSDWAMac() && (Opc == AMDGPU::V_MAC_F16_e32 \|\|
return false;		return false;

return true;		return true;
}		}

bool SIPeepholeSDWA::convertToSDWA(MachineInstr &MI,		bool SIPeepholeSDWA::convertToSDWA(MachineInstr &MI,
const SDWAOperandsVector &SDWAOperands) {		const SDWAOperandsVector &SDWAOperands) {
// Convert to sdwa		// Convert to sdwa
int SDWAOpcode = AMDGPU::getSDWAOp(MI.getOpcode());		int SDWAOpcode;
		rampitecUnsubmitted Not Done Reply Inline Actions Needs cast from unsigned or use unsigned for SDWAOpcode/Opcode. rampitec: Needs cast from unsigned or use unsigned for SDWAOpcode/Opcode.
		unsigned Opcode = MI.getOpcode();
		if (TII->isSDWA(Opcode)) {
		SDWAOpcode = Opcode;
		} else {
		SDWAOpcode = AMDGPU::getSDWAOp(Opcode);
if (SDWAOpcode == -1)		if (SDWAOpcode == -1)
SDWAOpcode = AMDGPU::getSDWAOp(AMDGPU::getVOPe32(MI.getOpcode()));		SDWAOpcode = AMDGPU::getSDWAOp(AMDGPU::getVOPe32(Opcode));
		}
assert(SDWAOpcode != -1);		assert(SDWAOpcode != -1);

const MCInstrDesc &SDWADesc = TII->get(SDWAOpcode);		const MCInstrDesc &SDWADesc = TII->get(SDWAOpcode);

// Create SDWA version of instruction MI and initialize its operands		// Create SDWA version of instruction MI and initialize its operands
MachineInstrBuilder SDWAInst =		MachineInstrBuilder SDWAInst =
BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), SDWADesc);		BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), SDWADesc);

▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::omod) != -1) {
MachineOperand *OMod = TII->getNamedOperand(MI, AMDGPU::OpName::omod);		MachineOperand *OMod = TII->getNamedOperand(MI, AMDGPU::OpName::omod);
if (OMod) {		if (OMod) {
SDWAInst.add(*OMod);		SDWAInst.add(*OMod);
} else {		} else {
SDWAInst.addImm(0);		SDWAInst.addImm(0);
}		}
}		}

// Initialize dst_sel if present		// Copy dst_sel if present, initialize otherwise if needed
if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::dst_sel) != -1) {		if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::dst_sel) != -1) {
		MachineOperand *DstSel = TII->getNamedOperand(MI, AMDGPU::OpName::dst_sel);
		if (DstSel) {
		SDWAInst.add(*DstSel);
		} else {
SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);		SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);
}		}
		}

// Initialize dst_unused if present		// Copy dst_unused if present, initialize otherwise if needed
if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::dst_unused) != -1) {		if (AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::dst_unused) != -1) {
		MachineOperand *DstUnused = TII->getNamedOperand(MI, AMDGPU::OpName::dst_unused);
		if (DstUnused) {
		SDWAInst.add(*DstUnused);
		} else {
SDWAInst.addImm(AMDGPU::SDWA::DstUnused::UNUSED_PAD);		SDWAInst.addImm(AMDGPU::SDWA::DstUnused::UNUSED_PAD);
}		}
		}

// Initialize src0_sel		// Copy src0_sel if present, initialize otherwise
assert(AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src0_sel) != -1);		assert(AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src0_sel) != -1);
		MachineOperand *Src0Sel = TII->getNamedOperand(MI, AMDGPU::OpName::src0_sel);
		if (Src0Sel) {
		SDWAInst.add(*Src0Sel);
		} else {
SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);		SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);
		}

		// Copy src1_sel if present, initialize otherwise if needed
// Initialize src1_sel if present
if (Src1) {		if (Src1) {
assert(AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src1_sel) != -1);		assert(AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src1_sel) != -1);
		MachineOperand *Src1Sel = TII->getNamedOperand(MI, AMDGPU::OpName::src1_sel);
		if (Src1Sel) {
		SDWAInst.add(*Src1Sel);
		} else {
SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);		SDWAInst.addImm(AMDGPU::SDWA::SdwaSel::DWORD);
}		}
		}

// Apply all sdwa operand pattenrs		// Apply all sdwa operand pattenrs
bool Converted = false;		bool Converted = false;
for (auto &Operand : SDWAOperands) {		for (auto &Operand : SDWAOperands) {
// There should be no intesection between SDWA operands and potential MIs		// There should be no intesection between SDWA operands and potential MIs
// e.g.:		// e.g.:
// v_and_b32 v0, 0xff, v1 -> src:v1 sel:BYTE_0		// v_and_b32 v0, 0xff, v1 -> src:v1 sel:BYTE_0
// v_and_b32 v2, 0xff, v0 -> src:v0 sel:BYTE_0		// v_and_b32 v2, 0xff, v0 -> src:v0 sel:BYTE_0
// v_add_u32 v3, v4, v2		// v_add_u32 v3, v4, v2
//		//
// In that example it is possible that we would fold 2nd instruction into 3rd		// In that example it is possible that we would fold 2nd instruction into 3rd
// (v_add_u32_sdwa) and then try to fold 1st instruction into 2nd (that was		// (v_add_u32_sdwa) and then try to fold 1st instruction into 2nd (that was
// already destroyed). So if SDWAOperand is also a potential MI then do not		// already destroyed). So if SDWAOperand is also a potential MI then do not
// apply it.		// apply it.

		arsenmUnsubmitted Not Done Reply Inline Actions Extra newline arsenm: Extra newline
if (PotentialMatches.count(Operand->getParentInst()) == 0)		if (PotentialMatches.count(Operand->getParentInst()) == 0)
Converted \|= Operand->convertToSDWA(*SDWAInst, TII);		Converted \|= Operand->convertToSDWA(*SDWAInst, TII);
}		}
if (Converted) {		if (Converted) {
ConvertedInstructions.push_back(SDWAInst);		ConvertedInstructions.push_back(SDWAInst);
} else {		} else {
SDWAInst->eraseFromParent();		SDWAInst->eraseFromParent();
return false;		return false;
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	bool SIPeepholeSDWA::runOnMachineFunction(MachineFunction &MF) {
if (!ST.hasSDWA())		if (!ST.hasSDWA())
return false;		return false;

MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
TRI = ST.getRegisterInfo();		TRI = ST.getRegisterInfo();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();

// Find all SDWA operands in MF.		// Find all SDWA operands in MF.
		bool Changed = false;
		bool Ret = false;
		do {
		rampitecUnsubmitted Not Done Reply Inline Actions This loop itself probably deserves a separate change. rampitec: This loop itself probably deserves a separate change.
matchSDWAOperands(MF);		matchSDWAOperands(MF);

for (const auto &OperandPair : SDWAOperands) {		for (const auto &OperandPair : SDWAOperands) {
const auto &Operand = OperandPair.second;		const auto &Operand = OperandPair.second;
MachineInstr *PotentialMI = Operand->potentialToConvert(TII);		MachineInstr *PotentialMI = Operand->potentialToConvert(TII);
if (PotentialMI && isConvertibleToSDWA(*PotentialMI, ST)) {		if (PotentialMI && isConvertibleToSDWA(*PotentialMI, ST)) {
PotentialMatches[PotentialMI].push_back(Operand.get());		PotentialMatches[PotentialMI].push_back(Operand.get());
}		}
}		}

for (auto &PotentialPair : PotentialMatches) {		for (auto &PotentialPair : PotentialMatches) {
MachineInstr &PotentialMI = *PotentialPair.first;		MachineInstr &PotentialMI = *PotentialPair.first;
convertToSDWA(PotentialMI, PotentialPair.second);		convertToSDWA(PotentialMI, PotentialPair.second);
}		}

PotentialMatches.clear();		PotentialMatches.clear();
SDWAOperands.clear();		SDWAOperands.clear();

bool Ret = !ConvertedInstructions.empty();		Changed = !ConvertedInstructions.empty();

		if (Changed)
		Ret = true;

while (!ConvertedInstructions.empty())		while (!ConvertedInstructions.empty())
legalizeScalarOperands(*ConvertedInstructions.pop_back_val(), ST);		legalizeScalarOperands(*ConvertedInstructions.pop_back_val(), ST);
		} while (Changed);

return Ret;		return Ret;
}		}

lib/Target/AMDGPU/SIRegisterInfo.h

Show First 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	const TargetRegisterClass *getEquivalentSGPRClass(
const TargetRegisterClass *VRC) const;		const TargetRegisterClass *VRC) const;

/// \returns The register class that is used for a sub-register of \p RC for		/// \returns The register class that is used for a sub-register of \p RC for
/// the given \p SubIdx. If \p SubIdx equals NoSubRegister, \p RC will		/// the given \p SubIdx. If \p SubIdx equals NoSubRegister, \p RC will
/// be returned.		/// be returned.
const TargetRegisterClass getSubRegClass(const TargetRegisterClass RC,		const TargetRegisterClass getSubRegClass(const TargetRegisterClass RC,
unsigned SubIdx) const;		unsigned SubIdx) const;

		bool isSameReg(const MachineOperand &LHS, const MachineOperand &RHS) const;

		bool isSubregOf(const MachineOperand &SubReg,
		const MachineOperand &SuperReg) const;

bool shouldRewriteCopySrc(const TargetRegisterClass *DefRC,		bool shouldRewriteCopySrc(const TargetRegisterClass *DefRC,
unsigned DefSubReg,		unsigned DefSubReg,
const TargetRegisterClass *SrcRC,		const TargetRegisterClass *SrcRC,
unsigned SrcSubReg) const override;		unsigned SrcSubReg) const override;

/// \returns True if operands defined with this operand type can accept		/// \returns True if operands defined with this operand type can accept
/// a literal constant (i.e. any 32-bit immediate).		/// a literal constant (i.e. any 32-bit immediate).
bool opCanUseLiteralConstant(unsigned OpType) const {		bool opCanUseLiteralConstant(unsigned OpType) const {
▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 1,308 Lines • ▼ Show 20 Lines	case 8:
return &AMDGPU::VReg_256RegClass;		return &AMDGPU::VReg_256RegClass;
case 16: /* fall-through */		case 16: /* fall-through */
default:		default:
llvm_unreachable("Invalid sub-register class size");		llvm_unreachable("Invalid sub-register class size");
}		}
}		}
}		}

		bool SIRegisterInfo::isSameReg(const MachineOperand &LHS,
		const MachineOperand &RHS) const {
		arsenmUnsubmitted Done Reply Inline Actions How are these different from the various MCRegisterInfo functions for checking if registers alias or have subreg relationships? arsenm: How are these different from the various MCRegisterInfo functions for checking if registers…
		return LHS.isReg() &&
		RHS.isReg() &&
		LHS.getReg() == RHS.getReg() &&
		LHS.getSubReg() == RHS.getSubReg();
		}

		bool SIRegisterInfo::isSubregOf(const MachineOperand &SubReg,
		const MachineOperand &SuperReg) const {
		if (!SuperReg.isReg() \|\| !SubReg.isReg())
		return false;

		if (isSameReg(SuperReg, SubReg))
		return true;

		if (SuperReg.getReg() != SubReg.getReg())
		return false;

		LaneBitmask SuperMask = getSubRegIndexLaneMask(SuperReg.getSubReg());
		LaneBitmask SubMask = getSubRegIndexLaneMask(SubReg.getSubReg());
		SuperMask \|= ~SubMask;
		return SuperMask.all();
		}

bool SIRegisterInfo::shouldRewriteCopySrc(		bool SIRegisterInfo::shouldRewriteCopySrc(
const TargetRegisterClass *DefRC,		const TargetRegisterClass *DefRC,
unsigned DefSubReg,		unsigned DefSubReg,
const TargetRegisterClass *SrcRC,		const TargetRegisterClass *SrcRC,
unsigned SrcSubReg) const {		unsigned SrcSubReg) const {
// We want to prefer the smallest register class possible, so we don't want to		// We want to prefer the smallest register class possible, so we don't want to
// stop and rewrite on anything that looks like a subregister		// stop and rewrite on anything that looks like a subregister
// extract. Operations mostly don't care about the super register class, so we		// extract. Operations mostly don't care about the super register class, so we
▲ Show 20 Lines • Show All 205 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/fabs.f16.ll

	Show First 20 Lines • Show All 121 Lines • ▼ Show 20 Lines

	; CI: v_cvt_f32_f16_e32			; CI: v_cvt_f32_f16_e32
	; CI: v_cvt_f32_f16_e32			; CI: v_cvt_f32_f16_e32
	; CI: v_mul_f32_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}			; CI: v_mul_f32_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}
	; CI: v_cvt_f16_f32			; CI: v_cvt_f16_f32
	; CI: v_mul_f32_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}			; CI: v_mul_f32_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}
	; CI: v_cvt_f16_f32			; CI: v_cvt_f16_f32

	; VI: v_lshrrev_b32_e32 v{{[0-9]+}}, 16,			; VI: v_mul_f16_sdwa v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
	; VI: v_mul_f16_sdwa v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
	; VI: v_mul_f16_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}			; VI: v_mul_f16_e64 v{{[0-9]+}}, \|v{{[0-9]+}}\|, v{{[0-9]+}}

	; GFX9: v_and_b32_e32 [[FABS:v[0-9]+]], 0x7fff7fff, [[VAL]]			; GFX9: v_and_b32_e32 [[FABS:v[0-9]+]], 0x7fff7fff, [[VAL]]
	; GFX9: v_pk_mul_f16 v{{[0-9]+}}, [[FABS]], v{{[0-9]+$}}			; GFX9: v_pk_mul_f16 v{{[0-9]+}}, [[FABS]], v{{[0-9]+$}}
	define amdgpu_kernel void @v_fabs_fold_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %in) #0 {			define amdgpu_kernel void @v_fabs_fold_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %in) #0 {
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %in, i32 %tid			%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %in, i32 %tid
	%val = load <2 x half>, <2 x half> addrspace(1)* %gep			%val = load <2 x half>, <2 x half> addrspace(1)* %gep
	Show All 13 Lines

test/CodeGen/AMDGPU/fcanonicalize.f16.ll

Show First 20 Lines • Show All 201 Lines • ▼ Show 20 Lines
; GCN: buffer_store_short [[REG]]		; GCN: buffer_store_short [[REG]]
define amdgpu_kernel void @test_fold_canonicalize_snan3_value_f16(half addrspace(1)* %out) #1 {		define amdgpu_kernel void @test_fold_canonicalize_snan3_value_f16(half addrspace(1)* %out) #1 {
%canonicalized = call half @llvm.canonicalize.f16(half 0xHFC01)		%canonicalized = call half @llvm.canonicalize.f16(half 0xHFC01)
store half %canonicalized, half addrspace(1)* %out		store half %canonicalized, half addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_test_canonicalize_var_v2f16:		; GCN-LABEL: {{^}}v_test_canonicalize_var_v2f16:
; VI-DAG: v_max_f16_sdwa [[REG0:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD		; VI-DAG: v_max_f16_sdwa [[REG0:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; VI-DAG: v_max_f16_e32 [[REG1:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}}		; VI-DAG: v_max_f16_e32 [[REG1:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}}
; VI-NOT: v_and_b32		; VI-NOT: v_and_b32

; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+$}}		; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+$}}
; GFX9: buffer_store_dword [[REG]]		; GFX9: buffer_store_dword [[REG]]
define amdgpu_kernel void @v_test_canonicalize_var_v2f16(<2 x half> addrspace(1)* %out) #1 {		define amdgpu_kernel void @v_test_canonicalize_var_v2f16(<2 x half> addrspace(1)* %out) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid		%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid
Show All 22 Lines	define amdgpu_kernel void @v_test_canonicalize_fabs_var_v2f16(<2 x half> addrspace(1)* %out) #1 {
%val.fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %val)		%val.fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %val)
%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %val.fabs)		%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %val.fabs)
store <2 x half> %canonicalized, <2 x half> addrspace(1)* %out		store <2 x half> %canonicalized, <2 x half> addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_test_canonicalize_fneg_fabs_var_v2f16:		; GCN-LABEL: {{^}}v_test_canonicalize_fneg_fabs_var_v2f16:
; VI-DAG: v_or_b32_e32 v{{[0-9]+}}, 0x80008000, v{{[0-9]+}}		; VI-DAG: v_or_b32_e32 v{{[0-9]+}}, 0x80008000, v{{[0-9]+}}
; VI-DAG: v_max_f16_sdwa [[REG0:v[0-9]+]], v{{[0-9]+}}, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD		; VI-DAG: v_max_f16_sdwa [[REG0:v[0-9]+]], v{{[0-9]+}}, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; VI-DAG: v_max_f16_e32 [[REG1:v[0-9]+]], v{{[0-9]+}}, v{{[0-9]+}}		; VI-DAG: v_max_f16_e32 [[REG1:v[0-9]+]], v{{[0-9]+}}, v{{[0-9]+}}
; VI: v_or_b32		; VI: v_or_b32

; GFX9: v_and_b32_e32 [[ABS:v[0-9]+]], 0x7fff7fff, v{{[0-9]+}}		; GFX9: v_and_b32_e32 [[ABS:v[0-9]+]], 0x7fff7fff, v{{[0-9]+}}
; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], [[ABS]], [[ABS]] neg_lo:[1,1] neg_hi:[1,1]{{$}}		; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], [[ABS]], [[ABS]] neg_lo:[1,1] neg_hi:[1,1]{{$}}
; GCN: buffer_store_dword		; GCN: buffer_store_dword
define amdgpu_kernel void @v_test_canonicalize_fneg_fabs_var_v2f16(<2 x half> addrspace(1)* %out) #1 {		define amdgpu_kernel void @v_test_canonicalize_fneg_fabs_var_v2f16(<2 x half> addrspace(1)* %out) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid		%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid
%val = load <2 x half>, <2 x half> addrspace(1)* %gep		%val = load <2 x half>, <2 x half> addrspace(1)* %gep
%val.fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %val)		%val.fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %val)
%val.fabs.fneg = fsub <2 x half> <half -0.0, half -0.0>, %val.fabs		%val.fabs.fneg = fsub <2 x half> <half -0.0, half -0.0>, %val.fabs
%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %val.fabs.fneg)		%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %val.fabs.fneg)
store <2 x half> %canonicalized, <2 x half> addrspace(1)* %out		store <2 x half> %canonicalized, <2 x half> addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}v_test_canonicalize_fneg_var_v2f16:		; GCN-LABEL: {{^}}v_test_canonicalize_fneg_var_v2f16:
; VI: v_xor_b32_e32 [[FNEG:v[0-9]+]], 0x80008000, v{{[0-9]+}}		; VI: v_xor_b32_e32 [[FNEG:v[0-9]+]], 0x80008000, v{{[0-9]+}}
; VI: v_lshrrev_b32_e32 [[FNEGHI:v[0-9]+]], 16, [[FNEG]]		; VI-DAG: v_max_f16_sdwa [[REG1:v[0-9]+]], [[FNEG]], [[FNEG]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
; VI-DAG: v_max_f16_sdwa [[REG1:v[0-9]+]], [[FNEG]], [[FNEGHI]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
; VI-DAG: v_max_f16_e32 [[REG0:v[0-9]+]], [[FNEG]], [[FNEG]]		; VI-DAG: v_max_f16_e32 [[REG0:v[0-9]+]], [[FNEG]], [[FNEG]]
; VI-NOT: 0xffff		; VI-NOT: 0xffff

; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}} neg_lo:[1,1] neg_hi:[1,1]{{$}}		; GFX9: v_pk_max_f16 [[REG:v[0-9]+]], {{v[0-9]+}}, {{v[0-9]+}} neg_lo:[1,1] neg_hi:[1,1]{{$}}
; GFX9: buffer_store_dword [[REG]]		; GFX9: buffer_store_dword [[REG]]
define amdgpu_kernel void @v_test_canonicalize_fneg_var_v2f16(<2 x half> addrspace(1)* %out) #1 {		define amdgpu_kernel void @v_test_canonicalize_fneg_var_v2f16(<2 x half> addrspace(1)* %out) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid		%gep = getelementptr <2 x half>, <2 x half> addrspace(1)* %out, i32 %tid
▲ Show 20 Lines • Show All 169 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/fneg.f16.ll

	Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines

	; CI: v_cvt_f32_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}			; CI: v_cvt_f32_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}
	; CI: v_cvt_f32_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}			; CI: v_cvt_f32_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}
	; CI: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}			; CI: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
	; CI: v_cvt_f16_f32			; CI: v_cvt_f16_f32
	; CI: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}			; CI: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
	; CI: v_cvt_f16_f32			; CI: v_cvt_f16_f32

	; VI: v_lshrrev_b32_e32 v{{[0-9]+}}, 16,			; VI: v_mul_f16_sdwa v{{[0-9]+}}, -v{{[0-9]+}}, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
	; VI: v_mul_f16_sdwa v{{[0-9]+}}, -v{{[0-9]+}}, v{{[0-9]+}} dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
	; VI: v_mul_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}, v{{[0-9]+}}			; VI: v_mul_f16_e64 v{{[0-9]+}}, -v{{[0-9]+}}, v{{[0-9]+}}

	; GFX9: v_pk_mul_f16 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} neg_lo:[1,0] neg_hi:[1,0]{{$}}			; GFX9: v_pk_mul_f16 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} neg_lo:[1,0] neg_hi:[1,0]{{$}}
	define amdgpu_kernel void @v_fneg_fold_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %in) #0 {			define amdgpu_kernel void @v_fneg_fold_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %in) #0 {
	%val = load <2 x half>, <2 x half> addrspace(1)* %in			%val = load <2 x half>, <2 x half> addrspace(1)* %in
	%fsub = fsub <2 x half> <half -0.0, half -0.0>, %val			%fsub = fsub <2 x half> <half -0.0, half -0.0>, %val
	%fmul = fmul <2 x half> %fsub, %val			%fmul = fmul <2 x half> %fsub, %val
	store <2 x half> %fmul, <2 x half> addrspace(1)* %out			store <2 x half> %fmul, <2 x half> addrspace(1)* %out
	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/sdwa-merge-preserve.mir

This file was added.

				# RUN: llc -march=amdgcn -mcpu=fiji -run-pass=si-merge-sdwa-preserve -verify-machineinstrs -o - %s \| FileCheck -check-prefix=SDWA %s
				# RUN: llc -march=amdgcn -mcpu=gfx900 -run-pass=si-merge-sdwa-preserve -verify-machineinstrs -o - %s \| FileCheck -check-prefix=SDWA %s
				rampitecUnsubmitted Done Reply Inline Actions Add -verify-machineinstrs to run lines. rampitec: Add -verify-machineinstrs to run lines.

				# SDWA-LABEL: {{^}}name: add_f16

				# SDWA: [[FIRST:%[0-9]+]] = FLAT_LOAD_DWORD %{{[0-9]+}}, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)
				# SDWA: [[SECOND:%[0-9]+]] = FLAT_LOAD_DWORD %{{[0-9]+}}, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)

				# SDWA: [[RES:%[0-9]+]] = V_ADD_F16_e64 0, [[FIRST]], 0, [[SECOND]], 0, 0, implicit %exec
				# SDWA: [[RES:%[0-9]+]] = V_ADD_F16_sdwa 0, [[FIRST]], 0, [[SECOND]], 0, 0, 5, 2, 5, 5, implicit %exec, implicit [[RES]]
				# SDWA: [[COPY:%[0-9]+]] = COPY [[RES]]

				# SDWA: FLAT_STORE_DWORD %{{[0-9]+}}, [[COPY:%[0-9]+]], 0, 0, 0, implicit %exec, implicit %flat_scr :: (store 4)

				---
				name: add_f16
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vreg_64 }
				- { id: 1, class: vreg_64 }
				- { id: 2, class: sreg_64 }
				- { id: 3, class: vgpr_32 }
				- { id: 4, class: vgpr_32 }
				- { id: 5, class: sreg_32_xm0 }
				- { id: 6, class: sreg_32 }
				- { id: 7, class: sreg_32_xm0 }
				- { id: 8, class: sreg_32 }
				- { id: 9, class: vgpr_32 }
				- { id: 10, class: vgpr_32 }
				- { id: 11, class: vgpr_32 }
				- { id: 12, class: sreg_32_xm0 }
				body: \|
				bb.0:
				liveins: %vgpr0_vgpr1, %vgpr2_vgpr3, %sgpr30_sgpr31

				%2 = COPY %sgpr30_sgpr31
				%1 = COPY %vgpr2_vgpr3
				%0 = COPY %vgpr0_vgpr1
				; This is needed to force MIR to treat this as non-SSA
				%0 = COPY %vgpr0_vgpr1
				%3 = FLAT_LOAD_DWORD %0, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)
				%4 = FLAT_LOAD_DWORD %1, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)

				%10 = V_ADD_F16_e64 0, %3, 0, %4, 0, 0, implicit %exec
				%11 = V_ADD_F16_sdwa 0, %3, 0, %4, 0, 0, 5, 2, 5, 5, implicit %exec, implicit %10

				FLAT_STORE_DWORD %0, %11, 0, 0, 0, implicit %exec, implicit %flat_scr :: (store 4)
				%sgpr30_sgpr31 = COPY %2
				S_SETPC_B64_return %sgpr30_sgpr31

				...

				# SDWA-LABEL: {{^}}name: add_f16_no_merge

				# SDWA: [[FIRST:%[0-9]+]] = FLAT_LOAD_DWORD %{{[0-9]+}}, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)
				# SDWA: [[SECOND:%[0-9]+]] = FLAT_LOAD_DWORD %{{[0-9]+}}, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)

				# SDWA: [[LOW:%[0-9]+]] = V_ADD_F16_e64 0, [[FIRST]], 0, [[SECOND]], 0, 0, implicit %exec
				# SDWA: [[HIGH:%[0-9]+]] = V_ADD_F16_sdwa 0, [[FIRST]], 0, [[SECOND]], 0, 0, 5, 2, 5, 5, implicit %exec

				# SDWA: FLAT_STORE_DWORD %{{[0-9]+}}, [[HIGH:%[0-9]+]], 0, 0, 0, implicit %exec, implicit %flat_scr :: (store 4)

				---
				name: add_f16_no_merge
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vreg_64 }
				- { id: 1, class: vreg_64 }
				- { id: 2, class: sreg_64 }
				- { id: 3, class: vgpr_32 }
				- { id: 4, class: vgpr_32 }
				- { id: 5, class: sreg_32_xm0 }
				- { id: 6, class: sreg_32 }
				- { id: 7, class: sreg_32_xm0 }
				- { id: 8, class: sreg_32 }
				- { id: 9, class: vgpr_32 }
				- { id: 10, class: vgpr_32 }
				- { id: 11, class: vgpr_32 }
				- { id: 12, class: sreg_32_xm0 }
				body: \|
				bb.0:
				liveins: %vgpr0_vgpr1, %vgpr2_vgpr3, %sgpr30_sgpr31

				%2 = COPY %sgpr30_sgpr31
				%1 = COPY %vgpr2_vgpr3
				%0 = COPY %vgpr0_vgpr1
				; This is needed to force MIR to treat this as non-SSA
				%0 = COPY %vgpr0_vgpr1
				%3 = FLAT_LOAD_DWORD %0, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)
				%4 = FLAT_LOAD_DWORD %1, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)

				%10 = V_ADD_F16_e64 0, %3, 0, %4, 0, 0, implicit %exec
				%11 = V_ADD_F16_sdwa 0, %3, 0, %4, 0, 0, 5, 2, 5, 5, implicit %exec

				FLAT_STORE_DWORD %0, %11, 0, 0, 0, implicit %exec, implicit %flat_scr :: (store 4)
				%sgpr30_sgpr31 = COPY %2
				S_SETPC_B64_return %sgpr30_sgpr31

test/CodeGen/AMDGPU/sdwa-peephole-instr.mir

Show First 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	bb.0:
%sgpr30_sgpr31 = COPY %2		%sgpr30_sgpr31 = COPY %2
S_SETPC_B64_return %sgpr30_sgpr31		S_SETPC_B64_return %sgpr30_sgpr31

...		...
---		---
# GCN-LABEL: {{^}}name: vop2_instructions		# GCN-LABEL: {{^}}name: vop2_instructions


# VI: %{{[0-9]+}} = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 6, 0, 6, 5, implicit %exec		# VI: %{{[0-9]+}} = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 5, 0, 6, 5, implicit %exec
# VI: %{{[0-9]+}} = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec		# VI: %{{[0-9]+}} = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec
# VI: %{{[0-9]+}} = V_SUB_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec		# VI: %{{[0-9]+}} = V_SUB_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec
# VI: %{{[0-9]+}} = V_MAC_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, %{{[0-9]+}}, 0, 0, 6, 0, 6, 1, implicit %exec		# VI: %{{[0-9]+}} = V_MAC_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, %{{[0-9]+}}, 0, 0, 6, 0, 6, 1, implicit %exec
# VI: %{{[0-9]+}} = V_MAC_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec		# VI: %{{[0-9]+}} = V_MAC_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec

# GFX9: %{{[0-9]+}} = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 6, 0, 6, 5, implicit %exec		# GFX9: %{{[0-9]+}} = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 5, 0, 6, 5, implicit %exec
# GFX9: %{{[0-9]+}} = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec		# GFX9: %{{[0-9]+}} = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec
# GFX9: %{{[0-9]+}} = V_SUB_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec		# GFX9: %{{[0-9]+}} = V_SUB_F16_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 6, 0, 5, 1, implicit %exec
# GFX9: %{{[0-9]+}} = V_MAC_F32_e32 %{{[0-9]+}}, %{{[0-9]+}}, %{{[0-9]+}}, implicit %exec		# GFX9: %{{[0-9]+}} = V_MAC_F32_e32 %{{[0-9]+}}, %{{[0-9]+}}, %{{[0-9]+}}, implicit %exec
# GFX9: %{{[0-9]+}} = V_MAC_F16_e32 %{{[0-9]+}}, %{{[0-9]+}}, %{{[0-9]+}}, implicit %exec		# GFX9: %{{[0-9]+}} = V_MAC_F16_e32 %{{[0-9]+}}, %{{[0-9]+}}, %{{[0-9]+}}, implicit %exec


# VI: %{{[0-9]+}} = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 5, 0, 6, 5, implicit %exec		# VI: %{{[0-9]+}} = V_AND_B32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 5, 0, 6, 5, implicit %exec
# VI: %{{[0-9]+}} = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec		# VI: %{{[0-9]+}} = V_ADD_F32_sdwa 0, %{{[0-9]+}}, 0, %{{[0-9]+}}, 0, 0, 5, 0, 5, 1, implicit %exec
▲ Show 20 Lines • Show All 282 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/sdwa-preserve.mir

This file was added.

				# RUN: llc -march=amdgcn -mcpu=fiji -start-before=si-peephole-sdwa -verify-machineinstrs -o - %s \| FileCheck -check-prefix=SDWA %s
				# RUN: llc -march=amdgcn -mcpu=gfx900 -start-before=si-peephole-sdwa -verify-machineinstrs -o - %s \| FileCheck -check-prefix=SDWA %s
				rampitecUnsubmitted Done Reply Inline Actions Add -verify-machineinstrs rampitec: Add -verify-machineinstrs

				# SDWA-LABEL: {{^}}add_f16_u32_preserve

				# SDWA: flat_load_dword [[FIRST:v[0-9]+]], v[{{[0-9]+}}:{{[0-9]+}}]
				# SDWA: flat_load_dword [[SECOND:v[0-9]+]], v[{{[0-9]+}}:{{[0-9]+}}]

				# SDWA: v_add_u32_sdwa [[RES:v[0-9]+]], [[FIRST]], [[SECOND]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:BYTE_3
				# SDWA: v_add_f16_sdwa [[RES:v[0-9]+]], [[FIRST]], [[SECOND]] dst_sel:BYTE_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_1

				# SDWA: flat_store_dword v[{{[0-9]+}}:{{[0-9]+}}], [[RES]]

				---
				name: add_f16_u32_preserve
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vreg_64 }
				- { id: 1, class: vreg_64 }
				- { id: 2, class: sreg_64 }
				- { id: 3, class: vgpr_32 }
				- { id: 4, class: vgpr_32 }
				- { id: 5, class: vgpr_32 }
				- { id: 6, class: vgpr_32 }
				- { id: 7, class: vgpr_32 }
				- { id: 8, class: vgpr_32 }
				- { id: 9, class: vgpr_32 }
				- { id: 10, class: vgpr_32 }
				- { id: 11, class: vgpr_32 }
				- { id: 12, class: vgpr_32 }
				- { id: 13, class: vgpr_32 }
				body: \|
				bb.0:
				liveins: %vgpr0_vgpr1, %vgpr2_vgpr3, %sgpr30_sgpr31

				%2 = COPY %sgpr30_sgpr31
				%1 = COPY %vgpr2_vgpr3
				%0 = COPY %vgpr0_vgpr1
				%3 = FLAT_LOAD_DWORD %0, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)
				%4 = FLAT_LOAD_DWORD %1, 0, 0, 0, implicit %exec, implicit %flat_scr :: (load 4)

				%5 = V_AND_B32_e32 65535, %3, implicit %exec
				%6 = V_LSHRREV_B32_e64 16, %4, implicit %exec
				%7 = V_BFE_U32 %3, 8, 8, implicit %exec
				%8 = V_LSHRREV_B32_e32 24, %4, implicit %exec

				%9 = V_ADD_F16_e64 0, %5, 0, %6, 0, 0, implicit %exec
				%10 = V_LSHLREV_B16_e64 8, %9, implicit %exec
				%11 = V_ADD_U32_e64 %7, %8, implicit %exec
				%12 = V_LSHLREV_B32_e64 16, %11, implicit %exec

				%13 = V_OR_B32_e64 %10, %12, implicit %exec

				FLAT_STORE_DWORD %0, %13, 0, 0, 0, implicit %exec, implicit %flat_scr :: (store 4)
				%sgpr30_sgpr31 = COPY %2
				S_SETPC_B64_return %sgpr30_sgpr31