This is an archive of the discontinued LLVM Phabricator instance.

lib/Target/AMDGPU/GCNDPPCombine.cpp
257	I think you have AND and OR mixed up... the identity for AND should be -1, and OR should be 0, right?
261	If the old operand is the identity in the original mov, then in the transformed instruction the old operand should be the un-swizzled source, not the identity again. I think this is the reason the add tests to still fail on radv, since it generates something like: v_mov_b32_e32 v5, 0 s_nop 1 v_add_u32_dpp v5, vcc, v4, v4 row_shr:4 row_mask:0xf bank_mask:0xe from %88 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %87, i32 276, i32 15, i32 14, i1 false) #2 %89 = add i32 %87, %88 when what we really want is v_add_u32_dpp v4, vcc, v4, v4 row_shr:4 row_mask:0xf bank_mask:0xe

Connor, indeed, my bad, I'll try to fix this in a couple of days.

Thank you,
Valery

Fixed issue with identity values and other cases, f32/f16 identity values to be added later.

fma/mac instructions is disabled for now.

Test is fully reworked, added comments.

Connor, can you please take a look at the test in this update (first half) - I described different cases to combine there and possibly run the patch on radv test (the pass has to be switched on manually).

The pass is still disabled by default.

If there is no strong objections I would like to submit my latest patch here since the pass is disabled and the patch looks better anyway. Otherwise I'll return from NY holidays only on Jan 11.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 9 2019, 5:49 AM

Closed by commit rL350721: [AMDGPU] Fix DPP combiner (authored by vpykhtin). · Explain Why

This revision was automatically updated to reflect the committed changes.

Hi Connor,

I submitted my latest patch, the pass is still disabled by defaul. Can you please run the RADV tests now (with the pass manually enabled)?

The patch was reverted as not reviewed. Uploaded rebased diff.

reopening revision.

Sorry, I just got back from break this week. I've run CTS with the pass enabled, and it now passes, although it seems most of the patterns we use don't get folded. Firstly AND, XOR, unsigned max, and unsigned min are most troubling, since the code that gets generated looks like it should be optimized:

	v_mov_b32_e32 v8, -1                                          ; 7E1002C1
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_and_b32_e32 v2, v2, v8                                      ; 26041102

as well as

	v_mov_b32_e32 v8, 0                                                         ; 7E100280
	s_nop 1                                                                     ; BF800001
	v_mov_b32_dpp v8, v2  row_shr:8 row_mask:0xf bank_mask:0xc                  ; 7E1002FA FC011802
	v_max_u32_e32 v2, v2, v8                                                    ; 1E041102

and

	v_mov_b32_e32 v8, -1                                          ; 7E1002C1
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_u32_e32 v2, v2, v8                                      ; 1C041102

and finally

	v_mov_b32_e32 v8, 0                                                         ; 7E100280
	s_nop 1                                                                     ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf               ; 7E1002FA AF014202
	v_xor_b32_e32 v2, v2, v8                                                    ; 2A041102

Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:

	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_i32_e32 v2, v2, v8                                      ; 18041102

Maybe this pass needs to be moved earlier in the pipeline?

I'll take a more detailed look soon. But I'd also like to say that I'm a little worried that there are currently no integration tests that use DPP besides the Vulkan CTS, so there is currently no way for anyone working on this from the ROCm/compute side to test this pass properly, which means that none of you can work on this with any confidence. I think this exchange has clearly shown that unit tests aren't enough, since they only show that the code does what you think it does, not whether what you think it does is correct :) Maybe we should fix the atomic optimizer first so that it does something similar to what radv and AMDVLK do, to make it easier to test? Or maybe you just need to go write some tests.

Thank you, Connor.

I'll go through the cases you described, there're may be issues with instruction commutation (dpp src reg should be src0) or other problems (I already saw e64 instructions that cannot be converted to e32). I would be appreciate if you can attach .ll file for your cases.

Speaking about testing, the test I added isn't just a unit test but it contains situations (each commented) the pass is supposed to combine. I think we should review the test and agree if it does the right thing first.

@cwabbott I planned to do a followup once this DPP change had landed to add the missing dpp/codegen patterns to the atomic optimizer - so watch this space!

I do agree though that we should probably add the DPP patterns that we want to be 100% certain get combined into a test case, I'll work with @vpykhtin to add this.

Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:
	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_i32_e32 v2, v2, v8                                      ; 18041102
Maybe this pass needs to be moved earlier in the pipeline?

I'm not sure I can insert the pass that high, I'll think of how it can be skipped.

In D55444#1352866, @vpykhtin wrote:
Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:
	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_i32_e32 v2, v2, v8                                      ; 18041102
Maybe this pass needs to be moved earlier in the pipeline?
I'm not sure I can insert the pass that high, I'll think of how it can be skipped.

This kind of immediate init optimization is performed by SIShrinkInstructions pass and it runs after the DPP combiner pass, so the issue is different here, would be nice to have .ll file.

I figured it would be a little easier if I looked at these cases by myself. It turns out there are more problems with isIdentityValue, including some correctness issues. After fixing these, everything works correctly now.

lib/Target/AMDGPU/GCNDPPCombine.cpp
260	You have min and max mixed up... the identity for min is the maximum possible value, not the minimum. The same goes for signed and unsigned min/max. Also, you can add V_XOR with an identity value of 0 here.
261–263	It turns out that LLVM sign-extends the immediate to 64 bits sometime during instruction selection, which means this won't work for any MIR generated by LLVM IR. We should be ignoring the high 32 bits when doing all these comparisons, since they're irrelevant.

Thanks! I wonder how easy is to get confused there. I updated diff with the latest found problems fixed.

vpykhtin marked 2 inline comments as done.Jan 11 2019, 4:00 AM

I think we reached the state this can be submitted (and probably enabled with subsequent patch). This would allow any of us make other fixes if required.

cwabbott added inline comments.Jan 14 2019, 10:27 AM

lib/Target/AMDGPU/GCNDPPCombine.cpp
362	Indentation is off here, seems like your editor is using hard tabs
363	I don't think this is correct. A lane can be set to 0 for the shift DPP controls (e.g. `DPP_ROW_SL1`) or if EXEC is zero for the lane to be read, in which case the user will do something non-trivial (e.g. OR will pass through the other operand) which we can't emulate in general.
368	Also here
396	This seems incorrect to me. In the lanes where the source is invalid or the row/bank mask is 0, the DPP move will act as a no-op, here so if that lane is then added, or'd, etc. with something else, we can't emulate that with a single instruction.

Hi Valery, I really like the way the different cases are listed in the explanatory comment at the top of the file, and I believe those cases are correct. Would it be possible to restructure the code in a way that follows those cases? I think that would make it much easier to follow.

That is, in combineDPPMov, can you restructure the first do/while construct so that it mirrors the cases of the top of file comment and defines new variables CombinedOld and CombinedBoundCtrl? Come to think of it, it may then be wise to move these top of file comments into the function to increase the chances that they will be kept up-to-date in the future.

lib/Target/AMDGPU/GCNDPPCombine.cpp
352–353	Having both `OldOpndVGPR` and `OldOpndValue` seems redundant. You could pass `OldOpndValue` as a MachineOperand to `createDPPInst` and have it assert on `!CombinedOld \|\| CombinedOld->Reg()`. That should help cut down the code size here somewhat and make the code easier to follow.
393–394	These should be SmallVectors.
448	Can we simplify the code here by just commuting the OrigMI unconditionally? That is, first check whether `&Use != src0 && &Use == src1`; in that case, try commuting the instruction. If that fails, break (but I don't think there'd be a need to rollback the commutation). If it succeeds, continue down the same path that is used for the `&Use == src0` case. You end up with only a single call to `createDPPInst` and less code duplication in general.

vpykhtin marked 4 inline comments as done.Jan 15 2019, 4:04 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
362	Yea, thanks. I reinstalled my editor recently and forgot to turn off tabs. It should be ok with the latest diff.
363	Could you please you show an example of how it wouldn't work? Note that EXEC mask remains the same (by the check above) and no combining is performed when the result of DPP mov is used in any other way than consumed by DPP-capable VALU instruction. I think the result of DPP mov should be the same as the DPP src operand used in the dpp-capable VALU instruction.
396	mov_dpp v1, v1, ... (v1 of other lane is stored in v1 of this lane) add_u32 v0, v1, ... This is the case when DPP src register is stored in the VGPR with the same name of the issuing lane. This way v1 would contain the same value after unsuccessfull DPP mov (no-op) and therefore can be used in the combined VALU op.

Hi Nikolai,

thank you for the review, I'll think of how to restructure the code to match top comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
352–353	There was a reason I did it initially though after a time I need to recall it. If I remember correctly the value is tracked through the instructions like mov/copy and subreg manupulation and having only value insn't enough to obtain the reg to store in the DPP instruction.
393–394	Ok.
448	This is a good idea: I don't like leaving commuted instruction on rollback at some intuitive level but really don't have arguments against it at the moment :-)

cwabbott added inline comments.Jan 15 2019, 7:44 AM

lib/Target/AMDGPU/GCNDPPCombine.cpp
363	Yes, you're right. I was misrembering what bound_ctrl:0 does. Sorry!
396	I still don't see how this can work. For something like: mov_dpp v1, v1, ... add_u32 v0, v1, v2 lanes where the shared data is invalid based on the DPP ctrl or EXEC will return v1 (same lane) + v2, whereas this will transform it to something like add_u32_dpp v0, v1, v2, ... which will give you v0 (undef). What's an example of a transform you're trying to accomplish?

vpykhtin marked an inline comment as done.Jan 15 2019, 8:00 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
396	Sorry, this should look like: mov v1, X mov v2, Y mov_dpp v1, v1, some_DPP_ctrl add_u32 v0, v1, v2 transformed to mov v1, X mov v2, Y add_u32_dpp v0, v1, v2, some_DPP_ctrl v1 should contain X on invalid DPP access or X from other lane on valid.

vpykhtin marked an inline comment as done.Jan 15 2019, 8:35 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
396	I forgto to mention that some_DPP_ctrl should have all masks fully enabled, otherwise add wouldn't write it's result and v0 in undef indeed.

cwabbott added inline comments.Jan 16 2019, 1:13 AM

lib/Target/AMDGPU/GCNDPPCombine.cpp
396	This still seems wrong. For the first sequence you gave, the add_u32 will return X (same lane) + Y on invalid DPP access while add_u32_dpp will return undef (assuming v0 isn't initialized, since you set old to undef in this patch).

vpykhtin marked an inline comment as done.Jan 16 2019, 6:09 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
396	It looks like the documentation is a bit unclear here. I thought when bound ctrl is off it means hardware just uses the value of issuing lane instead of DPP read but if it disables writing the result then you're right - v0 would be undef. I'll find out and fix this.

vpykhtin marked an inline comment as done.Jan 18 2019, 5:02 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
448	Actually a pass returns boolean meaning it changes something in the IR. What should it return when it does rollback? It may create new commuted instructions which are need to be mapped in SlotIndexes for example. I don't think its a good idea to return true only when a rollback took place.

arsenm added inline comments.Jan 18 2019, 12:08 PM

lib/Target/AMDGPU/GCNDPPCombine.cpp
448	Can rollback be avoided? I think the usual purpose of knowing something changed is to know if iterators are still valid, which probably isn't the case if anything was modified at any point

arsenm added inline comments.Jan 18 2019, 12:38 PM

lib/Target/AMDGPU/GCNDPPCombine.cpp
448	I suppose if it's just commuting, it's ok to report no change

vpykhtin marked an inline comment as done.Jan 21 2019, 6:34 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
448	The commutation is the whole problem. I cannot know whether the instuction can be commuted beforehand. Also an instruction copy should be created for the commutation - this means that previous analisys should be invalidated and the IR change should be repored.

vpykhtin marked an inline comment as done.Jan 21 2019, 6:42 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
448	I think there is no way to avoid new instructions on commutation because of reversed instructions

Fixed issue with old = dpp src register when bound ctrl is off.

Slightly refactored and simplified. Description is corrected, though it's not one-to-one maps on the code as there're other things to do (as reusing IMPLICIT_DEF instructions)

Variable names changed to be more consistent with the description.

I decided to left current commutation as is because it doesn't come for free: it may create new instructions (at least reversed instructions) and for that reason previous analisys (like SlotIndexes) should be invalidated.

vpykhtin added a reviewer: nhaehnle.Feb 4 2019, 8:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2019, 8:40 AM

LGTM

This revision is now accepted and ready to land.Feb 7 2019, 12:43 AM

rebased diff.

Thanks Nikolai!

Connor can you please try this patch and probably accept it?

I'm not going to read everything in detail, but the combining rules look correct to me and everything passes with this pass enabled. Feel free to re-enable it.

Closed by commit rL353513: [AMDGPU] Fix DPP combiner (authored by vpykhtin). · Explain WhyFeb 8 2019, 3:59 AM

This revision was automatically updated to reflect the committed changes.

Thank you Connor! I really appreciate your effort on this DPP work.

I'll enable the pass with the subsequent submit.

The DPP combiner pass is enabled since rL353691, https://reviews.llvm.org/rGded96df01e95

foad mentioned this in D124182: [AMDGPU] Combine DPP mov even if old reg def is in different BB.Apr 21 2022, 11:11 AM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

TargetInstrInfo.h

7 lines

lib/

Target/

AMDGPU/

GCNDPPCombine.cpp

239 lines

SIInstrInfo.h

6 lines

SIInstrInfo.cpp

26 lines

test/

CodeGen/

AMDGPU/

dpp_combine.ll

dpp_combine.mir

472 lines

dpp_combine_subregs.mir

Diff 183315

include/llvm/CodeGen/TargetInstrInfo.h

Show First 20 Lines • Show All 422 Lines • ▼ Show 20 Lines	public:
/// A pair composed of a register and a sub-register index.		/// A pair composed of a register and a sub-register index.
/// Used to give some type checking when modeling Reg:SubReg.		/// Used to give some type checking when modeling Reg:SubReg.
struct RegSubRegPair {		struct RegSubRegPair {
unsigned Reg;		unsigned Reg;
unsigned SubReg;		unsigned SubReg;

RegSubRegPair(unsigned Reg = 0, unsigned SubReg = 0)		RegSubRegPair(unsigned Reg = 0, unsigned SubReg = 0)
: Reg(Reg), SubReg(SubReg) {}		: Reg(Reg), SubReg(SubReg) {}

		bool operator==(const RegSubRegPair& P) const {
		return Reg == P.Reg && SubReg == P.SubReg;
		}
		bool operator!=(const RegSubRegPair& P) const {
		return !(*this == P);
		}
};		};

/// A pair composed of a pair of a register and a sub-register index,		/// A pair composed of a pair of a register and a sub-register index,
/// and another sub-register index.		/// and another sub-register index.
/// Used to give some type checking when modeling Reg:SubReg1, SubReg2.		/// Used to give some type checking when modeling Reg:SubReg1, SubReg2.
struct RegSubRegPairAndIdx : RegSubRegPair {		struct RegSubRegPairAndIdx : RegSubRegPair {
unsigned SubIdx;		unsigned SubIdx;

▲ Show 20 Lines • Show All 1,275 Lines • Show Last 20 Lines

lib/Target/AMDGPU/GCNDPPCombine.cpp

//=======- GCNDPPCombine.cpp - optimization for DPP instructions ---==========//		//=======- GCNDPPCombine.cpp - optimization for DPP instructions ---==========//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// The pass combines V_MOV_B32_dpp instruction with its VALU uses as a DPP src0		// The pass combines V_MOV_B32_dpp instruction with its VALU uses as a DPP src0
// operand.If any of the use instruction cannot be combined with the mov the		// operand. If any of the use instruction cannot be combined with the mov the
// whole sequence is reverted.		// whole sequence is reverted.
//		//
// $old = ...		// $old = ...
// $dpp_value = V_MOV_B32_dpp $old, $vgpr_to_be_read_from_other_lane,		// $dpp_value = V_MOV_B32_dpp $old, $vgpr_to_be_read_from_other_lane,
// dpp_controls..., $bound_ctrl		// dpp_controls..., $row_mask, $bank_mask, $bound_ctrl
// $res = VALU $dpp_value, ...		// $res = VALU $dpp_value [, src1]
//		//
// to		// to
//		//
// $res = VALU_DPP $folded_old, $vgpr_to_be_read_from_other_lane, ...,		// $res = VALU_DPP $combined_old, $vgpr_to_be_read_from_other_lane, [src1,]
// dpp_controls..., $folded_bound_ctrl		// dpp_controls..., $row_mask, $bank_mask, $combined_bound_ctrl
//		//
// Combining rules :		// Combining rules :
//		//
// $bound_ctrl is DPP_BOUND_ZERO, $old is any		// if $row_mask and $bank_mask are fully enabled (0xF) and
// $bound_ctrl is DPP_BOUND_OFF, $old is 0		// $bound_ctrl==DPP_BOUND_ZERO or $old==0
		// -> $combined_old = undef,
		// $combined_bound_ctrl = DPP_BOUND_ZERO
//		//
// ->$folded_old = undef, $folded_bound_ctrl = DPP_BOUND_ZERO		// if the VALU op is binary and
// $bound_ctrl is DPP_BOUND_OFF, $old is undef		// $bound_ctrl==DPP_BOUND_OFF and
		// $old==identity value (immediate) for the VALU op
		// -> $combined_old = src1,
		// $combined_bound_ctrl = DPP_BOUND_OFF
//		//
// ->$folded_old = undef, $folded_bound_ctrl = DPP_BOUND_OFF		// Othervise cancel.
// $bound_ctrl is DPP_BOUND_OFF, $old is foldable
//		//
// ->$folded_old = folded value, $folded_bound_ctrl = DPP_BOUND_OFF		// The mov_dpp instruction should recide in the same BB as all it's uses
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "MCTargetDesc/AMDGPUMCTargetDesc.h"		#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
Show All 19 Lines
class GCNDPPCombine : public MachineFunctionPass {		class GCNDPPCombine : public MachineFunctionPass {
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
const SIInstrInfo *TII;		const SIInstrInfo *TII;

using RegSubRegPair = TargetInstrInfo::RegSubRegPair;		using RegSubRegPair = TargetInstrInfo::RegSubRegPair;

MachineOperand *getOldOpndValue(MachineOperand &OldOpnd) const;		MachineOperand *getOldOpndValue(MachineOperand &OldOpnd) const;

RegSubRegPair foldOldOpnd(MachineInstr &OrigMI,
RegSubRegPair OldOpndVGPR,
MachineOperand &OldOpndValue) const;

MachineInstr *createDPPInst(MachineInstr &OrigMI,		MachineInstr *createDPPInst(MachineInstr &OrigMI,
MachineInstr &MovMI,		MachineInstr &MovMI,
RegSubRegPair OldOpndVGPR,		RegSubRegPair CombOldVGPR,
MachineOperand *OldOpnd,		MachineOperand *OldOpnd,
bool BoundCtrlZero) const;		bool CombBCZ) const;

MachineInstr *createDPPInst(MachineInstr &OrigMI,		MachineInstr *createDPPInst(MachineInstr &OrigMI,
MachineInstr &MovMI,		MachineInstr &MovMI,
RegSubRegPair OldOpndVGPR,		RegSubRegPair CombOldVGPR,
bool BoundCtrlZero) const;		bool CombBCZ) const;

bool hasNoImmOrEqual(MachineInstr &MI,		bool hasNoImmOrEqual(MachineInstr &MI,
unsigned OpndName,		unsigned OpndName,
int64_t Value,		int64_t Value,
int64_t Mask = -1) const;		int64_t Mask = -1) const;

bool combineDPPMov(MachineInstr &MI) const;		bool combineDPPMov(MachineInstr &MI) const;

▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	case AMDGPU::V_MOV_B32_e32: {
break;		break;
}		}
}		}
return &OldOpnd;		return &OldOpnd;
}		}

MachineInstr *GCNDPPCombine::createDPPInst(MachineInstr &OrigMI,		MachineInstr *GCNDPPCombine::createDPPInst(MachineInstr &OrigMI,
MachineInstr &MovMI,		MachineInstr &MovMI,
RegSubRegPair OldOpndVGPR,		RegSubRegPair CombOldVGPR,
bool BoundCtrlZero) const {		bool CombBCZ) const {
assert(MovMI.getOpcode() == AMDGPU::V_MOV_B32_dpp);		assert(MovMI.getOpcode() == AMDGPU::V_MOV_B32_dpp);
assert(TII->getNamedOperand(MovMI, AMDGPU::OpName::vdst)->getReg() ==		assert(TII->getNamedOperand(MovMI, AMDGPU::OpName::vdst)->getReg() ==
TII->getNamedOperand(OrigMI, AMDGPU::OpName::src0)->getReg());		TII->getNamedOperand(OrigMI, AMDGPU::OpName::src0)->getReg());

auto OrigOp = OrigMI.getOpcode();		auto OrigOp = OrigMI.getOpcode();
auto DPPOp = getDPPOp(OrigOp);		auto DPPOp = getDPPOp(OrigOp);
if (DPPOp == -1) {		if (DPPOp == -1) {
LLVM_DEBUG(dbgs() << " failed: no DPP opcode\n");		LLVM_DEBUG(dbgs() << " failed: no DPP opcode\n");
return nullptr;		return nullptr;
}		}

auto DPPInst = BuildMI(*OrigMI.getParent(), OrigMI,		auto DPPInst = BuildMI(*OrigMI.getParent(), OrigMI,
OrigMI.getDebugLoc(), TII->get(DPPOp));		OrigMI.getDebugLoc(), TII->get(DPPOp));
bool Fail = false;		bool Fail = false;
do {		do {
auto *Dst = TII->getNamedOperand(OrigMI, AMDGPU::OpName::vdst);		auto *Dst = TII->getNamedOperand(OrigMI, AMDGPU::OpName::vdst);
assert(Dst);		assert(Dst);
DPPInst.add(*Dst);		DPPInst.add(*Dst);
int NumOperands = 1;		int NumOperands = 1;

const int OldIdx = AMDGPU::getNamedOperandIdx(DPPOp, AMDGPU::OpName::old);		const int OldIdx = AMDGPU::getNamedOperandIdx(DPPOp, AMDGPU::OpName::old);
if (OldIdx != -1) {		if (OldIdx != -1) {
assert(OldIdx == NumOperands);		assert(OldIdx == NumOperands);
assert(isOfRegClass(OldOpndVGPR, AMDGPU::VGPR_32RegClass, *MRI));		assert(isOfRegClass(CombOldVGPR, AMDGPU::VGPR_32RegClass, *MRI));
DPPInst.addReg(OldOpndVGPR.Reg, 0, OldOpndVGPR.SubReg);		DPPInst.addReg(CombOldVGPR.Reg, 0, CombOldVGPR.SubReg);
++NumOperands;		++NumOperands;
		} else {
		// TODO: this discards MAC/FMA instructions for now, let's add it later
		LLVM_DEBUG(dbgs() << " failed: no old operand in DPP instruction,"
		" TBD\n");
		Fail = true;
		break;
}		}

if (auto *Mod0 = TII->getNamedOperand(OrigMI,		if (auto *Mod0 = TII->getNamedOperand(OrigMI,
AMDGPU::OpName::src0_modifiers)) {		AMDGPU::OpName::src0_modifiers)) {
assert(NumOperands == AMDGPU::getNamedOperandIdx(DPPOp,		assert(NumOperands == AMDGPU::getNamedOperandIdx(DPPOp,
AMDGPU::OpName::src0_modifiers));		AMDGPU::OpName::src0_modifiers));
assert(0LL == (Mod0->getImm() & ~(SISrcMods::ABS \| SISrcMods::NEG)));		assert(0LL == (Mod0->getImm() & ~(SISrcMods::ABS \| SISrcMods::NEG)));
DPPInst.addImm(Mod0->getImm());		DPPInst.addImm(Mod0->getImm());
++NumOperands;		++NumOperands;
}		}
auto *Src0 = TII->getNamedOperand(MovMI, AMDGPU::OpName::src0);		auto *Src0 = TII->getNamedOperand(MovMI, AMDGPU::OpName::src0);
assert(Src0);		assert(Src0);
if (!TII->isOperandLegal(*DPPInst.getInstr(), NumOperands, Src0)) {		if (!TII->isOperandLegal(*DPPInst.getInstr(), NumOperands, Src0)) {
LLVM_DEBUG(dbgs() << " failed: src0 is illegal\n");		LLVM_DEBUG(dbgs() << " failed: src0 is illegal\n");
Fail = true;		Fail = true;
break;		break;
}		}
DPPInst.add(*Src0);		DPPInst.add(*Src0);
		DPPInst->getOperand(NumOperands).setIsKill(false);
++NumOperands;		++NumOperands;

if (auto *Mod1 = TII->getNamedOperand(OrigMI,		if (auto *Mod1 = TII->getNamedOperand(OrigMI,
AMDGPU::OpName::src1_modifiers)) {		AMDGPU::OpName::src1_modifiers)) {
assert(NumOperands == AMDGPU::getNamedOperandIdx(DPPOp,		assert(NumOperands == AMDGPU::getNamedOperandIdx(DPPOp,
AMDGPU::OpName::src1_modifiers));		AMDGPU::OpName::src1_modifiers));
assert(0LL == (Mod1->getImm() & ~(SISrcMods::ABS \| SISrcMods::NEG)));		assert(0LL == (Mod1->getImm() & ~(SISrcMods::ABS \| SISrcMods::NEG)));
DPPInst.addImm(Mod1->getImm());		DPPInst.addImm(Mod1->getImm());
Show All 16 Lines	if (auto *Src2 = TII->getNamedOperand(OrigMI, AMDGPU::OpName::src2)) {
break;		break;
}		}
DPPInst.add(*Src2);		DPPInst.add(*Src2);
}		}

DPPInst.add(*TII->getNamedOperand(MovMI, AMDGPU::OpName::dpp_ctrl));		DPPInst.add(*TII->getNamedOperand(MovMI, AMDGPU::OpName::dpp_ctrl));
DPPInst.add(*TII->getNamedOperand(MovMI, AMDGPU::OpName::row_mask));		DPPInst.add(*TII->getNamedOperand(MovMI, AMDGPU::OpName::row_mask));
DPPInst.add(*TII->getNamedOperand(MovMI, AMDGPU::OpName::bank_mask));		DPPInst.add(*TII->getNamedOperand(MovMI, AMDGPU::OpName::bank_mask));
DPPInst.addImm(BoundCtrlZero ? 1 : 0);		DPPInst.addImm(CombBCZ ? 1 : 0);
} while (false);		} while (false);

if (Fail) {		if (Fail) {
DPPInst.getInstr()->eraseFromParent();		DPPInst.getInstr()->eraseFromParent();
return nullptr;		return nullptr;
}		}
LLVM_DEBUG(dbgs() << " combined: " << *DPPInst.getInstr());		LLVM_DEBUG(dbgs() << " combined: " << *DPPInst.getInstr());
return DPPInst.getInstr();		return DPPInst.getInstr();
}		}

GCNDPPCombine::RegSubRegPair		static bool isIdentityValue(unsigned OrigMIOp, MachineOperand *OldOpnd) {
GCNDPPCombine::foldOldOpnd(MachineInstr &OrigMI,		assert(OldOpnd->isImm());
RegSubRegPair OldOpndVGPR,		switch (OrigMIOp) {
MachineOperand &OldOpndValue) const {
assert(OldOpndValue.isImm());
switch (OrigMI.getOpcode()) {
default: break;		default: break;
		case AMDGPU::V_ADD_U32_e32:
		case AMDGPU::V_ADD_I32_e32:
		case AMDGPU::V_OR_B32_e32:
		cwabbottUnsubmitted Not Done Reply Inline Actions I think you have AND and OR mixed up... the identity for AND should be -1, and OR should be 0, right? cwabbott: I think you have AND and OR mixed up... the identity for AND should be -1, and OR should be 0…
		case AMDGPU::V_SUBREV_U32_e32:
		case AMDGPU::V_SUBREV_I32_e32:
case AMDGPU::V_MAX_U32_e32:		case AMDGPU::V_MAX_U32_e32:
		cwabbottUnsubmitted Done Reply Inline Actions You have min and max mixed up... the identity for min is the maximum possible value, not the minimum. The same goes for signed and unsigned min/max. Also, you can add V_XOR with an identity value of 0 here. cwabbott: You have min and max mixed up... the identity for min is the maximum possible value, not the…
if (OldOpndValue.getImm() == std::numeric_limits<uint32_t>::max())		case AMDGPU::V_XOR_B32_e32:
		cwabbottUnsubmitted Not Done Reply Inline Actions If the old operand is the identity in the original mov, then in the transformed instruction the old operand should be the un-swizzled source, not the identity again. I think this is the reason the add tests to still fail on radv, since it generates something like: v_mov_b32_e32 v5, 0 s_nop 1 v_add_u32_dpp v5, vcc, v4, v4 row_shr:4 row_mask:0xf bank_mask:0xe from %88 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %87, i32 276, i32 15, i32 14, i1 false) #2 %89 = add i32 %87, %88 when what we really want is v_add_u32_dpp v4, vcc, v4, v4 row_shr:4 row_mask:0xf bank_mask:0xe cwabbott: If the old operand is the identity in the original mov, then in the transformed instruction the…
return OldOpndVGPR;		if (OldOpnd->getImm() == 0)
		return true;
		cwabbottUnsubmitted Done Reply Inline Actions It turns out that LLVM sign-extends the immediate to 64 bits sometime during instruction selection, which means this won't work for any MIR generated by LLVM IR. We should be ignoring the high 32 bits when doing all these comparisons, since they're irrelevant. cwabbott: It turns out that LLVM sign-extends the immediate to 64 bits sometime during instruction…
break;		break;
case AMDGPU::V_MAX_I32_e32:		case AMDGPU::V_AND_B32_e32:
if (OldOpndValue.getImm() == std::numeric_limits<int32_t>::max())		case AMDGPU::V_MIN_U32_e32:
return OldOpndVGPR;		if (static_cast<uint32_t>(OldOpnd->getImm()) ==
		std::numeric_limits<uint32_t>::max())
		return true;
break;		break;
case AMDGPU::V_MIN_I32_e32:		case AMDGPU::V_MIN_I32_e32:
if (OldOpndValue.getImm() == std::numeric_limits<int32_t>::min())		if (static_cast<int32_t>(OldOpnd->getImm()) ==
return OldOpndVGPR;		std::numeric_limits<int32_t>::max())
		return true;
		break;
		case AMDGPU::V_MAX_I32_e32:
		if (static_cast<int32_t>(OldOpnd->getImm()) ==
		std::numeric_limits<int32_t>::min())
		return true;
break;		break;

case AMDGPU::V_MUL_I32_I24_e32:		case AMDGPU::V_MUL_I32_I24_e32:
case AMDGPU::V_MUL_U32_U24_e32:		case AMDGPU::V_MUL_U32_U24_e32:
if (OldOpndValue.getImm() == 1) {		if (OldOpnd->getImm() == 1)
auto *Src1 = TII->getNamedOperand(OrigMI, AMDGPU::OpName::src1);		return true;
assert(Src1 && Src1->isReg());
return getRegSubRegPair(*Src1);
}
break;		break;
}		}
return RegSubRegPair();		return false;
}		}

// Cases to combine:
// $bound_ctrl is DPP_BOUND_ZERO, $old is any
// $bound_ctrl is DPP_BOUND_OFF, $old is 0
// -> $old = undef, $bound_ctrl = DPP_BOUND_ZERO

// $bound_ctrl is DPP_BOUND_OFF, $old is undef
// -> $old = undef, $bound_ctrl = DPP_BOUND_OFF

// $bound_ctrl is DPP_BOUND_OFF, $old is foldable
// -> $old = folded value, $bound_ctrl = DPP_BOUND_OFF

MachineInstr *GCNDPPCombine::createDPPInst(MachineInstr &OrigMI,		MachineInstr *GCNDPPCombine::createDPPInst(MachineInstr &OrigMI,
MachineInstr &MovMI,		MachineInstr &MovMI,
RegSubRegPair OldOpndVGPR,		RegSubRegPair CombOldVGPR,
MachineOperand *OldOpndValue,		MachineOperand *OldOpndValue,
bool BoundCtrlZero) const {		bool CombBCZ) const {
assert(OldOpndVGPR.Reg);		assert(CombOldVGPR.Reg);
if (!BoundCtrlZero && OldOpndValue) {		if (!CombBCZ && OldOpndValue && OldOpndValue->isImm()) {
assert(OldOpndValue->isImm());		auto *Src1 = TII->getNamedOperand(OrigMI, AMDGPU::OpName::src1);
OldOpndVGPR = foldOldOpnd(OrigMI, OldOpndVGPR, *OldOpndValue);		if (!Src1 \|\| !Src1->isReg()) {
if (!OldOpndVGPR.Reg) {		LLVM_DEBUG(dbgs() << " failed: no src1 or it isn't a register\n");
LLVM_DEBUG(dbgs() << " failed: old immediate cannot be folded\n");		return nullptr;
		}
		if (!isIdentityValue(OrigMI.getOpcode(), OldOpndValue)) {
		LLVM_DEBUG(dbgs() << " failed: old immediate ins't an identity\n");
		return nullptr;
		}
		CombOldVGPR = getRegSubRegPair(*Src1);
		if (!isOfRegClass(CombOldVGPR, AMDGPU::VGPR_32RegClass, *MRI)) {
		LLVM_DEBUG(dbgs() << " failed: src1 isn't a VGPR32 register\n");
return nullptr;		return nullptr;
}		}
}		}
return createDPPInst(OrigMI, MovMI, OldOpndVGPR, BoundCtrlZero);		return createDPPInst(OrigMI, MovMI, CombOldVGPR, CombBCZ);
}		}

// returns true if MI doesn't have OpndName immediate operand or the		// returns true if MI doesn't have OpndName immediate operand or the
// operand has Value		// operand has Value
bool GCNDPPCombine::hasNoImmOrEqual(MachineInstr &MI, unsigned OpndName,		bool GCNDPPCombine::hasNoImmOrEqual(MachineInstr &MI, unsigned OpndName,
int64_t Value, int64_t Mask) const {		int64_t Value, int64_t Mask) const {
auto *Imm = TII->getNamedOperand(MI, OpndName);		auto *Imm = TII->getNamedOperand(MI, OpndName);
if (!Imm)		if (!Imm)
return true;		return true;

assert(Imm->isImm());		assert(Imm->isImm());
return (Imm->getImm() & Mask) == Value;		return (Imm->getImm() & Mask) == Value;
}		}

bool GCNDPPCombine::combineDPPMov(MachineInstr &MovMI) const {		bool GCNDPPCombine::combineDPPMov(MachineInstr &MovMI) const {
assert(MovMI.getOpcode() == AMDGPU::V_MOV_B32_dpp);		assert(MovMI.getOpcode() == AMDGPU::V_MOV_B32_dpp);
		LLVM_DEBUG(dbgs() << "\nDPP combine: " << MovMI);

		auto *DstOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::vdst);
		assert(DstOpnd && DstOpnd->isReg());
		auto DPPMovReg = DstOpnd->getReg();
		if (!isEXECMaskConstantBetweenDefAndUses(DPPMovReg, *MRI)) {
		LLVM_DEBUG(dbgs() << " failed: EXEC mask should remain the same"
		" for all uses\n");
		return false;
		}

		auto *RowMaskOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::row_mask);
		assert(RowMaskOpnd && RowMaskOpnd->isImm());
		auto *BankMaskOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::bank_mask);
		assert(BankMaskOpnd && BankMaskOpnd->isImm());
		const bool MaskAllLanes = RowMaskOpnd->getImm() == 0xF &&
		BankMaskOpnd->getImm() == 0xF;

auto *BCZOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::bound_ctrl);		auto *BCZOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::bound_ctrl);
assert(BCZOpnd && BCZOpnd->isImm());		assert(BCZOpnd && BCZOpnd->isImm());
bool BoundCtrlZero = 0 != BCZOpnd->getImm();		bool BoundCtrlZero = BCZOpnd->getImm();

LLVM_DEBUG(dbgs() << "\nDPP combine: " << MovMI);

auto *OldOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::old);		auto *OldOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::old);
assert(OldOpnd && OldOpnd->isReg());		assert(OldOpnd && OldOpnd->isReg());
auto OldOpndVGPR = getRegSubRegPair(*OldOpnd);
		nhaehnleUnsubmitted Not Done Reply Inline Actions Having both `OldOpndVGPR` and `OldOpndValue` seems redundant. You could pass `OldOpndValue` as a MachineOperand to `createDPPInst` and have it assert on `!CombinedOld \|\| CombinedOld->Reg()`. That should help cut down the code size here somewhat and make the code easier to follow. nhaehnle: Having both `OldOpndVGPR` and `OldOpndValue` seems redundant. You could pass `OldOpndValue` as…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions There was a reason I did it initially though after a time I need to recall it. If I remember correctly the value is tracked through the instructions like mov/copy and subreg manupulation and having only value insn't enough to obtain the reg to store in the DPP instruction. vpykhtin: There was a reason I did it initially though after a time I need to recall it. If I remember…
auto OldOpndValue = getOldOpndValue(OldOpnd);		auto * const OldOpndValue = getOldOpndValue(*OldOpnd);
		// OldOpndValue is either undef (IMPLICIT_DEF) or immediate or something else
		// We could use: assert(!OldOpndValue \|\| OldOpndValue->isImm())
		// but the third option is used to distinguish undef from non-immediate
		// to reuse IMPLICIT_DEF instruction later
assert(!OldOpndValue \|\| OldOpndValue->isImm() \|\| OldOpndValue == OldOpnd);		assert(!OldOpndValue \|\| OldOpndValue->isImm() \|\| OldOpndValue == OldOpnd);
if (OldOpndValue) {
if (BoundCtrlZero) {		bool CombBCZ = false;
OldOpndVGPR.Reg = AMDGPU::NoRegister; // should be undef, ignore old opnd
		cwabbottUnsubmitted Done Reply Inline Actions Indentation is off here, seems like your editor is using hard tabs cwabbott: Indentation is off here, seems like your editor is using hard tabs
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Yea, thanks. I reinstalled my editor recently and forgot to turn off tabs. It should be ok with the latest diff. vpykhtin: Yea, thanks. I reinstalled my editor recently and forgot to turn off tabs. It should be ok with…
OldOpndValue = nullptr;		if (MaskAllLanes && BoundCtrlZero) { // [1]
		cwabbottUnsubmitted Not Done Reply Inline Actions I don't think this is correct. A lane can be set to 0 for the shift DPP controls (e.g. `DPP_ROW_SL1`) or if EXEC is zero for the lane to be read, in which case the user will do something non-trivial (e.g. OR will pass through the other operand) which we can't emulate in general. cwabbott: I don't think this is correct. A lane can be set to 0 for the shift DPP controls (e.g.
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Could you please you show an example of how it wouldn't work? Note that EXEC mask remains the same (by the check above) and no combining is performed when the result of DPP mov is used in any other way than consumed by DPP-capable VALU instruction. I think the result of DPP mov should be the same as the DPP src operand used in the dpp-capable VALU instruction. vpykhtin: Could you please you show an example of how it wouldn't work? Note that EXEC mask remains the…
		cwabbottUnsubmitted Not Done Reply Inline Actions Yes, you're right. I was misrembering what bound_ctrl:0 does. Sorry! cwabbott: Yes, you're right. I was misrembering what bound_ctrl:0 does. Sorry!
		CombBCZ = true;
} else {		} else {
if (!OldOpndValue->isImm()) {		if (!OldOpndValue \|\| !OldOpndValue->isImm()) {
LLVM_DEBUG(dbgs() << " failed: old operand isn't an imm or undef\n");		LLVM_DEBUG(dbgs() << " failed: the DPP mov isn't combinable\n");
return false;		return false;
		cwabbottUnsubmitted Not Done Reply Inline Actions Also here cwabbott: Also here
}		}
if (OldOpndValue->getImm() == 0) {
OldOpndVGPR.Reg = AMDGPU::NoRegister; // should be undef		if (OldOpndValue->getParent()->getParent() != MovMI.getParent()) {
OldOpndValue = nullptr;		LLVM_DEBUG(dbgs() <<
BoundCtrlZero = true;		" failed: old reg def and mov should be in the same BB\n");
		return false;
}		}

		if (OldOpndValue->getImm() == 0) {
		if (MaskAllLanes) {
		assert(!BoundCtrlZero); // by check [1]
		CombBCZ = true;
		}
		} else if (BoundCtrlZero) {
		assert(!MaskAllLanes); // by check [1]
		LLVM_DEBUG(dbgs() <<
		" failed: old!=0 and bctrl:0 and not all lanes isn't combinable\n");
		return false;
}		}
}		}

LLVM_DEBUG(dbgs() << " old=";		LLVM_DEBUG(dbgs() << " old=";
if (!OldOpndValue)		if (!OldOpndValue)
dbgs() << "undef";		dbgs() << "undef";
else		else
dbgs() << OldOpndValue->getImm();		dbgs() << *OldOpndValue;
		nhaehnleUnsubmitted Not Done Reply Inline Actions These should be SmallVectors. nhaehnle: These should be SmallVectors.
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Ok. vpykhtin: Ok.
dbgs() << ", bound_ctrl=" << BoundCtrlZero << '\n');		dbgs() << ", bound_ctrl=" << CombBCZ << '\n');

		cwabbottUnsubmitted Not Done Reply Inline Actions This seems incorrect to me. In the lanes where the source is invalid or the row/bank mask is 0, the DPP move will act as a no-op, here so if that lane is then added, or'd, etc. with something else, we can't emulate that with a single instruction. cwabbott: This seems incorrect to me. In the lanes where the source is invalid or the row/bank mask is 0…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions mov_dpp v1, v1, ... (v1 of other lane is stored in v1 of this lane) add_u32 v0, v1, ... This is the case when DPP src register is stored in the VGPR with the same name of the issuing lane. This way v1 would contain the same value after unsuccessfull DPP mov (no-op) and therefore can be used in the combined VALU op. vpykhtin: mov_dpp v1, v1, ... (v1 of other lane is stored in v1 of this lane) add_u32 v0, v1, ... This…
		cwabbottUnsubmitted Not Done Reply Inline Actions I still don't see how this can work. For something like: mov_dpp v1, v1, ... add_u32 v0, v1, v2 lanes where the shared data is invalid based on the DPP ctrl or EXEC will return v1 (same lane) + v2, whereas this will transform it to something like add_u32_dpp v0, v1, v2, ... which will give you v0 (undef). What's an example of a transform you're trying to accomplish? cwabbott: I still don't see how this can work. For something like: ``` mov_dpp v1, v1, ... add_u32 v0…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Sorry, this should look like: mov v1, X mov v2, Y mov_dpp v1, v1, some_DPP_ctrl add_u32 v0, v1, v2 transformed to mov v1, X mov v2, Y add_u32_dpp v0, v1, v2, some_DPP_ctrl v1 should contain X on invalid DPP access or X from other lane on valid. vpykhtin: Sorry, this should look like: mov v1, X mov v2, Y mov_dpp v1, v1, some_DPP_ctrl add_u32 v0, v1…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions I forgto to mention that some_DPP_ctrl should have all masks fully enabled, otherwise add wouldn't write it's result and v0 in undef indeed. vpykhtin: I forgto to mention that some_DPP_ctrl should have all masks fully enabled, otherwise add…
		cwabbottUnsubmitted Not Done Reply Inline Actions This still seems wrong. For the first sequence you gave, the add_u32 will return X (same lane) + Y on invalid DPP access while add_u32_dpp will return undef (assuming v0 isn't initialized, since you set old to undef in this patch). cwabbott: This still seems wrong. For the first sequence you gave, the add_u32 will return X (same lane)…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions It looks like the documentation is a bit unclear here. I thought when bound ctrl is off it means hardware just uses the value of issuing lane instead of DPP read but if it disables writing the result then you're right - v0 would be undef. I'll find out and fix this. vpykhtin: It looks like the documentation is a bit unclear here. I thought when bound ctrl is off it…
std::vector<MachineInstr*> OrigMIs, DPPMIs;		SmallVector<MachineInstr*, 4> OrigMIs, DPPMIs;
if (!OldOpndVGPR.Reg) { // OldOpndVGPR = undef		auto CombOldVGPR = getRegSubRegPair(*OldOpnd);
OldOpndVGPR = RegSubRegPair(		// try to reuse previous old reg if its undefined (IMPLICIT_DEF)
		if (CombBCZ && OldOpndValue) { // CombOldVGPR should be undef
		CombOldVGPR = RegSubRegPair(
MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass));		MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass));
auto UndefInst = BuildMI(*MovMI.getParent(), MovMI, MovMI.getDebugLoc(),		auto UndefInst = BuildMI(*MovMI.getParent(), MovMI, MovMI.getDebugLoc(),
TII->get(AMDGPU::IMPLICIT_DEF), OldOpndVGPR.Reg);		TII->get(AMDGPU::IMPLICIT_DEF), CombOldVGPR.Reg);
DPPMIs.push_back(UndefInst.getInstr());		DPPMIs.push_back(UndefInst.getInstr());
}		}

OrigMIs.push_back(&MovMI);		OrigMIs.push_back(&MovMI);
bool Rollback = true;		bool Rollback = true;
for (auto &Use : MRI->use_nodbg_operands(		for (auto &Use : MRI->use_nodbg_operands(DPPMovReg)) {
TII->getNamedOperand(MovMI, AMDGPU::OpName::vdst)->getReg())) {
Rollback = true;		Rollback = true;

auto &OrigMI = *Use.getParent();		auto &OrigMI = *Use.getParent();
		LLVM_DEBUG(dbgs() << " try: " << OrigMI);

auto OrigOp = OrigMI.getOpcode();		auto OrigOp = OrigMI.getOpcode();
if (TII->isVOP3(OrigOp)) {		if (TII->isVOP3(OrigOp)) {
if (!TII->hasVALU32BitEncoding(OrigOp)) {		if (!TII->hasVALU32BitEncoding(OrigOp)) {
LLVM_DEBUG(dbgs() << " failed: VOP3 hasn't e32 equivalent\n");		LLVM_DEBUG(dbgs() << " failed: VOP3 hasn't e32 equivalent\n");
break;		break;
}		}
// check if other than abs\|neg modifiers are set (opsel for example)		// check if other than abs\|neg modifiers are set (opsel for example)
const int64_t Mask = ~(SISrcMods::ABS \| SISrcMods::NEG);		const int64_t Mask = ~(SISrcMods::ABS \| SISrcMods::NEG);
if (!hasNoImmOrEqual(OrigMI, AMDGPU::OpName::src0_modifiers, 0, Mask) \|\|		if (!hasNoImmOrEqual(OrigMI, AMDGPU::OpName::src0_modifiers, 0, Mask) \|\|
!hasNoImmOrEqual(OrigMI, AMDGPU::OpName::src1_modifiers, 0, Mask) \|\|		!hasNoImmOrEqual(OrigMI, AMDGPU::OpName::src1_modifiers, 0, Mask) \|\|
!hasNoImmOrEqual(OrigMI, AMDGPU::OpName::clamp, 0) \|\|		!hasNoImmOrEqual(OrigMI, AMDGPU::OpName::clamp, 0) \|\|
!hasNoImmOrEqual(OrigMI, AMDGPU::OpName::omod, 0)) {		!hasNoImmOrEqual(OrigMI, AMDGPU::OpName::omod, 0)) {
LLVM_DEBUG(dbgs() << " failed: VOP3 has non-default modifiers\n");		LLVM_DEBUG(dbgs() << " failed: VOP3 has non-default modifiers\n");
break;		break;
}		}
} else if (!TII->isVOP1(OrigOp) && !TII->isVOP2(OrigOp)) {		} else if (!TII->isVOP1(OrigOp) && !TII->isVOP2(OrigOp)) {
LLVM_DEBUG(dbgs() << " failed: not VOP1/2/3\n");		LLVM_DEBUG(dbgs() << " failed: not VOP1/2/3\n");
break;		break;
}		}

LLVM_DEBUG(dbgs() << " combining: " << OrigMI);		LLVM_DEBUG(dbgs() << " combining: " << OrigMI);
if (&Use == TII->getNamedOperand(OrigMI, AMDGPU::OpName::src0)) {		if (&Use == TII->getNamedOperand(OrigMI, AMDGPU::OpName::src0)) {
if (auto *DPPInst = createDPPInst(OrigMI, MovMI, OldOpndVGPR,		if (auto *DPPInst = createDPPInst(OrigMI, MovMI, CombOldVGPR,
OldOpndValue, BoundCtrlZero)) {		OldOpndValue, CombBCZ)) {
DPPMIs.push_back(DPPInst);		DPPMIs.push_back(DPPInst);
Rollback = false;		Rollback = false;
}		}
} else if (OrigMI.isCommutable() &&		} else if (OrigMI.isCommutable() &&
&Use == TII->getNamedOperand(OrigMI, AMDGPU::OpName::src1)) {		&Use == TII->getNamedOperand(OrigMI, AMDGPU::OpName::src1)) {
auto *BB = OrigMI.getParent();		auto *BB = OrigMI.getParent();
auto *NewMI = BB->getParent()->CloneMachineInstr(&OrigMI);		auto *NewMI = BB->getParent()->CloneMachineInstr(&OrigMI);
BB->insert(OrigMI, NewMI);		BB->insert(OrigMI, NewMI);
if (TII->commuteInstruction(*NewMI)) {		if (TII->commuteInstruction(*NewMI)) {
		nhaehnleUnsubmitted Not Done Reply Inline Actions Can we simplify the code here by just commuting the OrigMI unconditionally? That is, first check whether `&Use != src0 && &Use == src1`; in that case, try commuting the instruction. If that fails, break (but I don't think there'd be a need to rollback the commutation). If it succeeds, continue down the same path that is used for the `&Use == src0` case. You end up with only a single call to `createDPPInst` and less code duplication in general. nhaehnle: Can we simplify the code here by just commuting the OrigMI unconditionally? That is, first…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions This is a good idea: I don't like leaving commuted instruction on rollback at some intuitive level but really don't have arguments against it at the moment :-) vpykhtin: This is a good idea: I don't like leaving commuted instruction on rollback at some intuitive…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Actually a pass returns boolean meaning it changes something in the IR. What should it return when it does rollback? It may create new commuted instructions which are need to be mapped in SlotIndexes for example. I don't think its a good idea to return true only when a rollback took place. vpykhtin: Actually a pass returns boolean meaning it changes something in the IR. What should it return…
		arsenmUnsubmitted Not Done Reply Inline Actions Can rollback be avoided? I think the usual purpose of knowing something changed is to know if iterators are still valid, which probably isn't the case if anything was modified at any point arsenm: Can rollback be avoided? I think the usual purpose of knowing something changed is to know if…
		arsenmUnsubmitted Not Done Reply Inline Actions I suppose if it's just commuting, it's ok to report no change arsenm: I suppose if it's just commuting, it's ok to report no change
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions The commutation is the whole problem. I cannot know whether the instuction can be commuted beforehand. Also an instruction copy should be created for the commutation - this means that previous analisys should be invalidated and the IR change should be repored. vpykhtin: The commutation is the whole problem. I cannot know whether the instuction can be commuted…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions I think there is no way to avoid new instructions on commutation because of reversed instructions vpykhtin: I think there is no way to avoid new instructions on commutation because of reversed…
LLVM_DEBUG(dbgs() << " commuted: " << *NewMI);		LLVM_DEBUG(dbgs() << " commuted: " << *NewMI);
if (auto DPPInst = createDPPInst(NewMI, MovMI, OldOpndVGPR,		if (auto DPPInst = createDPPInst(NewMI, MovMI, CombOldVGPR,
OldOpndValue, BoundCtrlZero)) {		OldOpndValue, CombBCZ)) {
DPPMIs.push_back(DPPInst);		DPPMIs.push_back(DPPInst);
Rollback = false;		Rollback = false;
}		}
} else		} else
LLVM_DEBUG(dbgs() << " failed: cannot be commuted\n");		LLVM_DEBUG(dbgs() << " failed: cannot be commuted\n");
NewMI->eraseFromParent();		NewMI->eraseFromParent();
} else		} else
LLVM_DEBUG(dbgs() << " failed: no suitable operands\n");		LLVM_DEBUG(dbgs() << " failed: no suitable operands\n");
Show All 33 Lines

lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 951 Lines • ▼ Show 20 Lines	TargetInstrInfo::RegSubRegPair getRegSequenceSubReg(MachineInstr &MI,
unsigned SubReg);		unsigned SubReg);

/// \brief Return the defining instruction for a given reg:subreg pair		/// \brief Return the defining instruction for a given reg:subreg pair
/// skipping copy like instructions and subreg-manipulation pseudos.		/// skipping copy like instructions and subreg-manipulation pseudos.
/// Following another subreg of a reg:subreg isn't supported.		/// Following another subreg of a reg:subreg isn't supported.
MachineInstr *getVRegSubRegDef(const TargetInstrInfo::RegSubRegPair &P,		MachineInstr *getVRegSubRegDef(const TargetInstrInfo::RegSubRegPair &P,
MachineRegisterInfo &MRI);		MachineRegisterInfo &MRI);

		/// \brief Return true if EXEC mask isnt' changed between the def and
		/// all uses of VReg. Currently if def and uses are in different BBs -
		/// simply return false. Should be run on SSA.
		bool isEXECMaskConstantBetweenDefAndUses(unsigned VReg,
		MachineRegisterInfo &MRI);

namespace AMDGPU {		namespace AMDGPU {

LLVM_READONLY		LLVM_READONLY
int getVOPe64(uint16_t Opcode);		int getVOPe64(uint16_t Opcode);

LLVM_READONLY		LLVM_READONLY
int getVOPe32(uint16_t Opcode);		int getVOPe32(uint16_t Opcode);

▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 5,630 Lines • ▼ Show 20 Lines	default:
DefInst = MRI.getVRegDef(RSR.Reg);		DefInst = MRI.getVRegDef(RSR.Reg);
}		}
}		}
if (!DefInst)		if (!DefInst)
return MI;		return MI;
}		}
return nullptr;		return nullptr;
}		}

		bool llvm::isEXECMaskConstantBetweenDefAndUses(unsigned VReg,
		MachineRegisterInfo &MRI) {
		assert(MRI.isSSA() && "Must be run on SSA");
		auto *TRI = MRI.getTargetRegisterInfo();

		auto *DefI = MRI.getVRegDef(VReg);
		auto *BB = DefI->getParent();

		DenseSet<MachineInstr*> Uses;
		for (auto &Use : MRI.use_nodbg_operands(VReg)) {
		auto *I = Use.getParent();
		if (I->getParent() != BB)
		return false;
		Uses.insert(I);
		}

		auto E = BB->end();
		for (auto I = std::next(DefI->getIterator()); I != E; ++I) {
		Uses.erase(&*I);
		// don't check the last use
		if (Uses.empty() \|\| I->modifiesRegister(AMDGPU::EXEC, TRI))
		break;
		}
		return Uses.empty();
		}

test/CodeGen/AMDGPU/dpp_combine.ll

This file was deleted.

	; RUN: llc -march=amdgcn -mcpu=tonga -amdgpu-dpp-combine -verify-machineinstrs < %s \| FileCheck %s

	; VOP2 with literal cannot be combined
	; CHECK-LABEL: {{^}}dpp_combine_i32_literal:
	; CHECK: v_mov_b32_dpp [[OLD:v[0-9]+]], {{v[0-9]+}} quad_perm:[1,0,0,0] row_mask:0x2 bank_mask:0x1 bound_ctrl:0
	; CHECK: v_add_u32_e32 {{v[0-9]+}}, vcc, 42, [[OLD]]
	define amdgpu_kernel void @dpp_combine_i32_literal(i32 addrspace(1)* %out, i32 %in) {
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 2, i32 1, i1 1) #0
	%res = add nsw i32 %dpp, 42
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_i32_bz:
	; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	define amdgpu_kernel void @dpp_combine_i32_bz(i32 addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 1) #0
	%res = add nsw i32 %dpp, %x
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_i32_boff_undef:
	; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
	define amdgpu_kernel void @dpp_combine_i32_boff_undef(i32 addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 0) #0
	%res = add nsw i32 %dpp, %x
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_i32_boff_0:
	; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	define amdgpu_kernel void @dpp_combine_i32_boff_0(i32 addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %in, i32 1, i32 1, i32 1, i1 0) #0
	%res = add nsw i32 %dpp, %x
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_i32_boff_max:
	; CHECK: v_bfrev_b32_e32 [[OLD:v[0-9]+]], -2
	; CHECK: v_max_i32_dpp [[OLD]], {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
	define amdgpu_kernel void @dpp_combine_i32_boff_max(i32 addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 2147483647, i32 %in, i32 1, i32 1, i32 1, i1 0) #0
	%cmp = icmp sge i32 %dpp, %x
	%res = select i1 %cmp, i32 %dpp, i32 %x
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_i32_boff_min:
	; CHECK: v_bfrev_b32_e32 [[OLD:v[0-9]+]], 1
	; CHECK: v_min_i32_dpp [[OLD]], {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
	define amdgpu_kernel void @dpp_combine_i32_boff_min(i32 addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 -2147483648, i32 %in, i32 1, i32 1, i32 1, i1 0) #0
	%cmp = icmp sle i32 %dpp, %x
	%res = select i1 %cmp, i32 %dpp, i32 %x
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_i32_boff_mul:
	; CHECK: v_mul_i32_i24_dpp v0, v3, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
	define amdgpu_kernel void @dpp_combine_i32_boff_mul(i32 addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 1, i32 %in, i32 1, i32 1, i32 1, i1 0) #0

	%dpp.shl = shl i32 %dpp, 8
	%dpp.24 = ashr i32 %dpp.shl, 8
	%x.shl = shl i32 %x, 8
	%x.24 = ashr i32 %x.shl, 8
	%res = mul i32 %dpp.24, %x.24
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_i32_commute:
	; CHECK: v_subrev_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[2,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	define amdgpu_kernel void @dpp_combine_i32_commute(i32 addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 2, i32 1, i32 1, i1 1) #0
	%res = sub nsw i32 %x, %dpp
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_f32:
	; CHECK: v_add_f32_dpp {{v[0-9]+}}, {{v[0-9]+}}, v0 quad_perm:[3,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	define amdgpu_kernel void @dpp_combine_f32(i32 addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()

	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 3, i32 1, i32 1, i1 1) #0
	%dpp.f32 = bitcast i32 %dpp to float
	%x.f32 = bitcast i32 %x to float
	%res.f32 = fadd float %x.f32, %dpp.f32
	%res = bitcast float %res.f32 to i32
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_test_f32_mods:
	; CHECK: v_mul_f32_dpp {{v[0-9]+}}, \|{{v[0-9]+}}\|, -v0 quad_perm:[0,1,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	define amdgpu_kernel void @dpp_combine_test_f32_mods(i32 addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()

	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 4, i32 1, i32 1, i1 1) #0

	%x.f32 = bitcast i32 %x to float
	%x.f32.neg = fsub float -0.000000e+00, %x.f32

	%dpp.f32 = bitcast i32 %dpp to float
	%dpp.f32.cmp = fcmp fast olt float %dpp.f32, 0.000000e+00
	%dpp.f32.sign = select i1 %dpp.f32.cmp, float -1.000000e+00, float 1.000000e+00
	%dpp.f32.abs = fmul fast float %dpp.f32, %dpp.f32.sign

	%res.f32 = fmul float %x.f32.neg, %dpp.f32.abs
	%res = bitcast float %res.f32 to i32
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_mac:
	; CHECK: v_mac_f32_dpp v0, {{v[0-9]+}}, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	define amdgpu_kernel void @dpp_combine_mac(float addrspace(1)* %out, i32 %in) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%y = tail call i32 @llvm.amdgcn.workitem.id.y()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 1) #0
	%dpp.f32 = bitcast i32 %dpp to float
	%x.f32 = bitcast i32 %x to float
	%y.f32 = bitcast i32 %y to float

	%mult = fmul float %dpp.f32, %y.f32
	%res = fadd float %mult, %x.f32
	store float %res, float addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_sequence:
	define amdgpu_kernel void @dpp_combine_sequence(i32 addrspace(1)* %out, i32 %in, i1 %cmp) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 1) #0
	br i1 %cmp, label %bb1, label %bb2
	bb1:
	; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	%resadd = add nsw i32 %dpp, %x
	br label %bb3
	bb2:
	; CHECK: v_subrev_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	%ressub = sub nsw i32 %x, %dpp
	br label %bb3
	bb3:
	%res = phi i32 [%resadd, %bb1], [%ressub, %bb2]
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	; CHECK-LABEL: {{^}}dpp_combine_sequence_negative:
	; CHECK: v_mov_b32_dpp v1, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	define amdgpu_kernel void @dpp_combine_sequence_negative(i32 addrspace(1)* %out, i32 %in, i1 %cmp) {
	%x = tail call i32 @llvm.amdgcn.workitem.id.x()
	%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 1) #0
	br i1 %cmp, label %bb1, label %bb2
	bb1:
	%resadd = add nsw i32 %dpp, %x
	br label %bb3
	bb2:
	%ressub = sub nsw i32 2, %dpp ; break seq
	br label %bb3
	bb3:
	%res = phi i32 [%resadd, %bb1], [%ressub, %bb2]
	store i32 %res, i32 addrspace(1)* %out
	ret void
	}

	declare i32 @llvm.amdgcn.workitem.id.x()
	declare i32 @llvm.amdgcn.workitem.id.y()
	declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32, i32, i32, i1) #0

	attributes #0 = { nounwind readnone convergent }

test/CodeGen/AMDGPU/dpp_combine.mir

This file was added.

				# RUN: llc -march=amdgcn -mcpu=tonga -run-pass=gcn-dpp-combine -o - %s \| FileCheck %s

				---
				# old is undefined: only combine when masks are fully enabled and
				# bound_ctrl:0 is set, otherwise the result of DPP VALU op can be undefined.
				# CHECK-LABEL: name: old_is_undef
				# CHECK: %2:vgpr_32 = IMPLICIT_DEF
				# VOP2:
				# CHECK: %4:vgpr_32 = V_ADD_U32_dpp %2, %0, %1, 1, 15, 15, 1, implicit $exec
				# CHECK: %6:vgpr_32 = V_ADD_U32_e32 %5, %1, implicit $exec
				# CHECK: %8:vgpr_32 = V_ADD_U32_e32 %7, %1, implicit $exec
				# CHECK: %10:vgpr_32 = V_ADD_U32_e32 %9, %1, implicit $exec
				# VOP1:
				# CHECK: %12:vgpr_32 = V_NOT_B32_dpp %2, %0, 1, 15, 15, 1, implicit $exec
				# CHECK: %14:vgpr_32 = V_NOT_B32_e32 %13, implicit $exec
				# CHECK: %16:vgpr_32 = V_NOT_B32_e32 %15, implicit $exec
				# CHECK: %18:vgpr_32 = V_NOT_B32_e32 %17, implicit $exec
				name: old_is_undef
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1
				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = IMPLICIT_DEF

				; VOP2
				%3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 1, implicit $exec
				%4:vgpr_32 = V_ADD_U32_e32 %3, %1, implicit $exec

				%5:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 0, implicit $exec
				%6:vgpr_32 = V_ADD_U32_e32 %5, %1, implicit $exec

				%7:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 1, implicit $exec
				%8:vgpr_32 = V_ADD_U32_e32 %7, %1, implicit $exec

				%9:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec
				%10:vgpr_32 = V_ADD_U32_e32 %9, %1, implicit $exec

				; VOP1
				%11:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 1, implicit $exec
				%12:vgpr_32 = V_NOT_B32_e32 %11, implicit $exec

				%13:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 0, implicit $exec
				%14:vgpr_32 = V_NOT_B32_e32 %13, implicit $exec

				%15:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 1, implicit $exec
				%16:vgpr_32 = V_NOT_B32_e32 %15, implicit $exec

				%17:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec
				%18:vgpr_32 = V_NOT_B32_e32 %17, implicit $exec
				...

				# old is zero cases:

				# CHECK-LABEL: name: old_is_0

				# VOP2:
				# case 1: old is zero, masks are fully enabled, bound_ctrl:0 is on:
				# the DPP mov result would be either zero ({src lane disabled}\|{src lane is
				# out of range}) or active src lane result - can combine with old = undef.
				# undef is preffered as it makes life easier for the regalloc.
				# CHECK: [[U1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				# CHECK: %4:vgpr_32 = V_ADD_U32_dpp [[U1]], %0, %1, 1, 15, 15, 1, implicit $exec

				# case 2: old is zero, masks are fully enabled, bound_ctrl:0 is off:
				# as the DPP mov old is zero this case is no different from case 1 - combine it
				# setting bound_ctrl0 on for the combined DPP VALU op to make old undefined
				# CHECK: [[U2:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				# CHECK: %6:vgpr_32 = V_ADD_U32_dpp [[U2]], %0, %1, 1, 15, 15, 1, implicit $exec

				# case 3: masks are partialy disabled, bound_ctrl:0 is on:
				# the DPP mov result would be either zero ({src lane disabled}\|{src lane is
				# out of range} or {the DPP mov's dest VGPR write is disabled by masks}) or
				# active src lane result - can combine with old = src1 of the VALU op.
				# The VALU op should have the same masks as DPP mov as they select lanes
				# with identity value.
				# Special case: the bound_ctrl for the combined DPP VALU op isn't important
				# here but let's make it off to keep the combiner's logic simpler.
				# CHECK: %8:vgpr_32 = V_ADD_U32_dpp %1, %0, %1, 1, 14, 15, 0, implicit $exec

				# case 4: masks are partialy disabled, bound_ctrl:0 is off:
				# the DPP mov result would be either zero ({src lane disabled}\|{src lane is
				# out of range} or {the DPP mov's dest VGPR write is disabled by masks}) or
				# active src lane result - can combine with old = src1 of the VALU op.
				# The VALU op should have the same masks as DPP mov as they select
				# lanes with identity value
				# CHECK: %10:vgpr_32 = V_ADD_U32_dpp %1, %0, %1, 1, 14, 15, 0, implicit $exec

				# VOP1:
				# see case 1
				# CHECK: [[U3:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				# CHECK: %12:vgpr_32 = V_NOT_B32_dpp [[U3]], %0, 1, 15, 15, 1, implicit $exec
				# see case 2
				# CHECK: [[U4:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
				# CHECK: %14:vgpr_32 = V_NOT_B32_dpp [[U4]], %0, 1, 15, 15, 1, implicit $exec
				# case 3 and 4 not appliable as there is no way to specify unchanged result
				# for the unary VALU op
				# CHECK: %16:vgpr_32 = V_NOT_B32_e32 %15, implicit $exec
				# CHECK: %18:vgpr_32 = V_NOT_B32_e32 %17, implicit $exec

				name: old_is_0
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1
				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec

				; VOP2
				%3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 1, implicit $exec
				%4:vgpr_32 = V_ADD_U32_e32 %3, %1, implicit $exec

				%5:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 0, implicit $exec
				%6:vgpr_32 = V_ADD_U32_e32 %5, %1, implicit $exec

				%7:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 1, implicit $exec
				%8:vgpr_32 = V_ADD_U32_e32 %7, %1, implicit $exec

				%9:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec
				%10:vgpr_32 = V_ADD_U32_e32 %9, %1, implicit $exec

				; VOP1
				%11:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 1, implicit $exec
				%12:vgpr_32 = V_NOT_B32_e32 %11, implicit $exec

				%13:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 0, implicit $exec
				%14:vgpr_32 = V_NOT_B32_e32 %13, implicit $exec

				%15:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 1, implicit $exec
				%16:vgpr_32 = V_NOT_B32_e32 %15, implicit $exec

				%17:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec
				%18:vgpr_32 = V_NOT_B32_e32 %17, implicit $exec
				...

				# old is nonzero identity cases:

				# old is nonzero identity, masks are fully enabled, bound_ctrl:0 is off:
				# the DPP mov result would be either identity ({src lane disabled}\|{out of
				# range}) or src lane result - can combine with old = src1 of the VALU op
				# The DPP VALU op should have the same masks (and bctrl) as DPP mov as they
				# select lanes with identity value

				# CHECK-LABEL: name: nonzero_old_is_identity_masks_enabled_bctl_off
				# CHECK: %4:vgpr_32 = V_MUL_U32_U24_dpp %1, %0, %1, 1, 15, 15, 0, implicit $exec
				# CHECK: %7:vgpr_32 = V_AND_B32_dpp %1, %0, %1, 1, 15, 15, 0, implicit $exec
				# CHECK: %10:vgpr_32 = V_MAX_I32_dpp %1, %0, %1, 1, 15, 15, 0, implicit $exec
				# CHECK: %13:vgpr_32 = V_MIN_I32_dpp %1, %0, %1, 1, 15, 15, 0, implicit $exec

				name: nonzero_old_is_identity_masks_enabled_bctl_off
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1
				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1

				%2:vgpr_32 = V_MOV_B32_e32 1, implicit $exec
				%3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 0, implicit $exec
				%4:vgpr_32 = V_MUL_U32_U24_e32 %3, %1, implicit $exec

				%5:vgpr_32 = V_MOV_B32_e32 4294967295, implicit $exec
				%6:vgpr_32 = V_MOV_B32_dpp %5, %0, 1, 15, 15, 0, implicit $exec
				%7:vgpr_32 = V_AND_B32_e32 %6, %1, implicit $exec

				%8:vgpr_32 = V_MOV_B32_e32 -2147483648, implicit $exec
				%9:vgpr_32 = V_MOV_B32_dpp %8, %0, 1, 15, 15, 0, implicit $exec
				%10:vgpr_32 = V_MAX_I32_e32 %9, %1, implicit $exec

				%11:vgpr_32 = V_MOV_B32_e32 2147483647, implicit $exec
				%12:vgpr_32 = V_MOV_B32_dpp %11, %0, 1, 15, 15, 0, implicit $exec
				%13:vgpr_32 = V_MIN_I32_e32 %12, %1, implicit $exec
				...

				# old is nonzero identity, masks are partially enabled, bound_ctrl:0 is off:
				# the DPP mov result would be either identity ({src lane disabled}\|{src lane is
				# out of range} or {the DPP mov's dest VGPR write is disabled by masks}) or
				# active src lane result - can combine with old = src1 of the VALU op.
				# The DPP VALU op should have the same masks (and bctrl) as DPP mov as they
				# select lanes with identity value

				# CHECK-LABEL: name: nonzero_old_is_identity_masks_partially_disabled_bctl_off
				# CHECK: %4:vgpr_32 = V_MUL_U32_U24_dpp %1, %0, %1, 1, 14, 15, 0, implicit $exec
				# CHECK: %7:vgpr_32 = V_AND_B32_dpp %1, %0, %1, 1, 15, 14, 0, implicit $exec
				# CHECK: %10:vgpr_32 = V_MAX_I32_dpp %1, %0, %1, 1, 14, 15, 0, implicit $exec
				# CHECK: %13:vgpr_32 = V_MIN_I32_dpp %1, %0, %1, 1, 15, 14, 0, implicit $exec

				name: nonzero_old_is_identity_masks_partially_disabled_bctl_off
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1
				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1

				%2:vgpr_32 = V_MOV_B32_e32 1, implicit $exec
				%3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec
				%4:vgpr_32 = V_MUL_U32_U24_e32 %3, %1, implicit $exec

				%5:vgpr_32 = V_MOV_B32_e32 4294967295, implicit $exec
				%6:vgpr_32 = V_MOV_B32_dpp %5, %0, 1, 15, 14, 0, implicit $exec
				%7:vgpr_32 = V_AND_B32_e32 %6, %1, implicit $exec

				%8:vgpr_32 = V_MOV_B32_e32 -2147483648, implicit $exec
				%9:vgpr_32 = V_MOV_B32_dpp %8, %0, 1, 14, 15, 0, implicit $exec
				%10:vgpr_32 = V_MAX_I32_e32 %9, %1, implicit $exec

				%11:vgpr_32 = V_MOV_B32_e32 2147483647, implicit $exec
				%12:vgpr_32 = V_MOV_B32_dpp %11, %0, 1, 15, 14, 0, implicit $exec
				%13:vgpr_32 = V_MIN_I32_e32 %12, %1, implicit $exec
				...

				# old is nonzero identity, masks are partially enabled, bound_ctrl:0 is on:
				# the DPP mov result may have 3 different values:
				# 1. the active src lane result
				# 2. 0 if the src lane is disabled\|out of range
				# 3. DPP mov's old value if the mov's dest VGPR write is disabled by masks
				# can't combine

				# CHECK-LABEL: name: nonzero_old_is_identity_masks_partially_disabled_bctl0
				# CHECK: %4:vgpr_32 = V_MUL_U32_U24_e32 %3, %1, implicit $exec
				# CHECK: %7:vgpr_32 = V_AND_B32_e32 %6, %1, implicit $exec
				# CHECK: %10:vgpr_32 = V_MAX_I32_e32 %9, %1, implicit $exec
				# CHECK: %13:vgpr_32 = V_MIN_I32_e32 %12, %1, implicit $exec

				name: nonzero_old_is_identity_masks_partially_disabled_bctl0
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1
				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1

				%2:vgpr_32 = V_MOV_B32_e32 1, implicit $exec
				%3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 1, implicit $exec
				%4:vgpr_32 = V_MUL_U32_U24_e32 %3, %1, implicit $exec

				%5:vgpr_32 = V_MOV_B32_e32 4294967295, implicit $exec
				%6:vgpr_32 = V_MOV_B32_dpp %5, %0, 1, 15, 14, 1, implicit $exec
				%7:vgpr_32 = V_AND_B32_e32 %6, %1, implicit $exec

				%8:vgpr_32 = V_MOV_B32_e32 -2147483648, implicit $exec
				%9:vgpr_32 = V_MOV_B32_dpp %8, %0, 1, 14, 15, 1, implicit $exec
				%10:vgpr_32 = V_MAX_I32_e32 %9, %1, implicit $exec

				%11:vgpr_32 = V_MOV_B32_e32 2147483647, implicit $exec
				%12:vgpr_32 = V_MOV_B32_dpp %11, %0, 1, 15, 14, 1, implicit $exec
				%13:vgpr_32 = V_MIN_I32_e32 %12, %1, implicit $exec
				...

				# when the DPP source isn't a src0 operand the operation should be commuted if possible
				# CHECK-LABEL: name: dpp_commute
				# CHECK: %4:vgpr_32 = V_MUL_U32_U24_dpp %1, %0, %1, 1, 14, 15, 0, implicit $exec
				# CHECK: %7:vgpr_32 = V_AND_B32_dpp %1, %0, %1, 1, 15, 14, 0, implicit $exec
				# CHECK: %10:vgpr_32 = V_MAX_I32_dpp %1, %0, %1, 1, 14, 15, 0, implicit $exec
				# CHECK: %13:vgpr_32 = V_MIN_I32_dpp %1, %0, %1, 1, 15, 14, 0, implicit $exec
				# CHECK: %16:vgpr_32 = V_SUBREV_I32_dpp %1, %0, %1, 1, 14, 15, 0, implicit-def $vcc, implicit $exec
				# CHECK: %19:vgpr_32 = V_ADD_I32_e32 5, %18, implicit-def $vcc, implicit $exec
				name: dpp_commute
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1

				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1

				%2:vgpr_32 = V_MOV_B32_e32 1, implicit $exec
				%3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec
				%4:vgpr_32 = V_MUL_U32_U24_e32 %1, %3, implicit $exec

				%5:vgpr_32 = V_MOV_B32_e32 4294967295, implicit $exec
				%6:vgpr_32 = V_MOV_B32_dpp %5, %0, 1, 15, 14, 0, implicit $exec
				%7:vgpr_32 = V_AND_B32_e32 %1, %6, implicit $exec

				%8:vgpr_32 = V_MOV_B32_e32 -2147483648, implicit $exec
				%9:vgpr_32 = V_MOV_B32_dpp %8, %0, 1, 14, 15, 0, implicit $exec
				%10:vgpr_32 = V_MAX_I32_e32 %1, %9, implicit $exec

				%11:vgpr_32 = V_MOV_B32_e32 2147483647, implicit $exec
				%12:vgpr_32 = V_MOV_B32_dpp %11, %0, 1, 15, 14, 0, implicit $exec
				%13:vgpr_32 = V_MIN_I32_e32 %1, %12, implicit $exec

				%14:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%15:vgpr_32 = V_MOV_B32_dpp %14, %0, 1, 14, 15, 0, implicit $exec
				%16:vgpr_32 = V_SUB_I32_e32 %1, %15, implicit-def $vcc, implicit $exec

				; this cannot be combined because immediate as src0 isn't commutable
				%17:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%18:vgpr_32 = V_MOV_B32_dpp %17, %0, 1, 14, 15, 0, implicit $exec
				%19:vgpr_32 = V_ADD_I32_e32 5, %18, implicit-def $vcc, implicit $exec
				...

				# check for floating point modifiers
				# CHECK-LABEL: name: add_f32_e64
				# CHECK: %3:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 15, 15, 1, implicit $exec
				# CHECK: %4:vgpr_32 = V_ADD_F32_e64 0, %3, 0, %0, 0, 1, implicit $exec
				# CHECK: %6:vgpr_32 = V_ADD_F32_dpp %2, 0, %1, 0, %0, 1, 15, 15, 1, implicit $exec
				# CHECK: %8:vgpr_32 = V_ADD_F32_dpp %2, 1, %1, 2, %0, 1, 15, 15, 1, implicit $exec
				# CHECK: %10:vgpr_32 = V_ADD_F32_e64 4, %9, 8, %0, 0, 0, implicit $exec

				name: add_f32_e64
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1

				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = IMPLICIT_DEF

				; this shouldn't be combined as omod is set
				%3:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 15, 15, 1, implicit $exec
				%4:vgpr_32 = V_ADD_F32_e64 0, %3, 0, %0, 0, 1, implicit $exec

				; this should be combined as all modifiers are default
				%5:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 15, 15, 1, implicit $exec
				%6:vgpr_32 = V_ADD_F32_e64 0, %5, 0, %0, 0, 0, implicit $exec

				; this should be combined as modifiers other than abs\|neg are default
				%7:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 15, 15, 1, implicit $exec
				%8:vgpr_32 = V_ADD_F32_e64 1, %7, 2, %0, 0, 0, implicit $exec

				; this shouldn't be combined as modifiers aren't abs\|neg
				%9:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 15, 15, 1, implicit $exec
				%10:vgpr_32 = V_ADD_F32_e64 4, %9, 8, %0, 0, 0, implicit $exec
				...

				# tests on sequences of dpp consumers
				# CHECK-LABEL: name: dpp_seq
				# CHECK: %4:vgpr_32 = V_ADD_I32_dpp %1, %0, %1, 1, 14, 15, 0, implicit-def $vcc, implicit $exec
				# CHECK: %5:vgpr_32 = V_SUBREV_I32_dpp %1, %0, %1, 1, 14, 15, 0, implicit-def $vcc, implicit $exec
				# CHECK: %6:vgpr_32 = V_OR_B32_dpp %1, %0, %1, 1, 14, 15, 0, implicit $exec
				# broken sequence:
				# CHECK: %7:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec

				name: dpp_seq
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1
				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec

				%3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec
				%4:vgpr_32 = V_ADD_I32_e32 %3, %1, implicit-def $vcc, implicit $exec
				%5:vgpr_32 = V_SUB_I32_e32 %1, %3, implicit-def $vcc, implicit $exec
				%6:vgpr_32 = V_OR_B32_e32 %3, %1, implicit $exec

				%7:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 14, 15, 0, implicit $exec
				%8:vgpr_32 = V_ADD_I32_e32 %7, %1, implicit-def $vcc, implicit $exec
				; this breaks the sequence
				%9:vgpr_32 = V_SUB_I32_e32 5, %7, implicit-def $vcc, implicit $exec
				...

				# old reg def is in diff BB - cannot combine
				# CHECK-LABEL: name: old_in_diff_bb
				# CHECK: %3:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec

				name: old_in_diff_bb
				tracksRegLiveness: true
				body: \|
				bb.0:
				successors: %bb.1
				liveins: $vgpr0, $vgpr1

				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				S_BRANCH %bb.1

				bb.1:
				%3:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec
				%4:vgpr_32 = V_ADD_U32_e32 %3, %0, implicit $exec
				...

				# old reg def is in diff BB but bound_ctrl:0 - can combine
				# CHECK-LABEL: name: old_in_diff_bb_bctrl_zero
				# CHECK: %4:vgpr_32 = V_ADD_U32_dpp {{%[0-9]}}, %0, %1, 1, 15, 15, 1, implicit $exec

				name: old_in_diff_bb_bctrl_zero
				tracksRegLiveness: true
				body: \|
				bb.0:
				successors: %bb.1
				liveins: $vgpr0, $vgpr1

				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				S_BRANCH %bb.1

				bb.1:
				%3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 1, implicit $exec
				%4:vgpr_32 = V_ADD_U32_e32 %3, %1, implicit $exec
				...

				# EXEC mask changed between def and use - cannot combine
				# CHECK-LABEL: name: exec_changed
				# CHECK: %3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 1, implicit $exec

				name: exec_changed
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1

				%0:vgpr_32 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%3:vgpr_32 = V_MOV_B32_dpp %2, %0, 1, 15, 15, 1, implicit $exec
				%4:vgpr_32 = V_ADD_U32_e32 %3, %1, implicit $exec
				%5:sreg_64 = COPY $exec, implicit-def $exec
				%6:vgpr_32 = V_ADD_U32_e32 %3, %1, implicit $exec
				...

				# test if $old definition is correctly tracked through subreg manipulation pseudos

				# CHECK-LABEL: name: mul_old_subreg
				# CHECK: %7:vgpr_32 = V_MUL_I32_I24_dpp %0.sub1, %1, %0.sub1, 1, 1, 1, 0, implicit $exec

				name: mul_old_subreg
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1

				%0:vreg_64 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = V_MOV_B32_e32 1, implicit $exec
				%3:vgpr_32 = V_MOV_B32_e32 42, implicit $exec
				%4:vreg_64 = REG_SEQUENCE %2, %subreg.sub0, %3, %subreg.sub1
				%5:vreg_64 = INSERT_SUBREG %4, %1, %subreg.sub1 ; %5.sub0 is taken from %4
				%6:vgpr_32 = V_MOV_B32_dpp %5.sub0, %1, 1, 1, 1, 0, implicit $exec
				%7:vgpr_32 = V_MUL_I32_I24_e32 %6, %0.sub1, implicit $exec
				...

				# CHECK-LABEL: name: add_old_subreg
				# CHECK: %5:vgpr_32 = V_ADD_U32_dpp %0.sub1, %1, %0.sub1, 1, 1, 1, 0, implicit $exec

				name: add_old_subreg
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1

				%0:vreg_64 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%3:vreg_64 = INSERT_SUBREG %0, %2, %subreg.sub1 ; %3.sub1 is inserted
				%4:vgpr_32 = V_MOV_B32_dpp %3.sub1, %1, 1, 1, 1, 0, implicit $exec
				%5:vgpr_32 = V_ADD_U32_e32 %4, %0.sub1, implicit $exec
				...

				# CHECK-LABEL: name: add_old_subreg_undef
				# CHECK: %5:vgpr_32 = V_ADD_U32_dpp %3.sub1, %1, %0.sub1, 1, 15, 15, 1, implicit $exec

				name: add_old_subreg_undef
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $vgpr0, $vgpr1

				%0:vreg_64 = COPY $vgpr0
				%1:vgpr_32 = COPY $vgpr1
				%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%3:vreg_64 = REG_SEQUENCE %2, %subreg.sub0 ; %3.sub1 is undef
				%4:vgpr_32 = V_MOV_B32_dpp %3.sub1, %1, 1, 15, 15, 1, implicit $exec
				%5:vgpr_32 = V_ADD_U32_e32 %4, %0.sub1, implicit $exec

test/CodeGen/AMDGPU/dpp_combine_subregs.mir

This file was deleted.

	# RUN: llc -march=amdgcn -mcpu=tonga -run-pass=gcn-dpp-combine -o - %s \| FileCheck %s

	# test if $old definition is correctly tracked through subreg manipulation pseudos

	---
	# CHECK-LABEL: name: mul_old_subreg
	# CHECK: %7:vgpr_32 = V_MUL_I32_I24_dpp %0.sub1, %1, %0.sub1, 1, 1, 1, 0, implicit $exec

	name: mul_old_subreg
	tracksRegLiveness: true
	registers:
	- { id: 0, class: vreg_64 }
	- { id: 1, class: vgpr_32 }
	- { id: 2, class: vgpr_32 }
	- { id: 3, class: vgpr_32 }
	- { id: 4, class: vreg_64 }
	- { id: 5, class: vreg_64 }
	- { id: 6, class: vgpr_32 }
	- { id: 7, class: vgpr_32 }

	liveins:
	- { reg: '$vgpr0', virtual-reg: '%0' }
	- { reg: '$vgpr1', virtual-reg: '%1' }
	body: \|
	bb.0:
	liveins: $vgpr0, $vgpr1

	%0:vreg_64 = COPY $vgpr0
	%1:vgpr_32 = COPY $vgpr1
	%2:vgpr_32 = V_MOV_B32_e32 1, implicit $exec
	%3:vgpr_32 = V_MOV_B32_e32 42, implicit $exec
	%4 = REG_SEQUENCE %2, %subreg.sub0, %3, %subreg.sub1
	%5 = INSERT_SUBREG %4, %1, %subreg.sub1 ; %5.sub0 is taken from %4
	%6:vgpr_32 = V_MOV_B32_dpp %5.sub0, %1, 1, 1, 1, 0, implicit $exec
	%7:vgpr_32 = V_MUL_I32_I24_e32 %6, %0.sub1, implicit $exec
	...

	# CHECK-LABEL: name: add_old_subreg
	# CHECK: [[OLD:\%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
	# CHECK: %5:vgpr_32 = V_ADD_U32_dpp [[OLD]], %1, %0.sub1, 1, 1, 1, 1, implicit $exec

	name: add_old_subreg
	tracksRegLiveness: true
	registers:
	- { id: 0, class: vreg_64 }
	- { id: 1, class: vgpr_32 }
	- { id: 2, class: vgpr_32 }
	- { id: 3, class: vreg_64 }
	- { id: 4, class: vgpr_32 }
	- { id: 5, class: vgpr_32 }

	liveins:
	- { reg: '$vgpr0', virtual-reg: '%0' }
	- { reg: '$vgpr1', virtual-reg: '%1' }
	body: \|
	bb.0:
	liveins: $vgpr0, $vgpr1

	%0:vreg_64 = COPY $vgpr0
	%1:vgpr_32 = COPY $vgpr1
	%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
	%3:vreg_64 = INSERT_SUBREG %0, %2, %subreg.sub1 ; %3.sub1 is inserted
	%4:vgpr_32 = V_MOV_B32_dpp %3.sub1, %1, 1, 1, 1, 0, implicit $exec
	%5:vgpr_32 = V_ADD_U32_e32 %4, %0.sub1, implicit $exec
	...

	# CHECK-LABEL: name: add_old_subreg_undef
	# CHECK: %5:vgpr_32 = V_ADD_U32_dpp %3.sub1, %1, %0.sub1, 1, 1, 1, 0, implicit $exec

	name: add_old_subreg_undef
	tracksRegLiveness: true
	registers:
	- { id: 0, class: vreg_64 }
	- { id: 1, class: vgpr_32 }
	- { id: 2, class: vgpr_32 }
	- { id: 3, class: vreg_64 }
	- { id: 4, class: vgpr_32 }
	- { id: 5, class: vgpr_32 }

	liveins:
	- { reg: '$vgpr0', virtual-reg: '%0' }
	- { reg: '$vgpr1', virtual-reg: '%1' }
	body: \|
	bb.0:
	liveins: $vgpr0, $vgpr1

	%0:vreg_64 = COPY $vgpr0
	%1:vgpr_32 = COPY $vgpr1
	%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
	%3:vreg_64 = REG_SEQUENCE %2, %subreg.sub0 ; %3.sub1 is undef
	%4:vgpr_32 = V_MOV_B32_dpp %3.sub1, %1, 1, 1, 1, 0, implicit $exec
	%5:vgpr_32 = V_ADD_U32_e32 %4, %0.sub1, implicit $exec
	...

	# CHECK-LABEL: name: add_f32_e64
	# CHECK: %3:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 1, 1, 1, implicit $exec
	# CHECK: %4:vgpr_32 = V_ADD_F32_e64 0, %3, 0, %0, 0, 1, implicit $exec
	# CHECK: %6:vgpr_32 = V_ADD_F32_dpp %2, 0, %1, 0, %0, 1, 1, 1, 1, implicit $exec
	# CHECK: %7:vgpr_32 = V_ADD_F32_dpp %2, 1, %1, 2, %0, 1, 1, 1, 1, implicit $exec
	# CHECK: %9:vgpr_32 = V_ADD_F32_e64 4, %8, 8, %0, 0, 0, implicit $exec

	name: add_f32_e64
	tracksRegLiveness: true
	registers:
	- { id: 0, class: vgpr_32 }
	- { id: 1, class: vgpr_32 }
	- { id: 2, class: vgpr_32 }
	- { id: 3, class: vgpr_32 }
	- { id: 4, class: vgpr_32 }
	- { id: 5, class: vgpr_32 }
	- { id: 6, class: vgpr_32 }
	- { id: 7, class: vgpr_32 }
	- { id: 8, class: vgpr_32 }
	- { id: 9, class: vgpr_32 }

	liveins:
	- { reg: '$vgpr0', virtual-reg: '%0' }
	- { reg: '$vgpr1', virtual-reg: '%1' }
	body: \|
	bb.0:
	liveins: $vgpr0, $vgpr1

	%0:vgpr_32 = COPY $vgpr0
	%1:vgpr_32 = COPY $vgpr1
	%2:vgpr_32 = IMPLICIT_DEF
	%3:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 1, 1, 1, implicit $exec

	; this shouldn't be combined as omod is set
	%4:vgpr_32 = V_ADD_F32_e64 0, %3, 0, %0, 0, 1, implicit $exec

	%5:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 1, 1, 1, implicit $exec

	; this should be combined as all modifiers are default
	%6:vgpr_32 = V_ADD_F32_e64 0, %5, 0, %0, 0, 0, implicit $exec

	; this should be combined as modifiers other than abs\|neg are default
	%7:vgpr_32 = V_ADD_F32_e64 1, %5, 2, %0, 0, 0, implicit $exec

	%8:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 1, 1, 1, implicit $exec

	; this shouldn't be combined as modifiers aren't abs\|neg
	%9:vgpr_32 = V_ADD_F32_e64 4, %8, 8, %0, 0, 0, implicit $exec
	...

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Fix DPP combinerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 183315

include/llvm/CodeGen/TargetInstrInfo.h

lib/Target/AMDGPU/GCNDPPCombine.cpp

lib/Target/AMDGPU/SIInstrInfo.h

lib/Target/AMDGPU/SIInstrInfo.cpp

test/CodeGen/AMDGPU/dpp_combine.ll

test/CodeGen/AMDGPU/dpp_combine.mir

test/CodeGen/AMDGPU/dpp_combine_subregs.mir

AMDGPU: Fix DPP combiner
ClosedPublic