This is an archive of the discontinued LLVM Phabricator instance.

lib/Target/AMDGPU/GCNDPPCombine.cpp
261	I think you have AND and OR mixed up... the identity for AND should be -1, and OR should be 0, right?
265	If the old operand is the identity in the original mov, then in the transformed instruction the old operand should be the un-swizzled source, not the identity again. I think this is the reason the add tests to still fail on radv, since it generates something like: v_mov_b32_e32 v5, 0 s_nop 1 v_add_u32_dpp v5, vcc, v4, v4 row_shr:4 row_mask:0xf bank_mask:0xe from %88 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %87, i32 276, i32 15, i32 14, i1 false) #2 %89 = add i32 %87, %88 when what we really want is v_add_u32_dpp v4, vcc, v4, v4 row_shr:4 row_mask:0xf bank_mask:0xe

Connor, indeed, my bad, I'll try to fix this in a couple of days.

Thank you,
Valery

Fixed issue with identity values and other cases, f32/f16 identity values to be added later.

fma/mac instructions is disabled for now.

Test is fully reworked, added comments.

Connor, can you please take a look at the test in this update (first half) - I described different cases to combine there and possibly run the patch on radv test (the pass has to be switched on manually).

The pass is still disabled by default.

If there is no strong objections I would like to submit my latest patch here since the pass is disabled and the patch looks better anyway. Otherwise I'll return from NY holidays only on Jan 11.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 9 2019, 5:49 AM

Closed by commit rL350721: [AMDGPU] Fix DPP combiner (authored by vpykhtin). · Explain Why

This revision was automatically updated to reflect the committed changes.

Hi Connor,

I submitted my latest patch, the pass is still disabled by defaul. Can you please run the RADV tests now (with the pass manually enabled)?

The patch was reverted as not reviewed. Uploaded rebased diff.

reopening revision.

Sorry, I just got back from break this week. I've run CTS with the pass enabled, and it now passes, although it seems most of the patterns we use don't get folded. Firstly AND, XOR, unsigned max, and unsigned min are most troubling, since the code that gets generated looks like it should be optimized:

	v_mov_b32_e32 v8, -1                                          ; 7E1002C1
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_and_b32_e32 v2, v2, v8                                      ; 26041102

as well as

	v_mov_b32_e32 v8, 0                                                         ; 7E100280
	s_nop 1                                                                     ; BF800001
	v_mov_b32_dpp v8, v2  row_shr:8 row_mask:0xf bank_mask:0xc                  ; 7E1002FA FC011802
	v_max_u32_e32 v2, v2, v8                                                    ; 1E041102

and

	v_mov_b32_e32 v8, -1                                          ; 7E1002C1
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_u32_e32 v2, v2, v8                                      ; 1C041102

and finally

	v_mov_b32_e32 v8, 0                                                         ; 7E100280
	s_nop 1                                                                     ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf               ; 7E1002FA AF014202
	v_xor_b32_e32 v2, v2, v8                                                    ; 2A041102

Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:

	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_i32_e32 v2, v2, v8                                      ; 18041102

Maybe this pass needs to be moved earlier in the pipeline?

I'll take a more detailed look soon. But I'd also like to say that I'm a little worried that there are currently no integration tests that use DPP besides the Vulkan CTS, so there is currently no way for anyone working on this from the ROCm/compute side to test this pass properly, which means that none of you can work on this with any confidence. I think this exchange has clearly shown that unit tests aren't enough, since they only show that the code does what you think it does, not whether what you think it does is correct :) Maybe we should fix the atomic optimizer first so that it does something similar to what radv and AMDVLK do, to make it easier to test? Or maybe you just need to go write some tests.

Thank you, Connor.

I'll go through the cases you described, there're may be issues with instruction commutation (dpp src reg should be src0) or other problems (I already saw e64 instructions that cannot be converted to e32). I would be appreciate if you can attach .ll file for your cases.

Speaking about testing, the test I added isn't just a unit test but it contains situations (each commented) the pass is supposed to combine. I think we should review the test and agree if it does the right thing first.

@cwabbott I planned to do a followup once this DPP change had landed to add the missing dpp/codegen patterns to the atomic optimizer - so watch this space!

I do agree though that we should probably add the DPP patterns that we want to be 100% certain get combined into a test case, I'll work with @vpykhtin to add this.

Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:
	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_i32_e32 v2, v2, v8                                      ; 18041102
Maybe this pass needs to be moved earlier in the pipeline?

I'm not sure I can insert the pass that high, I'll think of how it can be skipped.

In D55444#1352866, @vpykhtin wrote:
Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:
	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
	s_nop 1                                                       ; BF800001
	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
	v_min_i32_e32 v2, v2, v8                                      ; 18041102
Maybe this pass needs to be moved earlier in the pipeline?
I'm not sure I can insert the pass that high, I'll think of how it can be skipped.

This kind of immediate init optimization is performed by SIShrinkInstructions pass and it runs after the DPP combiner pass, so the issue is different here, would be nice to have .ll file.

I figured it would be a little easier if I looked at these cases by myself. It turns out there are more problems with isIdentityValue, including some correctness issues. After fixing these, everything works correctly now.

lib/Target/AMDGPU/GCNDPPCombine.cpp
264	You have min and max mixed up... the identity for min is the maximum possible value, not the minimum. The same goes for signed and unsigned min/max. Also, you can add V_XOR with an identity value of 0 here.
271	It turns out that LLVM sign-extends the immediate to 64 bits sometime during instruction selection, which means this won't work for any MIR generated by LLVM IR. We should be ignoring the high 32 bits when doing all these comparisons, since they're irrelevant.

Thanks! I wonder how easy is to get confused there. I updated diff with the latest found problems fixed.

vpykhtin marked 2 inline comments as done.Jan 11 2019, 4:00 AM

I think we reached the state this can be submitted (and probably enabled with subsequent patch). This would allow any of us make other fixes if required.

cwabbott added inline comments.Jan 14 2019, 10:27 AM

lib/Target/AMDGPU/GCNDPPCombine.cpp
363	Indentation is off here, seems like your editor is using hard tabs
364	I don't think this is correct. A lane can be set to 0 for the shift DPP controls (e.g. `DPP_ROW_SL1`) or if EXEC is zero for the lane to be read, in which case the user will do something non-trivial (e.g. OR will pass through the other operand) which we can't emulate in general.
369	Also here
384	This seems incorrect to me. In the lanes where the source is invalid or the row/bank mask is 0, the DPP move will act as a no-op, here so if that lane is then added, or'd, etc. with something else, we can't emulate that with a single instruction.

Hi Valery, I really like the way the different cases are listed in the explanatory comment at the top of the file, and I believe those cases are correct. Would it be possible to restructure the code in a way that follows those cases? I think that would make it much easier to follow.

That is, in combineDPPMov, can you restructure the first do/while construct so that it mirrors the cases of the top of file comment and defines new variables CombinedOld and CombinedBoundCtrl? Come to think of it, it may then be wise to move these top of file comments into the function to increase the chances that they will be kept up-to-date in the future.

lib/Target/AMDGPU/GCNDPPCombine.cpp
348	Having both `OldOpndVGPR` and `OldOpndValue` seems redundant. You could pass `OldOpndValue` as a MachineOperand to `createDPPInst` and have it assert on `!CombinedOld \|\| CombinedOld->Reg()`. That should help cut down the code size here somewhat and make the code easier to follow.
385	These should be SmallVectors.
434	Can we simplify the code here by just commuting the OrigMI unconditionally? That is, first check whether `&Use != src0 && &Use == src1`; in that case, try commuting the instruction. If that fails, break (but I don't think there'd be a need to rollback the commutation). If it succeeds, continue down the same path that is used for the `&Use == src0` case. You end up with only a single call to `createDPPInst` and less code duplication in general.

vpykhtin marked 4 inline comments as done.Jan 15 2019, 4:04 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
363	Yea, thanks. I reinstalled my editor recently and forgot to turn off tabs. It should be ok with the latest diff.
364	Could you please you show an example of how it wouldn't work? Note that EXEC mask remains the same (by the check above) and no combining is performed when the result of DPP mov is used in any other way than consumed by DPP-capable VALU instruction. I think the result of DPP mov should be the same as the DPP src operand used in the dpp-capable VALU instruction.
384	mov_dpp v1, v1, ... (v1 of other lane is stored in v1 of this lane) add_u32 v0, v1, ... This is the case when DPP src register is stored in the VGPR with the same name of the issuing lane. This way v1 would contain the same value after unsuccessfull DPP mov (no-op) and therefore can be used in the combined VALU op.

Hi Nikolai,

thank you for the review, I'll think of how to restructure the code to match top comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
348	There was a reason I did it initially though after a time I need to recall it. If I remember correctly the value is tracked through the instructions like mov/copy and subreg manupulation and having only value insn't enough to obtain the reg to store in the DPP instruction.
385	Ok.
434	This is a good idea: I don't like leaving commuted instruction on rollback at some intuitive level but really don't have arguments against it at the moment :-)

cwabbott added inline comments.Jan 15 2019, 7:44 AM

lib/Target/AMDGPU/GCNDPPCombine.cpp
364	Yes, you're right. I was misrembering what bound_ctrl:0 does. Sorry!
384	I still don't see how this can work. For something like: mov_dpp v1, v1, ... add_u32 v0, v1, v2 lanes where the shared data is invalid based on the DPP ctrl or EXEC will return v1 (same lane) + v2, whereas this will transform it to something like add_u32_dpp v0, v1, v2, ... which will give you v0 (undef). What's an example of a transform you're trying to accomplish?

vpykhtin marked an inline comment as done.Jan 15 2019, 8:00 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
384	Sorry, this should look like: mov v1, X mov v2, Y mov_dpp v1, v1, some_DPP_ctrl add_u32 v0, v1, v2 transformed to mov v1, X mov v2, Y add_u32_dpp v0, v1, v2, some_DPP_ctrl v1 should contain X on invalid DPP access or X from other lane on valid.

vpykhtin marked an inline comment as done.Jan 15 2019, 8:35 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
384	I forgto to mention that some_DPP_ctrl should have all masks fully enabled, otherwise add wouldn't write it's result and v0 in undef indeed.

cwabbott added inline comments.Jan 16 2019, 1:13 AM

lib/Target/AMDGPU/GCNDPPCombine.cpp
384	This still seems wrong. For the first sequence you gave, the add_u32 will return X (same lane) + Y on invalid DPP access while add_u32_dpp will return undef (assuming v0 isn't initialized, since you set old to undef in this patch).

vpykhtin marked an inline comment as done.Jan 16 2019, 6:09 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
384	It looks like the documentation is a bit unclear here. I thought when bound ctrl is off it means hardware just uses the value of issuing lane instead of DPP read but if it disables writing the result then you're right - v0 would be undef. I'll find out and fix this.

vpykhtin marked an inline comment as done.Jan 18 2019, 5:02 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
434	Actually a pass returns boolean meaning it changes something in the IR. What should it return when it does rollback? It may create new commuted instructions which are need to be mapped in SlotIndexes for example. I don't think its a good idea to return true only when a rollback took place.

arsenm added inline comments.Jan 18 2019, 12:08 PM

lib/Target/AMDGPU/GCNDPPCombine.cpp
434	Can rollback be avoided? I think the usual purpose of knowing something changed is to know if iterators are still valid, which probably isn't the case if anything was modified at any point

arsenm added inline comments.Jan 18 2019, 12:38 PM

lib/Target/AMDGPU/GCNDPPCombine.cpp
434	I suppose if it's just commuting, it's ok to report no change

vpykhtin marked an inline comment as done.Jan 21 2019, 6:34 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
434	The commutation is the whole problem. I cannot know whether the instuction can be commuted beforehand. Also an instruction copy should be created for the commutation - this means that previous analisys should be invalidated and the IR change should be repored.

vpykhtin marked an inline comment as done.Jan 21 2019, 6:42 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/GCNDPPCombine.cpp
434	I think there is no way to avoid new instructions on commutation because of reversed instructions

Fixed issue with old = dpp src register when bound ctrl is off.

Slightly refactored and simplified. Description is corrected, though it's not one-to-one maps on the code as there're other things to do (as reusing IMPLICIT_DEF instructions)

Variable names changed to be more consistent with the description.

I decided to left current commutation as is because it doesn't come for free: it may create new instructions (at least reversed instructions) and for that reason previous analisys (like SlotIndexes) should be invalidated.

vpykhtin added a reviewer: nhaehnle.Feb 4 2019, 8:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2019, 8:40 AM

LGTM

This revision is now accepted and ready to land.Feb 7 2019, 12:43 AM

rebased diff.

Thanks Nikolai!

Connor can you please try this patch and probably accept it?

I'm not going to read everything in detail, but the combining rules look correct to me and everything passes with this pass enabled. Feel free to re-enable it.

Closed by commit rL353513: [AMDGPU] Fix DPP combiner (authored by vpykhtin). · Explain WhyFeb 8 2019, 3:59 AM

This revision was automatically updated to reflect the committed changes.

Thank you Connor! I really appreciate your effort on this DPP work.

I'll enable the pass with the subsequent submit.

The DPP combiner pass is enabled since rL353691, https://reviews.llvm.org/rGded96df01e95

foad mentioned this in D124182: [AMDGPU] Combine DPP mov even if old reg def is in different BB.Apr 21 2022, 11:11 AM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

TargetInstrInfo.h

7 lines

lib/

Target/

AMDGPU/

GCNDPPCombine.cpp

104 lines

SIInstrInfo.h

6 lines

SIInstrInfo.cpp

26 lines

test/

CodeGen/

AMDGPU/

dpp_combine.ll

86 lines

dpp_combine_subregs.mir

109 lines

Diff 177866

include/llvm/CodeGen/TargetInstrInfo.h

Show First 20 Lines • Show All 423 Lines • ▼ Show 20 Lines	public:
/// A pair composed of a register and a sub-register index.		/// A pair composed of a register and a sub-register index.
/// Used to give some type checking when modeling Reg:SubReg.		/// Used to give some type checking when modeling Reg:SubReg.
struct RegSubRegPair {		struct RegSubRegPair {
unsigned Reg;		unsigned Reg;
unsigned SubReg;		unsigned SubReg;

RegSubRegPair(unsigned Reg = 0, unsigned SubReg = 0)		RegSubRegPair(unsigned Reg = 0, unsigned SubReg = 0)
: Reg(Reg), SubReg(SubReg) {}		: Reg(Reg), SubReg(SubReg) {}

		bool operator==(const RegSubRegPair& P) const {
		return Reg == P.Reg && SubReg == P.SubReg;
		}
		bool operator!=(const RegSubRegPair& P) const {
		return !(*this == P);
		}
};		};

/// A pair composed of a pair of a register and a sub-register index,		/// A pair composed of a pair of a register and a sub-register index,
/// and another sub-register index.		/// and another sub-register index.
/// Used to give some type checking when modeling Reg:SubReg1, SubReg2.		/// Used to give some type checking when modeling Reg:SubReg1, SubReg2.
struct RegSubRegPairAndIdx : RegSubRegPair {		struct RegSubRegPairAndIdx : RegSubRegPair {
unsigned SubIdx;		unsigned SubIdx;

▲ Show 20 Lines • Show All 1,275 Lines • Show Last 20 Lines

lib/Target/AMDGPU/GCNDPPCombine.cpp

//=======- GCNDPPCombine.cpp - optimization for DPP instructions ---==========//		//=======- GCNDPPCombine.cpp - optimization for DPP instructions ---==========//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// The pass combines V_MOV_B32_dpp instruction with its VALU uses as a DPP src0		// The pass combines V_MOV_B32_dpp instruction with its VALU uses as a DPP src0
// operand.If any of the use instruction cannot be combined with the mov the		// operand.If any of the use instruction cannot be combined with the mov the
// whole sequence is reverted.		// whole sequence is reverted.
//		//
// $old = ...		// $old = ...
// $dpp_value = V_MOV_B32_dpp $old, $vgpr_to_be_read_from_other_lane,		// $dpp_value = V_MOV_B32_dpp $old, $vgpr_to_be_read_from_other_lane,
// dpp_controls..., $bound_ctrl		// dpp_controls, $row_mask, $bank_mask, $bound_ctrl
// $res = VALU $dpp_value, ...		// $res = VALU $dpp_value, ...
//		//
// to		// to
//		//
// $res = VALU_DPP $folded_old, $vgpr_to_be_read_from_other_lane, ...,		// $res = VALU_DPP $combined_old, $vgpr_to_be_read_from_other_lane, ...,
// dpp_controls..., $folded_bound_ctrl		// dpp_controls..., $row_mask, $bank_mask, $combined_bound_ctrl
//		//
// Combining rules :		// Combining rules :
//		//
// $bound_ctrl is DPP_BOUND_ZERO, $old is any		// if $row_mask and $bank_mask are all enabled (0xF) and
// $bound_ctrl is DPP_BOUND_OFF, $old is 0		// $old == $vgpr_to_be_read_from_other_lane
		// -> $combined_old = $old
		// $combined_bound_ctrl = $bound_ctrl
		//
		// if $row_mask and $bank_mask are all enabled (0xF) and
		// $bound_ctrl==DPP_BOUND_ZERO and $old==any or
		// $bound_ctrl==DPP_BOUND_OFF and $old==0
		// -> $combined_old = undef,
		// $combined_bound_ctrl = DPP_BOUND_ZERO
		//
		// if $old==undef or $old==identity value for the VALU op
		// -> $combined_old = undef/folded identity value,
		// $combined_bound_ctrl = DPP_BOUND_OFF
//		//
// ->$folded_old = undef, $folded_bound_ctrl = DPP_BOUND_ZERO		// Othervise cancel.
// $bound_ctrl is DPP_BOUND_OFF, $old is undef
//
// ->$folded_old = undef, $folded_bound_ctrl = DPP_BOUND_OFF
// $bound_ctrl is DPP_BOUND_OFF, $old is foldable
//
// ->$folded_old = folded value, $folded_bound_ctrl = DPP_BOUND_OFF
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "MCTargetDesc/AMDGPUMCTargetDesc.h"		#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	do {
auto *Src0 = TII->getNamedOperand(MovMI, AMDGPU::OpName::src0);		auto *Src0 = TII->getNamedOperand(MovMI, AMDGPU::OpName::src0);
assert(Src0);		assert(Src0);
if (!TII->isOperandLegal(*DPPInst.getInstr(), NumOperands, Src0)) {		if (!TII->isOperandLegal(*DPPInst.getInstr(), NumOperands, Src0)) {
LLVM_DEBUG(dbgs() << " failed: src0 is illegal\n");		LLVM_DEBUG(dbgs() << " failed: src0 is illegal\n");
Fail = true;		Fail = true;
break;		break;
}		}
DPPInst.add(*Src0);		DPPInst.add(*Src0);
		DPPInst->getOperand(NumOperands).setIsKill(false);
++NumOperands;		++NumOperands;

if (auto *Mod1 = TII->getNamedOperand(OrigMI,		if (auto *Mod1 = TII->getNamedOperand(OrigMI,
AMDGPU::OpName::src1_modifiers)) {		AMDGPU::OpName::src1_modifiers)) {
assert(NumOperands == AMDGPU::getNamedOperandIdx(DPPOp,		assert(NumOperands == AMDGPU::getNamedOperandIdx(DPPOp,
AMDGPU::OpName::src1_modifiers));		AMDGPU::OpName::src1_modifiers));
assert(0LL == (Mod1->getImm() & ~(SISrcMods::ABS \| SISrcMods::NEG)));		assert(0LL == (Mod1->getImm() & ~(SISrcMods::ABS \| SISrcMods::NEG)));
DPPInst.addImm(Mod1->getImm());		DPPInst.addImm(Mod1->getImm());
Show All 34 Lines

GCNDPPCombine::RegSubRegPair		GCNDPPCombine::RegSubRegPair
GCNDPPCombine::foldOldOpnd(MachineInstr &OrigMI,		GCNDPPCombine::foldOldOpnd(MachineInstr &OrigMI,
RegSubRegPair OldOpndVGPR,		RegSubRegPair OldOpndVGPR,
MachineOperand &OldOpndValue) const {		MachineOperand &OldOpndValue) const {
assert(OldOpndValue.isImm());		assert(OldOpndValue.isImm());
switch (OrigMI.getOpcode()) {		switch (OrigMI.getOpcode()) {
default: break;		default: break;
		case AMDGPU::V_ADD_U32_e32:
		case AMDGPU::V_ADD_I32_e32:
		case AMDGPU::V_AND_B32_e32:
		cwabbottUnsubmitted Not Done Reply Inline Actions I think you have AND and OR mixed up... the identity for AND should be -1, and OR should be 0, right? cwabbott: I think you have AND and OR mixed up... the identity for AND should be -1, and OR should be 0…
		case AMDGPU::V_SUBREV_U32_e32:
		case AMDGPU::V_SUBREV_I32_e32:
		if (OldOpndValue.getImm() == 0)
		cwabbottUnsubmitted Done Reply Inline Actions You have min and max mixed up... the identity for min is the maximum possible value, not the minimum. The same goes for signed and unsigned min/max. Also, you can add V_XOR with an identity value of 0 here. cwabbott: You have min and max mixed up... the identity for min is the maximum possible value, not the…
		return OldOpndVGPR;
		cwabbottUnsubmitted Not Done Reply Inline Actions If the old operand is the identity in the original mov, then in the transformed instruction the old operand should be the un-swizzled source, not the identity again. I think this is the reason the add tests to still fail on radv, since it generates something like: v_mov_b32_e32 v5, 0 s_nop 1 v_add_u32_dpp v5, vcc, v4, v4 row_shr:4 row_mask:0xf bank_mask:0xe from %88 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %87, i32 276, i32 15, i32 14, i1 false) #2 %89 = add i32 %87, %88 when what we really want is v_add_u32_dpp v4, vcc, v4, v4 row_shr:4 row_mask:0xf bank_mask:0xe cwabbott: If the old operand is the identity in the original mov, then in the transformed instruction the…
		break;
		case AMDGPU::V_OR_B32_e32:
case AMDGPU::V_MAX_U32_e32:		case AMDGPU::V_MAX_U32_e32:
if (OldOpndValue.getImm() == std::numeric_limits<uint32_t>::max())		if (static_cast<uint32_t>(OldOpndValue.getImm()) ==
		std::numeric_limits<uint32_t>::max())
return OldOpndVGPR;		return OldOpndVGPR;
		cwabbottUnsubmitted Done Reply Inline Actions It turns out that LLVM sign-extends the immediate to 64 bits sometime during instruction selection, which means this won't work for any MIR generated by LLVM IR. We should be ignoring the high 32 bits when doing all these comparisons, since they're irrelevant. cwabbott: It turns out that LLVM sign-extends the immediate to 64 bits sometime during instruction…
break;		break;
case AMDGPU::V_MAX_I32_e32:		case AMDGPU::V_MAX_I32_e32:
if (OldOpndValue.getImm() == std::numeric_limits<int32_t>::max())		if (OldOpndValue.getImm() == std::numeric_limits<int32_t>::max())
return OldOpndVGPR;		return OldOpndVGPR;
break;		break;
case AMDGPU::V_MIN_I32_e32:		case AMDGPU::V_MIN_I32_e32:
if (OldOpndValue.getImm() == std::numeric_limits<int32_t>::min())		if (OldOpndValue.getImm() == std::numeric_limits<int32_t>::min())
return OldOpndVGPR;		return OldOpndVGPR;
break;		break;

case AMDGPU::V_MUL_I32_I24_e32:		case AMDGPU::V_MUL_I32_I24_e32:
case AMDGPU::V_MUL_U32_U24_e32:		case AMDGPU::V_MUL_U32_U24_e32:
if (OldOpndValue.getImm() == 1) {		if (OldOpndValue.getImm() == 1) {
auto *Src1 = TII->getNamedOperand(OrigMI, AMDGPU::OpName::src1);		auto *Src1 = TII->getNamedOperand(OrigMI, AMDGPU::OpName::src1);
assert(Src1 && Src1->isReg());		assert(Src1 && Src1->isReg());
return getRegSubRegPair(*Src1);		return getRegSubRegPair(*Src1);
}		}
break;		break;
}		}
return RegSubRegPair();		return RegSubRegPair();
}		}

// Cases to combine:
// $bound_ctrl is DPP_BOUND_ZERO, $old is any
// $bound_ctrl is DPP_BOUND_OFF, $old is 0
// -> $old = undef, $bound_ctrl = DPP_BOUND_ZERO

// $bound_ctrl is DPP_BOUND_OFF, $old is undef
// -> $old = undef, $bound_ctrl = DPP_BOUND_OFF

// $bound_ctrl is DPP_BOUND_OFF, $old is foldable
// -> $old = folded value, $bound_ctrl = DPP_BOUND_OFF

MachineInstr *GCNDPPCombine::createDPPInst(MachineInstr &OrigMI,		MachineInstr *GCNDPPCombine::createDPPInst(MachineInstr &OrigMI,
MachineInstr &MovMI,		MachineInstr &MovMI,
RegSubRegPair OldOpndVGPR,		RegSubRegPair OldOpndVGPR,
MachineOperand *OldOpndValue,		MachineOperand *OldOpndValue,
bool BoundCtrlZero) const {		bool BoundCtrlZero) const {
assert(OldOpndVGPR.Reg);		assert(OldOpndVGPR.Reg);
if (!BoundCtrlZero && OldOpndValue) {		if (!BoundCtrlZero && OldOpndValue && OldOpndValue->isImm()) {
assert(OldOpndValue->isImm());
OldOpndVGPR = foldOldOpnd(OrigMI, OldOpndVGPR, *OldOpndValue);		OldOpndVGPR = foldOldOpnd(OrigMI, OldOpndVGPR, *OldOpndValue);
if (!OldOpndVGPR.Reg) {		if (!OldOpndVGPR.Reg) {
LLVM_DEBUG(dbgs() << " failed: old immediate cannot be folded\n");		LLVM_DEBUG(dbgs() << " failed: old immediate cannot be folded\n");
return nullptr;		return nullptr;
}		}
}		}
return createDPPInst(OrigMI, MovMI, OldOpndVGPR, BoundCtrlZero);		return createDPPInst(OrigMI, MovMI, OldOpndVGPR, BoundCtrlZero);
}		}

// returns true if MI doesn't have OpndName immediate operand or the		// returns true if MI doesn't have OpndName immediate operand or the
// operand has Value		// operand has Value
bool GCNDPPCombine::hasNoImmOrEqual(MachineInstr &MI, unsigned OpndName,		bool GCNDPPCombine::hasNoImmOrEqual(MachineInstr &MI, unsigned OpndName,
int64_t Value, int64_t Mask) const {		int64_t Value, int64_t Mask) const {
auto *Imm = TII->getNamedOperand(MI, OpndName);		auto *Imm = TII->getNamedOperand(MI, OpndName);
if (!Imm)		if (!Imm)
return true;		return true;

assert(Imm->isImm());		assert(Imm->isImm());
return (Imm->getImm() & Mask) == Value;		return (Imm->getImm() & Mask) == Value;
}		}

bool GCNDPPCombine::combineDPPMov(MachineInstr &MovMI) const {		bool GCNDPPCombine::combineDPPMov(MachineInstr &MovMI) const {
assert(MovMI.getOpcode() == AMDGPU::V_MOV_B32_dpp);		assert(MovMI.getOpcode() == AMDGPU::V_MOV_B32_dpp);
		LLVM_DEBUG(dbgs() << "\nDPP combine: " << MovMI);

		auto *DstOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::vdst);
		assert(DstOpnd && DstOpnd->isReg());
		auto DPPMovReg = DstOpnd->getReg();
		if (!isEXECMaskConstantBetweenDefAndUses(DPPMovReg, *MRI)) {
		LLVM_DEBUG(dbgs() << " failed: EXEC mask should remain the same"
		" for all uses\n");
		return false;
		}

		auto *RowMaskOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::row_mask);
		assert(RowMaskOpnd && RowMaskOpnd->isImm());
		auto *BankMaskOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::bank_mask);
		assert(BankMaskOpnd && BankMaskOpnd->isImm());
		const bool MaskAllLanes = RowMaskOpnd->getImm() == 0xF &&
		BankMaskOpnd->getImm() == 0xF;

auto *BCZOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::bound_ctrl);		auto *BCZOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::bound_ctrl);
assert(BCZOpnd && BCZOpnd->isImm());		assert(BCZOpnd && BCZOpnd->isImm());
bool BoundCtrlZero = 0 != BCZOpnd->getImm();		bool BoundCtrlZero = MaskAllLanes && 0 != BCZOpnd->getImm();

LLVM_DEBUG(dbgs() << "\nDPP combine: " << MovMI);

auto *OldOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::old);		auto *OldOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::old);
assert(OldOpnd && OldOpnd->isReg());		assert(OldOpnd && OldOpnd->isReg());
auto OldOpndVGPR = getRegSubRegPair(*OldOpnd);		auto OldOpndVGPR = getRegSubRegPair(*OldOpnd);
		nhaehnleUnsubmitted Not Done Reply Inline Actions Having both `OldOpndVGPR` and `OldOpndValue` seems redundant. You could pass `OldOpndValue` as a MachineOperand to `createDPPInst` and have it assert on `!CombinedOld \|\| CombinedOld->Reg()`. That should help cut down the code size here somewhat and make the code easier to follow. nhaehnle: Having both `OldOpndVGPR` and `OldOpndValue` seems redundant. You could pass `OldOpndValue` as…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions There was a reason I did it initially though after a time I need to recall it. If I remember correctly the value is tracked through the instructions like mov/copy and subreg manupulation and having only value insn't enough to obtain the reg to store in the DPP instruction. vpykhtin: There was a reason I did it initially though after a time I need to recall it. If I remember…
auto OldOpndValue = getOldOpndValue(OldOpnd);		auto OldOpndValue = getOldOpndValue(OldOpnd);
assert(!OldOpndValue \|\| OldOpndValue->isImm() \|\| OldOpndValue == OldOpnd);		assert(!OldOpndValue \|\| OldOpndValue->isImm() \|\| OldOpndValue == OldOpnd);
if (OldOpndValue) {		if (OldOpndValue) {
if (BoundCtrlZero) {		if (BoundCtrlZero) {
OldOpndVGPR.Reg = AMDGPU::NoRegister; // should be undef, ignore old opnd		OldOpndVGPR.Reg = AMDGPU::NoRegister; // should be undef, ignore old opnd
OldOpndValue = nullptr;		OldOpndValue = nullptr;
} else {		} else {
if (!OldOpndValue->isImm()) {		if (OldOpndValue->getParent()->getParent() != MovMI.getParent()) {
LLVM_DEBUG(dbgs() << " failed: old operand isn't an imm or undef\n");		LLVM_DEBUG(dbgs() << " failed: old reg def and mov should"
		" be in the same BB\n");
return false;		return false;
}		}
if (OldOpndValue->getImm() == 0) {		if (OldOpndValue->isImm()) {
		if (MaskAllLanes && OldOpndValue->getImm() == 0) {
OldOpndVGPR.Reg = AMDGPU::NoRegister; // should be undef		OldOpndVGPR.Reg = AMDGPU::NoRegister; // should be undef
		cwabbottUnsubmitted Done Reply Inline Actions Indentation is off here, seems like your editor is using hard tabs cwabbott: Indentation is off here, seems like your editor is using hard tabs
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Yea, thanks. I reinstalled my editor recently and forgot to turn off tabs. It should be ok with the latest diff. vpykhtin: Yea, thanks. I reinstalled my editor recently and forgot to turn off tabs. It should be ok with…
OldOpndValue = nullptr;		OldOpndValue = nullptr;
		cwabbottUnsubmitted Not Done Reply Inline Actions I don't think this is correct. A lane can be set to 0 for the shift DPP controls (e.g. `DPP_ROW_SL1`) or if EXEC is zero for the lane to be read, in which case the user will do something non-trivial (e.g. OR will pass through the other operand) which we can't emulate in general. cwabbott: I don't think this is correct. A lane can be set to 0 for the shift DPP controls (e.g.
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Could you please you show an example of how it wouldn't work? Note that EXEC mask remains the same (by the check above) and no combining is performed when the result of DPP mov is used in any other way than consumed by DPP-capable VALU instruction. I think the result of DPP mov should be the same as the DPP src operand used in the dpp-capable VALU instruction. vpykhtin: Could you please you show an example of how it wouldn't work? Note that EXEC mask remains the…
		cwabbottUnsubmitted Not Done Reply Inline Actions Yes, you're right. I was misrembering what bound_ctrl:0 does. Sorry! cwabbott: Yes, you're right. I was misrembering what bound_ctrl:0 does. Sorry!
BoundCtrlZero = true;		BoundCtrlZero = true;
}		}
		} else {
		auto *DPPSrcOpnd = TII->getNamedOperand(MovMI, AMDGPU::OpName::src0);
		assert(DPPSrcOpnd && DPPSrcOpnd->isReg());
		cwabbottUnsubmitted Not Done Reply Inline Actions Also here cwabbott: Also here
		if (!MaskAllLanes \|\| getRegSubRegPair(*DPPSrcOpnd) != OldOpndVGPR) {
		LLVM_DEBUG(dbgs() << " failed: the DPP mov isn't combinable\n");
		return false;
		}
		}
}		}
}		}

LLVM_DEBUG(dbgs() << " old=";		LLVM_DEBUG(dbgs() << " old=";
if (!OldOpndValue)		if (!OldOpndValue)
dbgs() << "undef";		dbgs() << "undef";
else		else
dbgs() << OldOpndValue->getImm();		dbgs() << *OldOpndValue;
dbgs() << ", bound_ctrl=" << BoundCtrlZero << '\n');		dbgs() << ", bound_ctrl=" << BoundCtrlZero << '\n');

		cwabbottUnsubmitted Not Done Reply Inline Actions This seems incorrect to me. In the lanes where the source is invalid or the row/bank mask is 0, the DPP move will act as a no-op, here so if that lane is then added, or'd, etc. with something else, we can't emulate that with a single instruction. cwabbott: This seems incorrect to me. In the lanes where the source is invalid or the row/bank mask is 0…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions mov_dpp v1, v1, ... (v1 of other lane is stored in v1 of this lane) add_u32 v0, v1, ... This is the case when DPP src register is stored in the VGPR with the same name of the issuing lane. This way v1 would contain the same value after unsuccessfull DPP mov (no-op) and therefore can be used in the combined VALU op. vpykhtin: mov_dpp v1, v1, ... (v1 of other lane is stored in v1 of this lane) add_u32 v0, v1, ... This…
		cwabbottUnsubmitted Not Done Reply Inline Actions I still don't see how this can work. For something like: mov_dpp v1, v1, ... add_u32 v0, v1, v2 lanes where the shared data is invalid based on the DPP ctrl or EXEC will return v1 (same lane) + v2, whereas this will transform it to something like add_u32_dpp v0, v1, v2, ... which will give you v0 (undef). What's an example of a transform you're trying to accomplish? cwabbott: I still don't see how this can work. For something like: ``` mov_dpp v1, v1, ... add_u32 v0…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Sorry, this should look like: mov v1, X mov v2, Y mov_dpp v1, v1, some_DPP_ctrl add_u32 v0, v1, v2 transformed to mov v1, X mov v2, Y add_u32_dpp v0, v1, v2, some_DPP_ctrl v1 should contain X on invalid DPP access or X from other lane on valid. vpykhtin: Sorry, this should look like: mov v1, X mov v2, Y mov_dpp v1, v1, some_DPP_ctrl add_u32 v0, v1…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions I forgto to mention that some_DPP_ctrl should have all masks fully enabled, otherwise add wouldn't write it's result and v0 in undef indeed. vpykhtin: I forgto to mention that some_DPP_ctrl should have all masks fully enabled, otherwise add…
		cwabbottUnsubmitted Not Done Reply Inline Actions This still seems wrong. For the first sequence you gave, the add_u32 will return X (same lane) + Y on invalid DPP access while add_u32_dpp will return undef (assuming v0 isn't initialized, since you set old to undef in this patch). cwabbott: This still seems wrong. For the first sequence you gave, the add_u32 will return X (same lane)…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions It looks like the documentation is a bit unclear here. I thought when bound ctrl is off it means hardware just uses the value of issuing lane instead of DPP read but if it disables writing the result then you're right - v0 would be undef. I'll find out and fix this. vpykhtin: It looks like the documentation is a bit unclear here. I thought when bound ctrl is off it…
std::vector<MachineInstr*> OrigMIs, DPPMIs;		std::vector<MachineInstr*> OrigMIs, DPPMIs;
		nhaehnleUnsubmitted Not Done Reply Inline Actions These should be SmallVectors. nhaehnle: These should be SmallVectors.
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Ok. vpykhtin: Ok.
if (!OldOpndVGPR.Reg) { // OldOpndVGPR = undef		if (!OldOpndVGPR.Reg) { // OldOpndVGPR = undef
OldOpndVGPR = RegSubRegPair(		OldOpndVGPR = RegSubRegPair(
MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass));		MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass));
auto UndefInst = BuildMI(*MovMI.getParent(), MovMI, MovMI.getDebugLoc(),		auto UndefInst = BuildMI(*MovMI.getParent(), MovMI, MovMI.getDebugLoc(),
TII->get(AMDGPU::IMPLICIT_DEF), OldOpndVGPR.Reg);		TII->get(AMDGPU::IMPLICIT_DEF), OldOpndVGPR.Reg);
DPPMIs.push_back(UndefInst.getInstr());		DPPMIs.push_back(UndefInst.getInstr());
}		}

OrigMIs.push_back(&MovMI);		OrigMIs.push_back(&MovMI);
bool Rollback = true;		bool Rollback = true;
for (auto &Use : MRI->use_nodbg_operands(		for (auto &Use : MRI->use_nodbg_operands(DPPMovReg)) {
TII->getNamedOperand(MovMI, AMDGPU::OpName::vdst)->getReg())) {
Rollback = true;		Rollback = true;

auto &OrigMI = *Use.getParent();		auto &OrigMI = *Use.getParent();
		LLVM_DEBUG(dbgs() << " try: " << OrigMI);

auto OrigOp = OrigMI.getOpcode();		auto OrigOp = OrigMI.getOpcode();
if (TII->isVOP3(OrigOp)) {		if (TII->isVOP3(OrigOp)) {
if (!TII->hasVALU32BitEncoding(OrigOp)) {		if (!TII->hasVALU32BitEncoding(OrigOp)) {
LLVM_DEBUG(dbgs() << " failed: VOP3 hasn't e32 equivalent\n");		LLVM_DEBUG(dbgs() << " failed: VOP3 hasn't e32 equivalent\n");
break;		break;
}		}
// check if other than abs\|neg modifiers are set (opsel for example)		// check if other than abs\|neg modifiers are set (opsel for example)
const int64_t Mask = ~(SISrcMods::ABS \| SISrcMods::NEG);		const int64_t Mask = ~(SISrcMods::ABS \| SISrcMods::NEG);
Show All 16 Lines	if (&Use == TII->getNamedOperand(OrigMI, AMDGPU::OpName::src0)) {
DPPMIs.push_back(DPPInst);		DPPMIs.push_back(DPPInst);
Rollback = false;		Rollback = false;
}		}
} else if (OrigMI.isCommutable() &&		} else if (OrigMI.isCommutable() &&
&Use == TII->getNamedOperand(OrigMI, AMDGPU::OpName::src1)) {		&Use == TII->getNamedOperand(OrigMI, AMDGPU::OpName::src1)) {
auto *BB = OrigMI.getParent();		auto *BB = OrigMI.getParent();
auto *NewMI = BB->getParent()->CloneMachineInstr(&OrigMI);		auto *NewMI = BB->getParent()->CloneMachineInstr(&OrigMI);
BB->insert(OrigMI, NewMI);		BB->insert(OrigMI, NewMI);
if (TII->commuteInstruction(*NewMI)) {		if (TII->commuteInstruction(*NewMI)) {
		nhaehnleUnsubmitted Not Done Reply Inline Actions Can we simplify the code here by just commuting the OrigMI unconditionally? That is, first check whether `&Use != src0 && &Use == src1`; in that case, try commuting the instruction. If that fails, break (but I don't think there'd be a need to rollback the commutation). If it succeeds, continue down the same path that is used for the `&Use == src0` case. You end up with only a single call to `createDPPInst` and less code duplication in general. nhaehnle: Can we simplify the code here by just commuting the OrigMI unconditionally? That is, first…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions This is a good idea: I don't like leaving commuted instruction on rollback at some intuitive level but really don't have arguments against it at the moment :-) vpykhtin: This is a good idea: I don't like leaving commuted instruction on rollback at some intuitive…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions Actually a pass returns boolean meaning it changes something in the IR. What should it return when it does rollback? It may create new commuted instructions which are need to be mapped in SlotIndexes for example. I don't think its a good idea to return true only when a rollback took place. vpykhtin: Actually a pass returns boolean meaning it changes something in the IR. What should it return…
		arsenmUnsubmitted Not Done Reply Inline Actions Can rollback be avoided? I think the usual purpose of knowing something changed is to know if iterators are still valid, which probably isn't the case if anything was modified at any point arsenm: Can rollback be avoided? I think the usual purpose of knowing something changed is to know if…
		arsenmUnsubmitted Not Done Reply Inline Actions I suppose if it's just commuting, it's ok to report no change arsenm: I suppose if it's just commuting, it's ok to report no change
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions The commutation is the whole problem. I cannot know whether the instuction can be commuted beforehand. Also an instruction copy should be created for the commutation - this means that previous analisys should be invalidated and the IR change should be repored. vpykhtin: The commutation is the whole problem. I cannot know whether the instuction can be commuted…
		vpykhtinAuthorUnsubmitted Done Reply Inline Actions I think there is no way to avoid new instructions on commutation because of reversed instructions vpykhtin: I think there is no way to avoid new instructions on commutation because of reversed…
LLVM_DEBUG(dbgs() << " commuted: " << *NewMI);		LLVM_DEBUG(dbgs() << " commuted: " << *NewMI);
if (auto DPPInst = createDPPInst(NewMI, MovMI, OldOpndVGPR,		if (auto DPPInst = createDPPInst(NewMI, MovMI, OldOpndVGPR,
OldOpndValue, BoundCtrlZero)) {		OldOpndValue, BoundCtrlZero)) {
DPPMIs.push_back(DPPInst);		DPPMIs.push_back(DPPInst);
Rollback = false;		Rollback = false;
}		}
} else		} else
LLVM_DEBUG(dbgs() << " failed: cannot be commuted\n");		LLVM_DEBUG(dbgs() << " failed: cannot be commuted\n");
Show All 36 Lines

lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 950 Lines • ▼ Show 20 Lines	TargetInstrInfo::RegSubRegPair getRegSequenceSubReg(MachineInstr &MI,
unsigned SubReg);		unsigned SubReg);

/// \brief Return the defining instruction for a given reg:subreg pair		/// \brief Return the defining instruction for a given reg:subreg pair
/// skipping copy like instructions and subreg-manipulation pseudos.		/// skipping copy like instructions and subreg-manipulation pseudos.
/// Following another subreg of a reg:subreg isn't supported.		/// Following another subreg of a reg:subreg isn't supported.
MachineInstr *getVRegSubRegDef(const TargetInstrInfo::RegSubRegPair &P,		MachineInstr *getVRegSubRegDef(const TargetInstrInfo::RegSubRegPair &P,
MachineRegisterInfo &MRI);		MachineRegisterInfo &MRI);

		/// \brief Return true if EXEC mask isnt' changed between the def and
		/// all uses of VReg. Currently if def and uses are in different BBs -
		/// simply return false. Should be run on SSA.
		bool isEXECMaskConstantBetweenDefAndUses(unsigned VReg,
		MachineRegisterInfo &MRI);

namespace AMDGPU {		namespace AMDGPU {

LLVM_READONLY		LLVM_READONLY
int getVOPe64(uint16_t Opcode);		int getVOPe64(uint16_t Opcode);

LLVM_READONLY		LLVM_READONLY
int getVOPe32(uint16_t Opcode);		int getVOPe32(uint16_t Opcode);

▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 5,585 Lines • ▼ Show 20 Lines	default:
DefInst = MRI.getVRegDef(RSR.Reg);		DefInst = MRI.getVRegDef(RSR.Reg);
}		}
}		}
if (!DefInst)		if (!DefInst)
return MI;		return MI;
}		}
return nullptr;		return nullptr;
}		}

		bool llvm::isEXECMaskConstantBetweenDefAndUses(unsigned VReg,
		MachineRegisterInfo &MRI) {
		assert(MRI.isSSA() && "Must be run on SSA");
		auto *TRI = MRI.getTargetRegisterInfo();

		auto *DefI = MRI.getVRegDef(VReg);
		auto *BB = DefI->getParent();

		DenseSet<MachineInstr*> Uses;
		for (auto &Use : MRI.use_nodbg_operands(VReg)) {
		auto *I = Use.getParent();
		if (I->getParent() != BB)
		return false;
		Uses.insert(I);
		}

		auto E = BB->end();
		for (auto I = std::next(DefI->getIterator()); I != E; ++I) {
		Uses.erase(&*I);
		// don't check the last use
		if (Uses.empty() \|\| I->modifiesRegister(AMDGPU::EXEC, TRI))
		break;
		}
		return Uses.empty();
		}

test/CodeGen/AMDGPU/dpp_combine.ll

; RUN: llc -march=amdgcn -mcpu=tonga -amdgpu-dpp-combine -verify-machineinstrs < %s \| FileCheck %s		; RUN: llc -march=amdgcn -mcpu=tonga -amdgpu-dpp-combine -verify-machineinstrs < %s \| FileCheck %s

; VOP2 with literal cannot be combined		; VOP2 with literal cannot be combined
; CHECK-LABEL: {{^}}dpp_combine_i32_literal:		; CHECK-LABEL: {{^}}dpp_combine_i32_literal:
; CHECK: v_mov_b32_dpp [[OLD:v[0-9]+]], {{v[0-9]+}} quad_perm:[1,0,0,0] row_mask:0x2 bank_mask:0x1 bound_ctrl:0		; CHECK: v_mov_b32_dpp [[OLD:v[0-9]+]], {{v[0-9]+}} quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0
; CHECK: v_add_u32_e32 {{v[0-9]+}}, vcc, 42, [[OLD]]		; CHECK: v_add_u32_e32 {{v[0-9]+}}, vcc, 42, [[OLD]]
define amdgpu_kernel void @dpp_combine_i32_literal(i32 addrspace(1)* %out, i32 %in) {		define amdgpu_kernel void @dpp_combine_i32_literal(i32 addrspace(1)* %out, i32 %in) {
%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 2, i32 1, i1 1) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 15, i32 15, i1 1) #0
%res = add nsw i32 %dpp, 42		%res = add nsw i32 %dpp, 42
store i32 %res, i32 addrspace(1)* %out		store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

; CHECK-LABEL: {{^}}dpp_combine_i32_bz:		; CHECK-LABEL: {{^}}dpp_combine_i32_bz:
; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0		; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0
define amdgpu_kernel void @dpp_combine_i32_bz(i32 addrspace(1)* %out, i32 %in) {		define amdgpu_kernel void @dpp_combine_i32_bz(i32 addrspace(1)* %out, i32 %in) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 1) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 15, i32 15, i1 1) #0
%res = add nsw i32 %dpp, %x		%res = add nsw i32 %dpp, %x
store i32 %res, i32 addrspace(1)* %out		store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

		; CHECK-LABEL: {{^}}dpp_combine_i32_mov:
		; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf
		define amdgpu_kernel void @dpp_combine_i32_mov(i32 addrspace(1)* %out, i32 %in) {
		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
		%dpp = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %in, i32 1, i32 15, i32 15, i1 1) #0
		%res = add nsw i32 %dpp, %x
		store i32 %res, i32 addrspace(1)* %out
		ret void
		}


; CHECK-LABEL: {{^}}dpp_combine_i32_boff_undef:		; CHECK-LABEL: {{^}}dpp_combine_i32_boff_undef:
; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1		; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
define amdgpu_kernel void @dpp_combine_i32_boff_undef(i32 addrspace(1)* %out, i32 %in) {		define amdgpu_kernel void @dpp_combine_i32_boff_undef(i32 addrspace(1)* %out, i32 %in) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 0) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 0) #0
%res = add nsw i32 %dpp, %x		%res = add nsw i32 %dpp, %x
store i32 %res, i32 addrspace(1)* %out		store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

; CHECK-LABEL: {{^}}dpp_combine_i32_boff_0:		; CHECK-LABEL: {{^}}dpp_combine_i32_boff_0:
; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0		; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0
define amdgpu_kernel void @dpp_combine_i32_boff_0(i32 addrspace(1)* %out, i32 %in) {		define amdgpu_kernel void @dpp_combine_i32_boff_0(i32 addrspace(1)* %out, i32 %in) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %in, i32 1, i32 1, i32 1, i1 0) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %in, i32 1, i32 15, i32 15, i1 0) #0
%res = add nsw i32 %dpp, %x		%res = add nsw i32 %dpp, %x
store i32 %res, i32 addrspace(1)* %out		store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

		; CHECK-LABEL: {{^}}dpp_combine_i32_boff_or:
		; CHECK: v_mov_b32_e32 [[OLD:v[0-9]+]], -1
		; CHECK: v_or_b32_dpp [[OLD]], {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
		define amdgpu_kernel void @dpp_combine_i32_boff_or(i32 addrspace(1)* %out, i32 %in) {
		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 4294967295, i32 %in, i32 1, i32 1, i32 1, i1 0) #0
		%res = or i32 %dpp, %x
		store i32 %res, i32 addrspace(1)* %out
		ret void
		}

		; CHECK-LABEL: {{^}}dpp_combine_i32_boff_and:
		; CHECK: v_and_b32_dpp {{v[0-9]+}}, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
		define amdgpu_kernel void @dpp_combine_i32_boff_and(i32 addrspace(1)* %out, i32 %in) {
		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %in, i32 1, i32 1, i32 1, i1 0) #0
		%res = and i32 %dpp, %x
		store i32 %res, i32 addrspace(1)* %out
		ret void
		}

; CHECK-LABEL: {{^}}dpp_combine_i32_boff_max:		; CHECK-LABEL: {{^}}dpp_combine_i32_boff_max:
; CHECK: v_bfrev_b32_e32 [[OLD:v[0-9]+]], -2		; CHECK: v_bfrev_b32_e32 [[OLD:v[0-9]+]], -2
; CHECK: v_max_i32_dpp [[OLD]], {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1		; CHECK: v_max_i32_dpp [[OLD]], {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1
define amdgpu_kernel void @dpp_combine_i32_boff_max(i32 addrspace(1)* %out, i32 %in) {		define amdgpu_kernel void @dpp_combine_i32_boff_max(i32 addrspace(1)* %out, i32 %in) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 2147483647, i32 %in, i32 1, i32 1, i32 1, i1 0) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 2147483647, i32 %in, i32 1, i32 1, i32 1, i1 0) #0
%cmp = icmp sge i32 %dpp, %x		%cmp = icmp sge i32 %dpp, %x
%res = select i1 %cmp, i32 %dpp, i32 %x		%res = select i1 %cmp, i32 %dpp, i32 %x
Show All 24 Lines	define amdgpu_kernel void @dpp_combine_i32_boff_mul(i32 addrspace(1)* %out, i32 %in) {
%x.shl = shl i32 %x, 8		%x.shl = shl i32 %x, 8
%x.24 = ashr i32 %x.shl, 8		%x.24 = ashr i32 %x.shl, 8
%res = mul i32 %dpp.24, %x.24		%res = mul i32 %dpp.24, %x.24
store i32 %res, i32 addrspace(1)* %out		store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

; CHECK-LABEL: {{^}}dpp_combine_i32_commute:		; CHECK-LABEL: {{^}}dpp_combine_i32_commute:
; CHECK: v_subrev_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[2,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0		; CHECK: v_subrev_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[2,0,0,0] row_mask:0x1 bank_mask:0x1
define amdgpu_kernel void @dpp_combine_i32_commute(i32 addrspace(1)* %out, i32 %in) {		define amdgpu_kernel void @dpp_combine_i32_commute(i32 addrspace(1)* %out, i32 %in) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 2, i32 1, i32 1, i1 1) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %in, i32 2, i32 1, i32 1, i1 1) #0
%res = sub nsw i32 %x, %dpp		%res = sub nsw i32 %x, %dpp
store i32 %res, i32 addrspace(1)* %out		store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

; CHECK-LABEL: {{^}}dpp_combine_f32:		; CHECK-LABEL: {{^}}dpp_combine_f32:
; CHECK: v_add_f32_dpp {{v[0-9]+}}, {{v[0-9]+}}, v0 quad_perm:[3,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0		; CHECK: v_add_f32_dpp {{v[0-9]+}}, {{v[0-9]+}}, v0 quad_perm:[3,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0
define amdgpu_kernel void @dpp_combine_f32(i32 addrspace(1)* %out, i32 %in) {		define amdgpu_kernel void @dpp_combine_f32(i32 addrspace(1)* %out, i32 %in) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()

%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 3, i32 1, i32 1, i1 1) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 3, i32 15, i32 15, i1 1) #0
%dpp.f32 = bitcast i32 %dpp to float		%dpp.f32 = bitcast i32 %dpp to float
%x.f32 = bitcast i32 %x to float		%x.f32 = bitcast i32 %x to float
%res.f32 = fadd float %x.f32, %dpp.f32		%res.f32 = fadd float %x.f32, %dpp.f32
%res = bitcast float %res.f32 to i32		%res = bitcast float %res.f32 to i32
store i32 %res, i32 addrspace(1)* %out		store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

; CHECK-LABEL: {{^}}dpp_combine_test_f32_mods:		; CHECK-LABEL: {{^}}dpp_combine_test_f32_mods:
; CHECK: v_mul_f32_dpp {{v[0-9]+}}, \|{{v[0-9]+}}\|, -v0 quad_perm:[0,1,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0		; CHECK: v_mul_f32_dpp {{v[0-9]+}}, \|{{v[0-9]+}}\|, -v0 quad_perm:[0,1,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0
define amdgpu_kernel void @dpp_combine_test_f32_mods(i32 addrspace(1)* %out, i32 %in) {		define amdgpu_kernel void @dpp_combine_test_f32_mods(i32 addrspace(1)* %out, i32 %in) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()

%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 4, i32 1, i32 1, i1 1) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 4, i32 15, i32 15, i1 1) #0

%x.f32 = bitcast i32 %x to float		%x.f32 = bitcast i32 %x to float
%x.f32.neg = fsub float -0.000000e+00, %x.f32		%x.f32.neg = fsub float -0.000000e+00, %x.f32

%dpp.f32 = bitcast i32 %dpp to float		%dpp.f32 = bitcast i32 %dpp to float
%dpp.f32.cmp = fcmp fast olt float %dpp.f32, 0.000000e+00		%dpp.f32.cmp = fcmp fast olt float %dpp.f32, 0.000000e+00
%dpp.f32.sign = select i1 %dpp.f32.cmp, float -1.000000e+00, float 1.000000e+00		%dpp.f32.sign = select i1 %dpp.f32.cmp, float -1.000000e+00, float 1.000000e+00
%dpp.f32.abs = fmul fast float %dpp.f32, %dpp.f32.sign		%dpp.f32.abs = fmul fast float %dpp.f32, %dpp.f32.sign

%res.f32 = fmul float %x.f32.neg, %dpp.f32.abs		%res.f32 = fmul float %x.f32.neg, %dpp.f32.abs
%res = bitcast float %res.f32 to i32		%res = bitcast float %res.f32 to i32
store i32 %res, i32 addrspace(1)* %out		store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

; CHECK-LABEL: {{^}}dpp_combine_mac:		; CHECK-LABEL: {{^}}dpp_combine_mac:
; CHECK: v_mac_f32_dpp v0, {{v[0-9]+}}, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0		; CHECK: v_mac_f32_dpp v0, {{v[0-9]+}}, v1 quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0
define amdgpu_kernel void @dpp_combine_mac(float addrspace(1)* %out, i32 %in) {		define amdgpu_kernel void @dpp_combine_mac(float addrspace(1)* %out, i32 %in) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
%y = tail call i32 @llvm.amdgcn.workitem.id.y()		%y = tail call i32 @llvm.amdgcn.workitem.id.y()
%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 1) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 15, i32 15, i1 1) #0
%dpp.f32 = bitcast i32 %dpp to float		%dpp.f32 = bitcast i32 %dpp to float
%x.f32 = bitcast i32 %x to float		%x.f32 = bitcast i32 %x to float
%y.f32 = bitcast i32 %y to float		%y.f32 = bitcast i32 %y to float

%mult = fmul float %dpp.f32, %y.f32		%mult = fmul float %dpp.f32, %y.f32
%res = fadd float %mult, %x.f32		%res = fadd float %mult, %x.f32
store float %res, float addrspace(1)* %out		store float %res, float addrspace(1)* %out
ret void		ret void
}		}

; CHECK-LABEL: {{^}}dpp_combine_sequence:		; CHECK-LABEL: {{^}}dpp_combine_sequence:
define amdgpu_kernel void @dpp_combine_sequence(i32 addrspace(1)* %out, i32 %in, i1 %cmp) {		define amdgpu_kernel void @dpp_combine_sequence(i32 addrspace(1)* %out1, i32 addrspace(1)* %out2, i32 %in) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 1) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 15, i32 15, i1 1) #0
br i1 %cmp, label %bb1, label %bb2
bb1:		; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0
; CHECK: v_add_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
%resadd = add nsw i32 %dpp, %x		%resadd = add nsw i32 %dpp, %x
br label %bb3
bb2:		; CHECK: v_subrev_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0
; CHECK: v_subrev_u32_dpp {{v[0-9]+}}, vcc, {{v[0-9]+}}, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
%ressub = sub nsw i32 %x, %dpp		%ressub = sub nsw i32 %x, %dpp
br label %bb3
bb3:		store i32 %resadd, i32 addrspace(1)* %out1
%res = phi i32 [%resadd, %bb1], [%ressub, %bb2]		store i32 %ressub, i32 addrspace(1)* %out2
store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

; CHECK-LABEL: {{^}}dpp_combine_sequence_negative:		; CHECK-LABEL: {{^}}dpp_combine_sequence_negative:
; CHECK: v_mov_b32_dpp v1, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0		; CHECK: v_mov_b32_dpp v1, v1 quad_perm:[1,0,0,0] row_mask:0xf bank_mask:0xf bound_ctrl:0
define amdgpu_kernel void @dpp_combine_sequence_negative(i32 addrspace(1)* %out, i32 %in, i1 %cmp) {		define amdgpu_kernel void @dpp_combine_sequence_negative(i32 addrspace(1)* %out, i32 %in, i1 %cmp) {
%x = tail call i32 @llvm.amdgcn.workitem.id.x()		%x = tail call i32 @llvm.amdgcn.workitem.id.x()
%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 1, i32 1, i1 1) #0		%dpp = call i32 @llvm.amdgcn.update.dpp.i32(i32 undef, i32 %in, i32 1, i32 15, i32 15, i1 1) #0
br i1 %cmp, label %bb1, label %bb2		br i1 %cmp, label %bb1, label %bb2
bb1:		bb1:
%resadd = add nsw i32 %dpp, %x		%resadd = add nsw i32 %dpp, %x
br label %bb3		br label %bb3
bb2:		bb2:
%ressub = sub nsw i32 2, %dpp ; break seq		%ressub = sub nsw i32 2, %dpp ; break seq
br label %bb3		br label %bb3
bb3:		bb3:
%res = phi i32 [%resadd, %bb1], [%ressub, %bb2]		%res = phi i32 [%resadd, %bb1], [%ressub, %bb2]
store i32 %res, i32 addrspace(1)* %out		store i32 %res, i32 addrspace(1)* %out
ret void		ret void
}		}

declare i32 @llvm.amdgcn.workitem.id.x()		declare i32 @llvm.amdgcn.workitem.id.x()
declare i32 @llvm.amdgcn.workitem.id.y()		declare i32 @llvm.amdgcn.workitem.id.y()
		declare i32 @llvm.amdgcn.mov.dpp.i32(i32, i32, i32, i32, i1) #0
declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32, i32, i32, i1) #0		declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32, i32, i32, i1) #0

attributes #0 = { nounwind readnone convergent }		attributes #0 = { nounwind readnone convergent }

test/CodeGen/AMDGPU/dpp_combine_subregs.mir

Show All 30 Lines	bb.0:
%3:vgpr_32 = V_MOV_B32_e32 42, implicit $exec		%3:vgpr_32 = V_MOV_B32_e32 42, implicit $exec
%4 = REG_SEQUENCE %2, %subreg.sub0, %3, %subreg.sub1		%4 = REG_SEQUENCE %2, %subreg.sub0, %3, %subreg.sub1
%5 = INSERT_SUBREG %4, %1, %subreg.sub1 ; %5.sub0 is taken from %4		%5 = INSERT_SUBREG %4, %1, %subreg.sub1 ; %5.sub0 is taken from %4
%6:vgpr_32 = V_MOV_B32_dpp %5.sub0, %1, 1, 1, 1, 0, implicit $exec		%6:vgpr_32 = V_MOV_B32_dpp %5.sub0, %1, 1, 1, 1, 0, implicit $exec
%7:vgpr_32 = V_MUL_I32_I24_e32 %6, %0.sub1, implicit $exec		%7:vgpr_32 = V_MUL_I32_I24_e32 %6, %0.sub1, implicit $exec
...		...

# CHECK-LABEL: name: add_old_subreg		# CHECK-LABEL: name: add_old_subreg
# CHECK: [[OLD:\%[0-9]+]]:vgpr_32 = IMPLICIT_DEF		# CHECK: %5:vgpr_32 = V_ADD_U32_dpp %3.sub1, %1, %0.sub1, 1, 1, 1, 0, implicit $exec
# CHECK: %5:vgpr_32 = V_ADD_U32_dpp [[OLD]], %1, %0.sub1, 1, 1, 1, 1, implicit $exec

name: add_old_subreg		name: add_old_subreg
tracksRegLiveness: true		tracksRegLiveness: true
registers:		registers:
- { id: 0, class: vreg_64 }		- { id: 0, class: vreg_64 }
- { id: 1, class: vgpr_32 }		- { id: 1, class: vgpr_32 }
- { id: 2, class: vgpr_32 }		- { id: 2, class: vgpr_32 }
- { id: 3, class: vreg_64 }		- { id: 3, class: vreg_64 }
Show All 39 Lines	bb.0:
%1:vgpr_32 = COPY $vgpr1		%1:vgpr_32 = COPY $vgpr1
%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec		%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
%3:vreg_64 = REG_SEQUENCE %2, %subreg.sub0 ; %3.sub1 is undef		%3:vreg_64 = REG_SEQUENCE %2, %subreg.sub0 ; %3.sub1 is undef
%4:vgpr_32 = V_MOV_B32_dpp %3.sub1, %1, 1, 1, 1, 0, implicit $exec		%4:vgpr_32 = V_MOV_B32_dpp %3.sub1, %1, 1, 1, 1, 0, implicit $exec
%5:vgpr_32 = V_ADD_U32_e32 %4, %0.sub1, implicit $exec		%5:vgpr_32 = V_ADD_U32_e32 %4, %0.sub1, implicit $exec
...		...

# CHECK-LABEL: name: add_f32_e64		# CHECK-LABEL: name: add_f32_e64
# CHECK: %3:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 1, 1, 1, implicit $exec		# CHECK: %3:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 15, 15, 1, implicit $exec
# CHECK: %4:vgpr_32 = V_ADD_F32_e64 0, %3, 0, %0, 0, 1, implicit $exec		# CHECK: %4:vgpr_32 = V_ADD_F32_e64 0, %3, 0, %0, 0, 1, implicit $exec
# CHECK: %6:vgpr_32 = V_ADD_F32_dpp %2, 0, %1, 0, %0, 1, 1, 1, 1, implicit $exec		# CHECK: %6:vgpr_32 = V_ADD_F32_dpp %2, 0, %1, 0, %0, 1, 15, 15, 1, implicit $exec
# CHECK: %7:vgpr_32 = V_ADD_F32_dpp %2, 1, %1, 2, %0, 1, 1, 1, 1, implicit $exec		# CHECK: %7:vgpr_32 = V_ADD_F32_dpp %2, 1, %1, 2, %0, 1, 15, 15, 1, implicit $exec
# CHECK: %9:vgpr_32 = V_ADD_F32_e64 4, %8, 8, %0, 0, 0, implicit $exec		# CHECK: %9:vgpr_32 = V_ADD_F32_e64 4, %8, 8, %0, 0, 0, implicit $exec

name: add_f32_e64		name: add_f32_e64
tracksRegLiveness: true		tracksRegLiveness: true
registers:		registers:
- { id: 0, class: vgpr_32 }		- { id: 0, class: vgpr_32 }
- { id: 1, class: vgpr_32 }		- { id: 1, class: vgpr_32 }
- { id: 2, class: vgpr_32 }		- { id: 2, class: vgpr_32 }
Show All 10 Lines	liveins:
- { reg: '$vgpr1', virtual-reg: '%1' }		- { reg: '$vgpr1', virtual-reg: '%1' }
body: \|		body: \|
bb.0:		bb.0:
liveins: $vgpr0, $vgpr1		liveins: $vgpr0, $vgpr1

%0:vgpr_32 = COPY $vgpr0		%0:vgpr_32 = COPY $vgpr0
%1:vgpr_32 = COPY $vgpr1		%1:vgpr_32 = COPY $vgpr1
%2:vgpr_32 = IMPLICIT_DEF		%2:vgpr_32 = IMPLICIT_DEF
%3:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 1, 1, 1, implicit $exec		%3:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 15, 15, 1, implicit $exec

; this shouldn't be combined as omod is set		; this shouldn't be combined as omod is set
%4:vgpr_32 = V_ADD_F32_e64 0, %3, 0, %0, 0, 1, implicit $exec		%4:vgpr_32 = V_ADD_F32_e64 0, %3, 0, %0, 0, 1, implicit $exec

%5:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 1, 1, 1, implicit $exec		%5:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 15, 15, 1, implicit $exec

; this should be combined as all modifiers are default		; this should be combined as all modifiers are default
%6:vgpr_32 = V_ADD_F32_e64 0, %5, 0, %0, 0, 0, implicit $exec		%6:vgpr_32 = V_ADD_F32_e64 0, %5, 0, %0, 0, 0, implicit $exec

; this should be combined as modifiers other than abs\|neg are default		; this should be combined as modifiers other than abs\|neg are default
%7:vgpr_32 = V_ADD_F32_e64 1, %5, 2, %0, 0, 0, implicit $exec		%7:vgpr_32 = V_ADD_F32_e64 1, %5, 2, %0, 0, 0, implicit $exec

%8:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 1, 1, 1, implicit $exec		%8:vgpr_32 = V_MOV_B32_dpp undef %2, %1, 1, 15, 15, 1, implicit $exec

; this shouldn't be combined as modifiers aren't abs\|neg		; this shouldn't be combined as modifiers aren't abs\|neg
%9:vgpr_32 = V_ADD_F32_e64 4, %8, 8, %0, 0, 0, implicit $exec		%9:vgpr_32 = V_ADD_F32_e64 4, %8, 8, %0, 0, 0, implicit $exec
...		...

		# old reg def is in diff BB - cannot combine
		# CHECK-LABEL: name: add_negative_old_in_diff_bb
		# CHECK: %3:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec

		name: add_negative_old_in_diff_bb
		tracksRegLiveness: true
		registers:
		- { id: 0, class: vreg_64 }
		- { id: 1, class: vgpr_32 }
		- { id: 2, class: vgpr_32 }
		- { id: 3, class: vgpr_32 }
		- { id: 4, class: vgpr_32 }

		liveins:
		- { reg: '$vgpr0', virtual-reg: '%0' }
		- { reg: '$vgpr1', virtual-reg: '%1' }
		body: \|
		bb.0:
		successors: %bb.1
		liveins: $vgpr0, $vgpr1

		%0:vreg_64 = COPY $vgpr0
		%1:vgpr_32 = COPY $vgpr1
		%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
		S_BRANCH %bb.1

		bb.1:
		%3:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec
		%4:vgpr_32 = V_ADD_U32_e32 %3, %0.sub1, implicit $exec
		...

		# old reg def is in diff BB but bound_ctrl:0 - can combine
		# CHECK-LABEL: name: add_positive_old_in_diff_bb_bctrl_zero
		# CHECK: %4:vgpr_32 = V_ADD_U32_dpp %5, %1, %0.sub1, 1, 15, 15, 1, implicit $exec

		name: add_positive_old_in_diff_bb_bctrl_zero
		tracksRegLiveness: true
		registers:
		- { id: 0, class: vreg_64 }
		- { id: 1, class: vgpr_32 }
		- { id: 2, class: vgpr_32 }
		- { id: 3, class: vgpr_32 }
		- { id: 4, class: vgpr_32 }

		liveins:
		- { reg: '$vgpr0', virtual-reg: '%0' }
		- { reg: '$vgpr1', virtual-reg: '%1' }
		body: \|
		bb.0:
		successors: %bb.1
		liveins: $vgpr0, $vgpr1

		%0:vreg_64 = COPY $vgpr0
		%1:vgpr_32 = COPY $vgpr1
		%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
		S_BRANCH %bb.1

		bb.1:
		%3:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 15, 15, 1, implicit $exec
		%4:vgpr_32 = V_ADD_U32_e32 %3, %0.sub1, implicit $exec

		...

		# EXEC mask changed between def and use - cannot combine
		# CHECK-LABEL: name: negative_exec_changed
		# CHECK: %4:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec

		name: negative_exec_changed
		tracksRegLiveness: true
		registers:
		- { id: 0, class: vreg_64 }
		- { id: 1, class: vgpr_32 }
		- { id: 2, class: vgpr_32 }
		- { id: 3, class: vreg_64 }
		- { id: 4, class: vgpr_32 }
		- { id: 5, class: vgpr_32 }
		- { id: 6, class: vgpr_32 }
		- { id: 7, class: sreg_64 }

		liveins:
		- { reg: '$vgpr0', virtual-reg: '%0' }
		- { reg: '$vgpr1', virtual-reg: '%1' }
		body: \|
		bb.0:
		liveins: $vgpr0, $vgpr1

		%0:vreg_64 = COPY $vgpr0
		%1:vgpr_32 = COPY $vgpr1
		%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
		%4:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec
		%5:vgpr_32 = V_ADD_U32_e32 %4, %0.sub0, implicit $exec
		%7:sreg_64 = COPY $exec, implicit-def $exec
		%6:vgpr_32 = V_ADD_U32_e32 %4, %0.sub1, implicit $exec

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Fix DPP combinerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 177866

include/llvm/CodeGen/TargetInstrInfo.h

lib/Target/AMDGPU/GCNDPPCombine.cpp

lib/Target/AMDGPU/SIInstrInfo.h

lib/Target/AMDGPU/SIInstrInfo.cpp

test/CodeGen/AMDGPU/dpp_combine.ll

test/CodeGen/AMDGPU/dpp_combine_subregs.mir

AMDGPU: Fix DPP combiner
ClosedPublic