This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
3/7
SIFoldOperands.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
GlobalISel/
-
add.v2i16.ll
-
addo.ll
-
andn2.ll
-
ashr.ll
-
bswap.ll
-
combine-fma-add-mul.ll
-
combine-fma-sub-ext-neg-mul.ll
-
combine-fma-sub-mul.ll
-
combine-fma-sub-neg-mul.ll
-
extractelement.i8.ll
-
fdiv.f32.ll
1/2
flat-scratch.ll
-
fmed3.ll
-
fpow.ll
-
fshl.ll
-
fshr.ll
-
insertelement.i16.ll
-
insertelement.i8.ll
-
llvm.amdgcn.div.scale.ll
-
llvm.amdgcn.image.atomic.dim.a16.ll
-
llvm.amdgcn.image.gather4.a16.dim.ll
-
llvm.amdgcn.image.load.1d.d16.ll
-
llvm.amdgcn.image.load.2darraymsaa.a16.ll
-
llvm.amdgcn.image.load.3d.a16.ll
-
llvm.amdgcn.image.sample.g16.ll
-
llvm.amdgcn.intersect_ray.ll
-
llvm.amdgcn.sbfe.ll
-
llvm.amdgcn.sdot4.ll
-
llvm.amdgcn.udot4.ll
-
lshr.ll
-
mul.ll
-
orn2.ll
-
roundeven.ll
-
saddsat.ll
-
sdiv.i32.ll
-
sdiv.i64.ll
-
sdivrem.ll
-
shl-ext-reduce.ll
-
shl.ll
-
srem.i32.ll
-
srem.i64.ll
-
ssubsat.ll
-
store-local.128.ll
-
store-local.96.ll
-
subo.ll
-
trunc.ll
-
uaddsat.ll
-
udivrem.ll
-
urem.i32.ll
-
urem.i64.ll
-
usubsat.ll
-
xnor.ll
-
add.v2i16.ll
-
amdgpu-codegenprepare-fold-binop-select.ll
-
amdgpu-codegenprepare-idiv.ll
-
and.ll
-
atomic_optimizations_local_pointer.ll
-
bypass-div.ll
-
combine-reg-or-const.ll
-
constant-address-space-32bit.ll
-
ctlz.ll
-
cttz.ll
-
cvt_f32_ubyte.ll
-
extract-subvector-16bit.ll
-
fabs.f16.ll
-
fabs.f64.ll
-
fabs.ll
-
fexp.ll
-
flat-scratch.ll
-
fmed3.ll
-
fneg-fabs.f16.ll
-
fneg-fabs.f64.ll
-
fneg-fabs.ll
-
fneg.ll
-
fold-immediate-operand-shrink-with-carry.mir
-
frem.ll
-
fshr.ll
-
idiv-licm.ll
-
idot2.ll
-
idot4u.ll
-
idot8s.ll
-
idot8u.ll
-
immv216.ll
-
insert_vector_dynelt.ll
-
insert_vector_elt.ll
-
insert_vector_elt.v2i16.ll
-
llvm.amdgcn.buffer.store.format.d16.ll
-
llvm.amdgcn.image.sample.a16.dim.ll
-
llvm.amdgcn.image.sample.g16.a16.dim.ll
-
llvm.amdgcn.image.sample.g16.encode.ll
-
llvm.amdgcn.image.sample.g16.ll
-
llvm.amdgcn.raw.buffer.store.format.d16.ll
-
llvm.amdgcn.raw.tbuffer.store.d16.ll
-
llvm.amdgcn.struct.buffer.store.format.d16.ll
-
llvm.amdgcn.struct.tbuffer.store.d16.ll
-
llvm.amdgcn.tbuffer.store.d16.ll
-
llvm.log.f16.ll
-
llvm.log10.f16.ll
-
llvm.round.f64.ll
-
load-constant-i16.ll
-
load-global-i16.ll
2/2
madak.ll
-
max.ll
-
mul.ll
-
mul_uint24-amdgcn.ll
-
or.ll
-
packed-fp32.ll
-
promote-constOffset-to-imm.ll
-
s_addk_i32.ll
-
salu-to-valu.ll
-
scratch-buffer.ll
-
sdiv64.ll
-
sdwa-peephole.ll
-
setcc-opt.ll
-
shift-i128.ll
-
shl.v2i16.ll
-
shrink-add-sub-constant.ll
-
splitkit-getsubrangeformask.ll
-
srem64.ll
-
strict_fadd.f16.ll
-
strict_fma.f16.ll
-
strict_fmul.f16.ll
-
strict_fsub.f16.ll
-
sub.v2i16.ll
-
uaddsat.ll
-
udiv.ll
-
udiv64.ll
-
udivrem24.ll
-
urem64.ll
-
usubsat.ll
-
v_pack.ll
-
vector_shuffle.packed.ll
-
xor.ll
-
zero_extend.ll

Differential D114643

[AMDGPU] Aggressively fold immediates in SIFoldOperands
ClosedPublic

Authored by foad on Nov 26 2021, 7:56 AM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec
nhaehnle
tsymalla
piotr
sebastian-ne

Commits

rG3eb2281bc067: [AMDGPU] Aggressively fold immediates in SIFoldOperands

Summary

Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.

This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:

It reduces the number of registers used for holding constant values, which might increase occupancy. (On the other hand, many of these registers are SGPRs which no longer affect occupancy on GFX10+.)
It reduces ALU stalls between the instruction that loads a constant into a register, and the instruction that uses it.
The above benefits are expected to outweigh any increase in code size.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

foad created this revision.Nov 26 2021, 7:56 AM

Herald added subscribers: wenlei, kerbowa, asbirlea and 9 others. · View Herald TranscriptNov 26 2021, 7:56 AM

foad requested review of this revision.Nov 26 2021, 7:56 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 26 2021, 7:56 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

foad added a child revision: D114644: [AMDGPU] Aggressively fold immediates in SIShrinkInstructions.Nov 26 2021, 7:59 AM

foad mentioned this in D114644: [AMDGPU] Aggressively fold immediates in SIShrinkInstructions.Nov 26 2021, 8:05 AM

foad added inline comments.Nov 26 2021, 8:09 AM

llvm/test/CodeGen/AMDGPU/madak.ll
54–55	Regression here: we are no longer forming madak/fmaak instructions. I think this is just bad luck. madak/fmaak formation is only implemented when PeepholeOptimizer calls SIInstrInfo::FoldImmediate. I think it would be much more reliable to do it as part of SIFoldOperands / SIShrinkInstructions.

sebastian-ne added a subscriber: sebastian-ne.Nov 26 2021, 8:31 AM

sebastian-ne added inline comments.

llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
99–104	Not really related to this patch, but shouldn’t we be able to inline v2 (15) into the scratch_store?

foad added inline comments.Nov 26 2021, 8:42 AM

llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
99–104	No, v2 is the value being stored, and it has to be in a vgpr for that instruction (not even an inline constant is allowed).

Harbormaster completed remote builds in B136243: Diff 390065.Nov 26 2021, 8:42 AM

Maybe we should start considering optsize here?

Rebase.

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 3:40 AM

Harbormaster completed remote builds in B152132: Diff 412371.Mar 2 2022, 4:47 AM

foad mentioned this in D77804: [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits.May 16 2022, 6:35 AM

Rebase.

Herald added subscribers: kosarev, jsilvanus, hsmhsm. · View Herald TranscriptMay 16 2022, 8:55 AM

foad added inline comments.May 16 2022, 8:58 AM

llvm/test/CodeGen/AMDGPU/madak.ll
54–55	The regression got fixed by D125567.

Harbormaster completed remote builds in B164659: Diff 429734.May 16 2022, 9:50 AM

foad added reviewers: arsenm, rampitec, nhaehnle, tsymalla, piotr, sebastian-ne.May 17 2022, 3:21 AM

Both patches look good to me!

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1253	Can this use `make_early_inc_range` instead of caching the small vector like in the if-case? (if so, this probably makes more sense as an NFC patch afterwards)

foad added inline comments.May 17 2022, 5:49 AM

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1253	Not sure. I can try that as a follow-up.

arsenm added inline comments.May 17 2022, 9:02 AM

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
149–151	Don't we still need to consider this for deciding to fold fma/fmak for f16?

foad added inline comments.May 17 2022, 9:25 AM

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
149–151	I don't see why. "isInline" functions are just asking whether it's free to fold a constant (i.e. no increase in code size). The actual legality checks are done later in tryAddToFoldList.

arsenm accepted this revision.May 17 2022, 1:20 PM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1253	The iterators in this pass are kind of a mess. I've wanted to rewrite this pass to work more like how PeepholeOpt works, collecting defs, visiting uses and looking for collected defs.

This revision is now accepted and ready to land.May 17 2022, 1:20 PM

This revision was landed with ongoing or failed builds.May 18 2022, 2:22 AM

Closed by commit rG3eb2281bc067: [AMDGPU] Aggressively fold immediates in SIFoldOperands (authored by foad). · Explain Why

This revision was automatically updated to reflect the committed changes.

foad added a commit: rG3eb2281bc067: [AMDGPU] Aggressively fold immediates in SIFoldOperands.

foad added inline comments.May 18 2022, 2:40 AM

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

1246–1253

I tried this, but it seems to get stuck in infinite loops in several lit tests. Not sure why:

diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 99aa8a60b04f..3159693a2b6e 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1243,12 +1243,9 @@ bool SIFoldOperands::foldInstOperand(MachineInstr &MI,
     }
   }
 
-  SmallVector<MachineOperand *, 4> UsesToProcess;
-  for (auto &Use : MRI->use_nodbg_operands(Dst.getReg()))
-    UsesToProcess.push_back(&Use);
-  for (auto U : UsesToProcess) {
-    MachineInstr *UseMI = U->getParent();
-    foldOperand(OpToFold, UseMI, UseMI->getOperandNo(U), FoldList,
+  for (auto &U : make_early_inc_range(MRI->use_nodbg_operands(Dst.getReg()))) {
+    MachineInstr *UseMI = U.getParent();
+    foldOperand(OpToFold, UseMI, UseMI->getOperandNo(&U), FoldList,
                 CopiesToReplace);
   }

sebastian-ne added inline comments.May 18 2022, 3:21 AM

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1253	Ok, thanks for trying!

piotr mentioned this in D126064: [AMDGPU] Handle mandatory literals in isOperandLegal.May 20 2022, 7:13 AM

foad mentioned this in D114232: [AMDGPU] Fold more inline constant operands by commuting instructions.Sep 7 2022, 3:19 AM

Large Diff

This large diff affects 135 files. Files without inline comments have been collapsed. Expand All Files

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIFoldOperands.cpp

84 lines

test/

CodeGen/

AMDGPU/

GlobalISel/

88 lines

7 lines

112 lines

124 lines

48 lines

combine-fma-add-mul.ll

76 lines

combine-fma-sub-ext-neg-mul.ll

20 lines

combine-fma-sub-mul.ll

20 lines

combine-fma-sub-neg-mul.ll

15 lines

490 lines

42 lines

76 lines

14 lines

5 lines

785 lines

930 lines

812 lines

4111 lines

llvm.amdgcn.div.scale.ll

3 lines

llvm.amdgcn.image.atomic.dim.a16.ll

50 lines

llvm.amdgcn.image.gather4.a16.dim.ll

54 lines

llvm.amdgcn.image.load.1d.d16.ll

5 lines

llvm.amdgcn.image.load.2darraymsaa.a16.ll

19 lines

llvm.amdgcn.image.load.3d.a16.ll

17 lines

llvm.amdgcn.image.sample.g16.ll

111 lines

llvm.amdgcn.intersect_ray.ll

54 lines

5 lines

13 lines

13 lines

173 lines

51 lines

112 lines

11 lines

1244 lines

48 lines

10 lines

234 lines

15 lines

96 lines

8 lines

10 lines

1242 lines

27 lines

19 lines

7 lines

10 lines

140 lines

193 lines

4 lines

279 lines

140 lines

26 lines

3 lines

amdgpu-codegenprepare-fold-binop-select.ll

4 lines

amdgpu-codegenprepare-idiv.ll

2911 lines

and.ll

10 lines

atomic_optimizations_local_pointer.ll

164 lines

bypass-div.ll

108 lines

combine-reg-or-const.ll

2 lines

constant-address-space-32bit.ll

2 lines

ctlz.ll

11 lines

cttz.ll

11 lines

cvt_f32_ubyte.ll

10 lines

extract-subvector-16bit.ll

19 lines

5 lines

12 lines

12 lines

7 lines

162 lines

5 lines

5 lines

14 lines

14 lines

13 lines

fold-immediate-operand-shrink-with-carry.mir

7 lines

179 lines

64 lines

38 lines

5 lines

81 lines

242 lines

158 lines

5 lines

insert_vector_dynelt.ll

5 lines

insert_vector_elt.ll

15 lines

insert_vector_elt.v2i16.ll

8 lines

llvm.amdgcn.buffer.store.format.d16.ll

5 lines

llvm.amdgcn.image.sample.a16.dim.ll

95 lines

llvm.amdgcn.image.sample.g16.a16.dim.ll

223 lines

llvm.amdgcn.image.sample.g16.encode.ll

77 lines

llvm.amdgcn.image.sample.g16.ll

77 lines

llvm.amdgcn.raw.buffer.store.format.d16.ll

10 lines

llvm.amdgcn.raw.tbuffer.store.d16.ll

10 lines

llvm.amdgcn.struct.buffer.store.format.d16.ll

10 lines

llvm.amdgcn.struct.tbuffer.store.d16.ll

10 lines

llvm.amdgcn.tbuffer.store.d16.ll

10 lines

7 lines

7 lines

114 lines

2063 lines

282 lines

4 lines

4 lines

2 lines

99 lines

4 lines

3 lines

promote-constOffset-to-imm.ll

26 lines

6 lines

3 lines

3 lines

347 lines

6 lines

13 lines

2 lines

11 lines

shrink-add-sub-constant.ll

20 lines

splitkit-getsubrangeformask.ll

9 lines

537 lines

9 lines

9 lines

9 lines

9 lines

2 lines

7 lines

133 lines

541 lines

10 lines

451 lines

5 lines

15 lines

vector_shuffle.packed.ll

13 lines

xor.ll

4 lines

zero_extend.ll

5 lines

Diff 430296

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	static unsigned macToMad(unsigned Opc) {
case AMDGPU::V_FMAC_LEGACY_F32_e64:		case AMDGPU::V_FMAC_LEGACY_F32_e64:
return AMDGPU::V_FMA_LEGACY_F32_e64;		return AMDGPU::V_FMA_LEGACY_F32_e64;
case AMDGPU::V_FMAC_F64_e64:		case AMDGPU::V_FMAC_F64_e64:
return AMDGPU::V_FMA_F64_e64;		return AMDGPU::V_FMA_F64_e64;
}		}
return AMDGPU::INSTRUCTION_LIST_END;		return AMDGPU::INSTRUCTION_LIST_END;
}		}

// Wrapper around isInlineConstant that understands special cases when
// instruction types are replaced during operand folding.
static bool isInlineConstantIfFolded(const SIInstrInfo *TII,
arsenmUnsubmitted Not Done Reply Inline Actions Don't we still need to consider this for deciding to fold fma/fmak for f16? arsenm: Don't we still need to consider this for deciding to fold fma/fmak for f16?
foadAuthorUnsubmitted Done Reply Inline Actions I don't see why. "isInline" functions are just asking whether it's free to fold a constant (i.e. no increase in code size). The actual legality checks are done later in tryAddToFoldList. foad: I don't see why. "isInline" functions are just asking whether it's free to fold a constant (i.e.
const MachineInstr &UseMI,
unsigned OpNo,
const MachineOperand &OpToFold) {
if (TII->isInlineConstant(UseMI, OpNo, OpToFold))
return true;

unsigned Opc = UseMI.getOpcode();
unsigned NewOpc = macToMad(Opc);
if (NewOpc != AMDGPU::INSTRUCTION_LIST_END) {
// Special case for mac. Since this is replaced with mad when folded into
// src2, we need to check the legality for the final instruction.
int Src2Idx = AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::src2);
if (static_cast<int>(OpNo) == Src2Idx) {
const MCInstrDesc &MadDesc = TII->get(NewOpc);
return TII->isInlineConstant(OpToFold, MadDesc.OpInfo[OpNo].OperandType);
}
}

return false;
}

// TODO: Add heuristic that the frame index might not fit in the addressing mode		// TODO: Add heuristic that the frame index might not fit in the addressing mode
// immediate offset to avoid materializing in loops.		// immediate offset to avoid materializing in loops.
static bool frameIndexMayFold(const SIInstrInfo *TII,		static bool frameIndexMayFold(const SIInstrInfo *TII,
const MachineInstr &UseMI,		const MachineInstr &UseMI,
int OpNo,		int OpNo,
const MachineOperand &OpToFold) {		const MachineOperand &OpToFold) {
if (!OpToFold.isFI())		if (!OpToFold.isFI())
return false;		return false;
▲ Show 20 Lines • Show All 1,081 Lines • ▼ Show 20 Lines	for (auto &UseMI :
// be folded due to multiple uses or operand constraints.		// be folded due to multiple uses or operand constraints.
if (tryConstantFoldOp(*MRI, TII, &UseMI)) {		if (tryConstantFoldOp(*MRI, TII, &UseMI)) {
LLVM_DEBUG(dbgs() << "Constant folded " << UseMI);		LLVM_DEBUG(dbgs() << "Constant folded " << UseMI);
Changed = true;		Changed = true;
}		}
}		}
}		}

bool FoldingImm = OpToFold.isImm() \|\| OpToFold.isFI() \|\| OpToFold.isGlobal();
if (FoldingImm) {
unsigned NumLiteralUses = 0;
MachineOperand *NonInlineUse = nullptr;
int NonInlineUseOpNo = -1;

for (auto &Use :
make_early_inc_range(MRI->use_nodbg_operands(Dst.getReg()))) {
MachineInstr *UseMI = Use.getParent();
unsigned OpNo = UseMI->getOperandNo(&Use);

// Try to fold any inline immediate uses, and then only fold other
// constants if they have one use.
//
// The legality of the inline immediate must be checked based on the use
// operand, not the defining instruction, because 32-bit instructions
// with 32-bit inline immediate sources may be used to materialize
// constants used in 16-bit operands.
//
// e.g. it is unsafe to fold:
// s_mov_b32 s0, 1.0 // materializes 0x3f800000
// v_add_f16 v0, v1, s0 // 1.0 f16 inline immediate sees 0x00003c00

// Folding immediates with more than one use will increase program size.
// FIXME: This will also reduce register usage, which may be better
// in some cases. A better heuristic is needed.
if (isInlineConstantIfFolded(TII, *UseMI, OpNo, OpToFold)) {
foldOperand(OpToFold, UseMI, OpNo, FoldList, CopiesToReplace);
} else if (frameIndexMayFold(TII, *UseMI, OpNo, OpToFold)) {
foldOperand(OpToFold, UseMI, OpNo, FoldList, CopiesToReplace);
} else {
if (++NumLiteralUses == 1) {
NonInlineUse = &Use;
NonInlineUseOpNo = OpNo;
}
}
}

if (NumLiteralUses == 1) {
MachineInstr *UseMI = NonInlineUse->getParent();
foldOperand(OpToFold, UseMI, NonInlineUseOpNo, FoldList, CopiesToReplace);
}
} else {
// Folding register.
SmallVector <MachineOperand *, 4> UsesToProcess;		SmallVector<MachineOperand *, 4> UsesToProcess;
for (auto &Use : MRI->use_nodbg_operands(Dst.getReg()))		for (auto &Use : MRI->use_nodbg_operands(Dst.getReg()))
UsesToProcess.push_back(&Use);		UsesToProcess.push_back(&Use);
for (auto U : UsesToProcess) {		for (auto U : UsesToProcess) {
MachineInstr *UseMI = U->getParent();		MachineInstr *UseMI = U->getParent();
		foldOperand(OpToFold, UseMI, UseMI->getOperandNo(U), FoldList,
foldOperand(OpToFold, UseMI, UseMI->getOperandNo(U),		CopiesToReplace);
FoldList, CopiesToReplace);
}
}		}
		sebastian-neUnsubmitted Not Done Reply Inline Actions Can this use `make_early_inc_range` instead of caching the small vector like in the if-case? (if so, this probably makes more sense as an NFC patch afterwards) sebastian-ne: Can this use `make_early_inc_range` instead of caching the small vector like in the if-case?
		foadAuthorUnsubmitted Done Reply Inline Actions Not sure. I can try that as a follow-up. foad: Not sure. I can try that as a follow-up.
		arsenmUnsubmitted Not Done Reply Inline Actions The iterators in this pass are kind of a mess. I've wanted to rewrite this pass to work more like how PeepholeOpt works, collecting defs, visiting uses and looking for collected defs. arsenm: The iterators in this pass are kind of a mess. I've wanted to rewrite this pass to work more…
		foadAuthorUnsubmitted Done Reply Inline Actions I tried this, but it seems to get stuck in infinite loops in several lit tests. Not sure why: diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp index 99aa8a60b04f..3159693a2b6e 100644 --- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp +++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp @@ -1243,12 +1243,9 @@ bool SIFoldOperands::foldInstOperand(MachineInstr &MI, } } - SmallVector<MachineOperand , 4> UsesToProcess; - for (auto &Use : MRI->use_nodbg_operands(Dst.getReg())) - UsesToProcess.push_back(&Use); - for (auto U : UsesToProcess) { - MachineInstr UseMI = U->getParent(); - foldOperand(OpToFold, UseMI, UseMI->getOperandNo(U), FoldList, + for (auto &U : make_early_inc_range(MRI->use_nodbg_operands(Dst.getReg()))) { + MachineInstr UseMI = U.getParent(); + foldOperand(OpToFold, UseMI, UseMI->getOperandNo(&U), FoldList, CopiesToReplace); } foad:* I tried this, but it seems to get stuck in infinite loops in several lit tests. Not sure why…
		sebastian-neUnsubmitted Not Done Reply Inline Actions Ok, thanks for trying! sebastian-ne: Ok, thanks for trying!

if (CopiesToReplace.empty() && FoldList.empty())		if (CopiesToReplace.empty() && FoldList.empty())
return Changed;		return Changed;

MachineFunction *MF = MI.getParent()->getParent();		MachineFunction *MF = MI.getParent()->getParent();
// Make sure we add EXEC uses to any new v_mov instructions created.		// Make sure we add EXEC uses to any new v_mov instructions created.
for (MachineInstr *Copy : CopiesToReplace)		for (MachineInstr *Copy : CopiesToReplace)
Copy->addImplicitDefUseOperands(*MF);		Copy->addImplicitDefUseOperands(*MF);
▲ Show 20 Lines • Show All 557 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/add.v2i16.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/addo.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/andn2.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/bswap.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-mul.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-sub-ext-neg-mul.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-sub-mul.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-sub-neg-mul.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i8.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f32.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll

	Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines
	; GFX10-LABEL: store_load_vindex_kernel:			; GFX10-LABEL: store_load_vindex_kernel:
	; GFX10: ; %bb.0: ; %bb			; GFX10: ; %bb.0: ; %bb
	; GFX10-NEXT: s_add_u32 s0, s0, s3			; GFX10-NEXT: s_add_u32 s0, s0, s3
	; GFX10-NEXT: s_addc_u32 s1, s1, 0			; GFX10-NEXT: s_addc_u32 s1, s1, 0
	; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s0			; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s0
	; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s1			; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s1
	; GFX10-NEXT: v_sub_nc_u32_e32 v1, 0, v0			; GFX10-NEXT: v_sub_nc_u32_e32 v1, 0, v0
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: v_mov_b32_e32 v2, 4			; GFX10-NEXT: v_mov_b32_e32 v2, 15
	; GFX10-NEXT: v_mov_b32_e32 v3, 15
	; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1			; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1
	; GFX10-NEXT: v_add_nc_u32_e32 v0, v2, v0			; GFX10-NEXT: v_add_nc_u32_e32 v0, 4, v0
	; GFX10-NEXT: v_add_nc_u32_e32 v1, v2, v1			; GFX10-NEXT: v_add_nc_u32_e32 v1, 4, v1
	; GFX10-NEXT: scratch_store_dword v0, v3, off			; GFX10-NEXT: scratch_store_dword v0, v2, off
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				sebastian-neUnsubmitted Not Done Reply Inline Actions Not really related to this patch, but shouldn’t we be able to inline v2 (15) into the scratch_store? sebastian-ne: Not really related to this patch, but shouldn’t we be able to inline v2 (15) into the…
				foadAuthorUnsubmitted Done Reply Inline Actions No, v2 is the value being stored, and it has to be in a vgpr for that instruction (not even an inline constant is allowed). foad: No, v2 is the value being stored, and it has to be in a vgpr for that instruction (not even an…
	; GFX10-NEXT: scratch_load_dword v0, v1, off offset:124 glc dlc			; GFX10-NEXT: scratch_load_dword v0, v1, off offset:124 glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: s_endpgm			; GFX10-NEXT: s_endpgm
	;			;
	; GFX940-LABEL: store_load_vindex_kernel:			; GFX940-LABEL: store_load_vindex_kernel:
	; GFX940: ; %bb.0: ; %bb			; GFX940: ; %bb.0: ; %bb
	; GFX940-NEXT: v_lshlrev_b32_e32 v1, 2, v0			; GFX940-NEXT: v_lshlrev_b32_e32 v1, 2, v0
	; GFX940-NEXT: v_mov_b32_e32 v2, 15			; GFX940-NEXT: v_mov_b32_e32 v2, 15
	Show All 37 Lines
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX10-LABEL: store_load_vindex_foo:			; GFX10-LABEL: store_load_vindex_foo:
	; GFX10: ; %bb.0: ; %bb			; GFX10: ; %bb.0: ; %bb
	; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: v_and_b32_e32 v1, 15, v0			; GFX10-NEXT: v_and_b32_e32 v1, 15, v0
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: v_mov_b32_e32 v2, s32			; GFX10-NEXT: v_mov_b32_e32 v2, 15
	; GFX10-NEXT: v_mov_b32_e32 v3, 15
	; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1			; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1
	; GFX10-NEXT: v_add_nc_u32_e32 v0, v2, v0			; GFX10-NEXT: v_add_nc_u32_e32 v0, s32, v0
	; GFX10-NEXT: v_add_nc_u32_e32 v1, v2, v1			; GFX10-NEXT: v_add_nc_u32_e32 v1, s32, v1
	; GFX10-NEXT: scratch_store_dword v0, v3, off			; GFX10-NEXT: scratch_store_dword v0, v2, off
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: scratch_load_dword v0, v1, off glc dlc			; GFX10-NEXT: scratch_load_dword v0, v1, off glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX940-LABEL: store_load_vindex_foo:			; GFX940-LABEL: store_load_vindex_foo:
	; GFX940: ; %bb.0: ; %bb			; GFX940: ; %bb.0: ; %bb
	; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	▲ Show 20 Lines • Show All 149 Lines • ▼ Show 20 Lines
	; GFX10-LABEL: store_load_vindex_small_offset_kernel:			; GFX10-LABEL: store_load_vindex_small_offset_kernel:
	; GFX10: ; %bb.0: ; %bb			; GFX10: ; %bb.0: ; %bb
	; GFX10-NEXT: s_add_u32 s0, s0, s3			; GFX10-NEXT: s_add_u32 s0, s0, s3
	; GFX10-NEXT: s_addc_u32 s1, s1, 0			; GFX10-NEXT: s_addc_u32 s1, s1, 0
	; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s0			; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s0
	; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s1			; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s1
	; GFX10-NEXT: v_sub_nc_u32_e32 v1, 0, v0			; GFX10-NEXT: v_sub_nc_u32_e32 v1, 0, v0
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: v_mov_b32_e32 v2, 0x104			; GFX10-NEXT: v_mov_b32_e32 v2, 15
	; GFX10-NEXT: v_mov_b32_e32 v3, 15			; GFX10-NEXT: scratch_load_dword v3, off, off offset:4 glc dlc
	; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1
	; GFX10-NEXT: v_add_nc_u32_e32 v0, v2, v0
	; GFX10-NEXT: v_add_nc_u32_e32 v1, v2, v1
	; GFX10-NEXT: scratch_load_dword v2, off, off offset:4 glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: scratch_store_dword v0, v3, off			; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1
				; GFX10-NEXT: v_add_nc_u32_e32 v0, 0x104, v0
				; GFX10-NEXT: v_add_nc_u32_e32 v1, 0x104, v1
				; GFX10-NEXT: scratch_store_dword v0, v2, off
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: scratch_load_dword v0, v1, off offset:124 glc dlc			; GFX10-NEXT: scratch_load_dword v0, v1, off offset:124 glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: s_endpgm			; GFX10-NEXT: s_endpgm
	;			;
	; GFX940-LABEL: store_load_vindex_small_offset_kernel:			; GFX940-LABEL: store_load_vindex_small_offset_kernel:
	; GFX940: ; %bb.0: ; %bb			; GFX940: ; %bb.0: ; %bb
	; GFX940-NEXT: scratch_load_dword v1, off, off offset:4 sc0 sc1			; GFX940-NEXT: scratch_load_dword v1, off, off offset:4 sc0 sc1
	▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX10-LABEL: store_load_vindex_small_offset_foo:			; GFX10-LABEL: store_load_vindex_small_offset_foo:
	; GFX10: ; %bb.0: ; %bb			; GFX10: ; %bb.0: ; %bb
	; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: v_and_b32_e32 v1, 15, v0			; GFX10-NEXT: v_and_b32_e32 v1, 15, v0
	; GFX10-NEXT: s_add_i32 vcc_lo, s32, 0x100
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: v_mov_b32_e32 v2, vcc_lo			; GFX10-NEXT: s_add_i32 vcc_lo, s32, 0x100
	; GFX10-NEXT: v_mov_b32_e32 v3, 15			; GFX10-NEXT: v_mov_b32_e32 v2, 15
	; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1			; GFX10-NEXT: scratch_load_dword v3, off, s32 glc dlc
	; GFX10-NEXT: v_add_nc_u32_e32 v0, v2, v0
	; GFX10-NEXT: v_add_nc_u32_e32 v1, v2, v1
	; GFX10-NEXT: scratch_load_dword v2, off, s32 glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: scratch_store_dword v0, v3, off			; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1
				; GFX10-NEXT: v_add_nc_u32_e32 v0, vcc_lo, v0
				; GFX10-NEXT: s_add_i32 vcc_lo, s32, 0x100
				; GFX10-NEXT: v_add_nc_u32_e32 v1, vcc_lo, v1
				; GFX10-NEXT: scratch_store_dword v0, v2, off
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: scratch_load_dword v0, v1, off glc dlc			; GFX10-NEXT: scratch_load_dword v0, v1, off glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX940-LABEL: store_load_vindex_small_offset_foo:			; GFX940-LABEL: store_load_vindex_small_offset_foo:
	; GFX940: ; %bb.0: ; %bb			; GFX940: ; %bb.0: ; %bb
	; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	▲ Show 20 Lines • Show All 126 Lines • ▼ Show 20 Lines
	; GFX10-LABEL: store_load_vindex_large_offset_kernel:			; GFX10-LABEL: store_load_vindex_large_offset_kernel:
	; GFX10: ; %bb.0: ; %bb			; GFX10: ; %bb.0: ; %bb
	; GFX10-NEXT: s_add_u32 s0, s0, s3			; GFX10-NEXT: s_add_u32 s0, s0, s3
	; GFX10-NEXT: s_addc_u32 s1, s1, 0			; GFX10-NEXT: s_addc_u32 s1, s1, 0
	; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s0			; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s0
	; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s1			; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s1
	; GFX10-NEXT: v_sub_nc_u32_e32 v1, 0, v0			; GFX10-NEXT: v_sub_nc_u32_e32 v1, 0, v0
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: v_mov_b32_e32 v2, 0x4004			; GFX10-NEXT: v_mov_b32_e32 v2, 15
	; GFX10-NEXT: v_mov_b32_e32 v3, 15			; GFX10-NEXT: scratch_load_dword v3, off, off offset:4 glc dlc
	; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1
	; GFX10-NEXT: v_add_nc_u32_e32 v0, v2, v0
	; GFX10-NEXT: v_add_nc_u32_e32 v1, v2, v1
	; GFX10-NEXT: scratch_load_dword v2, off, off offset:4 glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: scratch_store_dword v0, v3, off			; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1
				; GFX10-NEXT: v_add_nc_u32_e32 v0, 0x4004, v0
				; GFX10-NEXT: v_add_nc_u32_e32 v1, 0x4004, v1
				; GFX10-NEXT: scratch_store_dword v0, v2, off
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: scratch_load_dword v0, v1, off offset:124 glc dlc			; GFX10-NEXT: scratch_load_dword v0, v1, off offset:124 glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: s_endpgm			; GFX10-NEXT: s_endpgm
	;			;
	; GFX940-LABEL: store_load_vindex_large_offset_kernel:			; GFX940-LABEL: store_load_vindex_large_offset_kernel:
	; GFX940: ; %bb.0: ; %bb			; GFX940: ; %bb.0: ; %bb
	; GFX940-NEXT: scratch_load_dword v1, off, off offset:4 sc0 sc1			; GFX940-NEXT: scratch_load_dword v1, off, off offset:4 sc0 sc1
	▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX10-LABEL: store_load_vindex_large_offset_foo:			; GFX10-LABEL: store_load_vindex_large_offset_foo:
	; GFX10: ; %bb.0: ; %bb			; GFX10: ; %bb.0: ; %bb
	; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: v_and_b32_e32 v1, 15, v0			; GFX10-NEXT: v_and_b32_e32 v1, 15, v0
	; GFX10-NEXT: s_add_i32 vcc_lo, s32, 0x4004
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: v_mov_b32_e32 v2, vcc_lo			; GFX10-NEXT: s_add_i32 vcc_lo, s32, 0x4004
	; GFX10-NEXT: v_mov_b32_e32 v3, 15			; GFX10-NEXT: v_mov_b32_e32 v2, 15
	; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1			; GFX10-NEXT: scratch_load_dword v3, off, s32 offset:4 glc dlc
	; GFX10-NEXT: v_add_nc_u32_e32 v0, v2, v0
	; GFX10-NEXT: v_add_nc_u32_e32 v1, v2, v1
	; GFX10-NEXT: scratch_load_dword v2, off, s32 offset:4 glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: scratch_store_dword v0, v3, off			; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1
				; GFX10-NEXT: v_add_nc_u32_e32 v0, vcc_lo, v0
				; GFX10-NEXT: s_add_i32 vcc_lo, s32, 0x4004
				; GFX10-NEXT: v_add_nc_u32_e32 v1, vcc_lo, v1
				; GFX10-NEXT: scratch_store_dword v0, v2, off
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: scratch_load_dword v0, v1, off glc dlc			; GFX10-NEXT: scratch_load_dword v0, v1, off glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX940-LABEL: store_load_vindex_large_offset_foo:			; GFX940-LABEL: store_load_vindex_large_offset_foo:
	; GFX940: ; %bb.0: ; %bb			; GFX940: ; %bb.0: ; %bb
	; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	▲ Show 20 Lines • Show All 388 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/fmed3.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/fpow.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.scale.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.atomic.dim.a16.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.gather4.a16.dim.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.1d.d16.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.2darraymsaa.a16.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.3d.a16.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.sample.g16.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.sbfe.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.sdot4.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.udot4.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/orn2.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/roundeven.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i32.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/shl-ext-reduce.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/shl.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i32.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/store-local.128.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/store-local.96.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/subo.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/trunc.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/uaddsat.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i32.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/usubsat.ll

Load File

llvm/test/CodeGen/AMDGPU/GlobalISel/xnor.ll

Load File

llvm/test/CodeGen/AMDGPU/add.v2i16.ll

Load File

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fold-binop-select.ll

Load File

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll

Load File

llvm/test/CodeGen/AMDGPU/and.ll

Load File

llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll

Load File

llvm/test/CodeGen/AMDGPU/bypass-div.ll

Load File

llvm/test/CodeGen/AMDGPU/combine-reg-or-const.ll

Load File

llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll

Load File

llvm/test/CodeGen/AMDGPU/ctlz.ll

Load File

llvm/test/CodeGen/AMDGPU/cttz.ll

Load File

llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

Load File

llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll

Load File

llvm/test/CodeGen/AMDGPU/fabs.f16.ll

Load File

llvm/test/CodeGen/AMDGPU/fabs.f64.ll

Load File

llvm/test/CodeGen/AMDGPU/fabs.ll

Load File

llvm/test/CodeGen/AMDGPU/fexp.ll

Load File

llvm/test/CodeGen/AMDGPU/flat-scratch.ll

Load File

llvm/test/CodeGen/AMDGPU/fmed3.ll

Load File

llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll

Load File

llvm/test/CodeGen/AMDGPU/fneg-fabs.f64.ll

Load File

llvm/test/CodeGen/AMDGPU/fneg-fabs.ll

Load File

llvm/test/CodeGen/AMDGPU/fneg.ll

Load File

llvm/test/CodeGen/AMDGPU/fold-immediate-operand-shrink-with-carry.mir

Load File

llvm/test/CodeGen/AMDGPU/frem.ll

Load File

llvm/test/CodeGen/AMDGPU/fshr.ll

Load File

llvm/test/CodeGen/AMDGPU/idiv-licm.ll

Load File

llvm/test/CodeGen/AMDGPU/idot2.ll

Load File

llvm/test/CodeGen/AMDGPU/idot4u.ll

Load File

llvm/test/CodeGen/AMDGPU/idot8s.ll

Load File

llvm/test/CodeGen/AMDGPU/idot8u.ll

Load File

llvm/test/CodeGen/AMDGPU/immv216.ll

Load File

llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll

Load File

llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll

Load File

llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.format.d16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.sample.a16.dim.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.sample.g16.a16.dim.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.sample.g16.encode.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.sample.g16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.store.format.d16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.store.d16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.store.format.d16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.tbuffer.store.d16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.store.d16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.log.f16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.log10.f16.ll

Load File

llvm/test/CodeGen/AMDGPU/llvm.round.f64.ll

Load File

llvm/test/CodeGen/AMDGPU/load-constant-i16.ll

Load File

llvm/test/CodeGen/AMDGPU/load-global-i16.ll

Load File

llvm/test/CodeGen/AMDGPU/madak.ll

	Show All 35 Lines
	}			}

	; Make sure this is only folded with one use. This is a code size			; Make sure this is only folded with one use. This is a code size
	; optimization and if we fold the immediate multiple times, we'll undo			; optimization and if we fold the immediate multiple times, we'll undo
	; it.			; it.

	; GCN-LABEL: {{^}}madak_2_use_f32:			; GCN-LABEL: {{^}}madak_2_use_f32:
	; GFX9: v_mov_b32_e32 [[VK:v[0-9]+]], 0x41200000			; GFX9: v_mov_b32_e32 [[VK:v[0-9]+]], 0x41200000
	; GFX10: v_mov_b32_e32 [[VK:v[0-9]+]], 0x41200000
	; GFX6-DAG: buffer_load_dword [[VA:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 glc{{$}}			; GFX6-DAG: buffer_load_dword [[VA:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 glc{{$}}
	; GFX6-DAG: buffer_load_dword [[VB:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4			; GFX6-DAG: buffer_load_dword [[VB:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4
	; GFX6-DAG: buffer_load_dword [[VC:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8			; GFX6-DAG: buffer_load_dword [[VC:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8
	; GFX8_9_10: {{flat\|global}}_load_dword [[VA:v[0-9]+]],			; GFX8_9_10: {{flat\|global}}_load_dword [[VA:v[0-9]+]],
	; GFX8_9_10: {{flat\|global}}_load_dword [[VB:v[0-9]+]],			; GFX8_9_10: {{flat\|global}}_load_dword [[VB:v[0-9]+]],
	; GFX8_9_10: {{flat\|global}}_load_dword [[VC:v[0-9]+]],			; GFX8_9_10: {{flat\|global}}_load_dword [[VC:v[0-9]+]],
	; GFX6-DAG: v_mov_b32_e32 [[VK:v[0-9]+]], 0x41200000			; GFX6-DAG: v_mov_b32_e32 [[VK:v[0-9]+]], 0x41200000
	; GFX8-DAG: v_mov_b32_e32 [[VK:v[0-9]+]], 0x41200000			; GFX8-DAG: v_mov_b32_e32 [[VK:v[0-9]+]], 0x41200000
	; GFX6_8_9-DAG: v_madak_f32 {{v[0-9]+}}, [[VA]], [[VB]], 0x41200000			; GFX6_8_9-DAG: v_madak_f32 {{v[0-9]+}}, [[VA]], [[VB]], 0x41200000
	; GFX10-MAD-DAG:v_madak_f32 {{v[0-9]+}}, [[VA]], [[VB]], 0x41200000			; GFX10-MAD-DAG:v_madak_f32 {{v[0-9]+}}, [[VA]], [[VB]], 0x41200000
	; FMA-DAG: v_fmaak_f32 {{v[0-9]+}}, [[VA]], [[VB]], 0x41200000			; FMA-DAG: v_fmaak_f32 {{v[0-9]+}}, [[VA]], [[VB]], 0x41200000
	; MAD-DAG: v_mac_f32_e32 [[VK]], [[VA]], [[VC]]			; MAD-DAG: v_mac_f32_e32 [[VK]], [[VA]], [[VC]]
				foadAuthorUnsubmitted Done Reply Inline Actions Regression here: we are no longer forming madak/fmaak instructions. I think this is just bad luck. madak/fmaak formation is only implemented when PeepholeOptimizer calls SIInstrInfo::FoldImmediate. I think it would be much more reliable to do it as part of SIFoldOperands / SIShrinkInstructions. foad: Regression here: we are no longer forming madak/fmaak instructions. I think this is just bad…
				foadAuthorUnsubmitted Done Reply Inline Actions The regression got fixed by D125567. foad: The regression got fixed by D125567.
	; FMA-DAG: v_fmac_f32_e32 [[VK]], [[VA]], [[VC]]			; GFX10-FMA-DAG:v_fmaak_f32 {{v[0-9]+}}, [[VA]], [[VC]], 0x41200000
				; GFX940-FMA-DAG:v_fmac_f32_e32 [[VK]], [[VA]], [[VC]]
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @madak_2_use_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #0 {			define amdgpu_kernel void @madak_2_use_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #0 {
	%tid = tail call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone			%tid = tail call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone

	%in.gep.0 = getelementptr float, float addrspace(1)* %in, i32 %tid			%in.gep.0 = getelementptr float, float addrspace(1)* %in, i32 %tid
	%in.gep.1 = getelementptr float, float addrspace(1)* %in.gep.0, i32 1			%in.gep.1 = getelementptr float, float addrspace(1)* %in.gep.0, i32 1
	%in.gep.2 = getelementptr float, float addrspace(1)* %in.gep.0, i32 2			%in.gep.2 = getelementptr float, float addrspace(1)* %in.gep.0, i32 2

	▲ Show 20 Lines • Show All 212 Lines • Show Last 20 Lines