This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Aggressively fold immediates in SIFoldOperands
ClosedPublic

Authored by foad on Nov 26 2021, 7:56 AM.

Details

Summary

Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.

This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:

  • It reduces the number of registers used for holding constant values, which might increase occupancy. (On the other hand, many of these registers are SGPRs which no longer affect occupancy on GFX10+.)
  • It reduces ALU stalls between the instruction that loads a constant into a register, and the instruction that uses it.
  • The above benefits are expected to outweigh any increase in code size.

Diff Detail

Event Timeline

foad created this revision.Nov 26 2021, 7:56 AM
foad requested review of this revision.Nov 26 2021, 7:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 26 2021, 7:56 AM
foad added inline comments.Nov 26 2021, 8:09 AM
llvm/test/CodeGen/AMDGPU/madak.ll
54–55

Regression here: we are no longer forming madak/fmaak instructions. I think this is just bad luck. madak/fmaak formation is only implemented when PeepholeOptimizer calls SIInstrInfo::FoldImmediate. I think it would be much more reliable to do it as part of SIFoldOperands / SIShrinkInstructions.

sebastian-ne added inline comments.
llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
99–104

Not really related to this patch, but shouldn’t we be able to inline v2 (15) into the scratch_store?

foad added inline comments.Nov 26 2021, 8:42 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
99–104

No, v2 is the value being stored, and it has to be in a vgpr for that instruction (not even an inline constant is allowed).

Maybe we should start considering optsize here?

foad updated this revision to Diff 412371.Mar 2 2022, 3:40 AM

Rebase.

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 3:40 AM
foad updated this revision to Diff 429734.May 16 2022, 8:55 AM

Rebase.

foad added inline comments.May 16 2022, 8:58 AM
llvm/test/CodeGen/AMDGPU/madak.ll
54–55

The regression got fixed by D125567.

Both patches look good to me!

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1253

Can this use make_early_inc_range instead of caching the small vector like in the if-case? (if so, this probably makes more sense as an NFC patch afterwards)

foad added inline comments.May 17 2022, 5:49 AM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1253

Not sure. I can try that as a follow-up.

arsenm added inline comments.May 17 2022, 9:02 AM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
149–151

Don't we still need to consider this for deciding to fold fma/fmak for f16?

foad added inline comments.May 17 2022, 9:25 AM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
149–151

I don't see why. "isInline" functions are just asking whether it's free to fold a constant (i.e. no increase in code size). The actual legality checks are done later in tryAddToFoldList.

arsenm accepted this revision.May 17 2022, 1:20 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1253

The iterators in this pass are kind of a mess. I've wanted to rewrite this pass to work more like how PeepholeOpt works, collecting defs, visiting uses and looking for collected defs.

This revision is now accepted and ready to land.May 17 2022, 1:20 PM
This revision was landed with ongoing or failed builds.May 18 2022, 2:22 AM
This revision was automatically updated to reflect the committed changes.
foad added inline comments.May 18 2022, 2:40 AM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1253

I tried this, but it seems to get stuck in infinite loops in several lit tests. Not sure why:

diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 99aa8a60b04f..3159693a2b6e 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1243,12 +1243,9 @@ bool SIFoldOperands::foldInstOperand(MachineInstr &MI,
     }
   }
 
-  SmallVector<MachineOperand *, 4> UsesToProcess;
-  for (auto &Use : MRI->use_nodbg_operands(Dst.getReg()))
-    UsesToProcess.push_back(&Use);
-  for (auto U : UsesToProcess) {
-    MachineInstr *UseMI = U->getParent();
-    foldOperand(OpToFold, UseMI, UseMI->getOperandNo(U), FoldList,
+  for (auto &U : make_early_inc_range(MRI->use_nodbg_operands(Dst.getReg()))) {
+    MachineInstr *UseMI = U.getParent();
+    foldOperand(OpToFold, UseMI, UseMI->getOperandNo(&U), FoldList,
                 CopiesToReplace);
   }
sebastian-ne added inline comments.May 18 2022, 3:21 AM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1253

Ok, thanks for trying!