This is an archive of the discontinued LLVM Phabricator instance.

[MachineLICM] Handle subloops
ClosedPublic

Authored by jaykang10 on Jun 30 2023, 5:09 AM.

Details

Summary

It looks MachineLICM pass handles only outmost loops even though there are loop invariant codes in inner loops.
As an example, I have pre-committed a test llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll.
In the test, after isel, there is DUPv8i16gpr in vectorized loop and it is loop invariant. However, MachineLICM does not hoist it to the preheader of vectorized loop because it does not consider inner loops.
I think MachineLICM pass could handle the inner loops.

Diff Detail

Event Timeline

jaykang10 created this revision.Jun 30 2023, 5:09 AM
jaykang10 requested review of this revision.Jun 30 2023, 5:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2023, 5:09 AM
jaykang10 edited the summary of this revision. (Show Details)Jun 30 2023, 5:38 AM

It has always been a bit strange that this only tries to sink out of outermost loops. This looks like it will try the largest loop first, which sound like it makes sense.

Are there other tests that need to be updated?

llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
4

I tend to remove the Function Attrs along with dso_local and the local_unnamed_addr #0

159

And can you remove as much of this as you can.

159

I don't believe these are used for anything.

It has always been a bit strange that this only tries to sink out of outermost loops. This looks like it will try the largest loop first, which sound like it makes sense.

Are there other tests that need to be updated?

This change makes MachineLICM handle the loops from inner-most to out-most. If there is loop invariant code in inner-most loop, it will be hoisted to the pre-header of the inner-most loop first and then to outer loop gradually.
MachineLICM is target independent CodeGen pass so I am checking tests of other targets too.

llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
4

Sorry, I added this test as an experimental example.
Let me tidy up this test.

159

ditto

159

ditto

Additionally, if possible, I would like to get feedback from other target people.

jaykang10 updated this revision to Diff 538657.Jul 10 2023, 8:36 AM
jaykang10 added subscribers: sunfish, arsenm.

Updated test files.

For AMDGPU target, after hoisting some MIRs, SIOptimizeExecMaskingPreRA pass fails to remove them.
On CodeGen/AMDGPU/agpr-copy-no-free-registers.ll, it has below loop in MIR level.

Loop at depth 1 containing: %bb.1<header>,%bb.3,%bb.5,%bb.6,%bb.7,%bb.8,%bb.11,%bb.12,%bb.4,%bb.2,%bb.9<latch><exiting>
    Loop at depth 2 containing: %bb.5<header>,%bb.6,%bb.7,%bb.8,%bb.11<latch><exiting>

With this patch, below MIRs are hoisted from bb.5 to bb.3 in inner loop.

%155:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, %13:sreg_64_xexec, implicit $exec
%258:sreg_64_xexec = V_CMP_NE_U32_e64 %155:vgpr_32, %90:sreg_32, implicit $ex

After that, SIOptimizeExecMaskingPreRA pass fails to optimize the MIRs rather than original one so it looks there are more instructions with this patch. I have not checked the pass in detail but I guess the pass could handle the case. Other AMDGPU regressions have same issue.

A comment on SIOptimizeExecMaskingPreRA 
// Optimize sequence
//    %sel = V_CNDMASK_B32_e64 0, 1, %cc
//    %cmp = V_CMP_NE_U32 1, %sel
//    $vcc = S_AND_B64 $exec, %cmp
//    S_CBRANCH_VCC[N]Z
// =>
//    $vcc = S_ANDN2_B64 $exec, %cc
//    S_CBRANCH_VCC[N]Z

@arsenm If this change causes something wrong for AMDGPU target, please let me know.

For Webassembly target, on CodeGen/WebAssembly/reg-stackify.ll, I can see below MIR is hoisted to inner loop's preheader and it looks ok.

%3:fr64 = ADDSDrr %1:fr64(tied-def 0), %28:fr64, implicit $mxcsr

@sunfish If this change causes something wrong for WebAssembly target, please let me know.

If I remember right, the original commit that added this cited compile time as the reason. Have you measured the impact to compile time?

If I remember right, the original commit that added this cited compile time as the reason. Have you measured the impact to compile time?

Thanks for comment. @craig.topper
I have not checked the compile time yet. Maybe, interpreter or compiler workloads could cause more compile time because they usually have big nested loops. Let me check the compile time.
Additionally, if the compile time is issue, we could skip to visit basic blocks which are already visited with inner loop because the loop invariant code of the inner loop has already been hoisted to inner loop's preheader. Let me check it too.

"When loops are nested, we generally optimize the inner loops before the outer loops. For one, inner loops are likely to be executed more often. For another, it could move computation to an outer loop from which it is hoisted further when the outer loop is optimized and so on."

https://www.cs.cmu.edu/~fp/courses/15411-f13/lectures/17-loopopt.pdf

I have checked the compile time with llvm-testsuite/CTMark. It looks it is not too bad.

workload	org_inst_count	patch_inst_count	diff(%)
ClamAV	69508780509	69630988308	0.175816347
7zip	2.32E+11	2.32E+11	0.006923268
tramp3d-v4	1.05E+11	1.05E+11	0.000903823
kimwitu++	49770088564	49763466457	-0.013305395
sqlite3	46693620206	46759200740	0.140448596
mafft	42673608370	42731240751	0.13505392
SPASS	56670401414	56809872194	0.246108686
lencod	79886302869	80002255900	0.145147575
consumer-typeset	43685066260	43697382603	0.028193486
Bullet	1.15E+11	1.15E+11	-0.025679361
jaykang10 added a comment.EditedJul 11 2023, 5:33 AM

@craig.topper If you are not happy with the compile time number, please let me know.
Additionally, it seems it is not simple to skip the basic blocks of the inner loops while handling outer loop.

I have checked the compile time with llvm-testsuite/CTMark. It looks it is not too bad.

workload	org_inst_count	patch_inst_count	diff(%)
ClamAV	69508780509	69630988308	0.175816347
7zip	2.32E+11	2.32E+11	0.006923268
tramp3d-v4	1.05E+11	1.05E+11	0.000903823
kimwitu++	49770088564	49763466457	-0.013305395
sqlite3	46693620206	46759200740	0.140448596
mafft	42673608370	42731240751	0.13505392
SPASS	56670401414	56809872194	0.246108686
lencod	79886302869	80002255900	0.145147575
consumer-typeset	43685066260	43697382603	0.028193486
Bullet	1.15E+11	1.15E+11	-0.025679361

The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?

The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?

Those will likely be instruction counts, as per https://llvm-compile-time-tracker.com.

I have checked the compile time with llvm-testsuite/CTMark. It looks it is not too bad.

workload	org_inst_count	patch_inst_count	diff(%)
ClamAV	69508780509	69630988308	0.175816347
7zip	2.32E+11	2.32E+11	0.006923268
tramp3d-v4	1.05E+11	1.05E+11	0.000903823
kimwitu++	49770088564	49763466457	-0.013305395
sqlite3	46693620206	46759200740	0.140448596
mafft	42673608370	42731240751	0.13505392
SPASS	56670401414	56809872194	0.246108686
lencod	79886302869	80002255900	0.145147575
consumer-typeset	43685066260	43697382603	0.028193486
Bullet	1.15E+11	1.15E+11	-0.025679361

The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?

Ah, sorry for poor explanation.
It means the instruction count from perf on linux host during compilation of the workloads so bigger number of instruction count means longer compile time.
I used the TEST_SUITE_USE_PERF option with llvm-test-suite and checked the number of instructions from the perfstats file.

Yep, I followed the llvm-compile-time-tracker.
@dmgreen let me know the llvm-compile-time-tracker and the scripts.

I have checked the compile time with llvm-testsuite/CTMark. It looks it is not too bad.

workload	org_inst_count	patch_inst_count	diff(%)
ClamAV	69508780509	69630988308	0.175816347
7zip	2.32E+11	2.32E+11	0.006923268
tramp3d-v4	1.05E+11	1.05E+11	0.000903823
kimwitu++	49770088564	49763466457	-0.013305395
sqlite3	46693620206	46759200740	0.140448596
mafft	42673608370	42731240751	0.13505392
SPASS	56670401414	56809872194	0.246108686
lencod	79886302869	80002255900	0.145147575
consumer-typeset	43685066260	43697382603	0.028193486
Bullet	1.15E+11	1.15E+11	-0.025679361

The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?

Ah, sorry for poor explanation.
It means the instruction count from perf on linux host during compilation of the workloads so bigger number of instruction count means longer compile time.
I used the TEST_SUITE_USE_PERF option with llvm-test-suite and checked the number of instructions from the perfstats file.

Thanks for the clarification.

wxiao3 added inline comments.Jul 12 2023, 12:24 AM
llvm/lib/CodeGen/MachineLICM.cpp
335

2 questions:

  1. why we don't require that the outer-most loop that has a unique predecessor?
  2. can we push the innermost loops into the worklist first?
jaykang10 added inline comments.Jul 12 2023, 1:49 AM
llvm/lib/CodeGen/MachineLICM.cpp
335

Thanks for questions.

  1. why we don't require that the outer-most loop that has a unique predecessor?

As you can see, current implementation handles inner loops when outmost loop does not have unique predecessor. If loops have preheader, I think we can hoist loop invariant code into the preheader. The HoistOutOfLoop function checks it so I think we do not need to check the outer-most loop that has a unique predecessor.

  1. can we push the innermost loops into the worklist first?

We use Worklist.pop_back_val() and it means we handles last element of worklist first. In order to handle inner-most loop first, the inner-most loop is pushed into last element of the worklist.
If you feel something wrong, please let me know.

wxiao3 accepted this revision.Jul 12 2023, 6:31 AM
This revision is now accepted and ready to land.Jul 12 2023, 6:31 AM

Not all tests are updated? X86/licm-nested.ll

Not all tests are updated? X86/licm-nested.ll

The test has ; REQUIRES: asserts and it is enabled in the build with asserts.
Let me update the test.

jaykang10 updated this revision to Diff 539567.Jul 12 2023, 8:04 AM

Updated test/CodeGen/X86/licm-nested.ll.

This revision was landed with ongoing or failed builds.Jul 12 2023, 8:33 AM
This revision was automatically updated to reflect the committed changes.

Why do we want to visit inner loops first? Doesn't the algorithm visit all blocks of the inner loops when its on an outer loop anyway?

Why do we want to visit inner loops first? Doesn't the algorithm visit all blocks of the inner loops when its on an outer loop anyway?

As you can see on MachineLoop::isLoopInvariant(), if the loop contains the definition of MI's operands, the MI is not loop invariant.
Let's see below loop form.

outer loop:
    definition of operand of below loop invariant code
    inner loop:
        loop invariant code

If we visit only outer loop, the loop invariant code can not be hoisted to outer loop's preheader because its operand's definition is in outer loop.
If we visit inner loop, the loop invariant code can be hoisted to the inner loop's preheader because the inner loop does not contain the definition of the invariant code's operand.
That's the reason why I want to visit the inner loops.
If you feel something wrong, please let me know.

nikic added a subscriber: nikic.Jul 12 2023, 10:00 AM

we could skip to visit basic blocks which are already visited with inner loop because the loop invariant code of the inner loop has already been hoisted to inner loop's preheader.

Please do implement this (and revert this patch in the meantime if non-trivial). LICM should visit each block only once, not once per parent loop.

Why do we want to visit inner loops first? Doesn't the algorithm visit all blocks of the inner loops when its on an outer loop anyway?

As you can see on MachineLoop::isLoopInvariant(), if the loop contains the definition of MI's operands, the MI is not loop invariant.
Let's see below loop form.

outer loop:
    definition of operand of below loop invariant code
    inner loop:
        loop invariant code

If we visit only outer loop, the loop invariant code can not be hoisted to outer loop's preheader because its operand's definition is in outer loop.
If we visit inner loop, the loop invariant code can be hoisted to the inner loop's preheader because the inner loop does not contain the definition of the invariant code's operand.
That's the reason why I want to visit the inner loops.
If you feel something wrong, please let me know.

My question wasn't about visiting the inner loops it was about the order of visiting. Since I think it visits all blocks in child loops, I was wondering if we could aggressively hoist everything relative to the outer loop first instead of gradually. Then visit the inner loops to get anything that can't be hoisted all the way out. Though if we do as @nikic says and stop visiting blocks of inner loops when visiting outer loops, this won't work.

we could skip to visit basic blocks which are already visited with inner loop because the loop invariant code of the inner loop has already been hoisted to inner loop's preheader.

Please do implement this (and revert this patch in the meantime if non-trivial). LICM should visit each block only once, not once per parent loop.

Experimentally, I tried it but it was not simple and caused regressions... Let me try it again.
In the meantime, if possible, I would like to keep this patch in the meantime.

Why do we want to visit inner loops first? Doesn't the algorithm visit all blocks of the inner loops when its on an outer loop anyway?

As you can see on MachineLoop::isLoopInvariant(), if the loop contains the definition of MI's operands, the MI is not loop invariant.
Let's see below loop form.

outer loop:
    definition of operand of below loop invariant code
    inner loop:
        loop invariant code

If we visit only outer loop, the loop invariant code can not be hoisted to outer loop's preheader because its operand's definition is in outer loop.
If we visit inner loop, the loop invariant code can be hoisted to the inner loop's preheader because the inner loop does not contain the definition of the invariant code's operand.
That's the reason why I want to visit the inner loops.
If you feel something wrong, please let me know.

My question wasn't about visiting the inner loops it was about the order of visiting. Since I think it visits all blocks in child loops, I was wondering if we could aggressively hoist everything relative to the outer loop first instead of gradually. Then visit the inner loops to get anything that can't be hoisted all the way out. Though if we do as @nikic says and stop visiting blocks of inner loops when visiting outer loops, this won't work.

Ah, sorry... I misunderstood your question.
I think it would be worth to try the order you suggest.
Let me try to implement what @nikic suggested first.

My question wasn't about visiting the inner loops it was about the order of visiting. Since I think it visits all blocks in child loops, I was wondering if we could aggressively hoist everything relative to the outer loop first instead of gradually. Then visit the inner loops to get anything that can't be hoisted all the way out. Though if we do as @nikic says and stop visiting blocks of inner loops when visiting outer loops, this won't work.

There are multiple ways to go about this. We could only visit the outermost loop, but hoist to an inner loop preheader while doing so (i.e. perform the invariance check against the current inner-most loop to find candidates, and then hoist them to the outer-most parent loop that is still invariant). It's possible that doing this fits better into the MachineLICM model, which also needs to keep track of register pressure.

My question wasn't about visiting the inner loops it was about the order of visiting. Since I think it visits all blocks in child loops, I was wondering if we could aggressively hoist everything relative to the outer loop first instead of gradually. Then visit the inner loops to get anything that can't be hoisted all the way out. Though if we do as @nikic says and stop visiting blocks of inner loops when visiting outer loops, this won't work.

There are multiple ways to go about this. We could only visit the outermost loop, but hoist to an inner loop preheader while doing so (i.e. perform the invariance check against the current inner-most loop to find candidates, and then hoist them to the outer-most parent loop that is still invariant). It's possible that doing this fits better into the MachineLICM model, which also needs to keep track of register pressure.

Yep, I think it is good idea.
As below patch, I tried to skip the blocks which have already visited with inner loops but the different register pressure caused regression. In order to keep track of the register pressure properly, we need to visit all blocks in loop again...
Maybe, your idea could avoid the issue with keeping track of the register pressure properly...

diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index 1b84c0218867..c7ab88a3af22 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -239,9 +239,11 @@ namespace {
     void ExitScopeIfDone(
         MachineDomTreeNode *Node,
         DenseMap<MachineDomTreeNode *, unsigned> &OpenChildren,
-        const DenseMap<MachineDomTreeNode *, MachineDomTreeNode *> &ParentMap);
+        const DenseMap<MachineDomTreeNode *, MachineDomTreeNode *> &ParentMap,
+        SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs);
 
-    void HoistOutOfLoop(MachineDomTreeNode *HeaderN);
+    void HoistOutOfLoop(MachineDomTreeNode *HeaderN,
+                        SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs);
 
     void InitRegPressure(MachineBasicBlock *BB);
 
@@ -375,6 +377,7 @@ bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) {
   for (; MLII != MLIE; ++MLII)
     addSubLoopsToWorkList(*MLII, Worklist, PreRegAlloc);
 
+  SmallPtrSet<MachineBasicBlock *, 32> VisitedLoopBBs;
   while (!Worklist.empty()) {
     CurLoop = Worklist.pop_back_val();
     CurPreheader = nullptr;
@@ -389,8 +392,11 @@ bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) {
       // being hoisted.
       MachineDomTreeNode *N = DT->getNode(CurLoop->getHeader());
       FirstInLoop = true;
-      HoistOutOfLoop(N);
+      HoistOutOfLoop(N, VisitedLoopBBs);
       CSEMap.clear();
+      // Keep track of visited loop's blocks.
+      VisitedLoopBBs.insert(CurLoop->getBlocksVector().begin(),
+                            CurLoop->getBlocksVector().end());
     }
   }
 
@@ -694,14 +700,18 @@ void MachineLICMBase::ExitScope(MachineBasicBlock *MBB) {
 /// Destroy scope for the MBB that corresponds to the given dominator tree node
 /// if its a leaf or all of its children are done. Walk up the dominator tree to
 /// destroy ancestors which are now done.
-void MachineLICMBase::ExitScopeIfDone(MachineDomTreeNode *Node,
-    DenseMap<MachineDomTreeNode*, unsigned> &OpenChildren,
-    const DenseMap<MachineDomTreeNode*, MachineDomTreeNode*> &ParentMap) {
+void MachineLICMBase::ExitScopeIfDone(
+    MachineDomTreeNode *Node,
+    DenseMap<MachineDomTreeNode *, unsigned> &OpenChildren,
+    const DenseMap<MachineDomTreeNode *, MachineDomTreeNode *> &ParentMap,
+    SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs) {
   if (OpenChildren[Node])
     return;
 
-  for(;;) {
-    ExitScope(Node->getBlock());
+  for (;;) {
+    // If the block was visited previously, do not process the block.
+    if (!VisitedLoopBBs.contains(Node->getBlock()))
+      ExitScope(Node->getBlock());
     // Now traverse upwards to pop ancestors whose offsprings are all done.
     MachineDomTreeNode *Parent = ParentMap.lookup(Node);
     if (!Parent || --OpenChildren[Parent] != 0)
@@ -714,7 +724,9 @@ void MachineLICMBase::ExitScopeIfDone(MachineDomTreeNode *Node,
 /// specified header block, and that are in the current loop) in depth first
 /// order w.r.t the DominatorTree. This allows us to visit definitions before
 /// uses, allowing us to hoist a loop body in one pass without iteration.
-void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) {
+void MachineLICMBase::HoistOutOfLoop(
+    MachineDomTreeNode *HeaderN,
+    SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs) {
   MachineBasicBlock *Preheader = getCurPreheader();
   if (!Preheader)
     return;
@@ -741,7 +753,10 @@ void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) {
     if (!CurLoop->contains(BB))
       continue;
 
-    Scopes.push_back(Node);
+    // If the block was visited previously, do not process the block.
+    if (VisitedLoopBBs.empty() || !VisitedLoopBBs.contains(BB))
+      Scopes.push_back(Node);
+
     unsigned NumChildren = Node->getNumChildren();
 
     // Don't hoist things out of a large switch statement.  This often causes
@@ -786,7 +801,7 @@ void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) {
     }
 
     // If it's a leaf node, it's done. Traverse upwards to pop ancestors.
-    ExitScopeIfDone(Node, OpenChildren, ParentMap);
+    ExitScopeIfDone(Node, OpenChildren, ParentMap, VisitedLoopBBs);
   }
 }
nikic added a comment.Jul 19 2023, 1:25 AM

@jaykang10 Can you please revert this patch in the meantime? I'd like to make sure it does not make it into LLVM 17 in the current form.

@jaykang10 Can you please revert this patch in the meantime? I'd like to make sure it does not make it into LLVM 17 in the current form.

@nikic let me revert this commit.

jaykang10 updated this revision to Diff 541951.Jul 19 2023, 4:21 AM

Following @nikic's suggestion, If we fail to hoist MI to outmost loop and the MI is in subloop, try to hoist it to subloop's preheader.

@nikic Following your suggestion, I have updated the patch.
If possible, can I ask you some comments for the update please?

nikic reopened this revision.Jul 20 2023, 8:20 AM
This revision is now accepted and ready to land.Jul 20 2023, 8:20 AM
This revision was landed with ongoing or failed builds.Jul 20 2023, 8:46 AM
This revision was automatically updated to reflect the committed changes.
nikic added a comment.Jul 20 2023, 9:01 AM

That wasn't intended as an approval -- it's a phabricator quirk that if a revision is reopened it will show as accepted. One usually has to do "reopen" and then "request review", but it looks like only you can do the second part.

llvm/lib/CodeGen/MachineLICM.cpp
788

This is going to hoist to the inner-most preheader, but it may be that the instruction is invariant wrt a number of a loops (just not the outer-most one). Shouldn't we be going up the loop chain to find the highest invariant loop?

That wasn't intended as an approval -- it's a phabricator quirk that if a revision is reopened it will show as accepted. One usually has to do "reopen" and then "request review", but it looks like only you can do the second part.

Sorry... I have reverted the commit.

jaykang10 added inline comments.Jul 20 2023, 9:22 AM
llvm/lib/CodeGen/MachineLICM.cpp
788

I wanted to check that this approach is acceptable first.
We can go up the loop chain. Let me update the code.

jaykang10 updated this revision to Diff 542862.EditedJul 21 2023, 4:35 AM

Fixed a bug.
With this patch, we can hoist MIs to inner loop's preheader and the preheader can not dominate the CSE candidate MI's block. In this case, avoid CSE.

jaykang10 updated this revision to Diff 542892.Jul 21 2023, 6:51 AM

Following @nikic's comment, visit loop chain from outer one to inner one.

jaykang10 updated this revision to Diff 543575.Jul 24 2023, 9:01 AM

Removed LoopIsOuterMostWithPredecessor.

jaykang10 updated this revision to Diff 543576.Jul 24 2023, 9:03 AM

@nikic I have updated this patch following your comment.
If you need something more, please let me know.

@nikic Can I push this change please?

@nikic or anyone can review this updated patch please?

The perf of this version looks OK. The previous one seemed to be more aggressive, causing more changes both positive and negative, but from the perf I ran this looks OK.

llvm/lib/CodeGen/MachineLICM.cpp
783–784

Do you know what this refers to? I'm not sure I understand what it means. It might be worth just removing it.

784

outmost -> outermost
is in subloop -> is in a subloop

llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll
0–1

This doesn't really look autogenerated to me.

41

Should all these lines be removed, or should they be updated for the new codegen?

79

This shouldn't be needed.

nikic added inline comments.Aug 1 2023, 6:14 AM
llvm/lib/CodeGen/MachineLICM.cpp
800

Uff, this looks like a pretty big hack. It is viable to pass the loop as a parameter instead of temporarily changing "global" state?

1345

This seems to work around a larger issue. The problem is that CSEMap will get initialized for whichever preheader we happen to hoist into first. Thanks to this check, we will at least not make invalid replacements, but it means we will miss CSE opportunities if we are hoisting into any other preheader.

Probably the CSE map should be by preheader, instead of having only the one.

jaykang10 added inline comments.Aug 1 2023, 7:16 AM
llvm/lib/CodeGen/MachineLICM.cpp
783–784

Let me check it.

784

Yep, let me update them.

llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll
0–1

Sorry... it looks it did not use the script first time...
For negated_cond function, SIOptimizeExecMaskingPreRA pass fails to fold mask operations V_CNDMASK_B32_e64 and V_CMP_NE_U32 because they are hoisted.
Let me update it manually.

41

It looks these test lines are correct.
Let me keep these lines in this patch.

79

Yep, let me remove it.

jaykang10 added inline comments.Aug 1 2023, 7:30 AM
llvm/lib/CodeGen/MachineLICM.cpp
800

um... if possible, I did not want to change a lot in this pass... but I agree it is big hack.
Let me try to pass the global ones as parameters.

1345

It is good point.
Let me try to keep multiple CSE maps for multiple preheaders.

jaykang10 updated this revision to Diff 548173.Aug 8 2023, 5:45 AM

Following @nikic's comment, updated code.

  • Pass CurLoop and CurPreheader as function parameters
  • Keep CSEMap per preheader

@nikic If possible, can you check the update please?

dmgreen added inline comments.Sep 14 2023, 2:50 AM
llvm/lib/CodeGen/MachineLICM.cpp
134–135

Will ExitBlocks be incorrect now?

jaykang10 added inline comments.Sep 14 2023, 4:00 AM
llvm/lib/CodeGen/MachineLICM.cpp
134–135

Ah, that is good point!
They are out-most loop's ExitBlocks.
Let me fix it.
Thanks for checking it.

jaykang10 updated this revision to Diff 556767.Sep 14 2023, 4:05 AM

Following @dmgreen's comment, checked ExitBlocks per each loop.

dmgreen added inline comments.Sep 14 2023, 5:59 AM
llvm/lib/CodeGen/MachineLICM.cpp
134–135

Could this use CurLoop->isLoopExiting(ExitBlocks) instead? It might be quicker for larger loops.

jaykang10 added inline comments.Sep 14 2023, 6:56 AM
llvm/lib/CodeGen/MachineLICM.cpp
134–135

This function checks exit blocks which are outside loop and have predecessor inside loop.
isLoopExiting checks exiting blocks which are inside loop and have successor outside loop.
I think we need exit blocks here.
Let me try to keep the exit blocks for each loop in order to avoid re-calculation.

dmgreen added inline comments.Sep 14 2023, 7:10 AM
llvm/lib/CodeGen/MachineLICM.cpp
134–135

Oh I see. A different type of Exit Block.
Could it check !CurLoop->contains(ExitBlocks) && any_of(ExitBlocks->predecessors, is in CurLoop)?

jaykang10 added inline comments.Sep 14 2023, 7:41 AM
llvm/lib/CodeGen/MachineLICM.cpp
134–135

CurLoop->getExitBlocks collects blocks which are outside CurLoop and has predecessor in CurLoop.
This function checks the MBB parameter is in the blocks.

jaykang10 updated this revision to Diff 556789.Sep 14 2023, 8:51 AM

Created a map to keep exit blocks for each loop.

dmgreen reopened this revision.Sep 14 2023, 9:00 AM

Thanks. LGTM

llvm/lib/CodeGen/MachineLICM.cpp
134–135

Yeah - That was what I was aiming to capture. The ExitBlocks is outside the loop, but one of it's predecessors is inside.

This revision is now accepted and ready to land.Sep 14 2023, 9:00 AM
dmgreen accepted this revision.Sep 14 2023, 9:00 AM

Thanks. LGTM

Thanks for review.
Let me push this patch.
If @nikic or other people want to change this patch more, please let me know.

This revision was automatically updated to reflect the committed changes.
bkramer added inline comments.
llvm/lib/CodeGen/MachineLICM.cpp
783–784

When MI is hoisted the pointer is no longer valid. I'm seeing use after frees with asan after this change, so reverted in 3454cf6

jaykang10 added inline comments.Sep 15 2023, 5:59 AM
llvm/lib/CodeGen/MachineLICM.cpp
783–784

Ah, I am so sorry... Let me check it again.
Thanks for reverting the commit.

jaykang10 added inline comments.Sep 15 2023, 9:15 AM
llvm/lib/CodeGen/MachineLICM.cpp
783–784

It seems I need to check the MI is erased because CSE can be erased.
@bkramer If possible, can you let me know how I can reproduce the case you saw with asan please?

jaykang10 updated this revision to Diff 556931.Sep 18 2023, 1:16 AM

Checked MI erased ahead of update register pressure because CSE can be removed after hoisting it.

I think it would still be invalid to access MI.getParent if MI has been erased. It would be a use after delete. From looking at the old logic it appeared to run UpdateRegPressure only if Hoist was false. Can we do the same thing here?

I think it would still be invalid to access MI.getParent if MI has been erased. It would be a use after delete.

You are right. I was confused with remove which sets its parent nullptr... I am not sure how we can check the erased MI...

From looking at the old logic it appeared to run UpdateRegPressure only if Hoist was false. Can we do the same thing here?

I am not sure why the original code did not update the register pressure with the hoisted loop invariant code... I think this pass needs to update the register pressure with the loop invariant code, which is hoisted to preheader, because preheader dominates the blocks in the loop and the invariant code makes a live out of preheader. I could miss something...

nikic added a comment.Sep 18 2023, 2:44 AM

Maybe change Hoist() to return an enum that represents not hoisted / hoisted / CSEd?

Maybe change Hoist() to return an enum that represents not hoisted / hoisted / CSEd?

Thanks for good suggestion.
Let me try it.

jaykang10 updated this revision to Diff 556948.Sep 18 2023, 7:10 AM

Following @nikic's comment, updated return type of Hoist function.

Leaving a comment here as well. The commit caused an ASAN issue downstream. Cherry picking the revert fixed the asan issue., https://github.com/llvm/llvm-project/commit/5ec9699c4d1f165364586d825baef434e2c110b4#commitcomment-127784939 for more details. Please account for this during the resubmit.

Leaving a comment here as well. The commit caused an ASAN issue downstream. Cherry picking the revert fixed the asan issue., https://github.com/llvm/llvm-project/commit/5ec9699c4d1f165364586d825baef434e2c110b4#commitcomment-127784939 for more details. Please account for this during the resubmit.

Thanks for the asan output.
I have updated the patch in this review to fix the asan error.
If possible, can you check the updated patch fixes the asan error in your side please?
I am also checking it and I have not seen asan error yet.

nikic added inline comments.Sep 19 2023, 11:59 AM
llvm/lib/CodeGen/MachineLICM.cpp
134–135

Any reason not to name this enum and then use it instead of unsigned?

jaykang10 added inline comments.Sep 19 2023, 12:25 PM
llvm/lib/CodeGen/MachineLICM.cpp
134–135

Ah sorry.
Let me update the code with named enum tomorrow.

Following @nikic's comment, used named enum.

Sorry... with latest update, I can see asan errors from sanitizer bot locally.
Let me fix them again.

jaykang10 updated this revision to Diff 557289.EditedSep 25 2023, 12:40 AM

Fixed a bug.

  • ExtractHoistableLoad can erase MI. It creates new MI for unfolding load and assigns it to MI but the MI is not updated with new MI. In this case, the MI is not valid.
  • ExtractHoistableLoad updates the register pressure for the new MI so we do not need to update the register pressure for it outside the function.

I have run sanitizer bot locally and there is no failed tests from sanitizer bot with this patch.
If there is no objection, let me push this updated patch again.

Thanks. I think the changes still look OK.