It looks MachineLICM pass handles only outmost loops even though there are loop invariant codes in inner loops.
As an example, I have pre-committed a test llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll.
In the test, after isel, there is DUPv8i16gpr in vectorized loop and it is loop invariant. However, MachineLICM does not hoist it to the preheader of vectorized loop because it does not consider inner loops.
I think MachineLICM pass could handle the inner loops.
Details
Diff Detail
Event Timeline
It has always been a bit strange that this only tries to sink out of outermost loops. This looks like it will try the largest loop first, which sound like it makes sense.
Are there other tests that need to be updated?
llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll | ||
---|---|---|
4 | I tend to remove the Function Attrs along with dso_local and the local_unnamed_addr #0 | |
159 | And can you remove as much of this as you can. | |
159 | I don't believe these are used for anything. |
This change makes MachineLICM handle the loops from inner-most to out-most. If there is loop invariant code in inner-most loop, it will be hoisted to the pre-header of the inner-most loop first and then to outer loop gradually.
MachineLICM is target independent CodeGen pass so I am checking tests of other targets too.
llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll | ||
---|---|---|
4 | Sorry, I added this test as an experimental example. | |
159 | ditto | |
159 | ditto |
Updated test files.
For AMDGPU target, after hoisting some MIRs, SIOptimizeExecMaskingPreRA pass fails to remove them.
On CodeGen/AMDGPU/agpr-copy-no-free-registers.ll, it has below loop in MIR level.
Loop at depth 1 containing: %bb.1<header>,%bb.3,%bb.5,%bb.6,%bb.7,%bb.8,%bb.11,%bb.12,%bb.4,%bb.2,%bb.9<latch><exiting> Loop at depth 2 containing: %bb.5<header>,%bb.6,%bb.7,%bb.8,%bb.11<latch><exiting>
With this patch, below MIRs are hoisted from bb.5 to bb.3 in inner loop.
%155:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, %13:sreg_64_xexec, implicit $exec %258:sreg_64_xexec = V_CMP_NE_U32_e64 %155:vgpr_32, %90:sreg_32, implicit $ex
After that, SIOptimizeExecMaskingPreRA pass fails to optimize the MIRs rather than original one so it looks there are more instructions with this patch. I have not checked the pass in detail but I guess the pass could handle the case. Other AMDGPU regressions have same issue.
A comment on SIOptimizeExecMaskingPreRA // Optimize sequence // %sel = V_CNDMASK_B32_e64 0, 1, %cc // %cmp = V_CMP_NE_U32 1, %sel // $vcc = S_AND_B64 $exec, %cmp // S_CBRANCH_VCC[N]Z // => // $vcc = S_ANDN2_B64 $exec, %cc // S_CBRANCH_VCC[N]Z
@arsenm If this change causes something wrong for AMDGPU target, please let me know.
For Webassembly target, on CodeGen/WebAssembly/reg-stackify.ll, I can see below MIR is hoisted to inner loop's preheader and it looks ok.
%3:fr64 = ADDSDrr %1:fr64(tied-def 0), %28:fr64, implicit $mxcsr
@sunfish If this change causes something wrong for WebAssembly target, please let me know.
If I remember right, the original commit that added this cited compile time as the reason. Have you measured the impact to compile time?
Thanks for comment. @craig.topper
I have not checked the compile time yet. Maybe, interpreter or compiler workloads could cause more compile time because they usually have big nested loops. Let me check the compile time.
Additionally, if the compile time is issue, we could skip to visit basic blocks which are already visited with inner loop because the loop invariant code of the inner loop has already been hoisted to inner loop's preheader. Let me check it too.
"When loops are nested, we generally optimize the inner loops before the outer loops. For one, inner loops are likely to be executed more often. For another, it could move computation to an outer loop from which it is hoisted further when the outer loop is optimized and so on."
https://www.cs.cmu.edu/~fp/courses/15411-f13/lectures/17-loopopt.pdf
I have checked the compile time with llvm-testsuite/CTMark. It looks it is not too bad.
workload org_inst_count patch_inst_count diff(%) ClamAV 69508780509 69630988308 0.175816347 7zip 2.32E+11 2.32E+11 0.006923268 tramp3d-v4 1.05E+11 1.05E+11 0.000903823 kimwitu++ 49770088564 49763466457 -0.013305395 sqlite3 46693620206 46759200740 0.140448596 mafft 42673608370 42731240751 0.13505392 SPASS 56670401414 56809872194 0.246108686 lencod 79886302869 80002255900 0.145147575 consumer-typeset 43685066260 43697382603 0.028193486 Bullet 1.15E+11 1.15E+11 -0.025679361
@craig.topper If you are not happy with the compile time number, please let me know.
Additionally, it seems it is not simple to skip the basic blocks of the inner loops while handling outer loop.
The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?
The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?
Those will likely be instruction counts, as per https://llvm-compile-time-tracker.com.
Ah, sorry for poor explanation.
It means the instruction count from perf on linux host during compilation of the workloads so bigger number of instruction count means longer compile time.
I used the TEST_SUITE_USE_PERF option with llvm-test-suite and checked the number of instructions from the perfstats file.
Yep, I followed the llvm-compile-time-tracker.
@dmgreen let me know the llvm-compile-time-tracker and the scripts.
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
335 | 2 questions:
|
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
335 | Thanks for questions.
As you can see, current implementation handles inner loops when outmost loop does not have unique predecessor. If loops have preheader, I think we can hoist loop invariant code into the preheader. The HoistOutOfLoop function checks it so I think we do not need to check the outer-most loop that has a unique predecessor.
We use Worklist.pop_back_val() and it means we handles last element of worklist first. In order to handle inner-most loop first, the inner-most loop is pushed into last element of the worklist. |
The test has ; REQUIRES: asserts and it is enabled in the build with asserts.
Let me update the test.
Why do we want to visit inner loops first? Doesn't the algorithm visit all blocks of the inner loops when its on an outer loop anyway?
As you can see on MachineLoop::isLoopInvariant(), if the loop contains the definition of MI's operands, the MI is not loop invariant.
Let's see below loop form.
outer loop: definition of operand of below loop invariant code inner loop: loop invariant code
If we visit only outer loop, the loop invariant code can not be hoisted to outer loop's preheader because its operand's definition is in outer loop.
If we visit inner loop, the loop invariant code can be hoisted to the inner loop's preheader because the inner loop does not contain the definition of the invariant code's operand.
That's the reason why I want to visit the inner loops.
If you feel something wrong, please let me know.
we could skip to visit basic blocks which are already visited with inner loop because the loop invariant code of the inner loop has already been hoisted to inner loop's preheader.
Please do implement this (and revert this patch in the meantime if non-trivial). LICM should visit each block only once, not once per parent loop.
My question wasn't about visiting the inner loops it was about the order of visiting. Since I think it visits all blocks in child loops, I was wondering if we could aggressively hoist everything relative to the outer loop first instead of gradually. Then visit the inner loops to get anything that can't be hoisted all the way out. Though if we do as @nikic says and stop visiting blocks of inner loops when visiting outer loops, this won't work.
Experimentally, I tried it but it was not simple and caused regressions... Let me try it again.
In the meantime, if possible, I would like to keep this patch in the meantime.
Ah, sorry... I misunderstood your question.
I think it would be worth to try the order you suggest.
Let me try to implement what @nikic suggested first.
There are multiple ways to go about this. We could only visit the outermost loop, but hoist to an inner loop preheader while doing so (i.e. perform the invariance check against the current inner-most loop to find candidates, and then hoist them to the outer-most parent loop that is still invariant). It's possible that doing this fits better into the MachineLICM model, which also needs to keep track of register pressure.
Yep, I think it is good idea.
As below patch, I tried to skip the blocks which have already visited with inner loops but the different register pressure caused regression. In order to keep track of the register pressure properly, we need to visit all blocks in loop again...
Maybe, your idea could avoid the issue with keeping track of the register pressure properly...
diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp index 1b84c0218867..c7ab88a3af22 100644 --- a/llvm/lib/CodeGen/MachineLICM.cpp +++ b/llvm/lib/CodeGen/MachineLICM.cpp @@ -239,9 +239,11 @@ namespace { void ExitScopeIfDone( MachineDomTreeNode *Node, DenseMap<MachineDomTreeNode *, unsigned> &OpenChildren, - const DenseMap<MachineDomTreeNode *, MachineDomTreeNode *> &ParentMap); + const DenseMap<MachineDomTreeNode *, MachineDomTreeNode *> &ParentMap, + SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs); - void HoistOutOfLoop(MachineDomTreeNode *HeaderN); + void HoistOutOfLoop(MachineDomTreeNode *HeaderN, + SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs); void InitRegPressure(MachineBasicBlock *BB); @@ -375,6 +377,7 @@ bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) { for (; MLII != MLIE; ++MLII) addSubLoopsToWorkList(*MLII, Worklist, PreRegAlloc); + SmallPtrSet<MachineBasicBlock *, 32> VisitedLoopBBs; while (!Worklist.empty()) { CurLoop = Worklist.pop_back_val(); CurPreheader = nullptr; @@ -389,8 +392,11 @@ bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) { // being hoisted. MachineDomTreeNode *N = DT->getNode(CurLoop->getHeader()); FirstInLoop = true; - HoistOutOfLoop(N); + HoistOutOfLoop(N, VisitedLoopBBs); CSEMap.clear(); + // Keep track of visited loop's blocks. + VisitedLoopBBs.insert(CurLoop->getBlocksVector().begin(), + CurLoop->getBlocksVector().end()); } } @@ -694,14 +700,18 @@ void MachineLICMBase::ExitScope(MachineBasicBlock *MBB) { /// Destroy scope for the MBB that corresponds to the given dominator tree node /// if its a leaf or all of its children are done. Walk up the dominator tree to /// destroy ancestors which are now done. -void MachineLICMBase::ExitScopeIfDone(MachineDomTreeNode *Node, - DenseMap<MachineDomTreeNode*, unsigned> &OpenChildren, - const DenseMap<MachineDomTreeNode*, MachineDomTreeNode*> &ParentMap) { +void MachineLICMBase::ExitScopeIfDone( + MachineDomTreeNode *Node, + DenseMap<MachineDomTreeNode *, unsigned> &OpenChildren, + const DenseMap<MachineDomTreeNode *, MachineDomTreeNode *> &ParentMap, + SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs) { if (OpenChildren[Node]) return; - for(;;) { - ExitScope(Node->getBlock()); + for (;;) { + // If the block was visited previously, do not process the block. + if (!VisitedLoopBBs.contains(Node->getBlock())) + ExitScope(Node->getBlock()); // Now traverse upwards to pop ancestors whose offsprings are all done. MachineDomTreeNode *Parent = ParentMap.lookup(Node); if (!Parent || --OpenChildren[Parent] != 0) @@ -714,7 +724,9 @@ void MachineLICMBase::ExitScopeIfDone(MachineDomTreeNode *Node, /// specified header block, and that are in the current loop) in depth first /// order w.r.t the DominatorTree. This allows us to visit definitions before /// uses, allowing us to hoist a loop body in one pass without iteration. -void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) { +void MachineLICMBase::HoistOutOfLoop( + MachineDomTreeNode *HeaderN, + SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs) { MachineBasicBlock *Preheader = getCurPreheader(); if (!Preheader) return; @@ -741,7 +753,10 @@ void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) { if (!CurLoop->contains(BB)) continue; - Scopes.push_back(Node); + // If the block was visited previously, do not process the block. + if (VisitedLoopBBs.empty() || !VisitedLoopBBs.contains(BB)) + Scopes.push_back(Node); + unsigned NumChildren = Node->getNumChildren(); // Don't hoist things out of a large switch statement. This often causes @@ -786,7 +801,7 @@ void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) { } // If it's a leaf node, it's done. Traverse upwards to pop ancestors. - ExitScopeIfDone(Node, OpenChildren, ParentMap); + ExitScopeIfDone(Node, OpenChildren, ParentMap, VisitedLoopBBs); } }
@jaykang10 Can you please revert this patch in the meantime? I'd like to make sure it does not make it into LLVM 17 in the current form.
Following @nikic's suggestion, If we fail to hoist MI to outmost loop and the MI is in subloop, try to hoist it to subloop's preheader.
@nikic Following your suggestion, I have updated the patch.
If possible, can I ask you some comments for the update please?
That wasn't intended as an approval -- it's a phabricator quirk that if a revision is reopened it will show as accepted. One usually has to do "reopen" and then "request review", but it looks like only you can do the second part.
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
789 | This is going to hoist to the inner-most preheader, but it may be that the instruction is invariant wrt a number of a loops (just not the outer-most one). Shouldn't we be going up the loop chain to find the highest invariant loop? |
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
789 | I wanted to check that this approach is acceptable first. |
Fixed a bug.
With this patch, we can hoist MIs to inner loop's preheader and the preheader can not dominate the CSE candidate MI's block. In this case, avoid CSE.
@nikic I have updated this patch following your comment.
If you need something more, please let me know.
The perf of this version looks OK. The previous one seemed to be more aggressive, causing more changes both positive and negative, but from the perf I ran this looks OK.
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
785 | outmost -> outermost | |
806–807 | Do you know what this refers to? I'm not sure I understand what it means. It might be worth just removing it. | |
llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll | ||
0–1 | This doesn't really look autogenerated to me. | |
41 | Should all these lines be removed, or should they be updated for the new codegen? | |
79 | This shouldn't be needed. |
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
801 | Uff, this looks like a pretty big hack. It is viable to pass the loop as a parameter instead of temporarily changing "global" state? | |
1343 | This seems to work around a larger issue. The problem is that CSEMap will get initialized for whichever preheader we happen to hoist into first. Thanks to this check, we will at least not make invalid replacements, but it means we will miss CSE opportunities if we are hoisting into any other preheader. Probably the CSE map should be by preheader, instead of having only the one. |
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
785 | Yep, let me update them. | |
806–807 | Let me check it. | |
llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll | ||
0–1 | Sorry... it looks it did not use the script first time... | |
41 | It looks these test lines are correct. | |
79 | Yep, let me remove it. |
Following @nikic's comment, updated code.
- Pass CurLoop and CurPreheader as function parameters
- Keep CSEMap per preheader
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
134–135 | Will ExitBlocks be incorrect now? |
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
134–135 | Ah, that is good point! |
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
134–135 | Could this use CurLoop->isLoopExiting(ExitBlocks) instead? It might be quicker for larger loops. |
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
134–135 | This function checks exit blocks which are outside loop and have predecessor inside loop. |
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
134–135 | Oh I see. A different type of Exit Block. |
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
134–135 | CurLoop->getExitBlocks collects blocks which are outside CurLoop and has predecessor in CurLoop. |
Thanks. LGTM
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
134–135 | Yeah - That was what I was aiming to capture. The ExitBlocks is outside the loop, but one of it's predecessors is inside. |
Thanks for review.
Let me push this patch.
If @nikic or other people want to change this patch more, please let me know.
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
806–807 | Ah, I am so sorry... Let me check it again. |
Checked MI erased ahead of update register pressure because CSE can be removed after hoisting it.
I think it would still be invalid to access MI.getParent if MI has been erased. It would be a use after delete. From looking at the old logic it appeared to run UpdateRegPressure only if Hoist was false. Can we do the same thing here?
You are right. I was confused with remove which sets its parent nullptr... I am not sure how we can check the erased MI...
From looking at the old logic it appeared to run UpdateRegPressure only if Hoist was false. Can we do the same thing here?
I am not sure why the original code did not update the register pressure with the hoisted loop invariant code... I think this pass needs to update the register pressure with the loop invariant code, which is hoisted to preheader, because preheader dominates the blocks in the loop and the invariant code makes a live out of preheader. I could miss something...
Leaving a comment here as well. The commit caused an ASAN issue downstream. Cherry picking the revert fixed the asan issue., https://github.com/llvm/llvm-project/commit/5ec9699c4d1f165364586d825baef434e2c110b4#commitcomment-127784939 for more details. Please account for this during the resubmit.
Thanks for the asan output.
I have updated the patch in this review to fix the asan error.
If possible, can you check the updated patch fixes the asan error in your side please?
I am also checking it and I have not seen asan error yet.
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
134–135 | Any reason not to name this enum and then use it instead of unsigned? |
llvm/lib/CodeGen/MachineLICM.cpp | ||
---|---|---|
134–135 | Ah sorry. |
Sorry... with latest update, I can see asan errors from sanitizer bot locally.
Let me fix them again.
Fixed a bug.
- ExtractHoistableLoad can erase MI. It creates new MI for unfolding load and assigns it to MI but the MI is not updated with new MI. In this case, the MI is not valid.
- ExtractHoistableLoad updates the register pressure for the new MI so we do not need to update the register pressure for it outside the function.
I have run sanitizer bot locally and there is no failed tests from sanitizer bot with this patch.
If there is no objection, let me push this updated patch again.
Will ExitBlocks be incorrect now?