This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
12/24
MachineLICM.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
3/6
machine-licm-sub-loop.ll
-
AMDGPU/
-
agpr-copy-no-free-registers.ll
-
exec-mask-opt-cannot-create-empty-or-backward-segment.ll
3/6
optimize-negated-cond.ll
-
tuple-allocation-failure.ll
-
Thumb2/
-
mve-gather-scatter-optimisation.ll
-
WebAssembly/
-
reg-stackify.ll

Differential D154205

[MachineLICM] Handle subloops
ClosedPublic

Authored by jaykang10 on Jun 30 2023, 5:09 AM.

Download Raw Diff

Details

Reviewers

efriedma
t.p.northover
craig.topper
dmgreen
wxiao3

Commits

rGff68e43c811e: [MachineLICM] Handle Subloops
rG5ec9699c4d1f: [MachineLICM] Handle Subloops
rG50dd383d0867: [MachineLICM] Handle Subloops
rG33e60484d750: [MachineLICM] Handle Subloops

Summary

It looks MachineLICM pass handles only outmost loops even though there are loop invariant codes in inner loops.
As an example, I have pre-committed a test llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll.
In the test, after isel, there is DUPv8i16gpr in vectorized loop and it is loop invariant. However, MachineLICM does not hoist it to the preheader of vectorized loop because it does not consider inner loops.
I think MachineLICM pass could handle the inner loops.

Diff Detail

Event Timeline

jaykang10 created this revision.Jun 30 2023, 5:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2023, 5:09 AM

Herald added subscribers: asbirlea, hiraditya, kristof.beyls. · View Herald Transcript

jaykang10 requested review of this revision.Jun 30 2023, 5:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2023, 5:09 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

xbolva00 added a subscriber: xbolva00.Jun 30 2023, 5:24 AM

jaykang10 edited the summary of this revision. (Show Details)Jun 30 2023, 5:38 AM

Harbormaster completed remote builds in B242379: Diff 536184.Jun 30 2023, 6:40 AM

It has always been a bit strange that this only tries to sink out of outermost loops. This looks like it will try the largest loop first, which sound like it makes sense.

Are there other tests that need to be updated?

llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
4	I tend to remove the `Function Attrs` along with `dso_local` and the `local_unnamed_addr #0`
159	And can you remove as much of this as you can.
159	I don't believe these are used for anything.

In D154205#4463750, @dmgreen wrote:

It has always been a bit strange that this only tries to sink out of outermost loops. This looks like it will try the largest loop first, which sound like it makes sense.

Are there other tests that need to be updated?

This change makes MachineLICM handle the loops from inner-most to out-most. If there is loop invariant code in inner-most loop, it will be hoisted to the pre-header of the inner-most loop first and then to outer loop gradually.
MachineLICM is target independent CodeGen pass so I am checking tests of other targets too.

llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll
4	Sorry, I added this test as an experimental example. Let me tidy up this test.
159	ditto
159	ditto

Additionally, if possible, I would like to get feedback from other target people.

jaykang10 mentioned this in rG041db60bc1d2: [tests] Update precommit test for MachineLICM subloops.Jul 10 2023, 6:39 AM

Updated test files.

For AMDGPU target, after hoisting some MIRs, SIOptimizeExecMaskingPreRA pass fails to remove them.
On CodeGen/AMDGPU/agpr-copy-no-free-registers.ll, it has below loop in MIR level.

Loop at depth 1 containing: %bb.1<header>,%bb.3,%bb.5,%bb.6,%bb.7,%bb.8,%bb.11,%bb.12,%bb.4,%bb.2,%bb.9<latch><exiting>
    Loop at depth 2 containing: %bb.5<header>,%bb.6,%bb.7,%bb.8,%bb.11<latch><exiting>

With this patch, below MIRs are hoisted from bb.5 to bb.3 in inner loop.

%155:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, %13:sreg_64_xexec, implicit $exec
%258:sreg_64_xexec = V_CMP_NE_U32_e64 %155:vgpr_32, %90:sreg_32, implicit $ex

After that, SIOptimizeExecMaskingPreRA pass fails to optimize the MIRs rather than original one so it looks there are more instructions with this patch. I have not checked the pass in detail but I guess the pass could handle the case. Other AMDGPU regressions have same issue.

A comment on SIOptimizeExecMaskingPreRA 
// Optimize sequence
//    %sel = V_CNDMASK_B32_e64 0, 1, %cc
//    %cmp = V_CMP_NE_U32 1, %sel
//    $vcc = S_AND_B64 $exec, %cmp
//    S_CBRANCH_VCC[N]Z
// =>
//    $vcc = S_ANDN2_B64 $exec, %cc
//    S_CBRANCH_VCC[N]Z

@arsenm If this change causes something wrong for AMDGPU target, please let me know.

For Webassembly target, on CodeGen/WebAssembly/reg-stackify.ll, I can see below MIR is hoisted to inner loop's preheader and it looks ok.

%3:fr64 = ADDSDrr %1:fr64(tied-def 0), %28:fr64, implicit $mxcsr

@sunfish If this change causes something wrong for WebAssembly target, please let me know.

Herald added subscribers: wangpc, pmatos, asb and 6 others. · View Herald TranscriptJul 10 2023, 8:36 AM

Harbormaster completed remote builds in B244157: Diff 538657.Jul 10 2023, 9:21 AM

If I remember right, the original commit that added this cited compile time as the reason. Have you measured the impact to compile time?

pengfei added a reviewer: wxiao3.Jul 10 2023, 5:49 PM

In D154205#4486754, @craig.topper wrote:

If I remember right, the original commit that added this cited compile time as the reason. Have you measured the impact to compile time?

Thanks for comment. @craig.topper
I have not checked the compile time yet. Maybe, interpreter or compiler workloads could cause more compile time because they usually have big nested loops. Let me check the compile time.
Additionally, if the compile time is issue, we could skip to visit basic blocks which are already visited with inner loop because the loop invariant code of the inner loop has already been hoisted to inner loop's preheader. Let me check it too.

"When loops are nested, we generally optimize the inner loops before the outer loops. For one, inner loops are likely to be executed more often. For another, it could move computation to an outer loop from which it is hoisted further when the outer loop is optimized and so on."

https://www.cs.cmu.edu/~fp/courses/15411-f13/lectures/17-loopopt.pdf

I have checked the compile time with llvm-testsuite/CTMark. It looks it is not too bad.

workload	org_inst_count	patch_inst_count	diff(%)
ClamAV	69508780509	69630988308	0.175816347
7zip	2.32E+11	2.32E+11	0.006923268
tramp3d-v4	1.05E+11	1.05E+11	0.000903823
kimwitu++	49770088564	49763466457	-0.013305395
sqlite3	46693620206	46759200740	0.140448596
mafft	42673608370	42731240751	0.13505392
SPASS	56670401414	56809872194	0.246108686
lencod	79886302869	80002255900	0.145147575
consumer-typeset	43685066260	43697382603	0.028193486
Bullet	1.15E+11	1.15E+11	-0.025679361

@craig.topper If you are not happy with the compile time number, please let me know.
Additionally, it seems it is not simple to skip the basic blocks of the inner loops while handling outer loop.

In D154205#4488424, @jaykang10 wrote:

I have checked the compile time with llvm-testsuite/CTMark. It looks it is not too bad.

workload	org_inst_count	patch_inst_count	diff(%)
ClamAV	69508780509	69630988308	0.175816347
7zip	2.32E+11	2.32E+11	0.006923268
tramp3d-v4	1.05E+11	1.05E+11	0.000903823
kimwitu++	49770088564	49763466457	-0.013305395
sqlite3	46693620206	46759200740	0.140448596
mafft	42673608370	42731240751	0.13505392
SPASS	56670401414	56809872194	0.246108686
lencod	79886302869	80002255900	0.145147575
consumer-typeset	43685066260	43697382603	0.028193486
Bullet	1.15E+11	1.15E+11	-0.025679361

The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?

The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?

Those will likely be instruction counts, as per https://llvm-compile-time-tracker.com.

In D154205#4490081, @craig.topper wrote:
In D154205#4488424, @jaykang10 wrote:
I have checked the compile time with llvm-testsuite/CTMark. It looks it is not too bad.
workload	org_inst_count	patch_inst_count	diff(%)
ClamAV	69508780509	69630988308	0.175816347
7zip	2.32E+11	2.32E+11	0.006923268
tramp3d-v4	1.05E+11	1.05E+11	0.000903823
kimwitu++	49770088564	49763466457	-0.013305395
sqlite3	46693620206	46759200740	0.140448596
mafft	42673608370	42731240751	0.13505392
SPASS	56670401414	56809872194	0.246108686
lencod	79886302869	80002255900	0.145147575
consumer-typeset	43685066260	43697382603	0.028193486
Bullet	1.15E+11	1.15E+11	-0.025679361
The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?

Ah, sorry for poor explanation.
It means the instruction count from perf on linux host during compilation of the workloads so bigger number of instruction count means longer compile time.
I used the TEST_SUITE_USE_PERF option with llvm-test-suite and checked the number of instructions from the perfstats file.

Try
https://llvm-compile-time-tracker.com/about.php

In D154205#4490240, @xbolva00 wrote:

Try
https://llvm-compile-time-tracker.com/about.php

Yep, I followed the llvm-compile-time-tracker.
@dmgreen let me know the llvm-compile-time-tracker and the scripts.

In D154205#4490194, @jaykang10 wrote:
In D154205#4490081, @craig.topper wrote:
In D154205#4488424, @jaykang10 wrote:
I have checked the compile time with llvm-testsuite/CTMark. It looks it is not too bad.
workload	org_inst_count	patch_inst_count	diff(%)
ClamAV	69508780509	69630988308	0.175816347
7zip	2.32E+11	2.32E+11	0.006923268
tramp3d-v4	1.05E+11	1.05E+11	0.000903823
kimwitu++	49770088564	49763466457	-0.013305395
sqlite3	46693620206	46759200740	0.140448596
mafft	42673608370	42731240751	0.13505392
SPASS	56670401414	56809872194	0.246108686
lencod	79886302869	80002255900	0.145147575
consumer-typeset	43685066260	43697382603	0.028193486
Bullet	1.15E+11	1.15E+11	-0.025679361
The headings org_inst_count and patch_inst_count don't sound like compile time to me. That sounds like sizes of the resulting binary?
Ah, sorry for poor explanation.
It means the instruction count from perf on linux host during compilation of the workloads so bigger number of instruction count means longer compile time.
I used the TEST_SUITE_USE_PERF option with llvm-test-suite and checked the number of instructions from the perfstats file.

Thanks for the clarification.

wxiao3 added inline comments.Jul 12 2023, 12:24 AM

llvm/lib/CodeGen/MachineLICM.cpp
335	2 questions: why we don't require that the outer-most loop that has a unique predecessor? can we push the innermost loops into the worklist first?

jaykang10 added inline comments.Jul 12 2023, 1:49 AM

llvm/lib/CodeGen/MachineLICM.cpp
335	Thanks for questions. why we don't require that the outer-most loop that has a unique predecessor? As you can see, current implementation handles inner loops when outmost loop does not have unique predecessor. If loops have preheader, I think we can hoist loop invariant code into the preheader. The `HoistOutOfLoop` function checks it so I think we do not need to check the outer-most loop that has a unique predecessor. can we push the innermost loops into the worklist first? We use `Worklist.pop_back_val()` and it means we handles last element of worklist first. In order to handle inner-most loop first, the inner-most loop is pushed into last element of the worklist. If you feel something wrong, please let me know.

wxiao3 accepted this revision.Jul 12 2023, 6:31 AM

This revision is now accepted and ready to land.Jul 12 2023, 6:31 AM

Not all tests are updated? X86/licm-nested.ll

In D154205#4493464, @xbolva00 wrote:

Not all tests are updated? X86/licm-nested.ll

The test has ; REQUIRES: asserts and it is enabled in the build with asserts.
Let me update the test.

Updated test/CodeGen/X86/licm-nested.ll.

Herald added a subscriber: pengfei. · View Herald TranscriptJul 12 2023, 8:04 AM

This revision was landed with ongoing or failed builds.Jul 12 2023, 8:33 AM

Closed by commit rG33e60484d750: [MachineLICM] Handle Subloops (authored by jaykang10). · Explain Why

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rG33e60484d750: [MachineLICM] Handle Subloops.

Why do we want to visit inner loops first? Doesn't the algorithm visit all blocks of the inner loops when its on an outer loop anyway?

In D154205#4494018, @craig.topper wrote:

Why do we want to visit inner loops first? Doesn't the algorithm visit all blocks of the inner loops when its on an outer loop anyway?

As you can see on MachineLoop::isLoopInvariant(), if the loop contains the definition of MI's operands, the MI is not loop invariant.
Let's see below loop form.

outer loop:
    definition of operand of below loop invariant code
    inner loop:
        loop invariant code

If we visit only outer loop, the loop invariant code can not be hoisted to outer loop's preheader because its operand's definition is in outer loop.
If we visit inner loop, the loop invariant code can be hoisted to the inner loop's preheader because the inner loop does not contain the definition of the invariant code's operand.
That's the reason why I want to visit the inner loops.
If you feel something wrong, please let me know.

we could skip to visit basic blocks which are already visited with inner loop because the loop invariant code of the inner loop has already been hoisted to inner loop's preheader.

Please do implement this (and revert this patch in the meantime if non-trivial). LICM should visit each block only once, not once per parent loop.

In D154205#4494083, @jaykang10 wrote:
In D154205#4494018, @craig.topper wrote:

Why do we want to visit inner loops first? Doesn't the algorithm visit all blocks of the inner loops when its on an outer loop anyway?

As you can see on MachineLoop::isLoopInvariant(), if the loop contains the definition of MI's operands, the MI is not loop invariant.
Let's see below loop form.
outer loop:
    definition of operand of below loop invariant code
    inner loop:
        loop invariant code
If we visit only outer loop, the loop invariant code can not be hoisted to outer loop's preheader because its operand's definition is in outer loop.
If we visit inner loop, the loop invariant code can be hoisted to the inner loop's preheader because the inner loop does not contain the definition of the invariant code's operand.
That's the reason why I want to visit the inner loops.
If you feel something wrong, please let me know.

My question wasn't about visiting the inner loops it was about the order of visiting. Since I think it visits all blocks in child loops, I was wondering if we could aggressively hoist everything relative to the outer loop first instead of gradually. Then visit the inner loops to get anything that can't be hoisted all the way out. Though if we do as @nikic says and stop visiting blocks of inner loops when visiting outer loops, this won't work.

In D154205#4494140, @nikic wrote:

we could skip to visit basic blocks which are already visited with inner loop because the loop invariant code of the inner loop has already been hoisted to inner loop's preheader.

Please do implement this (and revert this patch in the meantime if non-trivial). LICM should visit each block only once, not once per parent loop.

Experimentally, I tried it but it was not simple and caused regressions... Let me try it again.
In the meantime, if possible, I would like to keep this patch in the meantime.

In D154205#4494166, @craig.topper wrote:
In D154205#4494083, @jaykang10 wrote:
In D154205#4494018, @craig.topper wrote:

Why do we want to visit inner loops first? Doesn't the algorithm visit all blocks of the inner loops when its on an outer loop anyway?

As you can see on MachineLoop::isLoopInvariant(), if the loop contains the definition of MI's operands, the MI is not loop invariant.
Let's see below loop form.
outer loop:
    definition of operand of below loop invariant code
    inner loop:
        loop invariant code
If we visit only outer loop, the loop invariant code can not be hoisted to outer loop's preheader because its operand's definition is in outer loop.
If we visit inner loop, the loop invariant code can be hoisted to the inner loop's preheader because the inner loop does not contain the definition of the invariant code's operand.
That's the reason why I want to visit the inner loops.
If you feel something wrong, please let me know.
My question wasn't about visiting the inner loops it was about the order of visiting. Since I think it visits all blocks in child loops, I was wondering if we could aggressively hoist everything relative to the outer loop first instead of gradually. Then visit the inner loops to get anything that can't be hoisted all the way out. Though if we do as @nikic says and stop visiting blocks of inner loops when visiting outer loops, this won't work.

Ah, sorry... I misunderstood your question.
I think it would be worth to try the order you suggest.
Let me try to implement what @nikic suggested first.

In D154205#4494166, @craig.topper wrote:

My question wasn't about visiting the inner loops it was about the order of visiting. Since I think it visits all blocks in child loops, I was wondering if we could aggressively hoist everything relative to the outer loop first instead of gradually. Then visit the inner loops to get anything that can't be hoisted all the way out. Though if we do as @nikic says and stop visiting blocks of inner loops when visiting outer loops, this won't work.

There are multiple ways to go about this. We could only visit the outermost loop, but hoist to an inner loop preheader while doing so (i.e. perform the invariance check against the current inner-most loop to find candidates, and then hoist them to the outer-most parent loop that is still invariant). It's possible that doing this fits better into the MachineLICM model, which also needs to keep track of register pressure.

Harbormaster completed remote builds in B244799: Diff 539567.Jul 12 2023, 12:48 PM

In D154205#4494458, @nikic wrote:

In D154205#4494166, @craig.topper wrote:

My question wasn't about visiting the inner loops it was about the order of visiting. Since I think it visits all blocks in child loops, I was wondering if we could aggressively hoist everything relative to the outer loop first instead of gradually. Then visit the inner loops to get anything that can't be hoisted all the way out. Though if we do as @nikic says and stop visiting blocks of inner loops when visiting outer loops, this won't work.

There are multiple ways to go about this. We could only visit the outermost loop, but hoist to an inner loop preheader while doing so (i.e. perform the invariance check against the current inner-most loop to find candidates, and then hoist them to the outer-most parent loop that is still invariant). It's possible that doing this fits better into the MachineLICM model, which also needs to keep track of register pressure.

Yep, I think it is good idea.
As below patch, I tried to skip the blocks which have already visited with inner loops but the different register pressure caused regression. In order to keep track of the register pressure properly, we need to visit all blocks in loop again...
Maybe, your idea could avoid the issue with keeping track of the register pressure properly...

diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index 1b84c0218867..c7ab88a3af22 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -239,9 +239,11 @@ namespace {
     void ExitScopeIfDone(
         MachineDomTreeNode *Node,
         DenseMap<MachineDomTreeNode *, unsigned> &OpenChildren,
-        const DenseMap<MachineDomTreeNode *, MachineDomTreeNode *> &ParentMap);
+        const DenseMap<MachineDomTreeNode *, MachineDomTreeNode *> &ParentMap,
+        SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs);
 
-    void HoistOutOfLoop(MachineDomTreeNode *HeaderN);
+    void HoistOutOfLoop(MachineDomTreeNode *HeaderN,
+                        SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs);
 
     void InitRegPressure(MachineBasicBlock *BB);
 
@@ -375,6 +377,7 @@ bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) {
   for (; MLII != MLIE; ++MLII)
     addSubLoopsToWorkList(*MLII, Worklist, PreRegAlloc);
 
+  SmallPtrSet<MachineBasicBlock *, 32> VisitedLoopBBs;
   while (!Worklist.empty()) {
     CurLoop = Worklist.pop_back_val();
     CurPreheader = nullptr;
@@ -389,8 +392,11 @@ bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) {
       // being hoisted.
       MachineDomTreeNode *N = DT->getNode(CurLoop->getHeader());
       FirstInLoop = true;
-      HoistOutOfLoop(N);
+      HoistOutOfLoop(N, VisitedLoopBBs);
       CSEMap.clear();
+      // Keep track of visited loop's blocks.
+      VisitedLoopBBs.insert(CurLoop->getBlocksVector().begin(),
+                            CurLoop->getBlocksVector().end());
     }
   }
 
@@ -694,14 +700,18 @@ void MachineLICMBase::ExitScope(MachineBasicBlock *MBB) {
 /// Destroy scope for the MBB that corresponds to the given dominator tree node
 /// if its a leaf or all of its children are done. Walk up the dominator tree to
 /// destroy ancestors which are now done.
-void MachineLICMBase::ExitScopeIfDone(MachineDomTreeNode *Node,
-    DenseMap<MachineDomTreeNode*, unsigned> &OpenChildren,
-    const DenseMap<MachineDomTreeNode*, MachineDomTreeNode*> &ParentMap) {
+void MachineLICMBase::ExitScopeIfDone(
+    MachineDomTreeNode *Node,
+    DenseMap<MachineDomTreeNode *, unsigned> &OpenChildren,
+    const DenseMap<MachineDomTreeNode *, MachineDomTreeNode *> &ParentMap,
+    SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs) {
   if (OpenChildren[Node])
     return;
 
-  for(;;) {
-    ExitScope(Node->getBlock());
+  for (;;) {
+    // If the block was visited previously, do not process the block.
+    if (!VisitedLoopBBs.contains(Node->getBlock()))
+      ExitScope(Node->getBlock());
     // Now traverse upwards to pop ancestors whose offsprings are all done.
     MachineDomTreeNode *Parent = ParentMap.lookup(Node);
     if (!Parent || --OpenChildren[Parent] != 0)
@@ -714,7 +724,9 @@ void MachineLICMBase::ExitScopeIfDone(MachineDomTreeNode *Node,
 /// specified header block, and that are in the current loop) in depth first
 /// order w.r.t the DominatorTree. This allows us to visit definitions before
 /// uses, allowing us to hoist a loop body in one pass without iteration.
-void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) {
+void MachineLICMBase::HoistOutOfLoop(
+    MachineDomTreeNode *HeaderN,
+    SmallPtrSetImpl<MachineBasicBlock *> &VisitedLoopBBs) {
   MachineBasicBlock *Preheader = getCurPreheader();
   if (!Preheader)
     return;
@@ -741,7 +753,10 @@ void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) {
     if (!CurLoop->contains(BB))
       continue;
 
-    Scopes.push_back(Node);
+    // If the block was visited previously, do not process the block.
+    if (VisitedLoopBBs.empty() || !VisitedLoopBBs.contains(BB))
+      Scopes.push_back(Node);
+
     unsigned NumChildren = Node->getNumChildren();
 
     // Don't hoist things out of a large switch statement.  This often causes
@@ -786,7 +801,7 @@ void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) {
     }
 
     // If it's a leaf node, it's done. Traverse upwards to pop ancestors.
-    ExitScopeIfDone(Node, OpenChildren, ParentMap);
+    ExitScopeIfDone(Node, OpenChildren, ParentMap, VisitedLoopBBs);
   }
 }

@jaykang10 Can you please revert this patch in the meantime? I'd like to make sure it does not make it into LLVM 17 in the current form.

In D154205#4513747, @nikic wrote:

@jaykang10 Can you please revert this patch in the meantime? I'd like to make sure it does not make it into LLVM 17 in the current form.

@nikic let me revert this commit.

jaykang10 added a reverting change: rG62ed3ff4bbdb: Revert "[MachineLICM] Handle Subloops".Jul 19 2023, 2:32 AM

Following @nikic's suggestion, If we fail to hoist MI to outmost loop and the MI is in subloop, try to hoist it to subloop's preheader.

Harbormaster completed remote builds in B246504: Diff 541951.Jul 19 2023, 6:23 AM

@nikic Following your suggestion, I have updated the patch.
If possible, can I ask you some comments for the update please?

nikic reopened this revision.Jul 20 2023, 8:20 AM

This revision is now accepted and ready to land.Jul 20 2023, 8:20 AM

This revision was landed with ongoing or failed builds.Jul 20 2023, 8:46 AM

Closed by commit rG50dd383d0867: [MachineLICM] Handle Subloops (authored by jaykang10). · Explain Why

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rG50dd383d0867: [MachineLICM] Handle Subloops.

That wasn't intended as an approval -- it's a phabricator quirk that if a revision is reopened it will show as accepted. One usually has to do "reopen" and then "request review", but it looks like only you can do the second part.

llvm/lib/CodeGen/MachineLICM.cpp
789	This is going to hoist to the inner-most preheader, but it may be that the instruction is invariant wrt a number of a loops (just not the outer-most one). Shouldn't we be going up the loop chain to find the highest invariant loop?

jaykang10 added a reverting change: rG351b4c17ddc0: Revert "[MachineLICM] Handle Subloops".Jul 20 2023, 9:13 AM

In D154205#4519284, @nikic wrote:

That wasn't intended as an approval -- it's a phabricator quirk that if a revision is reopened it will show as accepted. One usually has to do "reopen" and then "request review", but it looks like only you can do the second part.

Sorry... I have reverted the commit.

jaykang10 added inline comments.Jul 20 2023, 9:22 AM

llvm/lib/CodeGen/MachineLICM.cpp
789	I wanted to check that this approach is acceptable first. We can go up the loop chain. Let me update the code.

Fixed a bug.
With this patch, we can hoist MIs to inner loop's preheader and the preheader can not dominate the CSE candidate MI's block. In this case, avoid CSE.

Following @nikic's comment, visit loop chain from outer one to inner one.

Harbormaster completed remote builds in B247194: Diff 542892.Jul 21 2023, 10:07 AM

Removed LoopIsOuterMostWithPredecessor.

Harbormaster completed remote builds in B247701: Diff 543575.Jul 24 2023, 9:01 AM

jaykang10 updated this revision to Diff 543576.Jul 24 2023, 9:03 AM

Harbormaster completed remote builds in B247702: Diff 543576.Jul 24 2023, 2:14 PM

@nikic I have updated this patch following your comment.
If you need something more, please let me know.

@nikic Can I push this change please?

@nikic or anyone can review this updated patch please?

The perf of this version looks OK. The previous one seemed to be more aggressive, causing more changes both positive and negative, but from the perf I ran this looks OK.

llvm/lib/CodeGen/MachineLICM.cpp
785	outmost -> outermost is in subloop -> is in a subloop
806–807	Do you know what this refers to? I'm not sure I understand what it means. It might be worth just removing it.
llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll
0–1	This doesn't really look autogenerated to me.
41	Should all these lines be removed, or should they be updated for the new codegen?
79	This shouldn't be needed.

nikic added inline comments.Aug 1 2023, 6:14 AM

llvm/lib/CodeGen/MachineLICM.cpp
801	Uff, this looks like a pretty big hack. It is viable to pass the loop as a parameter instead of temporarily changing "global" state?
1343	This seems to work around a larger issue. The problem is that CSEMap will get initialized for whichever preheader we happen to hoist into first. Thanks to this check, we will at least not make invalid replacements, but it means we will miss CSE opportunities if we are hoisting into any other preheader. Probably the CSE map should be by preheader, instead of having only the one.

jaykang10 added inline comments.Aug 1 2023, 7:16 AM

llvm/lib/CodeGen/MachineLICM.cpp
785	Yep, let me update them.
806–807	Let me check it.
llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll
0–1	Sorry... it looks it did not use the script first time... For `negated_cond` function, `SIOptimizeExecMaskingPreRA` pass fails to fold mask operations `V_CNDMASK_B32_e64` and `V_CMP_NE_U32` because they are hoisted. Let me update it manually.
41	It looks these test lines are correct. Let me keep these lines in this patch.
79	Yep, let me remove it.

jaykang10 added inline comments.Aug 1 2023, 7:30 AM

llvm/lib/CodeGen/MachineLICM.cpp
801	um... if possible, I did not want to change a lot in this pass... but I agree it is big hack. Let me try to pass the global ones as parameters.
1343	It is good point. Let me try to keep multiple CSE maps for multiple preheaders.

Following @nikic's comment, updated code.

Pass CurLoop and CurPreheader as function parameters
Keep CSEMap per preheader

Harbormaster completed remote builds in B251072: Diff 548173.Aug 8 2023, 8:12 AM

@nikic If possible, can you check the update please?

@nikic Ping

@nikic ping

dmgreen added inline comments.Sep 14 2023, 2:50 AM

llvm/lib/CodeGen/MachineLICM.cpp
134–135	Will ExitBlocks be incorrect now?

jaykang10 added inline comments.Sep 14 2023, 4:00 AM

llvm/lib/CodeGen/MachineLICM.cpp
134–135	Ah, that is good point! They are out-most loop's ExitBlocks. Let me fix it. Thanks for checking it.

Following @dmgreen's comment, checked ExitBlocks per each loop.

Harbormaster completed remote builds in B257215: Diff 556767.Sep 14 2023, 4:43 AM

dmgreen added inline comments.Sep 14 2023, 5:59 AM

llvm/lib/CodeGen/MachineLICM.cpp
134–135	Could this use CurLoop->isLoopExiting(ExitBlocks) instead? It might be quicker for larger loops.

jaykang10 added inline comments.Sep 14 2023, 6:56 AM

llvm/lib/CodeGen/MachineLICM.cpp
134–135	This function checks exit blocks which are outside loop and have predecessor inside loop. isLoopExiting checks exiting blocks which are inside loop and have successor outside loop. I think we need exit blocks here. Let me try to keep the exit blocks for each loop in order to avoid re-calculation.

dmgreen added inline comments.Sep 14 2023, 7:10 AM

llvm/lib/CodeGen/MachineLICM.cpp
134–135	Oh I see. A different type of Exit Block. Could it check `!CurLoop->contains(ExitBlocks) && any_of(ExitBlocks->predecessors, is in CurLoop)`?

jaykang10 added inline comments.Sep 14 2023, 7:41 AM

llvm/lib/CodeGen/MachineLICM.cpp
134–135	`CurLoop->getExitBlocks` collects blocks which are outside CurLoop and has predecessor in CurLoop. This function checks the MBB parameter is in the blocks.

Created a map to keep exit blocks for each loop.

Thanks. LGTM

llvm/lib/CodeGen/MachineLICM.cpp
134–135	Yeah - That was what I was aiming to capture. The ExitBlocks is outside the loop, but one of it's predecessors is inside.

This revision is now accepted and ready to land.Sep 14 2023, 9:00 AM

dmgreen accepted this revision.Sep 14 2023, 9:00 AM

In D154205#4646027, @dmgreen wrote:

Thanks. LGTM

Thanks for review.
Let me push this patch.
If @nikic or other people want to change this patch more, please let me know.

Harbormaster completed remote builds in B257231: Diff 556789.Sep 14 2023, 9:58 AM

Closed by commit rG5ec9699c4d1f: [MachineLICM] Handle Subloops (authored by jaykang10). · Explain WhySep 14 2023, 10:08 AM

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rG5ec9699c4d1f: [MachineLICM] Handle Subloops.

bkramer added a reverting change: rG3454cf67bd0a: Revert "[MachineLICM] Handle Subloops".Sep 15 2023, 4:26 AM

bkramer added a subscriber: bkramer.Sep 15 2023, 4:27 AM

bkramer added inline comments.

llvm/lib/CodeGen/MachineLICM.cpp
806–807	When MI is hoisted the pointer is no longer valid. I'm seeing use after frees with asan after this change, so reverted in 3454cf6

jaykang10 added inline comments.Sep 15 2023, 5:59 AM

llvm/lib/CodeGen/MachineLICM.cpp
806–807	Ah, I am so sorry... Let me check it again. Thanks for reverting the commit.

jaykang10 added inline comments.Sep 15 2023, 9:15 AM

llvm/lib/CodeGen/MachineLICM.cpp
806–807	It seems I need to check the MI is erased because CSE can be erased. @bkramer If possible, can you let me know how I can reproduce the case you saw with asan please?

Checked MI erased ahead of update register pressure because CSE can be removed after hoisting it.

I think it would still be invalid to access MI.getParent if MI has been erased. It would be a use after delete. From looking at the old logic it appeared to run UpdateRegPressure only if Hoist was false. Can we do the same thing here?

In D154205#4647254, @dmgreen wrote:

I think it would still be invalid to access MI.getParent if MI has been erased. It would be a use after delete.

You are right. I was confused with remove which sets its parent nullptr... I am not sure how we can check the erased MI...

From looking at the old logic it appeared to run UpdateRegPressure only if Hoist was false. Can we do the same thing here?

I am not sure why the original code did not update the register pressure with the hoisted loop invariant code... I think this pass needs to update the register pressure with the loop invariant code, which is hoisted to preheader, because preheader dominates the blocks in the loop and the invariant code makes a live out of preheader. I could miss something...

Maybe change Hoist() to return an enum that represents not hoisted / hoisted / CSEd?

Harbormaster completed remote builds in B257331: Diff 556931.Sep 18 2023, 2:44 AM

In D154205#4647273, @nikic wrote:

Maybe change Hoist() to return an enum that represents not hoisted / hoisted / CSEd?

Thanks for good suggestion.
Let me try it.

Following @nikic's comment, updated return type of Hoist function.

Harbormaster completed remote builds in B257342: Diff 556948.Sep 18 2023, 7:35 AM

Leaving a comment here as well. The commit caused an ASAN issue downstream. Cherry picking the revert fixed the asan issue., https://github.com/llvm/llvm-project/commit/5ec9699c4d1f165364586d825baef434e2c110b4#commitcomment-127784939 for more details. Please account for this during the resubmit.

In D154205#4648315, @mravishankar wrote:

Leaving a comment here as well. The commit caused an ASAN issue downstream. Cherry picking the revert fixed the asan issue., https://github.com/llvm/llvm-project/commit/5ec9699c4d1f165364586d825baef434e2c110b4#commitcomment-127784939 for more details. Please account for this during the resubmit.

Thanks for the asan output.
I have updated the patch in this review to fix the asan error.
If possible, can you check the updated patch fixes the asan error in your side please?
I am also checking it and I have not seen asan error yet.

nikic added inline comments.Sep 19 2023, 11:59 AM

llvm/lib/CodeGen/MachineLICM.cpp
134–135	Any reason not to name this enum and then use it instead of `unsigned`?

jaykang10 added inline comments.Sep 19 2023, 12:25 PM

llvm/lib/CodeGen/MachineLICM.cpp
134–135	Ah sorry. Let me update the code with named enum tomorrow.

Following @nikic's comment, used named enum.

Harbormaster completed remote builds in B257441: Diff 557099.Sep 20 2023, 12:50 AM

Sorry... with latest update, I can see asan errors from sanitizer bot locally.
Let me fix them again.

Fixed a bug.

ExtractHoistableLoad can erase MI. It creates new MI for unfolding load and assigns it to MI but the MI is not updated with new MI. In this case, the MI is not valid.
ExtractHoistableLoad updates the register pressure for the new MI so we do not need to update the register pressure for it outside the function.

Harbormaster completed remote builds in B257566: Diff 557289.Sep 25 2023, 12:41 AM

I have run sanitizer bot locally and there is no failed tests from sanitizer bot with this patch.
If there is no objection, let me push this updated patch again.

Rebased

Thanks. I think the changes still look OK.

Harbormaster completed remote builds in B257570: Diff 557304.Sep 25 2023, 8:53 AM

jaykang10 added a commit: rGff68e43c811e: [MachineLICM] Handle Subloops.Sep 26 2023, 6:26 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MachineLICM.cpp

258 lines

test/

CodeGen/

AArch64/

machine-licm-sub-loop.ll

6 lines

AMDGPU/

agpr-copy-no-free-registers.ll

44 lines

exec-mask-opt-cannot-create-empty-or-backward-segment.ll

8 lines

optimize-negated-cond.ll

12 lines

tuple-allocation-failure.ll

100 lines

Thumb2/

mve-gather-scatter-optimisation.ll

146 lines

WebAssembly/

reg-stackify.ll

3 lines

Diff 557099

llvm/lib/CodeGen/MachineLICM.cpp

Show First 20 Lines • Show All 104 Lines • ▼ Show 20 Lines
STATISTIC(NumPostRAHoisted,		STATISTIC(NumPostRAHoisted,
"Number of machine instructions hoisted out of loops post regalloc");		"Number of machine instructions hoisted out of loops post regalloc");
STATISTIC(NumStoreConst,		STATISTIC(NumStoreConst,
"Number of stores of const phys reg hoisted out of loops");		"Number of stores of const phys reg hoisted out of loops");
STATISTIC(NumNotHoistedDueToHotness,		STATISTIC(NumNotHoistedDueToHotness,
"Number of instructions not hoisted due to block frequency");		"Number of instructions not hoisted due to block frequency");

namespace {		namespace {
		enum HoistResult { NotHoisted = 0, Hoisted = 1, CSEd = 2 };

class MachineLICMBase : public MachineFunctionPass {		class MachineLICMBase : public MachineFunctionPass {
const TargetInstrInfo *TII = nullptr;		const TargetInstrInfo *TII = nullptr;
const TargetLoweringBase *TLI = nullptr;		const TargetLoweringBase *TLI = nullptr;
const TargetRegisterInfo *TRI = nullptr;		const TargetRegisterInfo *TRI = nullptr;
const MachineFrameInfo *MFI = nullptr;		const MachineFrameInfo *MFI = nullptr;
MachineRegisterInfo *MRI = nullptr;		MachineRegisterInfo *MRI = nullptr;
TargetSchedModel SchedModel;		TargetSchedModel SchedModel;
bool PreRegAlloc = false;		bool PreRegAlloc = false;
bool HasProfileData = false;		bool HasProfileData = false;

// Various analyses that we use...		// Various analyses that we use...
AliasAnalysis *AA = nullptr; // Alias analysis info.		AliasAnalysis *AA = nullptr; // Alias analysis info.
MachineBlockFrequencyInfo *MBFI = nullptr; // Machine block frequncy info		MachineBlockFrequencyInfo *MBFI = nullptr; // Machine block frequncy info
MachineLoopInfo *MLI = nullptr; // Current MachineLoopInfo		MachineLoopInfo *MLI = nullptr; // Current MachineLoopInfo
MachineDominatorTree *DT = nullptr; // Machine dominator tree for the cur loop		MachineDominatorTree *DT = nullptr; // Machine dominator tree for the cur loop

// State that is updated as we process loops		// State that is updated as we process loops
bool Changed = false; // True if a loop is changed.		bool Changed = false; // True if a loop is changed.
bool FirstInLoop = false; // True if it's the first LICM in the loop.		bool FirstInLoop = false; // True if it's the first LICM in the loop.
MachineLoop *CurLoop = nullptr; // The current loop we are working on.
MachineBasicBlock *CurPreheader = nullptr; // The preheader for CurLoop.

// Exit blocks for CurLoop.
SmallVector<MachineBasicBlock *, 8> ExitBlocks;

		dmgreenUnsubmitted Not Done Reply Inline Actions Will ExitBlocks be incorrect now? dmgreen: Will ExitBlocks be incorrect now?
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Ah, that is good point! They are out-most loop's ExitBlocks. Let me fix it. Thanks for checking it. jaykang10: Ah, that is good point! They are out-most loop's ExitBlocks. Let me fix it. Thanks for checking…
		dmgreenUnsubmitted Not Done Reply Inline Actions Could this use CurLoop->isLoopExiting(ExitBlocks) instead? It might be quicker for larger loops. dmgreen: Could this use CurLoop->isLoopExiting(ExitBlocks) instead? It might be quicker for larger loops.
		jaykang10AuthorUnsubmitted Done Reply Inline Actions This function checks exit blocks which are outside loop and have predecessor inside loop. isLoopExiting checks exiting blocks which are inside loop and have successor outside loop. I think we need exit blocks here. Let me try to keep the exit blocks for each loop in order to avoid re-calculation. jaykang10: This function checks exit blocks which are outside loop and have predecessor inside loop.
		dmgreenUnsubmitted Not Done Reply Inline Actions Oh I see. A different type of Exit Block. Could it check `!CurLoop->contains(ExitBlocks) && any_of(ExitBlocks->predecessors, is in CurLoop)`? dmgreen: Oh I see. A different type of Exit Block. Could it check `!CurLoop->contains(ExitBlocks) &&…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions `CurLoop->getExitBlocks` collects blocks which are outside CurLoop and has predecessor in CurLoop. This function checks the MBB parameter is in the blocks. jaykang10: `CurLoop->getExitBlocks` collects blocks which are outside CurLoop and has predecessor in…
		dmgreenUnsubmitted Not Done Reply Inline Actions Yeah - That was what I was aiming to capture. The ExitBlocks is outside the loop, but one of it's predecessors is inside. dmgreen: Yeah - That was what I was aiming to capture. The ExitBlocks is outside the loop, but one of…
		nikicUnsubmitted Not Done Reply Inline Actions Any reason not to name this enum and then use it instead of `unsigned`? nikic: Any reason not to name this enum and then use it instead of `unsigned`?
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Ah sorry. Let me update the code with named enum tomorrow. jaykang10: Ah sorry. Let me update the code with named enum tomorrow.
bool isExitBlock(const MachineBasicBlock *MBB) const {		// Exit blocks of each Loop.
		DenseMap<MachineLoop , SmallVector<MachineBasicBlock , 8>> ExitBlockMap;

		bool isExitBlock(MachineLoop CurLoop, const MachineBasicBlock MBB) {
		if (ExitBlockMap.contains(CurLoop))
		return is_contained(ExitBlockMap[CurLoop], MBB);

		SmallVector<MachineBasicBlock *, 8> ExitBlocks;
		CurLoop->getExitBlocks(ExitBlocks);
		ExitBlockMap[CurLoop] = ExitBlocks;
return is_contained(ExitBlocks, MBB);		return is_contained(ExitBlocks, MBB);
}		}

// Track 'estimated' register pressure.		// Track 'estimated' register pressure.
SmallSet<Register, 32> RegSeen;		SmallSet<Register, 32> RegSeen;
SmallVector<unsigned, 8> RegPressure;		SmallVector<unsigned, 8> RegPressure;

// Register pressure "limit" per register pressure set. If the pressure		// Register pressure "limit" per register pressure set. If the pressure
// is higher than the limit, then it's considered high.		// is higher than the limit, then it's considered high.
SmallVector<unsigned, 8> RegLimit;		SmallVector<unsigned, 8> RegLimit;

// Register pressure on path leading from loop preheader to current BB.		// Register pressure on path leading from loop preheader to current BB.
SmallVector<SmallVector<unsigned, 8>, 16> BackTrace;		SmallVector<SmallVector<unsigned, 8>, 16> BackTrace;

// For each opcode, keep a list of potential CSE instructions.		// For each opcode per preheader, keep a list of potential CSE instructions.
DenseMap<unsigned, std::vector<MachineInstr *>> CSEMap;		DenseMap<MachineBasicBlock *,
		DenseMap<unsigned, std::vector<MachineInstr *>>>
		CSEMap;

enum {		enum {
SpeculateFalse = 0,		SpeculateFalse = 0,
SpeculateTrue = 1,		SpeculateTrue = 1,
SpeculateUnknown = 2		SpeculateUnknown = 2
};		};

// If a MBB does not dominate loop exiting blocks then it may not safe		// If a MBB does not dominate loop exiting blocks then it may not safe
Show All 18 Lines	public:
}		}

void releaseMemory() override {		void releaseMemory() override {
RegSeen.clear();		RegSeen.clear();
RegPressure.clear();		RegPressure.clear();
RegLimit.clear();		RegLimit.clear();
BackTrace.clear();		BackTrace.clear();
CSEMap.clear();		CSEMap.clear();
		ExitBlockMap.clear();
}		}

private:		private:
/// Keep track of information about hoisting candidates.		/// Keep track of information about hoisting candidates.
struct CandidateInfo {		struct CandidateInfo {
MachineInstr *MI;		MachineInstr *MI;
unsigned Def;		unsigned Def;
int FI;		int FI;

CandidateInfo(MachineInstr *mi, unsigned def, int fi)		CandidateInfo(MachineInstr *mi, unsigned def, int fi)
: MI(mi), Def(def), FI(fi) {}		: MI(mi), Def(def), FI(fi) {}
};		};

void HoistRegionPostRA();		void HoistRegionPostRA(MachineLoop *CurLoop,
		MachineBasicBlock *CurPreheader);

void HoistPostRA(MachineInstr *MI, unsigned Def);		void HoistPostRA(MachineInstr MI, unsigned Def, MachineLoop CurLoop,
		MachineBasicBlock *CurPreheader);

void ProcessMI(MachineInstr *MI, BitVector &PhysRegDefs,		void ProcessMI(MachineInstr *MI, BitVector &PhysRegDefs,
BitVector &PhysRegClobbers, SmallSet<int, 32> &StoredFIs,		BitVector &PhysRegClobbers, SmallSet<int, 32> &StoredFIs,
SmallVectorImpl<CandidateInfo> &Candidates);		SmallVectorImpl<CandidateInfo> &Candidates,
		MachineLoop *CurLoop);

void AddToLiveIns(MCRegister Reg);		void AddToLiveIns(MCRegister Reg, MachineLoop *CurLoop);

bool IsLICMCandidate(MachineInstr &I);		bool IsLICMCandidate(MachineInstr &I, MachineLoop *CurLoop);

bool IsLoopInvariantInst(MachineInstr &I);		bool IsLoopInvariantInst(MachineInstr &I, MachineLoop *CurLoop);

bool HasLoopPHIUse(const MachineInstr *MI) const;		bool HasLoopPHIUse(const MachineInstr MI, MachineLoop CurLoop);

bool HasHighOperandLatency(MachineInstr &MI, unsigned DefIdx,		bool HasHighOperandLatency(MachineInstr &MI, unsigned DefIdx, Register Reg,
Register Reg) const;		MachineLoop *CurLoop) const;

bool IsCheapInstruction(MachineInstr &MI) const;		bool IsCheapInstruction(MachineInstr &MI) const;

bool CanCauseHighRegPressure(const DenseMap<unsigned, int> &Cost,		bool CanCauseHighRegPressure(const DenseMap<unsigned, int> &Cost,
bool Cheap);		bool Cheap);

void UpdateBackTraceRegPressure(const MachineInstr *MI);		void UpdateBackTraceRegPressure(const MachineInstr *MI);

bool IsProfitableToHoist(MachineInstr &MI);		bool IsProfitableToHoist(MachineInstr &MI, MachineLoop *CurLoop);

bool IsGuaranteedToExecute(MachineBasicBlock *BB);		bool IsGuaranteedToExecute(MachineBasicBlock BB, MachineLoop CurLoop);

bool isTriviallyReMaterializable(const MachineInstr &MI) const;		bool isTriviallyReMaterializable(const MachineInstr &MI) const;

void EnterScope(MachineBasicBlock *MBB);		void EnterScope(MachineBasicBlock *MBB);

void ExitScope(MachineBasicBlock *MBB);		void ExitScope(MachineBasicBlock *MBB);

void ExitScopeIfDone(		void ExitScopeIfDone(
MachineDomTreeNode *Node,		MachineDomTreeNode *Node,
DenseMap<MachineDomTreeNode *, unsigned> &OpenChildren,		DenseMap<MachineDomTreeNode *, unsigned> &OpenChildren,
const DenseMap<MachineDomTreeNode , MachineDomTreeNode > &ParentMap);		const DenseMap<MachineDomTreeNode , MachineDomTreeNode > &ParentMap);

void HoistOutOfLoop(MachineDomTreeNode *HeaderN);		void HoistOutOfLoop(MachineDomTreeNode HeaderN, MachineLoop CurLoop,
		MachineBasicBlock *CurPreheader);

void InitRegPressure(MachineBasicBlock *BB);		void InitRegPressure(MachineBasicBlock *BB);

DenseMap<unsigned, int> calcRegisterCost(const MachineInstr *MI,		DenseMap<unsigned, int> calcRegisterCost(const MachineInstr *MI,
bool ConsiderSeen,		bool ConsiderSeen,
bool ConsiderUnseenAsDef);		bool ConsiderUnseenAsDef);

void UpdateRegPressure(const MachineInstr *MI,		void UpdateRegPressure(const MachineInstr *MI,
bool ConsiderUnseenAsDef = false);		bool ConsiderUnseenAsDef = false);

MachineInstr ExtractHoistableLoad(MachineInstr MI);		MachineInstr ExtractHoistableLoad(MachineInstr MI, MachineLoop *CurLoop);

MachineInstr LookForDuplicate(const MachineInstr MI,		MachineInstr LookForDuplicate(const MachineInstr MI,
std::vector<MachineInstr *> &PrevMIs);		std::vector<MachineInstr *> &PrevMIs);

bool		bool
EliminateCSE(MachineInstr *MI,		EliminateCSE(MachineInstr *MI,
DenseMap<unsigned, std::vector<MachineInstr *>>::iterator &CI);		DenseMap<unsigned, std::vector<MachineInstr *>>::iterator &CI);

bool MayCSE(MachineInstr *MI);		bool MayCSE(MachineInstr *MI);

bool Hoist(MachineInstr MI, MachineBasicBlock Preheader);		HoistResult Hoist(MachineInstr MI, MachineBasicBlock Preheader,
		MachineLoop *CurLoop);

void InitCSEMap(MachineBasicBlock *BB);		void InitCSEMap(MachineBasicBlock *BB);

bool isTgtHotterThanSrc(MachineBasicBlock *SrcBlock,		bool isTgtHotterThanSrc(MachineBasicBlock *SrcBlock,
MachineBasicBlock *TgtBlock);		MachineBasicBlock *TgtBlock);
MachineBasicBlock *getCurPreheader();		MachineBasicBlock getCurPreheader(MachineLoop CurLoop,
		MachineBasicBlock *CurPreheader);
};		};

class MachineLICM : public MachineLICMBase {		class MachineLICM : public MachineLICMBase {
public:		public:
static char ID;		static char ID;
MachineLICM() : MachineLICMBase(ID, false) {		MachineLICM() : MachineLICMBase(ID, false) {
initializeMachineLICMPass(*PassRegistry::getPassRegistry());		initializeMachineLICMPass(*PassRegistry::getPassRegistry());
}		}
Show All 28 Lines	INITIALIZE_PASS_BEGIN(EarlyMachineLICM, "early-machinelicm",
"Early Machine Loop Invariant Code Motion", false, false)		"Early Machine Loop Invariant Code Motion", false, false)
INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)		INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)
INITIALIZE_PASS_DEPENDENCY(MachineBlockFrequencyInfo)		INITIALIZE_PASS_DEPENDENCY(MachineBlockFrequencyInfo)
INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)		INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)		INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
INITIALIZE_PASS_END(EarlyMachineLICM, "early-machinelicm",		INITIALIZE_PASS_END(EarlyMachineLICM, "early-machinelicm",
"Early Machine Loop Invariant Code Motion", false, false)		"Early Machine Loop Invariant Code Motion", false, false)

/// Test if the given loop is the outer-most loop that has a unique predecessor.
static bool LoopIsOuterMostWithPredecessor(MachineLoop *CurLoop) {
// Check whether this loop even has a unique predecessor.
if (!CurLoop->getLoopPredecessor())
return false;
// Ok, now check to see if any of its outer loops do.
for (MachineLoop *L = CurLoop->getParentLoop(); L; L = L->getParentLoop())
if (L->getLoopPredecessor())
return false;
// None of them did, so this is the outermost with a unique predecessor.
return true;
}

bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) {		bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) {
if (skipFunction(MF.getFunction()))		if (skipFunction(MF.getFunction()))
return false;		return false;

		wxiao3Unsubmitted Not Done Reply Inline Actions 2 questions: why we don't require that the outer-most loop that has a unique predecessor? can we push the innermost loops into the worklist first? wxiao3: 2 questions: 1) why we don't require that the outer-most loop that has a unique predecessor? 2)…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Thanks for questions. why we don't require that the outer-most loop that has a unique predecessor? As you can see, current implementation handles inner loops when outmost loop does not have unique predecessor. If loops have preheader, I think we can hoist loop invariant code into the preheader. The `HoistOutOfLoop` function checks it so I think we do not need to check the outer-most loop that has a unique predecessor. can we push the innermost loops into the worklist first? We use `Worklist.pop_back_val()` and it means we handles last element of worklist first. In order to handle inner-most loop first, the inner-most loop is pushed into last element of the worklist. If you feel something wrong, please let me know. jaykang10: Thanks for questions. > 1) why we don't require that the outer-most loop that has a unique…
Changed = FirstInLoop = false;		Changed = FirstInLoop = false;
const TargetSubtargetInfo &ST = MF.getSubtarget();		const TargetSubtargetInfo &ST = MF.getSubtarget();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();
TLI = ST.getTargetLowering();		TLI = ST.getTargetLowering();
TRI = ST.getRegisterInfo();		TRI = ST.getRegisterInfo();
MFI = &MF.getFrameInfo();		MFI = &MF.getFrameInfo();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
SchedModel.init(&ST);		SchedModel.init(&ST);
Show All 21 Lines	bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) {
if (DisableHoistingToHotterBlocks != UseBFI::None)		if (DisableHoistingToHotterBlocks != UseBFI::None)
MBFI = &getAnalysis<MachineBlockFrequencyInfo>();		MBFI = &getAnalysis<MachineBlockFrequencyInfo>();
MLI = &getAnalysis<MachineLoopInfo>();		MLI = &getAnalysis<MachineLoopInfo>();
DT = &getAnalysis<MachineDominatorTree>();		DT = &getAnalysis<MachineDominatorTree>();
AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();

SmallVector<MachineLoop *, 8> Worklist(MLI->begin(), MLI->end());		SmallVector<MachineLoop *, 8> Worklist(MLI->begin(), MLI->end());
while (!Worklist.empty()) {		while (!Worklist.empty()) {
CurLoop = Worklist.pop_back_val();		MachineLoop *CurLoop = Worklist.pop_back_val();
CurPreheader = nullptr;		MachineBasicBlock *CurPreheader = nullptr;
ExitBlocks.clear();

// If this is done before regalloc, only visit outer-most preheader-sporting
// loops.
if (PreRegAlloc && !LoopIsOuterMostWithPredecessor(CurLoop)) {
Worklist.append(CurLoop->begin(), CurLoop->end());
continue;
}

CurLoop->getExitBlocks(ExitBlocks);

if (!PreRegAlloc)		if (!PreRegAlloc)
HoistRegionPostRA();		HoistRegionPostRA(CurLoop, CurPreheader);
else {		else {
// CSEMap is initialized for loop header when the first instruction is		// CSEMap is initialized for loop header when the first instruction is
// being hoisted.		// being hoisted.
MachineDomTreeNode *N = DT->getNode(CurLoop->getHeader());		MachineDomTreeNode *N = DT->getNode(CurLoop->getHeader());
FirstInLoop = true;		FirstInLoop = true;
HoistOutOfLoop(N);		HoistOutOfLoop(N, CurLoop, CurPreheader);
CSEMap.clear();		CSEMap.clear();
}		}
}		}

return Changed;		return Changed;
}		}

/// Return true if instruction stores to the specified frame.		/// Return true if instruction stores to the specified frame.
Show All 15 Lines	if (const FixedStackPseudoSourceValue *Value =
return true;		return true;
}		}
}		}
return false;		return false;
}		}

/// Examine the instruction for potentai LICM candidate. Also		/// Examine the instruction for potentai LICM candidate. Also
/// gather register def and frame object update information.		/// gather register def and frame object update information.
void MachineLICMBase::ProcessMI(MachineInstr *MI,		void MachineLICMBase::ProcessMI(MachineInstr *MI, BitVector &PhysRegDefs,
BitVector &PhysRegDefs,
BitVector &PhysRegClobbers,		BitVector &PhysRegClobbers,
SmallSet<int, 32> &StoredFIs,		SmallSet<int, 32> &StoredFIs,
SmallVectorImpl<CandidateInfo> &Candidates) {		SmallVectorImpl<CandidateInfo> &Candidates,
		MachineLoop *CurLoop) {
bool RuledOut = false;		bool RuledOut = false;
bool HasNonInvariantUse = false;		bool HasNonInvariantUse = false;
unsigned Def = 0;		unsigned Def = 0;
for (const MachineOperand &MO : MI->operands()) {		for (const MachineOperand &MO : MI->operands()) {
if (MO.isFI()) {		if (MO.isFI()) {
// Remember if the instruction stores to the frame index.		// Remember if the instruction stores to the frame index.
int FI = MO.getIndex();		int FI = MO.getIndex();
if (!StoredFIs.count(FI) &&		if (!StoredFIs.count(FI) &&
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	if (PhysRegClobbers.test(Reg))
// the loop, it cannot be a LICM candidate.		// the loop, it cannot be a LICM candidate.
RuledOut = true;		RuledOut = true;
}		}

// Only consider reloads for now and remats which do not have register		// Only consider reloads for now and remats which do not have register
// operands. FIXME: Consider unfold load folding instructions.		// operands. FIXME: Consider unfold load folding instructions.
if (Def && !RuledOut) {		if (Def && !RuledOut) {
int FI = std::numeric_limits<int>::min();		int FI = std::numeric_limits<int>::min();
if ((!HasNonInvariantUse && IsLICMCandidate(*MI)) \|\|		if ((!HasNonInvariantUse && IsLICMCandidate(*MI, CurLoop)) \|\|
(TII->isLoadFromStackSlot(*MI, FI) && MFI->isSpillSlotObjectIndex(FI)))		(TII->isLoadFromStackSlot(*MI, FI) && MFI->isSpillSlotObjectIndex(FI)))
Candidates.push_back(CandidateInfo(MI, Def, FI));		Candidates.push_back(CandidateInfo(MI, Def, FI));
}		}
}		}

/// Walk the specified region of the CFG and hoist loop invariants out to the		/// Walk the specified region of the CFG and hoist loop invariants out to the
/// preheader.		/// preheader.
void MachineLICMBase::HoistRegionPostRA() {		void MachineLICMBase::HoistRegionPostRA(MachineLoop *CurLoop,
MachineBasicBlock *Preheader = getCurPreheader();		MachineBasicBlock *CurPreheader) {
		MachineBasicBlock *Preheader = getCurPreheader(CurLoop, CurPreheader);
if (!Preheader)		if (!Preheader)
return;		return;

unsigned NumRegs = TRI->getNumRegs();		unsigned NumRegs = TRI->getNumRegs();
BitVector PhysRegDefs(NumRegs); // Regs defined once in the loop.		BitVector PhysRegDefs(NumRegs); // Regs defined once in the loop.
BitVector PhysRegClobbers(NumRegs); // Regs defined more than once.		BitVector PhysRegClobbers(NumRegs); // Regs defined more than once.

SmallVector<CandidateInfo, 32> Candidates;		SmallVector<CandidateInfo, 32> Candidates;
Show All 16 Lines	for (MachineBasicBlock *BB : CurLoop->getBlocks()) {
}		}

// Funclet entry blocks will clobber all registers		// Funclet entry blocks will clobber all registers
if (const uint32_t *Mask = BB->getBeginClobberMask(TRI))		if (const uint32_t *Mask = BB->getBeginClobberMask(TRI))
PhysRegClobbers.setBitsNotInMask(Mask);		PhysRegClobbers.setBitsNotInMask(Mask);

SpeculationState = SpeculateUnknown;		SpeculationState = SpeculateUnknown;
for (MachineInstr &MI : *BB)		for (MachineInstr &MI : *BB)
ProcessMI(&MI, PhysRegDefs, PhysRegClobbers, StoredFIs, Candidates);		ProcessMI(&MI, PhysRegDefs, PhysRegClobbers, StoredFIs, Candidates,
		CurLoop);
}		}

// Gather the registers read / clobbered by the terminator.		// Gather the registers read / clobbered by the terminator.
BitVector TermRegs(NumRegs);		BitVector TermRegs(NumRegs);
MachineBasicBlock::iterator TI = Preheader->getFirstTerminator();		MachineBasicBlock::iterator TI = Preheader->getFirstTerminator();
if (TI != Preheader->end()) {		if (TI != Preheader->end()) {
for (const MachineOperand &MO : TI->operands()) {		for (const MachineOperand &MO : TI->operands()) {
if (!MO.isReg())		if (!MO.isReg())
Show All 31 Lines	if (!PhysRegClobbers.test(Def) && !TermRegs.test(Def)) {
PhysRegClobbers.test(Reg)) {		PhysRegClobbers.test(Reg)) {
// If it's using a non-loop-invariant register, then it's obviously		// If it's using a non-loop-invariant register, then it's obviously
// not safe to hoist.		// not safe to hoist.
Safe = false;		Safe = false;
break;		break;
}		}
}		}
if (Safe)		if (Safe)
HoistPostRA(MI, Candidate.Def);		HoistPostRA(MI, Candidate.Def, CurLoop, CurPreheader);
}		}
}		}
}		}

/// Add register 'Reg' to the livein sets of BBs in the current loop, and make		/// Add register 'Reg' to the livein sets of BBs in the current loop, and make
/// sure it is not killed by any instructions in the loop.		/// sure it is not killed by any instructions in the loop.
void MachineLICMBase::AddToLiveIns(MCRegister Reg) {		void MachineLICMBase::AddToLiveIns(MCRegister Reg, MachineLoop *CurLoop) {
for (MachineBasicBlock *BB : CurLoop->getBlocks()) {		for (MachineBasicBlock *BB : CurLoop->getBlocks()) {
if (!BB->isLiveIn(Reg))		if (!BB->isLiveIn(Reg))
BB->addLiveIn(Reg);		BB->addLiveIn(Reg);
for (MachineInstr &MI : *BB) {		for (MachineInstr &MI : *BB) {
for (MachineOperand &MO : MI.all_uses()) {		for (MachineOperand &MO : MI.all_uses()) {
if (!MO.getReg())		if (!MO.getReg())
continue;		continue;
if (TRI->isSuperRegisterEq(Reg, MO.getReg()))		if (TRI->isSuperRegisterEq(Reg, MO.getReg()))
MO.setIsKill(false);		MO.setIsKill(false);
}		}
}		}
}		}
}		}

/// When an instruction is found to only use loop invariant operands that is		/// When an instruction is found to only use loop invariant operands that is
/// safe to hoist, this instruction is called to do the dirty work.		/// safe to hoist, this instruction is called to do the dirty work.
void MachineLICMBase::HoistPostRA(MachineInstr *MI, unsigned Def) {		void MachineLICMBase::HoistPostRA(MachineInstr *MI, unsigned Def,
MachineBasicBlock *Preheader = getCurPreheader();		MachineLoop *CurLoop,
		MachineBasicBlock *CurPreheader) {
		MachineBasicBlock *Preheader = getCurPreheader(CurLoop, CurPreheader);

// Now move the instructions to the predecessor, inserting it before any		// Now move the instructions to the predecessor, inserting it before any
// terminator instructions.		// terminator instructions.
LLVM_DEBUG(dbgs() << "Hoisting to " << printMBBReference(*Preheader)		LLVM_DEBUG(dbgs() << "Hoisting to " << printMBBReference(*Preheader)
<< " from " << printMBBReference(*MI->getParent()) << ": "		<< " from " << printMBBReference(*MI->getParent()) << ": "
<< *MI);		<< *MI);

// Splice the instruction to the preheader.		// Splice the instruction to the preheader.
MachineBasicBlock *MBB = MI->getParent();		MachineBasicBlock *MBB = MI->getParent();
Preheader->splice(Preheader->getFirstTerminator(), MBB, MI);		Preheader->splice(Preheader->getFirstTerminator(), MBB, MI);

// Since we are moving the instruction out of its basic block, we do not		// Since we are moving the instruction out of its basic block, we do not
// retain its debug location. Doing so would degrade the debugging		// retain its debug location. Doing so would degrade the debugging
// experience and adversely affect the accuracy of profiling information.		// experience and adversely affect the accuracy of profiling information.
assert(!MI->isDebugInstr() && "Should not hoist debug inst");		assert(!MI->isDebugInstr() && "Should not hoist debug inst");
MI->setDebugLoc(DebugLoc());		MI->setDebugLoc(DebugLoc());

// Add register to livein list to all the BBs in the current loop since a		// Add register to livein list to all the BBs in the current loop since a
// loop invariant must be kept live throughout the whole loop. This is		// loop invariant must be kept live throughout the whole loop. This is
// important to ensure later passes do not scavenge the def register.		// important to ensure later passes do not scavenge the def register.
AddToLiveIns(Def);		AddToLiveIns(Def, CurLoop);

++NumPostRAHoisted;		++NumPostRAHoisted;
Changed = true;		Changed = true;
}		}

/// Check if this mbb is guaranteed to execute. If not then a load from this mbb		/// Check if this mbb is guaranteed to execute. If not then a load from this mbb
/// may not be safe to hoist.		/// may not be safe to hoist.
bool MachineLICMBase::IsGuaranteedToExecute(MachineBasicBlock *BB) {		bool MachineLICMBase::IsGuaranteedToExecute(MachineBasicBlock *BB,
		MachineLoop *CurLoop) {
if (SpeculationState != SpeculateUnknown)		if (SpeculationState != SpeculateUnknown)
return SpeculationState == SpeculateFalse;		return SpeculationState == SpeculateFalse;

if (BB != CurLoop->getHeader()) {		if (BB != CurLoop->getHeader()) {
// Check loop exiting blocks.		// Check loop exiting blocks.
SmallVector<MachineBasicBlock*, 8> CurrentLoopExitingBlocks;		SmallVector<MachineBasicBlock*, 8> CurrentLoopExitingBlocks;
CurLoop->getExitingBlocks(CurrentLoopExitingBlocks);		CurLoop->getExitingBlocks(CurrentLoopExitingBlocks);
for (MachineBasicBlock *CurrentLoopExitingBlock : CurrentLoopExitingBlocks)		for (MachineBasicBlock *CurrentLoopExitingBlock : CurrentLoopExitingBlocks)
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	for(;;) {
Node = Parent;		Node = Parent;
}		}
}		}

/// Walk the specified loop in the CFG (defined by all blocks dominated by the		/// Walk the specified loop in the CFG (defined by all blocks dominated by the
/// specified header block, and that are in the current loop) in depth first		/// specified header block, and that are in the current loop) in depth first
/// order w.r.t the DominatorTree. This allows us to visit definitions before		/// order w.r.t the DominatorTree. This allows us to visit definitions before
/// uses, allowing us to hoist a loop body in one pass without iteration.		/// uses, allowing us to hoist a loop body in one pass without iteration.
void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN) {		void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN,
MachineBasicBlock *Preheader = getCurPreheader();		MachineLoop *CurLoop,
		MachineBasicBlock *CurPreheader) {
		MachineBasicBlock *Preheader = getCurPreheader(CurLoop, CurPreheader);
if (!Preheader)		if (!Preheader)
return;		return;

SmallVector<MachineDomTreeNode*, 32> Scopes;		SmallVector<MachineDomTreeNode*, 32> Scopes;
SmallVector<MachineDomTreeNode*, 8> WorkList;		SmallVector<MachineDomTreeNode*, 8> WorkList;
DenseMap<MachineDomTreeNode, MachineDomTreeNode> ParentMap;		DenseMap<MachineDomTreeNode, MachineDomTreeNode> ParentMap;
DenseMap<MachineDomTreeNode*, unsigned> OpenChildren;		DenseMap<MachineDomTreeNode*, unsigned> OpenChildren;

▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	void MachineLICMBase::HoistOutOfLoop(MachineDomTreeNode *HeaderN,
for (MachineDomTreeNode *Node : Scopes) {		for (MachineDomTreeNode *Node : Scopes) {
MachineBasicBlock *MBB = Node->getBlock();		MachineBasicBlock *MBB = Node->getBlock();

EnterScope(MBB);		EnterScope(MBB);

// Process the block		// Process the block
SpeculationState = SpeculateUnknown;		SpeculationState = SpeculateUnknown;
for (MachineInstr &MI : llvm::make_early_inc_range(*MBB)) {		for (MachineInstr &MI : llvm::make_early_inc_range(*MBB)) {
if (!Hoist(&MI, Preheader))		HoistResult HoistRes = HoistResult::NotHoisted;
		HoistRes = Hoist(&MI, Preheader, CurLoop);
		dmgreenUnsubmitted Not Done Reply Inline Actions outmost -> outermost is in subloop -> is in a subloop dmgreen: outmost -> outermost is in subloop -> is in a subloop
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Yep, let me update them. jaykang10: Yep, let me update them.
		if (HoistRes == HoistResult::NotHoisted) {
		// We have failed to hoist MI to outermost loop's preheader. If MI is in
		// a subloop, try to hoist it to subloop's preheader.
		SmallVector<MachineLoop *> InnerLoopWorkList;
		nikicUnsubmitted Not Done Reply Inline Actions This is going to hoist to the inner-most preheader, but it may be that the instruction is invariant wrt a number of a loops (just not the outer-most one). Shouldn't we be going up the loop chain to find the highest invariant loop? nikic: This is going to hoist to the inner-most preheader, but it may be that the instruction is…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions I wanted to check that this approach is acceptable first. We can go up the loop chain. Let me update the code. jaykang10: I wanted to check that this approach is acceptable first. We can go up the loop chain. Let me…
		for (MachineLoop *L = MLI->getLoopFor(MI.getParent()); L != CurLoop;
		L = L->getParentLoop())
		InnerLoopWorkList.push_back(L);

		while (!InnerLoopWorkList.empty()) {
		MachineLoop *InnerLoop = InnerLoopWorkList.pop_back_val();
		MachineBasicBlock *InnerLoopPreheader = InnerLoop->getLoopPreheader();
		if (InnerLoopPreheader) {
		HoistRes = Hoist(&MI, InnerLoopPreheader, InnerLoop);
		if (HoistRes != HoistResult::NotHoisted)
		break;
		}
		nikicUnsubmitted Not Done Reply Inline Actions Uff, this looks like a pretty big hack. It is viable to pass the loop as a parameter instead of temporarily changing "global" state? nikic: Uff, this looks like a pretty big hack. It is viable to pass the loop as a parameter instead of…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions um... if possible, I did not want to change a lot in this pass... but I agree it is big hack. Let me try to pass the global ones as parameters. jaykang10: um... if possible, I did not want to change a lot in this pass... but I agree it is big hack.
		}
		}

		if (HoistRes != HoistResult::CSEd)
UpdateRegPressure(&MI);		UpdateRegPressure(&MI);
// If we have hoisted an instruction that may store, it can only be a
// constant store.
}		}
		dmgreenUnsubmitted Not Done Reply Inline Actions Do you know what this refers to? I'm not sure I understand what it means. It might be worth just removing it. dmgreen: Do you know what this refers to? I'm not sure I understand what it means. It might be worth…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Let me check it. jaykang10: Let me check it.
		bkramerUnsubmitted Not Done Reply Inline Actions When MI is hoisted the pointer is no longer valid. I'm seeing use after frees with asan after this change, so reverted in 3454cf6 bkramer: When MI is hoisted the pointer is no longer valid. I'm seeing use after frees with asan after…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Ah, I am so sorry... Let me check it again. Thanks for reverting the commit. jaykang10: Ah, I am so sorry... Let me check it again. Thanks for reverting the commit.
		jaykang10AuthorUnsubmitted Done Reply Inline Actions It seems I need to check the MI is erased because CSE can be erased. @bkramer If possible, can you let me know how I can reproduce the case you saw with asan please? jaykang10: It seems I need to check the MI is erased because CSE can be erased. @bkramer If possible, can…

// If it's a leaf node, it's done. Traverse upwards to pop ancestors.		// If it's a leaf node, it's done. Traverse upwards to pop ancestors.
ExitScopeIfDone(Node, OpenChildren, ParentMap);		ExitScopeIfDone(Node, OpenChildren, ParentMap);
}		}
}		}

static bool isOperandKill(const MachineOperand &MO, MachineRegisterInfo *MRI) {		static bool isOperandKill(const MachineOperand &MO, MachineRegisterInfo *MRI) {
return MO.isKill() \|\| MRI->hasOneNonDBGUse(MO.getReg());		return MO.isKill() \|\| MRI->hasOneNonDBGUse(MO.getReg());
▲ Show 20 Lines • Show All 167 Lines • ▼ Show 20 Lines	for (MachineInstr &UseMI : MRI->use_instructions(CopyDstReg)) {
if (UseMI.mayStore() && isInvariantStore(UseMI, TRI, MRI))		if (UseMI.mayStore() && isInvariantStore(UseMI, TRI, MRI))
return true;		return true;
}		}
return false;		return false;
}		}

/// Returns true if the instruction may be a suitable candidate for LICM.		/// Returns true if the instruction may be a suitable candidate for LICM.
/// e.g. If the instruction is a call, then it's obviously not safe to hoist it.		/// e.g. If the instruction is a call, then it's obviously not safe to hoist it.
bool MachineLICMBase::IsLICMCandidate(MachineInstr &I) {		bool MachineLICMBase::IsLICMCandidate(MachineInstr &I, MachineLoop *CurLoop) {
// Check if it's safe to move the instruction.		// Check if it's safe to move the instruction.
bool DontMoveAcrossStore = true;		bool DontMoveAcrossStore = true;
if ((!I.isSafeToMove(AA, DontMoveAcrossStore)) &&		if ((!I.isSafeToMove(AA, DontMoveAcrossStore)) &&
!(HoistConstStores && isInvariantStore(I, TRI, MRI))) {		!(HoistConstStores && isInvariantStore(I, TRI, MRI))) {
LLVM_DEBUG(dbgs() << "LICM: Instruction not safe to move.\n");		LLVM_DEBUG(dbgs() << "LICM: Instruction not safe to move.\n");
return false;		return false;
}		}

// If it is a load then check if it is guaranteed to execute by making sure		// If it is a load then check if it is guaranteed to execute by making sure
// that it dominates all exiting blocks. If it doesn't, then there is a path		// that it dominates all exiting blocks. If it doesn't, then there is a path
// out of the loop which does not execute this load, so we can't hoist it.		// out of the loop which does not execute this load, so we can't hoist it.
// Loads from constant memory are safe to speculate, for example indexed load		// Loads from constant memory are safe to speculate, for example indexed load
// from a jump table.		// from a jump table.
// Stores and side effects are already checked by isSafeToMove.		// Stores and side effects are already checked by isSafeToMove.
if (I.mayLoad() && !mayLoadFromGOTOrConstantPool(I) &&		if (I.mayLoad() && !mayLoadFromGOTOrConstantPool(I) &&
!IsGuaranteedToExecute(I.getParent())) {		!IsGuaranteedToExecute(I.getParent(), CurLoop)) {
LLVM_DEBUG(dbgs() << "LICM: Load not guaranteed to execute.\n");		LLVM_DEBUG(dbgs() << "LICM: Load not guaranteed to execute.\n");
return false;		return false;
}		}

// Convergent attribute has been used on operations that involve inter-thread		// Convergent attribute has been used on operations that involve inter-thread
// communication which results are implicitly affected by the enclosing		// communication which results are implicitly affected by the enclosing
// control flows. It is not safe to hoist or sink such operations across		// control flows. It is not safe to hoist or sink such operations across
// control flow.		// control flow.
if (I.isConvergent())		if (I.isConvergent())
return false;		return false;

if (!TII->shouldHoist(I, CurLoop))		if (!TII->shouldHoist(I, CurLoop))
return false;		return false;

return true;		return true;
}		}

/// Returns true if the instruction is loop invariant.		/// Returns true if the instruction is loop invariant.
bool MachineLICMBase::IsLoopInvariantInst(MachineInstr &I) {		bool MachineLICMBase::IsLoopInvariantInst(MachineInstr &I,
if (!IsLICMCandidate(I)) {		MachineLoop *CurLoop) {
		if (!IsLICMCandidate(I, CurLoop)) {
LLVM_DEBUG(dbgs() << "LICM: Instruction not a LICM candidate\n");		LLVM_DEBUG(dbgs() << "LICM: Instruction not a LICM candidate\n");
return false;		return false;
}		}
return CurLoop->isLoopInvariant(I);		return CurLoop->isLoopInvariant(I);
}		}

/// Return true if the specified instruction is used by a phi node and hoisting		/// Return true if the specified instruction is used by a phi node and hoisting
/// it could cause a copy to be inserted.		/// it could cause a copy to be inserted.
bool MachineLICMBase::HasLoopPHIUse(const MachineInstr *MI) const {		bool MachineLICMBase::HasLoopPHIUse(const MachineInstr *MI,
		MachineLoop *CurLoop) {
SmallVector<const MachineInstr*, 8> Work(1, MI);		SmallVector<const MachineInstr *, 8> Work(1, MI);
do {		do {
MI = Work.pop_back_val();		MI = Work.pop_back_val();
for (const MachineOperand &MO : MI->all_defs()) {		for (const MachineOperand &MO : MI->all_defs()) {
Register Reg = MO.getReg();		Register Reg = MO.getReg();
if (!Reg.isVirtual())		if (!Reg.isVirtual())
continue;		continue;
for (MachineInstr &UseMI : MRI->use_instructions(Reg)) {		for (MachineInstr &UseMI : MRI->use_instructions(Reg)) {
// A PHI may cause a copy to be inserted.		// A PHI may cause a copy to be inserted.
if (UseMI.isPHI()) {		if (UseMI.isPHI()) {
// A PHI inside the loop causes a copy because the live range of Reg is		// A PHI inside the loop causes a copy because the live range of Reg is
// extended across the PHI.		// extended across the PHI.
if (CurLoop->contains(&UseMI))		if (CurLoop->contains(&UseMI))
return true;		return true;
// A PHI in an exit block can cause a copy to be inserted if the PHI		// A PHI in an exit block can cause a copy to be inserted if the PHI
// has multiple predecessors in the loop with different values.		// has multiple predecessors in the loop with different values.
// For now, approximate by rejecting all exit blocks.		// For now, approximate by rejecting all exit blocks.
if (isExitBlock(UseMI.getParent()))		if (isExitBlock(CurLoop, UseMI.getParent()))
return true;		return true;
continue;		continue;
}		}
// Look past copies as well.		// Look past copies as well.
if (UseMI.isCopy() && CurLoop->contains(&UseMI))		if (UseMI.isCopy() && CurLoop->contains(&UseMI))
Work.push_back(&UseMI);		Work.push_back(&UseMI);
}		}
}		}
} while (!Work.empty());		} while (!Work.empty());
return false;		return false;
}		}

/// Compute operand latency between a def of 'Reg' and an use in the current		/// Compute operand latency between a def of 'Reg' and an use in the current
/// loop, return true if the target considered it high.		/// loop, return true if the target considered it high.
bool MachineLICMBase::HasHighOperandLatency(MachineInstr &MI, unsigned DefIdx,		bool MachineLICMBase::HasHighOperandLatency(MachineInstr &MI, unsigned DefIdx,
Register Reg) const {		Register Reg,
		MachineLoop *CurLoop) const {
if (MRI->use_nodbg_empty(Reg))		if (MRI->use_nodbg_empty(Reg))
return false;		return false;

for (MachineInstr &UseMI : MRI->use_nodbg_instructions(Reg)) {		for (MachineInstr &UseMI : MRI->use_nodbg_instructions(Reg)) {
if (UseMI.isCopyLike())		if (UseMI.isCopyLike())
continue;		continue;
if (!CurLoop->contains(UseMI.getParent()))		if (!CurLoop->contains(UseMI.getParent()))
continue;		continue;
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	void MachineLICMBase::UpdateBackTraceRegPressure(const MachineInstr *MI) {
// Update register pressure of blocks from loop header to current block.		// Update register pressure of blocks from loop header to current block.
for (auto &RP : BackTrace)		for (auto &RP : BackTrace)
for (const auto &RPIdAndCost : Cost)		for (const auto &RPIdAndCost : Cost)
RP[RPIdAndCost.first] += RPIdAndCost.second;		RP[RPIdAndCost.first] += RPIdAndCost.second;
}		}

/// Return true if it is potentially profitable to hoist the given loop		/// Return true if it is potentially profitable to hoist the given loop
/// invariant.		/// invariant.
bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI) {		bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI,
		MachineLoop *CurLoop) {
if (MI.isImplicitDef())		if (MI.isImplicitDef())
return true;		return true;

// Besides removing computation from the loop, hoisting an instruction has		// Besides removing computation from the loop, hoisting an instruction has
// these effects:		// these effects:
//		//
// - The value defined by the instruction becomes live across the entire		// - The value defined by the instruction becomes live across the entire
// loop. This increases register pressure in the loop.		// loop. This increases register pressure in the loop.
//		//
// - If the value is used by a PHI in the loop, a copy will be required for		// - If the value is used by a PHI in the loop, a copy will be required for
// lowering the PHI after extending the live range.		// lowering the PHI after extending the live range.
//		//
// - When hoisting the last use of a value in the loop, that value no longer		// - When hoisting the last use of a value in the loop, that value no longer
// needs to be live in the loop. This lowers register pressure in the loop.		// needs to be live in the loop. This lowers register pressure in the loop.

if (HoistConstStores && isCopyFeedingInvariantStore(MI, MRI, TRI))		if (HoistConstStores && isCopyFeedingInvariantStore(MI, MRI, TRI))
return true;		return true;

bool CheapInstr = IsCheapInstruction(MI);		bool CheapInstr = IsCheapInstruction(MI);
bool CreatesCopy = HasLoopPHIUse(&MI);		bool CreatesCopy = HasLoopPHIUse(&MI, CurLoop);

// Don't hoist a cheap instruction if it would create a copy in the loop.		// Don't hoist a cheap instruction if it would create a copy in the loop.
if (CheapInstr && CreatesCopy) {		if (CheapInstr && CreatesCopy) {
LLVM_DEBUG(dbgs() << "Won't hoist cheap instr with loop PHI use: " << MI);		LLVM_DEBUG(dbgs() << "Won't hoist cheap instr with loop PHI use: " << MI);
return false;		return false;
}		}

// Rematerializable instructions should always be hoisted providing the		// Rematerializable instructions should always be hoisted providing the
// register allocator can just pull them down again when needed.		// register allocator can just pull them down again when needed.
if (isTriviallyReMaterializable(MI))		if (isTriviallyReMaterializable(MI))
return true;		return true;

// FIXME: If there are long latency loop-invariant instructions inside the		// FIXME: If there are long latency loop-invariant instructions inside the
// loop at this point, why didn't the optimizer's LICM hoist them?		// loop at this point, why didn't the optimizer's LICM hoist them?
for (unsigned i = 0, e = MI.getDesc().getNumOperands(); i != e; ++i) {		for (unsigned i = 0, e = MI.getDesc().getNumOperands(); i != e; ++i) {
const MachineOperand &MO = MI.getOperand(i);		const MachineOperand &MO = MI.getOperand(i);
if (!MO.isReg() \|\| MO.isImplicit())		if (!MO.isReg() \|\| MO.isImplicit())
continue;		continue;
Register Reg = MO.getReg();		Register Reg = MO.getReg();
if (!Reg.isVirtual())		if (!Reg.isVirtual())
continue;		continue;
if (MO.isDef() && HasHighOperandLatency(MI, i, Reg)) {		if (MO.isDef() && HasHighOperandLatency(MI, i, Reg, CurLoop)) {
LLVM_DEBUG(dbgs() << "Hoist High Latency: " << MI);		LLVM_DEBUG(dbgs() << "Hoist High Latency: " << MI);
++NumHighLatency;		++NumHighLatency;
return true;		return true;
}		}
}		}

// Estimate register pressure to determine whether to LICM the instruction.		// Estimate register pressure to determine whether to LICM the instruction.
// In low register pressure situation, we can be more aggressive about		// In low register pressure situation, we can be more aggressive about
Show All 17 Lines	if (CreatesCopy) {
LLVM_DEBUG(dbgs() << "Won't hoist instr with loop PHI use: " << MI);		LLVM_DEBUG(dbgs() << "Won't hoist instr with loop PHI use: " << MI);
return false;		return false;
}		}

// Do not "speculate" in high register pressure situation. If an		// Do not "speculate" in high register pressure situation. If an
// instruction is not guaranteed to be executed in the loop, it's best to be		// instruction is not guaranteed to be executed in the loop, it's best to be
// conservative.		// conservative.
if (AvoidSpeculation &&		if (AvoidSpeculation &&
(!IsGuaranteedToExecute(MI.getParent()) && !MayCSE(&MI))) {		(!IsGuaranteedToExecute(MI.getParent(), CurLoop) && !MayCSE(&MI))) {
LLVM_DEBUG(dbgs() << "Won't speculate: " << MI);		LLVM_DEBUG(dbgs() << "Won't speculate: " << MI);
return false;		return false;
}		}

// High register pressure situation, only hoist if the instruction is going		// High register pressure situation, only hoist if the instruction is going
// to be remat'ed.		// to be remat'ed.
if (!isTriviallyReMaterializable(MI) &&		if (!isTriviallyReMaterializable(MI) &&
!MI.isDereferenceableInvariantLoad()) {		!MI.isDereferenceableInvariantLoad()) {
LLVM_DEBUG(dbgs() << "Can't remat / high reg-pressure: " << MI);		LLVM_DEBUG(dbgs() << "Can't remat / high reg-pressure: " << MI);
return false;		return false;
}		}

return true;		return true;
}		}

/// Unfold a load from the given machineinstr if the load itself could be		/// Unfold a load from the given machineinstr if the load itself could be
/// hoisted. Return the unfolded and hoistable load, or null if the load		/// hoisted. Return the unfolded and hoistable load, or null if the load
/// couldn't be unfolded or if it wouldn't be hoistable.		/// couldn't be unfolded or if it wouldn't be hoistable.
MachineInstr MachineLICMBase::ExtractHoistableLoad(MachineInstr MI) {		MachineInstr MachineLICMBase::ExtractHoistableLoad(MachineInstr MI,
		MachineLoop *CurLoop) {
// Don't unfold simple loads.		// Don't unfold simple loads.
if (MI->canFoldAsLoad())		if (MI->canFoldAsLoad())
return nullptr;		return nullptr;

// If not, we may be able to unfold a load and hoist that.		// If not, we may be able to unfold a load and hoist that.
// First test whether the instruction is loading from an amenable		// First test whether the instruction is loading from an amenable
// memory location.		// memory location.
if (!MI->isDereferenceableInvariantLoad())		if (!MI->isDereferenceableInvariantLoad())
Show All 24 Lines	MachineInstr MachineLICMBase::ExtractHoistableLoad(MachineInstr MI,
assert(NewMIs.size() == 2 &&		assert(NewMIs.size() == 2 &&
"Unfolded a load into multiple instructions!");		"Unfolded a load into multiple instructions!");
MachineBasicBlock *MBB = MI->getParent();		MachineBasicBlock *MBB = MI->getParent();
MachineBasicBlock::iterator Pos = MI;		MachineBasicBlock::iterator Pos = MI;
MBB->insert(Pos, NewMIs[0]);		MBB->insert(Pos, NewMIs[0]);
MBB->insert(Pos, NewMIs[1]);		MBB->insert(Pos, NewMIs[1]);
// If unfolding produced a load that wasn't loop-invariant or profitable to		// If unfolding produced a load that wasn't loop-invariant or profitable to
// hoist, discard the new instructions and bail.		// hoist, discard the new instructions and bail.
if (!IsLoopInvariantInst(NewMIs[0]) \|\| !IsProfitableToHoist(NewMIs[0])) {		if (!IsLoopInvariantInst(*NewMIs[0], CurLoop) \|\|
		!IsProfitableToHoist(*NewMIs[0], CurLoop)) {
NewMIs[0]->eraseFromParent();		NewMIs[0]->eraseFromParent();
NewMIs[1]->eraseFromParent();		NewMIs[1]->eraseFromParent();
return nullptr;		return nullptr;
}		}

// Update register pressure for the unfolded instruction.		// Update register pressure for the unfolded instruction.
UpdateRegPressure(NewMIs[1]);		UpdateRegPressure(NewMIs[1]);

// Otherwise we successfully unfolded a load that we can hoist.		// Otherwise we successfully unfolded a load that we can hoist.

// Update the call site info.		// Update the call site info.
if (MI->shouldUpdateCallSiteInfo())		if (MI->shouldUpdateCallSiteInfo())
MF.eraseCallSiteInfo(MI);		MF.eraseCallSiteInfo(MI);

MI->eraseFromParent();		MI->eraseFromParent();
return NewMIs[0];		return NewMIs[0];
}		}

/// Initialize the CSE map with instructions that are in the current loop		/// Initialize the CSE map with instructions that are in the current loop
/// preheader that may become duplicates of instructions that are hoisted		/// preheader that may become duplicates of instructions that are hoisted
/// out of the loop.		/// out of the loop.
void MachineLICMBase::InitCSEMap(MachineBasicBlock *BB) {		void MachineLICMBase::InitCSEMap(MachineBasicBlock *BB) {
for (MachineInstr &MI : *BB)		for (MachineInstr &MI : *BB)
CSEMap[MI.getOpcode()].push_back(&MI);		CSEMap[BB][MI.getOpcode()].push_back(&MI);
}		}

/// Find an instruction amount PrevMIs that is a duplicate of MI.		/// Find an instruction amount PrevMIs that is a duplicate of MI.
/// Return this instruction if it's found.		/// Return this instruction if it's found.
MachineInstr *		MachineInstr *
MachineLICMBase::LookForDuplicate(const MachineInstr *MI,		MachineLICMBase::LookForDuplicate(const MachineInstr *MI,
std::vector<MachineInstr *> &PrevMIs) {		std::vector<MachineInstr *> &PrevMIs) {
for (MachineInstr *PrevMI : PrevMIs)		for (MachineInstr *PrevMI : PrevMIs)
if (TII->produceSameValue(MI, PrevMI, (PreRegAlloc ? MRI : nullptr)))		if (TII->produceSameValue(MI, PrevMI, (PreRegAlloc ? MRI : nullptr)))
return PrevMI;		return PrevMI;

return nullptr;		return nullptr;
		nikicUnsubmitted Not Done Reply Inline Actions This seems to work around a larger issue. The problem is that CSEMap will get initialized for whichever preheader we happen to hoist into first. Thanks to this check, we will at least not make invalid replacements, but it means we will miss CSE opportunities if we are hoisting into any other preheader. Probably the CSE map should be by preheader, instead of having only the one. nikic: This seems to work around a larger issue. The problem is that CSEMap will get initialized for…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions It is good point. Let me try to keep multiple CSE maps for multiple preheaders. jaykang10: It is good point. Let me try to keep multiple CSE maps for multiple preheaders.
}		}

/// Given a LICM'ed instruction, look for an instruction on the preheader that		/// Given a LICM'ed instruction, look for an instruction on the preheader that
/// computes the same value. If it's found, do a RAU on with the definition of		/// computes the same value. If it's found, do a RAU on with the definition of
/// the existing instruction rather than hoisting the instruction to the		/// the existing instruction rather than hoisting the instruction to the
/// preheader.		/// preheader.
bool MachineLICMBase::EliminateCSE(		bool MachineLICMBase::EliminateCSE(
MachineInstr *MI,		MachineInstr *MI,
DenseMap<unsigned, std::vector<MachineInstr *>>::iterator &CI) {		DenseMap<unsigned, std::vector<MachineInstr *>>::iterator &CI) {
// Do not CSE implicit_def so ProcessImplicitDefs can properly propagate		// Do not CSE implicit_def so ProcessImplicitDefs can properly propagate
// the undef property onto uses.		// the undef property onto uses.
if (CI == CSEMap.end() \|\| MI->isImplicitDef())		if (MI->isImplicitDef())
return false;		return false;

if (MachineInstr *Dup = LookForDuplicate(MI, CI->second)) {		if (MachineInstr *Dup = LookForDuplicate(MI, CI->second)) {
LLVM_DEBUG(dbgs() << "CSEing " << MI << " with " << Dup);		LLVM_DEBUG(dbgs() << "CSEing " << MI << " with " << Dup);

// Replace virtual registers defined by MI by their counterparts defined		// Replace virtual registers defined by MI by their counterparts defined
// by Dup.		// by Dup.
SmallVector<unsigned, 2> Defs;		SmallVector<unsigned, 2> Defs;
Show All 40 Lines	bool MachineLICMBase::EliminateCSE(
}		}
return false;		return false;
}		}

/// Return true if the given instruction will be CSE'd if it's hoisted out of		/// Return true if the given instruction will be CSE'd if it's hoisted out of
/// the loop.		/// the loop.
bool MachineLICMBase::MayCSE(MachineInstr *MI) {		bool MachineLICMBase::MayCSE(MachineInstr *MI) {
unsigned Opcode = MI->getOpcode();		unsigned Opcode = MI->getOpcode();
		for (auto &Map : CSEMap) {
		// Check this CSEMap's preheader dominates MI's basic block.
		if (DT->dominates(Map.first, MI->getParent())) {
DenseMap<unsigned, std::vector<MachineInstr *>>::iterator CI =		DenseMap<unsigned, std::vector<MachineInstr *>>::iterator CI =
CSEMap.find(Opcode);		Map.second.find(Opcode);
// Do not CSE implicit_def so ProcessImplicitDefs can properly propagate		// Do not CSE implicit_def so ProcessImplicitDefs can properly propagate
// the undef property onto uses.		// the undef property onto uses.
if (CI == CSEMap.end() \|\| MI->isImplicitDef())		if (CI == Map.second.end() \|\| MI->isImplicitDef())
return false;		continue;
		if (LookForDuplicate(MI, CI->second) != nullptr)
		return true;
		}
		}

return LookForDuplicate(MI, CI->second) != nullptr;		return false;
}		}

/// When an instruction is found to use only loop invariant operands		/// When an instruction is found to use only loop invariant operands
/// that are safe to hoist, this instruction is called to do the dirty work.		/// that are safe to hoist, this instruction is called to do the dirty work.
/// It returns true if the instruction is hoisted.		/// It returns true if the instruction is hoisted.
bool MachineLICMBase::Hoist(MachineInstr MI, MachineBasicBlock Preheader) {		HoistResult MachineLICMBase::Hoist(MachineInstr *MI,
		MachineBasicBlock *Preheader,
		MachineLoop *CurLoop) {
MachineBasicBlock *SrcBlock = MI->getParent();		MachineBasicBlock *SrcBlock = MI->getParent();

// Disable the instruction hoisting due to block hotness		// Disable the instruction hoisting due to block hotness
if ((DisableHoistingToHotterBlocks == UseBFI::All \|\|		if ((DisableHoistingToHotterBlocks == UseBFI::All \|\|
(DisableHoistingToHotterBlocks == UseBFI::PGO && HasProfileData)) &&		(DisableHoistingToHotterBlocks == UseBFI::PGO && HasProfileData)) &&
isTgtHotterThanSrc(SrcBlock, Preheader)) {		isTgtHotterThanSrc(SrcBlock, Preheader)) {
++NumNotHoistedDueToHotness;		++NumNotHoistedDueToHotness;
return false;		return HoistResult::NotHoisted;
}		}
// First check whether we should hoist this instruction.		// First check whether we should hoist this instruction.
if (!IsLoopInvariantInst(MI) \|\| !IsProfitableToHoist(MI)) {		if (!IsLoopInvariantInst(*MI, CurLoop) \|\|
		!IsProfitableToHoist(*MI, CurLoop)) {
// If not, try unfolding a hoistable load.		// If not, try unfolding a hoistable load.
MI = ExtractHoistableLoad(MI);		MI = ExtractHoistableLoad(MI, CurLoop);
if (!MI) return false;		if (!MI)
		return HoistResult::NotHoisted;
}		}

// If we have hoisted an instruction that may store, it can only be a constant		// If we have hoisted an instruction that may store, it can only be a constant
// store.		// store.
if (MI->mayStore())		if (MI->mayStore())
NumStoreConst++;		NumStoreConst++;

// Now move the instructions to the predecessor, inserting it before any		// Now move the instructions to the predecessor, inserting it before any
Show All 11 Lines	HoistResult MachineLICMBase::Hoist(MachineInstr *MI,
// initialize the CSE map with potential common expressions.		// initialize the CSE map with potential common expressions.
if (FirstInLoop) {		if (FirstInLoop) {
InitCSEMap(Preheader);		InitCSEMap(Preheader);
FirstInLoop = false;		FirstInLoop = false;
}		}

// Look for opportunity to CSE the hoisted instruction.		// Look for opportunity to CSE the hoisted instruction.
unsigned Opcode = MI->getOpcode();		unsigned Opcode = MI->getOpcode();
		bool HasCSEDone = false;
		for (auto &Map : CSEMap) {
		// Check this CSEMap's preheader dominates MI's basic block.
		if (DT->dominates(Map.first, MI->getParent())) {
DenseMap<unsigned, std::vector<MachineInstr *>>::iterator CI =		DenseMap<unsigned, std::vector<MachineInstr *>>::iterator CI =
CSEMap.find(Opcode);		Map.second.find(Opcode);
if (!EliminateCSE(MI, CI)) {		if (CI != Map.second.end()) {
		if (EliminateCSE(MI, CI)) {
		HasCSEDone = true;
		break;
		}
		}
		}
		}

		if (!HasCSEDone) {
// Otherwise, splice the instruction to the preheader.		// Otherwise, splice the instruction to the preheader.
Preheader->splice(Preheader->getFirstTerminator(),MI->getParent(),MI);		Preheader->splice(Preheader->getFirstTerminator(),MI->getParent(),MI);

// Since we are moving the instruction out of its basic block, we do not		// Since we are moving the instruction out of its basic block, we do not
// retain its debug location. Doing so would degrade the debugging		// retain its debug location. Doing so would degrade the debugging
// experience and adversely affect the accuracy of profiling information.		// experience and adversely affect the accuracy of profiling information.
assert(!MI->isDebugInstr() && "Should not hoist debug inst");		assert(!MI->isDebugInstr() && "Should not hoist debug inst");
MI->setDebugLoc(DebugLoc());		MI->setDebugLoc(DebugLoc());

// Update register pressure for BBs from header to this block.		// Update register pressure for BBs from header to this block.
UpdateBackTraceRegPressure(MI);		UpdateBackTraceRegPressure(MI);

// Clear the kill flags of any register this instruction defines,		// Clear the kill flags of any register this instruction defines,
// since they may need to be live throughout the entire loop		// since they may need to be live throughout the entire loop
// rather than just live for part of it.		// rather than just live for part of it.
for (MachineOperand &MO : MI->all_defs())		for (MachineOperand &MO : MI->all_defs())
if (!MO.isDead())		if (!MO.isDead())
MRI->clearKillFlags(MO.getReg());		MRI->clearKillFlags(MO.getReg());

// Add to the CSE map.		CSEMap[Preheader][Opcode].push_back(MI);
if (CI != CSEMap.end())
CI->second.push_back(MI);
else
CSEMap[Opcode].push_back(MI);
}		}

++NumHoisted;		++NumHoisted;
Changed = true;		Changed = true;

return true;		if (HasCSEDone)
		return HoistResult::CSEd;
		return HoistResult::Hoisted;
}		}

/// Get the preheader for the current loop, splitting a critical edge if needed.		/// Get the preheader for the current loop, splitting a critical edge if needed.
MachineBasicBlock *MachineLICMBase::getCurPreheader() {		MachineBasicBlock *
		MachineLICMBase::getCurPreheader(MachineLoop *CurLoop,
		MachineBasicBlock *CurPreheader) {
// Determine the block to which to hoist instructions. If we can't find a		// Determine the block to which to hoist instructions. If we can't find a
// suitable loop predecessor, we can't do any hoisting.		// suitable loop predecessor, we can't do any hoisting.

// If we've tried to get a preheader and failed, don't try again.		// If we've tried to get a preheader and failed, don't try again.
if (CurPreheader == reinterpret_cast<MachineBasicBlock *>(-1))		if (CurPreheader == reinterpret_cast<MachineBasicBlock *>(-1))
return nullptr;		return nullptr;

if (!CurPreheader) {		if (!CurPreheader) {
Show All 35 Lines

llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
	; RUN: llc -mtriple aarch64-none-linux-gnu < %s \| FileCheck %s			; RUN: llc -mtriple aarch64-none-linux-gnu < %s \| FileCheck %s

	define void @foo(i32 noundef %limit, ptr %out, ptr %y) {			define void @foo(i32 noundef %limit, ptr %out, ptr %y) {
				dmgreenUnsubmitted Not Done Reply Inline Actions I tend to remove the `Function Attrs` along with `dso_local` and the `local_unnamed_addr #0` dmgreen: I tend to remove the `Function Attrs` along with `dso_local` and the `local_unnamed_addr #0`
				jaykang10AuthorUnsubmitted Done Reply Inline Actions Sorry, I added this test as an experimental example. Let me tidy up this test. jaykang10: Sorry, I added this test as an experimental example. Let me tidy up this test.
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: // kill: def $w0 killed $w0 def $x0			; CHECK-NEXT: // kill: def $w0 killed $w0 def $x0
	; CHECK-NEXT: cmp w0, #1			; CHECK-NEXT: cmp w0, #1
	; CHECK-NEXT: b.lt .LBB0_10			; CHECK-NEXT: b.lt .LBB0_10
	; CHECK-NEXT: // %bb.1: // %for.cond1.preheader.us.preheader			; CHECK-NEXT: // %bb.1: // %for.cond1.preheader.us.preheader
	; CHECK-NEXT: mov w10, w0			; CHECK-NEXT: mov w10, w0
	; CHECK-NEXT: ubfiz x11, x0, #2, #32			; CHECK-NEXT: ubfiz x11, x0, #2, #32
	Show All 17 Lines
	; CHECK-NEXT: ldrsh w15, [x2, x9, lsl #1]			; CHECK-NEXT: ldrsh w15, [x2, x9, lsl #1]
	; CHECK-NEXT: cmp w0, #16			; CHECK-NEXT: cmp w0, #16
	; CHECK-NEXT: b.hs .LBB0_5			; CHECK-NEXT: b.hs .LBB0_5
	; CHECK-NEXT: // %bb.4: // in Loop: Header=BB0_3 Depth=1			; CHECK-NEXT: // %bb.4: // in Loop: Header=BB0_3 Depth=1
	; CHECK-NEXT: mov x18, xzr			; CHECK-NEXT: mov x18, xzr
	; CHECK-NEXT: b .LBB0_8			; CHECK-NEXT: b .LBB0_8
	; CHECK-NEXT: .LBB0_5: // %vector.ph			; CHECK-NEXT: .LBB0_5: // %vector.ph
	; CHECK-NEXT: // in Loop: Header=BB0_3 Depth=1			; CHECK-NEXT: // in Loop: Header=BB0_3 Depth=1
				; CHECK-NEXT: dup v0.8h, w15
	; CHECK-NEXT: mov x16, x14			; CHECK-NEXT: mov x16, x14
	; CHECK-NEXT: mov x17, x13			; CHECK-NEXT: mov x17, x13
	; CHECK-NEXT: mov x18, x12			; CHECK-NEXT: mov x18, x12
	; CHECK-NEXT: .LBB0_6: // %vector.body			; CHECK-NEXT: .LBB0_6: // %vector.body
	; CHECK-NEXT: // Parent Loop BB0_3 Depth=1			; CHECK-NEXT: // Parent Loop BB0_3 Depth=1
	; CHECK-NEXT: // => This Inner Loop Header: Depth=2			; CHECK-NEXT: // => This Inner Loop Header: Depth=2
	; CHECK-NEXT: dup v0.8h, w15
	; CHECK-NEXT: ldp q1, q4, [x16, #-16]			; CHECK-NEXT: ldp q1, q4, [x16, #-16]
	; CHECK-NEXT: ldp q3, q2, [x17, #-32]
	; CHECK-NEXT: subs x18, x18, #16			; CHECK-NEXT: subs x18, x18, #16
	; CHECK-NEXT: ldp q6, q5, [x17]			; CHECK-NEXT: ldp q3, q2, [x17, #-32]
	; CHECK-NEXT: add x16, x16, #32			; CHECK-NEXT: add x16, x16, #32
				; CHECK-NEXT: ldp q6, q5, [x17]
	; CHECK-NEXT: smlal2 v2.4s, v0.8h, v1.8h			; CHECK-NEXT: smlal2 v2.4s, v0.8h, v1.8h
	; CHECK-NEXT: smlal v3.4s, v0.4h, v1.4h			; CHECK-NEXT: smlal v3.4s, v0.4h, v1.4h
	; CHECK-NEXT: smlal2 v5.4s, v0.8h, v4.8h			; CHECK-NEXT: smlal2 v5.4s, v0.8h, v4.8h
	; CHECK-NEXT: smlal v6.4s, v0.4h, v4.4h			; CHECK-NEXT: smlal v6.4s, v0.4h, v4.4h
	; CHECK-NEXT: stp q3, q2, [x17, #-32]			; CHECK-NEXT: stp q3, q2, [x17, #-32]
	; CHECK-NEXT: stp q6, q5, [x17], #64			; CHECK-NEXT: stp q6, q5, [x17], #64
	; CHECK-NEXT: b.ne .LBB0_6			; CHECK-NEXT: b.ne .LBB0_6
	; CHECK-NEXT: // %bb.7: // %middle.block			; CHECK-NEXT: // %bb.7: // %middle.block
	▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines

	for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us, %middle.block			for.cond1.for.cond.cleanup3_crit_edge.us: ; preds = %for.body4.us, %middle.block
	%indvars.iv.next31 = add nuw nsw i64 %indvars.iv30, 1			%indvars.iv.next31 = add nuw nsw i64 %indvars.iv30, 1
	%exitcond35.not = icmp eq i64 %indvars.iv.next31, %wide.trip.count34			%exitcond35.not = icmp eq i64 %indvars.iv.next31, %wide.trip.count34
	br i1 %exitcond35.not, label %for.cond.cleanup, label %for.cond1.preheader.us			br i1 %exitcond35.not, label %for.cond.cleanup, label %for.cond1.preheader.us

	for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry			for.cond.cleanup: ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %entry
	ret void			ret void
	}			}
				dmgreenUnsubmitted Not Done Reply Inline Actions And can you remove as much of this as you can. dmgreen: And can you remove as much of this as you can.
				jaykang10AuthorUnsubmitted Done Reply Inline Actions ditto jaykang10: ditto
				dmgreenUnsubmitted Not Done Reply Inline Actions I don't believe these are used for anything. dmgreen: I don't believe these are used for anything.
				jaykang10AuthorUnsubmitted Done Reply Inline Actions ditto jaykang10: ditto

llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll

	Show First 20 Lines • Show All 551 Lines • ▼ Show 20 Lines
	; GFX908-NEXT: s_or_b32 s10, s10, 28			; GFX908-NEXT: s_or_b32 s10, s10, 28
	; GFX908-NEXT: s_waitcnt vmcnt(0)			; GFX908-NEXT: s_waitcnt vmcnt(0)
	; GFX908-NEXT: v_readfirstlane_b32 s5, v16			; GFX908-NEXT: v_readfirstlane_b32 s5, v16
	; GFX908-NEXT: s_and_b32 s5, 0xffff, s5			; GFX908-NEXT: s_and_b32 s5, 0xffff, s5
	; GFX908-NEXT: s_mul_i32 s1, s1, s5			; GFX908-NEXT: s_mul_i32 s1, s1, s5
	; GFX908-NEXT: s_mul_hi_u32 s9, s0, s5			; GFX908-NEXT: s_mul_hi_u32 s9, s0, s5
	; GFX908-NEXT: s_mul_i32 s0, s0, s5			; GFX908-NEXT: s_mul_i32 s0, s0, s5
	; GFX908-NEXT: s_add_i32 s1, s9, s1			; GFX908-NEXT: s_add_i32 s1, s9, s1
	; GFX908-NEXT: s_lshl_b64 s[0:1], s[0:1], 5			; GFX908-NEXT: s_lshl_b64 s[14:15], s[0:1], 5
	; GFX908-NEXT: s_branch .LBB3_2			; GFX908-NEXT: s_branch .LBB3_2
	; GFX908-NEXT: .LBB3_1: ; %Flow20			; GFX908-NEXT: .LBB3_1: ; %Flow20
	; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX908-NEXT: s_andn2_b64 vcc, exec, s[14:15]			; GFX908-NEXT: s_andn2_b64 vcc, exec, s[0:1]
	; GFX908-NEXT: s_cbranch_vccz .LBB3_12			; GFX908-NEXT: s_cbranch_vccz .LBB3_12
	; GFX908-NEXT: .LBB3_2: ; %bb9			; GFX908-NEXT: .LBB3_2: ; %bb9
	; GFX908-NEXT: ; =>This Loop Header: Depth=1			; GFX908-NEXT: ; =>This Loop Header: Depth=1
	; GFX908-NEXT: ; Child Loop BB3_5 Depth 2			; GFX908-NEXT: ; Child Loop BB3_5 Depth 2
	; GFX908-NEXT: s_mov_b64 s[16:17], -1			; GFX908-NEXT: s_mov_b64 s[16:17], -1
	; GFX908-NEXT: s_cbranch_scc0 .LBB3_10			; GFX908-NEXT: s_cbranch_scc0 .LBB3_10
	; GFX908-NEXT: ; %bb.3: ; %bb14			; GFX908-NEXT: ; %bb.3: ; %bb14
	; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX908-NEXT: global_load_dwordx2 v[2:3], v[0:1], off			; GFX908-NEXT: global_load_dwordx2 v[2:3], v[0:1], off
				; GFX908-NEXT: v_cmp_gt_i64_e64 s[0:1], s[6:7], -1
	; GFX908-NEXT: s_mov_b32 s9, s8			; GFX908-NEXT: s_mov_b32 s9, s8
				; GFX908-NEXT: v_cndmask_b32_e64 v6, 0, 1, s[0:1]
	; GFX908-NEXT: v_mov_b32_e32 v4, s8			; GFX908-NEXT: v_mov_b32_e32 v4, s8
				; GFX908-NEXT: v_cmp_ne_u32_e64 s[0:1], 1, v6
	; GFX908-NEXT: v_mov_b32_e32 v8, s8			; GFX908-NEXT: v_mov_b32_e32 v8, s8
	; GFX908-NEXT: v_mov_b32_e32 v6, s8			; GFX908-NEXT: v_mov_b32_e32 v6, s8
	; GFX908-NEXT: v_mov_b32_e32 v5, s9			; GFX908-NEXT: v_mov_b32_e32 v5, s9
	; GFX908-NEXT: v_mov_b32_e32 v9, s9			; GFX908-NEXT: v_mov_b32_e32 v9, s9
	; GFX908-NEXT: v_mov_b32_e32 v7, s9			; GFX908-NEXT: v_mov_b32_e32 v7, s9
	; GFX908-NEXT: v_cmp_lt_i64_e64 s[14:15], s[6:7], 0			; GFX908-NEXT: v_cmp_lt_i64_e64 s[16:17], s[6:7], 0
	; GFX908-NEXT: v_cmp_gt_i64_e64 s[16:17], s[6:7], -1
	; GFX908-NEXT: v_mov_b32_e32 v11, v5			; GFX908-NEXT: v_mov_b32_e32 v11, v5
	; GFX908-NEXT: s_mov_b64 s[20:21], s[10:11]			; GFX908-NEXT: s_mov_b64 s[20:21], s[10:11]
	; GFX908-NEXT: v_mov_b32_e32 v10, v4			; GFX908-NEXT: v_mov_b32_e32 v10, v4
	; GFX908-NEXT: s_waitcnt vmcnt(0)			; GFX908-NEXT: s_waitcnt vmcnt(0)
	; GFX908-NEXT: v_readfirstlane_b32 s5, v2			; GFX908-NEXT: v_readfirstlane_b32 s5, v2
	; GFX908-NEXT: v_readfirstlane_b32 s9, v3			; GFX908-NEXT: v_readfirstlane_b32 s9, v3
	; GFX908-NEXT: s_add_u32 s5, s5, 1			; GFX908-NEXT: s_add_u32 s5, s5, 1
	; GFX908-NEXT: s_addc_u32 s9, s9, 0			; GFX908-NEXT: s_addc_u32 s9, s9, 0
	; GFX908-NEXT: s_mul_hi_u32 s19, s2, s5			; GFX908-NEXT: s_mul_hi_u32 s19, s2, s5
	; GFX908-NEXT: s_mul_i32 s22, s3, s5			; GFX908-NEXT: s_mul_i32 s22, s3, s5
	; GFX908-NEXT: s_mul_i32 s18, s2, s5			; GFX908-NEXT: s_mul_i32 s18, s2, s5
	; GFX908-NEXT: s_mul_i32 s5, s2, s9			; GFX908-NEXT: s_mul_i32 s5, s2, s9
	; GFX908-NEXT: s_add_i32 s5, s19, s5			; GFX908-NEXT: s_add_i32 s5, s19, s5
	; GFX908-NEXT: s_add_i32 s5, s5, s22			; GFX908-NEXT: s_add_i32 s5, s5, s22
	; GFX908-NEXT: s_branch .LBB3_5			; GFX908-NEXT: s_branch .LBB3_5
	; GFX908-NEXT: .LBB3_4: ; %bb58			; GFX908-NEXT: .LBB3_4: ; %bb58
	; GFX908-NEXT: ; in Loop: Header=BB3_5 Depth=2			; GFX908-NEXT: ; in Loop: Header=BB3_5 Depth=2
	; GFX908-NEXT: v_add_co_u32_sdwa v2, vcc, v2, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0			; GFX908-NEXT: v_add_co_u32_sdwa v2, vcc, v2, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
	; GFX908-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v3, vcc			; GFX908-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v3, vcc
	; GFX908-NEXT: s_add_u32 s20, s20, s0			; GFX908-NEXT: s_add_u32 s20, s20, s14
	; GFX908-NEXT: v_cmp_lt_i64_e64 s[24:25], -1, v[2:3]			; GFX908-NEXT: v_cmp_lt_i64_e64 s[24:25], -1, v[2:3]
	; GFX908-NEXT: s_addc_u32 s21, s21, s1			; GFX908-NEXT: s_addc_u32 s21, s21, s15
	; GFX908-NEXT: s_mov_b64 s[22:23], 0			; GFX908-NEXT: s_mov_b64 s[22:23], 0
	; GFX908-NEXT: s_andn2_b64 vcc, exec, s[24:25]			; GFX908-NEXT: s_andn2_b64 vcc, exec, s[24:25]
	; GFX908-NEXT: s_cbranch_vccz .LBB3_9			; GFX908-NEXT: s_cbranch_vccz .LBB3_9
	; GFX908-NEXT: .LBB3_5: ; %bb16			; GFX908-NEXT: .LBB3_5: ; %bb16
	; GFX908-NEXT: ; Parent Loop BB3_2 Depth=1			; GFX908-NEXT: ; Parent Loop BB3_2 Depth=1
	; GFX908-NEXT: ; => This Inner Loop Header: Depth=2			; GFX908-NEXT: ; => This Inner Loop Header: Depth=2
	; GFX908-NEXT: s_add_u32 s22, s20, s18			; GFX908-NEXT: s_add_u32 s22, s20, s18
	; GFX908-NEXT: s_addc_u32 s23, s21, s5			; GFX908-NEXT: s_addc_u32 s23, s21, s5
	; GFX908-NEXT: global_load_dword v21, v19, s[22:23] offset:-12 glc			; GFX908-NEXT: global_load_dword v21, v19, s[22:23] offset:-12 glc
	; GFX908-NEXT: s_waitcnt vmcnt(0)			; GFX908-NEXT: s_waitcnt vmcnt(0)
	; GFX908-NEXT: global_load_dword v20, v19, s[22:23] offset:-8 glc			; GFX908-NEXT: global_load_dword v20, v19, s[22:23] offset:-8 glc
	; GFX908-NEXT: s_waitcnt vmcnt(0)			; GFX908-NEXT: s_waitcnt vmcnt(0)
	; GFX908-NEXT: global_load_dword v12, v19, s[22:23] offset:-4 glc			; GFX908-NEXT: global_load_dword v12, v19, s[22:23] offset:-4 glc
	; GFX908-NEXT: s_waitcnt vmcnt(0)			; GFX908-NEXT: s_waitcnt vmcnt(0)
	; GFX908-NEXT: global_load_dword v12, v19, s[22:23] glc			; GFX908-NEXT: global_load_dword v12, v19, s[22:23] glc
	; GFX908-NEXT: s_waitcnt vmcnt(0)			; GFX908-NEXT: s_waitcnt vmcnt(0)
	; GFX908-NEXT: ds_read_b64 v[12:13], v19			; GFX908-NEXT: ds_read_b64 v[12:13], v19
	; GFX908-NEXT: ds_read_b64 v[14:15], v0			; GFX908-NEXT: ds_read_b64 v[14:15], v0
	; GFX908-NEXT: s_andn2_b64 vcc, exec, s[16:17]			; GFX908-NEXT: s_and_b64 vcc, exec, s[0:1]
	; GFX908-NEXT: s_waitcnt lgkmcnt(0)			; GFX908-NEXT: s_waitcnt lgkmcnt(0)
	; GFX908-NEXT: s_cbranch_vccnz .LBB3_7			; GFX908-NEXT: s_cbranch_vccnz .LBB3_7
	; GFX908-NEXT: ; %bb.6: ; %bb51			; GFX908-NEXT: ; %bb.6: ; %bb51
	; GFX908-NEXT: ; in Loop: Header=BB3_5 Depth=2			; GFX908-NEXT: ; in Loop: Header=BB3_5 Depth=2
	; GFX908-NEXT: v_cvt_f32_f16_sdwa v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX908-NEXT: v_cvt_f32_f16_sdwa v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX908-NEXT: v_cvt_f32_f16_e32 v21, v21			; GFX908-NEXT: v_cvt_f32_f16_e32 v21, v21
	; GFX908-NEXT: v_cvt_f32_f16_sdwa v23, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX908-NEXT: v_cvt_f32_f16_sdwa v23, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX908-NEXT: v_cvt_f32_f16_e32 v20, v20			; GFX908-NEXT: v_cvt_f32_f16_e32 v20, v20
	Show All 11 Lines
	; GFX908-NEXT: v_add_f32_e32 v8, v8, v26			; GFX908-NEXT: v_add_f32_e32 v8, v8, v26
	; GFX908-NEXT: v_add_f32_e32 v6, v6, v14			; GFX908-NEXT: v_add_f32_e32 v6, v6, v14
	; GFX908-NEXT: v_add_f32_e32 v7, v7, v15			; GFX908-NEXT: v_add_f32_e32 v7, v7, v15
	; GFX908-NEXT: v_add_f32_e32 v10, v10, v12			; GFX908-NEXT: v_add_f32_e32 v10, v10, v12
	; GFX908-NEXT: v_add_f32_e32 v11, v11, v13			; GFX908-NEXT: v_add_f32_e32 v11, v11, v13
	; GFX908-NEXT: s_mov_b64 s[22:23], -1			; GFX908-NEXT: s_mov_b64 s[22:23], -1
	; GFX908-NEXT: s_branch .LBB3_4			; GFX908-NEXT: s_branch .LBB3_4
	; GFX908-NEXT: .LBB3_7: ; in Loop: Header=BB3_5 Depth=2			; GFX908-NEXT: .LBB3_7: ; in Loop: Header=BB3_5 Depth=2
	; GFX908-NEXT: s_mov_b64 s[22:23], s[14:15]			; GFX908-NEXT: s_mov_b64 s[22:23], s[16:17]
	; GFX908-NEXT: s_andn2_b64 vcc, exec, s[22:23]			; GFX908-NEXT: s_andn2_b64 vcc, exec, s[22:23]
	; GFX908-NEXT: s_cbranch_vccz .LBB3_4			; GFX908-NEXT: s_cbranch_vccz .LBB3_4
	; GFX908-NEXT: ; %bb.8: ; in Loop: Header=BB3_2 Depth=1			; GFX908-NEXT: ; %bb.8: ; in Loop: Header=BB3_2 Depth=1
	; GFX908-NEXT: ; implicit-def: $vgpr10_vgpr11			; GFX908-NEXT: ; implicit-def: $vgpr10_vgpr11
	; GFX908-NEXT: ; implicit-def: $vgpr6_vgpr7			; GFX908-NEXT: ; implicit-def: $vgpr6_vgpr7
	; GFX908-NEXT: ; implicit-def: $vgpr8_vgpr9			; GFX908-NEXT: ; implicit-def: $vgpr8_vgpr9
	; GFX908-NEXT: ; implicit-def: $vgpr4_vgpr5			; GFX908-NEXT: ; implicit-def: $vgpr4_vgpr5
	; GFX908-NEXT: ; implicit-def: $vgpr2_vgpr3			; GFX908-NEXT: ; implicit-def: $vgpr2_vgpr3
	; GFX908-NEXT: ; implicit-def: $sgpr20_sgpr21			; GFX908-NEXT: ; implicit-def: $sgpr20_sgpr21
	; GFX908-NEXT: .LBB3_9: ; %loop.exit.guard			; GFX908-NEXT: .LBB3_9: ; %loop.exit.guard
	; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX908-NEXT: s_xor_b64 s[16:17], s[22:23], -1			; GFX908-NEXT: s_xor_b64 s[16:17], s[22:23], -1
	; GFX908-NEXT: .LBB3_10: ; %Flow19			; GFX908-NEXT: .LBB3_10: ; %Flow19
	; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX908-NEXT: s_mov_b64 s[14:15], -1			; GFX908-NEXT: s_mov_b64 s[0:1], -1
	; GFX908-NEXT: s_and_b64 vcc, exec, s[16:17]			; GFX908-NEXT: s_and_b64 vcc, exec, s[16:17]
	; GFX908-NEXT: s_cbranch_vccz .LBB3_1			; GFX908-NEXT: s_cbranch_vccz .LBB3_1
	; GFX908-NEXT: ; %bb.11: ; %bb12			; GFX908-NEXT: ; %bb.11: ; %bb12
	; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX908-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX908-NEXT: s_add_u32 s6, s6, s4			; GFX908-NEXT: s_add_u32 s6, s6, s4
	; GFX908-NEXT: s_addc_u32 s7, s7, 0			; GFX908-NEXT: s_addc_u32 s7, s7, 0
	; GFX908-NEXT: s_add_u32 s10, s10, s12			; GFX908-NEXT: s_add_u32 s10, s10, s12
	; GFX908-NEXT: s_addc_u32 s11, s11, s13			; GFX908-NEXT: s_addc_u32 s11, s11, s13
	; GFX908-NEXT: s_mov_b64 s[14:15], 0			; GFX908-NEXT: s_mov_b64 s[0:1], 0
	; GFX908-NEXT: s_branch .LBB3_1			; GFX908-NEXT: s_branch .LBB3_1
	; GFX908-NEXT: .LBB3_12: ; %DummyReturnBlock			; GFX908-NEXT: .LBB3_12: ; %DummyReturnBlock
	; GFX908-NEXT: s_endpgm			; GFX908-NEXT: s_endpgm
	;			;
	; GFX90A-LABEL: introduced_copy_to_sgpr:			; GFX90A-LABEL: introduced_copy_to_sgpr:
	; GFX90A: ; %bb.0: ; %bb			; GFX90A: ; %bb.0: ; %bb
	; GFX90A-NEXT: global_load_ushort v18, v[0:1], off glc			; GFX90A-NEXT: global_load_ushort v18, v[0:1], off glc
	; GFX90A-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX90A-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	Show All 33 Lines
	; GFX90A-NEXT: s_or_b32 s10, s10, 28			; GFX90A-NEXT: s_or_b32 s10, s10, 28
	; GFX90A-NEXT: s_waitcnt vmcnt(0)			; GFX90A-NEXT: s_waitcnt vmcnt(0)
	; GFX90A-NEXT: v_readfirstlane_b32 s5, v18			; GFX90A-NEXT: v_readfirstlane_b32 s5, v18
	; GFX90A-NEXT: s_and_b32 s5, 0xffff, s5			; GFX90A-NEXT: s_and_b32 s5, 0xffff, s5
	; GFX90A-NEXT: s_mul_i32 s1, s1, s5			; GFX90A-NEXT: s_mul_i32 s1, s1, s5
	; GFX90A-NEXT: s_mul_hi_u32 s9, s0, s5			; GFX90A-NEXT: s_mul_hi_u32 s9, s0, s5
	; GFX90A-NEXT: s_mul_i32 s0, s0, s5			; GFX90A-NEXT: s_mul_i32 s0, s0, s5
	; GFX90A-NEXT: s_add_i32 s1, s9, s1			; GFX90A-NEXT: s_add_i32 s1, s9, s1
	; GFX90A-NEXT: s_lshl_b64 s[0:1], s[0:1], 5			; GFX90A-NEXT: s_lshl_b64 s[14:15], s[0:1], 5
	; GFX90A-NEXT: s_branch .LBB3_2			; GFX90A-NEXT: s_branch .LBB3_2
	; GFX90A-NEXT: .LBB3_1: ; %Flow20			; GFX90A-NEXT: .LBB3_1: ; %Flow20
	; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX90A-NEXT: s_andn2_b64 vcc, exec, s[14:15]			; GFX90A-NEXT: s_andn2_b64 vcc, exec, s[0:1]
	; GFX90A-NEXT: s_cbranch_vccz .LBB3_12			; GFX90A-NEXT: s_cbranch_vccz .LBB3_12
	; GFX90A-NEXT: .LBB3_2: ; %bb9			; GFX90A-NEXT: .LBB3_2: ; %bb9
	; GFX90A-NEXT: ; =>This Loop Header: Depth=1			; GFX90A-NEXT: ; =>This Loop Header: Depth=1
	; GFX90A-NEXT: ; Child Loop BB3_5 Depth 2			; GFX90A-NEXT: ; Child Loop BB3_5 Depth 2
	; GFX90A-NEXT: s_mov_b64 s[16:17], -1			; GFX90A-NEXT: s_mov_b64 s[16:17], -1
	; GFX90A-NEXT: s_cbranch_scc0 .LBB3_10			; GFX90A-NEXT: s_cbranch_scc0 .LBB3_10
	; GFX90A-NEXT: ; %bb.3: ; %bb14			; GFX90A-NEXT: ; %bb.3: ; %bb14
	; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX90A-NEXT: global_load_dwordx2 v[4:5], v[2:3], off			; GFX90A-NEXT: global_load_dwordx2 v[4:5], v[2:3], off
				; GFX90A-NEXT: v_cmp_gt_i64_e64 s[0:1], s[6:7], -1
	; GFX90A-NEXT: s_mov_b32 s9, s8			; GFX90A-NEXT: s_mov_b32 s9, s8
				; GFX90A-NEXT: v_cndmask_b32_e64 v8, 0, 1, s[0:1]
	; GFX90A-NEXT: v_pk_mov_b32 v[6:7], s[8:9], s[8:9] op_sel:[0,1]			; GFX90A-NEXT: v_pk_mov_b32 v[6:7], s[8:9], s[8:9] op_sel:[0,1]
				; GFX90A-NEXT: v_cmp_ne_u32_e64 s[0:1], 1, v8
	; GFX90A-NEXT: v_pk_mov_b32 v[10:11], s[8:9], s[8:9] op_sel:[0,1]			; GFX90A-NEXT: v_pk_mov_b32 v[10:11], s[8:9], s[8:9] op_sel:[0,1]
	; GFX90A-NEXT: v_pk_mov_b32 v[8:9], s[8:9], s[8:9] op_sel:[0,1]			; GFX90A-NEXT: v_pk_mov_b32 v[8:9], s[8:9], s[8:9] op_sel:[0,1]
	; GFX90A-NEXT: v_cmp_lt_i64_e64 s[14:15], s[6:7], 0			; GFX90A-NEXT: v_cmp_lt_i64_e64 s[16:17], s[6:7], 0
	; GFX90A-NEXT: v_cmp_gt_i64_e64 s[16:17], s[6:7], -1
	; GFX90A-NEXT: s_mov_b64 s[20:21], s[10:11]			; GFX90A-NEXT: s_mov_b64 s[20:21], s[10:11]
	; GFX90A-NEXT: v_pk_mov_b32 v[12:13], v[6:7], v[6:7] op_sel:[0,1]			; GFX90A-NEXT: v_pk_mov_b32 v[12:13], v[6:7], v[6:7] op_sel:[0,1]
	; GFX90A-NEXT: s_waitcnt vmcnt(0)			; GFX90A-NEXT: s_waitcnt vmcnt(0)
	; GFX90A-NEXT: v_readfirstlane_b32 s5, v4			; GFX90A-NEXT: v_readfirstlane_b32 s5, v4
	; GFX90A-NEXT: v_readfirstlane_b32 s9, v5			; GFX90A-NEXT: v_readfirstlane_b32 s9, v5
	; GFX90A-NEXT: s_add_u32 s5, s5, 1			; GFX90A-NEXT: s_add_u32 s5, s5, 1
	; GFX90A-NEXT: s_addc_u32 s9, s9, 0			; GFX90A-NEXT: s_addc_u32 s9, s9, 0
	; GFX90A-NEXT: s_mul_hi_u32 s19, s2, s5			; GFX90A-NEXT: s_mul_hi_u32 s19, s2, s5
	; GFX90A-NEXT: s_mul_i32 s22, s3, s5			; GFX90A-NEXT: s_mul_i32 s22, s3, s5
	; GFX90A-NEXT: s_mul_i32 s18, s2, s5			; GFX90A-NEXT: s_mul_i32 s18, s2, s5
	; GFX90A-NEXT: s_mul_i32 s5, s2, s9			; GFX90A-NEXT: s_mul_i32 s5, s2, s9
	; GFX90A-NEXT: s_add_i32 s5, s19, s5			; GFX90A-NEXT: s_add_i32 s5, s19, s5
	; GFX90A-NEXT: s_add_i32 s5, s5, s22			; GFX90A-NEXT: s_add_i32 s5, s5, s22
	; GFX90A-NEXT: s_branch .LBB3_5			; GFX90A-NEXT: s_branch .LBB3_5
	; GFX90A-NEXT: .LBB3_4: ; %bb58			; GFX90A-NEXT: .LBB3_4: ; %bb58
	; GFX90A-NEXT: ; in Loop: Header=BB3_5 Depth=2			; GFX90A-NEXT: ; in Loop: Header=BB3_5 Depth=2
	; GFX90A-NEXT: v_add_co_u32_sdwa v4, vcc, v4, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0			; GFX90A-NEXT: v_add_co_u32_sdwa v4, vcc, v4, v18 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
	; GFX90A-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v5, vcc			; GFX90A-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v5, vcc
	; GFX90A-NEXT: s_add_u32 s20, s20, s0			; GFX90A-NEXT: s_add_u32 s20, s20, s14
	; GFX90A-NEXT: s_addc_u32 s21, s21, s1			; GFX90A-NEXT: s_addc_u32 s21, s21, s15
	; GFX90A-NEXT: v_cmp_lt_i64_e64 s[24:25], -1, v[4:5]			; GFX90A-NEXT: v_cmp_lt_i64_e64 s[24:25], -1, v[4:5]
	; GFX90A-NEXT: s_mov_b64 s[22:23], 0			; GFX90A-NEXT: s_mov_b64 s[22:23], 0
	; GFX90A-NEXT: s_andn2_b64 vcc, exec, s[24:25]			; GFX90A-NEXT: s_andn2_b64 vcc, exec, s[24:25]
	; GFX90A-NEXT: s_cbranch_vccz .LBB3_9			; GFX90A-NEXT: s_cbranch_vccz .LBB3_9
	; GFX90A-NEXT: .LBB3_5: ; %bb16			; GFX90A-NEXT: .LBB3_5: ; %bb16
	; GFX90A-NEXT: ; Parent Loop BB3_2 Depth=1			; GFX90A-NEXT: ; Parent Loop BB3_2 Depth=1
	; GFX90A-NEXT: ; => This Inner Loop Header: Depth=2			; GFX90A-NEXT: ; => This Inner Loop Header: Depth=2
	; GFX90A-NEXT: s_add_u32 s22, s20, s18			; GFX90A-NEXT: s_add_u32 s22, s20, s18
	; GFX90A-NEXT: s_addc_u32 s23, s21, s5			; GFX90A-NEXT: s_addc_u32 s23, s21, s5
	; GFX90A-NEXT: global_load_dword v21, v19, s[22:23] offset:-12 glc			; GFX90A-NEXT: global_load_dword v21, v19, s[22:23] offset:-12 glc
	; GFX90A-NEXT: s_waitcnt vmcnt(0)			; GFX90A-NEXT: s_waitcnt vmcnt(0)
	; GFX90A-NEXT: global_load_dword v20, v19, s[22:23] offset:-8 glc			; GFX90A-NEXT: global_load_dword v20, v19, s[22:23] offset:-8 glc
	; GFX90A-NEXT: s_waitcnt vmcnt(0)			; GFX90A-NEXT: s_waitcnt vmcnt(0)
	; GFX90A-NEXT: global_load_dword v14, v19, s[22:23] offset:-4 glc			; GFX90A-NEXT: global_load_dword v14, v19, s[22:23] offset:-4 glc
	; GFX90A-NEXT: s_waitcnt vmcnt(0)			; GFX90A-NEXT: s_waitcnt vmcnt(0)
	; GFX90A-NEXT: global_load_dword v14, v19, s[22:23] glc			; GFX90A-NEXT: global_load_dword v14, v19, s[22:23] glc
	; GFX90A-NEXT: s_waitcnt vmcnt(0)			; GFX90A-NEXT: s_waitcnt vmcnt(0)
	; GFX90A-NEXT: ds_read_b64 v[14:15], v19			; GFX90A-NEXT: ds_read_b64 v[14:15], v19
	; GFX90A-NEXT: ds_read_b64 v[16:17], v0			; GFX90A-NEXT: ds_read_b64 v[16:17], v0
	; GFX90A-NEXT: s_andn2_b64 vcc, exec, s[16:17]			; GFX90A-NEXT: s_and_b64 vcc, exec, s[0:1]
	; GFX90A-NEXT: ; kill: killed $sgpr22 killed $sgpr23			; GFX90A-NEXT: ; kill: killed $sgpr22 killed $sgpr23
	; GFX90A-NEXT: s_waitcnt lgkmcnt(0)			; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
	; GFX90A-NEXT: s_cbranch_vccnz .LBB3_7			; GFX90A-NEXT: s_cbranch_vccnz .LBB3_7
	; GFX90A-NEXT: ; %bb.6: ; %bb51			; GFX90A-NEXT: ; %bb.6: ; %bb51
	; GFX90A-NEXT: ; in Loop: Header=BB3_5 Depth=2			; GFX90A-NEXT: ; in Loop: Header=BB3_5 Depth=2
	; GFX90A-NEXT: v_cvt_f32_f16_sdwa v23, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX90A-NEXT: v_cvt_f32_f16_sdwa v23, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX90A-NEXT: v_cvt_f32_f16_e32 v22, v21			; GFX90A-NEXT: v_cvt_f32_f16_e32 v22, v21
	; GFX90A-NEXT: v_cvt_f32_f16_sdwa v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX90A-NEXT: v_cvt_f32_f16_sdwa v21, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX90A-NEXT: v_cvt_f32_f16_e32 v20, v20			; GFX90A-NEXT: v_cvt_f32_f16_e32 v20, v20
	; GFX90A-NEXT: v_pk_add_f32 v[24:25], v[0:1], v[14:15]			; GFX90A-NEXT: v_pk_add_f32 v[24:25], v[0:1], v[14:15]
	; GFX90A-NEXT: v_pk_add_f32 v[26:27], v[14:15], 0 op_sel_hi:[1,0]			; GFX90A-NEXT: v_pk_add_f32 v[26:27], v[14:15], 0 op_sel_hi:[1,0]
	; GFX90A-NEXT: v_pk_add_f32 v[16:17], v[22:23], v[16:17]			; GFX90A-NEXT: v_pk_add_f32 v[16:17], v[22:23], v[16:17]
	; GFX90A-NEXT: v_pk_add_f32 v[14:15], v[20:21], v[14:15]			; GFX90A-NEXT: v_pk_add_f32 v[14:15], v[20:21], v[14:15]
	; GFX90A-NEXT: v_pk_add_f32 v[6:7], v[6:7], v[24:25]			; GFX90A-NEXT: v_pk_add_f32 v[6:7], v[6:7], v[24:25]
	; GFX90A-NEXT: v_pk_add_f32 v[10:11], v[10:11], v[26:27]			; GFX90A-NEXT: v_pk_add_f32 v[10:11], v[10:11], v[26:27]
	; GFX90A-NEXT: v_pk_add_f32 v[8:9], v[8:9], v[16:17]			; GFX90A-NEXT: v_pk_add_f32 v[8:9], v[8:9], v[16:17]
	; GFX90A-NEXT: v_pk_add_f32 v[12:13], v[12:13], v[14:15]			; GFX90A-NEXT: v_pk_add_f32 v[12:13], v[12:13], v[14:15]
	; GFX90A-NEXT: s_mov_b64 s[22:23], -1			; GFX90A-NEXT: s_mov_b64 s[22:23], -1
	; GFX90A-NEXT: s_branch .LBB3_4			; GFX90A-NEXT: s_branch .LBB3_4
	; GFX90A-NEXT: .LBB3_7: ; in Loop: Header=BB3_5 Depth=2			; GFX90A-NEXT: .LBB3_7: ; in Loop: Header=BB3_5 Depth=2
	; GFX90A-NEXT: s_mov_b64 s[22:23], s[14:15]			; GFX90A-NEXT: s_mov_b64 s[22:23], s[16:17]
	; GFX90A-NEXT: s_andn2_b64 vcc, exec, s[22:23]			; GFX90A-NEXT: s_andn2_b64 vcc, exec, s[22:23]
	; GFX90A-NEXT: s_cbranch_vccz .LBB3_4			; GFX90A-NEXT: s_cbranch_vccz .LBB3_4
	; GFX90A-NEXT: ; %bb.8: ; in Loop: Header=BB3_2 Depth=1			; GFX90A-NEXT: ; %bb.8: ; in Loop: Header=BB3_2 Depth=1
	; GFX90A-NEXT: ; implicit-def: $vgpr12_vgpr13			; GFX90A-NEXT: ; implicit-def: $vgpr12_vgpr13
	; GFX90A-NEXT: ; implicit-def: $vgpr8_vgpr9			; GFX90A-NEXT: ; implicit-def: $vgpr8_vgpr9
	; GFX90A-NEXT: ; implicit-def: $vgpr10_vgpr11			; GFX90A-NEXT: ; implicit-def: $vgpr10_vgpr11
	; GFX90A-NEXT: ; implicit-def: $vgpr6_vgpr7			; GFX90A-NEXT: ; implicit-def: $vgpr6_vgpr7
	; GFX90A-NEXT: ; implicit-def: $vgpr4_vgpr5			; GFX90A-NEXT: ; implicit-def: $vgpr4_vgpr5
	; GFX90A-NEXT: ; implicit-def: $sgpr20_sgpr21			; GFX90A-NEXT: ; implicit-def: $sgpr20_sgpr21
	; GFX90A-NEXT: .LBB3_9: ; %loop.exit.guard			; GFX90A-NEXT: .LBB3_9: ; %loop.exit.guard
	; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX90A-NEXT: s_xor_b64 s[16:17], s[22:23], -1			; GFX90A-NEXT: s_xor_b64 s[16:17], s[22:23], -1
	; GFX90A-NEXT: .LBB3_10: ; %Flow19			; GFX90A-NEXT: .LBB3_10: ; %Flow19
	; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX90A-NEXT: s_mov_b64 s[14:15], -1			; GFX90A-NEXT: s_mov_b64 s[0:1], -1
	; GFX90A-NEXT: s_and_b64 vcc, exec, s[16:17]			; GFX90A-NEXT: s_and_b64 vcc, exec, s[16:17]
	; GFX90A-NEXT: s_cbranch_vccz .LBB3_1			; GFX90A-NEXT: s_cbranch_vccz .LBB3_1
	; GFX90A-NEXT: ; %bb.11: ; %bb12			; GFX90A-NEXT: ; %bb.11: ; %bb12
	; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1			; GFX90A-NEXT: ; in Loop: Header=BB3_2 Depth=1
	; GFX90A-NEXT: s_add_u32 s6, s6, s4			; GFX90A-NEXT: s_add_u32 s6, s6, s4
	; GFX90A-NEXT: s_addc_u32 s7, s7, 0			; GFX90A-NEXT: s_addc_u32 s7, s7, 0
	; GFX90A-NEXT: s_add_u32 s10, s10, s12			; GFX90A-NEXT: s_add_u32 s10, s10, s12
	; GFX90A-NEXT: s_addc_u32 s11, s11, s13			; GFX90A-NEXT: s_addc_u32 s11, s11, s13
	; GFX90A-NEXT: s_mov_b64 s[14:15], 0			; GFX90A-NEXT: s_mov_b64 s[0:1], 0
	; GFX90A-NEXT: s_branch .LBB3_1			; GFX90A-NEXT: s_branch .LBB3_1
	; GFX90A-NEXT: .LBB3_12: ; %DummyReturnBlock			; GFX90A-NEXT: .LBB3_12: ; %DummyReturnBlock
	; GFX90A-NEXT: s_endpgm			; GFX90A-NEXT: s_endpgm
	bb:			bb:
	%i = load volatile i16, ptr addrspace(4) undef, align 2			%i = load volatile i16, ptr addrspace(4) undef, align 2
	%i6 = zext i16 %i to i64			%i6 = zext i16 %i to i64
	%i7 = udiv i32 %arg1, %arg2			%i7 = udiv i32 %arg1, %arg2
	%i8 = zext i32 %i7 to i64			%i8 = zext i32 %i7 to i64
	▲ Show 20 Lines • Show All 317 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/exec-mask-opt-cannot-create-empty-or-backward-segment.ll

	Show All 16 Lines
	; CHECK-NEXT: s_bitcmp1_b32 s2, 8			; CHECK-NEXT: s_bitcmp1_b32 s2, 8
	; CHECK-NEXT: s_cselect_b64 s[10:11], -1, 0			; CHECK-NEXT: s_cselect_b64 s[10:11], -1, 0
	; CHECK-NEXT: s_bitcmp1_b32 s2, 16			; CHECK-NEXT: s_bitcmp1_b32 s2, 16
	; CHECK-NEXT: s_cselect_b64 s[2:3], -1, 0			; CHECK-NEXT: s_cselect_b64 s[2:3], -1, 0
	; CHECK-NEXT: s_bitcmp1_b32 s0, 24			; CHECK-NEXT: s_bitcmp1_b32 s0, 24
	; CHECK-NEXT: s_cselect_b64 s[8:9], -1, 0			; CHECK-NEXT: s_cselect_b64 s[8:9], -1, 0
	; CHECK-NEXT: s_xor_b64 s[4:5], s[8:9], -1			; CHECK-NEXT: s_xor_b64 s[4:5], s[8:9], -1
	; CHECK-NEXT: s_bitcmp1_b32 s1, 0			; CHECK-NEXT: s_bitcmp1_b32 s1, 0
				; CHECK-NEXT: v_cndmask_b32_e64 v0, 0, 1, s[2:3]
	; CHECK-NEXT: s_cselect_b64 s[12:13], -1, 0			; CHECK-NEXT: s_cselect_b64 s[12:13], -1, 0
	; CHECK-NEXT: s_bitcmp1_b32 s6, 8			; CHECK-NEXT: s_bitcmp1_b32 s6, 8
	; CHECK-NEXT: v_cndmask_b32_e64 v0, 0, 1, s[2:3]
	; CHECK-NEXT: v_cndmask_b32_e64 v1, 0, 1, s[16:17]
	; CHECK-NEXT: s_cselect_b64 s[14:15], -1, 0
	; CHECK-NEXT: v_cmp_ne_u32_e64 s[2:3], 1, v0			; CHECK-NEXT: v_cmp_ne_u32_e64 s[2:3], 1, v0
				; CHECK-NEXT: v_cndmask_b32_e64 v0, 0, 1, s[16:17]
				; CHECK-NEXT: s_cselect_b64 s[14:15], -1, 0
	; CHECK-NEXT: s_and_b64 s[4:5], exec, s[4:5]			; CHECK-NEXT: s_and_b64 s[4:5], exec, s[4:5]
	; CHECK-NEXT: s_and_b64 s[6:7], exec, s[10:11]			; CHECK-NEXT: s_and_b64 s[6:7], exec, s[10:11]
				; CHECK-NEXT: v_cmp_ne_u32_e64 s[0:1], 1, v0
	; CHECK-NEXT: v_mov_b32_e32 v0, 0			; CHECK-NEXT: v_mov_b32_e32 v0, 0
	; CHECK-NEXT: v_cmp_ne_u32_e64 s[0:1], 1, v1
	; CHECK-NEXT: s_branch .LBB0_3			; CHECK-NEXT: s_branch .LBB0_3
	; CHECK-NEXT: .LBB0_1: ; in Loop: Header=BB0_3 Depth=1			; CHECK-NEXT: .LBB0_1: ; in Loop: Header=BB0_3 Depth=1
	; CHECK-NEXT: s_mov_b64 s[18:19], 0			; CHECK-NEXT: s_mov_b64 s[18:19], 0
	; CHECK-NEXT: s_mov_b64 s[20:21], -1			; CHECK-NEXT: s_mov_b64 s[20:21], -1
	; CHECK-NEXT: s_mov_b64 s[16:17], -1			; CHECK-NEXT: s_mov_b64 s[16:17], -1
	; CHECK-NEXT: s_mov_b64 s[22:23], -1			; CHECK-NEXT: s_mov_b64 s[22:23], -1
	; CHECK-NEXT: .LBB0_2: ; %Flow7			; CHECK-NEXT: .LBB0_2: ; %Flow7
	; CHECK-NEXT: ; in Loop: Header=BB0_3 Depth=1			; CHECK-NEXT: ; in Loop: Header=BB0_3 Depth=1
	▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
		dmgreenUnsubmitted Not Done Reply Inline Actions This doesn't really look autogenerated to me. dmgreen: This doesn't really look autogenerated to me.
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Sorry... it looks it did not use the script first time... For `negated_cond` function, `SIOptimizeExecMaskingPreRA` pass fails to fold mask operations `V_CNDMASK_B32_e64` and `V_CMP_NE_U32` because they are hoisted. Let me update it manually. jaykang10: Sorry... it looks it did not use the script first time... For `negated_cond` function…

; GCN-LABEL: {{^}}negated_cond:		; GCN-LABEL: {{^}}negated_cond:
; GCN: .LBB0_1:		; GCN: .LBB0_2:
; GCN: v_cmp_eq_u32_e64 [[CC:[^,]+]],		; GCN: v_cndmask_b32_e64
; GCN: .LBB0_3:		; GCN: v_cmp_ne_u32_e64
; GCN-NOT: v_cndmask_b32
; GCN-NOT: v_cmp
; GCN: s_andn2_b64 vcc, exec, [[CC]]
; GCN: s_lshl_b32 s12, s12, 5
; GCN: s_cbranch_vccz .LBB0_6
define amdgpu_kernel void @negated_cond(ptr addrspace(1) %arg1) {		define amdgpu_kernel void @negated_cond(ptr addrspace(1) %arg1) {
bb:		bb:
br label %bb1		br label %bb1

bb1:		bb1:
%tmp1 = load i32, ptr addrspace(1) %arg1		%tmp1 = load i32, ptr addrspace(1) %arg1
%tmp2 = icmp eq i32 %tmp1, 0		%tmp2 = icmp eq i32 %tmp1, 0
br label %bb2		br label %bb2
Show All 12 Lines	bb4:
%gep = getelementptr inbounds i32, ptr addrspace(1) %arg1, i32 %tmp6		%gep = getelementptr inbounds i32, ptr addrspace(1) %arg1, i32 %tmp6
store i32 0, ptr addrspace(1) %gep		store i32 0, ptr addrspace(1) %gep
%tmp7 = icmp eq i32 %tmp6, 32		%tmp7 = icmp eq i32 %tmp6, 32
br i1 %tmp7, label %bb1, label %bb2		br i1 %tmp7, label %bb1, label %bb2
}		}

; GCN-LABEL: {{^}}negated_cond_dominated_blocks:		; GCN-LABEL: {{^}}negated_cond_dominated_blocks:
; GCN: s_cmp_lg_u32		; GCN: s_cmp_lg_u32
; GCN: s_cselect_b64 [[CC1:[^,]+]], -1, 0		; GCN: s_cselect_b64 [[CC1:[^,]+]], -1, 0
dmgreenUnsubmitted Not Done Reply Inline Actions Should all these lines be removed, or should they be updated for the new codegen? dmgreen: Should all these lines be removed, or should they be updated for the new codegen?
jaykang10AuthorUnsubmitted Done Reply Inline Actions It looks these test lines are correct. Let me keep these lines in this patch. jaykang10: It looks these test lines are correct. Let me keep these lines in this patch.
; GCN: s_branch [[BB1:.LBB[0-9]+_[0-9]+]]		; GCN: s_branch [[BB1:.LBB[0-9]+_[0-9]+]]
; GCN: [[BB0:.LBB[0-9]+_[0-9]+]]		; GCN: [[BB0:.LBB[0-9]+_[0-9]+]]
; GCN-NOT: v_cndmask_b32		; GCN-NOT: v_cndmask_b32
; GCN-NOT: v_cmp		; GCN-NOT: v_cmp
; GCN: [[BB1]]:		; GCN: [[BB1]]:
; GCN: s_mov_b64 vcc, [[CC1]]		; GCN: s_mov_b64 vcc, [[CC1]]
; GCN: s_cbranch_vccz [[BB2:.LBB[0-9]+_[0-9]+]]		; GCN: s_cbranch_vccz [[BB2:.LBB[0-9]+_[0-9]+]]
; GCN: s_mov_b64 vcc, exec		; GCN: s_mov_b64 vcc, exec
Show All 25 Lines	bb6:
br label %bb7		br label %bb7

bb7:		bb7:
%tmp7 = phi i32 [ %tmp5, %bb5 ], [ %tmp6, %bb6 ]		%tmp7 = phi i32 [ %tmp5, %bb5 ], [ %tmp6, %bb6 ]
%gep = getelementptr inbounds i32, ptr addrspace(1) %arg1, i32 %tmp7		%gep = getelementptr inbounds i32, ptr addrspace(1) %arg1, i32 %tmp7
store i32 0, ptr addrspace(1) %gep		store i32 0, ptr addrspace(1) %gep
%tmp8 = icmp eq i32 %tmp7, 32		%tmp8 = icmp eq i32 %tmp7, 32
br i1 %tmp8, label %bb3, label %bb4		br i1 %tmp8, label %bb3, label %bb4
}		}
		dmgreenUnsubmitted Not Done Reply Inline Actions This shouldn't be needed. dmgreen: This shouldn't be needed.
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Yep, let me remove it. jaykang10: Yep, let me remove it.

llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll

	Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	; GLOBALNESS1-NEXT: s_xor_b64 s[4:5], s[4:5], -1			; GLOBALNESS1-NEXT: s_xor_b64 s[4:5], s[4:5], -1
	; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v1, 0, 1, vcc			; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v1, 0, 1, vcc
	; GLOBALNESS1-NEXT: s_bitcmp1_b32 s6, 0			; GLOBALNESS1-NEXT: s_bitcmp1_b32 s6, 0
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[42:43], 1, v1			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[42:43], 1, v1
	; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v1, 0, 1, s[4:5]			; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v1, 0, 1, s[4:5]
	; GLOBALNESS1-NEXT: s_cselect_b64 s[4:5], -1, 0			; GLOBALNESS1-NEXT: s_cselect_b64 s[4:5], -1, 0
	; GLOBALNESS1-NEXT: s_xor_b64 s[4:5], s[4:5], -1			; GLOBALNESS1-NEXT: s_xor_b64 s[4:5], s[4:5], -1
	; GLOBALNESS1-NEXT: s_bitcmp1_b32 s7, 0			; GLOBALNESS1-NEXT: s_bitcmp1_b32 s7, 0
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[48:49], 1, v1			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[44:45], 1, v2
	; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v1, 0, 1, s[4:5]			; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v2, 0, 1, s[4:5]
	; GLOBALNESS1-NEXT: s_cselect_b64 s[4:5], -1, 0			; GLOBALNESS1-NEXT: s_cselect_b64 s[4:5], -1, 0
	; GLOBALNESS1-NEXT: s_getpc_b64 s[6:7]			; GLOBALNESS1-NEXT: s_getpc_b64 s[6:7]
	; GLOBALNESS1-NEXT: s_add_u32 s6, s6, wobble@gotpcrel32@lo+4			; GLOBALNESS1-NEXT: s_add_u32 s6, s6, wobble@gotpcrel32@lo+4
	; GLOBALNESS1-NEXT: s_addc_u32 s7, s7, wobble@gotpcrel32@hi+12			; GLOBALNESS1-NEXT: s_addc_u32 s7, s7, wobble@gotpcrel32@hi+12
	; GLOBALNESS1-NEXT: s_xor_b64 s[4:5], s[4:5], -1			; GLOBALNESS1-NEXT: s_xor_b64 s[4:5], s[4:5], -1
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[50:51], 1, v1			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[48:49], 1, v2
	; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v1, 0, 1, s[4:5]			; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v2, 0, 1, s[4:5]
	; GLOBALNESS1-NEXT: s_load_dwordx2 s[76:77], s[6:7], 0x0			; GLOBALNESS1-NEXT: s_load_dwordx2 s[76:77], s[6:7], 0x0
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[52:53], 1, v1			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[50:51], 1, v2
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[44:45], 1, v2
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[46:47], 1, v3			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[46:47], 1, v3
	; GLOBALNESS1-NEXT: s_mov_b32 s70, s16			; GLOBALNESS1-NEXT: s_mov_b32 s70, s16
	; GLOBALNESS1-NEXT: s_mov_b64 s[38:39], s[8:9]			; GLOBALNESS1-NEXT: s_mov_b64 s[38:39], s[8:9]
	; GLOBALNESS1-NEXT: s_mov_b32 s71, s15			; GLOBALNESS1-NEXT: s_mov_b32 s71, s15
	; GLOBALNESS1-NEXT: s_mov_b32 s72, s14			; GLOBALNESS1-NEXT: s_mov_b32 s72, s14
	; GLOBALNESS1-NEXT: s_mov_b64 s[34:35], s[10:11]			; GLOBALNESS1-NEXT: s_mov_b64 s[34:35], s[10:11]
	; GLOBALNESS1-NEXT: s_mov_b64 s[74:75], 0x80			; GLOBALNESS1-NEXT: s_mov_b64 s[74:75], 0x80
				; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[60:61], 1, v1
	; GLOBALNESS1-NEXT: s_mov_b32 s32, 0			; GLOBALNESS1-NEXT: s_mov_b32 s32, 0
	; GLOBALNESS1-NEXT: ; implicit-def: $vgpr44_vgpr45			; GLOBALNESS1-NEXT: ; implicit-def: $vgpr44_vgpr45
	; GLOBALNESS1-NEXT: s_waitcnt vmcnt(0)			; GLOBALNESS1-NEXT: s_waitcnt vmcnt(0)
	; GLOBALNESS1-NEXT: v_cmp_gt_i32_e32 vcc, 0, v0			; GLOBALNESS1-NEXT: v_cmp_gt_i32_e32 vcc, 0, v0
	; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v1, 0, 1, vcc
	; GLOBALNESS1-NEXT: v_cmp_gt_i32_e32 vcc, 1, v0
	; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v2, 0, 1, vcc			; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v2, 0, 1, vcc
	; GLOBALNESS1-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0			; GLOBALNESS1-NEXT: v_cmp_gt_i32_e32 vcc, 1, v0
	; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v3, 0, 1, vcc			; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v3, 0, 1, vcc
				; GLOBALNESS1-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
				; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v4, 0, 1, vcc
	; GLOBALNESS1-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0			; GLOBALNESS1-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
	; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc			; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[54:55], 1, v1			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[52:53], 1, v2
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[56:57], 1, v2			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[54:55], 1, v3
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[58:59], 1, v3			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[56:57], 1, v4
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[60:61], 1, v0			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[58:59], 1, v0
	; GLOBALNESS1-NEXT: s_branch .LBB1_4			; GLOBALNESS1-NEXT: s_branch .LBB1_4
	; GLOBALNESS1-NEXT: .LBB1_1: ; %bb70.i			; GLOBALNESS1-NEXT: .LBB1_1: ; %bb70.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[60:61]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[58:59]
	; GLOBALNESS1-NEXT: s_cbranch_vccz .LBB1_29			; GLOBALNESS1-NEXT: s_cbranch_vccz .LBB1_29
	; GLOBALNESS1-NEXT: .LBB1_2: ; %Flow15			; GLOBALNESS1-NEXT: .LBB1_2: ; %Flow15
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS1-NEXT: s_or_b64 exec, exec, s[4:5]			; GLOBALNESS1-NEXT: s_or_b64 exec, exec, s[4:5]
	; GLOBALNESS1-NEXT: s_mov_b64 s[6:7], 0			; GLOBALNESS1-NEXT: s_mov_b64 s[6:7], 0
	; GLOBALNESS1-NEXT: ; implicit-def: $sgpr4_sgpr5			; GLOBALNESS1-NEXT: ; implicit-def: $sgpr4_sgpr5
	; GLOBALNESS1-NEXT: .LBB1_3: ; %Flow28			; GLOBALNESS1-NEXT: .LBB1_3: ; %Flow28
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; GLOBALNESS1-NEXT: v_cmp_gt_i32_e64 s[62:63], 0, v0			; GLOBALNESS1-NEXT: v_cmp_gt_i32_e64 s[62:63], 0, v0
	; GLOBALNESS1-NEXT: v_mov_b32_e32 v0, 0			; GLOBALNESS1-NEXT: v_mov_b32_e32 v0, 0
	; GLOBALNESS1-NEXT: v_mov_b32_e32 v1, 0x3ff00000			; GLOBALNESS1-NEXT: v_mov_b32_e32 v1, 0x3ff00000
	; GLOBALNESS1-NEXT: s_and_saveexec_b64 s[80:81], s[62:63]			; GLOBALNESS1-NEXT: s_and_saveexec_b64 s[80:81], s[62:63]
	; GLOBALNESS1-NEXT: s_cbranch_execz .LBB1_26			; GLOBALNESS1-NEXT: s_cbranch_execz .LBB1_26
	; GLOBALNESS1-NEXT: ; %bb.10: ; %bb33.i			; GLOBALNESS1-NEXT: ; %bb.10: ; %bb33.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS1-NEXT: global_load_dwordx2 v[0:1], v[2:3], off			; GLOBALNESS1-NEXT: global_load_dwordx2 v[0:1], v[2:3], off
	; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[54:55]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[52:53]
	; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_12			; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_12
	; GLOBALNESS1-NEXT: ; %bb.11: ; %bb39.i			; GLOBALNESS1-NEXT: ; %bb.11: ; %bb39.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS1-NEXT: v_mov_b32_e32 v43, v42			; GLOBALNESS1-NEXT: v_mov_b32_e32 v43, v42
	; GLOBALNESS1-NEXT: v_pk_mov_b32 v[2:3], 0, 0			; GLOBALNESS1-NEXT: v_pk_mov_b32 v[2:3], 0, 0
	; GLOBALNESS1-NEXT: global_store_dwordx2 v[2:3], v[42:43], off			; GLOBALNESS1-NEXT: global_store_dwordx2 v[2:3], v[42:43], off
	; GLOBALNESS1-NEXT: .LBB1_12: ; %bb44.lr.ph.i			; GLOBALNESS1-NEXT: .LBB1_12: ; %bb44.lr.ph.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS1-NEXT: v_cmp_ne_u32_e32 vcc, 0, v46			; GLOBALNESS1-NEXT: v_cmp_ne_u32_e32 vcc, 0, v46
	; GLOBALNESS1-NEXT: v_cndmask_b32_e32 v2, 0, v40, vcc			; GLOBALNESS1-NEXT: v_cndmask_b32_e32 v2, 0, v40, vcc
	; GLOBALNESS1-NEXT: s_waitcnt vmcnt(0)			; GLOBALNESS1-NEXT: s_waitcnt vmcnt(0)
	; GLOBALNESS1-NEXT: v_cmp_nlt_f64_e64 s[64:65], 0, v[0:1]			; GLOBALNESS1-NEXT: v_cmp_nlt_f64_e32 vcc, 0, v[0:1]
	; GLOBALNESS1-NEXT: v_cmp_eq_u32_e64 s[66:67], 0, v2			; GLOBALNESS1-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc
				; GLOBALNESS1-NEXT: v_cmp_eq_u32_e64 s[64:65], 0, v2
				; GLOBALNESS1-NEXT: v_cmp_ne_u32_e64 s[66:67], 1, v0
	; GLOBALNESS1-NEXT: s_branch .LBB1_15			; GLOBALNESS1-NEXT: s_branch .LBB1_15
	; GLOBALNESS1-NEXT: .LBB1_13: ; %Flow16			; GLOBALNESS1-NEXT: .LBB1_13: ; %Flow16
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS1-NEXT: s_or_b64 exec, exec, s[4:5]			; GLOBALNESS1-NEXT: s_or_b64 exec, exec, s[4:5]
	; GLOBALNESS1-NEXT: .LBB1_14: ; %bb63.i			; GLOBALNESS1-NEXT: .LBB1_14: ; %bb63.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[52:53]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[50:51]
	; GLOBALNESS1-NEXT: s_cbranch_vccz .LBB1_25			; GLOBALNESS1-NEXT: s_cbranch_vccz .LBB1_25
	; GLOBALNESS1-NEXT: .LBB1_15: ; %bb44.i			; GLOBALNESS1-NEXT: .LBB1_15: ; %bb44.i
	; GLOBALNESS1-NEXT: ; Parent Loop BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; Parent Loop BB1_4 Depth=1
	; GLOBALNESS1-NEXT: ; => This Inner Loop Header: Depth=2			; GLOBALNESS1-NEXT: ; => This Inner Loop Header: Depth=2
	; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[48:49]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[60:61]
	; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_14			; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_14
	; GLOBALNESS1-NEXT: ; %bb.16: ; %bb46.i			; GLOBALNESS1-NEXT: ; %bb.16: ; %bb46.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[50:51]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[48:49]
	; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_14			; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_14
	; GLOBALNESS1-NEXT: ; %bb.17: ; %bb50.i			; GLOBALNESS1-NEXT: ; %bb.17: ; %bb50.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[42:43]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[42:43]
	; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_20			; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_20
	; GLOBALNESS1-NEXT: ; %bb.18: ; %bb3.i.i			; GLOBALNESS1-NEXT: ; %bb.18: ; %bb3.i.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[44:45]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[44:45]
	; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_20			; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_20
	; GLOBALNESS1-NEXT: ; %bb.19: ; %bb6.i.i			; GLOBALNESS1-NEXT: ; %bb.19: ; %bb6.i.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS1-NEXT: s_andn2_b64 vcc, exec, s[64:65]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[66:67]
	; GLOBALNESS1-NEXT: .LBB1_20: ; %spam.exit.i			; GLOBALNESS1-NEXT: .LBB1_20: ; %spam.exit.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[56:57]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[54:55]
	; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_14			; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_14
	; GLOBALNESS1-NEXT: ; %bb.21: ; %bb55.i			; GLOBALNESS1-NEXT: ; %bb.21: ; %bb55.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS1-NEXT: s_add_u32 s68, s38, 40			; GLOBALNESS1-NEXT: s_add_u32 s68, s38, 40
	; GLOBALNESS1-NEXT: s_addc_u32 s69, s39, 0			; GLOBALNESS1-NEXT: s_addc_u32 s69, s39, 0
	; GLOBALNESS1-NEXT: s_mov_b64 s[4:5], s[40:41]			; GLOBALNESS1-NEXT: s_mov_b64 s[4:5], s[40:41]
	; GLOBALNESS1-NEXT: s_mov_b64 s[6:7], s[36:37]			; GLOBALNESS1-NEXT: s_mov_b64 s[6:7], s[36:37]
	; GLOBALNESS1-NEXT: s_mov_b64 s[8:9], s[68:69]			; GLOBALNESS1-NEXT: s_mov_b64 s[8:9], s[68:69]
	Show All 9 Lines
	; GLOBALNESS1-NEXT: s_mov_b64 s[8:9], s[68:69]			; GLOBALNESS1-NEXT: s_mov_b64 s[8:9], s[68:69]
	; GLOBALNESS1-NEXT: s_mov_b64 s[10:11], s[34:35]			; GLOBALNESS1-NEXT: s_mov_b64 s[10:11], s[34:35]
	; GLOBALNESS1-NEXT: s_mov_b32 s12, s72			; GLOBALNESS1-NEXT: s_mov_b32 s12, s72
	; GLOBALNESS1-NEXT: s_mov_b32 s13, s71			; GLOBALNESS1-NEXT: s_mov_b32 s13, s71
	; GLOBALNESS1-NEXT: s_mov_b32 s14, s70			; GLOBALNESS1-NEXT: s_mov_b32 s14, s70
	; GLOBALNESS1-NEXT: v_mov_b32_e32 v31, v41			; GLOBALNESS1-NEXT: v_mov_b32_e32 v31, v41
	; GLOBALNESS1-NEXT: global_store_dwordx2 v[46:47], v[44:45], off			; GLOBALNESS1-NEXT: global_store_dwordx2 v[46:47], v[44:45], off
	; GLOBALNESS1-NEXT: s_swappc_b64 s[30:31], s[76:77]			; GLOBALNESS1-NEXT: s_swappc_b64 s[30:31], s[76:77]
	; GLOBALNESS1-NEXT: s_and_saveexec_b64 s[4:5], s[66:67]			; GLOBALNESS1-NEXT: s_and_saveexec_b64 s[4:5], s[64:65]
	; GLOBALNESS1-NEXT: s_cbranch_execz .LBB1_13			; GLOBALNESS1-NEXT: s_cbranch_execz .LBB1_13
	; GLOBALNESS1-NEXT: ; %bb.22: ; %bb62.i			; GLOBALNESS1-NEXT: ; %bb.22: ; %bb62.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS1-NEXT: v_mov_b32_e32 v43, v42			; GLOBALNESS1-NEXT: v_mov_b32_e32 v43, v42
	; GLOBALNESS1-NEXT: global_store_dwordx2 v[46:47], v[42:43], off			; GLOBALNESS1-NEXT: global_store_dwordx2 v[46:47], v[42:43], off
	; GLOBALNESS1-NEXT: s_branch .LBB1_13			; GLOBALNESS1-NEXT: s_branch .LBB1_13
	; GLOBALNESS1-NEXT: .LBB1_23: ; %LeafBlock			; GLOBALNESS1-NEXT: .LBB1_23: ; %LeafBlock
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	Show All 11 Lines
	; GLOBALNESS1-NEXT: v_pk_mov_b32 v[0:1], 0, 0			; GLOBALNESS1-NEXT: v_pk_mov_b32 v[0:1], 0, 0
	; GLOBALNESS1-NEXT: .LBB1_26: ; %Flow24			; GLOBALNESS1-NEXT: .LBB1_26: ; %Flow24
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS1-NEXT: s_or_b64 exec, exec, s[80:81]			; GLOBALNESS1-NEXT: s_or_b64 exec, exec, s[80:81]
	; GLOBALNESS1-NEXT: s_and_saveexec_b64 s[4:5], s[62:63]			; GLOBALNESS1-NEXT: s_and_saveexec_b64 s[4:5], s[62:63]
	; GLOBALNESS1-NEXT: s_cbranch_execz .LBB1_2			; GLOBALNESS1-NEXT: s_cbranch_execz .LBB1_2
	; GLOBALNESS1-NEXT: ; %bb.27: ; %bb67.i			; GLOBALNESS1-NEXT: ; %bb.27: ; %bb67.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[58:59]			; GLOBALNESS1-NEXT: s_and_b64 vcc, exec, s[56:57]
	; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_1			; GLOBALNESS1-NEXT: s_cbranch_vccnz .LBB1_1
	; GLOBALNESS1-NEXT: ; %bb.28: ; %bb69.i			; GLOBALNESS1-NEXT: ; %bb.28: ; %bb69.i
	; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS1-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS1-NEXT: v_mov_b32_e32 v43, v42			; GLOBALNESS1-NEXT: v_mov_b32_e32 v43, v42
	; GLOBALNESS1-NEXT: v_pk_mov_b32 v[2:3], 0, 0			; GLOBALNESS1-NEXT: v_pk_mov_b32 v[2:3], 0, 0
	; GLOBALNESS1-NEXT: global_store_dwordx2 v[2:3], v[42:43], off			; GLOBALNESS1-NEXT: global_store_dwordx2 v[2:3], v[42:43], off
	; GLOBALNESS1-NEXT: s_branch .LBB1_1			; GLOBALNESS1-NEXT: s_branch .LBB1_1
	; GLOBALNESS1-NEXT: .LBB1_29: ; %bb73.i			; GLOBALNESS1-NEXT: .LBB1_29: ; %bb73.i
	▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines
	; GLOBALNESS0-NEXT: s_xor_b64 s[4:5], s[4:5], -1			; GLOBALNESS0-NEXT: s_xor_b64 s[4:5], s[4:5], -1
	; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v1, 0, 1, vcc			; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v1, 0, 1, vcc
	; GLOBALNESS0-NEXT: s_bitcmp1_b32 s6, 0			; GLOBALNESS0-NEXT: s_bitcmp1_b32 s6, 0
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[42:43], 1, v1			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[42:43], 1, v1
	; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v1, 0, 1, s[4:5]			; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v1, 0, 1, s[4:5]
	; GLOBALNESS0-NEXT: s_cselect_b64 s[4:5], -1, 0			; GLOBALNESS0-NEXT: s_cselect_b64 s[4:5], -1, 0
	; GLOBALNESS0-NEXT: s_xor_b64 s[4:5], s[4:5], -1			; GLOBALNESS0-NEXT: s_xor_b64 s[4:5], s[4:5], -1
	; GLOBALNESS0-NEXT: s_bitcmp1_b32 s7, 0			; GLOBALNESS0-NEXT: s_bitcmp1_b32 s7, 0
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[48:49], 1, v1			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[44:45], 1, v2
	; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v1, 0, 1, s[4:5]			; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v2, 0, 1, s[4:5]
	; GLOBALNESS0-NEXT: s_cselect_b64 s[4:5], -1, 0			; GLOBALNESS0-NEXT: s_cselect_b64 s[4:5], -1, 0
	; GLOBALNESS0-NEXT: s_getpc_b64 s[6:7]			; GLOBALNESS0-NEXT: s_getpc_b64 s[6:7]
	; GLOBALNESS0-NEXT: s_add_u32 s6, s6, wobble@gotpcrel32@lo+4			; GLOBALNESS0-NEXT: s_add_u32 s6, s6, wobble@gotpcrel32@lo+4
	; GLOBALNESS0-NEXT: s_addc_u32 s7, s7, wobble@gotpcrel32@hi+12			; GLOBALNESS0-NEXT: s_addc_u32 s7, s7, wobble@gotpcrel32@hi+12
	; GLOBALNESS0-NEXT: s_xor_b64 s[4:5], s[4:5], -1			; GLOBALNESS0-NEXT: s_xor_b64 s[4:5], s[4:5], -1
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[50:51], 1, v1			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[48:49], 1, v2
	; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v1, 0, 1, s[4:5]			; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v2, 0, 1, s[4:5]
	; GLOBALNESS0-NEXT: s_load_dwordx2 s[78:79], s[6:7], 0x0			; GLOBALNESS0-NEXT: s_load_dwordx2 s[78:79], s[6:7], 0x0
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[52:53], 1, v1			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[50:51], 1, v2
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[44:45], 1, v2
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[46:47], 1, v3			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[46:47], 1, v3
	; GLOBALNESS0-NEXT: s_mov_b32 s68, s16			; GLOBALNESS0-NEXT: s_mov_b32 s68, s16
	; GLOBALNESS0-NEXT: s_mov_b64 s[38:39], s[8:9]			; GLOBALNESS0-NEXT: s_mov_b64 s[38:39], s[8:9]
	; GLOBALNESS0-NEXT: s_mov_b32 s69, s15			; GLOBALNESS0-NEXT: s_mov_b32 s69, s15
	; GLOBALNESS0-NEXT: s_mov_b32 s70, s14			; GLOBALNESS0-NEXT: s_mov_b32 s70, s14
	; GLOBALNESS0-NEXT: s_mov_b64 s[34:35], s[10:11]			; GLOBALNESS0-NEXT: s_mov_b64 s[34:35], s[10:11]
	; GLOBALNESS0-NEXT: s_mov_b64 s[76:77], 0x80			; GLOBALNESS0-NEXT: s_mov_b64 s[76:77], 0x80
				; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[60:61], 1, v1
	; GLOBALNESS0-NEXT: s_mov_b32 s32, 0			; GLOBALNESS0-NEXT: s_mov_b32 s32, 0
	; GLOBALNESS0-NEXT: ; implicit-def: $vgpr44_vgpr45			; GLOBALNESS0-NEXT: ; implicit-def: $vgpr44_vgpr45
	; GLOBALNESS0-NEXT: s_waitcnt vmcnt(0)			; GLOBALNESS0-NEXT: s_waitcnt vmcnt(0)
	; GLOBALNESS0-NEXT: v_cmp_gt_i32_e32 vcc, 0, v0			; GLOBALNESS0-NEXT: v_cmp_gt_i32_e32 vcc, 0, v0
	; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v1, 0, 1, vcc
	; GLOBALNESS0-NEXT: v_cmp_gt_i32_e32 vcc, 1, v0
	; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v2, 0, 1, vcc			; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v2, 0, 1, vcc
	; GLOBALNESS0-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0			; GLOBALNESS0-NEXT: v_cmp_gt_i32_e32 vcc, 1, v0
	; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v3, 0, 1, vcc			; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v3, 0, 1, vcc
				; GLOBALNESS0-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
				; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v4, 0, 1, vcc
	; GLOBALNESS0-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0			; GLOBALNESS0-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
	; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc			; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[54:55], 1, v1			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[52:53], 1, v2
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[56:57], 1, v2			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[54:55], 1, v3
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[58:59], 1, v3			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[56:57], 1, v4
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[60:61], 1, v0			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[58:59], 1, v0
	; GLOBALNESS0-NEXT: s_branch .LBB1_4			; GLOBALNESS0-NEXT: s_branch .LBB1_4
	; GLOBALNESS0-NEXT: .LBB1_1: ; %bb70.i			; GLOBALNESS0-NEXT: .LBB1_1: ; %bb70.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[60:61]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[58:59]
	; GLOBALNESS0-NEXT: s_cbranch_vccz .LBB1_29			; GLOBALNESS0-NEXT: s_cbranch_vccz .LBB1_29
	; GLOBALNESS0-NEXT: .LBB1_2: ; %Flow15			; GLOBALNESS0-NEXT: .LBB1_2: ; %Flow15
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS0-NEXT: s_or_b64 exec, exec, s[4:5]			; GLOBALNESS0-NEXT: s_or_b64 exec, exec, s[4:5]
	; GLOBALNESS0-NEXT: s_mov_b64 s[6:7], 0			; GLOBALNESS0-NEXT: s_mov_b64 s[6:7], 0
	; GLOBALNESS0-NEXT: ; implicit-def: $sgpr4_sgpr5			; GLOBALNESS0-NEXT: ; implicit-def: $sgpr4_sgpr5
	; GLOBALNESS0-NEXT: .LBB1_3: ; %Flow28			; GLOBALNESS0-NEXT: .LBB1_3: ; %Flow28
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; GLOBALNESS0-NEXT: v_cmp_gt_i32_e64 s[62:63], 0, v0			; GLOBALNESS0-NEXT: v_cmp_gt_i32_e64 s[62:63], 0, v0
	; GLOBALNESS0-NEXT: v_mov_b32_e32 v0, 0			; GLOBALNESS0-NEXT: v_mov_b32_e32 v0, 0
	; GLOBALNESS0-NEXT: v_mov_b32_e32 v1, 0x3ff00000			; GLOBALNESS0-NEXT: v_mov_b32_e32 v1, 0x3ff00000
	; GLOBALNESS0-NEXT: s_and_saveexec_b64 s[80:81], s[62:63]			; GLOBALNESS0-NEXT: s_and_saveexec_b64 s[80:81], s[62:63]
	; GLOBALNESS0-NEXT: s_cbranch_execz .LBB1_26			; GLOBALNESS0-NEXT: s_cbranch_execz .LBB1_26
	; GLOBALNESS0-NEXT: ; %bb.10: ; %bb33.i			; GLOBALNESS0-NEXT: ; %bb.10: ; %bb33.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS0-NEXT: global_load_dwordx2 v[0:1], v[2:3], off			; GLOBALNESS0-NEXT: global_load_dwordx2 v[0:1], v[2:3], off
	; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[54:55]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[52:53]
	; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_12			; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_12
	; GLOBALNESS0-NEXT: ; %bb.11: ; %bb39.i			; GLOBALNESS0-NEXT: ; %bb.11: ; %bb39.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS0-NEXT: v_mov_b32_e32 v43, v42			; GLOBALNESS0-NEXT: v_mov_b32_e32 v43, v42
	; GLOBALNESS0-NEXT: v_pk_mov_b32 v[2:3], 0, 0			; GLOBALNESS0-NEXT: v_pk_mov_b32 v[2:3], 0, 0
	; GLOBALNESS0-NEXT: global_store_dwordx2 v[2:3], v[42:43], off			; GLOBALNESS0-NEXT: global_store_dwordx2 v[2:3], v[42:43], off
	; GLOBALNESS0-NEXT: .LBB1_12: ; %bb44.lr.ph.i			; GLOBALNESS0-NEXT: .LBB1_12: ; %bb44.lr.ph.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS0-NEXT: v_cmp_ne_u32_e32 vcc, 0, v46			; GLOBALNESS0-NEXT: v_cmp_ne_u32_e32 vcc, 0, v46
	; GLOBALNESS0-NEXT: v_cndmask_b32_e32 v2, 0, v40, vcc			; GLOBALNESS0-NEXT: v_cndmask_b32_e32 v2, 0, v40, vcc
	; GLOBALNESS0-NEXT: s_waitcnt vmcnt(0)			; GLOBALNESS0-NEXT: s_waitcnt vmcnt(0)
	; GLOBALNESS0-NEXT: v_cmp_nlt_f64_e64 s[64:65], 0, v[0:1]			; GLOBALNESS0-NEXT: v_cmp_nlt_f64_e32 vcc, 0, v[0:1]
	; GLOBALNESS0-NEXT: v_cmp_eq_u32_e64 s[66:67], 0, v2			; GLOBALNESS0-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc
				; GLOBALNESS0-NEXT: v_cmp_eq_u32_e64 s[64:65], 0, v2
				; GLOBALNESS0-NEXT: v_cmp_ne_u32_e64 s[66:67], 1, v0
	; GLOBALNESS0-NEXT: s_branch .LBB1_15			; GLOBALNESS0-NEXT: s_branch .LBB1_15
	; GLOBALNESS0-NEXT: .LBB1_13: ; %Flow16			; GLOBALNESS0-NEXT: .LBB1_13: ; %Flow16
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS0-NEXT: s_or_b64 exec, exec, s[4:5]			; GLOBALNESS0-NEXT: s_or_b64 exec, exec, s[4:5]
	; GLOBALNESS0-NEXT: .LBB1_14: ; %bb63.i			; GLOBALNESS0-NEXT: .LBB1_14: ; %bb63.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[52:53]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[50:51]
	; GLOBALNESS0-NEXT: s_cbranch_vccz .LBB1_25			; GLOBALNESS0-NEXT: s_cbranch_vccz .LBB1_25
	; GLOBALNESS0-NEXT: .LBB1_15: ; %bb44.i			; GLOBALNESS0-NEXT: .LBB1_15: ; %bb44.i
	; GLOBALNESS0-NEXT: ; Parent Loop BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; Parent Loop BB1_4 Depth=1
	; GLOBALNESS0-NEXT: ; => This Inner Loop Header: Depth=2			; GLOBALNESS0-NEXT: ; => This Inner Loop Header: Depth=2
	; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[48:49]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[60:61]
	; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_14			; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_14
	; GLOBALNESS0-NEXT: ; %bb.16: ; %bb46.i			; GLOBALNESS0-NEXT: ; %bb.16: ; %bb46.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[50:51]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[48:49]
	; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_14			; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_14
	; GLOBALNESS0-NEXT: ; %bb.17: ; %bb50.i			; GLOBALNESS0-NEXT: ; %bb.17: ; %bb50.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[42:43]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[42:43]
	; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_20			; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_20
	; GLOBALNESS0-NEXT: ; %bb.18: ; %bb3.i.i			; GLOBALNESS0-NEXT: ; %bb.18: ; %bb3.i.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[44:45]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[44:45]
	; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_20			; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_20
	; GLOBALNESS0-NEXT: ; %bb.19: ; %bb6.i.i			; GLOBALNESS0-NEXT: ; %bb.19: ; %bb6.i.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS0-NEXT: s_andn2_b64 vcc, exec, s[64:65]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[66:67]
	; GLOBALNESS0-NEXT: .LBB1_20: ; %spam.exit.i			; GLOBALNESS0-NEXT: .LBB1_20: ; %spam.exit.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[56:57]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[54:55]
	; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_14			; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_14
	; GLOBALNESS0-NEXT: ; %bb.21: ; %bb55.i			; GLOBALNESS0-NEXT: ; %bb.21: ; %bb55.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS0-NEXT: s_add_u32 s72, s38, 40			; GLOBALNESS0-NEXT: s_add_u32 s72, s38, 40
	; GLOBALNESS0-NEXT: s_addc_u32 s73, s39, 0			; GLOBALNESS0-NEXT: s_addc_u32 s73, s39, 0
	; GLOBALNESS0-NEXT: s_mov_b64 s[4:5], s[40:41]			; GLOBALNESS0-NEXT: s_mov_b64 s[4:5], s[40:41]
	; GLOBALNESS0-NEXT: s_mov_b64 s[6:7], s[36:37]			; GLOBALNESS0-NEXT: s_mov_b64 s[6:7], s[36:37]
	; GLOBALNESS0-NEXT: s_mov_b64 s[8:9], s[72:73]			; GLOBALNESS0-NEXT: s_mov_b64 s[8:9], s[72:73]
	Show All 9 Lines
	; GLOBALNESS0-NEXT: s_mov_b64 s[8:9], s[72:73]			; GLOBALNESS0-NEXT: s_mov_b64 s[8:9], s[72:73]
	; GLOBALNESS0-NEXT: s_mov_b64 s[10:11], s[34:35]			; GLOBALNESS0-NEXT: s_mov_b64 s[10:11], s[34:35]
	; GLOBALNESS0-NEXT: s_mov_b32 s12, s70			; GLOBALNESS0-NEXT: s_mov_b32 s12, s70
	; GLOBALNESS0-NEXT: s_mov_b32 s13, s69			; GLOBALNESS0-NEXT: s_mov_b32 s13, s69
	; GLOBALNESS0-NEXT: s_mov_b32 s14, s68			; GLOBALNESS0-NEXT: s_mov_b32 s14, s68
	; GLOBALNESS0-NEXT: v_mov_b32_e32 v31, v41			; GLOBALNESS0-NEXT: v_mov_b32_e32 v31, v41
	; GLOBALNESS0-NEXT: global_store_dwordx2 v[46:47], v[44:45], off			; GLOBALNESS0-NEXT: global_store_dwordx2 v[46:47], v[44:45], off
	; GLOBALNESS0-NEXT: s_swappc_b64 s[30:31], s[78:79]			; GLOBALNESS0-NEXT: s_swappc_b64 s[30:31], s[78:79]
	; GLOBALNESS0-NEXT: s_and_saveexec_b64 s[4:5], s[66:67]			; GLOBALNESS0-NEXT: s_and_saveexec_b64 s[4:5], s[64:65]
	; GLOBALNESS0-NEXT: s_cbranch_execz .LBB1_13			; GLOBALNESS0-NEXT: s_cbranch_execz .LBB1_13
	; GLOBALNESS0-NEXT: ; %bb.22: ; %bb62.i			; GLOBALNESS0-NEXT: ; %bb.22: ; %bb62.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_15 Depth=2
	; GLOBALNESS0-NEXT: v_mov_b32_e32 v43, v42			; GLOBALNESS0-NEXT: v_mov_b32_e32 v43, v42
	; GLOBALNESS0-NEXT: global_store_dwordx2 v[46:47], v[42:43], off			; GLOBALNESS0-NEXT: global_store_dwordx2 v[46:47], v[42:43], off
	; GLOBALNESS0-NEXT: s_branch .LBB1_13			; GLOBALNESS0-NEXT: s_branch .LBB1_13
	; GLOBALNESS0-NEXT: .LBB1_23: ; %LeafBlock			; GLOBALNESS0-NEXT: .LBB1_23: ; %LeafBlock
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	Show All 11 Lines
	; GLOBALNESS0-NEXT: v_pk_mov_b32 v[0:1], 0, 0			; GLOBALNESS0-NEXT: v_pk_mov_b32 v[0:1], 0, 0
	; GLOBALNESS0-NEXT: .LBB1_26: ; %Flow24			; GLOBALNESS0-NEXT: .LBB1_26: ; %Flow24
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS0-NEXT: s_or_b64 exec, exec, s[80:81]			; GLOBALNESS0-NEXT: s_or_b64 exec, exec, s[80:81]
	; GLOBALNESS0-NEXT: s_and_saveexec_b64 s[4:5], s[62:63]			; GLOBALNESS0-NEXT: s_and_saveexec_b64 s[4:5], s[62:63]
	; GLOBALNESS0-NEXT: s_cbranch_execz .LBB1_2			; GLOBALNESS0-NEXT: s_cbranch_execz .LBB1_2
	; GLOBALNESS0-NEXT: ; %bb.27: ; %bb67.i			; GLOBALNESS0-NEXT: ; %bb.27: ; %bb67.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[58:59]			; GLOBALNESS0-NEXT: s_and_b64 vcc, exec, s[56:57]
	; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_1			; GLOBALNESS0-NEXT: s_cbranch_vccnz .LBB1_1
	; GLOBALNESS0-NEXT: ; %bb.28: ; %bb69.i			; GLOBALNESS0-NEXT: ; %bb.28: ; %bb69.i
	; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1			; GLOBALNESS0-NEXT: ; in Loop: Header=BB1_4 Depth=1
	; GLOBALNESS0-NEXT: v_mov_b32_e32 v43, v42			; GLOBALNESS0-NEXT: v_mov_b32_e32 v43, v42
	; GLOBALNESS0-NEXT: v_pk_mov_b32 v[2:3], 0, 0			; GLOBALNESS0-NEXT: v_pk_mov_b32 v[2:3], 0, 0
	; GLOBALNESS0-NEXT: global_store_dwordx2 v[2:3], v[42:43], off			; GLOBALNESS0-NEXT: global_store_dwordx2 v[2:3], v[42:43], off
	; GLOBALNESS0-NEXT: s_branch .LBB1_1			; GLOBALNESS0-NEXT: s_branch .LBB1_1
	; GLOBALNESS0-NEXT: .LBB1_29: ; %bb73.i			; GLOBALNESS0-NEXT: .LBB1_29: ; %bb73.i
	▲ Show 20 Lines • Show All 157 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll

	Show First 20 Lines • Show All 441 Lines • ▼ Show 20 Lines

	end:			end:
	ret void;			ret void;
	}			}

	define dso_local void @arm_mat_mult_q31(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %n, i32 %m, i32 %l) local_unnamed_addr #0 {			define dso_local void @arm_mat_mult_q31(i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B, i32* noalias nocapture %C, i32 %n, i32 %m, i32 %l) local_unnamed_addr #0 {
	; CHECK-LABEL: arm_mat_mult_q31:			; CHECK-LABEL: arm_mat_mult_q31:
	; CHECK: @ %bb.0: @ %for.cond8.preheader.us.us.preheader.preheader			; CHECK: @ %bb.0: @ %for.cond8.preheader.us.us.preheader.preheader
	; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, r10, r11, lr}			; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, r10, lr}
	; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}			; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, lr}
	; CHECK-NEXT: .pad #4
	; CHECK-NEXT: sub sp, #4
	; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}			; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
	; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}			; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
	; CHECK-NEXT: .pad #16			; CHECK-NEXT: .pad #32
	; CHECK-NEXT: sub sp, #16			; CHECK-NEXT: sub sp, #32
	; CHECK-NEXT: ldrd r9, r12, [sp, #120]			; CHECK-NEXT: ldrd r9, r12, [sp, #128]
	; CHECK-NEXT: sub.w r7, r12, #1			; CHECK-NEXT: sub.w r7, r12, #1
	; CHECK-NEXT: movs r6, #1			; CHECK-NEXT: movs r6, #1
	; CHECK-NEXT: mov.w r8, #0			; CHECK-NEXT: mov.w r8, #0
	; CHECK-NEXT: add.w r7, r6, r7, lsr #1			; CHECK-NEXT: add.w r7, r6, r7, lsr #1
	; CHECK-NEXT: vdup.32 q1, r9
	; CHECK-NEXT: bic r7, r7, #3			; CHECK-NEXT: bic r7, r7, #3
	; CHECK-NEXT: vshl.i32 q3, q1, #3
	; CHECK-NEXT: subs r7, #4			; CHECK-NEXT: subs r7, #4
	; CHECK-NEXT: add.w r10, r6, r7, lsr #2			; CHECK-NEXT: add.w r10, r6, r7, lsr #2
	; CHECK-NEXT: adr r7, .LCPI9_0
	; CHECK-NEXT: adr r6, .LCPI9_1			; CHECK-NEXT: adr r6, .LCPI9_1
	; CHECK-NEXT: vldrw.u32 q2, [r7]
	; CHECK-NEXT: vldrw.u32 q0, [r6]			; CHECK-NEXT: vldrw.u32 q0, [r6]
				; CHECK-NEXT: adr r7, .LCPI9_0
				; CHECK-NEXT: vldrw.u32 q1, [r7]
	; CHECK-NEXT: vstrw.32 q0, [sp] @ 16-byte Spill			; CHECK-NEXT: vstrw.32 q0, [sp] @ 16-byte Spill
				; CHECK-NEXT: vdup.32 q0, r9
				; CHECK-NEXT: vmov q2, q0
				; CHECK-NEXT: vshl.i32 q3, q0, #3
				; CHECK-NEXT: vstrw.32 q1, [sp, #16] @ 16-byte Spill
	; CHECK-NEXT: .LBB9_1: @ %for.cond8.preheader.us.us.preheader			; CHECK-NEXT: .LBB9_1: @ %for.cond8.preheader.us.us.preheader
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB9_2 Depth 2			; CHECK-NEXT: @ Child Loop BB9_2 Depth 2
	; CHECK-NEXT: @ Child Loop BB9_3 Depth 3			; CHECK-NEXT: @ Child Loop BB9_3 Depth 3
	; CHECK-NEXT: mul r11, r8, r9			; CHECK-NEXT: mul lr, r8, r12
	; CHECK-NEXT: movs r5, #0			; CHECK-NEXT: vldrw.u32 q0, [sp] @ 16-byte Reload
	; CHECK-NEXT: mul r7, r8, r12			; CHECK-NEXT: movs r7, #0
				; CHECK-NEXT: mul r6, r8, r9
				; CHECK-NEXT: vdup.32 q4, lr
				; CHECK-NEXT: vshl.i32 q4, q4, #2
				; CHECK-NEXT: vadd.i32 q4, q4, r0
				; CHECK-NEXT: vadd.i32 q4, q4, q0
	; CHECK-NEXT: .LBB9_2: @ %vector.ph			; CHECK-NEXT: .LBB9_2: @ %vector.ph
	; CHECK-NEXT: @ Parent Loop BB9_1 Depth=1			; CHECK-NEXT: @ Parent Loop BB9_1 Depth=1
	; CHECK-NEXT: @ => This Loop Header: Depth=2			; CHECK-NEXT: @ => This Loop Header: Depth=2
	; CHECK-NEXT: @ Child Loop BB9_3 Depth 3			; CHECK-NEXT: @ Child Loop BB9_3 Depth 3
	; CHECK-NEXT: vdup.32 q5, r7			; CHECK-NEXT: vldrw.u32 q0, [sp, #16] @ 16-byte Reload
	; CHECK-NEXT: vldrw.u32 q0, [sp] @ 16-byte Reload			; CHECK-NEXT: vmov q7, q2
	; CHECK-NEXT: vshl.i32 q5, q5, #2
	; CHECK-NEXT: vmov q6, q1
	; CHECK-NEXT: vadd.i32 q5, q5, r0
	; CHECK-NEXT: dls lr, r10			; CHECK-NEXT: dls lr, r10
	; CHECK-NEXT: vmov.i32 q4, #0x0			; CHECK-NEXT: vmov.i32 q5, #0x0
	; CHECK-NEXT: vadd.i32 q5, q5, q0			; CHECK-NEXT: vmlas.i32 q7, q0, r7
	; CHECK-NEXT: vmlas.i32 q6, q2, r5			; CHECK-NEXT: vmov q6, q4
	; CHECK-NEXT: .LBB9_3: @ %vector.body			; CHECK-NEXT: .LBB9_3: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB9_1 Depth=1			; CHECK-NEXT: @ Parent Loop BB9_1 Depth=1
	; CHECK-NEXT: @ Parent Loop BB9_2 Depth=2			; CHECK-NEXT: @ Parent Loop BB9_2 Depth=2
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=3			; CHECK-NEXT: @ => This Inner Loop Header: Depth=3
	; CHECK-NEXT: vadd.i32 q7, q6, q3			; CHECK-NEXT: vadd.i32 q0, q7, q3
	; CHECK-NEXT: vldrw.u32 q0, [r1, q6, uxtw #2]			; CHECK-NEXT: vldrw.u32 q1, [r1, q7, uxtw #2]
	; CHECK-NEXT: vldrw.u32 q6, [q5, #32]!			; CHECK-NEXT: vldrw.u32 q7, [q6, #32]!
	; CHECK-NEXT: vmul.i32 q0, q0, q6			; CHECK-NEXT: vmul.i32 q1, q1, q7
	; CHECK-NEXT: vmov q6, q7			; CHECK-NEXT: vmov q7, q0
	; CHECK-NEXT: vadd.i32 q4, q0, q4			; CHECK-NEXT: vadd.i32 q5, q1, q5
	; CHECK-NEXT: le lr, .LBB9_3			; CHECK-NEXT: le lr, .LBB9_3
	; CHECK-NEXT: @ %bb.4: @ %middle.block			; CHECK-NEXT: @ %bb.4: @ %middle.block
	; CHECK-NEXT: @ in Loop: Header=BB9_2 Depth=2			; CHECK-NEXT: @ in Loop: Header=BB9_2 Depth=2
	; CHECK-NEXT: add.w r4, r5, r11			; CHECK-NEXT: adds r5, r7, r6
	; CHECK-NEXT: adds r5, #1			; CHECK-NEXT: adds r7, #1
	; CHECK-NEXT: vaddv.u32 r6, q4			; CHECK-NEXT: vaddv.u32 r4, q5
	; CHECK-NEXT: cmp r5, r9			; CHECK-NEXT: cmp r7, r9
	; CHECK-NEXT: str.w r6, [r2, r4, lsl #2]			; CHECK-NEXT: str.w r4, [r2, r5, lsl #2]
	; CHECK-NEXT: bne .LBB9_2			; CHECK-NEXT: bne .LBB9_2
	; CHECK-NEXT: @ %bb.5: @ %for.cond4.for.cond.cleanup6_crit_edge.us			; CHECK-NEXT: @ %bb.5: @ %for.cond4.for.cond.cleanup6_crit_edge.us
	; CHECK-NEXT: @ in Loop: Header=BB9_1 Depth=1			; CHECK-NEXT: @ in Loop: Header=BB9_1 Depth=1
	; CHECK-NEXT: add.w r8, r8, #1			; CHECK-NEXT: add.w r8, r8, #1
	; CHECK-NEXT: cmp r8, r3			; CHECK-NEXT: cmp r8, r3
	; CHECK-NEXT: bne .LBB9_1			; CHECK-NEXT: bne .LBB9_1
	; CHECK-NEXT: @ %bb.6: @ %for.end25			; CHECK-NEXT: @ %bb.6: @ %for.end25
	; CHECK-NEXT: add sp, #16			; CHECK-NEXT: add sp, #32
	; CHECK-NEXT: vpop {d8, d9, d10, d11, d12, d13, d14, d15}			; CHECK-NEXT: vpop {d8, d9, d10, d11, d12, d13, d14, d15}
	; CHECK-NEXT: add sp, #4			; CHECK-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, pc}
	; CHECK-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
	; CHECK-NEXT: .p2align 4			; CHECK-NEXT: .p2align 4
	; CHECK-NEXT: @ %bb.7:			; CHECK-NEXT: @ %bb.7:
	; CHECK-NEXT: .LCPI9_0:			; CHECK-NEXT: .LCPI9_0:
	; CHECK-NEXT: .long 0 @ 0x0			; CHECK-NEXT: .long 0 @ 0x0
	; CHECK-NEXT: .long 2 @ 0x2			; CHECK-NEXT: .long 2 @ 0x2
	; CHECK-NEXT: .long 4 @ 0x4			; CHECK-NEXT: .long 4 @ 0x4
	; CHECK-NEXT: .long 6 @ 0x6			; CHECK-NEXT: .long 6 @ 0x6
	; CHECK-NEXT: .LCPI9_1:			; CHECK-NEXT: .LCPI9_1:
	▲ Show 20 Lines • Show All 324 Lines • ▼ Show 20 Lines

	define hidden arm_aapcs_vfpcc i32 @arm_depthwise_conv_s8(i8* nocapture readonly %input, i16 zeroext %input_x, i16 zeroext %input_y, i16 zeroext %input_ch, i8* nocapture readonly %kernel, i16 zeroext %output_ch, i16 zeroext %ch_mult, i16 zeroext %kernel_x, i16 zeroext %kernel_y, i16 zeroext %pad_x, i16 zeroext %pad_y, i16 zeroext %stride_x, i16 zeroext %stride_y, i32* nocapture readonly %bias, i8* nocapture %output, i32* nocapture readonly %output_shift, i32* nocapture readonly %output_mult, i16 zeroext %output_x, i16 zeroext %output_y, i32 %output_offset, i32 %input_offset, i32 %output_activation_min, i32 %output_activation_max, i16 zeroext %dilation_x, i16 zeroext %dilation_y, i16* nocapture readnone %buffer_a) local_unnamed_addr #0 {			define hidden arm_aapcs_vfpcc i32 @arm_depthwise_conv_s8(i8* nocapture readonly %input, i16 zeroext %input_x, i16 zeroext %input_y, i16 zeroext %input_ch, i8* nocapture readonly %kernel, i16 zeroext %output_ch, i16 zeroext %ch_mult, i16 zeroext %kernel_x, i16 zeroext %kernel_y, i16 zeroext %pad_x, i16 zeroext %pad_y, i16 zeroext %stride_x, i16 zeroext %stride_y, i32* nocapture readonly %bias, i8* nocapture %output, i32* nocapture readonly %output_shift, i32* nocapture readonly %output_mult, i16 zeroext %output_x, i16 zeroext %output_y, i32 %output_offset, i32 %input_offset, i32 %output_activation_min, i32 %output_activation_max, i16 zeroext %dilation_x, i16 zeroext %dilation_y, i16* nocapture readnone %buffer_a) local_unnamed_addr #0 {
	; CHECK-LABEL: arm_depthwise_conv_s8:			; CHECK-LABEL: arm_depthwise_conv_s8:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, r10, r11, lr}			; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, r10, r11, lr}
	; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}			; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
	; CHECK-NEXT: .pad #4			; CHECK-NEXT: .pad #4
	; CHECK-NEXT: sub sp, #4			; CHECK-NEXT: sub sp, #4
	; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13}			; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
	; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13}			; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
	; CHECK-NEXT: .pad #8			; CHECK-NEXT: .pad #24
	; CHECK-NEXT: sub sp, #8			; CHECK-NEXT: sub sp, #24
	; CHECK-NEXT: ldrd r2, r7, [sp, #104]			; CHECK-NEXT: ldrd r2, r7, [sp, #136]
	; CHECK-NEXT: add.w r8, r7, #10			; CHECK-NEXT: add.w r8, r7, #10
	; CHECK-NEXT: adr r7, .LCPI11_0			; CHECK-NEXT: adr r7, .LCPI11_0
	; CHECK-NEXT: ldr r1, [sp, #96]			; CHECK-NEXT: ldr r1, [sp, #128]
	; CHECK-NEXT: vdup.32 q0, r2			; CHECK-NEXT: vdup.32 q0, r2
	; CHECK-NEXT: vldrw.u32 q1, [r7]			; CHECK-NEXT: vldrw.u32 q1, [r7]
	; CHECK-NEXT: mov.w r10, #0			; CHECK-NEXT: movs r4, #0
	; CHECK-NEXT: mov.w r9, #6			; CHECK-NEXT: mov.w r10, #6
	; CHECK-NEXT: movs r6, #11			; CHECK-NEXT: movs r6, #11
	; CHECK-NEXT: vshl.i32 q0, q0, #2			; CHECK-NEXT: vshl.i32 q0, q0, #2
	; CHECK-NEXT: movs r5, #0			; CHECK-NEXT: movs r5, #0
	; CHECK-NEXT: .LBB11_1: @ %for.body10.i			; CHECK-NEXT: .LBB11_1: @ %for.body10.i
	; CHECK-NEXT: @ =>This Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: @ Child Loop BB11_2 Depth 2			; CHECK-NEXT: @ Child Loop BB11_2 Depth 2
	; CHECK-NEXT: @ Child Loop BB11_3 Depth 3			; CHECK-NEXT: @ Child Loop BB11_3 Depth 3
	; CHECK-NEXT: @ Child Loop BB11_4 Depth 4			; CHECK-NEXT: @ Child Loop BB11_4 Depth 4
	; CHECK-NEXT: @ Child Loop BB11_5 Depth 5			; CHECK-NEXT: @ Child Loop BB11_5 Depth 5
	; CHECK-NEXT: movs r7, #0			; CHECK-NEXT: mov.w r9, #0
	; CHECK-NEXT: str r5, [sp, #4] @ 4-byte Spill			; CHECK-NEXT: str r5, [sp, #4] @ 4-byte Spill
	; CHECK-NEXT: .LBB11_2: @ %for.cond22.preheader.i			; CHECK-NEXT: .LBB11_2: @ %for.cond22.preheader.i
	; CHECK-NEXT: @ Parent Loop BB11_1 Depth=1			; CHECK-NEXT: @ Parent Loop BB11_1 Depth=1
	; CHECK-NEXT: @ => This Loop Header: Depth=2			; CHECK-NEXT: @ => This Loop Header: Depth=2
	; CHECK-NEXT: @ Child Loop BB11_3 Depth 3			; CHECK-NEXT: @ Child Loop BB11_3 Depth 3
	; CHECK-NEXT: @ Child Loop BB11_4 Depth 4			; CHECK-NEXT: @ Child Loop BB11_4 Depth 4
	; CHECK-NEXT: @ Child Loop BB11_5 Depth 5			; CHECK-NEXT: @ Child Loop BB11_5 Depth 5
	; CHECK-NEXT: movs r5, #0			; CHECK-NEXT: movs r7, #0
				; CHECK-NEXT: vdup.32 q2, r9
				; CHECK-NEXT: vstrw.32 q2, [sp, #8] @ 16-byte Spill
	; CHECK-NEXT: .LBB11_3: @ %for.body27.i			; CHECK-NEXT: .LBB11_3: @ %for.body27.i
	; CHECK-NEXT: @ Parent Loop BB11_1 Depth=1			; CHECK-NEXT: @ Parent Loop BB11_1 Depth=1
	; CHECK-NEXT: @ Parent Loop BB11_2 Depth=2			; CHECK-NEXT: @ Parent Loop BB11_2 Depth=2
	; CHECK-NEXT: @ => This Loop Header: Depth=3			; CHECK-NEXT: @ => This Loop Header: Depth=3
	; CHECK-NEXT: @ Child Loop BB11_4 Depth 4			; CHECK-NEXT: @ Child Loop BB11_4 Depth 4
	; CHECK-NEXT: @ Child Loop BB11_5 Depth 5			; CHECK-NEXT: @ Child Loop BB11_5 Depth 5
	; CHECK-NEXT: dls lr, r9			; CHECK-NEXT: dls lr, r10
	; CHECK-NEXT: mov.w r12, #0			; CHECK-NEXT: mov.w r12, #0
	; CHECK-NEXT: mov.w r11, #4			; CHECK-NEXT: mov.w r11, #4
				; CHECK-NEXT: vdup.32 q3, r7
	; CHECK-NEXT: .LBB11_4: @ %for.body78.us.i			; CHECK-NEXT: .LBB11_4: @ %for.body78.us.i
	; CHECK-NEXT: @ Parent Loop BB11_1 Depth=1			; CHECK-NEXT: @ Parent Loop BB11_1 Depth=1
	; CHECK-NEXT: @ Parent Loop BB11_2 Depth=2			; CHECK-NEXT: @ Parent Loop BB11_2 Depth=2
	; CHECK-NEXT: @ Parent Loop BB11_3 Depth=3			; CHECK-NEXT: @ Parent Loop BB11_3 Depth=3
	; CHECK-NEXT: @ => This Loop Header: Depth=4			; CHECK-NEXT: @ => This Loop Header: Depth=4
	; CHECK-NEXT: @ Child Loop BB11_5 Depth 5			; CHECK-NEXT: @ Child Loop BB11_5 Depth 5
	; CHECK-NEXT: mul r4, r11, r6			; CHECK-NEXT: mul r5, r11, r6
	; CHECK-NEXT: vdup.32 q3, r5			; CHECK-NEXT: vmov q4, q3
	; CHECK-NEXT: vdup.32 q2, r7			; CHECK-NEXT: vadd.i32 q5, q1, r5
	; CHECK-NEXT: vadd.i32 q4, q1, r4			; CHECK-NEXT: vmla.i32 q4, q5, r2
	; CHECK-NEXT: vmla.i32 q3, q4, r2			; CHECK-NEXT: vldrw.u32 q5, [sp, #8] @ 16-byte Reload
	; CHECK-NEXT: adds r4, #113			; CHECK-NEXT: adds r5, #113
	; CHECK-NEXT: vadd.i32 q4, q1, r4			; CHECK-NEXT: vadd.i32 q6, q1, r5
	; CHECK-NEXT: mov r4, r8			; CHECK-NEXT: mov r5, r8
	; CHECK-NEXT: vmla.i32 q2, q4, r2			; CHECK-NEXT: vmla.i32 q5, q6, r2
	; CHECK-NEXT: .LBB11_5: @ %vector.body			; CHECK-NEXT: .LBB11_5: @ %vector.body
	; CHECK-NEXT: @ Parent Loop BB11_1 Depth=1			; CHECK-NEXT: @ Parent Loop BB11_1 Depth=1
	; CHECK-NEXT: @ Parent Loop BB11_2 Depth=2			; CHECK-NEXT: @ Parent Loop BB11_2 Depth=2
	; CHECK-NEXT: @ Parent Loop BB11_3 Depth=3			; CHECK-NEXT: @ Parent Loop BB11_3 Depth=3
	; CHECK-NEXT: @ Parent Loop BB11_4 Depth=4			; CHECK-NEXT: @ Parent Loop BB11_4 Depth=4
	; CHECK-NEXT: @ => This Inner Loop Header: Depth=5			; CHECK-NEXT: @ => This Inner Loop Header: Depth=5
	; CHECK-NEXT: vldrb.s32 q6, [r0, q2]			; CHECK-NEXT: vldrb.s32 q2, [r0, q5]
	; CHECK-NEXT: vadd.i32 q5, q2, q0			; CHECK-NEXT: vadd.i32 q7, q5, q0
	; CHECK-NEXT: vadd.i32 q4, q3, q0			; CHECK-NEXT: vldrb.s32 q5, [r1, q4]
	; CHECK-NEXT: subs r4, #4			; CHECK-NEXT: vadd.i32 q6, q4, q0
	; CHECK-NEXT: vadd.i32 q2, q6, r2			; CHECK-NEXT: vadd.i32 q2, q2, r2
	; CHECK-NEXT: vldrb.s32 q6, [r1, q3]			; CHECK-NEXT: subs r5, #4
	; CHECK-NEXT: vmov q3, q4			; CHECK-NEXT: vmlava.u32 r12, q2, q5
	; CHECK-NEXT: vmlava.u32 r12, q2, q6			; CHECK-NEXT: vmov q5, q7
	; CHECK-NEXT: vmov q2, q5			; CHECK-NEXT: vmov q4, q6
	; CHECK-NEXT: bne .LBB11_5			; CHECK-NEXT: bne .LBB11_5
	; CHECK-NEXT: @ %bb.6: @ %middle.block			; CHECK-NEXT: @ %bb.6: @ %middle.block
	; CHECK-NEXT: @ in Loop: Header=BB11_4 Depth=4			; CHECK-NEXT: @ in Loop: Header=BB11_4 Depth=4
	; CHECK-NEXT: add.w r11, r11, #1			; CHECK-NEXT: add.w r11, r11, #1
	; CHECK-NEXT: le lr, .LBB11_4			; CHECK-NEXT: le lr, .LBB11_4
	; CHECK-NEXT: @ %bb.7: @ %for.cond.cleanup77.i			; CHECK-NEXT: @ %bb.7: @ %for.cond.cleanup77.i
	; CHECK-NEXT: @ in Loop: Header=BB11_3 Depth=3			; CHECK-NEXT: @ in Loop: Header=BB11_3 Depth=3
	; CHECK-NEXT: adds r5, #1			; CHECK-NEXT: adds r7, #1
	; CHECK-NEXT: add.w r10, r10, #1			; CHECK-NEXT: adds r4, #1
	; CHECK-NEXT: cmp r5, r2			; CHECK-NEXT: cmp r7, r2
	; CHECK-NEXT: bne .LBB11_3			; CHECK-NEXT: bne .LBB11_3
	; CHECK-NEXT: @ %bb.8: @ %for.cond.cleanup26.i			; CHECK-NEXT: @ %bb.8: @ %for.cond.cleanup26.i
	; CHECK-NEXT: @ in Loop: Header=BB11_2 Depth=2			; CHECK-NEXT: @ in Loop: Header=BB11_2 Depth=2
	; CHECK-NEXT: adds r7, #1			; CHECK-NEXT: add.w r9, r9, #1
	; CHECK-NEXT: cmp r7, r3			; CHECK-NEXT: cmp r9, r3
	; CHECK-NEXT: bne .LBB11_2			; CHECK-NEXT: bne .LBB11_2
	; CHECK-NEXT: @ %bb.9: @ %for.cond.cleanup20.i			; CHECK-NEXT: @ %bb.9: @ %for.cond.cleanup20.i
	; CHECK-NEXT: @ in Loop: Header=BB11_1 Depth=1			; CHECK-NEXT: @ in Loop: Header=BB11_1 Depth=1
	; CHECK-NEXT: ldr r5, [sp, #4] @ 4-byte Reload			; CHECK-NEXT: ldr r5, [sp, #4] @ 4-byte Reload
	; CHECK-NEXT: ldr r7, [sp, #148]			; CHECK-NEXT: ldr r7, [sp, #180]
	; CHECK-NEXT: adds r5, #1			; CHECK-NEXT: adds r5, #1
	; CHECK-NEXT: cmp r5, r7			; CHECK-NEXT: cmp r5, r7
	; CHECK-NEXT: it eq			; CHECK-NEXT: it eq
	; CHECK-NEXT: moveq r5, #0			; CHECK-NEXT: moveq r5, #0
	; CHECK-NEXT: b .LBB11_1			; CHECK-NEXT: b .LBB11_1
	; CHECK-NEXT: .p2align 4			; CHECK-NEXT: .p2align 4
	; CHECK-NEXT: @ %bb.10:			; CHECK-NEXT: @ %bb.10:
	; CHECK-NEXT: .LCPI11_0:			; CHECK-NEXT: .LCPI11_0:
	▲ Show 20 Lines • Show All 152 Lines • Show Last 20 Lines

llvm/test/CodeGen/WebAssembly/reg-stackify.ll

Show First 20 Lines • Show All 465 Lines • ▼ Show 20 Lines	define i32 @commute_to_fix_ordering(i32 %arg) {
ret i32 %tmp6		ret i32 %tmp6
}		}

; Stackify individual defs of virtual registers with multiple defs.		; Stackify individual defs of virtual registers with multiple defs.

; CHECK-LABEL: multiple_defs:		; CHECK-LABEL: multiple_defs:
; CHECK: f64.add $push[[NUM0:[0-9]+]]=, ${{[0-9]+}}, $pop{{[0-9]+}}{{$}}		; CHECK: f64.add $push[[NUM0:[0-9]+]]=, ${{[0-9]+}}, $pop{{[0-9]+}}{{$}}
; CHECK-NEXT: local.tee $push[[NUM1:[0-9]+]]=, $[[NUM2:[0-9]+]]=, $pop[[NUM0]]{{$}}		; CHECK-NEXT: local.tee $push[[NUM1:[0-9]+]]=, $[[NUM2:[0-9]+]]=, $pop[[NUM0]]{{$}}
; CHECK-NEXT: f64.select $push{{[0-9]+}}=, $pop{{[0-9]+}}, $pop[[NUM1]], ${{[0-9]+}}{{$}}		; CHECK-NEXT: f64.select ${{[0-9]+}}=, $pop{{[0-9]+}}, $pop[[NUM1]], ${{[0-9]+}}{{$}}
; CHECK: $[[NUM2]]=,
; NOREGS-LABEL: multiple_defs:		; NOREGS-LABEL: multiple_defs:
; NOREGS: f64.add		; NOREGS: f64.add
; NOREGS: local.tee		; NOREGS: local.tee
; NOREGS: f64.select		; NOREGS: f64.select
define void @multiple_defs(i32 %arg, i32 %arg1, i1 %arg2, i1 %arg3, i1 %arg4) {		define void @multiple_defs(i32 %arg, i32 %arg1, i1 %arg2, i1 %arg3, i1 %arg4) {
bb:		bb:
br label %bb5		br label %bb5

▲ Show 20 Lines • Show All 192 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MachineLICM] Handle subloopsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 557099

llvm/lib/CodeGen/MachineLICM.cpp

llvm/test/CodeGen/AArch64/machine-licm-sub-loop.ll

llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll

llvm/test/CodeGen/AMDGPU/exec-mask-opt-cannot-create-empty-or-backward-segment.ll

llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll

llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll

llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll

llvm/test/CodeGen/WebAssembly/reg-stackify.ll

[MachineLICM] Handle subloops
ClosedPublic