This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
CodeGen/
-
MachineLICM.cpp
-
Target/PowerPC/
-
PowerPC/
1/3
PPCInstrInfo.h
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
rematerializable-instruction-machine-licm.ll

Differential D82709

[MachineLICM] don't always hoist rematerializable instructions
AbandonedPublic

Authored by shchenz on Jun 28 2020, 1:43 AM.

Download Raw Diff

Details

Reviewers

hfinkel
jsji
nemanjai
efriedma
arsenm
qcolombet
dmgreen

Group Reviewers

Restricted Project

Summary

There is an issue for MachineLICM for rematerializable instructions hoisting.

%64:g8rc = LIS8 585
%65:g8rc = ORI8 killed %64:g8rc, 61440
%66:g8rc_and_g8rc_nox0 = STDUX %21:g8rc, %18:g8rc_and_g8rc_nox0(tied-def 0), killed %65:g8rc :: (store 8 into %ir.49, !tbaa !2)

%64:g8rc = LIS8 585 is a rematerializable instruction and it will be hoisted outside of loop without considering register pressure in MachineLICM.

After %64:g8rc = LIS8 585 is hoisted out, %65:g8rc = ORI8 killed %64:g8rc, 61440 is also hoisted out because it will lower the register pressure:

// - When hoisting the last use of a value in the loop, that value no longer
//   needs to be live in the loop. This lowers register pressure in the loop

So no matter how many pattern groups like above, MachineLICM will hoist all of them. When we have more than 32 group like above on PowerPC, hoisting all of them outside of loop will cause RA spilling.

This patch tries to fix the above issue:
1: override target hook shouldHoistCheapInstructions to make machine licm hoist cheap rematerializable based on register pressure.
2: when machine licm gets a cheap rematerializable instruction, it will also check the user inside the loop. if the user is also a loop invariant, do not hoist it blindly, instead it hoists the instruction after evaluating the register pressure.

Diff Detail

Repository: rL LLVM

Unit TestsFailed

	Time	Test
	10,410 ms	linux > libomp.env::Unknown Unit Message ("")
	1,440 ms	linux > libomp.worksharing/for::Unknown Unit Message ("")

Event Timeline

shchenz created this revision.Jun 28 2020, 1:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2020, 1:43 AM

Herald added subscribers: llvm-commits, • wuzish, asbirlea and 4 others. · View Herald Transcript

shchenz added parent revisions: D82708: [MachineLICM] NFC - make safety of moving explicitly for IsLoopInvariantInst, D82441: [MachineLICM] NFC - add a target hook shouldHoistCheapInstructions for machine licm to hoist cheap instruction.Jun 28 2020, 1:44 AM

shchenz edited the summary of this revision. (Show Details)Jun 28 2020, 1:49 AM

Herald added a subscriber: steven.zhang. · View Herald TranscriptJun 28 2020, 1:49 AM

Harbormaster failed remote builds in B62033: Diff 273909!Jun 28 2020, 2:49 AM

when machine licm gets a cheap rematerializable instruction, it will also check the user inside the loop. if the user is also a loop invariant, do not hoist it blindly, instead it hoists the instruction after evaluating the register pressure

Wouldn't it be simpler to just change the "When hoisting the last use of a value in the loop, that value no longer needs to be live in the loop" check? We could check if the value is trivially rematerializable; if it is, we're not really helping register pressure by hoisting.

llvm/lib/Target/PowerPC/PPCInstrInfo.h
318	I'm not really happy about adding target-specific heuristics to MachineLICM. Each target has its own cost model to some extent; we might want to use different rules for specific instructions, or maybe specific register files if they're particularly tiny. But I'd like to avoid knobs that universally change the way the algorithm works. If the core algorithm changes depending on the target, that makes it much harder to understand the the way the code is supposed to work, or make any changes in the future, or implement the hook appropriately.

In D82709#2120751, @efriedma wrote:

when machine licm gets a cheap rematerializable instruction, it will also check the user inside the loop. if the user is also a loop invariant, do not hoist it blindly, instead it hoists the instruction after evaluating the register pressure

Wouldn't it be simpler to just change the "When hoisting the last use of a value in the loop, that value no longer needs to be live in the loop" check? We could check if the value is trivially rematerializable; if it is, we're not really helping register pressure by hoisting.

@efriedma Do you mean changing the register pressure estimate model in function calcRegisterCost and hoisting rematerializable instruction all the time in machine LICM but keeping its last use inside the loop, so RA can sink the rematerializable instruction down when necessary?

I made another fix like:

diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index 98638b9..88e382c 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -911,14 +911,18 @@ MachineLICMBase::calcRegisterCost(const MachineInstr *MI, bool ConsiderSeen,
 
     RegClassWeight W = TRI->getRegClassWeight(RC);
     int RCCost = 0;
-    if (MO.isDef())
+    // If MI is a def of a rematerializable instruction and it has only one use,
+    // we can treat it as no cost because RA can sink MI right before its user.
+    if (MO.isDef() && !(TII->isTriviallyReMaterializable(*MI, AA) && 
+                        MRI->hasOneNonDBGUse(Reg)))
       RCCost = W.RegWeight;
     else {
       bool isKill = isOperandKill(MO, MRI);
+      bool isRemat = TII->isTriviallyReMaterializable(*MRI->getVRegDef(Reg), AA); 
       if (isNew && !isKill && ConsiderUnseenAsDef)
         // Haven't seen this, it must be a livein.
         RCCost = W.RegWeight;
-      else if (!isNew && isKill)
+      else if (!isNew && isKill && !isRemat)
         RCCost = -W.RegWeight;
     }

The change is simpler than what I did in this patch(hoisting cheap instructions based on register pressure). And with above change, machine LICM will not hoist all pattern groups as expected. But unfortunately, RA can not sink down the hoisted rematerializable instruction(LIS), so there are still many spills.

I am thinking this may be not a good solution either as it ties two passes Machine LICM and RA together. It may also hard to maintain, if we change one pass we must also consider another pass.

I prefer to keep all the logic in Machine LICM, we check the register pressure for the rematerializable instruction directly and not hoist it when in high register pressure so the rematerializable instruction's last use in loop will not be hoisted automatically.

What do you think?

shchenz marked an inline comment as done.Jun 30 2020, 12:13 AM

shchenz added inline comments.

llvm/lib/Target/PowerPC/PPCInstrInfo.h
318	yeah, agree with you for the hook adding policy. There is a conflict place in MachineLICM on PowerPC: 1: `hasLowDefLatency` overriding on PowerPC in commit https://reviews.llvm.org/rL258142, it indicates that all instructions including cheap instructions should be hoisted outside of loop. 2: In function `MachineLICMBase::CanCauseHighRegPressure`, there are statements: // Don't hoist cheap instructions if they would increase register pressure, // even if we're under the limit. if (CheapInstr && !HoistCheapInsts) return true; Here I just want to make it consistent on PowerPC target. But it seems strange with two different hooks...Maybe we need another patch to improve this.

shchenz mentioned this in D82708: [MachineLICM] NFC - make safety of moving explicitly for IsLoopInvariantInst.Jun 30 2020, 12:35 AM

@efriedma Hi, could you please help to have another look at the comment https://reviews.llvm.org/D82709#2121904? Do you still prefer to implement this by the collaboration of MachineLICM and RA? If so, does the fix in MachineLICM in the above comment make sense to you? Thanks.

But unfortunately, RA can not sink down the hoisted rematerializable instruction(LIS), so there are still many spills.

"Cannot", as in the target hooks for remat forbid it somehow? Or are the heuristics somehow favoring spilling over remat?

If we trust the register allocator to remat appropriately, we can just hoist everything without worrying about it. If we can't trust the register allocator, that means we're assuming the instruction won't be rematerialized, so we shouldn't be checking if it's rematerializable in the first place.

I am thinking this may be not a good solution either as it ties two passes Machine LICM and RA together. It may also hard to maintain, if we change one pass we must also consider another pass.

They're sort of tied together anyway in a general sense: given we have a register pressure heuristic, it needs to be aware of what the register allocator is actually going to do.

llvm/lib/Target/PowerPC/PPCInstrInfo.h
318	I assume you meant to link https://reviews.llvm.org/rL225471 ? I think this is all tied together; it doesn't really make sense to push it off to later.

In D82709#2137756, @efriedma wrote:

But unfortunately, RA can not sink down the hoisted rematerializable instruction(LIS), so there are still many spills.

"Cannot", as in the target hooks for remat forbid it somehow? Or are the heuristics somehow favoring spilling over remat?

If we trust the register allocator to remat appropriately, we can just hoist everything without worrying about it. If we can't trust the register allocator, that means we're assuming the instruction won't be rematerialized, so we shouldn't be checking if it's rematerializable in the first place.

Rereading this, I should probably say a bit more. I think the patch in https://reviews.llvm.org/D82709#2121904 makes sense in a world where we trust the register allocator. In a world where we don't trust the register allocator, we should just delete the isTriviallyReMaterializable() check completely. Either way, we need a consistent model. The original patch is essentially saying remat only works for instructions that have non-loop-invariant operands, and that doesn't really make sense.

Not really related to the contents of this patch, but some targets, like ARM, use a pseudo-instruction for integer immediates, and expand it after register allocation; this makes remat more effective.

OK, I will look into RA and see what's wrong here on PowerPC target. It seems RA can not sink any remat instructions at all on PowerPC as the case here should be very common.

Also thanks very much for your info about how ARM handles big IMM. To be honest, that's my most favorite solution either when I come to this issue. On PowerPC we expand the big IMM in ISEL in different ways for different IMM. This will expand a remat instruction to 2 or more non-remat instructions.

Using a pseudo for loading big imm also depends on RA can work well for sinking remateralizable instruction. I will look at the issue in RA first.

Thanks very much for your good comments.

shchenz planned changes to this revision.Jul 7 2020, 5:35 PM

Hi @efriedma after a long time investigation about greedy register allocation, I have some findings. I think the reason why the remat lis is not sinked down by RA as our expected is the limitation of current greedy register allocation. Hi @qcolombet sorry to bother you, If I am wrong at the comment about greedyRA, please correct me. ^-^

after MachineLICM (with all LIS and some ORI hoisted up), for the new added testcase, we get:

bb0:        ; outter loop preheader
    outteruse1 = 
    outteruse2 = 
    ....
    outteruseN = 
    ....
    lisvar1 = LIS
    orivar1 = ORI lisvar1
    lisvar2 = LIS
    orivar2 = ORI lisvar2
    ....
    lisvarm  = LIS
    orivarm = ORI lisvarm        <------ m ORI (together with related LIS) are hoisted out under register pressure.
    lisvarm+1 = LIS
    lisvarm+2 = LIS
    ...
    lisvarN = LIS        <------ all LIS are hoisted out because of remat.
bb1:            ;  inner loop preheader
    MTCTR8loop    <-------hardware loop, set loop count
bb2:
    std orivar1
    std orivar2
    ......
    std orivarm
    orivarm+1 = ORI lisvarm+1
    std orivarm+1
    orivarm+2 = ORI lisvarm+2
    std orivarm+2
    ......
    orivarN = ORI lisvarN
    std orivarN
    bdnz bb2   <--------hardware loop, test count register and branch
bb3:
    std outteruse1
    ....
    std outteruseN
    conditional-branch bb1, bb4
bb4:
  ret

In greedyRA, all live intervals are put inside a priority queue. And live interval with high priority will be assigned with physical register first. The bigger the live interval's size, the higher priority the live interval has. So in above code sequence, outteruse1, ... outteruseN will be assigned with physical register earlier than lisvar and orisvar.

So after greedyRA stage RS_Assign, RS_Split, outteruseNare the first to enter RS_Spill stage. Issue here is when we try to spill for outteruseN, greedyRA will not try to do rematerialize for low priority remat LIS instructions in advance. I think maybe this is why it is called greedy register allocation. It always handles live interval one by one? After spilling for outteruseN, greedy register allocation marks allocation for this live interval as done. It won't be changed later.

(When some remat instruction needs to be spilled, they will be rematerialized to front of their use as expected. see void InlineSpiller::spill() -> reMaterializeAll())

After greedy register allocation, code sequence is like:

bb0:        ; outter loop preheader
    outteruse1 = 
    spill outteruse1 to stack.1   <------ spill; these spills can be saved if we rematerialize all the below LIS to their uses.
    outteruse2 = 
    spill outteruse2 to stack.2   <------ spill
    ....
    outteruseN = 
    spill outteruseN to stack.N  <------ spill
    ....
    lisvar1 = LIS
    orivar1 = ORI lisvar1
    lisvar2 = LIS
    orivar2 = ORI lisvar2
    ....
    lisvarm  = LIS
    orivarm = ORI lisvarm
    ...
    lisvarN = LIS        <------ not all of the remat LIS are rematerialized because there is no need to do that, outteruse are already spilled.
bb1:            ;  inner loop preheader
    MTCTR8loop
bb2:
    std orivar1
    std orivar2
    ......
    std orivarm
     lisvarm+1 = LIS     <------ rematerialized
    orivarm+1 = ORI lisvarm+1
    std orivarm+1
    lisvarm+2 = LIS       <------rematerialized
    orivarm+2 = ORI lisvarm+2
    std orivarm+2
    ......
    orivarN = ORI lisvarN
    std orivarN
    bdnz bb2
bb3:
    reload outteruse1 from stack.1 <------reload
    std outteruse1
    ....
    reload outteruseN from stack.N  <------reload
    std outteruseN
    conditional-branch bb1, bb4
bb4:
  ret

greedyRA can not foresee that there are many remat instruction but with low priority in greedyRA priority queue when it tries to do spill for some non-remat registers but with high priority. This should be greedy register allocation's limitation. So I think maybe the best way is machinelicm hoist the LIS also based on register pressure.

Sorry for the long comments @efriedma . You comments are quite welcome. BTW: We found some obvious improvement for some benchmarks with this change on PowerPC target.

shchenz requested review of this revision.Aug 20 2020, 9:18 AM

Hi,

I am a bit confused as by what is expected by RA in this case.

So after greedyRA stage RS_Assign, RS_Split, outteruseNare the first to enter RS_Spill stage. Issue here is when we try to spill for outteruseN, greedyRA will not try to do rematerialize for low priority remat LIS instructions in advance

If LIS is low priority in your example doesn't that mean it was not been assigned anything at the point we're looking at outterresuNare?
In other words, if we don't have any register left for outterresuNare at this point, that means that rematerializing LIS won't help (since it is not assigned).

Could you share the debug output of regalloc? (-debug-only regalloc)

Cheers,
-Quentin

lkail added a subscriber: lkail.Aug 21 2020, 3:10 PM

Hi @qcolombet the log for the new added case(With the change in https://reviews.llvm.org/D82709#2121904) is put https://reviews.llvm.org/P8231

We expect that outteruse1 should not be spilled, for example, you can check virtual register %0 in the log file.
This can be achieved by remat all LIS to its use inside the loop. My first try is to do this in MachineLICM and find it works. After hoisting LIS based on register pressure, there is no/less spill both inside the inner loop and outter loop.

But we want to seek for a fix inside the ra and don't change the remat instruction hoisting logic inside machinelicm.

machine LICM expects greedy register allocation will remat all the LIS and so machine LICM hoists these remat instructions without considering register pressure.

But inside register allocation, that is not always the case.
for now %0 has high priority (big live interval size) but low spill weight (used in outer loop), and it is spilled before remat all LIS. So spill for %0 is kept in outer loop after the RA.
But as we can see, after RA, there are some remat instructions like %267 ~ %309 can be remat to the inner loop and reuse %x9 like other LIS instructions already inside that loop.

Do you have any idea about how to fix this in greedy RA? Thanks.

Hi @qcolombet, for the log in https://reviews.llvm.org/P8231, you can just see the RA result at the end. Ideally, we can remat registers %267 ~ %309(2132B - 2216B) to their use, and then we can free some physical registers and assign the physical registers to %166 ~ %190 to save the spills.

I checked the log, when we do spill for %0, LIS instruction, for example %157, is assigned with physical register, but seems RA will only remat instructions when the remat instructions are going to be spilled?

We should add logic inside selectOrSplitImpl like: when we try to spill a non-remat instruction, we should check if there is any remat instruction assigned with a physical register. If there is one, we should first remat the remat instruction and assign the physical register to the spilled one? Is this a reasonable change in RA?

Thanks for your comments in advance.

Herald added a subscriber: danielkiss. · View Herald TranscriptAug 30 2020, 7:35 PM

Hi @qcolombet , I saw you added comments for other patch about Greedy Register Pressure like "The greedy allocator is already very complicated". I am not sure my proposal change in above comment in greedyRA will increase the complexity or if it is the right way to go. Could you please help to confirm this? Thanks.

We should add logic inside selectOrSplitImpl like: when we try to spill a non-remat instruction, we should check if there is any remat instruction assigned with a physical register. If there is one, we should first remat the remat instruction and assign the physical register to the spilled one?

If we can not depend on register allocator to rematerialize all required rematerializable instructions, the way proposed in this PR should be reasonable. Instead of letting the register allocator rematerialize the instructions after we hoist all rematerializable instructions without considering register pressure in machineLICM, now we hoist the rematerializable instructions also based on register pressure.

Sorry for pinging this patch for a long time. We need this patch on PowerPC target. I think it should also benefit other targets.

Hi @shchenz,

Sorry for the late reply I missed your update.

I checked the log, when we do spill for %0, LIS instruction, for example %157, is assigned with physical register, but seems RA will only remat instructions when the remat instructions are going to be spilled?

That's correct, but a live-range can still evict another one, e.g., when we assign %88, we evict %0.

We should add logic inside selectOrSplitImpl like: when we try to spill a non-remat instruction, we should check if there is any remat instruction assigned with a physical register. If there is one, we should first remat the remat instruction and assign the physical register to the spilled one? Is this a reasonable change in RA?

That part should actually be covered. If you look at RAGreedy::shouldEvict, a live-interval can evict another one as long as its weight is bigger than the other one. Generally, the weight of rematerializable live intervals is small enough that it can be evicted pretty much all the time.
It doesn't happen in your case because the weight of %0 is pretty small and in particular, smaller than the rematerializable live-intervals (e.g., %157).

At this point, I would check two things:

Are the weights accurate? E.g., maybe the frequency estimate missed a loop?
Is %157 (or whatever rematerializable interval) considered when we call shouldEvict for %0?

If the answer to #2 is yes, then I think your problem is with #1. If it is no, we should check why.

Now, if the weights for #1 are accurate, then the cost model means that it is more expensive to rematerialize than to spill.

Cheers,
-Quentin

It doesn't happen in your case because the weight of %0 is pretty small and in particular, smaller than the rematerializable live-intervals (e.g., %157).

Yes, %0 is a live interval in the outer loop while the rematerializable live-intervals are in the inner loop, so I think that is the reason why weight of %0 is smaller than the rematerializable live-intervals (e.g., %157). The weight for %0 and %157 should be accurate? I did some hack:

if (isRematerializable(li, LIS, VRM, *MF.getSubtarget().getInstrInfo()))
  totalWeight *= 0.5F;

changed 0.5 to a smaller value, it can reduce the spill size for the new lit case in inner loop. But it is not as good as this patch. More spills are generated in outer loop. With this patch the spills in outer loop are also reduced a lot.

2: Is %157 (or whatever rematerializable interval) considered when we call shouldEvict for %0?

I will do more investigation for this.

Thanks again for your confirmation. @qcolombet

shchenz planned changes to this revision.Sep 17 2020, 7:35 PM

@qcolombet @efriedma , sorry for the late response and the long main. I made a summary for this issue.

Before machine LICM:

entry:
    outer-non-remat-def

outer loop:
    inner loop:
        inner-remat-def
        inner-use
        b inner loop
    outer-use
    b outer loop

After machine LICM:

entry:
    outer-non-remat-def
    inner-remat-def          ———>all remat definitions are hoisted out
outer loop:
    inner loop:
        inner-use
        b inner loop
    outer-use
    b outer loop

machineLICM depends on RA(register allocator) remat the inner-remat-def down to its inner loop users. This is ok when there is level-one loop. But here we have a level-two loop, RA does not work as machine LICM expects.

If we hoist all remat definitions in machine LICM pass:
RA can make sure the loop where the inner-users are have no spills but it can not make sure the outer loop has no spills.

The process in Greedy Register Allocation after we hoist all remat instructions to entry block is like this:
1: outer defs have higher allocation priority than inner def because outer defs have larger live interval. But outer defs have smaller spill weight because inner def users are in the inner loop.
2.1: firstly RA allocates physical registers to outer def
2.2: then RA allocates physical register to inner def and evicts the assigned physical registers for outer defs when RA found there are not enough physical registers for inner defs. (outer defs have smaller spill weight).
2.3: then the outer def virtual registers are in stage split and in the second round for outer def virtual registers allocation, they are all in split stage, so RA will do spilling for them without trying to elicit any virtual registers.
2.4: for the inner-remat-def virtual registers, RA will try its best to assign physical registers to some of them and split some of them in the outer loop in the second round.
2.5: and at last, when RA allocates registers for inner-users, it will remat the use of remat-instructions down to the front of the inner-users to make sure there is no spill/reload in the inner loop.

After greedy register allocation:

entry:
    outer-non-remat-def
    some of inner-remat-def 
outer loop:
    some inner-remat-def       // split for inner-remat-defs
    inner loop:
        some inner-remat-def // remat when allocates for inner-users
        inner-use
        b inner loop
    ;;;;;; reload happens in outer loop
    outer-use
    b outer loop

So the issue here is, machine licm expects RA remat the inner-remat-def to the inner loop, but in fact, RA will first try to split the inner-remat-def to the outer loop and then when does allocation for inner loop register, it will remat the def to the front of the uses to make sure inner loop has no spill. There is no problem for the inner loop, we can make sure there is no spill by remat. But to avoid spill in the inner loop, we don’t need to remat all inner-remat-def splitted in the outer loop. The left inner-remat-defs in the header of the outer loop will increase outer loop register pressure for sure. MachineLICM is not aware of the register pressure increasing in the outer loop when it hoists all remat instructions to the outer loop preheader.

shchenz updated this revision to Diff 313058.Dec 21 2020, 2:59 AM

Herald added subscribers: kerbowa, pengfei, arphaman and 6 others. · View Herald TranscriptDec 21 2020, 2:59 AM

Harbormaster completed remote builds in B83121: Diff 313058.Dec 21 2020, 2:59 AM

Since @efriedma has concerns about adding more hook in machineLICM pass, and the RA works as expected(@qcolombet please correct me if you find there is any issue in RA process in above comments), I made a new solution for this issue:
instead of hoisting all remat instructions, we now do:

// For remat instructions which are inside current working loop, we should
// always hoist them.
// For remat instructions which intend to be hoisted to outer parent loop, we
// only hoist non-cheap ones as RA can not pull all remat instructions down to
// inner loop as it will first try to split them in outer loop.

Now for the new added case rematerializable-instruction-machine-licm.ll, we don't have any spills.
But we get some cheap instructions kept inside the loop as expected. This should be alignment with the logic in CanCauseHighRegPressure in machinelicm pass. See flag HoistCheapInsts.

shchenz retitled this revision from [MachineLICM] [PowerPC] hoisting rematerializable cheap instructions based on register pressure. to [MachineLICM] don't always hoist rematerializable instructions.Dec 21 2020, 3:46 AM

perf test shows some degradations together with some improvements on PowerPC target. Plan change for now to get more tuning.

Matt added a subscriber: Matt.Aug 5 2021, 1:08 PM

Herald added a subscriber: jeroen.dobbelaere. · View Herald TranscriptAug 5 2021, 1:08 PM

IIUC, the motivation of this patch is to have part of immediate materialization code reside in the loop, i.e., have code like

1B  %0 = LIS8 64
2B  %1 = ORI8 %0, 17
loop:
8B  <use %1>

transformed to

1B  %0 = LIS 64
2B  %1 = ORI %0, 17
loop:
4B  %2 = ORI %0, 17
8B  <use %2>

I think it's currently beyond the ability of InlineSpiller when perform re-materialization. I set ORI8 to be trivially rematerilizable for PPC, the LiveRangeEdit will fail the check allUsesAvailableAt since %0 in ORI8 %0, 17 is no longer live(%0's live interval is [1:2)) in the loop.

shchenz mentioned this in D140119: [PowerPC][GIsel] Materialize i64 constants..Dec 15 2022, 9:34 PM

I think using a pseudo for integer immediate and expanding the pseudo after RA is a better solution.

Abandon this patch.

Herald added a project: Restricted Project. · View Herald TranscriptMay 29 2023, 6:01 PM

Herald added subscribers: pmatos, asb. · View Herald Transcript

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MachineLICM.cpp

43 lines

Target/

PowerPC/

PPCInstrInfo.h

3 lines

test/

CodeGen/

PowerPC/

rematerializable-instruction-machine-licm.ll

631 lines

Diff 273909

llvm/lib/CodeGen/MachineLICM.cpp

Show First 20 Lines • Show All 1,269 Lines • ▼ Show 20 Lines	bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI) {
bool CreatesCopy = HasLoopPHIUse(&MI);		bool CreatesCopy = HasLoopPHIUse(&MI);

// Don't hoist a cheap instruction if it would create a copy in the loop.		// Don't hoist a cheap instruction if it would create a copy in the loop.
if (CheapInstr && CreatesCopy) {		if (CheapInstr && CreatesCopy) {
LLVM_DEBUG(dbgs() << "Won't hoist cheap instr with loop PHI use: " << MI);		LLVM_DEBUG(dbgs() << "Won't hoist cheap instr with loop PHI use: " << MI);
return false;		return false;
}		}

// Rematerializable instructions should always be hoisted since the register		// If the rematerializable instruction's only user is in the loop, and the
// allocator can just pull them down again when needed.		// user is also a loop invariant, hoisting the rematerializable instruction
if (TII->isTriviallyReMaterializable(MI, AA))		// will always make the user be hoisted outside of the loop. RA can only
		// sink rematerializable instruction, not rematerializable instruction's
		// user. Hoisting both rematerializable instruction and its user will
		// increase register pressure.
		// For targets which support hoisting cheap rematerializable instruction based
		// on register pressure, we should do it considering register pressure.
		auto ShouldHoistRemat = [&] (MachineInstr &MI) {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - auto ShouldHoistRemat = [&] (MachineInstr &MI) { + auto ShouldHoistRemat = [&](MachineInstr &MI) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - auto ShouldHoistRemat = [&] (MachineInstr &MI) {…
		if (!TII->isTriviallyReMaterializable(MI, AA))
		return false;

		// If this is not a cheap rematerializable instruction, hoist it.
		if (!CheapInstr)
		return true;

		// If target prefers not to hoist cheap instructions based on register
		// pressure, hoist the rematerializable instruciton now.
		if (!TII->shouldHoistCheapInstructions())
		return true;

		// Make sure rematerializable instruction's only user is not loop invariant.
		// Remat clients assume operand 0 is the defined register.
		if (!MI.getNumOperands() \|\| !MI.getOperand(0).isReg())
		return true;
		Register DefReg = MI.getOperand(0).getReg();
		if (!MRI->hasOneNonDBGUse(DefReg))
		return true;
		MachineInstr* UseMI = &*MRI->use_instr_begin(DefReg);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - MachineInstr* UseMI = &MRI->use_instr_begin(DefReg); + MachineInstr UseMI = &MRI->use_instr_begin(DefReg); Lint: Pre-merge checks:* clang-format: please reformat the code ``` - MachineInstr* UseMI = &*MRI->use_instr_begin…
		bool IsSafeToMove = false;
		if (IsLoopInvariantInst(*UseMI, IsSafeToMove))
		return false;
		return true;
		};

		bool IsRematHoisted = ShouldHoistRemat(MI);
		if (IsRematHoisted)
return true;		return true;

// FIXME: If there are long latency loop-invariant instructions inside the		// FIXME: If there are long latency loop-invariant instructions inside the
// loop at this point, why didn't the optimizer's LICM hoist them?		// loop at this point, why didn't the optimizer's LICM hoist them?
for (unsigned i = 0, e = MI.getDesc().getNumOperands(); i != e; ++i) {		for (unsigned i = 0, e = MI.getDesc().getNumOperands(); i != e; ++i) {
const MachineOperand &MO = MI.getOperand(i);		const MachineOperand &MO = MI.getOperand(i);
if (!MO.isReg() \|\| MO.isImplicit())		if (!MO.isReg() \|\| MO.isImplicit())
continue;		continue;
Show All 36 Lines	bool MachineLICMBase::IsProfitableToHoist(MachineInstr &MI) {
if (AvoidSpeculation &&		if (AvoidSpeculation &&
(!IsGuaranteedToExecute(MI.getParent()) && !MayCSE(&MI))) {		(!IsGuaranteedToExecute(MI.getParent()) && !MayCSE(&MI))) {
LLVM_DEBUG(dbgs() << "Won't speculate: " << MI);		LLVM_DEBUG(dbgs() << "Won't speculate: " << MI);
return false;		return false;
}		}

// High register pressure situation, only hoist if the instruction is going		// High register pressure situation, only hoist if the instruction is going
// to be remat'ed.		// to be remat'ed.
if (!TII->isTriviallyReMaterializable(MI, AA) &&		if (!IsRematHoisted && !MI.isDereferenceableInvariantLoad(AA)) {
!MI.isDereferenceableInvariantLoad(AA)) {
LLVM_DEBUG(dbgs() << "Can't remat / high reg-pressure: " << MI);		LLVM_DEBUG(dbgs() << "Can't remat / high reg-pressure: " << MI);
return false;		return false;
}		}

return true;		return true;
}		}

/// Unfold a load from the given machineinstr if the load itself could be		/// Unfold a load from the given machineinstr if the load itself could be
▲ Show 20 Lines • Show All 285 Lines • Show Last 20 Lines

llvm/lib/Target/PowerPC/PPCInstrInfo.h

Show First 20 Lines • Show All 308 Lines • ▼ Show 20 Lines	bool hasLowDefLatency(const TargetSchedModel &SchedModel,
const MachineInstr &DefMI,		const MachineInstr &DefMI,
unsigned DefIdx) const override {		unsigned DefIdx) const override {
// Machine LICM should hoist all instructions in low-register-pressure		// Machine LICM should hoist all instructions in low-register-pressure
// situations; none are sufficiently free to justify leaving in a loop		// situations; none are sufficiently free to justify leaving in a loop
// body.		// body.
return false;		return false;
}		}

		/// Hoist cheap instructions based on register pressure in Machine LICM.
		bool shouldHoistCheapInstructions() const override { return true; }
		efriedmaUnsubmitted Not Done Reply Inline Actions I'm not really happy about adding target-specific heuristics to MachineLICM. Each target has its own cost model to some extent; we might want to use different rules for specific instructions, or maybe specific register files if they're particularly tiny. But I'd like to avoid knobs that universally change the way the algorithm works. If the core algorithm changes depending on the target, that makes it much harder to understand the the way the code is supposed to work, or make any changes in the future, or implement the hook appropriately. efriedma: I'm not really happy about adding target-specific heuristics to MachineLICM. Each target has…
		shchenzAuthorUnsubmitted Done Reply Inline Actions yeah, agree with you for the hook adding policy. There is a conflict place in MachineLICM on PowerPC: 1: `hasLowDefLatency` overriding on PowerPC in commit https://reviews.llvm.org/rL258142, it indicates that all instructions including cheap instructions should be hoisted outside of loop. 2: In function `MachineLICMBase::CanCauseHighRegPressure`, there are statements: // Don't hoist cheap instructions if they would increase register pressure, // even if we're under the limit. if (CheapInstr && !HoistCheapInsts) return true; Here I just want to make it consistent on PowerPC target. But it seems strange with two different hooks...Maybe we need another patch to improve this. shchenz: yeah, agree with you for the hook adding policy. There is a conflict place in MachineLICM on…
		efriedmaUnsubmitted Not Done Reply Inline Actions I assume you meant to link https://reviews.llvm.org/rL225471 ? I think this is all tied together; it doesn't really make sense to push it off to later. efriedma: I assume you meant to link https://reviews.llvm.org/rL225471 ? I think this is all tied…

bool useMachineCombiner() const override {		bool useMachineCombiner() const override {
return true;		return true;
}		}

/// When getMachineCombinerPatterns() finds patterns, this function generates		/// When getMachineCombinerPatterns() finds patterns, this function generates
/// the instructions that could replace the original code sequence		/// the instructions that could replace the original code sequence
void genAlternativeCodeSequence(		void genAlternativeCodeSequence(
MachineInstr &Root, MachineCombinerPattern Pattern,		MachineInstr &Root, MachineCombinerPattern Pattern,
▲ Show 20 Lines • Show All 317 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/rematerializable-instruction-machine-licm.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -verify-machineinstrs -mcpu=pwr9 < %s \| FileCheck %s			; RUN: llc -verify-machineinstrs -mcpu=pwr9 < %s \| FileCheck %s
	target datalayout = "e-m:e-i64:64-n32:64"			target datalayout = "e-m:e-i64:64-n32:64"
	target triple = "powerpc64le-unknown-linux-gnu"			target triple = "powerpc64le-unknown-linux-gnu"

	define zeroext i32 @test1(i64 %0, i64* %1) {			define zeroext i32 @test1(i64 %0, i64* %1) {
	; CHECK-LABEL: test1:			; CHECK-LABEL: test1:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: stdu 1, -720(1)
	; CHECK-NEXT: .cfi_def_cfa_offset 720
	; CHECK-NEXT: .cfi_offset r14, -144
	; CHECK-NEXT: .cfi_offset r15, -136
	; CHECK-NEXT: .cfi_offset r16, -128
	; CHECK-NEXT: .cfi_offset r17, -120
	; CHECK-NEXT: .cfi_offset r18, -112
	; CHECK-NEXT: .cfi_offset r19, -104
	; CHECK-NEXT: .cfi_offset r20, -96
	; CHECK-NEXT: .cfi_offset r21, -88
	; CHECK-NEXT: .cfi_offset r22, -80
	; CHECK-NEXT: .cfi_offset r23, -72
	; CHECK-NEXT: .cfi_offset r24, -64
	; CHECK-NEXT: .cfi_offset r25, -56
	; CHECK-NEXT: .cfi_offset r26, -48
	; CHECK-NEXT: .cfi_offset r27, -40
	; CHECK-NEXT: .cfi_offset r28, -32
	; CHECK-NEXT: .cfi_offset r29, -24
	; CHECK-NEXT: .cfi_offset r30, -16
	; CHECK-NEXT: .cfi_offset r31, -8
	; CHECK-NEXT: .cfi_offset r2, -152
	; CHECK-NEXT: lis 5, 4			; CHECK-NEXT: lis 5, 4
	; CHECK-NEXT: ori 6, 5, 6292			; CHECK-NEXT: ori 6, 5, 6292
	; CHECK-NEXT: ori 5, 5, 6291			; CHECK-NEXT: ori 5, 5, 6291
	; CHECK-NEXT: sldi 6, 6, 32			; CHECK-NEXT: sldi 0, 6, 32
	; CHECK-NEXT: oris 7, 6, 13030			; CHECK-NEXT: oris 6, 0, 13030
	; CHECK-NEXT: oris 8, 6, 13066			; CHECK-NEXT: std 25, -56(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 7, 7, 3704			; CHECK-NEXT: std 26, -48(1) # 8-byte Folded Spill
	; CHECK-NEXT: oris 9, 6, 13054			; CHECK-NEXT: std 27, -40(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 8, 8, 44408			; CHECK-NEXT: std 28, -32(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 9, 9, 30840			; CHECK-NEXT: std 29, -24(1) # 8-byte Folded Spill
	; CHECK-NEXT: add 7, 4, 7			; CHECK-NEXT: std 30, -16(1) # 8-byte Folded Spill
	; CHECK-NEXT: oris 10, 6, 13042			; CHECK-NEXT: oris 7, 0, 13066
	; CHECK-NEXT: ori 10, 10, 17272			; CHECK-NEXT: oris 8, 0, 13054
	; CHECK-NEXT: std 7, 384(1) # 8-byte Folded Spill			; CHECK-NEXT: oris 9, 0, 13042
	; CHECK-NEXT: add 7, 4, 8			; CHECK-NEXT: oris 10, 0, 13078
	; CHECK-NEXT: oris 11, 6, 13078			; CHECK-NEXT: oris 11, 0, 13115
	; CHECK-NEXT: ori 11, 11, 57976			; CHECK-NEXT: oris 12, 0, 13103
	; CHECK-NEXT: std 7, 376(1) # 8-byte Folded Spill			; CHECK-NEXT: oris 30, 0, 13091
	; CHECK-NEXT: add 7, 4, 9			; CHECK-NEXT: oris 29, 0, 13127
	; CHECK-NEXT: oris 12, 6, 13115			; CHECK-NEXT: oris 28, 0, 13164
	; CHECK-NEXT: ori 12, 12, 33144			; CHECK-NEXT: oris 27, 0, 13152
	; CHECK-NEXT: std 7, 368(1) # 8-byte Folded Spill			; CHECK-NEXT: oris 26, 0, 13139
	; CHECK-NEXT: add 7, 4, 10			; CHECK-NEXT: oris 25, 0, 13176
	; CHECK-NEXT: oris 0, 6, 13103			; CHECK-NEXT: ori 7, 7, 44408
	; CHECK-NEXT: ori 0, 0, 19576			; CHECK-NEXT: ori 6, 6, 3704
	; CHECK-NEXT: std 7, 360(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 8, 8, 30840
	; CHECK-NEXT: add 7, 4, 11			; CHECK-NEXT: ori 9, 9, 17272
	; CHECK-NEXT: std 30, 704(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 10, 10, 57976
	; CHECK-NEXT: oris 30, 6, 13091			; CHECK-NEXT: ori 11, 11, 33144
				; CHECK-NEXT: ori 12, 12, 19576
	; CHECK-NEXT: ori 30, 30, 6008			; CHECK-NEXT: ori 30, 30, 6008
	; CHECK-NEXT: std 7, 352(1) # 8-byte Folded Spill
	; CHECK-NEXT: add 7, 4, 12
	; CHECK-NEXT: std 29, 696(1) # 8-byte Folded Spill
	; CHECK-NEXT: oris 29, 6, 13127
	; CHECK-NEXT: ori 29, 29, 46712			; CHECK-NEXT: ori 29, 29, 46712
	; CHECK-NEXT: sldi 5, 5, 32
	; CHECK-NEXT: oris 5, 5, 29347
	; CHECK-NEXT: ori 5, 5, 20088
	; CHECK-NEXT: lis 8, 402
	; CHECK-NEXT: lis 9, 451
	; CHECK-NEXT: lis 10, 500
	; CHECK-NEXT: lis 11, 549
	; CHECK-NEXT: std 31, 712(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 2, 568(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 7, 344(1) # 8-byte Folded Spill
	; CHECK-NEXT: add 7, 4, 0
	; CHECK-NEXT: std 28, 688(1) # 8-byte Folded Spill
	; CHECK-NEXT: oris 28, 6, 13164
	; CHECK-NEXT: ori 28, 28, 21880			; CHECK-NEXT: ori 28, 28, 21880
	; CHECK-NEXT: std 7, 336(1) # 8-byte Folded Spill
	; CHECK-NEXT: add 7, 4, 30
	; CHECK-NEXT: std 27, 680(1) # 8-byte Folded Spill
	; CHECK-NEXT: oris 27, 6, 13152
	; CHECK-NEXT: ori 27, 27, 8312			; CHECK-NEXT: ori 27, 27, 8312
	; CHECK-NEXT: std 7, 328(1) # 8-byte Folded Spill
	; CHECK-NEXT: add 7, 4, 29
	; CHECK-NEXT: std 26, 672(1) # 8-byte Folded Spill
	; CHECK-NEXT: oris 26, 6, 13139
	; CHECK-NEXT: ori 26, 26, 60280			; CHECK-NEXT: ori 26, 26, 60280
	; CHECK-NEXT: std 7, 320(1) # 8-byte Folded Spill
	; CHECK-NEXT: add 7, 4, 28
	; CHECK-NEXT: std 25, 664(1) # 8-byte Folded Spill
	; CHECK-NEXT: oris 25, 6, 13176
	; CHECK-NEXT: ori 25, 25, 35448			; CHECK-NEXT: ori 25, 25, 35448
	; CHECK-NEXT: std 7, 312(1) # 8-byte Folded Spill			; CHECK-NEXT: sldi 5, 5, 32
	; CHECK-NEXT: add 7, 4, 27			; CHECK-NEXT: oris 5, 5, 29347
	; CHECK-NEXT: std 7, 304(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 5, 5, 20088
	; CHECK-NEXT: add 7, 4, 26			; CHECK-NEXT: add 5, 4, 5
	; CHECK-NEXT: std 7, 296(1) # 8-byte Folded Spill
	; CHECK-NEXT: add 7, 4, 25
	; CHECK-NEXT: std 7, 288(1) # 8-byte Folded Spill
	; CHECK-NEXT: oris 7, 6, 13213
	; CHECK-NEXT: ori 7, 7, 10616
	; CHECK-NEXT: add 7, 4, 7
	; CHECK-NEXT: std 7, 280(1) # 8-byte Folded Spill
	; CHECK-NEXT: oris 7, 6, 13200
	; CHECK-NEXT: oris 6, 6, 13188
	; CHECK-NEXT: ori 7, 7, 62584
	; CHECK-NEXT: ori 6, 6, 49016
	; CHECK-NEXT: add 7, 4, 7
	; CHECK-NEXT: add 6, 4, 6			; CHECK-NEXT: add 6, 4, 6
	; CHECK-NEXT: add 4, 4, 5			; CHECK-NEXT: std 24, -64(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 5, 268			; CHECK-NEXT: oris 24, 0, 13200
	; CHECK-NEXT: std 4, 256(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 24, 24, 62584
				; CHECK-NEXT: add 24, 4, 24
				; CHECK-NEXT: std 31, -8(1) # 8-byte Folded Spill
				; CHECK-NEXT: std 2, -152(1) # 8-byte Folded Spill
				; CHECK-NEXT: std 6, -160(1) # 8-byte Folded Spill
				; CHECK-NEXT: add 6, 4, 7
				; CHECK-NEXT: add 7, 4, 8
				; CHECK-NEXT: add 8, 4, 9
				; CHECK-NEXT: add 9, 4, 10
				; CHECK-NEXT: add 10, 4, 11
				; CHECK-NEXT: add 11, 4, 12
				; CHECK-NEXT: add 12, 4, 30
				; CHECK-NEXT: add 30, 4, 29
				; CHECK-NEXT: add 29, 4, 28
				; CHECK-NEXT: add 28, 4, 27
				; CHECK-NEXT: add 27, 4, 26
				; CHECK-NEXT: add 26, 4, 25
				; CHECK-NEXT: oris 25, 0, 13213
				; CHECK-NEXT: oris 0, 0, 13188
				; CHECK-NEXT: ori 25, 25, 10616
				; CHECK-NEXT: ori 0, 0, 49016
				; CHECK-NEXT: std 23, -72(1) # 8-byte Folded Spill
				; CHECK-NEXT: add 23, 4, 0
				; CHECK-NEXT: add 25, 4, 25
	; CHECK-NEXT: lis 4, 585			; CHECK-NEXT: lis 4, 585
	; CHECK-NEXT: ori 4, 4, 61440			; CHECK-NEXT: std 21, -88(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 4, 560(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 21, 4, 61440
	; CHECK-NEXT: lis 4, 48			; CHECK-NEXT: lis 4, 48
	; CHECK-NEXT: ori 4, 4, 54272			; CHECK-NEXT: std 20, -96(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 4, 552(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 20, 4, 54272
	; CHECK-NEXT: lis 4, 97			; CHECK-NEXT: lis 4, 97
	; CHECK-NEXT: ori 4, 4, 43008			; CHECK-NEXT: std 19, -104(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 4, 544(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 19, 4, 43008
	; CHECK-NEXT: lis 4, 146			; CHECK-NEXT: lis 4, 146
	; CHECK-NEXT: ori 4, 4, 31744			; CHECK-NEXT: std 18, -112(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 4, 536(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 18, 4, 31744
	; CHECK-NEXT: lis 4, 195			; CHECK-NEXT: lis 4, 195
	; CHECK-NEXT: ori 4, 4, 20480			; CHECK-NEXT: std 17, -120(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 4, 528(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 17, 4, 20480
	; CHECK-NEXT: lis 4, 244			; CHECK-NEXT: lis 4, 244
	; CHECK-NEXT: ori 4, 4, 9216			; CHECK-NEXT: std 16, -128(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 4, 520(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 16, 4, 9216
	; CHECK-NEXT: lis 4, 292			; CHECK-NEXT: lis 4, 292
	; CHECK-NEXT: ori 4, 4, 63488			; CHECK-NEXT: std 15, -136(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 4, 512(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 15, 4, 63488
	; CHECK-NEXT: lis 4, 341			; CHECK-NEXT: lis 4, 341
	; CHECK-NEXT: ori 4, 4, 52224			; CHECK-NEXT: std 14, -144(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 4, 504(1) # 8-byte Folded Spill			; CHECK-NEXT: ori 14, 4, 52224
	; CHECK-NEXT: lis 4, 390			; CHECK-NEXT: lis 4, 390
	; CHECK-NEXT: ori 4, 4, 40960			; CHECK-NEXT: ori 31, 4, 40960
	; CHECK-NEXT: std 4, 496(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 439			; CHECK-NEXT: lis 4, 439
	; CHECK-NEXT: ori 4, 4, 29696			; CHECK-NEXT: ori 2, 4, 29696
	; CHECK-NEXT: std 4, 488(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 488			; CHECK-NEXT: lis 4, 488
	; CHECK-NEXT: ori 4, 4, 18432			; CHECK-NEXT: std 22, -80(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 4, 480(1) # 8-byte Folded Spill			; CHECK-NEXT: li 22, 0
	; CHECK-NEXT: lis 4, 537			; CHECK-NEXT: std 6, -168(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 4, 7168			; CHECK-NEXT: ori 0, 4, 18432
	; CHECK-NEXT: std 4, 472(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 36
	; CHECK-NEXT: ori 4, 4, 40704
	; CHECK-NEXT: std 4, 464(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 85
	; CHECK-NEXT: ori 4, 4, 29440
	; CHECK-NEXT: std 4, 456(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 134
	; CHECK-NEXT: ori 4, 4, 18176
	; CHECK-NEXT: std 4, 448(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 183
	; CHECK-NEXT: ori 4, 4, 6912
	; CHECK-NEXT: std 4, 440(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 231
	; CHECK-NEXT: ori 4, 4, 61184
	; CHECK-NEXT: std 4, 432(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 280
	; CHECK-NEXT: ori 4, 4, 49920
	; CHECK-NEXT: std 4, 424(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 329
	; CHECK-NEXT: ori 4, 4, 38656
	; CHECK-NEXT: std 4, 416(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 378
	; CHECK-NEXT: ori 4, 4, 27392
	; CHECK-NEXT: std 4, 408(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 427
	; CHECK-NEXT: ori 4, 4, 16128
	; CHECK-NEXT: std 4, 400(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 476
	; CHECK-NEXT: ori 4, 4, 4864
	; CHECK-NEXT: std 4, 248(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 524
	; CHECK-NEXT: ori 4, 4, 59136
	; CHECK-NEXT: std 4, 240(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 573
	; CHECK-NEXT: ori 4, 4, 47872
	; CHECK-NEXT: std 4, 232(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 24
	; CHECK-NEXT: ori 4, 4, 27136
	; CHECK-NEXT: std 4, 224(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 73
	; CHECK-NEXT: ori 4, 4, 15872
	; CHECK-NEXT: std 4, 216(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 122
	; CHECK-NEXT: ori 4, 4, 4608
	; CHECK-NEXT: std 4, 208(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 170
	; CHECK-NEXT: ori 4, 4, 58880
	; CHECK-NEXT: std 4, 200(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 4, 219
	; CHECK-NEXT: ori 4, 4, 47616
	; CHECK-NEXT: std 4, 192(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 36352
	; CHECK-NEXT: lis 5, 317
	; CHECK-NEXT: std 4, 184(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 25088
	; CHECK-NEXT: lis 5, 366
	; CHECK-NEXT: std 4, 176(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 13824
	; CHECK-NEXT: lis 5, 415
	; CHECK-NEXT: std 4, 168(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 2560
	; CHECK-NEXT: lis 5, 463
	; CHECK-NEXT: std 4, 160(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 56832
	; CHECK-NEXT: lis 5, 512
	; CHECK-NEXT: std 4, 152(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 45568
	; CHECK-NEXT: lis 5, 561
	; CHECK-NEXT: std 4, 144(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 34304
	; CHECK-NEXT: lis 5, 12
	; CHECK-NEXT: std 4, 136(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 13568
	; CHECK-NEXT: lis 5, 61
	; CHECK-NEXT: std 4, 128(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 2304
	; CHECK-NEXT: lis 5, 109
	; CHECK-NEXT: std 4, 120(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 56576
	; CHECK-NEXT: lis 5, 158
	; CHECK-NEXT: std 4, 112(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 45312
	; CHECK-NEXT: lis 5, 207
	; CHECK-NEXT: std 4, 104(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 34048
	; CHECK-NEXT: lis 5, 256
	; CHECK-NEXT: std 6, 264(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 6, 305
	; CHECK-NEXT: ld 30, 192(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 29, 184(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 28, 176(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 27, 168(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 26, 160(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 25, 152(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 0, 120(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 4, 96(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 5, 22784
	; CHECK-NEXT: std 7, 272(1) # 8-byte Folded Spill
	; CHECK-NEXT: lis 7, 354
	; CHECK-NEXT: std 4, 88(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 6, 11520
	; CHECK-NEXT: ld 6, 240(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 4, 80(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 7, 256
	; CHECK-NEXT: ld 7, 232(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 4, 72(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 8, 54528
	; CHECK-NEXT: ld 8, 224(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 4, 64(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 9, 43264
	; CHECK-NEXT: ld 9, 216(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 4, 56(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 10, 32000
	; CHECK-NEXT: ld 10, 208(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 4, 48(1) # 8-byte Folded Spill
	; CHECK-NEXT: ori 4, 11, 20736
	; CHECK-NEXT: ld 11, 200(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 4, 40(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 14, 576(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 15, 584(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 16, 592(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 17, 600(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 18, 608(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 19, 616(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 20, 624(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 21, 632(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 22, 640(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 23, 648(1) # 8-byte Folded Spill
	; CHECK-NEXT: std 24, 656(1) # 8-byte Folded Spill
	; CHECK-NEXT: ld 5, 248(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 24, 144(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 23, 136(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 22, 112(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 21, 104(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 20, 96(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 19, 88(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 18, 80(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 17, 72(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 16, 64(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 15, 56(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 14, 48(1) # 8-byte Folded Reload
	; CHECK-NEXT: li 4, 0
	; CHECK-NEXT: ld 31, 40(1) # 8-byte Folded Reload
	; CHECK-NEXT: .p2align 4			; CHECK-NEXT: .p2align 4
	; CHECK-NEXT: .LBB0_1: # =>This Loop Header: Depth=1			; CHECK-NEXT: .LBB0_1: # =>This Loop Header: Depth=1
	; CHECK-NEXT: # Child Loop BB0_2 Depth 2			; CHECK-NEXT: # Child Loop BB0_2 Depth 2
	; CHECK-NEXT: stw 4, 396(1) # 4-byte Folded Spill
	; CHECK-NEXT: li 4, 83			; CHECK-NEXT: li 4, 83
	; CHECK-NEXT: mtctr 4			; CHECK-NEXT: mtctr 4
	; CHECK-NEXT: ld 12, 256(1) # 8-byte Folded Reload			; CHECK-NEXT: mr 4, 5
	; CHECK-NEXT: ld 4, 128(1) # 8-byte Folded Reload
	; CHECK-NEXT: .p2align 5			; CHECK-NEXT: .p2align 5
	; CHECK-NEXT: .LBB0_2: # Parent Loop BB0_1 Depth=1			; CHECK-NEXT: .LBB0_2: # Parent Loop BB0_1 Depth=1
	; CHECK-NEXT: # => This Inner Loop Header: Depth=2			; CHECK-NEXT: # => This Inner Loop Header: Depth=2
	; CHECK-NEXT: ld 2, 560(1) # 8-byte Folded Reload			; CHECK-NEXT: lis 6, 537
	; CHECK-NEXT: stdux 3, 12, 2			; CHECK-NEXT: ori 6, 6, 7168
	; CHECK-NEXT: ld 2, 552(1) # 8-byte Folded Reload			; CHECK-NEXT: stdux 3, 4, 21
	; CHECK-NEXT: stdx 3, 12, 5			; CHECK-NEXT: stdx 3, 4, 20
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 19
	; CHECK-NEXT: ld 2, 544(1) # 8-byte Folded Reload			; CHECK-NEXT: stdx 3, 4, 18
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 17
	; CHECK-NEXT: ld 2, 536(1) # 8-byte Folded Reload			; CHECK-NEXT: stdx 3, 4, 16
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 15
	; CHECK-NEXT: ld 2, 528(1) # 8-byte Folded Reload			; CHECK-NEXT: stdx 3, 4, 14
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 31
	; CHECK-NEXT: ld 2, 520(1) # 8-byte Folded Reload			; CHECK-NEXT: stdx 3, 4, 2
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 0
	; CHECK-NEXT: ld 2, 512(1) # 8-byte Folded Reload			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: lis 6, 36
	; CHECK-NEXT: ld 2, 504(1) # 8-byte Folded Reload			; CHECK-NEXT: ori 6, 6, 40704
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: ld 2, 496(1) # 8-byte Folded Reload			; CHECK-NEXT: lis 6, 85
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: ori 6, 6, 29440
	; CHECK-NEXT: ld 2, 488(1) # 8-byte Folded Reload			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: lis 6, 134
	; CHECK-NEXT: ld 2, 480(1) # 8-byte Folded Reload			; CHECK-NEXT: ori 6, 6, 18176
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: ld 2, 472(1) # 8-byte Folded Reload			; CHECK-NEXT: lis 6, 183
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: ori 6, 6, 6912
	; CHECK-NEXT: ld 2, 464(1) # 8-byte Folded Reload			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: lis 6, 231
	; CHECK-NEXT: ld 2, 456(1) # 8-byte Folded Reload			; CHECK-NEXT: ori 6, 6, 61184
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: ld 2, 448(1) # 8-byte Folded Reload			; CHECK-NEXT: lis 6, 280
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: ori 6, 6, 49920
	; CHECK-NEXT: ld 2, 440(1) # 8-byte Folded Reload			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: lis 6, 329
	; CHECK-NEXT: ld 2, 432(1) # 8-byte Folded Reload			; CHECK-NEXT: ori 6, 6, 38656
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: ld 2, 424(1) # 8-byte Folded Reload			; CHECK-NEXT: lis 6, 378
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: ori 6, 6, 27392
	; CHECK-NEXT: ld 2, 416(1) # 8-byte Folded Reload			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: lis 6, 427
	; CHECK-NEXT: ld 2, 408(1) # 8-byte Folded Reload			; CHECK-NEXT: ori 6, 6, 16128
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: ld 2, 400(1) # 8-byte Folded Reload			; CHECK-NEXT: lis 6, 476
	; CHECK-NEXT: stdx 3, 12, 2			; CHECK-NEXT: ori 6, 6, 4864
	; CHECK-NEXT: stdx 3, 12, 6			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 7			; CHECK-NEXT: lis 6, 524
	; CHECK-NEXT: stdx 3, 12, 8			; CHECK-NEXT: ori 6, 6, 59136
	; CHECK-NEXT: stdx 3, 12, 9			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 10			; CHECK-NEXT: lis 6, 573
	; CHECK-NEXT: stdx 3, 12, 11			; CHECK-NEXT: ori 6, 6, 47872
	; CHECK-NEXT: stdx 3, 12, 30			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 29			; CHECK-NEXT: lis 6, 24
	; CHECK-NEXT: stdx 3, 12, 28			; CHECK-NEXT: ori 6, 6, 27136
	; CHECK-NEXT: stdx 3, 12, 27			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 26			; CHECK-NEXT: lis 6, 73
	; CHECK-NEXT: stdx 3, 12, 25			; CHECK-NEXT: ori 6, 6, 15872
	; CHECK-NEXT: stdx 3, 12, 24			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 23			; CHECK-NEXT: lis 6, 122
	; CHECK-NEXT: stdx 3, 12, 4			; CHECK-NEXT: ori 6, 6, 4608
	; CHECK-NEXT: stdx 3, 12, 0			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 22			; CHECK-NEXT: lis 6, 170
	; CHECK-NEXT: stdx 3, 12, 21			; CHECK-NEXT: ori 6, 6, 58880
	; CHECK-NEXT: stdx 3, 12, 20			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 19			; CHECK-NEXT: lis 6, 219
	; CHECK-NEXT: stdx 3, 12, 18			; CHECK-NEXT: ori 6, 6, 47616
	; CHECK-NEXT: stdx 3, 12, 17			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 16			; CHECK-NEXT: lis 6, 268
	; CHECK-NEXT: stdx 3, 12, 15			; CHECK-NEXT: ori 6, 6, 36352
	; CHECK-NEXT: stdx 3, 12, 14			; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: stdx 3, 12, 31			; CHECK-NEXT: lis 6, 317
				; CHECK-NEXT: ori 6, 6, 25088
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 366
				; CHECK-NEXT: ori 6, 6, 13824
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 415
				; CHECK-NEXT: ori 6, 6, 2560
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 463
				; CHECK-NEXT: ori 6, 6, 56832
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 512
				; CHECK-NEXT: ori 6, 6, 45568
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 561
				; CHECK-NEXT: ori 6, 6, 34304
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 12
				; CHECK-NEXT: ori 6, 6, 13568
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 61
				; CHECK-NEXT: ori 6, 6, 2304
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 109
				; CHECK-NEXT: ori 6, 6, 56576
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 158
				; CHECK-NEXT: ori 6, 6, 45312
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 207
				; CHECK-NEXT: ori 6, 6, 34048
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 256
				; CHECK-NEXT: ori 6, 6, 22784
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 305
				; CHECK-NEXT: ori 6, 6, 11520
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 354
				; CHECK-NEXT: ori 6, 6, 256
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 402
				; CHECK-NEXT: ori 6, 6, 54528
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 451
				; CHECK-NEXT: ori 6, 6, 43264
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 500
				; CHECK-NEXT: ori 6, 6, 32000
				; CHECK-NEXT: stdx 3, 4, 6
				; CHECK-NEXT: lis 6, 549
				; CHECK-NEXT: ori 6, 6, 20736
				; CHECK-NEXT: stdx 3, 4, 6
	; CHECK-NEXT: bdnz .LBB0_2			; CHECK-NEXT: bdnz .LBB0_2
	; CHECK-NEXT: # %bb.3:			; CHECK-NEXT: # %bb.3:
	; CHECK-NEXT: ld 12, 384(1) # 8-byte Folded Reload			; CHECK-NEXT: addi 22, 22, 1
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: ld 4, -160(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 12, 376(1) # 8-byte Folded Reload			; CHECK-NEXT: std 3, 0(4)
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: ld 4, -168(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 12, 368(1) # 8-byte Folded Reload			; CHECK-NEXT: std 3, 0(4)
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: xoris 4, 22, 6
	; CHECK-NEXT: ld 12, 360(1) # 8-byte Folded Reload			; CHECK-NEXT: cmplwi 4, 6784
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: std 3, 0(7)
	; CHECK-NEXT: ld 12, 352(1) # 8-byte Folded Reload			; CHECK-NEXT: std 3, 0(8)
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: std 3, 0(9)
	; CHECK-NEXT: ld 12, 344(1) # 8-byte Folded Reload			; CHECK-NEXT: std 3, 0(10)
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: std 3, 0(11)
	; CHECK-NEXT: ld 12, 336(1) # 8-byte Folded Reload			; CHECK-NEXT: std 3, 0(12)
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: std 3, 0(30)
	; CHECK-NEXT: ld 12, 328(1) # 8-byte Folded Reload			; CHECK-NEXT: std 3, 0(29)
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: std 3, 0(28)
	; CHECK-NEXT: ld 12, 320(1) # 8-byte Folded Reload			; CHECK-NEXT: std 3, 0(27)
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: std 3, 0(26)
	; CHECK-NEXT: ld 12, 312(1) # 8-byte Folded Reload			; CHECK-NEXT: std 3, 0(25)
	; CHECK-NEXT: std 3, 0(12)			; CHECK-NEXT: std 3, 0(24)
	; CHECK-NEXT: ld 12, 304(1) # 8-byte Folded Reload			; CHECK-NEXT: std 3, 0(23)
	; CHECK-NEXT: std 3, 0(12)
	; CHECK-NEXT: ld 12, 296(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 3, 0(12)
	; CHECK-NEXT: ld 12, 288(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 3, 0(12)
	; CHECK-NEXT: ld 12, 280(1) # 8-byte Folded Reload
	; CHECK-NEXT: lwz 4, 396(1) # 4-byte Folded Reload
	; CHECK-NEXT: addi 4, 4, 1
	; CHECK-NEXT: std 3, 0(12)
	; CHECK-NEXT: ld 12, 272(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 3, 0(12)
	; CHECK-NEXT: xoris 12, 4, 6
	; CHECK-NEXT: cmplwi 12, 6784
	; CHECK-NEXT: ld 12, 264(1) # 8-byte Folded Reload
	; CHECK-NEXT: std 3, 0(12)
	; CHECK-NEXT: bne 0, .LBB0_1			; CHECK-NEXT: bne 0, .LBB0_1
	; CHECK-NEXT: # %bb.4:			; CHECK-NEXT: # %bb.4:
	; CHECK-NEXT: ld 2, 568(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 2, -152(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 31, 712(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 31, -8(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 30, 704(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 30, -16(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 29, 696(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 29, -24(1) # 8-byte Folded Reload
	; CHECK-NEXT: li 3, 0			; CHECK-NEXT: li 3, 0
	; CHECK-NEXT: ld 28, 688(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 28, -32(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 27, 680(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 27, -40(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 26, 672(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 26, -48(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 25, 664(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 25, -56(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 24, 656(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 24, -64(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 23, 648(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 23, -72(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 22, 640(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 22, -80(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 21, 632(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 21, -88(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 20, 624(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 20, -96(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 19, 616(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 19, -104(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 18, 608(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 18, -112(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 17, 600(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 17, -120(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 16, 592(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 16, -128(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 15, 584(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 15, -136(1) # 8-byte Folded Reload
	; CHECK-NEXT: ld 14, 576(1) # 8-byte Folded Reload			; CHECK-NEXT: ld 14, -144(1) # 8-byte Folded Reload
	; CHECK-NEXT: addi 1, 1, 720
	; CHECK-NEXT: blr			; CHECK-NEXT: blr
	%3 = getelementptr inbounds i64, i64* %1, i64 144115188075855			%3 = getelementptr inbounds i64, i64* %1, i64 144115188075855
	%4 = getelementptr i64, i64* %1, i64 144115586875855			%4 = getelementptr i64, i64* %1, i64 144115586875855
	%5 = getelementptr i64, i64* %1, i64 144115587175855			%5 = getelementptr i64, i64* %1, i64 144115587175855
	%6 = getelementptr i64, i64* %1, i64 144115587075855			%6 = getelementptr i64, i64* %1, i64 144115587075855
	%7 = getelementptr i64, i64* %1, i64 144115586975855			%7 = getelementptr i64, i64* %1, i64 144115586975855
	%8 = getelementptr i64, i64* %1, i64 144115587275855			%8 = getelementptr i64, i64* %1, i64 144115587275855
	%9 = getelementptr i64, i64* %1, i64 144115587575855			%9 = getelementptr i64, i64* %1, i64 144115587575855
	▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines