Page MenuHomePhabricator

[MachineLICM] don't always hoist rematerializable instructions
Changes PlannedPublic

Authored by shchenz on Jun 28 2020, 1:43 AM.

Details

Reviewers
hfinkel
jsji
nemanjai
efriedma
arsenm
qcolombet
dmgreen
Group Reviewers
Restricted Project
Summary

There is an issue for MachineLICM for rematerializable instructions hoisting.

%64:g8rc = LIS8 585
%65:g8rc = ORI8 killed %64:g8rc, 61440
%66:g8rc_and_g8rc_nox0 = STDUX %21:g8rc, %18:g8rc_and_g8rc_nox0(tied-def 0), killed %65:g8rc :: (store 8 into %ir.49, !tbaa !2)

%64:g8rc = LIS8 585 is a rematerializable instruction and it will be hoisted outside of loop without considering register pressure in MachineLICM.

After %64:g8rc = LIS8 585 is hoisted out, %65:g8rc = ORI8 killed %64:g8rc, 61440 is also hoisted out because it will lower the register pressure:

// - When hoisting the last use of a value in the loop, that value no longer
//   needs to be live in the loop. This lowers register pressure in the loop

So no matter how many pattern groups like above, MachineLICM will hoist all of them. When we have more than 32 group like above on PowerPC, hoisting all of them outside of loop will cause RA spilling.

This patch tries to fix the above issue:
1: override target hook shouldHoistCheapInstructions to make machine licm hoist cheap rematerializable based on register pressure.
2: when machine licm gets a cheap rematerializable instruction, it will also check the user inside the loop. if the user is also a loop invariant, do not hoist it blindly, instead it hoists the instruction after evaluating the register pressure.

Diff Detail

Event Timeline

shchenz created this revision.Jun 28 2020, 1:43 AM
shchenz edited the summary of this revision. (Show Details)Jun 28 2020, 1:49 AM

when machine licm gets a cheap rematerializable instruction, it will also check the user inside the loop. if the user is also a loop invariant, do not hoist it blindly, instead it hoists the instruction after evaluating the register pressure

Wouldn't it be simpler to just change the "When hoisting the last use of a value in the loop, that value no longer needs to be live in the loop" check? We could check if the value is trivially rematerializable; if it is, we're not really helping register pressure by hoisting.

llvm/lib/Target/PowerPC/PPCInstrInfo.h
318 ↗(On Diff #273909)

I'm not really happy about adding target-specific heuristics to MachineLICM. Each target has its own cost model to some extent; we might want to use different rules for specific instructions, or maybe specific register files if they're particularly tiny. But I'd like to avoid knobs that universally change the way the algorithm works. If the core algorithm changes depending on the target, that makes it much harder to understand the the way the code is supposed to work, or make any changes in the future, or implement the hook appropriately.

when machine licm gets a cheap rematerializable instruction, it will also check the user inside the loop. if the user is also a loop invariant, do not hoist it blindly, instead it hoists the instruction after evaluating the register pressure

Wouldn't it be simpler to just change the "When hoisting the last use of a value in the loop, that value no longer needs to be live in the loop" check? We could check if the value is trivially rematerializable; if it is, we're not really helping register pressure by hoisting.

@efriedma Do you mean changing the register pressure estimate model in function calcRegisterCost and hoisting rematerializable instruction all the time in machine LICM but keeping its last use inside the loop, so RA can sink the rematerializable instruction down when necessary?

I made another fix like:

diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index 98638b9..88e382c 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -911,14 +911,18 @@ MachineLICMBase::calcRegisterCost(const MachineInstr *MI, bool ConsiderSeen,
 
     RegClassWeight W = TRI->getRegClassWeight(RC);
     int RCCost = 0;
-    if (MO.isDef())
+    // If MI is a def of a rematerializable instruction and it has only one use,
+    // we can treat it as no cost because RA can sink MI right before its user.
+    if (MO.isDef() && !(TII->isTriviallyReMaterializable(*MI, AA) && 
+                        MRI->hasOneNonDBGUse(Reg)))
       RCCost = W.RegWeight;
     else {
       bool isKill = isOperandKill(MO, MRI);
+      bool isRemat = TII->isTriviallyReMaterializable(*MRI->getVRegDef(Reg), AA); 
       if (isNew && !isKill && ConsiderUnseenAsDef)
         // Haven't seen this, it must be a livein.
         RCCost = W.RegWeight;
-      else if (!isNew && isKill)
+      else if (!isNew && isKill && !isRemat)
         RCCost = -W.RegWeight;
     }

The change is simpler than what I did in this patch(hoisting cheap instructions based on register pressure). And with above change, machine LICM will not hoist all pattern groups as expected. But unfortunately, RA can not sink down the hoisted rematerializable instruction(LIS), so there are still many spills.

I am thinking this may be not a good solution either as it ties two passes Machine LICM and RA together. It may also hard to maintain, if we change one pass we must also consider another pass.

I prefer to keep all the logic in Machine LICM, we check the register pressure for the rematerializable instruction directly and not hoist it when in high register pressure so the rematerializable instruction's last use in loop will not be hoisted automatically.

What do you think?

shchenz marked an inline comment as done.Jun 30 2020, 12:13 AM
shchenz added inline comments.
llvm/lib/Target/PowerPC/PPCInstrInfo.h
318 ↗(On Diff #273909)

yeah, agree with you for the hook adding policy. There is a conflict place in MachineLICM on PowerPC:
1: hasLowDefLatency overriding on PowerPC in commit https://reviews.llvm.org/rL258142, it indicates that all instructions including cheap instructions should be hoisted outside of loop.
2: In function MachineLICMBase::CanCauseHighRegPressure, there are statements:

// Don't hoist cheap instructions if they would increase register pressure,
// even if we're under the limit.
if (CheapInstr && !HoistCheapInsts)
  return true;

Here I just want to make it consistent on PowerPC target. But it seems strange with two different hooks...Maybe we need another patch to improve this.

@efriedma Hi, could you please help to have another look at the comment https://reviews.llvm.org/D82709#2121904? Do you still prefer to implement this by the collaboration of MachineLICM and RA? If so, does the fix in MachineLICM in the above comment make sense to you? Thanks.

But unfortunately, RA can not sink down the hoisted rematerializable instruction(LIS), so there are still many spills.

"Cannot", as in the target hooks for remat forbid it somehow? Or are the heuristics somehow favoring spilling over remat?

If we trust the register allocator to remat appropriately, we can just hoist everything without worrying about it. If we can't trust the register allocator, that means we're assuming the instruction won't be rematerialized, so we shouldn't be checking if it's rematerializable in the first place.

I am thinking this may be not a good solution either as it ties two passes Machine LICM and RA together. It may also hard to maintain, if we change one pass we must also consider another pass.

They're sort of tied together anyway in a general sense: given we have a register pressure heuristic, it needs to be aware of what the register allocator is actually going to do.

llvm/lib/Target/PowerPC/PPCInstrInfo.h
318 ↗(On Diff #273909)

I assume you meant to link https://reviews.llvm.org/rL225471 ?

I think this is all tied together; it doesn't really make sense to push it off to later.

But unfortunately, RA can not sink down the hoisted rematerializable instruction(LIS), so there are still many spills.

"Cannot", as in the target hooks for remat forbid it somehow? Or are the heuristics somehow favoring spilling over remat?

If we trust the register allocator to remat appropriately, we can just hoist everything without worrying about it. If we can't trust the register allocator, that means we're assuming the instruction won't be rematerialized, so we shouldn't be checking if it's rematerializable in the first place.

Rereading this, I should probably say a bit more. I think the patch in https://reviews.llvm.org/D82709#2121904 makes sense in a world where we trust the register allocator. In a world where we don't trust the register allocator, we should just delete the isTriviallyReMaterializable() check completely. Either way, we need a consistent model. The original patch is essentially saying remat only works for instructions that have non-loop-invariant operands, and that doesn't really make sense.


Not really related to the contents of this patch, but some targets, like ARM, use a pseudo-instruction for integer immediates, and expand it after register allocation; this makes remat more effective.

shchenz added a comment.EditedJul 7 2020, 5:35 PM

OK, I will look into RA and see what's wrong here on PowerPC target. It seems RA can not sink any remat instructions at all on PowerPC as the case here should be very common.

Also thanks very much for your info about how ARM handles big IMM. To be honest, that's my most favorite solution either when I come to this issue. On PowerPC we expand the big IMM in ISEL in different ways for different IMM. This will expand a remat instruction to 2 or more non-remat instructions.

Using a pseudo for loading big imm also depends on RA can work well for sinking remateralizable instruction. I will look at the issue in RA first.

Thanks very much for your good comments.

shchenz planned changes to this revision.Jul 7 2020, 5:35 PM
shchenz added a comment.EditedAug 20 2020, 9:17 AM

Hi @efriedma after a long time investigation about greedy register allocation, I have some findings. I think the reason why the remat lis is not sinked down by RA as our expected is the limitation of current greedy register allocation. Hi @qcolombet sorry to bother you, If I am wrong at the comment about greedyRA, please correct me. ^-^

after MachineLICM (with all LIS and some ORI hoisted up), for the new added testcase, we get:

bb0:        ; outter loop preheader
    outteruse1 = 
    outteruse2 = 
    ....
    outteruseN = 
    ....
    lisvar1 = LIS
    orivar1 = ORI lisvar1
    lisvar2 = LIS
    orivar2 = ORI lisvar2
    ....
    lisvarm  = LIS
    orivarm = ORI lisvarm        <------ m ORI (together with related LIS) are hoisted out under register pressure.
    lisvarm+1 = LIS
    lisvarm+2 = LIS
    ...
    lisvarN = LIS        <------ all LIS are hoisted out because of remat.
bb1:            ;  inner loop preheader
    MTCTR8loop    <-------hardware loop, set loop count
bb2:
    std orivar1
    std orivar2
    ......
    std orivarm
    orivarm+1 = ORI lisvarm+1
    std orivarm+1
    orivarm+2 = ORI lisvarm+2
    std orivarm+2
    ......
    orivarN = ORI lisvarN
    std orivarN
    bdnz bb2   <--------hardware loop, test count register and branch
bb3:
    std outteruse1
    ....
    std outteruseN
    conditional-branch bb1, bb4
bb4:
  ret

In greedyRA, all live intervals are put inside a priority queue. And live interval with high priority will be assigned with physical register first. The bigger the live interval's size, the higher priority the live interval has. So in above code sequence, outteruse1, ... outteruseN will be assigned with physical register earlier than lisvar and orisvar.

So after greedyRA stage RS_Assign, RS_Split, outteruseNare the first to enter RS_Spill stage. Issue here is when we try to spill for outteruseN, greedyRA will not try to do rematerialize for low priority remat LIS instructions in advance. I think maybe this is why it is called greedy register allocation. It always handles live interval one by one? After spilling for outteruseN, greedy register allocation marks allocation for this live interval as done. It won't be changed later.

(When some remat instruction needs to be spilled, they will be rematerialized to front of their use as expected. see void InlineSpiller::spill() -> reMaterializeAll())

After greedy register allocation, code sequence is like:

bb0:        ; outter loop preheader
    outteruse1 = 
    spill outteruse1 to stack.1   <------ spill; these spills can be saved if we rematerialize all the below LIS to their uses.
    outteruse2 = 
    spill outteruse2 to stack.2   <------ spill
    ....
    outteruseN = 
    spill outteruseN to stack.N  <------ spill
    ....
    lisvar1 = LIS
    orivar1 = ORI lisvar1
    lisvar2 = LIS
    orivar2 = ORI lisvar2
    ....
    lisvarm  = LIS
    orivarm = ORI lisvarm
    ...
    lisvarN = LIS        <------ not all of the remat LIS are rematerialized because there is no need to do that, outteruse are already spilled.
bb1:            ;  inner loop preheader
    MTCTR8loop
bb2:
    std orivar1
    std orivar2
    ......
    std orivarm
     lisvarm+1 = LIS     <------ rematerialized
    orivarm+1 = ORI lisvarm+1
    std orivarm+1
    lisvarm+2 = LIS       <------rematerialized
    orivarm+2 = ORI lisvarm+2
    std orivarm+2
    ......
    orivarN = ORI lisvarN
    std orivarN
    bdnz bb2
bb3:
    reload outteruse1 from stack.1 <------reload
    std outteruse1
    ....
    reload outteruseN from stack.N  <------reload
    std outteruseN
    conditional-branch bb1, bb4
bb4:
  ret

greedyRA can not foresee that there are many remat instruction but with low priority in greedyRA priority queue when it tries to do spill for some non-remat registers but with high priority. This should be greedy register allocation's limitation. So I think maybe the best way is machinelicm hoist the LIS also based on register pressure.

Sorry for the long comments @efriedma . You comments are quite welcome. BTW: We found some obvious improvement for some benchmarks with this change on PowerPC target.

shchenz requested review of this revision.Aug 20 2020, 9:18 AM

Hi,

I am a bit confused as by what is expected by RA in this case.

So after greedyRA stage RS_Assign, RS_Split, outteruseNare the first to enter RS_Spill stage. Issue here is when we try to spill for outteruseN, greedyRA will not try to do rematerialize for low priority remat LIS instructions in advance

If LIS is low priority in your example doesn't that mean it was not been assigned anything at the point we're looking at outterresuNare?
In other words, if we don't have any register left for outterresuNare at this point, that means that rematerializing LIS won't help (since it is not assigned).

Could you share the debug output of regalloc? (-debug-only regalloc)

Cheers,
-Quentin

lkail added a subscriber: lkail.Aug 21 2020, 3:10 PM
shchenz added a comment.EditedAug 21 2020, 7:38 PM

Hi @qcolombet the log for the new added case(With the change in https://reviews.llvm.org/D82709#2121904) is put https://reviews.llvm.org/P8231

We expect that outteruse1 should not be spilled, for example, you can check virtual register %0 in the log file.
This can be achieved by remat all LIS to its use inside the loop. My first try is to do this in MachineLICM and find it works. After hoisting LIS based on register pressure, there is no/less spill both inside the inner loop and outter loop.

But we want to seek for a fix inside the ra and don't change the remat instruction hoisting logic inside machinelicm.

machine LICM expects greedy register allocation will remat all the LIS and so machine LICM hoists these remat instructions without considering register pressure.

But inside register allocation, that is not always the case.
for now %0 has high priority (big live interval size) but low spill weight (used in outer loop), and it is spilled before remat all LIS. So spill for %0 is kept in outer loop after the RA.
But as we can see, after RA, there are some remat instructions like %267 ~ %309 can be remat to the inner loop and reuse %x9 like other LIS instructions already inside that loop.

Do you have any idea about how to fix this in greedy RA? Thanks.

Hi @qcolombet, for the log in https://reviews.llvm.org/P8231, you can just see the RA result at the end. Ideally, we can remat registers %267 ~ %309(2132B - 2216B) to their use, and then we can free some physical registers and assign the physical registers to %166 ~ %190 to save the spills.

I checked the log, when we do spill for %0, LIS instruction, for example %157, is assigned with physical register, but seems RA will only remat instructions when the remat instructions are going to be spilled?

We should add logic inside selectOrSplitImpl like: when we try to spill a non-remat instruction, we should check if there is any remat instruction assigned with a physical register. If there is one, we should first remat the remat instruction and assign the physical register to the spilled one? Is this a reasonable change in RA?

Thanks for your comments in advance.

Hi @qcolombet , I saw you added comments for other patch about Greedy Register Pressure like "The greedy allocator is already very complicated". I am not sure my proposal change in above comment in greedyRA will increase the complexity or if it is the right way to go. Could you please help to confirm this? Thanks.

We should add logic inside selectOrSplitImpl like: when we try to spill a non-remat instruction, we should check if there is any remat instruction assigned with a physical register. If there is one, we should first remat the remat instruction and assign the physical register to the spilled one?

If we can not depend on register allocator to rematerialize all required rematerializable instructions, the way proposed in this PR should be reasonable. Instead of letting the register allocator rematerialize the instructions after we hoist all rematerializable instructions without considering register pressure in machineLICM, now we hoist the rematerializable instructions also based on register pressure.

Sorry for pinging this patch for a long time. We need this patch on PowerPC target. I think it should also benefit other targets.

Hi @shchenz,

Sorry for the late reply I missed your update.

I checked the log, when we do spill for %0, LIS instruction, for example %157, is assigned with physical register, but seems RA will only remat instructions when the remat instructions are going to be spilled?

That's correct, but a live-range can still evict another one, e.g., when we assign %88, we evict %0.

We should add logic inside selectOrSplitImpl like: when we try to spill a non-remat instruction, we should check if there is any remat instruction assigned with a physical register. If there is one, we should first remat the remat instruction and assign the physical register to the spilled one? Is this a reasonable change in RA?

That part should actually be covered. If you look at RAGreedy::shouldEvict, a live-interval can evict another one as long as its weight is bigger than the other one. Generally, the weight of rematerializable live intervals is small enough that it can be evicted pretty much all the time.
It doesn't happen in your case because the weight of %0 is pretty small and in particular, smaller than the rematerializable live-intervals (e.g., %157).

At this point, I would check two things:

  1. Are the weights accurate? E.g., maybe the frequency estimate missed a loop?
  2. Is %157 (or whatever rematerializable interval) considered when we call shouldEvict for %0?

If the answer to #2 is yes, then I think your problem is with #1. If it is no, we should check why.

Now, if the weights for #1 are accurate, then the cost model means that it is more expensive to rematerialize than to spill.

Cheers,
-Quentin

shchenz added a comment.EditedSep 10 2020, 6:02 PM

It doesn't happen in your case because the weight of %0 is pretty small and in particular, smaller than the rematerializable live-intervals (e.g., %157).

Yes, %0 is a live interval in the outer loop while the rematerializable live-intervals are in the inner loop, so I think that is the reason why weight of %0 is smaller than the rematerializable live-intervals (e.g., %157). The weight for %0 and %157 should be accurate? I did some hack:

if (isRematerializable(li, LIS, VRM, *MF.getSubtarget().getInstrInfo()))
  totalWeight *= 0.5F;

changed 0.5 to a smaller value, it can reduce the spill size for the new lit case in inner loop. But it is not as good as this patch. More spills are generated in outer loop. With this patch the spills in outer loop are also reduced a lot.

2: Is %157 (or whatever rematerializable interval) considered when we call shouldEvict for %0?

I will do more investigation for this.

Thanks again for your confirmation. @qcolombet

shchenz planned changes to this revision.Sep 17 2020, 7:35 PM

@qcolombet @efriedma , sorry for the late response and the long main. I made a summary for this issue.

Before machine LICM:

entry:
    outer-non-remat-def

outer loop:
    inner loop:
        inner-remat-def
        inner-use
        b inner loop
    outer-use
    b outer loop

After machine LICM:

entry:
    outer-non-remat-def
    inner-remat-def          ———>all remat definitions are hoisted out
outer loop:
    inner loop:
        inner-use
        b inner loop
    outer-use
    b outer loop

machineLICM depends on RA(register allocator) remat the inner-remat-def down to its inner loop users. This is ok when there is level-one loop. But here we have a level-two loop, RA does not work as machine LICM expects.

If we hoist all remat definitions in machine LICM pass:
RA can make sure the loop where the inner-users are have no spills but it can not make sure the outer loop has no spills.

The process in Greedy Register Allocation after we hoist all remat instructions to entry block is like this:
1: outer defs have higher allocation priority than inner def because outer defs have larger live interval. But outer defs have smaller spill weight because inner def users are in the inner loop.
2.1: firstly RA allocates physical registers to outer def
2.2: then RA allocates physical register to inner def and evicts the assigned physical registers for outer defs when RA found there are not enough physical registers for inner defs. (outer defs have smaller spill weight).
2.3: then the outer def virtual registers are in stage split and in the second round for outer def virtual registers allocation, they are all in split stage, so RA will do spilling for them without trying to elicit any virtual registers.
2.4: for the inner-remat-def virtual registers, RA will try its best to assign physical registers to some of them and split some of them in the outer loop in the second round.
2.5: and at last, when RA allocates registers for inner-users, it will remat the use of remat-instructions down to the front of the inner-users to make sure there is no spill/reload in the inner loop.

After greedy register allocation:

entry:
    outer-non-remat-def
    some of inner-remat-def 
outer loop:
    some inner-remat-def       // split for inner-remat-defs
    inner loop:
        some inner-remat-def // remat when allocates for inner-users
        inner-use
        b inner loop
    ;;;;;; reload happens in outer loop
    outer-use
    b outer loop

So the issue here is, machine licm expects RA remat the inner-remat-def to the inner loop, but in fact, RA will first try to split the inner-remat-def to the outer loop and then when does allocation for inner loop register, it will remat the def to the front of the uses to make sure inner loop has no spill. There is no problem for the inner loop, we can make sure there is no spill by remat. But to avoid spill in the inner loop, we don’t need to remat all inner-remat-def splitted in the outer loop. The left inner-remat-defs in the header of the outer loop will increase outer loop register pressure for sure. MachineLICM is not aware of the register pressure increasing in the outer loop when it hoists all remat instructions to the outer loop preheader.

shchenz updated this revision to Diff 313058.Dec 21 2020, 2:59 AM

Since @efriedma has concerns about adding more hook in machineLICM pass, and the RA works as expected(@qcolombet please correct me if you find there is any issue in RA process in above comments), I made a new solution for this issue:
instead of hoisting all remat instructions, we now do:

// For remat instructions which are inside current working loop, we should
// always hoist them.
// For remat instructions which intend to be hoisted to outer parent loop, we
// only hoist non-cheap ones as RA can not pull all remat instructions down to
// inner loop as it will first try to split them in outer loop.

Now for the new added case rematerializable-instruction-machine-licm.ll, we don't have any spills.
But we get some cheap instructions kept inside the loop as expected. This should be alignment with the logic in CanCauseHighRegPressure in machinelicm pass. See flag HoistCheapInsts.

shchenz retitled this revision from [MachineLICM] [PowerPC] hoisting rematerializable cheap instructions based on register pressure. to [MachineLICM] don't always hoist rematerializable instructions.Dec 21 2020, 3:46 AM
shchenz planned changes to this revision.EditedDec 21 2020, 4:36 AM

perf test shows some degradations together with some improvements on PowerPC target. Plan change for now to get more tuning.

Matt added a subscriber: Matt.Aug 5 2021, 1:08 PM
lkail added a comment.EditedSep 3 2021, 1:11 AM

IIUC, the motivation of this patch is to have part of immediate materialization code reside in the loop, i.e., have code like

1B  %0 = LIS8 64
2B  %1 = ORI8 %0, 17
loop:
8B  <use %1>

transformed to

1B  %0 = LIS 64
2B  %1 = ORI %0, 17
loop:
4B  %2 = ORI %0, 17
8B  <use %2>

I think it's currently beyond the ability of InlineSpiller when perform re-materialization. I set ORI8 to be trivially rematerilizable for PPC, the LiveRangeEdit will fail the check allUsesAvailableAt since %0 in ORI8 %0, 17 is no longer live(%0's live interval is [1:2)) in the loop.