Page MenuHomePhabricator

[MachineLICM] [PowerPC] hoisting rematerializable cheap instructions based on register pressure.
Changes PlannedPublic

Authored by shchenz on Jun 28 2020, 1:43 AM.

Details

Reviewers
hfinkel
jsji
nemanjai
efriedma
arsenm
qcolombet
dmgreen
Group Reviewers
Restricted Project
Summary

There is an issue for MachineLICM for rematerializable instructions hoisting.

%64:g8rc = LIS8 585
%65:g8rc = ORI8 killed %64:g8rc, 61440
%66:g8rc_and_g8rc_nox0 = STDUX %21:g8rc, %18:g8rc_and_g8rc_nox0(tied-def 0), killed %65:g8rc :: (store 8 into %ir.49, !tbaa !2)

%64:g8rc = LIS8 585 is a rematerializable instruction and it will be hoisted outside of loop without considering register pressure in MachineLICM.

After %64:g8rc = LIS8 585 is hoisted out, %65:g8rc = ORI8 killed %64:g8rc, 61440 is also hoisted out because it will lower the register pressure:

// - When hoisting the last use of a value in the loop, that value no longer
//   needs to be live in the loop. This lowers register pressure in the loop

So no matter how many pattern groups like above, MachineLICM will hoist all of them. When we have more than 32 group like above on PowerPC, hoisting all of them outside of loop will cause RA spilling.

This patch tries to fix the above issue:
1: override target hook shouldHoistCheapInstructions to make machine licm hoist cheap rematerializable based on register pressure.
2: when machine licm gets a cheap rematerializable instruction, it will also check the user inside the loop. if the user is also a loop invariant, do not hoist it blindly, instead it hoists the instruction after evaluating the register pressure.

Diff Detail

Repository
rL LLVM

Unit TestsFailed

TimeTest
10,410 mslinux > libomp.env::Unknown Unit Message ("")
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -I /mnt/disks/ssd0/agent/llvm-project/openmp/runtime/test -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -I /mnt/disks/ssd0/agent/llvm-project/openmp/runtime/test/ompt /mnt/disks/ssd0/agent/llvm-project/openmp/runtime/test/env/kmp_set_dispatch_buf.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/test/env/Output/kmp_set_dispatch_buf.c.tmp -lm -latomic && env KMP_DISP_NUM_BUFFERS=0 /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/test/env/Output/kmp_set_dispatch_buf.c.tmp
1,440 mslinux > libomp.worksharing/for::Unknown Unit Message ("")
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -I /mnt/disks/ssd0/agent/llvm-project/openmp/runtime/test -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -I /mnt/disks/ssd0/agent/llvm-project/openmp/runtime/test/ompt /mnt/disks/ssd0/agent/llvm-project/openmp/runtime/test/worksharing/for/kmp_set_dispatch_buf.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/test/worksharing/for/Output/kmp_set_dispatch_buf.c.tmp -lm -latomic && /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/test/worksharing/for/Output/kmp_set_dispatch_buf.c.tmp 7

Event Timeline

shchenz created this revision.Jun 28 2020, 1:43 AM
shchenz edited the summary of this revision. (Show Details)Jun 28 2020, 1:49 AM

when machine licm gets a cheap rematerializable instruction, it will also check the user inside the loop. if the user is also a loop invariant, do not hoist it blindly, instead it hoists the instruction after evaluating the register pressure

Wouldn't it be simpler to just change the "When hoisting the last use of a value in the loop, that value no longer needs to be live in the loop" check? We could check if the value is trivially rematerializable; if it is, we're not really helping register pressure by hoisting.

llvm/lib/Target/PowerPC/PPCInstrInfo.h
318

I'm not really happy about adding target-specific heuristics to MachineLICM. Each target has its own cost model to some extent; we might want to use different rules for specific instructions, or maybe specific register files if they're particularly tiny. But I'd like to avoid knobs that universally change the way the algorithm works. If the core algorithm changes depending on the target, that makes it much harder to understand the the way the code is supposed to work, or make any changes in the future, or implement the hook appropriately.

when machine licm gets a cheap rematerializable instruction, it will also check the user inside the loop. if the user is also a loop invariant, do not hoist it blindly, instead it hoists the instruction after evaluating the register pressure

Wouldn't it be simpler to just change the "When hoisting the last use of a value in the loop, that value no longer needs to be live in the loop" check? We could check if the value is trivially rematerializable; if it is, we're not really helping register pressure by hoisting.

@efriedma Do you mean changing the register pressure estimate model in function calcRegisterCost and hoisting rematerializable instruction all the time in machine LICM but keeping its last use inside the loop, so RA can sink the rematerializable instruction down when necessary?

I made another fix like:

diff --git a/llvm/lib/CodeGen/MachineLICM.cpp b/llvm/lib/CodeGen/MachineLICM.cpp
index 98638b9..88e382c 100644
--- a/llvm/lib/CodeGen/MachineLICM.cpp
+++ b/llvm/lib/CodeGen/MachineLICM.cpp
@@ -911,14 +911,18 @@ MachineLICMBase::calcRegisterCost(const MachineInstr *MI, bool ConsiderSeen,
 
     RegClassWeight W = TRI->getRegClassWeight(RC);
     int RCCost = 0;
-    if (MO.isDef())
+    // If MI is a def of a rematerializable instruction and it has only one use,
+    // we can treat it as no cost because RA can sink MI right before its user.
+    if (MO.isDef() && !(TII->isTriviallyReMaterializable(*MI, AA) && 
+                        MRI->hasOneNonDBGUse(Reg)))
       RCCost = W.RegWeight;
     else {
       bool isKill = isOperandKill(MO, MRI);
+      bool isRemat = TII->isTriviallyReMaterializable(*MRI->getVRegDef(Reg), AA); 
       if (isNew && !isKill && ConsiderUnseenAsDef)
         // Haven't seen this, it must be a livein.
         RCCost = W.RegWeight;
-      else if (!isNew && isKill)
+      else if (!isNew && isKill && !isRemat)
         RCCost = -W.RegWeight;
     }

The change is simpler than what I did in this patch(hoisting cheap instructions based on register pressure). And with above change, machine LICM will not hoist all pattern groups as expected. But unfortunately, RA can not sink down the hoisted rematerializable instruction(LIS), so there are still many spills.

I am thinking this may be not a good solution either as it ties two passes Machine LICM and RA together. It may also hard to maintain, if we change one pass we must also consider another pass.

I prefer to keep all the logic in Machine LICM, we check the register pressure for the rematerializable instruction directly and not hoist it when in high register pressure so the rematerializable instruction's last use in loop will not be hoisted automatically.

What do you think?

shchenz marked an inline comment as done.Jun 30 2020, 12:13 AM
shchenz added inline comments.
llvm/lib/Target/PowerPC/PPCInstrInfo.h
318

yeah, agree with you for the hook adding policy. There is a conflict place in MachineLICM on PowerPC:
1: hasLowDefLatency overriding on PowerPC in commit https://reviews.llvm.org/rL258142, it indicates that all instructions including cheap instructions should be hoisted outside of loop.
2: In function MachineLICMBase::CanCauseHighRegPressure, there are statements:

// Don't hoist cheap instructions if they would increase register pressure,
// even if we're under the limit.
if (CheapInstr && !HoistCheapInsts)
  return true;

Here I just want to make it consistent on PowerPC target. But it seems strange with two different hooks...Maybe we need another patch to improve this.

@efriedma Hi, could you please help to have another look at the comment https://reviews.llvm.org/D82709#2121904? Do you still prefer to implement this by the collaboration of MachineLICM and RA? If so, does the fix in MachineLICM in the above comment make sense to you? Thanks.

But unfortunately, RA can not sink down the hoisted rematerializable instruction(LIS), so there are still many spills.

"Cannot", as in the target hooks for remat forbid it somehow? Or are the heuristics somehow favoring spilling over remat?

If we trust the register allocator to remat appropriately, we can just hoist everything without worrying about it. If we can't trust the register allocator, that means we're assuming the instruction won't be rematerialized, so we shouldn't be checking if it's rematerializable in the first place.

I am thinking this may be not a good solution either as it ties two passes Machine LICM and RA together. It may also hard to maintain, if we change one pass we must also consider another pass.

They're sort of tied together anyway in a general sense: given we have a register pressure heuristic, it needs to be aware of what the register allocator is actually going to do.

llvm/lib/Target/PowerPC/PPCInstrInfo.h
318

I assume you meant to link https://reviews.llvm.org/rL225471 ?

I think this is all tied together; it doesn't really make sense to push it off to later.

But unfortunately, RA can not sink down the hoisted rematerializable instruction(LIS), so there are still many spills.

"Cannot", as in the target hooks for remat forbid it somehow? Or are the heuristics somehow favoring spilling over remat?

If we trust the register allocator to remat appropriately, we can just hoist everything without worrying about it. If we can't trust the register allocator, that means we're assuming the instruction won't be rematerialized, so we shouldn't be checking if it's rematerializable in the first place.

Rereading this, I should probably say a bit more. I think the patch in https://reviews.llvm.org/D82709#2121904 makes sense in a world where we trust the register allocator. In a world where we don't trust the register allocator, we should just delete the isTriviallyReMaterializable() check completely. Either way, we need a consistent model. The original patch is essentially saying remat only works for instructions that have non-loop-invariant operands, and that doesn't really make sense.


Not really related to the contents of this patch, but some targets, like ARM, use a pseudo-instruction for integer immediates, and expand it after register allocation; this makes remat more effective.

shchenz added a comment.EditedJul 7 2020, 5:35 PM

OK, I will look into RA and see what's wrong here on PowerPC target. It seems RA can not sink any remat instructions at all on PowerPC as the case here should be very common.

Also thanks very much for your info about how ARM handles big IMM. To be honest, that's my most favorite solution either when I come to this issue. On PowerPC we expand the big IMM in ISEL in different ways for different IMM. This will expand a remat instruction to 2 or more non-remat instructions.

Using a pseudo for loading big imm also depends on RA can work well for sinking remateralizable instruction. I will look at the issue in RA first.

Thanks very much for your good comments.

shchenz planned changes to this revision.Jul 7 2020, 5:35 PM
shchenz added a comment.EditedAug 20 2020, 9:17 AM

Hi @efriedma after a long time investigation about greedy register allocation, I have some findings. I think the reason why the remat lis is not sinked down by RA as our expected is the limitation of current greedy register allocation. Hi @qcolombet sorry to bother you, If I am wrong at the comment about greedyRA, please correct me. ^-^

after MachineLICM (with all LIS and some ORI hoisted up), for the new added testcase, we get:

bb0:        ; outter loop preheader
    outteruse1 = 
    outteruse2 = 
    ....
    outteruseN = 
    ....
    lisvar1 = LIS
    orivar1 = ORI lisvar1
    lisvar2 = LIS
    orivar2 = ORI lisvar2
    ....
    lisvarm  = LIS
    orivarm = ORI lisvarm        <------ m ORI (together with related LIS) are hoisted out under register pressure.
    lisvarm+1 = LIS
    lisvarm+2 = LIS
    ...
    lisvarN = LIS        <------ all LIS are hoisted out because of remat.
bb1:            ;  inner loop preheader
    MTCTR8loop    <-------hardware loop, set loop count
bb2:
    std orivar1
    std orivar2
    ......
    std orivarm
    orivarm+1 = ORI lisvarm+1
    std orivarm+1
    orivarm+2 = ORI lisvarm+2
    std orivarm+2
    ......
    orivarN = ORI lisvarN
    std orivarN
    bdnz bb2   <--------hardware loop, test count register and branch
bb3:
    std outteruse1
    ....
    std outteruseN
    conditional-branch bb1, bb4
bb4:
  ret

In greedyRA, all live intervals are put inside a priority queue. And live interval with high priority will be assigned with physical register first. The bigger the live interval's size, the higher priority the live interval has. So in above code sequence, outteruse1, ... outteruseN will be assigned with physical register earlier than lisvar and orisvar.

So after greedyRA stage RS_Assign, RS_Split, outteruseNare the first to enter RS_Spill stage. Issue here is when we try to spill for outteruseN, greedyRA will not try to do rematerialize for low priority remat LIS instructions in advance. I think maybe this is why it is called greedy register allocation. It always handles live interval one by one? After spilling for outteruseN, greedy register allocation marks allocation for this live interval as done. It won't be changed later.

(When some remat instruction needs to be spilled, they will be rematerialized to front of their use as expected. see void InlineSpiller::spill() -> reMaterializeAll())

After greedy register allocation, code sequence is like:

bb0:        ; outter loop preheader
    outteruse1 = 
    spill outteruse1 to stack.1   <------ spill; these spills can be saved if we rematerialize all the below LIS to their uses.
    outteruse2 = 
    spill outteruse2 to stack.2   <------ spill
    ....
    outteruseN = 
    spill outteruseN to stack.N  <------ spill
    ....
    lisvar1 = LIS
    orivar1 = ORI lisvar1
    lisvar2 = LIS
    orivar2 = ORI lisvar2
    ....
    lisvarm  = LIS
    orivarm = ORI lisvarm
    ...
    lisvarN = LIS        <------ not all of the remat LIS are rematerialized because there is no need to do that, outteruse are already spilled.
bb1:            ;  inner loop preheader
    MTCTR8loop
bb2:
    std orivar1
    std orivar2
    ......
    std orivarm
     lisvarm+1 = LIS     <------ rematerialized
    orivarm+1 = ORI lisvarm+1
    std orivarm+1
    lisvarm+2 = LIS       <------rematerialized
    orivarm+2 = ORI lisvarm+2
    std orivarm+2
    ......
    orivarN = ORI lisvarN
    std orivarN
    bdnz bb2
bb3:
    reload outteruse1 from stack.1 <------reload
    std outteruse1
    ....
    reload outteruseN from stack.N  <------reload
    std outteruseN
    conditional-branch bb1, bb4
bb4:
  ret

greedyRA can not foresee that there are many remat instruction but with low priority in greedyRA priority queue when it tries to do spill for some non-remat registers but with high priority. This should be greedy register allocation's limitation. So I think maybe the best way is machinelicm hoist the LIS also based on register pressure.

Sorry for the long comments @efriedma . You comments are quite welcome. BTW: We found some obvious improvement for some benchmarks with this change on PowerPC target.

shchenz requested review of this revision.Aug 20 2020, 9:18 AM

Hi,

I am a bit confused as by what is expected by RA in this case.

So after greedyRA stage RS_Assign, RS_Split, outteruseNare the first to enter RS_Spill stage. Issue here is when we try to spill for outteruseN, greedyRA will not try to do rematerialize for low priority remat LIS instructions in advance

If LIS is low priority in your example doesn't that mean it was not been assigned anything at the point we're looking at outterresuNare?
In other words, if we don't have any register left for outterresuNare at this point, that means that rematerializing LIS won't help (since it is not assigned).

Could you share the debug output of regalloc? (-debug-only regalloc)

Cheers,
-Quentin

lkail added a subscriber: lkail.Aug 21 2020, 3:10 PM
shchenz added a comment.EditedAug 21 2020, 7:38 PM

Hi @qcolombet the log for the new added case(With the change in https://reviews.llvm.org/D82709#2121904) is put https://reviews.llvm.org/P8231

We expect that outteruse1 should not be spilled, for example, you can check virtual register %0 in the log file.
This can be achieved by remat all LIS to its use inside the loop. My first try is to do this in MachineLICM and find it works. After hoisting LIS based on register pressure, there is no/less spill both inside the inner loop and outter loop.

But we want to seek for a fix inside the ra and don't change the remat instruction hoisting logic inside machinelicm.

machine LICM expects greedy register allocation will remat all the LIS and so machine LICM hoists these remat instructions without considering register pressure.

But inside register allocation, that is not always the case.
for now %0 has high priority (big live interval size) but low spill weight (used in outer loop), and it is spilled before remat all LIS. So spill for %0 is kept in outer loop after the RA.
But as we can see, after RA, there are some remat instructions like %267 ~ %309 can be remat to the inner loop and reuse %x9 like other LIS instructions already inside that loop.

Do you have any idea about how to fix this in greedy RA? Thanks.

Hi @qcolombet, for the log in https://reviews.llvm.org/P8231, you can just see the RA result at the end. Ideally, we can remat registers %267 ~ %309(2132B - 2216B) to their use, and then we can free some physical registers and assign the physical registers to %166 ~ %190 to save the spills.

I checked the log, when we do spill for %0, LIS instruction, for example %157, is assigned with physical register, but seems RA will only remat instructions when the remat instructions are going to be spilled?

We should add logic inside selectOrSplitImpl like: when we try to spill a non-remat instruction, we should check if there is any remat instruction assigned with a physical register. If there is one, we should first remat the remat instruction and assign the physical register to the spilled one? Is this a reasonable change in RA?

Thanks for your comments in advance.

Hi @qcolombet , I saw you added comments for other patch about Greedy Register Pressure like "The greedy allocator is already very complicated". I am not sure my proposal change in above comment in greedyRA will increase the complexity or if it is the right way to go. Could you please help to confirm this? Thanks.

We should add logic inside selectOrSplitImpl like: when we try to spill a non-remat instruction, we should check if there is any remat instruction assigned with a physical register. If there is one, we should first remat the remat instruction and assign the physical register to the spilled one?

If we can not depend on register allocator to rematerialize all required rematerializable instructions, the way proposed in this PR should be reasonable. Instead of letting the register allocator rematerialize the instructions after we hoist all rematerializable instructions without considering register pressure in machineLICM, now we hoist the rematerializable instructions also based on register pressure.

Sorry for pinging this patch for a long time. We need this patch on PowerPC target. I think it should also benefit other targets.

Hi @shchenz,

Sorry for the late reply I missed your update.

I checked the log, when we do spill for %0, LIS instruction, for example %157, is assigned with physical register, but seems RA will only remat instructions when the remat instructions are going to be spilled?

That's correct, but a live-range can still evict another one, e.g., when we assign %88, we evict %0.

We should add logic inside selectOrSplitImpl like: when we try to spill a non-remat instruction, we should check if there is any remat instruction assigned with a physical register. If there is one, we should first remat the remat instruction and assign the physical register to the spilled one? Is this a reasonable change in RA?

That part should actually be covered. If you look at RAGreedy::shouldEvict, a live-interval can evict another one as long as its weight is bigger than the other one. Generally, the weight of rematerializable live intervals is small enough that it can be evicted pretty much all the time.
It doesn't happen in your case because the weight of %0 is pretty small and in particular, smaller than the rematerializable live-intervals (e.g., %157).

At this point, I would check two things:

  1. Are the weights accurate? E.g., maybe the frequency estimate missed a loop?
  2. Is %157 (or whatever rematerializable interval) considered when we call shouldEvict for %0?

If the answer to #2 is yes, then I think your problem is with #1. If it is no, we should check why.

Now, if the weights for #1 are accurate, then the cost model means that it is more expensive to rematerialize than to spill.

Cheers,
-Quentin

shchenz added a comment.EditedSep 10 2020, 6:02 PM

It doesn't happen in your case because the weight of %0 is pretty small and in particular, smaller than the rematerializable live-intervals (e.g., %157).

Yes, %0 is a live interval in the outer loop while the rematerializable live-intervals are in the inner loop, so I think that is the reason why weight of %0 is smaller than the rematerializable live-intervals (e.g., %157). The weight for %0 and %157 should be accurate? I did some hack:

if (isRematerializable(li, LIS, VRM, *MF.getSubtarget().getInstrInfo()))
  totalWeight *= 0.5F;

changed 0.5 to a smaller value, it can reduce the spill size for the new lit case in inner loop. But it is not as good as this patch. More spills are generated in outer loop. With this patch the spills in outer loop are also reduced a lot.

2: Is %157 (or whatever rematerializable interval) considered when we call shouldEvict for %0?

I will do more investigation for this.

Thanks again for your confirmation. @qcolombet

shchenz planned changes to this revision.Sep 17 2020, 7:35 PM