Page MenuHomePhabricator

[Greedy RegAlloc] Add logic to greedy reg alloc to avoid bad eviction chains
ClosedPublic

Authored by myatsina on Jul 24 2017, 2:01 PM.

Details

Summary

This fixes bugzilla 26810
https://bugs.llvm.org/show_bug.cgi?id=26810

This is intended to prevent sequences like:
movl %ebp, 8(%esp) # 4-byte Spill
movl %ecx, %ebp
movl %ebx, %ecx
movl %edi, %ebx
movl %edx, %edi
cltd
idivl %esi
movl %edi, %edx
movl %ebx, %edi
movl %ecx, %ebx
movl %ebp, %ecx
movl 16(%esp), %ebp # 4 - byte Reload

Such sequences are created in 2 scenarios:

Scenario #1:
vreg0 is evicted from physreg0 by vreg1
Evictee vreg0 is intended for region splitting with split candidate physreg0 (the reg vreg0 was evicted from)
Region splitting creates a local interval because of interference with the evictor vreg1 (normally region spliiting creates 2 interval, the "by reg" and "by stack" intervals. Local interval created when interference occurs.)
one of the split intervals ends up evicting vreg2 from physreg1
Evictee vreg2 is intended for region splitting with split candidate physreg1
one of the split intervals ends up evicting vreg3 from physreg2 etc.. until someone spills

Scenario #2
vreg0 is evicted from physreg0 by vreg1
vreg2 is evicted from physreg2 by vreg3 etc
Evictee vreg0 is intended for region splitting with split candidate physreg1
Region splitting creates a local interval because of interference with the evictor vreg1
one of the split intervals ends up evicting back original evictor vreg1 from physreg0 (the reg vreg0 was evicted from)
Another evictee vreg2 is intended for region splitting with split candidate physreg1
one of the split intervals ends up evicting vreg3 from physreg2 etc.. until someone spills

As compile time was a concern, I've added a flag to control weather we do cost calculations for local intervals we expect to be created (it's on by default for X86 target, off for the rest).

Diff Detail

Repository
rL LLVM

Event Timeline

myatsina created this revision.Jul 24 2017, 2:01 PM
qcolombet requested changes to this revision.Jul 24 2017, 3:44 PM

Hi,

The greedy allocator is already very complicated and I am not sure the additional complexity of the eviction track is worth it.
Is it something that could be cleaned up in machine copy propagation? The problem is very local so that sounds doable.

I will have a closer look to the patch because fixing the problem from the start is obviously better that patching up later, but given how rare that problem is I really believe exploring other, less complex avenue is interesting.

Cheers,
Quentin

test/CodeGen/X86/bug26810.ll
1 ↗(On Diff #107959)

Could you use a .mir test to make the test more robust?

3 ↗(On Diff #107959)

That sounds wrong for a new test.

Testing should be positive as much as possible IMO.

This revision now requires changes to proceed.Jul 24 2017, 3:44 PM

Hi,

The greedy allocator is already very complicated and I am not sure the additional complexity of the eviction track is worth it.
Is it something that could be cleaned up in machine copy propagation? The problem is very local so that sounds doable.

I will have a closer look to the patch because fixing the problem from the start is obviously better that patching up later, but given how rare that problem is I really believe exploring other, less complex avenue is interesting.

Cheers,
Quentin

Thank you for suggesting the machine copy propagation, I've started working on this direction, it definitely seems easier to implement it there.
On the other hand, if I understood correctly, one of the issues with the old llvm register allocator (linear scan) was that that it did a lot of decisions that the rewriter had to clean up afterwards, and it was intended that greedy will try to avoid such decisions. I'm not sure if this eviction chain falls under this category or not.

Thanks,
Marina

test/CodeGen/X86/bug26810.ll
1 ↗(On Diff #107959)

Will do.

3 ↗(On Diff #107959)

I wasn't very satisfied with this check as well.
I'll make it into a positive test indeed.

Hi,

The greedy allocator is already very complicated and I am not sure the additional complexity of the eviction track is worth it.
Is it something that could be cleaned up in machine copy propagation? The problem is very local so that sounds doable.

I will have a closer look to the patch because fixing the problem from the start is obviously better that patching up later, but given how rare that problem is I really believe exploring other, less complex avenue is interesting.

Cheers,
Quentin

Thank you for suggesting the machine copy propagation, I've started working on this direction, it definitely seems easier to implement it there.
On the other hand, if I understood correctly, one of the issues with the old llvm register allocator (linear scan) was that that it did a lot of decisions that the rewriter had to clean up afterwards, and it was intended that greedy will try to avoid such decisions. I'm not sure if this eviction chain falls under this category or not.

Thanks,
Marina

I've checked the copy propagation pass feasibility -
I was able to catch a few new cases (probably because the increase in weight I did in my Greedy patch wasn't high enough, but that's heuristic and we might be able to tune it).
On the other hand, I'm now failing to catch all the cases that cross basic blocks because this pass works at a BB block level.

Based on this I think the solution should probably be kept in Greedy (+ possibly additional cleanup in the copy propagation pass).

Thanks,
Marina

Based on this I think the solution should probably be kept in Greedy (+ possibly additional cleanup in the copy propagation pass).

Would a super block copy propagation pass work?
I believe the code in that pass should just work in such configuration.

Based on this I think the solution should probably be kept in Greedy (+ possibly additional cleanup in the copy propagation pass).

Would a super block copy propagation pass work?
I believe the code in that pass should just work in such configuration.

When you say "super blocks" do you refer to restructuring the CFG (using tail duplication) and making the common path linear so that it can be combined into one large basic block?
I haven't really seen this concept in llvm (except for some "SuperBlock" in debug info, which seems to be unrelated), so if you have some references for me that would be great.

If we're talking about somehow flattening and "ordering" the BB, control flow and loops and and scanning them looking for cross block chains, then I don't think it's something trivial.
It's not always legal to replace such chains (if someone uses or clobbers one of the registers in the middle of the chain I can no longer do the replacement).
Here's one example, I cannot do the replacement "xmm0 = copy xmm3" + "xmm3 =copy xmm0" because if I reach bb2 from bb1 then xmm0 is part of the copy chain, but if I reach bb2 from bb3, then it is not.

bb1:
xmm0 = copy xmm1
// fall through bb2

bb2:
xmm1 = copy xmm2
xmm2 = copy xmm3
...
xmm3 = copy xmm2
xmm2 = copy xmm1
xmm1 = copy xmm0
test
je bb3

bb3:
xmm0 = /* something */
test
je bb2

In order to properly identify this I need to do liveness analysis for each reg suspected to be in the copy chain. I need to check if any clobbering (or even use of an "intermediate" value) might reach one of the BBs the chain is spread across - if so, I cannot do replacement.

  • Also, I may have several suspected chains in parallel which complicates it even more.

Please let me know if I understood correctly the "super block copy propagation".

Thanks,
Marina

As far as I read (http://www.eecs.umich.edu/~mahlke/papers/1993/hwu_jsuper93.pdf), in order to create superblocks, we need to identify traces using execution profile information, and then do tail duplication to avoid multiple entrances.
According the authors of this technique, this transformation as itself takes significant amount of code and compile time.
I don’t think this transformation is something we should do only for the sake of machine copy propagation pass, as it adds significant complexity.
The decision of supporting this transformation and the possible optimizations that can benefit from it seems like an orthogonal discussion that is not directly related to this bad eviction chains I’m trying to solve.

Even if I do have some transformation to superblocks that have single entry and multiple exists, I will still need to do liveness tracking over all possible paths to maintain correctness:

bb1:
xmm0 = copy xmm1
xmm1 = copy xmm2
xmm2 = copy xmm3
...

test
je bb3

bb2:
xmm3 = copy xmm2
xmm2 = copy xmm1
xmm1 = copy xmm0
return

bb3:
return

The path bb1->bb2 can benefit from the change, but it is not legal for me to this change if a paths like bb1->bb3 exits – I need to scan all paths.

For each suspected copy chain I will need to track a whole subtree in this superblock CFG which begins with the first copy of the chain.
I will need to make sure all possible paths from that first copy contain the whole chain and that there is no path that clobbers one of the registers in the middle of that chain.
So I find myself doing some sort of liveness tracking here too.
I know my original solution added complexity to Greedy, but Greedy’s decisions are the source of this issue, and it doesn’t seem like we have an elegant way to clean up the consequences of those decisions when we’re talking about cross-BB chains.

Thanks,
Marina

Have you had a chance to look at it yet?

Thanks,
Marina

Hi Marina,

Thanks for reminding me about this patch.

I was not able to look at it yet.
I will try to get to it in the next two weeks.

Cheers,
-Quentin

Hi Marina,

I had a quick look at the patch and I am not sure this the right approach.
The patch tries to avoid splitting when it might be part of a bad eviction chain, but I would argue there is no such thing as bad eviction chain. The evictions happened to relax the constraints on the allocation problem and blocking the splitting it won't help.
Now, unless phabricator is not showing everything (it is acting weird since the patch is quite big), my understanding of the patch is that it actually does not prevent eviction chains, it just resorts on less fancy splitting heuristics, which happens to spread splitting decisions around instead of having them localized at regions boundaries. Thus, we may still have eviction chains but they may be harder to spot.

Generally speaking, split points are not bad. What is bad though is the fact that we make poor allocation decisions that prevent to get rid of them later. I would focus my effort on that front if I were you.
For instance, in the example from the PR, I believe the bunch of copies don't get coalesced because we choose the color for the less constrained live-range first. That is if we were to allocate vreg79 first, then I believe vreg80 could use that as a hint eliminating one of the copy. Same for vreg81 and vreg82 and so on.
However, what happened is that we allocate vreg80 first and the allocation order makes it such that vreg79 won't be able to satisfy the hint, since what we pick interferes with what vreg79 can use. Given we have the same structure before and after the split point, the live-ranges get allocated in the order. Thus, the first mistake propagates through all the live-range (vreg79 prevents vreg81 to satisfy its live-range and so on).
For instance, if we were to delay vreg79, I believe we would satisfy all hints.

There is a lot of speculation in what I described for the example from the PR. I will try to spend some time verifying if any of those changes would indeed fix this problem.
In particular, I believe the problem can be solved with some tweaks in:

  • TargetRegisterInfo::getRegAllocationHints (e.g., we could give a hint to vreg80 so that it avoids vreg79 interferences)
  • Priority when enqueueing live-ranges
  • Consider the cost of using a register that is going to create a broken hint down the road when assigning a color (similar idea than first item)
  • Try to reconcile hints that are in the same region instead of one at a time

The last point is probably the one that is going to less affect existing code and thus would be probably the easiest to qualify.

Cheers,
-Quentin

For the record, playing with the order helps but to do the optimal (local) coloring here I would need to spend more time on the heuristic.
Again the easiest fix is probably the hints reconciling by region.

Hi Quentin,

I wouldn’t say my patch tries to avoid splitting, but rather tries to improve the calculation of the spill weight of split candidates:
When the register allocator decides to do a region split, it looks for the best physical register candidate for the split.
The best candidate is the one that will cause the minimal spill cost.
When calculating the spill cost of each candidate the algorithm takes into account interferences in the entrance/exist of the basic blocks.
However, there may be interference local to a basic block, which later, during the split itself will cause the creation of a new local interval (which will be local to the basic block) on top of the “by reg” and “by stack” intervals which are created during the split.
The algorithm currently ignores the fact that this local interval may cause spills (and thus may increase the spill weight of this candidate for the split).

My solution is to try to predict if this split candidate will case the creation of local intervals and if they in turn will cause spills, and add their spill weights to the total weight.
By doing so, I try to make the spill weight calculation of each candidate more accurate and allow the algorithm to choose a more suitable candidate.

If a local interval is created then we have a few options for its allocation:

  1. The interval will be allocated to some free reg – no additional spill cost needed.
  2. The interval may cause an eviction – in some cases this eviction is "bad" and guaranteed to causes a spill (it’s “bad” when you’re evicting the interval that evicted you, kind of like a cat and mouse game - somebody must loose here) - in this patch I’m trying to predict if it’s "bad" or not, and incorporate the spill weight of this interval.
  3. The interval may spill – I’ve already encountered a case where the new local interval is in a hot loop and ends us spilling around all uses – this spill cost wasn’t considered when the candidate was chosen. I have a solution for this case which is based on parts of this patch.
  4. The interval may split – I guess there might be some spill cost to consider here as well, but I didn’t explore this case yet.

I did see nice performance results with my current solution.
I will try to look into the hint reconciling as well, but I do think that the current spill weight calculation of the split candidates is not accurate enough and we need to consider the affects of those local intervals.

Thanks,
Marina

gberry added a subscriber: gberry.Oct 11 2017, 1:16 PM
myatsina updated this revision to Diff 119158.Oct 16 2017, 7:41 AM
myatsina edited edge metadata.
myatsina edited the summary of this revision. (Show Details)

As compile time was a concern, I've added a flag to control weather we do cost calculations for local intervals we expect to be created (it's on by default for X86 target, off for the rest).
I've fixed the tests and some comments.

Do you have additional comments?

qcolombet accepted this revision.Oct 16 2017, 8:57 AM

Hi Marina,

Looks reasonable to me.
Do a second pass before committing to make sure everything follow LLVM coding standard. I've highlighted a few problems.

I can help with that if you wish, but I figured you probably don't want to wait for me to do that :P.

Cheers,
-Quentin

lib/CodeGen/RegAllocGreedy.cpp
302 ↗(On Diff #119158)

brief*

303 ↗(On Diff #119158)

Use lower case for the first letter for methods.

310 ↗(On Diff #119158)

Ditto

1377 ↗(On Diff #119158)

Check*

This revision is now accepted and ready to land.Oct 16 2017, 8:57 AM
This revision was automatically updated to reflect the committed changes.
myatsina marked 4 inline comments as done.