This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC][WIP] Provide context-sensitive cost to the Greedy Allocator to favour splitting over CSR first use
AbandonedPublic

Authored by nemanjai on Dec 2 2016, 3:48 PM.

Details

Summary

PowerPC is a target that can see great benefits from shrink-wrapping due to a high overhead for non-leaf calls (spilling SPR's and CSR's). However, the register allocator currently favours allocating a CSR over splitting a region. This in effect results in copies of parameter registers into CSR's in the entry block when the parameter is live across any calls in the function. And of course, this disables shrink-wrapping because the save point then must be the entry block.
Just providing a cost in TargetRegisterInfo::getCSRFirstUseCost() is not all that effective in alleviating this issue because it is a global setting and there are situations where allocating a CSR is better (i.e. less spilling around calls).

As the title mentions, this is a work in progress. The cost function probably needs a fair bit of tuning both for the actual values for the cost and for conditions under which a non-zero cost is returned. The current value is arbitrary (just seems to work well) and the condition is simple:

  • The live range of the value spans any blocks that have no calls

However, with this very coarse cost function, we see a doubling in the number of shrink-wrapped functions as well as significant performance improvement. There are 4 SPEC INT benchmarks that degrade by 0.07% - 1.07%. Everything else improves by 2.7% - 4.7% (INT) and 0.16% - 15.0% (FP).

So hopefully through this patch, we can have a productive discussion of how to proceed with a more fine-grained cost model for allocating CSR's.

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai updated this revision to Diff 80139.Dec 2 2016, 3:48 PM
nemanjai retitled this revision from to [PowerPC][WIP] Provide context-sensitive cost to the Greedy Allocator to favour splitting over CSR first use.
nemanjai updated this object.
nemanjai set the repository for this revision to rL LLVM.
nemanjai added subscribers: sfertile, lei, syzaara, jtony.
echristo edited edge metadata.Dec 8 2016, 3:01 PM

In general this patch might be a good idea. I'm currently wondering what the impact is on other targets if we just make the implementation the same for all architectures? Can you explore the tradeoffs that we're looking at on, say, aarch64 and x86? Basically I'm not sure why we allocate a CSR rather than split the range - I'm guessing register pressure, but that's less of an issue I think.

Quentin?

-eric

qcolombet edited edge metadata.Dec 8 2016, 4:39 PM

Hi,

In general this patch might be a good idea.

I agree. Moreover, I would do something similar for the spilling case and make sure the cost is different for copies and spills.

I'm currently wondering what the impact is on other targets if we just make the implementation the same for all architectures?

My guess is this will depend on whether or not shrink-wrapping is enabled and the cost of the spill/split instruction for the target. That being said, that's possible that we could derive something sensible for all architectures.

Basically I'm not sure why we allocate a CSR rather than split the range - I'm guessing register pressure, but that's less of an issue I think.

We allocate a CSR rather than splitting a live-range when the cost model says it is more expensive to split... that helps right :).
To be concrete here, IIRC, the cost model assumes that the split points we add won't be coalesced and thus accounts for the frequencies of the copies at the related spot. If that cost is bigger than a spill of the CSR in the entry block, we spill the CSR. E.g., we trade a spill of a CSR against copies in a loop.

Cheers,
-Quentin

Also, that's possible that the right fix/simple fix is to have one CSRCost for split and one for spill. Indeed, IIRC, right now the returned cost for both spilling and splitting is going to be the sum of the frequencies where the split/spill happen and given the spill and copy have different cost, we may want to have different comparison.

E.g., CSRCostForSpill = 5 (ok to trade against more than 5 executed spill/reload) but CSRCostForSpilt = 20 (ok to trade against more than 20 executed copies)

Hi,

In general this patch might be a good idea.

I agree. Moreover, I would do something similar for the spilling case and make sure the cost is different for copies and spills.

I'm currently wondering what the impact is on other targets if we just make the implementation the same for all architectures?

My guess is this will depend on whether or not shrink-wrapping is enabled and the cost of the spill/split instruction for the target. That being said, that's possible that we could derive something sensible for all architectures.

Basically I'm not sure why we allocate a CSR rather than split the range - I'm guessing register pressure, but that's less of an issue I think.

We allocate a CSR rather than splitting a live-range when the cost model says it is more expensive to split... that helps right :).
To be concrete here, IIRC, the cost model assumes that the split points we add won't be coalesced and thus accounts for the frequencies of the copies at the related spot. If that cost is bigger than a spill of the CSR in the entry block, we spill the CSR. E.g., we trade a spill of a CSR against copies in a loop.

That's a lot more "heuristic" than I'm comfortable with in general, but OK. :)

Next step:

I think we should work on a more unified interface and then get this committed.

Thanks!

-eric

wmi added a subscriber: wmi.Dec 9 2016, 3:54 PM

Ah, raising CSR cost to be high enough and using existing splitting is indeed a simpler approach. I didn't think of the idea even after you hinted this in PR. I just tried https://reviews.llvm.org/D27366 and it can cover the simple testcases I have. I can help to evaluate the performance on x86 and see how the overall performance looks. Thanks!

wmi added a comment.Dec 9 2016, 4:01 PM

Ah, very sorry, I was intended to reply to https://reviews.llvm.org/D27596.

I proposed a patch in https://reviews.llvm.org/D27596 to address the same issue and Quentin suggested D27366 is simpler. I agree it is a better approach. The patch already covers my simple testcases when I tried it on x86. I can help to evaluate its performance on x86 side with internal benchmarks.

Forgive me if the answer to this was meant to be evident from the comments, but I am not clear on it:
Do we agree that the cost should be context sensitive or are we suggesting that a global cost should be provided by the target for the spill/split (i.e. two separate costs that are properties of the subtarget irrespective of the MachineFunction)?

wmi added a comment.Dec 13 2016, 3:11 PM

Just providing a cost in TargetRegisterInfo::getCSRFirstUseCost() is not all that effective in alleviating this issue because it is a global setting and there are situations where allocating a CSR is better (i.e. less spilling around calls).

Could you be more specific about the case when providing a global cost in TargetRegisterInfo::getCSRFirstUseCost() is not enough? I am asking because if there is case using CSR is better than splitting or spilling, maybe calculateRegionSplitCost or calcSpillCost are to be adjusted, not CSRCost. I think CSRCost is only related with prolog/epilog so it is relatively fixed?

hfinkel edited edge metadata.Dec 13 2016, 3:19 PM
In D27366#621609, @wmi wrote:

Just providing a cost in TargetRegisterInfo::getCSRFirstUseCost() is not all that effective in alleviating this issue because it is a global setting and there are situations where allocating a CSR is better (i.e. less spilling around calls).

Could you be more specific about the case when providing a global cost in TargetRegisterInfo::getCSRFirstUseCost() is not enough? I am asking because if there is case using CSR is better than splitting or spilling, maybe calculateRegionSplitCost or calcSpillCost are to be adjusted, not CSRCost. I think CSRCost is only related with prolog/epilog so it is relatively fixed?

In general, yes. If you have a lot of high-register-pressure computation, then you want to use a CSR instead of repeatedly spilling. In other words, shrink-wrapping aside, using the CSR is only cost-equivalent to the first spill, if you'd need multiple spills, then using a CSR is better.

nemanjai updated this revision to Diff 84859.Jan 18 2017, 10:41 AM

Rebasing to current trunk.

gberry added a subscriber: gberry.Jan 20 2017, 9:38 AM
wmi added a comment.Feb 8 2017, 11:21 AM

nemanjai, do you have any update about the patch? We also depend on it on x86 side so want to know the status.

Short version is that I am stuck in terms of how to proceed as none of my numerous attempts to do this differently bore fruit.

TL; DR version:
Unfortunately, I have not been able to make much progress either by providing two separate context-insensitive costs or by asking for direction on llvm-dev. I'm not sure if the goal for X86 is the same as it is for us on PPC, but I'll try to clarify what we're after.

The motivation for this patch was to facilitate shrink-wrapping (of course, that much is obvious). On PPC, the most prevalent reason by far that functions don't get shrink-wrapped (and could be) is that a CSR is used too early with respect to blocks that contain calls. From the standpoint of shrink-wrapping, the register allocator would ideally allocate a CSR in a block only if:

  • It is dominated by a block that contains a call (notice, not properly dominated)
  • Avoiding CSR use would require a spill

This patch does not quite achieve this, but none of the other approaches I've tried come as close as this patch does. I suppose that the solution would be to add these heuristics into the register allocator somehow, but I don't understand the RA code well enough to do this. I also don't know if anyone agrees with that assessment :).

Another observation that I made is that on PPC, this is particularly prevalent in functions that take parameters and have calls. The pattern in those functions is that a parameter register is copied into a CSR in the entry block so that the value is available in the same register in all blocks where it is used (regardless of whether the block contains a call). So perhaps it would be possible to tweak whatever it is that associates a physical (parameter) register with a virtual register to associate the parameter register with a different vreg in blocks that have calls and those that don't.

wmi added a comment.Feb 10 2017, 2:59 PM

Hi Nemanja,

I tried your testcase with my experimental patch (got from here: http://lists.llvm.org/pipermail/llvm-dev/2017-February/109977.html) and saw that the testcase was not shrinkwrap optimized (cmd I used: clang -O2 -target powerpc64le-grtev4-linux-gnu -S 1.c).

Existing reg splitting for live range across function calls took effect. After splitting, the sub vreg across call got CSR register assigned, and the other sub vreg got a non-CSR register. These all work as we expect. However, during tryHintRecoloring, the two sub vregs are coalesced again.

I add a simple logic in tryHIntRecoloring: If we are going to switch from a non-CSR reg to a CSR reg, only when the recoloring cost difference is larger than CSRCost, we will do such recoloring. In other words, to justify the planning recoloring, the benefit must be at least larger than the potential negative impact on shrinkwrapping.

With the change, the testcase is shrinkwrap optimized. I attach the changed experimental patch.

Thanks,
Wei.

In D27366#674099, @wmi wrote:

Hi Nemanja,

I tried your testcase with my experimental patch (got from here: http://lists.llvm.org/pipermail/llvm-dev/2017-February/109977.html) and saw that the testcase was not shrinkwrap optimized (cmd I used: clang -O2 -target powerpc64le-grtev4-linux-gnu -S 1.c).

Existing reg splitting for live range across function calls took effect. After splitting, the sub vreg across call got CSR register assigned, and the other sub vreg got a non-CSR register. These all work as we expect. However, during tryHintRecoloring, the two sub vregs are coalesced again.

I add a simple logic in tryHIntRecoloring: If we are going to switch from a non-CSR reg to a CSR reg, only when the recoloring cost difference is larger than CSRCost, we will do such recoloring. In other words, to justify the planning recoloring, the benefit must be at least larger than the potential negative impact on shrinkwrapping.

With the change, the testcase is shrinkwrap optimized. I attach the changed experimental patch.

Thanks,
Wei.

Hi Wei,
sorry for the delay in getting back to you. I think we should try to make some progress here and the patch you proposed certainly seems less intrusive and like a reasonable first step forward. Do you want to post it for review and I can then abandon this patch?
My experiments with your proposed patch definitely increase the number of functions we shrink wrap and provide a performance improvement (albeit slightly smaller than this context-sensitive patch). However, I think that your patch is definitely needed so we can prevent regalloc from reversing decisions it has made.
So what I propose is that we move forward with your proposed patch and abandon this.

For further improvements, I added a (quick-and-dirty) analysis to output all functions that we could theoretically shrink-wrap but don't. As you may expect, this produces a whole lot of data. I plan to spend some time distilling that data into usable reproducers so we can see if we are missing many opportunities that would actually be profitable to shrink-wrap.

nemanjai updated this revision to Diff 92186.Mar 17 2017, 1:22 PM

Rebasing to ToT.

junbuml added a subscriber: junbuml.Jun 9 2017, 7:32 AM

Do we have any update about the patch? I tried this on AArch64 for Spec2006, and found 8% performance improvement in astar due to the shrink wrapping happened in one of the hot functions; no regressions in other benchmarks. I will be happy to help evaluating this change on AArch64 side.

Kindly ping.
Do we have any plan about the patch? If nobody is really pursuing this, I can take a closer look at this if this is right fix.

nemanjai abandoned this revision.Feb 6 2018, 4:23 AM

We decided not to pursue this further as there are better approaches to aiding shrink wrapping.

Herald added a project: Restricted Project. · View Herald TranscriptAug 12 2019, 11:50 PM