This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] LSR tunings for PowerPC
ClosedPublic

Authored by stefanp on Dec 5 2017, 2:23 PM.

Details

Summary

The purpose of this patch is to make LSR generate better code for Power. This is done by overriding isLSRCostLess to check instruction counts like X86 does. This was shown to show a 0.10% improvement in the geometric mean of SPEC 2006 tests on Power systems.

Diff Detail

Repository
rL LLVM

Event Timeline

bnemanich created this revision.Dec 5 2017, 2:23 PM
echristo added inline comments.Dec 5 2017, 2:30 PM
lib/Target/PowerPC/PPCTargetTransformInfo.cpp
249–250 ↗(On Diff #125613)

Probably best to explain why this is useful and you can remove the "PowerPC specific" bit.

test/Transforms/LoopStrengthReduce/PowerPC/lsr-insns-3.ll
3 ↗(On Diff #125613)

Good to have a comment here of exactly what you're looking for as part of the test to make updating it easier. Right now I'd have no idea.

69 ↗(On Diff #125613)

Can limit the attributes of all the functions as well?

hfinkel added inline comments.Dec 5 2017, 2:36 PM
lib/Target/PowerPC/PPCTargetTransformInfo.cpp
249–250 ↗(On Diff #125613)

Also, why this formula? If C1.Insns < C2.Insns, we're deferring to the base implementation, but not if C1.Insns > C2.Insns. That seems different from what X86 does, where it defers whenever the instruction counts aren't equal:

ol X86TTIImpl::isLSRCostLess(TargetTransformInfo::LSRCost &C1,

                           TargetTransformInfo::LSRCost &C2) {
// X86 specific here are "instruction number 1st priority".
return std::tie(C1.Insns, C1.NumRegs, C1.AddRecCost,
                C1.NumIVMuls, C1.NumBaseAdds,
                C1.ScaleCost, C1.ImmCost, C1.SetupCost) <
       std::tie(C2.Insns, C2.NumRegs, C2.AddRecCost,
                C2.NumIVMuls, C2.NumBaseAdds,
                C2.ScaleCost, C2.ImmCost, C2.SetupCost);

}

jtony added a subscriber: jtony.Dec 6 2017, 8:14 AM
jtony added inline comments.
test/Transforms/LoopStrengthReduce/PowerPC/lsr-insns-3.ll
56 ↗(On Diff #125613)

We could remove this as well.

70 ↗(On Diff #125613)

I agree with Eric, most of these attribute # could be cleaned up (removed). We want to only keep the necessary info in the test to make the test case as short as possible.

bnemanich updated this revision to Diff 126570.Dec 12 2017, 9:58 AM

I added some more comments about the cost calculation and updated the test to simplify it and explain what it was looking for.

bnemanich marked 6 inline comments as done.Dec 12 2017, 9:59 AM
hfinkel added inline comments.Dec 12 2017, 9:14 PM
lib/Target/PowerPC/PPCTargetTransformInfo.cpp
253 ↗(On Diff #126570)

This is a proper partial ordering: isLSRCostLess(C1, C2) && isLSRCostLess(C2, C1) is always false. However, I'm not sure the meaning is obvious...

For example, imagine that BaseT::isLSRCostLess(C1, C2) is true and BaseT::isLSRCostLess(C2, C1) is false. In this case, isLSRCostLess is either equivalent to BaseT::isLSRCostLess(C1, C2), if (C1.Insns <= C2.Insns), or isLSRCostLess implies equality (if C1.Insns > C2.Insns), because in this latter case, isLSRCostLess(C1, C2) is false and isLSRCostLess(C2, C1) is also false. In other words, you've taken all {C1, C2} for which C1 < C2 (by BaseT::isLSRCostLess) and C1.Insns > C2.Insns, and made them equal in LSR cost. You say that this is a more conservative selection, but it just seems more arbitrary to me (you've enlarged the class of equal things). If this is what you intended, it should have a specific explanation.

stefanp commandeered this revision.Jan 12 2018, 10:51 AM
stefanp added a reviewer: bnemanich.
stefanp added a subscriber: stefanp.

I'm going to start looking at this patch now.

@hfinkel
Hi Hal,
Yes that is correct. We are increasing the size of the class of equal elements.
The idea is that LSR will only replace one formula with another if it finds something with a lower cost. By returning false more often LSR will do fewer replacements and that is what we mean by more conservative. Maybe that's not really clear in the comment and I can update the comment to make it more clear.

I think you are looking for a theoretical reason as to why we picked this formula over others. However, the reason we settled on this formula as opposed to what is on x86 for example was based on the performance testing that we did. We tried a number of options and this one came out the best. We used this formula as an opportunity to tune LSR.

stefanp updated this revision to Diff 132449.Feb 1 2018, 12:28 PM

Modified the tuning formula to take into consideration the idea that we shouldn’t be enlarging the class of equal items.
After looking into the LSR algorithm we discovered that if we enlarge the space of equal items we do not favor unchanged code as we had initially desired.
Therefore, this change does almost exactly what the default isLSRCostLess function does (from the base class) except that number of instructions has been added as the most important category to compare.

hfinkel accepted this revision.Mar 7 2018, 1:53 AM

LGTM

This revision is now accepted and ready to land.Mar 7 2018, 1:53 AM
This revision was automatically updated to reflect the committed changes.

One inline grammar-o.

That said, I'm seeing some failures internally here after this - I'm trying to get a testcase for you though. Almost assuredly a latent bug.

llvm/trunk/test/Transforms/LoopStrengthReduce/PowerPC/lsr-insns-3.ll
3

less -> fewer :)

@echristo
Hi Eric,
Would you like me to pull this change out and fix the bug before I put it back in?

It's be great if you wouldn't mind. Still working on getting you a test
case.

I've reverted the change in rL327143.