The purpose of this patch is to make LSR generate better code for Power. This is done by overriding isLSRCostLess to check instruction counts like X86 does. This was shown to show a 0.10% improvement in the geometric mean of SPEC 2006 tests on Power systems.
Details
Diff Detail
Event Timeline
lib/Target/PowerPC/PPCTargetTransformInfo.cpp | ||
---|---|---|
249–250 | Probably best to explain why this is useful and you can remove the "PowerPC specific" bit. | |
test/Transforms/LoopStrengthReduce/PowerPC/lsr-insns-3.ll | ||
3 | Good to have a comment here of exactly what you're looking for as part of the test to make updating it easier. Right now I'd have no idea. | |
69 | Can limit the attributes of all the functions as well? |
lib/Target/PowerPC/PPCTargetTransformInfo.cpp | ||
---|---|---|
249–250 | Also, why this formula? If C1.Insns < C2.Insns, we're deferring to the base implementation, but not if C1.Insns > C2.Insns. That seems different from what X86 does, where it defers whenever the instruction counts aren't equal: ol X86TTIImpl::isLSRCostLess(TargetTransformInfo::LSRCost &C1, TargetTransformInfo::LSRCost &C2) { // X86 specific here are "instruction number 1st priority". return std::tie(C1.Insns, C1.NumRegs, C1.AddRecCost, C1.NumIVMuls, C1.NumBaseAdds, C1.ScaleCost, C1.ImmCost, C1.SetupCost) < std::tie(C2.Insns, C2.NumRegs, C2.AddRecCost, C2.NumIVMuls, C2.NumBaseAdds, C2.ScaleCost, C2.ImmCost, C2.SetupCost); } |
I added some more comments about the cost calculation and updated the test to simplify it and explain what it was looking for.
lib/Target/PowerPC/PPCTargetTransformInfo.cpp | ||
---|---|---|
253 | This is a proper partial ordering: isLSRCostLess(C1, C2) && isLSRCostLess(C2, C1) is always false. However, I'm not sure the meaning is obvious... For example, imagine that BaseT::isLSRCostLess(C1, C2) is true and BaseT::isLSRCostLess(C2, C1) is false. In this case, isLSRCostLess is either equivalent to BaseT::isLSRCostLess(C1, C2), if (C1.Insns <= C2.Insns), or isLSRCostLess implies equality (if C1.Insns > C2.Insns), because in this latter case, isLSRCostLess(C1, C2) is false and isLSRCostLess(C2, C1) is also false. In other words, you've taken all {C1, C2} for which C1 < C2 (by BaseT::isLSRCostLess) and C1.Insns > C2.Insns, and made them equal in LSR cost. You say that this is a more conservative selection, but it just seems more arbitrary to me (you've enlarged the class of equal things). If this is what you intended, it should have a specific explanation. |
@hfinkel
Hi Hal,
Yes that is correct. We are increasing the size of the class of equal elements.
The idea is that LSR will only replace one formula with another if it finds something with a lower cost. By returning false more often LSR will do fewer replacements and that is what we mean by more conservative. Maybe that's not really clear in the comment and I can update the comment to make it more clear.
I think you are looking for a theoretical reason as to why we picked this formula over others. However, the reason we settled on this formula as opposed to what is on x86 for example was based on the performance testing that we did. We tried a number of options and this one came out the best. We used this formula as an opportunity to tune LSR.
Modified the tuning formula to take into consideration the idea that we shouldn’t be enlarging the class of equal items.
After looking into the LSR algorithm we discovered that if we enlarge the space of equal items we do not favor unchanged code as we had initially desired.
Therefore, this change does almost exactly what the default isLSRCostLess function does (from the base class) except that number of instructions has been added as the most important category to compare.
One inline grammar-o.
That said, I'm seeing some failures internally here after this - I'm trying to get a testcase for you though. Almost assuredly a latent bug.
llvm/trunk/test/Transforms/LoopStrengthReduce/PowerPC/lsr-insns-3.ll | ||
---|---|---|
3 ↗ | (On Diff #137405) | less -> fewer :) |
@echristo
Hi Eric,
Would you like me to pull this change out and fix the bug before I put it back in?
Probably best to explain why this is useful and you can remove the "PowerPC specific" bit.