This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Be more aggressive
ClosedPublic

Authored by jmolloy on Feb 9 2015, 9:09 AM.

Details

Reviewers
andreadb
hfinkel
Summary

Up the phi node folding threshold from a cheap "1" to a meagre "2".

Update tests for extra added selects and slight code churn.

Diff Detail

Repository
rL LLVM

Event Timeline

jmolloy updated this revision to Diff 19589.Feb 9 2015, 9:09 AM
jmolloy retitled this revision from to [SimplifyCFG] Be more aggressive.
jmolloy updated this object.
jmolloy edited the test plan for this revision. (Show Details)
jmolloy added a reviewer: hfinkel.
jmolloy set the repository for this revision to rL LLVM.
jmolloy added a subscriber: Unknown Object (MLST).
jmolloy updated this revision to Diff 19647.Feb 10 2015, 1:02 AM
jmolloy removed rL LLVM as the repository for this revision.

Update to add a new test, checking that the idiomatic "clamp()" function is correctly optimized to two selects.

hfinkel added inline comments.Feb 10 2015, 10:51 AM
test/Transforms/SimplifyCFG/clamp.ll
6

Please actually check for the clamp code you expect.

hfinkel added inline comments.Feb 10 2015, 10:52 AM
lib/Transforms/Utils/SimplifyCFG.cpp
60

And please add a comment here explaining why the number is 2 (so that we'll get the clamp pattern, etc.).

jmolloy updated this revision to Diff 19746.Feb 11 2015, 5:57 AM
jmolloy set the repository for this revision to rL LLVM.

Hi Hal,

Thanks for the review. Changes made.

Cheers,

James

hfinkel edited edge metadata.Feb 11 2015, 6:03 AM

Can you please provide a summary of how this affects our normal performance benchmarks?

lib/Transforms/Utils/SimplifyCFG.cpp
57

Do you mean two selects?

Hi Hal,

Sure. There is no difference in any of the industry-standard test suites I have. In LNT, I see two small regressions and 9 small improvements (on AArch64):

Regressions:

  • telecomm-gsm: 4.0%
  • Nodesplitting-dbl: 1.04%

Improvements:

  • aha: -7.8%
  • lambda: -6.16%
  • covariance (from polybench): -3.65%
  • gemver: -2.65%
  • +5 more sub-2% improvements.

Cheers,

James

lib/Transforms/Utils/SimplifyCFG.cpp
57

No; the heuristic has to be enough that it will hoist *one* select, to then remove the branch and in the end cause two selects.

When I run this change on my POWER7 box, I see no improvements and one major regression:

MultiSource/Benchmarks/Olden/power/power
23.3258% +/- 9.53904%

I'll attempt to figure out what is going on here.

Generally speaking, I'd like to discuss a bit more, from a modeling perspective, what makes this a good idea? And, should we be using the same threshold for both FoldTwoEntryPHINode and SpeculativelyExecuteBB? We have costs for the instructions, do we need a cost for the branch? Do we need to consider whether or not we're speculating multiple instructions that are dependent on each other vs. independent?

lib/Transforms/Utils/SimplifyCFG.cpp
57

Okay, please explain that in the comment (there is no need to be terse).

Hi Hal,

I can make the comment change, no problem.

With regards modelling, I think this is going to be a very difficult problem if we want to model more accurately. The penalty for speculating instructions is a function of the number and type of instructions speculated, the in-orderness of the CPU, the predictability of the branch condition, and probably a bunch of other factors too.

I feel that, given how much better the optimizer can reason about things when they're in a single basic block, that for small sequences like this the speculated version should be the canonical form. Then CodeGenPrepare or something similarly nearer the backend can make a target-specific decision whether to expand it or not. So really, the heuristic here is mainly a cutoff to stop horribly expensive stuff from being speculated, but the backend should have the final say.

That is, the indirect benefits outweigh the direct benefits, so more accurately modelling the direct cost in SimplifyCFG we may end up making the wrong decision.

James

To provide one more data point: On an Intel Sandy Bridge box, I see no regressions and one speedup:

SingleSource/Benchmarks/CoyoteBench/huffbench
-11.4659% +/- 3.75373%

hfinkel accepted this revision.Feb 12 2015, 4:29 PM
hfinkel edited edge metadata.

Hi Hal,

I can make the comment change, no problem.

With regards modelling, I think this is going to be a very difficult problem if we want to model more accurately. The penalty for speculating instructions is a function of the number and type of instructions speculated, the in-orderness of the CPU, the predictability of the branch condition, and probably a bunch of other factors too.

I feel that, given how much better the optimizer can reason about things when they're in a single basic block, that for small sequences like this the speculated version should be the canonical form. Then CodeGenPrepare or something similarly nearer the backend can make a target-specific decision whether to expand it or not. So really, the heuristic here is mainly a cutoff to stop horribly expensive stuff from being speculated, but the backend should have the final say.

That is, the indirect benefits outweigh the direct benefits, so more accurately modelling the direct cost in SimplifyCFG we may end up making the wrong decision.

Fair enough, but sometimes we need to take the "do no harm" approach. Nevertheless, it turns out my P7 performance regression was a combination of a faulty script and a compiler crash, so we can ignore that (it is a good thing that I investigated it, however, because it turns out it was a serious regression). I think that we may need to return to the modeling question here, but I think we can move forward with this for now (all of my testing is neutral, both on PPC and on X86, with a speedup on x86). LGTM.

James

This revision is now accepted and ready to land.Feb 12 2015, 4:29 PM

Thanks Hal,

Landed in r229099.

James

jmolloy closed this revision.Feb 13 2015, 2:51 AM