Up the phi node folding threshold from a cheap "1" to a meagre "2".
Update tests for extra added selects and slight code churn.
Differential D7507
[SimplifyCFG] Be more aggressive jmolloy on Feb 9 2015, 9:09 AM. Authored by
Details
Diff Detail
Event TimelineComment Actions Update to add a new test, checking that the idiomatic "clamp()" function is correctly optimized to two selects.
Comment Actions Can you please provide a summary of how this affects our normal performance benchmarks?
Comment Actions Hi Hal, Sure. There is no difference in any of the industry-standard test suites I have. In LNT, I see two small regressions and 9 small improvements (on AArch64): Regressions:
Improvements:
Cheers, James
Comment Actions When I run this change on my POWER7 box, I see no improvements and one major regression: MultiSource/Benchmarks/Olden/power/power I'll attempt to figure out what is going on here. Generally speaking, I'd like to discuss a bit more, from a modeling perspective, what makes this a good idea? And, should we be using the same threshold for both FoldTwoEntryPHINode and SpeculativelyExecuteBB? We have costs for the instructions, do we need a cost for the branch? Do we need to consider whether or not we're speculating multiple instructions that are dependent on each other vs. independent?
Comment Actions Hi Hal, I can make the comment change, no problem. With regards modelling, I think this is going to be a very difficult problem if we want to model more accurately. The penalty for speculating instructions is a function of the number and type of instructions speculated, the in-orderness of the CPU, the predictability of the branch condition, and probably a bunch of other factors too. I feel that, given how much better the optimizer can reason about things when they're in a single basic block, that for small sequences like this the speculated version should be the canonical form. Then CodeGenPrepare or something similarly nearer the backend can make a target-specific decision whether to expand it or not. So really, the heuristic here is mainly a cutoff to stop horribly expensive stuff from being speculated, but the backend should have the final say. That is, the indirect benefits outweigh the direct benefits, so more accurately modelling the direct cost in SimplifyCFG we may end up making the wrong decision. James Comment Actions To provide one more data point: On an Intel Sandy Bridge box, I see no regressions and one speedup: SingleSource/Benchmarks/CoyoteBench/huffbench Comment Actions Fair enough, but sometimes we need to take the "do no harm" approach. Nevertheless, it turns out my P7 performance regression was a combination of a faulty script and a compiler crash, so we can ignore that (it is a good thing that I investigated it, however, because it turns out it was a serious regression). I think that we may need to return to the modeling question here, but I think we can move forward with this for now (all of my testing is neutral, both on PPC and on X86, with a speedup on x86). LGTM.
|
Do you mean two selects?