We generically canonicalize commutative binary operation nodes so that if
only one operand is a constant, it will be on the RHS. However, if both
operands are comparisons against constants, it is good to move the most
likely taken condition to the RHS if the binary operation is Instruction::And
and move the less likely taken condition to the RHS if the binary operation
is Instruction::Or.
Details
Diff Detail
Event Timeline
The code itself looks good, does what says on the tin. :)
But I wonder if this generic assumption works well (or better) for all targets...
Can you share a bit of the motivation behind this? Like benchmark numbers, etc?
Thanks Renato,
The main motivation for this patch is to fix the regressions we see in spec2006/mcf when D20030 is enabled. I have so far tested this patch in combination with D20030 on spec2006 on Kryo and overall the results are positive(better) for LTO config:
Benchmark Diff --------------------------------------------- spec2006/astar:ref/BigLakes2048.cfg 0.47% spec2006/gcc:ref/g23.in 0.41% spec2006/gobmk:ref/nngs.tst -0.33% spec2006/gobmk:ref/score2.tst -1.14% spec2006/gobmk:ref/trevorc.tst -0.50% spec2006/gobmk:ref/trevord.tst -0.47% spec2006/mcf:ref 8.14% spec2006/omnetpp:ref -1.75% spec2006/perlbench:ref/checkspam.pl -2.34% spec2006/perlbench:ref/splitmail.pl -1.89% spec2006/povray:ref 1.02% spec2006/sjeng:ref 0.65% spec2006/xalancbmk:ref 2.07%
Right, but this is a change in the generic codegen prepare, not even AArch64 specific. So, testing only on Kryo seems very dangerous.
I encourage you to run the same benchmarks on other architectures (AArch64 A57, ARMv7, x86_64 at least) before concluding that this is an overall winning strategy.
Or, you lower this piece of the code into the AArch64-specific part, and make sure to wrap around a target feature flag if it's only beneficial to Kryo.
cheers,
--renato
In principle, this change is target independent because it reassociates binary operands to simplify branches. The reassociation pass is designed for transformations that will help down the line optimizations such as constant propagation, GCSE, LICM, PRE etc.. so I moved it down to CGP.
I can certainly verify for A57 and know for a fact that it improves spec2006/mcf on A57 as well. However, I am uncertain of reliably testing and verifying on other targets.
If we want to move this to AArch64 backend only, this needs to be done at pre-ISel stage. AArch64PromoteConstantPass and Arch64AddressTypePromotionPass are the only pre-ISel passes in AArch64 backend but their purpose is different to what this change tries to accomplish. I'm not sure if it is reasonable to create another pass just to do this transformation.
Although this is target independent, I have added a feature flag to guard this change. It is currently enabled only for Kryo because I tested only on this target. If this is profitable for other targets, we can add the feature flag to those targets.
Renato, is this along the lines of what you were suggesting?
The re-association is target independent, but guessing which branch will be taken probably isn't, as it depends on the branch-predictor, which are wildly different on some targets / workloads.
I can certainly verify for A57 and know for a fact that it improves spec2006/mcf on A57 as well. However, I am uncertain of reliably testing and verifying on other targets.
At least for A57 would be nice.
The new flag should suffice. It can also allow other target maintainers to test on their arches by adding the feature and benchmarking, and then commit the change if profitable. Only then, if this is universally true, we could remove the flag and make it a generic pass.
cheers,
--renato
Thanks Renato,
Performance runs on A57 are ongoing, and I will update the results once I get them.
The only clear performance differences on A57 are:
Benchmark Diff ---------------------------------------------------------------- spec2006/mcf:ref 3.21 spec2006/hmmer:ref 1.22 spec2006/sjeng:ref 0.87 spec2006/sphinx3:ref 0.71 spec2006/perlbench:ref -2.51 spec2006/povray:ref -2.44
Overall this has a minor impact on performance on A57. Is it reasonable to turn it on for A57 too?
It seems like there is no objection to turn this on for A57 but there is no official approval yet. If it helps to pursuade the community, I have completed running tests on A53 and this change is performance neutral on A53 with no noise regressions or gains. IMHO this seems to be good for AArch64 targets, but I am inclined to leave it enabled only for Kryo because there is no approval from the community. Any thoughts?
If all cores that have FeaturePredictableSelectIsExpensive can also have the new flag, and it makes sense, it could be coalesced into a single flag?
It does look a bit redundant as it is, but I haven't really looked closely how the two features are linked.
@t.p.northover,any comments?
FWIW, PredictableSelectIsExpensive is also set in X86ISelLowering.cpp: PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder()
I created a new flag because I could not verify the profitability of this patch on x86 target. I agree if it makes sense to enable it for x86, we could coalesce into a single flag.
Yuck, I hate heuristics like this. It's not even particularly clear that "range size" correlates well with probability in real code, let alone with what any given branch predictor thinks of that probability.
When you remove the 'mcf' test that this patch was specifically written to target, it turns into a net regression on geomean even on Kryo. I'm not entirely sure how fair that is (I avoided statistics like the plague), but isn't using separate data to derive and test code/hypotheses generally considered a good thing?
With all due respect, I think there are some facts that need some clarification. First, for targets which have cheap jump instructions, currently LLVM splits a conditional branch like:
%0 = icmp ne i32 %a, 0 %1 = icmp eq i32 %b, 2 %or.cond = and i1 %0, %1 br i1 %or.cond, label %TrueBB, label %FalseBB
into multiple branch instructions like:
BB1: %0 = icmp ne i32 %a, 0 br i1 %0, label %SplitBB, label %FalseBB SplitBB: %1 = icmp eq i32 %b, 2 br i1 %1, label %TrueBB, label %FalseBB
This requires creation of SplitBB after BB1 and currently we update branch weights taking liberty under the assumption that
FalseProb for BB1 == TrueProb for BB1 * FalseProb for SplitBB.
This is very fragile because it assumes that the source order correlates well with the probability in real code which is not true as seen in mcf case here. Furthermore, the codegen doesn't end up always with source order because earlier transformations such as Reassociation and JumpThreading sometimes commutate the binary operands resulting in big swings in performance for reasons unrelated to the original change as seen in here: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20141117/244920.html
What this patch tries to achieve is to address this issue by ranking the commutative binary operands based on their range sizes, so that we have a consistent codegen that doesn't fluctuate the performance with any minor changes in earlier optimization passes and at the same time gives better performance overall.
Second, it is very hard to correlate something with probability in real code. Even when using PGO for SPEC, we generate profile using train input and use it for ref input although for most benchmarks train input does not correlate to ref input. FWIW, this heuristic is not assigning any branch probabilities based on the range size, it only ranks the commutative binary operands based on the generic assumption that if we have two conditions a != 0 and b == 2, it is more likely that a != 0 than b == 2, because for all possible values 'a' and 'b' can take it is more likely that it is not equal to some value than it is equal to some other value. Conservatively, I scaled the comparision to logarithmic so that small differences in range sizes can be ignored. One situation this assumption might not hold is when 'a' or 'b' are enums or macros, although I haven't checked if this is the cause for some of the regressions we see.
When you remove the 'mcf' test that this patch was specifically written to target, it turns into a net regression on geomean even on Kryo. I'm not entirely sure how fair that is (I avoided statistics like the plague), but isn't using separate data to derive and test code/hypotheses generally considered a good thing?
Earnestly, the regressions were minor (except for perlbench which was -2.5%) compared to the gains that I did not spend time looking for the reason. I am sorry that you found this heuristic distasteful, one alternative I can think of to fix this issue is to canonicalize icmp ne to RHS of the AND expression or to LHS of the OR expression which might be less yuck but still strange.
FWIW, this heuristic is not assigning any branch probabilities based on the range size, it only ranks the commutative binary operands based on the generic assumption that if we have two conditions a != 0 and b == 2, it is more likely that a != 0 than b == 2
And I still think that's not obviously true. Integers actually used often take a very limited number of values, and this seems like a common idiom to me:
int res = some_func(); if (res < 0) llvm_unreachable("WTF happened"); else if (res == 0) [...]
Earnestly, the regressions were minor (except for perlbench which was -2.5%) compared to the gains that I did not spend time looking for the reason.
I don't mind that you didn't investigate the other regressions, I worry about the experimental soundness of including mcf in the analysis deciding whether this actually is an optimization. For example https://en.wikipedia.org/wiki/Test_set.
<rant>
But this is actually why I try to stay well clear of benchmarking discussions (unless asked, as here). I strongly suspect most of what we do around data gathering is fundamentally unprincipled and probably unsound, and would make actual statisticians and experimental designers claw their eyes out with a rusty spoon.
Unfortunately, I don't know enough to say precisely where we're going wrong or how to fix it. So I mostly just pretend it's not happening and keep out of those areas.
</rant>
To be clear, I'm not really trying to block this. I was asked what I thought and replied.
This is exactly the idiom I am trying to clarify. This change does not influence the branch direction for the idiom like above at all. All this change targets is code like
if (cond1 && cond2) do_something
or
if (cond1 || cond2) do_something
Currently, the codegen decides to split the condition into branches for targets which have cheap jump instructions and there is some flexibility in the order of the blocks containing cond1 and cond2. This patch is trying to rank cond1 and cond2 in such a way that it is profitable in most cases and is always consistent independent of earlier optimizations.
Thanks,
Balaram
For the giggles (really only that, as you might guess I view all numbers in this thread and any other LLVM benchmarking attempts with the deepest suspicion, I'm certainly not an experimental designer), I ran 3 tests on a Cyclone-like processor:
- Enable the PredictableSelectIsExpensive feature (this seems reasonable, like Kryo predictable branches are very very cheap on Cyclone).
- That plus the suggested heuristic
- The first with exactly the opposite of the suggested heuristic.
The SPEC2006 results are below (> 1 => improvement)
Benchmark | suggested speedup | opposite speedup |
433.milc/433.milc | 1.00235569837 | 0.996451239064 |
444.namd/444.namd | 1.00085232996 | 1.00266512107 |
447.dealII/447.dealII | 1.00232774224 | 1.00139244117 |
450.soplex/450.soplex | 0.991283676704 | 1.0 |
470.lbm/470.lbm | 1.00033869904 | 0.992413122292 |
400.perlbench/400.perlbench | 1.0124059534 | 1.00875815688 |
401.bzip2/401.bzip2 | 0.996946748407 | 0.999879439635 |
403.gcc/403.gcc | 1.00149565217 | 1.01734859727 |
429.mcf/429.mcf | 1.00949811937 | 0.997803188194 |
445.gobmk/445.gobmk | 0.973701955496 | 0.996549344375 |
456.hmmer/456.hmmer | 1.00178618018 | 1.00164546294 |
458.sjeng/458.sjeng | 1.00405505452 | 0.994367699255 |
462.libquantum/462.libquantum | 0.993502343417 | 0.989287229529 |
464.h264ref/464.h264ref | 0.996045116438 | 1.00229694506 |
471.omnetpp/471.omnetpp | 0.998019595581 | 1.00524934383 |
473.astar/473.astar | 1.00208689727 | 0.998415807973 |
483.xalancbmk/483.xalancbmk | 1.00291218638 | 0.987646150452 |
Geomeans were 0.99935572086 for the patch, 0.99951568293 for the complete opposite. Both worse than the status-quo, but whether in a statistically significant way, who knows?
I think the only conclusion we can really draw from this is that the LLVM project really needs to hire an actual scientist who specializes in designing experiments (not a computer scientist, not a mathematician who dabbles in programming) and give them some clout. We shouldn't be making these kinds of decisions based on ad-hoc runs of a handful of 10 year old benchmarks on ${RANDOM_HARDWARE}.
As for this patch, meh.
Tim.
Thanks for testing this patch out, Tim.
Now that I have lured you into benchmarking :) I will go out on a limb and claim that if you run this patch along with its dependent patch D20030 and enable the feature by turning on the flag "-aarch64-ccmp-disable-triangle-latch" you will see the numbers that will be comparable to the numbers I have shown on this thread. This patch by itself is not very interesting and it will get interesting once D20030 is enabled. All the numbers I have presented in this thread are using this suggested heuristic in combination with D20030 enabled. Appreciate your help.
Tim, both sets of numbers are statistically equivalent (and both move down to 0.999...), so the change is probably completely innocuous at run time, and surely adds some compile time.
As Balaram said, this helps D20030, which Kristof has seen some improvements, so there could be some merit underneath it all.
My take away points from this discussion are:
- The two feature flags are very close to each other, and I think more investigation is needed to make sure you must create a new flag instead of using the current one.
- The choice does look arbitrary, but I think Balaram's argument to make it more predictable interesting. I have no idea how this could play on the vast different (sub-)architectures out there, but it shouldn't be worse, on average, than a non-predictable output.
- Maybe what we need is to merge these two patches and test on all architectures that have the FeaturePredictableSelectIsExpensive to make sure this is not the same effect showing up in different ways.
If the predictability makes it converge to a reasonable scenario, then I think the change is positive. If not, it's probably just noise.
Renato,
I agree to all your points. FWIW, the idea of using value range for static branch prediction is not new. The idea was first introduced by Jason Patterson in his SIGPLAN'95 paper: "Accurate Static Branch Prediction by Value Range Propagation". So I think there is definitely some value in this. However, I have dropped this from my plate because I have already spent a lot of time trying to improve this past several months, so I would rather spend my efforts elsewhere. Thanks for the review.