I have been running some experiments with the select optimization pass
for ARM AArch64 cores. It is trying to solve a problem that is difficult
for the compiler to fix. The criteria for when a csel is better or worse
than a branch depends heavily on whether the branch is well predicted
and the amount of ILP in the loop (as well as other criteria like the
core in question and the relative performance of the branch predictor).
The pass seems to do a decent job though, with the inner loop heuristics
being well implemented and doing a better job than I had expected in
general, even without PGO information.
I've been doing quite a bit of benchmarking. The headline numbers are
these for SPEC2017 on a Neoverse N1:
500.perlbench_r -0.12% 502.gcc_r 0.02% 505.mcf_r 6.02% 520.omnetpp_r 0.32% 523.xalancbmk_r 0.20% 525.x264_r 0.02% 531.deepsjeng_r 0.00% 541.leela_r -0.09% 548.exchange2_r 0.00% 557.xz_r -0.20%
Running benchmarks with a combination of the llvm-test-suite plus
several versions of SPEC gave between a 0.2% and 0.4% geomean
improvement depending on the core/run. The instruction count went down
by 0.1% too, which is a good sign, but the results can be a little noisy.
Some issues from other benchmarks I had ran were improved in
rGca78b5601466f8515f5f958ef8e63d787d9d812e. In summary well predicted
branches will see in improvement, badly predicted branches may get
worse, and on average performance seems to be a little better overall.
This patch enables the pass for AArch64 under -O3 for cores that will
benefit for it. i.e. not in-order cores that do not fit into the "Assume
infinite resources that allow to fully exploit the available
instruction-level parallelism" cost model. It uses a subtarget feature
for specifying when the pass will be enabled, which I have enabled under
cpu=generic as the performance increases for out of order cores seems
larger than any decreases for inorder.