This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Enable the select optimize pass for AArch64
ClosedPublic

Authored by dmgreen on Nov 30 2022, 1:42 AM.

Details

Summary

I have been running some experiments with the select optimization pass
for ARM AArch64 cores. It is trying to solve a problem that is difficult
for the compiler to fix. The criteria for when a csel is better or worse
than a branch depends heavily on whether the branch is well predicted
and the amount of ILP in the loop (as well as other criteria like the
core in question and the relative performance of the branch predictor).
The pass seems to do a decent job though, with the inner loop heuristics
being well implemented and doing a better job than I had expected in
general, even without PGO information.

I've been doing quite a bit of benchmarking. The headline numbers are
these for SPEC2017 on a Neoverse N1:

500.perlbench_r   -0.12%
502.gcc_r         0.02%
505.mcf_r         6.02%
520.omnetpp_r     0.32%
523.xalancbmk_r   0.20%
525.x264_r        0.02%
531.deepsjeng_r   0.00%
541.leela_r       -0.09%
548.exchange2_r   0.00%
557.xz_r          -0.20%

Running benchmarks with a combination of the llvm-test-suite plus
several versions of SPEC gave between a 0.2% and 0.4% geomean
improvement depending on the core/run. The instruction count went down
by 0.1% too, which is a good sign, but the results can be a little noisy.
Some issues from other benchmarks I had ran were improved in
rGca78b5601466f8515f5f958ef8e63d787d9d812e. In summary well predicted
branches will see in improvement, badly predicted branches may get
worse, and on average performance seems to be a little better overall.

This patch enables the pass for AArch64 under -O3 for cores that will
benefit for it. i.e. not in-order cores that do not fit into the "Assume
infinite resources that allow to fully exploit the available
instruction-level parallelism" cost model. It uses a subtarget feature
for specifying when the pass will be enabled, which I have enabled under
cpu=generic as the performance increases for out of order cores seems
larger than any decreases for inorder.

Diff Detail

Event Timeline

dmgreen created this revision.Nov 30 2022, 1:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 30 2022, 1:42 AM
dmgreen requested review of this revision.Nov 30 2022, 1:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 30 2022, 1:42 AM

Nice one Dave!
I am going to take this patch and run some benchmarking on a different AArch64 system. I will report back soon.

SjoerdMeijer added a comment.EditedNov 30 2022, 2:08 AM

In the mean time, how about compile times? Did you measure that by any chance?

Nice one Dave!
I am going to take this patch and run some benchmarking on a different AArch64 system. I will report back soon.

Sounds good, let us know how they look.

In the mean, how about compile times? Did you measure that by any chance?

Hmm. I have not, but this is only a scan over the instructions in each function, not anything that isn't done many times in the compiler already. I can check that is true though and there isn't anything costly hiding in it.

In the mean, how about compile times? Did you measure that by any chance?

Hmm. I have not, but this is only a scan over the instructions in each function, not anything that isn't done many times in the compiler already. I can check that is true though and there isn't anything costly hiding in it.

Thanks, and yes, that's what I was expecting, i.e. this to be a simple scan and thus not contributing much. Thought it would be good to check, to avoid possibly getting surprised for a reason we had not thought about.

The compile time was a little higher than I expected, at around a 0.25% increase. I was expecting it to be closer to 0%, but some of the tests in ct-mark are around 0.5% (with others being 0%). Considering this is only enabled at -O3, that is probably an acceptable amount.

Thanks Dave for this extensive evaluation and reporting!

The performance improvements even for non-PGO builds are somewhat expected given that currently the AArch64 backend does not have almost any logic to make this decision (contrary to x86) and it just aggressively prefers predication.
So, even without profile information, the loop-level heuristics, albeit conservative, allow for some obvious cases to be converted to branches.

Note that internally at Google, we have already enabled the select-optimize pass for all instrPGO-optimized builds including AArch64. The performance improvements for AArch64 appear to be even more significant than x86.
For non-PGO builds, we have also seen significant improvements for some microbenchmarks on AArch64 but I did not have the time to investigate more. So, your efforts are more than welcome.

Regarding compilation time, the impact should be small. For non-PGO builds, we essentially only have the loop-level heuristic that does two passes over all the instructions in each loop and for each instruction it iterates over its operands. So for each loop, 2*N*K, where N is the instructions in the loop and K is the operand count of each instruction; this is essentially O(N) given that the operand count is a small bounded number.
In practice, the constant costs might be noticeable for some programs with big loops.
Enabling for -O3 only sounds reasonable. Note that you do not want to enable it for size-optimizing builds (although checks within the pass already prevent it from being used in those cases)

My first SPEC performance results are in line with yours. The gain for MCF is bigger, but the gap is also bigger than on the N1; so the trend is the same. I would like to do a bit more testing though and will report back tomorrow.

Thanks for checking the compile times. Yeah, perhaps a bit surprisingly a bit higher than expected, but agreed it looks reasonable for O3.

And the confirmation from Google is nice. I think it shows we are on the right track with this.

SjoerdMeijer accepted this revision.Dec 1 2022, 9:43 AM

I have done a bit more testing, and results are okay. This looks good to me.

One nit: the tests are a bit big. At least some comments explaining what's going on would be nice.

This revision is now accepted and ready to land.Dec 1 2022, 9:43 AM
This revision was landed with ongoing or failed builds.Dec 3 2022, 8:09 AM
This revision was automatically updated to reflect the committed changes.

I am wondering if we could enable it for X86 too. Any known blockers?