This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add PredictableSelectIsExpensive feature to all the cpus that have FeatureEnableSelectOptimize
ClosedPublic

Authored by aleksandr.popov on Feb 2 2023, 2:50 AM.

Details

Summary

In the revision https://reviews.llvm.org/D138990 was enabled select
optimize pass for AArch64.

We were doing some benchmarking on the Neoverse V1 and were
experimenting with select optimize heuristics. We found out that there
are some additional profitable transformations to predictable branches
(with prediction rate > 75% according to Agner Fog's rule of thumb) can
be done by base heuristic from SelectOptimize pass or by
optimizeSelectInst form CodeGenPrepare pass. But they are blocked on the
Neoverse V1, since PredictableSelectIsExpensive feature is not set for
that subtarget.

Note that to achieve this results we also changed predictable branch
threshold from 99% to 75%: https://reviews.llvm.org/D143060

Looks like it makes sense to add this feature to all targets where was
enabled select optimize pass in the https://reviews.llvm.org/D138990.

Diff Detail

Event Timeline

aleksandr.popov created this revision.Feb 2 2023, 2:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 2 2023, 2:50 AM
aleksandr.popov requested review of this revision.Feb 2 2023, 2:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 2 2023, 2:50 AM
aleksandr.popov edited the summary of this revision. (Show Details)Feb 2 2023, 2:58 AM

So ideally, FeatureEnableSelectOptimize and FeaturePredictableSelectIsExpensive would be a single target feature that control both same options. But when I was trying it, FeaturePredictableSelectIsExpensive was leading to some performance degradations compared to just the inner loop heuristic controlled by FeatureEnableSelectOptimize. I chose to be conservative and only enable the one, although the performance differences could have been more noise than real regressions, and I don't remember them being very large. Predictable Selects are not really slow on AArch64, they are just CSEL instructions which have a latency of 1 and a decent throughput. Branches can be quicker in situation, but it is very depent on a lot of factors that can be difficult for the compiler to guess at.

In the end of the day it is performance of the particular heuristics that matters. Do you have performance measurements to share? Are they with the 99% or 75%? Did you plan to change that? And are you running with or without PGO? Thanks.

aleksandr.popov added a comment.EditedFeb 7 2023, 3:52 AM

In the end of the day it is performance of the particular heuristics that matters. Do you have performance measurements to share? Are they with the 99% or 75%? Did you plan to change that? And are you running with or without PGO? Thanks.

Yep, we have been running benchmarks on java VM with llvm based JIT compiler. So that SelectOptimize pass relied on the profile information about branches weights.

The most runs with were made on the following benchmarks:
Renaissance Benchmark Suite 0.0%
SPECjvm2008 +2,5%
We've done a big amount of repeats to ensure that there is no nosie in the results.

During results analysis we've found out that there were cases when it's reasonable to convert select to branch and to get better score on the benchmark, but SelectOptimize pass made a decision not to do that.
There were 2 approaches to fix that:

  • to update SelectOptimize's loop level heuristic;
  • to use another one heuristic on the top of SelectOptimize pass;

The second approach turned out to be simple and effective: we just converted all selects with prediction rate > 75%.
Anger Fog's rule of thumb for x86 seems to be working fine for AArch64 also (https://discourse.llvm.org/t/rfc-cmov-vs-branch-optimization/6040)

In the end of the day it is performance of the particular heuristics that matters. Do you have performance measurements to share? Are they with the 99% or 75%? Did you plan to change that? And are you running with or without PGO? Thanks.

Yep, we have been running benchmarks on java VM with llvm based JIT compiler. So that SelectOptimize pass relied on the profile information about branches weights.

The most runs with were made on the following benchmarks:
Scala Benchmark Suite 0.0%
SPECjvm2008 +2,5%
We've done a big amount of repeats to ensure that there is no nosie in the results.

During results analysis we've found out that there were cases when it's reasonable to convert select to branch and to get better score on the benchmark, but SelectOptimize pass made a decision not to do that.
There were 2 approaches to fix that:

  • to update SelectOptimize's loop level heuristic;
  • to use another one heuristic on the top of SelectOptimize pass;

The second approach turned out to be simple and effective: we just converted all selects with prediction rate > 75%.
Anger Fog's rule of thumb for x86 seems to be working fine for AArch64 also (https://discourse.llvm.org/t/rfc-cmov-vs-branch-optimization/6040)

OK Thanks. Just to check:

  • This is java
  • Has accurate profiling info
  • Is using the 75% prediction rate threshold, and this patch? Or just this patch?

I can try an few benchmarks and see how it behaves with PredictableSelectIsExpensive from C, with and without PGO info.

Yes that's right: java, accurate profiling info, 75% prediction rate threshold and this patch.

Thank you, Dave!

Hi, Dave! @dmgreen
Did you have a chance to try some benchmarks with this patch?

Hi sorry for the delay. This fell off my radar after some benchmarking failed and the reruns took a while. I think the results are probably fine overall. None of the changes I saw were very large and at least in some cases it was a small improvement.

I think it is worth adding FeaturePredictableSelectIsExpensive to all the cpus that have FeatureEnableSelectOptimize to keep them consistent. But maybe not Generic if it is not applicable to all cpus. We could think about combining FeaturePredictableSelectIsExpensive and FeatureEnableSelectOptimize into a single feature at some point, but it is probably useful to have them separate for the time being, in case we receive any reports of performance getting worse.

@SjoerdMeijer any opinion from your end?

Sorry, I also forgot about this, but thanks for pinging me.

I will also do a few benchmarks runs. Shouldn't take too long, will report back in a day.

I am also interested in the neoverse-v2 while we are at it. But I need a little bit more time to get numbers for that, so maybe we can do the v1 first and then I will follow up for the v2.

I did some performance runs, nothing stands out, so LGTM.

I agree with this and will let @dmgreen handle that:

I think it is worth adding FeaturePredictableSelectIsExpensive to all the cpus that have FeatureEnableSelectOptimize to keep them consistent. But maybe not Generic if it is not applicable to all cpus. We could think about combining FeaturePredictableSelectIsExpensive and FeatureEnableSelectOptimize into a single feature at some point, but it is probably useful to have them separate for the time being, ..

I have no opinion whether we do that now or later.

Thanks, guys!
@dmgreen could you please clarify, did you run benchmarks with PredictableSelectIsExpensive only or did you use 75% prediction rate threshold as well?

Thanks, guys!
@dmgreen could you please clarify, did you run benchmarks with PredictableSelectIsExpensive only or did you use 75% prediction rate threshold as well?

Only with PredictableSelectIsExpensive, keeping the default threshold. Without any PGO data it might be difficult to change the predication rate to something so low.

khchen added a subscriber: khchen.Apr 12 2023, 11:37 PM

I am also interested in the neoverse-v2 while we are at it. But I need a little bit more time to get numbers for that, so maybe we can do the v1 first and then I will follow up for the v2.

Hi @SjoerdMeijer! Did you have a chance to test it with neoverse-v2?
If not, could we now land this patch with update for neoverse-v1 only?

Sorry, didn't have time to look into the V2.
If @dmgreen is happy with this change, then please land this and then we can follow up later.

I'm not a huge fan of this being neoverse-v1 only, I'm afraid. I don't know of any reason why it would be. I think we should follow the comment in https://reviews.llvm.org/D143162#4212859 and make the same change for all similar cpus.

I'm not a huge fan of this being neoverse-v1 only, I'm afraid. I don't know of any reason why it would be. I think we should follow the comment in https://reviews.llvm.org/D143162#4212859 and make the same change for all similar cpus.

Make sense. I don't want to hold up this patch. So I am okay to just add it for v2 too then.
Then I will try to benchmark this later. I don't have time this week, maybe towards the end of next.

aleksandr.popov edited the summary of this revision. (Show Details)
aleksandr.popov retitled this revision from [AArch64] Add PredictableSelectIsExpensive feature for the Neoverse V1 to [AArch64] Add PredictableSelectIsExpensive feature to all the cpus that have FeatureEnableSelectOptimize.Apr 26 2023, 2:09 AM
aleksandr.popov edited the summary of this revision. (Show Details)

I've updated Diff according to the comment in https://reviews.llvm.org/D143162#4212859

Thanks this looks good. Is it possible to write a test for a standard case where it is beneficial to convert a select to branches, and check that at least some of the cpus here enable it?

Matt added a subscriber: Matt.Apr 26 2023, 3:39 PM

Is it possible to write a test for a standard case where it is beneficial to convert a select to branches, and check that at least some of the cpus here enable it?

Sure, done!

dmgreen accepted this revision.May 9 2023, 1:14 AM

Thanks. LGTM

This revision is now accepted and ready to land.May 9 2023, 1:14 AM