This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add PredictableSelectIsExpensive feature to all the cpus that have FeatureEnableSelectOptimize
ClosedPublic

Authored by aleksandr.popov on Feb 2 2023, 2:50 AM.

Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer
apostolakis
skatkov
apilipenko

Commits

rG22f21738370c: [AArch64] Add PredictableSelectIsExpensive feature to all the cpus that have…

Summary

In the revision https://reviews.llvm.org/D138990 was enabled select
optimize pass for AArch64.

We were doing some benchmarking on the Neoverse V1 and were
experimenting with select optimize heuristics. We found out that there
are some additional profitable transformations to predictable branches
(with prediction rate > 75% according to Agner Fog's rule of thumb) can
be done by base heuristic from SelectOptimize pass or by
optimizeSelectInst form CodeGenPrepare pass. But they are blocked on the
Neoverse V1, since PredictableSelectIsExpensive feature is not set for
that subtarget.

Note that to achieve this results we also changed predictable branch
threshold from 99% to 75%: https://reviews.llvm.org/D143060

Looks like it makes sense to add this feature to all targets where was
enabled select optimize pass in the https://reviews.llvm.org/D138990.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aleksandr.popov created this revision.Feb 2 2023, 2:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 2 2023, 2:50 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

aleksandr.popov requested review of this revision.Feb 2 2023, 2:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 2 2023, 2:50 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

aleksandr.popov edited the summary of this revision. (Show Details)Feb 2 2023, 2:58 AM

aleksandr.popov added reviewers: dmgreen, SjoerdMeijer, apostolakis, skatkov, apilipenko.Feb 2 2023, 3:02 AM

Harbormaster completed remote builds in B211432: Diff 494227.Feb 2 2023, 3:45 AM

So ideally, FeatureEnableSelectOptimize and FeaturePredictableSelectIsExpensive would be a single target feature that control both same options. But when I was trying it, FeaturePredictableSelectIsExpensive was leading to some performance degradations compared to just the inner loop heuristic controlled by FeatureEnableSelectOptimize. I chose to be conservative and only enable the one, although the performance differences could have been more noise than real regressions, and I don't remember them being very large. Predictable Selects are not really slow on AArch64, they are just CSEL instructions which have a latency of 1 and a decent throughput. Branches can be quicker in situation, but it is very depent on a lot of factors that can be difficult for the compiler to guess at.

In the end of the day it is performance of the particular heuristics that matters. Do you have performance measurements to share? Are they with the 99% or 75%? Did you plan to change that? And are you running with or without PGO? Thanks.

In the end of the day it is performance of the particular heuristics that matters. Do you have performance measurements to share? Are they with the 99% or 75%? Did you plan to change that? And are you running with or without PGO? Thanks.

Yep, we have been running benchmarks on java VM with llvm based JIT compiler. So that SelectOptimize pass relied on the profile information about branches weights.

The most runs with were made on the following benchmarks:
Renaissance Benchmark Suite 0.0%
SPECjvm2008 +2,5%
We've done a big amount of repeats to ensure that there is no nosie in the results.

During results analysis we've found out that there were cases when it's reasonable to convert select to branch and to get better score on the benchmark, but SelectOptimize pass made a decision not to do that.
There were 2 approaches to fix that:

to update SelectOptimize's loop level heuristic;
to use another one heuristic on the top of SelectOptimize pass;

The second approach turned out to be simple and effective: we just converted all selects with prediction rate > 75%.
Anger Fog's rule of thumb for x86 seems to be working fine for AArch64 also (https://discourse.llvm.org/t/rfc-cmov-vs-branch-optimization/6040)

In D143162#4109736, @aleksandr.popov wrote:

In the end of the day it is performance of the particular heuristics that matters. Do you have performance measurements to share? Are they with the 99% or 75%? Did you plan to change that? And are you running with or without PGO? Thanks.

Yep, we have been running benchmarks on java VM with llvm based JIT compiler. So that SelectOptimize pass relied on the profile information about branches weights.

The most runs with were made on the following benchmarks:
Scala Benchmark Suite 0.0%
SPECjvm2008 +2,5%
We've done a big amount of repeats to ensure that there is no nosie in the results.

During results analysis we've found out that there were cases when it's reasonable to convert select to branch and to get better score on the benchmark, but SelectOptimize pass made a decision not to do that.
There were 2 approaches to fix that:

to update SelectOptimize's loop level heuristic;

to use another one heuristic on the top of SelectOptimize pass;

The second approach turned out to be simple and effective: we just converted all selects with prediction rate > 75%.
Anger Fog's rule of thumb for x86 seems to be working fine for AArch64 also (https://discourse.llvm.org/t/rfc-cmov-vs-branch-optimization/6040)

OK Thanks. Just to check:

This is java
Has accurate profiling info
Is using the 75% prediction rate threshold, and this patch? Or just this patch?

I can try an few benchmarks and see how it behaves with PredictableSelectIsExpensive from C, with and without PGO info.

Yes that's right: java, accurate profiling info, 75% prediction rate threshold and this patch.

Thank you, Dave!

Hi, Dave! @dmgreen
Did you have a chance to try some benchmarks with this patch?

Hi sorry for the delay. This fell off my radar after some benchmarking failed and the reruns took a while. I think the results are probably fine overall. None of the changes I saw were very large and at least in some cases it was a small improvement.

I think it is worth adding FeaturePredictableSelectIsExpensive to all the cpus that have FeatureEnableSelectOptimize to keep them consistent. But maybe not Generic if it is not applicable to all cpus. We could think about combining FeaturePredictableSelectIsExpensive and FeatureEnableSelectOptimize into a single feature at some point, but it is probably useful to have them separate for the time being, in case we receive any reports of performance getting worse.

@SjoerdMeijer any opinion from your end?

Sorry, I also forgot about this, but thanks for pinging me.

I will also do a few benchmarks runs. Shouldn't take too long, will report back in a day.

I am also interested in the neoverse-v2 while we are at it. But I need a little bit more time to get numbers for that, so maybe we can do the v1 first and then I will follow up for the v2.

I did some performance runs, nothing stands out, so LGTM.

I agree with this and will let @dmgreen handle that:

I think it is worth adding FeaturePredictableSelectIsExpensive to all the cpus that have FeatureEnableSelectOptimize to keep them consistent. But maybe not Generic if it is not applicable to all cpus. We could think about combining FeaturePredictableSelectIsExpensive and FeatureEnableSelectOptimize into a single feature at some point, but it is probably useful to have them separate for the time being, ..

I have no opinion whether we do that now or later.

Thanks, guys!
@dmgreen could you please clarify, did you run benchmarks with PredictableSelectIsExpensive only or did you use 75% prediction rate threshold as well?

In D143162#4219795, @aleksandr.popov wrote:

Thanks, guys!
@dmgreen could you please clarify, did you run benchmarks with PredictableSelectIsExpensive only or did you use 75% prediction rate threshold as well?

Only with PredictableSelectIsExpensive, keeping the default threshold. Without any PGO data it might be difficult to change the predication rate to something so low.

khchen added a subscriber: khchen.Apr 12 2023, 11:37 PM

I am also interested in the neoverse-v2 while we are at it. But I need a little bit more time to get numbers for that, so maybe we can do the v1 first and then I will follow up for the v2.

Hi @SjoerdMeijer! Did you have a chance to test it with neoverse-v2?
If not, could we now land this patch with update for neoverse-v1 only?

Sorry, didn't have time to look into the V2.
If @dmgreen is happy with this change, then please land this and then we can follow up later.

I'm not a huge fan of this being neoverse-v1 only, I'm afraid. I don't know of any reason why it would be. I think we should follow the comment in https://reviews.llvm.org/D143162#4212859 and make the same change for all similar cpus.

In D143162#4294917, @dmgreen wrote:

I'm not a huge fan of this being neoverse-v1 only, I'm afraid. I don't know of any reason why it would be. I think we should follow the comment in https://reviews.llvm.org/D143162#4212859 and make the same change for all similar cpus.

Make sense. I don't want to hold up this patch. So I am okay to just add it for v2 too then.
Then I will try to benchmark this later. I don't have time this week, maybe towards the end of next.

aleksandr.popov updated this revision to Diff 517100.Apr 26 2023, 2:06 AM

aleksandr.popov edited the summary of this revision. (Show Details)

I've updated Diff according to the comment in https://reviews.llvm.org/D143162#4212859

Thanks this looks good. Is it possible to write a test for a standard case where it is beneficial to convert a select to branches, and check that at least some of the cpus here enable it?

Matt added a subscriber: Matt.Apr 26 2023, 3:39 PM

Is it possible to write a test for a standard case where it is beneficial to convert a select to branches, and check that at least some of the cpus here enable it?

Sure, done!

Harbormaster completed remote builds in B230273: Diff 519898.May 5 2023, 10:06 AM

Thanks. LGTM

This revision is now accepted and ready to land.May 9 2023, 1:14 AM

Closed by commit rG22f21738370c: [AArch64] Add PredictableSelectIsExpensive feature to all the cpus that have… (authored by aleksandr.popov). · Explain WhyJul 3 2023, 3:33 AM

This revision was automatically updated to reflect the committed changes.

aleksandr.popov added a commit: rG22f21738370c: [AArch64] Add PredictableSelectIsExpensive feature to all the cpus that have….

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64.td

54 lines

test/

CodeGen/

AArch64/

convert-highly-predictable-select-to-branch.ll

60 lines

Diff 536706

llvm/lib/Target/AArch64/AArch64.td

Show First 20 Lines • Show All 786 Lines • ▼ Show 20 Lines	def TuneA57 : SubtargetFeature<"a57", "ARMProcFamily", "CortexA57",
FeaturePredictableSelectIsExpensive]>;		FeaturePredictableSelectIsExpensive]>;

def TuneA65 : SubtargetFeature<"a65", "ARMProcFamily", "CortexA65",		def TuneA65 : SubtargetFeature<"a65", "ARMProcFamily", "CortexA65",
"Cortex-A65 ARM processors", [		"Cortex-A65 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAddress,		FeatureFuseAddress,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureFuseLiterals,		FeatureFuseLiterals,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA72 : SubtargetFeature<"a72", "ARMProcFamily", "CortexA72",		def TuneA72 : SubtargetFeature<"a72", "ARMProcFamily", "CortexA72",
"Cortex-A72 ARM processors", [		"Cortex-A72 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureFuseLiterals,		FeatureFuseLiterals,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA73 : SubtargetFeature<"a73", "ARMProcFamily", "CortexA73",		def TuneA73 : SubtargetFeature<"a73", "ARMProcFamily", "CortexA73",
"Cortex-A73 ARM processors", [		"Cortex-A73 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA75 : SubtargetFeature<"a75", "ARMProcFamily", "CortexA75",		def TuneA75 : SubtargetFeature<"a75", "ARMProcFamily", "CortexA75",
"Cortex-A75 ARM processors", [		"Cortex-A75 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA76 : SubtargetFeature<"a76", "ARMProcFamily", "CortexA76",		def TuneA76 : SubtargetFeature<"a76", "ARMProcFamily", "CortexA76",
"Cortex-A76 ARM processors", [		"Cortex-A76 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA77 : SubtargetFeature<"a77", "ARMProcFamily", "CortexA77",		def TuneA77 : SubtargetFeature<"a77", "ARMProcFamily", "CortexA77",
"Cortex-A77 ARM processors", [		"Cortex-A77 ARM processors", [
FeatureCmpBccFusion,		FeatureCmpBccFusion,
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA78 : SubtargetFeature<"a78", "ARMProcFamily", "CortexA78",		def TuneA78 : SubtargetFeature<"a78", "ARMProcFamily", "CortexA78",
"Cortex-A78 ARM processors", [		"Cortex-A78 ARM processors", [
FeatureCmpBccFusion,		FeatureCmpBccFusion,
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA78C : SubtargetFeature<"a78c", "ARMProcFamily",		def TuneA78C : SubtargetFeature<"a78c", "ARMProcFamily",
"CortexA78C",		"CortexA78C",
"Cortex-A78C ARM processors", [		"Cortex-A78C ARM processors", [
FeatureCmpBccFusion,		FeatureCmpBccFusion,
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA710 : SubtargetFeature<"a710", "ARMProcFamily", "CortexA710",		def TuneA710 : SubtargetFeature<"a710", "ARMProcFamily", "CortexA710",
"Cortex-A710 ARM processors", [		"Cortex-A710 ARM processors", [
FeatureCmpBccFusion,		FeatureCmpBccFusion,
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA715 : SubtargetFeature<"a715", "ARMProcFamily", "CortexA715",		def TuneA715 : SubtargetFeature<"a715", "ARMProcFamily", "CortexA715",
"Cortex-A715 ARM processors", [		"Cortex-A715 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureCmpBccFusion,		FeatureCmpBccFusion,
FeatureLSLFast,		FeatureLSLFast,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneR82 : SubtargetFeature<"cortex-r82", "ARMProcFamily",		def TuneR82 : SubtargetFeature<"cortex-r82", "ARMProcFamily",
"CortexR82",		"CortexR82",
"Cortex-R82 ARM processors", [		"Cortex-R82 ARM processors", [
FeaturePostRAScheduler]>;		FeaturePostRAScheduler]>;

def TuneX1 : SubtargetFeature<"cortex-x1", "ARMProcFamily", "CortexX1",		def TuneX1 : SubtargetFeature<"cortex-x1", "ARMProcFamily", "CortexX1",
"Cortex-X1 ARM processors", [		"Cortex-X1 ARM processors", [
FeatureCmpBccFusion,		FeatureCmpBccFusion,
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneX2 : SubtargetFeature<"cortex-x2", "ARMProcFamily", "CortexX2",		def TuneX2 : SubtargetFeature<"cortex-x2", "ARMProcFamily", "CortexX2",
"Cortex-X2 ARM processors", [		"Cortex-X2 ARM processors", [
FeatureCmpBccFusion,		FeatureCmpBccFusion,
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneX3 : SubtargetFeature<"cortex-x3", "ARMProcFamily", "CortexX3",		def TuneX3 : SubtargetFeature<"cortex-x3", "ARMProcFamily", "CortexX3",
"Cortex-X3 ARM processors", [		"Cortex-X3 ARM processors", [
FeatureLSLFast,		FeatureLSLFast,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureFuseAES,		FeatureFuseAES,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneA64FX : SubtargetFeature<"a64fx", "ARMProcFamily", "A64FX",		def TuneA64FX : SubtargetFeature<"a64fx", "ARMProcFamily", "A64FX",
"Fujitsu A64FX processors", [		"Fujitsu A64FX processors", [
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureAggressiveFMA,		FeatureAggressiveFMA,
FeatureArithmeticBccFusion,		FeatureArithmeticBccFusion,
FeaturePredictableSelectIsExpensive		FeaturePredictableSelectIsExpensive
]>;		]>;
▲ Show 20 Lines • Show All 168 Lines • ▼ Show 20 Lines	def TuneNeoverseE1 : SubtargetFeature<"neoversee1", "ARMProcFamily", "NeoverseE1",
FeaturePostRAScheduler]>;		FeaturePostRAScheduler]>;

def TuneNeoverseN1 : SubtargetFeature<"neoversen1", "ARMProcFamily", "NeoverseN1",		def TuneNeoverseN1 : SubtargetFeature<"neoversen1", "ARMProcFamily", "NeoverseN1",
"Neoverse N1 ARM processors", [		"Neoverse N1 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneNeoverseN2 : SubtargetFeature<"neoversen2", "ARMProcFamily", "NeoverseN2",		def TuneNeoverseN2 : SubtargetFeature<"neoversen2", "ARMProcFamily", "NeoverseN2",
"Neoverse N2 ARM processors", [		"Neoverse N2 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneNeoverse512TVB : SubtargetFeature<"neoverse512tvb", "ARMProcFamily", "Neoverse512TVB",		def TuneNeoverse512TVB : SubtargetFeature<"neoverse512tvb", "ARMProcFamily", "Neoverse512TVB",
"Neoverse 512-TVB ARM processors", [		"Neoverse 512-TVB ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneNeoverseV1 : SubtargetFeature<"neoversev1", "ARMProcFamily", "NeoverseV1",		def TuneNeoverseV1 : SubtargetFeature<"neoversev1", "ARMProcFamily", "NeoverseV1",
"Neoverse V1 ARM processors", [		"Neoverse V1 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureFuseAdrpAdd,		FeatureFuseAdrpAdd,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2",		def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2",
"Neoverse V2 ARM processors", [		"Neoverse V2 ARM processors", [
FeatureFuseAES,		FeatureFuseAES,
FeatureLSLFast,		FeatureLSLFast,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeatureEnableSelectOptimize]>;		FeatureEnableSelectOptimize,
		FeaturePredictableSelectIsExpensive]>;

def TuneSaphira : SubtargetFeature<"saphira", "ARMProcFamily", "Saphira",		def TuneSaphira : SubtargetFeature<"saphira", "ARMProcFamily", "Saphira",
"Qualcomm Saphira processors", [		"Qualcomm Saphira processors", [
FeatureCustomCheapAsMoveHandling,		FeatureCustomCheapAsMoveHandling,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeaturePredictableSelectIsExpensive,		FeaturePredictableSelectIsExpensive,
FeatureZCZeroing,		FeatureZCZeroing,
FeatureLSLFast]>;		FeatureLSLFast]>;
▲ Show 20 Lines • Show All 419 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/convert-highly-predictable-select-to-branch.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -select-optimize -mtriple=aarch64-linux-gnu -mcpu=generic -S < %s \| FileCheck %s --check-prefix=CHECK-GENERIC
				; RUN: opt -select-optimize -mtriple=aarch64-linux-gnu -mcpu=neoverse-n1 -S < %s \| FileCheck %s
				; RUN: opt -select-optimize -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -S < %s \| FileCheck %s
				; RUN: opt -select-optimize -mtriple=aarch64-linux-gnu -mcpu=cortex-a72 -S < %s \| FileCheck %s

				; Test has not predictable select, which should not be transformed to a branch
				define i32 @test1(i32 %a) {
				; CHECK-GENERIC-LABEL: @test1(
				; CHECK-GENERIC-NEXT: entry:
				; CHECK-GENERIC-NEXT: [[CMP:%.]] = icmp slt i32 [[A:%.]], 1
				; CHECK-GENERIC-NEXT: [[DEC:%.*]] = sub i32 [[A]], 1
				; CHECK-GENERIC-NEXT: [[RES:%.*]] = select i1 [[CMP]], i32 0, i32 [[DEC]], !prof [[PROF0:![0-9]+]]
				; CHECK-GENERIC-NEXT: ret i32 [[RES]]
				;
				; CHECK-LABEL: @test1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[CMP:%.]] = icmp slt i32 [[A:%.]], 1
				; CHECK-NEXT: [[DEC:%.*]] = sub i32 [[A]], 1
				; CHECK-NEXT: [[RES:%.*]] = select i1 [[CMP]], i32 0, i32 [[DEC]], !prof [[PROF0:![0-9]+]]
				; CHECK-NEXT: ret i32 [[RES]]
				;
				entry:
				%cmp = icmp slt i32 %a, 1
				%dec = sub i32 %a, 1
				%res = select i1 %cmp, i32 0, i32 %dec, !prof !0
				ret i32 %res
				}

				; Test has highly predictable select according to profile data,
				; which should be transformed to a branch on cores with enabled FeaturePredictableSelectIsExpensive
				define i32 @test2(i32 %a) {
				; CHECK-GENERIC-LABEL: @test2(
				; CHECK-GENERIC-NEXT: entry:
				; CHECK-GENERIC-NEXT: [[CMP:%.]] = icmp slt i32 [[A:%.]], 1
				; CHECK-GENERIC-NEXT: [[DEC:%.*]] = sub i32 [[A]], 1
				; CHECK-GENERIC-NEXT: [[RES:%.*]] = select i1 [[CMP]], i32 0, i32 [[DEC]], !prof [[PROF1:![0-9]+]]
				; CHECK-GENERIC-NEXT: ret i32 [[RES]]
				;
				; CHECK-LABEL: @test2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[CMP:%.]] = icmp slt i32 [[A:%.]], 1
				; CHECK-NEXT: [[RES_FROZEN:%.*]] = freeze i1 [[CMP]]
				; CHECK-NEXT: br i1 [[RES_FROZEN]], label [[SELECT_END:%.]], label [[SELECT_FALSE_SINK:%.]], !prof [[PROF1:![0-9]+]]
				; CHECK: select.false.sink:
				; CHECK-NEXT: [[DEC:%.*]] = sub i32 [[A]], 1
				; CHECK-NEXT: br label [[SELECT_END]]
				; CHECK: select.end:
				; CHECK-NEXT: [[RES:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[DEC]], [[SELECT_FALSE_SINK]] ]
				; CHECK-NEXT: ret i32 [[RES]]
				;
				entry:
				%cmp = icmp slt i32 %a, 1
				%dec = sub i32 %a, 1
				%res = select i1 %cmp, i32 0, i32 %dec, !prof !1
				ret i32 %res
				}

				!0 = !{!"branch_weights", i32 1, i32 1}
				!1 = !{!"branch_weights", i32 1, i32 1000}