This is an archive of the discontinued LLVM Phabricator instance.

Differential D120230

[SelectOpti][1/5] Setup new select-optimize pass
ClosedPublic

Authored by apostolakis on Feb 20 2022, 10:25 PM.

Download Raw Diff

Details

Reviewers

spatel
Carrot
reames
tejohnson

Commits

rGca7c307d1816: [SelectOpti][1/5] Setup new select-optimize pass

Summary

This is the first commit for the cmov-vs-branch optimization pass.
The goal is to develop a new profile-guided and target-independent cost/benefit analysis
for selecting conditional moves over branches when optimizing for performance.

Initially, this new pass is expected to be enabled only for instrumentation-based PGO.

RFC: https://discourse.llvm.org/t/rfc-cmov-vs-branch-optimization/6040

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

apostolakis created this revision.Feb 20 2022, 10:25 PM

Herald added subscribers: hiraditya, mgorny. · View Herald TranscriptFeb 20 2022, 10:25 PM

apostolakis edited the summary of this revision. (Show Details)Feb 20 2022, 10:33 PM

apostolakis added a child revision: D120231: [SelectOpti][3/5] Base Heuristics.Feb 20 2022, 10:44 PM

Harbormaster completed remote builds in B150634: Diff 410218.Feb 20 2022, 11:38 PM

apostolakis edited the summary of this revision. (Show Details)Feb 21 2022, 1:32 PM

Herald added a subscriber: wenlei. · View Herald TranscriptFeb 21 2022, 1:32 PM

apostolakis published this revision for review.Feb 23 2022, 10:44 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2022, 10:44 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

apostolakis added reviewers: spatel, Carrot, reames.Feb 23 2022, 10:45 AM

apostolakis added subscribers: chandlerc, davidxl.

Leave the enabling for instrPGO-only for a separate patch after this patch series gets approved.

Harbormaster completed remote builds in B151548: Diff 411522.Feb 25 2022, 3:45 PM

lkail added a subscriber: lkail.Feb 25 2022, 4:20 PM

shchenz added a subscriber: shchenz.Feb 27 2022, 10:25 PM

Ping

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 9:00 AM

davidxl added inline comments.Mar 7 2022, 9:41 PM

llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
668	Do we need a canonicalization phase before this to convert jumps to select when possible?
llvm/lib/CodeGen/TargetPassConfig.cpp
943	Not suitable when size optimization is on.

apostolakis added inline comments.Mar 9 2022, 4:53 PM

llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
668	SimplifyCFG canonicalizes jumps to selects when possible. Currently SimplifyCFG prevents this canonicalization for a few cases where it is deemed very unprofitable. These cases seem reasonable and are probably not worth converting back to selects just before this pass and reconsidering; although it might be worth empirically evaluating the effect of that. What is more important in my opinion is for SimplifyCFG to canonicalize to selects without trying to optimize to maximize its enabling effect for subsequent LLVM IR optimizations (some preliminary experiments showed a small benefit from such a change). Then at the end, this new pass will decide what form is more profitable. This matter was briefly discussed recently in https://reviews.llvm.org/D118066. Once this new pass is enabled for all PGO builds, I will advocate for changes in SimplifyCFG to aggressively canonicalize to selects when profile information are available and an informed decision can be made later without blocking any LLVM IR optimizations.
llvm/lib/CodeGen/TargetPassConfig.cpp
943	The Code generation optimization level that is checked here (llvm::CodeGenOpt::Level) only contains 4 levels (None, Less, Default, Aggressive), none of which is meant for optimize-for-size. I also do not see any other API at this level that would enable a check for size. Instead, the check for size-opts right now is happening in the beginning of the new pass (see the next patch in this series: D120231), where it checks the attributes of the function for OptSize and also invokes the shouldOptimizeForSize function that uses PGSO heuristics.

The patch looks good to me. Adding Teresa to look at the pass ordering change.

llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
668	SG.
llvm/lib/CodeGen/TargetPassConfig.cpp
943	ok.

lkail mentioned this in D117861: [SimplifyCFG] Enhance costmodel of FoldTwoEntryPHINode while consider branch misprediction.Mar 10 2022, 8:47 PM

In D120230#3374100, @davidxl wrote:

The patch looks good to me. Adding Teresa to look at the pass ordering change.

Do you mean the placement of the new pass? Because I don't see any changes to ordering other than the addition.

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

Right. I mean pass placement.

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

In D120230#3374674, @wenlei wrote:

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

This stack is complete and functional for x86 instr-PGO. It is not yet tuned for Sample-PGO and there are some Sample-PGO-specific improvements that are yet to be made, most notably leveraging LBR data to capture misprediction rates and incorporating them in the heuristics (as discussed with @modimo in the RFC). So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

tschuett added a subscriber: tschuett.Mar 13 2022, 5:21 AM

In D120230#3377783, @apostolakis wrote:

In D120230#3374674, @wenlei wrote:

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

This stack is complete and functional for x86 instr-PGO.

Thanks. I'm trying this with IRPGO on a large internal workload now, should have results soon. Will also take a closer look at the patch set.

It is not yet tuned for Sample-PGO and there are some Sample-PGO-specific improvements that are yet to be made, most notably leveraging LBR data to capture misprediction rates and incorporating them in the heuristics (as discussed with @modimo in the RFC).

While we can have branch miss reported by perf tools and feedback into compiler sample PGO, it would be challenging to accurately correlate the actual branch miss to the unoptimized IR at profile loading time (it is a general challenge to apply any low level PMU info for compiler PGO).

So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

For sample PGO, we're going to experiment with deferring some compiler if-convert to PLO (BOLT in our case), where the correlation of branch miss won't be a problem, and also to avoid the problem of being stuck with cmov. cc @Amir @maksfb

FSAFDO allows late profile loading so that the matching of branches should be less of an issue.

In D120230#3378380, @wenlei wrote:

In D120230#3377783, @apostolakis wrote:

In D120230#3374674, @wenlei wrote:

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

This stack is complete and functional for x86 instr-PGO.

Thanks. I'm trying this with IRPGO on a large internal workload now, should have results soon. Will also take a closer look at the patch set.

Great! Let me know if you see any regressions or issues.

It is not yet tuned for Sample-PGO and there are some Sample-PGO-specific improvements that are yet to be made, most notably leveraging LBR data to capture misprediction rates and incorporating them in the heuristics (as discussed with @modimo in the RFC).

While we can have branch miss reported by perf tools and feedback into compiler sample PGO, it would be challenging to accurately correlate the actual branch miss to the unoptimized IR at profile loading time (it is a general challenge to apply any low level PMU info for compiler PGO).

As David said, FSAFDO might alleviate this problem.

So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

For sample PGO, we're going to experiment with deferring some compiler if-convert to PLO (BOLT in our case), where the correlation of branch miss won't be a problem, and also to avoid the problem of being stuck with cmov. cc @Amir @maksfb

The lack of mispredict data for cmovs will be a problem but we will not be stuck with a cmov decision but rather we might observe some oscillations (which is still problematic but not atypical of SampleFDO settings and there are some known remedies). The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

In D120230#3376746, @davidxl wrote:

Right. I mean pass placement.

It seems reasonable since iiuc we want to do this very late, and it looks like it will be the last IR pass.

In D120230#3377781, @apostolakis wrote:

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

In D120230#3380072, @apostolakis wrote:

In D120230#3378380, @wenlei wrote:

In D120230#3377783, @apostolakis wrote:

In D120230#3374674, @wenlei wrote:

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

This stack is complete and functional for x86 instr-PGO.

Thanks. I'm trying this with IRPGO on a large internal workload now, should have results soon. Will also take a closer look at the patch set.

Great! Let me know if you see any regressions or issues.

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

It is not yet tuned for Sample-PGO and there are some Sample-PGO-specific improvements that are yet to be made, most notably leveraging LBR data to capture misprediction rates and incorporating them in the heuristics (as discussed with @modimo in the RFC).

While we can have branch miss reported by perf tools and feedback into compiler sample PGO, it would be challenging to accurately correlate the actual branch miss to the unoptimized IR at profile loading time (it is a general challenge to apply any low level PMU info for compiler PGO).

As David said, FSAFDO might alleviate this problem.

So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

For sample PGO, we're going to experiment with deferring some compiler if-convert to PLO (BOLT in our case), where the correlation of branch miss won't be a problem, and also to avoid the problem of being stuck with cmov. cc @Amir @maksfb

The lack of mispredict data for cmovs will be a problem but we will not be stuck with a cmov decision but rather we might observe some oscillations (which is still problematic but not atypical of SampleFDO settings and there are some known remedies).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run? If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

David

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

In D120230#3381552, @davidxl wrote:

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

This is being worked on now and we don't have data yet. The numbers above didn't have BOLT interfering with cmov.

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run? If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

David

Right, making compiler conservative to preserve branch so we can have control flow profile for BOLT to make final decision is the experiment we're doing. Pseudo-probe for sample PGO can also be tuned a bit more intrusive to disallow cmov for better profile in that setup.

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

In D120230#3381552, @davidxl wrote:

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

I didn't measure it yet, but unlikely (see comment below).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run?

Yes, the profile data should be collected from pre-BOLT binary.

If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

The compiler can indeed choose to minimize cmov generation – I've recently added an LLVM knob to force-expand all cmov's in D119777 (x86-cmov-converter-force-all).

However, the data collected with (non-PGO, non-LTO) clang binary suggests that x86-cmov-converter-force-all introduces a significant perf regression that BOLT's CMOV conversion with default heuristics is unable to recover from. BOLT converts back a minor percentage (~5%) of eligible hammocks based on execution and misprediction heuristics (>5% misprediction rate, >1% biased condition). The hypothesis is that force-expanding cmov's results in 1) a code size increase, 2) more branches => higher pressure on BPU structures, and given that BOLT converts back only a small part of hammocks back, these factors result in a net regression.

In other words, misprediction rate may not be the most important factor in hammock-vs-cmov tradeoff for large code footprint workloads. I believe that a holistic approach (criticality + misprediction rate + code size) may yield better performance.

David

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

In D120230#3381563, @wenlei wrote:

In D120230#3381552, @davidxl wrote:

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

This is being worked on now and we don't have data yet. The numbers above didn't have BOLT interfering with cmov.

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run? If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

David

Right, making compiler conservative to preserve branch so we can have control flow profile for BOLT to make final decision is the experiment we're doing. Pseudo-probe for sample PGO can also be tuned a bit more intrusive to disallow cmov for better profile in that setup.

But in this case, the binary (from BOLT) used to generate profile for the compiler still have cmov, so some loss of profile data is unavoidable. The oscillating issue will mostly be gone though.

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

In D120230#3381568, @Amir wrote:

In D120230#3381552, @davidxl wrote:

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

I didn't measure it yet, but unlikely (see comment below).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run?

Yes, the profile data should be collected from pre-BOLT binary.

If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

The compiler can indeed choose to minimize cmov generation – I've recently added an LLVM knob to force-expand all cmov's in D119777 (x86-cmov-converter-force-all).

However, the data collected with (non-PGO, non-LTO) clang binary suggests that x86-cmov-converter-force-all introduces a significant perf regression that BOLT's CMOV conversion with default heuristics is unable to recover from.

I assume BOLT's block layout lays out those branchy code properly, right?

BOLT converts back a minor percentage (~5%) of eligible hammocks based on execution and misprediction heuristics (>5% misprediction rate, >1% biased condition).

Only 5% of the hammock based execution meets the conversion criteria or 5% of the candidates matching the criteria can be converted back ?

The hypothesis is that force-expanding cmov's results in 1) a code size increase, 2) more branches => higher pressure on BPU structures, and given that BOLT converts back only a small part of hammocks back, these factors result in a net regression.

In other words, misprediction rate may not be the most important factor in hammock-vs-cmov tradeoff for large code footprint workloads. I believe that a holistic approach (criticality + misprediction rate + code size) may yield better performance.

yes, modelling the global effect as well as branch interactions will be useful thing to do. Note that newly introduced branches can change BPU behavior thus lead to different branch misprediction distribution too.

David

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

But in this case, the binary (from BOLT) used to generate profile for the compiler still have cmov, so some loss of profile data is unavoidable. The oscillating issue will mostly be gone though.

We have some special setup where dedicated tier is used for sample profiling for both compiler and BOLT, in that setup we're not profiling post-BOLT binary. It's not the typical prod profiling setup for sample PGO.

OTOH for cmov, since compiler is going to be conservative and deferring that to BOLT, not have compiler profile is okay (just from cmov perspective) as long as BOLT sees the profile from pre-BOLT binary which doesn't do cmov.

In D120230#3381581, @davidxl wrote:

In D120230#3381568, @Amir wrote:

In D120230#3381552, @davidxl wrote:

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

I didn't measure it yet, but unlikely (see comment below).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run?

Yes, the profile data should be collected from pre-BOLT binary.

If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

The compiler can indeed choose to minimize cmov generation – I've recently added an LLVM knob to force-expand all cmov's in D119777 (x86-cmov-converter-force-all).

However, the data collected with (non-PGO, non-LTO) clang binary suggests that x86-cmov-converter-force-all introduces a significant perf regression that BOLT's CMOV conversion with default heuristics is unable to recover from.

I assume BOLT's block layout lays out those branchy code properly, right?

No, function splitting and layout were turned off, so only baseline BOLT with and w/o CMOV conversion.

BOLT converts back a minor percentage (~5%) of eligible hammocks based on execution and misprediction heuristics (>5% misprediction rate, >1% biased condition).

Only 5% of the hammock based execution meets the conversion criteria or 5% of the candidates matching the criteria can be converted back ?

The former: 5% of the hammock based execution meets the conversion criteria (with default CMOV conversion heuristics).

The hypothesis is that force-expanding cmov's results in 1) a code size increase, 2) more branches => higher pressure on BPU structures, and given that BOLT converts back only a small part of hammocks back, these factors result in a net regression.

In other words, misprediction rate may not be the most important factor in hammock-vs-cmov tradeoff for large code footprint workloads. I believe that a holistic approach (criticality + misprediction rate + code size) may yield better performance.

yes, modelling the global effect as well as branch interactions will be useful thing to do.

I'm hoping Sotiris's pass can achieve that eventually.

Note that newly introduced branches can change BPU behavior thus lead to different branch misprediction distribution too.

David

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

In D120230#3381291, @tejohnson wrote:

In D120230#3377781, @apostolakis wrote:

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

The reason that all the tests are in the 4th patch is that this patch involves the actual transformation (converts selects to branches for the cases that the conversion was deemed profitable based on the profitability heuristics of patches 2 and 3). The 2nd patch has the base (non-loop) heuristics and the 3rd patch has the loop-level heuristics. Patches 2 and 3 do not change the IR.

I agree though that it is better to split it up in a way that allows testing of each patch. So, I will re-organize the patches to enable more per-patch testing. I will move the base of the actual transformation of the code in a 2nd patch (and some testing by temporarily assuming that all selects should be converted), then the 3rd patch will be the base heuristics and testing, 4th patch loop heuristics and testing, and a 5th patch that optimizes the transformation (maximizes the sinking of one-use slices in the true/false blocks and interleaving of slices).

In D120230#3381419, @wenlei wrote:

So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

The lack of mispredict data for cmovs will be a problem but we will not be stuck with a cmov decision but rather we might observe some oscillations (which is still problematic but not atypical of SampleFDO settings and there are some known remedies).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

We will not get stuck if the non-profile logic and default values for branch weights and mispredict rates are set in a way that favors branches. Already talked about mispredict rates. Another example is the base heuristic that converts selects to branches when there is an expensive operand in the computation of the cold operand (currently less than 20% selected operand). If this select is currently a branch and the profile-based weight is 25% for the expensive operand (so not cold) then we might convert that to a cmov. Then we will not have profile information but we can allow all selects with expensive operands but no profile data to be conservatively converted to branches, and then if the expensive operand becomes cold the select will become a branch, otherwise there will be oscillation.

In general, I think some reasonable non-profile-based heuristics that favor branches will be better than blindly converting all cases to branches even for cases where a cmov seems preferable even with unfavorable branch weights and mispredict rate.

In D120230#3381419, @wenlei wrote:

In D120230#3380072, @apostolakis wrote:

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

Just to clarify since you misunderstood me, I was referring (both here and in the RFC) to actual misprediction rate and not branch weights. If you look at the 3rd patch (D120232) that includes the loop-level heuristics you will see that a mispredict rate (that conservatively defaults to 25%) is taken into account to calculate the cost of branches. Branch weights are used to compute the branch cost for correctly predicted cases (BranchCost = PredictedPathCost + MispredictCost * MispredictRate, where PredictedPathCost = TrueOpCost * TrueWeight + FalseOpCost * FalseWeight). Unbiased branches indeed do not necessarily lead to mispredictions (as discussed in the RFC) and the only case where branch weights are used for predictability for this optimization is for cases where the branch is entirely biased to one direction and thus it is somewhat expected to be predictable (part of the base heuristics). If profile-based mispredict rates were available, there would at least one extra base heuristic that handles some extreme cases (highly or poorly predicted) and then for in-between cases the existing heuristics will be refined to make the decision.

bsmith added a subscriber: bsmith.Mar 21 2022, 7:59 AM

apostolakis added a child revision: D122259: [SelectOpti][2/5] Select-to-branch base transformation.Mar 22 2022, 1:52 PM

apostolakis retitled this revision from [SelectOpti][1/4] Setup new select-optimize pass to [SelectOpti][1/5] Setup new select-optimize pass.Mar 22 2022, 2:06 PM

In D120230#3384454, @apostolakis wrote:

In D120230#3381291, @tejohnson wrote:

In D120230#3377781, @apostolakis wrote:

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

The reason that all the tests are in the 4th patch is that this patch involves the actual transformation (converts selects to branches for the cases that the conversion was deemed profitable based on the profitability heuristics of patches 2 and 3). The 2nd patch has the base (non-loop) heuristics and the 3rd patch has the loop-level heuristics. Patches 2 and 3 do not change the IR.

I agree though that it is better to split it up in a way that allows testing of each patch. So, I will re-organize the patches to enable more per-patch testing. I will move the base of the actual transformation of the code in a 2nd patch (and some testing by temporarily assuming that all selects should be converted), then the 3rd patch will be the base heuristics and testing, 4th patch loop heuristics and testing, and a 5th patch that optimizes the transformation (maximizes the sinking of one-use slices in the true/false blocks and interleaving of slices).

As discussed, re-organized the subsequent patches to allow for per-patch testing.

Use LegacyPM by default for this pass.

Harbormaster completed remote builds in B155792: Diff 417517.Mar 23 2022, 2:26 AM

In D120230#3401683, @apostolakis wrote:

In D120230#3384454, @apostolakis wrote:

In D120230#3381291, @tejohnson wrote:

In D120230#3377781, @apostolakis wrote:

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

The reason that all the tests are in the 4th patch is that this patch involves the actual transformation (converts selects to branches for the cases that the conversion was deemed profitable based on the profitability heuristics of patches 2 and 3). The 2nd patch has the base (non-loop) heuristics and the 3rd patch has the loop-level heuristics. Patches 2 and 3 do not change the IR.

I agree though that it is better to split it up in a way that allows testing of each patch. So, I will re-organize the patches to enable more per-patch testing. I will move the base of the actual transformation of the code in a 2nd patch (and some testing by temporarily assuming that all selects should be converted), then the 3rd patch will be the base heuristics and testing, 4th patch loop heuristics and testing, and a 5th patch that optimizes the transformation (maximizes the sinking of one-use slices in the true/false blocks and interleaving of slices).

As discussed, re-organized the subsequent patches to allow for per-patch testing.

@tejohnson does the re-organization of subsequent patches to allow for per-patch testing look okay to you? Is there anything else to note for this first patch? Apart from the decision of where to place the new pass, the rest of the code is mostly boilerplate.

In D120230#3420050, @apostolakis wrote:

In D120230#3401683, @apostolakis wrote:

In D120230#3384454, @apostolakis wrote:

In D120230#3381291, @tejohnson wrote:

In D120230#3377781, @apostolakis wrote:

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

The reason that all the tests are in the 4th patch is that this patch involves the actual transformation (converts selects to branches for the cases that the conversion was deemed profitable based on the profitability heuristics of patches 2 and 3). The 2nd patch has the base (non-loop) heuristics and the 3rd patch has the loop-level heuristics. Patches 2 and 3 do not change the IR.

I agree though that it is better to split it up in a way that allows testing of each patch. So, I will re-organize the patches to enable more per-patch testing. I will move the base of the actual transformation of the code in a 2nd patch (and some testing by temporarily assuming that all selects should be converted), then the 3rd patch will be the base heuristics and testing, 4th patch loop heuristics and testing, and a 5th patch that optimizes the transformation (maximizes the sinking of one-use slices in the true/false blocks and interleaving of slices).

As discussed, re-organized the subsequent patches to allow for per-patch testing.

@tejohnson does the re-organization of subsequent patches to allow for per-patch testing look okay to you? Is there anything else to note for this first patch? Apart from the decision of where to place the new pass, the rest of the code is mostly boilerplate.

Sorry for the delay. I like the new patch organization with the additional testing, thanks!

LGTM

This revision is now accepted and ready to land.Apr 4 2022, 9:41 PM

lkail mentioned this in D113872: [CGP] Handle select instructions relying on the same condition.Apr 5 2022, 2:02 AM

Rebase

apostolakis edited the summary of this revision. (Show Details)May 13 2022, 5:10 PM

Harbormaster completed remote builds in B164411: Diff 429382.May 13 2022, 5:51 PM

apostolakis removed a child revision: D120231: [SelectOpti][3/5] Base Heuristics.May 14 2022, 2:05 PM

Init DisableSelectOptimize to true when declaring it.

This revision was landed with ongoing or failed builds.May 19 2022, 9:42 AM

Closed by commit rGca7c307d1816: [SelectOpti][1/5] Setup new select-optimize pass (authored by apostolakis). · Explain Why

This revision was automatically updated to reflect the committed changes.

apostolakis added a commit: rGca7c307d1816: [SelectOpti][1/5] Setup new select-optimize pass.

Harbormaster completed remote builds in B165350: Diff 430719.May 19 2022, 10:10 AM

junaire added a subscriber: junaire.May 21 2022, 8:02 PM

This comment was removed by junaire.

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

CodeGenPassBuilder.h

4 lines

MachinePassRegistry.def

1 line

Passes.h

3 lines

InitializePasses.h

1 line

LinkAllPasses.h

1 line

Target/

CGPassBuilderOption.h

1 line

lib/

CodeGen/

1 line

1 line

43 lines

10 lines

tools/

opt/

opt.cpp

4 lines

Diff 430722

llvm/include/llvm/CodeGen/CodeGenPassBuilder.h

Show First 20 Lines • Show All 657 Lines • ▼ Show 20 Lines	void CodeGenPassBuilder<Derived>::addIRPasses(AddIRPass &addPass) const {

// Add scalarization of target's unsupported masked memory intrinsics pass.		// Add scalarization of target's unsupported masked memory intrinsics pass.
// the unsupported intrinsic will be replaced with a chain of basic blocks,		// the unsupported intrinsic will be replaced with a chain of basic blocks,
// that stores/loads element one-by-one if the appropriate mask bit is set.		// that stores/loads element one-by-one if the appropriate mask bit is set.
addPass(ScalarizeMaskedMemIntrinPass());		addPass(ScalarizeMaskedMemIntrinPass());

// Expand reduction intrinsics into shuffle sequences if the target wants to.		// Expand reduction intrinsics into shuffle sequences if the target wants to.
addPass(ExpandReductionsPass());		addPass(ExpandReductionsPass());

		// Convert conditional moves to conditional jumps when profitable.
		if (getOptLevel() != CodeGenOpt::None && !Opt.DisableSelectOptimize)
		davidxlUnsubmitted Not Done Reply Inline Actions Do we need a canonicalization phase before this to convert jumps to select when possible? davidxl: Do we need a canonicalization phase before this to convert jumps to select when possible?
		apostolakisAuthorUnsubmitted Not Done Reply Inline Actions SimplifyCFG canonicalizes jumps to selects when possible. Currently SimplifyCFG prevents this canonicalization for a few cases where it is deemed very unprofitable. These cases seem reasonable and are probably not worth converting back to selects just before this pass and reconsidering; although it might be worth empirically evaluating the effect of that. What is more important in my opinion is for SimplifyCFG to canonicalize to selects without trying to optimize to maximize its enabling effect for subsequent LLVM IR optimizations (some preliminary experiments showed a small benefit from such a change). Then at the end, this new pass will decide what form is more profitable. This matter was briefly discussed recently in https://reviews.llvm.org/D118066. Once this new pass is enabled for all PGO builds, I will advocate for changes in SimplifyCFG to aggressively canonicalize to selects when profile information are available and an informed decision can be made later without blocking any LLVM IR optimizations. apostolakis: SimplifyCFG canonicalizes jumps to selects when possible. Currently SimplifyCFG prevents this…
		davidxlUnsubmitted Not Done Reply Inline Actions SG. davidxl: SG.
		addPass(SelectOptimizePass());
}		}

/// Turn exception handling constructs into something the code generators can		/// Turn exception handling constructs into something the code generators can
/// handle.		/// handle.
template <typename Derived>		template <typename Derived>
void CodeGenPassBuilder<Derived>::addPassesToHandleExceptions(		void CodeGenPassBuilder<Derived>::addPassesToHandleExceptions(
AddIRPass &addPass) const {		AddIRPass &addPass) const {
const MCAsmInfo *MCAI = TM.getMCAsmInfo();		const MCAsmInfo *MCAI = TM.getMCAsmInfo();
▲ Show 20 Lines • Show All 472 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/MachinePassRegistry.def

	Show First 20 Lines • Show All 114 Lines • ▼ Show 20 Lines
	DUMMY_FUNCTION_PASS("safe-stack", SafeStackPass, ())			DUMMY_FUNCTION_PASS("safe-stack", SafeStackPass, ())
	DUMMY_FUNCTION_PASS("stack-protector", StackProtectorPass, ())			DUMMY_FUNCTION_PASS("stack-protector", StackProtectorPass, ())
	DUMMY_FUNCTION_PASS("atomic-expand", AtomicExpandPass, ())			DUMMY_FUNCTION_PASS("atomic-expand", AtomicExpandPass, ())
	DUMMY_FUNCTION_PASS("interleaved-access", InterleavedAccessPass, ())			DUMMY_FUNCTION_PASS("interleaved-access", InterleavedAccessPass, ())
	DUMMY_FUNCTION_PASS("indirectbr-expand", IndirectBrExpandPass, ())			DUMMY_FUNCTION_PASS("indirectbr-expand", IndirectBrExpandPass, ())
	DUMMY_FUNCTION_PASS("cfguard-dispatch", CFGuardDispatchPass, ())			DUMMY_FUNCTION_PASS("cfguard-dispatch", CFGuardDispatchPass, ())
	DUMMY_FUNCTION_PASS("cfguard-check", CFGuardCheckPass, ())			DUMMY_FUNCTION_PASS("cfguard-check", CFGuardCheckPass, ())
	DUMMY_FUNCTION_PASS("gc-info-printer", GCInfoPrinterPass, ())			DUMMY_FUNCTION_PASS("gc-info-printer", GCInfoPrinterPass, ())
				DUMMY_FUNCTION_PASS("select-optimize", SelectOptimizePass, ())
	#undef DUMMY_FUNCTION_PASS			#undef DUMMY_FUNCTION_PASS

	#ifndef DUMMY_MODULE_PASS			#ifndef DUMMY_MODULE_PASS
	#define DUMMY_MODULE_PASS(NAME, PASS_NAME, CONSTRUCTOR)			#define DUMMY_MODULE_PASS(NAME, PASS_NAME, CONSTRUCTOR)
	#endif			#endif
	DUMMY_MODULE_PASS("lower-emutls", LowerEmuTLSPass, ())			DUMMY_MODULE_PASS("lower-emutls", LowerEmuTLSPass, ())
	#undef DUMMY_MODULE_PASS			#undef DUMMY_MODULE_PASS

	▲ Show 20 Lines • Show All 73 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/Passes.h

Show First 20 Lines • Show All 554 Lines • ▼ Show 20 Lines	namespace llvm {
FunctionPass *createX86LowerAMXIntrinsicsPass();		FunctionPass *createX86LowerAMXIntrinsicsPass();

/// When learning an eviction policy, extract score(reward) information,		/// When learning an eviction policy, extract score(reward) information,
/// otherwise this does nothing		/// otherwise this does nothing
FunctionPass *createRegAllocScoringPass();		FunctionPass *createRegAllocScoringPass();

/// JMC instrument pass.		/// JMC instrument pass.
ModulePass *createJMCInstrumenterPass();		ModulePass *createJMCInstrumenterPass();

		/// This pass converts conditional moves to conditional jumps when profitable.
		FunctionPass *createSelectOptimizePass();
} // End llvm namespace		} // End llvm namespace

#endif		#endif

llvm/include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 388 Lines • ▼ Show 20 Lines
	void initializeRewriteSymbolsLegacyPassPass(PassRegistry&);			void initializeRewriteSymbolsLegacyPassPass(PassRegistry&);
	void initializeSCCPLegacyPassPass(PassRegistry&);			void initializeSCCPLegacyPassPass(PassRegistry&);
	void initializeSCEVAAWrapperPassPass(PassRegistry&);			void initializeSCEVAAWrapperPassPass(PassRegistry&);
	void initializeSLPVectorizerPass(PassRegistry&);			void initializeSLPVectorizerPass(PassRegistry&);
	void initializeSROALegacyPassPass(PassRegistry&);			void initializeSROALegacyPassPass(PassRegistry&);
	void initializeSafeStackLegacyPassPass(PassRegistry&);			void initializeSafeStackLegacyPassPass(PassRegistry&);
	void initializeSafepointIRVerifierPass(PassRegistry&);			void initializeSafepointIRVerifierPass(PassRegistry&);
	void initializeSampleProfileLoaderLegacyPassPass(PassRegistry&);			void initializeSampleProfileLoaderLegacyPassPass(PassRegistry&);
				void initializeSelectOptimizePass(PassRegistry &);
	void initializeModuleSanitizerCoverageLegacyPassPass(PassRegistry &);			void initializeModuleSanitizerCoverageLegacyPassPass(PassRegistry &);
	void initializeScalarEvolutionWrapperPassPass(PassRegistry&);			void initializeScalarEvolutionWrapperPassPass(PassRegistry&);
	void initializeScalarizeMaskedMemIntrinLegacyPassPass(PassRegistry &);			void initializeScalarizeMaskedMemIntrinLegacyPassPass(PassRegistry &);
	void initializeScalarizerLegacyPassPass(PassRegistry&);			void initializeScalarizerLegacyPassPass(PassRegistry&);
	void initializeScavengerTestPass(PassRegistry&);			void initializeScavengerTestPass(PassRegistry&);
	void initializeScopedNoAliasAAWrapperPassPass(PassRegistry&);			void initializeScopedNoAliasAAWrapperPassPass(PassRegistry&);
	void initializeSeparateConstOffsetFromGEPLegacyPassPass(PassRegistry &);			void initializeSeparateConstOffsetFromGEPLegacyPassPass(PassRegistry &);
	void initializeShadowStackGCLoweringPass(PassRegistry&);			void initializeShadowStackGCLoweringPass(PassRegistry&);
	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

llvm/include/llvm/LinkAllPasses.h

Show First 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	ForcePassLinking() {
(void) llvm::createEliminateAvailableExternallyPass();		(void) llvm::createEliminateAvailableExternallyPass();
(void)llvm::createScalarizeMaskedMemIntrinLegacyPass();		(void)llvm::createScalarizeMaskedMemIntrinLegacyPass();
(void) llvm::createWarnMissedTransformationsPass();		(void) llvm::createWarnMissedTransformationsPass();
(void) llvm::createHardwareLoopsPass();		(void) llvm::createHardwareLoopsPass();
(void) llvm::createInjectTLIMappingsLegacyPass();		(void) llvm::createInjectTLIMappingsLegacyPass();
(void) llvm::createUnifyLoopExitsPass();		(void) llvm::createUnifyLoopExitsPass();
(void) llvm::createFixIrreduciblePass();		(void) llvm::createFixIrreduciblePass();
(void)llvm::createFunctionSpecializationPass();		(void)llvm::createFunctionSpecializationPass();
		(void)llvm::createSelectOptimizePass();

(void)new llvm::IntervalPartition();		(void)new llvm::IntervalPartition();
(void)new llvm::ScalarEvolutionWrapperPass();		(void)new llvm::ScalarEvolutionWrapperPass();
llvm::Function::Create(nullptr, llvm::GlobalValue::ExternalLinkage)->viewCFGOnly();		llvm::Function::Create(nullptr, llvm::GlobalValue::ExternalLinkage)->viewCFGOnly();
llvm::RGPassManager RGM;		llvm::RGPassManager RGM;
llvm::TargetLibraryInfoImpl TLII;		llvm::TargetLibraryInfoImpl TLII;
llvm::TargetLibraryInfo TLI(TLII);		llvm::TargetLibraryInfo TLI(TLII);
llvm::AliasAnalysis AA(TLI);		llvm::AliasAnalysis AA(TLI);
Show All 10 Lines

llvm/include/llvm/Target/CGPassBuilderOption.h

Show All 36 Lines	struct CGPassBuilderOption {
bool EarlyLiveIntervals = false;		bool EarlyLiveIntervals = false;

bool DisableLSR = false;		bool DisableLSR = false;
bool DisableCGP = false;		bool DisableCGP = false;
bool PrintLSR = false;		bool PrintLSR = false;
bool DisableMergeICmps = false;		bool DisableMergeICmps = false;
bool DisablePartialLibcallInlining = false;		bool DisablePartialLibcallInlining = false;
bool DisableConstantHoisting = false;		bool DisableConstantHoisting = false;
		bool DisableSelectOptimize = true;
bool PrintISelInput = false;		bool PrintISelInput = false;
bool PrintGCInfo = false;		bool PrintGCInfo = false;
bool RequiresCodeGenSCCOrder = false;		bool RequiresCodeGenSCCOrder = false;

RunOutliner EnableMachineOutliner = RunOutliner::TargetDefault;		RunOutliner EnableMachineOutliner = RunOutliner::TargetDefault;
RegAllocType RegAlloc = RegAllocType::Default;		RegAllocType RegAlloc = RegAllocType::Default;
CFLAAType UseCFLAA = CFLAAType::None;		CFLAAType UseCFLAA = CFLAAType::None;
Optional<GlobalISelAbortMode> EnableGlobalISelAbort;		Optional<GlobalISelAbortMode> EnableGlobalISelAbort;
Show All 11 Lines

llvm/lib/CodeGen/CMakeLists.txt

Show First 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	add_llvm_component_library(LLVMCodeGen
RegisterBank.cpp		RegisterBank.cpp
RegisterBankInfo.cpp		RegisterBankInfo.cpp
SafeStack.cpp		SafeStack.cpp
SafeStackLayout.cpp		SafeStackLayout.cpp
ScheduleDAG.cpp		ScheduleDAG.cpp
ScheduleDAGInstrs.cpp		ScheduleDAGInstrs.cpp
ScheduleDAGPrinter.cpp		ScheduleDAGPrinter.cpp
ScoreboardHazardRecognizer.cpp		ScoreboardHazardRecognizer.cpp
		SelectOptimize.cpp
ShadowStackGCLowering.cpp		ShadowStackGCLowering.cpp
ShrinkWrap.cpp		ShrinkWrap.cpp
SjLjEHPrepare.cpp		SjLjEHPrepare.cpp
SlotIndexes.cpp		SlotIndexes.cpp
SpillPlacement.cpp		SpillPlacement.cpp
SplitKit.cpp		SplitKit.cpp
StackColoring.cpp		StackColoring.cpp
StackMapLivenessAnalysis.cpp		StackMapLivenessAnalysis.cpp
▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

llvm/lib/CodeGen/CodeGen.cpp

Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	void llvm::initializeCodeGen(PassRegistry &Registry) {
initializeRAGreedyPass(Registry);		initializeRAGreedyPass(Registry);
initializeRegAllocFastPass(Registry);		initializeRegAllocFastPass(Registry);
initializeRegUsageInfoCollectorPass(Registry);		initializeRegUsageInfoCollectorPass(Registry);
initializeRegUsageInfoPropagationPass(Registry);		initializeRegUsageInfoPropagationPass(Registry);
initializeRegisterCoalescerPass(Registry);		initializeRegisterCoalescerPass(Registry);
initializeRemoveRedundantDebugValuesPass(Registry);		initializeRemoveRedundantDebugValuesPass(Registry);
initializeRenameIndependentSubregsPass(Registry);		initializeRenameIndependentSubregsPass(Registry);
initializeSafeStackLegacyPassPass(Registry);		initializeSafeStackLegacyPassPass(Registry);
		initializeSelectOptimizePass(Registry);
initializeShadowStackGCLoweringPass(Registry);		initializeShadowStackGCLoweringPass(Registry);
initializeShrinkWrapPass(Registry);		initializeShrinkWrapPass(Registry);
initializeSjLjEHPreparePass(Registry);		initializeSjLjEHPreparePass(Registry);
initializeSlotIndexesPass(Registry);		initializeSlotIndexesPass(Registry);
initializeStackColoringPass(Registry);		initializeStackColoringPass(Registry);
initializeStackMapLivenessPass(Registry);		initializeStackMapLivenessPass(Registry);
initializeStackProtectorPass(Registry);		initializeStackProtectorPass(Registry);
initializeStackSlotColoringPass(Registry);		initializeStackSlotColoringPass(Registry);
Show All 18 Lines

llvm/lib/CodeGen/SelectOptimize.cpp

This file was added.

				//===--- SelectOptimize.cpp - Convert select to branches if profitable ---===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This pass converts selects to conditional jumps when profitable.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/CodeGen/Passes.h"
				#include "llvm/IR/Function.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Pass.h"

				using namespace llvm;

				namespace {

				class SelectOptimize : public FunctionPass {
				public:
				static char ID;
				SelectOptimize() : FunctionPass(ID) {
				initializeSelectOptimizePass(*PassRegistry::getPassRegistry());
				}

				bool runOnFunction(Function &F) override;

				void getAnalysisUsage(AnalysisUsage &AU) const override {}
				};
				} // namespace

				char SelectOptimize::ID = 0;
				INITIALIZE_PASS(SelectOptimize, "select-optimize", "Optimize selects", false,
				false)

				FunctionPass *llvm::createSelectOptimizePass() { return new SelectOptimize(); }

				bool SelectOptimize::runOnFunction(Function &F) {
				llvm_unreachable("Unimplemented");
				}

llvm/lib/CodeGen/TargetPassConfig.cpp

Show First 20 Lines • Show All 254 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableMachineFunctionSplitter(
cl::desc("Split out cold blocks from machine functions based on profile "		cl::desc("Split out cold blocks from machine functions based on profile "
"information."));		"information."));

/// Disable the expand reductions pass for testing.		/// Disable the expand reductions pass for testing.
static cl::opt<bool> DisableExpandReductions(		static cl::opt<bool> DisableExpandReductions(
"disable-expand-reductions", cl::init(false), cl::Hidden,		"disable-expand-reductions", cl::init(false), cl::Hidden,
cl::desc("Disable the expand reduction intrinsics pass from running"));		cl::desc("Disable the expand reduction intrinsics pass from running"));

		/// Disable the select optimization pass.
		static cl::opt<bool> DisableSelectOptimize(
		"disable-select-optimize", cl::init(true), cl::Hidden,
		cl::desc("Disable the select-optimization pass from running"));

/// Allow standard passes to be disabled by command line options. This supports		/// Allow standard passes to be disabled by command line options. This supports
/// simple binary flags that either suppress the pass or do nothing.		/// simple binary flags that either suppress the pass or do nothing.
/// i.e. -disable-mypass=false has no effect.		/// i.e. -disable-mypass=false has no effect.
/// These should be converted to boolOrDefault in order to use applyOverride.		/// These should be converted to boolOrDefault in order to use applyOverride.
static IdentifyingPassPtr applyDisable(IdentifyingPassPtr PassID,		static IdentifyingPassPtr applyDisable(IdentifyingPassPtr PassID,
bool Override) {		bool Override) {
if (Override)		if (Override)
return IdentifyingPassPtr();		return IdentifyingPassPtr();
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	#define SET_BOOLEAN_OPTION(Option) Opt.Option = Option;
SET_BOOLEAN_OPTION(EnableMachineOutliner)		SET_BOOLEAN_OPTION(EnableMachineOutliner)
SET_BOOLEAN_OPTION(MISchedPostRA)		SET_BOOLEAN_OPTION(MISchedPostRA)
SET_BOOLEAN_OPTION(UseCFLAA)		SET_BOOLEAN_OPTION(UseCFLAA)
SET_BOOLEAN_OPTION(DisableMergeICmps)		SET_BOOLEAN_OPTION(DisableMergeICmps)
SET_BOOLEAN_OPTION(DisableLSR)		SET_BOOLEAN_OPTION(DisableLSR)
SET_BOOLEAN_OPTION(DisableConstantHoisting)		SET_BOOLEAN_OPTION(DisableConstantHoisting)
SET_BOOLEAN_OPTION(DisableCGP)		SET_BOOLEAN_OPTION(DisableCGP)
SET_BOOLEAN_OPTION(DisablePartialLibcallInlining)		SET_BOOLEAN_OPTION(DisablePartialLibcallInlining)
		SET_BOOLEAN_OPTION(DisableSelectOptimize)
SET_BOOLEAN_OPTION(PrintLSR)		SET_BOOLEAN_OPTION(PrintLSR)
SET_BOOLEAN_OPTION(PrintISelInput)		SET_BOOLEAN_OPTION(PrintISelInput)
SET_BOOLEAN_OPTION(PrintGCInfo)		SET_BOOLEAN_OPTION(PrintGCInfo)

return Opt;		return Opt;
}		}

static void registerPartialPipelineCallback(PassInstrumentationCallbacks &PIC,		static void registerPartialPipelineCallback(PassInstrumentationCallbacks &PIC,
▲ Show 20 Lines • Show All 424 Lines • ▼ Show 20 Lines	void TargetPassConfig::addIRPasses() {
addPass(createScalarizeMaskedMemIntrinLegacyPass());		addPass(createScalarizeMaskedMemIntrinLegacyPass());

// Expand reduction intrinsics into shuffle sequences if the target wants to.		// Expand reduction intrinsics into shuffle sequences if the target wants to.
// Allow disabling it for testing purposes.		// Allow disabling it for testing purposes.
if (!DisableExpandReductions)		if (!DisableExpandReductions)
addPass(createExpandReductionsPass());		addPass(createExpandReductionsPass());

if (getOptLevel() != CodeGenOpt::None)		if (getOptLevel() != CodeGenOpt::None)
addPass(createTLSVariableHoistPass());		addPass(createTLSVariableHoistPass());
		davidxlUnsubmitted Not Done Reply Inline Actions Not suitable when size optimization is on. davidxl: Not suitable when size optimization is on.
		apostolakisAuthorUnsubmitted Not Done Reply Inline Actions The Code generation optimization level that is checked here (llvm::CodeGenOpt::Level) only contains 4 levels (None, Less, Default, Aggressive), none of which is meant for optimize-for-size. I also do not see any other API at this level that would enable a check for size. Instead, the check for size-opts right now is happening in the beginning of the new pass (see the next patch in this series: D120231), where it checks the attributes of the function for OptSize and also invokes the shouldOptimizeForSize function that uses PGSO heuristics. apostolakis: The Code generation optimization level that is checked here (llvm::CodeGenOpt::Level) only…
		davidxlUnsubmitted Not Done Reply Inline Actions ok. davidxl: ok.

		// Convert conditional moves to conditional jumps when profitable.
		if (getOptLevel() != CodeGenOpt::None && !DisableSelectOptimize)
		addPass(createSelectOptimizePass());
}		}

/// Turn exception handling constructs into something the code generators can		/// Turn exception handling constructs into something the code generators can
/// handle.		/// handle.
void TargetPassConfig::addPassesToHandleExceptions() {		void TargetPassConfig::addPassesToHandleExceptions() {
const MCAsmInfo *MCAI = TM->getMCAsmInfo();		const MCAsmInfo *MCAI = TM->getMCAsmInfo();
assert(MCAI && "No MCAsmInfo");		assert(MCAI && "No MCAsmInfo");
switch (MCAI->getExceptionHandlingType()) {		switch (MCAI->getExceptionHandlingType()) {
▲ Show 20 Lines • Show All 614 Lines • Show Last 20 Lines

llvm/tools/opt/opt.cpp

Show First 20 Lines • Show All 487 Lines • ▼ Show 20 Lines	std::vector<StringRef> PassNameExact = {
"global-merge", "pre-isel-intrinsic-lowering",		"global-merge", "pre-isel-intrinsic-lowering",
"expand-reductions", "indirectbr-expand",		"expand-reductions", "indirectbr-expand",
"generic-to-nvvm", "expandmemcmp",		"generic-to-nvvm", "expandmemcmp",
"loop-reduce", "lower-amx-type",		"loop-reduce", "lower-amx-type",
"pre-amx-config", "lower-amx-intrinsics",		"pre-amx-config", "lower-amx-intrinsics",
"polyhedral-info", "print-polyhedral-info",		"polyhedral-info", "print-polyhedral-info",
"replace-with-veclib", "jmc-instrument",		"replace-with-veclib", "jmc-instrument",
"dot-regions", "dot-regions-only",		"dot-regions", "dot-regions-only",
"view-regions", "view-regions-only"};		"view-regions", "view-regions-only",
		"select-optimize"};
for (const auto &P : PassNamePrefix)		for (const auto &P : PassNamePrefix)
if (Pass.startswith(P))		if (Pass.startswith(P))
return true;		return true;
for (const auto &P : PassNameContain)		for (const auto &P : PassNameContain)
if (Pass.contains(P))		if (Pass.contains(P))
return true;		return true;
return llvm::is_contained(PassNameExact, Pass);		return llvm::is_contained(PassNameExact, Pass);
}		}
Show All 34 Lines	int main(int argc, char **argv) {
initializeInstCombine(Registry);		initializeInstCombine(Registry);
initializeAggressiveInstCombine(Registry);		initializeAggressiveInstCombine(Registry);
initializeInstrumentation(Registry);		initializeInstrumentation(Registry);
initializeTarget(Registry);		initializeTarget(Registry);
// For codegen passes, only passes that do IR to IR transformation are		// For codegen passes, only passes that do IR to IR transformation are
// supported.		// supported.
initializeExpandMemCmpPassPass(Registry);		initializeExpandMemCmpPassPass(Registry);
initializeScalarizeMaskedMemIntrinLegacyPassPass(Registry);		initializeScalarizeMaskedMemIntrinLegacyPassPass(Registry);
		initializeSelectOptimizePass(Registry);
initializeCodeGenPreparePass(Registry);		initializeCodeGenPreparePass(Registry);
initializeAtomicExpandPass(Registry);		initializeAtomicExpandPass(Registry);
initializeRewriteSymbolsLegacyPassPass(Registry);		initializeRewriteSymbolsLegacyPassPass(Registry);
initializeWinEHPreparePass(Registry);		initializeWinEHPreparePass(Registry);
initializeDwarfEHPrepareLegacyPassPass(Registry);		initializeDwarfEHPrepareLegacyPassPass(Registry);
initializeSafeStackLegacyPassPass(Registry);		initializeSafeStackLegacyPassPass(Registry);
initializeSjLjEHPreparePass(Registry);		initializeSjLjEHPreparePass(Registry);
initializePreISelIntrinsicLoweringLegacyPassPass(Registry);		initializePreISelIntrinsicLoweringLegacyPassPass(Registry);
▲ Show 20 Lines • Show All 517 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SelectOpti][1/5] Setup new select-optimize passClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 430722

llvm/include/llvm/CodeGen/CodeGenPassBuilder.h

llvm/include/llvm/CodeGen/MachinePassRegistry.def

llvm/include/llvm/CodeGen/Passes.h

llvm/include/llvm/InitializePasses.h

llvm/include/llvm/LinkAllPasses.h

llvm/include/llvm/Target/CGPassBuilderOption.h

llvm/lib/CodeGen/CMakeLists.txt

llvm/lib/CodeGen/CodeGen.cpp

llvm/lib/CodeGen/SelectOptimize.cpp

llvm/lib/CodeGen/TargetPassConfig.cpp

llvm/tools/opt/opt.cpp

[SelectOpti][1/5] Setup new select-optimize pass
ClosedPublic