Page MenuHomePhabricator

[SelectOpti][1/5] Setup new select-optimize pass
ClosedPublic

Authored by apostolakis on Feb 20 2022, 10:25 PM.

Details

Summary

This is the first commit for the cmov-vs-branch optimization pass.
The goal is to develop a new profile-guided and target-independent cost/benefit analysis
for selecting conditional moves over branches when optimizing for performance.

Initially, this new pass is expected to be enabled only for instrumentation-based PGO.

RFC: https://discourse.llvm.org/t/rfc-cmov-vs-branch-optimization/6040

Diff Detail

Event Timeline

apostolakis created this revision.Feb 20 2022, 10:25 PM
apostolakis edited the summary of this revision. (Show Details)Feb 20 2022, 10:33 PM
apostolakis edited the summary of this revision. (Show Details)Feb 21 2022, 1:32 PM
apostolakis published this revision for review.Feb 23 2022, 10:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2022, 10:44 AM
apostolakis edited the summary of this revision. (Show Details)

Leave the enabling for instrPGO-only for a separate patch after this patch series gets approved.

lkail added a subscriber: lkail.Feb 25 2022, 4:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 9:00 AM
davidxl added inline comments.Mar 7 2022, 9:41 PM
llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
668

Do we need a canonicalization phase before this to convert jumps to select when possible?

llvm/lib/CodeGen/TargetPassConfig.cpp
943

Not suitable when size optimization is on.

apostolakis added inline comments.Mar 9 2022, 4:53 PM
llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
668

SimplifyCFG canonicalizes jumps to selects when possible. Currently SimplifyCFG prevents this canonicalization for a few cases where it is deemed very unprofitable. These cases seem reasonable and are probably not worth converting back to selects just before this pass and reconsidering; although it might be worth empirically evaluating the effect of that. What is more important in my opinion is for SimplifyCFG to canonicalize to selects without trying to optimize to maximize its enabling effect for subsequent LLVM IR optimizations (some preliminary experiments showed a small benefit from such a change). Then at the end, this new pass will decide what form is more profitable. This matter was briefly discussed recently in https://reviews.llvm.org/D118066. Once this new pass is enabled for all PGO builds, I will advocate for changes in SimplifyCFG to aggressively canonicalize to selects when profile information are available and an informed decision can be made later without blocking any LLVM IR optimizations.

llvm/lib/CodeGen/TargetPassConfig.cpp
943

The Code generation optimization level that is checked here (llvm::CodeGenOpt::Level) only contains 4 levels (None, Less, Default, Aggressive), none of which is meant for optimize-for-size. I also do not see any other API at this level that would enable a check for size. Instead, the check for size-opts right now is happening in the beginning of the new pass (see the next patch in this series: D120231), where it checks the attributes of the function for OptSize and also invokes the shouldOptimizeForSize function that uses PGSO heuristics.

The patch looks good to me. Adding Teresa to look at the pass ordering change.

llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
668

SG.

llvm/lib/CodeGen/TargetPassConfig.cpp
943

ok.

The patch looks good to me. Adding Teresa to look at the pass ordering change.

Do you mean the placement of the new pass? Because I don't see any changes to ordering other than the addition.

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

Right. I mean pass placement.

apostolakis added a comment.EditedMar 13 2022, 3:25 AM

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

This stack is complete and functional for x86 instr-PGO. It is not yet tuned for Sample-PGO and there are some Sample-PGO-specific improvements that are yet to be made, most notably leveraging LBR data to capture misprediction rates and incorporating them in the heuristics (as discussed with @modimo in the RFC). So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

This stack is complete and functional for x86 instr-PGO.

Thanks. I'm trying this with IRPGO on a large internal workload now, should have results soon. Will also take a closer look at the patch set.

It is not yet tuned for Sample-PGO and there are some Sample-PGO-specific improvements that are yet to be made, most notably leveraging LBR data to capture misprediction rates and incorporating them in the heuristics (as discussed with @modimo in the RFC).

While we can have branch miss reported by perf tools and feedback into compiler sample PGO, it would be challenging to accurately correlate the actual branch miss to the unoptimized IR at profile loading time (it is a general challenge to apply any low level PMU info for compiler PGO).

So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

For sample PGO, we're going to experiment with deferring some compiler if-convert to PLO (BOLT in our case), where the correlation of branch miss won't be a problem, and also to avoid the problem of being stuck with cmov. cc @Amir @maksfb

FSAFDO allows late profile loading so that the matching of branches should be less of an issue.

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

This stack is complete and functional for x86 instr-PGO.

Thanks. I'm trying this with IRPGO on a large internal workload now, should have results soon. Will also take a closer look at the patch set.

Great! Let me know if you see any regressions or issues.

It is not yet tuned for Sample-PGO and there are some Sample-PGO-specific improvements that are yet to be made, most notably leveraging LBR data to capture misprediction rates and incorporating them in the heuristics (as discussed with @modimo in the RFC).

While we can have branch miss reported by perf tools and feedback into compiler sample PGO, it would be challenging to accurately correlate the actual branch miss to the unoptimized IR at profile loading time (it is a general challenge to apply any low level PMU info for compiler PGO).

As David said, FSAFDO might alleviate this problem.

So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

For sample PGO, we're going to experiment with deferring some compiler if-convert to PLO (BOLT in our case), where the correlation of branch miss won't be a problem, and also to avoid the problem of being stuck with cmov. cc @Amir @maksfb

The lack of mispredict data for cmovs will be a problem but we will not be stuck with a cmov decision but rather we might observe some oscillations (which is still problematic but not atypical of SampleFDO settings and there are some known remedies). The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Right. I mean pass placement.

It seems reasonable since iiuc we want to do this very late, and it looks like it will be the last IR pass.

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

Thanks for the patches. We've noticed similar problems as you described in RFC which often leads to way too aggressive cmov (in general and in comparison to gcc). Is the current version of this stack complete and functional? If so, we'd be happy to give it a try to see how it handles the suboptimal cases we spotted.

This stack is complete and functional for x86 instr-PGO.

Thanks. I'm trying this with IRPGO on a large internal workload now, should have results soon. Will also take a closer look at the patch set.

Great! Let me know if you see any regressions or issues.

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

It is not yet tuned for Sample-PGO and there are some Sample-PGO-specific improvements that are yet to be made, most notably leveraging LBR data to capture misprediction rates and incorporating them in the heuristics (as discussed with @modimo in the RFC).

While we can have branch miss reported by perf tools and feedback into compiler sample PGO, it would be challenging to accurately correlate the actual branch miss to the unoptimized IR at profile loading time (it is a general challenge to apply any low level PMU info for compiler PGO).

As David said, FSAFDO might alleviate this problem.

So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

For sample PGO, we're going to experiment with deferring some compiler if-convert to PLO (BOLT in our case), where the correlation of branch miss won't be a problem, and also to avoid the problem of being stuck with cmov. cc @Amir @maksfb

The lack of mispredict data for cmovs will be a problem but we will not be stuck with a cmov decision but rather we might observe some oscillations (which is still problematic but not atypical of SampleFDO settings and there are some known remedies).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run? If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

David

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

This is being worked on now and we don't have data yet. The numbers above didn't have BOLT interfering with cmov.

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run? If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

David

Right, making compiler conservative to preserve branch so we can have control flow profile for BOLT to make final decision is the experiment we're doing. Pseudo-probe for sample PGO can also be tuned a bit more intrusive to disallow cmov for better profile in that setup.

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

Amir added a comment.Mar 14 2022, 10:34 PM

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

I didn't measure it yet, but unlikely (see comment below).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run?

Yes, the profile data should be collected from pre-BOLT binary.

If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

The compiler can indeed choose to minimize cmov generation – I've recently added an LLVM knob to force-expand all cmov's in D119777 (x86-cmov-converter-force-all).

However, the data collected with (non-PGO, non-LTO) clang binary suggests that x86-cmov-converter-force-all introduces a significant perf regression that BOLT's CMOV conversion with default heuristics is unable to recover from. BOLT converts back a minor percentage (~5%) of eligible hammocks based on execution and misprediction heuristics (>5% misprediction rate, >1% biased condition). The hypothesis is that force-expanding cmov's results in 1) a code size increase, 2) more branches => higher pressure on BPU structures, and given that BOLT converts back only a small part of hammocks back, these factors result in a net regression.

In other words, misprediction rate may not be the most important factor in hammock-vs-cmov tradeoff for large code footprint workloads. I believe that a holistic approach (criticality + misprediction rate + code size) may yield better performance.

David

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

This is being worked on now and we don't have data yet. The numbers above didn't have BOLT interfering with cmov.

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run? If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

David

Right, making compiler conservative to preserve branch so we can have control flow profile for BOLT to make final decision is the experiment we're doing. Pseudo-probe for sample PGO can also be tuned a bit more intrusive to disallow cmov for better profile in that setup.

But in this case, the binary (from BOLT) used to generate profile for the compiler still have cmov, so some loss of profile data is unavoidable. The oscillating issue will mostly be gone though.

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

I didn't measure it yet, but unlikely (see comment below).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run?

Yes, the profile data should be collected from pre-BOLT binary.

If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

The compiler can indeed choose to minimize cmov generation – I've recently added an LLVM knob to force-expand all cmov's in D119777 (x86-cmov-converter-force-all).

However, the data collected with (non-PGO, non-LTO) clang binary suggests that x86-cmov-converter-force-all introduces a significant perf regression that BOLT's CMOV conversion with default heuristics is unable to recover from.

I assume BOLT's block layout lays out those branchy code properly, right?

BOLT converts back a minor percentage (~5%) of eligible hammocks based on execution and misprediction heuristics (>5% misprediction rate, >1% biased condition).

Only 5% of the hammock based execution meets the conversion criteria or 5% of the candidates matching the criteria can be converted back ?

The hypothesis is that force-expanding cmov's results in 1) a code size increase, 2) more branches => higher pressure on BPU structures, and given that BOLT converts back only a small part of hammocks back, these factors result in a net regression.

In other words, misprediction rate may not be the most important factor in hammock-vs-cmov tradeoff for large code footprint workloads. I believe that a holistic approach (criticality + misprediction rate + code size) may yield better performance.

yes, modelling the global effect as well as branch interactions will be useful thing to do. Note that newly introduced branches can change BPU behavior thus lead to different branch misprediction distribution too.

David

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

But in this case, the binary (from BOLT) used to generate profile for the compiler still have cmov, so some loss of profile data is unavoidable. The oscillating issue will mostly be gone though.

We have some special setup where dedicated tier is used for sample profiling for both compiler and BOLT, in that setup we're not profiling post-BOLT binary. It's not the typical prod profiling setup for sample PGO.

OTOH for cmov, since compiler is going to be conservative and deferring that to BOLT, not have compiler profile is okay (just from cmov perspective) as long as BOLT sees the profile from pre-BOLT binary which doesn't do cmov.

Amir added a comment.Mar 14 2022, 11:01 PM

On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).

Does BOLT's cmov optimization improve performance for this workload?

I didn't measure it yet, but unlikely (see comment below).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.

Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run?

Yes, the profile data should be collected from pre-BOLT binary.

If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.

The compiler can indeed choose to minimize cmov generation – I've recently added an LLVM knob to force-expand all cmov's in D119777 (x86-cmov-converter-force-all).

However, the data collected with (non-PGO, non-LTO) clang binary suggests that x86-cmov-converter-force-all introduces a significant perf regression that BOLT's CMOV conversion with default heuristics is unable to recover from.

I assume BOLT's block layout lays out those branchy code properly, right?

No, function splitting and layout were turned off, so only baseline BOLT with and w/o CMOV conversion.

BOLT converts back a minor percentage (~5%) of eligible hammocks based on execution and misprediction heuristics (>5% misprediction rate, >1% biased condition).

Only 5% of the hammock based execution meets the conversion criteria or 5% of the candidates matching the criteria can be converted back ?

The former: 5% of the hammock based execution meets the conversion criteria (with default CMOV conversion heuristics).

The hypothesis is that force-expanding cmov's results in 1) a code size increase, 2) more branches => higher pressure on BPU structures, and given that BOLT converts back only a small part of hammocks back, these factors result in a net regression.

In other words, misprediction rate may not be the most important factor in hammock-vs-cmov tradeoff for large code footprint workloads. I believe that a holistic approach (criticality + misprediction rate + code size) may yield better performance.

yes, modelling the global effect as well as branch interactions will be useful thing to do.

I'm hoping Sotiris's pass can achieve that eventually.

Note that newly introduced branches can change BPU behavior thus lead to different branch misprediction distribution too.

David

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.

Yes, that is a different challenge.

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

The reason that all the tests are in the 4th patch is that this patch involves the actual transformation (converts selects to branches for the cases that the conversion was deemed profitable based on the profitability heuristics of patches 2 and 3). The 2nd patch has the base (non-loop) heuristics and the 3rd patch has the loop-level heuristics. Patches 2 and 3 do not change the IR.

I agree though that it is better to split it up in a way that allows testing of each patch. So, I will re-organize the patches to enable more per-patch testing. I will move the base of the actual transformation of the code in a 2nd patch (and some testing by temporarily assuming that all selects should be converted), then the 3rd patch will be the base heuristics and testing, 4th patch loop heuristics and testing, and a 5th patch that optimizes the transformation (maximizes the sinking of one-use slices in the true/false blocks and interleaving of slices).

So, if you are interested in Sample-PGO it is better to wait for the next patch series that will tailor this pass for that. At that point, we can iterate and refine to make sure that the pass addresses these suboptimal cases and avoids regressions for your workloads.

The lack of mispredict data for cmovs will be a problem but we will not be stuck with a cmov decision but rather we might observe some oscillations (which is still problematic but not atypical of SampleFDO settings and there are some known remedies).

Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.

We will not get stuck if the non-profile logic and default values for branch weights and mispredict rates are set in a way that favors branches. Already talked about mispredict rates. Another example is the base heuristic that converts selects to branches when there is an expensive operand in the computation of the cold operand (currently less than 20% selected operand). If this select is currently a branch and the profile-based weight is 25% for the expensive operand (so not cold) then we might convert that to a cmov. Then we will not have profile information but we can allow all selects with expensive operands but no profile data to be conservatively converted to branches, and then if the expensive operand becomes cold the select will become a branch, otherwise there will be oscillation.

In general, I think some reasonable non-profile-based heuristics that favor branches will be better than blindly converting all cases to branches even for cases where a cmov seems preferable even with unfavorable branch weights and mispredict rate.

The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.

nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.

Just to clarify since you misunderstood me, I was referring (both here and in the RFC) to actual misprediction rate and not branch weights. If you look at the 3rd patch (D120232) that includes the loop-level heuristics you will see that a mispredict rate (that conservatively defaults to 25%) is taken into account to calculate the cost of branches. Branch weights are used to compute the branch cost for correctly predicted cases (BranchCost = PredictedPathCost + MispredictCost * MispredictRate, where PredictedPathCost = TrueOpCost * TrueWeight + FalseOpCost * FalseWeight). Unbiased branches indeed do not necessarily lead to mispredictions (as discussed in the RFC) and the only case where branch weights are used for predictability for this optimization is for cases where the branch is entirely biased to one direction and thus it is somewhat expected to be predictable (part of the base heuristics). If profile-based mispredict rates were available, there would at least one extra base heuristic that handles some extreme cases (highly or poorly predicted) and then for in-between cases the existing heuristics will be refined to make the decision.

bsmith added a subscriber: bsmith.Mar 21 2022, 7:59 AM
apostolakis retitled this revision from [SelectOpti][1/4] Setup new select-optimize pass to [SelectOpti][1/5] Setup new select-optimize pass.Mar 22 2022, 2:06 PM

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

The reason that all the tests are in the 4th patch is that this patch involves the actual transformation (converts selects to branches for the cases that the conversion was deemed profitable based on the profitability heuristics of patches 2 and 3). The 2nd patch has the base (non-loop) heuristics and the 3rd patch has the loop-level heuristics. Patches 2 and 3 do not change the IR.

I agree though that it is better to split it up in a way that allows testing of each patch. So, I will re-organize the patches to enable more per-patch testing. I will move the base of the actual transformation of the code in a 2nd patch (and some testing by temporarily assuming that all selects should be converted), then the 3rd patch will be the base heuristics and testing, 4th patch loop heuristics and testing, and a 5th patch that optimizes the transformation (maximizes the sinking of one-use slices in the true/false blocks and interleaving of slices).

As discussed, re-organized the subsequent patches to allow for per-patch testing.

Use LegacyPM by default for this pass.

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

The reason that all the tests are in the 4th patch is that this patch involves the actual transformation (converts selects to branches for the cases that the conversion was deemed profitable based on the profitability heuristics of patches 2 and 3). The 2nd patch has the base (non-loop) heuristics and the 3rd patch has the loop-level heuristics. Patches 2 and 3 do not change the IR.

I agree though that it is better to split it up in a way that allows testing of each patch. So, I will re-organize the patches to enable more per-patch testing. I will move the base of the actual transformation of the code in a 2nd patch (and some testing by temporarily assuming that all selects should be converted), then the 3rd patch will be the base heuristics and testing, 4th patch loop heuristics and testing, and a 5th patch that optimizes the transformation (maximizes the sinking of one-use slices in the true/false blocks and interleaving of slices).

As discussed, re-organized the subsequent patches to allow for per-patch testing.

@tejohnson does the re-organization of subsequent patches to allow for per-patch testing look okay to you? Is there anything else to note for this first patch? Apart from the decision of where to place the new pass, the rest of the code is mostly boilerplate.

tejohnson accepted this revision.Apr 4 2022, 9:41 PM

One comment about the patch is that it would be good to remove the llvm_unreachable, and test for the pass in one of the pass ordering tests. E.g. llvm/test/CodeGen/X86/opt-pipeline.ll (there are similar ones for other archs too).

In this patch series the pass is disabled by default and I was actually planning on having a separate follow-up patch (D121547) where I will enable by default this pass for x86 instr-PGO. In the D121547 patch I had to change the X86/opt-pipeline.ll and you can see more clearly the placement of this pass (almost just before the CodeGenPrepare pass). If it is preferable I can move these changes in this patch.

I see, ok it seems reasonable to leave the pass pipeline testing until then. But generally it is good to have tests with each patch - another option is to merge this with the next patch which I assume has part of the implementation in it, and add an opt based test with that. Oh, I see that patches 2 and 3 don't have a test, only patch 4. IMO if at all possible it is better to split up the patches into pieces that can each be tested.

The reason that all the tests are in the 4th patch is that this patch involves the actual transformation (converts selects to branches for the cases that the conversion was deemed profitable based on the profitability heuristics of patches 2 and 3). The 2nd patch has the base (non-loop) heuristics and the 3rd patch has the loop-level heuristics. Patches 2 and 3 do not change the IR.

I agree though that it is better to split it up in a way that allows testing of each patch. So, I will re-organize the patches to enable more per-patch testing. I will move the base of the actual transformation of the code in a 2nd patch (and some testing by temporarily assuming that all selects should be converted), then the 3rd patch will be the base heuristics and testing, 4th patch loop heuristics and testing, and a 5th patch that optimizes the transformation (maximizes the sinking of one-use slices in the true/false blocks and interleaving of slices).

As discussed, re-organized the subsequent patches to allow for per-patch testing.

@tejohnson does the re-organization of subsequent patches to allow for per-patch testing look okay to you? Is there anything else to note for this first patch? Apart from the decision of where to place the new pass, the rest of the code is mostly boilerplate.

Sorry for the delay. I like the new patch organization with the additional testing, thanks!

LGTM

This revision is now accepted and ready to land.Apr 4 2022, 9:41 PM
apostolakis edited the summary of this revision. (Show Details)May 13 2022, 5:10 PM

Init DisableSelectOptimize to true when declaring it.

This revision was landed with ongoing or failed builds.May 19 2022, 9:42 AM
This revision was automatically updated to reflect the committed changes.
This comment was removed by junaire.