This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Handle select instructions relying on the same condition
AbandonedPublic

Authored by lkail on Nov 14 2021, 11:45 PM.

Details

Reviewers
spatel
davidxl
bkramer
jsji
apostolakis
shchenz
lebedev.ri
Group Reviewers
Restricted Project
Summary

In SimplifyCFGPass we have transformed specific branches to select instruction and we sometimes want undo this transformation in CGP.
Currently, CGP only transform select i1 %cond to branches if %cond has one use. https://reviews.llvm.org/D24147 relaxes it a bit, but it relies on the branch weight on the select instruction and SimplifyCFGPass looks not maintaining this branch weight when perform the transformation.
Some of our internal workload do some computation like

loop:
...
if (cond) {
  // update max/min values
  // update indexes of max/min values
}

and will be like

loop:
  select cond ...
  select cond ...
  select cond ...
  ...

after SimplifyCFGPass.
If these select are retained after CGP, we will have redundant computations of select's operands in the loop and hurt the performance(even the compuation of the operand is cheap, but it's in the loop, sill has big impact on the performance). This patch tries to mitigate this situation.
Introduced new flag cgp-sink-select-operand-ratio-against-misprediction to model such situation roughly. Say computing %t's cost is Cost(%t), the probability of taking %t is P(%t), misprediction probability is PM, misprediction penalty is PENALTY. We want Cost(%t) * P(%t) + PM * PENALTY < Cost(%t) aka PENALTY < Cost(%t)*(1-P(%t))/PM when sinking select operand.

Diff Detail

Event Timeline

lkail created this revision.Nov 14 2021, 11:45 PM
lkail requested review of this revision.Nov 14 2021, 11:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 14 2021, 11:45 PM
lkail updated this revision to Diff 387530.Nov 16 2021, 2:01 AM
lkail edited the summary of this revision. (Show Details)
lkail added reviewers: Restricted Project, spatel, davidxl, bkramer, jsji.

+ Sotiris who is working in general improvement in this area.

apostolakis added inline comments.Nov 20 2021, 10:19 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
6578–6582

This is not sound. Checking for just one use was conservative but it ensured that the instruction can be sinked. If other select instructions use the same operand, the operand can be sinked only if it is always used on the same path (i.e., always true or false operand).

Here is an example:

%x = ...
%y = ...
%a = select %cond, %x, %y
%b = select %cond, %y, %x

If converted to branch, %x and %y cannot be sinked since they are needed in both paths:

%x = ...
%y = ...
if (%cond)
   %a = %x
   %b = %y
else
   %a = %y
   %b = %x

So, you will need to change the check here to account for such scenarios.

6584

Enabling this option would make conversion to branches very aggressive without enough information to make such a decision. Sure in some cases it might be profitable to convert even selects with cheap operands but surely not always. For example, if the branch mispredicts once in a while, then the misprediction cost could outweigh the cost of computing all the operands of the select. Having this check only for expensive operands raised a bit the bar of how much misprediction would be acceptable.

6627–6628

You mentioned in the description that SimplifyCFGPass sometimes does not maintain branch metadata. Have you seen cases where the metadata is preserved only for a subset of a group of selects? If so, this change is useful. The ideal solution would be to address the issue in SimplifyCFGPass and avoid here the extra compile time and complexity of looking over all the select instructions (and avoid cases where none of the selects have branch info). But for now, at least it is worth adding a comment that explains why you need to check all of them so that this change can be reverted if the SimplifyCFGPass becomes better at maintaining metadata.

6631–6632

This comment needs to be changed if the proposed change below goes through.

lkail updated this revision to Diff 389103.Nov 23 2021, 12:15 AM
  • Address comments.
  • Add test of inconsistent use of operand in select group
lkail marked 2 inline comments as done.Nov 23 2021, 12:30 AM
lkail added inline comments.
llvm/lib/CodeGen/CodeGenPrepare.cpp
6578–6582

Good catch. Added test for it.

6584

Good point! But I'm afraid CGP currently lacks facility exposing misprediction cost which should be defined by MCSchedModel. I happened to find X86 implements its own X86CmovConversion.cpp which uses more target-specific info to make the decision. The FIXME in isFormingBranchFromSelectProfitable also says

// FIXME: This should use the same heuristics as IfConversion to determine
// whether a select is better represented as a branch.

I starts doubting is it appropriate to get the task done in CGP since we don't have accurate target info.

6627–6628

Yes, SimplifyCFG keeps all !prof metadata. What I meant in the summary is SimplifyCFG doesn't maintain branch probability which is calculated by BPI statically.

lkail marked an inline comment as done.Nov 23 2021, 12:35 AM

@jsji @shchenz Do you think it is good idea we implement isel->branch transformation by extending current PPCExpandISel which was mentioned a bit by hal in https://reviews.llvm.org/D34769? Thus we can have enough target info to solve the concern raised by @apostolakis that trading between misprediction penalty and number of instructions executed.

If we make the SimplifyCFG pass keep the statically calculated branch weights(if any) for the select instructions, will our case be optimized like expected in the CGP pass? If so, I think letting SimplifyCFG pass keep the branch weights for the select instructions should make more sense. I saw there are some functions like setBranchWeights in SimplifyCFG pass that will update the branch weights, but not sure why it does not work for our case.

I am actually in the process of porting some of the ideas of X86CmovConversion.cpp in CGP. Mainly the idea of doing a loop-based critical path analysis, and potentially make this x86-specific pass obsolete.
In general, it is quite tricky to get this conversion right and changes can easily lead to regressions. Even with the existing shallow and conservative heuristics in CGP (that rarely convert selects), I find several cases where CGP harms performance by converting a select to a branch. So, I am reconsidering every heuristic.
X86CmovConversion.cpp is towards the right direction but it cannot leverage profile information which are available in CGP. It also aggressively converts conditional moves with load operands which is often not ideal.
IfConversion passes do not currently convert branches for all architectures (AFAIK not for x86 and AArch64, not sure about PowerPC).

By the way, in CGP you can get an architecture-specific misprediction penalty (you can get a TargetSchedModel using the TargetSubtargetInfo field of CodeGenPrepare). Note though that without any PMU, the compiler cannot really guess the misprediction rate. Therefore, the actual cost of misspeculation is not really computable. In X86CmovConversion.cpp, the misprediction rate is conservatively assumed to be 25%.
That said, a target-specific pass would give you access to other forms of information; for example one could better model the backend pressure caused by the lengthy dependence chains of predicated code.

lkail updated this revision to Diff 389432.Nov 24 2021, 1:59 AM

Having this check only for expensive operands raised a bit the bar of how much misprediction would be acceptable.

I don't think I have caught your idea exactly. I've updated the patch, please correct me if I'm wrong.

lkail added a comment.Nov 24 2021, 2:05 AM

If we make the SimplifyCFG pass keep the statically calculated branch weights(if any) for the select instructions, will our case be optimized like expected in the CGP pass? If so, I think letting SimplifyCFG pass keep the branch weights for the select instructions should make more sense. I saw there are some functions like setBranchWeights in SimplifyCFG pass that will update the branch weights, but not sure why it does not work for our case.

Looks we can't avoid if we use static analysis result from BPI by default when profiling data is missing, but the static result can't reflect actual misprediction and cause degradation

lkail updated this revision to Diff 390238.Nov 28 2021, 7:18 PM
lkail updated this revision to Diff 390242.Nov 28 2021, 7:27 PM
apostolakis added a comment.EditedDec 7 2021, 10:23 AM

I was waiting for someone else to take a look as well since I am new in the reviewing process in llvm.
Maybe it would help to state if you have any motivating case that led to these changes and especially to the introduction of a new flag.

-The updated flag is a bit better than the initial one and points out to the users that they need to think about misprediction penalty when tuning branch vs select heuristics.
-Looking into all the selects of the group when considering the sinkSelectOperand heuristic are fine with me and in accordance with the logic behind these heuristics.
-Cmp->hasOneUse() allowed only groups with one select to be optimized in CGP. This was a bit of an odd choice. The proposed change lifts this restriction. This is not necessarily a bad idea (the perf impact is unclear). I wouldn’t agree though that selects are considered expensive if the Cmp is only used within the select group. I would be surprised if the compare instruction is ever used outside the group.

apostolakis added inline comments.Dec 7 2021, 10:27 AM
llvm/lib/CodeGen/CodeGenPrepare.cpp
6588–6590

Not certain but wouldn't TTI->getInstructionCost(I, TargetTransformInfo::TCK_Latency) return cycle latency of an instruction?

lkail updated this revision to Diff 397444.EditedJan 4 2022, 6:49 PM

Use TTI->getInstructionCost(I, TargetTransformInfo::TCK_Latency) to estimate cheap instruction's latency and gentle ping.

lkail edited the summary of this revision. (Show Details)Jan 4 2022, 7:10 PM
lkail edited the summary of this revision. (Show Details)Jan 4 2022, 7:16 PM
lkail updated this revision to Diff 397447.Jan 4 2022, 7:25 PM
lkail added a comment.Jan 11 2022, 6:07 PM

Gentle ping.

I think this was asked earlier, but I don't see an answer: could we prevent SimplifyCFG from forming these selects in the first place? Is the cost model broken in some way or just not accurate enough?

I don't object to adding/moving logic to CGP, but it's a bit strange that we are doing that while we are trying to move away from SDAG to GISel (CGP was intended to be a temporary pass to overcome the limits of one-block-at-a-time SDAG).

I sympathize with the previous comments about x86's cmov pass. AFAIK, we don't have all of the needed target/profile info in one place, so it always causes complaints.
There used to be a bunch of cmov bugs linked together in bugzilla, and I'm not sure how to find/link those again with github issues. Here's one:
https://github.com/llvm/llvm-project/issues/42246

lkail added a comment.Jan 13 2022, 2:04 AM

could we prevent SimplifyCFG from forming these selects in the first place? Is the cost model broken in some way or just not accurate enough?

Sorry I have missed this one. IIUC, FoldTwoEntryPHINode should be the place performing branch to select transformation. FoldTwoEntryPHINode checks

// Okay, we found that we can merge this two-entry phi node into a select.                                                                                                                  
// Doing so would require us to fold *all* two entry phi nodes in this block.                                                                                                               
// At some point this becomes non-profitable (particularly if the target                                                                                                                    
// doesn't support cmov's).  Only do this transformation if there are two or                                                                                                                
// fewer PHI nodes in this block.                                                                                                                                                           
unsigned NumPhis = 0;                                                                                                                                                                       
for (BasicBlock::iterator I = BB->begin(); isa<PHINode>(I); ++NumPhis, ++I)                                                                                                                 
  if (NumPhis > 2)                                                                                                                                                                          
    return false;

which looks not accurate enough.

Is it appropriate to improve this heuristic rather than in CGP, since SimplifyCFGPass looks like a canonical pass to me and also we lack subtarget info(which includes mispredict penalty) in SimplifyCFGPass?

could we prevent SimplifyCFG from forming these selects in the first place? Is the cost model broken in some way or just not accurate enough?

Sorry I have missed this one. IIUC, FoldTwoEntryPHINode should be the place performing branch to select transformation. FoldTwoEntryPHINode checks

// Okay, we found that we can merge this two-entry phi node into a select.                                                                                                                  
// Doing so would require us to fold *all* two entry phi nodes in this block.                                                                                                               
// At some point this becomes non-profitable (particularly if the target                                                                                                                    
// doesn't support cmov's).  Only do this transformation if there are two or                                                                                                                
// fewer PHI nodes in this block.                                                                                                                                                           
unsigned NumPhis = 0;                                                                                                                                                                       
for (BasicBlock::iterator I = BB->begin(); isa<PHINode>(I); ++NumPhis, ++I)                                                                                                                 
  if (NumPhis > 2)                                                                                                                                                                          
    return false;

which looks not accurate enough.

Is it appropriate to improve this heuristic rather than in CGP, since SimplifyCFGPass looks like a canonical pass to me and also we lack subtarget info(which includes mispredict penalty) in SimplifyCFGPass?

cc @lebedev.ri who has also made adjustments on enabling the phi->select transforms in SimplifyCFG.

If we can adjust SimplifyCFG in some way to do the right thing for your motivating case and not break anything else, that would be great. :)
The SimplifyCFG transform is predicated on a mix of heuristics (like the hard limit of 2 phis) and costs (grep for "TTI.getUserCost(I, TargetTransformInfo::TCK_SizeAndLatency)").
If we can rely more on the cost model and less on the hard limits, that seems like a win.
We will never get it completely right for everyone though because the cost modeling in IR will never be accurate enough (profile data should help, but even that may not capture enough info to get the ideal CFG for all targets).

Are we trying to prevent simplifycfg from making transformations, or to instead trick it into transforming the patteren?
FWIW, FoldTwoEntryPHINode is fundamentally broken. I keep meaning to post a rewrite, but well...

Are we trying to prevent simplifycfg from making transformations, or to instead trick it into transforming the patteren?

Trying to prevent simplifycfg from making branch -> select transformation. Sometimes, the cost to calculate both true operand and false operand of a select instruction is larger than branching.

lebedev.ri added a comment.EditedJan 14 2022, 2:59 PM

I see. Well, i actually thought about that before. Basic idea is to rely on MCSchedModel's
MispredictPenalty instead of "hardcoded" TwoEntryPHINodeFoldingThreshold. That sounds easy.

I guess the back-of-the-napkin math suggests that (assuming unpredictable branch),

  • in 50% of cases, we pay the average cost of both of the blocks
  • in other 50% of cases, we pay the average cost of both of the blocks, then the misprediction penalty, then again the average cost of both of the blocks

So roughly the actual profitability check should be

cost(bb0) + cost(bb1) + cost(selects) <= ((1*((cost(bb0) + cost(bb1))/2) + 0*MispredictPenalty) + (2*((cost(bb0) + cost(bb1))/2) + MispredictPenalty))/2
cost(bb0) + cost(bb1) + cost(selects) <= 0.75*(cost(bb0) + cost(bb1)) + 0.5*MispredictPenalty
(cost(bb0) + cost(bb1) + 4*cost(selects))/2 <= MispredictPenalty

I think the main problem is that we'd effectively need a MCA-like thingy
for IR instructions, because it is not generally correct to consider that the latency of two instructions
is a sum of their latencies, because they could be independent and could have been executed in parallel.

Basic idea is to rely on MCSchedModel's MispredictPenalty instead of "hardcoded" TwoEntryPHINodeFoldingThreshold.

Is it appropriate to introduce MCSchedModel in simplifycfg?

because it is not generally correct to consider that the latency of two instructions is a sum of their latencies, because they could be independent and could have been executed in parallel.

Good point.

Another point I have missed in the patch is what if some selects can't be lowered to corresponding target's conditional move finally, like cmove in x86, isel in ppc. In ppc, we don't have legal operation for ISD::Select on float types. This looks breaking the consistency between costmodel in LLVM IR and final assembly. Does this also imply we can't handle branch -> select precisely in simplifycfg?

Basic idea is to rely on MCSchedModel's MispredictPenalty instead of "hardcoded" TwoEntryPHINodeFoldingThreshold.

Is it appropriate to introduce MCSchedModel in simplifycfg?

If it's properly exposed via TTI, should be fine i guess.
I'll try to look into simplifycfg stuff.

because it is not generally correct to consider that the latency of two instructions is a sum of their latencies, because they could be independent and could have been executed in parallel.

Good point.

Another point I have missed in the patch is what if some selects can't be lowered to corresponding target's conditional move finally, like cmove in x86, isel in ppc. In ppc, we don't have legal operation for ISD::Select on float types. This looks breaking the consistency between costmodel in LLVM IR and final assembly. Does this also imply we can't handle branch -> select precisely in simplifycfg?

I would guess TTI::getSelectCost() should have a parameter with type of the operands being selected.

But regardless of what happens in simplifycfg, it sounds like ppc should have it's own version of X86CmovConversion, if it doesn't already.

But regardless of what happens in simplifycfg, it sounds like ppc should have it's own version of X86CmovConversion, if it doesn't already.

There is PPCExpandISel, but it doesn't serve as an optimization pass. It tries to transform isel instruction to branch post-RA for those targets don't have isel support.

lkail abandoned this revision.Apr 5 2022, 2:02 AM

What this patch intended to do is covered by https://reviews.llvm.org/D120230.

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 2:02 AM