This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
5/8
CodeGenPrepare.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
cgp-select.ll

Differential D113872

[CGP] Handle select instructions relying on the same condition
AbandonedPublic

Authored by lkail on Nov 14 2021, 11:45 PM.

Download Raw Diff

Details

Reviewers

spatel
davidxl
bkramer
jsji
apostolakis
shchenz
lebedev.ri

Group Reviewers

Restricted Project

Summary

In SimplifyCFGPass we have transformed specific branches to select instruction and we sometimes want undo this transformation in CGP.
Currently, CGP only transform select i1 %cond to branches if %cond has one use. https://reviews.llvm.org/D24147 relaxes it a bit, but it relies on the branch weight on the select instruction and SimplifyCFGPass looks not maintaining this branch weight when perform the transformation.
Some of our internal workload do some computation like

loop:
...
if (cond) {
  // update max/min values
  // update indexes of max/min values
}

and will be like

loop:
  select cond ...
  select cond ...
  select cond ...
  ...

after SimplifyCFGPass.
If these select are retained after CGP, we will have redundant computations of select's operands in the loop and hurt the performance(even the compuation of the operand is cheap, but it's in the loop, sill has big impact on the performance). This patch tries to mitigate this situation.
Introduced new flag cgp-sink-select-operand-ratio-against-misprediction to model such situation roughly. Say computing %t's cost is Cost(%t), the probability of taking %t is P(%t), misprediction probability is PM, misprediction penalty is PENALTY. We want Cost(%t) * P(%t) + PM * PENALTY < Cost(%t) aka PENALTY < Cost(%t)*(1-P(%t))/PM when sinking select operand.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lkail created this revision.Nov 14 2021, 11:45 PM

Herald added a subscriber: hiraditya. · View Herald TranscriptNov 14 2021, 11:45 PM

lkail requested review of this revision.Nov 14 2021, 11:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 14 2021, 11:45 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B134193: Diff 387161.Nov 15 2021, 12:26 AM

zhaozhengpeng added a subscriber: zhaozhengpeng.Nov 15 2021, 1:09 AM

HLJ2009 added a subscriber: HLJ2009.Nov 15 2021, 4:48 AM

lkail mentioned this in rGc0da8a4e40dc: [CGP][PowerPC] Pre-commit test case for D113872. NFC..Nov 16 2021, 1:20 AM

lkail updated this revision to Diff 387530.Nov 16 2021, 2:01 AM

lkail edited the summary of this revision. (Show Details)

lkail added reviewers: Restricted Project, spatel, davidxl, bkramer, jsji.

Herald added a subscriber: nemanjai. · View Herald TranscriptNov 16 2021, 2:01 AM

Harbormaster completed remote builds in B134457: Diff 387530.Nov 16 2021, 2:43 AM

+ Sotiris who is working in general improvement in this area.

apostolakis added inline comments.Nov 20 2021, 10:19 AM

llvm/lib/CodeGen/CodeGenPrepare.cpp
6578–6582	This is not sound. Checking for just one use was conservative but it ensured that the instruction can be sinked. If other select instructions use the same operand, the operand can be sinked only if it is always used on the same path (i.e., always true or false operand). Here is an example: %x = ... %y = ... %a = select %cond, %x, %y %b = select %cond, %y, %x If converted to branch, %x and %y cannot be sinked since they are needed in both paths: %x = ... %y = ... if (%cond) %a = %x %b = %y else %a = %y %b = %x So, you will need to change the check here to account for such scenarios.
6584	Enabling this option would make conversion to branches very aggressive without enough information to make such a decision. Sure in some cases it might be profitable to convert even selects with cheap operands but surely not always. For example, if the branch mispredicts once in a while, then the misprediction cost could outweigh the cost of computing all the operands of the select. Having this check only for expensive operands raised a bit the bar of how much misprediction would be acceptable.
6627–6628	You mentioned in the description that SimplifyCFGPass sometimes does not maintain branch metadata. Have you seen cases where the metadata is preserved only for a subset of a group of selects? If so, this change is useful. The ideal solution would be to address the issue in SimplifyCFGPass and avoid here the extra compile time and complexity of looking over all the select instructions (and avoid cases where none of the selects have branch info). But for now, at least it is worth adding a comment that explains why you need to check all of them so that this change can be reverted if the SimplifyCFGPass becomes better at maintaining metadata.
6631–6632	This comment needs to be changed if the proposed change below goes through.

Address comments.
Add test of inconsistent use of operand in select group

lkail marked 2 inline comments as done.Nov 23 2021, 12:30 AM

lkail added inline comments.

llvm/lib/CodeGen/CodeGenPrepare.cpp
6578–6582	Good catch. Added test for it.
6584	Good point! But I'm afraid CGP currently lacks facility exposing misprediction cost which should be defined by MCSchedModel. I happened to find X86 implements its own `X86CmovConversion.cpp` which uses more target-specific info to make the decision. The FIXME in `isFormingBranchFromSelectProfitable` also says // FIXME: This should use the same heuristics as IfConversion to determine // whether a select is better represented as a branch. I starts doubting is it appropriate to get the task done in CGP since we don't have accurate target info.
6627–6628	Yes, SimplifyCFG keeps all `!prof` metadata. What I meant in the summary is SimplifyCFG doesn't maintain branch probability which is calculated by BPI statically.

lkail added a reviewer: shchenz.Nov 23 2021, 12:31 AM

@jsji @shchenz Do you think it is good idea we implement isel->branch transformation by extending current PPCExpandISel which was mentioned a bit by hal in https://reviews.llvm.org/D34769? Thus we can have enough target info to solve the concern raised by @apostolakis that trading between misprediction penalty and number of instructions executed.

Harbormaster completed remote builds in B135560: Diff 389103.Nov 23 2021, 2:10 AM

If we make the SimplifyCFG pass keep the statically calculated branch weights(if any) for the select instructions, will our case be optimized like expected in the CGP pass? If so, I think letting SimplifyCFG pass keep the branch weights for the select instructions should make more sense. I saw there are some functions like setBranchWeights in SimplifyCFG pass that will update the branch weights, but not sure why it does not work for our case.

I am actually in the process of porting some of the ideas of X86CmovConversion.cpp in CGP. Mainly the idea of doing a loop-based critical path analysis, and potentially make this x86-specific pass obsolete.
In general, it is quite tricky to get this conversion right and changes can easily lead to regressions. Even with the existing shallow and conservative heuristics in CGP (that rarely convert selects), I find several cases where CGP harms performance by converting a select to a branch. So, I am reconsidering every heuristic.
X86CmovConversion.cpp is towards the right direction but it cannot leverage profile information which are available in CGP. It also aggressively converts conditional moves with load operands which is often not ideal.
IfConversion passes do not currently convert branches for all architectures (AFAIK not for x86 and AArch64, not sure about PowerPC).

By the way, in CGP you can get an architecture-specific misprediction penalty (you can get a TargetSchedModel using the TargetSubtargetInfo field of CodeGenPrepare). Note though that without any PMU, the compiler cannot really guess the misprediction rate. Therefore, the actual cost of misspeculation is not really computable. In X86CmovConversion.cpp, the misprediction rate is conservatively assumed to be 25%.
That said, a target-specific pass would give you access to other forms of information; for example one could better model the backend pressure caused by the lengthy dependence chains of predicated code.

Having this check only for expensive operands raised a bit the bar of how much misprediction would be acceptable.

I don't think I have caught your idea exactly. I've updated the patch, please correct me if I'm wrong.

In D113872#3148705, @shchenz wrote:

If we make the SimplifyCFG pass keep the statically calculated branch weights(if any) for the select instructions, will our case be optimized like expected in the CGP pass? If so, I think letting SimplifyCFG pass keep the branch weights for the select instructions should make more sense. I saw there are some functions like setBranchWeights in SimplifyCFG pass that will update the branch weights, but not sure why it does not work for our case.

Looks we can't avoid if we use static analysis result from BPI by default when profiling data is missing, but the static result can't reflect actual misprediction and cause degradation

Harbormaster completed remote builds in B135805: Diff 389432.Nov 24 2021, 4:26 AM

lkail updated this revision to Diff 390238.Nov 28 2021, 7:18 PM

lkail updated this revision to Diff 390242.Nov 28 2021, 7:27 PM

Harbormaster completed remote builds in B136365: Diff 390242.Nov 28 2021, 8:07 PM

Ping.

I was waiting for someone else to take a look as well since I am new in the reviewing process in llvm.
Maybe it would help to state if you have any motivating case that led to these changes and especially to the introduction of a new flag.

-The updated flag is a bit better than the initial one and points out to the users that they need to think about misprediction penalty when tuning branch vs select heuristics.
-Looking into all the selects of the group when considering the sinkSelectOperand heuristic are fine with me and in accordance with the logic behind these heuristics.
-Cmp->hasOneUse() allowed only groups with one select to be optimized in CGP. This was a bit of an odd choice. The proposed change lifts this restriction. This is not necessarily a bad idea (the perf impact is unclear). I wouldn’t agree though that selects are considered expensive if the Cmp is only used within the select group. I would be surprised if the compare instruction is ever used outside the group.

apostolakis added inline comments.Dec 7 2021, 10:27 AM

llvm/lib/CodeGen/CodeGenPrepare.cpp
6588–6590	Not certain but wouldn't `TTI->getInstructionCost(I, TargetTransformInfo::TCK_Latency)` return cycle latency of an instruction?

Use TTI->getInstructionCost(I, TargetTransformInfo::TCK_Latency) to estimate cheap instruction's latency and gentle ping.

lkail edited the summary of this revision. (Show Details)Jan 4 2022, 7:10 PM

lkail edited the summary of this revision. (Show Details)Jan 4 2022, 7:16 PM

lkail updated this revision to Diff 397447.Jan 4 2022, 7:25 PM

Harbormaster completed remote builds in B141602: Diff 397447.Jan 4 2022, 8:09 PM

Gentle ping.

I think this was asked earlier, but I don't see an answer: could we prevent SimplifyCFG from forming these selects in the first place? Is the cost model broken in some way or just not accurate enough?

I don't object to adding/moving logic to CGP, but it's a bit strange that we are doing that while we are trying to move away from SDAG to GISel (CGP was intended to be a temporary pass to overcome the limits of one-block-at-a-time SDAG).

I sympathize with the previous comments about x86's cmov pass. AFAIK, we don't have all of the needed target/profile info in one place, so it always causes complaints.
There used to be a bunch of cmov bugs linked together in bugzilla, and I'm not sure how to find/link those again with github issues. Here's one:
https://github.com/llvm/llvm-project/issues/42246

could we prevent SimplifyCFG from forming these selects in the first place? Is the cost model broken in some way or just not accurate enough?

Sorry I have missed this one. IIUC, FoldTwoEntryPHINode should be the place performing branch to select transformation. FoldTwoEntryPHINode checks

// Okay, we found that we can merge this two-entry phi node into a select.                                                                                                                  
// Doing so would require us to fold *all* two entry phi nodes in this block.                                                                                                               
// At some point this becomes non-profitable (particularly if the target                                                                                                                    
// doesn't support cmov's).  Only do this transformation if there are two or                                                                                                                
// fewer PHI nodes in this block.                                                                                                                                                           
unsigned NumPhis = 0;                                                                                                                                                                       
for (BasicBlock::iterator I = BB->begin(); isa<PHINode>(I); ++NumPhis, ++I)                                                                                                                 
  if (NumPhis > 2)                                                                                                                                                                          
    return false;

which looks not accurate enough.

Is it appropriate to improve this heuristic rather than in CGP, since SimplifyCFGPass looks like a canonical pass to me and also we lack subtarget info(which includes mispredict penalty) in SimplifyCFGPass?

In D113872#3240013, @lkail wrote:

could we prevent SimplifyCFG from forming these selects in the first place? Is the cost model broken in some way or just not accurate enough?

Sorry I have missed this one. IIUC, FoldTwoEntryPHINode should be the place performing branch to select transformation. FoldTwoEntryPHINode checks

// Okay, we found that we can merge this two-entry phi node into a select.                                                                                                                  
// Doing so would require us to fold *all* two entry phi nodes in this block.                                                                                                               
// At some point this becomes non-profitable (particularly if the target                                                                                                                    
// doesn't support cmov's).  Only do this transformation if there are two or                                                                                                                
// fewer PHI nodes in this block.                                                                                                                                                           
unsigned NumPhis = 0;                                                                                                                                                                       
for (BasicBlock::iterator I = BB->begin(); isa<PHINode>(I); ++NumPhis, ++I)                                                                                                                 
  if (NumPhis > 2)                                                                                                                                                                          
    return false;

which looks not accurate enough.

cc @lebedev.ri who has also made adjustments on enabling the phi->select transforms in SimplifyCFG.

If we can adjust SimplifyCFG in some way to do the right thing for your motivating case and not break anything else, that would be great. :)
The SimplifyCFG transform is predicated on a mix of heuristics (like the hard limit of 2 phis) and costs (grep for "TTI.getUserCost(I, TargetTransformInfo::TCK_SizeAndLatency)").
If we can rely more on the cost model and less on the hard limits, that seems like a win.
We will never get it completely right for everyone though because the cost modeling in IR will never be accurate enough (profile data should help, but even that may not capture enough info to get the ideal CFG for all targets).

Are we trying to prevent simplifycfg from making transformations, or to instead trick it into transforming the patteren?
FWIW, FoldTwoEntryPHINode is fundamentally broken. I keep meaning to post a rewrite, but well...

Are we trying to prevent simplifycfg from making transformations, or to instead trick it into transforming the patteren?

Trying to prevent simplifycfg from making branch -> select transformation. Sometimes, the cost to calculate both true operand and false operand of a select instruction is larger than branching.

I see. Well, i actually thought about that before. Basic idea is to rely on MCSchedModel's
MispredictPenalty instead of "hardcoded" TwoEntryPHINodeFoldingThreshold. That sounds easy.

I guess the back-of-the-napkin math suggests that (assuming unpredictable branch),

in 50% of cases, we pay the average cost of both of the blocks
in other 50% of cases, we pay the average cost of both of the blocks, then the misprediction penalty, then again the average cost of both of the blocks

So roughly the actual profitability check should be

cost(bb0) + cost(bb1) + cost(selects) <= ((1*((cost(bb0) + cost(bb1))/2) + 0*MispredictPenalty) + (2*((cost(bb0) + cost(bb1))/2) + MispredictPenalty))/2
cost(bb0) + cost(bb1) + cost(selects) <= 0.75*(cost(bb0) + cost(bb1)) + 0.5*MispredictPenalty
(cost(bb0) + cost(bb1) + 4*cost(selects))/2 <= MispredictPenalty

I think the main problem is that we'd effectively need a MCA-like thingy
for IR instructions, because it is not generally correct to consider that the latency of two instructions
is a sum of their latencies, because they could be independent and could have been executed in parallel.

Basic idea is to rely on MCSchedModel's MispredictPenalty instead of "hardcoded" TwoEntryPHINodeFoldingThreshold.

Is it appropriate to introduce MCSchedModel in simplifycfg?

because it is not generally correct to consider that the latency of two instructions is a sum of their latencies, because they could be independent and could have been executed in parallel.

Good point.

Another point I have missed in the patch is what if some selects can't be lowered to corresponding target's conditional move finally, like cmove in x86, isel in ppc. In ppc, we don't have legal operation for ISD::Select on float types. This looks breaking the consistency between costmodel in LLVM IR and final assembly. Does this also imply we can't handle branch -> select precisely in simplifycfg?

lkail added a reviewer: lebedev.ri.Jan 16 2022, 11:13 PM

In D113872#3247529, @lkail wrote:

Basic idea is to rely on MCSchedModel's MispredictPenalty instead of "hardcoded" TwoEntryPHINodeFoldingThreshold.

Is it appropriate to introduce MCSchedModel in simplifycfg?

If it's properly exposed via TTI, should be fine i guess.
I'll try to look into simplifycfg stuff.

because it is not generally correct to consider that the latency of two instructions is a sum of their latencies, because they could be independent and could have been executed in parallel.

Good point.

Another point I have missed in the patch is what if some selects can't be lowered to corresponding target's conditional move finally, like cmove in x86, isel in ppc. In ppc, we don't have legal operation for ISD::Select on float types. This looks breaking the consistency between costmodel in LLVM IR and final assembly. Does this also imply we can't handle branch -> select precisely in simplifycfg?

I would guess TTI::getSelectCost() should have a parameter with type of the operands being selected.

But regardless of what happens in simplifycfg, it sounds like ppc should have it's own version of X86CmovConversion, if it doesn't already.

But regardless of what happens in simplifycfg, it sounds like ppc should have it's own version of X86CmovConversion, if it doesn't already.

There is PPCExpandISel, but it doesn't serve as an optimization pass. It tries to transform isel instruction to branch post-RA for those targets don't have isel support.

lkail mentioned this in D117861: [SimplifyCFG] Enhance costmodel of FoldTwoEntryPHINode while consider branch misprediction.Jan 21 2022, 12:54 AM

What this patch intended to do is covered by https://reviews.llvm.org/D120230.

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 2:02 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

CodeGenPrepare.cpp

63 lines

test/

CodeGen/

PowerPC/

cgp-select.ll

86 lines

Diff 389103

llvm/lib/CodeGen/CodeGenPrepare.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 137 Lines • ▼ Show 20 Lines
static cl::opt<bool>		static cl::opt<bool>
DisableGCOpts("disable-cgp-gc-opts", cl::Hidden, cl::init(false),		DisableGCOpts("disable-cgp-gc-opts", cl::Hidden, cl::init(false),
cl::desc("Disable GC optimizations in CodeGenPrepare"));		cl::desc("Disable GC optimizations in CodeGenPrepare"));

static cl::opt<bool> DisableSelectToBranch(		static cl::opt<bool> DisableSelectToBranch(
"disable-cgp-select2branch", cl::Hidden, cl::init(false),		"disable-cgp-select2branch", cl::Hidden, cl::init(false),
cl::desc("Disable select to branch conversion."));		cl::desc("Disable select to branch conversion."));

		static cl::opt<bool>
		SinkCheapSelectOperand("enable-cgp-sink-cheap-select-operand", cl::Hidden,
		cl::init(false),
		cl::desc("Sink select's cheap operand"));

static cl::opt<bool> AddrSinkUsingGEPs(		static cl::opt<bool> AddrSinkUsingGEPs(
"addr-sink-using-gep", cl::Hidden, cl::init(true),		"addr-sink-using-gep", cl::Hidden, cl::init(true),
cl::desc("Address sinking in CGP using GEPs."));		cl::desc("Address sinking in CGP using GEPs."));

static cl::opt<bool> EnableAndCmpSinking(		static cl::opt<bool> EnableAndCmpSinking(
"enable-andcmp-sinking", cl::Hidden, cl::init(true),		"enable-andcmp-sinking", cl::Hidden, cl::init(true),
cl::desc("Enable sinkinig and/cmp into branches."));		cl::desc("Enable sinkinig and/cmp into branches."));

▲ Show 20 Lines • Show All 6,405 Lines • ▼ Show 20 Lines	for (auto *And : AndsToMaybeRemove)
}		}

++NumAndsAdded;		++NumAndsAdded;
return true;		return true;
}		}

/// Check if V (an operand of a select instruction) is an expensive instruction		/// Check if V (an operand of a select instruction) is an expensive instruction
/// that is only used once.		/// that is only used once.
static bool sinkSelectOperand(const TargetTransformInfo TTI, Value V) {		static bool sinkSelectOperand(const TargetTransformInfo *TTI,
		ArrayRef<SelectInst > ASI, Value V) {
auto *I = dyn_cast<Instruction>(V);		auto *I = dyn_cast<Instruction>(V);
// If it's safe to speculatively execute, then it should not have side		// If it's safe to speculatively execute, then it should not have side
// effects; therefore, it's safe to sink and possibly not execute.		// effects; therefore, it's safe to sink and possibly not execute.
return I && I->hasOneUse() && isSafeToSpeculativelyExecute(I) &&		if (!I \|\| I->user_empty())
		return false;
		SelectInst *SI = dyn_cast<SelectInst>(I->user_back());
		if (!SI)
		return false;
		assert((V == SI->getTrueValue() \|\| V == SI->getFalseValue()) &&
		apostolakisUnsubmitted Not Done Reply Inline Actions This is not sound. Checking for just one use was conservative but it ensured that the instruction can be sinked. If other select instructions use the same operand, the operand can be sinked only if it is always used on the same path (i.e., always true or false operand). Here is an example: %x = ... %y = ... %a = select %cond, %x, %y %b = select %cond, %y, %x If converted to branch, %x and %y cannot be sinked since they are needed in both paths: %x = ... %y = ... if (%cond) %a = %x %b = %y else %a = %y %b = %x So, you will need to change the check here to account for such scenarios. apostolakis: This is not sound. Checking for just one use was conservative but it ensured that the…
		lkailAuthorUnsubmitted Done Reply Inline Actions Good catch. Added test for it. lkail: Good catch. Added test for it.
		"V should be used as either true or false operand");
		bool UsedAsTrueOperand = V == SI->getTrueValue();
		apostolakisUnsubmitted Not Done Reply Inline Actions Enabling this option would make conversion to branches very aggressive without enough information to make such a decision. Sure in some cases it might be profitable to convert even selects with cheap operands but surely not always. For example, if the branch mispredicts once in a while, then the misprediction cost could outweigh the cost of computing all the operands of the select. Having this check only for expensive operands raised a bit the bar of how much misprediction would be acceptable. apostolakis: Enabling this option would make conversion to branches very aggressive without enough…
		lkailAuthorUnsubmitted Done Reply Inline Actions Good point! But I'm afraid CGP currently lacks facility exposing misprediction cost which should be defined by MCSchedModel. I happened to find X86 implements its own `X86CmovConversion.cpp` which uses more target-specific info to make the decision. The FIXME in `isFormingBranchFromSelectProfitable` also says // FIXME: This should use the same heuristics as IfConversion to determine // whether a select is better represented as a branch. I starts doubting is it appropriate to get the task done in CGP since we don't have accurate target info. lkail: Good point! But I'm afraid CGP currently lacks facility exposing misprediction cost which…
		return all_of(I->users(),
		[&](User *user) {
		// Check if all users of V in ASI consistently use V as true
		// or false operand.
		SelectInst *U = dyn_cast<SelectInst>(user);
		return U &&
		apostolakisUnsubmitted Not Done Reply Inline Actions Not certain but wouldn't `TTI->getInstructionCost(I, TargetTransformInfo::TCK_Latency)` return cycle latency of an instruction? apostolakis: Not certain but wouldn't `TTI->getInstructionCost(I, TargetTransformInfo::TCK_Latency)` return…
		std::find(ASI.begin(), ASI.end(), U) != ASI.end() &&
		((UsedAsTrueOperand &&
		SI->getTrueValue() == U->getTrueValue()) \|\|
		(!UsedAsTrueOperand &&
		(SI->getFalseValue() == U->getFalseValue())));
		}) &&
		isSafeToSpeculativelyExecute(I) &&
		(SinkCheapSelectOperand \|\|
TTI->getUserCost(I, TargetTransformInfo::TCK_SizeAndLatency) >=		TTI->getUserCost(I, TargetTransformInfo::TCK_SizeAndLatency) >=
TargetTransformInfo::TCC_Expensive;		TargetTransformInfo::TCC_Expensive);
}		}

/// Returns true if a SelectInst should be turned into an explicit branch.		/// Returns true if a SelectInst should be turned into an explicit branch.
static bool isFormingBranchFromSelectProfitable(const TargetTransformInfo *TTI,		static bool isFormingBranchFromSelectProfitable(const TargetTransformInfo *TTI,
const TargetLowering *TLI,		const TargetLowering *TLI,
SelectInst *SI) {		ArrayRef<SelectInst *> ASI) {
// If even a predictable select is cheap, then a branch can't be cheaper.		// If even a predictable select is cheap, then a branch can't be cheaper.
if (!TLI->isPredictableSelectExpensive())		if (!TLI->isPredictableSelectExpensive())
return false;		return false;

// FIXME: This should use the same heuristics as IfConversion to determine		// FIXME: This should use the same heuristics as IfConversion to determine
// whether a select is better represented as a branch.		// whether a select is better represented as a branch.

// If metadata tells us that the select condition is obviously predictable,		// If metadata tells us that the select condition is obviously predictable,
// then we want to replace the select with a branch.		// then we want to replace the select with a branch.
		assert(!ASI.empty() && "ASI should have at least one SelectInst.");
		SelectInst *SI = ASI.back();
uint64_t TrueWeight, FalseWeight;		uint64_t TrueWeight, FalseWeight;
if (SI->extractProfMetadata(TrueWeight, FalseWeight)) {		if (SI->extractProfMetadata(TrueWeight, FalseWeight)) {
uint64_t Max = std::max(TrueWeight, FalseWeight);		uint64_t Max = std::max(TrueWeight, FalseWeight);
uint64_t Sum = TrueWeight + FalseWeight;		uint64_t Sum = TrueWeight + FalseWeight;
if (Sum != 0) {		if (Sum != 0) {
auto Probability = BranchProbability::getBranchProbability(Max, Sum);		auto Probability = BranchProbability::getBranchProbability(Max, Sum);
if (Probability > TTI->getPredictableBranchThreshold())		if (Probability > TTI->getPredictableBranchThreshold())
return true;		return true;
}		}
}		}

		apostolakisUnsubmitted Done Reply Inline Actions You mentioned in the description that SimplifyCFGPass sometimes does not maintain branch metadata. Have you seen cases where the metadata is preserved only for a subset of a group of selects? If so, this change is useful. The ideal solution would be to address the issue in SimplifyCFGPass and avoid here the extra compile time and complexity of looking over all the select instructions (and avoid cases where none of the selects have branch info). But for now, at least it is worth adding a comment that explains why you need to check all of them so that this change can be reverted if the SimplifyCFGPass becomes better at maintaining metadata. apostolakis: You mentioned in the description that SimplifyCFGPass sometimes does not maintain branch…
		lkailAuthorUnsubmitted Done Reply Inline Actions Yes, SimplifyCFG keeps all `!prof` metadata. What I meant in the summary is SimplifyCFG doesn't maintain branch probability which is calculated by BPI statically. lkail: Yes, SimplifyCFG keeps all `!prof` metadata. What I meant in the summary is SimplifyCFG doesn't…
CmpInst *Cmp = dyn_cast<CmpInst>(SI->getCondition());		CmpInst *Cmp = dyn_cast<CmpInst>(SI->getCondition());

// If a branch is predictable, an out-of-order CPU can avoid blocking on its		// If a branch is predictable, an out-of-order CPU can avoid blocking on its
// comparison condition. If the compare has more than one use, there's		// comparison condition. If the compare's uses are all selects in the same
		apostolakisUnsubmitted Done Reply Inline Actions This comment needs to be changed if the proposed change below goes through. apostolakis: This comment needs to be changed if the proposed change below goes through.
// probably another cmov or setcc around, so it's not worth emitting a branch.		// basic block, we try to form branch since select is considered expensive at
if (!Cmp \|\| !Cmp->hasOneUse())		// this point.
		if (!Cmp \|\| !all_of(Cmp->uses(), [&](const Use &use) {
		SelectInst *SI = dyn_cast<SelectInst>(use.getUser());
		return SI && std::find(ASI.begin(), ASI.end(), SI) != ASI.end();
		}))
return false;		return false;

// If either operand of the select is expensive and only needed on one side		// If either operand of the select is expensive and only needed on one side
// of the select, we should form a branch.		// of the select, we should form a branch.
if (sinkSelectOperand(TTI, SI->getTrueValue()) \|\|		for (SelectInst *SI : ASI) {
sinkSelectOperand(TTI, SI->getFalseValue()))		if (sinkSelectOperand(TTI, ASI, SI->getTrueValue()) \|\|
		sinkSelectOperand(TTI, ASI, SI->getFalseValue()))
return true;		return true;
		}

return false;		return false;
}		}

/// If \p isTrue is true, return the true value of \p SI, otherwise return		/// If \p isTrue is true, return the true value of \p SI, otherwise return
/// false value of \p SI. If the true/false value of \p SI is defined by any		/// false value of \p SI. If the true/false value of \p SI is defined by any
/// select instructions in \p Selects, look through the defining select		/// select instructions in \p Selects, look through the defining select
/// instruction until the true/false value is not defined in \p Selects.		/// instruction until the true/false value is not defined in \p Selects.
▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines	bool CodeGenPrepare::optimizeSelectInst(SelectInst *SI) {
if (VectorCond)		if (VectorCond)
SelectKind = TargetLowering::VectorMaskSelect;		SelectKind = TargetLowering::VectorMaskSelect;
else if (SI->getType()->isVectorTy())		else if (SI->getType()->isVectorTy())
SelectKind = TargetLowering::ScalarCondVectorVal;		SelectKind = TargetLowering::ScalarCondVectorVal;
else		else
SelectKind = TargetLowering::ScalarValSelect;		SelectKind = TargetLowering::ScalarValSelect;

if (TLI->isSelectSupported(SelectKind) &&		if (TLI->isSelectSupported(SelectKind) &&
(!isFormingBranchFromSelectProfitable(TTI, TLI, SI) \|\| OptSize \|\|		(!isFormingBranchFromSelectProfitable(TTI, TLI, ASI) \|\| OptSize \|\|
llvm::shouldOptimizeForSize(SI->getParent(), PSI, BFI.get())))		llvm::shouldOptimizeForSize(SI->getParent(), PSI, BFI.get())))
return false;		return false;

// The DominatorTree needs to be rebuilt by any consumers after this		// The DominatorTree needs to be rebuilt by any consumers after this
// transformation. We simply reset here rather than setting the ModifiedDT		// transformation. We simply reset here rather than setting the ModifiedDT
// flag to avoid restarting the function walk in runOnFunction for each		// flag to avoid restarting the function walk in runOnFunction for each
// select optimized.		// select optimized.
DT.reset();		DT.reset();
Show All 37 Lines	bool CodeGenPrepare::optimizeSelectInst(SelectInst *SI) {
BasicBlock *TrueBlock = nullptr;		BasicBlock *TrueBlock = nullptr;
BasicBlock *FalseBlock = nullptr;		BasicBlock *FalseBlock = nullptr;
BranchInst *TrueBranch = nullptr;		BranchInst *TrueBranch = nullptr;
BranchInst *FalseBranch = nullptr;		BranchInst *FalseBranch = nullptr;

// Sink expensive instructions into the conditional blocks to avoid executing		// Sink expensive instructions into the conditional blocks to avoid executing
// them speculatively.		// them speculatively.
for (SelectInst *SI : ASI) {		for (SelectInst *SI : ASI) {
if (sinkSelectOperand(TTI, SI->getTrueValue())) {		if (sinkSelectOperand(TTI, ASI, SI->getTrueValue())) {
if (TrueBlock == nullptr) {		if (TrueBlock == nullptr) {
TrueBlock = BasicBlock::Create(SI->getContext(), "select.true.sink",		TrueBlock = BasicBlock::Create(SI->getContext(), "select.true.sink",
EndBlock->getParent(), EndBlock);		EndBlock->getParent(), EndBlock);
TrueBranch = BranchInst::Create(EndBlock, TrueBlock);		TrueBranch = BranchInst::Create(EndBlock, TrueBlock);
TrueBranch->setDebugLoc(SI->getDebugLoc());		TrueBranch->setDebugLoc(SI->getDebugLoc());
}		}
auto *TrueInst = cast<Instruction>(SI->getTrueValue());		auto *TrueInst = cast<Instruction>(SI->getTrueValue());
TrueInst->moveBefore(TrueBranch);		TrueInst->moveBefore(TrueBranch);
}		}
if (sinkSelectOperand(TTI, SI->getFalseValue())) {		if (sinkSelectOperand(TTI, ASI, SI->getFalseValue())) {
if (FalseBlock == nullptr) {		if (FalseBlock == nullptr) {
FalseBlock = BasicBlock::Create(SI->getContext(), "select.false.sink",		FalseBlock = BasicBlock::Create(SI->getContext(), "select.false.sink",
EndBlock->getParent(), EndBlock);		EndBlock->getParent(), EndBlock);
FalseBranch = BranchInst::Create(EndBlock, FalseBlock);		FalseBranch = BranchInst::Create(EndBlock, FalseBlock);
FalseBranch->setDebugLoc(SI->getDebugLoc());		FalseBranch->setDebugLoc(SI->getDebugLoc());
}		}
auto *FalseInst = cast<Instruction>(SI->getFalseValue());		auto *FalseInst = cast<Instruction>(SI->getFalseValue());
FalseInst->moveBefore(FalseBranch);		FalseInst->moveBefore(FalseBranch);
▲ Show 20 Lines • Show All 1,487 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/cgp-select.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -O3 -mcpu=pwr9 -verify-machineinstrs -mtriple=powerpc64le-unknown-unknown < %s \| FileCheck %s		; RUN: llc -O3 -mcpu=pwr9 -verify-machineinstrs -mtriple=powerpc64le-unknown-unknown \
		; RUN: -enable-cgp-sink-cheap-select-operand < %s \| FileCheck %s

define dso_local void @wibble(float* nocapture readonly %arg, i32 signext %arg1, i32* nocapture %arg2, float* nocapture %arg3) {		define dso_local void @wibble(float* nocapture readonly %arg, i32 signext %arg1, i32* nocapture %arg2, float* nocapture %arg3) {
; CHECK-LABEL: wibble:		; CHECK-LABEL: wibble:
; CHECK: # %bb.0: # %bb		; CHECK: # %bb.0: # %bb
; CHECK-NEXT: lfs 0, 0(3)		; CHECK-NEXT: lfs 0, 0(3)
; CHECK-NEXT: li 7, 7		; CHECK-NEXT: li 7, 7
; CHECK-NEXT: cmpwi 4, 2		; CHECK-NEXT: cmpwi 4, 2
; CHECK-NEXT: xsaddsp 0, 0, 0		; CHECK-NEXT: xsaddsp 0, 0, 0
; CHECK-NEXT: blt 0, .LBB0_5		; CHECK-NEXT: blt 0, .LBB0_5
; CHECK-NEXT: # %bb.1: # %bb6		; CHECK-NEXT: # %bb.1: # %bb6
; CHECK-NEXT: clrldi 4, 4, 32		; CHECK-NEXT: clrldi 4, 4, 32
; CHECK-NEXT: li 7, 7		; CHECK-NEXT: li 7, 7
; CHECK-NEXT: addi 4, 4, -1		; CHECK-NEXT: addi 4, 4, -1
; CHECK-NEXT: mtctr 4		; CHECK-NEXT: mtctr 4
; CHECK-NEXT: li 4, 8		; CHECK-NEXT: li 4, 8
; CHECK-NEXT: b .LBB0_3		; CHECK-NEXT: b .LBB0_3
; CHECK-NEXT: .p2align 5		; CHECK-NEXT: .p2align 5
; CHECK-NEXT: .LBB0_2: # %bb11		; CHECK-NEXT: .LBB0_2: # %select.end
; CHECK-NEXT: #		; CHECK-NEXT: #
; CHECK-NEXT: iselgt 7, 4, 7
; CHECK-NEXT: addi 4, 4, 1		; CHECK-NEXT: addi 4, 4, 1
; CHECK-NEXT: bdz .LBB0_5		; CHECK-NEXT: bdz .LBB0_5
; CHECK-NEXT: .LBB0_3: # %bb11		; CHECK-NEXT: .LBB0_3: # %bb11
; CHECK-NEXT: #		; CHECK-NEXT: #
; CHECK-NEXT: lfsu 1, 4(3)		; CHECK-NEXT: lfsu 1, 4(3)
; CHECK-NEXT: fcmpu 0, 1, 0		; CHECK-NEXT: fcmpu 0, 1, 0
; CHECK-NEXT: ble 0, .LBB0_2		; CHECK-NEXT: bc 4, 1, .LBB0_2
; CHECK-NEXT: # %bb.4:		; CHECK-NEXT: # %bb.4: # %select.true.sink
		; CHECK-NEXT: #
; CHECK-NEXT: xsaddsp 0, 1, 1		; CHECK-NEXT: xsaddsp 0, 1, 1
		; CHECK-NEXT: mr 7, 4
; CHECK-NEXT: b .LBB0_2		; CHECK-NEXT: b .LBB0_2
; CHECK-NEXT: .LBB0_5: # %bb8		; CHECK-NEXT: .LBB0_5: # %bb8
; CHECK-NEXT: stw 7, 0(5)		; CHECK-NEXT: stw 7, 0(5)
; CHECK-NEXT: stfs 0, 0(6)		; CHECK-NEXT: stfs 0, 0(6)
; CHECK-NEXT: blr		; CHECK-NEXT: blr
bb:		bb:
%tmp = load float, float* %arg, align 4		%tmp = load float, float* %arg, align 4
%tmp4 = fmul float %tmp, 2.000000e+00		%tmp4 = fmul float %tmp, 2.000000e+00
Show All 22 Lines	bb11: ; preds = %bb11, %bb6
%tmp19 = select i1 %tmp17, float %tmp18, float %tmp14		%tmp19 = select i1 %tmp17, float %tmp18, float %tmp14
%tmp20 = trunc i64 %tmp12 to i32		%tmp20 = trunc i64 %tmp12 to i32
%tmp21 = add i32 %tmp20, 7		%tmp21 = add i32 %tmp20, 7
%tmp22 = select i1 %tmp17, i32 %tmp21, i32 %tmp13		%tmp22 = select i1 %tmp17, i32 %tmp21, i32 %tmp13
%tmp23 = add nuw nsw i64 %tmp12, 1		%tmp23 = add nuw nsw i64 %tmp12, 1
%tmp24 = icmp eq i64 %tmp23, %tmp7		%tmp24 = icmp eq i64 %tmp23, %tmp7
br i1 %tmp24, label %bb8, label %bb11		br i1 %tmp24, label %bb8, label %bb11
}		}

		define dso_local void @foo(float* nocapture readonly %arg, i32 signext %arg1, i32* nocapture %arg2, float* nocapture %arg3) {
		; CHECK-LABEL: foo:
		; CHECK: # %bb.0: # %bb
		; CHECK-NEXT: lfs 0, 0(3)
		; CHECK-NEXT: li 9, 7
		; CHECK-NEXT: cmpwi 4, 2
		; CHECK-NEXT: xsaddsp 0, 0, 0
		; CHECK-NEXT: blt 0, .LBB1_5
		; CHECK-NEXT: # %bb.1: # %bb6
		; CHECK-NEXT: clrldi 4, 4, 32
		; CHECK-NEXT: li 8, 7
		; CHECK-NEXT: li 7, 1
		; CHECK-NEXT: b .LBB1_3
		; CHECK-NEXT: .p2align 4
		; CHECK-NEXT: .LBB1_2: # %bb11
		; CHECK-NEXT: #
		; CHECK-NEXT: addi 10, 7, 7
		; CHECK-NEXT: addi 7, 7, 1
		; CHECK-NEXT: iselgt 9, 10, 8
		; CHECK-NEXT: iselgt 8, 8, 10
		; CHECK-NEXT: clrldi 8, 8, 32
		; CHECK-NEXT: sub 8, 4, 8
		; CHECK-NEXT: cmpld 8, 7
		; CHECK-NEXT: mr 8, 9
		; CHECK-NEXT: beq 0, .LBB1_5
		; CHECK-NEXT: .LBB1_3: # %bb11
		; CHECK-NEXT: #
		; CHECK-NEXT: lfsu 1, 4(3)
		; CHECK-NEXT: fcmpu 0, 1, 0
		; CHECK-NEXT: ble 0, .LBB1_2
		; CHECK-NEXT: # %bb.4:
		; CHECK-NEXT: xsaddsp 0, 1, 1
		; CHECK-NEXT: b .LBB1_2
		; CHECK-NEXT: .LBB1_5: # %bb8
		; CHECK-NEXT: stw 9, 0(5)
		; CHECK-NEXT: stfs 0, 0(6)
		; CHECK-NEXT: blr
		bb:
		%tmp = load float, float* %arg, align 4
		%tmp4 = fmul float %tmp, 2.000000e+00
		%tmp5 = icmp sgt i32 %arg1, 1
		br i1 %tmp5, label %bb6, label %bb8

		bb6: ; preds = %bb
		%tmp7 = zext i32 %arg1 to i64
		br label %bb11

		bb8: ; preds = %bb11, %bb
		%tmp9 = phi float [ %tmp4, %bb ], [ %tmp19, %bb11 ]
		%tmp10 = phi i32 [ 7, %bb ], [ %tmp22, %bb11 ]
		store i32 %tmp10, i32* %arg2, align 4
		store float %tmp9, float* %arg3, align 4
		ret void

		bb11: ; preds = %bb11, %bb6
		%tmp12 = phi i64 [ 1, %bb6 ], [ %tmp23, %bb11 ]
		%tmp13 = phi i32 [ 7, %bb6 ], [ %tmp22, %bb11 ]
		%tmp14 = phi float [ %tmp4, %bb6 ], [ %tmp19, %bb11 ]
		%tmp15 = getelementptr inbounds float, float* %arg, i64 %tmp12
		%tmp16 = load float, float* %tmp15, align 4
		%tmp17 = fcmp ogt float %tmp16, %tmp14
		%tmp18 = fmul float %tmp16, 2.000000e+00
		%tmp19 = select i1 %tmp17, float %tmp18, float %tmp14
		%tmp20 = trunc i64 %tmp12 to i32
		%tmp21 = add i32 %tmp20, 7
		%tmp22 = select i1 %tmp17, i32 %tmp21, i32 %tmp13
		%tmp23 = add nuw nsw i64 %tmp12, 1
		%tmp24 = select i1 %tmp17, i32 %tmp13, i32 %tmp21
		%tmp24.ext = zext i32 %tmp24 to i64
		%tmp25 = add nuw nsw i64 %tmp23, %tmp24.ext
		%tmp26 = icmp eq i64 %tmp25, %tmp7
		br i1 %tmp26, label %bb8, label %bb11
		}