This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
MC/
-
MCSchedule.h
-
Target/
-
TargetSchedule.td
-
lib/
-
CodeGen/
9
CodeGenPrepare.cpp
-
MC/
-
MCSchedule.cpp
-
Target/AArch64/
-
AArch64/
-
AArch64SchedA57.td
-
test/
-
CodeGen/X86/
-
X86/
-
machine-combiner.ll
-
Transforms/CodeGenPrepare/X86/
-
CodeGenPrepare/
-
X86/
2
select.ll
-
utils/TableGen/
-
TableGen/
-
SubtargetEmitter.cpp

Differential D17288

[CodeGenPrepare] Do select to branch transform when cmp's operand is expensive.
AbandonedPublic

Authored by flyingforyou on Feb 16 2016, 2:05 AM.

Download Raw Diff

Details

Reviewers

spatel

Summary

Emit "cmov on compare with a expensive operand" as a branch to avoid stalls on executing expensive instruction likes division.
This patch shows some improvements on commercial benchmark.

There are no notable regressions on Core-i5, Cortex-A57.

Diff Detail

Event Timeline

flyingforyou updated this revision to Diff 48055.Feb 16 2016, 2:05 AM

flyingforyou retitled this revision from to [CodeGenPrepare] Do select to branch transform when cmp's operand is expensive..

flyingforyou updated this object.

flyingforyou added a reviewer: spatel.

flyingforyou added subscribers: llvm-commits, JongwonLee, evandro.

mcrosier added a subscriber: mcrosier.Feb 16 2016, 6:50 AM

[ cc'ing people who are subscribed to D16836 ]

I thought there was agreement that it is better to handle this at a later stage of compilation. This patch is going to form a branch for any target where the compare operand is expensive, but this hurts SelectionDAG because of its block-level scope.

This isn't the same as sinking a speculatively executed expensive op - in this case, we have to execute the expensive op regardless of the comparison result. Am I not understanding the motivation?

Thanks Sanjay for comment.

We discussed about load-cmp heuristic in D16836. This is something likes load-cmp heuristic.

The main idea is if load takes so much cycles due to cache miss, we can get cmp's result after completing load. In this case, we can get huge benefit regarding transform select to branch. But recent OoO core has big cache(if we consider l1/l2/l3..) and h/w prefetcher... So we can avoid cache miss easily. This is what we talked about in D16836.

But this patch consider div(expensive instruction)-cmp case. Division is very expensive instruction which can take almost 17/30 cycles.(float/double case) Recent core's brach predictor miss penalty is 14 cycles(Ofcourse, it depends on uarch).
I think this case is what we really can hide expensive instruction cost.

I realized that test-case is too simple and it couldn't show what I really wanted. Sorry for miss reading.

%div = fdiv float %a, %b                                  ; --> we need huge time for execution.
%cmp = fcmp ogt float %div, %b
%sel = select i1 %cmp, float %div, float 8.0    ;  --> We can't execute further.

; after this patch, select is changed to branch.

%div = fdiv float %a, %b                                  ; --> we need huge time for execution.
%cmp = fcmp ogt float %div, %b
br i1 %cmp, label %select.end, label %select.false    ;  --> But, we can execute further by branch predictor.

Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty.
And if branch prediction is correct, we can hide fdiv's execution cycles.

I think this is logically correct. But we need to test more. (test-suite, spec, commertial benchmarks...)

Junmo.

In D17288#354233, @flyingforyou wrote:

Thanks Sanjay for comment.

We discussed about load-cmp heuristic in D16836. This is something likes load-cmp heuristic.

Yes - I can see that this would be a similar heuristic. The question we have is whether the load-cmp heuristic itself can be improved by moving it lower or using more data to make it smarter. Therefore, I don't think we should add another heuristic-based transform here right now.

As an example, I don't think this transform or the load transform would help an in-order CPU of any architecture. But we're still doing the transform for those subtargets.

Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty.
And if branch prediction is correct, we can hide fdiv's execution cycles.

I think this is logically correct. But we need to test more. (test-suite, spec, commertial benchmarks...)

I think I understand your point, and certainly it looks logically correct. But unless there is some good benchmark evidence to support the heuristic, I don't think we should add it here. If others disagree or if I'm misunderstanding, please let me know.

Sanjay,

As an example, I don't think this transform or the load transform would help an in-order CPU of any architecture. But we're still doing the transform for those subtargets.

We already consider that we apply "select to branch opt" for which one.

// Do we have efficient codegen support for this kind of 'selects' ?
if (TLI->isSelectSupported(SelectKind)) {
  // We have efficient codegen support for the select instruction.
  // Check if it is profitable to keep this 'select'.
  if (!TLI->isPredictableSelectExpensive() ||
      !isFormingBranchFromSelectProfitable(TTI, SI))
    return false;
}

We already check through TLI->isPredictableSelectExpensive(). This can be set in X86ISelLowering.cpp.

// A predictable cmov does not hurt on an in-order CPU.
// FIXME: Use a CPU attribute to trigger this, not a CPU model.
PredictableSelectIsExpensive = !Subtarget.isAtom();

If PredictableSelectIsExpensive is not true, we don't transform "select to branch" opt.

And I think we can use getSchedModel().isOutOfOrder() for setting the flag PredictableSelectIsExpensive.

Junmo.

flyingforyou mentioned this in D16836: [CodeGenPrepare] Remove load-based heuristic.Feb 17 2016, 5:54 PM

Rebase patch & modify test-case.

flyingforyou updated this object.Feb 26 2016, 1:51 AM

Test Env: CPU: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz

x86_div-cmp-heuristic_testresult.csv69 KBDownload

I didn't find notable regression on SPEC/test-suite

Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty.
And if branch prediction is correct, we can hide fdiv's execution cycles.

This may happen to be a correct general statement to make today, but not one that is guaranteed to hold for future architectures. In my opinion, it's just too low level of a heuristic to make here.

In general, I think the heuristics are good for x86 (and perhaps other architectures), and I'd really like to see something like this, but I have to agree with Sanjay with regards to wanting to do this later.

It's also easier to hold on to selects and break them up later than to break them up early and change our minds and hope to reassemble them.

Thanks for comment, Zia.

This may happen to be a correct general statement to make today, but not one that is guaranteed to hold for future architectures. In my opinion, it's just too low level of a heuristic to make here.

I agree with your opinion about future architecture might have powerful divider which doesn't need this heuristic. But when they come, we can make this opt turn off easily by setting PredictableSelectIsExpensive false or passing more information to isFormingBranchFromSelectProfitable something like MCSchedModel for changing heuristic.

It's also easier to hold on to selects and break them up later than to break them up early and change our minds and hope to reassemble them.

Yes. you're right. But it's easier to make branch at this level than later.

Junmo.

I just had a quick look at the attached perf data. Are there any wins from the change? Is there a motivating example outside of SPEC CPU or test-suite that you are trying to improve?

hiraditya added a subscriber: hiraditya.Mar 3 2016, 2:05 PM

Modify the patch which can consider uArch information.

Herald added a subscriber: MatzeB. · View Herald TranscriptMar 11 2016, 1:53 AM

Hi Sanjay.
I already explained that we got improvement on commercial benchmark by e-mail.

This may happen to be a correct general statement to make today, but not one that is guaranteed to hold for future architectures.

There is concern about this heuristic will not be applied for future architectures. So I added FdivLatency in SchedMachineModel for refering.
I think fdiv is special for most of architecture. It wouldn't be pipelined and it has very high latency. So I think this information can be used for another optimizations.

Also, this is very different heuristic from load-cmp-heuristic. Most of load takes 4cycles when the target is in cache. But division takes over 18~30 cycles. This heuristic could be helpful for most of architecture.

And there is a optimization pass EarlyIfConversion. This will recover the change when TBB, FBB's cost is not good for using branch.

So, How about this approach?

In D17288#372925, @flyingforyou wrote:

There is concern about this heuristic will not be applied for future architectures. So I added FdivLatency in SchedMachineModel for refering.
I think fdiv is special for most of architecture. It wouldn't be pipelined and it has very high latency. So I think this information can be used for another optimizations.

Also, this is very different heuristic from load-cmp-heuristic. Most of load takes 4cycles when the target is in cache. But division takes over 18~30 cycles. This heuristic could be helpful for most of architecture.

And there is a optimization pass EarlyIfConversion. This will recover the change when TBB, FBB's cost is not good for using branch.

So, How about this approach?

Hi Junmo -
Thanks for following up. You've answered my concerns about limiting the transform based on actual subtarget features, so I think this is a reasonable way to solve the problem. I'm not sure if we want the default behavior to be opt-out though, but that's simple to change.
I'd certainly like to hear from others if they think this is a good approach.

I don't like the idea of FdivLatency having a fixed value in cycles. Firstly, which FP division does this apply to? Secondly, it fails to consider its throughput. Thirdly, the default value may be convenient to some targets, but is far from universally acceptable. Fourthly, it's not a good practice to have this value in one place and also elsewhere, like in the pipeline model, where the same information is richly described in all its variations.

Rather, I'd be more comfortable with a simple boolean value indicating that divisions (not only FP ones) are expensive, defaulting to false, or a hook into the target to examine the instr and return the actual cycle count or whether it's indeed expensive.

In D17288#373170, @evandro wrote:

I don't like the idea of FdivLatency having a fixed value in cycles. Firstly, which FP division does this apply to? Secondly, it fails to consider its throughput. Thirdly, the default value may be convenient to some targets, but is far from universally acceptable. Fourthly, it's not a good practice to have this value in one place and also elsewhere, like in the pipeline model, where the same information is richly described in all its variations.

Rather, I'd be more comfortable with a simple boolean value indicating that divisions (not only FP ones) are expensive, defaulting to false, or a hook into the target to examine the instr and return the actual cycle count or whether it's indeed expensive.

These are good points. If we're going to use detailed models, we shouldn't duplicate just a single point from those models. We do have a simplified cost model in TTI for use at this level:
https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/TargetTransformInfoImpl.h#L47

And this is what this patch used originally, but now we have introduced the branch mispredict penalty into the equation too.

By design I think, the TTI cost model is limited in the information it can provide. But it does simulate (quite poorly in some cases) information that exists in detail in the SchedMachineModel. Note that in PR26837:
https://llvm.org/bugs/show_bug.cgi?id=26837
...we've discussed making latency and throughput more explicit in the TTI cost model because that would be helpful for the vectorizer, inliner, unroller, and others. But I think that does raise the question: should the TTI cost model inherit its data from the more detailed SchedMachineModel instead of duplicating it?

Hi Evandro.

I really appreciate your comment. There is a history about why I choose this way. Please refer to previous patch. (D16836).

I don't like the idea of FdivLatency having a fixed value in cycles. Firstly, which FP division does this apply to?

This can be easily handled by changing name. Now, I only consider single precision fp division which has mininum cycles for division.

Secondly, it fails to consider its throughput.

Now, there is no way to show its throughput. But we don't need to consider it's throughput in this heuristic.

Thirdly, the default value may be convenient to some targets, but is far from universally acceptable. Fourthly, it's not a good practice to have this value in one place and also elsewhere, like in the pipeline model, where the same information is richly described in all its variations.

I think these reasons are not only related with div-cmp-sel heuristic. I think we can talk about this issue on PR26837 or llvm-dev mail thread.

Rather, I'd be more comfortable with a simple boolean value indicating that divisions (not only FP ones) are expensive, defaulting to false, or a hook into the target to examine the instr and return the actual cycle count or whether it's indeed expensive.

Now, divisions are treated as expensive by default, likes below. Why do you think divisons are treated as cheap? If there are good reasons, we can talk about this also through llvm-dev.

enum TargetCostConstants {
  TCC_Free = 0,     ///< Expected to fold away in lowering.
  TCC_Basic = 1,    ///< The cost of a typical 'add' instruction.
  TCC_Expensive = 4 ///< The cost of a 'div' instruction on x86.
};

unsigned getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) {
  switch (Opcode) {
  default:
    // By default, just classify everything as 'basic'.
    return TTI::TCC_Basic;

  case Instruction::FDiv:
  case Instruction::FRem:
  case Instruction::SDiv:
  case Instruction::SRem:
  case Instruction::UDiv:
  case Instruction::URem:
    return TTI::TCC_Expensive;

Rebase patch against latest trunk change & modiy comments.

ping?

tkn added a subscriber: tkn.Mar 21 2016, 7:55 AM

Kindly ping ...

Evandro, we still wait your reply over 2weeks.
Could you please answer or review about our questions?

Junmo.

mzolotukhin added a subscriber: mzolotukhin.Mar 29 2016, 6:57 PM

Hi Gerolf.

I really appreciate your comment.

For the record, could you write your comment on D17288, please?

BRs,
Junmo Park.

In response to Sanjay's question: "so I think this is a reasonable way to solve the problem. I'm not sure if we want the default behavior to be opt-out though, but that's simple to change.
I'd certainly like to hear from others if they think this is a good approach."

It seems the major motivation here is the benchmark gain, and no, it won’t be necessarily simple to change. This optimization could give a sizable gain on multiple architectures, and once in, people will not easily accept a loss when the default is attempted to be changed. If we decide to go for it we should be conscious about it.
I think the machine combiner (in this case it would split the conditional move instructions) would be a more natural place for this optimization. Eventually I see that pass capable of both latency and throughput estimates for possibly a small increase in compile-time. Perhaps a compromise is reasonable by enabling a version of the current patch (assuming no regressions) and starting the work towards a more general solution. I see related questions coming up in other places. For example for some DAGcombines the machine combiner could come up with faster code sequences.

-Gerolf

Hi Gerolf.

Thanks for your valuable comments and suggestions.

Junmo.

A little more elaboration on the combiner idea:
In its current form the machine combiner only evaluates a few instruction pattern and picks the "best". Generalizing this to regions - and in particular - to code regions with if-converted code would be a necessary step to make better code generation decisions in that case. Whether this is a good or not so good idea depends on the architecture/uArchitecture and compile-time budget. With a select multiple parameters come into play where the scheduler is in the best position to evaluate different code sequences: the parameters that must be evaluated include resources (in general more instruction have to execute in parallel in if-converted code), branch predictability, scheduling gains (for example in architecture w/o control speculation the select might enable it) etc. From the compile-time perspective not all combinations can be tried. So a hierarchical approach where simple heuristics (eg. filter branch that are highly predictable) catch most/many cases and the combiner only evaluates some of the "hard" ones likely will result in the best code quality.

lib/CodeGen/CodeGenPrepare.cpp
4554	It that really a good heuristic? Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong.
4561	In the case both paths consume the long latency select is still the better choice.
4569	Why? The cmp could feed multiple selects from PHI nodes and still a branch would be preferable.
test/Transforms/CodeGenPrepare/X86/select.ll
145	I find this example misleading. The use of the %div in the select is irrelevant. The only issue is whether or not the branch is predictable.

flyingforyou added inline comments.Mar 31 2016, 8:42 PM

lib/CodeGen/CodeGenPrepare.cpp
4554	Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. I also agree with this idea.. But what we can do for this in this patch? It that really a good heuristic? If you think this is not good, what heuristic do you recommend?
4561	Why do you think so?
4569	@bkramer or Sanjay, How do you think about this?
test/Transforms/CodeGenPrepare/X86/select.ll
145	Will fix this.

mbodart mentioned this in D13363: [DeadStoreElimination] Add support for non-local DSE.Apr 1 2016, 1:41 PM

flyingforyou added inline comments.Apr 3 2016, 11:10 PM

lib/CodeGen/CodeGenPrepare.cpp
4554	I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong. My point is this. When we remove the load-cmp-csel heuristic, there is a main point which is related with load's execution cycle. The heuristic assumes that load can be taken huge cycles during cache-miss. But recent uArchitecture has big cache especially if it supports OoO execution. So we don't need to worry about cache-miss most of cases. div-cmp-csel is almost same idea likes above with cache-miss case. Most of uArchitecture executes floating point division with high latency. So, if we apply this heuristic, we can get huge benefit due to hiding division's execution cycles. in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. About this, I think branch prediction is good, even if instruction's execution cycle is small. But if the prediction is failed when executing short latency instructions something likes "add-cmp-branch", we can easily recognize the tranformation is wrong. So we just try "div-cmp-branch" case.

Gerolf added inline comments.Apr 5 2016, 12:32 AM

lib/CodeGen/CodeGenPrepare.cpp
4554	When the branch is well predicted I don't see a reason to generate a csel (except for code size). The crux is the compiler has to model two unknowns: is there a hot path? and is there a branch misprediction penalty? Profiling helps, but is not always (or better perhaps, rarely) available. I think a reasonable heuristic and akin to what you are pursuing is this: Conceptually a csel merges two paths. When the paths are unbalanced don't generate a csel. The paths are unbalanced when their execution times differ "a lot". For example, if one path consumes a long latency operation, but not the other does not, consider the paths unbalanced and don't issue a csel. Or if you know on your uArch branches are rarely mispredicted across a wide range of apps, a csel should only be generated when there is a very specific reason for it.
4561	Both paths require the result of the long latency instruction. So at least it is less likely that your optimization helps.

Addressed Gerolf's comments.

Gerolf, I still don't understand what you said "In the case both paths consume the long latency select is still the better choice.".

Why is choosing csel better than branch when both paths consume the long latency?
I think we can hide long latency by using branch.

Hi Chad.

// Prefer likely predicted branches to selects on out-of-order cores.
if (Subtarget->isCortexA57() || Subtarget->isKryo())
  PredictableSelectIsExpensive = true;

Recently, you changed PredictableSelectIsExpensive flag on for Kryo. How do you think about this change?

In D17288#400669, @flyingforyou wrote:
Hi Chad.
// Prefer likely predicted branches to selects on out-of-order cores.
if (Subtarget->isCortexA57() || Subtarget->isKryo())
  PredictableSelectIsExpensive = true;
Recently, you changed PredictableSelectIsExpensive flag on for Kryo. How do you think about this change?

Honestly, I was upstreaming on old internal patch and didn't do the analysis for the change.

My understanding is that this flags allows CGP to more aggressively convert selects to a series of branches. This makes a great deal of sense on out-of-order cores with good branch predictors, which Kryo has..

Unfortunately, I haven't been following this review and I don't completely understand the problem you're trying to solve. It seems Sanjay and Gerolf are providing good feedback, so I'm going to defer to their judgement.

mcrosier removed a subscriber: mcrosier.Apr 15 2016, 9:51 AM

spatel resigned from this revision.Sep 26 2017, 3:51 PM

flyingforyou abandoned this revision.Sep 26 2017, 4:00 PM

Revision Contents

Path

Size

include/

llvm/

MC/

MCSchedule.h

6 lines

Target/

TargetSchedule.td

1 line

lib/

CodeGen/

CodeGenPrepare.cpp

36 lines

MC/

MCSchedule.cpp

1 line

Target/

AArch64/

AArch64SchedA57.td

1 line

test/

CodeGen/

X86/

machine-combiner.ll

24 lines

Transforms/

CodeGenPrepare/

X86/

select.ll

25 lines

utils/

TableGen/

SubtargetEmitter.cpp

1 line

Diff 52981

include/llvm/MC/MCSchedule.h

Show First 20 Lines • Show All 174 Lines • ▼ Show 20 Lines	struct MCSchedModel {
unsigned HighLatency;		unsigned HighLatency;
static const unsigned DefaultHighLatency = 10;		static const unsigned DefaultHighLatency = 10;

// MispredictPenalty is the typical number of extra cycles the processor		// MispredictPenalty is the typical number of extra cycles the processor
// takes to recover from a branch misprediction.		// takes to recover from a branch misprediction.
unsigned MispredictPenalty;		unsigned MispredictPenalty;
static const unsigned DefaultMispredictPenalty = 10;		static const unsigned DefaultMispredictPenalty = 10;

		// FdivLatency is the expected latency of single precision fdiv instructions.
		//
		// Most of microarchitecture have high latency for floating point division.
		unsigned FdivLatency;
		static const unsigned DefaultFdivLatency = 14;

bool PostRAScheduler; // default value is false		bool PostRAScheduler; // default value is false

bool CompleteModel;		bool CompleteModel;

unsigned ProcID;		unsigned ProcID;
const MCProcResourceDesc *ProcResourceTable;		const MCProcResourceDesc *ProcResourceTable;
const MCSchedClassDesc *SchedClassTable;		const MCSchedClassDesc *SchedClassTable;
unsigned NumProcResourceKinds;		unsigned NumProcResourceKinds;
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

include/llvm/Target/TargetSchedule.td

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	class SchedMachineModel {
int MinLatency = -1; // Determines which instructions are allowed in a group.		int MinLatency = -1; // Determines which instructions are allowed in a group.
// (-1) inorder (0) ooo, (1): inorder +var latencies.		// (-1) inorder (0) ooo, (1): inorder +var latencies.
int MicroOpBufferSize = -1; // Max micro-ops that can be buffered.		int MicroOpBufferSize = -1; // Max micro-ops that can be buffered.
int LoopMicroOpBufferSize = -1; // Max micro-ops that can be buffered for		int LoopMicroOpBufferSize = -1; // Max micro-ops that can be buffered for
// optimized loop dispatch/execution.		// optimized loop dispatch/execution.
int LoadLatency = -1; // Cycles for loads to access the cache.		int LoadLatency = -1; // Cycles for loads to access the cache.
int HighLatency = -1; // Approximation of cycles for "high latency" ops.		int HighLatency = -1; // Approximation of cycles for "high latency" ops.
int MispredictPenalty = -1; // Extra cycles for a mispredicted branch.		int MispredictPenalty = -1; // Extra cycles for a mispredicted branch.
		int FdivLatency = -1; // Single precision fdiv latency

// Per-cycle resources tables.		// Per-cycle resources tables.
ProcessorItineraries Itineraries = NoItineraries;		ProcessorItineraries Itineraries = NoItineraries;

bit PostRAScheduler = 0; // Enable Post RegAlloc Scheduler pass.		bit PostRAScheduler = 0; // Enable Post RegAlloc Scheduler pass.

// Subtargets that define a model for only a subset of instructions		// Subtargets that define a model for only a subset of instructions
// that have a scheduling class (itinerary class or SchedRW list)		// that have a scheduling class (itinerary class or SchedRW list)
▲ Show 20 Lines • Show All 328 Lines • Show Last 20 Lines

lib/CodeGen/CodeGenPrepare.cpp

Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines
typedef PointerIntPair<Type *, 1, bool> TypeIsSExt;		typedef PointerIntPair<Type *, 1, bool> TypeIsSExt;
typedef DenseMap<Instruction *, TypeIsSExt> InstrToOrigTy;		typedef DenseMap<Instruction *, TypeIsSExt> InstrToOrigTy;
class TypePromotionTransaction;		class TypePromotionTransaction;

class CodeGenPrepare : public FunctionPass {		class CodeGenPrepare : public FunctionPass {
const TargetMachine *TM;		const TargetMachine *TM;
const TargetLowering *TLI;		const TargetLowering *TLI;
const TargetTransformInfo *TTI;		const TargetTransformInfo *TTI;
		const TargetSubtargetInfo *STI;
const TargetLibraryInfo *TLInfo;		const TargetLibraryInfo *TLInfo;
const LoopInfo *LI;		const LoopInfo *LI;

/// As we scan instructions optimizing them, this is the next instruction		/// As we scan instructions optimizing them, this is the next instruction
/// to optimize. Transforms that can invalidate this should update it.		/// to optimize. Transforms that can invalidate this should update it.
BasicBlock::iterator CurInstIterator;		BasicBlock::iterator CurInstIterator;

/// Keeps track of non-local addresses that have been sunk into a block.		/// Keeps track of non-local addresses that have been sunk into a block.
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	bool CodeGenPrepare::runOnFunction(Function &F) {
DL = &F.getParent()->getDataLayout();		DL = &F.getParent()->getDataLayout();

bool EverMadeChange = false;		bool EverMadeChange = false;
// Clear per function information.		// Clear per function information.
InsertedInsts.clear();		InsertedInsts.clear();
PromotedInsts.clear();		PromotedInsts.clear();

ModifiedDT = false;		ModifiedDT = false;
if (TM)		if (TM) {
TLI = TM->getSubtargetImpl(F)->getTargetLowering();		STI = TM->getSubtargetImpl(F);
		TLI = STI->getTargetLowering();
		}
TLInfo = &getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();		TLInfo = &getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();
TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
OptSize = F.optForSize();		OptSize = F.optForSize();

/// This optimization identifies DIV instructions that can be		/// This optimization identifies DIV instructions that can be
/// profitably bypassed and carried out with a shorter, faster divide.		/// profitably bypassed and carried out with a shorter, faster divide.
if (!OptSize && TLI && TLI->isSlowDivBypassed()) {		if (!OptSize && TLI && TLI->isSlowDivBypassed()) {
▲ Show 20 Lines • Show All 4,290 Lines • ▼ Show 20 Lines	static bool sinkSelectOperand(const TargetTransformInfo TTI, Value V) {
// If it's safe to speculatively execute, then it should not have side		// If it's safe to speculatively execute, then it should not have side
// effects; therefore, it's safe to sink and possibly not execute.		// effects; therefore, it's safe to sink and possibly not execute.
return I && I->hasOneUse() && isSafeToSpeculativelyExecute(I) &&		return I && I->hasOneUse() && isSafeToSpeculativelyExecute(I) &&
TTI->getUserCost(I) >= TargetTransformInfo::TCC_Expensive;		TTI->getUserCost(I) >= TargetTransformInfo::TCC_Expensive;
}		}

/// Returns true if a SelectInst should be turned into an explicit branch.		/// Returns true if a SelectInst should be turned into an explicit branch.
static bool isFormingBranchFromSelectProfitable(const TargetTransformInfo *TTI,		static bool isFormingBranchFromSelectProfitable(const TargetTransformInfo *TTI,
		const TargetSubtargetInfo *STI,
SelectInst *SI) {		SelectInst *SI) {
// FIXME: This should use the same heuristics as IfConversion to determine		// FIXME: This should use the same heuristics as IfConversion to determine
// whether a select is better represented as a branch. This requires that		// whether a select is better represented as a branch. This requires that
// branch probability metadata is preserved for the select, which is not the		// branch probability metadata is preserved for the select, which is not the
// case currently.		// case currently.

CmpInst *Cmp = dyn_cast<CmpInst>(SI->getCondition());		CmpInst *Cmp = dyn_cast<CmpInst>(SI->getCondition());
		if (!Cmp)
		return false;

		Value *CmpOp0 = Cmp->getOperand(0);
		Value *CmpOp1 = Cmp->getOperand(1);

		// Emit "cmov on compare with a expensive operand" as a branch to avoid stalls
		// on executing expensive instruction likes division.
		auto IsExpensiveCostInst = [&](Value *V) -> bool {
		auto *I = dyn_cast<Instruction>(V);
		if (I && I->getOpcode() == Instruction::FDiv &&
		STI->getSchedModel().FdivLatency >
		GerolfUnsubmitted Not Done Reply Inline Actions It that really a good heuristic? Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong. Gerolf: It that really a good heuristic? Even when the divide latency is less than or equal to the…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. I also agree with this idea.. But what we can do for this in this patch? It that really a good heuristic? If you think this is not good, what heuristic do you recommend? flyingforyou: > Even when the divide latency is less than or equal to the branch mispredication penalty…
		GerolfUnsubmitted Not Done Reply Inline Actions When the branch is well predicted I don't see a reason to generate a csel (except for code size). The crux is the compiler has to model two unknowns: is there a hot path? and is there a branch misprediction penalty? Profiling helps, but is not always (or better perhaps, rarely) available. I think a reasonable heuristic and akin to what you are pursuing is this: Conceptually a csel merges two paths. When the paths are unbalanced don't generate a csel. The paths are unbalanced when their execution times differ "a lot". For example, if one path consumes a long latency operation, but not the other does not, consider the paths unbalanced and don't issue a csel. Or if you know on your uArch branches are rarely mispredicted across a wide range of apps, a csel should only be generated when there is a very specific reason for it. Gerolf: When the branch is well predicted I don't see a reason to generate a csel (except for code…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong. My point is this. When we remove the load-cmp-csel heuristic, there is a main point which is related with load's execution cycle. The heuristic assumes that load can be taken huge cycles during cache-miss. But recent uArchitecture has big cache especially if it supports OoO execution. So we don't need to worry about cache-miss most of cases. div-cmp-csel is almost same idea likes above with cache-miss case. Most of uArchitecture executes floating point division with high latency. So, if we apply this heuristic, we can get huge benefit due to hiding division's execution cycles. in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. About this, I think branch prediction is good, even if instruction's execution cycle is small. But if the prediction is failed when executing short latency instructions something likes "add-cmp-branch", we can easily recognize the tranformation is wrong. So we just try "div-cmp-branch" case. flyingforyou: > I believe the reasoning you are looking for is this: in the presence of a long latency…
		STI->getSchedModel().MispredictPenalty)
		return true;

		return false;
		};

		if (IsExpensiveCostInst(CmpOp0) && IsExpensiveCostInst(CmpOp1))
		GerolfUnsubmitted Not Done Reply Inline Actions In the case both paths consume the long latency select is still the better choice. Gerolf: In the case both paths consume the long latency select is still the better choice.
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Why do you think so? flyingforyou: Why do you think so?
		GerolfUnsubmitted Not Done Reply Inline Actions Both paths require the result of the long latency instruction. So at least it is less likely that your optimization helps. Gerolf: Both paths require the result of the long latency instruction. So at least it is less likely…
		return false;
		else if (IsExpensiveCostInst(CmpOp0) \|\| IsExpensiveCostInst(CmpOp1))
		return true;

// If a branch is predictable, an out-of-order CPU can avoid blocking on its		// If a branch is predictable, an out-of-order CPU can avoid blocking on its
// comparison condition. If the compare has more than one use, there's		// comparison condition. If the compare has more than one use, there's
// probably another cmov or setcc around, so it's not worth emitting a branch.		// probably another cmov or setcc around, so it's not worth emitting a branch.
if (!Cmp \|\| !Cmp->hasOneUse())		if (!Cmp->hasOneUse())
		GerolfUnsubmitted Not Done Reply Inline Actions Why? The cmp could feed multiple selects from PHI nodes and still a branch would be preferable. Gerolf: Why? The cmp could feed multiple selects from PHI nodes and still a branch would be preferable.
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions @bkramer or Sanjay, How do you think about this? flyingforyou: @bkramer or Sanjay, How do you think about this?
return false;		return false;

// If either operand of the select is expensive and only needed on one side		// If either operand of the select is expensive and only needed on one side
// of the select, we should form a branch.		// of the select, we should form a branch.
if (sinkSelectOperand(TTI, SI->getTrueValue()) \|\|		if (sinkSelectOperand(TTI, SI->getTrueValue()) \|\|
sinkSelectOperand(TTI, SI->getFalseValue()))		sinkSelectOperand(TTI, SI->getFalseValue()))
return true;		return true;

return false;		return false;
}		}


/// If we have a SelectInst that will likely profit from branch prediction,		/// If we have a SelectInst that will likely profit from branch prediction,
/// turn it into a branch.		/// turn it into a branch.
bool CodeGenPrepare::optimizeSelectInst(SelectInst *SI) {		bool CodeGenPrepare::optimizeSelectInst(SelectInst *SI) {
bool VectorCond = !SI->getCondition()->getType()->isIntegerTy(1);		bool VectorCond = !SI->getCondition()->getType()->isIntegerTy(1);

// Can we convert the 'select' to CF ?		// Can we convert the 'select' to CF ?
if (DisableSelectToBranch \|\| OptSize \|\| !TLI \|\| VectorCond)		if (DisableSelectToBranch \|\| OptSize \|\| !TLI \|\| !STI \|\| VectorCond)
return false;		return false;

TargetLowering::SelectSupportKind SelectKind;		TargetLowering::SelectSupportKind SelectKind;
if (VectorCond)		if (VectorCond)
SelectKind = TargetLowering::VectorMaskSelect;		SelectKind = TargetLowering::VectorMaskSelect;
else if (SI->getType()->isVectorTy())		else if (SI->getType()->isVectorTy())
SelectKind = TargetLowering::ScalarCondVectorVal;		SelectKind = TargetLowering::ScalarCondVectorVal;
else		else
SelectKind = TargetLowering::ScalarValSelect;		SelectKind = TargetLowering::ScalarValSelect;

// Do we have efficient codegen support for this kind of 'selects' ?		// Do we have efficient codegen support for this kind of 'selects' ?
if (TLI->isSelectSupported(SelectKind)) {		if (TLI->isSelectSupported(SelectKind)) {
// We have efficient codegen support for the select instruction.		// We have efficient codegen support for the select instruction.
// Check if it is profitable to keep this 'select'.		// Check if it is profitable to keep this 'select'.
if (!TLI->isPredictableSelectExpensive() \|\|		if (!TLI->isPredictableSelectExpensive() \|\|
!isFormingBranchFromSelectProfitable(TTI, SI))		!isFormingBranchFromSelectProfitable(TTI, STI, SI))
return false;		return false;
}		}

ModifiedDT = true;		ModifiedDT = true;

// Transform a sequence like this:		// Transform a sequence like this:
// start:		// start:
// %cmp = cmp uge i32 %a, %b		// %cmp = cmp uge i32 %a, %b
▲ Show 20 Lines • Show All 1,071 Lines • Show Last 20 Lines

lib/MC/MCSchedule.cpp

	Show All 18 Lines
	static_assert(std::is_pod<MCSchedModel>::value,			static_assert(std::is_pod<MCSchedModel>::value,
	"We shouldn't have a static constructor here");			"We shouldn't have a static constructor here");
	const MCSchedModel MCSchedModel::Default = {DefaultIssueWidth,			const MCSchedModel MCSchedModel::Default = {DefaultIssueWidth,
	DefaultMicroOpBufferSize,			DefaultMicroOpBufferSize,
	DefaultLoopMicroOpBufferSize,			DefaultLoopMicroOpBufferSize,
	DefaultLoadLatency,			DefaultLoadLatency,
	DefaultHighLatency,			DefaultHighLatency,
	DefaultMispredictPenalty,			DefaultMispredictPenalty,
				DefaultFdivLatency,
	false,			false,
	true,			true,
	0,			0,
	nullptr,			nullptr,
	nullptr,			nullptr,
	0,			0,
	0,			0,
	nullptr};			nullptr};

lib/Target/AArch64/AArch64SchedA57.td

	Show All 20 Lines
	// Therefore, IssueWidth is set to the narrower of the two at three, while still			// Therefore, IssueWidth is set to the narrower of the two at three, while still
	// modeling the machine as out-of-order.			// modeling the machine as out-of-order.

	def CortexA57Model : SchedMachineModel {			def CortexA57Model : SchedMachineModel {
	let IssueWidth = 3; // 3-way decode and dispatch			let IssueWidth = 3; // 3-way decode and dispatch
	let MicroOpBufferSize = 128; // 128 micro-op re-order buffer			let MicroOpBufferSize = 128; // 128 micro-op re-order buffer
	let LoadLatency = 4; // Optimistic load latency			let LoadLatency = 4; // Optimistic load latency
	let MispredictPenalty = 14; // Fetch + Decode/Rename/Dispatch + Branch			let MispredictPenalty = 14; // Fetch + Decode/Rename/Dispatch + Branch
				let FdivLatency = 18; // Set single precision fdiv latency

	// Enable partial & runtime unrolling. The magic number is chosen based on			// Enable partial & runtime unrolling. The magic number is chosen based on
	// experiments and benchmarking data.			// experiments and benchmarking data.
	let LoopMicroOpBufferSize = 16;			let LoopMicroOpBufferSize = 16;
	let CompleteModel = 1;			let CompleteModel = 1;
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 628 Lines • Show Last 20 Lines

test/CodeGen/X86/machine-combiner.ll

Show First 20 Lines • Show All 357 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
ret <4 x double> %t2		ret <4 x double> %t2
}		}

; Verify that SSE and AVX scalar single-precision minimum ops are reassociated.		; Verify that SSE and AVX scalar single-precision minimum ops are reassociated.

define float @reassociate_mins_single(float %x0, float %x1, float %x2, float %x3) {		define float @reassociate_mins_single(float %x0, float %x1, float %x2, float %x3) {
; SSE-LABEL: reassociate_mins_single:		; SSE-LABEL: reassociate_mins_single:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: divss %xmm1, %xmm0		; SSE-NEXT: mulss %xmm1, %xmm0
; SSE-NEXT: minss %xmm3, %xmm2		; SSE-NEXT: minss %xmm3, %xmm2
; SSE-NEXT: minss %xmm2, %xmm0		; SSE-NEXT: minss %xmm2, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: reassociate_mins_single:		; AVX-LABEL: reassociate_mins_single:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vdivss %xmm1, %xmm0, %xmm0		; AVX-NEXT: vmulss %xmm1, %xmm0, %xmm0
; AVX-NEXT: vminss %xmm3, %xmm2, %xmm1		; AVX-NEXT: vminss %xmm3, %xmm2, %xmm1
; AVX-NEXT: vminss %xmm1, %xmm0, %xmm0		; AVX-NEXT: vminss %xmm1, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%t0 = fdiv float %x0, %x1		%t0 = fmul float %x0, %x1
%cmp1 = fcmp olt float %x2, %t0		%cmp1 = fcmp olt float %x2, %t0
%sel1 = select i1 %cmp1, float %x2, float %t0		%sel1 = select i1 %cmp1, float %x2, float %t0
%cmp2 = fcmp olt float %x3, %sel1		%cmp2 = fcmp olt float %x3, %sel1
%sel2 = select i1 %cmp2, float %x3, float %sel1		%sel2 = select i1 %cmp2, float %x3, float %sel1
ret float %sel2		ret float %sel2
}		}

; Verify that SSE and AVX scalar single-precision maximum ops are reassociated.		; Verify that SSE and AVX scalar single-precision maximum ops are reassociated.

define float @reassociate_maxs_single(float %x0, float %x1, float %x2, float %x3) {		define float @reassociate_maxs_single(float %x0, float %x1, float %x2, float %x3) {
; SSE-LABEL: reassociate_maxs_single:		; SSE-LABEL: reassociate_maxs_single:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: divss %xmm1, %xmm0		; SSE-NEXT: mulss %xmm1, %xmm0
; SSE-NEXT: maxss %xmm3, %xmm2		; SSE-NEXT: maxss %xmm3, %xmm2
; SSE-NEXT: maxss %xmm2, %xmm0		; SSE-NEXT: maxss %xmm2, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: reassociate_maxs_single:		; AVX-LABEL: reassociate_maxs_single:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vdivss %xmm1, %xmm0, %xmm0		; AVX-NEXT: vmulss %xmm1, %xmm0, %xmm0
; AVX-NEXT: vmaxss %xmm3, %xmm2, %xmm1		; AVX-NEXT: vmaxss %xmm3, %xmm2, %xmm1
; AVX-NEXT: vmaxss %xmm1, %xmm0, %xmm0		; AVX-NEXT: vmaxss %xmm1, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%t0 = fdiv float %x0, %x1		%t0 = fmul float %x0, %x1
%cmp1 = fcmp ogt float %x2, %t0		%cmp1 = fcmp ogt float %x2, %t0
%sel1 = select i1 %cmp1, float %x2, float %t0		%sel1 = select i1 %cmp1, float %x2, float %t0
%cmp2 = fcmp ogt float %x3, %sel1		%cmp2 = fcmp ogt float %x3, %sel1
%sel2 = select i1 %cmp2, float %x3, float %sel1		%sel2 = select i1 %cmp2, float %x3, float %sel1
ret float %sel2		ret float %sel2
}		}

; Verify that SSE and AVX scalar double-precision minimum ops are reassociated.		; Verify that SSE and AVX scalar double-precision minimum ops are reassociated.

define double @reassociate_mins_double(double %x0, double %x1, double %x2, double %x3) {		define double @reassociate_mins_double(double %x0, double %x1, double %x2, double %x3) {
; SSE-LABEL: reassociate_mins_double:		; SSE-LABEL: reassociate_mins_double:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: divsd %xmm1, %xmm0		; SSE-NEXT: mulsd %xmm1, %xmm0
; SSE-NEXT: minsd %xmm3, %xmm2		; SSE-NEXT: minsd %xmm3, %xmm2
; SSE-NEXT: minsd %xmm2, %xmm0		; SSE-NEXT: minsd %xmm2, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: reassociate_mins_double:		; AVX-LABEL: reassociate_mins_double:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vdivsd %xmm1, %xmm0, %xmm0		; AVX-NEXT: vmulsd %xmm1, %xmm0, %xmm0
; AVX-NEXT: vminsd %xmm3, %xmm2, %xmm1		; AVX-NEXT: vminsd %xmm3, %xmm2, %xmm1
; AVX-NEXT: vminsd %xmm1, %xmm0, %xmm0		; AVX-NEXT: vminsd %xmm1, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%t0 = fdiv double %x0, %x1		%t0 = fmul double %x0, %x1
%cmp1 = fcmp olt double %x2, %t0		%cmp1 = fcmp olt double %x2, %t0
%sel1 = select i1 %cmp1, double %x2, double %t0		%sel1 = select i1 %cmp1, double %x2, double %t0
%cmp2 = fcmp olt double %x3, %sel1		%cmp2 = fcmp olt double %x3, %sel1
%sel2 = select i1 %cmp2, double %x3, double %sel1		%sel2 = select i1 %cmp2, double %x3, double %sel1
ret double %sel2		ret double %sel2
}		}

; Verify that SSE and AVX scalar double-precision maximum ops are reassociated.		; Verify that SSE and AVX scalar double-precision maximum ops are reassociated.

define double @reassociate_maxs_double(double %x0, double %x1, double %x2, double %x3) {		define double @reassociate_maxs_double(double %x0, double %x1, double %x2, double %x3) {
; SSE-LABEL: reassociate_maxs_double:		; SSE-LABEL: reassociate_maxs_double:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: divsd %xmm1, %xmm0		; SSE-NEXT: mulsd %xmm1, %xmm0
; SSE-NEXT: maxsd %xmm3, %xmm2		; SSE-NEXT: maxsd %xmm3, %xmm2
; SSE-NEXT: maxsd %xmm2, %xmm0		; SSE-NEXT: maxsd %xmm2, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: reassociate_maxs_double:		; AVX-LABEL: reassociate_maxs_double:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vdivsd %xmm1, %xmm0, %xmm0		; AVX-NEXT: vmulsd %xmm1, %xmm0, %xmm0
; AVX-NEXT: vmaxsd %xmm3, %xmm2, %xmm1		; AVX-NEXT: vmaxsd %xmm3, %xmm2, %xmm1
; AVX-NEXT: vmaxsd %xmm1, %xmm0, %xmm0		; AVX-NEXT: vmaxsd %xmm1, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%t0 = fdiv double %x0, %x1		%t0 = fmul double %x0, %x1
%cmp1 = fcmp ogt double %x2, %t0		%cmp1 = fcmp ogt double %x2, %t0
%sel1 = select i1 %cmp1, double %x2, double %t0		%sel1 = select i1 %cmp1, double %x2, double %t0
%cmp2 = fcmp ogt double %x3, %sel1		%cmp2 = fcmp ogt double %x3, %sel1
%sel2 = select i1 %cmp2, double %x3, double %sel1		%sel2 = select i1 %cmp2, double %x3, double %sel1
ret double %sel2		ret double %sel2
}		}

; Verify that SSE and AVX 128-bit vector single-precision minimum ops are reassociated.		; Verify that SSE and AVX 128-bit vector single-precision minimum ops are reassociated.
▲ Show 20 Lines • Show All 216 Lines • Show Last 20 Lines

test/Transforms/CodeGenPrepare/X86/select.ll

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines

; CHECK-LABEL: @fdiv_false_sink(		; CHECK-LABEL: @fdiv_false_sink(
; CHECK: %cmp = fcmp ogt float %a, 3.0		; CHECK: %cmp = fcmp ogt float %a, 3.0
; CHECK: br i1 %cmp, label %select.end, label %select.false.sink		; CHECK: br i1 %cmp, label %select.end, label %select.false.sink
; CHECK: select.false.sink:		; CHECK: select.false.sink:
; CHECK: %div = fdiv float %a, %b		; CHECK: %div = fdiv float %a, %b
; CHECK: br label %select.end		; CHECK: br label %select.end
; CHECK: select.end:		; CHECK: select.end:
; CHECK: %sel = phi float [ 4.000000e+00, %entry ], [ %div, %select.false.sink ]		; CHECK: %sel = phi float [ 4.000000e+00, %entry ], [ %div, %select.false.sink ]
; CHECK: ret float %sel		; CHECK: ret float %sel
}		}

define float @fdiv_both_sink(float %a, float %b) {		define float @fdiv_both_sink(float %a, float %b) {
entry:		entry:
%div1 = fdiv float %a, %b		%div1 = fdiv float %a, %b
%div2 = fdiv float %b, %a		%div2 = fdiv float %b, %a
%cmp = fcmp ogt float %a, 5.0		%cmp = fcmp ogt float %a, 5.0
%sel = select i1 %cmp, float %div1, float %div2		%sel = select i1 %cmp, float %div1, float %div2
ret float %sel		ret float %sel

; CHECK-LABEL: @fdiv_both_sink(		; CHECK-LABEL: @fdiv_both_sink(
; CHECK: %cmp = fcmp ogt float %a, 5.0		; CHECK: %cmp = fcmp ogt float %a, 5.0
; CHECK: br i1 %cmp, label %select.true.sink, label %select.false.sink		; CHECK: br i1 %cmp, label %select.true.sink, label %select.false.sink
; CHECK: select.true.sink:		; CHECK: select.true.sink:
; CHECK: %div1 = fdiv float %a, %b		; CHECK: %div1 = fdiv float %a, %b
; CHECK: br label %select.end		; CHECK: br label %select.end
; CHECK: select.false.sink:		; CHECK: select.false.sink:
; CHECK: %div2 = fdiv float %b, %a		; CHECK: %div2 = fdiv float %b, %a
; CHECK: br label %select.end		; CHECK: br label %select.end
; CHECK: select.end:		; CHECK: select.end:
; CHECK: %sel = phi float [ %div1, %select.true.sink ], [ %div2, %select.false.sink ]		; CHECK: %sel = phi float [ %div1, %select.true.sink ], [ %div2, %select.false.sink ]
; CHECK: ret float %sel		; CHECK: ret float %sel
}		}

; An 'fadd' is not too expensive, so it's ok to speculate.		; An 'fadd' is not too expensive, so it's ok to speculate.

define float @fadd_no_sink(float %a, float %b) {		define float @fadd_no_sink(float %a, float %b) {
%add = fadd float %a, %b		%add = fadd float %a, %b
%cmp = fcmp ogt float 6.0, %a		%cmp = fcmp ogt float 6.0, %a
%sel = select i1 %cmp, float %add, float 7.0		%sel = select i1 %cmp, float %add, float 7.0
ret float %sel		ret float %sel

; CHECK-LABEL: @fadd_no_sink(		; CHECK-LABEL: @fadd_no_sink(
; CHECK: %sel = select i1 %cmp, float %add, float 7.0		; CHECK: %sel = select i1 %cmp, float %add, float 7.0
}		}

; Possible enhancement: sinkability is only calculated with the direct		; Possible enhancement: sinkability is only calculated with the direct
; operand of the select, so we don't try to sink this. The fdiv cost is not		; operand of the select, so we don't try to sink this. The fdiv cost is not
; taken into account.		; taken into account.

define float @fdiv_no_sink(float %a, float %b) {		define float @fdiv_no_sink(float %a, float %b) {
entry:		entry:
%div = fdiv float %a, %b		%div = fdiv float %a, %b
%add = fadd float %div, %b		%add = fadd float %div, %b
%cmp = fcmp ogt float %a, 1.0		%cmp = fcmp ogt float %a, 1.0
%sel = select i1 %cmp, float %add, float 8.0		%sel = select i1 %cmp, float %add, float 8.0
ret float %sel		ret float %sel

; CHECK-LABEL: @fdiv_no_sink(		; CHECK-LABEL: @fdiv_no_sink(
; CHECK: %sel = select i1 %cmp, float %add, float 8.0		; CHECK: %sel = select i1 %cmp, float %add, float 8.0
}		}

; Do not transform the CFG if the select operands may have side effects.		; Do not transform the CFG if the select operands may have side effects.

declare i64* @bar(i32, i32, i32)		declare i64* @bar(i32, i32, i32)
declare i64* @baz(i32, i32, i32)		declare i64* @baz(i32, i32, i32)

define i64* @calls_no_sink(i32 %in) {		define i64* @calls_no_sink(i32 %in) {
Show All 13 Lines	define i32 @sdiv_no_sink(i32 %a, i32 %b) {
%cmp = icmp sgt i32 %a, 5		%cmp = icmp sgt i32 %a, 5
%sel = select i1 %cmp, i32 %div1, i32 %div2		%sel = select i1 %cmp, i32 %div1, i32 %div2
ret i32 %sel		ret i32 %sel

; CHECK-LABEL: @sdiv_no_sink(		; CHECK-LABEL: @sdiv_no_sink(
; CHECK: %sel = select i1 %cmp, i32 %div1, i32 %div2		; CHECK: %sel = select i1 %cmp, i32 %div1, i32 %div2
}		}

		; Nothing to sink here, but this gets converted to a branch to
		; avoid stalling an out-of-order CPU on a predictable branch.
		; Because cmp's operand is expensive instruction likes division.

		define float @fdiv_do_transform(float %a, float %b) {
		entry:
		%div = fdiv float %a, %b
		%cmp = fcmp ogt float %div, %b
		%sel = select i1 %cmp, float 3.0, float 8.0
		GerolfUnsubmitted Not Done Reply Inline Actions I find this example misleading. The use of the %div in the select is irrelevant. The only issue is whether or not the branch is predictable. Gerolf: I find this example misleading. The use of the %div in the select is irrelevant. The only issue…
		flyingforyouAuthorUnsubmitted Not Done Reply Inline Actions Will fix this. flyingforyou: Will fix this.
		ret float %sel

		; CHECK-LABEL: @fdiv_do_transform(
		; CHECK: br i1 %cmp, label %select.end, label %select.false
		}

utils/TableGen/SubtargetEmitter.cpp

Show First 20 Lines • Show All 1,151 Lines • ▼ Show 20 Lines	for (const CodeGenProcModel &PM : SchedModels.procModels()) {
OS << "\n";		OS << "\n";
OS << "static const llvm::MCSchedModel " << PM.ModelName << " = {\n";		OS << "static const llvm::MCSchedModel " << PM.ModelName << " = {\n";
EmitProcessorProp(OS, PM.ModelDef, "IssueWidth", ',');		EmitProcessorProp(OS, PM.ModelDef, "IssueWidth", ',');
EmitProcessorProp(OS, PM.ModelDef, "MicroOpBufferSize", ',');		EmitProcessorProp(OS, PM.ModelDef, "MicroOpBufferSize", ',');
EmitProcessorProp(OS, PM.ModelDef, "LoopMicroOpBufferSize", ',');		EmitProcessorProp(OS, PM.ModelDef, "LoopMicroOpBufferSize", ',');
EmitProcessorProp(OS, PM.ModelDef, "LoadLatency", ',');		EmitProcessorProp(OS, PM.ModelDef, "LoadLatency", ',');
EmitProcessorProp(OS, PM.ModelDef, "HighLatency", ',');		EmitProcessorProp(OS, PM.ModelDef, "HighLatency", ',');
EmitProcessorProp(OS, PM.ModelDef, "MispredictPenalty", ',');		EmitProcessorProp(OS, PM.ModelDef, "MispredictPenalty", ',');
		EmitProcessorProp(OS, PM.ModelDef, "FdivLatency", ',');

OS << " " << (bool)(PM.ModelDef ?		OS << " " << (bool)(PM.ModelDef ?
PM.ModelDef->getValueAsBit("PostRAScheduler") : 0)		PM.ModelDef->getValueAsBit("PostRAScheduler") : 0)
<< ", // " << "PostRAScheduler\n";		<< ", // " << "PostRAScheduler\n";

OS << " " << (bool)(PM.ModelDef ?		OS << " " << (bool)(PM.ModelDef ?
PM.ModelDef->getValueAsBit("CompleteModel") : 0)		PM.ModelDef->getValueAsBit("CompleteModel") : 0)
<< ", // " << "CompleteModel\n";		<< ", // " << "CompleteModel\n";
▲ Show 20 Lines • Show All 357 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrepare] Do select to branch transform when cmp's operand is expensive.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 52981

include/llvm/MC/MCSchedule.h

include/llvm/Target/TargetSchedule.td

lib/CodeGen/CodeGenPrepare.cpp

lib/MC/MCSchedule.cpp

lib/Target/AArch64/AArch64SchedA57.td

test/CodeGen/X86/machine-combiner.ll

test/Transforms/CodeGenPrepare/X86/select.ll

utils/TableGen/SubtargetEmitter.cpp

[CodeGenPrepare] Do select to branch transform when cmp's operand is expensive.
AbandonedPublic