Page MenuHomePhabricator

[CodeGenPrepare] Do select to branch transform when cmp's operand is expensive.
AbandonedPublic

Authored by flyingforyou on Feb 16 2016, 2:05 AM.

Details

Reviewers
spatel
Summary

Emit "cmov on compare with a expensive operand" as a branch to avoid stalls on executing expensive instruction likes division.
This patch shows some improvements on commercial benchmark.

There are no notable regressions on Core-i5, Cortex-A57.

Diff Detail

Event Timeline

flyingforyou retitled this revision from to [CodeGenPrepare] Do select to branch transform when cmp's operand is expensive..
flyingforyou updated this object.
flyingforyou added a reviewer: spatel.
spatel edited edge metadata.Feb 16 2016, 9:28 AM
spatel added subscribers: hfinkel, zansari, mehdi_amini and 2 others.

[ cc'ing people who are subscribed to D16836 ]

I thought there was agreement that it is better to handle this at a later stage of compilation. This patch is going to form a branch for any target where the compare operand is expensive, but this hurts SelectionDAG because of its block-level scope.

This isn't the same as sinking a speculatively executed expensive op - in this case, we have to execute the expensive op regardless of the comparison result. Am I not understanding the motivation?

Thanks Sanjay for comment.

We discussed about load-cmp heuristic in D16836. This is something likes load-cmp heuristic.

The main idea is if load takes so much cycles due to cache miss, we can get cmp's result after completing load. In this case, we can get huge benefit regarding transform select to branch. But recent OoO core has big cache(if we consider l1/l2/l3..) and h/w prefetcher... So we can avoid cache miss easily. This is what we talked about in D16836.

But this patch consider div(expensive instruction)-cmp case. Division is very expensive instruction which can take almost 17/30 cycles.(float/double case) Recent core's brach predictor miss penalty is 14 cycles(Ofcourse, it depends on uarch).
I think this case is what we really can hide expensive instruction cost.

I realized that test-case is too simple and it couldn't show what I really wanted. Sorry for miss reading.

%div = fdiv float %a, %b                                  ; --> we need huge time for execution.
%cmp = fcmp ogt float %div, %b
%sel = select i1 %cmp, float %div, float 8.0    ;  --> We can't execute further.

; after this patch, select is changed to branch.

%div = fdiv float %a, %b                                  ; --> we need huge time for execution.
%cmp = fcmp ogt float %div, %b
br i1 %cmp, label %select.end, label %select.false    ;  --> But, we can execute further by branch predictor.

Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty.
And if branch prediction is correct, we can hide fdiv's execution cycles.

I think this is logically correct. But we need to test more. (test-suite, spec, commertial benchmarks...)

Junmo.

Thanks Sanjay for comment.

We discussed about load-cmp heuristic in D16836. This is something likes load-cmp heuristic.

Yes - I can see that this would be a similar heuristic. The question we have is whether the load-cmp heuristic itself can be improved by moving it lower or using more data to make it smarter. Therefore, I don't think we should add another heuristic-based transform here right now.

As an example, I don't think this transform or the load transform would help an in-order CPU of any architecture. But we're still doing the transform for those subtargets.

Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty.
And if branch prediction is correct, we can hide fdiv's execution cycles.

I think this is logically correct. But we need to test more. (test-suite, spec, commertial benchmarks...)

I think I understand your point, and certainly it looks logically correct. But unless there is some good benchmark evidence to support the heuristic, I don't think we should add it here. If others disagree or if I'm misunderstanding, please let me know.

flyingforyou added a comment.EditedFeb 17 2016, 5:43 PM

Sanjay,

As an example, I don't think this transform or the load transform would help an in-order CPU of any architecture. But we're still doing the transform for those subtargets.

We already consider that we apply "select to branch opt" for which one.

// Do we have efficient codegen support for this kind of 'selects' ?
if (TLI->isSelectSupported(SelectKind)) {
  // We have efficient codegen support for the select instruction.
  // Check if it is profitable to keep this 'select'.
  if (!TLI->isPredictableSelectExpensive() ||
      !isFormingBranchFromSelectProfitable(TTI, SI))
    return false;
}

We already check through TLI->isPredictableSelectExpensive(). This can be set in X86ISelLowering.cpp.

// A predictable cmov does not hurt on an in-order CPU.
// FIXME: Use a CPU attribute to trigger this, not a CPU model.
PredictableSelectIsExpensive = !Subtarget.isAtom();

If PredictableSelectIsExpensive is not true, we don't transform "select to branch" opt.

And I think we can use getSchedModel().isOutOfOrder() for setting the flag PredictableSelectIsExpensive.

Junmo.

flyingforyou updated this object.
flyingforyou edited edge metadata.

Rebase patch & modify test-case.

Test Env: CPU: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz

I didn't find notable regression on SPEC/test-suite

Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty.
And if branch prediction is correct, we can hide fdiv's execution cycles.

This may happen to be a correct general statement to make today, but not one that is guaranteed to hold for future architectures. In my opinion, it's just too low level of a heuristic to make here.

In general, I think the heuristics are good for x86 (and perhaps other architectures), and I'd really like to see something like this, but I have to agree with Sanjay with regards to wanting to do this later.

It's also easier to hold on to selects and break them up later than to break them up early and change our minds and hope to reassemble them.

Thanks for comment, Zia.

This may happen to be a correct general statement to make today, but not one that is guaranteed to hold for future architectures. In my opinion, it's just too low level of a heuristic to make here.

I agree with your opinion about future architecture might have powerful divider which doesn't need this heuristic. But when they come, we can make this opt turn off easily by setting PredictableSelectIsExpensive false or passing more information to isFormingBranchFromSelectProfitable something like MCSchedModel for changing heuristic.

It's also easier to hold on to selects and break them up later than to break them up early and change our minds and hope to reassemble them.

Yes. you're right. But it's easier to make branch at this level than later.

Junmo.

spatel added a comment.Mar 1 2016, 7:36 AM

I just had a quick look at the attached perf data. Are there any wins from the change? Is there a motivating example outside of SPEC CPU or test-suite that you are trying to improve?

Modify the patch which can consider uArch information.

Hi Sanjay.
I already explained that we got improvement on commercial benchmark by e-mail.

This may happen to be a correct general statement to make today, but not one that is guaranteed to hold for future architectures.

There is concern about this heuristic will not be applied for future architectures. So I added FdivLatency in SchedMachineModel for refering.
I think fdiv is special for most of architecture. It wouldn't be pipelined and it has very high latency. So I think this information can be used for another optimizations.

Also, this is very different heuristic from load-cmp-heuristic. Most of load takes 4cycles when the target is in cache. But division takes over 18~30 cycles. This heuristic could be helpful for most of architecture.

And there is a optimization pass EarlyIfConversion. This will recover the change when TBB, FBB's cost is not good for using branch.

So, How about this approach?

There is concern about this heuristic will not be applied for future architectures. So I added FdivLatency in SchedMachineModel for refering.
I think fdiv is special for most of architecture. It wouldn't be pipelined and it has very high latency. So I think this information can be used for another optimizations.

Also, this is very different heuristic from load-cmp-heuristic. Most of load takes 4cycles when the target is in cache. But division takes over 18~30 cycles. This heuristic could be helpful for most of architecture.

And there is a optimization pass EarlyIfConversion. This will recover the change when TBB, FBB's cost is not good for using branch.

So, How about this approach?

Hi Junmo -
Thanks for following up. You've answered my concerns about limiting the transform based on actual subtarget features, so I think this is a reasonable way to solve the problem. I'm not sure if we want the default behavior to be opt-out though, but that's simple to change.
I'd certainly like to hear from others if they think this is a good approach.

evandro added a comment.EditedMar 11 2016, 8:59 AM

I don't like the idea of FdivLatency having a fixed value in cycles. Firstly, which FP division does this apply to? Secondly, it fails to consider its throughput. Thirdly, the default value may be convenient to some targets, but is far from universally acceptable. Fourthly, it's not a good practice to have this value in one place and also elsewhere, like in the pipeline model, where the same information is richly described in all its variations.

Rather, I'd be more comfortable with a simple boolean value indicating that divisions (not only FP ones) are expensive, defaulting to false, or a hook into the target to examine the instr and return the actual cycle count or whether it's indeed expensive.

I don't like the idea of FdivLatency having a fixed value in cycles. Firstly, which FP division does this apply to? Secondly, it fails to consider its throughput. Thirdly, the default value may be convenient to some targets, but is far from universally acceptable. Fourthly, it's not a good practice to have this value in one place and also elsewhere, like in the pipeline model, where the same information is richly described in all its variations.

Rather, I'd be more comfortable with a simple boolean value indicating that divisions (not only FP ones) are expensive, defaulting to false, or a hook into the target to examine the instr and return the actual cycle count or whether it's indeed expensive.

These are good points. If we're going to use detailed models, we shouldn't duplicate just a single point from those models. We do have a simplified cost model in TTI for use at this level:
https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/TargetTransformInfoImpl.h#L47

And this is what this patch used originally, but now we have introduced the branch mispredict penalty into the equation too.

By design I think, the TTI cost model is limited in the information it can provide. But it does simulate (quite poorly in some cases) information that exists in detail in the SchedMachineModel. Note that in PR26837:
https://llvm.org/bugs/show_bug.cgi?id=26837
...we've discussed making latency and throughput more explicit in the TTI cost model because that would be helpful for the vectorizer, inliner, unroller, and others. But I think that does raise the question: should the TTI cost model inherit its data from the more detailed SchedMachineModel instead of duplicating it?

flyingforyou added a comment.EditedMar 14 2016, 3:19 AM

Hi Evandro.

I really appreciate your comment. There is a history about why I choose this way. Please refer to previous patch. (D16836).

I don't like the idea of FdivLatency having a fixed value in cycles. Firstly, which FP division does this apply to?

This can be easily handled by changing name. Now, I only consider single precision fp division which has mininum cycles for division.

Secondly, it fails to consider its throughput.

Now, there is no way to show its throughput. But we don't need to consider it's throughput in this heuristic.

Thirdly, the default value may be convenient to some targets, but is far from universally acceptable. Fourthly, it's not a good practice to have this value in one place and also elsewhere, like in the pipeline model, where the same information is richly described in all its variations.

I think these reasons are not only related with div-cmp-sel heuristic. I think we can talk about this issue on PR26837 or llvm-dev mail thread.

Rather, I'd be more comfortable with a simple boolean value indicating that divisions (not only FP ones) are expensive, defaulting to false, or a hook into the target to examine the instr and return the actual cycle count or whether it's indeed expensive.

Now, divisions are treated as expensive by default, likes below. Why do you think divisons are treated as cheap? If there are good reasons, we can talk about this also through llvm-dev.

enum TargetCostConstants {
  TCC_Free = 0,     ///< Expected to fold away in lowering.
  TCC_Basic = 1,    ///< The cost of a typical 'add' instruction.
  TCC_Expensive = 4 ///< The cost of a 'div' instruction on x86.
};

unsigned getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) {
  switch (Opcode) {
  default:
    // By default, just classify everything as 'basic'.
    return TTI::TCC_Basic;

  case Instruction::FDiv:
  case Instruction::FRem:
  case Instruction::SDiv:
  case Instruction::SRem:
  case Instruction::UDiv:
  case Instruction::URem:
    return TTI::TCC_Expensive;

Rebase patch against latest trunk change & modiy comments.

tkn added a subscriber: tkn.Mar 21 2016, 7:55 AM

Kindly ping ...

Evandro, we still wait your reply over 2weeks.
Could you please answer or review about our questions?

Junmo.

Hi Gerolf.

I really appreciate your comment.

For the record, could you write your comment on D17288, please?

BRs,
Junmo Park.

Gerolf added a subscriber: Gerolf.Mar 30 2016, 9:12 PM

In response to Sanjay's question: "so I think this is a reasonable way to solve the problem. I'm not sure if we want the default behavior to be opt-out though, but that's simple to change.
I'd certainly like to hear from others if they think this is a good approach."

It seems the major motivation here is the benchmark gain, and no, it won’t be necessarily simple to change. This optimization could give a sizable gain on multiple architectures, and once in, people will not easily accept a loss when the default is attempted to be changed. If we decide to go for it we should be conscious about it.
I think the machine combiner (in this case it would split the conditional move instructions) would be a more natural place for this optimization. Eventually I see that pass capable of both latency and throughput estimates for possibly a small increase in compile-time. Perhaps a compromise is reasonable by enabling a version of the current patch (assuming no regressions) and starting the work towards a more general solution. I see related questions coming up in other places. For example for some DAGcombines the machine combiner could come up with faster code sequences.

-Gerolf

Hi Gerolf.

Thanks for your valuable comments and suggestions.

Junmo.

A little more elaboration on the combiner idea:
In its current form the machine combiner only evaluates a few instruction pattern and picks the "best". Generalizing this to regions - and in particular - to code regions with if-converted code would be a necessary step to make better code generation decisions in that case. Whether this is a good or not so good idea depends on the architecture/uArchitecture and compile-time budget. With a select multiple parameters come into play where the scheduler is in the best position to evaluate different code sequences: the parameters that must be evaluated include resources (in general more instruction have to execute in parallel in if-converted code), branch predictability, scheduling gains (for example in architecture w/o control speculation the select might enable it) etc. From the compile-time perspective not all combinations can be tried. So a hierarchical approach where simple heuristics (eg. filter branch that are highly predictable) catch most/many cases and the combiner only evaluates some of the "hard" ones likely will result in the best code quality.

lib/CodeGen/CodeGenPrepare.cpp
4554

It that really a good heuristic? Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong.

4561

In the case both paths consume the long latency select is still the better choice.

4569

Why? The cmp could feed multiple selects from PHI nodes and still a branch would be preferable.

test/Transforms/CodeGenPrepare/X86/select.ll
145

I find this example misleading. The use of the %div in the select is irrelevant. The only issue is whether or not the branch is predictable.

flyingforyou added inline comments.Mar 31 2016, 8:42 PM
lib/CodeGen/CodeGenPrepare.cpp
4554

Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior.

I also agree with this idea.. But what we can do for this in this patch?

It that really a good heuristic?

If you think this is not good, what heuristic do you recommend?

4561

Why do you think so?

4569

@bkramer or Sanjay, How do you think about this?

test/Transforms/CodeGenPrepare/X86/select.ll
145

Will fix this.

flyingforyou added inline comments.Apr 3 2016, 11:10 PM
lib/CodeGen/CodeGenPrepare.cpp
4554

I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong.

My point is this. When we remove the load-cmp-csel heuristic, there is a main point which is related with load's execution cycle. The heuristic assumes that load can be taken huge cycles during cache-miss. But recent uArchitecture has big cache especially if it supports OoO execution. So we don't need to worry about cache-miss most of cases.

div-cmp-csel is almost same idea likes above with cache-miss case. Most of uArchitecture executes floating point division with high latency. So, if we apply this heuristic, we can get huge benefit due to hiding division's execution cycles.

in the presence of a long latency instruction assume the dependent branch is well predicted most of the time.

About this, I think branch prediction is good, even if instruction's execution cycle is small. But if the prediction is failed when executing short latency instructions something likes "add-cmp-branch", we can easily recognize the tranformation is wrong. So we just try "div-cmp-branch" case.

Gerolf added inline comments.Apr 5 2016, 12:32 AM
lib/CodeGen/CodeGenPrepare.cpp
4554

When the branch is well predicted I don't see a reason to generate a csel (except for code size). The crux is the compiler has to model two unknowns: is there a hot path? and is there a branch misprediction penalty? Profiling helps, but is not always (or better perhaps, rarely) available. I think a reasonable heuristic and akin to what you are pursuing is this: Conceptually a csel merges two paths. When the paths are unbalanced don't generate a csel. The paths are unbalanced when their execution times differ "a lot". For example, if one path consumes a long latency operation, but not the other does not, consider the paths unbalanced and don't issue a csel. Or if you know on your uArch branches are rarely mispredicted across a wide range of apps, a csel should only be generated when there is a very specific reason for it.

4561

Both paths require the result of the long latency instruction. So at least it is less likely that your optimization helps.

flyingforyou updated this revision to Diff 52981.EditedApr 7 2016, 5:12 PM

Addressed Gerolf's comments.

Gerolf, I still don't understand what you said "In the case both paths consume the long latency select is still the better choice.".

Why is choosing csel better than branch when both paths consume the long latency?
I think we can hide long latency by using branch.

Hi Chad.

// Prefer likely predicted branches to selects on out-of-order cores.
if (Subtarget->isCortexA57() || Subtarget->isKryo())
  PredictableSelectIsExpensive = true;

Recently, you changed PredictableSelectIsExpensive flag on for Kryo. How do you think about this change?

Hi Chad.

// Prefer likely predicted branches to selects on out-of-order cores.
if (Subtarget->isCortexA57() || Subtarget->isKryo())
  PredictableSelectIsExpensive = true;

Recently, you changed PredictableSelectIsExpensive flag on for Kryo. How do you think about this change?

Honestly, I was upstreaming on old internal patch and didn't do the analysis for the change.

My understanding is that this flags allows CGP to more aggressively convert selects to a series of branches. This makes a great deal of sense on out-of-order cores with good branch predictors, which Kryo has..

Unfortunately, I haven't been following this review and I don't completely understand the problem you're trying to solve. It seems Sanjay and Gerolf are providing good feedback, so I'm going to defer to their judgement.

mcrosier removed a subscriber: mcrosier.Apr 15 2016, 9:51 AM
spatel resigned from this revision.Sep 26 2017, 3:51 PM
flyingforyou abandoned this revision.Sep 26 2017, 4:00 PM