Emit "cmov on compare with a expensive operand" as a branch to avoid stalls on executing expensive instruction likes division.
This patch shows some improvements on commercial benchmark.
There are no notable regressions on Core-i5, Cortex-A57.
Differential D17288
[CodeGenPrepare] Do select to branch transform when cmp's operand is expensive. flyingforyou on Feb 16 2016, 2:05 AM. Authored by
Details
Emit "cmov on compare with a expensive operand" as a branch to avoid stalls on executing expensive instruction likes division. There are no notable regressions on Core-i5, Cortex-A57.
Diff Detail Event TimelineComment Actions [ cc'ing people who are subscribed to D16836 ] I thought there was agreement that it is better to handle this at a later stage of compilation. This patch is going to form a branch for any target where the compare operand is expensive, but this hurts SelectionDAG because of its block-level scope. This isn't the same as sinking a speculatively executed expensive op - in this case, we have to execute the expensive op regardless of the comparison result. Am I not understanding the motivation? Comment Actions Thanks Sanjay for comment. We discussed about load-cmp heuristic in D16836. This is something likes load-cmp heuristic. The main idea is if load takes so much cycles due to cache miss, we can get cmp's result after completing load. In this case, we can get huge benefit regarding transform select to branch. But recent OoO core has big cache(if we consider l1/l2/l3..) and h/w prefetcher... So we can avoid cache miss easily. This is what we talked about in D16836. But this patch consider div(expensive instruction)-cmp case. Division is very expensive instruction which can take almost 17/30 cycles.(float/double case) Recent core's brach predictor miss penalty is 14 cycles(Ofcourse, it depends on uarch). I realized that test-case is too simple and it couldn't show what I really wanted. Sorry for miss reading. %div = fdiv float %a, %b ; --> we need huge time for execution. %cmp = fcmp ogt float %div, %b %sel = select i1 %cmp, float %div, float 8.0 ; --> We can't execute further. ; after this patch, select is changed to branch. %div = fdiv float %a, %b ; --> we need huge time for execution. %cmp = fcmp ogt float %div, %b br i1 %cmp, label %select.end, label %select.false ; --> But, we can execute further by branch predictor. Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty. I think this is logically correct. But we need to test more. (test-suite, spec, commertial benchmarks...) Junmo. Comment Actions Yes - I can see that this would be a similar heuristic. The question we have is whether the load-cmp heuristic itself can be improved by moving it lower or using more data to make it smarter. Therefore, I don't think we should add another heuristic-based transform here right now. As an example, I don't think this transform or the load transform would help an in-order CPU of any architecture. But we're still doing the transform for those subtargets.
I think I understand your point, and certainly it looks logically correct. But unless there is some good benchmark evidence to support the heuristic, I don't think we should add it here. If others disagree or if I'm misunderstanding, please let me know. Comment Actions Sanjay,
We already consider that we apply "select to branch opt" for which one. // Do we have efficient codegen support for this kind of 'selects' ? if (TLI->isSelectSupported(SelectKind)) { // We have efficient codegen support for the select instruction. // Check if it is profitable to keep this 'select'. if (!TLI->isPredictableSelectExpensive() || !isFormingBranchFromSelectProfitable(TTI, SI)) return false; } We already check through TLI->isPredictableSelectExpensive(). This can be set in X86ISelLowering.cpp. // A predictable cmov does not hurt on an in-order CPU. // FIXME: Use a CPU attribute to trigger this, not a CPU model. PredictableSelectIsExpensive = !Subtarget.isAtom(); If PredictableSelectIsExpensive is not true, we don't transform "select to branch" opt. And I think we can use getSchedModel().isOutOfOrder() for setting the flag PredictableSelectIsExpensive. Junmo. Comment Actions Test Env: CPU: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz I didn't find notable regression on SPEC/test-suite Comment Actions
This may happen to be a correct general statement to make today, but not one that is guaranteed to hold for future architectures. In my opinion, it's just too low level of a heuristic to make here. In general, I think the heuristics are good for x86 (and perhaps other architectures), and I'd really like to see something like this, but I have to agree with Sanjay with regards to wanting to do this later. It's also easier to hold on to selects and break them up later than to break them up early and change our minds and hope to reassemble them. Comment Actions Thanks for comment, Zia.
I agree with your opinion about future architecture might have powerful divider which doesn't need this heuristic. But when they come, we can make this opt turn off easily by setting PredictableSelectIsExpensive false or passing more information to isFormingBranchFromSelectProfitable something like MCSchedModel for changing heuristic.
Yes. you're right. But it's easier to make branch at this level than later. Junmo. Comment Actions I just had a quick look at the attached perf data. Are there any wins from the change? Is there a motivating example outside of SPEC CPU or test-suite that you are trying to improve? Comment Actions Hi Sanjay.
There is concern about this heuristic will not be applied for future architectures. So I added FdivLatency in SchedMachineModel for refering. Also, this is very different heuristic from load-cmp-heuristic. Most of load takes 4cycles when the target is in cache. But division takes over 18~30 cycles. This heuristic could be helpful for most of architecture. And there is a optimization pass EarlyIfConversion. This will recover the change when TBB, FBB's cost is not good for using branch. So, How about this approach? Comment Actions Hi Junmo - Comment Actions I don't like the idea of FdivLatency having a fixed value in cycles. Firstly, which FP division does this apply to? Secondly, it fails to consider its throughput. Thirdly, the default value may be convenient to some targets, but is far from universally acceptable. Fourthly, it's not a good practice to have this value in one place and also elsewhere, like in the pipeline model, where the same information is richly described in all its variations. Rather, I'd be more comfortable with a simple boolean value indicating that divisions (not only FP ones) are expensive, defaulting to false, or a hook into the target to examine the instr and return the actual cycle count or whether it's indeed expensive. Comment Actions These are good points. If we're going to use detailed models, we shouldn't duplicate just a single point from those models. We do have a simplified cost model in TTI for use at this level: And this is what this patch used originally, but now we have introduced the branch mispredict penalty into the equation too. By design I think, the TTI cost model is limited in the information it can provide. But it does simulate (quite poorly in some cases) information that exists in detail in the SchedMachineModel. Note that in PR26837: Comment Actions Hi Evandro. I really appreciate your comment. There is a history about why I choose this way. Please refer to previous patch. (D16836).
This can be easily handled by changing name. Now, I only consider single precision fp division which has mininum cycles for division.
Now, there is no way to show its throughput. But we don't need to consider it's throughput in this heuristic.
I think these reasons are not only related with div-cmp-sel heuristic. I think we can talk about this issue on PR26837 or llvm-dev mail thread.
Now, divisions are treated as expensive by default, likes below. Why do you think divisons are treated as cheap? If there are good reasons, we can talk about this also through llvm-dev. enum TargetCostConstants { TCC_Free = 0, ///< Expected to fold away in lowering. TCC_Basic = 1, ///< The cost of a typical 'add' instruction. TCC_Expensive = 4 ///< The cost of a 'div' instruction on x86. }; unsigned getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) { switch (Opcode) { default: // By default, just classify everything as 'basic'. return TTI::TCC_Basic; case Instruction::FDiv: case Instruction::FRem: case Instruction::SDiv: case Instruction::SRem: case Instruction::UDiv: case Instruction::URem: return TTI::TCC_Expensive; Comment Actions Evandro, we still wait your reply over 2weeks. Junmo. Comment Actions Hi Gerolf. I really appreciate your comment. For the record, could you write your comment on D17288, please? BRs, Comment Actions In response to Sanjay's question: "so I think this is a reasonable way to solve the problem. I'm not sure if we want the default behavior to be opt-out though, but that's simple to change. It seems the major motivation here is the benchmark gain, and no, it won’t be necessarily simple to change. This optimization could give a sizable gain on multiple architectures, and once in, people will not easily accept a loss when the default is attempted to be changed. If we decide to go for it we should be conscious about it. -Gerolf Comment Actions A little more elaboration on the combiner idea:
Comment Actions Addressed Gerolf's comments. Gerolf, I still don't understand what you said "In the case both paths consume the long latency select is still the better choice.". Why is choosing csel better than branch when both paths consume the long latency? Comment Actions Hi Chad. // Prefer likely predicted branches to selects on out-of-order cores. if (Subtarget->isCortexA57() || Subtarget->isKryo()) PredictableSelectIsExpensive = true; Recently, you changed PredictableSelectIsExpensive flag on for Kryo. How do you think about this change? Comment Actions Honestly, I was upstreaming on old internal patch and didn't do the analysis for the change. My understanding is that this flags allows CGP to more aggressively convert selects to a series of branches. This makes a great deal of sense on out-of-order cores with good branch predictors, which Kryo has.. Unfortunately, I haven't been following this review and I don't completely understand the problem you're trying to solve. It seems Sanjay and Gerolf are providing good feedback, so I'm going to defer to their judgement. |
It that really a good heuristic? Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong.