This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrepare] Remove load-based heuristic
ClosedPublic

Authored by flyingforyou on Feb 2 2016, 7:27 PM.

Details

Summary

Both the hardware and LLVM have changed since 2012.
Now, load-based heuristic don't show big differences any more on OoO cores.

There is no regressons and improvements on spec2000/2006. (Cortex-A57, Core i5).

But There are two big improvements in test-suite.

Benchmark NameExe time Opt/Ori
MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl87.18%
MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt83.82%

And I didn't see notable regressions in test-suite.

Diff Detail

Event Timeline

flyingforyou retitled this revision from to [CodeGenPrepare] Don't transform select instructions into branches when both of operands are cheap.
flyingforyou updated this object.
flyingforyou added reviewers: spatel, jmolloy.
flyingforyou added a subscriber: llvm-commits.
hfinkel added a subscriber: hfinkel.Feb 2 2016, 8:19 PM
hfinkel added inline comments.
lib/CodeGen/CodeGenPrepare.cpp
4479

I don't understand what's going on here. So what if they fold away? They could be bitcasts, or free extensions/truncations, etc. Wouldn't you need to recurse up the use/def chain to find the non-free operand and making some determination based on that?

flyingforyou added inline comments.Feb 2 2016, 9:37 PM
lib/CodeGen/CodeGenPrepare.cpp
4479

Thanks Hal.

This checking is simillar with sinkSelectOperand.

  // If either operand of the select is expensive and only needed on one side
  // of the select, we should form a branch.
  if (sinkSelectOperand(TTI, SI->getTrueValue()) ||
      sinkSelectOperand(TTI, SI->getFalseValue()))
    return true;
}

/// Check if V (an operand of a select instruction) is an expensive instruction
/// that is only used once.
static bool sinkSelectOperand(const TargetTransformInfo *TTI, Value *V) {
  auto *I = dyn_cast<Instruction>(V);
  // If it's safe to speculatively execute, then it should not have side
  // effects; therefore, it's safe to sink and possibly *not* execute.
  return I && I->hasOneUse() && isSafeToSpeculativelyExecute(I) &&
         TTI->getUserCost(I) >= TargetTransformInfo::TCC_Expensive;
}

What I want to do is if the Instruction is expected to fold away or it is chep, we don't transform conditional select to branch.

The main reason of transforming select to branch is hiding execution cycles of expensive instruction likes division.
For example, division would be taken more than 19~32 cycles for execution (on Cortex-A57). But If banch prediction is almost correct and we don't choice result of division, we can hide extra cycles of division.

isFormingBranchFromSelectProfitable is consist of heuristic checkings.

I don't think we need to recurse up the use/def chain to find non-free operand. We can expect that free extension/truncations might not relate with expensive instructions like division. And If we should apply checking use/def chain, we may need to change sinkSelectOperand function also.

hfinkel added inline comments.Feb 2 2016, 9:54 PM
lib/CodeGen/CodeGenPrepare.cpp
4479

I don't think we need to recurse up the use/def chain to find non-free operand. We can expect that free extension/truncations might not relate with expensive instructions like division. And If we should apply checking use/def chain, we may need to change sinkSelectOperand function also.

This is similar to the code in sinkSelectOperand, but also different. In sinkSelectOperand, we explicitly check that the operand we've found *is* expensive. In your proposed change, we have the opposite: We know only that the operand is irrelevant to performance (to first approximation). You need to look through that operand to find the IR instruction that is relevant, and then you can make a determination (or you might find a constant, argument, etc.).

Have you looked at the TSVC case? What change is it making that matters?

flyingforyou added a comment.EditedFeb 2 2016, 11:07 PM
for.body4:                                        ; preds = %for.body4, %for.cond1.preheader
  %indvars.iv = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next, %for.body4 ]
  %j.126 = phi i32 [ -1, %for.cond1.preheader ], [ %j.2, %for.body4 ]
  %scevgep = getelementptr double, double* getelementptr inbounds (%struct.GlobalData, %struct.GlobalData* @global_data, i32 0, i32 0, i32 0), i64 %indvars.iv
  %3 = load double, double* %scevgep, align 8, !tbaa !5
  %cmp5 = fcmp olt double %3, 0.000000e+00
  %tmp = trunc i64 %indvars.iv to i32
  %j.2 = select i1 %cmp5, i32 %tmp, i32 %j.126
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 32000, %indvars.iv.next
  br i1 %exitcond, label %for.cond.cleanup3, label %for.body4

In this case, %tmp, %j.126 is target for sinking.

This code will translate to arm64 likes below.
after patch.

40b684:	8b080e69 	add	x9, x19, x8, lsl #3
40b688:	6d400520 	ldp	d0, d1, [x9]
40b68c:	1e602008 	fcmp	d0, #0.0
40b690:	1a94410a 	csel	w10, w8, w20, mi
40b694:	1e602028 	fcmp	d1, #0.0
40b698:	6d410520 	ldp	d0, d1, [x9,#16]
40b69c:	91000508 	add	x8, x8, #0x1
40b6a0:	1a8a410a 	csel	w10, w8, w10, mi
40b6a4:	1e602008 	fcmp	d0, #0.0
40b6a8:	fd401120 	ldr	d0, [x9,#32]
40b6ac:	1a88554a 	csinc	w10, w10, w8, pl
40b6b0:	1e602028 	fcmp	d1, #0.0
40b6b4:	91000908 	add	x8, x8, #0x2
40b6b8:	1a8a410a 	csel	w10, w8, w10, mi
40b6bc:	1e602008 	fcmp	d0, #0.0
40b6c0:	1a885554 	csinc	w20, w10, w8, pl
40b6c4:	91000908 	add	x8, x8, #0x2
40b6c8:	eb1b011f 	cmp	x8, x27
40b6cc:	54fffdc1 	b.ne	40b684 <s331+0xb4>

before patch

40b684:	fc697a60 	ldr	d0, [x19,x9,lsl #3]
40b688:	2a0903e0 	mov	w0, w9
40b68c:	1e602008 	fcmp	d0, #0.0
40b690:	54000044 	b.mi	40b698 <s331+0xc8>
40b694:	2a1403e0 	mov	w0, w20
40b698:	8b090e68 	add	x8, x19, x9, lsl #3
40b69c:	91000529 	add	x9, x9, #0x1
40b6a0:	2a0903e1 	mov	w1, w9
40b6a4:	fd400500 	ldr	d0, [x8,#8]
40b6a8:	1e602008 	fcmp	d0, #0.0
40b6ac:	54000044 	b.mi	40b6b4 <s331+0xe4>
40b6b0:	2a0003e1 	mov	w1, w0
40b6b4:	fd400900 	ldr	d0, [x8,#16]
40b6b8:	91000529 	add	x9, x9, #0x1
40b6bc:	2a0903e0 	mov	w0, w9
40b6c0:	1e602008 	fcmp	d0, #0.0
40b6c4:	54000044 	b.mi	40b6cc <s331+0xfc>
40b6c8:	2a0103e0 	mov	w0, w1
40b6cc:	fd400d00 	ldr	d0, [x8,#24]
40b6d0:	91000529 	add	x9, x9, #0x1
40b6d4:	2a0903e1 	mov	w1, w9
40b6d8:	1e602008 	fcmp	d0, #0.0
40b6dc:	54000044 	b.mi	40b6e4 <s331+0x114>
40b6e0:	2a0003e1 	mov	w1, w0
40b6e4:	fd401100 	ldr	d0, [x8,#32]
40b6e8:	91000528 	add	x8, x9, #0x1
40b6ec:	2a0803f4 	mov	w20, w8
40b6f0:	1e602008 	fcmp	d0, #0.0
40b6f4:	54000044 	b.mi	40b6fc <s331+0x12c>
40b6f8:	2a0103f4 	mov	w20, w1
40b6fc:	91000509 	add	x9, x8, #0x1
40b700:	eb1b013f 	cmp	x9, x27
40b704:	54fffc01 	b.ne	40b684 <s331+0xb4>

We transform just for hiding mov instruction. I don't think it's proper transformation regarding performance.

But I will consider adding some limitation as you said Hal. If you have any sugesstions, please let me know.

// Sink expensive instructions into the conditional blocks to avoid executing
// them speculatively.
if (sinkSelectOperand(TTI, SI->getTrueValue())) {
  TrueBlock = BasicBlock::Create(SI->getContext(), "select.true.sink",
                                 EndBlock->getParent(), EndBlock);
  auto *TrueBranch = BranchInst::Create(EndBlock, TrueBlock);
  auto *TrueInst = cast<Instruction>(SI->getTrueValue());
  TrueInst->moveBefore(TrueBranch);
}
if (sinkSelectOperand(TTI, SI->getFalseValue())) {
  FalseBlock = BasicBlock::Create(SI->getContext(), "select.false.sink",
                                  EndBlock->getParent(), EndBlock);
  auto *FalseBranch = BranchInst::Create(EndBlock, FalseBlock);
  auto *FalseInst = cast<Instruction>(SI->getFalseValue());
  FalseInst->moveBefore(FalseBranch);
}

After investigating more, Select optimization only consider just one instruction. So without this patch, we do transform, even if we don't have sinking instruction.
I want to listen from Sanjay who made this change first.

for.body4:                                        ; preds = %for.body4, %for.cond1.preheader
  %indvars.iv = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next, %for.body4 ]
  %j.126 = phi i32 [ -1, %for.cond1.preheader ], [ %j.2, %for.body4 ]
  %scevgep = getelementptr double, double* getelementptr inbounds (%struct.GlobalData, %struct.GlobalData* @global_data, i32 0, i32 0, i32 0), i64 %indvars.iv
  %3 = load double, double* %scevgep, align 8, !tbaa !5
  %cmp5 = fcmp olt double %3, 0.000000e+00
  %tmp = trunc i64 %indvars.iv to i32
  %j.2 = select i1 %cmp5, i32 %tmp, i32 %j.126
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 32000, %indvars.iv.next
  br i1 %exitcond, label %for.cond.cleanup3, label %for.body4

In this case, %tmp, %j.126 is target for sinking.

This code will translate to arm64 likes below.
after patch.

...

We transform just for hiding mov instruction. I don't think it's proper transformation regarding performance.

This makes perfect sense, thanks! Here's the thing: There are several reasons why the cost of an IR instruction might be "free". One reason is that the instruction represents something we already have (like a PHI or an Argument). One reason is that it represents something that can be implicitly folded (like a truncation). Constants that can be used directly as operands also fall into this category.

In this loop, the select has two operands. One is a PHI (free because it represents results already in registers), and the other is a truncation. This truncation is free because it can be folded with its user. The truncation, however, is also of a PHI. And that's the important part.

I think that what you want to do here is, for each operand, walk back through "free" CastInsts, and see if you find a PHI, an Argument. These things should already be in registers, and it is not worth branching.

spatel edited edge metadata.Feb 3 2016, 8:14 AM
spatel added subscribers: mehdi_amini, bkramer.

After investigating more, Select optimization only consider just one instruction. So without this patch, we do transform, even if we don't have sinking instruction.
I want to listen from Sanjay who made this change first.

Hi Junmo -

You're correct that the select optimization only considers one expensive instruction rather than following a chain. This was a hack to get the bug ( https://llvm.org/bugs/show_bug.cgi?id=24818 ) fixed quickly. Expanding that transform to look at more than one instruction was mentioned as a follow-on here:
http://reviews.llvm.org/D13297#257684

It was my mistake to not include a 'TODO' comment in the code about that.

I agree with Hal's point: the sinking of an expensive operand is not the converse of not sinking a cheap operand.

If I'm understanding this patch, it is bypassing the transform created by http://reviews.llvm.org/rL156234 and could regress whatever test motivated that patch.

After thinking about what that patch is trying to do and considering Mehdi's comments about hampering DAG transforms in D13297, I wonder if the correct solution is to redo r156234 at the machine-level (MachineCombiner pass?). I think that would solve the problem you're seeing. Ie, the problem is that the sinking transform is firing because of the *load* in your test case, but in your benchmarks, the loads are cheap, so we're causing a slowdown by removing selects.

zansari added a subscriber: zansari.Feb 3 2016, 1:20 PM
flyingforyou edited edge metadata.

Addressed Hal's comment.

Thanks, Hal. It is great advice.

Sanjay, Thank you for sharing long history of releated commit.

I agree with your concern also. But first commit was merged on 2012. It's almost 4 years ago. Recent OoO core has more logic for avoiding cache-miss. (Something likes HW prefetcher..)
And I also don't think your previous patch is hack. Even if we can avoid cache-miss penalty, if there are no enough heavy instructions like division, it can't get enough improvement. (Of course, it depends on instruction stream after branch.)

As you said, this approach can bypass load-cmp heuristic. But it might be very minimum portion.

I give more limitation which is Hal's suguesstion on this patch. Can we try to apply this heuristic also?

I also get some improvements on commercial benchmark both of X86, AArch64.

spatel added a comment.Feb 5 2016, 9:14 AM

Addressed Hal's comment.

I don't think the patch implements what Hal suggested. You're checking if the direct operands of a "free" instruction are not Arg/Phi/Const. The suggestion was to recurse/loop until we are sure that no operands in the chain are not free.

Sanjay, Thank you for sharing long history of releated commit.

I agree with your concern also. But first commit was merged on 2012. It's almost 4 years ago. Recent OoO core has more logic for avoiding cache-miss. (Something likes HW prefetcher..)

Please correct me if I am not understanding: the fact that HW may have gotten *better* at caching is more reason to remove the heuristic-based load "optimization"; it means that a load is more likely to be cheap.

In any case, I don't think we should be reasoning about target-specific optimizations at this level (IR) without better target hooks to restrict the transform. And in this particular case, I think it is impossible for the compiler to know whether the load is expensive or not. This is why I suggest pushing that transform down to the DAG or machine-level where it could be limited more easily. Maybe this is something that only OoO x86 would like to do?

And I also don't think your previous patch is hack. Even if we can avoid cache-miss penalty, if there are no enough heavy instructions like division, it can't get enough improvement. (Of course, it depends on instruction stream after branch.)

I think we're mixing up 2 independent issues now. The load transform and the TCC_Expensive transform have the same goal: don't execute an expensive instruction unnecessarily. But the first is driven by a heuristic (a single use load is expensive) while the second is driven by the fixed and known TTI cost of an instruction such as division.

I also get some improvements on commercial benchmark both of X86, AArch64.

I would be very interested to know the perf results of completely removing the heuristic-based load transform. I'm not sure what LLVM policy is for this type of situation. I've cc'd @bkramer in case it is possible to know which app/benchmark was improved by r156234. Clearly, it was driven by an improvement for OoO x86-64.

Addressed Hal's comment.

I don't think the patch implements what Hal suggested. You're checking if the direct operands of a "free" instruction are not Arg/Phi/Const. The suggestion was to recurse/loop until we are sure that no operands in the chain are not free.

...

True, although it occurs to me that we actually can't be more-sophisticated here than the sinking logic. The sinking logic should also sink the whole chain of free instructions (bitcasts, truncs, etc.) to get to the expensive one (should one exist). But it doesn't.

Also, now I feel like I'm missing something. isFormingBranchFromSelectProfitable should only return true if at least one of the operands is expensive or a load. Here you're checking that both operands are free. In that state, isFormingBranchFromSelectProfitable should always have returned false (unless, perhaps, one of them was a "free" load).

I'd like to echo Sanjay's concern regarding bypassing of the CMP heuristics.

Whether we agree that the CMP heuristics are useful or not, they are there, and the proposed changes preempt it based on the assumption that "if the true and false side of the branch are cheap, then the compare doesn't matter." I think this is fundamentally wrong. Even if both sides of the branch are NOPs, if we have high latency (yes, it's debatable whether a load falls into this category) instructions feeding into the comparison, the new conditional move has to sit and wait until the compare is resolved. With branches, the latency is completely hidden by the prediction (assuming it's correct). I'm not sure about ARM, but this is true for X86, at least.

I think the two options are : (1) If you feel the compare heuristics are wrong, then move/change them, or (2) Move some of these smarts down into the machine level, as Sanjay suggested.

I'd lean more toward (2), personally, since I'd be a little uncomfortable with (1), unless we can come up with some decent heuristics. I'd actually argue that the CMP heuristics need to be beefed up more to have more checks for high latency instructions (and I don't fully agree with the single-use check on the load), and that's always better to do lower down.

I'd like to echo Sanjay's concern regarding bypassing of the CMP heuristics.

Whether we agree that the CMP heuristics are useful or not, they are there, and the proposed changes preempt it based on the assumption that "if the true and false side of the branch are cheap, then the compare doesn't matter." I think this is fundamentally wrong. Even if both sides of the branch are NOPs, if we have high latency (yes, it's debatable whether a load falls into this category) instructions feeding into the comparison, the new conditional move has to sit and wait until the compare is resolved. With branches, the latency is completely hidden by the prediction (assuming it's correct). I'm not sure about ARM, but this is true for X86, at least.

I think the two options are : (1) If you feel the compare heuristics are wrong, then move/change them, or (2) Move some of these smarts down into the machine level, as Sanjay suggested.

I'd lean more toward (2), personally, since I'd be a little uncomfortable with (1), unless we can come up with some decent heuristics. I'd actually argue that the CMP heuristics need to be beefed up more to have more checks for high latency instructions (and I don't fully agree with the single-use check on the load), and that's always better to do lower down.

I think that (2) is the right answer.

Consider the case highlighted: The values were phis (or truncs of phis where the truncation is implicit and thus free). Function arguments are the same (at least those passed in registers). The fact is that the value is probably already in a register, so this conversion is not profitable. However, if this happens to be a spilled-value reload, then it might be a good idea, and we won't know that until quite late. For function arguments, at the IR level, we don't yet know if they'll be in registers or loaded from the stack (and even for the register arguments, we don't know if the register value was spilled).

And furthermore, since we don't (yet) have global instruction selection, breaking apart blocks like this early is almost always a bad idea (because it hampers instruction selection and scheduling, both are which are block local).

Thanks, Sanjay, Hal, Zia.

I think I couldn't explain what I really wanted to do. What I wanted to do is do not make branch(apart blocks) when test case is seen.

Zia,

"if the true and false side of the branch are cheap, then the compare doesn't matter." I think this is fundamentally wrong.

I agree with your mention regarding CMP heuristics which want to hide load's execution cycles.

Even if both sides of the branch are NOPs, if we have high latency (yes, it's debatable whether a load falls into this category) instructions feeding into the comparison, the new conditional move has to sit and wait until the compare is resolved. With branches, the latency is completely hidden by the prediction (assuming it's correct). I'm not sure about ARM, but this is true for X86, at least.

I think recent OoO core's pipeline is deep enought to execute another uops which have no dependency with conditional move.

And I think Sanjay's second sugesstion is only about singking expensive instructions. I also think this is good idea, but the main problem is that I don't want to make branches.

Hal,

And furthermore, since we don't (yet) have global instruction selection, breaking apart blocks like this early is almost always a bad idea (because it hampers instruction selection and scheduling, both are which are block local).

I also agree with your opinion. With this patch, we don't break basic block(making branch) for hiding select's stall when select's two operand's are cheap enough.
Even if it is not logically perfect, this patch shows no regressions in test-suite on core-i5 sandy bridge.


I saw a little improvement on lua.

Sanjay,

In any case, I don't think we should be reasoning about target-specific optimizations at this level (IR) without better target hooks to restrict the transform. And in this particular case, I think it is impossible for the compiler to know whether the load is expensive or not. This is why I suggest pushing that transform down to the DAG or machine-level where it could be limited more easily. Maybe this is something that only OoO x86 would like to do?

-> in AArch64
// Prefer likely predicted branches to selects on out-of-order cores.
if (Subtarget->isCortexA57())
  PredictableSelectIsExpensive = true;

-> X86
// A predictable cmov does not hurt on an in-order CPU.
// FIXME: Use a CPU attribute to trigger this, not a CPU model.
PredictableSelectIsExpensive = !Subtarget.isAtom();

As you said, most of OoO X86 CPU turn on the flag.

When I ran "llc -mcpu=corei7 -mtriple=x86_64-linux < test/CodeGen/AArch64/arm64-select.ll", I can get result likes below.
Without this patch.

.Ltmp3:
        .cfi_offset %rbx, -24
.Ltmp4:
        .cfi_offset %r14, -16
        movq    %rsi, %r14
        movq    %rdi, %rbx
        callq   _Z6getvalv
        movss   (%r14), %xmm1           # xmm1 = mem[0],zero,zero,zero
        ucomiss %xmm0, %xmm1
        jbe     .LBB0_2
# BB#1:
        addq    $4, %rbx
        jmp     .LBB0_3
.LBB0_2:                                # %select.false
        addq    $8, %rbx
.LBB0_3:                                # %select.end
        movl    (%rbx), %eax
        addq    $8, %rsp
        popq    %rbx
        popq    %r14
        retq
.Lfunc_end0:
        .size   test, .Lfunc_end0-test
        .cfi_endproc

With this patch,

.Ltmp3:
        .cfi_offset %rbx, -24
.Ltmp4:
        .cfi_offset %r14, -16
        movq    %rsi, %r14
        movq    %rdi, %rbx
        callq   _Z6getvalv
        movss   (%r14), %xmm1           # xmm1 = mem[0],zero,zero,zero
        leaq    4(%rbx), %rax
        addq    $8, %rbx
        ucomiss %xmm0, %xmm1
        cmovaq  %rax, %rbx
        movl    (%rbx), %eax
        addq    $8, %rsp
        popq    %rbx
        popq    %r14
        retq
.Lfunc_end0:
        .size   test, .Lfunc_end0-test
        .cfi_endproc

I am not sure that making branch is better than using cmov.

I would be very interested to know the perf results of completely removing the heuristic-based load transform. I'm not sure what LLVM policy is for this type of situation.

Thanks for sugesstion this, I will run several benchmarks and share the data soon.

Anyway, If we can't modify CMP heuristic, How can we modify EarlyIfConversion's likes below?

bool EarlyIfConverter::shouldConvertIf() {
....
  // Look at all the tail phis, and compute the critical path extension caused
  // by inserting select instructions.
  MachineTraceMetrics::Trace TailTrace = MinInstr->getTrace(IfConv.Tail);
  for (unsigned i = 0, e = IfConv.PHIs.size(); i != e; ++i) {
    SSAIfConv::PHIInfo &PI = IfConv.PHIs[i];
    unsigned Slack = TailTrace.getInstrSlack(PI.PHI);
    unsigned MaxDepth = Slack + TailTrace.getInstrCycles(PI.PHI).Depth;
    DEBUG(dbgs() << "Slack " << Slack << ":\t" << *PI.PHI);

    // The condition is pulled into the critical path.
    unsigned CondDepth = adjCycles(BranchDepth, PI.CondCycles);
    if (CondDepth > MaxDepth) {
      unsigned Extra = CondDepth - MaxDepth;
      DEBUG(dbgs() << "Condition adds " << Extra << " cycles.\n");
      if (Extra > CritLimit) {
        DEBUG(dbgs() << "Exceeds limit of " << CritLimit << '\n');
        return false;
      }
    }

    // The TBB value is pulled into the critical path.
    unsigned TDepth = adjCycles(TBBTrace.getPHIDepth(PI.PHI), PI.TCycles);
    if (TDepth > MaxDepth) {
      unsigned Extra = TDepth - MaxDepth;
      DEBUG(dbgs() << "TBB data adds " << Extra << " cycles.\n");
      if (Extra > CritLimit) {
        DEBUG(dbgs() << "Exceeds limit of " << CritLimit << '\n');
        return false;
      }
    }

    // The FBB value is pulled into the critical path.
    unsigned FDepth = adjCycles(FBBTrace.getPHIDepth(PI.PHI), PI.FCycles);
    if (FDepth > MaxDepth) {
      unsigned Extra = FDepth - MaxDepth;
      DEBUG(dbgs() << "FBB data adds " << Extra << " cycles.\n");
      if (Extra > CritLimit) {
        DEBUG(dbgs() << "Exceeds limit of " << CritLimit << '\n');
        return false;
      }
    }
  }
  return true;
}

This is opposite pass of "select to branch".

-Junmo.

I would be very interested to know the perf results of completely removing the heuristic-based load transform. I'm not sure what LLVM policy is for this type of situation. I've cc'd @bkramer in case it is possible to know which app/benchmark was improved by r156234. Clearly, it was driven by an improvement for OoO x86-64.

Back then this had significant effects on the LLVM test suite on 2012-class x86_64 hardware. The heuristic is extremely dumb though because LLVM was/is lacking infrastructure to do it properly.

  1. Introducing branches at the SelectionDAG level is extremely messy.
  2. No profile info available (not even heuristically) on selects.
  3. No info on the cost of an instruction. This is somewhat solved by TTI but I'm not sure if it handles loads.

If I were to implement it properly I would start by adding branch weights to selects, we currently lose a lot of profile data in SimplifyCFG. Then we can drive the transform by transforming biased selects into branches first, maybe with a TTI-based version of the current heuristic as a fall back if no info is there.

If you don't see any regressions from disabling the load-based heuristic feel free to throw it out. Both the hardware and LLVM have changed since 2012 so maybe the problems I was papering over with the load-based heuristic aren't there anymore.

I would be very interested to know the perf results of completely removing the heuristic-based load transform. I'm not sure what LLVM policy is for this type of situation. I've cc'd @bkramer in case it is possible to know which app/benchmark was improved by r156234. Clearly, it was driven by an improvement for OoO x86-64.

Back then this had significant effects on the LLVM test suite on 2012-class x86_64 hardware. The heuristic is extremely dumb though because LLVM was/is lacking infrastructure to do it properly.

  1. Introducing branches at the SelectionDAG level is extremely messy.
  2. No profile info available (not even heuristically) on selects.
  3. No info on the cost of an instruction. This is somewhat solved by TTI but I'm not sure if it handles loads.

If I were to implement it properly I would start by adding branch weights to selects, we currently lose a lot of profile data in SimplifyCFG. Then we can drive the transform by transforming biased selects into branches first, maybe with a TTI-based version of the current heuristic as a fall back if no info is there.

If you don't see any regressions from disabling the load-based heuristic feel free to throw it out. Both the hardware and LLVM have changed since 2012 so maybe the problems I was papering over with the load-based heuristic aren't there anymore.

Thank you for the background! I'll do some perf testing on AMD Jaguar (2013-class small core) with the current test-suite and report back.

I filed the branch weight metadata propagation enhancement here:
https://llvm.org/bugs/show_bug.cgi?id=26636

I'm looking into resurrecting the nonlocal dead store elimination in http://reviews.llvm.org/D13363,
but one of the snags is a performance issue related to select-vs-branch choices. That change enables
more selects, which unfortunately incur a 10 to 20% perf drop on some benchmarks in the EEMBC automotive
suite, on various X86 architectures.

One technique for recouping most of these drops is to relax the load-feeding-cmp heuristic in
isFormingBranchFromSelectProfitable, to allow the load to have multiple users, all of which are cmp's.
Yes, this is subject to the same "is this a reasonable heuristic at this point in compilation" questions
as discussed in this review, and a target-specific lowering may be preferable. But such lowering does
not yet exist, which is perhaps why we're seeing continual tuning in this function.

Is it worth pursuing more tuning in isFormingBranchFromSelectProfitable,
or is that likely to be rejected in favor of option 2?

regards,

  • mitch
Gerolf added a subscriber: Gerolf.Feb 17 2016, 5:25 PM

In this case there could be another issue coming into play. Some of the discussions centering around the IR heuristics assume but don’t check that the target is OoO. What if the core is in-order? Or is this a non-issue anymore for x86 even in the embedded space?

Overall I think Sanjay’s thought about pushing the optimization/tuning down to the MachineCombiner goes in the right direction. As a local peephole this kind of tuning always seems to end up giving mixed performance results over time. I have seen this before on Itanium and more recently on ARM64 where for some time a local patch for tuning the sFormingBranchFromSelectProfitable heuristics gave some gains on “commercial benchmarks” for a few months before they vanished again because of changes to the instruction mix.

-Gerolf

Hi Gerolf.

Below is what I answered about Sanjay's opinion in D17288. I think this could be answer about your opinion also.

We already consider that we apply "select to branch opt" for which one.

// Do we have efficient codegen support for this kind of 'selects' ?
if (TLI->isSelectSupported(SelectKind)) {
  // We have efficient codegen support for the select instruction.
  // Check if it is profitable to keep this 'select'.
  if (!TLI->isPredictableSelectExpensive() ||
      !isFormingBranchFromSelectProfitable(TTI, SI))
    return false;
}

We already check through TLI->isPredictableSelectExpensive(). This can be set in X86ISelLowering.cpp.

// A predictable cmov does not hurt on an in-order CPU.
// FIXME: Use a CPU attribute to trigger this, not a CPU model.
PredictableSelectIsExpensive = !Subtarget.isAtom();

If PredictableSelectIsExpensive is not true, we don't transform "select to branch" opt.

And I think we can use getSchedModel().isOutOfOrder() for setting the flag PredictableSelectIsExpensive.

Junmo.

Remove Load-CMP heuristic result.
Test Env: CPU: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz

I didn't find notable regression.

And I think we can use getSchedModel().isOutOfOrder() for setting the flag PredictableSelectIsExpensive.

Thanks, Junmo. I didn't realize that was possible, but that solves a problem. We shouldn't be using bogus predicates which approximate the necessary condition...the correct predicate is already available!

Index: lib/Target/ARM/ARMISelLowering.cpp
===================================================================
--- lib/Target/ARM/ARMISelLowering.cpp	(revision 261182)
+++ lib/Target/ARM/ARMISelLowering.cpp	(working copy)
@@ -1042,7 +1042,7 @@
   setMinStackArgumentAlignment(4);
 
   // Prefer likely predicted branches to selects on out-of-order cores.
-  PredictableSelectIsExpensive = Subtarget->isLikeA9();
+  PredictableSelectIsExpensive = Subtarget->getSchedModel().isOutOfOrder();

   setMinFunctionAlignment(Subtarget->isThumb() ? 1 : 2);
 }

And I see now that PredictableSelectIsExpensive defaults to 'false', so that means targets must opt-in to this transform. That makes me feel better about doing this transform in general, but it doesn't solve the problem raised by this patch.

The transform is based on a heuristic, and this patch may limit when that heuristic fires, but Zia and Mitch have suggested the opposite: they'd like to do this transform more often for their targets.

If we're going to solve that tension here in CGP, we're going to need another target hook or make PredictableSelectIsExpensive more than a bool value. To solve this at the machine-level, you'd either need to customize the transform per target or add that same target hook to be used from a common codegen pass.

I'm not sure which is the better option. As Mitch pointed out, there's certainly demand to do the transform(s) here because it's easier. And now that I realize we have access to the same SchedModel info that we would use at the machine-level anyway, it seems more reasonable. However, as Hal and Mehdi have noted, breaking up a basic block may harm isel/scheduling.

Here's test-suite raw data after removing the branch-on-load heuristic completely while running on AMD Jaguar 1.5 GHz. I ran each config 3 times. I'm filtering out anything with more than 1% relative std dev in a google spreadsheet while reviewing the numbers.

Like Junmo's SandyBridge experiment summary (it looks like the actual data is encrypted?), I don't see any real regressions.

There was a similar (+14%) win for MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt as Junmo reported testing with Cortex-A57, however, there was only a slight (+1.5%) win for MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl.

The average perf of the filtered tests is +0.2% after removing the branch-on-load transform.

Thanks Sanjay.

As Mitch pointed out, there's certainly demand to do the transform(s) here because it's easier. And now that I realize we have access to the same SchedModel info that we would use at the machine-level anyway, it seems more reasonable. However, as Hal and Mehdi have noted, breaking up a basic block may harm isel/scheduling.

I totally agree with this. I will change this patch soon.

Thanks for pointing ARMISelLowering.cpp.

flyingforyou retitled this revision from [CodeGenPrepare] Don't transform select instructions into branches when both of operands are cheap to [CodeGenPrepare] Remove load-based heuristic.
flyingforyou updated this object.

Addressed Sanjay's suggestion.

Thanks.

Junmo - your patch will cause these existing regression tests to fail:

LLVM :: CodeGen/AArch64/a57-csel.ll
LLVM :: CodeGen/X86/cmov-into-branch.ll
LLVM :: Transforms/CodeGenPrepare/X86/select.ll

Please update the check-lines in those tests and add those diffs to this patch. The existing tests cover the functionality that you are proposing to change, so I don't think there is a need to add a new test file.

We've covered a lot of ground in this review, so let me try to summarize:

  1. http://reviews.llvm.org/rL156234 - this added a TLI (isFormingBranchFromSelectProfitable) and heuristic (single-use load) based transform that would convert a select into a branch. It improved test-suite and possibly other benchmarks running on x86 OoO (likely SandyBridge at the time).
  2. Benjamin noted some of the potential improvements in the commit message and again earlier in this review. I filed https://llvm.org/bugs/show_bug.cgi?id=26636 to track one of those suggestions.
  3. Junmo proposed limiting the transform based on "free" operands because this improved 2 tests in test-suite for ARM.
  4. I was concerned that we didn't have the right target information at this level, so we might harm in-order cores or GPUs, but that was wrong. I fixed the x86 predicate in http://reviews.llvm.org/rL261275 . The ARM version is a bigger mess (it won't be NFC), so I'll let someone who knows that backend better make the identical change for that target.
  5. Current test-suite data from SandyBridge, Jaguar, and Cortex-A57 show an improvement from *not* using the select-to-branch transform with loads.
  6. Therefore, this patch has been updated to effectively revert r156234.
  7. There is concern that this will hurt some cases in EEMBC or other benchmarks running on x86.
  8. There is general agreement that if this patch is committed, but we want to resurrect/improve the select-to-branch transform, it should be done after isel. This is because adding branches to IR limits isel/scheduling effectiveness because those are currently limited to basic block scope.
  9. If we had global isel, that limitation would be removed, so we would probably recreate the select-to-branch transform here in CGP again because it's easier to work on IR than MI.

Did I miss anything? :)
I would like to hear from Zia and Mitch before proceeding with this patch because of #7. It would be great to have more perf data. But like I said earlier in the thread, I don't know what our policy is in a situation where a heuristic has lost value for test-suite but still has benefits for other cases.

Addressed Sanjay's comment.

Thanks Sanjay for great summary.!

Hi. Mitch.

I'm looking into resurrecting the nonlocal dead store elimination in http://reviews.llvm.org/D13363,
but one of the snags is a performance issue related to select-vs-branch choices. That change enables
more selects, which unfortunately incur a 10 to 20% perf drop on some benchmarks in the EEMBC automotive
suite, on various X86 architectures.

This patch will make select to branch lesser. But I uploaded other patch(D17288) for applying select to branch optimization when obvious case.

If you see some regressions regarding this patch(D16836), could you let us know, please?

Junmo.

I would like to hear from Zia and Mitch before proceeding with this patch because of #7. It would be great to have more perf data. But like I said earlier in the thread, I don't know what our policy is in a situation where a heuristic has lost value for test-suite but still has benefits for other cases.

Just back from a break, so I apologize for getting to the party late. Looks like we see some minor perf drops on our side from this, but I think I'm willing to deal with that through plans of beefing up our heuristics after isel. At a high level, the changes do look like they have potential to introduce cases that may cause regressions, but I don't think that the existing heuristic made much sense to begin with.... So, lgmt..

zansari accepted this revision.Feb 24 2016, 3:47 PM
zansari added a reviewer: zansari.
This revision is now accepted and ready to land.Feb 24 2016, 3:47 PM
This revision was automatically updated to reflect the committed changes.

Thanks for all.
It was great time for me.

Especially, Sanjay, I really appreciate your review & comments with great summary.

Commited in r261809.

Junmo.

tkn added a subscriber: tkn.Mar 21 2016, 7:55 AM