This is an archive of the discontinued LLVM Phabricator instance.

[X86] X86::CMOV to Branch heuristic based optimization
ClosedPublic

Authored by aaboud on Jun 28 2017, 11:40 AM.

Details

Summary

LLVM compiler recognizes opportunities to transform a branch into IR select instruction(s) - later it will be lowered into X86::CMOV instruction, assuming no other optimization eliminated the SelectInst.
However, it is not always profitable to emit X86::CMOV instruction. For example, branch is preferable over an X86::CMOV instruction when:

  • Branch is well predicted
  • Condition operand is expensive, compared to True-value and the False-value operands

In CodeGenPrepare pass there is a shallow optimization that tries to convert SelectInst into branch, but it is not enough.
This patch, implements machine optimization pass that converts X86::CMOV instruction(s) into branch, based on a conservative heuristic.

Diff Detail

Event Timeline

aaboud created this revision.Jun 28 2017, 11:40 AM

Interestingly, we just recently introduced a pass into the PowerPC backend that does essentially the same thing; lib/Target/PowerPC/PPCExpandISEL.cpp. Could you please take a look at that and see how it compares to this implementation? I'm obviously curious as to whether we could add some target callbacks and unify the underlying logic in a target-independent pass. FWIW, both implementations are around 500 lines.

One thing that's not clear to me: how are you determining whether or not the branch is predictable?

One outstanding TODO item we have from the PPC functionality is to support __builtin_unpredictable (which turns in to metadata at the IR level). Have you looked at that?

hfinkel added a subscriber: jtony.Jun 28 2017, 11:54 AM

Interestingly, we just recently introduced a pass into the PowerPC backend that does essentially the same thing; lib/Target/PowerPC/PPCExpandISEL.cpp. Could you please take a look at that and see how it compares to this implementation? I'm obviously curious as to whether we could add some target callbacks and unify the underlying logic in a target-independent pass. FWIW, both implementations are around 500 lines.

Thanks for the information, I will take a look.

One thing that's not clear to me: how are you determining whether or not the branch is predictable?

I did not determine if it is predictable, however, I did the transformation only when I can assure gain of more than 25% of the mis-prediction penalty.
On the other hand, I considered during development of this pass to try determine one case where branch is not predicted, which is tree-search like algorithms.
I did that roughly by looking for how the CMOV result value was used, and if it was used for another "pointer-like" load, I considered it a possible case for tree-search.
Notice, that this is not part of this patch, but we might consider adding it if found reasonable.

One outstanding TODO item we have from the PPC functionality is to support __builtin_unpredictable (which turns in to metadata at the IR level). Have you looked at that?

I know about this built-in, but as you said, it is not lowered into the machine IR. I did not try to figure out how to preserve it though.

Interestingly, we just recently introduced a pass into the PowerPC backend that does essentially the same thing; lib/Target/PowerPC/PPCExpandISEL.cpp. Could you please take a look at that and see how it compares to this implementation? I'm obviously curious as to whether we could add some target callbacks and unify the underlying logic in a target-independent pass. FWIW, both implementations are around 500 lines.

Thanks for the information, I will take a look.

One thing that's not clear to me: how are you determining whether or not the branch is predictable?

I did not determine if it is predictable, however, I did the transformation only when I can assure gain of more than 25% of the mis-prediction penalty.
On the other hand, I considered during development of this pass to try determine one case where branch is not predicted, which is tree-search like algorithms.
I did that roughly by looking for how the CMOV result value was used, and if it was used for another "pointer-like" load, I considered it a possible case for tree-search.
Notice, that this is not part of this patch, but we might consider adding it if found reasonable.

Thanks. That heuristic sounds like an interesting idea. Hopefully, we can investigate that in the future.

One outstanding TODO item we have from the PPC functionality is to support __builtin_unpredictable (which turns in to metadata at the IR level). Have you looked at that?

I know about this built-in, but as you said, it is not lowered into the machine IR. I did not try to figure out how to preserve it though.

One way of doing this might be to have the branch instructions get the metadata from the IR (there are MI metadata operands, although I think they're only currently used for debug-info pseudo instructions).

Interestingly, we just recently introduced a pass into the PowerPC backend that does essentially the same thing; lib/Target/PowerPC/PPCExpandISEL.cpp. Could you please take a look at that and see how it compares to this implementation? I'm obviously curious as to whether we could add some target callbacks and unify the underlying logic in a target-independent pass. FWIW, both implementations are around 500 lines.

Thanks for the information, I will take a look.

I reviewed the PowerPC pass in lib/Target/PowerPC/PPCExpandISEL.cpp, it is trying to address same issue like this patch, however, it is different:

PPCExpandISEL

  1. Runs after register allocation.
  2. User flag based, rather than compilation heuristic, i.e., once it is decided to expand ISEL, it will expand them all.
  3. Used as a functionality pass, to cover cases where HW does not support ISEL instruction.

X86CmovConversion

  1. Runs on SSA machine IR, before register allocation.
  2. Heuristic base optimization, i.e., it can convert only CMOV instructions that worth converting while keeping others.
  3. Used only as a performance pass, the functionality pass exists in X86TargetLowering::EmitLoweredSelect, which lowers pseudo CMOV instructions.
spatel edited edge metadata.Jul 1 2017, 6:50 AM

I haven't looked at the patch yet, but for reference I filed:
https://bugs.llvm.org//show_bug.cgi?id=33013 (although the comments veered off to a different topic)
...and mentioned:
https://reviews.llvm.org/rL292154

If the example(s) in the bug report are already here, then great. If not, you might want to consider those cases.

lsaba added a subscriber: lsaba.Jul 1 2017, 11:44 PM
zvi edited edge metadata.Jul 2 2017, 2:04 AM

A few quick comments in the body of the implementation.

About the tests:
The added test are definitely interesting cases, but they are system tests as they check the full flow where we handle branches: CodegenPrepare, ISel, EarlyIfConversion (which is not enabled for X86), ...
Have you considered adding mit tests which will be considered as unit-tests specific to this pass?

lib/Target/X86/X86CmovConversion.cpp
94

Please consider making the variable and type names consistent. Use either Cmov or CMOV.

127

Define as a reference instead of a pointer.

131

Could we save the overhead of initializing the resource tables from one visited function to another? If two visited functions share the same subtarget, could we reuse the same schedule model configuration?

156

This variable is not initialized in the c-tor, and IIUC will be assigned here for the first time. IMHO, it would be better to reduce its scope to this loop, and pass a reference in to collectCmovCandidates() and collectCmovCandidates().

170

To be consistent with collectCmovCandidates() and checkForProfitableCmovCandidates() i suggest you move this loop into convertCmovInstsToBranches().

Another option that IMHO is better:
Define CmovInstGroups in the scope of the MF iteration loop in this function and pass it by reference to the functions called in this loop. That will save the need to CmovInstGroups.clear() in numerous functions.

204

*FoundNonCMOVInst

211

If i did not miss it, can you please add a comment here or update the comments on the top why we skip cmovs with memory operands? My naive thought is that if we can speculate the load in the cmov, then we should also be able to put the load under a branch.

226

Would this allow making the code below simpler?

else if (Group.empty()) continue;

It would probably be faster as the typical case is where no interesting instruction was encountered yet (or ever).

Another option is to search for the first interesting cmov instruction in the BB before entering this loop and and bail out if none found.

243

Consider:

if (Group.empty()) continue;
aaboud marked 7 inline comments as done.Jul 5 2017, 10:15 AM

Thanks Zvi for the helpful comments.
I fixed most of them, and answered the others, see below.

lib/Target/X86/X86CmovConversion.cpp
131

MachineCombiner, MachineLICM, and IfConverter initialize it in runOnMachineFunction, for every machine function.
Do you think that this function is too expensive?
Either way, it is out-of-scope for this patch.

211

The reason is implementation effort.
There is a "TODO" at line 191, so this can be added in future patch.
I do not see a need to add a comment, the "TODO" should be enough, do not you think so?

aaboud updated this revision to Diff 105300.Jul 5 2017, 10:17 AM

Addressed Zvi comments.

I haven't looked at the patch yet, but for reference I filed:
https://bugs.llvm.org//show_bug.cgi?id=33013 (although the comments veered off to a different topic)
...and mentioned:
https://reviews.llvm.org/rL292154

If the example(s) in the bug report are already here, then great. If not, you might want to consider those cases.

The reason it veered off topic is that the FPToUI bug that I reported on the mailing list, was dismissed supposedly because LLVM doesn't support correct floating point accrued exception state for very simple intrinsic conversions e.g. ISD::FP_TO_UINT is broken on X86. The issue was isolated to a SELECT created by FPToUI being lowered as a conditional move, however, one side of the SELECT had a SUBSS floating point instruction that clobbers the floating point accrued exception state (sets FE_INEXECT), so we discussed exact vs inexact conversions. The fix for the FPToUI bug is to lower the SELECT as a branch, however the compiler needs to model the MXCSR flags. That aside, FPToUI is trivially fixable by emitting a branch. It's a bug.

The bug reported was "FE_INEXACT being set for an exact conversion from float to unsigned long long" and it resulted in you raising a bug about the compiler generating more branchy code (ignoring the issue I reported), so we were not, in fact, going off topic, rather we were discussing the original bug reported on the mailing list.

This code may indeed fix the FE_INEXACT issue but long term the compiler needs to consider the floating point accrued flags side effects, however, whether of not FENV_ACCESS is enabled, the FPToUI should be trivially able to generate correct flags by using a branch instead of a cmove. GCC preserves floating point accrued exception flags for simple signed and unsigned floating point conversions when FENV_ACCESS is not enabled. It's not a complex DAG. This is not especially complex FP code. It's simply a broken ISD::FP_TO_UINT. i.e. buggy intrinsic.

"SelectionDAGLegalize::ExpandNode() in LegalizeDAG.cpp. The X86 backend sets a table entry indicating that FP_TO_UINT should be expanded for these value types, but the actual expansion is in target-independent code. " from Andrew K. It's a SELECT.

case ISD::FP_TO_UINT: {
  SDValue True, False;
  EVT VT =  Node->getOperand(0).getValueType();
  EVT NVT = Node->getValueType(0);
  APFloat apf(DAG.EVTToAPFloatSemantics(VT),
              APInt::getNullValue(VT.getSizeInBits()));
  APInt x = APInt::getSignBit(NVT.getSizeInBits());
  (void)apf.convertFromAPInt(x, false, APFloat::rmNearestTiesToEven);
  Tmp1 = DAG.getConstantFP(apf, dl, VT);
  Tmp2 = DAG.getSetCC(dl, getSetCCResultType(VT),
                      Node->getOperand(0),
                      Tmp1, ISD::SETLT);
  True = DAG.getNode(ISD::FP_TO_SINT, dl, NVT, Node->getOperand(0));
  // TODO: Should any fast-math-flags be set for the FSUB?
  False = DAG.getNode(ISD::FP_TO_SINT, dl, NVT,
                      DAG.getNode(ISD::FSUB, dl, VT,
                                  Node->getOperand(0), Tmp1));
  False = DAG.getNode(ISD::XOR, dl, NVT, False,
                      DAG.getConstant(x, dl, NVT));
  Tmp1 = DAG.getSelect(dl, NVT, Tmp2, True, False);
  Results.push_back(Tmp1);
  break;
}

Note the second form with CMOV has a bug as the SUBSS is clobbering the flags resulting in FE_INEXACT for exact conversions. In any case this patch may fix the issue in a round about way

"
$ more llvm/lib/Target/X86//README-X86-64.txt

Are we better off using branches instead of cmove to implement FP to
unsigned i64?

_conv:

ucomiss LC0(%rip), %xmm0
cvttss2siq      %xmm0, %rdx
jb      L3
subss   LC0(%rip), %xmm0
movabsq $-9223372036854775808, %rax
cvttss2siq      %xmm0, %rdx
xorq    %rax, %rdx

L3:

movq    %rdx, %rax
ret

instead of

_conv:

movss LCPI1_0(%rip), %xmm1
cvttss2siq %xmm0, %rcx
movaps %xmm0, %xmm2
subss %xmm1, %xmm2
cvttss2siq %xmm2, %rax
movabsq $-9223372036854775808, %rdx
xorq %rdx, %rax
ucomiss %xmm1, %xmm0
cmovb %rcx, %rax
ret

"

zvi added inline comments.Jul 9 2017, 1:57 PM
lib/Target/X86/X86CmovConversion.cpp
131

I don't know what is the potential saving :)
I'm ok with keeping it as-is.

211

I agree. I missed the TODO

344

Please avoid using operator[] for lookups - use DenseMap::lookup() instead.
And maybe mapping operands to null's is just inefficient? The non-existence of the mapping can be also used for choosing depth = 0?
So this code can be instead something like:

if (MachineInstr *DefMI = RDM.lookup(Reg)) {
  OperandToDefMap[&MO] = DefMI;
  MIDepth = std::max(MIDepth, DepthMap.lookup(DefMI).Depth);
  if (!IsCMOV)
    MIDepthOpt = std::max(MIDepthOpt, DepthMap.lookup(DefMI).OptDepth);
 }

This also applies to more uses of operator[] below. If I'm not mistaken, only these two statements should remain using opertor[]:

RegDefMaps[TargetRegisterInfo::isVirtualRegister(Reg)][Reg] = &MI;
DepthMap[&MI] = {MIDepth += Latency, MIDepthOpt += Latency};
test/CodeGen/X86/x86-cmov-converter.ll
322

Can these attributes be dropped or minimized if they don't add anything?

zvi added inline comments.Jul 9 2017, 2:07 PM
lib/Target/X86/X86CmovConversion.cpp
339

Are these two equivalent?

for (auto &MO : MI.operands())
  if (!MO.isReg() || !MO.isUse())
    continue;

and

for (auto &MO : MI.uses())
  if (!MO.isReg())
    continue?
356

Are these two equivalent?

for (auto &MO: MI.operands())
  if (!MO.isReg() || !MO.isDef())
    continue;

and

for (auto &MO: MI.defs())
aaboud marked 2 inline comments as done.Jul 11 2017, 10:11 AM

Thanks Zvi,
I answered your comments below.

lib/Target/X86/X86CmovConversion.cpp
339

Could be, but to keep the code consistent wit the below loop, I preferred this format, I do not mind change to your suggestion if you think it is better.

344

Fixed most places.
Other places where I kept the "[]" operator, we know for sure that the entry we are looking for exists in the container.

356

No they are not, the "defs()" does not include implicit definition operands.

iterator_range<mop_iterator> operands() {
  return make_range(operands_begin(), operands_end());
}

/// Returns a range over all explicit operands that are register definitions.
/// Implicit definition are not included!
iterator_range<mop_iterator> defs() {
  return make_range(operands_begin(),
                    operands_begin() + getDesc().getNumDefs());
}

/// Returns a range that includes all operands that are register uses.
/// This may include unrelated operands which are not register uses.
iterator_range<mop_iterator> uses() {
  return make_range(operands_begin() + getDesc().getNumDefs(),
                    operands_end());
}
aaboud updated this revision to Diff 106059.Jul 11 2017, 10:12 AM
aaboud marked an inline comment as done.

Addressed more comments by Zvi.

I haven't looked at the patch yet, but for reference I filed:
https://bugs.llvm.org//show_bug.cgi?id=33013 (although the comments veered off to a different topic)
...and mentioned:
https://reviews.llvm.org/rL292154

If the example(s) in the bug report are already here, then great. If not, you might want to consider those cases.

The optimization in this patch is triggered on inner loops only, assuming that hotspots will exist in these inner loops.
Moreover, if I modify the example on the code by calling the function from a loop like this:

static int foo(float x) {
  if (x < 42.0f)
    return x;
  return 12;
}

int bar(float *a, float *b, int n) {
  int sum = 0;
#pragma clang loop vectorize(disable)
  for (int i = 0; i < n; ++i) {
    float c = a[i] + b[i];
    sum += foo(c);
  }
  return sum;
}

Then the patch will indeed convert the CMOV into branch.

LBB0_4:                                 # %for.body
                                        # =>This Inner Loop Header: Depth=1
        movss   (%esi), %xmm1           # xmm1 = mem[0],zero,zero,zero
        addss   (%edx), %xmm1
        ucomiss %xmm1, %xmm0
        ja      LBB0_5
# BB#6:                                 # %for.body
                                        #   in Loop: Header=BB0_4 Depth=1
        movl    $12, %eax
        jmp     LBB0_7
        .p2align        4, 0x90
LBB0_5:                                 #   in Loop: Header=BB0_4 Depth=1
        cvttss2si       %xmm1, %eax
LBB0_7:                                 # %for.body
                                        #   in Loop: Header=BB0_4 Depth=1
        addl    %edi, %eax
        addl    $4, %esi
        addl    $4, %edx
        decl    %ecx
        movl    %eax, %edi
        jne     LBB0_4

It would be great if I can get approval for this patch before the next release branch is created (July 19, 2017: Branch point).
Please, let me know if you have more comments as soon as possible, so I can address them as well.

Thanks,
Amjad

zvi added inline comments.Jul 11 2017, 12:13 PM
lib/Target/X86/X86CmovConversion.cpp
339

Iterating over uses() instead of operand() is more efficient, right?

565

Please add comment explaining the data-structure

578
auto I = RegRewriteTable.find(Op1Reg);
if (I != RegWriteTable.end())
  Op1Reg  = I->second.first
aaboud updated this revision to Diff 106214.Jul 12 2017, 8:00 AM
aaboud marked 2 inline comments as done.

Addressed more comments by Zvi.

lib/Target/X86/X86CmovConversion.cpp
565

There is a comment explaining this map and even explaining the motivation behind it:

//  ...   and that the code must maintain a mapping from earlier PHI's
// destination registers, and the registers that went into the PHI.
zvi accepted this revision.Jul 12 2017, 1:09 PM

LGTM. Thanks, Amjad!

lib/Target/X86/X86CmovConversion.cpp
565

That comment explaining why we need to maintain the mapping is great. I thought it would be helpful to add something like:

// Maps Phi's Def VirtReg -> (Incoming0, Incoming1)

I'm fine with keeping it as-is.

This revision is now accepted and ready to land.Jul 12 2017, 1:09 PM
This revision was automatically updated to reflect the committed changes.
anna added a subscriber: anna.Jul 26 2017, 1:58 PM

Hi @aaboud,

We are seeing couple of regressions with this optimization on our internal benchmarks on skylake hardware.
This is fixed when we turn off the optimization using -x86-cmov-converter=false

Could it be something in the heuristic that depends on the cmov to branch conversion on the gain being greater than 25% of branch misprediction penalty? I analyzed the assembly on the internal benchmark and can definitely see higher number of branches in a MBB when earlier it was all in one block with a cmov.

I have filed a bug for this (with a reduced IR): https://bugs.llvm.org/show_bug.cgi?id=33954

Could you please disable this optimization while debugging the problem offline?

Also, I was wondering if you ran this optimization on any LLVM performance suite? Couldn't find the numbers in the review.

Thanks,
Anna

In D34769#822047, @anna wrote:

Also, I was wondering if you ran this optimization on any LLVM performance suite? Couldn't find the numbers in the review.

I measured gains on both Spec2006 and EEMBC benchmarks, with no regression that was directly related to the CMOV conversion into branch.
The most noticeable improvements (but there are more), measured on Skylake:

Spec2006
  429.mcf:     +6%
  456.hmmer:   +12%

Telecom
  fbital00:    +30%