This is an archive of the discontinued LLVM Phabricator instance.

[X86] Combining CMOVs with [ANY,SIGN,ZERO]_EXTEND for cases where CMOV has constant arguments
ClosedPublic

Authored by aivchenk on Aug 14 2017, 1:52 PM.

Details

Summary

Combine CMOV[i16]<-[SIGN,ZERO,ANY]_EXTEND to [i32,i64] into CMOV[i32,i64].
One example of where it is useful is:

before (20 bytes)
<foo>:
test $0x1,%dil
mov $0x307e,%ax
mov $0xffff,%cx
cmovne %ax,%cx
movzwl %cx,%eax
retq

after (18 bytes)
<foo>:
test $0x1,%dil
mov $0x307e,%ecx
mov $0xffff,%eax
cmovne %ecx,%eax
retq

Diff Detail

Event Timeline

aivchenk created this revision.Aug 14 2017, 1:52 PM
RKSimon edited edge metadata.Aug 14 2017, 2:00 PM
RKSimon added a subscriber: llvm-commits.

Please don't forget to include llvm-commits as a subscriber

aaboud added a reviewer: zvi.Aug 14 2017, 2:02 PM
spatel added inline comments.Aug 15 2017, 8:11 AM
test/CodeGen/X86/cmov-promotion.ll
2–3
  1. Please add this file to trunk with the utils script auto-generated baseline checks, so we just see the diffs in this patch.
  2. It's not clear to me what the 2nd run is showing. That target has cmov, but it gets converted to test+branch later?
  3. Assuming the 2nd run is necessary, why are there no MCU checks for the 2nd test?
  4. Should there be an spromotion_16_to_64 for completeness?
spatel edited edge metadata.Aug 15 2017, 8:17 AM

It's worth noting that we could convert any select of constants to math ops. I was planning a follow-up to https://reviews.llvm.org/rL310717 that would do that, but there might be cases where we would prefer to select cmov? If not, that would make this patch obsolete.

It's worth noting that we could convert any select of constants to math ops. I was planning a follow-up to https://reviews.llvm.org/rL310717 that would do that, but there might be cases where we would prefer to select cmov? If not, that would make this patch obsolete.

GCC has already implemented this:
https://godbolt.org/g/i5DVLe

aivchenk updated this revision to Diff 111228.Aug 15 2017, 12:23 PM
aivchenk marked an inline comment as done.Aug 15 2017, 12:42 PM

Thank you for your comments, I updated the patch to address them:

  1. Done
  2. MCU target does not have CMOV instruction so it is always converted into test+branch. I firstly cared about it - that's the reason why it is here
  3. Just a miss. It is fixed
  4. Ditto

About converting into math ops.. I think that in most cases my patch would be obsolete then. The only reason I see why it would be not obsolete is: assume that CMOV with constants is part of a CMOV-group (as defined in X86CmovConversion.cpp: consecutive/having the same CC) and the decision taken by x86-cmov-converter is to transform this CMOV-group into branch. Since branch is generated anyways now, basically this "constant" CMOV is transformed into two addition MOVs, which should be more effective than converting it to three ops math logic

Thank you for your comments, I updated the patch to address them:

  1. Done
  2. MCU target does not have CMOV instruction so it is always converted into test+branch. I firstly cared about it - that's the reason why it is here

I don't have a good understanding of this path then. Why are we generating x86-specific cmov nodes for a target that doesn't support cmov? I would think these should remain as select nodes to avoid creating a flags result that won't exist, but I must be missing something?

About converting into math ops.. I think that in most cases my patch would be obsolete then. The only reason I see why it would be not obsolete is: assume that CMOV with constants is part of a CMOV-group (as defined in X86CmovConversion.cpp: consecutive/having the same CC) and the decision taken by x86-cmov-converter is to transform this CMOV-group into branch. Since branch is generated anyways now, basically this "constant" CMOV is transformed into two addition MOVs, which should be more effective than converting it to three ops math logic

That's a good point - if the compare has multiple uses, we probably shouldn't convert to math. This may already be wrong in the existing transforms. It's also possible that a cmov sequence is faster (the dependency chain is shorter), but I think a CPU would have to be able to execute cmov at the same speed/throughput as the simple ops for that to be true.

craig.topper edited edge metadata.Aug 15 2017, 2:46 PM

I believe X86 lowers all selects of any type to X86ISD::CMOV. If the target supports CMOV for a type, isel will emit a real CMOV instruction. If the target doesn't support CMOV or the type isn't supported by CMOV we emit a CMOV peusdo instruction which gets custom inserted to control flow via EmitLoweredSelect.

I checked in the tests from cmov-promotion.ll at rL311052, so we'd see the baseline codegen. Can you rebase after that?

I put a couple of style nit comments inline. I think this patch makes sense, but would appreciate if another reviewer can take a look too to make sure the non-cmov code is better too.

lib/Target/X86/X86ISelLowering.cpp
34359–34361

It's a matter of taste, but seems more typical to shrink the function signature to just (SDNode *N, SelectionDAG &DAG) and then extract the VT, DL, and opcode in local variables.

Another style note: it's completely inconsistent, but I think we prefer "DL" now that variables are supposed to be capitalized.

34379–34382

Could just use the 4 operand override of getNode here to avoid the explicit SmallVector?

aivchenk updated this revision to Diff 111491.Aug 17 2017, 4:00 AM
aivchenk marked 2 inline comments as done.

Thank you very much for your comments and for submitting the baseline.

I did the rebase and addressed your comments in the new diff

Thanks for the updates. I have no other suggestions.

Kind ping :)

Minor style comments.

lib/Target/X86/X86ISelLowering.cpp
34362

(style)

if (CMovN.getOpcode() != X86ISD::CMOV)
  return SDValue();
34379

(style) early-out:

  if (!DoPromoteCMOV)
    return SDValue();

  CMovOp0 = DAG.getNode(ExtendOpcode, DL, TargetVT, CMovOp0);
  CMovOp1 = DAG.getNode(ExtendOpcode, DL, TargetVT, CMovOp1);
  return DAG.getNode(X86ISD::CMOV, DL, TargetVT, CMovOp0, CMovOp1,
                                        CMovN.getOperand(2), CMovN.getOperand(3));
}
test/CodeGen/X86/cmov-promotion.ll
98

Please can you add 32->64 tests to check that they aren't happening? Maybe 8->X test as well?

aivchenk updated this revision to Diff 114162.Sep 7 2017, 5:29 AM
aivchenk marked 2 inline comments as done.

Thanks for your comments, I attached the updated patch

No more comments from me - anybody else (@spatel)?

spatel accepted this revision.Sep 10 2017, 7:03 AM

LGTM.

Side note about my earlier comment about transforming select-of-constants to bitwise logic: according to Agner's tables, AMD Zen can do cmov in a single uop and at full throughput, so that transform won't make sense in general.

This revision is now accepted and ready to land.Sep 10 2017, 7:03 AM

@aivchenk Are you intending to commit this?

This revision was automatically updated to reflect the committed changes.

yep, sorry for the delay