This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Optimize cmp chain before legalization
ClosedPublic

Authored by Allen on Nov 14 2022, 4:47 AM.

Details

Summary
  • For case bcmp9, there is extras AND and EXTEND int the chain of OR/XOR, which prevent the transform, so enable the optimize before legalization.
  • The key IR frag related:
      t37: i32,ch = load<(load (s8) from %ir.4), anyext from i8> t0, t11, undef:i64
        t12: i64 = add t4, Constant:i64<8>
      t38: i32,ch = load<(load (s8) from %ir.5), anyext from i8> t0, t12, undef:i64
    t39: i32 = xor t37, t38
  t40: i64 = any_extend t39
t42: i64 = and t40, Constant:i64<255>

Depends on D138398 to fix combine_setcc_glue

Diff Detail

Event Timeline

Allen created this revision.Nov 14 2022, 4:47 AM
Allen requested review of this revision.Nov 14 2022, 4:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 14 2022, 4:47 AM
Allen edited the summary of this revision. (Show Details)Nov 14 2022, 4:48 AM

There is code in DAGCombiner::BackwardsPropagateMask that can propagate And's back to loads, and would usually handle patterns like this but it can't look through any_extends. It has seemed to be useful in the past though.
You could also imagine transforming i64 and (any_extend(i32 x), mask) into i64 zext(i32 and(x, mask) under AArch64, as we know the zext will be free. I think that would run into other problems though, as the zext between the And isn't handled for all the BFI cases. Without improving BFI at the same time it would lead to other regressions.

So I'm not sure either of those methods would be better than this, even if they are more general. I think it would be useful to add deliberate tests for this though if we can, especially for the edge cases.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11982 ↗(On Diff #475108)

Why "Dup" in the name? Because there are two loads? When I see Dup I think of vector splats.

12013 ↗(On Diff #475108)

Does it not need to check other bits about the Mask?

12030 ↗(On Diff #475108)

Should XOR be Logicop.getOpcode()?

20666 ↗(On Diff #475108)

This can be done separately.

Allen updated this revision to Diff 475825.Nov 16 2022, 7:38 AM
Allen marked an inline comment as done.
Allen marked 2 inline comments as done.Nov 16 2022, 7:46 AM

There is code in DAGCombiner::BackwardsPropagateMask that can propagate And's back to loads, and would usually handle patterns like this but it can't look through any_extends. It has seemed to be useful in the past though.
You could also imagine transforming i64 and (any_extend(i32 x), mask) into i64 zext(i32 and(x, mask) under AArch64, as we know the zext will be free. I think that would run into other problems though, as the zext between the And isn't handled for all the BFI cases. Without improving BFI at the same time it would lead to other regressions.

So I'm not sure either of those methods would be better than this, even if they are more general. I think it would be useful to add deliberate tests for this though if we can, especially for the edge cases.

a) visitAND already try the i64 and (any_extend(i32 x), mask) into i64 zext(i32 and(x, mask)`, but fail as checking MaskedValueIsZero, https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6302
b) I also tryed to adjust ISD::ANY_EXTEND --> ISD::ZERO_EXTEND in DAGTypeLegalizer::PromoteIntOp_ZERO_EXTEND, then we'll also get the expect zext without AND, but some x86 cases regressions for chains of ANDs, so I limited to both operands are Loads.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11982 ↗(On Diff #475108)

Yes, Both operands are Load. I'll rename it to CombineZExtLogicopDoubleExtLoad

12013 ↗(On Diff #475108)

Done, thanks

12030 ↗(On Diff #475108)

Oh, good catch. thanks

20666 ↗(On Diff #475108)

Maybe you still can try to enable the optimize before legalization to fix these issues like:

if ((Cond == ISD::SETEQ || Cond == ISD::SETNE) && isNullConstant(RHS) &&
    LHS->getOpcode() == ISD::OR && LHS->hasOneUse() &&
    isOrXorChain(LHS, NumXors, WorkList)) {
  SDValue XOR0, XOR1;
  std::tie(XOR0, XOR1) = WorkList[0];
  SDValue Cmp = DAG.getSetCC(DL, VT, XOR0, XOR1, ISD::SETNE);
  for (unsigned I = 1; I < WorkList.size(); I++) {
    std::tie(XOR0, XOR1) = WorkList[I];
    SDValue CmpChain = DAG.getSetCC(DL, VT, XOR0, XOR1, ISD::SETNE);
    Cmp = DAG.getNode(ISD::OR, DL, VT, Cmp, CmpChain);
  }

  // Exit early by inverting the condition, which help reduce indentations.
  return DAG.getSetCC(DL, VT, Cmp, DAG.getConstant(0, DL, VT), Cond);
}

After that you need remove if (!DCI.isBeforeLegalize()) for performOrXorChainCombine and remove the function call in lowerSetCC to make the code cleaner.
As @dmgreen mentioned before, maybe you can try to combine to AArch64ISD::SUBS + AArch64ISD::CCMP. But you may need to fix some type legalization issues to make it works.
Current patch looks too specific, I think.

Allen updated this revision to Diff 476451.Nov 18 2022, 6:29 AM
Allen marked 2 inline comments as done.
Allen retitled this revision from [AArch64] Optimize cmp chain when the result is tested for [in]equality with 0 to [AArch64] Optimize cmp chain before legalization.
Allen edited the summary of this revision. (Show Details)
Allen added a comment.Nov 18 2022, 6:32 AM

Maybe you still can try to enable the optimize before legalization to fix these issues like:

if ((Cond == ISD::SETEQ || Cond == ISD::SETNE) && isNullConstant(RHS) &&
    LHS->getOpcode() == ISD::OR && LHS->hasOneUse() &&
    isOrXorChain(LHS, NumXors, WorkList)) {
  SDValue XOR0, XOR1;
  std::tie(XOR0, XOR1) = WorkList[0];
  SDValue Cmp = DAG.getSetCC(DL, VT, XOR0, XOR1, ISD::SETNE);
  for (unsigned I = 1; I < WorkList.size(); I++) {
    std::tie(XOR0, XOR1) = WorkList[I];
    SDValue CmpChain = DAG.getSetCC(DL, VT, XOR0, XOR1, ISD::SETNE);
    Cmp = DAG.getNode(ISD::OR, DL, VT, Cmp, CmpChain);
  }

  // Exit early by inverting the condition, which help reduce indentations.
  return DAG.getSetCC(DL, VT, Cmp, DAG.getConstant(0, DL, VT), Cond);
}

After that you need remove if (!DCI.isBeforeLegalize()) for performOrXorChainCombine and remove the function call in lowerSetCC to make the code cleaner.
As @dmgreen mentioned before, maybe you can try to combine to AArch64ISD::SUBS + AArch64ISD::CCMP. But you may need to fix some type legalization issues to make it works.
Current patch looks too specific, I think.

Thanks @bcl5980 for your idea, apply your comment.

bcl5980 added inline comments.Nov 19 2022, 3:31 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8592

Maybe the ISD::ZERO_EXTEND can remain here?

llvm/test/CodeGen/AArch64/dag-combine-setcc.ll
224

Looks regression for this case?

bcl5980 added inline comments.Nov 20 2022, 10:37 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8633

It looks this can continue to be simplified to

unsigned LogicOp = (Cond == ISD::SETEQ) ? ISD::AND : ISD::OR;
SDValue Cmp = DAG.getSetCC(DL, VT, XOR0, XOR1, Cond);
for (unsigned I = 1; I < WorkList.size(); I++) {
  std::tie(XOR0, XOR1) = WorkList[I];
  SDValue CmpChain = DAG.getSetCC(DL, VT, XOR0, XOR1, Cond);
  Cmp = DAG.getNode(LogicOp, DL, VT, Cmp, CmpChain);
}

return Cmp;

Looks more cases can get benefit from it.

llvm/test/CodeGen/AArch64/dag-combine-setcc.ll
224

This regression cmp x3, x10 part is a little tricky. It depends on the tree search sequence. If we search RHS first it will disappear. But I don't know how to fix it by a elegant way.

Allen added inline comments.Nov 21 2022, 6:26 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8592

In the current implementation, we generate the AND and ANY_EXTEND sequence when , so the ISD::ZERO_EXTEND is not required?

8633

Yes, this change fix the case @PR58675, while regression on case combine_setcc_glue, so I'll need more work on it.

+++ b/llvm/test/CodeGen/AArch64/dag-combine-setcc.ll
@@ -191,9 +191,11 @@ define i32 @combine_setcc_glue(i128 noundef %x, i128 noundef %y) {
 ; CHECK-LABEL: combine_setcc_glue:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    cmp x1, x3
-; CHECK-NEXT:    ccmp x0, x2, #0, eq
-; CHECK-NEXT:    ccmp x0, x2, #4, ne
-; CHECK-NEXT:    cset w0, eq
+; CHECK-NEXT:    cset w8, eq
+; CHECK-NEXT:    cmp x0, x2
+; CHECK-NEXT:    cset w9, eq
+; CHECK-NEXT:    and w8, w9, w8
+; CHECK-NEXT:    orr w0, w8, w9
 ; CHECK-NEXT:    ret
bcl5980 added inline comments.Nov 21 2022, 5:54 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8592

The case like that can trigger the zext part. https://www.godbolt.org/z/bPrT3G7bh
And I think the cost is very low so we can add it.

8633

Don't worry about the combine_setcc_glue regression. D138401 already fixed that.

bcl5980 added inline comments.Nov 21 2022, 5:57 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8633

Sorry, the change fix combine_setcc_glue is D138398

Allen updated this revision to Diff 477061.Nov 21 2022, 10:11 PM
Allen edited the summary of this revision. (Show Details)

rebase base on D138398

Allen marked 5 inline comments as done.Nov 21 2022, 10:14 PM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8633

Thanks for the fixing, I rebase on that.

bcl5980 accepted this revision.Nov 21 2022, 10:48 PM

LGTM, but please waiting for @dmgreen or someone else to approve it.

This revision is now accepted and ready to land.Nov 21 2022, 10:48 PM

Can you add tests for chains or or/xor with some different types. Maybe i8/i16/i128 (assuming we already have them for i32/i64). Then something strange like an i42. If we are allowing the transform in many more places then it would be good to test them.

Allen updated this revision to Diff 477169.Nov 22 2022, 6:25 AM
Allen marked an inline comment as done.

Add 5 more cases with zero_extend and different types (i8/i16/i128/i42)

dmgreen added inline comments.Nov 23 2022, 1:38 AM
llvm/test/CodeGen/AArch64/bcmp.ll
432

We don't usually put godbolt links in the source.

I see the new code is a little larger. They wouldn't be if the inputs were loaded or known to be smaller already though. And they don't fail, which was the main point of adding them.

dmgreen accepted this revision.Nov 23 2022, 2:10 AM

OK lets go with this patch. Its not quite perfect, but the gains are good to see and we can fix the other issues if they cause problems. If you remove the godbolt link then LGTM.

Remember to upload with context. thanks

Allen updated this revision to Diff 477442.Nov 23 2022, 3:38 AM

Remove the godbolt as comment

Allen marked an inline comment as done.Nov 23 2022, 3:39 AM
Allen added inline comments.
llvm/test/CodeGen/AArch64/bcmp.ll
432

Deleted, thanks

This revision was landed with ongoing or failed builds.Nov 23 2022, 3:48 AM
This revision was automatically updated to reflect the committed changes.
Allen marked an inline comment as done.