This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG] don't split branch on logic-of-vector-compares
ClosedPublic

Authored by spatel on Jun 25 2020, 2:14 PM.

Details

Summary

SelectionDAGBuilder converts logic-of-compares into multiple branches based on a boolean TLI setting in isJumpExpensive(). But that probably never considered the pattern of extracted bools from a vector compare - it seems unlikely that we would want to turn vector logic into control-flow.

The motivating x86 reduction case is shown in PR44565:
https://bugs.llvm.org/show_bug.cgi?id=44565
...and that test shows the expected improvement from using pmovmsk codegen.

For AArch64, I modified the test to include an extra op because the simpler test gets transformed by a codegen invocation of SimplifyCFG. I think what we see currently is an improvement, but it might be better if the 'and' was done on the vector unit. Potentially this could use 'addv' or 'addp' instead?

Diff Detail

Event Timeline

spatel created this revision.Jun 25 2020, 2:14 PM

Would it make sense to treat this as an SLP issue? i.e. we should transform and(extractelement(x,0), extractelement(x,1)) to vector.reduce.and(x)? If we're not extracting two elements from the same vector, the transform doesn't look as good.

On AArch64, we probably want to use addp for the result of a <2 x double> compare, given the limited set of available operations on 64-bit integers, sure. But I doubt it's really that relevant in practice.

Would it make sense to treat this as an SLP issue? i.e. we should transform and(extractelement(x,0), extractelement(x,1)) to vector.reduce.and(x)? If we're not extracting two elements from the same vector, the transform doesn't look as good.

On AArch64, we probably want to use addp for the result of a <2 x double> compare, given the limited set of available operations on 64-bit integers, sure. But I doubt it's really that relevant in practice.

we've tried several times to get 2 element vector support into SLP and its always blown up in our faces

Would it make sense to treat this as an SLP issue? i.e. we should transform and(extractelement(x,0), extractelement(x,1)) to vector.reduce.and(x)? If we're not extracting two elements from the same vector, the transform doesn't look as good.

Yes - however, my attempt within SLP failed (D59710). We should now get this simplest case in VectorCombine.
I am still trying to solve another variation of the pattern more generally in VectorCombine with D82474. But cases like this may escape as shown in that patch description. If we don't account for the transform to use movmsk with a scalar compare + branch, the cost model may say it's not profitable. So we need to look at larger patterns to see what tricks codegen can perform, but that gets more and more complicated as we lengthen the match in IR.

I think it's fine to limit the bailout here by matching a common source vector, so I can add that clause.

spatel updated this revision to Diff 273693.Jun 26 2020, 5:41 AM
spatel added reviewers: uweigand, jonpa.

Patch updated:

  1. Constrained the match to require extracting from a single source vector.
  2. I missed a SystemZ test change in the earlier draft; adding SystemZ reviewers to confirm if that's an improvement.
  1. I missed a SystemZ test change in the earlier draft; adding SystemZ reviewers to confirm if that's an improvement.

Interesting. The compiler now completely optimizes out all operations on one of the two vector lanes, apparently because it recognizes their value is constant and computable at compile time. So that's an improvement (even though not one this test case was expecting to happen ...).

On the other hand, in the operations on the remaining lane, it seems the compiler is now no longer able to perform the known-bits optimization that this test is actually written for -- see the comment that says we should optimize out a redundant AND to get a compare instead of a TM (test-under-mask), but the code now actually does contain a TM (tmll). That seems a regression at first glance.

  1. I missed a SystemZ test change in the earlier draft; adding SystemZ reviewers to confirm if that's an improvement.

Interesting. The compiler now completely optimizes out all operations on one of the two vector lanes, apparently because it recognizes their value is constant and computable at compile time. So that's an improvement (even though not one this test case was expecting to happen ...).

On the other hand, in the operations on the remaining lane, it seems the compiler is now no longer able to perform the known-bits optimization that this test is actually written for -- see the comment that says we should optimize out a redundant AND to get a compare instead of a TM (test-under-mask), but the code now actually does contain a TM (tmll). That seems a regression at first glance.

Thanks. I can't tell if there's some generic transform that would help or if something target-specific is missing.
We have this after type-legalization:

      t50: v4i32 = BUILD_VECTOR undef:i32, t32, undef:i32, undef:i32
    t51: v2i64 = bitcast t50
    t30: v2i64 = BUILD_VECTOR Constant:i64<1>, Constant:i64<1>
  t36: v2i64 = xor t51, t30
t45: v4i32 = bitcast t36

Would converting the xor to v4i32 help? computeKnownBits probably has problems looking through bitcasts.

Thanks. I can't tell if there's some generic transform that would help or if something target-specific is missing.
We have this after type-legalization:

      t50: v4i32 = BUILD_VECTOR undef:i32, t32, undef:i32, undef:i32
    t51: v2i64 = bitcast t50
    t30: v2i64 = BUILD_VECTOR Constant:i64<1>, Constant:i64<1>
  t36: v2i64 = xor t51, t30
t45: v4i32 = bitcast t36

Would converting the xor to v4i32 help? computeKnownBits probably has problems looking through bitcasts.

The bitcasts shouldn't be a problem, but the undefs will cause it to bail if any of those undef elements are demanded.

  1. I missed a SystemZ test change in the earlier draft; adding SystemZ reviewers to confirm if that's an improvement.

Interesting. The compiler now completely optimizes out all operations on one of the two vector lanes, apparently because it recognizes their value is constant and computable at compile time. So that's an improvement (even though not one this test case was expecting to happen ...).

On the other hand, in the operations on the remaining lane, it seems the compiler is now no longer able to perform the known-bits optimization that this test is actually written for -- see the comment that says we should optimize out a redundant AND to get a compare instead of a TM (test-under-mask), but the code now actually does contain a TM (tmll). That seems a regression at first glance.

Thanks. I can't tell if there's some generic transform that would help or if something target-specific is missing.
We have this after type-legalization:

      t50: v4i32 = BUILD_VECTOR undef:i32, t32, undef:i32, undef:i32
    t51: v2i64 = bitcast t50
    t30: v2i64 = BUILD_VECTOR Constant:i64<1>, Constant:i64<1>
  t36: v2i64 = xor t51, t30
t45: v4i32 = bitcast t36

Would converting the xor to v4i32 help? computeKnownBits probably has problems looking through bitcasts.

It would really obsolete this test, but I think SystemZ should be overriding the TLI hook shouldScalarizeBinop() that defaults to 'false'.

X86 does this:

bool shouldScalarizeBinop(SDValue VecOp) const {
  unsigned Opc = VecOp.getOpcode();

  // Assume target opcodes can't be scalarized.
  // TODO - do we have any exceptions?
  if (Opc >= ISD::BUILTIN_OP_END)
    return false;

  // If the vector op is not supported, try to convert to scalar.
  EVT VecVT = VecOp.getValueType();
  if (!isOperationLegalOrCustomOrPromote(Opc, VecVT))
    return true;

  // If the vector op is supported, but the scalar op is not, the transform may
  // not be worthwhile.
  EVT ScalarVT = VecVT.getScalarType();
  return isOperationLegalOrCustomOrPromote(Opc, ScalarVT);
}

If I add that on top of this patch, this test collapses to:

 $ llc -o - sysz.ll -mtriple=s390x-linux-gnu -mcpu=z13  -jump-is-expensive=1
...
# %bb.0:
	clhhsi	0, 0
	je	.LBB0_2
# %bb.1:
.LBB0_2:
.Lfunc_end0:

If the override seems reasonable, I can post that for review. Here's a draft with current trunk test changes:

diff --git a/llvm/lib/Target/SystemZ/SystemZISelLowering.h b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
index e60deaedbdf..3e1cdba9906 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelLowering.h
+++ b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
@@ -452,6 +452,25 @@ public:
     return VT == MVT::i32 || VT == MVT::i64;
   }
 
+  bool shouldScalarizeBinop(SDValue VecOp) const override {
+    unsigned Opc = VecOp.getOpcode();
+
+    // Assume target opcodes can't be scalarized.
+    // TODO - do we have any exceptions?
+    if (Opc >= ISD::BUILTIN_OP_END)
+      return false;
+
+    // If the vector op is not supported, try to convert to scalar.
+    EVT VecVT = VecOp.getValueType();
+    if (!isOperationLegalOrCustomOrPromote(Opc, VecVT))
+      return true;
+
+    // If the vector op is supported, but the scalar op is not, the transform may
+    // not be worthwhile.
+    EVT ScalarVT = VecVT.getScalarType();
+    return isOperationLegalOrCustomOrPromote(Opc, ScalarVT);
+  }
+
   const char *getTargetNodeName(unsigned Opcode) const override;
   std::pair<unsigned, const TargetRegisterClass *>
   getRegForInlineAsmConstraint(const TargetRegisterInfo *TRI,
diff --git a/llvm/test/CodeGen/SystemZ/knownbits.ll b/llvm/test/CodeGen/SystemZ/knownbits.ll
index 08694d8e699..021c939dcfa 100644
--- a/llvm/test/CodeGen/SystemZ/knownbits.ll
+++ b/llvm/test/CodeGen/SystemZ/knownbits.ll
@@ -9,9 +9,9 @@ define i32 @f0(<4 x i32> %a0) {
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vgbm %v0, 0
 ; CHECK-NEXT:    vceqf %v0, %v24, %v0
-; CHECK-NEXT:    vrepif %v1, 1
-; CHECK-NEXT:    vnc %v0, %v1, %v0
+; CHECK-NEXT:    vno %v0, %v0, %v0
 ; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    nilf %r2, 1
 ; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
 ; CHECK-NEXT:    br %r14
   %cmp0 = icmp ne <4 x i32> %a0, zeroinitializer
diff --git a/llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll b/llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll
index 278f0bf2a30..a6bc5763b25 100644
--- a/llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll
+++ b/llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll
@@ -7,13 +7,10 @@ define void @pr32275(<4 x i8> %B15) {
 ; CHECK-LABEL: pr32275:
 ; CHECK:       # %bb.0: # %BB
 ; CHECK-NEXT:    vlgvb %r0, %v24, 3
-; CHECK-NEXT:    vlvgp %v0, %r0, %r0
-; CHECK-NEXT:    vrepif %v1, 1
-; CHECK-NEXT:    vn %v0, %v0, %v1
-; CHECK-NEXT:    vlgvf %r0, %v0, 3
 ; CHECK-NEXT:  .LBB0_1: # %CF34
 ; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    cijlh %r0, 0, .LBB0_1
+; CHECK-NEXT:    tmll %r0, 1
+; CHECK-NEXT:    jne .LBB0_1
 ; CHECK-NEXT:  # %bb.2: # %CF36
 ; CHECK-NEXT:    br %r14
 BB:

I've now updated the knownbits.ll test case to make it less fragile and still test what it is supposed to test (commit e9c6b63). The problem was use of "undef" (which makes it susceptible to collapse as generic optimizations improve), and the fact that knownBits was operating at the very edge of MaxRecursionDepth, so changes in common code could push it above the limit.

Can you retry with current mainline? Hopefully, the test case now no longer changes with your patch.

As to shouldScalarizeBinop, thanks for pointing this out! I agree we probably ought to define this, but I think I'd like to evaluate the changes this is causing to real-world code before checking it in. @jonpa, can you have a look?

spatel updated this revision to Diff 274565.Jun 30 2020, 11:59 AM

Patch updated:
Rebased after rGe9c6b63d4a16c795 - this patch doesn't affect any SystemZ tests now (thanks!).

RKSimon accepted this revision.Jun 30 2020, 11:55 PM

LGTM - cheers

This revision is now accepted and ready to land.Jun 30 2020, 11:55 PM
This revision was automatically updated to reflect the committed changes.