This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
-
SelectionDAGBuilder.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
-
vec-extract-branch.ll
-
X86/
-
setcc-logic.ll

Differential D82602

[SelectionDAG] don't split branch on logic-of-vector-compares
ClosedPublic

Authored by spatel on Jun 25 2020, 2:14 PM.

Download Raw Diff

Details

Reviewers

fhahn
efriedma
RKSimon
uweigand
jonpa

Commits

rGbc110de78a4b: [SelectionDAG] don't split branch on logic-of-vector-compares

Summary

SelectionDAGBuilder converts logic-of-compares into multiple branches based on a boolean TLI setting in isJumpExpensive(). But that probably never considered the pattern of extracted bools from a vector compare - it seems unlikely that we would want to turn vector logic into control-flow.

The motivating x86 reduction case is shown in PR44565:
https://bugs.llvm.org/show_bug.cgi?id=44565
...and that test shows the expected improvement from using pmovmsk codegen.

For AArch64, I modified the test to include an extra op because the simpler test gets transformed by a codegen invocation of SimplifyCFG. I think what we see currently is an improvement, but it might be better if the 'and' was done on the vector unit. Potentially this could use 'addv' or 'addp' instead?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Jun 25 2020, 2:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2020, 2:14 PM

Herald added subscribers: hiraditya, kristof.beyls, mcrosier. · View Herald Transcript

Would it make sense to treat this as an SLP issue? i.e. we should transform and(extractelement(x,0), extractelement(x,1)) to vector.reduce.and(x)? If we're not extracting two elements from the same vector, the transform doesn't look as good.

On AArch64, we probably want to use addp for the result of a <2 x double> compare, given the limited set of available operations on 64-bit integers, sure. But I doubt it's really that relevant in practice.

In D82602#2115689, @efriedma wrote:

Would it make sense to treat this as an SLP issue? i.e. we should transform and(extractelement(x,0), extractelement(x,1)) to vector.reduce.and(x)? If we're not extracting two elements from the same vector, the transform doesn't look as good.

On AArch64, we probably want to use addp for the result of a <2 x double> compare, given the limited set of available operations on 64-bit integers, sure. But I doubt it's really that relevant in practice.

we've tried several times to get 2 element vector support into SLP and its always blown up in our faces

In D82602#2115689, @efriedma wrote:

Would it make sense to treat this as an SLP issue? i.e. we should transform and(extractelement(x,0), extractelement(x,1)) to vector.reduce.and(x)? If we're not extracting two elements from the same vector, the transform doesn't look as good.

Yes - however, my attempt within SLP failed (D59710). We should now get this simplest case in VectorCombine.
I am still trying to solve another variation of the pattern more generally in VectorCombine with D82474. But cases like this may escape as shown in that patch description. If we don't account for the transform to use movmsk with a scalar compare + branch, the cost model may say it's not profitable. So we need to look at larger patterns to see what tricks codegen can perform, but that gets more and more complicated as we lengthen the match in IR.

I think it's fine to limit the bailout here by matching a common source vector, so I can add that clause.

Patch updated:

Constrained the match to require extracting from a single source vector.
I missed a SystemZ test change in the earlier draft; adding SystemZ reviewers to confirm if that's an improvement.

In D82602#2116747, @spatel wrote:

I missed a SystemZ test change in the earlier draft; adding SystemZ reviewers to confirm if that's an improvement.

Interesting. The compiler now completely optimizes out all operations on one of the two vector lanes, apparently because it recognizes their value is constant and computable at compile time. So that's an improvement (even though not one this test case was expecting to happen ...).

On the other hand, in the operations on the remaining lane, it seems the compiler is now no longer able to perform the known-bits optimization that this test is actually written for -- see the comment that says we should optimize out a redundant AND to get a compare instead of a TM (test-under-mask), but the code now actually does contain a TM (tmll). That seems a regression at first glance.

In D82602#2117150, @uweigand wrote:

In D82602#2116747, @spatel wrote:

I missed a SystemZ test change in the earlier draft; adding SystemZ reviewers to confirm if that's an improvement.

Interesting. The compiler now completely optimizes out all operations on one of the two vector lanes, apparently because it recognizes their value is constant and computable at compile time. So that's an improvement (even though not one this test case was expecting to happen ...).

On the other hand, in the operations on the remaining lane, it seems the compiler is now no longer able to perform the known-bits optimization that this test is actually written for -- see the comment that says we should optimize out a redundant AND to get a compare instead of a TM (test-under-mask), but the code now actually does contain a TM (tmll). That seems a regression at first glance.

Thanks. I can't tell if there's some generic transform that would help or if something target-specific is missing.
We have this after type-legalization:

      t50: v4i32 = BUILD_VECTOR undef:i32, t32, undef:i32, undef:i32
    t51: v2i64 = bitcast t50
    t30: v2i64 = BUILD_VECTOR Constant:i64<1>, Constant:i64<1>
  t36: v2i64 = xor t51, t30
t45: v4i32 = bitcast t36

Would converting the xor to v4i32 help? computeKnownBits probably has problems looking through bitcasts.

In D82602#2117418, @spatel wrote:
Thanks. I can't tell if there's some generic transform that would help or if something target-specific is missing.
We have this after type-legalization:
      t50: v4i32 = BUILD_VECTOR undef:i32, t32, undef:i32, undef:i32
    t51: v2i64 = bitcast t50
    t30: v2i64 = BUILD_VECTOR Constant:i64<1>, Constant:i64<1>
  t36: v2i64 = xor t51, t30
t45: v4i32 = bitcast t36
Would converting the xor to v4i32 help? computeKnownBits probably has problems looking through bitcasts.

The bitcasts shouldn't be a problem, but the undefs will cause it to bail if any of those undef elements are demanded.

In D82602#2117418, @spatel wrote:
In D82602#2117150, @uweigand wrote:

In D82602#2116747, @spatel wrote:

I missed a SystemZ test change in the earlier draft; adding SystemZ reviewers to confirm if that's an improvement.

Interesting. The compiler now completely optimizes out all operations on one of the two vector lanes, apparently because it recognizes their value is constant and computable at compile time. So that's an improvement (even though not one this test case was expecting to happen ...).

On the other hand, in the operations on the remaining lane, it seems the compiler is now no longer able to perform the known-bits optimization that this test is actually written for -- see the comment that says we should optimize out a redundant AND to get a compare instead of a TM (test-under-mask), but the code now actually does contain a TM (tmll). That seems a regression at first glance.

Thanks. I can't tell if there's some generic transform that would help or if something target-specific is missing.
We have this after type-legalization:
      t50: v4i32 = BUILD_VECTOR undef:i32, t32, undef:i32, undef:i32
    t51: v2i64 = bitcast t50
    t30: v2i64 = BUILD_VECTOR Constant:i64<1>, Constant:i64<1>
  t36: v2i64 = xor t51, t30
t45: v4i32 = bitcast t36
Would converting the xor to v4i32 help? computeKnownBits probably has problems looking through bitcasts.

It would really obsolete this test, but I think SystemZ should be overriding the TLI hook shouldScalarizeBinop() that defaults to 'false'.

X86 does this:

bool shouldScalarizeBinop(SDValue VecOp) const {
  unsigned Opc = VecOp.getOpcode();

  // Assume target opcodes can't be scalarized.
  // TODO - do we have any exceptions?
  if (Opc >= ISD::BUILTIN_OP_END)
    return false;

  // If the vector op is not supported, try to convert to scalar.
  EVT VecVT = VecOp.getValueType();
  if (!isOperationLegalOrCustomOrPromote(Opc, VecVT))
    return true;

  // If the vector op is supported, but the scalar op is not, the transform may
  // not be worthwhile.
  EVT ScalarVT = VecVT.getScalarType();
  return isOperationLegalOrCustomOrPromote(Opc, ScalarVT);
}

If I add that on top of this patch, this test collapses to:

 $ llc -o - sysz.ll -mtriple=s390x-linux-gnu -mcpu=z13  -jump-is-expensive=1
...
# %bb.0:
	clhhsi	0, 0
	je	.LBB0_2
# %bb.1:
.LBB0_2:
.Lfunc_end0:

If the override seems reasonable, I can post that for review. Here's a draft with current trunk test changes:

diff --git a/llvm/lib/Target/SystemZ/SystemZISelLowering.h b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
index e60deaedbdf..3e1cdba9906 100644
--- a/llvm/lib/Target/SystemZ/SystemZISelLowering.h
+++ b/llvm/lib/Target/SystemZ/SystemZISelLowering.h
@@ -452,6 +452,25 @@ public:
     return VT == MVT::i32 || VT == MVT::i64;
   }
 
+  bool shouldScalarizeBinop(SDValue VecOp) const override {
+    unsigned Opc = VecOp.getOpcode();
+
+    // Assume target opcodes can't be scalarized.
+    // TODO - do we have any exceptions?
+    if (Opc >= ISD::BUILTIN_OP_END)
+      return false;
+
+    // If the vector op is not supported, try to convert to scalar.
+    EVT VecVT = VecOp.getValueType();
+    if (!isOperationLegalOrCustomOrPromote(Opc, VecVT))
+      return true;
+
+    // If the vector op is supported, but the scalar op is not, the transform may
+    // not be worthwhile.
+    EVT ScalarVT = VecVT.getScalarType();
+    return isOperationLegalOrCustomOrPromote(Opc, ScalarVT);
+  }
+
   const char *getTargetNodeName(unsigned Opcode) const override;
   std::pair<unsigned, const TargetRegisterClass *>
   getRegForInlineAsmConstraint(const TargetRegisterInfo *TRI,
diff --git a/llvm/test/CodeGen/SystemZ/knownbits.ll b/llvm/test/CodeGen/SystemZ/knownbits.ll
index 08694d8e699..021c939dcfa 100644
--- a/llvm/test/CodeGen/SystemZ/knownbits.ll
+++ b/llvm/test/CodeGen/SystemZ/knownbits.ll
@@ -9,9 +9,9 @@ define i32 @f0(<4 x i32> %a0) {
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vgbm %v0, 0
 ; CHECK-NEXT:    vceqf %v0, %v24, %v0
-; CHECK-NEXT:    vrepif %v1, 1
-; CHECK-NEXT:    vnc %v0, %v1, %v0
+; CHECK-NEXT:    vno %v0, %v0, %v0
 ; CHECK-NEXT:    vlgvf %r2, %v0, 3
+; CHECK-NEXT:    nilf %r2, 1
 ; CHECK-NEXT:    # kill: def $r2l killed $r2l killed $r2d
 ; CHECK-NEXT:    br %r14
   %cmp0 = icmp ne <4 x i32> %a0, zeroinitializer
diff --git a/llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll b/llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll
index 278f0bf2a30..a6bc5763b25 100644
--- a/llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll
+++ b/llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll
@@ -7,13 +7,10 @@ define void @pr32275(<4 x i8> %B15) {
 ; CHECK-LABEL: pr32275:
 ; CHECK:       # %bb.0: # %BB
 ; CHECK-NEXT:    vlgvb %r0, %v24, 3
-; CHECK-NEXT:    vlvgp %v0, %r0, %r0
-; CHECK-NEXT:    vrepif %v1, 1
-; CHECK-NEXT:    vn %v0, %v0, %v1
-; CHECK-NEXT:    vlgvf %r0, %v0, 3
 ; CHECK-NEXT:  .LBB0_1: # %CF34
 ; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    cijlh %r0, 0, .LBB0_1
+; CHECK-NEXT:    tmll %r0, 1
+; CHECK-NEXT:    jne .LBB0_1
 ; CHECK-NEXT:  # %bb.2: # %CF36
 ; CHECK-NEXT:    br %r14
 BB:

I've now updated the knownbits.ll test case to make it less fragile and still test what it is supposed to test (commit e9c6b63). The problem was use of "undef" (which makes it susceptible to collapse as generic optimizations improve), and the fact that knownBits was operating at the very edge of MaxRecursionDepth, so changes in common code could push it above the limit.

Can you retry with current mainline? Hopefully, the test case now no longer changes with your patch.

As to shouldScalarizeBinop, thanks for pointing this out! I agree we probably ought to define this, but I think I'd like to evaluate the changes this is causing to real-world code before checking it in. @jonpa, can you have a look?

Patch updated:
Rebased after rGe9c6b63d4a16c795 - this patch doesn't affect any SystemZ tests now (thanks!).

LGTM - cheers

This revision is now accepted and ready to land.Jun 30 2020, 11:55 PM

Closed by commit rGbc110de78a4b: [SelectionDAG] don't split branch on logic-of-vector-compares (authored by spatel). · Explain WhyJul 2 2020, 2:36 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

SelectionDAGBuilder.cpp

9 lines

test/

CodeGen/

AArch64/

vec-extract-branch.ll

11 lines

X86/

setcc-logic.ll

11 lines

Diff 275243

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,297 Lines • ▼ Show 20 Lines	void SelectionDAGBuilder::visitBr(const BranchInst &I) {

// If this condition is one of the special cases we handle, do special stuff		// If this condition is one of the special cases we handle, do special stuff
// now.		// now.
const Value *CondVal = I.getCondition();		const Value *CondVal = I.getCondition();
MachineBasicBlock *Succ1MBB = FuncInfo.MBBMap[I.getSuccessor(1)];		MachineBasicBlock *Succ1MBB = FuncInfo.MBBMap[I.getSuccessor(1)];

// If this is a series of conditions that are or'd or and'd together, emit		// If this is a series of conditions that are or'd or and'd together, emit
// this as a sequence of branches instead of setcc's with and/or operations.		// this as a sequence of branches instead of setcc's with and/or operations.
// As long as jumps are not expensive, this should improve performance.		// As long as jumps are not expensive (exceptions for multi-use logic ops,
		// unpredictable branches, and vector extracts because those jumps are likely
		// expensive for any target), this should improve performance.
// For example, instead of something like:		// For example, instead of something like:
// cmp A, B		// cmp A, B
// C = seteq		// C = seteq
// cmp D, E		// cmp D, E
// F = setle		// F = setle
// or C, F		// or C, F
// jnz foo		// jnz foo
// Emit:		// Emit:
// cmp A, B		// cmp A, B
// je foo		// je foo
// cmp D, E		// cmp D, E
// jle foo		// jle foo
if (const BinaryOperator *BOp = dyn_cast<BinaryOperator>(CondVal)) {		if (const BinaryOperator *BOp = dyn_cast<BinaryOperator>(CondVal)) {
Instruction::BinaryOps Opcode = BOp->getOpcode();		Instruction::BinaryOps Opcode = BOp->getOpcode();
		Value Vec, BOp0 = BOp->getOperand(0), *BOp1 = BOp->getOperand(1);
if (!DAG.getTargetLoweringInfo().isJumpExpensive() && BOp->hasOneUse() &&		if (!DAG.getTargetLoweringInfo().isJumpExpensive() && BOp->hasOneUse() &&
!I.hasMetadata(LLVMContext::MD_unpredictable) &&		!I.hasMetadata(LLVMContext::MD_unpredictable) &&
(Opcode == Instruction::And \|\| Opcode == Instruction::Or)) {		(Opcode == Instruction::And \|\| Opcode == Instruction::Or) &&
		!(match(BOp0, m_ExtractElt(m_Value(Vec), m_Value())) &&
		match(BOp1, m_ExtractElt(m_Specific(Vec), m_Value())))) {
FindMergedConditions(BOp, Succ0MBB, Succ1MBB, BrMBB, BrMBB,		FindMergedConditions(BOp, Succ0MBB, Succ1MBB, BrMBB, BrMBB,
Opcode,		Opcode,
getEdgeProbability(BrMBB, Succ0MBB),		getEdgeProbability(BrMBB, Succ0MBB),
getEdgeProbability(BrMBB, Succ1MBB),		getEdgeProbability(BrMBB, Succ1MBB),
/InvertCond=/false);		/InvertCond=/false);
// If the compares in later blocks need to use values not currently		// If the compares in later blocks need to use values not currently
// exported from this block, export them now. This block should always		// exported from this block, export them now. This block should always
// be the first entry.		// be the first entry.
▲ Show 20 Lines • Show All 8,360 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/vec-extract-branch.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -verify-machineinstrs \| FileCheck %s			; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -verify-machineinstrs \| FileCheck %s

	define i32 @vec_extract_branch(<2 x double> %x, i32 %y) {			define i32 @vec_extract_branch(<2 x double> %x, i32 %y) {
	; CHECK-LABEL: vec_extract_branch:			; CHECK-LABEL: vec_extract_branch:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: fcmgt v0.2d, v0.2d, #0.0			; CHECK-NEXT: fcmgt v0.2d, v0.2d, #0.0
	; CHECK-NEXT: xtn v0.2s, v0.2d			; CHECK-NEXT: xtn v0.2s, v0.2d
	; CHECK-NEXT: fmov w8, s0
	; CHECK-NEXT: tbz w8, #0, .LBB0_3
	; CHECK-NEXT: // %bb.1:
	; CHECK-NEXT: mov w8, v0.s[1]			; CHECK-NEXT: mov w8, v0.s[1]
	; CHECK-NEXT: tbz w8, #0, .LBB0_3			; CHECK-NEXT: fmov w9, s0
	; CHECK-NEXT: // %bb.2: // %true			; CHECK-NEXT: and w8, w9, w8
				; CHECK-NEXT: tbz w8, #0, .LBB0_2
				; CHECK-NEXT: // %bb.1: // %true
	; CHECK-NEXT: mov w8, #42			; CHECK-NEXT: mov w8, #42
	; CHECK-NEXT: sdiv w0, w8, w0			; CHECK-NEXT: sdiv w0, w8, w0
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	; CHECK-NEXT: .LBB0_3: // %false			; CHECK-NEXT: .LBB0_2: // %false
	; CHECK-NEXT: mov w0, #88			; CHECK-NEXT: mov w0, #88
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%t1 = fcmp ogt <2 x double> %x, zeroinitializer			%t1 = fcmp ogt <2 x double> %x, zeroinitializer
	%t2 = extractelement <2 x i1> %t1, i32 0			%t2 = extractelement <2 x i1> %t1, i32 0
	%t3 = extractelement <2 x i1> %t1, i32 1			%t3 = extractelement <2 x i1> %t1, i32 1
	%t4 = and i1 %t2, %t3			%t4 = and i1 %t2, %t3
	br i1 %t4, label %true, label %false			br i1 %t4, label %true, label %false
	true:			true:
	%y1 = sdiv i32 42, %y			%y1 = sdiv i32 42, %y
	ret i32 %y1			ret i32 %y1
	false:			false:
	ret i32 88			ret i32 88
	}			}

llvm/test/CodeGen/X86/setcc-logic.ll

	Show First 20 Lines • Show All 317 Lines • ▼ Show 20 Lines
	; PR44565 - https://bugs.llvm.org/show_bug.cgi?id=44565			; PR44565 - https://bugs.llvm.org/show_bug.cgi?id=44565

	define i32 @vec_extract_branch(<2 x double> %x) {			define i32 @vec_extract_branch(<2 x double> %x) {
	; CHECK-LABEL: vec_extract_branch:			; CHECK-LABEL: vec_extract_branch:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: xorpd %xmm1, %xmm1			; CHECK-NEXT: xorpd %xmm1, %xmm1
	; CHECK-NEXT: cmpltpd %xmm0, %xmm1			; CHECK-NEXT: cmpltpd %xmm0, %xmm1
	; CHECK-NEXT: movmskpd %xmm1, %eax			; CHECK-NEXT: movmskpd %xmm1, %eax
	; CHECK-NEXT: testb $1, %al			; CHECK-NEXT: cmpb $3, %al
	; CHECK-NEXT: je .LBB16_3			; CHECK-NEXT: jne .LBB16_2
	; CHECK-NEXT: # %bb.1:			; CHECK-NEXT: # %bb.1: # %true
	; CHECK-NEXT: shrb %al
	; CHECK-NEXT: je .LBB16_3
	; CHECK-NEXT: # %bb.2: # %true
	; CHECK-NEXT: movl $42, %eax			; CHECK-NEXT: movl $42, %eax
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; CHECK-NEXT: .LBB16_3: # %false			; CHECK-NEXT: .LBB16_2: # %false
	; CHECK-NEXT: movl $88, %eax			; CHECK-NEXT: movl $88, %eax
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%t1 = fcmp ogt <2 x double> %x, zeroinitializer			%t1 = fcmp ogt <2 x double> %x, zeroinitializer
	%t2 = extractelement <2 x i1> %t1, i32 0			%t2 = extractelement <2 x i1> %t1, i32 0
	%t3 = extractelement <2 x i1> %t1, i32 1			%t3 = extractelement <2 x i1> %t1, i32 1
	%t4 = and i1 %t2, %t3			%t4 = and i1 %t2, %t3
	br i1 %t4, label %true, label %false			br i1 %t4, label %true, label %false
	true:			true:
	▲ Show 20 Lines • Show All 268 Lines • Show Last 20 Lines