This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG] Remove special call to LHS computeKnownBits for ANDs with constant RHS.
AbandonedPublic

Authored by craig.topper on Apr 5 2017, 1:56 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
rengolin
nemanjai
nhaehnle
aemerson

Summary

This code seems largely unnecessary. Most of what it was currently fixing were differences between computeKnownBits and SimplifyDemandedBits for SETCC and AssertZExt. I've submitted a separate patch to clean that up.

This leaves us with the few test issues seen here.

The AArch64/fast-isel-select.ll test is because we now try to create -1 as the constant for the XOR when we descend through SimplifyDemandedBits for the LHS of the AND. Previously we descended down the LHS through computeKnownBits and then came back up and determined the AND was useless and deleted it. Now we go through SimplifyDemandedBits first and the XOR sees that the upper bits aren't demanded so it creates a -1. This gets folded into an ORN(or not) with the normal isel. The fast isel case doesn't go through SimplifyDemandedBits so doesn't get optimized. The AND instruction was always there in the fastisel case before this patch it just wasn't being checked for by FileCheck.

The PowerPC/rlwimi-and.ll test appears to be a case where there are multiple ANDs separated by other operations in a chain. Previously we removed a late AND in the chain because the earlier AND made it redundant. But now we remove the AND earlier in the chain because we see the later AND made it redundant.

The AMDGPU/fneg.f16.ll is similar to the PowerPC test in that there are multiple ANDs in a chain. In the original code we removed the late AND and kept an earlier AND. The earlier AND got folded with an anyext load to create a zext load. Now we remove the earlier AND and fail to create the ZEXT load because there is an XOR between the load and the AND. We could recover this if we had a DAG combine to move AND with a constant above an XOR with a constant if all the bits in the XOR constant are set in the AND constant.

Diff Detail

Event Timeline

craig.topper created this revision.Apr 5 2017, 1:56 PM

Herald added subscribers: tpr, nhaehnle, nemanjai and 2 others. · View Herald TranscriptApr 5 2017, 1:56 PM

Ping

Adding target specialists as reviewers

nemanjai added inline comments.May 4 2017, 5:22 PM

test/CodeGen/PowerPC/rlwimi-and.ll
33	It's hard to tell exactly what's going on without context (i.e. being able to produce the code with this patch). I tried applying this patch to do so but it appears this patch may depend on the previous patches (not upstream yet). Semantically these instruction sequences are the same. Take bits 23 and 31 from two different inputs (with the former input being shifted right 8 bits). Performance wise there isn't any real difference either. So this LGTM.

craig.topper added inline comments.May 4 2017, 5:30 PM

test/CodeGen/PowerPC/rlwimi-and.ll
33	I think I may need to rebase the patch do to a lot of KnownBits related changed lately. I'll do that tonight or tomorrow

Rebase for recent changes.

I got a new regression failure in X86 for combine-and.ll. I suspect there's some missing known bits support on some target node, but I haven't looked into it yet.

Herald added a subscriber: javed.absar. · View Herald TranscriptMay 10 2017, 12:43 PM

Correction, the combine-and problem seems to be because SimplifyDemandedBits only considers scalar constants for shift amount, not constant splats from a build_vector.

In D31724#751377, @craig.topper wrote:

Correction, the combine-and problem seems to be because SimplifyDemandedBits only considers scalar constants for shift amount, not constant splats from a build_vector.

This is an efficiency improvement, but shouldn't we fix that first to avoid the regression?

Yeah I think we should. That failure only showed up when I was rebasing. Is that something I should look into?

In D31724#753822, @craig.topper wrote:

Yeah I think we should. That failure only showed up when I was rebasing. Is that something I should look into?

Sure. I fixed some of the opcodes to be splat-aware a few weeks ago, but I'll take any help I can get to complete the job. :)
Beware: I think I saw a miscompile from one of those opcodes when I changed it to use 'isConstOrConstSplat'. There may be something ugly going on here. I'll see if I left any notes for myself about what went wrong.

craig.topper planned changes to this revision.May 22 2017, 11:50 AM

aemerson resigned from this revision.Sep 13 2017, 11:49 AM

craig.topper mentioned this in D38967: [SelectionDAG] Don't subject ISD:Constant to the depth limit in TargetLowering::SimplifyDemandedBits..Oct 16 2017, 11:55 AM

Diffusion mentioned this in rL316255: [SelectionDAG] Don't subject ISD:Constant to the depth limit in TargetLowering….Oct 20 2017, 7:27 PM

@craig.topper Is this patch still relevant?

Herald added subscribers: jsji, kristof.beyls, jvesely. · View Herald TranscriptJan 2 2019, 9:41 AM

It's producing more test changes now on X86 and other targets. It's producing several regressions in the combine-sdiv.ll test by failing to delete some and instructions. The other X86 changes look pretty neutral. I haven't looked at the other targets yet.

@craig.topper Is this still relevant? At least some of these changes have been fixed by improvements to SimplifyDemandedBits

Herald added subscribers: ychen, steven.zhang, • wuzish, MaskRay. · View Herald TranscriptSep 6 2019, 7:06 AM

I still think its odd that we do something different than what InstCombine does, but its probably not worth fixing.

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

TargetLowering.cpp

46 lines

test/

CodeGen/

AArch64/

fast-isel-select.ll

15 lines

AMDGPU/

fneg.f16.ll

3 lines

PowerPC/

rlwimi-and.ll

4 lines

X86/

combine-and.ll

1 line

Diff 98505

lib/CodeGen/SelectionDAG/TargetLowering.cpp

Show First 20 Lines • Show All 565 Lines • ▼ Show 20 Lines	for (SDValue SrcOp : Op->ops()) {

// Known bits are the values that are shared by every element.		// Known bits are the values that are shared by every element.
// TODO: support per-element known bits.		// TODO: support per-element known bits.
Known.One &= Known2.One;		Known.One &= Known2.One;
Known.Zero &= Known2.Zero;		Known.Zero &= Known2.Zero;
}		}
return false; // Don't fall through, will infinitely loop.		return false; // Don't fall through, will infinitely loop.
case ISD::AND:		case ISD::AND:
// If the RHS is a constant, check to see if the LHS would be zero without
// using the bits from the RHS. Below, we use knowledge about the RHS to
// simplify the LHS, here we're using information from the LHS to simplify
// the RHS.
if (ConstantSDNode *RHSC = isConstOrConstSplat(Op.getOperand(1))) {
SDValue Op0 = Op.getOperand(0);
KnownBits LHSKnown;
// Do not increment Depth here; that can cause an infinite loop.
TLO.DAG.computeKnownBits(Op0, LHSKnown, Depth);
// If the LHS already has zeros where RHSC does, this and is dead.
if ((LHSKnown.Zero & NewMask) == (~RHSC->getAPIntValue() & NewMask))
return TLO.CombineTo(Op, Op0);

// If any of the set bits in the RHS are known zero on the LHS, shrink
// the constant.
if (ShrinkDemandedConstant(Op, ~LHSKnown.Zero & NewMask, TLO))
return true;

// Bitwise-not (xor X, -1) is a special case: we don't usually shrink its
// constant, but if this 'and' is only clearing bits that were just set by
// the xor, then this 'and' can be eliminated by shrinking the mask of
// the xor. For example, for a 32-bit X:
// and (xor (srl X, 31), -1), 1 --> xor (srl X, 31), 1
if (isBitwiseNot(Op0) && Op0.hasOneUse() &&
LHSKnown.One == ~RHSC->getAPIntValue()) {
SDValue Xor = TLO.DAG.getNode(ISD::XOR, dl, Op.getValueType(),
Op0.getOperand(0), Op.getOperand(1));
return TLO.CombineTo(Op, Xor);
}
}

if (SimplifyDemandedBits(Op.getOperand(1), NewMask, Known, TLO, Depth+1))		if (SimplifyDemandedBits(Op.getOperand(1), NewMask, Known, TLO, Depth+1))
return true;		return true;
assert((Known.Zero & Known.One) == 0 && "Bits known to be one AND zero?");		assert((Known.Zero & Known.One) == 0 && "Bits known to be one AND zero?");
if (SimplifyDemandedBits(Op.getOperand(0), ~Known.Zero & NewMask,		if (SimplifyDemandedBits(Op.getOperand(0), ~Known.Zero & NewMask,
Known2, TLO, Depth+1))		Known2, TLO, Depth+1))
return true;		return true;
assert((Known2.Zero & Known2.One) == 0 && "Bits known to be one AND zero?");		assert((Known2.Zero & Known2.One) == 0 && "Bits known to be one AND zero?");

// If all of the demanded bits are known one on one side, return the other.		// If all of the demanded bits are known one on one side, return the other.
// These bits cannot contribute to the result of the 'and'.		// These bits cannot contribute to the result of the 'and'.
if (NewMask.isSubsetOf(Known2.Zero \| Known.One))		if (NewMask.isSubsetOf(Known2.Zero \| Known.One))
return TLO.CombineTo(Op, Op.getOperand(0));		return TLO.CombineTo(Op, Op.getOperand(0));
if (NewMask.isSubsetOf(Known.Zero \| Known2.One))		if (NewMask.isSubsetOf(Known.Zero \| Known2.One))
return TLO.CombineTo(Op, Op.getOperand(1));		return TLO.CombineTo(Op, Op.getOperand(1));
// If all of the demanded bits in the inputs are known zeros, return zero.		// If all of the demanded bits in the inputs are known zeros, return zero.
if (NewMask.isSubsetOf(Known.Zero \| Known2.Zero))		if (NewMask.isSubsetOf(Known.Zero \| Known2.Zero))
return TLO.CombineTo(Op, TLO.DAG.getConstant(0, dl, Op.getValueType()));		return TLO.CombineTo(Op, TLO.DAG.getConstant(0, dl, Op.getValueType()));
// If the RHS is a constant, see if we can simplify it.		// If the RHS is a constant, see if we can simplify it.
if (ShrinkDemandedConstant(Op, ~Known2.Zero & NewMask, TLO))		if (ShrinkDemandedConstant(Op, ~Known2.Zero & NewMask, TLO))
return true;		return true;
// If the operation can be done in a smaller type, do so.		// If the operation can be done in a smaller type, do so.
if (ShrinkDemandedOp(Op, BitWidth, NewMask, TLO))		if (ShrinkDemandedOp(Op, BitWidth, NewMask, TLO))
return true;		return true;

		if (ConstantSDNode *RHSC = dyn_cast<ConstantSDNode>(Op.getOperand(1))) {
		SDValue Op0 = Op.getOperand(0);
		// Bitwise-not (xor X, -1) is a special case: we don't usually shrink its
		// constant, but if this 'and' is only clearing bits that were just set by
		// the xor, then this 'and' can be eliminated by shrinking the mask of
		// the xor. For example, for a 32-bit X:
		// and (xor (srl X, 31), -1), 1 --> xor (srl X, 31), 1
		if (isBitwiseNot(Op0) && Op0.hasOneUse() &&
		Known2.One == ~RHSC->getAPIntValue()) {
		SDValue Xor = TLO.DAG.getNode(ISD::XOR, dl, Op.getValueType(),
		Op0.getOperand(0), Op.getOperand(1));
		return TLO.CombineTo(Op, Xor);
		}
		}

// Output known-1 bits are only known if set in both the LHS & RHS.		// Output known-1 bits are only known if set in both the LHS & RHS.
Known.One &= Known2.One;		Known.One &= Known2.One;
// Output known-0 are known to be clear if zero in either the LHS \| RHS.		// Output known-0 are known to be clear if zero in either the LHS \| RHS.
Known.Zero \|= Known2.Zero;		Known.Zero \|= Known2.Zero;
break;		break;
case ISD::OR:		case ISD::OR:
if (SimplifyDemandedBits(Op.getOperand(1), NewMask, Known, TLO, Depth+1))		if (SimplifyDemandedBits(Op.getOperand(1), NewMask, Known, TLO, Depth+1))
return true;		return true;
▲ Show 20 Lines • Show All 3,241 Lines • Show Last 20 Lines

test/CodeGen/AArch64/fast-isel-select.ll

	; RUN: llc -mtriple=aarch64-apple-darwin -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-apple-darwin -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK --check-prefix=SLOWISEL
	; RUN: llc -mtriple=aarch64-apple-darwin -fast-isel -fast-isel-abort=1 -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-apple-darwin -fast-isel -fast-isel-abort=1 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK -check-prefix=FASTISEL

	; First test the different supported value types for select.			; First test the different supported value types for select.
	define zeroext i1 @select_i1(i1 zeroext %c, i1 zeroext %a, i1 zeroext %b) {			define zeroext i1 @select_i1(i1 zeroext %c, i1 zeroext %a, i1 zeroext %b) {
	; CHECK-LABEL: select_i1			; CHECK-LABEL: select_i1
	; CHECK: {{cmp w0, #0\|tst w0, #0x1}}			; CHECK: {{cmp w0, #0\|tst w0, #0x1}}
	; CHECK-NEXT: csel {{w[0-9]+}}, w1, w2, ne			; CHECK-NEXT: csel {{w[0-9]+}}, w1, w2, ne
	%1 = select i1 %c, i1 %a, i1 %b			%1 = select i1 %c, i1 %a, i1 %b
	ret i1 %1			ret i1 %1
	▲ Show 20 Lines • Show All 278 Lines • ▼ Show 20 Lines
	define zeroext i1 @select_opt1(i1 zeroext %c, i1 zeroext %a) {			define zeroext i1 @select_opt1(i1 zeroext %c, i1 zeroext %a) {
	; CHECK-LABEL: select_opt1			; CHECK-LABEL: select_opt1
	; CHECK: orr {{w[0-9]+}}, w0, w1			; CHECK: orr {{w[0-9]+}}, w0, w1
	%1 = select i1 %c, i1 true, i1 %a			%1 = select i1 %c, i1 true, i1 %a
	ret i1 %1			ret i1 %1
	}			}

	define zeroext i1 @select_opt2(i1 zeroext %c, i1 zeroext %a) {			define zeroext i1 @select_opt2(i1 zeroext %c, i1 zeroext %a) {
	; CHECK-LABEL: select_opt2			; SLOWISEL-LABEL: select_opt2
	; CHECK: eor [[REG:w[0-9]+]], w0, #0x1			; SLOWISEL: orn [[REG:w[0-9]+]], w1, w0
	; CHECK: orr {{w[0-9]+}}, [[REG]], w1			; SLOWISEL: and {{w[0-9]+}}, [[REG]], #0x1
				;
				; FASTISEL-LABEL: select_opt2
				; FASTISEL: eor [[REG:w[0-9]+]], w0, #0x1
				; FASTISEL: orr [[REG2:w[0-9]+]], [[REG]], w1
				; FASTISEL: and {{w[0-9]+}}, [[REG2]], #0x1
	%1 = select i1 %c, i1 %a, i1 true			%1 = select i1 %c, i1 %a, i1 true
	ret i1 %1			ret i1 %1
	}			}

	define zeroext i1 @select_opt3(i1 zeroext %c, i1 zeroext %a) {			define zeroext i1 @select_opt3(i1 zeroext %c, i1 zeroext %a) {
	; CHECK-LABEL: select_opt3			; CHECK-LABEL: select_opt3
	; CHECK: bic {{w[0-9]+}}, w1, w0			; CHECK: bic {{w[0-9]+}}, w1, w0
	%1 = select i1 %c, i1 false, i1 %a			%1 = select i1 %c, i1 false, i1 %a
	Show All 9 Lines

test/CodeGen/AMDGPU/fneg.f16.ll

Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @v_fneg_fold_f16(half addrspace(1)* %out, half addrspace(1)* %in) #0 {
store half %fmul, half addrspace(1)* %out		store half %fmul, half addrspace(1)* %out
ret void		ret void
}		}

; FIXME: Terrible code with VI and even worse with SI/CI		; FIXME: Terrible code with VI and even worse with SI/CI
; GCN-LABEL: {{^}}s_fneg_v2f16:		; GCN-LABEL: {{^}}s_fneg_v2f16:
; CI: s_mov_b32 [[MASK:s[0-9]+]], 0x8000{{$}}		; CI: s_mov_b32 [[MASK:s[0-9]+]], 0x8000{{$}}
; CI: v_xor_b32_e32 v{{[0-9]+}}, [[MASK]], v{{[0-9]+}}		; CI: v_xor_b32_e32 v{{[0-9]+}}, [[MASK]], v{{[0-9]+}}
; CI: v_lshlrev_b32_e32 v{{[0-9]+}}, 16, v{{[0-9]+}}		; CI: v_and_b32_e32 v{{[0-9]+}}, 0xffff, v{{[0-9]+}}
; CI: v_xor_b32_e32 v{{[0-9]+}}, [[MASK]], v{{[0-9]+}}		; CI: v_xor_b32_e32 v{{[0-9]+}}, [[MASK]], v{{[0-9]+}}
		; CI: v_lshlrev_b32_e32 v{{[0-9]+}}, 16, v{{[0-9]+}}
; CI: v_or_b32_e32		; CI: v_or_b32_e32

; VI: v_mov_b32_e32 [[MASK:v[0-9]+]], 0x8000{{$}}		; VI: v_mov_b32_e32 [[MASK:v[0-9]+]], 0x8000{{$}}
; VI-DAG: v_xor_b32_sdwa v{{[0-9]+}}, v{{[0-9]+}}, [[MASK]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD		; VI-DAG: v_xor_b32_sdwa v{{[0-9]+}}, v{{[0-9]+}}, [[MASK]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
; VI-DAG: v_xor_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}, [[MASK]]		; VI-DAG: v_xor_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}, [[MASK]]

; GFX9: v_xor_b32_e32 v{{[0-9]+}}, 0x80008000, v{{[0-9]+}}		; GFX9: v_xor_b32_e32 v{{[0-9]+}}, 0x80008000, v{{[0-9]+}}

▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/rlwimi-and.ll

Show All 23 Lines	codeRepl17: ; preds = %codeRepl4
%4 = and i8 %3, 1		%4 = and i8 %3, 1
%not.tobool.i.1.i.i = icmp eq i8 %4, 0		%not.tobool.i.1.i.i = icmp eq i8 %4, 0
%rvml38.sroa.1.1.insert.ext = select i1 %not.tobool.i.1.i.i, i16 0, i16 1		%rvml38.sroa.1.1.insert.ext = select i1 %not.tobool.i.1.i.i, i16 0, i16 1
%rvml38.sroa.0.0.insert.insert = or i16 %rvml38.sroa.1.1.insert.ext, %2		%rvml38.sroa.0.0.insert.insert = or i16 %rvml38.sroa.1.1.insert.ext, %2
store i16 %rvml38.sroa.0.0.insert.insert, i16* undef, align 2		store i16 %rvml38.sroa.0.0.insert.insert, i16* undef, align 2
unreachable		unreachable

; CHECK: @test		; CHECK: @test
; CHECK: clrlwi [[R1:[0-9]+]], {{[0-9]+}}, 31		; CHECK: rlwimi [[R1:[0-9]+]], {{[0-9]+}}, 8, 16, 23
; CHECK: rlwimi [[R1]], {{[0-9]+}}, 8, 23, 23		; CHECK: andi. {{[0-9]+}}, [[R1]], 257
		nemanjaiUnsubmitted Not Done Reply Inline Actions It's hard to tell exactly what's going on without context (i.e. being able to produce the code with this patch). I tried applying this patch to do so but it appears this patch may depend on the previous patches (not upstream yet). Semantically these instruction sequences are the same. Take bits 23 and 31 from two different inputs (with the former input being shifted right 8 bits). Performance wise there isn't any real difference either. So this LGTM. nemanjai: It's hard to tell exactly what's going on without context (i.e. being able to produce the code…
		craig.topperAuthorUnsubmitted Not Done Reply Inline Actions I think I may need to rebase the patch do to a lot of KnownBits related changed lately. I'll do that tonight or tomorrow craig.topper: I think I may need to rebase the patch do to a lot of KnownBits related changed lately. I'll do…

codeRepl29: ; preds = %codeRepl1		codeRepl29: ; preds = %codeRepl1
unreachable		unreachable

codeRepl31: ; preds = %entry		codeRepl31: ; preds = %entry
ret void		ret void
}		}

test/CodeGen/X86/combine-and.ll

	Show First 20 Lines • Show All 248 Lines • ▼ Show 20 Lines
	;			;
	; known sign bits folding			; known sign bits folding
	;			;

	define <8 x i16> @ashr_mask1_v8i16(<8 x i16> %a0) {			define <8 x i16> @ashr_mask1_v8i16(<8 x i16> %a0) {
	; CHECK-LABEL: ashr_mask1_v8i16:			; CHECK-LABEL: ashr_mask1_v8i16:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: psrlw $15, %xmm0			; CHECK-NEXT: psrlw $15, %xmm0
				; CHECK-NEXT: pand {{.*}}(%rip), %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = ashr <8 x i16> %a0, <i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15>			%1 = ashr <8 x i16> %a0, <i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15, i16 15>
	%2 = and <8 x i16> %1, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>			%2 = and <8 x i16> %1, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
	ret <8 x i16> %2			ret <8 x i16> %2
	}			}

	define <4 x i32> @ashr_mask7_v4i32(<4 x i32> %a0) {			define <4 x i32> @ashr_mask7_v4i32(<4 x i32> %a0) {
	; CHECK-LABEL: ashr_mask7_v4i32:			; CHECK-LABEL: ashr_mask7_v4i32:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: psrad $31, %xmm0			; CHECK-NEXT: psrad $31, %xmm0
	; CHECK-NEXT: psrld $29, %xmm0			; CHECK-NEXT: psrld $29, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = ashr <4 x i32> %a0, <i32 31, i32 31, i32 31, i32 31>			%1 = ashr <4 x i32> %a0, <i32 31, i32 31, i32 31, i32 31>
	%2 = and <4 x i32> %1, <i32 7, i32 7, i32 7, i32 7>			%2 = and <4 x i32> %1, <i32 7, i32 7, i32 7, i32 7>
	ret <4 x i32> %2			ret <4 x i32> %2
	}			}