Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer
samparker
simon_tatham
ostannard

Commits

rGd5eb7ffa3372: [Target][ARM] Fold or(A, B) more aggressively for I1 vectors

Summary

This is a change to the ARM backend that folds or(A, B) into not(and(not(A), not(B))) more often.
This only affects vectors of i1: v4i1, v8i1 and v16i1 because PerformORCombine_i1 is only called by PerformORCombine if those conditions are met:

if (Subtarget->hasMVEIntegerOps() &&
    (VT == MVT::v4i1 || VT == MVT::v8i1 || VT == MVT::v16i1))
  return PerformORCombine_i1(N, DCI, Subtarget);

This actually generates better code in my tests, as not and and are essentially free compared to or when manipulating the VPR register.

and becomes a VPT block (no extra instruction)
not becomes a vpnot, which is often removed by the MVE VPT Block Insertion pass to create VPT blocks (no extra instructions).

However, I'm not fully sure it's a good change, and I believe the implementation could be better, which is why I need some help to review this.

Diff Detail

Event Timeline

Pierre-vh created this revision.Apr 1 2020, 2:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2020, 2:20 AM

Herald added subscribers: llvm-commits, danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

Pierre-vh added a parent revision: D77201: [CodeGen][SelectionDAG] Flip Booleans More Often.Apr 1 2020, 2:21 AM

Pierre-vh added reviewers: SjoerdMeijer, samparker, simon_tatham, ostannard.Apr 1 2020, 2:29 AM

Harbormaster failed remote builds in B51262: Diff 254124!Apr 1 2020, 2:46 AM

I have no thoughts on the patch itself, but the commit message looks quite alarming out of context. Perhaps it should mention that you're doing this specifically for i1 and vectors of i1, and not for bitwise OR of ordinary integers?

In D77202#1954493, @simon_tatham wrote:

I have no thoughts on the patch itself, but the commit message looks quite alarming out of context. Perhaps it should mention that you're doing this specifically for i1 and vectors of i1, and not for bitwise OR of ordinary integers?

Sure, I've updated the commit message and the description of the change.

The code that already exists in the function you are changing is essentially already doing the same thing as you add here, just in a more constrained set of circumstances. It is saying that if the operands are obviously invertable, then the VPNOT will be really free and we can go ahead and invert the and to an or. The question is if that is true for all cases or not. For the test cases you have here it is true that the operands are easily invertable, but that won't be true for everything.

With VPT blocks the NOT's will also be free in a lot of cases, but again not all and it's difficult to see where in practice it would be better.

I think that we should either always be doing this, or doing this when the operands are (recursively) freely invertable. I'm not sure which will be better in practice.

Rebased the patch

Pierre-vh added a parent revision: D77712: [Target][ARM] Add PerformVSELECTCombine for MVE Integer Ops.Apr 8 2020, 12:29 AM

Pierre-vh removed a parent revision: D77201: [CodeGen][SelectionDAG] Flip Booleans More Often.

Pierre-vh added a child revision: D78201: [Target][ARM] Replace outdated getARMVPTBlockMask function.Apr 15 2020, 6:30 AM

I reworked the implementation of the patch, it should be cleaner now.

The test here look like an improvement, but I'm not sure this would be true for all cases.

The old code said something like "if we know that both operands can be inverted, invert the OR."
The new code looks more like "if we know the operands can be inverted; invert them. Else if we know either can't be inverted, don't invert. Else if we know _nothing_ about the operands; invert them, and hope for the best".

If we are going to go this route (if we think it's generally profitable, It's hard to tell with how many other little problems we have in predicate code generation), I think it might make more sense to improve the checks for when we invert or not. We could just always try invert it but I doubt that works very well. What do you think about testing if at least one side of the 'and' can be inverted, I think that might be enough to justify the transform. That should at least either remove a NOT or convert the AND to an OR. And a so long as one of the sides is a VCMP, we will be able to fold the and into the compare.

It doesn't look like we have a fold for "(not vcmp) -> vcmp". It might be better to have PerformORCombine_i1 just produce not(and(not, not), (providing it looks profitable) and have other combines fold the not's into vcmps/anything else.

Updated the patch: now the transformation only happens if one of the operands is a condition that can be immediately inverted.
It isn't as good as the other version (in terms of improvements) but it's safer (there is less risk of generating terrible code in some situations)

This looks like an improvement on it's own, but I think it would be cleaner if there was a different fold for doing (not vcmp) -> vcmp, so that it needn't be done here. Then we can convert to not(and(not, not) here. Also we can handle swapped operands then too.

Moved the (not(vcmp)) -> !vcmp fold to PerformXORCombine

Nice. One last round of me nitpicking details I think.

What happened to the test changes in mve-pred-or.ll?

llvm/lib/Target/ARM/ARMISelLowering.cpp
12687	This can just be `return (ARMCC::CondCodes)N->getConstantOperandVal(2)`
12699	N->getOperand(0).getValueType(),...
12718–12719	Maybe make an IsFreelyInvertable() function/lambda. Then this will just be if (IsFreelyInvertable(N0) \|\| IsFreelyInvertable(N1)) We can then add things like swapping operands to it, if we teach it those tricks.
12879	Just create a SDLoc for N0. Same above in the other function.

Refactorings (see comments marked "done")
Fold even when only one side is free to invert. This brings back the mve-pred-or changes.

Thanks. LGTM, with one extra comment.

llvm/lib/Target/ARM/ARMISelLowering.cpp
12686	These can use DL too. The not is in a way coming from the Or.

This revision is now accepted and ready to land.May 4 2020, 11:42 PM

Pierre-vh updated this revision to Diff 262018.May 5 2020, 1:11 AM

Pierre-vh marked an inline comment as done.

Pierre-vh removed a child revision: D78201: [Target][ARM] Replace outdated getARMVPTBlockMask function.

Closed by commit rGd5eb7ffa3372: [Target][ARM] Fold or(A, B) more aggressively for I1 vectors (authored by Pierre-vh). · Explain WhyMay 5 2020, 2:07 AM

This revision was automatically updated to reflect the committed changes.

Diff 260252

llvm/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,677 Lines • ▼ Show 20 Lines	case ARMCC::HI:
return !IsFloat;		return !IsFloat;
default:		default:
return false;		return false;
};		};
}		}

static SDValue PerformORCombine_i1(SDNode *N,		static SDValue PerformORCombine_i1(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const ARMSubtarget *Subtarget) {		const ARMSubtarget *Subtarget) {
		dmgreenUnsubmitted Done Reply Inline Actions These can use DL too. The not is in a way coming from the Or. dmgreen: These can use DL too. The not is in a way coming from the Or.
// Try to invert "or A, B" -> "and ~A, ~B", as the "and" is easier to chain		// Try to invert "or A, B" -> "and ~A, ~B", as the "and" is easier to chain
		dmgreenUnsubmitted Done Reply Inline Actions This can just be `return (ARMCC::CondCodes)N->getConstantOperandVal(2)` dmgreen: This can just be `return (ARMCC::CondCodes)N->getConstantOperandVal(2)`
// together with predicates		// together with predicates
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);

ARMCC::CondCodes CondCode0 = ARMCC::AL;		auto getOppositeCondition = [](SDValue Value, unsigned Idx) {
ARMCC::CondCodes CondCode1 = ARMCC::AL;		const ConstantSDNode *Const =
		cast<const ConstantSDNode>(Value->getOperand(Idx));
		ARMCC::CondCodes Result =
		ARMCC::getOppositeCondition((ARMCC::CondCodes)Const->getZExtValue());
		if (isValidMVECond(Result,
		Value->getOperand(0)->getValueType(0).isFloatingPoint()))
		dmgreenUnsubmitted Done Reply Inline Actions N->getOperand(0).getValueType(),... dmgreen: N->getOperand(0).getValueType(),...
		return Result;
		return ARMCC::AL;
		};

		ARMCC::CondCodes Opposite0 = ARMCC::AL;
if (N0->getOpcode() == ARMISD::VCMP)		if (N0->getOpcode() == ARMISD::VCMP)
CondCode0 = (ARMCC::CondCodes)cast<const ConstantSDNode>(N0->getOperand(2))		Opposite0 = getOppositeCondition(N0, 2);
->getZExtValue();
else if (N0->getOpcode() == ARMISD::VCMPZ)		else if (N0->getOpcode() == ARMISD::VCMPZ)
CondCode0 = (ARMCC::CondCodes)cast<const ConstantSDNode>(N0->getOperand(1))		Opposite0 = getOppositeCondition(N0, 1);
->getZExtValue();
		ARMCC::CondCodes Opposite1 = ARMCC::AL;
if (N1->getOpcode() == ARMISD::VCMP)		if (N1->getOpcode() == ARMISD::VCMP)
CondCode1 = (ARMCC::CondCodes)cast<const ConstantSDNode>(N1->getOperand(2))		Opposite1 = getOppositeCondition(N1, 2);
->getZExtValue();
else if (N1->getOpcode() == ARMISD::VCMPZ)		else if (N1->getOpcode() == ARMISD::VCMPZ)
CondCode1 = (ARMCC::CondCodes)cast<const ConstantSDNode>(N1->getOperand(1))		Opposite1 = getOppositeCondition(N1, 1);
->getZExtValue();

if (CondCode0 == ARMCC::AL \|\| CondCode1 == ARMCC::AL)		if (Opposite0 == ARMCC::AL && Opposite1 == ARMCC::AL)
return SDValue();		return SDValue();

unsigned Opposite0 = ARMCC::getOppositeCondition(CondCode0);		SDValue NewN0, NewN1;
		dmgreenUnsubmitted Done Reply Inline Actions Maybe make an IsFreelyInvertable() function/lambda. Then this will just be if (IsFreelyInvertable(N0) \|\| IsFreelyInvertable(N1)) We can then add things like swapping operands to it, if we teach it those tricks. dmgreen: Maybe make an IsFreelyInvertable() function/lambda. Then this will just be if…
unsigned Opposite1 = ARMCC::getOppositeCondition(CondCode1);		if (Opposite0 != ARMCC::AL) {
		SmallVector<SDValue, 4> Ops;
if (!isValidMVECond(Opposite0,		Ops.push_back(N0->getOperand(0));
N0->getOperand(0)->getValueType(0).isFloatingPoint()) \|\|
!isValidMVECond(Opposite1,
N1->getOperand(0)->getValueType(0).isFloatingPoint()))
return SDValue();

SmallVector<SDValue, 4> Ops0;
Ops0.push_back(N0->getOperand(0));
if (N0->getOpcode() == ARMISD::VCMP)		if (N0->getOpcode() == ARMISD::VCMP)
Ops0.push_back(N0->getOperand(1));		Ops.push_back(N0->getOperand(1));
Ops0.push_back(DCI.DAG.getConstant(Opposite0, SDLoc(N0), MVT::i32));		Ops.push_back(DCI.DAG.getConstant(Opposite0, SDLoc(N0), MVT::i32));
SmallVector<SDValue, 4> Ops1;		NewN0 = DCI.DAG.getNode(N0->getOpcode(), SDLoc(N0), VT, Ops);
Ops1.push_back(N1->getOperand(0));		} else
		NewN0 = DCI.DAG.getLogicalNOT({N0}, N0, VT);

		if (Opposite1 != ARMCC::AL) {
		SmallVector<SDValue, 4> Ops;
		Ops.push_back(N1->getOperand(0));
if (N1->getOpcode() == ARMISD::VCMP)		if (N1->getOpcode() == ARMISD::VCMP)
Ops1.push_back(N1->getOperand(1));		Ops.push_back(N1->getOperand(1));
Ops1.push_back(DCI.DAG.getConstant(Opposite1, SDLoc(N1), MVT::i32));		Ops.push_back(DCI.DAG.getConstant(Opposite1, SDLoc(N1), MVT::i32));
		NewN1 = DCI.DAG.getNode(N1->getOpcode(), SDLoc(N1), VT, Ops);
		} else
		NewN1 = DCI.DAG.getLogicalNOT({N1}, N1, VT);

SDValue NewN0 = DCI.DAG.getNode(N0->getOpcode(), SDLoc(N0), VT, Ops0);
SDValue NewN1 = DCI.DAG.getNode(N1->getOpcode(), SDLoc(N1), VT, Ops1);
SDValue And = DCI.DAG.getNode(ISD::AND, SDLoc(N), VT, NewN0, NewN1);		SDValue And = DCI.DAG.getNode(ISD::AND, SDLoc(N), VT, NewN0, NewN1);
return DCI.DAG.getNode(ISD::XOR, SDLoc(N), VT, And,		return DCI.DAG.getNode(ISD::XOR, SDLoc(N), VT, And,
DCI.DAG.getAllOnesConstant(SDLoc(N), VT));		DCI.DAG.getAllOnesConstant(SDLoc(N), VT));
}		}

/// PerformORCombine - Target-specific dag combine xforms for ISD::OR		/// PerformORCombine - Target-specific dag combine xforms for ISD::OR
static SDValue PerformORCombine(SDNode *N,		static SDValue PerformORCombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	static SDValue ParseBFI(SDNode *N, APInt &ToMask, APInt &FromMask) {

SDValue From = N->getOperand(1);		SDValue From = N->getOperand(1);
ToMask = ~cast<ConstantSDNode>(N->getOperand(2))->getAPIntValue();		ToMask = ~cast<ConstantSDNode>(N->getOperand(2))->getAPIntValue();
FromMask = APInt::getLowBitsSet(ToMask.getBitWidth(), ToMask.countPopulation());		FromMask = APInt::getLowBitsSet(ToMask.getBitWidth(), ToMask.countPopulation());

// If the Base came from a SHR #C, we can deduce that it is really testing bit		// If the Base came from a SHR #C, we can deduce that it is really testing bit
// #C in the base of the SHR.		// #C in the base of the SHR.
if (From->getOpcode() == ISD::SRL &&		if (From->getOpcode() == ISD::SRL &&
isa<ConstantSDNode>(From->getOperand(1))) {		isa<ConstantSDNode>(From->getOperand(1))) {
		dmgreenUnsubmitted Done Reply Inline Actions Just create a SDLoc for N0. Same above in the other function. dmgreen: Just create a SDLoc for N0. Same above in the other function.
APInt Shift = cast<ConstantSDNode>(From->getOperand(1))->getAPIntValue();		APInt Shift = cast<ConstantSDNode>(From->getOperand(1))->getAPIntValue();
assert(Shift.getLimitedValue() < 32 && "Shift too large!");		assert(Shift.getLimitedValue() < 32 && "Shift too large!");
FromMask <<= Shift.getLimitedValue(31);		FromMask <<= Shift.getLimitedValue(31);
From = From->getOperand(0);		From = From->getOperand(0);
}		}

return From;		return From;
}		}
▲ Show 20 Lines • Show All 5,358 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/cond-vector-reduce-mve-codegen.ll

	Show First 20 Lines • Show All 290 Lines • ▼ Show 20 Lines
	for.cond.cleanup: ; preds = %middle.block, %entry			for.cond.cleanup: ; preds = %middle.block, %entry
	%res.0.lcssa = phi i32 [ 0, %entry ], [ %reduce, %middle.block ]			%res.0.lcssa = phi i32 [ 0, %entry ], [ %reduce, %middle.block ]
	ret i32 %res.0.lcssa			ret i32 %res.0.lcssa
	}			}

	define dso_local i32 @or_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {			define dso_local i32 @or_mul_reduce_add(i32* noalias nocapture readonly %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c, i32* noalias nocapture readonly %d, i32 %N) {
	; CHECK-LABEL: or_mul_reduce_add:			; CHECK-LABEL: or_mul_reduce_add:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: push {r4, r5, r6, lr}			; CHECK-NEXT: push {r4, r5, r7, lr}
	; CHECK-NEXT: sub sp, #4			; CHECK-NEXT: ldr.w r12, [sp, #16]
	; CHECK-NEXT: ldr.w r12, [sp, #20]
	; CHECK-NEXT: cmp.w r12, #0			; CHECK-NEXT: cmp.w r12, #0
	; CHECK-NEXT: beq .LBB3_4			; CHECK-NEXT: beq .LBB3_4
	; CHECK-NEXT: @ %bb.1: @ %vector.ph			; CHECK-NEXT: @ %bb.1: @ %vector.ph
	; CHECK-NEXT: add.w r4, r12, #3			; CHECK-NEXT: add.w r4, r12, #3
	; CHECK-NEXT: vmov.i32 q1, #0x0			; CHECK-NEXT: vmov.i32 q1, #0x0
	; CHECK-NEXT: bic r4, r4, #3			; CHECK-NEXT: bic r4, r4, #3
	; CHECK-NEXT: subs r5, r4, #4			; CHECK-NEXT: subs r5, r4, #4
	; CHECK-NEXT: movs r4, #1			; CHECK-NEXT: movs r4, #1
	; CHECK-NEXT: add.w lr, r4, r5, lsr #2			; CHECK-NEXT: add.w lr, r4, r5, lsr #2
	; CHECK-NEXT: lsrs r4, r5, #2			; CHECK-NEXT: lsrs r4, r5, #2
	; CHECK-NEXT: sub.w r4, r12, r4, lsl #2			; CHECK-NEXT: sub.w r4, r12, r4, lsl #2
	; CHECK-NEXT: dls lr, lr			; CHECK-NEXT: dls lr, lr
	; CHECK-NEXT: .LBB3_2: @ %vector.body			; CHECK-NEXT: .LBB3_2: @ %vector.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: vctp.32 r12			; CHECK-NEXT: vctp.32 r12
	; CHECK-NEXT: vmov q0, q1			; CHECK-NEXT: vmov q0, q1
	; CHECK-NEXT: vstr p0, [sp] @ 4-byte Spill
	; CHECK-NEXT: sub.w r12, r12, #4
	; CHECK-NEXT: vpstt			; CHECK-NEXT: vpstt
	; CHECK-NEXT: vldrwt.u32 q1, [r1], #16			; CHECK-NEXT: vldrwt.u32 q1, [r1], #16
	; CHECK-NEXT: vldrwt.u32 q2, [r0], #16			; CHECK-NEXT: vldrwt.u32 q2, [r0], #16
				; CHECK-NEXT: vpnot
	; CHECK-NEXT: vsub.i32 q1, q2, q1			; CHECK-NEXT: vsub.i32 q1, q2, q1
	; CHECK-NEXT: vcmp.i32 eq, q1, zr			; CHECK-NEXT: sub.w r12, r12, #4
	; CHECK-NEXT: vmrs r5, p0			; CHECK-NEXT: vpstee
	; CHECK-NEXT: vldr p0, [sp] @ 4-byte Reload			; CHECK-NEXT: vcmpt.i32 ne, q1, zr
	; CHECK-NEXT: vmrs r6, p0			; CHECK-NEXT: vldrwe.u32 q1, [r3], #16
	; CHECK-NEXT: orrs r5, r6			; CHECK-NEXT: vldrwe.u32 q2, [r2], #16
	; CHECK-NEXT: vmsr p0, r5
	; CHECK-NEXT: vpstt
	; CHECK-NEXT: vldrwt.u32 q1, [r3], #16
	; CHECK-NEXT: vldrwt.u32 q2, [r2], #16
	; CHECK-NEXT: vmul.i32 q1, q2, q1			; CHECK-NEXT: vmul.i32 q1, q2, q1
	; CHECK-NEXT: vadd.i32 q1, q1, q0			; CHECK-NEXT: vadd.i32 q1, q1, q0
	; CHECK-NEXT: le lr, .LBB3_2			; CHECK-NEXT: le lr, .LBB3_2
	; CHECK-NEXT: @ %bb.3: @ %middle.block			; CHECK-NEXT: @ %bb.3: @ %middle.block
	; CHECK-NEXT: vctp.32 r4			; CHECK-NEXT: vctp.32 r4
	; CHECK-NEXT: vpsel q0, q1, q0			; CHECK-NEXT: vpsel q0, q1, q0
	; CHECK-NEXT: vaddv.u32 r0, q0			; CHECK-NEXT: vaddv.u32 r0, q0
	; CHECK-NEXT: add sp, #4			; CHECK-NEXT: pop {r4, r5, r7, pc}
	; CHECK-NEXT: pop {r4, r5, r6, pc}
	; CHECK-NEXT: .LBB3_4:			; CHECK-NEXT: .LBB3_4:
	; CHECK-NEXT: movs r0, #0			; CHECK-NEXT: movs r0, #0
	; CHECK-NEXT: add sp, #4			; CHECK-NEXT: pop {r4, r5, r7, pc}
	; CHECK-NEXT: pop {r4, r5, r6, pc}
	entry:			entry:
	%cmp8 = icmp eq i32 %N, 0			%cmp8 = icmp eq i32 %N, 0
	br i1 %cmp8, label %for.cond.cleanup, label %vector.ph			br i1 %cmp8, label %for.cond.cleanup, label %vector.ph

	vector.ph: ; preds = %entry			vector.ph: ; preds = %entry
	%n.rnd.up = add i32 %N, 3			%n.rnd.up = add i32 %N, 3
	%n.vec = and i32 %n.rnd.up, -4			%n.vec = and i32 %n.rnd.up, -4
	%trip.count.minus.1 = add i32 %N, -1			%trip.count.minus.1 = add i32 %N, -1
	▲ Show 20 Lines • Show All 175 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve-pred-or.ll

Show First 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	entry:
%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b		%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b
ret <4 x i32> %s		ret <4 x i32> %s
}		}

define arm_aapcs_vfpcc <4 x i32> @cmpulez_v4i1(<4 x i32> %a, <4 x i32> %b) {		define arm_aapcs_vfpcc <4 x i32> @cmpulez_v4i1(<4 x i32> %a, <4 x i32> %b) {
; CHECK-LABEL: cmpulez_v4i1:		; CHECK-LABEL: cmpulez_v4i1:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vcmp.u32 cs, q1, zr		; CHECK-NEXT: vcmp.u32 cs, q1, zr
; CHECK-NEXT: vmrs r0, p0		; CHECK-NEXT: vpnot
; CHECK-NEXT: vcmp.i32 eq, q0, zr		; CHECK-NEXT: vpst
; CHECK-NEXT: vmrs r1, p0		; CHECK-NEXT: vcmpt.i32 ne, q0, zr
; CHECK-NEXT: orrs r0, r1		; CHECK-NEXT: vpsel q0, q1, q0
; CHECK-NEXT: vmsr p0, r0
; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%c1 = icmp eq <4 x i32> %a, zeroinitializer		%c1 = icmp eq <4 x i32> %a, zeroinitializer
%c2 = icmp ule <4 x i32> %b, zeroinitializer		%c2 = icmp ule <4 x i32> %b, zeroinitializer
%o = or <4 x i1> %c1, %c2		%o = or <4 x i1> %c1, %c2
%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b		%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b
ret <4 x i32> %s		ret <4 x i32> %s
}		}
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	entry:
%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b		%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b
ret <4 x i32> %s		ret <4 x i32> %s
}		}

define arm_aapcs_vfpcc <4 x i32> @cmpult_v4i1(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {		define arm_aapcs_vfpcc <4 x i32> @cmpult_v4i1(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {
; CHECK-LABEL: cmpult_v4i1:		; CHECK-LABEL: cmpult_v4i1:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vcmp.u32 hi, q2, q1		; CHECK-NEXT: vcmp.u32 hi, q2, q1
; CHECK-NEXT: vmrs r0, p0		; CHECK-NEXT: vpnot
; CHECK-NEXT: vcmp.i32 eq, q0, zr		; CHECK-NEXT: vpst
; CHECK-NEXT: vmrs r1, p0		; CHECK-NEXT: vcmpt.i32 ne, q0, zr
; CHECK-NEXT: orrs r0, r1		; CHECK-NEXT: vpsel q0, q1, q0
; CHECK-NEXT: vmsr p0, r0
; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%c1 = icmp eq <4 x i32> %a, zeroinitializer		%c1 = icmp eq <4 x i32> %a, zeroinitializer
%c2 = icmp ult <4 x i32> %b, %c		%c2 = icmp ult <4 x i32> %b, %c
%o = or <4 x i1> %c1, %c2		%o = or <4 x i1> %c1, %c2
%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b		%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b
ret <4 x i32> %s		ret <4 x i32> %s
}		}

define arm_aapcs_vfpcc <4 x i32> @cmpugt_v4i1(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {		define arm_aapcs_vfpcc <4 x i32> @cmpugt_v4i1(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {
; CHECK-LABEL: cmpugt_v4i1:		; CHECK-LABEL: cmpugt_v4i1:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vcmp.u32 hi, q1, q2		; CHECK-NEXT: vcmp.u32 hi, q1, q2
; CHECK-NEXT: vmrs r0, p0		; CHECK-NEXT: vpnot
; CHECK-NEXT: vcmp.i32 eq, q0, zr		; CHECK-NEXT: vpst
; CHECK-NEXT: vmrs r1, p0		; CHECK-NEXT: vcmpt.i32 ne, q0, zr
; CHECK-NEXT: orrs r0, r1		; CHECK-NEXT: vpsel q0, q1, q0
; CHECK-NEXT: vmsr p0, r0
; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%c1 = icmp eq <4 x i32> %a, zeroinitializer		%c1 = icmp eq <4 x i32> %a, zeroinitializer
%c2 = icmp ugt <4 x i32> %b, %c		%c2 = icmp ugt <4 x i32> %b, %c
%o = or <4 x i1> %c1, %c2		%o = or <4 x i1> %c1, %c2
%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b		%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b
ret <4 x i32> %s		ret <4 x i32> %s
}		}

define arm_aapcs_vfpcc <4 x i32> @cmpule_v4i1(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {		define arm_aapcs_vfpcc <4 x i32> @cmpule_v4i1(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {
; CHECK-LABEL: cmpule_v4i1:		; CHECK-LABEL: cmpule_v4i1:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vcmp.u32 cs, q2, q1		; CHECK-NEXT: vcmp.u32 cs, q2, q1
; CHECK-NEXT: vmrs r0, p0		; CHECK-NEXT: vpnot
; CHECK-NEXT: vcmp.i32 eq, q0, zr		; CHECK-NEXT: vpst
; CHECK-NEXT: vmrs r1, p0		; CHECK-NEXT: vcmpt.i32 ne, q0, zr
; CHECK-NEXT: orrs r0, r1		; CHECK-NEXT: vpsel q0, q1, q0
; CHECK-NEXT: vmsr p0, r0
; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%c1 = icmp eq <4 x i32> %a, zeroinitializer		%c1 = icmp eq <4 x i32> %a, zeroinitializer
%c2 = icmp ule <4 x i32> %b, %c		%c2 = icmp ule <4 x i32> %b, %c
%o = or <4 x i1> %c1, %c2		%o = or <4 x i1> %c1, %c2
%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b		%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b
ret <4 x i32> %s		ret <4 x i32> %s
}		}

define arm_aapcs_vfpcc <4 x i32> @cmpuge_v4i1(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {		define arm_aapcs_vfpcc <4 x i32> @cmpuge_v4i1(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {
; CHECK-LABEL: cmpuge_v4i1:		; CHECK-LABEL: cmpuge_v4i1:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vcmp.u32 cs, q1, q2		; CHECK-NEXT: vcmp.u32 cs, q1, q2
; CHECK-NEXT: vmrs r0, p0		; CHECK-NEXT: vpnot
; CHECK-NEXT: vcmp.i32 eq, q0, zr		; CHECK-NEXT: vpst
; CHECK-NEXT: vmrs r1, p0		; CHECK-NEXT: vcmpt.i32 ne, q0, zr
; CHECK-NEXT: orrs r0, r1		; CHECK-NEXT: vpsel q0, q1, q0
; CHECK-NEXT: vmsr p0, r0
; CHECK-NEXT: vpsel q0, q0, q1
; CHECK-NEXT: bx lr		; CHECK-NEXT: bx lr
entry:		entry:
%c1 = icmp eq <4 x i32> %a, zeroinitializer		%c1 = icmp eq <4 x i32> %a, zeroinitializer
%c2 = icmp uge <4 x i32> %b, %c		%c2 = icmp uge <4 x i32> %b, %c
%o = or <4 x i1> %c1, %c2		%o = or <4 x i1> %c1, %c2
%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b		%s = select <4 x i1> %o, <4 x i32> %a, <4 x i32> %b
ret <4 x i32> %s		ret <4 x i32> %s
}		}
▲ Show 20 Lines • Show All 171 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Target][ARM] Fold or(A, B) more aggressively for I1 Vectors
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 260252

llvm/lib/Target/ARM/ARMISelLowering.cpp

llvm/test/CodeGen/Thumb2/LowOverheadLoops/cond-vector-reduce-mve-codegen.ll

llvm/test/CodeGen/Thumb2/mve-pred-or.ll

This is an archive of the discontinued LLVM Phabricator instance.

[Target][ARM] Fold or(A, B) more aggressively for I1 VectorsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 260252

llvm/lib/Target/ARM/ARMISelLowering.cpp

llvm/test/CodeGen/Thumb2/LowOverheadLoops/cond-vector-reduce-mve-codegen.ll

llvm/test/CodeGen/Thumb2/mve-pred-or.ll

[Target][ARM] Fold or(A, B) more aggressively for I1 Vectors
ClosedPublic