This is an archive of the discontinued LLVM Phabricator instance.

Preliminary: add the tests with baseline trunk checks.
We should have test coverage for v8i16, v16i16, and v32i8.
This doesn't solve the addusb part of PR40053, right? That needs some enhancement around MatchADDUS I think.

Diffusion mentioned this in rL349416: [X86] Add baseline tests for D55780.Dec 17 2018, 3:23 PM

Rebase after committing baseline tests for all types.

nikic added a subscriber: nikic.Dec 17 2018, 3:57 PM

andreadb added a subscriber: andreadb.Dec 18 2018, 3:26 AM

Hi Craig,

Your patch addresses the issue with the subs example from PR40053. However, as soon as we change the code to something like this, then your rule would not trigger:

unsigned long long test_sub_2(__m128i x) {
    __m128i c = _mm_set1_epi8(70);
    return _mm_subs_epu8(x, c)[0];
}

This is similar to the example from PR40053, with the difference that only element zero is effectively used.
If I build this with clang, the optimizer performs a simplify demanded elts-like optimization that propagates undefs to the constant vector.
That undef propagation later on breaks the UMAX pattern. So, it won't appear in the "Initial Selection DAG".
As a consequence, your new pattern would not work, and we end up with this codegen (avx):

vpmaxub .LCPI1_0(%rip), %xmm0, %xmm1
vpcmpeqb        %xmm1, %xmm0, %xmm1
vpaddb  .LCPI1_1(%rip), %xmm0, %xmm0
vpand   %xmm0, %xmm1, %xmm0
vmovq   %xmm0, %rax

Your patch improves the matching of SUBUS, but more work would need to be done in this area.

There is also another problem caused by undef being aggressively propagated by the logic that simplifies demanded vector elements. See my comment below.

lib/Target/X86/X86ISelLowering.cpp
40522	My understanding is that `matchBinaryPredicate()` bails out if not all elements from the input vectors are ConstantSDNode. In the presence of constant build vectors with some undefs, that logic would not be able to match a SUBUS. However, the presence of undef values shouldn't really matter in this particular case. It should be safe to ignore them. Recently `simplify demanded vector elts` has become more effective (and a bit more aggressive). Being able to propagate `undef` to unused lanes of an input vector is great. However, in one particular case (D55600) it caused a regression. The root cause of that regression was the the inability of `matchBinaryPredicate()` to handle undefs. I am worried that, the more we improve that simplification logic, the more likely it is to end up propagating `undef` values to constant vectors. I think that we need a version of `matchBinaryPredicate` that knows how to skip undefs. Alternatively, `matchBinaryPredicate could accept an extra (optional) bool argument named `IgnoreUndefs`.

RKSimon added inline comments.Dec 18 2018, 6:06 AM

lib/Target/X86/X86ISelLowering.cpp
40522	I'm looking at this now - D55819 does the same for matchUnaryPredicate and I should have the equivalent matchBinaryPredicate patch available soon

This change handles psubus cases with no undefs, so LGTM.
Depending on how the patches land, we could add the undef capability and tests for it directly here or make that an enhancement.
But as I mentioned, I don't think this patch alone is enough to close PR40053 because that's at least 2 independent bugs in 1 report.

This revision is now accepted and ready to land.Dec 18 2018, 6:15 AM

RKSimon added inline comments.Dec 18 2018, 6:53 AM

lib/Target/X86/X86ISelLowering.cpp
40522	D55822 is for matchBinaryPredicate undef handling - I have no objections for this patch to go in first and I can add support for it to D55822

In D55780#1334383, @spatel wrote:

This change handles psubus cases with no undefs, so LGTM.
Depending on how the patches land, we could add the undef capability and tests for it directly here or make that an enhancement.
But as I mentioned, I don't think this patch alone is enough to close PR40053 because that's at least 2 independent bugs in 1 report.

Don't get me wrong: I am not suggesting that we shouldn't commit this patch.

I just wanted to point out that it is not true that all the psubus cases are fixed as you wrote. To fully fix psubus we need to make sure that we match UMAX even in the presence of undefs.
As I wrote in my previous comment, if you slightly change the example from the bugzilla, then we no longer get a UMAX in the DAG, and we lose this optimization (see below):

unsigned long long test_sub_2(__m128i x) {
    __m128i c = _mm_set1_epi8(70);
    return _mm_subs_epu8(x, c)[0];
}

In D55780#1334563, @andreadb wrote:
In D55780#1334383, @spatel wrote:

This change handles psubus cases with no undefs, so LGTM.
Depending on how the patches land, we could add the undef capability and tests for it directly here or make that an enhancement.
But as I mentioned, I don't think this patch alone is enough to close PR40053 because that's at least 2 independent bugs in 1 report.

Don't get me wrong: I am not suggesting that we shouldn't commit this patch.

I just wanted to point out that it is not true that all the psubus cases are fixed as you wrote. To fully fix psubus we need to make sure that we match UMAX even in the presence of undefs.
As I wrote in my previous comment, if you slightly change the example from the bugzilla, then we no longer get a UMAX in the DAG, and we lose this optimization (see below):
unsigned long long test_sub_2(__m128i x) {
    __m128i c = _mm_set1_epi8(70);
    return _mm_subs_epu8(x, c)[0];
}

Ah, sorry I forgot to address that example. I filed it here:
https://bugs.llvm.org/show_bug.cgi?id=40083

And I agree, we're missing many saturating math patterns. I started looking into that with:
https://bugs.llvm.org/show_bug.cgi?id=14613
...but now that we have IR intrinsics and DAG nodes for saturating math, I think it needs to be revisited so we use those ops.

In D55780#1334596, @spatel wrote:
In D55780#1334563, @andreadb wrote:
In D55780#1334383, @spatel wrote:

This change handles psubus cases with no undefs, so LGTM.
Depending on how the patches land, we could add the undef capability and tests for it directly here or make that an enhancement.
But as I mentioned, I don't think this patch alone is enough to close PR40053 because that's at least 2 independent bugs in 1 report.

Don't get me wrong: I am not suggesting that we shouldn't commit this patch.

I just wanted to point out that it is not true that all the psubus cases are fixed as you wrote. To fully fix psubus we need to make sure that we match UMAX even in the presence of undefs.
As I wrote in my previous comment, if you slightly change the example from the bugzilla, then we no longer get a UMAX in the DAG, and we lose this optimization (see below):
unsigned long long test_sub_2(__m128i x) {
    __m128i c = _mm_set1_epi8(70);
    return _mm_subs_epu8(x, c)[0];
}
Ah, sorry I forgot to address that example. I filed it here:
https://bugs.llvm.org/show_bug.cgi?id=40083

No problem. Thanks for raising that bug!

And I agree, we're missing many saturating math patterns. I started looking into that with:
https://bugs.llvm.org/show_bug.cgi?id=14613
...but now that we have IR intrinsics and DAG nodes for saturating math, I think it needs to be revisited so we use those ops.

As Simon wrote, we could wait for D55822, so that we get support for undefs too. That being said, I am okay even if this patch is committed first.

-Andrea

Closed by commit rL349519: [X86] Create PSUBUS from (add (umax X, C), -C) (authored by ctopper). · Explain WhyDec 18 2018, 10:29 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

42 lines

test/

CodeGen/

X86/

psubus.ll

34 lines

Diff 178500

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 40,489 Lines • ▼ Show 20 Lines	return DAG.getNode(X86ISD::VPMADDWD, DL, ResVT,
DAG.getNode(ISD::TRUNCATE, DL, TruncVT, Ops[0]),		DAG.getNode(ISD::TRUNCATE, DL, TruncVT, Ops[0]),
DAG.getNode(ISD::TRUNCATE, DL, TruncVT, Ops[1]));		DAG.getNode(ISD::TRUNCATE, DL, TruncVT, Ops[1]));
};		};
return SplitOpsAndApply(DAG, Subtarget, DL, VT,		return SplitOpsAndApply(DAG, Subtarget, DL, VT,
{ Mul.getOperand(0), Mul.getOperand(1) },		{ Mul.getOperand(0), Mul.getOperand(1) },
PMADDBuilder);		PMADDBuilder);
}		}

		// Try to turn (add (umax X, C), -C) into (psubus X, C)
		static SDValue combineAddToSUBUS(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		if (!Subtarget.hasSSE2())
		return SDValue();

		EVT VT = N->getValueType(0);

		// psubus is available in SSE2 for i8 and i16 vectors.
		if (!VT.isVector() \|\| VT.getVectorNumElements() < 2 \|\|
		!isPowerOf2_32(VT.getVectorNumElements()) \|\|
		!(VT.getVectorElementType() == MVT::i8 \|\|
		VT.getVectorElementType() == MVT::i16))
		return SDValue();

		SDValue Op0 = N->getOperand(0);
		SDValue Op1 = N->getOperand(1);
		if (Op0.getOpcode() != ISD::UMAX)
		return SDValue();

		// The add should have a constant that is the negative of the max.
		auto MatchSUBUS = [](ConstantSDNode Max, ConstantSDNode Op) {
		return Max->getAPIntValue() == (-Op->getAPIntValue());
		};
		if (!ISD::matchBinaryPredicate(Op0.getOperand(1), Op1, MatchSUBUS))
		andreadbUnsubmitted Not Done Reply Inline Actions My understanding is that `matchBinaryPredicate()` bails out if not all elements from the input vectors are ConstantSDNode. In the presence of constant build vectors with some undefs, that logic would not be able to match a SUBUS. However, the presence of undef values shouldn't really matter in this particular case. It should be safe to ignore them. Recently `simplify demanded vector elts` has become more effective (and a bit more aggressive). Being able to propagate `undef` to unused lanes of an input vector is great. However, in one particular case (D55600) it caused a regression. The root cause of that regression was the the inability of `matchBinaryPredicate()` to handle undefs. I am worried that, the more we improve that simplification logic, the more likely it is to end up propagating `undef` values to constant vectors. I think that we need a version of `matchBinaryPredicate` that knows how to skip undefs. Alternatively, `matchBinaryPredicate could accept an extra (optional) bool argument named `IgnoreUndefs`. andreadb: My understanding is that `matchBinaryPredicate()` bails out if not all elements from the input…
		RKSimonUnsubmitted Not Done Reply Inline Actions I'm looking at this now - D55819 does the same for matchUnaryPredicate and I should have the equivalent matchBinaryPredicate patch available soon RKSimon: I'm looking at this now - D55819 does the same for matchUnaryPredicate and I should have the…
		RKSimonUnsubmitted Not Done Reply Inline Actions D55822 is for matchBinaryPredicate undef handling - I have no objections for this patch to go in first and I can add support for it to D55822 RKSimon: D55822 is for matchBinaryPredicate undef handling - I have no objections for this patch to go…
		return SDValue();

		auto SUBUSBuilder = [](SelectionDAG &DAG, const SDLoc &DL,
		ArrayRef<SDValue> Ops) {
		return DAG.getNode(X86ISD::SUBUS, DL, Ops[0].getValueType(), Ops);
		};

		// Take both operands from the umax node.
		SDLoc DL(N);
		return SplitOpsAndApply(DAG, Subtarget, DL, VT,
		{ Op0.getOperand(0), Op0.getOperand(1) },
		SUBUSBuilder);
		}

// Attempt to turn this pattern into PMADDWD.		// Attempt to turn this pattern into PMADDWD.
// (mul (add (zext (build_vector)), (zext (build_vector))),		// (mul (add (zext (build_vector)), (zext (build_vector))),
// (add (zext (build_vector)), (zext (build_vector)))		// (add (zext (build_vector)), (zext (build_vector)))
static SDValue matchPMADDWD_2(SelectionDAG &DAG, SDValue N0, SDValue N1,		static SDValue matchPMADDWD_2(SelectionDAG &DAG, SDValue N0, SDValue N1,
const SDLoc &DL, EVT VT,		const SDLoc &DL, EVT VT,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
if (!Subtarget.hasSSE2())		if (!Subtarget.hasSSE2())
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	if ((VT == MVT::v8i16 \|\| VT == MVT::v4i32 \|\| VT == MVT::v16i16 \|\|
};		};
return SplitOpsAndApply(DAG, Subtarget, SDLoc(N), VT, {Op0, Op1},		return SplitOpsAndApply(DAG, Subtarget, SDLoc(N), VT, {Op0, Op1},
HADDBuilder);		HADDBuilder);
}		}

if (SDValue V = combineIncDecVector(N, DAG))		if (SDValue V = combineIncDecVector(N, DAG))
return V;		return V;

		if (SDValue V = combineAddToSUBUS(N, DAG, Subtarget))
		return V;

return combineAddOrSubToADCOrSBB(N, DAG);		return combineAddOrSubToADCOrSBB(N, DAG);
}		}

static SDValue combineSubToSubus(SDNode *N, SelectionDAG &DAG,		static SDValue combineSubToSubus(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
▲ Show 20 Lines • Show All 1,692 Lines • Show Last 20 Lines

test/CodeGen/X86/psubus.ll

Show First 20 Lines • Show All 2,405 Lines • ▼ Show 20 Lines	; AVX512-NEXT: retq
%ld2 = load <2 x i16>, <2 x i16>* %p2, align 8		%ld2 = load <2 x i16>, <2 x i16>* %p2, align 8
%1 = sub <2 x i16> %ld1, %ld2		%1 = sub <2 x i16> %ld1, %ld2
%2 = icmp ugt <2 x i16> %ld1, %ld2		%2 = icmp ugt <2 x i16> %ld1, %ld2
%sh3 = select <2 x i1> %2, <2 x i16> %1, <2 x i16> zeroinitializer		%sh3 = select <2 x i1> %2, <2 x i16> %1, <2 x i16> zeroinitializer
store <2 x i16> %sh3, <2 x i16>* %p1, align 8		store <2 x i16> %sh3, <2 x i16>* %p1, align 8
ret void		ret void
}		}

		define <16 x i8> @test19(<16 x i8> %x) {
		; SSE-LABEL: test19:
		; SSE: # %bb.0: # %entry
		; SSE-NEXT: psubusb {{.*}}(%rip), %xmm0
		; SSE-NEXT: retq
		;
		; AVX-LABEL: test19:
		; AVX: # %bb.0: # %entry
		; AVX-NEXT: vpsubusb {{.*}}(%rip), %xmm0, %xmm0
		; AVX-NEXT: retq
		entry:
		%0 = icmp ugt <16 x i8> %x, <i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70>
		%1 = select <16 x i1> %0, <16 x i8> %x, <16 x i8> <i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70, i8 70>
		%2 = add <16 x i8> %1, <i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70, i8 -70>
		ret <16 x i8> %2
		}

		define <16 x i8> @test20(<16 x i8> %x) {
		; SSE-LABEL: test20:
		; SSE: # %bb.0: # %entry
		; SSE-NEXT: psubusb {{.*}}(%rip), %xmm0
		; SSE-NEXT: retq
		;
		; AVX-LABEL: test20:
		; AVX: # %bb.0: # %entry
		; AVX-NEXT: vpsubusb {{.*}}(%rip), %xmm0, %xmm0
		; AVX-NEXT: retq
		entry:
		%0 = icmp ugt <16 x i8> %x, <i8 1, i8 -22, i8 -50, i8 -114, i8 -77, i8 -70, i8 123, i8 98, i8 63, i8 19, i8 -22, i8 100, i8 25, i8 34, i8 55, i8 70>
		%1 = select <16 x i1> %0, <16 x i8> %x, <16 x i8> <i8 1, i8 -22, i8 -50, i8 -114, i8 -77, i8 -70, i8 123, i8 98, i8 63, i8 19, i8 -22, i8 100, i8 25, i8 34, i8 55, i8 70>
		%2 = add <16 x i8> %1, <i8 -1, i8 22, i8 50, i8 114, i8 77, i8 70, i8 -123, i8 -98, i8 -63, i8 -19, i8 22, i8 -100, i8 -25, i8 -34, i8 -55, i8 -70>
		ret <16 x i8> %2
		}

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Create PSUBUS from (add (umax X, C), -C)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 178500

lib/Target/X86/X86ISelLowering.cpp

test/CodeGen/X86/psubus.ll

[X86] Create PSUBUS from (add (umax X, C), -C)
ClosedPublic