Download Raw Diff

Details

Reviewers

Commits

rG132fbf5476e1: [X86] Don't turn (c?-v:v) into (c?-v:0) by blindly using PSIGN.
rL261023: [X86] Don't turn (c?-v:v) into (c?-v:0) by blindly using PSIGN.

Summary

Currently, we sometimes miscompile this vector pattern:

(c ? -v : v)

We lower it to (because "c" is <4 x i1>, lowered as a vector mask):

(~c & v) | (c & -v)

When we have SSSE3, we incorrectly lower that to PSIGN, which does:

(c < 0 ? -v : c > 0 ? v : 0)

in other words, when c is either all-ones or all-zero:

(c ? -v : 0)

While this is an old bug, it rarely triggers because the PSIGN combine
is too sensitive to operand order. This will be improved separately.

Note that the PSIGN tests are also incorrect.
Consider test/CodeGen/X86/vec-sign.ll:

%b.lobit = ashr <4 x i32> %b, <i32 31, i32 31, i32 31, i32 31>
%sub = sub nsw <4 x i32> zeroinitializer, %a
%0 = xor <4 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = and <4 x i32> %a, %0
%2 = and <4 x i32> %b.lobit, %sub
%cond = or <4 x i32> %1, %2
ret <4 x i32> %cond

if %b is zero:

%b.lobit = <4 x i32> zeroinitializer
%sub = sub nsw <4 x i32> zeroinitializer, %a
%0 = <4 x i32> <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = <4 x i32> %a
%2 = <4 x i32> zeroinitializer
%cond = or <4 x i32> %a, zeroinitializer
ret <4 x i32> %a

whereas we currently generate:

psignd %xmm1, %xmm0
retq

which returns 0, as %xmm1 is 0.

Instead of directly using c as a mask, avoid the zero case by setting
any bit (other than the sign bit). This lets PSIGN default to the
positive case, while not changing the "negative" (all ones) case.
With that, the generated sequence correctly implements:

(c ? -v : v)

Fixes PR26110.

Diff Detail

Repository: rL LLVM

Event Timeline

ab updated this revision to Diff 47760.Feb 11 2016, 6:08 PM

ab retitled this revision from to [X86] Don't turn (c?-v:v) into (c?-v:0) by blindly using PSIGN..

ab updated this object.

ab added subscribers: RKSimon, spatel, craig.topper and 3 others.

ab added inline comments.Feb 11 2016, 6:12 PM

test/CodeGen/X86/vec-sign.ll
7 ↗	(On Diff #47760)	This isn't very helpful; I'll try to look into getting some CP verbose printing.

spatel mentioned this in D17176: [CodeGen] Add getBuildVector and getSplatBuildVector helpers..Feb 12 2016, 7:00 AM

(c ? -v : v)

I think that we can improve the SSE2 (no psign) codegen and have the SSSE3 solution avoid psign completely by using a variant of:
https://graphics.stanford.edu/~seander/bithacks.html#ConditionalNegate

From what I can tell, the SSE2 savings are one integer logic op + a move. The SSSE3 case would have 3 simple integer logic ops rather than load/or/psign. I don't think it's worth chasing. Side note: it's been 10 years since SSSE3 (Merom) came out...can we change the default x86 subtarget now from Yonah/SSE2?

lib/Target/X86/X86ISelLowering.cpp
26334 ↗	(On Diff #47760)	This comment is misleading. We know we have a 'select' kind of operation, but if we don't have SSE4, then we're going to bail out because we don't actually have the x86 blendv .
26354 ↗	(On Diff #47760)	getConstant() is magic. There are no code comments to tell you this, but it can do the splat for you. :)

Use logic instead of PSIGN.

Simplify (srl c, 31) to reuse intermediate result: (srl (sra c, 31), 31).

Noticed after uploading the patch, of course ;)

In D17181#351771, @spatel wrote:

(c ? -v : v)

I think that we can improve the SSE2 (no psign) codegen and have the SSSE3 solution avoid psign completely by using a variant of:
https://graphics.stanford.edu/~seander/bithacks.html#ConditionalNegate

From what I can tell, the SSE2 savings are one integer logic op + a move. The SSSE3 case would have 3 simple integer logic ops rather than load/or/psign. I don't think it's worth chasing.

You're right! Here's a simple adaptation of that.

This also means that we can completely remove PSIGN, as this was the only user in the codebase. I'll do that when we agree.

Side note: it's been 10 years since SSSE3 (Merom) came out...can we change the default x86 subtarget now from Yonah/SSE2?

Heh, I would love to see that, but I guess that the hardware vendors (Sony, Apple?) changed it already, and most of the others (distros?) want to stick to the minimal meaning of "x86_64".

-Ahmed

spatel added inline comments.Feb 15 2016, 4:07 PM

lib/Target/X86/X86ISelLowering.cpp
26481–26484 ↗	(On Diff #48009)	Double-check me to make sure, but we can do one better I think: ((X ^ M) + (M & 1)) ((X ^ M) - (M)) <--- since we know that M is all 1s (ie, -1), change the 'add 1' to 'sub -1'

Simplify further by subtracting the mask.

ab marked 4 inline comments as done.Feb 15 2016, 4:30 PM

ab added inline comments.

lib/Target/X86/X86ISelLowering.cpp
26475–26478 ↗	(On Diff #48038)	Ah yes, beautiful!

In D17181#353035, @ab wrote:

This also means that we can completely remove PSIGN, as this was the only user in the codebase. I'll do that when we agree.

Great! I didn't think it was worth the effort to optimize the codegen, but we did better than I initially thought we could. And deleting an X86-specific node is better still. psign...quite an instruction. :)

Side note: it's been 10 years since SSSE3 (Merom) came out...can we change the default x86 subtarget now from Yonah/SSE2?

Heh, I would love to see that, but I guess that the hardware vendors (Sony, Apple?) changed it already, and most of the others (distros?) want to stick to the minimal meaning of "x86_64".

My memory was off - Yonah had SSE3:
https://en.wikipedia.org/wiki/Yonah_%28microprocessor%29

In any case, it looks like we default to "Core2" as our CPU model which actually is Merom, but then we limit it to SSE2. Seems odd. A Darwin x86 target required SSE3 from the start. There might be something to chase down here.

LGTM.

This revision is now accepted and ready to land.Feb 15 2016, 4:52 PM

Closed by commit rL261023: [X86] Don't turn (c?-v:v) into (c?-v:0) by blindly using PSIGN. (authored by ab). · Explain WhyFeb 16 2016, 2:18 PM

This revision was automatically updated to reflect the committed changes.

ab marked an inline comment as done.

Diff 48110

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 26,388 Lines • ▼ Show 20 Lines

// Try to fold:		// Try to fold:
// (or (and (m, y), (pandn m, x)))		// (or (and (m, y), (pandn m, x)))
// into:		// into:
// (vselect m, x, y)		// (vselect m, x, y)
// As a special case, try to fold:		// As a special case, try to fold:
// (or (and (m, (sub 0, x)), (pandn m, x)))		// (or (and (m, (sub 0, x)), (pandn m, x)))
// into:		// into:
// (psign m, x)		// (sub (xor X, M), M)
static SDValue combineLogicBlendIntoPBLENDV(SDNode *N, SelectionDAG &DAG,		static SDValue combineLogicBlendIntoPBLENDV(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
assert(N->getOpcode() == ISD::OR);		assert(N->getOpcode() == ISD::OR);

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (!((VT == MVT::v2i64 && Subtarget.hasSSSE3()) \|\|		if (!((VT == MVT::v2i64) \|\| (VT == MVT::v4i64 && Subtarget.hasInt256())))
(VT == MVT::v4i64 && Subtarget.hasInt256())))
return SDValue();		return SDValue();
		assert(Subtarget.hasSSE2() && "Unexpected i64 vector without SSE2!");

// Canonicalize pandn to RHS		// Canonicalize pandn to RHS
if (N0.getOpcode() == X86ISD::ANDNP)		if (N0.getOpcode() == X86ISD::ANDNP)
std::swap(N0, N1);		std::swap(N0, N1);

if (N0.getOpcode() != ISD::AND \|\| N1.getOpcode() != X86ISD::ANDNP)		if (N0.getOpcode() != ISD::AND \|\| N1.getOpcode() != X86ISD::ANDNP)
return SDValue();		return SDValue();

Show All 32 Lines	if (Mask.getOpcode() == ISD::SRA) {
SDValue SraC = Mask.getOperand(1);		SDValue SraC = Mask.getOperand(1);
SraAmt = cast<ConstantSDNode>(SraC)->getZExtValue();		SraAmt = cast<ConstantSDNode>(SraC)->getZExtValue();
}		}
if ((SraAmt + 1) != EltBits)		if ((SraAmt + 1) != EltBits)
return SDValue();		return SDValue();

SDLoc DL(N);		SDLoc DL(N);

// Now we know we at least have a plendvb with the mask val. See if		// Try to match:
// we can form a psignb/w/d.		// (or (and (M, (sub 0, X)), (pandn M, X)))
// psign = x.type == y.type == mask.type && y = sub(0, x);		// which is a special case of vselect:
		// (vselect M, (sub 0, X), X)
		// Per:
		// http://graphics.stanford.edu/~seander/bithacks.html#ConditionalNegate
		// We know that, if fNegate is 0 or 1:
		// (fNegate ? -v : v) == ((v ^ -fNegate) + fNegate)
		//
		// Here, we have a mask, M (all 1s or 0), and, similarly, we know that:
		// ((M & 1) ? -X : X) == ((X ^ -(M & 1)) + (M & 1))
		// ( M ? -X : X) == ((X ^ M ) + (M & 1))
		// This lets us transform our vselect to:
		// (add (xor X, M), (and M, 1))
		// And further to:
		// (sub (xor X, M), M)
if (Y.getOpcode() == ISD::SUB && Y.getOperand(1) == X &&		if (Y.getOpcode() == ISD::SUB && Y.getOperand(1) == X &&
ISD::isBuildVectorAllZeros(Y.getOperand(0).getNode()) &&		ISD::isBuildVectorAllZeros(Y.getOperand(0).getNode()) &&
X.getValueType() == MaskVT && Y.getValueType() == MaskVT) {		X.getValueType() == MaskVT && Y.getValueType() == MaskVT) {
assert((EltBits == 8 \|\| EltBits == 16 \|\| EltBits == 32) &&		assert(EltBits == 8 \|\| EltBits == 16 \|\| EltBits == 32);
"Unsupported VT for PSIGN");		return DAG.getBitcast(
Mask = DAG.getNode(X86ISD::PSIGN, DL, MaskVT, X, Mask.getOperand(0));		VT, DAG.getNode(ISD::SUB, DL, MaskVT,
return DAG.getBitcast(VT, Mask);		DAG.getNode(ISD::XOR, DL, MaskVT, X, Mask), Mask));
}		}

// PBLENDVB is only available on SSE 4.1.		// PBLENDVB is only available on SSE 4.1.
if (!Subtarget.hasSSE41())		if (!Subtarget.hasSSE41())
return SDValue();		return SDValue();

MVT BlendVT = (VT == MVT::v4i64) ? MVT::v32i8 : MVT::v16i8;		MVT BlendVT = (VT == MVT::v4i64) ? MVT::v32i8 : MVT::v16i8;

▲ Show 20 Lines • Show All 3,137 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-blend.ll

Show First 20 Lines • Show All 902 Lines • ▼ Show 20 Lines	entry:
%cond = or <8 x i32> %1, %2		%cond = or <8 x i32> %1, %2
ret <8 x i32> %cond		ret <8 x i32> %cond
}		}

define <4 x i32> @blend_neg_logic_v4i32(<4 x i32> %a, <4 x i32> %b) {		define <4 x i32> @blend_neg_logic_v4i32(<4 x i32> %a, <4 x i32> %b) {
; SSE2-LABEL: blend_neg_logic_v4i32:		; SSE2-LABEL: blend_neg_logic_v4i32:
; SSE2: # BB#0: # %entry		; SSE2: # BB#0: # %entry
; SSE2-NEXT: psrad $31, %xmm1		; SSE2-NEXT: psrad $31, %xmm1
; SSE2-NEXT: pxor %xmm2, %xmm2		; SSE2-NEXT: pxor %xmm1, %xmm0
; SSE2-NEXT: psubd %xmm0, %xmm2		; SSE2-NEXT: psubd %xmm1, %xmm0
; SSE2-NEXT: pand %xmm1, %xmm2
; SSE2-NEXT: pandn %xmm0, %xmm1
; SSE2-NEXT: por %xmm1, %xmm2
; SSE2-NEXT: movdqa %xmm2, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSSE3-LABEL: blend_neg_logic_v4i32:		; SSSE3-LABEL: blend_neg_logic_v4i32:
; SSSE3: # BB#0: # %entry		; SSSE3: # BB#0: # %entry
; SSSE3-NEXT: psignd %xmm1, %xmm0		; SSSE3-NEXT: psrad $31, %xmm1
		; SSSE3-NEXT: pxor %xmm1, %xmm0
		; SSSE3-NEXT: psubd %xmm1, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; SSE41-LABEL: blend_neg_logic_v4i32:		; SSE41-LABEL: blend_neg_logic_v4i32:
; SSE41: # BB#0: # %entry		; SSE41: # BB#0: # %entry
; SSE41-NEXT: psignd %xmm1, %xmm0		; SSE41-NEXT: psrad $31, %xmm1
		; SSE41-NEXT: pxor %xmm1, %xmm0
		; SSE41-NEXT: psubd %xmm1, %xmm0
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX-LABEL: blend_neg_logic_v4i32:		; AVX-LABEL: blend_neg_logic_v4i32:
; AVX: # BB#0: # %entry		; AVX: # BB#0: # %entry
; AVX-NEXT: vpsignd %xmm1, %xmm0, %xmm0		; AVX-NEXT: vpsrad $31, %xmm1, %xmm1
		; AVX-NEXT: vpxor %xmm1, %xmm0, %xmm0
		; AVX-NEXT: vpsubd %xmm1, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
entry:		entry:
%b.lobit = ashr <4 x i32> %b, <i32 31, i32 31, i32 31, i32 31>		%b.lobit = ashr <4 x i32> %b, <i32 31, i32 31, i32 31, i32 31>
%sub = sub nsw <4 x i32> zeroinitializer, %a		%sub = sub nsw <4 x i32> zeroinitializer, %a
%0 = xor <4 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1>		%0 = xor <4 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = and <4 x i32> %a, %0		%1 = and <4 x i32> %a, %0
%2 = and <4 x i32> %b.lobit, %sub		%2 = and <4 x i32> %b.lobit, %sub
%cond = or <4 x i32> %1, %2		%cond = or <4 x i32> %1, %2
ret <4 x i32> %cond		ret <4 x i32> %cond
}		}

define <8 x i32> @blend_neg_logic_v8i32(<8 x i32> %a, <8 x i32> %b) {		define <8 x i32> @blend_neg_logic_v8i32(<8 x i32> %a, <8 x i32> %b) {
; SSE2-LABEL: blend_neg_logic_v8i32:		; SSE2-LABEL: blend_neg_logic_v8i32:
; SSE2: # BB#0: # %entry		; SSE2: # BB#0: # %entry
; SSE2-NEXT: psrad $31, %xmm2
; SSE2-NEXT: psrad $31, %xmm3		; SSE2-NEXT: psrad $31, %xmm3
; SSE2-NEXT: pxor %xmm4, %xmm4		; SSE2-NEXT: psrad $31, %xmm2
; SSE2-NEXT: pxor %xmm5, %xmm5		; SSE2-NEXT: pxor %xmm2, %xmm0
; SSE2-NEXT: psubd %xmm0, %xmm5		; SSE2-NEXT: psubd %xmm2, %xmm0
; SSE2-NEXT: psubd %xmm1, %xmm4		; SSE2-NEXT: pxor %xmm3, %xmm1
; SSE2-NEXT: pand %xmm3, %xmm4		; SSE2-NEXT: psubd %xmm3, %xmm1
; SSE2-NEXT: pandn %xmm1, %xmm3
; SSE2-NEXT: pand %xmm2, %xmm5
; SSE2-NEXT: pandn %xmm0, %xmm2
; SSE2-NEXT: por %xmm2, %xmm5
; SSE2-NEXT: por %xmm3, %xmm4
; SSE2-NEXT: movdqa %xmm5, %xmm0
; SSE2-NEXT: movdqa %xmm4, %xmm1
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSSE3-LABEL: blend_neg_logic_v8i32:		; SSSE3-LABEL: blend_neg_logic_v8i32:
; SSSE3: # BB#0: # %entry		; SSSE3: # BB#0: # %entry
; SSSE3-NEXT: psignd %xmm2, %xmm0		; SSSE3-NEXT: psrad $31, %xmm3
; SSSE3-NEXT: psignd %xmm3, %xmm1		; SSSE3-NEXT: psrad $31, %xmm2
		; SSSE3-NEXT: pxor %xmm2, %xmm0
		; SSSE3-NEXT: psubd %xmm2, %xmm0
		; SSSE3-NEXT: pxor %xmm3, %xmm1
		; SSSE3-NEXT: psubd %xmm3, %xmm1
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; SSE41-LABEL: blend_neg_logic_v8i32:		; SSE41-LABEL: blend_neg_logic_v8i32:
; SSE41: # BB#0: # %entry		; SSE41: # BB#0: # %entry
; SSE41-NEXT: psignd %xmm2, %xmm0		; SSE41-NEXT: psrad $31, %xmm3
; SSE41-NEXT: psignd %xmm3, %xmm1		; SSE41-NEXT: psrad $31, %xmm2
		; SSE41-NEXT: pxor %xmm2, %xmm0
		; SSE41-NEXT: psubd %xmm2, %xmm0
		; SSE41-NEXT: pxor %xmm3, %xmm1
		; SSE41-NEXT: psubd %xmm3, %xmm1
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX1-LABEL: blend_neg_logic_v8i32:		; AVX1-LABEL: blend_neg_logic_v8i32:
; AVX1: # BB#0: # %entry		; AVX1: # BB#0: # %entry
; AVX1-NEXT: vpsrad $31, %xmm1, %xmm2		; AVX1-NEXT: vpsrad $31, %xmm1, %xmm2
; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm1		; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm1
; AVX1-NEXT: vpsrad $31, %xmm1, %xmm1		; AVX1-NEXT: vpsrad $31, %xmm1, %xmm1
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1		; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1
; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2		; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2
; AVX1-NEXT: vpxor %xmm3, %xmm3, %xmm3		; AVX1-NEXT: vpxor %xmm3, %xmm3, %xmm3
; AVX1-NEXT: vpsubd %xmm2, %xmm3, %xmm2		; AVX1-NEXT: vpsubd %xmm2, %xmm3, %xmm2
; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm3		; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm3
; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm3, %ymm2		; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm3, %ymm2
; AVX1-NEXT: vandnps %ymm0, %ymm1, %ymm0		; AVX1-NEXT: vandnps %ymm0, %ymm1, %ymm0
; AVX1-NEXT: vandps %ymm2, %ymm1, %ymm1		; AVX1-NEXT: vandps %ymm2, %ymm1, %ymm1
; AVX1-NEXT: vorps %ymm1, %ymm0, %ymm0		; AVX1-NEXT: vorps %ymm1, %ymm0, %ymm0
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX2-LABEL: blend_neg_logic_v8i32:		; AVX2-LABEL: blend_neg_logic_v8i32:
; AVX2: # BB#0: # %entry		; AVX2: # BB#0: # %entry
; AVX2-NEXT: vpsignd %ymm1, %ymm0, %ymm0		; AVX2-NEXT: vpsrad $31, %ymm1, %ymm1
		; AVX2-NEXT: vpxor %ymm1, %ymm0, %ymm0
		; AVX2-NEXT: vpsubd %ymm1, %ymm0, %ymm0
; AVX2-NEXT: retq		; AVX2-NEXT: retq
entry:		entry:
%b.lobit = ashr <8 x i32> %b, <i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31>		%b.lobit = ashr <8 x i32> %b, <i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31>
%sub = sub nsw <8 x i32> zeroinitializer, %a		%sub = sub nsw <8 x i32> zeroinitializer, %a
%0 = xor <8 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>		%0 = xor <8 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
%1 = and <8 x i32> %a, %0		%1 = and <8 x i32> %a, %0
%2 = and <8 x i32> %b.lobit, %sub		%2 = and <8 x i32> %b.lobit, %sub
%cond = or <8 x i32> %1, %2		%cond = or <8 x i32> %1, %2
▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Don't turn (c?-v:v) into (c?-v:0) by blindly using PSIGN.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 48110

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/test/CodeGen/X86/vector-blend.ll

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Don't turn (c?-v:v) into (c?-v:0) by blindly using PSIGN.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 48110

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/test/CodeGen/X86/vector-blend.ll

[X86] Don't turn (c?-v:v) into (c?-v:0) by blindly using PSIGN.
ClosedPublic