This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
6
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
3
sse41.ll

Differential D3475

Optimization for certain shufflevector by using insertps.
ClosedPublic

Authored by filcab on Apr 23 2014, 2:46 PM.

Download Raw Diff

Details

Reviewers

nadav
andreadb

Commits

rG363b570d2a8b: Optimization for certain shufflevector by using insertps.
rL207291: Optimization for certain shufflevector by using insertps.

Summary

If we're doing a v4f32 shuffle on x86 with SSE4.1, we can lower certain
shufflevectors to an insertps instruction:
When most of the shufflevector result's elements come from one vector (and
keep their index), and one element comes from another vector or a memory
operand.

Added tests for insertps optimizations on shufflevector.

Diff Detail

Event Timeline

filcab updated this revision to Diff 8781.Apr 23 2014, 2:46 PM

filcab retitled this revision from to Optimization for certain shufflevector by using insertps..

filcab updated this object.

filcab edited the test plan for this revision. (Show Details)

filcab added a reviewer: nadav.

filcab added a subscriber: Unknown Object (MLST).

silvas added a subscriber: silvas.Apr 23 2014, 5:49 PM

silvas added inline comments.

lib/Target/X86/X86ISelLowering.cpp
3938	Tiny style nit: the canonical LLVM counted loop is for (int i = 0, e = Foo.size(); i != e; ++i)

Fixed llvm-style nit pointed by Sean Silva.

I did not review the patch carefully but from a quick look it looks fine. Andrea, what do you say?

Hi Filipe,

you can check INSERTPS mask outside NormalizeVectorShuffle(), where all other masks have been checked.
you can use insertps for v4i32 as well.

I think that folding load in insertps is not fully correct

you translate this IR to (add + insertps) commands.

%0 = load <4 x float>* %pb, align 16
%vecinit6 = shufflevector <4 x float> %a, <4 x float> %0, <4 x i32> <i32 0, i32 1, i32 2, i32 4>

insertps loads 4 bytes instead of 32. You lose exceptions. It is ok for OpenCL but other compilers can't ignore exceptions.

And in general, I'm not sure that
add + insertps-load-form is better than load + insertps

About tests: Why do you check X32 and X64 separately?

Elena

Hi filipe,

I think your patch in general looks good except for a few things (see comments below).

lib/Target/X86/X86ISelLowering.cpp
3933	I would change this into: if (VT != MVT::v4i32 && VT != MVT::v4f32)
7283	It probably makes sense to add a comment explaining that it is safe to call this function only when the shuffle mask is a valid INSERTPS mask (according to your new isINSERTPSMask function). Otherwise, DestIndex would be wrongly computed.
7284	I would remove the bool 'HasAVX' since it is not used. That extra argument can be added in future once we decide to optimize for AVX cases too.
7293	You should also support MVT::v4i32.
7315–7336	As Elena pointed out, it is unsafe to transform the load+insertps into an add+insertps with memory operand. If you force that transformation, then the alignment constraint would not be honored. According to the instruction set reference, no alignment is required for the insertps instruction (unless alignment checking is enabled). If alignment checking is enabled, the check would ensure that the address is aligned to 4 (and not 16). The good thing is that, even without that if-stmt, your transformation would still improve all the test cases you added (i.e. for the load+shuffle case, you would get a `movaps+insertps` instead of `movaps+shufps+shufps`).
test/CodeGen/X86/sse41.ll
252–265	Without the 'load+insertps -> add+insertps' transformation, this test would still be improved by your patch. With your patch (excluding the if-stmt between lines 7315:7336 ), the code generate is better than it was before applying your patch. We now get a movaps+insertps; before we produced instead a movaps followed by two shufps instructions. I would change the test adding explicit checks for the aligned load (movaps) and the insertps.
252–279	Elena is right. Since you are matching the same code for X64 and X32, I would suggest adding a common check-prefix (example: --check-prefix=CHECK) to both RUN lines. You can use a single CHECK-LABEL in each test function (rather than having two identical checks [X32-LABEL and X64-LABEL] for every function). Your new tests would only use CHECK (and not X32/X64). I would add extra tests to verify that your lowering rule works fine with <4 x i32> vectors.
252–279	Your parch improves the code generation also in the following case: define <4 x float> @foo(<4 x float> %a, float* %b) { %1 = load float* %b, align 4 %2 = insertelement <4 x float> undef, float %1, i32 0 %result = shufflevector <4 x float> %a, <4 x float> %2, <4 x i32> <i32 0, i32 1, i32 2, i32 4> ret <4 x i32> %result } That is because your fix helps matching the pattern defined for `def rm : SS4AI8` that selects an INSERTPS with a memory operand. (See X86InstrSSE.td - multiclass SS41I_insertf32). With your patch, the backend would produce a single insertps for the test above. Before (i.e. without your patch), we produced the following long sequence of instructions (on X64): movss (%rdi), %xmm1 movaps %xmm0, %xmm2 movss %xmm1, %xmm2 shufps $36, %xmm2, %xmm0 I suggest to add that test as well (and another test to verify that the code is simplifed even if we have <4 x i32> vector types).

Hi Elena and Andrea,

1 - I've changed the check + change to NormalizeVectorShuffe(), I will update the patch soon;

2 - This won't be as easy as it looks (see below);

3 - I don't understand what the problem would be in making the load smaller. Especially since we do these kinds of load reductions in other places (including target independent opts like in DAGCombiner::ReduceLoadOpStoreWidth, and others). The alignment requirements don't get stricter, so we shouldn't output code that might break when the original wouldn't. And if I'm not mistaken, the alignment exceptions would fall under undefined behaviour, which wouldn't make this optimization invalid;

By the way, with the current transform, we would never generate an add instruction for the insertps. We just generate the insertps with the address calculation in it (just to be sure we're on the same page).

4 - I will change the test. After this fix I'll also change other tests in the same file that similarly check for the same text on X32 and X64.

About supporting i32 besides f32:

I can simply do the transform for MVT::v4i32 too, but it will not optimize as much as the current one does for f32.
It will optimize some cases but, for example, we might still get movsd+insertps when insertps should suffice.

To fix this problem, we add a bunch of new ones:

I don't know that much about the backend, but there doesn't seem to be a way to do it without creating a new SDNode type for INSERTPS (I don't know what it would be called, since INSERTPS is already used). (X86InstrFragmentsSIMD.td:84)
I can't find a way to have, on the type profile, an OR condition, which would allow us to define X86insertps as getting a f32 or i32 from memory.
Without this, the pattern we generate in X86InstrSSE.td:6522 won't match due to its types (even if we define a whole new multiclass for the i32 case).

I think the best course of action would be to:

Add some more tests, for the i32 cases (and collapse the tests to a common check-prefix);
Do the transform on both i32 and f32, for now, even if the generated instructions aren't perfect;
Later figure out how to better express the insertps for i32 in the backed (either with a new instruction, or a simpler one-time hack for this one).

What do you think?

Filipe

Just a small addition that I will investigate tomorrow:
The whole “get an i32 from memory into an xmm element” gets much easier if my transform emits a pinsrd for that case. I will change it to use pinsrd (for the i32 from mem case) tomorrow.

Filipe

Added the optimization for v4i32 (but we emit pinsrd instead of insertps when loading from memory).
Addresses the concerns in the review, except for the load minimization, due to discussion on IRC, existing optimizations that reduce load sizes, and lack of response to my question.

andreadb accepted this revision.Apr 25 2014, 4:51 PM

andreadb edited edge metadata.

This revision is now accepted and ready to land.Apr 25 2014, 4:51 PM

filcab closed this revision.Apr 25 2014, 7:09 PM

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

88 lines

test/

CodeGen/

X86/

sse41.ll

28 lines

Diff 8785

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,919 Lines • ▼ Show 20 Lines	static bool isMOVLHPSMask(ArrayRef<int> Mask, MVT VT) {

for (unsigned i = 0, e = NumElems/2; i != e; ++i)		for (unsigned i = 0, e = NumElems/2; i != e; ++i)
if (!isUndefOrEqual(Mask[i + e], i + NumElems))		if (!isUndefOrEqual(Mask[i + e], i + NumElems))
return false;		return false;

return true;		return true;
}		}

		/// isINSERTPSMask - Return true if the specified VECTOR_SHUFFLE operand
		/// specifies a shuffle of elements that is suitable for input to INSERTPS.
		/// i. e: If all but one element come from the same vector.
		static bool isINSERTPSMask(ArrayRef<int> Mask, MVT VT) {
		// TODO: Deal with AVX's VINSERTPS
		if (!VT.is128BitVector() \|\| VT != MVT::v4f32)
		andreadbUnsubmitted Not Done Reply Inline Actions I would change this into: if (VT != MVT::v4i32 && VT != MVT::v4f32) andreadb: I would change this into: if (VT != MVT::v4i32 && VT != MVT::v4f32)
		return false;

		unsigned CorrectPosV1 = 0;
		unsigned CorrectPosV2 = 0;
		for (int i = 0, e = (int)VT.getVectorNumElements(); i != e; ++i)
		silvasUnsubmitted Not Done Reply Inline Actions Tiny style nit: the canonical LLVM counted loop is for (int i = 0, e = Foo.size(); i != e; ++i) silvas: Tiny style nit: the canonical LLVM counted loop is ``` for (int i = 0, e = Foo.size(); i != e…
		if (Mask[i] == i)
		++CorrectPosV1;
		else if (Mask[i] == i + 4)
		++CorrectPosV2;

		if (CorrectPosV1 == 3 \|\| CorrectPosV2 == 3)
		// We have 3 elements from one vector, and one from another.
		return true;

		return false;
		}

//		//
// Some special combinations that can be optimized.		// Some special combinations that can be optimized.
//		//
static		static
SDValue Compact8x32ShuffleNode(ShuffleVectorSDNode *SVOp,		SDValue Compact8x32ShuffleNode(ShuffleVectorSDNode *SVOp,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT VT = SVOp->getSimpleValueType(0);		MVT VT = SVOp->getSimpleValueType(0);
SDLoc dl(SVOp);		SDLoc dl(SVOp);
▲ Show 20 Lines • Show All 3,316 Lines • ▼ Show 20 Lines	SDValue getMOVLP(SDValue &Op, SDLoc &dl, SelectionDAG &DAG, bool HasSSE2) {

assert(VT != MVT::v4i32 && "unsupported shuffle type");		assert(VT != MVT::v4i32 && "unsupported shuffle type");

// Invert the operand order and use SHUFPS to match it.		// Invert the operand order and use SHUFPS to match it.
return getTargetShuffleNode(X86ISD::SHUFP, dl, VT, V2, V1,		return getTargetShuffleNode(X86ISD::SHUFP, dl, VT, V2, V1,
getShuffleSHUFImmediate(SVOp), DAG);		getShuffleSHUFImmediate(SVOp), DAG);
}		}

		static SDValue getINSERTPS(ShuffleVectorSDNode *SVOp, SDLoc &dl,
		andreadbUnsubmitted Not Done Reply Inline Actions It probably makes sense to add a comment explaining that it is safe to call this function only when the shuffle mask is a valid INSERTPS mask (according to your new isINSERTPSMask function). Otherwise, DestIndex would be wrongly computed. andreadb: It probably makes sense to add a comment explaining that it is safe to call this function only…
		SelectionDAG &DAG, bool HasAVX) {
		andreadbUnsubmitted Not Done Reply Inline Actions I would remove the bool 'HasAVX' since it is not used. That extra argument can be added in future once we decide to optimize for AVX cases too. andreadb: I would remove the bool 'HasAVX' since it is not used. That extra argument can be added in…
		// Generate an insertps instruction when inserting an f32 from memory onto a
		// v4f32 or when copying a member from one v4f32 to another.
		// TODO: Optimize for AVX cases too (VINSERTPS)
		MVT VT = SVOp->getSimpleValueType(0);
		MVT EVT = VT.getVectorElementType();
		SDValue V1 = SVOp->getOperand(0);
		SDValue V2 = SVOp->getOperand(1);
		auto Mask = SVOp->getMask();
		assert(VT == MVT::v4f32 && "unsupported vector type for insertps");
		andreadbUnsubmitted Not Done Reply Inline Actions You should also support MVT::v4i32. andreadb: You should also support MVT::v4i32.

		int FromV1 = std::count_if(Mask.begin(), Mask.end(),
		[](const int &i) { return i < 4; });

		SDValue From;
		SDValue To;
		unsigned DestIndex;
		if (FromV1 == 1) {
		From = V1;
		To = V2;
		DestIndex = std::find_if(Mask.begin(), Mask.end(),
		[](const int &i) { return i < 4; }) -
		Mask.begin();
		} else {
		From = V2;
		To = V1;
		DestIndex = std::find_if(Mask.begin(), Mask.end(),
		[](const int &i) { return i >= 4; }) -
		Mask.begin();
		}

		if (MayFoldLoad(From)) {
		// Trivial case, when From comes from a load and is only used by the
		// shuffle. Make it use insertps from the vector that we need from that
		// load.
		SDValue Addr = From.getOperand(1);
		SDValue NewAddr =
		DAG.getNode(ISD::ADD, dl, Addr.getSimpleValueType(), Addr,
		DAG.getConstant(DestIndex * EVT.getStoreSize(),
		Addr.getSimpleValueType()));

		LoadSDNode *Load = cast<LoadSDNode>(From);
		SDValue Ld = DAG.getLoad(EVT, dl, Load->getChain(), NewAddr,
		DAG.getMachineFunction().getMachineMemOperand(
		Load->getMemOperand(), 0, EVT.getStoreSize()));

		// Create this as a scalar to vector to match the instruction pattern.
		SDValue LoadScalarToVector =
		DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4f32, Ld);
		SDValue InsertpsMask = DAG.getIntPtrConstant(DestIndex << 4);
		return DAG.getNode(X86ISD::INSERTPS, dl, VT, To, LoadScalarToVector,
		InsertpsMask);
		}
		andreadbUnsubmitted Not Done Reply Inline Actions As Elena pointed out, it is unsafe to transform the load+insertps into an add+insertps with memory operand. If you force that transformation, then the alignment constraint would not be honored. According to the instruction set reference, no alignment is required for the insertps instruction (unless alignment checking is enabled). If alignment checking is enabled, the check would ensure that the address is aligned to 4 (and not 16). The good thing is that, even without that if-stmt, your transformation would still improve all the test cases you added (i.e. for the load+shuffle case, you would get a `movaps+insertps` instead of `movaps+shufps+shufps`). andreadb: As Elena pointed out, it is unsafe to transform the load+insertps into an add+insertps with…

		// Vector-element-to-vector
		unsigned SrcIndex = Mask[DestIndex] % 4;
		SDValue InsertpsMask = DAG.getIntPtrConstant(DestIndex << 4 \| SrcIndex << 6);
		return DAG.getNode(X86ISD::INSERTPS, dl, VT, To, From, InsertpsMask);
		}

// Reduce a vector shuffle to zext.		// Reduce a vector shuffle to zext.
static SDValue LowerVectorIntExtend(SDValue Op, const X86Subtarget *Subtarget,		static SDValue LowerVectorIntExtend(SDValue Op, const X86Subtarget *Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
// PMOVZX is only available from SSE41.		// PMOVZX is only available from SSE41.
if (!Subtarget->hasSSE41())		if (!Subtarget->hasSSE41())
return SDValue();		return SDValue();

MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	X86TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
SDLoc dl(Op);		SDLoc dl(Op);
unsigned NumElems = VT.getVectorNumElements();		unsigned NumElems = VT.getVectorNumElements();
bool V1IsUndef = V1.getOpcode() == ISD::UNDEF;		bool V1IsUndef = V1.getOpcode() == ISD::UNDEF;
bool V2IsUndef = V2.getOpcode() == ISD::UNDEF;		bool V2IsUndef = V2.getOpcode() == ISD::UNDEF;
bool V1IsSplat = false;		bool V1IsSplat = false;
bool V2IsSplat = false;		bool V2IsSplat = false;
bool HasSSE2 = Subtarget->hasSSE2();		bool HasSSE2 = Subtarget->hasSSE2();
		bool HasSSE4 = Subtarget->hasSSE41();
bool HasFp256 = Subtarget->hasFp256();		bool HasFp256 = Subtarget->hasFp256();
bool HasInt256 = Subtarget->hasInt256();		bool HasInt256 = Subtarget->hasInt256();
MachineFunction &MF = DAG.getMachineFunction();		MachineFunction &MF = DAG.getMachineFunction();
bool OptForSize = MF.getFunction()->getAttributes().		bool OptForSize = MF.getFunction()->getAttributes().
hasAttribute(AttributeSet::FunctionIndex, Attribute::OptimizeForSize);		hasAttribute(AttributeSet::FunctionIndex, Attribute::OptimizeForSize);

assert(VT.getSizeInBits() != 64 && "Can't lower MMX shuffles");		assert(VT.getSizeInBits() != 64 && "Can't lower MMX shuffles");

▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	if (isSHUFPMask(M, VT))
return getTargetShuffleNode(X86ISD::SHUFP, dl, VT, V1, V2,		return getTargetShuffleNode(X86ISD::SHUFP, dl, VT, V1, V2,
getShuffleSHUFImmediate(SVOp), DAG);		getShuffleSHUFImmediate(SVOp), DAG);

if (isUNPCKL_v_undef_Mask(M, VT, HasInt256))		if (isUNPCKL_v_undef_Mask(M, VT, HasInt256))
return getTargetShuffleNode(X86ISD::UNPCKL, dl, VT, V1, V1, DAG);		return getTargetShuffleNode(X86ISD::UNPCKL, dl, VT, V1, V1, DAG);
if (isUNPCKH_v_undef_Mask(M, VT, HasInt256))		if (isUNPCKH_v_undef_Mask(M, VT, HasInt256))
return getTargetShuffleNode(X86ISD::UNPCKH, dl, VT, V1, V1, DAG);		return getTargetShuffleNode(X86ISD::UNPCKH, dl, VT, V1, V1, DAG);

		if (HasSSE4 && isINSERTPSMask(M, VT))
		return getINSERTPS(SVOp, dl, DAG, Subtarget->hasAVX());

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Generate target specific nodes for 128 or 256-bit shuffles only		// Generate target specific nodes for 128 or 256-bit shuffles only
// supported in the AVX instruction set.		// supported in the AVX instruction set.
//		//

// Handle VMOVDDUPY permutations		// Handle VMOVDDUPY permutations
if (V2IsUndef && isMOVDDUPYMask(M, VT, HasFp256))		if (V2IsUndef && isMOVDDUPYMask(M, VT, HasFp256))
return getTargetShuffleNode(X86ISD::MOVDDUP, dl, VT, V1, DAG);		return getTargetShuffleNode(X86ISD::MOVDDUP, dl, VT, V1, DAG);
▲ Show 20 Lines • Show All 13,005 Lines • Show Last 20 Lines

test/CodeGen/X86/sse41.ll

	Show First 20 Lines • Show All 243 Lines • ▼ Show 20 Lines
	; X32: ret			; X32: ret
	; X64-LABEL: buildvector:			; X64-LABEL: buildvector:
	; X64-NOT: insertps $0			; X64-NOT: insertps $0
	; X64: insertps $16			; X64: insertps $16
	; X64-NOT: insertps $0			; X64-NOT: insertps $0
	; X64: ret			; X64: ret
	}			}

				define <4 x float> @insertps_from_shufflevector_1(<4 x float> %a, <4 x float>* nocapture readonly %pb) {
				entry:
				%0 = load <4 x float>* %pb, align 16
				%vecinit6 = shufflevector <4 x float> %a, <4 x float> %0, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
				ret <4 x float> %vecinit6
				; X32-LABEL: insertps_from_shufflevector_1:
				; X32-NOT: shufps
				; X32: insertps
				; X32: ret
				; X64-LABEL: insertps_from_shufflevector_1:
				; X64-NOT: shufps
				; X64: insertps
				; X64: ret
				}
				andreadbUnsubmitted Not Done Reply Inline Actions Without the 'load+insertps -> add+insertps' transformation, this test would still be improved by your patch. With your patch (excluding the if-stmt between lines 7315:7336 ), the code generate is better than it was before applying your patch. We now get a movaps+insertps; before we produced instead a movaps followed by two shufps instructions. I would change the test adding explicit checks for the aligned load (movaps) and the insertps. andreadb: Without the 'load+insertps -> add+insertps' transformation, this test would still be improved…

				define <4 x float> @insertps_from_shufflevector_2(<4 x float> %a, <4 x float> %b) {
				entry:
				%vecinit6 = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 1, i32 5, i32 3>
				ret <4 x float> %vecinit6
				; X32-LABEL: insertps_from_shufflevector_2:
				; X32: insertps
				; X32-NOT: shufps
				; X32: ret
				; X64-LABEL: insertps_from_shufflevector_2:
				; X64: insertps
				; X64-NOT: shufps
				; X64: ret
				}
				andreadbUnsubmitted Not Done Reply Inline Actions Elena is right. Since you are matching the same code for X64 and X32, I would suggest adding a common check-prefix (example: --check-prefix=CHECK) to both RUN lines. You can use a single CHECK-LABEL in each test function (rather than having two identical checks [X32-LABEL and X64-LABEL] for every function). Your new tests would only use CHECK (and not X32/X64). I would add extra tests to verify that your lowering rule works fine with <4 x i32> vectors. andreadb: Elena is right. Since you are matching the same code for X64 and X32, I would suggest adding a…
				andreadbUnsubmitted Not Done Reply Inline Actions Your parch improves the code generation also in the following case: define <4 x float> @foo(<4 x float> %a, float* %b) { %1 = load float* %b, align 4 %2 = insertelement <4 x float> undef, float %1, i32 0 %result = shufflevector <4 x float> %a, <4 x float> %2, <4 x i32> <i32 0, i32 1, i32 2, i32 4> ret <4 x i32> %result } That is because your fix helps matching the pattern defined for `def rm : SS4AI8` that selects an INSERTPS with a memory operand. (See X86InstrSSE.td - multiclass SS41I_insertf32). With your patch, the backend would produce a single insertps for the test above. Before (i.e. without your patch), we produced the following long sequence of instructions (on X64): movss (%rdi), %xmm1 movaps %xmm0, %xmm2 movss %xmm1, %xmm2 shufps $36, %xmm2, %xmm0 I suggest to add that test as well (and another test to verify that the code is simplifed even if we have <4 x i32> vector types). andreadb: Your parch improves the code generation also in the following case: define <4 x float>…