This is an archive of the discontinued LLVM Phabricator instance.

[X86] Don't assume that a shuffle operand is #0: it isn't for VPERMV.
ClosedPublic

Authored by ab on Feb 9 2016, 2:13 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon

Commits

rG671795a9855f: [X86] Don't assume that shuffle non-mask operands starts at #0.
rL262627: [X86] Don't assume that shuffle non-mask operands starts at #0.

Summary

Since:

r246981 AVX-512: Lowering for 512-bit vector shuffles.

VPERMV is recognized in getTargetShuffleMask.

This breaks assumptions in most callers, as they expect N->getOperand(0) to be (one of) the vector operand(s). It isn't, as VPERMV has the mask as operand #0 (I can't think of another shuffle-like instruction that works the same).

In the added testcase, this leads the funny-looking:

vmovdqa .LCPI0_0(%rip), %ymm0   # ymm0 = [0,1,2,3,4,5,6,4]
vpshufb .LCPI0_1(%rip), %ymm0, %ymm0 # ymm0 = ymm0[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,16,17,18,18]

In my original testcase (s/i32 4>)/i32 1>)/ should do the trick) , the VPSHUFB lane restriction was another problem, but Simon fixed that in r260063.

I can think of two obvious solutions:

swap the X86ISD::VPERMV operands, commenting in X86ISelLowering.h that it's different from the instructions. IMO, it's confusing either way.
return the operands and fix the users. There are many users, some of which (e.g., setTargetShuffleZeroElements) only return a mask themselves. This doesn't seem perfect either.

This (very rough, WIP) patch implements the latter.

What do you think? We might improve this by having a struct wrap <Mask, IsUnary, Ops>, and hopefully avoid computing slightly different things in different places.

Diff Detail

Repository: rL LLVM

Event Timeline

ab updated this revision to Diff 47363.Feb 9 2016, 2:13 PM

ab retitled this revision from to [X86] Don't assume that a shuffle operand is #0: it isn't for VPERMV..

ab updated this object.

ab added reviewers: RKSimon, spatel.

ab added a subscriber: llvm-commits.

Hi Ahmed -

I'm still blissfully ignorant of AVX-512, so my opinion shouldn't have as much weight as people who are working on that (cc some of the others that were on D10683?).

But I would lean towards the first solution: swap the operands for the X86ISD::VPERMV node. If I'm understanding the problem, this would (mostly?) limit the changes to the td defs. We barely document the DAG node operands or their orders anyway, so adding that kind of info seems fair to me. I think it's better to preserve the software uniformity as long as possible, even if the hardware instructions are a mess. Ie, the C instrinsics keep the expected order:
https://software.intel.com/en-us/node/524011
...so let's preserve that illusion as long as we can.

In D17041#348673, @spatel wrote:

Hi Ahmed -

I'm still blissfully ignorant of AVX-512, so my opinion shouldn't have as much weight as people who are working on that (cc some of the others that were on D10683?).

But I would lean towards the first solution: swap the operands for the X86ISD::VPERMV node. If I'm understanding the problem, this would (mostly?) limit the changes to the td defs.

Yep

We barely document the DAG node operands or their orders anyway, so adding that kind of info seems fair to me. I think it's better to preserve the software uniformity as long as possible, even if the hardware instructions are a mess. Ie, the C instrinsics keep the expected order:
https://software.intel.com/en-us/node/524011
...so let's preserve that illusion as long as we can.

I agree; I'm mostly worried about us developers expecting the SD op to match the instruction.
I didn't realize the C intrinsics were swapped. IMO that's enough to justify #1.

..but now that I look it up, the AVX-512 intrinsics use the instruction order *sigh*
Back to square one.

-Ahmed

Add back context to original patch.

In D17041#348716, @ab wrote:

..but now that I look it up, the AVX-512 intrinsics use the instruction order *sigh*
Back to square one.

Wow. Is it too late for Intel to fix/deprecate/rename those intrinsics? If the argument is that the intrinsics should match the asm, then what happened with the AVX2 vperm variants?

-----Original Message-----
From: llvm-commits [mailto:llvm-commits-bounces@lists.llvm.org] On Behalf
Of Sanjay Patel via llvm-commits
Sent: Wednesday, February 10, 2016 10:58 AM
To: ahmed.bougacha@gmail.com; llvm-dev@redking.me.uk;
spatel@rotateright.com
Cc: llvm-commits@lists.llvm.org
Subject: Re: [PATCH] D17041: [X86] Don't assume that a shuffle operand is
#0: it isn't for VPERMV.

spatel added a comment.

In http://reviews.llvm.org/D17041#348716, @ab wrote:

..but now that I look it up, the AVX-512 intrinsics use the instruction order

*sigh*

Back to square one.

Wow. Is it too late for Intel to fix/deprecate/rename those intrinsics? If the
argument is that the intrinsics should match the asm, then what happened
with the AVX2 vperm variants?

It is really too late. There are several other tool chains that already implemented the intrinsics as defined, and have been in use for quite a while.

http://reviews.llvm.org/D17041

llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

RKSimon added inline comments.Feb 10 2016, 2:13 PM

lib/Target/X86/X86ISelLowering.cpp
23658 ↗	(On Diff #47486)	Was there a need to add this? We already have all undef and all undef/zero handling below and in combineX86ShuffleChain
24335 ↗	(On Diff #47486)	Don't these need dealing with as well? Also at the moment the getOperand(1) calls means that unary target shuffles can't use this combine as they would assert....

ab added inline comments.Feb 11 2016, 12:56 PM

lib/Target/X86/X86ISelLowering.cpp
23658 ↗	(On Diff #47486)	Yes, but before we can figure that out, we only checked undef (see the TODO about zero/undef below), and we crash trying to and getOpcode() on SDValue(). But to be clear: the whole patch is very mechanical, and this is a hack. I'm mainly interested in whether I should proceed with this of swap the operands.
24335 ↗	(On Diff #47486)	This actually doesn't get called for all shuffles opcodes, though it probably should.
24352 ↗	(On Diff #47486)	And this looks like it should be fall through to the other cases if PerformShuffleCombine256 didn't succeed.

RKSimon mentioned this in rL261433: [X86][SSE] Move all undef/zero cases before target shuffle combining..Feb 20 2016, 5:02 AM

Hi Ahmed, I ended up using some of the same code to fix PR26667 - please can you rebase this patch?

Rebase away fixes covered by r261433-4.

Ping of sorts: Elena, all, how do you think we should fix this: swap the VPERMV operands, or change all of our code to stop assuming the mask is the last operand (this patch)? I don't do much AVX512, so I'm not comfortable making that decision.

Ping of sorts: Elena, all, how do you think we should fix this: swap the VPERMV operands, or change all of our code to stop assuming the mask is the last operand (this patch)? I don't do much AVX512, so I'm not comfortable making that decision.

In my opinion, the mask op should not be always the last. I don't think that you need to swap VPERMV operands. As far as AVX-512 - the form is the same:
vpermd %vec_op, %mask_op, %vec_res.

I'll say you more - in AVX-512 we have two forms of 3-src shuffles:
VPERMT2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst Permute double-words from two tables in zmm3/m512/m32bcst and zmm1 using indices in zmm2 and store the result in zmm1 using writemask k1

VPERMI2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst Permute double-words from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

So, it's ok that the mask is not last in SDNode.

Simplify getTargetShuffleMask and rebase.

In D17041#363755, @delena wrote:

Ping of sorts: Elena, all, how do you think we should fix this: swap the VPERMV operands, or change all of our code to stop assuming the mask is the last operand (this patch)? I don't do much AVX512, so I'm not comfortable making that decision.

In my opinion, the mask op should not be always the last. I don't think that you need to swap VPERMV operands. As far as AVX-512 - the form is the same:
vpermd %vec_op, %mask_op, %vec_res.

I'll say you more - in AVX-512 we have two forms of 3-src shuffles:
VPERMT2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst Permute double-words from two tables in zmm3/m512/m32bcst and zmm1 using indices in zmm2 and store the result in zmm1 using writemask k1

VPERMI2D zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst Permute double-words from two tables in zmm3/m512/m32bcst and zmm2 using indices in zmm1 and store the result in zmm1 using writemask k1.

So, it's ok that the mask is not last in SDNode.

All else being equal, I'd actually really prefer for the mask to always be last, because it helps preserve sanity in the backend :(

So, you say it's OK for VPERMVBut I think I'm really asking the opposite: is it OK if we make the mask always be the last operand?

Otherwise, I rebased and cleaned up the patch, so reviews are appreciated!

ab mentioned this in D17681: [X86][AVX] Better support for the variable mask form of VPERMILPD/VPERMILPS.Feb 29 2016, 4:53 PM

Some comments - might be void if altering the intrinsic argument ordering is going to happen.

lib/Target/X86/X86ISelLowering.cpp
4908 ↗	(On Diff #49433)	If you're going to add this maybe add an assertion for Mask.empty() as well?
test/CodeGen/X86/avx2-vperm-combining.ll
3 ↗	(On Diff #49433)	Possibly rename this vector-shuffle-combining-avx2.ll ? It matches the other test files that have already been added for SSSE3/AVX-only shuffle intrinsic combine tests.
16 ↗	(On Diff #49433)	Add permps test as well.

lib/Target/X86/X86ISelLowering.cpp
5014 ↗	(On Diff #49433)	Make this comment explicit? Something like: "Unlike most shuffle nodes, VPERMV's mask operand is operand 0." I think we should also note this in the header file definition for any nodes that are "special". Given that you've already done the work, I think we might as well move forward on this path. There's really no hope of maintaining sanity. I just saw page 5-600 (p. 706) of: https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf Embrace the madness. :)

Cleanup and expand tests
Return the correct operands for VPERMV3 (can't get it to fire)
Document VPERMV weirdness in X86ISD
Assert that Mask is clear in getTargetShuffleMask; document that.

ab marked 6 inline comments as done.Mar 1 2016, 3:46 PM

ab added inline comments.

lib/Target/X86/X86ISelLowering.cpp
5038 ↗	(On Diff #49558)	Yeah, Elena also mentioned VPERMT2 and (the worst of all) VPERMI2. I have lost all hope!

VPERMIV3 isn't being used yet but you should be able to create some VPERMV3 tests for a vector-shuffle-combining-avx512.ll test file - check avx512-intrinsics.ll for direct calls to llvm.x86.avx512.mask.vpermt2var.*

Other than that - LGTM.

This revision is now accepted and ready to land.Mar 2 2016, 3:22 AM

So, I don't think the VPERMV3 case is currently reachable (and boy did I try):

VPERMV3 is only formed during lowering
one getTargetShuffleMask caller is XFormVExtractWithShuffleIntoLoad, which operates on extractelt nodes, which are lowered into extractelt (extract_subvector) earlier.
extract_subvector prevents the BLENDI and INSERTPS combines as well
512-bit shuffles are only lowered into VPERMV, VPERMV3, and UNPCK, none of which is unary, so combineX86ShufflesRecursively doesn't fire either
getShuffleScalarElt and combineShuffles can't be called for VPERMV3, as it's only called for a select few shuffle opcodes.

I might have missed something though, so ideas welcome!

This does expose a couple problems:

should we delete the VPERMV3 code, and add it back once it can actually fire?
should we try to run combineShuffle for all of the opcodes?

Closed by commit rL262627: [X86] Don't assume that shuffle non-mask operands starts at #0. (authored by ab). · Explain WhyMar 3 2016, 8:58 AM

This revision was automatically updated to reflect the committed changes.

Also, thanks for the reviews everyone! r262627

In D17041#367330, @ab wrote:

So, I don't think the VPERMV3 case is currently reachable (and boy did I try):

Thanks for trying Ahmed! You may have more luck after applying D17858 - not guaranteeing anything though.

I'd recommend keeping the code in for now - its going to be needed sooner rather than later.

RKSimon added inline comments.Mar 3 2016, 2:22 PM

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
5076	Just noticed that we're not attempting to detect unary shuffles (which probably explains why you've had so much trouble getting combines to fire): IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(2);

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.h

14 lines

X86ISelLowering.cpp

86 lines

test/

CodeGen/

X86/

vector-shuffle-combining-avx2.ll

27 lines

Diff 49743

llvm/trunk/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 386 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
MOVLPS,		MOVLPS,
MOVLPD,		MOVLPD,
MOVSD,		MOVSD,
MOVSS,		MOVSS,
UNPCKL,		UNPCKL,
UNPCKH,		UNPCKH,
VPERMILPV,		VPERMILPV,
VPERMILPI,		VPERMILPI,
		VPERMI,
		VPERM2X128,

		// Variable Permute (VPERM)
		// Res = VPERMV MaskV, V0
VPERMV,		VPERMV,

		// 3-op Variable Permute (VPERMT2)
		// Res = VPERMV3 V0, MaskV, V1
VPERMV3,		VPERMV3,

		// 3-op Variable Permute overwriting the index (VPERMI2)
		// Res = VPERMIV3 V0, MaskV, V1
VPERMIV3,		VPERMIV3,
VPERMI,
VPERM2X128,
// Bitwise ternary logic		// Bitwise ternary logic
VPTERNLOG,		VPTERNLOG,
// Fix Up Special Packed Float32/64 values		// Fix Up Special Packed Float32/64 values
VFIXUPIMM,		VFIXUPIMM,
VFIXUPIMMS,		VFIXUPIMMS,
// Range Restriction Calculation For Packed Pairs of Float32/64 values		// Range Restriction Calculation For Packed Pairs of Float32/64 values
VRANGE,		VRANGE,
// Reduce - Perform Reduction Transformation on scalar\packed FP		// Reduce - Perform Reduction Transformation on scalar\packed FP
▲ Show 20 Lines • Show All 789 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,930 Lines • ▼ Show 20 Lines	static const Constant *getTargetShuffleMaskConstant(SDValue MaskNode) {
auto *MaskCP = dyn_cast<ConstantPoolSDNode>(Ptr);		auto *MaskCP = dyn_cast<ConstantPoolSDNode>(Ptr);
if (!MaskCP \|\| MaskCP->isMachineConstantPoolEntry())		if (!MaskCP \|\| MaskCP->isMachineConstantPoolEntry())
return nullptr;		return nullptr;

return dyn_cast<Constant>(MaskCP->getConstVal());		return dyn_cast<Constant>(MaskCP->getConstVal());
}		}

/// Calculates the shuffle mask corresponding to the target-specific opcode.		/// Calculates the shuffle mask corresponding to the target-specific opcode.
/// Returns true if the Mask could be calculated. Sets IsUnary to true if only		/// If the mask could be calculated, returns it in \p Mask, returns the shuffle
/// uses one source. Note that this will set IsUnary for shuffles which use a		/// operands in \p Ops, and returns true.
/// single input multiple times, and in those cases it will		/// Sets \p IsUnary to true if only one source is used. Note that this will set
/// adjust the mask to only have indices within that single input.		/// IsUnary for shuffles which use a single input multiple times, and in those
		/// cases it will adjust the mask to only have indices within that single input.
		/// It is an error to call this with non-empty Mask/Ops vectors.
static bool getTargetShuffleMask(SDNode *N, MVT VT, bool AllowSentinelZero,		static bool getTargetShuffleMask(SDNode *N, MVT VT, bool AllowSentinelZero,
		SmallVectorImpl<SDValue> &Ops,
SmallVectorImpl<int> &Mask, bool &IsUnary) {		SmallVectorImpl<int> &Mask, bool &IsUnary) {
unsigned NumElems = VT.getVectorNumElements();		unsigned NumElems = VT.getVectorNumElements();
SDValue ImmN;		SDValue ImmN;

		assert(Mask.empty() && "getTargetShuffleMask expects an empty Mask vector");
		assert(Ops.empty() && "getTargetShuffleMask expects an empty Ops vector");

IsUnary = false;		IsUnary = false;
bool IsFakeUnary = false;		bool IsFakeUnary = false;
switch(N->getOpcode()) {		switch(N->getOpcode()) {
case X86ISD::BLENDI:		case X86ISD::BLENDI:
ImmN = N->getOperand(N->getNumOperands()-1);		ImmN = N->getOperand(N->getNumOperands()-1);
DecodeBLENDMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);		DecodeBLENDMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
break;		break;
case X86ISD::SHUFP:		case X86ISD::SHUFP:
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	case X86ISD::MOVDDUP:
break;		break;
case X86ISD::MOVLHPD:		case X86ISD::MOVLHPD:
case X86ISD::MOVLPD:		case X86ISD::MOVLPD:
case X86ISD::MOVLPS:		case X86ISD::MOVLPS:
// Not yet implemented		// Not yet implemented
return false;		return false;
case X86ISD::VPERMV: {		case X86ISD::VPERMV: {
IsUnary = true;		IsUnary = true;
		// Unlike most shuffle nodes, VPERMV's mask operand is operand 0.
		Ops.push_back(N->getOperand(1));
SDValue MaskNode = N->getOperand(0);		SDValue MaskNode = N->getOperand(0);
SmallVector<uint64_t, 32> RawMask;		SmallVector<uint64_t, 32> RawMask;
unsigned MaskLoBits = Log2_64(VT.getVectorNumElements());		unsigned MaskLoBits = Log2_64(VT.getVectorNumElements());
if (getTargetShuffleMaskIndices(MaskNode, MaskLoBits, RawMask)) {		if (getTargetShuffleMaskIndices(MaskNode, MaskLoBits, RawMask)) {
DecodeVPERMVMask(RawMask, Mask);		DecodeVPERMVMask(RawMask, Mask);
break;		break;
}		}
if (auto *C = getTargetShuffleMaskConstant(MaskNode)) {		if (auto *C = getTargetShuffleMaskConstant(MaskNode)) {
DecodeVPERMVMask(C, VT, Mask);		DecodeVPERMVMask(C, VT, Mask);
break;		break;
}		}
return false;		return false;
}		}
case X86ISD::VPERMV3: {		case X86ISD::VPERMV3: {
		// Unlike most shuffle nodes, VPERMV3's mask operand is the middle one.
		Ops.push_back(N->getOperand(0));
		Ops.push_back(N->getOperand(2));
		RKSimonUnsubmitted Not Done Reply Inline Actions Just noticed that we're not attempting to detect unary shuffles (which probably explains why you've had so much trouble getting combines to fire): IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(2); RKSimon: Just noticed that we're not attempting to detect unary shuffles (which probably explains why…
SDValue MaskNode = N->getOperand(1);		SDValue MaskNode = N->getOperand(1);

SmallVector<uint64_t, 32> RawMask;		SmallVector<uint64_t, 32> RawMask;
unsigned MaskLoBits = Log2_64(VT.getVectorNumElements() * 2);		unsigned MaskLoBits = Log2_64(VT.getVectorNumElements() * 2);
if (getTargetShuffleMaskIndices(MaskNode, MaskLoBits, RawMask)) {		if (getTargetShuffleMaskIndices(MaskNode, MaskLoBits, RawMask)) {
DecodeVPERMV3Mask(RawMask, Mask);		DecodeVPERMV3Mask(RawMask, Mask);
break;		break;
}		}
if (auto *C = getTargetShuffleMaskConstant(MaskNode)) {		if (auto *C = getTargetShuffleMaskConstant(MaskNode)) {
DecodeVPERMV3Mask(C, VT, Mask);		DecodeVPERMV3Mask(C, VT, Mask);
Show All 17 Lines	static bool getTargetShuffleMask(SDNode *N, MVT VT, bool AllowSentinelZero,
// If we have a fake unary shuffle, the shuffle mask is spread across two		// If we have a fake unary shuffle, the shuffle mask is spread across two
// inputs that are actually the same node. Re-map the mask to always point		// inputs that are actually the same node. Re-map the mask to always point
// into the first input.		// into the first input.
if (IsFakeUnary)		if (IsFakeUnary)
for (int &M : Mask)		for (int &M : Mask)
if (M >= (int)Mask.size())		if (M >= (int)Mask.size())
M -= Mask.size();		M -= Mask.size();

		// If we didn't already add operands in the opcode-specific code, default to
		// adding 1 or 2 operands starting at 0.
		if (Ops.empty()) {
		Ops.push_back(N->getOperand(0));
		if (!IsUnary \|\| IsFakeUnary)
		Ops.push_back(N->getOperand(1));
		}

return true;		return true;
}		}

/// Check a target shuffle mask's inputs to see if we can set any values to		/// Check a target shuffle mask's inputs to see if we can set any values to
/// SM_SentinelZero - this is for elements that are known to be zero		/// SM_SentinelZero - this is for elements that are known to be zero
/// (not just zeroable) from their inputs.		/// (not just zeroable) from their inputs.
/// Returns true if the target shuffle mask was decoded.		/// Returns true if the target shuffle mask was decoded.
static bool setTargetShuffleZeroElements(SDValue N,		static bool setTargetShuffleZeroElements(SDValue N,
SmallVectorImpl<int> &Mask) {		SmallVectorImpl<int> &Mask,
		SmallVectorImpl<SDValue> &Ops) {
bool IsUnary;		bool IsUnary;
if (!isTargetShuffle(N.getOpcode()))		if (!isTargetShuffle(N.getOpcode()))
return false;		return false;
if (!getTargetShuffleMask(N.getNode(), N.getSimpleValueType(), true, Mask,		if (!getTargetShuffleMask(N.getNode(), N.getSimpleValueType(), true, Ops,
IsUnary))		Mask, IsUnary))
return false;		return false;

SDValue V1 = N.getOperand(0);		SDValue V1 = Ops[0];
SDValue V2 = IsUnary ? V1 : N.getOperand(1);		SDValue V2 = IsUnary ? V1 : Ops[1];

while (V1.getOpcode() == ISD::BITCAST)		while (V1.getOpcode() == ISD::BITCAST)
V1 = V1->getOperand(0);		V1 = V1->getOperand(0);
while (V2.getOpcode() == ISD::BITCAST)		while (V2.getOpcode() == ISD::BITCAST)
V2 = V2->getOperand(0);		V2 = V2->getOperand(0);

for (int i = 0, Size = Mask.size(); i != Size; ++i) {		for (int i = 0, Size = Mask.size(); i != Size; ++i) {
int M = Mask[i];		int M = Mask[i];
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
/// Calls setTargetShuffleZeroElements to resolve a target shuffle mask's inputs		/// Calls setTargetShuffleZeroElements to resolve a target shuffle mask's inputs
/// and set the SM_SentinelUndef and SM_SentinelZero values. Then check the		/// and set the SM_SentinelUndef and SM_SentinelZero values. Then check the
/// remaining input indices in case we now have a unary shuffle and adjust the		/// remaining input indices in case we now have a unary shuffle and adjust the
/// Op0/Op1 inputs accordingly.		/// Op0/Op1 inputs accordingly.
/// Returns true if the target shuffle mask was decoded.		/// Returns true if the target shuffle mask was decoded.
static bool resolveTargetShuffleInputs(SDValue Op, bool &IsUnary, SDValue &Op0,		static bool resolveTargetShuffleInputs(SDValue Op, bool &IsUnary, SDValue &Op0,
SDValue &Op1,		SDValue &Op1,
SmallVectorImpl<int> &Mask) {		SmallVectorImpl<int> &Mask) {
if (!setTargetShuffleZeroElements(Op, Mask))		SmallVector<SDValue, 2> Ops;
		if (!setTargetShuffleZeroElements(Op, Mask, Ops))
return false;		return false;

int NumElts = Mask.size();		int NumElts = Mask.size();
bool Op0InUse = std::any_of(Mask.begin(), Mask.end(), [NumElts](int Idx) {		bool Op0InUse = std::any_of(Mask.begin(), Mask.end(), [NumElts](int Idx) {
return 0 <= Idx && Idx < NumElts;		return 0 <= Idx && Idx < NumElts;
});		});
bool Op1InUse = std::any_of(Mask.begin(), Mask.end(),		bool Op1InUse = std::any_of(Mask.begin(), Mask.end(),
[NumElts](int Idx) { return NumElts <= Idx; });		[NumElts](int Idx) { return NumElts <= Idx; });

Op0 = Op0InUse ? Op.getOperand(0) : SDValue();		Op0 = Op0InUse ? Ops[0] : SDValue();
Op1 = Op1InUse ? Op.getOperand(1) : SDValue();		Op1 = Op1InUse ? Ops[1] : SDValue();
IsUnary = !(Op0InUse && Op1InUse);		IsUnary = !(Op0InUse && Op1InUse);

if (!IsUnary)		if (!IsUnary)
return true;		return true;

// We're only using Op1 - commute the mask and inputs.		// We're only using Op1 - commute the mask and inputs.
if (!Op0InUse && Op1InUse) {		if (!Op0InUse && Op1InUse) {
for (int &M : Mask)		for (int &M : Mask)
Show All 31 Lines	static SDValue getShuffleScalarElt(SDNode *N, unsigned Index, SelectionDAG &DAG,
}		}

// Recurse into target specific vector shuffles to find scalars.		// Recurse into target specific vector shuffles to find scalars.
if (isTargetShuffle(Opcode)) {		if (isTargetShuffle(Opcode)) {
MVT ShufVT = V.getSimpleValueType();		MVT ShufVT = V.getSimpleValueType();
MVT ShufSVT = ShufVT.getVectorElementType();		MVT ShufSVT = ShufVT.getVectorElementType();
int NumElems = (int)ShufVT.getVectorNumElements();		int NumElems = (int)ShufVT.getVectorNumElements();
SmallVector<int, 16> ShuffleMask;		SmallVector<int, 16> ShuffleMask;
		SmallVector<SDValue, 16> ShuffleOps;
bool IsUnary;		bool IsUnary;

if (!getTargetShuffleMask(N, ShufVT, true, ShuffleMask, IsUnary))		if (!getTargetShuffleMask(N, ShufVT, true, ShuffleOps, ShuffleMask, IsUnary))
return SDValue();		return SDValue();

int Elt = ShuffleMask[Index];		int Elt = ShuffleMask[Index];
if (Elt == SM_SentinelZero)		if (Elt == SM_SentinelZero)
return ShufSVT.isInteger() ? DAG.getConstant(0, SDLoc(N), ShufSVT)		return ShufSVT.isInteger() ? DAG.getConstant(0, SDLoc(N), ShufSVT)
: DAG.getConstantFP(+0.0, SDLoc(N), ShufSVT);		: DAG.getConstantFP(+0.0, SDLoc(N), ShufSVT);
if (Elt == SM_SentinelUndef)		if (Elt == SM_SentinelUndef)
return DAG.getUNDEF(ShufSVT);		return DAG.getUNDEF(ShufSVT);

assert(0 <= Elt && Elt < (2*NumElems) && "Shuffle index out of range");		assert(0 <= Elt && Elt < (2*NumElems) && "Shuffle index out of range");
SDValue NewV = (Elt < NumElems) ? N->getOperand(0) : N->getOperand(1);		SDValue NewV = (Elt < NumElems) ? ShuffleOps[0] : ShuffleOps[1];
return getShuffleScalarElt(NewV.getNode(), Elt % NumElems, DAG,		return getShuffleScalarElt(NewV.getNode(), Elt % NumElems, DAG,
Depth+1);		Depth+1);
}		}

// Actual nodes that may contain scalar elements		// Actual nodes that may contain scalar elements
if (Opcode == ISD::BITCAST) {		if (Opcode == ISD::BITCAST) {
V = V.getOperand(0);		V = V.getOperand(0);
EVT SrcVT = V.getValueType();		EVT SrcVT = V.getValueType();
▲ Show 20 Lines • Show All 18,769 Lines • ▼ Show 20 Lines

/// \brief Get the PSHUF-style mask from PSHUF node.		/// \brief Get the PSHUF-style mask from PSHUF node.
///		///
/// This is a very minor wrapper around getTargetShuffleMask to easy forming v4		/// This is a very minor wrapper around getTargetShuffleMask to easy forming v4
/// PSHUF-style masks that can be reused with such instructions.		/// PSHUF-style masks that can be reused with such instructions.
static SmallVector<int, 4> getPSHUFShuffleMask(SDValue N) {		static SmallVector<int, 4> getPSHUFShuffleMask(SDValue N) {
MVT VT = N.getSimpleValueType();		MVT VT = N.getSimpleValueType();
SmallVector<int, 4> Mask;		SmallVector<int, 4> Mask;
		SmallVector<SDValue, 2> Ops;
bool IsUnary;		bool IsUnary;
bool HaveMask = getTargetShuffleMask(N.getNode(), VT, false, Mask, IsUnary);		bool HaveMask =
		getTargetShuffleMask(N.getNode(), VT, false, Ops, Mask, IsUnary);
(void)HaveMask;		(void)HaveMask;
assert(HaveMask);		assert(HaveMask);

// If we have more than 128-bits, only the low 128-bits of shuffle mask		// If we have more than 128-bits, only the low 128-bits of shuffle mask
// matter. Check that the upper masks are repeats and remove them.		// matter. Check that the upper masks are repeats and remove them.
if (VT.getSizeInBits() > 128) {		if (VT.getSizeInBits() > 128) {
int LaneElts = 128 / VT.getScalarSizeInBits();		int LaneElts = 128 / VT.getScalarSizeInBits();
#ifndef NDEBUG		#ifndef NDEBUG
▲ Show 20 Lines • Show All 298 Lines • ▼ Show 20 Lines	case X86ISD::BLENDI: {

// Attempt to merge blend(insertps(x,y),zero).		// Attempt to merge blend(insertps(x,y),zero).
if (V0.getOpcode() == X86ISD::INSERTPS \|\|		if (V0.getOpcode() == X86ISD::INSERTPS \|\|
V1.getOpcode() == X86ISD::INSERTPS) {		V1.getOpcode() == X86ISD::INSERTPS) {
assert(VT == MVT::v4f32 && "INSERTPS ValueType must be MVT::v4f32");		assert(VT == MVT::v4f32 && "INSERTPS ValueType must be MVT::v4f32");

// Determine which elements are known to be zero.		// Determine which elements are known to be zero.
SmallVector<int, 8> TargetMask;		SmallVector<int, 8> TargetMask;
if (!setTargetShuffleZeroElements(N, TargetMask))		SmallVector<SDValue, 2> BlendOps;
		if (!setTargetShuffleZeroElements(N, TargetMask, BlendOps))
return SDValue();		return SDValue();

// Helper function to take inner insertps node and attempt to		// Helper function to take inner insertps node and attempt to
// merge the blend with zero into its zero mask.		// merge the blend with zero into its zero mask.
auto MergeInsertPSAndBlend = [&](SDValue V, int Offset) {		auto MergeInsertPSAndBlend = [&](SDValue V, int Offset) {
if (V.getOpcode() != X86ISD::INSERTPS)		if (V.getOpcode() != X86ISD::INSERTPS)
return SDValue();		return SDValue();
SDValue Op0 = V.getOperand(0);		SDValue Op0 = V.getOperand(0);
Show All 37 Lines	case X86ISD::INSERTPS: {

// If we zero out the element from Op1 then we don't need to reference it.		// If we zero out the element from Op1 then we don't need to reference it.
if ((ZeroMask & (1u << DstIdx)) && !Op1.isUndef())		if ((ZeroMask & (1u << DstIdx)) && !Op1.isUndef())
return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, DAG.getUNDEF(VT),		return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, DAG.getUNDEF(VT),
DAG.getConstant(InsertPSMask, DL, MVT::i8));		DAG.getConstant(InsertPSMask, DL, MVT::i8));

// Attempt to merge insertps Op1 with an inner target shuffle node.		// Attempt to merge insertps Op1 with an inner target shuffle node.
SmallVector<int, 8> TargetMask1;		SmallVector<int, 8> TargetMask1;
if (setTargetShuffleZeroElements(Op1, TargetMask1)) {		SmallVector<SDValue, 2> Ops1;
		if (setTargetShuffleZeroElements(Op1, TargetMask1, Ops1)) {
int M = TargetMask1[SrcIdx];		int M = TargetMask1[SrcIdx];
if (isUndefOrZero(M)) {		if (isUndefOrZero(M)) {
// Zero/UNDEF insertion - zero out element and remove dependency.		// Zero/UNDEF insertion - zero out element and remove dependency.
InsertPSMask \|= (1u << DstIdx);		InsertPSMask \|= (1u << DstIdx);
return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, DAG.getUNDEF(VT),		return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, DAG.getUNDEF(VT),
DAG.getConstant(InsertPSMask, DL, MVT::i8));		DAG.getConstant(InsertPSMask, DL, MVT::i8));
}		}
// Update insertps mask srcidx and reference the source input directly.		// Update insertps mask srcidx and reference the source input directly.
assert(0 <= M && M < 8 && "Shuffle index out of range");		assert(0 <= M && M < 8 && "Shuffle index out of range");
InsertPSMask = (InsertPSMask & 0x3f) \| ((M & 0x3) << 6);		InsertPSMask = (InsertPSMask & 0x3f) \| ((M & 0x3) << 6);
Op1 = Op1.getOperand(M < 4 ? 0 : 1);		Op1 = Ops1[M < 4 ? 0 : 1];
return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, Op1,		return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, Op1,
DAG.getConstant(InsertPSMask, DL, MVT::i8));		DAG.getConstant(InsertPSMask, DL, MVT::i8));
}		}

// Attempt to merge insertps Op0 with an inner target shuffle node.		// Attempt to merge insertps Op0 with an inner target shuffle node.
SmallVector<int, 8> TargetMask0;		SmallVector<int, 8> TargetMask0;
if (!setTargetShuffleZeroElements(Op0, TargetMask0))		SmallVector<SDValue, 2> Ops0;
		if (!setTargetShuffleZeroElements(Op0, TargetMask0, Ops0))
return SDValue();		return SDValue();

bool Updated = false;		bool Updated = false;
bool UseInput00 = false;		bool UseInput00 = false;
bool UseInput01 = false;		bool UseInput01 = false;
for (int i = 0; i != 4; ++i) {		for (int i = 0; i != 4; ++i) {
int M = TargetMask0[i];		int M = TargetMask0[i];
if ((InsertPSMask & (1u << i)) \|\| (i == (int)DstIdx)) {		if ((InsertPSMask & (1u << i)) \|\| (i == (int)DstIdx)) {
Show All 14 Lines	for (int i = 0; i != 4; ++i) {
UseInput00 \|= (0 <= M && M < 4);		UseInput00 \|= (0 <= M && M < 4);
UseInput01 \|= (4 <= M);		UseInput01 \|= (4 <= M);
}		}

// If we're not using both inputs of the target shuffle then use the		// If we're not using both inputs of the target shuffle then use the
// referenced input directly.		// referenced input directly.
if (UseInput00 && !UseInput01) {		if (UseInput00 && !UseInput01) {
Updated = true;		Updated = true;
Op0 = Op0.getOperand(0);		Op0 = Ops0[0];
} else if (!UseInput00 && UseInput01) {		} else if (!UseInput00 && UseInput01) {
Updated = true;		Updated = true;
Op0 = Op0.getOperand(1);		Op0 = Ops0[1];
}		}

if (Updated)		if (Updated)
return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, Op1,		return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, Op1,
DAG.getConstant(InsertPSMask, DL, MVT::i8));		DAG.getConstant(InsertPSMask, DL, MVT::i8));

return SDValue();		return SDValue();
}		}
▲ Show 20 Lines • Show All 284 Lines • ▼ Show 20 Lines	static SDValue XFormVExtractWithShuffleIntoLoad(SDNode *N, SelectionDAG &DAG,
if (!isTargetShuffle(InVec.getOpcode()))		if (!isTargetShuffle(InVec.getOpcode()))
return SDValue();		return SDValue();

// Don't duplicate a load with other uses.		// Don't duplicate a load with other uses.
if (!InVec.hasOneUse())		if (!InVec.hasOneUse())
return SDValue();		return SDValue();

SmallVector<int, 16> ShuffleMask;		SmallVector<int, 16> ShuffleMask;
		SmallVector<SDValue, 2> ShuffleOps;
bool UnaryShuffle;		bool UnaryShuffle;
if (!getTargetShuffleMask(InVec.getNode(), CurrentVT.getSimpleVT(), true,		if (!getTargetShuffleMask(InVec.getNode(), CurrentVT.getSimpleVT(), true,
ShuffleMask, UnaryShuffle))		ShuffleOps, ShuffleMask, UnaryShuffle))
return SDValue();		return SDValue();

// Select the input vector, guarding against out of range extract vector.		// Select the input vector, guarding against out of range extract vector.
unsigned NumElems = CurrentVT.getVectorNumElements();		unsigned NumElems = CurrentVT.getVectorNumElements();
int Elt = cast<ConstantSDNode>(EltNo)->getZExtValue();		int Elt = cast<ConstantSDNode>(EltNo)->getZExtValue();
int Idx = (Elt > (int)NumElems) ? SM_SentinelUndef : ShuffleMask[Elt];		int Idx = (Elt > (int)NumElems) ? SM_SentinelUndef : ShuffleMask[Elt];

if (Idx == SM_SentinelZero)		if (Idx == SM_SentinelZero)
return EltVT.isInteger() ? DAG.getConstant(0, SDLoc(N), EltVT)		return EltVT.isInteger() ? DAG.getConstant(0, SDLoc(N), EltVT)
: DAG.getConstantFP(+0.0, SDLoc(N), EltVT);		: DAG.getConstantFP(+0.0, SDLoc(N), EltVT);
if (Idx == SM_SentinelUndef)		if (Idx == SM_SentinelUndef)
return DAG.getUNDEF(EltVT);		return DAG.getUNDEF(EltVT);

assert(0 <= Idx && Idx < (int)(2 * NumElems) && "Shuffle index out of range");		assert(0 <= Idx && Idx < (int)(2 * NumElems) && "Shuffle index out of range");
SDValue LdNode = (Idx < (int)NumElems) ? InVec.getOperand(0)		SDValue LdNode = (Idx < (int)NumElems) ? ShuffleOps[0]
: InVec.getOperand(1);		: ShuffleOps[1];

// If inputs to shuffle are the same for both ops, then allow 2 uses		// If inputs to shuffle are the same for both ops, then allow 2 uses
unsigned AllowedUses = InVec.getNumOperands() > 1 &&		unsigned AllowedUses =
InVec.getOperand(0) == InVec.getOperand(1) ? 2 : 1;		(ShuffleOps.size() > 1 && ShuffleOps[0] == ShuffleOps[1]) ? 2 : 1;

if (LdNode.getOpcode() == ISD::BITCAST) {		if (LdNode.getOpcode() == ISD::BITCAST) {
// Don't duplicate a load with other uses.		// Don't duplicate a load with other uses.
if (!LdNode.getNode()->hasNUsesOfValue(AllowedUses, 0))		if (!LdNode.getNode()->hasNUsesOfValue(AllowedUses, 0))
return SDValue();		return SDValue();

AllowedUses = 1; // only allow 1 load use if we have a bitcast		AllowedUses = 1; // only allow 1 load use if we have a bitcast
LdNode = LdNode.getOperand(0);		LdNode = LdNode.getOperand(0);
Show All 17 Lines	static SDValue XFormVExtractWithShuffleIntoLoad(SDNode *N, SelectionDAG &DAG,
if (NewAlign > Align \|\| !TLI.isOperationLegalOrCustom(ISD::LOAD, EltVT))		if (NewAlign > Align \|\| !TLI.isOperationLegalOrCustom(ISD::LOAD, EltVT))
return SDValue();		return SDValue();

// All checks match so transform back to vector_shuffle so that DAG combiner		// All checks match so transform back to vector_shuffle so that DAG combiner
// can finish the job		// can finish the job
SDLoc dl(N);		SDLoc dl(N);

// Create shuffle node taking into account the case that its a unary shuffle		// Create shuffle node taking into account the case that its a unary shuffle
SDValue Shuffle = (UnaryShuffle) ? DAG.getUNDEF(CurrentVT)		SDValue Shuffle = (UnaryShuffle) ? DAG.getUNDEF(CurrentVT) : ShuffleOps[1];
: InVec.getOperand(1);
Shuffle = DAG.getVectorShuffle(CurrentVT, dl,		Shuffle = DAG.getVectorShuffle(CurrentVT, dl,
InVec.getOperand(0), Shuffle,		ShuffleOps[0], Shuffle,
&ShuffleMask[0]);		&ShuffleMask[0]);
Shuffle = DAG.getBitcast(OriginalVT, Shuffle);		Shuffle = DAG.getBitcast(OriginalVT, Shuffle);
return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, N->getValueType(0), Shuffle,		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, N->getValueType(0), Shuffle,
EltNo);		EltNo);
}		}

static SDValue combineBitcast(SDNode *N, SelectionDAG &DAG,		static SDValue combineBitcast(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
▲ Show 20 Lines • Show All 5,126 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shuffle-combining-avx2.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-- -mattr=+avx2 \| FileCheck %s

				declare <8 x i32> @llvm.x86.avx2.permd(<8 x i32>, <8 x i32>)
				declare <8 x float> @llvm.x86.avx2.permps(<8 x float>, <8 x i32>)

				define <32 x i8> @combine_pshufb_vpermd(<8 x i32> %a) {
				; CHECK-LABEL: combine_pshufb_vpermd:
				; CHECK: # BB#0:
				; CHECK-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,16,17,18,18]
				; CHECK-NEXT: retq
				%tmp0 = call <8 x i32> @llvm.x86.avx2.permd(<8 x i32> %a, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 4>)
				%tmp1 = bitcast <8 x i32> %tmp0 to <32 x i8>
				%tmp2 = shufflevector <32 x i8> %tmp1, <32 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 30>
				ret <32 x i8> %tmp2
				}

				define <32 x i8> @combine_pshufb_vpermps(<8 x float> %a) {
				; CHECK-LABEL: combine_pshufb_vpermps:
				; CHECK: # BB#0:
				; CHECK-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,16,17,18,18]
				; CHECK-NEXT: retq
				%tmp0 = call <8 x float> @llvm.x86.avx2.permps(<8 x float> %a, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 4>)
				%tmp1 = bitcast <8 x float> %tmp0 to <32 x i8>
				%tmp2 = shufflevector <32 x i8> %tmp1, <32 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 30>
				ret <32 x i8> %tmp2
				}