This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
3/3
PPCISelLowering.h
4/20
PPCISelLowering.cpp
-
PPCInstrAltivec.td
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
2/2
p9-vinsert-vextract.ll

Differential D34160

[Power9] Exploit vinserth instruction
ClosedPublic

Authored by gyiu on Jun 13 2017, 12:51 PM.

Download Raw Diff

Details

Reviewers

kbarton
nemanjai
inouehrs
sfertile
jtony
lei
hfinkel
stefanp
syzaara

Commits

rG671526148c54: Adds code to PPC ISEL lowering to recognize half-word inserts from…
rL317111: Adds code to PPC ISEL lowering to recognize half-word inserts from…

Summary

This patch adds code to PPC ISEL lowering to recognize half-word inserts from vector_shuffles, and use P9 shift and vector insert instructions instead of vperm.

Diff Detail

Repository: rL LLVM

Event Timeline

gyiu created this revision.Jun 13 2017, 12:51 PM

This patch (potentially) increase the number of vector instructions (permutation -> shift + insert). Is my understanding correct?

lei added inline comments.Jun 14 2017, 12:03 PM

lib/Target/PowerPC/PPCISelLowering.cpp
1663	nit: don't forget the "." :)
test/CodeGen/PowerPC/p9-vinsert-vextract.ll
6	Can you add a short description of what each of these functions are testing?

In D34160#779972, @inouehrs wrote:

This patch (potentially) increase the number of vector instructions (permutation -> shift + insert). Is my understanding correct?

Yep. Though I think with a vperm you still need to load the mask into a vector register first, whereas with vshift + vinsert we're saving on the load.

In D34160#781261, @gyiu wrote:

In D34160#779972, @inouehrs wrote:

This patch (potentially) increase the number of vector instructions (permutation -> shift + insert). Is my understanding correct?

Yep. Though I think with a vperm you still need to load the mask into a vector register first, whereas with vshift + vinsert we're saving on the load.

I feel we should not increase the number of vector instructions within a loop (i.e. a common case for vector code) if we can load the mask into a vector register before the loop.
In case without an additional shift, it is nice to do opt in a loop for freeing up one vector register.

jtony added inline comments.Jun 15 2017, 1:55 PM

lib/Target/PowerPC/PPCISelLowering.h
527	The community doesn't like the bool parameters here. But I am not sure whether we must remove them or it is just nice to do.
test/CodeGen/PowerPC/p9-vinsert-vextract.ll
9	I would prefer to use non-mangled function names to make it more readable. I think you can just regenerate the IR from c source file instead of cpp file.

Added comments and demangled function names for LIT tests
Added period to comment
Fixed issue when one operand of the shufflevector is 'undef', in which case the PPCISDs we generate will use only the defined one.
Initialize 'Swap' boolean

Go ahead and submit the rename separately (if you don't have commit access send me a patch and I'll do it). I would prefer to try to minimize boolean arguments - in this case you're adding a few more including a true/false and a return value one. Feel free to refactor the code around it to require fewer booleans or have multiple functions with helper functions that end up being called.

Thanks!

lib/Target/PowerPC/PPCISelLowering.h
527	Please do.

jtony added inline comments.Jun 16 2017, 7:55 AM

lib/Target/PowerPC/PPCISelLowering.cpp
1640	Is there any reason why we use uint32_t? If not, I would use unsigned instead to make it consistent. We are using both unsigned and uint32_t in this file.

gyiu marked 2 inline comments as done.Jun 16 2017, 10:13 AM

gyiu added inline comments.Jun 19 2017, 8:05 AM

lib/Target/PowerPC/PPCISelLowering.cpp
1640	I think using uint32_t is appropriate here because I want to illustrate that I'm using 32-bits exactly. Using 'unsigned int' would be fine as well, but I don't think it's as clear that I'm looking for 32-bits exactly.

Added -O0 to LIT tests to test corner case of undef 2nd operand of vector shuffle.
Refactored VINSERTH code to avoid boolean parameters and return value.
Merged loops for 2nd operand undefined case and both operands defined.

echristo added inline comments.Jun 19 2017, 11:19 AM

lib/Target/PowerPC/PPCISelLowering.cpp
7954	Nit: (Here and other places) Comments are complete sentences including punctuation.

In D34160#781301, @inouehrs wrote:

In D34160#781261, @gyiu wrote:

In D34160#779972, @inouehrs wrote:

This patch (potentially) increase the number of vector instructions (permutation -> shift + insert). Is my understanding correct?

Yep. Though I think with a vperm you still need to load the mask into a vector register first, whereas with vshift + vinsert we're saving on the load.

I feel we should not increase the number of vector instructions within a loop (i.e. a common case for vector code) if we can load the mask into a vector register before the loop.
In case without an additional shift, it is nice to do opt in a loop for freeing up one vector register.

Although I definitely agree that we should take steps to ensure we don't introduce further instructions in loops, I'm not sure that avoiding a 2-instruction sequence for a shuffle is necessarily the right thing to do. This statement is predicated on the fact that we can hoist the constant pool load out of a loop. If register pressure prevents this, we will have a load in the loop. Furthermore, if the loop is large enough and has other memory operations, it is conceivable that the constant pool load could be a cache miss on every iteration. And it is conceivable that such large loops will be the ones for which register pressure prevents the hoisting of the load. Furthermore, if the GPR register pressure is also high, we might not even be able to hoist the address calculations outside the loop, which would make the vperm sequence 3-4 instructions.
I think that at ISEL time, we should favour shorter instruction sequences that don't involve loads. And perhaps if we can show that multi-instruction permute sequences in loops appear enough in real code, we might want to have a loop pass that simplifies them into a load outside the loop with a vperm in the loop in general.

lib/Target/PowerPC/PPCISelLowering.h
1079	This is probably a remnant of a previous implementation. Please rewrite the comment.

nemanjai added inline comments.Jun 21 2017, 8:35 PM

lib/Target/PowerPC/PPCISelLowering.cpp
7948	So we don't want to shift if we're within the same register? Is there a specific reason for this?
7959	Isn't this already guaranteed to only have the low order 3 bits set?
7964	Why would we continue to search if we've already confirmed that: We have an element from vector A All other elements are from vector B in the correct order

Addressed comments about my comments (grammar, periods, etc).
Removed irrelevant comments in PPCISelLowering.h

gyiu marked 2 inline comments as done.Jun 21 2017, 9:58 PM

gyiu added inline comments.

lib/Target/PowerPC/PPCISelLowering.cpp
7948	I believe we have to add a xxlor to another VR if we want to shift the vector since we can't shift if both operands of the vector shuffle are the same vector. Adding another two cycles to VECSHL+VECINSERT seems diminish its value versus load+vperm.
7959	MaskOneElt could actually be >= 8, since the mask is in range [0, 15].
7964	Yep, you're correct. Need a break here since we can't find more than one candidate.

nemanjai added inline comments.Jun 23 2017, 6:17 AM

lib/Target/PowerPC/PPCISelLowering.cpp
7948	I don't really see why. Assume that you have something like this: vector unsigned short test(vector unsigned short a) { a[5] = a[2]; } I don't see why we can't codegen something like this for it: vsldoi 3, 2, 2, 4 vinserth 2, 3, 4 Forgive me if I didn't work out the immediates exactly correctly, but the point is the [lack of] need for the XXLOR. Of course, this does use an extra register, but so does the alternative (vperm).
7959	Ah, right. I didn't think of that. Sorry about that.

gyiu added inline comments.Jun 23 2017, 11:49 AM

lib/Target/PowerPC/PPCISelLowering.cpp
7948	Hmmm... Yep, you're right. I guess I can simplify my code even further now. I think this also means I have to fix up the code for the original xxinsertw lowering in a separate patch.

nemanjai added inline comments.Jun 23 2017, 1:08 PM

lib/Target/PowerPC/PPCISelLowering.cpp
7948	Yes, as @echristo mentioned, you should do all the renaming of things in a separate patch that doesn't really require a review. You're just renaming stuff.

I'll open a separate item to address Nemenja's comments as I will not get a chance to do another enchancement.

I don't really see why. Assume that you have something like this:

vector unsigned short test(vector unsigned short a) {
a[5] = a[2];
}

I don't see why we can't codegen something like this for it:

vsldoi 3, 2, 2, 4
vinserth 2, 3, 4

Forgive me if I didn't work out the immediates exactly correctly, but the point is the [lack of] need for the XXLOR. Of course, this does use an extra register, but so does the alternative (vperm).

lib/Target/PowerPC/PPCISelLowering.cpp
7948	Actually, I'm not quite sure what you mean here. The original code for xxinsertw has the limitation of only being able to insert element 3 if both input vectors to the vector_shuffle are the same. I'll need to change that in a separate patch. I'm not sure where the 'renaming of things' comes into play?

nemanjai added inline comments.Jun 26 2017, 9:14 AM

lib/Target/PowerPC/PPCISelLowering.cpp
1142	By "renaming stuff", I mean things like this. Kind of orthogonal to the patch and should go in as a separate NFC change.

Added breaks to stop searching for the pattern once I've found a candidate.

gyiu marked an inline comment as done.Jun 26 2017, 9:15 AM

nemanjai added inline comments.Jun 29 2017, 1:45 AM

lib/Target/PowerPC/PPCISelLowering.cpp
7941	You should be able to get rid of this condition here. Move the assignment `if (V.isUndef()) V2 = V1;` above here Use the `OriginalOrderLow` if the two vectors are the same The rest should fall out naturally and we'll do the shift for the single-input case as well. And the code will also be simpler.

gyiu added inline comments.Aug 23 2017, 10:35 AM

lib/Target/PowerPC/PPCISelLowering.cpp
7941	@nemanjai I created Issue #410 on github to address the issue when using vector shifts in the case when both inputs are the same vector. There's further investigation that's required as it's not clear which input/output registers the (vector shift + vector extract) sequence uses in this case. I would rather do this change as part of that work item instead.

Refactored NFCs to another patch to be committed.
Made changes to remove restriction on only recognizing shuffles of halfword element 3 (4 in LE mode) when both input vectors are the same vector. That is, we can now recognize all single element shuffles in this situation.

Note that I was able to re-implement Nemanja's suggestion of generalizing the case when both inputs are the same vector because the registers used in code-gen are now consistent. Not sure if it was a real problem that I saw previously, or a transient issue that was fixed with newer levels of LLVM.

Changed my mind, removed changes related to this comment:

"Made changes to remove restriction on only recognizing shuffles of halfword element 3 (4 in LE mode) when both input vectors are the same vector. That is, we can now recognize all single element shuffles in this situation."

Will use a different patch to remove the restriction instead, as the contents of this patch is still functionally correct.

kbarton added inline comments.Oct 23 2017, 8:03 PM

lib/Target/PowerPC/PPCISelLowering.cpp
117	Is this still necessary? I don't see any calls to it - only to the 3-parameter version of it.

gyiu marked an inline comment as done.Oct 24 2017, 8:29 AM

gyiu added inline comments.

lib/Target/PowerPC/PPCISelLowering.cpp
117	Yeah, this patch is old so the declaration is based on the version that had two parameters. I'll update it when I merge with the latest code.

LGTM

This revision is now accepted and ready to land.Oct 24 2017, 3:27 PM

Closed by commit rL317111: Adds code to PPC ISEL lowering to recognize half-word inserts from… (authored by gyiu). · Explain WhyNov 1 2017, 11:07 AM

This revision was automatically updated to reflect the committed changes.

gyiu marked an inline comment as done.

Revision Contents

Path

Size

lib/

Target/

PowerPC/

PPCISelLowering.h

9 lines

PPCISelLowering.cpp

119 lines

PPCInstrAltivec.td

16 lines

test/

CodeGen/

PowerPC/

p9-vinsert-vextract.ll

300 lines

Diff 113259

lib/Target/PowerPC/PPCISelLowering.h

Show First 20 Lines • Show All 518 Lines • ▼ Show 20 Lines	namespace PPC {
/// getVSPLTImmediate - Return the appropriate VSPLT* immediate to splat the		/// getVSPLTImmediate - Return the appropriate VSPLT* immediate to splat the
/// specified isSplatShuffleMask VECTOR_SHUFFLE mask.		/// specified isSplatShuffleMask VECTOR_SHUFFLE mask.
unsigned getVSPLTImmediate(SDNode *N, unsigned EltSize, SelectionDAG &DAG);		unsigned getVSPLTImmediate(SDNode *N, unsigned EltSize, SelectionDAG &DAG);

/// get_VSPLTI_elt - If this is a build_vector of constants which can be		/// get_VSPLTI_elt - If this is a build_vector of constants which can be
/// formed by using a vspltis[bhw] instruction of the specified element		/// formed by using a vspltis[bhw] instruction of the specified element
/// size, return the constant being splatted. The ByteSize field indicates		/// size, return the constant being splatted. The ByteSize field indicates
/// the number of bytes of each element [124] -> [bhw].		/// the number of bytes of each element [124] -> [bhw].
SDValue get_VSPLTI_elt(SDNode *N, unsigned ByteSize, SelectionDAG &DAG);		SDValue get_VSPLTI_elt(SDNode *N, unsigned ByteSize, SelectionDAG &DAG);
		jtonyUnsubmitted Done Reply Inline Actions The community doesn't like the bool parameters here. But I am not sure whether we must remove them or it is just nice to do. jtony: The community doesn't like the bool parameters here. But I am not sure whether we must remove…
		echristoUnsubmitted Done Reply Inline Actions Please do. echristo: Please do.

/// If this is a qvaligni shuffle mask, return the shift		/// If this is a qvaligni shuffle mask, return the shift
/// amount, otherwise return -1.		/// amount, otherwise return -1.
int isQVALIGNIShuffleMask(SDNode *N);		int isQVALIGNIShuffleMask(SDNode *N);

} // end namespace PPC		} // end namespace PPC

class PPCTargetLowering : public TargetLowering {		class PPCTargetLowering : public TargetLowering {
▲ Show 20 Lines • Show All 531 Lines • ▼ Show 20 Lines	SDValue getRecipEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,
int &RefinementSteps) const override;		int &RefinementSteps) const override;
unsigned combineRepeatedFPDivisors() const override;		unsigned combineRepeatedFPDivisors() const override;

CCAssignFn *useFastISelCCs(unsigned Flag) const;		CCAssignFn *useFastISelCCs(unsigned Flag) const;

SDValue		SDValue
combineElementTruncationToVectorTruncation(SDNode *N,		combineElementTruncationToVectorTruncation(SDNode *N,
DAGCombinerInfo &DCI) const;		DAGCombinerInfo &DCI) const;
};
		/// lowerToVINSERTH - Return the SDValue if this VECTOR_SHUFFLE can be
		/// handled by the VINSERTH instruction introduced in ISA 3.0. This is
		/// essentially any shuffle of v8i16 vectors that just inserts one element
		/// from one vector into the other.
		nemanjaiUnsubmitted Done Reply Inline Actions This is probably a remnant of a previous implementation. Please rewrite the comment. nemanjai: This is probably a remnant of a previous implementation. Please rewrite the comment.
		SDValue lowerToVINSERTH(ShuffleVectorSDNode *N, SelectionDAG &DAG) const;

		}; // end class PPCTargetLowering

namespace PPC {		namespace PPC {

FastISel *createFastISel(FunctionLoweringInfo &FuncInfo,		FastISel *createFastISel(FunctionLoweringInfo &FuncInfo,
const TargetLibraryInfo *LibInfo);		const TargetLibraryInfo *LibInfo);

} // end namespace PPC		} // end namespace PPC

Show All 30 Lines

lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines
cl::desc("disable unaligned load/store generation on PPC"), cl::Hidden);		cl::desc("disable unaligned load/store generation on PPC"), cl::Hidden);

static cl::opt<bool> DisableSCO("disable-ppc-sco",		static cl::opt<bool> DisableSCO("disable-ppc-sco",
cl::desc("disable sibling call optimization on ppc"), cl::Hidden);		cl::desc("disable sibling call optimization on ppc"), cl::Hidden);

STATISTIC(NumTailCalls, "Number of tail calls");		STATISTIC(NumTailCalls, "Number of tail calls");
STATISTIC(NumSiblingCalls, "Number of sibling calls");		STATISTIC(NumSiblingCalls, "Number of sibling calls");

		static bool isNByteElemShuffleMask(ShuffleVectorSDNode *, unsigned, int);
		kbartonUnsubmitted Done Reply Inline Actions Is this still necessary? I don't see any calls to it - only to the 3-parameter version of it. kbarton: Is this still necessary? I don't see any calls to it - only to the 3-parameter version of it.
		gyiuAuthorUnsubmitted Not Done Reply Inline Actions Yeah, this patch is old so the declaration is based on the version that had two parameters. I'll update it when I merge with the latest code. gyiu: Yeah, this patch is old so the declaration is based on the version that had two parameters.

// FIXME: Remove this once the bug has been fixed!		// FIXME: Remove this once the bug has been fixed!
extern cl::opt<bool> ANDIGlueBug;		extern cl::opt<bool> ANDIGlueBug;

PPCTargetLowering::PPCTargetLowering(const PPCTargetMachine &TM,		PPCTargetLowering::PPCTargetLowering(const PPCTargetMachine &TM,
const PPCSubtarget &STI)		const PPCSubtarget &STI)
: TargetLowering(TM), Subtarget(STI) {		: TargetLowering(TM), Subtarget(STI) {
// Use _setjmp/_longjmp instead of setjmp/longjmp.		// Use _setjmp/_longjmp instead of setjmp/longjmp.
setUseUnderscoreSetJmp(true);		setUseUnderscoreSetJmp(true);
▲ Show 20 Lines • Show All 1,007 Lines • ▼ Show 20 Lines	const char *PPCTargetLowering::getTargetNodeName(unsigned Opcode) const {
case PPCISD::FCTIWUZ: return "PPCISD::FCTIWUZ";		case PPCISD::FCTIWUZ: return "PPCISD::FCTIWUZ";
case PPCISD::FRE: return "PPCISD::FRE";		case PPCISD::FRE: return "PPCISD::FRE";
case PPCISD::FRSQRTE: return "PPCISD::FRSQRTE";		case PPCISD::FRSQRTE: return "PPCISD::FRSQRTE";
case PPCISD::STFIWX: return "PPCISD::STFIWX";		case PPCISD::STFIWX: return "PPCISD::STFIWX";
case PPCISD::VMADDFP: return "PPCISD::VMADDFP";		case PPCISD::VMADDFP: return "PPCISD::VMADDFP";
case PPCISD::VNMSUBFP: return "PPCISD::VNMSUBFP";		case PPCISD::VNMSUBFP: return "PPCISD::VNMSUBFP";
case PPCISD::VPERM: return "PPCISD::VPERM";		case PPCISD::VPERM: return "PPCISD::VPERM";
case PPCISD::XXSPLT: return "PPCISD::XXSPLT";		case PPCISD::XXSPLT: return "PPCISD::XXSPLT";
case PPCISD::VECINSERT: return "PPCISD::VECINSERT";		case PPCISD::VECINSERT: return "PPCISD::VECINSERT";
		nemanjaiUnsubmitted Not Done Reply Inline Actions By "renaming stuff", I mean things like this. Kind of orthogonal to the patch and should go in as a separate NFC change. nemanjai: By "renaming stuff", I mean things like this. Kind of orthogonal to the patch and should go in…
case PPCISD::XXREVERSE: return "PPCISD::XXREVERSE";		case PPCISD::XXREVERSE: return "PPCISD::XXREVERSE";
case PPCISD::XXPERMDI: return "PPCISD::XXPERMDI";		case PPCISD::XXPERMDI: return "PPCISD::XXPERMDI";
case PPCISD::VECSHL: return "PPCISD::VECSHL";		case PPCISD::VECSHL: return "PPCISD::VECSHL";
case PPCISD::CMPB: return "PPCISD::CMPB";		case PPCISD::CMPB: return "PPCISD::CMPB";
case PPCISD::Hi: return "PPCISD::Hi";		case PPCISD::Hi: return "PPCISD::Hi";
case PPCISD::Lo: return "PPCISD::Lo";		case PPCISD::Lo: return "PPCISD::Lo";
case PPCISD::TOC_ENTRY: return "PPCISD::TOC_ENTRY";		case PPCISD::TOC_ENTRY: return "PPCISD::TOC_ENTRY";
case PPCISD::DYNALLOC: return "PPCISD::DYNALLOC";		case PPCISD::DYNALLOC: return "PPCISD::DYNALLOC";
▲ Show 20 Lines • Show All 481 Lines • ▼ Show 20 Lines
/// Word/DoubleWord/QuadWord).		/// Word/DoubleWord/QuadWord).
/// \param[in] StepLen the delta indices number among the N byte element, if		/// \param[in] StepLen the delta indices number among the N byte element, if
/// the mask is in increasing/decreasing order then it is 1/-1.		/// the mask is in increasing/decreasing order then it is 1/-1.
/// \return true iff the mask is shuffling N byte elements.		/// \return true iff the mask is shuffling N byte elements.
static bool isNByteElemShuffleMask(ShuffleVectorSDNode *N, unsigned Width,		static bool isNByteElemShuffleMask(ShuffleVectorSDNode *N, unsigned Width,
int StepLen) {		int StepLen) {
assert((Width == 2 \|\| Width == 4 \|\| Width == 8 \|\| Width == 16) &&		assert((Width == 2 \|\| Width == 4 \|\| Width == 8 \|\| Width == 16) &&
"Unexpected element width.");		"Unexpected element width.");
assert((StepLen == 1 \|\| StepLen == -1) && "Unexpected element width.");		assert((StepLen == 1 \|\| StepLen == -1) && "Unexpected element width.");
		jtonyUnsubmitted Not Done Reply Inline Actions Is there any reason why we use uint32_t? If not, I would use unsigned instead to make it consistent. We are using both unsigned and uint32_t in this file. jtony: Is there any reason why we use uint32_t? If not, I would use unsigned instead to make it…
		gyiuAuthorUnsubmitted Not Done Reply Inline Actions I think using uint32_t is appropriate here because I want to illustrate that I'm using 32-bits exactly. Using 'unsigned int' would be fine as well, but I don't think it's as clear that I'm looking for 32-bits exactly. gyiu: I think using uint32_t is appropriate here because I want to illustrate that I'm using 32-bits…

unsigned NumOfElem = 16 / Width;		unsigned NumOfElem = 16 / Width;
unsigned MaskVal[16]; // Width is never greater than 16		unsigned MaskVal[16]; // Width is never greater than 16
for (unsigned i = 0; i < NumOfElem; ++i) {		for (unsigned i = 0; i < NumOfElem; ++i) {
MaskVal[0] = N->getMaskElt(i * Width);		MaskVal[0] = N->getMaskElt(i * Width);
if ((StepLen == 1) && (MaskVal[0] % Width)) {		if ((StepLen == 1) && (MaskVal[0] % Width)) {
return false;		return false;
} else if ((StepLen == -1) && ((MaskVal[0] + 1) % Width)) {		} else if ((StepLen == -1) && ((MaskVal[0] + 1) % Width)) {
return false;		return false;
}		}

for (unsigned int j = 1; j < Width; ++j) {		for (unsigned int j = 1; j < Width; ++j) {
MaskVal[j] = N->getMaskElt(i * Width + j);		MaskVal[j] = N->getMaskElt(i * Width + j);
if (MaskVal[j] != MaskVal[j-1] + StepLen) {		if (MaskVal[j] != MaskVal[j-1] + StepLen) {
return false;		return false;
}		}
}		}
}		}

return true;		return true;
}		}

bool PPC::isXXINSERTWMask(ShuffleVectorSDNode *N, unsigned &ShiftElts,		bool PPC::isXXINSERTWMask(ShuffleVectorSDNode *N, unsigned &ShiftElts,
		leiUnsubmitted Done Reply Inline Actions nit: don't forget the "." :) lei: nit: don't forget the "." :)
unsigned &InsertAtByte, bool &Swap, bool IsLE) {		unsigned &InsertAtByte, bool &Swap, bool IsLE) {
if (!isNByteElemShuffleMask(N, 4, 1))		if (!isNByteElemShuffleMask(N, 4, 1))
return false;		return false;

// Now we look at mask elements 0,4,8,12		// Now we look at mask elements 0,4,8,12
unsigned M0 = N->getMaskElt(0) / 4;		unsigned M0 = N->getMaskElt(0) / 4;
unsigned M1 = N->getMaskElt(4) / 4;		unsigned M1 = N->getMaskElt(4) / 4;
unsigned M2 = N->getMaskElt(8) / 4;		unsigned M2 = N->getMaskElt(8) / 4;
▲ Show 20 Lines • Show All 6,204 Lines • ▼ Show 20 Lines	static SDValue GeneratePerfectShuffle(unsigned PFEntry, SDValue LHS,
}		}
EVT VT = OpLHS.getValueType();		EVT VT = OpLHS.getValueType();
OpLHS = DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, OpLHS);		OpLHS = DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, OpLHS);
OpRHS = DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, OpRHS);		OpRHS = DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, OpRHS);
SDValue T = DAG.getVectorShuffle(MVT::v16i8, dl, OpLHS, OpRHS, ShufIdxs);		SDValue T = DAG.getVectorShuffle(MVT::v16i8, dl, OpLHS, OpRHS, ShufIdxs);
return DAG.getNode(ISD::BITCAST, dl, VT, T);		return DAG.getNode(ISD::BITCAST, dl, VT, T);
}		}

		/// lowerToVINSERTH - Return the SDValue if this VECTOR_SHUFFLE can be handled
		/// by the VINSERTH instruction introduced in ISA 3.0, else just return default
		/// SDValue.
		SDValue PPCTargetLowering::lowerToVINSERTH(ShuffleVectorSDNode *N,
		SelectionDAG &DAG) const {
		const unsigned NumHalfWords = 8;
		const unsigned BytesInVector = NumHalfWords * 2;
		// Check that the shuffle is on half-words.
		if (!isNByteElemShuffleMask(N, 2, 1))
		return SDValue();

		bool IsLE = Subtarget.isLittleEndian();
		SDLoc dl(N);
		SDValue V1 = N->getOperand(0);
		SDValue V2 = N->getOperand(1);
		unsigned ShiftElts = 0, InsertAtByte = 0;
		bool Swap = false;

		// Shifts required to get the half-word we want at element 3.
		unsigned LittleEndianShifts[] = {4, 3, 2, 1, 0, 7, 6, 5};
		unsigned BigEndianShifts[] = {5, 6, 7, 0, 1, 2, 3, 4};

		uint32_t Mask = 0;
		uint32_t OriginalOrderLow = 0x1234567;
		uint32_t OriginalOrderHigh = 0x89ABCDEF;
		// Now we look at mask elements 0,2,4,6,8,10,12,14. Pack the mask into a
		// 32-bit space, only need 4-bit nibbles per element.
		for (unsigned i = 0; i < NumHalfWords; ++i) {
		unsigned MaskShift = (NumHalfWords - 1 - i) * 4;
		Mask \|= ((uint32_t)(N->getMaskElt(i * 2) / 2) << MaskShift);
		}

		// For each mask element, find out if we're just inserting something
		// from V2 into V1 or vice versa. Possible permutations inserting an element
		// from V2 into V1:
		// X, 1, 2, 3, 4, 5, 6, 7
		// 0, X, 2, 3, 4, 5, 6, 7
		// 0, 1, X, 3, 4, 5, 6, 7
		// 0, 1, 2, X, 4, 5, 6, 7
		// 0, 1, 2, 3, X, 5, 6, 7
		// 0, 1, 2, 3, 4, X, 6, 7
		// 0, 1, 2, 3, 4, 5, X, 7
		// 0, 1, 2, 3, 4, 5, 6, X
		// Inserting from V1 into V2 will be similar, except mask range will be [8,15].

		bool FoundCandidate = false;
		// Go through the mask of half-words to find an element that's being moved
		// from one vector to the other.
		for (unsigned i = 0; i < NumHalfWords; ++i) {
		unsigned MaskShift = (NumHalfWords - 1 - i) * 4;
		uint32_t MaskOneElt = (Mask >> MaskShift) & 0xF;
		uint32_t MaskOtherElts = ~(0xF << MaskShift);
		uint32_t TargetOrder = 0x0;

		// If both vector operands for the shuffle are the same vector, the mask
		// will contain only elements from the first one and the second one will be
		// undef.
		if (V2.isUndef()) {
		nemanjaiUnsubmitted Not Done Reply Inline Actions You should be able to get rid of this condition here. Move the assignment `if (V.isUndef()) V2 = V1;` above here Use the `OriginalOrderLow` if the two vectors are the same The rest should fall out naturally and we'll do the shift for the single-input case as well. And the code will also be simpler. nemanjai: You should be able to get rid of this condition here. - Move the assignment `if (V.isUndef())…
		gyiuAuthorUnsubmitted Not Done Reply Inline Actions @nemanjai I created Issue #410 on github to address the issue when using vector shifts in the case when both inputs are the same vector. There's further investigation that's required as it's not clear which input/output registers the (vector shift + vector extract) sequence uses in this case. I would rather do this change as part of that work item instead. gyiu: @nemanjai I created Issue #410 on github to address the issue when using vector shifts in the…
		ShiftElts = 0;
		unsigned VINSERTHSrcElem = IsLE ? 4 : 3;
		TargetOrder = OriginalOrderLow;
		Swap = false;
		// Skip if not the correct element or mask of other elements don't equal
		// to our expected order.
		if (MaskOneElt == VINSERTHSrcElem &&
		nemanjaiUnsubmitted Not Done Reply Inline Actions So we don't want to shift if we're within the same register? Is there a specific reason for this? nemanjai: So we don't want to shift if we're within the same register? Is there a specific reason for…
		gyiuAuthorUnsubmitted Not Done Reply Inline Actions I believe we have to add a xxlor to another VR if we want to shift the vector since we can't shift if both operands of the vector shuffle are the same vector. Adding another two cycles to VECSHL+VECINSERT seems diminish its value versus load+vperm. gyiu: I believe we have to add a xxlor to another VR if we want to shift the vector since we can't…
		nemanjaiUnsubmitted Not Done Reply Inline Actions I don't really see why. Assume that you have something like this: vector unsigned short test(vector unsigned short a) { a[5] = a[2]; } I don't see why we can't codegen something like this for it: vsldoi 3, 2, 2, 4 vinserth 2, 3, 4 Forgive me if I didn't work out the immediates exactly correctly, but the point is the [lack of] need for the XXLOR. Of course, this does use an extra register, but so does the alternative (vperm). nemanjai: I don't really see why. Assume that you have something like this: ``` vector unsigned short…
		gyiuAuthorUnsubmitted Not Done Reply Inline Actions Hmmm... Yep, you're right. I guess I can simplify my code even further now. I think this also means I have to fix up the code for the original xxinsertw lowering in a separate patch. gyiu: Hmmm... Yep, you're right. I guess I can simplify my code even further now. I think this also…
		nemanjaiUnsubmitted Not Done Reply Inline Actions Yes, as @echristo mentioned, you should do all the renaming of things in a separate patch that doesn't really require a review. You're just renaming stuff. nemanjai: Yes, as @echristo mentioned, you should do all the renaming of things in a separate patch that…
		gyiuAuthorUnsubmitted Not Done Reply Inline Actions Actually, I'm not quite sure what you mean here. The original code for xxinsertw has the limitation of only being able to insert element 3 if both input vectors to the vector_shuffle are the same. I'll need to change that in a separate patch. I'm not sure where the 'renaming of things' comes into play? gyiu: Actually, I'm not quite sure what you mean here. The original code for xxinsertw has the…
		(Mask & MaskOtherElts) == (TargetOrder & MaskOtherElts)) {
		InsertAtByte = IsLE ? BytesInVector - (i + 1) * 2 : i * 2;
		FoundCandidate = true;
		break;
		}
		} else { // If both operands are defined.
		echristoUnsubmitted Done Reply Inline Actions Nit: (Here and other places) Comments are complete sentences including punctuation. echristo: Nit: (Here and other places) Comments are complete sentences including punctuation.
		// Target order is [8,15] if the current mask is between [0,7].
		TargetOrder =
		(MaskOneElt < NumHalfWords) ? OriginalOrderHigh : OriginalOrderLow;
		// Skip if mask of other elements don't equal our expected order.
		if ((Mask & MaskOtherElts) == (TargetOrder & MaskOtherElts)) {
		nemanjaiUnsubmitted Not Done Reply Inline Actions Isn't this already guaranteed to only have the low order 3 bits set? nemanjai: Isn't this already guaranteed to only have the low order 3 bits set?
		gyiuAuthorUnsubmitted Not Done Reply Inline Actions MaskOneElt could actually be >= 8, since the mask is in range [0, 15]. gyiu: MaskOneElt could actually be >= 8, since the mask is in range [0, 15].
		nemanjaiUnsubmitted Not Done Reply Inline Actions Ah, right. I didn't think of that. Sorry about that. nemanjai: Ah, right. I didn't think of that. Sorry about that.
		// We only need the last 3 bits for the number of shifts.
		ShiftElts = IsLE ? LittleEndianShifts[MaskOneElt & 0x7]
		: BigEndianShifts[MaskOneElt & 0x7];
		InsertAtByte = IsLE ? BytesInVector - (i + 1) * 2 : i * 2;
		Swap = MaskOneElt < NumHalfWords;
		nemanjaiUnsubmitted Done Reply Inline Actions Why would we continue to search if we've already confirmed that: We have an element from vector A All other elements are from vector B in the correct order nemanjai: Why would we continue to search if we've already confirmed that: 1. We have an element from…
		gyiuAuthorUnsubmitted Not Done Reply Inline Actions Yep, you're correct. Need a break here since we can't find more than one candidate. gyiu: Yep, you're correct. Need a break here since we can't find more than one candidate.
		FoundCandidate = true;
		break;
		}
		}
		}

		if (!FoundCandidate)
		return SDValue();

		// Candidate found, construct the proper SDAG sequence with VINSERTH,
		// optionally with VECSHL if shift is required.
		if (Swap)
		std::swap(V1, V2);
		if (V2.isUndef())
		V2 = V1;
		SDValue Conv1 = DAG.getNode(ISD::BITCAST, dl, MVT::v8i16, V1);
		if (ShiftElts) {
		// Double ShiftElts because we're left shifting on v16i8 type.
		SDValue Shl = DAG.getNode(PPCISD::VECSHL, dl, MVT::v16i8, V2, V2,
		DAG.getConstant(2 * ShiftElts, dl, MVT::i32));
		SDValue Conv2 = DAG.getNode(ISD::BITCAST, dl, MVT::v8i16, Shl);
		SDValue Ins = DAG.getNode(PPCISD::VECINSERT, dl, MVT::v8i16, Conv1, Conv2,
		DAG.getConstant(InsertAtByte, dl, MVT::i32));
		return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Ins);
		}
		SDValue Conv2 = DAG.getNode(ISD::BITCAST, dl, MVT::v8i16, V2);
		SDValue Ins = DAG.getNode(PPCISD::VECINSERT, dl, MVT::v8i16, Conv1, Conv2,
		DAG.getConstant(InsertAtByte, dl, MVT::i32));
		return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Ins);
		}

/// LowerVECTOR_SHUFFLE - Return the code we lower for VECTOR_SHUFFLE. If this		/// LowerVECTOR_SHUFFLE - Return the code we lower for VECTOR_SHUFFLE. If this
/// is a shuffle we can handle in a single instruction, return it. Otherwise,		/// is a shuffle we can handle in a single instruction, return it. Otherwise,
/// return the code it can be lowered into. Worst case, it can always be		/// return the code it can be lowered into. Worst case, it can always be
/// lowered into a vperm.		/// lowered into a vperm.
SDValue PPCTargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,		SDValue PPCTargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc dl(Op);		SDLoc dl(Op);
SDValue V1 = Op.getOperand(0);		SDValue V1 = Op.getOperand(0);
Show All 18 Lines	if (ShiftElts) {
DAG.getConstant(InsertAtByte, dl, MVT::i32));		DAG.getConstant(InsertAtByte, dl, MVT::i32));
return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Ins);		return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Ins);
}		}
SDValue Ins = DAG.getNode(PPCISD::VECINSERT, dl, MVT::v4i32, Conv1, Conv2,		SDValue Ins = DAG.getNode(PPCISD::VECINSERT, dl, MVT::v4i32, Conv1, Conv2,
DAG.getConstant(InsertAtByte, dl, MVT::i32));		DAG.getConstant(InsertAtByte, dl, MVT::i32));
return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Ins);		return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Ins);
}		}

		if (Subtarget.hasP9Altivec()) {
		SDValue NewISDNode = lowerToVINSERTH(SVOp, DAG);
		if (NewISDNode)
		return NewISDNode;
		}

if (Subtarget.hasVSX() &&		if (Subtarget.hasVSX() &&
PPC::isXXSLDWIShuffleMask(SVOp, ShiftElts, Swap, isLittleEndian)) {		PPC::isXXSLDWIShuffleMask(SVOp, ShiftElts, Swap, isLittleEndian)) {
if (Swap)		if (Swap)
std::swap(V1, V2);		std::swap(V1, V2);
SDValue Conv1 = DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, V1);		SDValue Conv1 = DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, V1);
SDValue Conv2 =		SDValue Conv2 =
DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, V2.isUndef() ? V1 : V2);		DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, V2.isUndef() ? V1 : V2);
▲ Show 20 Lines • Show All 5,600 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrAltivec.td

	Show First 20 Lines • Show All 471 Lines • ▼ Show 20 Lines
	def VMLADDUHM : VA1a_Int_Ty<34, "vmladduhm", int_ppc_altivec_vmladduhm, v8i16>;			def VMLADDUHM : VA1a_Int_Ty<34, "vmladduhm", int_ppc_altivec_vmladduhm, v8i16>;
	} // isCommutable			} // isCommutable

	def VPERM : VA1a_Int_Ty3<43, "vperm", int_ppc_altivec_vperm,			def VPERM : VA1a_Int_Ty3<43, "vperm", int_ppc_altivec_vperm,
	v4i32, v4i32, v16i8>;			v4i32, v4i32, v16i8>;
	def VSEL : VA1a_Int_Ty<42, "vsel", int_ppc_altivec_vsel, v4i32>;			def VSEL : VA1a_Int_Ty<42, "vsel", int_ppc_altivec_vsel, v4i32>;

	// Shuffles.			// Shuffles.
	def VSLDOI : VAForm_2<44, (outs vrrc:$vD), (ins vrrc:$vA, vrrc:$vB, u5imm:$SH),			def VSLDOI : VAForm_2<44, (outs vrrc:$vD), (ins vrrc:$vA, vrrc:$vB, u4imm:$SH),
	"vsldoi $vD, $vA, $vB, $SH", IIC_VecFP,			"vsldoi $vD, $vA, $vB, $SH", IIC_VecFP,
	[(set v16i8:$vD,			[(set v16i8:$vD,
	(vsldoi_shuffle:$SH v16i8:$vA, v16i8:$vB))]>;			(PPCvecshl v16i8:$vA, v16i8:$vB, imm32SExt16:$SH))]>;

	// VX-Form instructions. AltiVec arithmetic ops.			// VX-Form instructions. AltiVec arithmetic ops.
	let isCommutable = 1 in {			let isCommutable = 1 in {
	def VADDFP : VXForm_1<10, (outs vrrc:$vD), (ins vrrc:$vA, vrrc:$vB),			def VADDFP : VXForm_1<10, (outs vrrc:$vD), (ins vrrc:$vA, vrrc:$vB),
	"vaddfp $vD, $vA, $vB", IIC_VecFP,			"vaddfp $vD, $vA, $vB", IIC_VecFP,
	[(set v4f32:$vD, (fadd v4f32:$vA, v4f32:$vB))]>;			[(set v4f32:$vD, (fadd v4f32:$vA, v4f32:$vB))]>;

	def VADDUBM : VXForm_1<0, (outs vrrc:$vD), (ins vrrc:$vA, vrrc:$vB),			def VADDUBM : VXForm_1<0, (outs vrrc:$vD), (ins vrrc:$vA, vrrc:$vB),
	▲ Show 20 Lines • Show All 411 Lines • ▼ Show 20 Lines

	// Match vsldoi(x,x), vpkuwum(x,x), vpkuhum(x,x)			// Match vsldoi(x,x), vpkuwum(x,x), vpkuhum(x,x)
	def:Pat<(vsldoi_unary_shuffle:$in v16i8:$vA, undef),			def:Pat<(vsldoi_unary_shuffle:$in v16i8:$vA, undef),
	(VSLDOI $vA, $vA, (VSLDOI_unary_get_imm $in))>;			(VSLDOI $vA, $vA, (VSLDOI_unary_get_imm $in))>;
	def:Pat<(vpkuwum_unary_shuffle v16i8:$vA, undef),			def:Pat<(vpkuwum_unary_shuffle v16i8:$vA, undef),
	(VPKUWUM $vA, $vA)>;			(VPKUWUM $vA, $vA)>;
	def:Pat<(vpkuhum_unary_shuffle v16i8:$vA, undef),			def:Pat<(vpkuhum_unary_shuffle v16i8:$vA, undef),
	(VPKUHUM $vA, $vA)>;			(VPKUHUM $vA, $vA)>;
				def:Pat<(vsldoi_shuffle:$SH v16i8:$vA, v16i8:$vB),
				(VSLDOI v16i8:$vA, v16i8:$vB, (VSLDOI_get_imm $SH))>;


	// Match vsldoi(y,x), vpkuwum(y,x), vpkuhum(y,x), i.e., swapped operands.			// Match vsldoi(y,x), vpkuwum(y,x), vpkuhum(y,x), i.e., swapped operands.
	// These fragments are matched for little-endian, where the inputs must			// These fragments are matched for little-endian, where the inputs must
	// be swapped for correct semantics.			// be swapped for correct semantics.
	def:Pat<(vsldoi_swapped_shuffle:$in v16i8:$vA, v16i8:$vB),			def:Pat<(vsldoi_swapped_shuffle:$in v16i8:$vA, v16i8:$vB),
	(VSLDOI $vB, $vA, (VSLDOI_swapped_get_imm $in))>;			(VSLDOI $vB, $vA, (VSLDOI_swapped_get_imm $in))>;
	def:Pat<(vpkuwum_swapped_shuffle v16i8:$vA, v16i8:$vB),			def:Pat<(vpkuwum_swapped_shuffle v16i8:$vA, v16i8:$vB),
	(VPKUWUM $vB, $vA)>;			(VPKUWUM $vB, $vA)>;
	▲ Show 20 Lines • Show All 386 Lines • ▼ Show 20 Lines
	def VEXTUBRX : VX1_RT5_RA5_VB5<1805, "vextubrx", []>;			def VEXTUBRX : VX1_RT5_RA5_VB5<1805, "vextubrx", []>;
	def VEXTUHLX : VX1_RT5_RA5_VB5<1613, "vextuhlx", []>;			def VEXTUHLX : VX1_RT5_RA5_VB5<1613, "vextuhlx", []>;
	def VEXTUHRX : VX1_RT5_RA5_VB5<1869, "vextuhrx", []>;			def VEXTUHRX : VX1_RT5_RA5_VB5<1869, "vextuhrx", []>;
	def VEXTUWLX : VX1_RT5_RA5_VB5<1677, "vextuwlx", []>;			def VEXTUWLX : VX1_RT5_RA5_VB5<1677, "vextuwlx", []>;
	def VEXTUWRX : VX1_RT5_RA5_VB5<1933, "vextuwrx", []>;			def VEXTUWRX : VX1_RT5_RA5_VB5<1933, "vextuwrx", []>;

	// Vector Insert Element Instructions			// Vector Insert Element Instructions
	def VINSERTB : VX1_VT5_UIM5_VB5<781, "vinsertb", []>;			def VINSERTB : VX1_VT5_UIM5_VB5<781, "vinsertb", []>;
	def VINSERTH : VX1_VT5_UIM5_VB5<845, "vinserth", []>;			def VINSERTH : VXForm_1<845, (outs vrrc:$vD),
				(ins vrrc:$vDi, u4imm:$UIM, vrrc:$vB),
				"vinserth $vD, $vB, $UIM", IIC_VecGeneral,
				[(set v8i16:$vD, (PPCvecinsert v8i16:$vDi, v8i16:$vB,
				imm32SExt16:$UIM))]>,
				RegConstraint<"$vDi = $vD">, NoEncode<"$vDi">;
	def VINSERTW : VX1_VT5_UIM5_VB5<909, "vinsertw", []>;			def VINSERTW : VX1_VT5_UIM5_VB5<909, "vinsertw", []>;
	def VINSERTD : VX1_VT5_UIM5_VB5<973, "vinsertd", []>;			def VINSERTD : VX1_VT5_UIM5_VB5<973, "vinsertd", []>;

	class VX_VT5_EO5_VB5<bits<11> xo, bits<5> eo, string opc, list<dag> pattern>			class VX_VT5_EO5_VB5<bits<11> xo, bits<5> eo, string opc, list<dag> pattern>
	: VXForm_RD5_XO5_RS5<xo, eo, (outs vrrc:$vD), (ins vrrc:$vB),			: VXForm_RD5_XO5_RS5<xo, eo, (outs vrrc:$vD), (ins vrrc:$vB),
	!strconcat(opc, " $vD, $vB"), IIC_VecGeneral, pattern>;			!strconcat(opc, " $vD, $vB"), IIC_VecGeneral, pattern>;
	class VX_VT5_EO5_VB5s<bits<11> xo, bits<5> eo, string opc, list<dag> pattern>			class VX_VT5_EO5_VB5s<bits<11> xo, bits<5> eo, string opc, list<dag> pattern>
	: VXForm_RD5_XO5_RS5<xo, eo, (outs vfrc:$vD), (ins vfrc:$vB),			: VXForm_RD5_XO5_RS5<xo, eo, (outs vfrc:$vD), (ins vfrc:$vB),
	▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/p9-vinsert-vextract.ll

This file was added.

				; RUN: llc -mcpu=pwr9 -mtriple=powerpc64le-unknown-linux-gnu \
				; RUN: -verify-machineinstrs < %s \| FileCheck %s
				; RUN: llc -O0 -mcpu=pwr9 -mtriple=powerpc64le-unknown-linux-gnu \
				; RUN: -verify-machineinstrs < %s \| FileCheck %s
				; RUN: llc -mcpu=pwr9 -mtriple=powerpc64-unknown-linux-gnu \
				; RUN: -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK-BE
				leiUnsubmitted Done Reply Inline Actions Can you add a short description of what each of these functions are testing? lei: Can you add a short description of what each of these functions are testing?
				; RUN: llc -O0 -mcpu=pwr9 -mtriple=powerpc64-unknown-linux-gnu \
				; RUN: -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK-BE

				jtonyUnsubmitted Done Reply Inline Actions I would prefer to use non-mangled function names to make it more readable. I think you can just regenerate the IR from c source file instead of cpp file. jtony: I would prefer to use non-mangled function names to make it more readable. I think you can just…
				; The following testcases take one halfword element from the second vector and
				; inserts it at various locations in the first vector
				define <8 x i16> @shuffle_vector_halfword_0_8(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_0_8
				; CHECK: vsldoi 3, 3, 3, 8
				; CHECK: vinserth 2, 3, 14
				; CHECK-BE-LABEL: shuffle_vector_halfword_0_8
				; CHECK-BE: vsldoi 3, 3, 3, 10
				; CHECK-BE: vinserth 2, 3, 0
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_1_15(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_1_15
				; CHECK: vsldoi 3, 3, 3, 10
				; CHECK: vinserth 2, 3, 12
				; CHECK-BE-LABEL: shuffle_vector_halfword_1_15
				; CHECK-BE: vsldoi 3, 3, 3, 8
				; CHECK-BE: vinserth 2, 3, 2
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 15, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_2_9(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_2_9
				; CHECK: vsldoi 3, 3, 3, 6
				; CHECK: vinserth 2, 3, 10
				; CHECK-BE-LABEL: shuffle_vector_halfword_2_9
				; CHECK-BE: vsldoi 3, 3, 3, 12
				; CHECK-BE: vinserth 2, 3, 4
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 9, i32 3, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_3_13(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_3_13
				; CHECK: vsldoi 3, 3, 3, 14
				; CHECK: vinserth 2, 3, 8
				; CHECK-BE-LABEL: shuffle_vector_halfword_3_13
				; CHECK-BE: vsldoi 3, 3, 3, 4
				; CHECK-BE: vinserth 2, 3, 6
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 13, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_4_10(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_4_10
				; CHECK: vsldoi 3, 3, 3, 4
				; CHECK: vinserth 2, 3, 6
				; CHECK-BE-LABEL: shuffle_vector_halfword_4_10
				; CHECK-BE: vsldoi 3, 3, 3, 14
				; CHECK-BE: vinserth 2, 3, 8
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 10, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_5_14(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_5_14
				; CHECK: vsldoi 3, 3, 3, 12
				; CHECK: vinserth 2, 3, 4
				; CHECK-BE-LABEL: shuffle_vector_halfword_5_14
				; CHECK-BE: vsldoi 3, 3, 3, 6
				; CHECK-BE: vinserth 2, 3, 10
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 14, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_6_11(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_6_11
				; CHECK: vsldoi 3, 3, 3, 2
				; CHECK: vinserth 2, 3, 2
				; CHECK-BE-LABEL: shuffle_vector_halfword_6_11
				; CHECK-BE: vinserth 2, 3, 12
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 11, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_7_12(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_7_12
				; CHECK: vinserth 2, 3, 0
				; CHECK-BE-LABEL: shuffle_vector_halfword_7_12
				; CHECK-BE: vsldoi 3, 3, 3, 2
				; CHECK-BE: vinserth 2, 3, 14
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 12>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_8_1(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_8_1
				; CHECK: vsldoi 2, 2, 2, 6
				; CHECK: vinserth 3, 2, 14
				; CHECK: vmr 2, 3
				; CHECK-BE-LABEL: shuffle_vector_halfword_8_1
				; CHECK-BE: vsldoi 2, 2, 2, 12
				; CHECK-BE: vinserth 3, 2, 0
				; CHECK-BE: vmr 2, 3
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 1, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				ret <8 x i16> %vecins
				}

				; The following testcases take one halfword element from the first vector and
				; inserts it at various locations in the second vector
				define <8 x i16> @shuffle_vector_halfword_9_7(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_9_7
				; CHECK: vsldoi 2, 2, 2, 10
				; CHECK: vinserth 3, 2, 12
				; CHECK: vmr 2, 3
				; CHECK-BE-LABEL: shuffle_vector_halfword_9_7
				; CHECK-BE: vsldoi 2, 2, 2, 8
				; CHECK-BE: vinserth 3, 2, 2
				; CHECK-BE: vmr 2, 3
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 8, i32 7, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_10_4(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_10_4
				; CHECK: vinserth 3, 2, 10
				; CHECK: vmr 2, 3
				; CHECK-BE-LABEL: shuffle_vector_halfword_10_4
				; CHECK-BE: vsldoi 2, 2, 2, 2
				; CHECK-BE: vinserth 3, 2, 4
				; CHECK-BE: vmr 2, 3
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 8, i32 9, i32 4, i32 11, i32 12, i32 13, i32 14, i32 15>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_11_2(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_11_2
				; CHECK: vsldoi 2, 2, 2, 4
				; CHECK: vinserth 3, 2, 8
				; CHECK: vmr 2, 3
				; CHECK-BE-LABEL: shuffle_vector_halfword_11_2
				; CHECK-BE: vsldoi 2, 2, 2, 14
				; CHECK-BE: vinserth 3, 2, 6
				; CHECK-BE: vmr 2, 3
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 8, i32 9, i32 10, i32 2, i32 12, i32 13, i32 14, i32 15>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_12_6(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_12_6
				; CHECK: vsldoi 2, 2, 2, 12
				; CHECK: vinserth 3, 2, 6
				; CHECK: vmr 2, 3
				; CHECK-BE-LABEL: shuffle_vector_halfword_12_6
				; CHECK-BE: vsldoi 2, 2, 2, 6
				; CHECK-BE: vinserth 3, 2, 8
				; CHECK-BE: vmr 2, 3
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 6, i32 13, i32 14, i32 15>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_13_3(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_13_3
				; CHECK: vsldoi 2, 2, 2, 2
				; CHECK: vinserth 3, 2, 4
				; CHECK: vmr 2, 3
				; CHECK-BE-LABEL: shuffle_vector_halfword_13_3
				; CHECK-BE: vinserth 3, 2, 10
				; CHECK-BE: vmr 2, 3
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 3, i32 14, i32 15>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_14_5(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_14_5
				; CHECK: vsldoi 2, 2, 2, 14
				; CHECK: vinserth 3, 2, 2
				; CHECK: vmr 2, 3
				; CHECK-BE-LABEL: shuffle_vector_halfword_14_5
				; CHECK-BE: vsldoi 2, 2, 2, 4
				; CHECK-BE: vinserth 3, 2, 12
				; CHECK-BE: vmr 2, 3
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 5, i32 15>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_15_0(<8 x i16> %a, <8 x i16> %b) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_15_0
				; CHECK: vsldoi 2, 2, 2, 8
				; CHECK: vinserth 3, 2, 0
				; CHECK: vmr 2, 3
				; CHECK-BE-LABEL: shuffle_vector_halfword_15_0
				; CHECK-BE: vsldoi 2, 2, 2, 10
				; CHECK-BE: vinserth 3, 2, 14
				; CHECK-BE: vmr 2, 3
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 0>
				ret <8 x i16> %vecins
				}

				; The following testcases use the same vector in both arguments of the
				; shufflevector. If halfword element 3 in BE mode(or 4 in LE mode) is the one
				; we're attempting to insert, then we can use the vector insert instruction
				define <8 x i16> @shuffle_vector_halfword_0_4(<8 x i16> %a) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_0_4
				; CHECK: vinserth 2, 2, 14
				; CHECK-BE-LABEL: shuffle_vector_halfword_0_4
				; CHECK-BE-NOT: vinserth
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %a, <8 x i32> <i32 4, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_1_3(<8 x i16> %a) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_1_3
				; CHECK-NOT: vinserth
				; CHECK-BE-LABEL: shuffle_vector_halfword_1_3
				; CHECK-BE: vinserth 2, 2, 2
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %a, <8 x i32> <i32 0, i32 3, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_2_3(<8 x i16> %a) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_2_3
				; CHECK-NOT: vinserth
				; CHECK-BE-LABEL: shuffle_vector_halfword_2_3
				; CHECK-BE: vinserth 2, 2, 4
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %a, <8 x i32> <i32 0, i32 1, i32 3, i32 3, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_3_4(<8 x i16> %a) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_3_4
				; CHECK: vinserth 2, 2, 8
				; CHECK-BE-LABEL: shuffle_vector_halfword_3_4
				; CHECK-BE-NOT: vinserth
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %a, <8 x i32> <i32 0, i32 1, i32 2, i32 4, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_4_3(<8 x i16> %a) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_4_3
				; CHECK-NOT: vinserth
				; CHECK-BE-LABEL: shuffle_vector_halfword_4_3
				; CHECK-BE: vinserth 2, 2, 8
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %a, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 3, i32 5, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_5_3(<8 x i16> %a) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_5_3
				; CHECK-NOT: vinserth
				; CHECK-BE-LABEL: shuffle_vector_halfword_5_3
				; CHECK-BE: vinserth 2, 2, 10
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %a, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 3, i32 6, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_6_4(<8 x i16> %a) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_6_4
				; CHECK: vinserth 2, 2, 2
				; CHECK-BE-LABEL: shuffle_vector_halfword_6_4
				; CHECK-BE-NOT: vinserth
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %a, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 4, i32 7>
				ret <8 x i16> %vecins
				}

				define <8 x i16> @shuffle_vector_halfword_7_4(<8 x i16> %a) {
				entry:
				; CHECK-LABEL: shuffle_vector_halfword_7_4
				; CHECK: vinserth 2, 2, 0
				; CHECK-BE-LABEL: shuffle_vector_halfword_7_4
				; CHECK-BE-NOT: vinserth
				%vecins = shufflevector <8 x i16> %a, <8 x i16> %a, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 4>
				ret <8 x i16> %vecins
				}