This is an archive of the discontinued LLVM Phabricator instance.

[x86] Begin a significant overhaul of how vector lowering is done in the x86 backend.
ClosedPublic

Authored by chandlerc on Jun 19 2014, 6:21 PM.

Download Raw Diff

Details

Reviewers

grosbach
filcab

Commits

rG83860cfcfa16: [x86] Begin a significant overhaul of how vector lowering is done in the x86…
rL211888: [x86] Begin a significant overhaul of how vector lowering is done in the

Summary

This sketches out a new code path for vector lowering, hidden behind an
off-by-default flag while it is under development. The fundamental idea
behind the new code path is to aggressively break down the problem space
in ways that ease selecting the odd set of instructions available on
x86, and carefully avoid scalarizing code even when forced to use older
ISAs. Notably, this starts off restricting itself to SSE2 and implements
the complete vector shuffle and blend space for 128-bit vectors in SSE2
without scalarizing. The plan is to layer on top of this ISA extensions
where we can bail out of the complex SSE2 lowering and opt for
a cheaper, specialized instruction (or set of instructions). It also
needs to be generalized to AVX and AVX512 vector widths.

Currently, this does a decent but not perfect job for SSE2. There are
some specific shortcomings that I plan to address:

We need a peephole combine to fold together shuffles where possible. There are cases where a previous shuffle could be modified slightly to arrange for elements to be in the correct position and a later shuffle eliminated. Doing this eagerly added quite a bit of complexity, and so my plan is to combine away these redundancies afterward.
There are a lot more clever ways to use unpck and pack that need to be added. This is essential for real world shuffles as it turns out...

Once SSE2 is polished a bit I should be able to get interesting numbers
on performance improvements on benchmarks conducive to vectorization.
All of this will be off by default until it is functionally equivalent
of course.

Diff Detail

Event Timeline

chandlerc updated this revision to Diff 10669.Jun 19 2014, 6:21 PM

chandlerc retitled this revision from to [x86] Begin a significant overhaul of how vector lowering is done in the x86 backend..

chandlerc updated this object.

chandlerc edited the test plan for this revision. (Show Details)

chandlerc added reviewers: grosbach, filcab.

chandlerc set the repository for this revision to rL LLVM.

chandlerc added a subscriber: Unknown Object (MLST).

This is fantastic. Some comments inline. Mostly of the shed color choice variety that you can ignore freely if you prefer.

I really like the overall direction of this patch. It takes a very systematic approach to breaking down the problem.

In the commit message, you mention needing a later peephole to combine some combinations of instructions. Would it make sense to call some of those sequences out explicitly in the code comments as well? Since this code is going to be relying on other things in order to get performant results, I'd like to see that called out so that anyone later reading this code can see where else they should look to fully understand how things are happening.

lib/Target/X86/X86ISelLowering.cpp
7348	s/test/tests/
7383	Same thing as above. Function comment describing the approach will go a long way.
7708	Only need to "try to collapse" once. ;)

grosbach added inline comments.Jun 20 2014, 12:55 PM

lib/Target/X86/X86ISelLowering.cpp
6890	Style nit. I prefer not to refer to things as "new" in code comments, as the comments tend to live on that way long after the new code becomes the old code.
6898	You comment on this below, but IMO it's worth calling out explicitly here that UNDEF masks are required to be implemented as preserving the input lane and thus are considered a NOP. This and isSingleInputShuffleMask() should probably be static methods of ShuffleVectorSDNode instead. That would be similar to how isSplatMask() is implemented (side note, that should probably switch to use an ArrayRef instead of an EVT). Then add simple non-static wrappers isNoopShuffle() and isSingleInputShuff() to the class which call down to the static versions using the instance's mask.
6910	"NB?"
6932	Likewise, it may be useful to expose a direct method on ShuffleVectorSDNode for getting the lane count. int getNumElements() const { return getValueType(0).getVectorNumElements(); }
6934	Example of the above. This would be a bit clearer if written as SVOp->isSingleInput(). Ditto on similar constructs later in the patch.
6936	does not parse: "by passing sharing the"?
6942	Comment strings on the asserts.
6960	s/But we have to map the mask as it/We have to map the mask, as it/
6988	space after "{" and before "}"
6993	Why not use the helper function you defined above to check for the shuffle being single input? I guess because you want to use the element count directly in the following code?
7002	std::find_if instead?
7003	This is a neat trick for getting the other lane of the high half, but it isn't immediately obvious that's what it's doing. Either a ?: to make the selection explicit or a comment explaining the bitwise trick would be good.
7015	Big ask, but I'd love some ASCII art type examples of the transforms, in particular for those like this one that need more than one instruction. Not the complete set of possibilities; just a single example that's illustrative showing arrows for which lanes get moved where by which insn.
7046	Spacing for "{" "}".
7094	This is a really tricky part of the patch. Can you add a comment describing the approach and algorithm to the function? Without a lot of very explicit exposition, I'm concerned about future "improvements" to this code path inadvertently breaking things.
7125	An example in the comment would really help here.

majnemer added a subscriber: majnemer.Jun 20 2014, 2:56 PM

majnemer added inline comments.

lib/Target/X86/X86ISelLowering.cpp
6910	It is probably an abbreviation for nota bene

Bike-sheddy comments addressed below. ;] I seem to like a somewhat different shade than you. Lemme know if its too different.

To the combine stuff, I'm not sure where would be a great place to put it. The point of the combines being needed is that there *isn't* one place where we make a potentially bad decision, but that two distant decisions could (in theory) collude to save an operation. Hard to comment that. The test cases should cover it very well though...

Will attach an updated patch momentarily...

lib/Target/X86/X86ISelLowering.cpp
6890	OK, "experimental" maybe?
6898	Regarding the undef mask contract below, it doesn't impact this at all. An undef lane, by its nature, is a NOP? I really dislike static methods for these at the moment. These have nothing to do with an SDNode and everything to do with the specific ways in which masks are used in this lowering code. I'm hesitant to over-generalize here. Also, static methods are horrible to actually use. If there is useful common logic to refactor into common locations, I'd rather do it if and when it is actually needed to implement common functionality. As it stands, it wouldn't save anything?
6910	Correct. I don't care particularly about comment style; I'll probably clean this up along with adding the comment above.
6932	I don't know that this makes the code any better. By directly asserting properties of the Mask, to me, it makes the following code that assumes the properties on the mask much more clear.
6934	I think this method might make sense on SVOp as it really isn't used anywhere on raw / hypothetical masks. But at the same time, it was actually rather convenient to write all of this code in terms of the mask. I don't find the ShuffleVectorSDNode to add any utility. :: shrug ::
6936	English is hard. Will reword.
6942	When there is a generally useful bit of information that is being asserted, I try to do so. However, these asserts are just documenting the mathematical invariants established by the above condition. There isn't any useful comment to give here IMO.
6960	Done.
6988	See the coding guidelines and clang-format -- this is how it formats braces specifically to interoperate cleanly with braced init lists in C++11.
6993	Exactly. When I have to count anyways, I may as well re-use that.
7002	Sure. The first time I wrote this I was annoyed by re-computing the index from find_if and switched it to a while loop, but maybe the loop is too high cost.
7003	I have a distaste for comments which merely explain what the standard says the C++ in question does. I tried to help this by naming the variable after the "adjacent" index. Is there a better variable name that would give the context? If not I can do something else...
7015	I'm hesitant to do this. Every time I drew an ascii art picture (and I drew several) or something on a white board, it actually messed me up because I started to silently assume that the one example I had was actually representative of the other ways to hit the same scenario. Unless this is breaking your ability to reason about the code, I'm going to hold off investing a ton of time in crafting these...
7046	Same comment, this is the clang-format enforced style and matches the coding standards. I'll give a citation to help: http://llvm.org/docs/CodingStandards.html#braced-initializer-lists
7094	Yea, I didn't have that comment because there were many tens of different approaches tried before one worked. I'm also still hoping to simplify some parts of it. Also, all the details are already documented in each branch of the hybrid approach.
7125	Hmm, ok. Not sure it'll actually help, but done. =] This trick took me weeks to figure out.
7348	Don.
7383	I've added them to most. Note that these comments will only get more confusing as time goes on. =/ I'm not sure of a good way to document these. Each new ISA extension or trick I add will require updates. Hopefully I won't miss any. =]
7708	Done.

Updates from review thus far...

Many thanks for this patch! Making the shuffle code saner and tidier is awesome!

I'm posting a few comments for v2* and v4* for now. Will check the other ones later. Most of these are nitpicks, but the ones on line 6816 and 6836 look like bugs (either on the code or in my brain).

I also don't like the phrasing “We rely heavily on "undef" masks preserving the input lane.”
From what I've seen, it seems that we will treat undefs as if the corresponding element for the source vector was specified. But the code doesn't assume that anywhere (maybe it does in places further down). My first reading parsed it as “we need to be sure that undefs are treated this way”.

I would also echo most of grosbach's comments.

lib/Target/X86/X86ISelLowering.cpp
6911	This is similar to a (DAG) shuffle mask, but not quite (it only goes up (non-inclusive) to 4, not 8, which you would have on shuffles for two v4* vectors). Could you add some word (Shufp?) to the name, to make it easier to spot the difference?
6932	Shouldn't we test, at a higher level, that the shuffle's type is v2f64? I don't think we can get a malformed SDValue that's a v2f64 with more (or fewer) than 2 elements in the mask. The same for the other functions.
6937	Since the possible values are UNDEF (-1), 0, and 1, wouldn't simply Mask[x] == 1u be easier to understand (This would invert how we handle UNDEFs in this case)
6945	I might be missing something. How does this handle an <i32 2, i32 0> mask? Wouldn't it generate { V1[0], V2[0] } instead of { V2[0], V1[0] } ?
6965	What about bitcasting the PSHUFD output back to v2i64?
6993	Since we use NumV2Elements a bunch of times, I'd be ok with not going through the mask two times when we can just go one time.
7084	Nitpick: I would change the phrasing. If we only use one shufps, shufps is smaller (2 opcode bytes vs. 3 opcode bytes). If we need two shufps, then it's better to just use shufpd. The comment, as it is, seems, to me, like it suggests we still need to check if there's any case where this is helpful.

My overarching concern is to make sure the code is understandable a year or three or five from now. There's no getting around the necessary complexity of the code due to the ISAs we have to work with today, but as new ISA extensions come out, it's only going to get more so. As such, it's immensely important that the code be understandable to a level that guides someone doing that to work where and how to insert those new tricks. Thus my hounding more about comments than the actual details of the code. The code itself looks great. I want to make sure it stays that way after barbarians like me get their hands on it in the coming years. :)

Looks like one of my comments from before got eaten. Trying again, but in case it doesn't go through there this time either, you're missing a "#include <numeric>". Otherwise you don't have a declaration for std::accumulate.

lib/Target/X86/X86ISelLowering.cpp
6898	I'm more concerned that it's inconsistent with what we already have (isSplatMask()). I don't care much which idiom we use (static method vs. static function), but do think we should be consistent. I disagree that they have nothing to do with the SDNode. They're for asking questions about a mask either of or for an SDNode. Now, strictly speaking, the mask is a distinct thing, it's true, and we could have a dedicated class or namespace for that, I suppose. Static methods are no harder to use than the static functions you already have. It's just a scope. On the other hand, do other targets actually want answers to the questions these helpers ask? If not, maybe you're right that they're more closely related to the X86 lowering code than the generic stuff. So I suggest we get an answer pragmatically. Can you have a look at the other targets that do shuffles and see if they have something similar? If so, combine the use cases with something generic and if not, leave it as you've written it currently. Sound reasonable?
6910	Cool. Just not used to latin in source code. ;)
6942	I'd still prefer there be something. As a matter of consistent style, if nothing else. Just something that which tells me what other properties I should be looking at to see what happened. For example, distinguishing between a mask that's invalid on its face for the vector type vs. one that's valid but shouldn't have gotten this far because the code above was supposed to handle it.
6988	clang-format agrees with me. That's why I made the comment, actually.
7003	It's not just explaining what the standard says, though. It explains why. I spend a couple minutes looking at that code wondering what you were up to before I realized and I had to work through the bit patterns in my head to do it. Maybe I'm just being dense and it should have been obvious, but it wasn't. I really think a comment explaining the intent (not the language semantics) is very useful here. The abbreviation certainly didn't help. I thought it meant "Adjusted" not "Adjacent".
7015	I'm drawing lots of stuff on paper to be able to reason about the code. Some comments to help guide that thought process would help. From another perspective, comments breaking down what the code is intended to cover help review to make sure it matches what it actually does cover. I understand that a mathematically sound exposition of all the cases and how they transpose into one another is likely not practical, but I still think a few example would be useful. Doesn't have to be ASCII art. I just like pictures. Simple examples, even expressed in terms of __builtin_shufflevector() .c code perhaps, would go a long way.
7046	Then clang-format should be fixed. That's what I used to verify I was correct. I just tried again and got the same result.
7094	Maybe something about "there be dragons here, and not the friendly kind" then? If it was that tricky to get right, I'm more concerned about someone helpfully coming along and "fixing" it later without understanding all the cases. Tests go a long way there, of course, but I don't like relying on that exclusively.
7129	llvm/lib/Target/X86/X86ISelLowering.cpp:7000:24: error: no member named 'accumulate' in namespace 'std' Need to add "#include <numeric>" I think.

OK, I think I've responded to all the comments reasonably at this point. Will upload an updated patch momentarily. Please let me know if this LG to submit at this point! =D

lib/Target/X86/X86ISelLowering.cpp
6898	Chatted about this on IRC some and I've add lots of comments which will even explain the specific peculiarities of these routines. The essence is that these seem pretyt special-purpose today. They work on raw masks, and those masks sometimes have fairly specific meaning to this cod ethat might not generalize well. I Actually tried to move the single input test and it is just wrong in general; it only works here with this code path. Hopefully with the documentation this helps clarify what's going on.
6911	Do you have a concrete suggestion here? I don't have any ideas other than V4. We use this for SHUFP, PSHUF, PSHUFH, and PSHUFL instructions. =/ The way I think about it is that this is just the way that x86 always represents a 4-lane shuffle, regardless of where the 4 lanes come from.
6932	Yea, testing this stuff is a great idea. I've added lots of asserts to this effect. Also fixed a bunch of places where we were playing fast and loose with types to instead go back through the full vector shuffle code to ensure type correctness throughout.
6937	Sure.
6942	I've added strings to every assert. I've no idea of many of these really add value, but hey. =D
6945	How? There is a test case that covers this. ;] I think the key is that we canonicalize the mask first, so we know that by the time we gete here there is exactly one entry from each input, and when we have equal numbers of inptus, we canonicalize s.t. that first input provides more low-half inputs than the second input.
6965	Done. I think the DAG stuff was fixing this for me. This code is definitely covered by tests. =/
6988	To record here what took place on IRC -- Jim's build of clang-format was stale for truly mysterious reasons... no idea why...
7003	Comments added.
7015	Ok. However, this comment was just bad. Bordering on terrible. I've rewritten it to be less confusing. =] I'm not sure after that exactly where diagrams or examples would most help? Could you point at the specific places that need more illustration in light of the added comments from the last couple of iterations?
7046	As noted above, we sorted this out...
7084	Not sure what you're really saying. First, we're talking about shufps vs. pshufd. Second, we would never need more or fewer of shufps vs. pshufd, they are both equally powerful when using a single input. However, yes, I've now looked, and there is no reason for this. I'll nuke the FIXME. If anything, we might want a combine somewhere very late that can see when we don't need the free copy pshufd provides and we don't pay a domain crossing penalty, and so it is an easy code size win to rewrite it as a shufps.
7094	I've tried to do a decent job of commenting the strategy with appropriate warnings and cross references. I think it ended up being simpler to document than I feared, so hopefully its good now.
7129	Done.

New patch with significant updates based on review comments by Jim and Filipe.

filcab added inline comments.Jun 25 2014, 6:04 PM

lib/Target/X86/X86ISelLowering.cpp
7135	This seems wrong. How can you have [{0,1}{2,7}{4,5}{6,3}] -> [{0,1}{4,7}{2,3}{6,5}]? How did the {2,7} get mixed with the {4,5}? Wouldn't pshufd just shuffle double words around but not break them? Shouldn't it become [{0,1}{4,5}{2,7}{6,3}]?
7143	Commenting how that subtraction gets us the index that's not used would be nice.
7203	Wouldn't InPlaceInputs[0] ^ 1 do the same thing? And it would be the same trick you used in the previous function.
7531	NumV1Inputs == 0 && NumV2Inputs == 4 => assert in lowerV8I16BasicBlendVectorShuffle:7293 The previous if should have an else if (NumV1Inputs == 0) ...SingleInput...(...V2...) If this condition doesn't arise because it's been handled earlier, I'd like to have an assert here.

chandlerc added inline comments.Jun 25 2014, 6:07 PM

lib/Target/X86/X86ISelLowering.cpp
7135	Yea, this is why examples aren't clear. The PSHUFD applies to the input to this shuffle. The result here is the new mask needed after the pshufd to produce the same final output. How can i make this more clear?
7531	Correct, we can't hit this because if there is only a single input, it is always V1. Adding assert here.

chandlerc added inline comments.Jun 25 2014, 6:20 PM

lib/Target/X86/X86ISelLowering.cpp
7135	Think I have a better way to illustrate it...
7143	Yea, this is way down the clever rabbit hole.
7203	Good idea. ;] I should be way more consistent with these tricks.

Fix up comments and add another assert based on Filipe's comments.

LGTM unless Filipe has any additional requests.

lib/Target/X86/X86ISelLowering.cpp
7321	Different from what? I don't have a better suggestion for the message, unfortunately.

LGTM, too. I'll still have another look at some comments, but if there's any minor issue with those, it can be resolved on IRC.

lib/Target/X86/X86ISelLowering.cpp
6911	Maybe V4X86Shuffle? At least it shows it might not be a “regular” DAG shuffle mask.

This revision is now accepted and ready to land.Jun 26 2014, 3:08 PM

zinovy.nis added a subscriber: zinovy.nis.Jun 27 2014, 4:27 AM

Closed by commit rL211888 (authored by @chandlerc).

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

1027 lines

test/

CodeGen/

X86/

vector-shuffle-128-v16.ll

85 lines

vector-shuffle-128-v2.ll

219 lines

vector-shuffle-128-v4.ll

174 lines

vector-shuffle-128-v8.ll

499 lines

Diff 10873

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 38 Lines
#include "llvm/IR/GlobalAlias.h"		#include "llvm/IR/GlobalAlias.h"
#include "llvm/IR/GlobalVariable.h"		#include "llvm/IR/GlobalVariable.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/Intrinsics.h"		#include "llvm/IR/Intrinsics.h"
#include "llvm/MC/MCAsmInfo.h"		#include "llvm/MC/MCAsmInfo.h"
#include "llvm/MC/MCContext.h"		#include "llvm/MC/MCContext.h"
#include "llvm/MC/MCExpr.h"		#include "llvm/MC/MCExpr.h"
#include "llvm/MC/MCSymbol.h"		#include "llvm/MC/MCSymbol.h"
		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
#include "llvm/Target/TargetOptions.h"		#include "llvm/Target/TargetOptions.h"
#include <bitset>		#include <bitset>
		#include <numeric>
#include <cctype>		#include <cctype>
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "x86-isel"		#define DEBUG_TYPE "x86-isel"

STATISTIC(NumTailCalls, "Number of tail calls");		STATISTIC(NumTailCalls, "Number of tail calls");

		static cl::opt<bool> ExperimentalVectorShuffleLowering(
		"x86-experimental-vector-shuffle-lowering", cl::init(false),
		cl::desc("Enable an experimental vector shuffle lowering code path."),
		cl::Hidden);

// Forward declarations.		// Forward declarations.
static SDValue getMOVL(SelectionDAG &DAG, SDLoc dl, EVT VT, SDValue V1,		static SDValue getMOVL(SelectionDAG &DAG, SDLoc dl, EVT VT, SDValue V1,
SDValue V2);		SDValue V2);

static SDValue ExtractSubVector(SDValue Vec, unsigned IdxVal,		static SDValue ExtractSubVector(SDValue Vec, unsigned IdxVal,
SelectionDAG &DAG, SDLoc dl,		SelectionDAG &DAG, SDLoc dl,
unsigned vectorWidth) {		unsigned vectorWidth) {
assert((vectorWidth == 128 \|\| vectorWidth == 256) &&		assert((vectorWidth == 128 \|\| vectorWidth == 256) &&
▲ Show 20 Lines • Show All 6,803 Lines • ▼ Show 20 Lines	static SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) {

// AVX can use the vinsertf128 instruction to create 256-bit vectors		// AVX can use the vinsertf128 instruction to create 256-bit vectors
// from two other 128-bit ones.		// from two other 128-bit ones.

// 512-bit vector may contain 2 256-bit vectors or 4 128-bit vectors		// 512-bit vector may contain 2 256-bit vectors or 4 128-bit vectors
return LowerAVXCONCAT_VECTORS(Op, DAG);		return LowerAVXCONCAT_VECTORS(Op, DAG);
}		}


		//===----------------------------------------------------------------------===//
		// Vector shuffle lowering
		//
		// This is an experimental code path for lowering vector shuffles on x86. It is
		// designed to handle arbitrary vector shuffles and blends, gracefully
		grosbachUnsubmitted Not Done Reply Inline Actions Style nit. I prefer not to refer to things as "new" in code comments, as the comments tend to live on that way long after the new code becomes the old code. grosbach: Style nit. I prefer not to refer to things as "new" in code comments, as the comments tend to…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions OK, "experimental" maybe? chandlerc: OK, "experimental" maybe?
		// degrading performance as necessary. It works hard to recognize idiomatic
		// shuffles and lower them to optimal instruction patterns without leaving
		// a framework that allows reasonably efficient handling of all vector shuffle
		// patterns.
		//===----------------------------------------------------------------------===//

		/// \brief Tiny helper function to identify a no-op mask.
		///
		grosbachUnsubmitted Not Done Reply Inline Actions You comment on this below, but IMO it's worth calling out explicitly here that UNDEF masks are required to be implemented as preserving the input lane and thus are considered a NOP. This and isSingleInputShuffleMask() should probably be static methods of ShuffleVectorSDNode instead. That would be similar to how isSplatMask() is implemented (side note, that should probably switch to use an ArrayRef instead of an EVT). Then add simple non-static wrappers isNoopShuffle() and isSingleInputShuff() to the class which call down to the static versions using the instance's mask. grosbach: You comment on this below, but IMO it's worth calling out explicitly here that UNDEF masks are…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Regarding the undef mask contract below, it doesn't impact this at all. An undef lane, by its nature, is a NOP? I really dislike static methods for these at the moment. These have nothing to do with an SDNode and everything to do with the specific ways in which masks are used in this lowering code. I'm hesitant to over-generalize here. Also, static methods are horrible to actually use. If there is useful common logic to refactor into common locations, I'd rather do it if and when it is actually needed to implement common functionality. As it stands, it wouldn't save anything? chandlerc: Regarding the undef mask contract below, it doesn't impact this at all. An undef lane, by its…
		grosbachUnsubmitted Not Done Reply Inline Actions I'm more concerned that it's inconsistent with what we already have (isSplatMask()). I don't care much which idiom we use (static method vs. static function), but do think we should be consistent. I disagree that they have nothing to do with the SDNode. They're for asking questions about a mask either of or for an SDNode. Now, strictly speaking, the mask is a distinct thing, it's true, and we could have a dedicated class or namespace for that, I suppose. Static methods are no harder to use than the static functions you already have. It's just a scope. On the other hand, do other targets actually want answers to the questions these helpers ask? If not, maybe you're right that they're more closely related to the X86 lowering code than the generic stuff. So I suggest we get an answer pragmatically. Can you have a look at the other targets that do shuffles and see if they have something similar? If so, combine the use cases with something generic and if not, leave it as you've written it currently. Sound reasonable? grosbach: I'm more concerned that it's inconsistent with what we already have (isSplatMask()). I don't…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Chatted about this on IRC some and I've add lots of comments which will even explain the specific peculiarities of these routines. The essence is that these seem pretyt special-purpose today. They work on raw masks, and those masks sometimes have fairly specific meaning to this cod ethat might not generalize well. I Actually tried to move the single input test and it is just wrong in general; it only works here with this code path. Hopefully with the documentation this helps clarify what's going on. chandlerc: Chatted about this on IRC some and I've add lots of comments which will even explain the…
		/// This is a somewhat boring predicate function. It checks whether the mask
		/// array input, which is assumed to be a single-input shuffle mask of the kind
		/// used by the X86 shuffle instructions (not a fully general
		/// ShuffleVectorSDNode mask) requires any shuffles to occur. Both undef and an
		/// in-place shuffle are 'no-op's.
		static bool isNoopShuffleMask(ArrayRef<int> Mask) {
		for (int i = 0, Size = Mask.size(); i < Size; ++i)
		if (Mask[i] != -1 && Mask[i] != i)
		return false;
		return true;
		}

		grosbachUnsubmitted Not Done Reply Inline Actions "NB?" grosbach: "NB?"
		majnemerUnsubmitted Not Done Reply Inline Actions It is probably an abbreviation for nota bene majnemer: It is probably an abbreviation for [[ http://en.wikipedia.org/wiki/Nota_bene \| nota bene ]]
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Correct. I don't care particularly about comment style; I'll probably clean this up along with adding the comment above. chandlerc: Correct. I don't care particularly about comment style; I'll probably clean this up along with…
		grosbachUnsubmitted Not Done Reply Inline Actions Cool. Just not used to latin in source code. ;) grosbach: Cool. Just not used to latin in source code. ;)
		/// \brief Helper function to classify a mask as a single-input mask.
		filcabUnsubmitted Not Done Reply Inline Actions This is similar to a (DAG) shuffle mask, but not quite (it only goes up (non-inclusive) to 4, not 8, which you would have on shuffles for two v4* vectors). Could you add some word (Shufp?) to the name, to make it easier to spot the difference? filcab: This is similar to a (DAG) shuffle mask, but not quite (it only goes up (non-inclusive) to 4…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Do you have a concrete suggestion here? I don't have any ideas other than V4. We use this for SHUFP, PSHUF, PSHUFH, and PSHUFL instructions. =/ The way I think about it is that this is just the way that x86 always represents a 4-lane shuffle, regardless of where the 4 lanes come from. chandlerc: Do you have a concrete suggestion here? I don't have any ideas other than V4. We use this for…
		filcabUnsubmitted Not Done Reply Inline Actions Maybe V4X86Shuffle? At least it shows it might not be a “regular” DAG shuffle mask. filcab: Maybe V4X86Shuffle? At least it shows it might not be a “regular” DAG shuffle mask.
		///
		/// This isn't a generic single-input test because in the vector shuffle
		/// lowering we canonicalize single inputs to be the first input operand. This
		/// means we can more quickly test for a single input by only checking whether
		/// an input from the second operand exists. We also assume that the size of
		/// mask corresponds to the size of the input vectors which isn't true in the
		/// fully general case.
		static bool isSingleInputShuffleMask(ArrayRef<int> Mask) {
		for (int M : Mask)
		if (M >= (int)Mask.size())
		return false;
		return true;
		}

		/// \brief Get a 4-lane 8-bit shuffle immediate for a mask.
		///
		/// This helper function produces an 8-bit shuffle immediate corresponding to
		/// the ubiquitous shuffle encoding scheme used in x86 instructions for
		/// shuffling 4 lanes. It can be used with most of the PSHUF instructions for
		/// example.
		///
		grosbachUnsubmitted Not Done Reply Inline Actions Likewise, it may be useful to expose a direct method on ShuffleVectorSDNode for getting the lane count. int getNumElements() const { return getValueType(0).getVectorNumElements(); } grosbach: Likewise, it may be useful to expose a direct method on ShuffleVectorSDNode for getting the…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions I don't know that this makes the code any better. By directly asserting properties of the Mask, to me, it makes the following code that assumes the properties on the mask much more clear. chandlerc: I don't know that this makes the code any better. By directly asserting properties of the Mask…
		filcabUnsubmitted Not Done Reply Inline Actions Shouldn't we test, at a higher level, that the shuffle's type is v2f64? I don't think we can get a malformed SDValue that's a v2f64 with more (or fewer) than 2 elements in the mask. The same for the other functions. filcab: Shouldn't we test, at a higher level, that the shuffle's type is v2f64? I don't think we can…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Yea, testing this stuff is a great idea. I've added lots of asserts to this effect. Also fixed a bunch of places where we were playing fast and loose with types to instead go back through the full vector shuffle code to ensure type correctness throughout. chandlerc: Yea, testing this stuff is a great idea. I've added lots of asserts to this effect. Also fixed…
		/// NB: We rely heavily on "undef" masks preserving the input lane.
		static SDValue getV4ShuffleImmForMask(ArrayRef<int> Mask, SelectionDAG &DAG) {
		grosbachUnsubmitted Not Done Reply Inline Actions Example of the above. This would be a bit clearer if written as SVOp->isSingleInput(). Ditto on similar constructs later in the patch. grosbach: Example of the above. This would be a bit clearer if written as SVOp->isSingleInput(). Ditto…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions I think this method might make sense on SVOp as it really isn't used anywhere on raw / hypothetical masks. But at the same time, it was actually rather convenient to write all of this code in terms of the mask. I don't find the ShuffleVectorSDNode to add any utility. :: shrug :: chandlerc: I think this method might make sense on SVOp as it really isn't used anywhere on raw /…
		assert(Mask.size() == 4 && "Only 4-lane shuffle masks");
		assert(Mask[0] >= -1 && Mask[0] < 4 && "Out of bound mask element!");
		grosbachUnsubmitted Not Done Reply Inline Actions does not parse: "by passing sharing the"? grosbach: does not parse: "by passing sharing the"?
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions English is hard. Will reword. chandlerc: English is hard. Will reword.
		assert(Mask[1] >= -1 && Mask[1] < 4 && "Out of bound mask element!");
		filcabUnsubmitted Not Done Reply Inline Actions Since the possible values are UNDEF (-1), 0, and 1, wouldn't simply Mask[x] == 1u be easier to understand (This would invert how we handle UNDEFs in this case) filcab: Since the possible values are UNDEF (-1), 0, and 1, wouldn't simply Mask[x] == 1u be easier to…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Sure. chandlerc: Sure.
		assert(Mask[2] >= -1 && Mask[2] < 4 && "Out of bound mask element!");
		assert(Mask[3] >= -1 && Mask[3] < 4 && "Out of bound mask element!");

		unsigned Imm = 0;
		Imm \|= (Mask[0] == -1 ? 0 : Mask[0]) << 0;
		grosbachUnsubmitted Not Done Reply Inline Actions Comment strings on the asserts. grosbach: Comment strings on the asserts.
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions When there is a generally useful bit of information that is being asserted, I try to do so. However, these asserts are just documenting the mathematical invariants established by the above condition. There isn't any useful comment to give here IMO. chandlerc: When there is a generally useful bit of information that is being asserted, I try to do so.
		grosbachUnsubmitted Not Done Reply Inline Actions I'd still prefer there be something. As a matter of consistent style, if nothing else. Just something that which tells me what other properties I should be looking at to see what happened. For example, distinguishing between a mask that's invalid on its face for the vector type vs. one that's valid but shouldn't have gotten this far because the code above was supposed to handle it. grosbach: I'd still prefer there be something. As a matter of consistent style, if nothing else. Just…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions I've added strings to every assert. I've no idea of many of these really add value, but hey. =D chandlerc: I've added strings to every assert. I've no idea of many of these really add value, but hey. =D
		Imm \|= (Mask[1] == -1 ? 1 : Mask[1]) << 2;
		Imm \|= (Mask[2] == -1 ? 2 : Mask[2]) << 4;
		Imm \|= (Mask[3] == -1 ? 3 : Mask[3]) << 6;
		filcabUnsubmitted Not Done Reply Inline Actions I might be missing something. How does this handle an <i32 2, i32 0> mask? Wouldn't it generate { V1[0], V2[0] } instead of { V2[0], V1[0] } ? filcab: I might be missing something. How does this handle an <i32 2, i32 0> mask? Wouldn't it generate…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions How? There is a test case that covers this. ;] I think the key is that we canonicalize the mask first, so we know that by the time we gete here there is exactly one entry from each input, and when we have equal numbers of inptus, we canonicalize s.t. that first input provides more low-half inputs than the second input. chandlerc: How? There is a test case that covers this. ;] I think the key is that we canonicalize the…
		return DAG.getConstant(Imm, MVT::i8);
		}

		/// \brief Handle lowering of 2-lane 64-bit floating point shuffles.
		///
		/// This is the basis function for the 2-lane 64-bit shuffles as we have full
		/// support for floating point shuffles but not integer shuffles. These
		/// instructions will incur a domain crossing penalty on some chips though so
		/// it is better to avoid lowering through this for integer vectors where
		/// possible.
		static SDValue lowerV2F64VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
		const X86Subtarget *Subtarget,
		SelectionDAG &DAG) {
		SDLoc DL(Op);
		assert(Op.getSimpleValueType() == MVT::v2f64 && "Bad shuffle type!");
		grosbachUnsubmitted Not Done Reply Inline Actions s/But we have to map the mask as it/We have to map the mask, as it/ grosbach: s/But we have to map the mask as it/We have to map the mask, as it/
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Done. chandlerc: Done.
		assert(V1.getSimpleValueType() == MVT::v2f64 && "Bad operand type!");
		assert(V2.getSimpleValueType() == MVT::v2f64 && "Bad operand type!");
		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
		ArrayRef<int> Mask = SVOp->getMask();
		assert(Mask.size() == 2 && "Unexpected mask size for v2 shuffle!");
		filcabUnsubmitted Not Done Reply Inline Actions What about bitcasting the PSHUFD output back to v2i64? filcab: What about bitcasting the PSHUFD output back to v2i64?
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Done. I think the DAG stuff was fixing this for me. This code is definitely covered by tests. =/ chandlerc: Done. I think the DAG stuff was fixing this for me. This code is definitely covered by tests. =/

		if (isSingleInputShuffleMask(Mask)) {
		// Straight shuffle of a single input vector. Simulate this by using the
		// single input as both of the "inputs" to this instruction..
		unsigned SHUFPDMask = (Mask[0] == 1) \| ((Mask[1] == 1) << 1);
		return DAG.getNode(X86ISD::SHUFP, SDLoc(Op), MVT::v2f64, V1, V1,
		DAG.getConstant(SHUFPDMask, MVT::i8));
		}
		assert(Mask[0] >= 0 && Mask[0] < 2 && "Non-canonicalized blend!");
		assert(Mask[1] >= 2 && "Non-canonicalized blend!");

		unsigned SHUFPDMask = (Mask[0] == 1) \| (((Mask[1] - 2) == 1) << 1);
		return DAG.getNode(X86ISD::SHUFP, SDLoc(Op), MVT::v2f64, V1, V2,
		DAG.getConstant(SHUFPDMask, MVT::i8));
		}

		/// \brief Handle lowering of 2-lane 64-bit integer shuffles.
		///
		/// Tries to lower a 2-lane 64-bit shuffle using shuffle operations provided by
		/// the integer unit to minimize domain crossing penalties. However, for blends
		/// it falls back to the floating point shuffle operation with appropriate bit
		/// casting.
		static SDValue lowerV2I64VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
		grosbachUnsubmitted Not Done Reply Inline Actions space after "{" and before "}" grosbach: space after "{" and before "}"
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions See the coding guidelines and clang-format -- this is how it formats braces specifically to interoperate cleanly with braced init lists in C++11. chandlerc: See the coding guidelines and clang-format -- this is how it formats braces specifically to…
		grosbachUnsubmitted Not Done Reply Inline Actions clang-format agrees with me. That's why I made the comment, actually. grosbach: clang-format agrees with me. That's why I made the comment, actually.
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions To record here what took place on IRC -- Jim's build of clang-format was stale for truly mysterious reasons... no idea why... chandlerc: To record here what took place on IRC -- Jim's build of clang-format was stale for truly…
		const X86Subtarget *Subtarget,
		SelectionDAG &DAG) {
		SDLoc DL(Op);
		assert(Op.getSimpleValueType() == MVT::v2i64 && "Bad shuffle type!");
		assert(V1.getSimpleValueType() == MVT::v2i64 && "Bad operand type!");
		grosbachUnsubmitted Not Done Reply Inline Actions Why not use the helper function you defined above to check for the shuffle being single input? I guess because you want to use the element count directly in the following code? grosbach: Why not use the helper function you defined above to check for the shuffle being single input?
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Exactly. When I have to count anyways, I may as well re-use that. chandlerc: Exactly. When I have to count anyways, I may as well re-use that.
		filcabUnsubmitted Not Done Reply Inline Actions Since we use NumV2Elements a bunch of times, I'd be ok with not going through the mask two times when we can just go one time. filcab: Since we use NumV2Elements a bunch of times, I'd be ok with not going through the mask two…
		assert(V2.getSimpleValueType() == MVT::v2i64 && "Bad operand type!");
		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
		ArrayRef<int> Mask = SVOp->getMask();
		assert(Mask.size() == 2 && "Unexpected mask size for v2 shuffle!");

		if (isSingleInputShuffleMask(Mask)) {
		// Straight shuffle of a single input vector. For everything from SSE2
		// onward this has a single fast instruction with no scary immediates.
		// We have to map the mask as it is actually a v4i32 shuffle instruction.
		grosbachUnsubmitted Not Done Reply Inline Actions std::find_if instead? grosbach: std::find_if instead?
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Sure. The first time I wrote this I was annoyed by re-computing the index from find_if and switched it to a while loop, but maybe the loop is too high cost. chandlerc: Sure. The first time I wrote this I was annoyed by re-computing the index from find_if and…
		V1 = DAG.getNode(ISD::BITCAST, DL, MVT::v4i32, V1);
		grosbachUnsubmitted Not Done Reply Inline Actions This is a neat trick for getting the other lane of the high half, but it isn't immediately obvious that's what it's doing. Either a ?: to make the selection explicit or a comment explaining the bitwise trick would be good. grosbach: This is a neat trick for getting the other lane of the high half, but it isn't immediately…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions I have a distaste for comments which merely explain what the standard says the C++ in question does. I tried to help this by naming the variable after the "adjacent" index. Is there a better variable name that would give the context? If not I can do something else... chandlerc: I have a distaste for comments which merely explain what the standard says the C++ in question…
		grosbachUnsubmitted Not Done Reply Inline Actions It's not just explaining what the standard says, though. It explains why. I spend a couple minutes looking at that code wondering what you were up to before I realized and I had to work through the bit patterns in my head to do it. Maybe I'm just being dense and it should have been obvious, but it wasn't. I really think a comment explaining the intent (not the language semantics) is very useful here. The abbreviation certainly didn't help. I thought it meant "Adjusted" not "Adjacent". grosbach: It's not just explaining what the standard says, though. It explains why. I spend a couple…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Comments added. chandlerc: Comments added.
		int WidenedMask[4] = {
		std::max(Mask[0], 0) * 2, std::max(Mask[0], 0) * 2 + 1,
		std::max(Mask[1], 0) * 2, std::max(Mask[1], 0) * 2 + 1};
		return DAG.getNode(ISD::BITCAST, DL, MVT::v2i64,
		DAG.getNode(X86ISD::PSHUFD, SDLoc(Op), MVT::v4i32, V1,
		getV4ShuffleImmForMask(WidenedMask, DAG)));
		}

		// We implement this with SHUFPD which is pretty lame because it will likely
		// incur 2 cycles of stall for integer vectors on Nehalem and older chips.
		// However, all the alternatives are still more cycles and newer chips don't
		// have this problem. It would be really nice if x86 had better shuffles here.
		grosbachUnsubmitted Not Done Reply Inline Actions Big ask, but I'd love some ASCII art type examples of the transforms, in particular for those like this one that need more than one instruction. Not the complete set of possibilities; just a single example that's illustrative showing arrows for which lanes get moved where by which insn. grosbach: Big ask, but I'd love some ASCII art type examples of the transforms, in particular for those…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions I'm hesitant to do this. Every time I drew an ascii art picture (and I drew several) or something on a white board, it actually messed me up because I started to silently assume that the one example I had was actually representative of the other ways to hit the same scenario. Unless this is breaking your ability to reason about the code, I'm going to hold off investing a ton of time in crafting these... chandlerc: I'm hesitant to do this. Every time I drew an ascii art picture (and I drew several) or…
		grosbachUnsubmitted Not Done Reply Inline Actions I'm drawing lots of stuff on paper to be able to reason about the code. Some comments to help guide that thought process would help. From another perspective, comments breaking down what the code is intended to cover help review to make sure it matches what it actually does cover. I understand that a mathematically sound exposition of all the cases and how they transpose into one another is likely not practical, but I still think a few example would be useful. Doesn't have to be ASCII art. I just like pictures. Simple examples, even expressed in terms of __builtin_shufflevector() .c code perhaps, would go a long way. grosbach: I'm drawing lots of stuff on paper to be able to reason about the code. Some comments to help…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Ok. However, this comment was just bad. Bordering on terrible. I've rewritten it to be less confusing. =] I'm not sure after that exactly where diagrams or examples would most help? Could you point at the specific places that need more illustration in light of the added comments from the last couple of iterations? chandlerc: Ok. However, this comment was just bad. Bordering on terrible. I've rewritten it to be less…
		V1 = DAG.getNode(ISD::BITCAST, DL, MVT::v2f64, V1);
		V2 = DAG.getNode(ISD::BITCAST, DL, MVT::v2f64, V2);
		return DAG.getNode(ISD::BITCAST, DL, MVT::v2i64,
		DAG.getVectorShuffle(MVT::v2f64, DL, V1, V2, Mask));
		}

		/// \brief Lower 4-lane 32-bit floating point shuffles.
		///
		/// Uses instructions exclusively from the floating point unit to minimize
		/// domain crossing penalties, as these are sufficient to implement all v4f32
		/// shuffles.
		static SDValue lowerV4F32VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
		const X86Subtarget *Subtarget,
		SelectionDAG &DAG) {
		SDLoc DL(Op);
		assert(Op.getSimpleValueType() == MVT::v4f32 && "Bad shuffle type!");
		assert(V1.getSimpleValueType() == MVT::v4f32 && "Bad operand type!");
		assert(V2.getSimpleValueType() == MVT::v4f32 && "Bad operand type!");
		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
		ArrayRef<int> Mask = SVOp->getMask();
		assert(Mask.size() == 4 && "Unexpected mask size for v4 shuffle!");

		SDValue LowV = V1, HighV = V2;
		int NewMask[4] = {Mask[0], Mask[1], Mask[2], Mask[3]};

		int NumV2Elements =
		std::count_if(Mask.begin(), Mask.end(), [](int M) { return M >= 4; });

		if (NumV2Elements == 0)
		// Straight shuffle of a single input vector. We pass the input vector to
		// both operands to simulate this with a SHUFPS.
		grosbachUnsubmitted Not Done Reply Inline Actions Spacing for "{" "}". grosbach: Spacing for "{" "}".
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Same comment, this is the clang-format enforced style and matches the coding standards. I'll give a citation to help: http://llvm.org/docs/CodingStandards.html#braced-initializer-lists chandlerc: Same comment, this is the clang-format enforced style and matches the coding standards. I'll…
		grosbachUnsubmitted Not Done Reply Inline Actions Then clang-format should be fixed. That's what I used to verify I was correct. I just tried again and got the same result. grosbach: Then clang-format should be fixed. That's what I used to verify I was correct. I just tried…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions As noted above, we sorted this out... chandlerc: As noted above, we sorted this out...
		return DAG.getNode(X86ISD::SHUFP, DL, MVT::v4f32, V1, V1,
		getV4ShuffleImmForMask(Mask, DAG));

		if (NumV2Elements == 1) {
		int V2Index =
		std::find_if(Mask.begin(), Mask.end(), [](int M) { return M >= 4; }) -
		Mask.begin();
		// Compute the index adjacent to V2Index and in the same half by toggling
		// the low bit.
		int V2AdjIndex = V2Index ^ 1;

		if (Mask[V2AdjIndex] == -1) {
		// Handles all the cases where we have a single V2 element and an undef.
		// This will only ever happen in the high lanes because we commute the
		// vector otherwise.
		if (V2Index < 2)
		std::swap(LowV, HighV);
		NewMask[V2Index] -= 4;
		} else {
		// Handle the case where the V2 element ends up adjacent to a V1 element.
		// To make this work, blend them together as the first step.
		int V1Index = V2AdjIndex;
		int BlendMask[4] = {Mask[V2Index] - 4, 0, Mask[V1Index], 0};
		V2 = DAG.getNode(X86ISD::SHUFP, DL, MVT::v4f32, V2, V1,
		getV4ShuffleImmForMask(BlendMask, DAG));

		// Now proceed to reconstruct the final blend as we have the necessary
		// high or low half formed.
		if (V2Index < 2) {
		LowV = V2;
		HighV = V1;
		} else {
		HighV = V2;
		}
		NewMask[V1Index] = 2; // We put the V1 element in V2[2].
		NewMask[V2Index] = 0; // We shifted the V2 element into V2[0].
		}
		} else if (NumV2Elements == 2) {
		filcabUnsubmitted Not Done Reply Inline Actions Nitpick: I would change the phrasing. If we only use one shufps, shufps is smaller (2 opcode bytes vs. 3 opcode bytes). If we need two shufps, then it's better to just use shufpd. The comment, as it is, seems, to me, like it suggests we still need to check if there's any case where this is helpful. filcab: Nitpick: I would change the phrasing. If we only use one shufps, shufps is smaller (2 opcode…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Not sure what you're really saying. First, we're talking about shufps vs. pshufd. Second, we would never need more or fewer of shufps vs. pshufd, they are both equally powerful when using a single input. However, yes, I've now looked, and there is no reason for this. I'll nuke the FIXME. If anything, we might want a combine somewhere very late that can see when we don't need the free copy pshufd provides and we don't pay a domain crossing penalty, and so it is an easy code size win to rewrite it as a shufps. chandlerc: Not sure what you're really saying. First, we're talking about shufps vs. pshufd. Second, we…
		if (Mask[0] < 4 && Mask[1] < 4) {
		// Handle the easy case where we have V1 in the low lanes and V2 in the
		// high lanes. We never see this reversed because we sort the shuffle.
		NewMask[2] -= 4;
		NewMask[3] -= 4;
		} else {
		// We have a mixture of V1 and V2 in both low and high lanes. Rather than
		// trying to place elements directly, just blend them and set up the final
		// shuffle to place them.

		grosbachUnsubmitted Not Done Reply Inline Actions This is a really tricky part of the patch. Can you add a comment describing the approach and algorithm to the function? Without a lot of very explicit exposition, I'm concerned about future "improvements" to this code path inadvertently breaking things. grosbach: This is a really tricky part of the patch. Can you add a comment describing the approach and…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Yea, I didn't have that comment because there were many tens of different approaches tried before one worked. I'm also still hoping to simplify some parts of it. Also, all the details are already documented in each branch of the hybrid approach. chandlerc: Yea, I didn't have that comment because there were many tens of different approaches tried…
		grosbachUnsubmitted Not Done Reply Inline Actions Maybe something about "there be dragons here, and not the friendly kind" then? If it was that tricky to get right, I'm more concerned about someone helpfully coming along and "fixing" it later without understanding all the cases. Tests go a long way there, of course, but I don't like relying on that exclusively. grosbach: Maybe something about "there be dragons here, and not the friendly kind" then? If it was that…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions I've tried to do a decent job of commenting the strategy with appropriate warnings and cross references. I think it ended up being simpler to document than I feared, so hopefully its good now. chandlerc: I've tried to do a decent job of commenting the strategy with appropriate warnings and cross…
		// The first two blend mask elements are for V1, the second two are for
		// V2.
		int BlendMask[4] = {Mask[0] < 4 ? Mask[0] : Mask[1],
		Mask[2] < 4 ? Mask[2] : Mask[3],
		(Mask[0] >= 4 ? Mask[0] : Mask[1]) - 4,
		(Mask[2] >= 4 ? Mask[2] : Mask[3]) - 4};
		V1 = DAG.getNode(X86ISD::SHUFP, DL, MVT::v4f32, V1, V2,
		getV4ShuffleImmForMask(BlendMask, DAG));

		// Now we do a normal shuffle of V1 by giving V1 as both operands to
		// a blend.
		HighV = V1;
		NewMask[0] = Mask[0] < 4 ? 0 : 2;
		NewMask[1] = Mask[0] < 4 ? 2 : 0;
		NewMask[2] = Mask[2] < 4 ? 1 : 3;
		NewMask[3] = Mask[2] < 4 ? 3 : 1;
		}
		}
		return DAG.getNode(X86ISD::SHUFP, DL, MVT::v4f32, LowV, HighV,
		getV4ShuffleImmForMask(NewMask, DAG));
		}

		/// \brief Lower 4-lane i32 vector shuffles.
		///
		/// We try to handle these with integer-domain shuffles where we can, but for
		/// blends we use the floating point domain blend instructions.
		static SDValue lowerV4I32VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
		const X86Subtarget *Subtarget,
		SelectionDAG &DAG) {
		SDLoc DL(Op);
		assert(Op.getSimpleValueType() == MVT::v4i32 && "Bad shuffle type!");
		grosbachUnsubmitted Not Done Reply Inline Actions An example in the comment would really help here. grosbach: An example in the comment would really help here.
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Hmm, ok. Not sure it'll actually help, but done. =] This trick took me weeks to figure out. chandlerc: Hmm, ok. Not sure it'll actually help, but done. =] This trick took me weeks to figure out.
		assert(V1.getSimpleValueType() == MVT::v4i32 && "Bad operand type!");
		assert(V2.getSimpleValueType() == MVT::v4i32 && "Bad operand type!");
		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
		ArrayRef<int> Mask = SVOp->getMask();
		grosbachUnsubmitted Not Done Reply Inline Actions llvm/lib/Target/X86/X86ISelLowering.cpp:7000:24: error: no member named 'accumulate' in namespace 'std' Need to add "#include <numeric>" I think. grosbach: llvm/lib/Target/X86/X86ISelLowering.cpp:7000:24: error: no member named 'accumulate' in…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Done. chandlerc: Done.
		assert(Mask.size() == 4 && "Unexpected mask size for v4 shuffle!");

		if (isSingleInputShuffleMask(Mask))
		// Straight shuffle of a single input vector. For everything from SSE2
		// onward this has a single fast instruction with no scary immediates.
		return DAG.getNode(X86ISD::PSHUFD, DL, MVT::v4i32, V1,
		filcabUnsubmitted Not Done Reply Inline Actions This seems wrong. How can you have [{0,1}{2,7}{4,5}{6,3}] -> [{0,1}{4,7}{2,3}{6,5}]? How did the {2,7} get mixed with the {4,5}? Wouldn't pshufd just shuffle double words around but not break them? Shouldn't it become [{0,1}{4,5}{2,7}{6,3}]? filcab: This seems wrong. How can you have [{0,1}{2,7}{4,5}{6,3}] -> [{0,1}{4,7}{2,3}{6,5}]? How did…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Yea, this is why examples aren't clear. The PSHUFD applies to the input to this shuffle. The result here is the new mask needed after the pshufd to produce the same final output. How can i make this more clear? chandlerc: Yea, this is why examples aren't clear. The PSHUFD applies to the input to this shuffle. The…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Think I have a better way to illustrate it... chandlerc: Think I have a better way to illustrate it...
		getV4ShuffleImmForMask(Mask, DAG));

		// We implement this with SHUFPS because it can blend from two vectors.
		// Because we're going to eventually use SHUFPS, we use SHUFPS even to build
		// up the inputs, bypassing domain shift penalties that we would encur if we
		// directly used PSHUFD on Nehalem and older. For newer chips, this isn't
		// relevant.
		return DAG.getNode(ISD::BITCAST, DL, MVT::v4i32,
		filcabUnsubmitted Not Done Reply Inline Actions Commenting how that subtraction gets us the index that's not used would be nice. filcab: Commenting how that subtraction gets us the index that's not used would be nice.
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Yea, this is way down the clever rabbit hole. chandlerc: Yea, this is way down the clever rabbit hole.
		DAG.getVectorShuffle(
		MVT::v4f32, DL,
		DAG.getNode(ISD::BITCAST, DL, MVT::v4f32, V1),
		DAG.getNode(ISD::BITCAST, DL, MVT::v4f32, V2), Mask));
		}

		/// \brief Lowering of single-input v8i16 shuffles is the cornerstone of SSE2
		/// shuffle lowering, and the most complex part.
		///
		/// The lowering strategy is to try to form pairs of input lanes which are
		/// targeted at the same half of the final vector, and then use a dword shuffle
		/// to place them onto the right half, and finally unpack the paired lanes into
		/// their final position.
		///
		/// The exact breakdown of how to form these dword pairs and align them on the
		/// correct sides is really tricky. See the comments within the function for
		/// more of the details.
		static SDValue lowerV8I16SingleInputVectorShuffle(
		SDLoc DL, SDValue V, MutableArrayRef<int> Mask,
		const X86Subtarget *Subtarget, SelectionDAG &DAG) {
		assert(V.getSimpleValueType() == MVT::v8i16 && "Bad input type!");
		MutableArrayRef<int> LoMask = Mask.slice(0, 4);
		MutableArrayRef<int> HiMask = Mask.slice(4, 4);

		auto isLo = [](int M) { return M >= 0 && M < 4; };
		auto isHi = [](int M) { return M >= 4; };

		SmallVector<int, 4> LoInputs;
		std::copy_if(LoMask.begin(), LoMask.end(), std::back_inserter(LoInputs),
		[](int M) { return M >= 0; });
		std::sort(LoInputs.begin(), LoInputs.end());
		LoInputs.erase(std::unique(LoInputs.begin(), LoInputs.end()), LoInputs.end());
		SmallVector<int, 4> HiInputs;
		std::copy_if(HiMask.begin(), HiMask.end(), std::back_inserter(HiInputs),
		[](int M) { return M >= 0; });
		std::sort(HiInputs.begin(), HiInputs.end());
		HiInputs.erase(std::unique(HiInputs.begin(), HiInputs.end()), HiInputs.end());
		int NumLToL =
		std::lower_bound(LoInputs.begin(), LoInputs.end(), 4) - LoInputs.begin();
		int NumHToL = LoInputs.size() - NumLToL;
		int NumLToH =
		std::lower_bound(HiInputs.begin(), HiInputs.end(), 4) - HiInputs.begin();
		int NumHToH = HiInputs.size() - NumLToH;
		MutableArrayRef<int> LToLInputs(LoInputs.data(), NumLToL);
		MutableArrayRef<int> LToHInputs(HiInputs.data(), NumLToH);
		MutableArrayRef<int> HToLInputs(LoInputs.data() + NumLToL, NumHToL);
		MutableArrayRef<int> HToHInputs(HiInputs.data() + NumLToH, NumHToH);

		// Simplify the 1-into-3 and 3-into-1 cases with a single pshufd. For all
		// such inputs we can swap two of the dwords across the half mark and end up
		// with <=2 inputs to each half in each half. Once there, we can fall through
		// to the generic code below. For example:
		//
		// Input: [a, b, c, d, e, f, g, h] -PSHUFD[0,2,1,3]-> [a, b, e, f, c, d, g, h]
		// Mask: [0, 1, 2, 7, 4, 5, 6, 3] -----------------> [0, 1, 4, 7, 2, 3, 6, 5]
		//
		// Before we had 3-1 in the low half and 3-1 in the high half. Afterward, 2-2
		// and 2-2.
		auto balanceSides = [&](ArrayRef<int> ThreeInputs, int OneInput,
		int ThreeInputHalfSum, int OneInputHalfOffset) {
		filcabUnsubmitted Not Done Reply Inline Actions Wouldn't InPlaceInputs[0] ^ 1 do the same thing? And it would be the same trick you used in the previous function. filcab: Wouldn't InPlaceInputs[0] ^ 1 do the same thing? And it would be the same trick you used in the…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Good idea. ;] I should be way more consistent with these tricks. chandlerc: Good idea. ;] I should be way more consistent with these tricks.
		// Compute the index of dword with only one word among the three inputs in
		// a half by taking the sum of the half with three inputs and subtracting
		// the sum of the actual three inputs. The difference is the remaining
		// slot.
		int DWordA = (ThreeInputHalfSum -
		std::accumulate(ThreeInputs.begin(), ThreeInputs.end(), 0)) /
		2;
		int DWordB = OneInputHalfOffset / 2 + (OneInput / 2 + 1) % 2;

		int PSHUFDMask[] = {0, 1, 2, 3};
		PSHUFDMask[DWordA] = DWordB;
		PSHUFDMask[DWordB] = DWordA;
		V = DAG.getNode(ISD::BITCAST, DL, MVT::v8i16,
		DAG.getNode(X86ISD::PSHUFD, DL, MVT::v4i32,
		DAG.getNode(ISD::BITCAST, DL, MVT::v4i32, V),
		getV4ShuffleImmForMask(PSHUFDMask, DAG)));

		// Adjust the mask to match the new locations of A and B.
		for (int &M : Mask)
		if (M != -1 && M/2 == DWordA)
		M = 2 * DWordB + M % 2;
		else if (M != -1 && M/2 == DWordB)
		M = 2 * DWordA + M % 2;

		// Recurse back into this routine to re-compute state now that this isn't
		// a 3 and 1 problem.
		return DAG.getVectorShuffle(MVT::v8i16, DL, V, DAG.getUNDEF(MVT::v8i16),
		Mask);
		};
		if (NumLToL == 3 && NumHToL == 1)
		return balanceSides(LToLInputs, HToLInputs[0], 0 + 1 + 2 + 3, 4);
		else if (NumLToL == 1 && NumHToL == 3)
		return balanceSides(HToLInputs, LToLInputs[0], 4 + 5 + 6 + 7, 0);
		else if (NumLToH == 1 && NumHToH == 3)
		return balanceSides(HToHInputs, LToHInputs[0], 4 + 5 + 6 + 7, 0);
		else if (NumLToH == 3 && NumHToH == 1)
		return balanceSides(LToHInputs, HToHInputs[0], 0 + 1 + 2 + 3, 4);

		// At this point there are at most two inputs to the low and high halves from
		// each half. That means the inputs can always be grouped into dwords and
		// those dwords can then be moved to the correct half with a dword shuffle.
		// We use at most one low and one high word shuffle to collect these paired
		// inputs into dwords, and finally a dword shuffle to place them.
		int PSHUFLMask[4] = {-1, -1, -1, -1};
		int PSHUFHMask[4] = {-1, -1, -1, -1};
		int PSHUFDMask[4] = {-1, -1, -1, -1};

		// First fix the masks for all the inputs that are staying in their
		// original halves. This will then dictate the targets of the cross-half
		// shuffles.
		auto fixInPlaceInputs = [&PSHUFDMask](
		ArrayRef<int> InPlaceInputs, MutableArrayRef<int> SourceHalfMask,
		MutableArrayRef<int> HalfMask, int HalfOffset) {
		if (InPlaceInputs.empty())
		return;
		if (InPlaceInputs.size() == 1) {
		SourceHalfMask[InPlaceInputs[0] - HalfOffset] =
		InPlaceInputs[0] - HalfOffset;
		PSHUFDMask[InPlaceInputs[0] / 2] = InPlaceInputs[0] / 2;
		return;
		}

		assert(InPlaceInputs.size() == 2 && "Cannot handle 3 or 4 inputs!");
		SourceHalfMask[InPlaceInputs[0] - HalfOffset] =
		InPlaceInputs[0] - HalfOffset;
		// Put the second input next to the first so that they are packed into
		// a dword. We find the adjacent index by toggling the low bit.
		int AdjIndex = InPlaceInputs[0] ^ 1;
		SourceHalfMask[AdjIndex - HalfOffset] = InPlaceInputs[1] - HalfOffset;
		std::replace(HalfMask.begin(), HalfMask.end(), InPlaceInputs[1], AdjIndex);
		PSHUFDMask[AdjIndex / 2] = AdjIndex / 2;
		};
		if (!HToLInputs.empty())
		fixInPlaceInputs(LToLInputs, PSHUFLMask, LoMask, 0);
		if (!LToHInputs.empty())
		fixInPlaceInputs(HToHInputs, PSHUFHMask, HiMask, 4);

		// Now gather the cross-half inputs and place them into a free dword of
		// their target half.
		// FIXME: This operation could almost certainly be simplified dramatically to
		// look more like the 3-1 fixing operation.
		auto moveInputsToRightHalf = [&PSHUFDMask](
		MutableArrayRef<int> IncomingInputs, ArrayRef<int> ExistingInputs,
		MutableArrayRef<int> SourceHalfMask, MutableArrayRef<int> HalfMask,
		int SourceOffset, int DestOffset) {
		auto isWordClobbered = [](ArrayRef<int> SourceHalfMask, int Word) {
		return SourceHalfMask[Word] != -1 && SourceHalfMask[Word] != Word;
		};
		auto isDWordClobbered = [&isWordClobbered](ArrayRef<int> SourceHalfMask,
		int Word) {
		int LowWord = Word & ~1;
		int HighWord = Word \| 1;
		return isWordClobbered(SourceHalfMask, LowWord) \|\|
		isWordClobbered(SourceHalfMask, HighWord);
		};

		if (IncomingInputs.empty())
		return;

		if (ExistingInputs.empty()) {
		// Map any dwords with inputs from them into the right half.
		for (int Input : IncomingInputs) {
		// If the source half mask maps over the inputs, turn those into
		// swaps and use the swapped lane.
		if (isWordClobbered(SourceHalfMask, Input - SourceOffset)) {
		if (SourceHalfMask[SourceHalfMask[Input - SourceOffset]] == -1) {
		SourceHalfMask[SourceHalfMask[Input - SourceOffset]] =
		Input - SourceOffset;
		// We have to swap the uses in our half mask in one sweep.
		for (int &M : HalfMask)
		if (M == SourceHalfMask[Input - SourceOffset])
		M = Input;
		else if (M == Input)
		M = SourceHalfMask[Input - SourceOffset] + SourceOffset;
		} else {
		assert(SourceHalfMask[SourceHalfMask[Input - SourceOffset]] ==
		Input - SourceOffset &&
		"Previous placement was different!");
		grosbachUnsubmitted Not Done Reply Inline Actions Different from what? I don't have a better suggestion for the message, unfortunately. grosbach: Different from what? I don't have a better suggestion for the message, unfortunately.
		}
		// Note that this correctly re-maps both when we do a swap and when
		// we observe the other side of the swap above. We rely on that to
		// avoid swapping the members of the input list directly.
		Input = SourceHalfMask[Input - SourceOffset] + SourceOffset;
		}

		// Map the input's dword into the correct half.
		if (PSHUFDMask[(Input - SourceOffset + DestOffset) / 2] == -1)
		PSHUFDMask[(Input - SourceOffset + DestOffset) / 2] = Input / 2;
		else
		assert(PSHUFDMask[(Input - SourceOffset + DestOffset) / 2] ==
		Input / 2 &&
		"Previous placement was different!");
		}

		// And just directly shift any other-half mask elements to be same-half
		// as we will have mirrored the dword containing the element into the
		// same position within that half.
		for (int &M : HalfMask)
		if (M >= SourceOffset && M < SourceOffset + 4) {
		M = M - SourceOffset + DestOffset;
		assert(M >= 0 && "This should never wrap below zero!");
		}
		return;
		}

		grosbachUnsubmitted Not Done Reply Inline Actions s/test/tests/ grosbach: s/test/tests/
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Don. chandlerc: Don.
		// Ensure we have the input in a viable dword of its current half. This
		// is particularly tricky because the original position may be clobbered
		// by inputs being moved and staying in that half.
		if (IncomingInputs.size() == 1) {
		if (isWordClobbered(SourceHalfMask, IncomingInputs[0] - SourceOffset)) {
		int InputFixed = std::find(std::begin(SourceHalfMask),
		std::end(SourceHalfMask), -1) -
		std::begin(SourceHalfMask) + SourceOffset;
		SourceHalfMask[InputFixed - SourceOffset] =
		IncomingInputs[0] - SourceOffset;
		std::replace(HalfMask.begin(), HalfMask.end(), IncomingInputs[0],
		InputFixed);
		IncomingInputs[0] = InputFixed;
		}
		} else if (IncomingInputs.size() == 2) {
		if (IncomingInputs[0] / 2 != IncomingInputs[1] / 2 \|\|
		isDWordClobbered(SourceHalfMask, IncomingInputs[0] - SourceOffset)) {
		int SourceDWordBase = !isDWordClobbered(SourceHalfMask, 0) ? 0 : 2;
		assert(!isDWordClobbered(SourceHalfMask, SourceDWordBase) &&
		"Not all dwords can be clobbered!");
		SourceHalfMask[SourceDWordBase] = IncomingInputs[0] - SourceOffset;
		SourceHalfMask[SourceDWordBase + 1] = IncomingInputs[1] - SourceOffset;
		for (int &M : HalfMask)
		if (M == IncomingInputs[0])
		M = SourceDWordBase + SourceOffset;
		else if (M == IncomingInputs[1])
		M = SourceDWordBase + 1 + SourceOffset;
		IncomingInputs[0] = SourceDWordBase + SourceOffset;
		IncomingInputs[1] = SourceDWordBase + 1 + SourceOffset;
		}
		} else {
		llvm_unreachable("Unhandled input size!");
		}

		// Now hoist the DWord down to the right half.
		grosbachUnsubmitted Not Done Reply Inline Actions Same thing as above. Function comment describing the approach will go a long way. grosbach: Same thing as above. Function comment describing the approach will go a long way.
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions I've added them to most. Note that these comments will only get more confusing as time goes on. =/ I'm not sure of a good way to document these. Each new ISA extension or trick I add will require updates. Hopefully I won't miss any. =] chandlerc: I've added them to most. Note that these comments will only get more confusing as time goes on.
		int FreeDWord = (PSHUFDMask[DestOffset / 2] == -1 ? 0 : 1) + DestOffset / 2;
		assert(PSHUFDMask[FreeDWord] == -1 && "DWord not free");
		PSHUFDMask[FreeDWord] = IncomingInputs[0] / 2;
		for (int Input : IncomingInputs)
		std::replace(HalfMask.begin(), HalfMask.end(), Input,
		FreeDWord * 2 + Input % 2);
		};
		moveInputsToRightHalf(HToLInputs, LToLInputs, PSHUFHMask, LoMask,
		/SourceOffset/ 4, /DestOffset/ 0);
		moveInputsToRightHalf(LToHInputs, HToHInputs, PSHUFLMask, HiMask,
		/SourceOffset/ 0, /DestOffset/ 4);

		// Now enact all the shuffles we've computed to move the inputs into their
		// target half.
		if (!isNoopShuffleMask(PSHUFLMask))
		V = DAG.getNode(X86ISD::PSHUFLW, DL, MVT::v8i16, V,
		getV4ShuffleImmForMask(PSHUFLMask, DAG));
		if (!isNoopShuffleMask(PSHUFHMask))
		V = DAG.getNode(X86ISD::PSHUFHW, DL, MVT::v8i16, V,
		getV4ShuffleImmForMask(PSHUFHMask, DAG));
		if (!isNoopShuffleMask(PSHUFDMask))
		V = DAG.getNode(ISD::BITCAST, DL, MVT::v8i16,
		DAG.getNode(X86ISD::PSHUFD, DL, MVT::v4i32,
		DAG.getNode(ISD::BITCAST, DL, MVT::v4i32, V),
		getV4ShuffleImmForMask(PSHUFDMask, DAG)));

		// At this point, each half should contain all its inputs, and we can then
		// just shuffle them into their final position.
		assert(std::count_if(LoMask.begin(), LoMask.end(), isHi) == 0 &&
		"Failed to lift all the high half inputs to the low mask!");
		assert(std::count_if(HiMask.begin(), HiMask.end(), isLo) == 0 &&
		"Failed to lift all the low half inputs to the high mask!");

		// Do a half shuffle for the low mask.
		if (!isNoopShuffleMask(LoMask))
		V = DAG.getNode(X86ISD::PSHUFLW, DL, MVT::v8i16, V,
		getV4ShuffleImmForMask(LoMask, DAG));

		// Do a half shuffle with the high mask after shifting its values down.
		for (int &M : HiMask)
		if (M >= 0)
		M -= 4;
		if (!isNoopShuffleMask(HiMask))
		V = DAG.getNode(X86ISD::PSHUFHW, DL, MVT::v8i16, V,
		getV4ShuffleImmForMask(HiMask, DAG));

		return V;
		}

		/// \brief Detect whether the mask pattern should be lowered through
		/// interleaving.
		///
		/// This essentially tests whether viewing the mask as an interleaving of two
		/// sub-sequences reduces the cross-input traffic of a blend operation. If so,
		/// lowering it through interleaving is a significantly better strategy.
		static bool shouldLowerAsInterleaving(ArrayRef<int> Mask) {
		int NumEvenInputs[2] = {0, 0};
		int NumOddInputs[2] = {0, 0};
		int NumLoInputs[2] = {0, 0};
		int NumHiInputs[2] = {0, 0};
		for (int i = 0, Size = Mask.size(); i < Size; ++i) {
		if (Mask[i] < 0)
		continue;

		int InputIdx = Mask[i] >= Size;

		if (i < Size / 2)
		++NumLoInputs[InputIdx];
		else
		++NumHiInputs[InputIdx];

		if ((i % 2) == 0)
		++NumEvenInputs[InputIdx];
		else
		++NumOddInputs[InputIdx];
		}

		// The minimum number of cross-input results for both the interleaved and
		// split cases. If interleaving results in fewer cross-input results, return
		// true.
		int InterleavedCrosses = std::min(NumEvenInputs[1] + NumOddInputs[0],
		NumEvenInputs[0] + NumOddInputs[1]);
		int SplitCrosses = std::min(NumLoInputs[1] + NumHiInputs[0],
		NumLoInputs[0] + NumHiInputs[1]);
		return InterleavedCrosses < SplitCrosses;
		}

		/// \brief Blend two v8i16 vectors using a naive unpack strategy.
		///
		/// This strategy only works when the inputs from each vector fit into a single
		/// half of that vector, and generally there are not so many inputs as to leave
		/// the in-place shuffles required highly constrained (and thus expensive). It
		/// shifts all the inputs into a single side of both input vectors and then
		/// uses an unpack to interleave these inputs in a single vector. At that
		/// point, we will fall back on the generic single input shuffle lowering.
		static SDValue lowerV8I16BasicBlendVectorShuffle(SDLoc DL, SDValue V1,
		SDValue V2,
		MutableArrayRef<int> Mask,
		const X86Subtarget *Subtarget,
		SelectionDAG &DAG) {
		assert(V1.getSimpleValueType() == MVT::v8i16 && "Bad input type!");
		assert(V2.getSimpleValueType() == MVT::v8i16 && "Bad input type!");
		SmallVector<int, 3> LoV1Inputs, HiV1Inputs, LoV2Inputs, HiV2Inputs;
		for (int i = 0; i < 8; ++i)
		if (Mask[i] >= 0 && Mask[i] < 4)
		LoV1Inputs.push_back(i);
		else if (Mask[i] >= 4 && Mask[i] < 8)
		HiV1Inputs.push_back(i);
		else if (Mask[i] >= 8 && Mask[i] < 12)
		LoV2Inputs.push_back(i);
		else if (Mask[i] >= 12)
		HiV2Inputs.push_back(i);

		int NumV1Inputs = LoV1Inputs.size() + HiV1Inputs.size();
		int NumV2Inputs = LoV2Inputs.size() + HiV2Inputs.size();

		assert(NumV1Inputs > 0 && NumV1Inputs <= 3 && "At most 3 inputs supported");
		assert(NumV2Inputs > 0 && NumV2Inputs <= 3 && "At most 3 inputs supported");
		assert(NumV1Inputs + NumV2Inputs <= 4 && "At most 4 combined inputs");

		bool MergeFromLo = LoV1Inputs.size() + LoV2Inputs.size() >=
		HiV1Inputs.size() + HiV2Inputs.size();

		auto moveInputsToHalf = [&](SDValue V, ArrayRef<int> LoInputs,
		ArrayRef<int> HiInputs, bool MoveToLo,
		int MaskOffset) {
		ArrayRef<int> GoodInputs = MoveToLo ? LoInputs : HiInputs;
		ArrayRef<int> BadInputs = MoveToLo ? HiInputs : LoInputs;
		if (BadInputs.empty())
		return V;

		int MoveMask[] = {-1, -1, -1, -1, -1, -1, -1, -1};
		int MoveOffset = MoveToLo ? 0 : 4;

		if (GoodInputs.empty()) {
		for (int BadInput : BadInputs) {
		MoveMask[Mask[BadInput] % 4 + MoveOffset] = Mask[BadInput] - MaskOffset;
		Mask[BadInput] = Mask[BadInput] % 4 + MoveOffset + MaskOffset;
		}
		} else {
		if (GoodInputs.size() == 2) {
		// If the low inputs are spread across two dwords, pack them into
		// a single dword.
		MoveMask[Mask[GoodInputs[0]] % 2 + MoveOffset] =
		Mask[GoodInputs[0]] - MaskOffset;
		MoveMask[Mask[GoodInputs[1]] % 2 + MoveOffset] =
		Mask[GoodInputs[1]] - MaskOffset;
		Mask[GoodInputs[0]] = Mask[GoodInputs[0]] % 2 + MoveOffset + MaskOffset;
		filcabUnsubmitted Not Done Reply Inline Actions NumV1Inputs == 0 && NumV2Inputs == 4 => assert in lowerV8I16BasicBlendVectorShuffle:7293 The previous if should have an else if (NumV1Inputs == 0) ...SingleInput...(...V2...) If this condition doesn't arise because it's been handled earlier, I'd like to have an assert here. filcab: NumV1Inputs == 0 && NumV2Inputs == 4 => assert in lowerV8I16BasicBlendVectorShuffle:7293 The…
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Correct, we can't hit this because if there is only a single input, it is always V1. Adding assert here. chandlerc: Correct, we can't hit this because if there is only a single input, it is always V1. Adding…
		Mask[GoodInputs[1]] = Mask[GoodInputs[0]] % 2 + MoveOffset + MaskOffset;
		} else {
		// Otherwise pin the low inputs.
		for (int GoodInput : GoodInputs)
		MoveMask[Mask[GoodInput]] = Mask[GoodInput] - MaskOffset;
		}

		int MoveMaskIdx =
		std::find(std::begin(MoveMask) + MoveOffset, std::end(MoveMask), -1) -
		std::begin(MoveMask);
		assert(MoveMaskIdx >= MoveOffset && "Established above");

		if (BadInputs.size() == 2) {
		assert(MoveMask[MoveMaskIdx] == -1 && "Expected empty slot");
		assert(MoveMask[MoveMaskIdx + 1] == -1 && "Expected empty slot");
		MoveMask[MoveMaskIdx + Mask[BadInputs[0]] % 2] =
		Mask[BadInputs[0]] - MaskOffset;
		MoveMask[MoveMaskIdx + Mask[BadInputs[1]] % 2] =
		Mask[BadInputs[1]] - MaskOffset;
		Mask[BadInputs[0]] = MoveMaskIdx + Mask[BadInputs[0]] % 2 + MaskOffset;
		Mask[BadInputs[1]] = MoveMaskIdx + Mask[BadInputs[1]] % 2 + MaskOffset;
		} else {
		assert(BadInputs.size() == 1 && "All sizes handled");
		MoveMask[MoveMaskIdx] = Mask[BadInputs[0]] - MaskOffset;
		Mask[BadInputs[0]] = MoveMaskIdx + MaskOffset;
		}
		}

		return DAG.getVectorShuffle(MVT::v8i16, DL, V, DAG.getUNDEF(MVT::v8i16),
		MoveMask);
		};
		V1 = moveInputsToHalf(V1, LoV1Inputs, HiV1Inputs, MergeFromLo,
		/MaskOffset/ 0);
		V2 = moveInputsToHalf(V2, LoV2Inputs, HiV2Inputs, MergeFromLo,
		/MaskOffset/ 8);

		// FIXME: Select an interleaving of the merge of V1 and V2 that minimizes
		// cross-half traffic in the final shuffle.

		// Munge the mask to be a single-input mask after the unpack merges the
		// results.
		for (int &M : Mask)
		if (M != -1)
		M = 2 * (M % 4) + (M / 8);

		return DAG.getVectorShuffle(
		MVT::v8i16, DL, DAG.getNode(MergeFromLo ? X86ISD::UNPCKL : X86ISD::UNPCKH,
		DL, MVT::v8i16, V1, V2),
		DAG.getUNDEF(MVT::v8i16), Mask);
		}

		/// \brief Generic lowering of 8-lane i16 shuffles.
		///
		/// This handles both single-input shuffles and combined shuffle/blends with
		/// two inputs. The single input shuffles are immediately delegated to
		/// a dedicated lowering routine.
		///
		/// The blends are lowered in one of three fundamental ways. If there are few
		/// enough inputs, it delegates to a basic UNPCK-based strategy. If the shuffle
		/// of the input is significantly cheaper when lowered as an interleaving of
		/// the two inputs, try to interleave them. Otherwise, blend the low and high
		/// halves of the inputs separately (making them have relatively few inputs)
		/// and then concatenate them.
		static SDValue lowerV8I16VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
		const X86Subtarget *Subtarget,
		SelectionDAG &DAG) {
		SDLoc DL(Op);
		assert(Op.getSimpleValueType() == MVT::v8i16 && "Bad shuffle type!");
		assert(V1.getSimpleValueType() == MVT::v8i16 && "Bad operand type!");
		assert(V2.getSimpleValueType() == MVT::v8i16 && "Bad operand type!");
		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
		ArrayRef<int> OrigMask = SVOp->getMask();
		int MaskStorage[8] = {OrigMask[0], OrigMask[1], OrigMask[2], OrigMask[3],
		OrigMask[4], OrigMask[5], OrigMask[6], OrigMask[7]};
		MutableArrayRef<int> Mask(MaskStorage);

		assert(Mask.size() == 8 && "Unexpected mask size for v8 shuffle!");

		int Size = Mask.size();
		assert(Size == 8 && "Unexpected mask size for v8 shuffle!");

		auto isV1 = [](int M) { return M >= 0 && M < 8; };
		auto isV2 = [](int M) { return M >= 8; };

		int NumV1Inputs = std::count_if(Mask.begin(), Mask.end(), isV1);
		int NumV2Inputs = std::count_if(Mask.begin(), Mask.end(), isV2);

		if (NumV2Inputs == 0)
		return lowerV8I16SingleInputVectorShuffle(DL, V1, Mask, Subtarget, DAG);

		assert(NumV1Inputs > 0 && "All single-input shuffles should be canonicalized "
		"to be V1-input shuffles.");

		if (NumV1Inputs + NumV2Inputs <= 4)
		return lowerV8I16BasicBlendVectorShuffle(DL, V1, V2, Mask, Subtarget, DAG);

		// Check whether an interleaving lowering is likely to be more efficient.
		// This isn't perfect but it is a strong heuristic that tends to work well on
		// the kinds of shuffles that show up in practice.
		//
		// FIXME: Handle 1x, 2x, and 4x interleaving.
		if (shouldLowerAsInterleaving(Mask)) {
		// FIXME: Figure out whether we should pack these into the low or high
		// halves.

		int EMask[8], OMask[8];
		for (int i = 0; i < 4; ++i) {
		EMask[i] = Mask[2*i];
		OMask[i] = Mask[2*i + 1];
		EMask[i + 4] = -1;
		OMask[i + 4] = -1;
		}

		SDValue Evens = DAG.getVectorShuffle(MVT::v8i16, DL, V1, V2, EMask);
		SDValue Odds = DAG.getVectorShuffle(MVT::v8i16, DL, V1, V2, OMask);

		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v8i16, Evens, Odds);
		}

		int LoBlendMask[8] = {-1, -1, -1, -1, -1, -1, -1, -1};
		int HiBlendMask[8] = {-1, -1, -1, -1, -1, -1, -1, -1};

		for (int i = 0; i < 4; ++i) {
		LoBlendMask[i] = Mask[i];
		HiBlendMask[i] = Mask[i + 4];
		}

		SDValue LoV = DAG.getVectorShuffle(MVT::v8i16, DL, V1, V2, LoBlendMask);
		SDValue HiV = DAG.getVectorShuffle(MVT::v8i16, DL, V1, V2, HiBlendMask);
		LoV = DAG.getNode(ISD::BITCAST, DL, MVT::v2i64, LoV);
		HiV = DAG.getNode(ISD::BITCAST, DL, MVT::v2i64, HiV);

		return DAG.getNode(ISD::BITCAST, DL, MVT::v8i16,
		DAG.getNode(X86ISD::UNPCKL, DL, MVT::v2i64, LoV, HiV));
		}

		/// \brief Generic lowering of v16i8 shuffles.
		///
		/// This is a hybrid strategy to lower v16i8 vectors. It first attempts to
		/// detect any complexity reducing interleaving. If that doesn't help, it uses
		/// UNPCK to spread the i8 elements across two i16-element vectors, and uses
		/// the existing lowering for v8i16 blends on each half, finally PACK-ing them
		/// back together.
		static SDValue lowerV16I8VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
		const X86Subtarget *Subtarget,
		SelectionDAG &DAG) {
		SDLoc DL(Op);
		assert(Op.getSimpleValueType() == MVT::v16i8 && "Bad shuffle type!");
		assert(V1.getSimpleValueType() == MVT::v16i8 && "Bad operand type!");
		assert(V2.getSimpleValueType() == MVT::v16i8 && "Bad operand type!");
		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
		ArrayRef<int> OrigMask = SVOp->getMask();
		assert(OrigMask.size() == 16 && "Unexpected mask size for v16 shuffle!");
		int MaskStorage[16] = {
		OrigMask[0], OrigMask[1], OrigMask[2], OrigMask[3],
		OrigMask[4], OrigMask[5], OrigMask[6], OrigMask[7],
		OrigMask[8], OrigMask[9], OrigMask[10], OrigMask[11],
		OrigMask[12], OrigMask[13], OrigMask[14], OrigMask[15]};
		MutableArrayRef<int> Mask(MaskStorage);
		MutableArrayRef<int> LoMask = Mask.slice(0, 8);
		MutableArrayRef<int> HiMask = Mask.slice(8, 8);

		// Check whether an interleaving lowering is likely to be more efficient.
		// This isn't perfect but it is a strong heuristic that tends to work well on
		// the kinds of shuffles that show up in practice.
		//
		// FIXME: We need to handle other interleaving widths (i16, i32, ...).
		if (shouldLowerAsInterleaving(Mask)) {
		// FIXME: Figure out whether we should pack these into the low or high
		// halves.

		int EMask[16], OMask[16];
		for (int i = 0; i < 8; ++i) {
		EMask[i] = Mask[2*i];
		OMask[i] = Mask[2*i + 1];
		EMask[i + 8] = -1;
		OMask[i + 8] = -1;
		grosbachUnsubmitted Not Done Reply Inline Actions Only need to "try to collapse" once. ;) grosbach: Only need to "try to collapse" once. ;)
		chandlercAuthorUnsubmitted Not Done Reply Inline Actions Done. chandlerc: Done.
		}

		SDValue Evens = DAG.getVectorShuffle(MVT::v16i8, DL, V1, V2, EMask);
		SDValue Odds = DAG.getVectorShuffle(MVT::v16i8, DL, V1, V2, OMask);

		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v16i8, Evens, Odds);
		}

		SDValue LoV1 = DAG.getNode(ISD::BITCAST, DL, MVT::v8i16,
		DAG.getNode(X86ISD::UNPCKL, DL, MVT::v16i8, V1,
		DAG.getUNDEF(MVT::v8i16)));
		SDValue HiV1 = DAG.getNode(ISD::BITCAST, DL, MVT::v8i16,
		DAG.getNode(X86ISD::UNPCKH, DL, MVT::v16i8, V1,
		DAG.getUNDEF(MVT::v8i16)));
		SDValue LoV2 = DAG.getNode(ISD::BITCAST, DL, MVT::v8i16,
		DAG.getNode(X86ISD::UNPCKL, DL, MVT::v16i8, V2,
		DAG.getUNDEF(MVT::v8i16)));
		SDValue HiV2 = DAG.getNode(ISD::BITCAST, DL, MVT::v8i16,
		DAG.getNode(X86ISD::UNPCKH, DL, MVT::v16i8, V2,
		DAG.getUNDEF(MVT::v8i16)));

		int V1LoBlendMask[8] = {-1, -1, -1, -1, -1, -1, -1, -1};
		int V1HiBlendMask[8] = {-1, -1, -1, -1, -1, -1, -1, -1};
		int V2LoBlendMask[8] = {-1, -1, -1, -1, -1, -1, -1, -1};
		int V2HiBlendMask[8] = {-1, -1, -1, -1, -1, -1, -1, -1};

		auto buildBlendMasks = [](MutableArrayRef<int> HalfMask,
		MutableArrayRef<int> V1HalfBlendMask,
		MutableArrayRef<int> V2HalfBlendMask) {
		for (int i = 0; i < 8; ++i)
		if (HalfMask[i] >= 0 && HalfMask[i] < 16) {
		V1HalfBlendMask[i] = HalfMask[i];
		HalfMask[i] = i;
		} else if (HalfMask[i] >= 16) {
		V2HalfBlendMask[i] = HalfMask[i] - 16;
		HalfMask[i] = i + 8;
		}
		};
		buildBlendMasks(LoMask, V1LoBlendMask, V2LoBlendMask);
		buildBlendMasks(HiMask, V1HiBlendMask, V2HiBlendMask);

		SDValue V1Lo = DAG.getVectorShuffle(MVT::v8i16, DL, LoV1, HiV1, V1LoBlendMask);
		SDValue V2Lo = DAG.getVectorShuffle(MVT::v8i16, DL, LoV2, HiV2, V2LoBlendMask);
		SDValue V1Hi = DAG.getVectorShuffle(MVT::v8i16, DL, LoV1, HiV1, V1HiBlendMask);
		SDValue V2Hi = DAG.getVectorShuffle(MVT::v8i16, DL, LoV2, HiV2, V2HiBlendMask);

		SDValue LoV = DAG.getVectorShuffle(MVT::v8i16, DL, V1Lo, V2Lo, LoMask);
		SDValue HiV = DAG.getVectorShuffle(MVT::v8i16, DL, V1Hi, V2Hi, HiMask);

		return DAG.getNode(X86ISD::PACKUS, DL, MVT::v16i8, LoV, HiV);
		}

		/// \brief Dispatching routine to lower various 128-bit x86 vector shuffles.
		///
		/// This routine breaks down the specific type of 128-bit shuffle and
		/// dispatches to the lowering routines accordingly.
		static SDValue lower128BitVectorShuffle(SDValue Op, SDValue V1, SDValue V2,
		MVT VT, const X86Subtarget *Subtarget,
		SelectionDAG &DAG) {
		switch (VT.SimpleTy) {
		case MVT::v2i64:
		return lowerV2I64VectorShuffle(Op, V1, V2, Subtarget, DAG);
		case MVT::v2f64:
		return lowerV2F64VectorShuffle(Op, V1, V2, Subtarget, DAG);
		case MVT::v4i32:
		return lowerV4I32VectorShuffle(Op, V1, V2, Subtarget, DAG);
		case MVT::v4f32:
		return lowerV4F32VectorShuffle(Op, V1, V2, Subtarget, DAG);
		case MVT::v8i16:
		return lowerV8I16VectorShuffle(Op, V1, V2, Subtarget, DAG);
		case MVT::v16i8:
		return lowerV16I8VectorShuffle(Op, V1, V2, Subtarget, DAG);

		default:
		llvm_unreachable("Unimplemented!");
		}
		}

		/// \brief Tiny helper function to test whether adjacent masks are sequential.
		static bool areAdjacentMasksSequential(ArrayRef<int> Mask) {
		for (int i = 0, Size = Mask.size(); i < Size; i += 2)
		if (Mask[i] + 1 != Mask[i+1])
		return false;

		return true;
		}

		/// \brief Top-level lowering for x86 vector shuffles.
		///
		/// This handles decomposition, canonicalization, and lowering of all x86
		/// vector shuffles. Most of the specific lowering strategies are encapsulated
		/// above in helper routines. The canonicalization attempts to widen shuffles
		/// to involve fewer lanes of wider elements, consolidate symmetric patterns
		/// s.t. only one of the two inputs needs to be tested, etc.
		static SDValue lowerVectorShuffle(SDValue Op, const X86Subtarget *Subtarget,
		SelectionDAG &DAG) {
		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
		ArrayRef<int> Mask = SVOp->getMask();
		SDValue V1 = Op.getOperand(0);
		SDValue V2 = Op.getOperand(1);
		MVT VT = Op.getSimpleValueType();
		int NumElements = VT.getVectorNumElements();
		SDLoc dl(Op);

		assert(VT.getSizeInBits() != 64 && "Can't lower MMX shuffles");

		bool V1IsUndef = V1.getOpcode() == ISD::UNDEF;
		bool V2IsUndef = V2.getOpcode() == ISD::UNDEF;
		if (V1IsUndef && V2IsUndef)
		return DAG.getUNDEF(VT);

		// When we create a shuffle node we put the UNDEF node to second operand,
		// but in some cases the first operand may be transformed to UNDEF.
		// In this case we should just commute the node.
		if (V1IsUndef)
		return CommuteVectorShuffle(SVOp, DAG);

		// Check for non-undef masks pointing at an undef vector and make the masks
		// undef as well. This makes it easier to match the shuffle based solely on
		// the mask.
		if (V2IsUndef)
		for (int M : Mask)
		if (M >= NumElements) {
		SmallVector<int, 8> NewMask(Mask.begin(), Mask.end());
		for (int &M : NewMask)
		if (M >= NumElements)
		M = -1;
		return DAG.getVectorShuffle(VT, dl, V1, V2, NewMask);
		}

		// For integer vector shuffles, try to collapse them into a shuffle of fewer
		// lanes but wider integers. We cap this to not form integers larger than i64
		// but it might be interesting to form i128 integers to handle flipping the
		// low and high halves of AVX 256-bit vectors.
		if (VT.isInteger() && VT.getScalarSizeInBits() < 64 &&
		areAdjacentMasksSequential(Mask)) {
		SmallVector<int, 8> NewMask;
		for (int i = 0, Size = Mask.size(); i < Size; i += 2)
		NewMask.push_back(Mask[i] / 2);
		MVT NewVT =
		MVT::getVectorVT(MVT::getIntegerVT(VT.getScalarSizeInBits() * 2),
		VT.getVectorNumElements() / 2);
		V1 = DAG.getNode(ISD::BITCAST, dl, NewVT, V1);
		V2 = DAG.getNode(ISD::BITCAST, dl, NewVT, V2);
		return DAG.getNode(ISD::BITCAST, dl, VT,
		DAG.getVectorShuffle(NewVT, dl, V1, V2, NewMask));
		}

		int NumV1Elements = 0, NumUndefElements = 0, NumV2Elements = 0;
		for (int M : SVOp->getMask())
		if (M < 0)
		++NumUndefElements;
		else if (M < NumElements)
		++NumV1Elements;
		else
		++NumV2Elements;

		// Commute the shuffle as needed such that more elements come from V1 than
		// V2. This allows us to match the shuffle pattern strictly on how many
		// elements come from V1 without handling the symmetric cases.
		if (NumV2Elements > NumV1Elements)
		return CommuteVectorShuffle(SVOp, DAG);

		// When the number of V1 and V2 elements are the same, try to minimize the
		// number of uses of V2 in the low half of the vector.
		if (NumV1Elements == NumV2Elements) {
		int LowV1Elements = 0, LowV2Elements = 0;
		for (int M : SVOp->getMask().slice(0, NumElements / 2))
		if (M >= NumElements)
		++LowV2Elements;
		else if (M >= 0)
		++LowV1Elements;
		if (LowV2Elements > LowV1Elements)
		return CommuteVectorShuffle(SVOp, DAG);
		}

		// For each vector width, delegate to a specialized lowering routine.
		if (VT.getSizeInBits() == 128)
		return lower128BitVectorShuffle(Op, V1, V2, VT, Subtarget, DAG);

		llvm_unreachable("Unimplemented!");
		}


		//===----------------------------------------------------------------------===//
		// Legacy vector shuffle lowering
		//
		// This code is the legacy code handling vector shuffles until the above
		// replaces its functionality and performance.
		//===----------------------------------------------------------------------===//

static bool isBlendMask(ArrayRef<int> MaskVals, MVT VT, bool hasSSE41,		static bool isBlendMask(ArrayRef<int> MaskVals, MVT VT, bool hasSSE41,
bool hasInt256, unsigned *MaskOut = nullptr) {		bool hasInt256, unsigned *MaskOut = nullptr) {
MVT EltVT = VT.getVectorElementType();		MVT EltVT = VT.getVectorElementType();

// There is no blend with immediate in AVX-512.		// There is no blend with immediate in AVX-512.
if (VT.is512BitVector())		if (VT.is512BitVector())
return false;		return false;

▲ Show 20 Lines • Show All 1,236 Lines • ▼ Show 20 Lines	X86TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const {
bool V2IsSplat = false;		bool V2IsSplat = false;
bool HasSSE2 = Subtarget->hasSSE2();		bool HasSSE2 = Subtarget->hasSSE2();
bool HasFp256 = Subtarget->hasFp256();		bool HasFp256 = Subtarget->hasFp256();
bool HasInt256 = Subtarget->hasInt256();		bool HasInt256 = Subtarget->hasInt256();
MachineFunction &MF = DAG.getMachineFunction();		MachineFunction &MF = DAG.getMachineFunction();
bool OptForSize = MF.getFunction()->getAttributes().		bool OptForSize = MF.getFunction()->getAttributes().
hasAttribute(AttributeSet::FunctionIndex, Attribute::OptimizeForSize);		hasAttribute(AttributeSet::FunctionIndex, Attribute::OptimizeForSize);

		// Check if we should use the experimental vector shuffle lowering. If so,
		// delegate completely to that code path.
		if (ExperimentalVectorShuffleLowering)
		return lowerVectorShuffle(Op, Subtarget, DAG);

assert(VT.getSizeInBits() != 64 && "Can't lower MMX shuffles");		assert(VT.getSizeInBits() != 64 && "Can't lower MMX shuffles");

if (V1IsUndef && V2IsUndef)		if (V1IsUndef && V2IsUndef)
return DAG.getUNDEF(VT);		return DAG.getUNDEF(VT);

// When we create a shuffle node we put the UNDEF node to second operand,		// When we create a shuffle node we put the UNDEF node to second operand,
// but in some cases the first operand may be transformed to UNDEF.		// but in some cases the first operand may be transformed to UNDEF.
// In this case we should just commute the node.		// In this case we should just commute the node.
▲ Show 20 Lines • Show All 13,882 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shuffle-128-v16.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -x86-experimental-vector-shuffle-lowering \| FileCheck %s --check-prefix=CHECK-SSE2

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

				define <16 x i8> @shuffle_v16i8_0101010101010101(<16 x i8> %a, <16 x i8> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v16i8_0101010101010101
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[0,1,0,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,0,0,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <16 x i8> %a, <16 x i8> %b, <16 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
				ret <16 x i8> %shuffle
				}

				define <16 x i8> @shuffle_v16i8_00_16_01_17_02_18_03_19_04_20_05_21_06_22_07_23(<16 x i8> %a, <16 x i8> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v16i8_00_16_01_17_02_18_03_19_04_20_05_21_06_22_07_23
				; CHECK-SSE2: punpcklbw %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <16 x i8> %a, <16 x i8> %b, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
				ret <16 x i8> %shuffle
				}

				define <16 x i8> @shuffle_v16i8_16_00_16_01_16_02_16_03_16_04_16_05_16_06_16_07(<16 x i8> %a, <16 x i8> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v16i8_16_00_16_01_16_02_16_03_16_04_16_05_16_06_16_07
				; CHECK-SSE2: punpcklbw %xmm0, %xmm1
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm1 = xmm1[0,1,0,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm1 = xmm1[0,0,0,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm1 = xmm1[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: packuswb %xmm0, %xmm1
				; CHECK-SSE2-NEXT: punpcklbw %xmm0, %xmm1
				; CHECK-SSE2-NEXT: movdqa %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <16 x i8> %a, <16 x i8> %b, <16 x i32> <i32 16, i32 0, i32 16, i32 1, i32 16, i32 2, i32 16, i32 3, i32 16, i32 4, i32 16, i32 5, i32 16, i32 6, i32 16, i32 7>
				ret <16 x i8> %shuffle
				}

				define <16 x i8> @shuffle_v16i8_03_02_01_00_07_06_05_04_11_10_09_08_15_14_13_12(<16 x i8> %a, <16 x i8> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v16i8_03_02_01_00_07_06_05_04_11_10_09_08_15_14_13_12
				; CHECK-SSE2: movdqa %xmm0, %xmm1
				; CHECK-SSE2-NEXT: punpckhbw %xmm0, %xmm1
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm1 = xmm1[3,2,1,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm1 = xmm1[0,1,2,3,7,6,5,4]
				; CHECK-SSE2-NEXT: punpcklbw %xmm0, %xmm0
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[3,2,1,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,7,6,5,4]
				; CHECK-SSE2-NEXT: packuswb %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <16 x i8> %a, <16 x i8> %b, <16 x i32> <i32 3, i32 2, i32 1, i32 0, i32 7, i32 6, i32 5, i32 4, i32 11, i32 10, i32 9, i32 8, i32 15, i32 14, i32 13, i32 12>
				ret <16 x i8> %shuffle
				}

				define <16 x i8> @shuffle_v16i8_03_02_01_00_07_06_05_04_19_18_17_16_23_22_21_20(<16 x i8> %a, <16 x i8> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v16i8_03_02_01_00_07_06_05_04_19_18_17_16_23_22_21_20
				; CHECK-SSE2: punpcklbw %xmm0, %xmm1
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm1 = xmm1[3,2,1,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm1 = xmm1[0,1,2,3,7,6,5,4]
				; CHECK-SSE2-NEXT: punpcklbw %xmm0, %xmm0
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[3,2,1,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,7,6,5,4]
				; CHECK-SSE2-NEXT: packuswb %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <16 x i8> %a, <16 x i8> %b, <16 x i32> <i32 3, i32 2, i32 1, i32 0, i32 7, i32 6, i32 5, i32 4, i32 19, i32 18, i32 17, i32 16, i32 23, i32 22, i32 21, i32 20>
				ret <16 x i8> %shuffle
				}

				define <16 x i8> @shuffle_v16i8_03_02_01_00_31_30_29_28_11_10_09_08_23_22_21_20(<16 x i8> %a, <16 x i8> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v16i8_03_02_01_00_31_30_29_28_11_10_09_08_23_22_21_20
				; CHECK-SSE2: movdqa %xmm1, %xmm2
				; CHECK-SSE2-NEXT: punpcklbw %xmm0, %xmm2
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm2 = xmm2[0,1,2,3,7,6,5,4]
				; CHECK-SSE2-NEXT: movdqa %xmm0, %xmm3
				; CHECK-SSE2-NEXT: punpckhbw %xmm0, %xmm3
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm3 = xmm3[3,2,1,0,4,5,6,7]
				; CHECK-SSE2-NEXT: shufpd {{.*}} # xmm3 = xmm3[0],xmm2[1]
				; CHECK-SSE2-NEXT: punpckhbw %xmm0, %xmm1
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm1 = xmm1[0,1,2,3,7,6,5,4]
				; CHECK-SSE2-NEXT: punpcklbw %xmm0, %xmm0
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[3,2,1,0,4,5,6,7]
				; CHECK-SSE2-NEXT: shufpd {{.*}} # xmm0 = xmm0[0],xmm1[1]
				; CHECK-SSE2-NEXT: packuswb %xmm3, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <16 x i8> %a, <16 x i8> %b, <16 x i32> <i32 3, i32 2, i32 1, i32 0, i32 31, i32 30, i32 29, i32 28, i32 11, i32 10, i32 9, i32 8, i32 23, i32 22, i32 21, i32 20>
				ret <16 x i8> %shuffle
				}

test/CodeGen/X86/vector-shuffle-128-v2.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -x86-experimental-vector-shuffle-lowering \| FileCheck %s --check-prefix=CHECK-SSE2

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

				define <2 x i64> @shuffle_v2i64_00(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_00
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[0,1,0,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 0, i32 0>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_10(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_10
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[2,3,0,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 1, i32 0>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_11(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_11
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[2,3,2,3]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 1, i32 1>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_22(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_22
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm1[0,1,0,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 2, i32 2>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_32(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_32
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm1[2,3,0,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 2>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_33(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_33
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm1[2,3,2,3]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 3>
				ret <2 x i64> %shuffle
				}

				define <2 x double> @shuffle_v2f64_00(<2 x double> %a, <2 x double> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2f64_00
				; CHECK-SSE2: shufpd {{.*}} # xmm0 = xmm0[0,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x double> %a, <2 x double> %b, <2 x i32> <i32 0, i32 0>
				ret <2 x double> %shuffle
				}
				define <2 x double> @shuffle_v2f64_10(<2 x double> %a, <2 x double> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2f64_10
				; CHECK-SSE2: shufpd {{.*}} # xmm0 = xmm0[1,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x double> %a, <2 x double> %b, <2 x i32> <i32 1, i32 0>
				ret <2 x double> %shuffle
				}
				define <2 x double> @shuffle_v2f64_11(<2 x double> %a, <2 x double> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2f64_11
				; CHECK-SSE2: shufpd {{.*}} # xmm0 = xmm0[1,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x double> %a, <2 x double> %b, <2 x i32> <i32 1, i32 1>
				ret <2 x double> %shuffle
				}
				define <2 x double> @shuffle_v2f64_22(<2 x double> %a, <2 x double> %b) {
				; FIXME: Should these use movapd + shufpd to remove a domain change at the cost
				; of a mov?
				;
				; CHECK-SSE2-LABEL: @shuffle_v2f64_22
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm1[0,1,0,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x double> %a, <2 x double> %b, <2 x i32> <i32 2, i32 2>
				ret <2 x double> %shuffle
				}
				define <2 x double> @shuffle_v2f64_32(<2 x double> %a, <2 x double> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2f64_32
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm1[2,3,0,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x double> %a, <2 x double> %b, <2 x i32> <i32 3, i32 2>
				ret <2 x double> %shuffle
				}
				define <2 x double> @shuffle_v2f64_33(<2 x double> %a, <2 x double> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2f64_33
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm1[2,3,2,3]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x double> %a, <2 x double> %b, <2 x i32> <i32 3, i32 3>
				ret <2 x double> %shuffle
				}


				define <2 x i64> @shuffle_v2i64_02(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_02
				; CHECK-SSE2: shufpd {{.*}} # xmm0 = xmm0[0],xmm1[0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 0, i32 2>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_02_copy(<2 x i64> %nonce, <2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_02_copy
				; CHECK-SSE2: shufpd {{.*}} # xmm1 = xmm1[0],xmm2[0]
				; CHECK-SSE2-NEXT: movapd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 0, i32 2>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_03(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_03
				; CHECK-SSE2: shufpd {{.*}} # xmm0 = xmm0[0],xmm1[1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 0, i32 3>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_03_copy(<2 x i64> %nonce, <2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_03_copy
				; CHECK-SSE2: shufpd {{.*}} # xmm1 = xmm1[0],xmm2[1]
				; CHECK-SSE2-NEXT: movapd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 0, i32 3>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_12(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_12
				; CHECK-SSE2: shufpd {{.*}} # xmm0 = xmm0[1],xmm1[0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 1, i32 2>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_12_copy(<2 x i64> %nonce, <2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_12_copy
				; CHECK-SSE2: shufpd {{.*}} # xmm1 = xmm1[1],xmm2[0]
				; CHECK-SSE2-NEXT: movapd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 1, i32 2>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_13(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_13
				; CHECK-SSE2: shufpd {{.*}} # xmm0 = xmm0[1],xmm1[1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 1, i32 3>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_13_copy(<2 x i64> %nonce, <2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_13_copy
				; CHECK-SSE2: shufpd {{.*}} # xmm1 = xmm1[1],xmm2[1]
				; CHECK-SSE2-NEXT: movapd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 1, i32 3>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_20(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_20
				; CHECK-SSE2: shufpd {{.*}} # xmm1 = xmm1[0],xmm0[0]
				; CHECK-SSE2-NEXT: movapd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 2, i32 0>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_20_copy(<2 x i64> %nonce, <2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_20_copy
				; CHECK-SSE2: shufpd {{.*}} # xmm2 = xmm2[0],xmm1[0]
				; CHECK-SSE2-NEXT: movapd %xmm2, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 2, i32 0>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_21(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_21
				; CHECK-SSE2: shufpd {{.*}} # xmm1 = xmm1[0],xmm0[1]
				; CHECK-SSE2-NEXT: movapd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 2, i32 1>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_21_copy(<2 x i64> %nonce, <2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_21_copy
				; CHECK-SSE2: shufpd {{.*}} # xmm2 = xmm2[0],xmm1[1]
				; CHECK-SSE2-NEXT: movapd %xmm2, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 2, i32 1>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_30(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_30
				; CHECK-SSE2: shufpd {{.*}} # xmm1 = xmm1[1],xmm0[0]
				; CHECK-SSE2-NEXT: movapd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 0>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_30_copy(<2 x i64> %nonce, <2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_30_copy
				; CHECK-SSE2: shufpd {{.*}} # xmm2 = xmm2[1],xmm1[0]
				; CHECK-SSE2-NEXT: movapd %xmm2, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 0>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_31(<2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_31
				; CHECK-SSE2: shufpd {{.*}} # xmm1 = xmm1[1],xmm0[1]
				; CHECK-SSE2-NEXT: movapd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 1>
				ret <2 x i64> %shuffle
				}
				define <2 x i64> @shuffle_v2i64_31_copy(<2 x i64> %nonce, <2 x i64> %a, <2 x i64> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v2i64_31_copy
				; CHECK-SSE2: shufpd {{.*}} # xmm2 = xmm2[1],xmm1[1]
				; CHECK-SSE2-NEXT: movapd %xmm2, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 1>
				ret <2 x i64> %shuffle
				}

test/CodeGen/X86/vector-shuffle-128-v4.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -x86-experimental-vector-shuffle-lowering \| FileCheck %s --check-prefix=CHECK-SSE2

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

				define <4 x i32> @shuffle_v4i32_0001(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_0001
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[0,0,0,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 0, i32 0, i32 1>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_0020(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_0020
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[0,0,2,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 0, i32 2, i32 0>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_0300(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_0300
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[0,3,0,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 3, i32 0, i32 0>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_1000(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_1000
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[1,0,0,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 1, i32 0, i32 0, i32 0>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_2200(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_2200
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[2,2,0,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 2, i32 2, i32 0, i32 0>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_3330(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_3330
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[3,3,3,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 3, i32 3, i32 3, i32 0>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_3210(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_3210
				; CHECK-SSE2: pshufd {{.*}} # xmm0 = xmm0[3,2,1,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				ret <4 x i32> %shuffle
				}

				define <4 x float> @shuffle_v4f32_0001(<4 x float> %a, <4 x float> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4f32_0001
				; CHECK-SSE2: shufps {{.*}} # xmm0 = xmm0[0,0,0,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 0, i32 0, i32 1>
				ret <4 x float> %shuffle
				}
				define <4 x float> @shuffle_v4f32_0020(<4 x float> %a, <4 x float> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4f32_0020
				; CHECK-SSE2: shufps {{.*}} # xmm0 = xmm0[0,0,2,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 0, i32 2, i32 0>
				ret <4 x float> %shuffle
				}
				define <4 x float> @shuffle_v4f32_0300(<4 x float> %a, <4 x float> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4f32_0300
				; CHECK-SSE2: shufps {{.*}} # xmm0 = xmm0[0,3,0,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 3, i32 0, i32 0>
				ret <4 x float> %shuffle
				}
				define <4 x float> @shuffle_v4f32_1000(<4 x float> %a, <4 x float> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4f32_1000
				; CHECK-SSE2: shufps {{.*}} # xmm0 = xmm0[1,0,0,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 1, i32 0, i32 0, i32 0>
				ret <4 x float> %shuffle
				}
				define <4 x float> @shuffle_v4f32_2200(<4 x float> %a, <4 x float> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4f32_2200
				; CHECK-SSE2: shufps {{.*}} # xmm0 = xmm0[2,2,0,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 2, i32 2, i32 0, i32 0>
				ret <4 x float> %shuffle
				}
				define <4 x float> @shuffle_v4f32_3330(<4 x float> %a, <4 x float> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4f32_3330
				; CHECK-SSE2: shufps {{.*}} # xmm0 = xmm0[3,3,3,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 3, i32 3, i32 3, i32 0>
				ret <4 x float> %shuffle
				}
				define <4 x float> @shuffle_v4f32_3210(<4 x float> %a, <4 x float> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4f32_3210
				; CHECK-SSE2: shufps {{.*}} # xmm0 = xmm0[3,2,1,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				ret <4 x float> %shuffle
				}

				define <4 x i32> @shuffle_v4i32_0124(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_0124
				; CHECK-SSE2: shufps {{.*}} # xmm1 = xmm1[0,0],xmm0[2,0]
				; CHECK-SSE2-NEXT: shufps {{.*}} # xmm0 = xmm0[0,1],xmm1[2,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_0142(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_0142
				; CHECK-SSE2: shufps {{.*}} # xmm1 = xmm1[0,0],xmm0[2,0]
				; CHECK-SSE2-NEXT: shufps {{.*}} # xmm0 = xmm0[0,1],xmm1[0,2]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 1, i32 4, i32 2>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_0412(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_0412
				; CHECK-SSE2: shufps {{.*}} # xmm1 = xmm1[0,0],xmm0[0,0]
				; CHECK-SSE2-NEXT: shufps {{.*}} # xmm1 = xmm1[2,0],xmm0[1,2]
				; CHECK-SSE2-NEXT: movaps %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 4, i32 1, i32 2>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_4012(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_4012
				; CHECK-SSE2: shufps {{.*}} # xmm1 = xmm1[0,0],xmm0[0,0]
				; CHECK-SSE2-NEXT: shufps {{.*}} # xmm1 = xmm1[0,2],xmm0[1,2]
				; CHECK-SSE2-NEXT: movaps %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 4, i32 0, i32 1, i32 2>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_0145(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_0145
				; CHECK-SSE2: shufpd {{.*}} # xmm0 = xmm0[0],xmm1[0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_0451(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_0451
				; CHECK-SSE2: movaps %xmm0, %xmm2
				; CHECK-SSE2-NEXT: shufps {{.*}} # xmm2 = xmm2[0,1],xmm1[0,1]
				; FIXME: This is wrong!!! xmm0 = xmm2[0,2],xmm2[3,1] would be correct....
				; CHECK-SSE2-NEXT: shufps {{.*}} # xmm0 = xmm0[0,2],xmm2[3,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 4, i32 5, i32 1>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_4501(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_4501
				; CHECK-SSE2: shufpd {{.*}} # xmm1 = xmm1[0],xmm0[0]
				; CHECK-SSE2-NEXT: movapd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 4, i32 5, i32 0, i32 1>
				ret <4 x i32> %shuffle
				}
				define <4 x i32> @shuffle_v4i32_4015(<4 x i32> %a, <4 x i32> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v4i32_4015
				; CHECK-SSE2: movaps %xmm0, %xmm2
				; CHECK-SSE2-NEXT: shufps {{.*}} # xmm2 = xmm2[0,1],xmm1[0,1]
				; FIXME: This is wrong!!! xmm0 = xmm2[0,2],xmm2[3,1] would be correct....
				; CHECK-SSE2-NEXT: shufps {{.*}} # xmm0 = xmm0[2,0],xmm2[1,3]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 4, i32 0, i32 1, i32 5>
				ret <4 x i32> %shuffle
				}

test/CodeGen/X86/vector-shuffle-128-v8.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -x86-experimental-vector-shuffle-lowering \| FileCheck %s --check-prefix=CHECK-SSE2

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

				define <8 x i16> @shuffle_v8i16_01012323(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_01012323
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,0,1,1]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 2, i32 3, i32 2, i32 3>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_67452301(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_67452301
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[3,2,1,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 6, i32 7, i32 4, i32 5, i32 2, i32 3, i32 0, i32 1>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_456789AB(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_456789AB
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2: shufpd {{.*}} # xmm0 = xmm0[1],xmm1[0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_00000000(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_00000000
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,1,0,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,0,0,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_00004444(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_00004444
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,0,0,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 4, i32 4, i32 4, i32 4>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_31206745(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_31206745
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[3,1,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,6,7,4,5]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 3, i32 1, i32 2, i32 0, i32 6, i32 7, i32 4, i32 5>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_44440000(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_44440000
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[2,1,0,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,0,0,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 4, i32 4, i32 4, i32 4, i32 0, i32 0, i32 0, i32 0>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_75643120(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_75643120
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[2,3,0,1]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[3,1,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,7,5,6,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 7, i32 5, i32 6, i32 4, i32 3, i32 1, i32 2, i32 0>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_10545410(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_10545410
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,0]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[1,0,3,2,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,5,4,7,6]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 1, i32 0, i32 5, i32 4, i32 5, i32 4, i32 1, i32 0>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_54105410(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_54105410
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,0]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[3,2,1,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,5,4,7,6]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 5, i32 4, i32 1, i32 0, i32 5, i32 4, i32 1, i32 0>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_54101054(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_54101054
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,0]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[3,2,1,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,7,6,5,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 5, i32 4, i32 1, i32 0, i32 1, i32 0, i32 5, i32 4>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_04400440(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_04400440
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,0]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,2,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,6,4,4,6]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 4, i32 4, i32 0, i32 0, i32 4, i32 4, i32 0>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_40044004(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_40044004
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,0]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[2,0,0,2,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,6,6,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 4, i32 0, i32 0, i32 4, i32 4, i32 0, i32 0, i32 4>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_26405173(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_26405173
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,2,1,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,7,5,4,6]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,3,2,1]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[1,3,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,5,6,4,7]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 2, i32 6, i32 4, i32 0, i32 5, i32 1, i32 7, i32 3>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_20645173(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_20645173
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,2,1,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,7,5,4,6]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,3,2,1]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[1,0,3,2,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,5,6,4,7]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 2, i32 0, i32 6, i32 4, i32 5, i32 1, i32 7, i32 3>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_26401375(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_26401375
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,2,1,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,7,5,4,6]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,3,2,1]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[1,3,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,6,7,4,5]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 2, i32 6, i32 4, i32 0, i32 1, i32 3, i32 7, i32 5>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_00444444(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_00444444
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,0,2,2,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 0, i32 4, i32 4, i32 4, i32 4, i32 4, i32 4>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_44004444(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_44004444
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[2,2,0,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 4, i32 4, i32 0, i32 0, i32 4, i32 4, i32 4, i32 4>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_04404444(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_04404444
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,2,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 4, i32 4, i32 0, i32 4, i32 4, i32 4, i32 4>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_04400000(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_04400000
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,0,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,2,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 4, i32 4, i32 0, i32 0, i32 0, i32 0, i32 0>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_04404567(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_04404567
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,2,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 4, i32 4, i32 0, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_0X444444(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_0X444444
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,1,2,2,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 undef, i32 4, i32 4, i32 4, i32 4, i32 4, i32 4>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_44X04444(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_44X04444
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[2,2,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 4, i32 4, i32 undef, i32 0, i32 4, i32 4, i32 4, i32 4>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_X4404444(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_X4404444
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,2,2,0,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,4,4,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 undef, i32 4, i32 4, i32 0, i32 4, i32 4, i32 4, i32 4>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_0127XXXX(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_0127XXXX
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,1,3]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,7,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_XXXX4563(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_XXXX4563
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[3,1,2,0]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,3,2,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,1,2,0]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 4, i32 5, i32 6, i32 3>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_4563XXXX(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_4563XXXX
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[3,1,2,0]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,3,2,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[2,3,0,1,4,5,6,7]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 4, i32 5, i32 6, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_01274563(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_01274563
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,1,3]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,6,5,4,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,3,2,1]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,6,7,4,5]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 7, i32 4, i32 5, i32 6, i32 3>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_45630127(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_45630127
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[3,1,2,0]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,3,1,2,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,1,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[2,3,0,1,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,6,7,5,4]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 4, i32 5, i32 6, i32 3, i32 0, i32 1, i32 2, i32 7>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_08192a3b(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_08192a3b
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: punpcklwd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_0c1d2e3f(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_0c1d2e3f
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm1 = xmm1[2,3,2,3]
				; CHECK-SSE2-NEXT: punpcklwd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 12, i32 1, i32 13, i32 2, i32 14, i32 3, i32 15>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_4c5d6e7f(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_4c5d6e7f
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm1 = xmm1[2,3,2,3]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[2,3,2,3]
				; CHECK-SSE2-NEXT: punpcklwd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_48596a7b(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_48596a7b
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[2,3,2,3]
				; CHECK-SSE2-NEXT: punpcklwd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 4, i32 8, i32 5, i32 9, i32 6, i32 10, i32 7, i32 11>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_08196e7f(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_08196e7f
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm1 = xmm1[0,3,2,3]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,3,2,3]
				; CHECK-SSE2-NEXT: punpcklwd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 6, i32 14, i32 7, i32 15>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_0c1d6879(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_0c1d6879
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,3,2,3]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm1 = xmm1[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm1 = xmm1[2,3,0,1,4,5,6,7]
				; CHECK-SSE2-NEXT: punpcklwd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 12, i32 1, i32 13, i32 6, i32 8, i32 7, i32 9>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_109832ba(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_109832ba
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: punpcklwd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm1 = xmm0[2,0,3,1,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[2,3,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[2,0,3,1,4,5,6,7]
				; CHECK-SSE2-NEXT: punpcklqdq %xmm0, %xmm1
				; CHECK-SSE2-NEXT: movdqa %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 1, i32 0, i32 9, i32 8, i32 3, i32 2, i32 11, i32 10>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_8091a2b3(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_8091a2b3
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: punpcklwd %xmm0, %xmm1
				; CHECK-SSE2-NEXT: movdqa %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 8, i32 0, i32 9, i32 1, i32 10, i32 2, i32 11, i32 3>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_c4d5e6f7(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_c4d5e6f7
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm2 = xmm0[2,3,2,3]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm1[2,3,2,3]
				; CHECK-SSE2-NEXT: punpcklwd %xmm2, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 12, i32 4, i32 13, i32 5, i32 14, i32 6, i32 15, i32 7>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_0213cedf(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_0213cedf
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,2,1,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm1 = xmm1[2,3,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm1 = xmm1[0,2,1,3,4,5,6,7]
				; CHECK-SSE2-NEXT: punpcklqdq %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 2, i32 1, i32 3, i32 12, i32 14, i32 13, i32 15>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_032dXXXX(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_032dXXXX
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm1 = xmm1[2,1,2,3]
				; CHECK-SSE2-NEXT: punpcklwd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,3,2,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,6,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,3,2,1,4,5,6,7]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 3, i32 2, i32 13, i32 undef, i32 undef, i32 undef, i32 undef>
				ret <8 x i16> %shuffle
				}
				define <8 x i16> @shuffle_v8i16_XXXcXXXX(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_XXXcXXXX
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm1[2,1,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[0,1,2,1,4,5,6,7]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 13, i32 undef, i32 undef, i32 undef, i32 undef>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_012dXXXX(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_012dXXXX
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm1 = xmm1[2,1,2,3]
				; CHECK-SSE2-NEXT: punpcklwd %xmm1, %xmm0
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[3,1,2,0]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,6,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[2,1,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[1,2,0,3,4,5,6,7]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 13, i32 undef, i32 undef, i32 undef, i32 undef>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_XXXXcde3(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_XXXXcde3
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,1,2,1]
				; CHECK-SSE2-NEXT: punpckhwd %xmm0, %xmm1
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm1[0,2,2,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,7,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,1,2,0]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,6,7,4,5]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 12, i32 13, i32 14, i32 3>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_cde3XXXX(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_cde3XXXX
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,1,2,1]
				; CHECK-SSE2-NEXT: punpckhwd %xmm0, %xmm1
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm1[0,2,2,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,7,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[0,2,2,3]
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 12, i32 13, i32 14, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
				ret <8 x i16> %shuffle
				}

				define <8 x i16> @shuffle_v8i16_012dcde3(<8 x i16> %a, <8 x i16> %b) {
				; CHECK-SSE2-LABEL: @shuffle_v8i16_012dcde3
				; CHECK-SSE2: # BB#0:
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm2 = xmm0[0,1,2,1]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm3 = xmm1[2,1,2,3]
				; CHECK-SSE2-NEXT: punpckhwd %xmm2, %xmm1
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm1 = xmm1[0,2,2,3,4,5,6,7]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm1 = xmm1[0,1,2,3,4,7,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm1 = xmm1[0,2,2,3]
				; CHECK-SSE2-NEXT: punpcklwd %xmm3, %xmm0
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[3,1,2,0]
				; CHECK-SSE2-NEXT: pshufhw {{.*}} # xmm0 = xmm0[0,1,2,3,4,6,6,7]
				; CHECK-SSE2-NEXT: pshufd {{.*}} # xmm0 = xmm0[2,1,2,3]
				; CHECK-SSE2-NEXT: pshuflw {{.*}} # xmm0 = xmm0[1,2,0,3,4,5,6,7]
				; CHECK-SSE2-NEXT: punpcklqdq %xmm1, %xmm0
				; CHECK-SSE2-NEXT: retq
				%shuffle = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 13, i32 12, i32 13, i32 14, i32 3>
				ret <8 x i16> %shuffle
				}