This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
insertelement-ones.ll
-
vector-shuffle-256-v16.ll
-
vector-shuffle-256-v32.ll
-
vector-shuffle-512-v32.ll
-
vector-shuffle-v48.ll

Differential D50074

[X86][AVX2] Prefer VPBLENDW +VPBLENDD to VPBLENDVB for v16i16 blend shuffles
ClosedPublic

Authored by RKSimon on Jul 31 2018, 8:21 AM.

Download Raw Diff

Details

Reviewers

craig.topper
zvi
delena
lebedev.ri
pcordes

Commits

rG6b9bf7ecbc15: [X86][AVX] Prefer VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles
rL340913: [X86][AVX] Prefer VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles

Summary

Noticed while looking at D49562 codegen - we can avoid a large constant mask load and a slow VPBLENDVB select op by using VPBLENDW+VPBLENDD instead.

TODO: We should investigate adding VPBLENDVB handling to target shuffle combining as well.

Should we be preferring VPBLENDVB/VSELECT for AVX512 targets?

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon created this revision.Jul 31 2018, 8:21 AM

Nice idea to chain vpblendw + vpblendd, those are both single-uop instructions on AVX2 CPUs, and vpblendd can run on any port.

Skylake: vblendvb is 2 uops for any of p015, vpblendw is 1 uop for p5. If you can hoist the vector constant, vpblendvb is equal to 2 immediate blends, and worse than 3.

Fun fact: Skylake's legacy-SSE pblendvb x,x,xmm0 is a single uop for p015. The 4-explicit-operand VEX encoding makes it cost 2 uops.

Haswell's vblendvb is 2 uops for p5 only, so it and shuffles can easily bottlenecks on port 5. vpblendw is also port 5 only. vpblendw+vpblendd is better, but depending on port pressure, 2x vpblendw+vpblendd is worse (again assuming you can hoist the vector constant).

Agner Fog strangely doesn't have numbers for pblendvb on Piledriver or Ryzen, not even the SSE4 version. http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt says vpblendvb xmm is single-cycle latency on Ryzen, but vpblendvb ymm is 2c latency. (With throughput = latency, so I guess only one port.) VPBLENDW ymm is single-cycle latency, with 0.67c throughput, so I guess it's 1 uop per lane, on 3 ports. So Ryzen's vpblendw is better than Intel's avoiding port bottlenecks. But vpblendvb is also only 1 uop per lane, so it's definitely efficient when we can hoist the mask out of the loop, and register pressure makes that a good thing to spend a register on.

My instinct here is that for Intel tunings and probably also generic, we should replace vpblendvb with up to 2 uops of vpblendw + vpblendd, but not 3.

If we can analyze the situation and figure out that vpblendvb will definitely have to reload the mask every time, we should consider replacing it even if it takes 3 immediate blends. Ideally we can check the loop for port 5 pressure.

Stuff like this makes clang hard to use when hand-tuning a loop, though. I know I'd be very annoyed if I was using a vpblendvb intrinsic, and clang replaced it with 2x vpblendw + vpblendd and created a port 5 bottleneck on Skylake, plus costing more uops. So we should be very cautious about 3-instruction replacements.

Replacing it with 2 uops can obviously be harmful too in some cases, because vpblendw only runs on port 5. It would be great if there was an option that asked clang to use instructions more closely matching the intrinsics for hand-tuned loops, but we can always write asm by hand to tune for a specific uarch.

With AVX512BW, we should *definitely* be using vpblendmb zmm0{k1}{z}, zmm1, zmm2. According to IACA for SKX, it's single-uop, 1c latency, and runs on any vector ALU port. (i.e. port 0 or port 5 when port 1 is shut down because 512-bit uops are in flight.)

Even if setting up the constant takes extra uops, the critical path latency for the vectors is only 1 cycle.

To get a 64-bit constant into a k register, we need a movabs rcx, 0x12346... / kmovq k1, rcx, or load it from memory. I think normally k registers aren't under a lot of pressure in most functions, so we should normally be able to hoist it if there is a loop.

Loading a k register from memory costs 2 uops according to IACA, one of them being a micro-fused load+ALU. That sounds weird; IDK why you'd need an ALU uop other than the integer->k port-5-only uop. It might be correct, though; k-register store+load has 0.67c or 0.5c throughput (http://users.atw.hu/instlatx64/GenuineIntel0050654_SkylakeX_InstLatX64.txt). Anyway, I think probably a mov-immediate is a good choice even for 64-bit integers, because it apparently doesn't save uops to load straight into mask registers.

With only AVX512F, we can do bit/byte blends using vpternlogd, using a vector control mask (in a zmm reg, not a k reg). Given the right truth table, one source can select the corresponding bit from either of the other two operands, so we can replace one of the inputs or replace the selector.

vpternlogd is single-uop on AVX512 CPUs, including KNL.

This could be an interesting option for byte blends of 256-bit vectors when used with AVX2 compare results (that put the result in a vector instead of mask reg). e.g. building manually-vectorized code with 256-bit vectors with -march=knl, where we have AVX512F but not BW. (And not VL, so we'd actually have to use a ZMM instruction. That's fine on KNL, but very bad on SKX if no other 512-bit instructions were in flight. We'd like to avoid -mtune=generic -mavx512f being a pitfall of nasty code-gen compared to -march=skylake-avx512)

I don't think AVX512 has any immediate blends. Even vpblendpd doesn't have an EVEX encoding, only VEX using only the low 4 bits of the imm8. At least if it does, they don't have blend or select in the mnemonic or short description.

I guess you're meant to use k registers, even though it's a 2-step 2-uop process to get an immediate into a k reg. (But one of those uops can run on any port, including port 6). Both those extra uops are off the critical path of vectors in -> vector out, unlike with multi-uop vpblendvb.

Of course, VEX vpblendd is still excellent, and should be used on 256-bit vectors whenever possible. e.g. for _mm256_mask_blend_epi32 with a compile-time constant mask, if register allocation has the operands in the low 16 registers.

(Fun fact: using only ymm16..31 avoids the need for vzeroupper, because their low lanes aren't accessible with legacy SSE instructions. But missing out on VEX instructions / short-encodings when doing 256-bit vectorization with AVX512 available is a downside to that.)

Cheers Peter, I'm going to look at adding combining shuffles to VPBLENDVB/VPBLENDMB in the target shuffle combiner. We already have a 'variable mask' threshold mechanism that allows recent Intel CPUs to merge >2 shuffles to a single variable mask shuffle so the 2*VPBLENDW+VPLENDD regression case can be avoided on those targets (see the 'SLOW' vs 'FAST' codegen checks above).

I can look at combine shuffles to VPTERNLOG in the future if/when its requested.

In D50074#1189210, @RKSimon wrote:

Cheers Peter, I'm going to look at adding combining shuffles to VPBLENDVB/VPBLENDMB in the target shuffle combiner.

Don't forget that a 32-bit mask is cheaper to create with a mov r32, imm32, so look for the chance to use vpblendmw.

We already have a 'variable mask' threshold mechanism that allows recent Intel CPUs to merge >2 shuffles to a single variable mask shuffle so the 2*VPBLENDW+VPLENDD regression case can be avoided on those targets (see the 'SLOW' vs 'FAST' codegen checks above).

From that code-gen, I hope KNL is the only CPU in the "AVX2-SLOW" category; its vpblendw/d are efficient but vpblendvb is 4 uops (and thus a front-end bottleneck). But KNL's vpshuflw/hw ymm are horrible too, 8c throughput vs. 12c for vpshufb ymm, so I'm not convinced that multiple vpshufl/hw + combine is the way to go vs. just using vpshufb ymm. Given the way its front-end works, one huge instruction that gets a lot of uops from microcode ROM in one go is probably better than multiple multi-uop instructions that stall the decoders multiple times. (But this is just based on Agner Fog's guide, not any real testing. Still, the per-instruction throughput numbers can be misleading because nearly every multi-uop instruction's throughput is based on the resulting front-end bottleneck. IDK if microcode can be read fast enough to fill that bubble for later insns...)

If KNL can load a mask for vpternlogd, that's probably your best bet for efficient byte blends if AVX512F isn't disabled. But maybe not a high priority to implement because AVX2 byte-manipulation code is generally going to suck on KNL anyway.

On anything other than KNL that supports AVX2, it comes down to whether the blend mask can be hoisted out of a loop.

Haswell and is almost always better off with 1 vpblendvb (2p5) than 3 separate instructions (2p5 + p015), when we have the mask in a reg already.

Ryzen is much better off with vpblendvb ymm (2 uops) than 6 uops.

(IDK about Excavator).

RKSimon mentioned this in D41436: [X86][AVX512] Enable variable shuffle combining by default on AVX512 targets.Aug 9 2018, 7:52 AM

rebased - still investigating how best to start including vselect inside shuffle combining - trying to optimize for everything from SSE41 to AVX512BWVL isn't straightforward - especially as we don't do much to optimize vselect nodes most of the time as their behaviour is target specific after legalization.

TBH I reckon this could go in as it is and we improve VSELECT combines later on.

In D50074#1214328, @RKSimon wrote:

TBH I reckon this could go in as it is and we improve VSELECT combines later on.

Sounds reasonable as long as we aren't pessimizing Skylake by turning vpblendvb into 3 uops (including 2 for port 5) instead of 2 for any port, inside a loop.

AVX2-FAST-LABEL: PR24935: seems to be doing that still.

Especially in manually-vectorized code, I think it would be bad to compile _mm256_blendv_epi8 with a constant into 2x vpblendw + vpblendd. Could easily cause a performance regression in some code.

Can we add a check that only at most 2 immediate blends will be needed, as a conservative option to get the improvements in place for the cases where it is a win?

Limit to only one PBLENDW until we have shuffle combining support for VSELECT.

This is looking pretty good as far as just the optimization you're aiming for. Most of my comments are on only semi-related things that happen to be in the diffs. e.g. we end up picking vpermw now instead of blending.

test/CodeGen/X86/insertelement-ones.ll
312 ↗	(On Diff #162716)	We should have another testcase that blends something other than -1, because the best way to do this blend with three `-1` elements is with an OR. OR with -1 produces -1 regardless of the previous contents, and OR with 0 is a no-op, thus it's a blend. I don't have numbers on loading a constant vs. a couple extra uops outside a loop. Obviously any time we have a loop that will either keep a constant hot in cache, or let us hoist into a reg, this is a very nice win. https://godbolt.org/z/JNv5VZ shows that this works: a manually optimized version of the function gives the same result for constant-propagation. vpor .LCPI1_0(%rip), %ymm0, %ymm0 ret clang actually used `vorps`, but that can only run on port 5 before Skylake. I used `-march=haswell`, so compiling `_mm256_or_si256` to `vorps` (port 5) instead of `vpor` (port 0/1/5) is really silly for an integer vector. (SKL lets `vorps` run on any port, with latency between FP instructions dependent on which port it actually picks. But I compiled with `-march=haswell`, and this is a poor choice for HSW. Without AVX, `por` is 1 byte longer than `orps`, but even then `por` is can be worth it on pre-Skylake depending on the surrounding code (port 5 pressure, and/or if there's any ILP for this blend). Also with Hyperthreading, uops that can be assigned to any port are more likely to be able to take full advantage of the extra ILP exposed by SMT, vs. potentially having both threads together bottleneck on the same port.
test/CodeGen/X86/vector-shuffle-256-v32.ll
436 ↗	(On Diff #162716)	`vpermw` costs 2 shuffle uops and 4c latency on SKX, so it's implemented internally as a lane-crossing + in-lane shuffle. Some future CPU might make it single-uop, though. If we need a `vpshufb` anyway, can we use a wider-granularity shuffle like `vpermd` (single uop), using a vector constant there? I guess immediate `vpermq` isn't super helpful.
437 ↗	(On Diff #162716)	`movw $1, %ax` has a false dependency on the old value of RAX. Saving 1 byte of code size is probably not worth it, unless benchmarks show that the code density to help the uop-cache / front-end is more important than back-end false deps. But the false-dep effect can just randomly happen to suck a lot (last use of RAX = cache miss or long dep chain), vs. negligible, so it's hard to bench. (But at least this doesn't cause a decode stall, even though the operand-size prefix changes the length of the rest of the instruction because of the imm16. `mov` is exempt from LCP stalls on Sandybridge-family.) I've noticed clang using narrow operand sizes (like 8-bit) in other cases when it's not gaining anything, and potentially hurting. (TODO: report this as a separate bug; it's obviously unrelated to this.)
440 ↗	(On Diff #162716)	Could we have used merge-masking on the `vpshufb control, %ymm0, %ymm1{%k1}` with a mask of `eax=3` (2 bytes) instead of `eax=1` (1 word)? (I'm falling asleep here, so I can't tell right now.)

pcordes accepted this revision.Aug 28 2018, 11:01 PM

This revision is now accepted and ready to land.Aug 28 2018, 11:01 PM

Closed by commit rL340913: [X86][AVX] Prefer VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles (authored by RKSimon). · Explain WhyAug 29 2018, 3:52 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

16 lines

test/

CodeGen/

X86/

insertelement-ones.ll

48 lines

vector-shuffle-256-v16.ll

42 lines

vector-shuffle-256-v32.ll

24 lines

vector-shuffle-512-v32.ll

20 lines

vector-shuffle-v48.ll

4 lines

Diff 163039

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,856 Lines • ▼ Show 20 Lines	static SDValue lowerVectorShuffleAsBlend(const SDLoc &DL, MVT VT, SDValue V1,

switch (VT.SimpleTy) {		switch (VT.SimpleTy) {
case MVT::v2f64:		case MVT::v2f64:
case MVT::v4f32:		case MVT::v4f32:
case MVT::v4f64:		case MVT::v4f64:
case MVT::v8f32:		case MVT::v8f32:
return DAG.getNode(X86ISD::BLENDI, DL, VT, V1, V2,		return DAG.getNode(X86ISD::BLENDI, DL, VT, V1, V2,
DAG.getConstant(BlendMask, DL, MVT::i8));		DAG.getConstant(BlendMask, DL, MVT::i8));

case MVT::v4i64:		case MVT::v4i64:
case MVT::v8i32:		case MVT::v8i32:
assert(Subtarget.hasAVX2() && "256-bit integer blends require AVX2!");		assert(Subtarget.hasAVX2() && "256-bit integer blends require AVX2!");
LLVM_FALLTHROUGH;		LLVM_FALLTHROUGH;
case MVT::v2i64:		case MVT::v2i64:
case MVT::v4i32:		case MVT::v4i32:
// If we have AVX2 it is faster to use VPBLENDD when the shuffle fits into		// If we have AVX2 it is faster to use VPBLENDD when the shuffle fits into
// that instruction.		// that instruction.
Show All 15 Lines	case MVT::v8i16: {
int Scale = 8 / VT.getVectorNumElements();		int Scale = 8 / VT.getVectorNumElements();
BlendMask = scaleVectorShuffleBlendMask(BlendMask, Mask.size(), Scale);		BlendMask = scaleVectorShuffleBlendMask(BlendMask, Mask.size(), Scale);
V1 = DAG.getBitcast(MVT::v8i16, V1);		V1 = DAG.getBitcast(MVT::v8i16, V1);
V2 = DAG.getBitcast(MVT::v8i16, V2);		V2 = DAG.getBitcast(MVT::v8i16, V2);
return DAG.getBitcast(VT,		return DAG.getBitcast(VT,
DAG.getNode(X86ISD::BLENDI, DL, MVT::v8i16, V1, V2,		DAG.getNode(X86ISD::BLENDI, DL, MVT::v8i16, V1, V2,
DAG.getConstant(BlendMask, DL, MVT::i8)));		DAG.getConstant(BlendMask, DL, MVT::i8)));
}		}

case MVT::v16i16: {		case MVT::v16i16: {
assert(Subtarget.hasAVX2() && "256-bit integer blends require AVX2!");		assert(Subtarget.hasAVX2() && "256-bit integer blends require AVX2!");
SmallVector<int, 8> RepeatedMask;		SmallVector<int, 8> RepeatedMask;
if (is128BitLaneRepeatedShuffleMask(MVT::v16i16, Mask, RepeatedMask)) {		if (is128BitLaneRepeatedShuffleMask(MVT::v16i16, Mask, RepeatedMask)) {
// We can lower these with PBLENDW which is mirrored across 128-bit lanes.		// We can lower these with PBLENDW which is mirrored across 128-bit lanes.
assert(RepeatedMask.size() == 8 && "Repeated mask size doesn't match!");		assert(RepeatedMask.size() == 8 && "Repeated mask size doesn't match!");
BlendMask = 0;		BlendMask = 0;
for (int i = 0; i < 8; ++i)		for (int i = 0; i < 8; ++i)
if (RepeatedMask[i] >= 8)		if (RepeatedMask[i] >= 8)
BlendMask \|= 1ull << i;		BlendMask \|= 1ull << i;
return DAG.getNode(X86ISD::BLENDI, DL, MVT::v16i16, V1, V2,		return DAG.getNode(X86ISD::BLENDI, DL, MVT::v16i16, V1, V2,
DAG.getConstant(BlendMask, DL, MVT::i8));		DAG.getConstant(BlendMask, DL, MVT::i8));
}		}
		// Use PBLENDW for lower/upper lanes and then blend lanes.
		// TODO - we should allow 2 PBLENDW here and leave shuffle combine to
		// merge to VSELECT where useful.
		uint64_t LoMask = BlendMask & 0xFF;
		uint64_t HiMask = (BlendMask >> 8) & 0xFF;
		if (LoMask == 0 \|\| LoMask == 255 \|\| HiMask == 0 \|\| HiMask == 255) {
		SDValue Lo = DAG.getNode(X86ISD::BLENDI, DL, MVT::v16i16, V1, V2,
		DAG.getConstant(LoMask, DL, MVT::i8));
		SDValue Hi = DAG.getNode(X86ISD::BLENDI, DL, MVT::v16i16, V1, V2,
		DAG.getConstant(HiMask, DL, MVT::i8));
		return DAG.getVectorShuffle(
		MVT::v16i16, DL, Lo, Hi,
		{0, 1, 2, 3, 4, 5, 6, 7, 24, 25, 26, 27, 28, 29, 30, 31});
		}
LLVM_FALLTHROUGH;		LLVM_FALLTHROUGH;
}		}
case MVT::v16i8:		case MVT::v16i8:
case MVT::v32i8: {		case MVT::v32i8: {
assert((VT.is128BitVector() \|\| Subtarget.hasAVX2()) &&		assert((VT.is128BitVector() \|\| Subtarget.hasAVX2()) &&
"256-bit byte-blends require AVX2 support!");		"256-bit byte-blends require AVX2 support!");

if (Subtarget.hasBWI() && Subtarget.hasVLX()) {		if (Subtarget.hasBWI() && Subtarget.hasVLX()) {
▲ Show 20 Lines • Show All 31,350 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/insertelement-ones.ll

	Show First 20 Lines • Show All 284 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vandps {{.*}}(%rip), %ymm0, %ymm0			; AVX1-NEXT: vandps {{.*}}(%rip), %ymm0, %ymm0
	; AVX1-NEXT: vorps {{.*}}(%rip), %ymm0, %ymm0			; AVX1-NEXT: vorps {{.*}}(%rip), %ymm0, %ymm0
	; AVX1-NEXT: vandps {{.*}}(%rip), %ymm0, %ymm0			; AVX1-NEXT: vandps {{.*}}(%rip), %ymm0, %ymm0
	; AVX1-NEXT: vorps {{.*}}(%rip), %ymm0, %ymm0			; AVX1-NEXT: vorps {{.*}}(%rip), %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: insert_v16i16_x12345x789ABCDEx:			; AVX2-LABEL: insert_v16i16_x12345x789ABCDEx:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm1 = [0,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; AVX2-NEXT: vpcmpeqd %ymm1, %ymm1, %ymm1
	; AVX2-NEXT: vpcmpeqd %ymm2, %ymm2, %ymm2			; AVX2-NEXT: vpblendw {{.*#+}} ymm2 = ymm1[0],ymm0[1,2,3,4,5,6,7],ymm1[8],ymm0[9,10,11,12,13,14,15]
	; AVX2-NEXT: vpblendvb %ymm1, %ymm0, %ymm2, %ymm0			; AVX2-NEXT: vpblendd {{.*#+}} ymm2 = ymm2[0,1,2,3],ymm0[4,5,6,7]
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm1 = [255,255,255,255,255,255,255,255,255,255,255,255,0,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; AVX2-NEXT: vpblendw {{.*#+}} ymm2 = ymm2[0,1,2,3,4,5],ymm1[6],ymm2[7,8,9,10,11,12,13],ymm1[14],ymm2[15]
	; AVX2-NEXT: vpblendvb %ymm1, %ymm0, %ymm2, %ymm0			; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm2[0,1,2,3],ymm0[4,5,6,7]
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm1 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,0]			; AVX2-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5,6],ymm1[7],ymm0[8,9,10,11,12,13,14],ymm1[15]
	; AVX2-NEXT: vpblendvb %ymm1, %ymm0, %ymm2, %ymm0			; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm2[0,1,2,3],ymm0[4,5,6,7]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512F-LABEL: insert_v16i16_x12345x789ABCDEx:			; AVX512-LABEL: insert_v16i16_x12345x789ABCDEx:
	; AVX512F: # %bb.0:			; AVX512: # %bb.0:
	; AVX512F-NEXT: vmovdqa {{.*#+}} ymm1 = [0,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; AVX512-NEXT: vpcmpeqd %ymm1, %ymm1, %ymm1
	; AVX512F-NEXT: vpcmpeqd %ymm2, %ymm2, %ymm2			; AVX512-NEXT: vpblendw {{.*#+}} ymm2 = ymm1[0],ymm0[1,2,3,4,5,6,7],ymm1[8],ymm0[9,10,11,12,13,14,15]
	; AVX512F-NEXT: vpblendvb %ymm1, %ymm0, %ymm2, %ymm0			; AVX512-NEXT: vpblendd {{.*#+}} ymm2 = ymm2[0,1,2,3],ymm0[4,5,6,7]
	; AVX512F-NEXT: vmovdqa {{.*#+}} ymm1 = [255,255,255,255,255,255,255,255,255,255,255,255,0,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; AVX512-NEXT: vpblendw {{.*#+}} ymm2 = ymm2[0,1,2,3,4,5],ymm1[6],ymm2[7,8,9,10,11,12,13],ymm1[14],ymm2[15]
	; AVX512F-NEXT: vpblendvb %ymm1, %ymm0, %ymm2, %ymm0			; AVX512-NEXT: vpblendd {{.*#+}} ymm0 = ymm2[0,1,2,3],ymm0[4,5,6,7]
	; AVX512F-NEXT: vmovdqa {{.*#+}} ymm1 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,0]			; AVX512-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5,6],ymm1[7],ymm0[8,9,10,11,12,13,14],ymm1[15]
	; AVX512F-NEXT: vpblendvb %ymm1, %ymm0, %ymm2, %ymm0			; AVX512-NEXT: vpblendd {{.*#+}} ymm0 = ymm2[0,1,2,3],ymm0[4,5,6,7]
	; AVX512F-NEXT: retq			; AVX512-NEXT: retq
	;
	; AVX512VL-LABEL: insert_v16i16_x12345x789ABCDEx:
	; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: vpcmpeqd %ymm1, %ymm1, %ymm1
	; AVX512VL-NEXT: movw $1, %ax
	; AVX512VL-NEXT: kmovd %eax, %k1
	; AVX512VL-NEXT: vmovdqu16 %ymm1, %ymm0 {%k1}
	; AVX512VL-NEXT: movw $64, %ax
	; AVX512VL-NEXT: kmovd %eax, %k1
	; AVX512VL-NEXT: vmovdqu16 %ymm1, %ymm0 {%k1}
	; AVX512VL-NEXT: movw $-32768, %ax # imm = 0x8000
	; AVX512VL-NEXT: kmovd %eax, %k1
	; AVX512VL-NEXT: vmovdqu16 %ymm1, %ymm0 {%k1}
	; AVX512VL-NEXT: retq
	%1 = insertelement <16 x i16> %a, i16 -1, i32 0			%1 = insertelement <16 x i16> %a, i16 -1, i32 0
	%2 = insertelement <16 x i16> %1, i16 -1, i32 6			%2 = insertelement <16 x i16> %1, i16 -1, i32 6
	%3 = insertelement <16 x i16> %2, i16 -1, i32 15			%3 = insertelement <16 x i16> %2, i16 -1, i32 15
	ret <16 x i16> %3			ret <16 x i16> %3
	}			}

	define <16 x i8> @insert_v16i8_x123456789ABCDEx(<16 x i8> %a) {			define <16 x i8> @insert_v16i8_x123456789ABCDEx(<16 x i8> %a) {
	; SSE2-LABEL: insert_v16i8_x123456789ABCDEx:			; SSE2-LABEL: insert_v16i8_x123456789ABCDEx:
	▲ Show 20 Lines • Show All 171 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v16.ll

	Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[0,0,2,3,4,5,6,7]			; AVX1-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[0,0,2,3,4,5,6,7]
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v16i16_00_00_00_00_00_00_09_00_00_00_00_00_00_00_00_00:			; AVX2-LABEL: shuffle_v16i16_00_00_00_00_00_00_09_00_00_00_00_00_00_00_00_00:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]			; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = <0,0,255,255,u,u,u,u,u,u,u,u,u,u,u,u,255,255,u,u,u,u,u,u,u,u,u,u,u,u,u,u>			; AVX2-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7],ymm0[8],ymm1[9,10,11,12,13,14,15]
	; AVX2-NEXT: vpblendvb %ymm2, %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
	; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,1,0,1,0,1,0,1,0,1,0,1,2,3,0,1,16,17,16,17,16,17,16,17,16,17,16,17,16,17,16,17]			; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,1,0,1,0,1,0,1,0,1,0,1,2,3,0,1,16,17,16,17,16,17,16,17,16,17,16,17,16,17,16,17]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512VL-LABEL: shuffle_v16i16_00_00_00_00_00_00_09_00_00_00_00_00_00_00_00_00:			; AVX512VL-LABEL: shuffle_v16i16_00_00_00_00_00_00_09_00_00_00_00_00_00_00_00_00:
	; AVX512VL: # %bb.0:			; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: vmovdqa {{.*#+}} ymm1 = [0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0]			; AVX512VL-NEXT: vmovdqa {{.*#+}} ymm1 = [0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0]
	; AVX512VL-NEXT: vpermw %ymm0, %ymm1, %ymm0			; AVX512VL-NEXT: vpermw %ymm0, %ymm1, %ymm0
	; AVX512VL-NEXT: retq			; AVX512VL-NEXT: retq
	▲ Show 20 Lines • Show All 621 Lines • ▼ Show 20 Lines
	; AVX1-LABEL: shuffle_v16i16_00_01_02_03_04_05_06_07_08_09_10_11_12_13_14_31:			; AVX1-LABEL: shuffle_v16i16_00_01_02_03_04_05_06_07_08_09_10_11_12_13_14_31:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vmovaps {{.*#+}} ymm2 = [65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,0]			; AVX1-NEXT: vmovaps {{.*#+}} ymm2 = [65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,0]
	; AVX1-NEXT: vandnps %ymm1, %ymm2, %ymm1			; AVX1-NEXT: vandnps %ymm1, %ymm2, %ymm1
	; AVX1-NEXT: vandps %ymm2, %ymm0, %ymm0			; AVX1-NEXT: vandps %ymm2, %ymm0, %ymm0
	; AVX1-NEXT: vorps %ymm1, %ymm0, %ymm0			; AVX1-NEXT: vorps %ymm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v16i16_00_01_02_03_04_05_06_07_08_09_10_11_12_13_14_31:			; AVX2OR512VL-LABEL: shuffle_v16i16_00_01_02_03_04_05_06_07_08_09_10_11_12_13_14_31:
	; AVX2: # %bb.0:			; AVX2OR512VL: # %bb.0:
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,0]			; AVX2OR512VL-NEXT: vpblendw {{.*#+}} ymm1 = ymm0[0,1,2,3,4,5,6],ymm1[7],ymm0[8,9,10,11,12,13,14],ymm1[15]
	; AVX2-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0			; AVX2OR512VL-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
	; AVX2-NEXT: retq			; AVX2OR512VL-NEXT: retq
	;
	; AVX512VL-LABEL: shuffle_v16i16_00_01_02_03_04_05_06_07_08_09_10_11_12_13_14_31:
	; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: movw $-32768, %ax # imm = 0x8000
	; AVX512VL-NEXT: kmovd %eax, %k1
	; AVX512VL-NEXT: vmovdqu16 %ymm1, %ymm0 {%k1}
	; AVX512VL-NEXT: retq
	%shuffle = shufflevector <16 x i16> %a, <16 x i16> %b, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 31>			%shuffle = shufflevector <16 x i16> %a, <16 x i16> %b, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 31>
	ret <16 x i16> %shuffle			ret <16 x i16> %shuffle
	}			}

	define <16 x i16> @shuffle_v16i16_16_01_02_03_04_05_06_07_08_09_10_11_12_13_14_15(<16 x i16> %a, <16 x i16> %b) {			define <16 x i16> @shuffle_v16i16_16_01_02_03_04_05_06_07_08_09_10_11_12_13_14_15(<16 x i16> %a, <16 x i16> %b) {
	; AVX1-LABEL: shuffle_v16i16_16_01_02_03_04_05_06_07_08_09_10_11_12_13_14_15:			; AVX1-LABEL: shuffle_v16i16_16_01_02_03_04_05_06_07_08_09_10_11_12_13_14_15:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vmovaps {{.*#+}} ymm2 = [0,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535]			; AVX1-NEXT: vmovaps {{.*#+}} ymm2 = [0,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535]
	; AVX1-NEXT: vandnps %ymm1, %ymm2, %ymm1			; AVX1-NEXT: vandnps %ymm1, %ymm2, %ymm1
	; AVX1-NEXT: vandps %ymm2, %ymm0, %ymm0			; AVX1-NEXT: vandps %ymm2, %ymm0, %ymm0
	; AVX1-NEXT: vorps %ymm1, %ymm0, %ymm0			; AVX1-NEXT: vorps %ymm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v16i16_16_01_02_03_04_05_06_07_08_09_10_11_12_13_14_15:			; AVX2OR512VL-LABEL: shuffle_v16i16_16_01_02_03_04_05_06_07_08_09_10_11_12_13_14_15:
	; AVX2: # %bb.0:			; AVX2OR512VL: # %bb.0:
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; AVX2OR512VL-NEXT: vpblendw {{.*#+}} ymm1 = ymm1[0],ymm0[1,2,3,4,5,6,7],ymm1[8],ymm0[9,10,11,12,13,14,15]
	; AVX2-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0			; AVX2OR512VL-NEXT: vpblendd {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
	; AVX2-NEXT: retq			; AVX2OR512VL-NEXT: retq
	;
	; AVX512VL-LABEL: shuffle_v16i16_16_01_02_03_04_05_06_07_08_09_10_11_12_13_14_15:
	; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: movw $1, %ax
	; AVX512VL-NEXT: kmovd %eax, %k1
	; AVX512VL-NEXT: vmovdqu16 %ymm1, %ymm0 {%k1}
	; AVX512VL-NEXT: retq
	%shuffle = shufflevector <16 x i16> %a, <16 x i16> %b, <16 x i32> <i32 16, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>			%shuffle = shufflevector <16 x i16> %a, <16 x i16> %b, <16 x i32> <i32 16, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	ret <16 x i16> %shuffle			ret <16 x i16> %shuffle
	}			}

	define <16 x i16> @shuffle_v16i16_00_17_02_19_04_21_06_23_24_09_26_11_28_13_30_15(<16 x i16> %a, <16 x i16> %b) {			define <16 x i16> @shuffle_v16i16_00_17_02_19_04_21_06_23_24_09_26_11_28_13_30_15(<16 x i16> %a, <16 x i16> %b) {
	; AVX1-LABEL: shuffle_v16i16_00_17_02_19_04_21_06_23_24_09_26_11_28_13_30_15:			; AVX1-LABEL: shuffle_v16i16_00_17_02_19_04_21_06_23_24_09_26_11_28_13_30_15:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vmovaps {{.*#+}} ymm2 = [65535,0,65535,0,65535,0,65535,0,0,65535,0,65535,0,65535,0,65535]			; AVX1-NEXT: vmovaps {{.*#+}} ymm2 = [65535,0,65535,0,65535,0,65535,0,0,65535,0,65535,0,65535,0,65535]
	▲ Show 20 Lines • Show All 3,548 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vandps %ymm0, %ymm1, %ymm0			; AVX1-NEXT: vandps %ymm0, %ymm1, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: PR34369:			; AVX2-LABEL: PR34369:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm0[2,3,0,1]			; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm0[2,3,0,1]
	; AVX2-NEXT: vpshufb {{.*#+}} xmm2 = xmm2[8,9,10,11,4,5,10,11,8,9,10,11,4,5,4,5]			; AVX2-NEXT: vpshufb {{.*#+}} xmm2 = xmm2[8,9,10,11,4,5,10,11,8,9,10,11,4,5,4,5]
	; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[6,7,0,1,0,1,u,u,10,11,4,5,4,5,u,u,30,31,16,17,28,29,16,17,18,19,20,21,24,25,24,25]			; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[6,7,0,1,0,1,u,u,10,11,4,5,4,5,u,u,30,31,16,17,28,29,16,17,18,19,20,21,24,25,24,25]
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm3 = [255,255,255,255,255,255,0,0,255,255,255,255,255,255,0,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; AVX2-NEXT: vpblendw {{.*#+}} ymm2 = ymm0[0,1,2],ymm2[3],ymm0[4,5,6],ymm2[7],ymm0[8,9,10],ymm2[11],ymm0[12,13,14],ymm2[15]
	; AVX2-NEXT: vpblendvb %ymm3, %ymm0, %ymm2, %ymm0			; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm2[0,1,2,3],ymm0[4,5,6,7]
	; AVX2-NEXT: vpxor %xmm2, %xmm2, %xmm2			; AVX2-NEXT: vpxor %xmm2, %xmm2, %xmm2
	; AVX2-NEXT: vpcmpeqw %ymm2, %ymm1, %ymm1			; AVX2-NEXT: vpcmpeqw %ymm2, %ymm1, %ymm1
	; AVX2-NEXT: vpand %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vpand %ymm0, %ymm1, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512VL-LABEL: PR34369:			; AVX512VL-LABEL: PR34369:
	; AVX512VL: # %bb.0:			; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: vmovdqa {{.*#+}} ymm2 = [3,0,0,13,5,2,2,10,15,8,14,8,9,10,12,12]			; AVX512VL-NEXT: vmovdqa {{.*#+}} ymm2 = [3,0,0,13,5,2,2,10,15,8,14,8,9,10,12,12]
	▲ Show 20 Lines • Show All 98 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v32.ll

	Show First 20 Lines • Show All 383 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]			; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
	; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0]			; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_18_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:			; AVX2-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_18_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]			; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = <0,0,255,255,u,u,u,u,u,u,u,u,u,u,u,u,255,255,u,u,u,u,u,u,u,u,u,u,u,u,u,u>			; AVX2-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7],ymm0[8],ymm1[9,10,11,12,13,14,15]
	; AVX2-NEXT: vpblendvb %ymm2, %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
	; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]			; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512VLBW-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_18_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:			; AVX512VLBW-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_18_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:
	; AVX512VLBW: # %bb.0:			; AVX512VLBW: # %bb.0:
	; AVX512VLBW-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]			; AVX512VLBW-NEXT: vmovdqa {{.*#+}} ymm1 = [0,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7]
	; AVX512VLBW-NEXT: movw $1, %ax			; AVX512VLBW-NEXT: vpermw %ymm0, %ymm1, %ymm0
	; AVX512VLBW-NEXT: kmovd %eax, %k1			; AVX512VLBW-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
	; AVX512VLBW-NEXT: vmovdqu16 %ymm0, %ymm1 {%k1}
	; AVX512VLBW-NEXT: vpshufb {{.*#+}} ymm0 = ymm1[0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
	; AVX512VLBW-NEXT: retq			; AVX512VLBW-NEXT: retq
	;			;
	; AVX512VLVBMI-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_18_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:			; AVX512VLVBMI-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_18_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:
	; AVX512VLVBMI: # %bb.0:			; AVX512VLVBMI: # %bb.0:
	; AVX512VLVBMI-NEXT: vmovdqa {{.*#+}} ymm1 = [0,0,0,0,0,0,0,0,0,0,0,0,0,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]			; AVX512VLVBMI-NEXT: vmovdqa {{.*#+}} ymm1 = [0,0,0,0,0,0,0,0,0,0,0,0,0,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
	; AVX512VLVBMI-NEXT: vpermb %ymm0, %ymm1, %ymm0			; AVX512VLVBMI-NEXT: vpermb %ymm0, %ymm1, %ymm0
	; AVX512VLVBMI-NEXT: retq			; AVX512VLVBMI-NEXT: retq
	%shuffle = shufflevector <32 x i8> %a, <32 x i8> %b, <32 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 18, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>			%shuffle = shufflevector <32 x i8> %a, <32 x i8> %b, <32 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 18, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
	Show All 9 Lines
	; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]			; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
	; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,0,0,0,0,0,0,0,0,0,0,0,7,0,0,0]			; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,0,0,0,0,0,0,0,0,0,0,0,7,0,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_19_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:			; AVX2-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_19_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]			; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = <0,0,255,255,u,u,u,u,u,u,u,u,u,u,u,u,255,255,u,u,u,u,u,u,u,u,u,u,u,u,u,u>			; AVX2-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7],ymm0[8],ymm1[9,10,11,12,13,14,15]
	; AVX2-NEXT: vpblendvb %ymm2, %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
	; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]			; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512VLBW-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_19_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:			; AVX512VLBW-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_19_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:
	; AVX512VLBW: # %bb.0:			; AVX512VLBW: # %bb.0:
	; AVX512VLBW-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]			; AVX512VLBW-NEXT: vmovdqa {{.*#+}} ymm1 = [0,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7]
	; AVX512VLBW-NEXT: movw $1, %ax			; AVX512VLBW-NEXT: vpermw %ymm0, %ymm1, %ymm0
	; AVX512VLBW-NEXT: kmovd %eax, %k1			; AVX512VLBW-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
	; AVX512VLBW-NEXT: vmovdqu16 %ymm0, %ymm1 {%k1}
	; AVX512VLBW-NEXT: vpshufb {{.*#+}} ymm0 = ymm1[0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
	; AVX512VLBW-NEXT: retq			; AVX512VLBW-NEXT: retq
	;			;
	; AVX512VLVBMI-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_19_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:			; AVX512VLVBMI-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_19_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:
	; AVX512VLVBMI: # %bb.0:			; AVX512VLVBMI: # %bb.0:
	; AVX512VLVBMI-NEXT: vmovdqa {{.*#+}} ymm1 = [0,0,0,0,0,0,0,0,0,0,0,0,19,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]			; AVX512VLVBMI-NEXT: vmovdqa {{.*#+}} ymm1 = [0,0,0,0,0,0,0,0,0,0,0,0,19,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
	; AVX512VLVBMI-NEXT: vpermb %ymm0, %ymm1, %ymm0			; AVX512VLVBMI-NEXT: vpermb %ymm0, %ymm1, %ymm0
	; AVX512VLVBMI-NEXT: retq			; AVX512VLVBMI-NEXT: retq
	%shuffle = shufflevector <32 x i8> %a, <32 x i8> %b, <32 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 19, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>			%shuffle = shufflevector <32 x i8> %a, <32 x i8> %b, <32 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 19, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
	▲ Show 20 Lines • Show All 2,453 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shuffle-512-v32.ll

Show All 34 Lines	; SKX-NEXT: retq
%c = shufflevector <32 x i16> %a, <32 x i16> undef, <32 x i32> <i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8>		%c = shufflevector <32 x i16> %a, <32 x i16> undef, <32 x i32> <i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8>
ret <32 x i16> %c		ret <32 x i16> %c
}		}

define <32 x i16> @shuffle_v32i16_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_01_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_1f(<32 x i16> %a) {		define <32 x i16> @shuffle_v32i16_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_01_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_1f(<32 x i16> %a) {
; KNL-LABEL: shuffle_v32i16_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_01_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_1f:		; KNL-LABEL: shuffle_v32i16_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_01_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_1f:
; KNL: ## %bb.0:		; KNL: ## %bb.0:
; KNL-NEXT: vpshufb {{.*#+}} ymm2 = ymm0[4,5,10,11,4,5,6,7,14,15,2,3,4,5,2,3,20,21,26,27,20,21,22,23,30,31,18,19,20,21,18,19]		; KNL-NEXT: vpshufb {{.*#+}} ymm2 = ymm0[4,5,10,11,4,5,6,7,14,15,2,3,4,5,2,3,20,21,26,27,20,21,22,23,30,31,18,19,20,21,18,19]
; KNL-NEXT: vpermq {{.*#+}} ymm3 = ymm0[2,3,0,1]		; KNL-NEXT: vpermq {{.*#+}} ymm0 = ymm0[2,3,0,1]
; KNL-NEXT: vpshufb {{.*#+}} ymm0 = ymm3[0,1,10,11,8,9,8,9,14,15,2,3,4,5,2,3,16,17,26,27,24,25,24,25,30,31,18,19,20,21,18,19]		; KNL-NEXT: vpshufb {{.*#+}} ymm3 = ymm0[0,1,10,11,8,9,8,9,14,15,6,7,4,5,14,15,16,17,26,27,24,25,24,25,30,31,22,23,20,21,30,31]
; KNL-NEXT: vmovdqa {{.*#+}} ymm4 = <0,0,0,0,u,u,u,u,0,0,u,u,255,255,0,0,255,255,255,255,u,u,255,255,255,255,u,u,0,0,255,255>
; KNL-NEXT: vpblendvb %ymm4, %ymm0, %ymm2, %ymm0
; KNL-NEXT: vpshufb {{.*#+}} ymm3 = ymm3[0,1,10,11,8,9,8,9,14,15,6,7,4,5,14,15,16,17,26,27,24,25,24,25,30,31,22,23,20,21,30,31]
; KNL-NEXT: vmovdqa {{.*#+}} ymm4 = <255,255,255,255,u,u,u,u,255,255,u,u,0,0,255,255,0,0,0,0,u,u,0,0,0,0,u,u,255,255,u,u>		; KNL-NEXT: vmovdqa {{.*#+}} ymm4 = <255,255,255,255,u,u,u,u,255,255,u,u,0,0,255,255,0,0,0,0,u,u,0,0,0,0,u,u,255,255,u,u>
; KNL-NEXT: vpblendvb %ymm4, %ymm2, %ymm3, %ymm2		; KNL-NEXT: vpblendvb %ymm4, %ymm2, %ymm3, %ymm3
; KNL-NEXT: vmovdqa {{.*#+}} ymm3 = <255,255,255,255,u,u,u,u,255,255,u,u,255,255,255,255,255,255,255,255,u,u,255,255,255,255,u,u,255,255,0,0>		; KNL-NEXT: vpblendw {{.*#+}} ymm1 = ymm3[0,1,2,3,4,5,6],ymm1[7],ymm3[8,9,10,11,12,13,14],ymm1[15]
; KNL-NEXT: vpblendvb %ymm3, %ymm2, %ymm1, %ymm1		; KNL-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,1,10,11,8,9,8,9,14,15,2,3,4,5,2,3,16,17,26,27,24,25,24,25,30,31,18,19,20,21,18,19]
		; KNL-NEXT: vpblendd {{.*#+}} ymm1 = ymm3[0,1,2,3],ymm1[4,5,6,7]
		; KNL-NEXT: vmovdqa {{.*#+}} ymm3 = <0,0,0,0,u,u,u,u,0,0,u,u,255,255,0,0,255,255,255,255,u,u,255,255,255,255,u,u,0,0,255,255>
		; KNL-NEXT: vpblendvb %ymm3, %ymm0, %ymm2, %ymm0
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: shuffle_v32i16_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_01_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_1f:		; SKX-LABEL: shuffle_v32i16_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_01_02_05_u_u_07_u_0a_01_00_05_u_04_07_u_0a_1f:
; SKX: ## %bb.0:		; SKX: ## %bb.0:
; SKX-NEXT: vmovdqa64 {{.*#+}} zmm1 = <2,5,u,u,7,u,10,1,0,5,u,4,7,u,10,1,2,5,u,u,7,u,10,1,0,5,u,4,7,u,10,31>		; SKX-NEXT: vmovdqa64 {{.*#+}} zmm1 = <2,5,u,u,7,u,10,1,0,5,u,4,7,u,10,1,2,5,u,u,7,u,10,1,0,5,u,4,7,u,10,31>
; SKX-NEXT: vpermw %zmm0, %zmm1, %zmm0		; SKX-NEXT: vpermw %zmm0, %zmm1, %zmm0
; SKX-NEXT: retq		; SKX-NEXT: retq
%c = shufflevector <32 x i16> %a, <32 x i16> undef, <32 x i32> <i32 2, i32 5, i32 undef, i32 undef, i32 7, i32 undef, i32 10, i32 1, i32 0, i32 5, i32 undef, i32 4, i32 7, i32 undef, i32 10, i32 1, i32 2, i32 5, i32 undef, i32 undef, i32 7, i32 undef, i32 10, i32 1, i32 0, i32 5, i32 undef, i32 4, i32 7, i32 undef, i32 10, i32 31>		%c = shufflevector <32 x i16> %a, <32 x i16> undef, <32 x i32> <i32 2, i32 5, i32 undef, i32 undef, i32 7, i32 undef, i32 10, i32 1, i32 0, i32 5, i32 undef, i32 4, i32 7, i32 undef, i32 10, i32 1, i32 2, i32 5, i32 undef, i32 undef, i32 7, i32 undef, i32 10, i32 1, i32 0, i32 5, i32 undef, i32 4, i32 7, i32 undef, i32 10, i32 31>
ret <32 x i16> %c		ret <32 x i16> %c
}		}

define <32 x i16> @shuffle_v32i16_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_18_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_38(<32 x i16> %a, <32 x i16> %b) {		define <32 x i16> @shuffle_v32i16_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_18_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_38(<32 x i16> %a, <32 x i16> %b) {
; KNL-LABEL: shuffle_v32i16_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_18_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_38:		; KNL-LABEL: shuffle_v32i16_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_18_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_38:
; KNL: ## %bb.0:		; KNL: ## %bb.0:
; KNL-NEXT: vpermq {{.*#+}} ymm2 = ymm1[2,3,0,1]		; KNL-NEXT: vpermq {{.*#+}} ymm2 = ymm1[2,3,0,1]
; KNL-NEXT: vpblendw {{.*#+}} ymm2 = ymm1[0,1,2,3],ymm2[4,5],ymm1[6],ymm2[7],ymm1[8,9,10,11],ymm2[12,13],ymm1[14],ymm2[15]		; KNL-NEXT: vpblendw {{.*#+}} ymm2 = ymm1[0,1,2,3],ymm2[4,5],ymm1[6],ymm2[7],ymm1[8,9,10,11],ymm2[12,13],ymm1[14],ymm2[15]
; KNL-NEXT: vpshufb {{.*#+}} ymm1 = ymm2[u,u,14,15,u,u,12,13,u,u,10,11,u,u,8,9,u,u,22,23,u,u,20,21,u,u,18,19,u,u,u,u]		; KNL-NEXT: vpshufb {{.*#+}} ymm1 = ymm2[u,u,14,15,u,u,12,13,u,u,10,11,u,u,8,9,u,u,22,23,u,u,20,21,u,u,18,19,u,u,u,u]
; KNL-NEXT: vpermq {{.*#+}} ymm4 = ymm0[2,3,0,1]		; KNL-NEXT: vpermq {{.*#+}} ymm4 = ymm0[2,3,0,1]
; KNL-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0,1,2,3,4],ymm4[5,6,7],ymm0[8,9,10,11,12],ymm4[13,14,15]		; KNL-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0,1,2,3,4],ymm4[5,6,7],ymm0[8,9,10,11,12],ymm4[13,14,15]
; KNL-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[14,15,u,u,12,13,u,u,10,11,u,u,8,9,u,u,22,23,u,u,20,21,u,u,18,19,u,u,16,17,u,u]		; KNL-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[14,15,u,u,12,13,u,u,10,11,u,u,8,9,u,u,22,23,u,u,20,21,u,u,18,19,u,u,16,17,u,u]
; KNL-NEXT: vpblendw {{.*#+}} ymm1 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]		; KNL-NEXT: vpblendw {{.*#+}} ymm1 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
; KNL-NEXT: vextracti128 $1, %ymm3, %xmm3		; KNL-NEXT: vextracti128 $1, %ymm3, %xmm3
; KNL-NEXT: vpbroadcastw %xmm3, %ymm3		; KNL-NEXT: vpbroadcastw %xmm3, %ymm3
; KNL-NEXT: vmovdqa {{.*#+}} ymm4 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,0]		; KNL-NEXT: vpblendw {{.*#+}} ymm3 = ymm1[0,1,2,3,4,5,6],ymm3[7],ymm1[8,9,10,11,12,13,14],ymm3[15]
; KNL-NEXT: vpblendvb %ymm4, %ymm1, %ymm3, %ymm1		; KNL-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2,3],ymm3[4,5,6,7]
; KNL-NEXT: vpshufb {{.*#+}} ymm2 = ymm2[u,u,14,15,u,u,12,13,u,u,10,11,u,u,8,9,u,u,22,23,u,u,20,21,u,u,18,19,u,u,16,17]		; KNL-NEXT: vpshufb {{.*#+}} ymm2 = ymm2[u,u,14,15,u,u,12,13,u,u,10,11,u,u,8,9,u,u,22,23,u,u,20,21,u,u,18,19,u,u,16,17]
; KNL-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0],ymm2[1],ymm0[2],ymm2[3],ymm0[4],ymm2[5],ymm0[6],ymm2[7],ymm0[8],ymm2[9],ymm0[10],ymm2[11],ymm0[12],ymm2[13],ymm0[14],ymm2[15]		; KNL-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0],ymm2[1],ymm0[2],ymm2[3],ymm0[4],ymm2[5],ymm0[6],ymm2[7],ymm0[8],ymm2[9],ymm0[10],ymm2[11],ymm0[12],ymm2[13],ymm0[14],ymm2[15]
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: shuffle_v32i16_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_18_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_38:		; SKX-LABEL: shuffle_v32i16_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_18_0f_1f_0e_16_0d_1d_04_1e_0b_1b_0a_1a_09_19_08_38:
; SKX: ## %bb.0:		; SKX: ## %bb.0:
; SKX-NEXT: vmovdqa64 {{.*#+}} zmm2 = [15,31,14,22,13,29,4,28,11,27,10,26,9,25,8,24,15,31,14,22,13,29,4,28,11,27,10,26,9,25,8,56]		; SKX-NEXT: vmovdqa64 {{.*#+}} zmm2 = [15,31,14,22,13,29,4,28,11,27,10,26,9,25,8,24,15,31,14,22,13,29,4,28,11,27,10,26,9,25,8,56]
; SKX-NEXT: vpermt2w %zmm1, %zmm2, %zmm0		; SKX-NEXT: vpermt2w %zmm1, %zmm2, %zmm0
▲ Show 20 Lines • Show All 286 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shuffle-v48.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=x86_64-pc-linux -mattr=+avx2 < %s \| FileCheck %s			; RUN: llc -mtriple=x86_64-pc-linux -mattr=+avx2 < %s \| FileCheck %s
	define <32 x i8> @foo(<48 x i8>* %x0, <16 x i32> %x1, <16 x i32> %x2) {			define <32 x i8> @foo(<48 x i8>* %x0, <16 x i32> %x1, <16 x i32> %x2) {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: vmovdqu 32(%rdi), %xmm0			; CHECK-NEXT: vmovdqu 32(%rdi), %xmm0
	; CHECK-NEXT: vmovdqu (%rdi), %ymm1			; CHECK-NEXT: vmovdqu (%rdi), %ymm1
	; CHECK-NEXT: vpermq {{.*#+}} ymm2 = ymm1[2,3,0,1]			; CHECK-NEXT: vpermq {{.*#+}} ymm2 = ymm1[2,3,0,1]
	; CHECK-NEXT: vpshufb {{.*#+}} xmm2 = xmm2[u,u,u,u,u,u,u,u,u,u,u,0,2,3,5,6]			; CHECK-NEXT: vpshufb {{.*#+}} xmm2 = xmm2[u,u,u,u,u,u,u,u,u,u,u,0,2,3,5,6]
	; CHECK-NEXT: vpshufb {{.*#+}} ymm1 = ymm1[0,1,3,4,6,7,9,10,12,13,15,u,u,u,u,u,24,25,27,28,30,31,u,u,u,u,u,u,u,u,u,u]			; CHECK-NEXT: vpshufb {{.*#+}} ymm1 = ymm1[0,1,3,4,6,7,9,10,12,13,15,u,u,u,u,u,24,25,27,28,30,31,u,u,u,u,u,u,u,u,u,u]
	; CHECK-NEXT: vmovdqa {{.*#+}} ymm3 = <255,255,255,255,255,255,255,255,255,255,255,0,0,0,0,0,255,255,255,255,255,255,u,u,u,u,u,u,u,u,u,u>			; CHECK-NEXT: vmovdqa {{.*#+}} ymm3 = <255,255,255,255,255,255,255,255,255,255,255,0,0,0,0,0,255,255,255,255,255,255,u,u,u,u,u,u,u,u,u,u>
	; CHECK-NEXT: vpblendvb %ymm3, %ymm1, %ymm2, %ymm1			; CHECK-NEXT: vpblendvb %ymm3, %ymm1, %ymm2, %ymm1
	; CHECK-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[u,u,u,u,u,u,1,2,4,5,7,8,10,11,13,14]			; CHECK-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[u,u,u,u,u,u,1,2,4,5,7,8,10,11,13,14]
	; CHECK-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0			; CHECK-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0
	; CHECK-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,0,0,0,0,0,0,0,0,0]			; CHECK-NEXT: vpblendw {{.*#+}} ymm0 = ymm1[0,1,2],ymm0[3,4,5,6,7],ymm1[8,9,10],ymm0[11,12,13,14,15]
	; CHECK-NEXT: vpblendvb %ymm2, %ymm1, %ymm0, %ymm0			; CHECK-NEXT: vpblendd {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = load <48 x i8>, <48 x i8>* %x0, align 1			%1 = load <48 x i8>, <48 x i8>* %x0, align 1
	%2 = shufflevector <48 x i8> %1, <48 x i8> undef, <32 x i32> <i32 0, i32 1, i32 3, i32 4, i32 6, i32 7, i32 9, i32 10, i32 12, i32 13, i32 15, i32 16, i32 18, i32 19, i32 21, i32 22, i32 24, i32 25, i32 27, i32 28, i32 30, i32 31, i32 33, i32 34, i32 36, i32 37, i32 39, i32 40, i32 42, i32 43, i32 45, i32 46>			%2 = shufflevector <48 x i8> %1, <48 x i8> undef, <32 x i32> <i32 0, i32 1, i32 3, i32 4, i32 6, i32 7, i32 9, i32 10, i32 12, i32 13, i32 15, i32 16, i32 18, i32 19, i32 21, i32 22, i32 24, i32 25, i32 27, i32 28, i32 30, i32 31, i32 33, i32 34, i32 36, i32 37, i32 39, i32 40, i32 42, i32 43, i32 45, i32 46>
	ret <32 x i8> %2			ret <32 x i8> %2
	}			}