This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Teach how to convert SSSE3/AVX2 byte shuffles to builtin shuffles if the shuffle mask is constant.
ClosedPublic

Authored by andreadb on Sep 29 2015, 9:51 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
RKSimon
chandlerc

Commits

rG0594e2a1e948: [InstCombine] Teach how to convert SSSE3/AVX2 byte shuffles to builtin shuffles…
rL248913: [InstCombine] Teach how to convert SSSE3/AVX2 byte shuffles to builtin…

Summary

This patch teaches InstCombiner how to convert a SSSE3/AVX2 byte shuffle to a builtin shuffle if the mask is constant.

Converting byte shuffle intrinsic calls to builtin shuffles can help finding more opportunities for combining shuffles later on in selection dag.

We may end up with byte shuffles with constant masks as the result of inlining.

Added test x86-pshufb.ll.

Please let me know if ok to submit.

Thanks,
Andrea

Diff Detail

Repository: rL LLVM

Event Timeline

andreadb updated this revision to Diff 35990.Sep 29 2015, 9:51 AM

andreadb retitled this revision from to [InstCombine] Teach how to convert SSSE3/AVX2 byte shuffles to builtin shuffles if the shuffle mask is constant. .

andreadb updated this object.

andreadb added reviewers: RKSimon, spatel, qcolombet.

andreadb added a subscriber: llvm-commits.

Query about AVX2 instruction.

lib/Transforms/InstCombine/InstCombineCalls.cpp
1184 ↗	(On Diff #35990)	I don't think this is correct - AVX2 pshufb doesn't reuse the lower 128-bit lane it uses its own lane but with 'local' index values (0-15 representing the 16-31 of a shufflevector), https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=shuffle_epi8&expand=4700

andreadb added inline comments.Sep 29 2015, 10:26 AM

lib/Transforms/InstCombine/InstCombineCalls.cpp
1184 ↗	(On Diff #35990)	Right. This is wrong. I will upload a new patch. Thanks Simon.

Patch updated.
Fixed the AVX2 byte shuffle mask computation (thanks Simon!).

Please let me know if ok to commit.

-Andrea

spatel added inline comments.Sep 29 2015, 1:09 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
1185–1194 ↗	(On Diff #36019)	If you use an int8_t instead of APInt for Index, does this become simpler? Indexes[I] = (Index < 0) ? NumElts : Index & 0xF;

andreadb added inline comments.Sep 29 2015, 1:32 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
1185–1194 ↗	(On Diff #36019)	Yep. I will change it. Thanks.

CCing chandler as he spent quite a while on shuffles.

Patch Updated.

Simplified code as suggested by Sanjay.

Ok to commit?

A couple of minor things but otherwise LGTM. Thanks Andrea!

lib/Transforms/InstCombine/InstCombineCalls.cpp
1197 ↗	(On Diff #36080)	I'm worried this is hardcoded to the AVX2 '32 element' implementation (so if the AVX512 intrinsic ever gets supported here...) You could possibly merge this into the code above: Indexes[I] = (I & ~0xF) + ((Index < 0) ? NumElts : (Index & 0xF))
test/Transforms/InstCombine/x86-pshufb.ll
2 ↗	(On Diff #36080)	You have a decent range of tests here already - the only addition I can think of would be to show that some 'strange value' pshufb indices (i.e. not -128 or 0-15 or 0-31) get correctly masked based on their bits.

This revision is now accepted and ready to land.Sep 30 2015, 6:31 AM

andreadb added inline comments.Sep 30 2015, 6:40 AM

lib/Transforms/InstCombine/InstCombineCalls.cpp
1197 ↗	(On Diff #36080)	Yes, this is done specifically to support the AVX2 implementation. However, I don't think we can merge that into the code above because we also have to handle ConstantAggregateZero pshufb masks.
test/Transforms/InstCombine/x86-pshufb.ll
2 ↗	(On Diff #36080)	Sure, I will add extra tests for it.

Closed by commit rL248913: [InstCombine] Teach how to convert SSSE3/AVX2 byte shuffles to builtin… (authored by adibiagio). · Explain WhySep 30 2015, 9:46 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

41 lines

test/

Transforms/

InstCombine/

x86-pshufb.ll

267 lines

Diff 36121

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 1,157 Lines • ▼ Show 20 Lines	if (auto C = dyn_cast<ConstantDataVector>(Mask)) {
Selectors.push_back(ConstantInt::get(Tyi1, Selector >> (BitWidth - 1)));		Selectors.push_back(ConstantInt::get(Tyi1, Selector >> (BitWidth - 1)));
}		}
auto NewSelector = ConstantVector::get(Selectors);		auto NewSelector = ConstantVector::get(Selectors);
return SelectInst::Create(NewSelector, Op1, Op0, "blendv");		return SelectInst::Create(NewSelector, Op1, Op0, "blendv");
}		}
break;		break;
}		}

		case Intrinsic::x86_ssse3_pshuf_b_128:
		case Intrinsic::x86_avx2_pshuf_b: {
		// Turn pshufb(V1,mask) -> shuffle(V1,Zero,mask) if mask is a constant.
		auto *V = II->getArgOperand(1);
		auto *VTy = cast<VectorType>(V->getType());
		unsigned NumElts = VTy->getNumElements();
		assert((NumElts == 16 \|\| NumElts == 32) &&
		"Unexpected number of elements in shuffle mask!");
		// Initialize the resulting shuffle mask to all zeroes.
		uint32_t Indexes[32] = {0};

		if (auto *Mask = dyn_cast<ConstantDataVector>(V)) {
		// Each byte in the shuffle control mask forms an index to permute the
		// corresponding byte in the destination operand.
		for (unsigned I = 0; I < NumElts; ++I) {
		int8_t Index = Mask->getElementAsInteger(I);
		// If the most significant bit (bit[7]) of each byte of the shuffle
		// control mask is set, then zero is written in the result byte.
		// The zero vector is in the right-hand side of the resulting
		// shufflevector.

		// The value of each index is the least significant 4 bits of the
		// shuffle control byte.
		Indexes[I] = (Index < 0) ? NumElts : Index & 0xF;
		}
		} else if (!isa<ConstantAggregateZero>(V))
		break;

		// The value of each index for the high 128-bit lane is the least
		// significant 4 bits of the respective shuffle control byte.
		for (unsigned I = 16; I < NumElts; ++I)
		Indexes[I] += I & 0xF0;

		auto NewC = ConstantDataVector::get(V->getContext(),
		makeArrayRef(Indexes, NumElts));
		auto V1 = II->getArgOperand(0);
		auto V2 = Constant::getNullValue(II->getType());
		auto Shuffle = Builder->CreateShuffleVector(V1, V2, NewC);
		return ReplaceInstUsesWith(CI, Shuffle);
		}

case Intrinsic::x86_avx_vpermilvar_ps:		case Intrinsic::x86_avx_vpermilvar_ps:
case Intrinsic::x86_avx_vpermilvar_ps_256:		case Intrinsic::x86_avx_vpermilvar_ps_256:
case Intrinsic::x86_avx_vpermilvar_pd:		case Intrinsic::x86_avx_vpermilvar_pd:
case Intrinsic::x86_avx_vpermilvar_pd_256: {		case Intrinsic::x86_avx_vpermilvar_pd_256: {
// Convert vpermil* to shufflevector if the mask is constant.		// Convert vpermil* to shufflevector if the mask is constant.
Value *V = II->getArgOperand(1);		Value *V = II->getArgOperand(1);
unsigned Size = cast<VectorType>(V->getType())->getNumElements();		unsigned Size = cast<VectorType>(V->getType())->getNumElements();
assert(Size == 8 \|\| Size == 4 \|\| Size == 2);		assert(Size == 8 \|\| Size == 4 \|\| Size == 2);
▲ Show 20 Lines • Show All 1,007 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/x86-pshufb.ll

				; RUN: opt < %s -instcombine -S \| FileCheck %s

				; Verify that instcombine is able to fold identity shuffles.

				define <16 x i8> @identity_test(<16 x i8> %InVec) {
				; CHECK-LABEL: @identity_test
				; CHECK: ret <16 x i8> %InVec

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15>)
				ret <16 x i8> %1
				}

				define <32 x i8> @identity_test_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @identity_test_avx2
				; CHECK: ret <32 x i8> %InVec

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15>)
				ret <32 x i8> %1
				}


				; Verify that instcombine is able to fold byte shuffles with zero masks.

				define <16 x i8> @fold_to_zero_vector(<16 x i8> %InVec) {
				; CHECK-LABEL: @fold_to_zero_vector
				; CHECK: ret <16 x i8> zeroinitializer

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>)
				ret <16 x i8> %1
				}

				define <32 x i8> @fold_to_zero_vector_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @fold_to_zero_vector_avx2
				; CHECK: ret <32 x i8> zeroinitializer

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>)
				ret <32 x i8> %1
				}

				; Instcombine should be able to fold the following byte shuffle to a builtin shufflevector
				; with a shuffle mask of all zeroes.

				define <16 x i8> @splat_test(<16 x i8> %InVec) {
				; CHECK-LABEL: @splat_test
				; CHECK: shufflevector <16 x i8> %InVec, <16 x i8> undef, <16 x i32> zeroinitializer

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> zeroinitializer)
				ret <16 x i8> %1
				}

				; In the test case below, elements in the low 128-bit lane of the result
				; vector are equal to the lower byte of %InVec (shuffle index 0).
				; Elements in the high 128-bit lane of the result vector are equal to
				; the lower byte in the high 128-bit lane of %InVec (shuffle index 16).

				define <32 x i8> @splat_test_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @splat_test_avx2
				; CHECK: shufflevector <32 x i8> %InVec, <32 x i8> undef, <32 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> zeroinitializer)
				ret <32 x i8> %1
				}

				; Each of the byte shuffles in the following tests is equivalent to a blend between
				; vector %InVec and a vector of all zeroes.

				define <16 x i8> @blend1(<16 x i8> %InVec) {
				; CHECK-LABEL: @blend1
				; CHECK: shufflevector <16 x i8> %InVec, {{.*}}, <16 x i32> <i32 16, i32 1, i32 16, i32 3, i32 16, i32 5, i32 16, i32 7, i32 16, i32 9, i32 16, i32 11, i32 16, i32 13, i32 16, i32 15>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 -128, i8 1, i8 -128, i8 3, i8 -128, i8 5, i8 -128, i8 7, i8 -128, i8 9, i8 -128, i8 11, i8 -128, i8 13, i8 -128, i8 15>)
				ret <16 x i8> %1
				}

				define <16 x i8> @blend2(<16 x i8> %InVec) {
				; CHECK-LABEL: @blend2
				; CHECK: shufflevector <16 x i8> %InVec, {{.*}}, <16 x i32> <i32 16, i32 16, i32 2, i32 3, i32 16, i32 16, i32 6, i32 7, i32 16, i32 16, i32 10, i32 11, i32 16, i32 16, i32 14, i32 15>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 -128, i8 -128, i8 2, i8 3, i8 -128, i8 -128, i8 6, i8 7, i8 -128, i8 -128, i8 10, i8 11, i8 -128, i8 -128, i8 14, i8 15>)
				ret <16 x i8> %1
				}

				define <16 x i8> @blend3(<16 x i8> %InVec) {
				; CHECK-LABEL: @blend3
				; CHECK: shufflevector <16 x i8> %InVec, {{.*}}, <16 x i32> <i32 16, i32 16, i32 16, i32 16, i32 4, i32 5, i32 6, i32 7, i32 16, i32 16, i32 16, i32 16, i32 12, i32 13, i32 14, i32 15>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 4, i8 5, i8 6, i8 7, i8 -128, i8 -128, i8 -128, i8 -128, i8 12, i8 13, i8 14, i8 15>)
				ret <16 x i8> %1
				}

				define <16 x i8> @blend4(<16 x i8> %InVec) {
				; CHECK-LABEL: @blend4
				; CHECK: shufflevector <16 x i8> %InVec, {{.*}}, <16 x i32> <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15>)
				ret <16 x i8> %1
				}

				define <16 x i8> @blend5(<16 x i8> %InVec) {
				; CHECK-LABEL: @blend5
				; CHECK: shufflevector <16 x i8> %InVec, {{.*}}, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 0, i8 1, i8 2, i8 3, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>)
				ret <16 x i8> %1
				}

				define <16 x i8> @blend6(<16 x i8> %InVec) {
				; CHECK-LABEL: @blend6
				; CHECK: shufflevector <16 x i8> %InVec, {{.*}}, <16 x i32> <i32 0, i32 1, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 0, i8 1, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>)
				ret <16 x i8> %1
				}

				define <32 x i8> @blend1_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @blend1_avx2
				; CHECK: shufflevector <32 x i8> %InVec, {{.*}}, <32 x i32> <i32 32, i32 1, i32 32, i32 3, i32 32, i32 5, i32 32, i32 7, i32 32, i32 9, i32 32, i32 11, i32 32, i32 13, i32 32, i32 15, i32 48, i32 17, i32 48, i32 19, i32 48, i32 21, i32 48, i32 23, i32 48, i32 25, i32 48, i32 27, i32 48, i32 29, i32 48, i32 31>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 -128, i8 1, i8 -128, i8 3, i8 -128, i8 5, i8 -128, i8 7, i8 -128, i8 9, i8 -128, i8 11, i8 -128, i8 13, i8 -128, i8 15, i8 -128, i8 1, i8 -128, i8 3, i8 -128, i8 5, i8 -128, i8 7, i8 -128, i8 9, i8 -128, i8 11, i8 -128, i8 13, i8 -128, i8 15>)
				ret <32 x i8> %1
				}

				define <32 x i8> @blend2_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @blend2_avx2
				; CHECK: shufflevector <32 x i8> %InVec, {{.*}}, <32 x i32> <i32 32, i32 32, i32 2, i32 3, i32 32, i32 32, i32 6, i32 7, i32 32, i32 32, i32 10, i32 11, i32 32, i32 32, i32 14, i32 15, i32 48, i32 48, i32 18, i32 19, i32 48, i32 48, i32 22, i32 23, i32 48, i32 48, i32 26, i32 27, i32 48, i32 48, i32 30, i32 31>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 -128, i8 -128, i8 2, i8 3, i8 -128, i8 -128, i8 6, i8 7, i8 -128, i8 -128, i8 10, i8 11, i8 -128, i8 -128, i8 14, i8 15, i8 -128, i8 -128, i8 2, i8 3, i8 -128, i8 -128, i8 6, i8 7, i8 -128, i8 -128, i8 10, i8 11, i8 -128, i8 -128, i8 14, i8 15>)
				ret <32 x i8> %1
				}

				define <32 x i8> @blend3_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @blend3_avx2
				; CHECK: shufflevector <32 x i8> %InVec, {{.*}}, <32 x i32> <i32 32, i32 32, i32 32, i32 32, i32 4, i32 5, i32 6, i32 7, i32 32, i32 32, i32 32, i32 32, i32 12, i32 13, i32 14, i32 15, i32 48, i32 48, i32 48, i32 48, i32 20, i32 21, i32 22, i32 23, i32 48, i32 48, i32 48, i32 48, i32 28, i32 29, i32 30, i32 31>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 4, i8 5, i8 6, i8 7, i8 -128, i8 -128, i8 -128, i8 -128, i8 12, i8 13, i8 14, i8 15, i8 -128, i8 -128, i8 -128, i8 -128, i8 4, i8 5, i8 6, i8 7, i8 -128, i8 -128, i8 -128, i8 -128, i8 12, i8 13, i8 14, i8 15>)
				ret <32 x i8> %1
				}

				define <32 x i8> @blend4_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @blend4_avx2
				; CHECK: shufflevector <32 x i8> %InVec, {{.*}}, <32 x i32> <i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15>)
				ret <32 x i8> %1
				}

				define <32 x i8> @blend5_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @blend5_avx2
				; CHECK: shufflevector <32 x i8> %InVec, {{.*}}, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 16, i32 17, i32 18, i32 19, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 0, i8 1, i8 2, i8 3, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 0, i8 1, i8 2, i8 3, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>)
				ret <32 x i8> %1
				}

				define <32 x i8> @blend6_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @blend6_avx2
				; CHECK: shufflevector <32 x i8> %InVec, {{.*}}, <32 x i32> <i32 0, i32 1, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 16, i32 17, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 0, i8 1, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 0, i8 1, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>)
				ret <32 x i8> %1
				}

				; movq idiom.
				define <16 x i8> @movq_idiom(<16 x i8> %InVec) {
				; CHECK-LABEL: @movq_idiom
				; CHECK: shufflevector <16 x i8> %InVec, <16 x i8> <i8 0, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>)
				ret <16 x i8> %1
				}

				define <32 x i8> @movq_idiom_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @movq_idiom_avx2
				; CHECK: shufflevector <32 x i8> %InVec, <32 x i8> <i8 0, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 0, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48, i32 48>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>)
				ret <32 x i8> %1
				}

				; Vector permutations using byte shuffles.

				define <16 x i8> @permute1(<16 x i8> %InVec) {
				; CHECK-LABEL: @permute1
				; CHECK: shufflevector <16 x i8> %InVec, <16 x i8> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15, i32 12, i32 13, i32 14, i32 15>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 4, i8 5, i8 6, i8 7, i8 4, i8 5, i8 6, i8 7, i8 12, i8 13, i8 14, i8 15, i8 12, i8 13, i8 14, i8 15>)
				ret <16 x i8> %1
				}

				define <16 x i8> @permute2(<16 x i8> %InVec) {
				; CHECK-LABEL: @permute2
				; CHECK: shufflevector <16 x i8> %InVec, <16 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7>)
				ret <16 x i8> %1
				}

				define <32 x i8> @permute1_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @permute1_avx2
				; CHECK: shufflevector <32 x i8> %InVec, <32 x i8> undef, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15, i32 12, i32 13, i32 14, i32 15, i32 20, i32 21, i32 22, i32 23, i32 20, i32 21, i32 22, i32 23, i32 28, i32 29, i32 30, i32 31, i32 28, i32 29, i32 30, i32 31>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 4, i8 5, i8 6, i8 7, i8 4, i8 5, i8 6, i8 7, i8 12, i8 13, i8 14, i8 15, i8 12, i8 13, i8 14, i8 15, i8 4, i8 5, i8 6, i8 7, i8 4, i8 5, i8 6, i8 7, i8 12, i8 13, i8 14, i8 15, i8 12, i8 13, i8 14, i8 15>)
				ret <32 x i8> %1
				}

				define <32 x i8> @permute2_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @permute2_avx2
				; CHECK: shufflevector <32 x i8> %InVec, <32 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7>)
				ret <32 x i8> %1
				}

				; Test that instcombine correctly folds a pshufb with values that
				; are not -128 and that are not encoded in four bits.

				define <16 x i8> @identity_test2_2(<16 x i8> %InVec) {
				; CHECK-LABEL: @identity_test2_2
				; CHECK: ret <16 x i8> %InVec

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 16, i8 17, i8 18, i8 19, i8 20, i8 21, i8 22, i8 23, i8 24, i8 25, i8 26, i8 27, i8 28, i8 29, i8 30, i8 31>)
				ret <16 x i8> %1
				}

				define <32 x i8> @identity_test_avx2_2(<32 x i8> %InVec) {
				; CHECK-LABEL: @identity_test_avx2_2
				; CHECK: ret <32 x i8> %InVec

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 16, i8 33, i8 66, i8 19, i8 36, i8 69, i8 22, i8 39, i8 72, i8 25, i8 42, i8 75, i8 28, i8 45, i8 78, i8 31, i8 48, i8 81, i8 34, i8 51, i8 84, i8 37, i8 54, i8 87, i8 40, i8 57, i8 90, i8 43, i8 60, i8 93, i8 46, i8 63>)
				ret <32 x i8> %1
				}

				define <16 x i8> @fold_to_zero_vector_2(<16 x i8> %InVec) {
				; CHECK-LABEL: @fold_to_zero_vector_2
				; CHECK: ret <16 x i8> zeroinitializer

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 -125, i8 -1, i8 -53, i8 -32, i8 -4, i8 -7, i8 -33, i8 -66, i8 -99, i8 -120, i8 -100, i8 -22, i8 -17, i8 -1, i8 -11, i8 -15>)
				ret <16 x i8> %1
				}

				define <32 x i8> @fold_to_zero_vector_avx2_2(<32 x i8> %InVec) {
				; CHECK-LABEL: @fold_to_zero_vector_avx2_2
				; CHECK: ret <32 x i8> zeroinitializer

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 -127, i8 -1, i8 -53, i8 -32, i8 -4, i8 -7, i8 -33, i8 -66, i8 -99, i8 -120, i8 -100, i8 -22, i8 -17, i8 -1, i8 -11, i8 -15, i8 -126, i8 -2, i8 -52, i8 -31, i8 -5, i8 -8, i8 -34, i8 -67, i8 -100, i8 -119, i8 -101, i8 -23, i8 -16, i8 -2, i8 -12, i8 -16>)
				ret <32 x i8> %1
				}

				define <16 x i8> @permute3(<16 x i8> %InVec) {
				; CHECK-LABEL: @permute3
				; CHECK: shufflevector <16 x i8> %InVec, <16 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

				%1 = tail call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %InVec, <16 x i8> <i8 48, i8 17, i8 34, i8 51, i8 20, i8 37, i8 54, i8 23, i8 16, i8 49, i8 66, i8 19, i8 52, i8 69, i8 22, i8 55>)
				ret <16 x i8> %1
				}

				define <32 x i8> @permute3_avx2(<32 x i8> %InVec) {
				; CHECK-LABEL: @permute3_avx2
				; CHECK: shufflevector <32 x i8> %InVec, <32 x i8> undef, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15, i32 12, i32 13, i32 14, i32 15, i32 20, i32 21, i32 22, i32 23, i32 20, i32 21, i32 22, i32 23, i32 28, i32 29, i32 30, i32 31, i32 28, i32 29, i32 30, i32 31>

				%1 = tail call <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8> %InVec, <32 x i8> <i8 52, i8 21, i8 38, i8 55, i8 20, i8 37, i8 54, i8 23, i8 28, i8 61, i8 78, i8 31, i8 60, i8 29, i8 30, i8 79, i8 52, i8 21, i8 38, i8 55, i8 20, i8 53, i8 102, i8 23, i8 92, i8 93, i8 94, i8 95, i8 108, i8 109, i8 110, i8 111>)
				ret <32 x i8> %1
				}


				declare <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8>, <16 x i8>)
				declare <32 x i8> @llvm.x86.avx2.pshuf.b(<32 x i8>, <32 x i8>)