This is an archive of the discontinued LLVM Phabricator instance.

[X86, AVX] instcombine common cases of vperm2* intrinsics into shuffles
ClosedPublic

Authored by spatel on Mar 20 2015, 10:23 AM.

Download Raw Diff

Details

Reviewers

RKSimon
chandlerc
andreadb
craig.topper

Commits

rGccf5f24b7b90: [X86, AVX] instcombine common cases of vperm2* intrinsics into shuffles
rL232852: [X86, AVX] instcombine common cases of vperm2* intrinsics into shuffles

Summary

vperm2* intrinsics are just shuffles unless a zero mask bit is set. In a few special cases, they're not even shuffles.

Optimizing intrinsics in InstCombine is better than handling this in the front-end for at least two reasons:

Optimizing custom-written SSE intrinsic code at -O0 makes vector coders really angry (and so I have some regrets about some patches from last week).
Doing mask conversion logic in header files is hard to write and subsequently read.

Unfortunately, we use a magic number (generally assumed to be -1) to specify undef values in shufflevector masks in IR. And apparently, that magic has led to lax coding where we just check <0 for undef. If we had a proper enum for shufflevector mask special values, we could do like the x86 backend has done and easily transform the zero mask bit cases here too. Fixing that could be a follow-on patch. Otherwise, we'll try to deal with matching a 2-shuffle sequence in the x86 backend. But again, that's a separate patch (see the TODO comment in this one).

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 22354.Mar 20 2015, 10:23 AM

spatel retitled this revision from to [X86, AVX] instcombine common cases of vperm2* intrinsics into shuffles.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: andreadb, RKSimon, craig.topper, chandlerc.

spatel added a subscriber: Unknown Object (MLST).

Hi Sanjay,

lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗	(On Diff #22354)	Why not have a static function instead? If you use a function, you can avoid to to change 'InstCombineInternal.h'.
245–248 ↗	(On Diff #22354)	I don't think you need two shuffles for this case. If either b[3] or b[7] is set, you can always convert the 'vperm2f128' into a single shuffle instruction where one of the operands is a zero vector, and the other operand is either Op0 or Op2. Example: With mask 'Imm' equal to '0x08', the low 128-bit lane of the result is always going to be zero. The upper 128-bit lane is goint to be Op0[127:0]. So, you can expand the input 'x86_avx_vperm2f128_ps_256' into: shufflevector <8 x float> zeroinitializer, <8 x float> Op0, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11> Basically, what I am trying to say is that if a 'zero bit' is set, then one of the input vectors becomes redundant. You can reuse the same logic you added for the case where 'zero mask bits are not set' to handle _all_ the possible cases. You would only need to check in advance if zero bits are set, and replace either Op0 or Op1 (or both) with a zero vector if needed.

spatel added inline comments.Mar 20 2015, 12:45 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗	(On Diff #22354)	No reason other than cut and paste from the code around this. If this is static, we'd have to pass in a reference to the InstCombiner in order to use ReplaceInstUsesWith(), right? Do you think that's cleaner?
245–248 ↗	(On Diff #22354)	Ah, nice catch. I'll adjust the comment and make that a small follow-on patch if it's ok with you.

spatel added inline comments.Mar 20 2015, 12:50 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗	(On Diff #22354)	I think the use of 'Builder' would also require the InstCombiner reference.

andreadb added inline comments.Mar 20 2015, 1:00 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗	(On Diff #22354)	I was thinking about something like this: if (Value V = SimplifyX86vperm2(II, Builder)) return ReplaceInstUsesWith(II, V); Basically, the call to 'ReplaceInstUsesWith' could be done by the caller rather than by function 'SimplifyX86vperm2'. Your function can then take as second argument a pointer to the IRBuilder. So, something like: static Value simplifyX86Vperm2f128(IntrinsicInst &II, InstCombiner::BuilderTy Builder) {...} That said, this was just an idea. I don't know if this is a better/cleaner way for doing it.. but at least (I thought) we avoid to change the header file. :-)

spatel added inline comments.Mar 20 2015, 1:18 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
203 ↗	(On Diff #22354)	Ok, as long as I'm adding the minimal number of lines to this already 900 line switch statement. :) I'll update the patch shortly.

Patch updated based on feedback from Andrea:

Made the simplify function static - avoids header changes w/o adding any extra lines to the switch
Fixed comment regarding the zero mask case - will add that in a future patch
Moved some local variables closer to their usage
Added TODO for the equivalent AVX2 (integer) intrinsic

LGTM. Thanks Sanjay.

This revision is now accepted and ready to land.Mar 20 2015, 2:21 PM

Closed by commit rL232852: [X86, AVX] instcombine common cases of vperm2* intrinsics into shuffles (authored by spatel). · Explain WhyMar 20 2015, 2:50 PM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D8567: [X86, AVX] instcombine vperm2 intrinsics with zero inputs into shuffles.Mar 23 2015, 2:49 PM

spatel mentioned this in rL233110: [X86, AVX] instcombine vperm2 intrinsics with zero inputs into shuffles.Mar 24 2015, 1:39 PM

spatel mentioned this in D8833: [X86, SSE] instcombine common cases of insertps intrinsics into shuffles.Apr 4 2015, 2:21 PM

spatel mentioned this in rL235124: [X86, SSE] instcombine common cases of insertps intrinsics into shuffles.Apr 16 2015, 10:55 AM

spatel mentioned this in D10555: [X86] Replace avx2.pbroadcast intrinsics with native IR..Jun 19 2015, 8:00 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

59 lines

test/

Transforms/

InstCombine/

x86-vperm2.ll

236 lines

Diff 22381

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	if (Len <= 8 && isPowerOf2_32((uint32_t)Len)) {
// Set the size of the copy to 0, it will be deleted on the next iteration.		// Set the size of the copy to 0, it will be deleted on the next iteration.
MI->setLength(Constant::getNullValue(LenC->getType()));		MI->setLength(Constant::getNullValue(LenC->getType()));
return MI;		return MI;
}		}

return nullptr;		return nullptr;
}		}

		/// The shuffle mask for a perm2*128 selects any two halves of two 256-bit
		/// source vectors, unless a zero bit is set. If a zero bit is set,
		/// then ignore that half of the mask and clear that half of the vector.
		static Value *SimplifyX86vperm2(const IntrinsicInst &II,
		InstCombiner::BuilderTy &Builder) {
		if (auto CInt = dyn_cast<ConstantInt>(II.getArgOperand(2))) {
		VectorType *VecTy = cast<VectorType>(II.getType());
		uint8_t Imm = CInt->getZExtValue();

		// The immediate permute control byte looks like this:
		// [1:0] - select 128 bits from sources for low half of destination
		// [2] - ignore
		// [3] - zero low half of destination
		// [5:4] - select 128 bits from sources for high half of destination
		// [6] - ignore
		// [7] - zero high half of destination

		if ((Imm & 0x88) == 0x88) {
		// If both zero mask bits are set, this was just a weird way to
		// generate a zero vector.
		return ConstantAggregateZero::get(VecTy);
		}

		// TODO: If a single zero bit is set, replace one of the source operands
		// with a zero vector and use the same mask generation logic as below.

		if ((Imm & 0x88) == 0x00) {
		// If neither zero mask bit is set, this is a simple shuffle.
		unsigned NumElts = VecTy->getNumElements();
		unsigned HalfSize = NumElts / 2;
		unsigned HalfBegin;
		SmallVector<int, 8> ShuffleMask(NumElts);

		// Permute low half of result.
		HalfBegin = (Imm & 0x3) * HalfSize;
		for (unsigned i = 0; i != HalfSize; ++i)
		ShuffleMask[i] = HalfBegin + i;

		// Permute high half of result.
		HalfBegin = ((Imm >> 4) & 0x3) * HalfSize;
		for (unsigned i = HalfSize; i != NumElts; ++i)
		ShuffleMask[i] = HalfBegin + i - HalfSize;

		Value *Op0 = II.getArgOperand(0);
		Value *Op1 = II.getArgOperand(1);
		return Builder.CreateShuffleVector(Op0, Op1, ShuffleMask);
		}
		}
		return nullptr;
		}

/// visitCallInst - CallInst simplification. This mostly only handles folding		/// visitCallInst - CallInst simplification. This mostly only handles folding
/// of intrinsic instructions. For normal calls, it allows visitCallSite to do		/// of intrinsic instructions. For normal calls, it allows visitCallSite to do
/// the heavy lifting.		/// the heavy lifting.
///		///
Instruction *InstCombiner::visitCallInst(CallInst &CI) {		Instruction *InstCombiner::visitCallInst(CallInst &CI) {
if (isFreeCall(&CI, TLI))		if (isFreeCall(&CI, TLI))
return visitFree(CI);		return visitFree(CI);

▲ Show 20 Lines • Show All 691 Lines • ▼ Show 20 Lines	case Intrinsic::x86_avx_vpermilvar_pd_256: {
auto NewC =		auto NewC =
ConstantDataVector::get(V->getContext(), makeArrayRef(Indexes, Size));		ConstantDataVector::get(V->getContext(), makeArrayRef(Indexes, Size));
auto V1 = II->getArgOperand(0);		auto V1 = II->getArgOperand(0);
auto V2 = UndefValue::get(V1->getType());		auto V2 = UndefValue::get(V1->getType());
auto Shuffle = Builder->CreateShuffleVector(V1, V2, NewC);		auto Shuffle = Builder->CreateShuffleVector(V1, V2, NewC);
return ReplaceInstUsesWith(CI, Shuffle);		return ReplaceInstUsesWith(CI, Shuffle);
}		}

		case Intrinsic::x86_avx_vperm2f128_pd_256:
		case Intrinsic::x86_avx_vperm2f128_ps_256:
		case Intrinsic::x86_avx_vperm2f128_si_256:
		// TODO: Add the AVX2 version of this instruction.
		if (Value V = SimplifyX86vperm2(II, *Builder))
		return ReplaceInstUsesWith(*II, V);
		break;

case Intrinsic::ppc_altivec_vperm:		case Intrinsic::ppc_altivec_vperm:
// Turn vperm(V1,V2,mask) -> shuffle(V1,V2,mask) if mask is a constant.		// Turn vperm(V1,V2,mask) -> shuffle(V1,V2,mask) if mask is a constant.
// Note that ppc_altivec_vperm has a big-endian bias, so when creating		// Note that ppc_altivec_vperm has a big-endian bias, so when creating
// a vectorshuffle for little endian, we must undo the transformation		// a vectorshuffle for little endian, we must undo the transformation
// performed on vec_perm in altivec.h. That is, we must complement		// performed on vec_perm in altivec.h. That is, we must complement
// the permutation mask with respect to 31 and reverse the order of		// the permutation mask with respect to 31 and reverse the order of
// V1 and V2.		// V1 and V2.
if (Constant *Mask = dyn_cast<Constant>(II->getArgOperand(2))) {		if (Constant *Mask = dyn_cast<Constant>(II->getArgOperand(2))) {
▲ Show 20 Lines • Show All 943 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/x86-vperm2.ll

				; RUN: opt < %s -instcombine -S \| FileCheck %s

				; This should never happen, but make sure we don't crash handling a non-constant immediate byte.

				define <4 x double> @perm2pd_non_const_imm(<4 x double> %a0, <4 x double> %a1, i8 %b) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 %b)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_non_const_imm
				; CHECK-NEXT: call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 %b)
				; CHECK-NEXT: ret <4 x double>
				}


				; In the following 3 tests, both zero mask bits of the immediate are set.

				define <4 x double> @perm2pd_0x88(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 136)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x88
				; CHECK-NEXT: ret <4 x double> zeroinitializer
				}

				define <8 x float> @perm2ps_0x88(<8 x float> %a0, <8 x float> %a1) {
				%res = call <8 x float> @llvm.x86.avx.vperm2f128.ps.256(<8 x float> %a0, <8 x float> %a1, i8 136)
				ret <8 x float> %res

				; CHECK-LABEL: @perm2ps_0x88
				; CHECK-NEXT: ret <8 x float> zeroinitializer
				}

				define <8 x i32> @perm2si_0x88(<8 x i32> %a0, <8 x i32> %a1) {
				%res = call <8 x i32> @llvm.x86.avx.vperm2f128.si.256(<8 x i32> %a0, <8 x i32> %a1, i8 136)
				ret <8 x i32> %res

				; CHECK-LABEL: @perm2si_0x88
				; CHECK-NEXT: ret <8 x i32> zeroinitializer
				}


				; The other control bits are ignored when zero mask bits of the immediate are set.

				define <4 x double> @perm2pd_0xff(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 255)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0xff
				; CHECK-NEXT: ret <4 x double> zeroinitializer
				}


				; The following 16 tests are simple shuffles, except for 2 cases where we can just return one of the
				; source vectors. Verify that we generate the right shuffle masks and undef source operand where possible..

				define <4 x double> @perm2pd_0x00(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 0)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x00
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x01(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 1)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x01
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x02(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 2)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x02
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 4, i32 5, i32 0, i32 1>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x03(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 3)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x03
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 6, i32 7, i32 0, i32 1>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x10(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 16)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x10
				; CHECK-NEXT: ret <4 x double> %a0
				}

				define <4 x double> @perm2pd_0x11(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 17)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x11
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 2, i32 3>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x12(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 18)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x12
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 4, i32 5, i32 2, i32 3>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x13(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 19)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x13
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 6, i32 7, i32 2, i32 3>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x20(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 32)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x20
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x21(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 33)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x21
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x22(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 34)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x22
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a1, <4 x double> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x23(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 35)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x23
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a1, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x30(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 48)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x30
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x31(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 49)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x31
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 2, i32 3, i32 6, i32 7>
				; CHECK-NEXT: ret <4 x double> %1
				}

				define <4 x double> @perm2pd_0x32(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 50)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x32
				; CHECK-NEXT: ret <4 x double> %a1
				}

				define <4 x double> @perm2pd_0x33(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 51)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x33
				; CHECK-NEXT: %1 = shufflevector <4 x double> %a1, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 2, i32 3>
				; CHECK-NEXT: ret <4 x double> %1
				}

				; Confirm that a mask for 32-bit elements is also correct.

				define <8 x float> @perm2ps_0x31(<8 x float> %a0, <8 x float> %a1) {
				%res = call <8 x float> @llvm.x86.avx.vperm2f128.ps.256(<8 x float> %a0, <8 x float> %a1, i8 49)
				ret <8 x float> %res

				; CHECK-LABEL: @perm2ps_0x31
				; CHECK-NEXT: %1 = shufflevector <8 x float> %a0, <8 x float> %a1, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15>
				; CHECK-NEXT: ret <8 x float> %1
				}


				; Confirm that when a single zero mask bit is set, we do nothing.

				define <4 x double> @perm2pd_0x83(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 131)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x83
				; CHECK-NEXT: call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 -125)
				; CHECK-NEXT: ret <4 x double>
				}


				; Confirm that when the other zero mask bit is set, we do nothing. Also confirm that an ignored bit has no effect.

				define <4 x double> @perm2pd_0x48(<4 x double> %a0, <4 x double> %a1) {
				%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 72)
				ret <4 x double> %res

				; CHECK-LABEL: @perm2pd_0x48
				; CHECK-NEXT: call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 72)
				; CHECK-NEXT: ret <4 x double>
				}

				declare <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double>, <4 x double>, i8) nounwind readnone
				declare <8 x float> @llvm.x86.avx.vperm2f128.ps.256(<8 x float>, <8 x float>, i8) nounwind readnone
				declare <8 x i32> @llvm.x86.avx.vperm2f128.si.256(<8 x i32>, <8 x i32>, i8) nounwind readnone