This is an archive of the discontinued LLVM Phabricator instance.

[X86][XOP] Support for VPPERM byte shuffle instruction
ClosedPublic

Authored by RKSimon on Mar 15 2016, 9:21 AM.

Download Raw Diff

Details

Reviewers

spatel
delena
silvas
andreadb

Commits

rG572ca71573e6: [X86][XOP] Support for VPPERM byte shuffle instruction
rL264260: [X86][XOP] Support for VPPERM byte shuffle instruction

Summary

This patch begins adding support for lowering to the XOP VPPERM instruction - adding the X86ISD::VPPERM opcode and shuffle mask decoding (for the more basic shuffle operations that the instruction supports).

The mask decoding required the existing MCInstrLowering code to be updated to support binary shuffles - the implementation now matches what is done in X86InstrCOmments.cpp. This should be useful for some AVX512 binary shuffles (VPERMT2 etc.) as well.

A followup patch will enable VPPERM as a target shuffle for combining.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 50744.Mar 15 2016, 9:21 AM

RKSimon retitled this revision from to [X86][XOP] Support for VPPERM byte shuffle instruction.

RKSimon updated this object.

RKSimon added reviewers: delena, spatel, andreadb, silvas.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

ping - I'm primarily interested in whether anybody has any concerns about the changes to X86MCInstLower.cpp - I realise not many people are up on the workings of XOP ;-)

In D18189#379265, @RKSimon wrote:

ping - I'm primarily interested in whether anybody has any concerns about the changes to X86MCInstLower.cpp - I realise not many people are up on the workings of XOP ;-)

I'm not a good reviewer for XOP or AVX512... :)
But can this patch be split up as:

Add support for VPPERM without changing getShuffleComment()
Enhance getShuffleComment() to support VPPERM (and VPERMT2?)

In D18189#381588, @spatel wrote:

In D18189#379265, @RKSimon wrote:

ping - I'm primarily interested in whether anybody has any concerns about the changes to X86MCInstLower.cpp - I realise not many people are up on the workings of XOP ;-)

I'm not a good reviewer for XOP or AVX512... :)
But can this patch be split up as:

Add support for VPPERM without changing getShuffleComment()

Enhance getShuffleComment() to support VPPERM (and VPERMT2?)

Yes thats pretty trivial - I was really trying to avoid overloading the list with patches about instruction sets that not many people are interested in ;-) The split is very clean already:

1 - Add support for VPPERM:
X86ISelLowering.h
X86ISelLowering.cpp
X86InstrFragmentsSIMD.td
X86InstrXOP.td
X86IntrinsicsInfo.h

2 - Add support for constant pool decoding of 2 input shuffles:
X86MCInstLower.cpp

3 - Enable VPPERM constant pool decoding:
X86ShuffleDecodeConstantPool.h
X86ShuffleDecodeConstantPool.cpp
vector-shuffle-combining-xop.ll

Whether we merge 2 + 3 depends on if we want immediate test cases for (2) - without them all we show is that it doesn't break existing unary shuffle decodes.

In D18189#381652, @RKSimon wrote:

Yes thats pretty trivial - I was really trying to avoid overloading the list with patches about instruction sets that not many people are interested in ;-) The split is very clean already:

1 - Add support for VPPERM:
X86ISelLowering.h
X86ISelLowering.cpp
X86InstrFragmentsSIMD.td
X86InstrXOP.td
X86IntrinsicsInfo.h

Can you limit this patch to only this part? I have my AMD Vol. 4 open to the vpperm page, so I can review that part first. AMD managed to redo Altivec vperm and add extra magic via the unused control bits...who knew? :)

Reduced the patch to just the plumbing for the X86ISD::VPPERM opcode.

Yeah, VPPERM is pretty nifty - I'll probably try to make use of at least some of its features (bitreverse comes to mind) in future patches, but for now I'm just interested in binary shuffle combining.

LGTM. See inline comments for a couple of small changes.

lib/Target/X86/X86InstrXOP.td
225–244 ↗	(On Diff #51493)	The reg/mem suffixes here confused me at first. I realize this is copying existing code, but I'd prefer if these were more accurate for the 3 input operands: "rrr", "rrm", "rmr". It's fine if that's a separate commit for that NFC change.
test/CodeGen/X86/vector-shuffle-combining-xop.ll
33–40 ↗	(On Diff #51493)	Add 2 tests for the load folding variants?

This revision is now accepted and ready to land.Mar 23 2016, 5:59 PM

RKSimon added inline comments.Mar 24 2016, 2:08 AM

test/CodeGen/X86/vector-shuffle-combining-xop.ll
33–40 ↗	(On Diff #51493)	Those are already tested in xop-intrinsics-x86_64.ll

Closed by commit rL264260: [X86][XOP] Support for VPPERM byte shuffle instruction (authored by RKSimon). · Explain WhyMar 24 2016, 4:58 AM

This revision was automatically updated to reflect the committed changes.

RKSimon mentioned this in D18441: [X86][XOP] Support for VPPERM shuffle mask decoding.Mar 24 2016, 5:11 AM

RKSimon mentioned this in rL264305: [X86][XOP] Fixed instruction postfixes to more closely match operands.Mar 24 2016, 9:36 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.h

2 lines

X86ISelLowering.cpp

1 line

X86InstrFragmentsSIMD.td

4 lines

X86InstrXOP.td

44 lines

X86IntrinsicsInfo.h

1 line

test/

CodeGen/

X86/

vector-shuffle-combining-xop.ll

13 lines

Diff 51538

llvm/trunk/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 435 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
EXTRQI, INSERTQI,		EXTRQI, INSERTQI,

// XOP variable/immediate rotations		// XOP variable/immediate rotations
VPROT, VPROTI,		VPROT, VPROTI,
// XOP arithmetic/logical shifts		// XOP arithmetic/logical shifts
VPSHA, VPSHL,		VPSHA, VPSHL,
// XOP signed/unsigned integer comparisons		// XOP signed/unsigned integer comparisons
VPCOM, VPCOMU,		VPCOM, VPCOMU,
		// XOP packed permute bytes
		VPPERM,

// Vector multiply packed unsigned doubleword integers		// Vector multiply packed unsigned doubleword integers
PMULUDQ,		PMULUDQ,
// Vector multiply packed signed doubleword integers		// Vector multiply packed signed doubleword integers
PMULDQ,		PMULDQ,
// Vector Multiply Packed UnsignedIntegers with Round and Scale		// Vector Multiply Packed UnsignedIntegers with Round and Scale
MULHRS,		MULHRS,
// Multiply and Add Packed Integers		// Multiply and Add Packed Integers
▲ Show 20 Lines • Show All 757 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 21,508 Lines • ▼ Show 20 Lines	const char *X86TargetLowering::getTargetNodeName(unsigned Opcode) const {
case X86ISD::VSHL: return "X86ISD::VSHL";		case X86ISD::VSHL: return "X86ISD::VSHL";
case X86ISD::VSRL: return "X86ISD::VSRL";		case X86ISD::VSRL: return "X86ISD::VSRL";
case X86ISD::VSRA: return "X86ISD::VSRA";		case X86ISD::VSRA: return "X86ISD::VSRA";
case X86ISD::VSHLI: return "X86ISD::VSHLI";		case X86ISD::VSHLI: return "X86ISD::VSHLI";
case X86ISD::VSRLI: return "X86ISD::VSRLI";		case X86ISD::VSRLI: return "X86ISD::VSRLI";
case X86ISD::VSRAI: return "X86ISD::VSRAI";		case X86ISD::VSRAI: return "X86ISD::VSRAI";
case X86ISD::VROTLI: return "X86ISD::VROTLI";		case X86ISD::VROTLI: return "X86ISD::VROTLI";
case X86ISD::VROTRI: return "X86ISD::VROTRI";		case X86ISD::VROTRI: return "X86ISD::VROTRI";
		case X86ISD::VPPERM: return "X86ISD::VPPERM";
case X86ISD::CMPP: return "X86ISD::CMPP";		case X86ISD::CMPP: return "X86ISD::CMPP";
case X86ISD::PCMPEQ: return "X86ISD::PCMPEQ";		case X86ISD::PCMPEQ: return "X86ISD::PCMPEQ";
case X86ISD::PCMPGT: return "X86ISD::PCMPGT";		case X86ISD::PCMPGT: return "X86ISD::PCMPGT";
case X86ISD::PCMPEQM: return "X86ISD::PCMPEQM";		case X86ISD::PCMPEQM: return "X86ISD::PCMPEQM";
case X86ISD::PCMPGTM: return "X86ISD::PCMPGTM";		case X86ISD::PCMPGTM: return "X86ISD::PCMPGTM";
case X86ISD::ADD: return "X86ISD::ADD";		case X86ISD::ADD: return "X86ISD::ADD";
case X86ISD::SUB: return "X86ISD::SUB";		case X86ISD::SUB: return "X86ISD::SUB";
case X86ISD::ADC: return "X86ISD::ADC";		case X86ISD::ADC: return "X86ISD::ADC";
▲ Show 20 Lines • Show All 8,658 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrFragmentsSIMD.td

Show First 20 Lines • Show All 245 Lines • ▼ Show 20 Lines	def X86vpcom : SDNode<"X86ISD::VPCOM",
SDTypeProfile<1, 3, [SDTCisVec<0>, SDTCisSameAs<0,1>,		SDTypeProfile<1, 3, [SDTCisVec<0>, SDTCisSameAs<0,1>,
SDTCisSameAs<0,2>,		SDTCisSameAs<0,2>,
SDTCisVT<3, i8>]>>;		SDTCisVT<3, i8>]>>;
def X86vpcomu : SDNode<"X86ISD::VPCOMU",		def X86vpcomu : SDNode<"X86ISD::VPCOMU",
SDTypeProfile<1, 3, [SDTCisVec<0>, SDTCisSameAs<0,1>,		SDTypeProfile<1, 3, [SDTCisVec<0>, SDTCisSameAs<0,1>,
SDTCisSameAs<0,2>,		SDTCisSameAs<0,2>,
SDTCisVT<3, i8>]>>;		SDTCisVT<3, i8>]>>;

		def X86vpperm : SDNode<"X86ISD::VPPERM",
		SDTypeProfile<1, 3, [SDTCisVT<0, v16i8>, SDTCisSameAs<0,1>,
		SDTCisSameAs<0,2>]>>;

def SDTX86CmpPTest : SDTypeProfile<1, 2, [SDTCisVT<0, i32>,		def SDTX86CmpPTest : SDTypeProfile<1, 2, [SDTCisVT<0, i32>,
SDTCisVec<1>,		SDTCisVec<1>,
SDTCisSameAs<2, 1>]>;		SDTCisSameAs<2, 1>]>;

def SDTX86Testm : SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisVec<1>,		def SDTX86Testm : SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisVec<1>,
SDTCisSameAs<2, 1>, SDTCVecEltisVT<0, i1>,		SDTCisSameAs<2, 1>, SDTCVecEltisVT<0, i1>,
SDTCisSameNumEltsAs<0, 1>]>;		SDTCisSameNumEltsAs<0, 1>]>;

▲ Show 20 Lines • Show All 788 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrXOP.td

Show First 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	let ExeDomain = SSEPackedInt in { // SSE integer instructions
defm VPCOMD : xopvpcom<0xCE, "d", X86vpcom, v4i32>;		defm VPCOMD : xopvpcom<0xCE, "d", X86vpcom, v4i32>;
defm VPCOMQ : xopvpcom<0xCF, "q", X86vpcom, v2i64>;		defm VPCOMQ : xopvpcom<0xCF, "q", X86vpcom, v2i64>;
defm VPCOMUB : xopvpcom<0xEC, "ub", X86vpcomu, v16i8>;		defm VPCOMUB : xopvpcom<0xEC, "ub", X86vpcomu, v16i8>;
defm VPCOMUW : xopvpcom<0xED, "uw", X86vpcomu, v8i16>;		defm VPCOMUW : xopvpcom<0xED, "uw", X86vpcomu, v8i16>;
defm VPCOMUD : xopvpcom<0xEE, "ud", X86vpcomu, v4i32>;		defm VPCOMUD : xopvpcom<0xEE, "ud", X86vpcomu, v4i32>;
defm VPCOMUQ : xopvpcom<0xEF, "uq", X86vpcomu, v2i64>;		defm VPCOMUQ : xopvpcom<0xEF, "uq", X86vpcomu, v2i64>;
}		}

		multiclass xop4op<bits<8> opc, string OpcodeStr, SDNode OpNode,
		ValueType vt128> {
		def rr : IXOPi8<opc, MRMSrcReg, (outs VR128:$dst),
		(ins VR128:$src1, VR128:$src2, VR128:$src3),
		!strconcat(OpcodeStr,
		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
		[(set VR128:$dst,
		(vt128 (OpNode (vt128 VR128:$src1), (vt128 VR128:$src2),
		(vt128 VR128:$src3))))]>,
		XOP_4V, VEX_I8IMM;
		def rm : IXOPi8<opc, MRMSrcMem, (outs VR128:$dst),
		(ins VR128:$src1, VR128:$src2, i128mem:$src3),
		!strconcat(OpcodeStr,
		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
		[(set VR128:$dst,
		(vt128 (OpNode (vt128 VR128:$src1), (vt128 VR128:$src2),
		(vt128 (bitconvert (loadv2i64 addr:$src3))))))]>,
		XOP_4V, VEX_I8IMM, VEX_W, MemOp4;
		def mr : IXOPi8<opc, MRMSrcMem, (outs VR128:$dst),
		(ins VR128:$src1, i128mem:$src2, VR128:$src3),
		!strconcat(OpcodeStr,
		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
		[(set VR128:$dst,
		(v16i8 (OpNode (vt128 VR128:$src1), (vt128 (bitconvert (loadv2i64 addr:$src2))),
		(vt128 VR128:$src3))))]>,
		XOP_4V, VEX_I8IMM;
		// For disassembler
		let isCodeGenOnly = 1, ForceDisassemble = 1, hasSideEffects = 0 in
		def rr_REV : IXOPi8<opc, MRMSrcReg, (outs VR128:$dst),
		(ins VR128:$src1, VR128:$src2, VR128:$src3),
		!strconcat(OpcodeStr,
		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
		[]>, XOP_4V, VEX_I8IMM, VEX_W, MemOp4;
		}

		let ExeDomain = SSEPackedInt in {
		defm VPPERM : xop4op<0xA3, "vpperm", X86vpperm, v16i8>;
		}

// Instruction where either second or third source can be memory		// Instruction where either second or third source can be memory
multiclass xop4op<bits<8> opc, string OpcodeStr, Intrinsic Int> {		multiclass xop4op_int<bits<8> opc, string OpcodeStr, Intrinsic Int> {
def rr : IXOPi8<opc, MRMSrcReg, (outs VR128:$dst),		def rr : IXOPi8<opc, MRMSrcReg, (outs VR128:$dst),
(ins VR128:$src1, VR128:$src2, VR128:$src3),		(ins VR128:$src1, VR128:$src2, VR128:$src3),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
[(set VR128:$dst, (Int VR128:$src1, VR128:$src2, VR128:$src3))]>,		[(set VR128:$dst, (Int VR128:$src1, VR128:$src2, VR128:$src3))]>,
XOP_4V, VEX_I8IMM;		XOP_4V, VEX_I8IMM;
def rm : IXOPi8<opc, MRMSrcMem, (outs VR128:$dst),		def rm : IXOPi8<opc, MRMSrcMem, (outs VR128:$dst),
(ins VR128:$src1, VR128:$src2, i128mem:$src3),		(ins VR128:$src1, VR128:$src2, i128mem:$src3),
Show All 16 Lines	multiclass xop4op_int<bits<8> opc, string OpcodeStr, Intrinsic Int> {
def rr_REV : IXOPi8<opc, MRMSrcReg, (outs VR128:$dst),		def rr_REV : IXOPi8<opc, MRMSrcReg, (outs VR128:$dst),
(ins VR128:$src1, VR128:$src2, VR128:$src3),		(ins VR128:$src1, VR128:$src2, VR128:$src3),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
[]>, XOP_4V, VEX_I8IMM, VEX_W, MemOp4;		[]>, XOP_4V, VEX_I8IMM, VEX_W, MemOp4;
}		}

let ExeDomain = SSEPackedInt in {		let ExeDomain = SSEPackedInt in {
defm VPPERM : xop4op<0xA3, "vpperm", int_x86_xop_vpperm>;		defm VPCMOV : xop4op_int<0xA2, "vpcmov", int_x86_xop_vpcmov>;
defm VPCMOV : xop4op<0xA2, "vpcmov", int_x86_xop_vpcmov>;
}		}

multiclass xop4op256<bits<8> opc, string OpcodeStr, Intrinsic Int> {		multiclass xop4op256<bits<8> opc, string OpcodeStr, Intrinsic Int> {
def rrY : IXOPi8<opc, MRMSrcReg, (outs VR256:$dst),		def rrY : IXOPi8<opc, MRMSrcReg, (outs VR256:$dst),
(ins VR256:$src1, VR256:$src2, VR256:$src3),		(ins VR256:$src1, VR256:$src2, VR256:$src3),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),		"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
[(set VR256:$dst, (Int VR256:$src1, VR256:$src2, VR256:$src3))]>,		[(set VR256:$dst, (Int VR256:$src1, VR256:$src2, VR256:$src3))]>,
▲ Show 20 Lines • Show All 105 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86IntrinsicsInfo.h

	Show First 20 Lines • Show All 2,272 Lines • ▼ Show 20 Lines
	X86_INTRINSIC_DATA(xop_vpcomb, INTR_TYPE_3OP, X86ISD::VPCOM, 0),			X86_INTRINSIC_DATA(xop_vpcomb, INTR_TYPE_3OP, X86ISD::VPCOM, 0),
	X86_INTRINSIC_DATA(xop_vpcomd, INTR_TYPE_3OP, X86ISD::VPCOM, 0),			X86_INTRINSIC_DATA(xop_vpcomd, INTR_TYPE_3OP, X86ISD::VPCOM, 0),
	X86_INTRINSIC_DATA(xop_vpcomq, INTR_TYPE_3OP, X86ISD::VPCOM, 0),			X86_INTRINSIC_DATA(xop_vpcomq, INTR_TYPE_3OP, X86ISD::VPCOM, 0),
	X86_INTRINSIC_DATA(xop_vpcomub, INTR_TYPE_3OP, X86ISD::VPCOMU, 0),			X86_INTRINSIC_DATA(xop_vpcomub, INTR_TYPE_3OP, X86ISD::VPCOMU, 0),
	X86_INTRINSIC_DATA(xop_vpcomud, INTR_TYPE_3OP, X86ISD::VPCOMU, 0),			X86_INTRINSIC_DATA(xop_vpcomud, INTR_TYPE_3OP, X86ISD::VPCOMU, 0),
	X86_INTRINSIC_DATA(xop_vpcomuq, INTR_TYPE_3OP, X86ISD::VPCOMU, 0),			X86_INTRINSIC_DATA(xop_vpcomuq, INTR_TYPE_3OP, X86ISD::VPCOMU, 0),
	X86_INTRINSIC_DATA(xop_vpcomuw, INTR_TYPE_3OP, X86ISD::VPCOMU, 0),			X86_INTRINSIC_DATA(xop_vpcomuw, INTR_TYPE_3OP, X86ISD::VPCOMU, 0),
	X86_INTRINSIC_DATA(xop_vpcomw, INTR_TYPE_3OP, X86ISD::VPCOM, 0),			X86_INTRINSIC_DATA(xop_vpcomw, INTR_TYPE_3OP, X86ISD::VPCOM, 0),
				X86_INTRINSIC_DATA(xop_vpperm, INTR_TYPE_3OP, X86ISD::VPPERM, 0),
	X86_INTRINSIC_DATA(xop_vprotb, INTR_TYPE_2OP, X86ISD::VPROT, 0),			X86_INTRINSIC_DATA(xop_vprotb, INTR_TYPE_2OP, X86ISD::VPROT, 0),
	X86_INTRINSIC_DATA(xop_vprotbi, INTR_TYPE_2OP, X86ISD::VPROTI, 0),			X86_INTRINSIC_DATA(xop_vprotbi, INTR_TYPE_2OP, X86ISD::VPROTI, 0),
	X86_INTRINSIC_DATA(xop_vprotd, INTR_TYPE_2OP, X86ISD::VPROT, 0),			X86_INTRINSIC_DATA(xop_vprotd, INTR_TYPE_2OP, X86ISD::VPROT, 0),
	X86_INTRINSIC_DATA(xop_vprotdi, INTR_TYPE_2OP, X86ISD::VPROTI, 0),			X86_INTRINSIC_DATA(xop_vprotdi, INTR_TYPE_2OP, X86ISD::VPROTI, 0),
	X86_INTRINSIC_DATA(xop_vprotq, INTR_TYPE_2OP, X86ISD::VPROT, 0),			X86_INTRINSIC_DATA(xop_vprotq, INTR_TYPE_2OP, X86ISD::VPROT, 0),
	X86_INTRINSIC_DATA(xop_vprotqi, INTR_TYPE_2OP, X86ISD::VPROTI, 0),			X86_INTRINSIC_DATA(xop_vprotqi, INTR_TYPE_2OP, X86ISD::VPROTI, 0),
	X86_INTRINSIC_DATA(xop_vprotw, INTR_TYPE_2OP, X86ISD::VPROT, 0),			X86_INTRINSIC_DATA(xop_vprotw, INTR_TYPE_2OP, X86ISD::VPROT, 0),
	X86_INTRINSIC_DATA(xop_vprotwi, INTR_TYPE_2OP, X86ISD::VPROTI, 0),			X86_INTRINSIC_DATA(xop_vprotwi, INTR_TYPE_2OP, X86ISD::VPROTI, 0),
	▲ Show 20 Lines • Show All 131 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shuffle-combining-xop.ll

	Show All 15 Lines
	; CHECK-NEXT: vpperm {{.*}}(%rip), %xmm1, %xmm0, %xmm0			; CHECK-NEXT: vpperm {{.*}}(%rip), %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vpperm {{.*}}(%rip), %xmm0, %xmm0, %xmm0			; CHECK-NEXT: vpperm {{.*}}(%rip), %xmm0, %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%res0 = call <16 x i8> @llvm.x86.xop.vpperm(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> <i8 31, i8 30, i8 29, i8 28, i8 27, i8 26, i8 25, i8 24, i8 23, i8 22, i8 21, i8 20, i8 19, i8 18, i8 17, i8 16>)			%res0 = call <16 x i8> @llvm.x86.xop.vpperm(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> <i8 31, i8 30, i8 29, i8 28, i8 27, i8 26, i8 25, i8 24, i8 23, i8 22, i8 21, i8 20, i8 19, i8 18, i8 17, i8 16>)
	%res1 = call <16 x i8> @llvm.x86.xop.vpperm(<16 x i8> %res0, <16 x i8> undef, <16 x i8> <i8 15, i8 14, i8 13, i8 12, i8 11, i8 10, i8 9, i8 8, i8 7, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)			%res1 = call <16 x i8> @llvm.x86.xop.vpperm(<16 x i8> %res0, <16 x i8> undef, <16 x i8> <i8 15, i8 14, i8 13, i8 12, i8 11, i8 10, i8 9, i8 8, i8 7, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
	ret <16 x i8> %res1			ret <16 x i8> %res1
	}			}

	define <16 x i8> @combine_vpperm_as_unpckhwd(<16 x i8> %a0, <16 x i8> %a1) {			define <16 x i8> @combine_vpperm_as_unary_unpckhwd(<16 x i8> %a0, <16 x i8> %a1) {
	; CHECK-LABEL: combine_vpperm_as_unpckhwd:			; CHECK-LABEL: combine_vpperm_as_unary_unpckhwd:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vpperm {{.*}}(%rip), %xmm0, %xmm0, %xmm0			; CHECK-NEXT: vpperm {{.*}}(%rip), %xmm0, %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%res0 = call <16 x i8> @llvm.x86.xop.vpperm(<16 x i8> %a0, <16 x i8> %a0, <16 x i8> <i8 8, i8 24, i8 9, i8 25, i8 10, i8 26, i8 11, i8 27, i8 12, i8 28, i8 13, i8 29, i8 14, i8 30, i8 15, i8 31>)			%res0 = call <16 x i8> @llvm.x86.xop.vpperm(<16 x i8> %a0, <16 x i8> %a0, <16 x i8> <i8 8, i8 24, i8 9, i8 25, i8 10, i8 26, i8 11, i8 27, i8 12, i8 28, i8 13, i8 29, i8 14, i8 30, i8 15, i8 31>)
	ret <16 x i8> %res0			ret <16 x i8> %res0
	}			}

				define <16 x i8> @combine_vpperm_as_unpckhwd(<16 x i8> %a0, <16 x i8> %a1) {
				; CHECK-LABEL: combine_vpperm_as_unpckhwd:
				; CHECK: # BB#0:
				; CHECK-NEXT: vpperm {{.*}}(%rip), %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%res0 = call <16 x i8> @llvm.x86.xop.vpperm(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> <i8 8, i8 24, i8 9, i8 25, i8 10, i8 26, i8 11, i8 27, i8 12, i8 28, i8 13, i8 29, i8 14, i8 30, i8 15, i8 31>)
				ret <16 x i8> %res0
				}