This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add pattern matching for PMADDUBSW
ClosedPublic

Authored by craig.topper on Jul 25 2018, 5:21 PM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
zvi

Commits

rGbef126fb710c: [X86] Add pattern matching for PMADDUBSW
rL338402: [X86] Add pattern matching for PMADDUBSW

Summary

Similar to D49636, but for PMADDUBSW. This instruction has the additional complexity that the addition of the two products saturates to 16-bits rather than wrapping around. And one operand is treated as signed and the other as unsigned.

I changed the madd.ll test command line from sse2 to ssse3 to ensure this instruction was available which also caused some test changes for phadd. I can commit that separately if desired. Or I can add a new run line. Or a new test file. Whatever is preferable

A C example that triggers this pattern

static const int N = 128;

int8_t A[2*N];
uint8_t B[2*N];
int16_t C[N];

#define MIN(x, y) ((x) < (y)) ? (x) : (y)
#define MAX(x, y) ((x) > (y)) ? (x) : (y)

void foo() {
  for (int i = 0; i != N; ++i)
    C[i] = MIN(MAX((int16_t)A[2*i]*(int16_t)B[2*i] + (int16_t)A[2*i+1]*(int16_t)B[2*i+1], -32768), 32767);
}

Diff Detail

Event Timeline

craig.topper created this revision.Jul 25 2018, 5:21 PM

craig.topper edited the summary of this revision. (Show Details)

LGTM. Regarding the SSE2->SSSE3 test change, i think it's fine. Can you update the --check-prefix to SSSE3 in a follow-up commit? I think it's convenient to review as-is, but in the longer term it would be misleading to leave it as-is.

zvi accepted this revision.Jul 27 2018, 12:58 PM

This revision is now accepted and ready to land.Jul 27 2018, 12:58 PM

RKSimon requested changes to this revision.Jul 30 2018, 3:27 AM

RKSimon added inline comments.

lib/Target/X86/X86ISelLowering.cpp
36821	Couldn't you merge all these sets of canonicalization early-outs together to safe space?
36837	I always get nervous when we add vectorization detection in the DAG.
36839	for (unsigned i = 0, e = N00.getNumOperands(); i != e; ++i) { or use NumElems?
36865	element
test/CodeGen/X86/madd.ll
2 ↗	(On Diff #157406)	prefix doesn't match the sse level - please can you put the pmaddusbw tests in pmaddusbw.ll - I don;t think its worth adding SSSE3 tests to madd.ll tbh

This revision now requires changes to proceed.Jul 30 2018, 3:27 AM

craig.topper added inline comments.Jul 30 2018, 10:34 AM

lib/Target/X86/X86ISelLowering.cpp
36837	Nervous how?

Address Simon's comments. Spell the mnemonic correctly in the test cases.

RKSimon added inline comments.Jul 30 2018, 11:54 AM

lib/Target/X86/X86ISelLowering.cpp
36837	In this case it looks necessary, but it encourages the belief that its safe to perform vectorization in the DAG. In general we should be relying on the vectorizers to perform this using a proper cost analysis. see PR35732 as well.

LGTM thanks - would be nice to commit pmaddubsw.ll first so we see the codegen diff from this patch.

This revision is now accepted and ready to land.Jul 31 2018, 2:17 AM

Closed by commit rL338402: [X86] Add pattern matching for PMADDUBSW (authored by ctopper). · Explain WhyJul 31 2018, 10:12 AM

This revision was automatically updated to reflect the committed changes.

Sorry, for the late response to the already closed review.

In this case it looks necessary, but it encourages the belief that its safe to perform vectorization in the DAG. In general we should be relying on the vectorizers to perform this using a proper cost analysis.

Vectorizer to perform a proper cost analysis, I agree 100%.

Vectorizer to perform this part, we can't do that unless we want vectorizer to emit target (in)dependent intrinsics all over the place. As long as we want to keep using sequence of IR instructions, what's more important is to build up common and robust (TTI based?) pattern matchers that can be used at IR level as well as at the DAG level such that what we see at vectorizer can also be captured at DAG Optimizer.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

143 lines

test/

CodeGen/

X86/

pmaddubsw.ll

553 lines

Diff 158014

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 36,747 Lines • ▼ Show 20 Lines	static SDValue combinePMULH(SDValue Src, EVT VT, const SDLoc &DL,
// Ensure the input types match.		// Ensure the input types match.
if (LHS.getValueType() != VT \|\| RHS.getValueType() != VT)		if (LHS.getValueType() != VT \|\| RHS.getValueType() != VT)
return SDValue();		return SDValue();

unsigned Opc = ExtOpc == ISD::SIGN_EXTEND ? ISD::MULHS : ISD::MULHU;		unsigned Opc = ExtOpc == ISD::SIGN_EXTEND ? ISD::MULHS : ISD::MULHU;
return DAG.getNode(Opc, DL, VT, LHS, RHS);		return DAG.getNode(Opc, DL, VT, LHS, RHS);
}		}

		// Attempt to match PMADDUBSW, which multiplies corresponding unsigned bytes
		// from one vector with signed bytes from another vector, adds together
		// adjacent pairs of 16-bit products, and saturates the result before
		// truncating to 16-bits.
		//
		// Which looks something like this:
		// (i16 (ssat (add (mul (zext (even elts (i8 A))), (sext (even elts (i8 B)))),
		// (mul (zext (odd elts (i8 A)), (sext (odd elts (i8 B))))))))
		static SDValue detectPMADDUBSW(SDValue In, EVT VT, SelectionDAG &DAG,
		const X86Subtarget &Subtarget,
		const SDLoc &DL) {
		if (!VT.isVector() \|\| !Subtarget.hasSSSE3())
		return SDValue();

		unsigned NumElems = VT.getVectorNumElements();
		EVT ScalarVT = VT.getVectorElementType();
		if (ScalarVT != MVT::i16 \|\| NumElems < 8 \|\| !isPowerOf2_32(NumElems))
		return SDValue();

		SDValue SSatVal = detectSSatPattern(In, VT);
		if (!SSatVal \|\| SSatVal.getOpcode() != ISD::ADD)
		return SDValue();

		// Ok this is a signed saturation of an ADD. See if this ADD is adding pairs
		// of multiplies from even/odd elements.
		SDValue N0 = SSatVal.getOperand(0);
		SDValue N1 = SSatVal.getOperand(1);

		if (N0.getOpcode() != ISD::MUL \|\| N1.getOpcode() != ISD::MUL)
		return SDValue();

		SDValue N00 = N0.getOperand(0);
		SDValue N01 = N0.getOperand(1);
		SDValue N10 = N1.getOperand(0);
		SDValue N11 = N1.getOperand(1);

		// TODO: Handle constant vectors and use knownbits/computenumsignbits?
		// Canonicalize zero_extend to LHS.
		if (N01.getOpcode() == ISD::ZERO_EXTEND)
		std::swap(N00, N01);
		if (N11.getOpcode() == ISD::ZERO_EXTEND)
		std::swap(N10, N11);

		// Ensure we have a zero_extend and a sign_extend.
		if (N00.getOpcode() != ISD::ZERO_EXTEND \|\|
		N01.getOpcode() != ISD::SIGN_EXTEND \|\|
		N10.getOpcode() != ISD::ZERO_EXTEND \|\|
		N11.getOpcode() != ISD::SIGN_EXTEND)
		return SDValue();

		// Peek through the extends.
		N00 = N00.getOperand(0);
		N01 = N01.getOperand(0);
		N10 = N10.getOperand(0);
		N11 = N11.getOperand(0);

		// Ensure the extend is from vXi8.
		if (N00.getValueType().getVectorElementType() != MVT::i8 \|\|
		N01.getValueType().getVectorElementType() != MVT::i8 \|\|
		N10.getValueType().getVectorElementType() != MVT::i8 \|\|
		N11.getValueType().getVectorElementType() != MVT::i8)
		return SDValue();

		// All inputs should be build_vectors.
		if (N00.getOpcode() != ISD::BUILD_VECTOR \|\|
		N01.getOpcode() != ISD::BUILD_VECTOR \|\|
		RKSimonUnsubmitted Not Done Reply Inline Actions Couldn't you merge all these sets of canonicalization early-outs together to safe space? RKSimon: Couldn't you merge all these sets of canonicalization early-outs together to safe space?
		N10.getOpcode() != ISD::BUILD_VECTOR \|\|
		N11.getOpcode() != ISD::BUILD_VECTOR)
		return SDValue();

		// N00/N10 are zero extended. N01/N11 are sign extended.

		// For each element, we need to ensure we have an odd element from one vector
		// multiplied by the odd element of another vector and the even element from
		// one of the same vectors being multiplied by the even element from the
		// other vector. So we need to make sure for each element i, this operator
		// is being performed:
		// A[2 * i] * B[2 * i] + A[2 * i + 1] * B[2 * i + 1]
		SDValue ZExtIn, SExtIn;
		for (unsigned i = 0; i != NumElems; ++i) {
		SDValue N00Elt = N00.getOperand(i);
		SDValue N01Elt = N01.getOperand(i);
		RKSimonUnsubmitted Not Done Reply Inline Actions I always get nervous when we add vectorization detection in the DAG. RKSimon: I always get nervous when we add vectorization detection in the DAG.
		craig.topperAuthorUnsubmitted Not Done Reply Inline Actions Nervous how? craig.topper: Nervous how?
		RKSimonUnsubmitted Not Done Reply Inline Actions In this case it looks necessary, but it encourages the belief that its safe to perform vectorization in the DAG. In general we should be relying on the vectorizers to perform this using a proper cost analysis. see PR35732 as well. RKSimon: In this case it looks necessary, but it encourages the belief that its safe to perform…
		SDValue N10Elt = N10.getOperand(i);
		SDValue N11Elt = N11.getOperand(i);
		RKSimonUnsubmitted Not Done Reply Inline Actions for (unsigned i = 0, e = N00.getNumOperands(); i != e; ++i) { or use NumElems? RKSimon: for (unsigned i = 0, e = N00.getNumOperands(); i != e; ++i) { or use NumElems?
		// TODO: Be more tolerant to undefs.
		if (N00Elt.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
		N01Elt.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
		N10Elt.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
		N11Elt.getOpcode() != ISD::EXTRACT_VECTOR_ELT)
		return SDValue();
		auto *ConstN00Elt = dyn_cast<ConstantSDNode>(N00Elt.getOperand(1));
		auto *ConstN01Elt = dyn_cast<ConstantSDNode>(N01Elt.getOperand(1));
		auto *ConstN10Elt = dyn_cast<ConstantSDNode>(N10Elt.getOperand(1));
		auto *ConstN11Elt = dyn_cast<ConstantSDNode>(N11Elt.getOperand(1));
		if (!ConstN00Elt \|\| !ConstN01Elt \|\| !ConstN10Elt \|\| !ConstN11Elt)
		return SDValue();
		unsigned IdxN00 = ConstN00Elt->getZExtValue();
		unsigned IdxN01 = ConstN01Elt->getZExtValue();
		unsigned IdxN10 = ConstN10Elt->getZExtValue();
		unsigned IdxN11 = ConstN11Elt->getZExtValue();
		// Add is commutative so indices can be reordered.
		if (IdxN00 > IdxN10) {
		std::swap(IdxN00, IdxN10);
		std::swap(IdxN01, IdxN11);
		}
		// N0 indices be the even element. N1 indices must be the next odd element.
		if (IdxN00 != 2 * i \|\| IdxN10 != 2 * i + 1 \|\|
		IdxN01 != 2 * i \|\| IdxN11 != 2 * i + 1)
		return SDValue();
		SDValue N00In = N00Elt.getOperand(0);
		RKSimonUnsubmitted Not Done Reply Inline Actions element RKSimon: element
		SDValue N01In = N01Elt.getOperand(0);
		SDValue N10In = N10Elt.getOperand(0);
		SDValue N11In = N11Elt.getOperand(0);
		// First time we find an input capture it.
		if (!ZExtIn) {
		ZExtIn = N00In;
		SExtIn = N01In;
		}
		if (ZExtIn != N00In \|\| SExtIn != N01In \|\|
		ZExtIn != N10In \|\| SExtIn != N11In)
		return SDValue();
		}

		auto PMADDBuilder = [](SelectionDAG &DAG, const SDLoc &DL,
		ArrayRef<SDValue> Ops) {
		// Shrink by adding truncate nodes and let DAGCombine fold with the
		// sources.
		EVT InVT = Ops[0].getValueType();
		assert(InVT.getScalarType() == MVT::i8 &&
		"Unexpected scalar element type");
		assert(InVT == Ops[1].getValueType() && "Operands' types mismatch");
		EVT ResVT = EVT::getVectorVT(*DAG.getContext(), MVT::i16,
		InVT.getVectorNumElements() / 2);
		return DAG.getNode(X86ISD::VPMADDUBSW, DL, ResVT, Ops[0], Ops[1]);
		};
		return SplitOpsAndApply(DAG, Subtarget, DL, VT, { ZExtIn, SExtIn },
		PMADDBuilder);
		}

static SDValue combineTruncate(SDNode *N, SelectionDAG &DAG,		static SDValue combineTruncate(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue Src = N->getOperand(0);		SDValue Src = N->getOperand(0);
SDLoc DL(N);		SDLoc DL(N);

// Attempt to pre-truncate inputs to arithmetic ops instead.		// Attempt to pre-truncate inputs to arithmetic ops instead.
if (SDValue V = combineTruncatedArithmetic(N, DAG, Subtarget, DL))		if (SDValue V = combineTruncatedArithmetic(N, DAG, Subtarget, DL))
return V;		return V;

// Try to detect AVG pattern first.		// Try to detect AVG pattern first.
if (SDValue Avg = detectAVGPattern(Src, VT, DAG, Subtarget, DL))		if (SDValue Avg = detectAVGPattern(Src, VT, DAG, Subtarget, DL))
return Avg;		return Avg;

		// Try to detect PMADD
		if (SDValue PMAdd = detectPMADDUBSW(Src, VT, DAG, Subtarget, DL))
		return PMAdd;

// Try to combine truncation with signed/unsigned saturation.		// Try to combine truncation with signed/unsigned saturation.
if (SDValue Val = combineTruncateWithSat(Src, VT, DL, DAG, Subtarget))		if (SDValue Val = combineTruncateWithSat(Src, VT, DL, DAG, Subtarget))
return Val;		return Val;

// Try to combine PMULHUW/PMULHW for vXi16.		// Try to combine PMULHUW/PMULHW for vXi16.
if (SDValue V = combinePMULH(Src, VT, DL, DAG, Subtarget))		if (SDValue V = combinePMULH(Src, VT, DL, DAG, Subtarget))
return V;		return V;

▲ Show 20 Lines • Show All 3,919 Lines • Show Last 20 Lines

test/CodeGen/X86/pmaddubsw.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+ssse3 \| FileCheck %s --check-prefix=SSE
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx \| FileCheck %s --check-prefixes=AVX,AVX1
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 \| FileCheck %s --check-prefixes=AVX,AVX256,AVX2
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX,AVX256,AVX512,AVX512F
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw \| FileCheck %s --check-prefixes=AVX,AVX256,AVX512,AVX512BW

				; NOTE: We're testing with loads because ABI lowering creates a concat_vectors that extract_vector_elt creation can see through.
				; This would require the combine to recreate the concat_vectors.
				define <8 x i16> @pmaddubsw_128(<16 x i8>* %Aptr, <16 x i8>* %Bptr) {
				; SSE-LABEL: pmaddubsw_128:
				; SSE: # %bb.0:
				; SSE-NEXT: movdqa (%rsi), %xmm0
				; SSE-NEXT: pmaddubsw (%rdi), %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: pmaddubsw_128:
				; AVX: # %bb.0:
				; AVX-NEXT: vmovdqa (%rsi), %xmm0
				; AVX-NEXT: vpmaddubsw (%rdi), %xmm0, %xmm0
				; AVX-NEXT: retq
				%A = load <16 x i8>, <16 x i8>* %Aptr
				%B = load <16 x i8>, <16 x i8>* %Bptr
				%A_even = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%A_odd = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%B_even = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%B_odd = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%A_even_ext = sext <8 x i8> %A_even to <8 x i32>
				%B_even_ext = zext <8 x i8> %B_even to <8 x i32>
				%A_odd_ext = sext <8 x i8> %A_odd to <8 x i32>
				%B_odd_ext = zext <8 x i8> %B_odd to <8 x i32>
				%even_mul = mul <8 x i32> %A_even_ext, %B_even_ext
				%odd_mul = mul <8 x i32> %A_odd_ext, %B_odd_ext
				%add = add <8 x i32> %even_mul, %odd_mul
				%cmp_max = icmp sgt <8 x i32> %add, <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%max = select <8 x i1> %cmp_max, <8 x i32> %add, <8 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%cmp_min = icmp slt <8 x i32> %max, <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%min = select <8 x i1> %cmp_min, <8 x i32> %max, <8 x i32> <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%trunc = trunc <8 x i32> %min to <8 x i16>
				ret <8 x i16> %trunc
				}

				define <16 x i16> @pmaddubsw_256(<32 x i8>* %Aptr, <32 x i8>* %Bptr) {
				; SSE-LABEL: pmaddubsw_256:
				; SSE: # %bb.0:
				; SSE-NEXT: movdqa (%rsi), %xmm0
				; SSE-NEXT: movdqa 16(%rsi), %xmm1
				; SSE-NEXT: pmaddubsw (%rdi), %xmm0
				; SSE-NEXT: pmaddubsw 16(%rdi), %xmm1
				; SSE-NEXT: retq
				;
				; AVX1-LABEL: pmaddubsw_256:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vmovdqa (%rdi), %ymm0
				; AVX1-NEXT: vmovdqa (%rsi), %ymm1
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2
				; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3
				; AVX1-NEXT: vpmaddubsw %xmm2, %xmm3, %xmm2
				; AVX1-NEXT: vpmaddubsw %xmm0, %xmm1, %xmm0
				; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
				; AVX1-NEXT: retq
				;
				; AVX256-LABEL: pmaddubsw_256:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vmovdqa (%rsi), %ymm0
				; AVX256-NEXT: vpmaddubsw (%rdi), %ymm0, %ymm0
				; AVX256-NEXT: retq
				%A = load <32 x i8>, <32 x i8>* %Aptr
				%B = load <32 x i8>, <32 x i8>* %Bptr
				%A_even = shufflevector <32 x i8> %A, <32 x i8> undef, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%A_odd = shufflevector <32 x i8> %A, <32 x i8> undef, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%B_even = shufflevector <32 x i8> %B, <32 x i8> undef, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%B_odd = shufflevector <32 x i8> %B, <32 x i8> undef, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%A_even_ext = sext <16 x i8> %A_even to <16 x i32>
				%B_even_ext = zext <16 x i8> %B_even to <16 x i32>
				%A_odd_ext = sext <16 x i8> %A_odd to <16 x i32>
				%B_odd_ext = zext <16 x i8> %B_odd to <16 x i32>
				%even_mul = mul <16 x i32> %A_even_ext, %B_even_ext
				%odd_mul = mul <16 x i32> %A_odd_ext, %B_odd_ext
				%add = add <16 x i32> %even_mul, %odd_mul
				%cmp_max = icmp sgt <16 x i32> %add, <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%max = select <16 x i1> %cmp_max, <16 x i32> %add, <16 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%cmp_min = icmp slt <16 x i32> %max, <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%min = select <16 x i1> %cmp_min, <16 x i32> %max, <16 x i32> <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%trunc = trunc <16 x i32> %min to <16 x i16>
				ret <16 x i16> %trunc
				}

				define <64 x i16> @pmaddubsw_512(<128 x i8>* %Aptr, <128 x i8>* %Bptr) {
				; SSE-LABEL: pmaddubsw_512:
				; SSE: # %bb.0:
				; SSE-NEXT: movdqa 112(%rdx), %xmm0
				; SSE-NEXT: movdqa 96(%rdx), %xmm1
				; SSE-NEXT: movdqa 80(%rdx), %xmm2
				; SSE-NEXT: movdqa 64(%rdx), %xmm3
				; SSE-NEXT: movdqa (%rdx), %xmm4
				; SSE-NEXT: movdqa 16(%rdx), %xmm5
				; SSE-NEXT: movdqa 32(%rdx), %xmm6
				; SSE-NEXT: movdqa 48(%rdx), %xmm7
				; SSE-NEXT: pmaddubsw (%rsi), %xmm4
				; SSE-NEXT: pmaddubsw 16(%rsi), %xmm5
				; SSE-NEXT: pmaddubsw 32(%rsi), %xmm6
				; SSE-NEXT: pmaddubsw 48(%rsi), %xmm7
				; SSE-NEXT: pmaddubsw 64(%rsi), %xmm3
				; SSE-NEXT: pmaddubsw 80(%rsi), %xmm2
				; SSE-NEXT: pmaddubsw 96(%rsi), %xmm1
				; SSE-NEXT: pmaddubsw 112(%rsi), %xmm0
				; SSE-NEXT: movdqa %xmm0, 112(%rdi)
				; SSE-NEXT: movdqa %xmm1, 96(%rdi)
				; SSE-NEXT: movdqa %xmm2, 80(%rdi)
				; SSE-NEXT: movdqa %xmm3, 64(%rdi)
				; SSE-NEXT: movdqa %xmm7, 48(%rdi)
				; SSE-NEXT: movdqa %xmm6, 32(%rdi)
				; SSE-NEXT: movdqa %xmm5, 16(%rdi)
				; SSE-NEXT: movdqa %xmm4, (%rdi)
				; SSE-NEXT: movq %rdi, %rax
				; SSE-NEXT: retq
				;
				; AVX1-LABEL: pmaddubsw_512:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vmovdqa (%rdi), %ymm0
				; AVX1-NEXT: vmovdqa 32(%rdi), %ymm1
				; AVX1-NEXT: vmovdqa 64(%rdi), %ymm2
				; AVX1-NEXT: vmovdqa 96(%rdi), %ymm8
				; AVX1-NEXT: vmovdqa (%rsi), %ymm4
				; AVX1-NEXT: vmovdqa 32(%rsi), %ymm5
				; AVX1-NEXT: vmovdqa 64(%rsi), %ymm6
				; AVX1-NEXT: vmovdqa 96(%rsi), %ymm9
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm3
				; AVX1-NEXT: vextractf128 $1, %ymm4, %xmm7
				; AVX1-NEXT: vpmaddubsw %xmm3, %xmm7, %xmm3
				; AVX1-NEXT: vpmaddubsw %xmm0, %xmm4, %xmm0
				; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm0
				; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3
				; AVX1-NEXT: vextractf128 $1, %ymm5, %xmm4
				; AVX1-NEXT: vpmaddubsw %xmm3, %xmm4, %xmm3
				; AVX1-NEXT: vpmaddubsw %xmm1, %xmm5, %xmm1
				; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm1, %ymm1
				; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm3
				; AVX1-NEXT: vextractf128 $1, %ymm6, %xmm4
				; AVX1-NEXT: vpmaddubsw %xmm3, %xmm4, %xmm3
				; AVX1-NEXT: vpmaddubsw %xmm2, %xmm6, %xmm2
				; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm2, %ymm2
				; AVX1-NEXT: vextractf128 $1, %ymm8, %xmm3
				; AVX1-NEXT: vextractf128 $1, %ymm9, %xmm4
				; AVX1-NEXT: vpmaddubsw %xmm3, %xmm4, %xmm3
				; AVX1-NEXT: vpmaddubsw %xmm8, %xmm9, %xmm4
				; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm4, %ymm3
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: pmaddubsw_512:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vmovdqa (%rsi), %ymm0
				; AVX2-NEXT: vmovdqa 32(%rsi), %ymm1
				; AVX2-NEXT: vmovdqa 64(%rsi), %ymm2
				; AVX2-NEXT: vmovdqa 96(%rsi), %ymm3
				; AVX2-NEXT: vpmaddubsw (%rdi), %ymm0, %ymm0
				; AVX2-NEXT: vpmaddubsw 32(%rdi), %ymm1, %ymm1
				; AVX2-NEXT: vpmaddubsw 64(%rdi), %ymm2, %ymm2
				; AVX2-NEXT: vpmaddubsw 96(%rdi), %ymm3, %ymm3
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: pmaddubsw_512:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: vmovdqa (%rsi), %ymm0
				; AVX512F-NEXT: vmovdqa 32(%rsi), %ymm1
				; AVX512F-NEXT: vmovdqa 64(%rsi), %ymm2
				; AVX512F-NEXT: vmovdqa 96(%rsi), %ymm3
				; AVX512F-NEXT: vpmaddubsw (%rdi), %ymm0, %ymm0
				; AVX512F-NEXT: vpmaddubsw 32(%rdi), %ymm1, %ymm1
				; AVX512F-NEXT: vpmaddubsw 64(%rdi), %ymm2, %ymm2
				; AVX512F-NEXT: vpmaddubsw 96(%rdi), %ymm3, %ymm3
				; AVX512F-NEXT: retq
				;
				; AVX512BW-LABEL: pmaddubsw_512:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vmovdqa64 (%rsi), %zmm0
				; AVX512BW-NEXT: vmovdqa64 64(%rsi), %zmm1
				; AVX512BW-NEXT: vpmaddubsw (%rdi), %zmm0, %zmm0
				; AVX512BW-NEXT: vpmaddubsw 64(%rdi), %zmm1, %zmm1
				; AVX512BW-NEXT: retq
				%A = load <128 x i8>, <128 x i8>* %Aptr
				%B = load <128 x i8>, <128 x i8>* %Bptr
				%A_even = shufflevector <128 x i8> %A, <128 x i8> undef, <64 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30, i32 32, i32 34, i32 36, i32 38, i32 40, i32 42, i32 44, i32 46, i32 48, i32 50, i32 52, i32 54, i32 56, i32 58, i32 60, i32 62, i32 64, i32 66, i32 68, i32 70, i32 72, i32 74, i32 76, i32 78, i32 80, i32 82, i32 84, i32 86, i32 88, i32 90, i32 92, i32 94, i32 96, i32 98, i32 100, i32 102, i32 104, i32 106, i32 108, i32 110, i32 112, i32 114, i32 116, i32 118, i32 120, i32 122, i32 124, i32 126>
				%A_odd = shufflevector <128 x i8> %A, <128 x i8> undef, <64 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31, i32 33, i32 35, i32 37, i32 39, i32 41, i32 43, i32 45, i32 47, i32 49, i32 51, i32 53, i32 55, i32 57, i32 59, i32 61, i32 63, i32 65, i32 67, i32 69, i32 71, i32 73, i32 75, i32 77, i32 79, i32 81, i32 83, i32 85, i32 87, i32 89, i32 91, i32 93, i32 95, i32 97, i32 99, i32 101, i32 103, i32 105, i32 107, i32 109, i32 111, i32 113, i32 115, i32 117, i32 119, i32 121, i32 123, i32 125, i32 127>
				%B_even = shufflevector <128 x i8> %B, <128 x i8> undef, <64 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30, i32 32, i32 34, i32 36, i32 38, i32 40, i32 42, i32 44, i32 46, i32 48, i32 50, i32 52, i32 54, i32 56, i32 58, i32 60, i32 62, i32 64, i32 66, i32 68, i32 70, i32 72, i32 74, i32 76, i32 78, i32 80, i32 82, i32 84, i32 86, i32 88, i32 90, i32 92, i32 94, i32 96, i32 98, i32 100, i32 102, i32 104, i32 106, i32 108, i32 110, i32 112, i32 114, i32 116, i32 118, i32 120, i32 122, i32 124, i32 126>
				%B_odd = shufflevector <128 x i8> %B, <128 x i8> undef, <64 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31, i32 33, i32 35, i32 37, i32 39, i32 41, i32 43, i32 45, i32 47, i32 49, i32 51, i32 53, i32 55, i32 57, i32 59, i32 61, i32 63, i32 65, i32 67, i32 69, i32 71, i32 73, i32 75, i32 77, i32 79, i32 81, i32 83, i32 85, i32 87, i32 89, i32 91, i32 93, i32 95, i32 97, i32 99, i32 101, i32 103, i32 105, i32 107, i32 109, i32 111, i32 113, i32 115, i32 117, i32 119, i32 121, i32 123, i32 125, i32 127>
				%A_even_ext = sext <64 x i8> %A_even to <64 x i32>
				%B_even_ext = zext <64 x i8> %B_even to <64 x i32>
				%A_odd_ext = sext <64 x i8> %A_odd to <64 x i32>
				%B_odd_ext = zext <64 x i8> %B_odd to <64 x i32>
				%even_mul = mul <64 x i32> %A_even_ext, %B_even_ext
				%odd_mul = mul <64 x i32> %A_odd_ext, %B_odd_ext
				%add = add <64 x i32> %even_mul, %odd_mul
				%cmp_max = icmp sgt <64 x i32> %add, <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%max = select <64 x i1> %cmp_max, <64 x i32> %add, <64 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%cmp_min = icmp slt <64 x i32> %max, <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%min = select <64 x i1> %cmp_min, <64 x i32> %max, <64 x i32> <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%trunc = trunc <64 x i32> %min to <64 x i16>
				ret <64 x i16> %trunc
				}

				define <8 x i16> @pmaddubsw_swapped_indices(<16 x i8>* %Aptr, <16 x i8>* %Bptr) {
				; SSE-LABEL: pmaddubsw_swapped_indices:
				; SSE: # %bb.0:
				; SSE-NEXT: movdqa (%rsi), %xmm0
				; SSE-NEXT: pmaddubsw (%rdi), %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: pmaddubsw_swapped_indices:
				; AVX: # %bb.0:
				; AVX-NEXT: vmovdqa (%rsi), %xmm0
				; AVX-NEXT: vpmaddubsw (%rdi), %xmm0, %xmm0
				; AVX-NEXT: retq
				%A = load <16 x i8>, <16 x i8>* %Aptr
				%B = load <16 x i8>, <16 x i8>* %Bptr
				%A_even = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 1, i32 2, i32 5, i32 6, i32 9, i32 10, i32 13, i32 14> ;indices aren't all even
				%A_odd = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 0, i32 3, i32 4, i32 7, i32 8, i32 11, i32 12, i32 15> ;indices aren't all odd
				%B_even = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 1, i32 2, i32 5, i32 6, i32 9, i32 10, i32 13, i32 14> ;same indices as A
				%B_odd = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 0, i32 3, i32 4, i32 7, i32 8, i32 11, i32 12, i32 15> ;same indices as A
				%A_even_ext = sext <8 x i8> %A_even to <8 x i32>
				%B_even_ext = zext <8 x i8> %B_even to <8 x i32>
				%A_odd_ext = sext <8 x i8> %A_odd to <8 x i32>
				%B_odd_ext = zext <8 x i8> %B_odd to <8 x i32>
				%even_mul = mul <8 x i32> %A_even_ext, %B_even_ext
				%odd_mul = mul <8 x i32> %A_odd_ext, %B_odd_ext
				%add = add <8 x i32> %even_mul, %odd_mul
				%cmp_max = icmp sgt <8 x i32> %add, <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%max = select <8 x i1> %cmp_max, <8 x i32> %add, <8 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%cmp_min = icmp slt <8 x i32> %max, <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%min = select <8 x i1> %cmp_min, <8 x i32> %max, <8 x i32> <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%trunc = trunc <8 x i32> %min to <8 x i16>
				ret <8 x i16> %trunc
				}

				define <8 x i16> @pmaddubsw_swapped_extend(<16 x i8>* %Aptr, <16 x i8>* %Bptr) {
				; SSE-LABEL: pmaddubsw_swapped_extend:
				; SSE: # %bb.0:
				; SSE-NEXT: movdqa (%rdi), %xmm0
				; SSE-NEXT: pmaddubsw (%rsi), %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: pmaddubsw_swapped_extend:
				; AVX: # %bb.0:
				; AVX-NEXT: vmovdqa (%rdi), %xmm0
				; AVX-NEXT: vpmaddubsw (%rsi), %xmm0, %xmm0
				; AVX-NEXT: retq
				%A = load <16 x i8>, <16 x i8>* %Aptr
				%B = load <16 x i8>, <16 x i8>* %Bptr
				%A_even = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%A_odd = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%B_even = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%B_odd = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%A_even_ext = zext <8 x i8> %A_even to <8 x i32>
				%B_even_ext = sext <8 x i8> %B_even to <8 x i32>
				%A_odd_ext = zext <8 x i8> %A_odd to <8 x i32>
				%B_odd_ext = sext <8 x i8> %B_odd to <8 x i32>
				%even_mul = mul <8 x i32> %A_even_ext, %B_even_ext
				%odd_mul = mul <8 x i32> %A_odd_ext, %B_odd_ext
				%add = add <8 x i32> %even_mul, %odd_mul
				%cmp_max = icmp sgt <8 x i32> %add, <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%max = select <8 x i1> %cmp_max, <8 x i32> %add, <8 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%cmp_min = icmp slt <8 x i32> %max, <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%min = select <8 x i1> %cmp_min, <8 x i32> %max, <8 x i32> <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%trunc = trunc <8 x i32> %min to <8 x i16>
				ret <8 x i16> %trunc
				}

				define <8 x i16> @pmaddubsw_commuted_mul(<16 x i8>* %Aptr, <16 x i8>* %Bptr) {
				; SSE-LABEL: pmaddubsw_commuted_mul:
				; SSE: # %bb.0:
				; SSE-NEXT: movdqa (%rsi), %xmm0
				; SSE-NEXT: pmaddubsw (%rdi), %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: pmaddubsw_commuted_mul:
				; AVX: # %bb.0:
				; AVX-NEXT: vmovdqa (%rsi), %xmm0
				; AVX-NEXT: vpmaddubsw (%rdi), %xmm0, %xmm0
				; AVX-NEXT: retq
				%A = load <16 x i8>, <16 x i8>* %Aptr
				%B = load <16 x i8>, <16 x i8>* %Bptr
				%A_even = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%A_odd = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%B_even = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%B_odd = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%A_even_ext = sext <8 x i8> %A_even to <8 x i32>
				%B_even_ext = zext <8 x i8> %B_even to <8 x i32>
				%A_odd_ext = sext <8 x i8> %A_odd to <8 x i32>
				%B_odd_ext = zext <8 x i8> %B_odd to <8 x i32>
				%even_mul = mul <8 x i32> %B_even_ext, %A_even_ext
				%odd_mul = mul <8 x i32> %A_odd_ext, %B_odd_ext
				%add = add <8 x i32> %even_mul, %odd_mul
				%cmp_max = icmp sgt <8 x i32> %add, <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%max = select <8 x i1> %cmp_max, <8 x i32> %add, <8 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%cmp_min = icmp slt <8 x i32> %max, <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%min = select <8 x i1> %cmp_min, <8 x i32> %max, <8 x i32> <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%trunc = trunc <8 x i32> %min to <8 x i16>
				ret <8 x i16> %trunc
				}

				define <8 x i16> @pmaddubsw_bad_extend(<16 x i8>* %Aptr, <16 x i8>* %Bptr) {
				; SSE-LABEL: pmaddubsw_bad_extend:
				; SSE: # %bb.0:
				; SSE-NEXT: movdqa (%rdi), %xmm1
				; SSE-NEXT: movdqa (%rsi), %xmm0
				; SSE-NEXT: movdqa {{.*#+}} xmm2 = [255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0]
				; SSE-NEXT: pand %xmm0, %xmm2
				; SSE-NEXT: movdqa %xmm1, %xmm3
				; SSE-NEXT: psllw $8, %xmm3
				; SSE-NEXT: psraw $8, %xmm3
				; SSE-NEXT: movdqa %xmm3, %xmm4
				; SSE-NEXT: pmulhw %xmm2, %xmm4
				; SSE-NEXT: pmullw %xmm2, %xmm3
				; SSE-NEXT: movdqa %xmm3, %xmm2
				; SSE-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
				; SSE-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
				; SSE-NEXT: psraw $8, %xmm0
				; SSE-NEXT: psrlw $8, %xmm1
				; SSE-NEXT: movdqa %xmm1, %xmm4
				; SSE-NEXT: pmulhw %xmm0, %xmm4
				; SSE-NEXT: pmullw %xmm0, %xmm1
				; SSE-NEXT: movdqa %xmm1, %xmm0
				; SSE-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
				; SSE-NEXT: paddd %xmm2, %xmm0
				; SSE-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm4[4],xmm1[5],xmm4[5],xmm1[6],xmm4[6],xmm1[7],xmm4[7]
				; SSE-NEXT: paddd %xmm3, %xmm1
				; SSE-NEXT: packssdw %xmm1, %xmm0
				; SSE-NEXT: retq
				;
				; AVX1-LABEL: pmaddubsw_bad_extend:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vmovdqa (%rdi), %xmm0
				; AVX1-NEXT: vmovdqa (%rsi), %xmm1
				; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = <8,10,12,14,u,u,u,u,u,u,u,u,u,u,u,u>
				; AVX1-NEXT: vpshufb %xmm2, %xmm0, %xmm3
				; AVX1-NEXT: vpmovsxbd %xmm3, %xmm3
				; AVX1-NEXT: vmovdqa {{.*#+}} xmm4 = <0,2,4,6,u,u,u,u,u,u,u,u,u,u,u,u>
				; AVX1-NEXT: vpshufb %xmm4, %xmm0, %xmm5
				; AVX1-NEXT: vpmovsxbd %xmm5, %xmm5
				; AVX1-NEXT: vpshufb %xmm2, %xmm1, %xmm2
				; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
				; AVX1-NEXT: vpmulld %xmm2, %xmm3, %xmm2
				; AVX1-NEXT: vpshufb %xmm4, %xmm1, %xmm3
				; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm3 = xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero,xmm3[2],zero,zero,zero,xmm3[3],zero,zero,zero
				; AVX1-NEXT: vpmulld %xmm3, %xmm5, %xmm3
				; AVX1-NEXT: vmovdqa {{.*#+}} xmm4 = <9,11,13,15,u,u,u,u,u,u,u,u,u,u,u,u>
				; AVX1-NEXT: vpshufb %xmm4, %xmm0, %xmm5
				; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero
				; AVX1-NEXT: vmovdqa {{.*#+}} xmm6 = <1,3,5,7,u,u,u,u,u,u,u,u,u,u,u,u>
				; AVX1-NEXT: vpshufb %xmm6, %xmm0, %xmm0
				; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
				; AVX1-NEXT: vpshufb %xmm4, %xmm1, %xmm4
				; AVX1-NEXT: vpmovsxbd %xmm4, %xmm4
				; AVX1-NEXT: vpmulld %xmm4, %xmm5, %xmm4
				; AVX1-NEXT: vpaddd %xmm4, %xmm2, %xmm2
				; AVX1-NEXT: vpshufb %xmm6, %xmm1, %xmm1
				; AVX1-NEXT: vpmovsxbd %xmm1, %xmm1
				; AVX1-NEXT: vpmulld %xmm1, %xmm0, %xmm0
				; AVX1-NEXT: vpaddd %xmm0, %xmm3, %xmm0
				; AVX1-NEXT: vpackssdw %xmm2, %xmm0, %xmm0
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: pmaddubsw_bad_extend:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vmovdqa (%rdi), %xmm0
				; AVX2-NEXT: vmovdqa (%rsi), %xmm1
				; AVX2-NEXT: vmovdqa {{.*#+}} xmm2 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
				; AVX2-NEXT: vpshufb %xmm2, %xmm0, %xmm3
				; AVX2-NEXT: vpmovsxbd %xmm3, %ymm3
				; AVX2-NEXT: vpshufb %xmm2, %xmm1, %xmm2
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
				; AVX2-NEXT: vpmulld %ymm2, %ymm3, %ymm2
				; AVX2-NEXT: vmovdqa {{.*#+}} xmm3 = <1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u>
				; AVX2-NEXT: vpshufb %xmm3, %xmm0, %xmm0
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
				; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
				; AVX2-NEXT: vpmovsxbd %xmm1, %ymm1
				; AVX2-NEXT: vpmulld %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vpaddd %ymm0, %ymm2, %ymm0
				; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX2-NEXT: vpackssdw %xmm1, %xmm0, %xmm0
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512-LABEL: pmaddubsw_bad_extend:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vmovdqa (%rdi), %xmm0
				; AVX512-NEXT: vmovdqa (%rsi), %xmm1
				; AVX512-NEXT: vmovdqa {{.*#+}} xmm2 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
				; AVX512-NEXT: vpshufb %xmm2, %xmm0, %xmm3
				; AVX512-NEXT: vpmovsxbd %xmm3, %ymm3
				; AVX512-NEXT: vpshufb %xmm2, %xmm1, %xmm2
				; AVX512-NEXT: vpmovzxbd {{.*#+}} ymm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
				; AVX512-NEXT: vpmulld %ymm2, %ymm3, %ymm2
				; AVX512-NEXT: vmovdqa {{.*#+}} xmm3 = <1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u>
				; AVX512-NEXT: vpshufb %xmm3, %xmm0, %xmm0
				; AVX512-NEXT: vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
				; AVX512-NEXT: vpshufb %xmm3, %xmm1, %xmm1
				; AVX512-NEXT: vpmovsxbd %xmm1, %ymm1
				; AVX512-NEXT: vpmulld %ymm1, %ymm0, %ymm0
				; AVX512-NEXT: vpaddd %ymm0, %ymm2, %ymm0
				; AVX512-NEXT: vpbroadcastd {{.*#+}} ymm1 = [4294934528,4294934528,4294934528,4294934528,4294934528,4294934528,4294934528,4294934528]
				; AVX512-NEXT: vpmaxsd %ymm1, %ymm0, %ymm0
				; AVX512-NEXT: vpbroadcastd {{.*#+}} ymm1 = [32767,32767,32767,32767,32767,32767,32767,32767]
				; AVX512-NEXT: vpminsd %ymm1, %ymm0, %ymm0
				; AVX512-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				%A = load <16 x i8>, <16 x i8>* %Aptr
				%B = load <16 x i8>, <16 x i8>* %Bptr
				%A_even = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%A_odd = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%B_even = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%B_odd = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%A_even_ext = sext <8 x i8> %A_even to <8 x i32>
				%B_even_ext = zext <8 x i8> %B_even to <8 x i32>
				%A_odd_ext = zext <8 x i8> %A_odd to <8 x i32>
				%B_odd_ext = sext <8 x i8> %B_odd to <8 x i32>
				%even_mul = mul <8 x i32> %A_even_ext, %B_even_ext
				%odd_mul = mul <8 x i32> %A_odd_ext, %B_odd_ext
				%add = add <8 x i32> %even_mul, %odd_mul
				%cmp_max = icmp sgt <8 x i32> %add, <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%max = select <8 x i1> %cmp_max, <8 x i32> %add, <8 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%cmp_min = icmp slt <8 x i32> %max, <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%min = select <8 x i1> %cmp_min, <8 x i32> %max, <8 x i32> <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%trunc = trunc <8 x i32> %min to <8 x i16>
				ret <8 x i16> %trunc
				}

				define <8 x i16> @pmaddubsw_bad_indices(<16 x i8>* %Aptr, <16 x i8>* %Bptr) {
				; SSE-LABEL: pmaddubsw_bad_indices:
				; SSE: # %bb.0:
				; SSE-NEXT: movdqa (%rdi), %xmm1
				; SSE-NEXT: movdqa (%rsi), %xmm0
				; SSE-NEXT: movdqa {{.*#+}} xmm2 = [255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0]
				; SSE-NEXT: pand %xmm0, %xmm2
				; SSE-NEXT: movdqa %xmm1, %xmm3
				; SSE-NEXT: pshufb {{.*#+}} xmm3 = xmm3[u,1,u,2,u,5,u,6,u,9,u,10,u,13,u,14]
				; SSE-NEXT: psraw $8, %xmm3
				; SSE-NEXT: movdqa %xmm3, %xmm4
				; SSE-NEXT: pmulhw %xmm2, %xmm4
				; SSE-NEXT: pmullw %xmm2, %xmm3
				; SSE-NEXT: movdqa %xmm3, %xmm2
				; SSE-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
				; SSE-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
				; SSE-NEXT: psrlw $8, %xmm0
				; SSE-NEXT: pshufb {{.*#+}} xmm1 = xmm1[u,0,u,3,u,4,u,7,u,8,u,11,u,12,u,15]
				; SSE-NEXT: psraw $8, %xmm1
				; SSE-NEXT: movdqa %xmm1, %xmm4
				; SSE-NEXT: pmulhw %xmm0, %xmm4
				; SSE-NEXT: pmullw %xmm0, %xmm1
				; SSE-NEXT: movdqa %xmm1, %xmm0
				; SSE-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
				; SSE-NEXT: paddd %xmm2, %xmm0
				; SSE-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm4[4],xmm1[5],xmm4[5],xmm1[6],xmm4[6],xmm1[7],xmm4[7]
				; SSE-NEXT: paddd %xmm3, %xmm1
				; SSE-NEXT: packssdw %xmm1, %xmm0
				; SSE-NEXT: retq
				;
				; AVX1-LABEL: pmaddubsw_bad_indices:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vmovdqa (%rdi), %xmm0
				; AVX1-NEXT: vmovdqa (%rsi), %xmm1
				; AVX1-NEXT: vpshufb {{.*#+}} xmm2 = xmm0[9,10,13,14,u,u,u,u,u,u,u,u,u,u,u,u]
				; AVX1-NEXT: vpmovsxbd %xmm2, %xmm2
				; AVX1-NEXT: vpshufb {{.*#+}} xmm3 = xmm0[1,2,5,6,u,u,u,u,u,u,u,u,u,u,u,u]
				; AVX1-NEXT: vpmovsxbd %xmm3, %xmm3
				; AVX1-NEXT: vpshufb {{.*#+}} xmm4 = xmm1[8,10,12,14,u,u,u,u,u,u,u,u,u,u,u,u]
				; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero
				; AVX1-NEXT: vpmulld %xmm4, %xmm2, %xmm2
				; AVX1-NEXT: vpshufb {{.*#+}} xmm4 = xmm1[0,2,4,6,u,u,u,u,u,u,u,u,u,u,u,u]
				; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero
				; AVX1-NEXT: vpmulld %xmm4, %xmm3, %xmm3
				; AVX1-NEXT: vpshufb {{.*#+}} xmm4 = xmm0[8,11,12,15,u,u,u,u,u,u,u,u,u,u,u,u]
				; AVX1-NEXT: vpmovsxbd %xmm4, %xmm4
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,3,4,7,u,u,u,u,u,u,u,u,u,u,u,u]
				; AVX1-NEXT: vpmovsxbd %xmm0, %xmm0
				; AVX1-NEXT: vpshufb {{.*#+}} xmm5 = xmm1[9,11,13,15,u,u,u,u,u,u,u,u,u,u,u,u]
				; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero
				; AVX1-NEXT: vpmulld %xmm5, %xmm4, %xmm4
				; AVX1-NEXT: vpaddd %xmm4, %xmm2, %xmm2
				; AVX1-NEXT: vpshufb {{.*#+}} xmm1 = xmm1[1,3,5,7,u,u,u,u,u,u,u,u,u,u,u,u]
				; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
				; AVX1-NEXT: vpmulld %xmm1, %xmm0, %xmm0
				; AVX1-NEXT: vpaddd %xmm0, %xmm3, %xmm0
				; AVX1-NEXT: vpackssdw %xmm2, %xmm0, %xmm0
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: pmaddubsw_bad_indices:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vmovdqa (%rdi), %xmm0
				; AVX2-NEXT: vmovdqa (%rsi), %xmm1
				; AVX2-NEXT: vpshufb {{.*#+}} xmm2 = xmm0[1,2,5,6,9,10,13,14,u,u,u,u,u,u,u,u]
				; AVX2-NEXT: vpmovsxbd %xmm2, %ymm2
				; AVX2-NEXT: vpshufb {{.*#+}} xmm3 = xmm1[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm3 = xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero,xmm3[2],zero,zero,zero,xmm3[3],zero,zero,zero,xmm3[4],zero,zero,zero,xmm3[5],zero,zero,zero,xmm3[6],zero,zero,zero,xmm3[7],zero,zero,zero
				; AVX2-NEXT: vpmulld %ymm3, %ymm2, %ymm2
				; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,3,4,7,8,11,12,15,u,u,u,u,u,u,u,u]
				; AVX2-NEXT: vpmovsxbd %xmm0, %ymm0
				; AVX2-NEXT: vpshufb {{.*#+}} xmm1 = xmm1[1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u]
				; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
				; AVX2-NEXT: vpmulld %ymm1, %ymm0, %ymm0
				; AVX2-NEXT: vpaddd %ymm0, %ymm2, %ymm0
				; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX2-NEXT: vpackssdw %xmm1, %xmm0, %xmm0
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512-LABEL: pmaddubsw_bad_indices:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vmovdqa (%rdi), %xmm0
				; AVX512-NEXT: vmovdqa (%rsi), %xmm1
				; AVX512-NEXT: vpshufb {{.*#+}} xmm2 = xmm0[1,2,5,6,9,10,13,14,u,u,u,u,u,u,u,u]
				; AVX512-NEXT: vpmovsxbd %xmm2, %ymm2
				; AVX512-NEXT: vpshufb {{.*#+}} xmm3 = xmm1[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
				; AVX512-NEXT: vpmovzxbd {{.*#+}} ymm3 = xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero,xmm3[2],zero,zero,zero,xmm3[3],zero,zero,zero,xmm3[4],zero,zero,zero,xmm3[5],zero,zero,zero,xmm3[6],zero,zero,zero,xmm3[7],zero,zero,zero
				; AVX512-NEXT: vpmulld %ymm3, %ymm2, %ymm2
				; AVX512-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,3,4,7,8,11,12,15,u,u,u,u,u,u,u,u]
				; AVX512-NEXT: vpmovsxbd %xmm0, %ymm0
				; AVX512-NEXT: vpshufb {{.*#+}} xmm1 = xmm1[1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u]
				; AVX512-NEXT: vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
				; AVX512-NEXT: vpmulld %ymm1, %ymm0, %ymm0
				; AVX512-NEXT: vpaddd %ymm0, %ymm2, %ymm0
				; AVX512-NEXT: vpbroadcastd {{.*#+}} ymm1 = [4294934528,4294934528,4294934528,4294934528,4294934528,4294934528,4294934528,4294934528]
				; AVX512-NEXT: vpmaxsd %ymm1, %ymm0, %ymm0
				; AVX512-NEXT: vpbroadcastd {{.*#+}} ymm1 = [32767,32767,32767,32767,32767,32767,32767,32767]
				; AVX512-NEXT: vpminsd %ymm1, %ymm0, %ymm0
				; AVX512-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				%A = load <16 x i8>, <16 x i8>* %Aptr
				%B = load <16 x i8>, <16 x i8>* %Bptr
				%A_even = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 1, i32 2, i32 5, i32 6, i32 9, i32 10, i32 13, i32 14> ;indices aren't all even
				%A_odd = shufflevector <16 x i8> %A, <16 x i8> undef, <8 x i32> <i32 0, i32 3, i32 4, i32 7, i32 8, i32 11, i32 12, i32 15> ;indices aren't all odd
				%B_even = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14> ;different than A
				%B_odd = shufflevector <16 x i8> %B, <16 x i8> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15> ;different than A
				%A_even_ext = sext <8 x i8> %A_even to <8 x i32>
				%B_even_ext = zext <8 x i8> %B_even to <8 x i32>
				%A_odd_ext = sext <8 x i8> %A_odd to <8 x i32>
				%B_odd_ext = zext <8 x i8> %B_odd to <8 x i32>
				%even_mul = mul <8 x i32> %A_even_ext, %B_even_ext
				%odd_mul = mul <8 x i32> %A_odd_ext, %B_odd_ext
				%add = add <8 x i32> %even_mul, %odd_mul
				%cmp_max = icmp sgt <8 x i32> %add, <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%max = select <8 x i1> %cmp_max, <8 x i32> %add, <8 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768, i32 -32768>
				%cmp_min = icmp slt <8 x i32> %max, <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%min = select <8 x i1> %cmp_min, <8 x i32> %max, <8 x i32> <i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767, i32 32767>
				%trunc = trunc <8 x i32> %min to <8 x i16>
				ret <8 x i16> %trunc
				}