This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
-
LegalizeVectorTypes.cpp
-
Target/X86/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
shrink_vmul.ll

Differential D20931

[X86] Reduce the width of multiplification when its operands are extended from i8 or i16
ClosedPublic

Authored by wmi on Jun 2 2016, 2:36 PM.

Download Raw Diff

Details

Reviewers

RKSimon
mkuper
congh
hfinkel

Commits

rGb799a625f922: [X86] Reduce the width of multiplification when its operands are extended from…
rL272694: [X86] Reduce the width of multiplification when its operands are extended…

Summary

For <N x i32> type mul, pmuludq will be used for targets without SSE41, which often introduces many extra pack and unpack instructions in vectorized loop body because pmuludq generates <N/2 x i64> type value. However when the operands of <N x i32> mul are extended from smaller size values like i8 and i16, the type of mul may be shrinked to use pmullw + pmulhw/pmulhuw instead of pmuludq, which generates better code. For targets with SSE41, pmulld is supported so no shrinking is needed.

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 59458.Jun 2 2016, 2:36 PM

wmi retitled this revision from to [X86] Reduce the width of multiplification when its operands are extended from i8 or i16.

wmi updated this object.

wmi added reviewers: hfinkel, RKSimon, congh.

wmi set the repository for this revision to rL LLVM.

wmi added subscribers: llvm-commits, davidxl, mkuper.

The testcases are a bit more complicated than they need to be; could you reduce them to just the minimal IR? In particular, including the loops makes it harder to read.

Also, missing testcase for the multiply-by-constant case.

lib/Target/X86/X86ISelLowering.cpp
26450 ↗	(On Diff #59458)	"shrunk", not "shrinked"... but you might want to use "narrowed" here instead.
26477 ↗	(On Diff #59458)	What if one side is signed, and the other is unsigned?
26493 ↗	(On Diff #59458)	Repeated code; should be refactored.
26507 ↗	(On Diff #59458)	This is not what you want. The constant vector is "zero-extended" if ZEXT(TRUNC(vec))==vec, and "sign-extended" if SEXT(TRUNC(vec)) == vec. Computing whether the vector is a splat has no relation to either of those properties.
26612 ↗	(On Diff #59458)	There's massive code duplication here; needs to be refactored. Maybe make splitting the inputs if necessary and actually computing the product separate steps?

mkuper added inline comments.Jun 2 2016, 3:13 PM

lib/Target/X86/X86ISelLowering.cpp
26452 ↗	(On Diff #59458)	Any chance to split this function up? Or is there no logical way to do that?
26468 ↗	(On Diff #59458)	I'm not sure DONTCARE is a good name for this - it's not a "top" value. Maybe UNKNOWN?
26475 ↗	(On Diff #59458)	ISD::ANY_EXTEND would work too, right? (Not sure how to test this, but if both ZERO and SIGN work, ANY should definitely work.)
26481 ↗	(On Diff #59458)	What if it's MVT::i1? You may want an "else return SDValue()" here as well.
26486 ↗	(On Diff #59458)	Same two comments as above.
26494 ↗	(On Diff #59458)	What happens if one of the extends is a sext and the other is a zext? It seems like if N0 is a sext and N1 is a zext we'll get IsSigned == true, and if it's the other way around, we'll get IsSigned == false, which seems oddly asymmetrical. Or is there a canonicalization earlier that guarantees the order of a sext and zext?
26499 ↗	(On Diff #59458)	Are you guaranteed that if one of {N0, N1} is a extend, and the other is a BuildVector, N0 is the extend? I can imagine something canonizes it that way - and if that's the case, documenting this here would probably be a good idea.
26519 ↗	(On Diff #59458)	Why not use N0->getOperand(0), at least when the type is i16? Although I guess we're pretty much guaranteed DAGCombine will clean this up, and this may be a bit cleaner as is. So, I'm not really sure which is better.
26523 ↗	(On Diff #59458)	This is necessary because this runs before legalization, right? What if it ran after legalization? Does that help? Or is that too late?
26544 ↗	(On Diff #59458)	Why not use a generic shuffle here? Is the lowering further down not good enough to get the right unpacks?
29777 ↗	(On Diff #59458)	Why the linebreak?
test/CodeGen/X86/shrink_vmul.ll
1 ↗	(On Diff #59458)	Could you add sext tests as well? I'm not sure you need the whole type x {zext, sext} matrix, but one sext test would be good.

Eli and Michael, thanks for your comments. Will post a new patch after I address the issues mentioned.

lib/Target/X86/X86ISelLowering.cpp
26450 ↗	(On Diff #59458)	Will fix it.
26452 ↗	(On Diff #59458)	I can extract the pattern matching part to a separate func. Do you think it is enough?
26468 ↗	(On Diff #59458)	That is better. Will fix it.
26475 ↗	(On Diff #59458)	Yes, it should work. I think I can generate the same code for it with ISD::ZERO_EXTEND.
26477 ↗	(On Diff #59458)	Nice catch. I think then the type of the mul should be signed.
26481 ↗	(On Diff #59458)	You are right. Will fix it.
26493 ↗	(On Diff #59458)	Will fix it.
26494 ↗	(On Diff #59458)	Eli asked a similar question. If one side is sext, and the other is zext, then the type of mult should be signed. I will fix it.
26499 ↗	(On Diff #59458)	Yes, InstCombine will canonicalize it. Will add comment for it.
26507 ↗	(On Diff #59458)	Maybe I can use SplatValue.isNegative() to get the signedness of constant? Then use it together with the signedness of N0 to determine the signedness of multiplification.
26519 ↗	(On Diff #59458)	Yes, I tried to make the code simpler here.
26523 ↗	(On Diff #59458)	Yes, it is necessary to run before type legalization. That is because suppose original mul is of type <16 x i32>, it will be splitted into four muls of type <4 x i32> in legalization. We actually need after the transformation is two muls of type <8 x i16>, it is more difficult to merge pairs of muls if it is done after legalization.
26544 ↗	(On Diff #59458)	Because I feel using generic shuffle doesn't make the code simpler. We still need two shuffles, two bitcast and one concat_vectors.
26612 ↗	(On Diff #59458)	I did see there were some code duplication. Like I can merge the code for "VT.getVectorNumElements() > OpsVT.getVectorNumElements()" and that for "VT.getVectorNumElements() == OpsVT.getVectorNumElements()", but I felt it made the logic little bit less clear after the merge, which mixed the case requiring split with the case requiring no split. Probably still worth it. I will do it and add some comments to clarify the logic.
29777 ↗	(On Diff #59458)	It is done by clang-format. I think it is more consistent without the linebreak. I will fix it.
test/CodeGen/X86/shrink_vmul.ll
1 ↗	(On Diff #59458)	Will add it.

mkuper added inline comments.Jun 2 2016, 5:59 PM

lib/Target/X86/X86ISelLowering.cpp
26452 ↗	(On Diff #59458)	Yes, together with the refactoring Eli suggested, that should be good.
26507 ↗	(On Diff #59458)	I think what Eli meant is that there's no reason to look for a splat, specifically. If every element is small enough, it doesn't matter whether the vector is a splat or not.
26519 ↗	(On Diff #59458)	As long as this gets clean up, that sounds reasonable.
26523 ↗	(On Diff #59458)	Ah, ok, got it, thanks.
26544 ↗	(On Diff #59458)	It will actually make the code a bit more complex, really. What I had in mind was that if the result of the mul itself feeds a shuffle, leaving generic shuffles here may expose more dag combines further down the road.

Addressed Eli and Michael's comments.

A major change is: The previous way to choose shrinking modes by only looking at sext and zext is incorrect. I should analyze and use the value ranges of mul operands to determine which shrinking mode should be chosen. Different shrinking modes and their allowed value ranges are described in the comment of reduceVMULWidth.

This is looking a lot better overall.

lib/Target/X86/X86ISelLowering.cpp
26537 ↗	(On Diff #59745)	It feels like you should be able to use ComputeNumSignBits/computeKnownBits here; I'm not sure how much shorter that actually ends up, though.
26675 ↗	(On Diff #59745)	It's not obvious to me why you're explicitly legalizing this here; you could just generate a MUL on, for example, <4 x i16> and legalization should do the right thing from there.
test/CodeGen/X86/shrink_vmul.ll
751 ↗	(On Diff #59745)	It would probably be more clear to write this as `mul nuw nsw <2 x i32> %tmp8, <i32 -32768, i32 32767>`.

wmi added inline comments.Jun 7 2016, 10:44 AM

lib/Target/X86/X86ISelLowering.cpp
26537 ↗	(On Diff #59745)	Thanks for the suggestion. I change the value range check of intconst to use ComputeNumSignBits. The code length doesn't change much. But ComputeNumSignBits is more powerful, I can use it to do some extension further in the future, like, ; %val1 = load <2 x i8> ; %op1 = zext<2 x i32> %val1 ; %val2 = load <2 x i8> ; %op2 = zext<2 x i32> %val2 ; %add = add <2 x i32> %op1, %op2 ; %rst = mul <2 x i32> %add, %op2 ComputeNumSignBits may know %add's value range is within 0 ~ 32767 (Actually 0 ~ 255*2).
26675 ↗	(On Diff #59745)	I choose to explicitly legalize here because implicit legalization will generate different results: Suppose the input is <4 x i16>, for implicit legalization, it will be converted to <4 x i64> then bitcast to <8 x i16> before being used as the input of pmullw. If the input is a vector load + sext/zext, then the input needs to be unpck twice to get <4 x i64>. For explicit legalization, I choose to concat <4 x i16> with vector undef to get <8 x i16>. If the input is a vector load + sext/zext, then the input can be directly used as the input of pmullw.
test/CodeGen/X86/shrink_vmul.ll
751 ↗	(On Diff #59745)	That is better. Fixed.

wmi updated this revision to Diff 59912.Jun 7 2016, 10:45 AM

eli.friedman added inline comments.Jun 7 2016, 5:22 PM

lib/Target/X86/X86ISelLowering.cpp
26506 ↗	(On Diff #59912)	It's probably more clear to express this in terms of the number of sign bits and the number of leading zeros (APInt::countLeadingZeros). Actually, you could probably get rid of the ValRange enumeration altogether in favor of those two numbers. For example, return `MULS8` if `std::min(signbits1, signbits2) > 24`.
26675 ↗	(On Diff #59912)	I'm not following... how do you get to <4 x i64>? Legalization of a `<4 x i16>` multiply will widen it to an `<8 x i16>` multiply; this codepath already gets used for IR like `mul <4 x i16> %a, %b`.

wmi added inline comments.Jun 7 2016, 11:26 PM

lib/Target/X86/X86ISelLowering.cpp
26506 ↗	(On Diff #59912)	I rewrite it and the code is much shorter. Thanks for the suggestion.
26675 ↗	(On Diff #59912)	Sorry. I wanted to say <4 x i32> instead of <4 x i64>. when legalizing <4 x i16> to <4 x i32>, it will use a punpcklwd instruction if the input is load <4 x i16>. It is different from widening <4 x i16> to <8 x i16> by filling undef in the higher bits.

wmi updated this revision to Diff 59999.Jun 7 2016, 11:27 PM

eli.friedman added inline comments.Jun 8 2016, 12:26 AM

lib/Target/X86/X86ISelLowering.cpp
26675 ↗	(On Diff #59999)	Ah, I see what you mean. That's how we end up in an awful mess with the following: typedef short a __attribute((ext_vector_type(4))); void g(a x); a f(a x, a y, int c) { a z = x*y; if (c) g(z+x); return z; } That doesn't explain why you need to explicitly legalize the inputs in the case where you split the nodes, though.

wmi added inline comments.Jun 8 2016, 12:02 PM

lib/Target/X86/X86ISelLowering.cpp
26675 ↗	(On Diff #59999)	I choosed to explicitly legalize because I used X86ISD::UNPCKL and X86ISD::UNPCKH instead of vector_shuffle so no mask setting was needed. Seems legalization only works for generic ISD instead of X86ISD. Actually it doesn't need to legalize. I change unpck to vectorshuffle and remove the splitting. The code looks simpler even with the additional mask setting code. Thanks for the suggestion.

wmi updated this revision to Diff 60077.Jun 8 2016, 12:04 PM

eli.friedman added inline comments.Jun 8 2016, 12:53 PM

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
673 ↗	(On Diff #60077)	Do you need MULHS here too? (Maybe missing a test for this?)
lib/Target/X86/X86ISelLowering.cpp
26453 ↗	(On Diff #60077)	This comment is confusing; the operand isn't actually guaranteed to be between 0 and 127. (The transform is safe because we can assume an appropriate number of leading sign/zero bits.)
26478 ↗	(On Diff #60077)	Probably better to use APInt::getNumSignBits here.

wmi added inline comments.Jun 8 2016, 1:33 PM

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
673 ↗	(On Diff #60077)	Ah sorry. Fixed. Add a test mul_16xi16_sext for it.
lib/Target/X86/X86ISelLowering.cpp
26453 ↗	(On Diff #60077)	Fixed.
26478 ↗	(On Diff #60077)	Fixed.

wmi updated this revision to Diff 60097.Jun 8 2016, 1:33 PM

LGTM, but this has been though a lot of revisions, so it's probably a good idea to someone else to double-check I didn't miss something obvious.

Eli, thanks for your many helpful suggestions.

LGTM too.
I have a couple of comments, but they're rather half-baked, feel free to ignore them if they don't make sense to you.

lib/Target/X86/X86ISelLowering.cpp
26446 ↗	(On Diff #60097)	Nit - it's a bit misleading to have N->getNumOperands() here, and then index into an array of size 2. Maybe assert N->getNumOperands() == 2 (or just use 2 here, but that's less self-documenting, I guess).
26470 ↗	(On Diff #60097)	Perhaps use ComputeNumSignBits on the BUILD_VECTOR elements as well, instead of checking for const/undef? Although I'm really not sure if that gains anything in practice, and it won't make the code significantly smaller, so it's may not be worth the overhead.
26480 ↗	(On Diff #60097)	Maybe computeKnownBits instead of special-casing ZERO_EXTEND? Although, the same as above applies, not at all sure it's worth the overhead.

This revision is now accepted and ready to land.Jun 13 2016, 11:51 AM

wmi added inline comments.Jun 13 2016, 4:42 PM

lib/Target/X86/X86ISelLowering.cpp
26446 ↗	(On Diff #60097)	That is misleading indeed. Fixed and Added an assertion.
26470 ↗	(On Diff #60097)	ComputeNumSignBits doesn't work for BUILD_VECTOR for now. It always return 1.
26480 ↗	(On Diff #60097)	The code seems longer and costlier that way, so I choose to special-case ZERO_EXTEND here.

wmi updated this revision to Diff 60628.Jun 13 2016, 4:43 PM

wmi edited edge metadata.

Closed by commit rL272694: [X86] Reduce the width of multiplification when its operands are extended… (authored by wmi). · Explain WhyJun 14 2016, 12:00 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

LegalizeVectorTypes.cpp

2 lines

Target/

X86/

X86ISelLowering.cpp

211 lines

test/

CodeGen/

X86/

shrink_vmul.ll

864 lines

Diff 60716

llvm/trunk/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

Show First 20 Lines • Show All 664 Lines • ▼ Show 20 Lines	#endif
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
SplitVecRes_ExtendOp(N, Lo, Hi);		SplitVecRes_ExtendOp(N, Lo, Hi);
break;		break;

case ISD::ADD:		case ISD::ADD:
case ISD::SUB:		case ISD::SUB:
case ISD::MUL:		case ISD::MUL:
		case ISD::MULHS:
		case ISD::MULHU:
case ISD::FADD:		case ISD::FADD:
case ISD::FSUB:		case ISD::FSUB:
case ISD::FMUL:		case ISD::FMUL:
case ISD::FMINNUM:		case ISD::FMINNUM:
case ISD::FMAXNUM:		case ISD::FMAXNUM:
case ISD::FMINNAN:		case ISD::FMINNAN:
case ISD::FMAXNAN:		case ISD::FMAXNAN:
case ISD::SDIV:		case ISD::SDIV:
▲ Show 20 Lines • Show All 3,193 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 26,956 Lines • ▼ Show 20 Lines	if (checkBoolTestAndOrSetCCCombine(Cond, CC0, CC1, Flags, isAndSetCC)) {
DAG.ReplaceAllUsesOfValueWith(SDValue(N, 1), SDValue(CMOV.getNode(), 1));		DAG.ReplaceAllUsesOfValueWith(SDValue(N, 1), SDValue(CMOV.getNode(), 1));
return CMOV;		return CMOV;
}		}
}		}

return SDValue();		return SDValue();
}		}

		/// Different mul shrinking modes.
		enum ShrinkMode { MULS8, MULU8, MULS16, MULU16 };

		static bool canReduceVMulWidth(SDNode *N, SelectionDAG &DAG, ShrinkMode &Mode) {
		EVT VT = N->getOperand(0).getValueType();
		if (VT.getScalarSizeInBits() != 32)
		return false;

		assert(N->getNumOperands() == 2 && "NumOperands of Mul are 2");
		unsigned SignBits[2] = {1, 1};
		bool IsPositive[2] = {false, false};
		for (unsigned i = 0; i < 2; i++) {
		SDValue Opd = N->getOperand(i);

		// DAG.ComputeNumSignBits return 1 for ISD::ANY_EXTEND, so we need to
		// compute signbits for it separately.
		if (Opd.getOpcode() == ISD::ANY_EXTEND) {
		// For anyextend, it is safe to assume an appropriate number of leading
		// sign/zero bits.
		if (Opd.getOperand(0).getValueType().getVectorElementType() == MVT::i8)
		SignBits[i] = 25;
		else if (Opd.getOperand(0).getValueType().getVectorElementType() ==
		MVT::i16)
		SignBits[i] = 17;
		else
		return false;
		IsPositive[i] = true;
		} else if (Opd.getOpcode() == ISD::BUILD_VECTOR) {
		// All the operands of BUILD_VECTOR need to be int constant.
		// Find the smallest value range which all the operands belong to.
		SignBits[i] = 32;
		IsPositive[i] = true;
		for (const SDValue &SubOp : Opd.getNode()->op_values()) {
		if (SubOp.isUndef())
		continue;
		auto *CN = dyn_cast<ConstantSDNode>(SubOp);
		if (!CN)
		return false;
		APInt IntVal = CN->getAPIntValue();
		if (IntVal.isNegative())
		IsPositive[i] = false;
		SignBits[i] = std::min(SignBits[i], IntVal.getNumSignBits());
		}
		} else {
		SignBits[i] = DAG.ComputeNumSignBits(Opd);
		if (Opd.getOpcode() == ISD::ZERO_EXTEND)
		IsPositive[i] = true;
		}
		}

		bool AllPositive = IsPositive[0] && IsPositive[1];
		unsigned MinSignBits = std::min(SignBits[0], SignBits[1]);
		// When ranges are from -128 ~ 127, use MULS8 mode.
		if (MinSignBits >= 25)
		Mode = MULS8;
		// When ranges are from 0 ~ 255, use MULU8 mode.
		else if (AllPositive && MinSignBits >= 24)
		Mode = MULU8;
		// When ranges are from -32768 ~ 32767, use MULS16 mode.
		else if (MinSignBits >= 17)
		Mode = MULS16;
		// When ranges are from 0 ~ 65535, use MULU16 mode.
		else if (AllPositive && MinSignBits >= 16)
		Mode = MULU16;
		else
		return false;
		return true;
		}

		/// When the operands of vector mul are extended from smaller size values,
		/// like i8 and i16, the type of mul may be shrinked to generate more
		/// efficient code. Two typical patterns are handled:
		/// Pattern1:
		/// %2 = sext/zext <N x i8> %1 to <N x i32>
		/// %4 = sext/zext <N x i8> %3 to <N x i32>
		// or %4 = build_vector <N x i32> %C1, ..., %CN (%C1..%CN are constants)
		/// %5 = mul <N x i32> %2, %4
		///
		/// Pattern2:
		/// %2 = zext/sext <N x i16> %1 to <N x i32>
		/// %4 = zext/sext <N x i16> %3 to <N x i32>
		/// or %4 = build_vector <N x i32> %C1, ..., %CN (%C1..%CN are constants)
		/// %5 = mul <N x i32> %2, %4
		///
		/// There are four mul shrinking modes:
		/// If %2 == sext32(trunc8(%2)), i.e., the scalar value range of %2 is
		/// -128 to 128, and the scalar value range of %4 is also -128 to 128,
		/// generate pmullw+sext32 for it (MULS8 mode).
		/// If %2 == zext32(trunc8(%2)), i.e., the scalar value range of %2 is
		/// 0 to 255, and the scalar value range of %4 is also 0 to 255,
		/// generate pmullw+zext32 for it (MULU8 mode).
		/// If %2 == sext32(trunc16(%2)), i.e., the scalar value range of %2 is
		/// -32768 to 32767, and the scalar value range of %4 is also -32768 to 32767,
		/// generate pmullw+pmulhw for it (MULS16 mode).
		/// If %2 == zext32(trunc16(%2)), i.e., the scalar value range of %2 is
		/// 0 to 65535, and the scalar value range of %4 is also 0 to 65535,
		/// generate pmullw+pmulhuw for it (MULU16 mode).
		static SDValue reduceVMULWidth(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		// pmulld is supported since SSE41. It is better to use pmulld
		// instead of pmullw+pmulhw.
		if (Subtarget.hasSSE41())
		return SDValue();

		ShrinkMode Mode;
		if (!canReduceVMulWidth(N, DAG, Mode))
		return SDValue();

		SDLoc DL(N);
		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);
		EVT VT = N->getOperand(0).getValueType();
		unsigned RegSize = 128;
		MVT OpsVT = MVT::getVectorVT(MVT::i16, RegSize / 16);
		EVT ReducedVT =
		EVT::getVectorVT(*DAG.getContext(), MVT::i16, VT.getVectorNumElements());
		// Shrink the operands of mul.
		SDValue NewN0 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N0);
		SDValue NewN1 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N1);

		if (VT.getVectorNumElements() >= OpsVT.getVectorNumElements()) {
		// Generate the lower part of mul: pmullw. For MULU8/MULS8, only the
		// lower part is needed.
		SDValue MulLo = DAG.getNode(ISD::MUL, DL, ReducedVT, NewN0, NewN1);
		if (Mode == MULU8 \|\| Mode == MULS8) {
		return DAG.getNode((Mode == MULU8) ? ISD::ZERO_EXTEND : ISD::SIGN_EXTEND,
		DL, VT, MulLo);
		} else {
		MVT ResVT = MVT::getVectorVT(MVT::i32, VT.getVectorNumElements() / 2);
		// Generate the higher part of mul: pmulhw/pmulhuw. For MULU16/MULS16,
		// the higher part is also needed.
		SDValue MulHi = DAG.getNode(Mode == MULS16 ? ISD::MULHS : ISD::MULHU, DL,
		ReducedVT, NewN0, NewN1);

		// Repack the lower part and higher part result of mul into a wider
		// result.
		// Generate shuffle functioning as punpcklwd.
		SmallVector<int, 16> ShuffleMask(VT.getVectorNumElements());
		for (unsigned i = 0; i < VT.getVectorNumElements() / 2; i++) {
		ShuffleMask[2 * i] = i;
		ShuffleMask[2 * i + 1] = i + VT.getVectorNumElements();
		}
		SDValue ResLo =
		DAG.getVectorShuffle(ReducedVT, DL, MulLo, MulHi, &ShuffleMask[0]);
		ResLo = DAG.getNode(ISD::BITCAST, DL, ResVT, ResLo);
		// Generate shuffle functioning as punpckhwd.
		for (unsigned i = 0; i < VT.getVectorNumElements() / 2; i++) {
		ShuffleMask[2 * i] = i + VT.getVectorNumElements() / 2;
		ShuffleMask[2 * i + 1] = i + VT.getVectorNumElements() * 3 / 2;
		}
		SDValue ResHi =
		DAG.getVectorShuffle(ReducedVT, DL, MulLo, MulHi, &ShuffleMask[0]);
		ResHi = DAG.getNode(ISD::BITCAST, DL, ResVT, ResHi);
		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ResLo, ResHi);
		}
		} else {
		// When VT.getVectorNumElements() < OpsVT.getVectorNumElements(), we want
		// to legalize the mul explicitly because implicit legalization for type
		// <4 x i16> to <4 x i32> sometimes involves unnecessary unpack
		// instructions which will not exist when we explicitly legalize it by
		// extending <4 x i16> to <8 x i16> (concatenating the <4 x i16> val with
		// <4 x i16> undef).
		//
		// Legalize the operands of mul.
		SmallVector<SDValue, 16> Ops(RegSize / ReducedVT.getSizeInBits(),
		DAG.getUNDEF(ReducedVT));
		Ops[0] = NewN0;
		NewN0 = DAG.getNode(ISD::CONCAT_VECTORS, DL, OpsVT, Ops);
		Ops[0] = NewN1;
		NewN1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, OpsVT, Ops);

		if (Mode == MULU8 \|\| Mode == MULS8) {
		// Generate lower part of mul: pmullw. For MULU8/MULS8, only the lower
		// part is needed.
		SDValue Mul = DAG.getNode(ISD::MUL, DL, OpsVT, NewN0, NewN1);

		// convert the type of mul result to VT.
		MVT ResVT = MVT::getVectorVT(MVT::i32, RegSize / 32);
		SDValue Res = DAG.getNode(Mode == MULU8 ? ISD::ZERO_EXTEND_VECTOR_INREG
		: ISD::SIGN_EXTEND_VECTOR_INREG,
		DL, ResVT, Mul);
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, Res,
		DAG.getIntPtrConstant(0, DL));
		} else {
		// Generate the lower and higher part of mul: pmulhw/pmulhuw. For
		// MULU16/MULS16, both parts are needed.
		SDValue MulLo = DAG.getNode(ISD::MUL, DL, OpsVT, NewN0, NewN1);
		SDValue MulHi = DAG.getNode(Mode == MULS16 ? ISD::MULHS : ISD::MULHU, DL,
		OpsVT, NewN0, NewN1);

		// Repack the lower part and higher part result of mul into a wider
		// result. Make sure the type of mul result is VT.
		MVT ResVT = MVT::getVectorVT(MVT::i32, RegSize / 32);
		SDValue Res = DAG.getNode(X86ISD::UNPCKL, DL, OpsVT, MulLo, MulHi);
		Res = DAG.getNode(ISD::BITCAST, DL, ResVT, Res);
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, Res,
		DAG.getIntPtrConstant(0, DL));
		}
		}
		}

/// Optimize a single multiply with constant into two operations in order to		/// Optimize a single multiply with constant into two operations in order to
/// implement it with two cheaper instructions, e.g. LEA + SHL, LEA + LEA.		/// implement it with two cheaper instructions, e.g. LEA + SHL, LEA + LEA.
static SDValue combineMul(SDNode *N, SelectionDAG &DAG,		static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI) {		TargetLowering::DAGCombinerInfo &DCI,
		const X86Subtarget &Subtarget) {
		EVT VT = N->getValueType(0);
		if (DCI.isBeforeLegalize() && VT.isVector())
		return reduceVMULWidth(N, DAG, Subtarget);

// An imul is usually smaller than the alternative sequence.		// An imul is usually smaller than the alternative sequence.
if (DAG.getMachineFunction().getFunction()->optForMinSize())		if (DAG.getMachineFunction().getFunction()->optForMinSize())
return SDValue();		return SDValue();

if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())		if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())
return SDValue();		return SDValue();

EVT VT = N->getValueType(0);
if (VT != MVT::i64 && VT != MVT::i32)		if (VT != MVT::i64 && VT != MVT::i32)
return SDValue();		return SDValue();

ConstantSDNode *C = dyn_cast<ConstantSDNode>(N->getOperand(1));		ConstantSDNode *C = dyn_cast<ConstantSDNode>(N->getOperand(1));
if (!C)		if (!C)
return SDValue();		return SDValue();
uint64_t MulAmt = C->getZExtValue();		uint64_t MulAmt = C->getZExtValue();
if (isPowerOf2_64(MulAmt) \|\| MulAmt == 3 \|\| MulAmt == 5 \|\| MulAmt == 9)		if (isPowerOf2_64(MulAmt) \|\| MulAmt == 3 \|\| MulAmt == 5 \|\| MulAmt == 9)
▲ Show 20 Lines • Show All 3,278 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::VSELECT:		case ISD::VSELECT:
case ISD::SELECT:		case ISD::SELECT:
case X86ISD::SHRUNKBLEND: return combineSelect(N, DAG, DCI, Subtarget);		case X86ISD::SHRUNKBLEND: return combineSelect(N, DAG, DCI, Subtarget);
case ISD::BITCAST: return combineBitcast(N, DAG, Subtarget);		case ISD::BITCAST: return combineBitcast(N, DAG, Subtarget);
case X86ISD::CMOV: return combineCMov(N, DAG, DCI, Subtarget);		case X86ISD::CMOV: return combineCMov(N, DAG, DCI, Subtarget);
case ISD::ADD: return combineAdd(N, DAG, Subtarget);		case ISD::ADD: return combineAdd(N, DAG, Subtarget);
case ISD::SUB: return combineSub(N, DAG, Subtarget);		case ISD::SUB: return combineSub(N, DAG, Subtarget);
case X86ISD::ADC: return combineADC(N, DAG, DCI);		case X86ISD::ADC: return combineADC(N, DAG, DCI);
case ISD::MUL: return combineMul(N, DAG, DCI);		case ISD::MUL: return combineMul(N, DAG, DCI, Subtarget);
case ISD::SHL:		case ISD::SHL:
case ISD::SRA:		case ISD::SRA:
case ISD::SRL: return combineShift(N, DAG, DCI, Subtarget);		case ISD::SRL: return combineShift(N, DAG, DCI, Subtarget);
case ISD::AND: return combineAnd(N, DAG, DCI, Subtarget);		case ISD::AND: return combineAnd(N, DAG, DCI, Subtarget);
case ISD::OR: return combineOr(N, DAG, DCI, Subtarget);		case ISD::OR: return combineOr(N, DAG, DCI, Subtarget);
case ISD::XOR: return combineXor(N, DAG, DCI, Subtarget);		case ISD::XOR: return combineXor(N, DAG, DCI, Subtarget);
case ISD::LOAD: return combineLoad(N, DAG, DCI, Subtarget);		case ISD::LOAD: return combineLoad(N, DAG, DCI, Subtarget);
case ISD::MLOAD: return combineMaskedLoad(N, DAG, DCI, Subtarget);		case ISD::MLOAD: return combineMaskedLoad(N, DAG, DCI, Subtarget);
▲ Show 20 Lines • Show All 956 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/shrink_vmul.ll

				; NOTE: Assertions have been autogenerated by update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s

				@c = external global i32*, align 8

				; %val1 = load <2 x i8>
				; %op1 = zext<2 x i32> %val1
				; %val2 = load <2 x i8>
				; %op2 = zext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: movzwl (%rsi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm1
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movq %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = zext <2 x i8> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i8>*
				%wide.load17 = load <2 x i8>, <2 x i8>* %tmp11, align 1
				%tmp12 = zext <2 x i8> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <4 x i8>
				; %op1 = zext<4 x i32> %val1
				; %val2 = load <4 x i8>
				; %op2 = zext<4 x i32> %val2
				; %rst = mul <4 x i32> %op1, %op2
				;
				define void @mul_4xi8(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_4xi8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movdqu %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <4 x i8>*
				%wide.load = load <4 x i8>, <4 x i8>* %tmp7, align 1
				%tmp8 = zext <4 x i8> %wide.load to <4 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <4 x i8>*
				%wide.load17 = load <4 x i8>, <4 x i8>* %tmp11, align 1
				%tmp12 = zext <4 x i8> %wide.load17 to <4 x i32>
				%tmp13 = mul nuw nsw <4 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <4 x i32>*
				store <4 x i32> %tmp13, <4 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <8 x i8>
				; %op1 = zext<8 x i32> %val1
				; %val2 = load <8 x i8>
				; %op2 = zext<8 x i32> %val2
				; %rst = mul <8 x i32> %op1, %op2
				;
				define void @mul_8xi8(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_8xi8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
				; CHECK-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: movdqa %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: movdqu %xmm1, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <8 x i8>*
				%wide.load = load <8 x i8>, <8 x i8>* %tmp7, align 1
				%tmp8 = zext <8 x i8> %wide.load to <8 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <8 x i8>*
				%wide.load17 = load <8 x i8>, <8 x i8>* %tmp11, align 1
				%tmp12 = zext <8 x i8> %wide.load17 to <8 x i32>
				%tmp13 = mul nuw nsw <8 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <8 x i32>*
				store <8 x i32> %tmp13, <8 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <16 x i8>
				; %op1 = zext<16 x i32> %val1
				; %val2 = load <16 x i8>
				; %op2 = zext<16 x i32> %val2
				; %rst = mul <16 x i32> %op1, %op2
				;
				define void @mul_16xi8(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_16xi8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movdqu (%rdi,%rdx), %xmm0
				; CHECK-NEXT: movdqu (%rsi,%rdx), %xmm1
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: movdqa %xmm0, %xmm3
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3],xmm3[4],xmm2[4],xmm3[5],xmm2[5],xmm3[6],xmm2[6],xmm3[7],xmm2[7]
				; CHECK-NEXT: movdqa %xmm1, %xmm4
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm2[0],xmm4[1],xmm2[1],xmm4[2],xmm2[2],xmm4[3],xmm2[3],xmm4[4],xmm2[4],xmm4[5],xmm2[5],xmm4[6],xmm2[6],xmm4[7],xmm2[7]
				; CHECK-NEXT: pmullw %xmm3, %xmm4
				; CHECK-NEXT: movdqa %xmm4, %xmm3
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm2[4],xmm4[5],xmm2[5],xmm4[6],xmm2[6],xmm4[7],xmm2[7]
				; CHECK-NEXT: punpckhbw {{.*#+}} xmm0 = xmm0[8],xmm2[8],xmm0[9],xmm2[9],xmm0[10],xmm2[10],xmm0[11],xmm2[11],xmm0[12],xmm2[12],xmm0[13],xmm2[13],xmm0[14],xmm2[14],xmm0[15],xmm2[15]
				; CHECK-NEXT: punpckhbw {{.*#+}} xmm1 = xmm1[8],xmm2[8],xmm1[9],xmm2[9],xmm1[10],xmm2[10],xmm1[11],xmm2[11],xmm1[12],xmm2[12],xmm1[13],xmm2[13],xmm1[14],xmm2[14],xmm1[15],xmm2[15]
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: movdqa %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: movdqu %xmm1, 48(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, 32(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm4, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm3, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <16 x i8>*
				%wide.load = load <16 x i8>, <16 x i8>* %tmp7, align 1
				%tmp8 = zext <16 x i8> %wide.load to <16 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <16 x i8>*
				%wide.load17 = load <16 x i8>, <16 x i8>* %tmp11, align 1
				%tmp12 = zext <16 x i8> %wide.load17 to <16 x i32>
				%tmp13 = mul nuw nsw <16 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <16 x i32>*
				store <16 x i32> %tmp13, <16 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i16>
				; %op1 = zext<2 x i32> %val1
				; %val2 = load <2 x i16>
				; %op2 = zext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi16:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhuw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movq %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = zext <2 x i16> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i16>*
				%wide.load17 = load <2 x i16>, <2 x i16>* %tmp11, align 1
				%tmp12 = zext <2 x i16> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <4 x i16>
				; %op1 = zext<4 x i32> %val1
				; %val2 = load <4 x i16>
				; %op2 = zext<4 x i32> %val2
				; %rst = mul <4 x i32> %op1, %op2
				;
				define void @mul_4xi16(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_4xi16:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
				; CHECK-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhuw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movdqu %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <4 x i16>*
				%wide.load = load <4 x i16>, <4 x i16>* %tmp7, align 1
				%tmp8 = zext <4 x i16> %wide.load to <4 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <4 x i16>*
				%wide.load17 = load <4 x i16>, <4 x i16>* %tmp11, align 1
				%tmp12 = zext <4 x i16> %wide.load17 to <4 x i32>
				%tmp13 = mul nuw nsw <4 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <4 x i32>*
				store <4 x i32> %tmp13, <4 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <8 x i16>
				; %op1 = zext<8 x i32> %val1
				; %val2 = load <8 x i16>
				; %op2 = zext<8 x i32> %val2
				; %rst = mul <8 x i32> %op1, %op2
				;
				define void @mul_8xi16(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_8xi16:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movdqu (%rdi,%rdx), %xmm0
				; CHECK-NEXT: movdqu (%rsi,%rdx), %xmm1
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhuw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: movdqa %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: movdqu %xmm1, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <8 x i16>*
				%wide.load = load <8 x i16>, <8 x i16>* %tmp7, align 1
				%tmp8 = zext <8 x i16> %wide.load to <8 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <8 x i16>*
				%wide.load17 = load <8 x i16>, <8 x i16>* %tmp11, align 1
				%tmp12 = zext <8 x i16> %wide.load17 to <8 x i32>
				%tmp13 = mul nuw nsw <8 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <8 x i32>*
				store <8 x i32> %tmp13, <8 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <16 x i16>
				; %op1 = zext<16 x i32> %val1
				; %val2 = load <16 x i16>
				; %op2 = zext<16 x i32> %val2
				; %rst = mul <16 x i32> %op1, %op2
				;
				define void @mul_16xi16(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_16xi16:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movdqu (%rdi,%rdx), %xmm0
				; CHECK-NEXT: movdqu 16(%rdi,%rdx), %xmm1
				; CHECK-NEXT: movdqu (%rsi,%rdx), %xmm2
				; CHECK-NEXT: movdqu 16(%rsi,%rdx), %xmm3
				; CHECK-NEXT: movdqa %xmm2, %xmm4
				; CHECK-NEXT: pmulhuw %xmm0, %xmm4
				; CHECK-NEXT: pmullw %xmm0, %xmm2
				; CHECK-NEXT: movdqa %xmm2, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]
				; CHECK-NEXT: movdqa %xmm3, %xmm4
				; CHECK-NEXT: pmulhuw %xmm1, %xmm4
				; CHECK-NEXT: pmullw %xmm1, %xmm3
				; CHECK-NEXT: movdqa %xmm3, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
				; CHECK-NEXT: movdqu %xmm3, 48(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm1, 32(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm2, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <16 x i16>*
				%wide.load = load <16 x i16>, <16 x i16>* %tmp7, align 1
				%tmp8 = zext <16 x i16> %wide.load to <16 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <16 x i16>*
				%wide.load17 = load <16 x i16>, <16 x i16>* %tmp11, align 1
				%tmp12 = zext <16 x i16> %wide.load17 to <16 x i32>
				%tmp13 = mul nuw nsw <16 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <16 x i32>*
				store <16 x i32> %tmp13, <16 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i8>
				; %op1 = sext<2 x i32> %val1
				; %val2 = load <2 x i8>
				; %op2 = sext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_sext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi8_sext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: movzwl (%rsi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm1
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm1
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
				; CHECK-NEXT: psrad $16, %xmm0
				; CHECK-NEXT: movq %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i8>*
				%wide.load17 = load <2 x i8>, <2 x i8>* %tmp11, align 1
				%tmp12 = sext <2 x i8> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i8>
				; %op1 = sext<2 x i32> %val1
				; %val2 = load <2 x i8>
				; %op2 = zext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_sext_zext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi8_sext_zext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: movzwl (%rsi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm1
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i8>*
				%wide.load17 = load <2 x i8>, <2 x i8>* %tmp11, align 1
				%tmp12 = zext <2 x i8> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i16>
				; %op1 = sext<2 x i32> %val1
				; %val2 = load <2 x i16>
				; %op2 = sext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_sext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi16_sext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movq %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = sext <2 x i16> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i16>*
				%wide.load17 = load <2 x i16>, <2 x i16>* %tmp11, align 1
				%tmp12 = sext <2 x i16> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i16>
				; %op1 = sext<2 x i32> %val1
				; %val2 = load <2 x i16>
				; %op2 = zext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_sext_zext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi16_sext_zext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
				; CHECK-NEXT: psrad $16, %xmm0
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
				; CHECK-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,1,1,3]
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmuludq %xmm0, %xmm2
				; CHECK-NEXT: movdqa %xmm0, %xmm3
				; CHECK-NEXT: psrlq $32, %xmm3
				; CHECK-NEXT: pmuludq %xmm1, %xmm3
				; CHECK-NEXT: psllq $32, %xmm3
				; CHECK-NEXT: paddq %xmm2, %xmm3
				; CHECK-NEXT: psrlq $32, %xmm1
				; CHECK-NEXT: pmuludq %xmm0, %xmm1
				; CHECK-NEXT: psllq $32, %xmm1
				; CHECK-NEXT: paddq %xmm3, %xmm1
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = sext <2 x i16> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i16>*
				%wide.load17 = load <2 x i16>, <2 x i16>* %tmp11, align 1
				%tmp12 = zext <2 x i16> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <16 x i16>
				; %op1 = sext<16 x i32> %val1
				; %val2 = load <16 x i16>
				; %op2 = sext<16 x i32> %val2
				; %rst = mul <16 x i32> %op1, %op2
				;
				define void @mul_16xi16_sext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_16xi16_sext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movdqu (%rdi,%rdx), %xmm0
				; CHECK-NEXT: movdqu 16(%rdi,%rdx), %xmm1
				; CHECK-NEXT: movdqu (%rsi,%rdx), %xmm2
				; CHECK-NEXT: movdqu 16(%rsi,%rdx), %xmm3
				; CHECK-NEXT: movdqa %xmm2, %xmm4
				; CHECK-NEXT: pmulhw %xmm0, %xmm4
				; CHECK-NEXT: pmullw %xmm0, %xmm2
				; CHECK-NEXT: movdqa %xmm2, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]
				; CHECK-NEXT: movdqa %xmm3, %xmm4
				; CHECK-NEXT: pmulhw %xmm1, %xmm4
				; CHECK-NEXT: pmullw %xmm1, %xmm3
				; CHECK-NEXT: movdqa %xmm3, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
				; CHECK-NEXT: movdqu %xmm3, 48(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm1, 32(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm2, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <16 x i16>*
				%wide.load = load <16 x i16>, <16 x i16>* %tmp7, align 1
				%tmp8 = sext <16 x i16> %wide.load to <16 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <16 x i16>*
				%wide.load17 = load <16 x i16>, <16 x i16>* %tmp11, align 1
				%tmp12 = sext <16 x i16> %wide.load17 to <16 x i32>
				%tmp13 = mul nuw nsw <16 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <16 x i32>*
				store <16 x i32> %tmp13, <16 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 255)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst1(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst1:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: pxor %xmm1, %xmm1
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
				; CHECK-NEXT: pmullw {{.*}}(%rip), %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = zext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 255>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-128 ~ 127)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst2(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst2:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: pmullw {{.*}}(%rip), %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
				; CHECK-NEXT: psrad $16, %xmm0
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -128, i32 127>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 256)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst3(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst3:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: pxor %xmm1, %xmm1
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <0,256,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = zext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 256>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-1 ~ 255)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst4(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst4:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: pxor %xmm1, %xmm1
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <65535,255,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = zext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -1, i32 255>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-129 ~ 127)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst5(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst5:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <65407,127,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -129, i32 127>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-128 ~ 128)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst6(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst6:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <65408,128,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -128, i32 128>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i16>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 65535)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_varconst1(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi16_varconst1:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <0,65535,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhuw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = zext <2 x i16> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 65535>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i16>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-32768 ~ 32767)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_varconst2(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi16_varconst2:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <32768,32767,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = sext <2 x i16> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -32768, i32 32767>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i16>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 65536)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_varconst3(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi16_varconst3:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: pxor %xmm1, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
				; CHECK-NEXT: movl $65536, %ecx # imm = 0x10000
				; CHECK-NEXT: movd %rcx, %xmm1
				; CHECK-NEXT: pslldq {{.*#+}} xmm1 = zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7]
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmuludq %xmm1, %xmm2
				; CHECK-NEXT: psrlq $32, %xmm0
				; CHECK-NEXT: pmuludq %xmm1, %xmm0
				; CHECK-NEXT: psllq $32, %xmm0
				; CHECK-NEXT: paddq %xmm2, %xmm0
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = zext <2 x i16> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 65536>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i16>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 32768)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_varconst4(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi16_varconst4:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
				; CHECK-NEXT: psrad $16, %xmm0
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
				; CHECK-NEXT: movl $32768, %ecx # imm = 0x8000
				; CHECK-NEXT: movd %rcx, %xmm1
				; CHECK-NEXT: pslldq {{.*#+}} xmm1 = zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7]
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmuludq %xmm1, %xmm2
				; CHECK-NEXT: psrlq $32, %xmm0
				; CHECK-NEXT: pmuludq %xmm1, %xmm0
				; CHECK-NEXT: psllq $32, %xmm0
				; CHECK-NEXT: paddq %xmm2, %xmm0
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = sext <2 x i16> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 32768>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}