This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
2
LegalizeVectorTypes.cpp
-
Target/X86/
-
X86/
56
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
4
shrink_vmul.ll

Differential D20931

[X86] Reduce the width of multiplification when its operands are extended from i8 or i16
ClosedPublic

Authored by wmi on Jun 2 2016, 2:36 PM.

Download Raw Diff

Details

Reviewers

RKSimon
mkuper
congh
hfinkel

Commits

rGb799a625f922: [X86] Reduce the width of multiplification when its operands are extended from…
rL272694: [X86] Reduce the width of multiplification when its operands are extended…

Summary

For <N x i32> type mul, pmuludq will be used for targets without SSE41, which often introduces many extra pack and unpack instructions in vectorized loop body because pmuludq generates <N/2 x i64> type value. However when the operands of <N x i32> mul are extended from smaller size values like i8 and i16, the type of mul may be shrinked to use pmullw + pmulhw/pmulhuw instead of pmuludq, which generates better code. For targets with SSE41, pmulld is supported so no shrinking is needed.

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 59458.Jun 2 2016, 2:36 PM

wmi retitled this revision from to [X86] Reduce the width of multiplification when its operands are extended from i8 or i16.

wmi updated this object.

wmi added reviewers: hfinkel, RKSimon, congh.

wmi set the repository for this revision to rL LLVM.

wmi added subscribers: llvm-commits, davidxl, mkuper.

The testcases are a bit more complicated than they need to be; could you reduce them to just the minimal IR? In particular, including the loops makes it harder to read.

Also, missing testcase for the multiply-by-constant case.

lib/Target/X86/X86ISelLowering.cpp
26450	"shrunk", not "shrinked"... but you might want to use "narrowed" here instead.
26477	What if one side is signed, and the other is unsigned?
26493	Repeated code; should be refactored.
26507	This is not what you want. The constant vector is "zero-extended" if ZEXT(TRUNC(vec))==vec, and "sign-extended" if SEXT(TRUNC(vec)) == vec. Computing whether the vector is a splat has no relation to either of those properties.
26612	There's massive code duplication here; needs to be refactored. Maybe make splitting the inputs if necessary and actually computing the product separate steps?

mkuper added inline comments.Jun 2 2016, 3:13 PM

lib/Target/X86/X86ISelLowering.cpp
26452	Any chance to split this function up? Or is there no logical way to do that?
26468	I'm not sure DONTCARE is a good name for this - it's not a "top" value. Maybe UNKNOWN?
26475	ISD::ANY_EXTEND would work too, right? (Not sure how to test this, but if both ZERO and SIGN work, ANY should definitely work.)
26481	What if it's MVT::i1? You may want an "else return SDValue()" here as well.
26486	Same two comments as above.
26494	What happens if one of the extends is a sext and the other is a zext? It seems like if N0 is a sext and N1 is a zext we'll get IsSigned == true, and if it's the other way around, we'll get IsSigned == false, which seems oddly asymmetrical. Or is there a canonicalization earlier that guarantees the order of a sext and zext?
26499	Are you guaranteed that if one of {N0, N1} is a extend, and the other is a BuildVector, N0 is the extend? I can imagine something canonizes it that way - and if that's the case, documenting this here would probably be a good idea.
26519	Why not use N0->getOperand(0), at least when the type is i16? Although I guess we're pretty much guaranteed DAGCombine will clean this up, and this may be a bit cleaner as is. So, I'm not really sure which is better.
26523	This is necessary because this runs before legalization, right? What if it ran after legalization? Does that help? Or is that too late?
26544	Why not use a generic shuffle here? Is the lowering further down not good enough to get the right unpacks?
29798	Why the linebreak?
test/CodeGen/X86/shrink_vmul.ll
2	Could you add sext tests as well? I'm not sure you need the whole type x {zext, sext} matrix, but one sext test would be good.

Eli and Michael, thanks for your comments. Will post a new patch after I address the issues mentioned.

lib/Target/X86/X86ISelLowering.cpp
26450	Will fix it.
26452	I can extract the pattern matching part to a separate func. Do you think it is enough?
26468	That is better. Will fix it.
26475	Yes, it should work. I think I can generate the same code for it with ISD::ZERO_EXTEND.
26477	Nice catch. I think then the type of the mul should be signed.
26481	You are right. Will fix it.
26493	Will fix it.
26494	Eli asked a similar question. If one side is sext, and the other is zext, then the type of mult should be signed. I will fix it.
26499	Yes, InstCombine will canonicalize it. Will add comment for it.
26507	Maybe I can use SplatValue.isNegative() to get the signedness of constant? Then use it together with the signedness of N0 to determine the signedness of multiplification.
26519	Yes, I tried to make the code simpler here.
26523	Yes, it is necessary to run before type legalization. That is because suppose original mul is of type <16 x i32>, it will be splitted into four muls of type <4 x i32> in legalization. We actually need after the transformation is two muls of type <8 x i16>, it is more difficult to merge pairs of muls if it is done after legalization.
26544	Because I feel using generic shuffle doesn't make the code simpler. We still need two shuffles, two bitcast and one concat_vectors.
26612	I did see there were some code duplication. Like I can merge the code for "VT.getVectorNumElements() > OpsVT.getVectorNumElements()" and that for "VT.getVectorNumElements() == OpsVT.getVectorNumElements()", but I felt it made the logic little bit less clear after the merge, which mixed the case requiring split with the case requiring no split. Probably still worth it. I will do it and add some comments to clarify the logic.
29798	It is done by clang-format. I think it is more consistent without the linebreak. I will fix it.
test/CodeGen/X86/shrink_vmul.ll
2	Will add it.

mkuper added inline comments.Jun 2 2016, 5:59 PM

lib/Target/X86/X86ISelLowering.cpp
26452	Yes, together with the refactoring Eli suggested, that should be good.
26507	I think what Eli meant is that there's no reason to look for a splat, specifically. If every element is small enough, it doesn't matter whether the vector is a splat or not.
26519	As long as this gets clean up, that sounds reasonable.
26523	Ah, ok, got it, thanks.
26544	It will actually make the code a bit more complex, really. What I had in mind was that if the result of the mul itself feeds a shuffle, leaving generic shuffles here may expose more dag combines further down the road.

Addressed Eli and Michael's comments.

A major change is: The previous way to choose shrinking modes by only looking at sext and zext is incorrect. I should analyze and use the value ranges of mul operands to determine which shrinking mode should be chosen. Different shrinking modes and their allowed value ranges are described in the comment of reduceVMULWidth.

This is looking a lot better overall.

lib/Target/X86/X86ISelLowering.cpp
26537	It feels like you should be able to use ComputeNumSignBits/computeKnownBits here; I'm not sure how much shorter that actually ends up, though.
26675	It's not obvious to me why you're explicitly legalizing this here; you could just generate a MUL on, for example, <4 x i16> and legalization should do the right thing from there.
test/CodeGen/X86/shrink_vmul.ll
752	It would probably be more clear to write this as `mul nuw nsw <2 x i32> %tmp8, <i32 -32768, i32 32767>`.

wmi added inline comments.Jun 7 2016, 10:44 AM

lib/Target/X86/X86ISelLowering.cpp
26537	Thanks for the suggestion. I change the value range check of intconst to use ComputeNumSignBits. The code length doesn't change much. But ComputeNumSignBits is more powerful, I can use it to do some extension further in the future, like, ; %val1 = load <2 x i8> ; %op1 = zext<2 x i32> %val1 ; %val2 = load <2 x i8> ; %op2 = zext<2 x i32> %val2 ; %add = add <2 x i32> %op1, %op2 ; %rst = mul <2 x i32> %add, %op2 ComputeNumSignBits may know %add's value range is within 0 ~ 32767 (Actually 0 ~ 255*2).
26675	I choose to explicitly legalize here because implicit legalization will generate different results: Suppose the input is <4 x i16>, for implicit legalization, it will be converted to <4 x i64> then bitcast to <8 x i16> before being used as the input of pmullw. If the input is a vector load + sext/zext, then the input needs to be unpck twice to get <4 x i64>. For explicit legalization, I choose to concat <4 x i16> with vector undef to get <8 x i16>. If the input is a vector load + sext/zext, then the input can be directly used as the input of pmullw.
test/CodeGen/X86/shrink_vmul.ll
752	That is better. Fixed.

wmi updated this revision to Diff 59912.Jun 7 2016, 10:45 AM

eli.friedman added inline comments.Jun 7 2016, 5:22 PM

lib/Target/X86/X86ISelLowering.cpp
26506	It's probably more clear to express this in terms of the number of sign bits and the number of leading zeros (APInt::countLeadingZeros). Actually, you could probably get rid of the ValRange enumeration altogether in favor of those two numbers. For example, return `MULS8` if `std::min(signbits1, signbits2) > 24`.
26675	I'm not following... how do you get to <4 x i64>? Legalization of a `<4 x i16>` multiply will widen it to an `<8 x i16>` multiply; this codepath already gets used for IR like `mul <4 x i16> %a, %b`.

wmi added inline comments.Jun 7 2016, 11:26 PM

lib/Target/X86/X86ISelLowering.cpp
26506	I rewrite it and the code is much shorter. Thanks for the suggestion.
26675	Sorry. I wanted to say <4 x i32> instead of <4 x i64>. when legalizing <4 x i16> to <4 x i32>, it will use a punpcklwd instruction if the input is load <4 x i16>. It is different from widening <4 x i16> to <8 x i16> by filling undef in the higher bits.

wmi updated this revision to Diff 59999.Jun 7 2016, 11:27 PM

eli.friedman added inline comments.Jun 8 2016, 12:26 AM

lib/Target/X86/X86ISelLowering.cpp
26672	Ah, I see what you mean. That's how we end up in an awful mess with the following: typedef short a __attribute((ext_vector_type(4))); void g(a x); a f(a x, a y, int c) { a z = x*y; if (c) g(z+x); return z; } That doesn't explain why you need to explicitly legalize the inputs in the case where you split the nodes, though.

wmi added inline comments.Jun 8 2016, 12:02 PM

lib/Target/X86/X86ISelLowering.cpp
26672	I choosed to explicitly legalize because I used X86ISD::UNPCKL and X86ISD::UNPCKH instead of vector_shuffle so no mask setting was needed. Seems legalization only works for generic ISD instead of X86ISD. Actually it doesn't need to legalize. I change unpck to vectorshuffle and remove the splitting. The code looks simpler even with the additional mask setting code. Thanks for the suggestion.

wmi updated this revision to Diff 60077.Jun 8 2016, 12:04 PM

eli.friedman added inline comments.Jun 8 2016, 12:53 PM

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
673	Do you need MULHS here too? (Maybe missing a test for this?)
lib/Target/X86/X86ISelLowering.cpp
26453	This comment is confusing; the operand isn't actually guaranteed to be between 0 and 127. (The transform is safe because we can assume an appropriate number of leading sign/zero bits.)
26478	Probably better to use APInt::getNumSignBits here.

wmi added inline comments.Jun 8 2016, 1:33 PM

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
673	Ah sorry. Fixed. Add a test mul_16xi16_sext for it.
lib/Target/X86/X86ISelLowering.cpp
26453	Fixed.
26478	Fixed.

wmi updated this revision to Diff 60097.Jun 8 2016, 1:33 PM

LGTM, but this has been though a lot of revisions, so it's probably a good idea to someone else to double-check I didn't miss something obvious.

Eli, thanks for your many helpful suggestions.

LGTM too.
I have a couple of comments, but they're rather half-baked, feel free to ignore them if they don't make sense to you.

lib/Target/X86/X86ISelLowering.cpp
26446	Nit - it's a bit misleading to have N->getNumOperands() here, and then index into an array of size 2. Maybe assert N->getNumOperands() == 2 (or just use 2 here, but that's less self-documenting, I guess).
26470	Perhaps use ComputeNumSignBits on the BUILD_VECTOR elements as well, instead of checking for const/undef? Although I'm really not sure if that gains anything in practice, and it won't make the code significantly smaller, so it's may not be worth the overhead.
26480	Maybe computeKnownBits instead of special-casing ZERO_EXTEND? Although, the same as above applies, not at all sure it's worth the overhead.

This revision is now accepted and ready to land.Jun 13 2016, 11:51 AM

wmi added inline comments.Jun 13 2016, 4:42 PM

lib/Target/X86/X86ISelLowering.cpp
26446	That is misleading indeed. Fixed and Added an assertion.
26470	ComputeNumSignBits doesn't work for BUILD_VECTOR for now. It always return 1.
26480	The code seems longer and costlier that way, so I choose to special-case ZERO_EXTEND here.

wmi updated this revision to Diff 60628.Jun 13 2016, 4:43 PM

wmi edited edge metadata.

Closed by commit rL272694: [X86] Reduce the width of multiplification when its operands are extended… (authored by wmi). · Explain WhyJun 14 2016, 12:00 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

LegalizeVectorTypes.cpp

2 lines

Target/

X86/

X86ISelLowering.cpp

210 lines

test/

CodeGen/

X86/

shrink_vmul.ll

864 lines

Diff 60097

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

Show First 20 Lines • Show All 664 Lines • ▼ Show 20 Lines	#endif
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
SplitVecRes_ExtendOp(N, Lo, Hi);		SplitVecRes_ExtendOp(N, Lo, Hi);
break;		break;

case ISD::ADD:		case ISD::ADD:
case ISD::SUB:		case ISD::SUB:
case ISD::MUL:		case ISD::MUL:
		case ISD::MULHS:
		eli.friedmanUnsubmitted Not Done Reply Inline Actions Do you need MULHS here too? (Maybe missing a test for this?) eli.friedman: Do you need MULHS here too? (Maybe missing a test for this?)
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Ah sorry. Fixed. Add a test mul_16xi16_sext for it. wmi: Ah sorry. Fixed. Add a test mul_16xi16_sext for it.
		case ISD::MULHU:
case ISD::FADD:		case ISD::FADD:
case ISD::FSUB:		case ISD::FSUB:
case ISD::FMUL:		case ISD::FMUL:
case ISD::FMINNUM:		case ISD::FMINNUM:
case ISD::FMAXNUM:		case ISD::FMAXNUM:
case ISD::FMINNAN:		case ISD::FMINNAN:
case ISD::FMAXNAN:		case ISD::FMAXNAN:
case ISD::SDIV:		case ISD::SDIV:
▲ Show 20 Lines • Show All 3,174 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 26,427 Lines • ▼ Show 20 Lines	if (checkBoolTestAndOrSetCCCombine(Cond, CC0, CC1, Flags, isAndSetCC)) {
DAG.ReplaceAllUsesOfValueWith(SDValue(N, 1), SDValue(CMOV.getNode(), 1));		DAG.ReplaceAllUsesOfValueWith(SDValue(N, 1), SDValue(CMOV.getNode(), 1));
return CMOV;		return CMOV;
}		}
}		}

return SDValue();		return SDValue();
}		}

		/// Different mul shrinking modes.
		enum ShrinkMode { MULS8, MULU8, MULS16, MULU16 };

		static bool canReduceVMulWidth(SDNode *N, SelectionDAG &DAG, ShrinkMode &Mode) {
		EVT VT = N->getOperand(0).getValueType();
		if (VT.getScalarSizeInBits() != 32)
		return false;

		unsigned SignBits[2] = {1, 1};
		bool IsPositive[2] = {false, false};
		for (unsigned i = 0, e = N->getNumOperands(); i < e; i++) {
		mkuperUnsubmitted Not Done Reply Inline Actions Nit - it's a bit misleading to have N->getNumOperands() here, and then index into an array of size 2. Maybe assert N->getNumOperands() == 2 (or just use 2 here, but that's less self-documenting, I guess). mkuper: Nit - it's a bit misleading to have N->getNumOperands() here, and then index into an array of…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions That is misleading indeed. Fixed and Added an assertion. wmi: That is misleading indeed. Fixed and Added an assertion.
		SDValue Opd = N->getOperand(i);

		// DAG.ComputeNumSignBits return 1 for ISD::ANY_EXTEND, so we need to
		// compute signbits for it separately.
		eli.friedmanUnsubmitted Not Done Reply Inline Actions "shrunk", not "shrinked"... but you might want to use "narrowed" here instead. eli.friedman: "shrunk", not "shrinked"... but you might want to use "narrowed" here instead.
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Will fix it. wmi: Will fix it.
		if (Opd.getOpcode() == ISD::ANY_EXTEND) {
		// For anyextend, it is safe to assume an appropriate number of leading
		mkuperUnsubmitted Not Done Reply Inline Actions Any chance to split this function up? Or is there no logical way to do that? mkuper: Any chance to split this function up? Or is there no logical way to do that?
		wmiAuthorUnsubmitted Not Done Reply Inline Actions I can extract the pattern matching part to a separate func. Do you think it is enough? wmi: I can extract the pattern matching part to a separate func. Do you think it is enough?
		mkuperUnsubmitted Not Done Reply Inline Actions Yes, together with the refactoring Eli suggested, that should be good. mkuper: Yes, together with the refactoring Eli suggested, that should be good.
		// sign/zero bits.
		eli.friedmanUnsubmitted Not Done Reply Inline Actions This comment is confusing; the operand isn't actually guaranteed to be between 0 and 127. (The transform is safe because we can assume an appropriate number of leading sign/zero bits.) eli.friedman: This comment is confusing; the operand isn't actually guaranteed to be between 0 and 127. (The…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Fixed. wmi: Fixed.
		if (Opd.getOperand(0).getValueType().getVectorElementType() == MVT::i8)
		SignBits[i] = 25;
		else if (Opd.getOperand(0).getValueType().getVectorElementType() ==
		MVT::i16)
		SignBits[i] = 17;
		else
		return false;
		IsPositive[i] = true;
		} else if (Opd.getOpcode() == ISD::BUILD_VECTOR) {
		// All the operands of BUILD_VECTOR need to be int constant.
		// Find the smallest value range which all the operands belong to.
		SignBits[i] = 32;
		IsPositive[i] = true;
		for (const SDValue &SubOp : Opd.getNode()->op_values()) {
		if (SubOp.isUndef())
		mkuperUnsubmitted Not Done Reply Inline Actions I'm not sure DONTCARE is a good name for this - it's not a "top" value. Maybe UNKNOWN? mkuper: I'm not sure DONTCARE is a good name for this - it's not a "top" value. Maybe UNKNOWN?
		wmiAuthorUnsubmitted Not Done Reply Inline Actions That is better. Will fix it. wmi: That is better. Will fix it.
		continue;
		auto *CN = dyn_cast<ConstantSDNode>(SubOp);
		mkuperUnsubmitted Not Done Reply Inline Actions Perhaps use ComputeNumSignBits on the BUILD_VECTOR elements as well, instead of checking for const/undef? Although I'm really not sure if that gains anything in practice, and it won't make the code significantly smaller, so it's may not be worth the overhead. mkuper: Perhaps use ComputeNumSignBits on the BUILD_VECTOR elements as well, instead of checking for…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions ComputeNumSignBits doesn't work for BUILD_VECTOR for now. It always return 1. wmi: ComputeNumSignBits doesn't work for BUILD_VECTOR for now. It always return 1.
		if (!CN)
		return false;
		APInt IntVal = CN->getAPIntValue();
		if (IntVal.isNegative())
		IsPositive[i] = false;
		mkuperUnsubmitted Not Done Reply Inline Actions ISD::ANY_EXTEND would work too, right? (Not sure how to test this, but if both ZERO and SIGN work, ANY should definitely work.) mkuper: ISD::ANY_EXTEND would work too, right? (Not sure how to test this, but if both ZERO and SIGN…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Yes, it should work. I think I can generate the same code for it with ISD::ZERO_EXTEND. wmi: Yes, it should work. I think I can generate the same code for it with ISD::ZERO_EXTEND.
		SignBits[i] = std::min(SignBits[i], IntVal.getNumSignBits());
		}
		eli.friedmanUnsubmitted Not Done Reply Inline Actions What if one side is signed, and the other is unsigned? eli.friedman: What if one side is signed, and the other is unsigned?
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Nice catch. I think then the type of the mul should be signed. wmi: Nice catch. I think then the type of the mul should be signed.
		} else {
		eli.friedmanUnsubmitted Not Done Reply Inline Actions Probably better to use APInt::getNumSignBits here. eli.friedman: Probably better to use APInt::getNumSignBits here.
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Fixed. wmi: Fixed.
		SignBits[i] = DAG.ComputeNumSignBits(Opd);
		if (Opd.getOpcode() == ISD::ZERO_EXTEND)
		mkuperUnsubmitted Not Done Reply Inline Actions Maybe computeKnownBits instead of special-casing ZERO_EXTEND? Although, the same as above applies, not at all sure it's worth the overhead. mkuper: Maybe computeKnownBits instead of special-casing ZERO_EXTEND? Although, the same as above…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions The code seems longer and costlier that way, so I choose to special-case ZERO_EXTEND here. wmi: The code seems longer and costlier that way, so I choose to special-case ZERO_EXTEND here.
		IsPositive[i] = true;
		mkuperUnsubmitted Not Done Reply Inline Actions What if it's MVT::i1? You may want an "else return SDValue()" here as well. mkuper: What if it's MVT::i1? You may want an "else return SDValue()" here as well.
		wmiAuthorUnsubmitted Not Done Reply Inline Actions You are right. Will fix it. wmi: You are right. Will fix it.
		}
		}

		bool AllPositive = IsPositive[0] && IsPositive[1];
		unsigned MinSignBits = std::min(SignBits[0], SignBits[1]);
		mkuperUnsubmitted Not Done Reply Inline Actions Same two comments as above. mkuper: Same two comments as above.
		// When ranges are from -128 ~ 127, use MULS8 mode.
		if (MinSignBits >= 25)
		Mode = MULS8;
		// When ranges are from 0 ~ 255, use MULU8 mode.
		else if (AllPositive && MinSignBits >= 24)
		Mode = MULU8;
		// When ranges are from -32768 ~ 32767, use MULS16 mode.
		eli.friedmanUnsubmitted Not Done Reply Inline Actions Repeated code; should be refactored. eli.friedman: Repeated code; should be refactored.
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Will fix it. wmi: Will fix it.
		else if (MinSignBits >= 17)
		mkuperUnsubmitted Not Done Reply Inline Actions What happens if one of the extends is a sext and the other is a zext? It seems like if N0 is a sext and N1 is a zext we'll get IsSigned == true, and if it's the other way around, we'll get IsSigned == false, which seems oddly asymmetrical. Or is there a canonicalization earlier that guarantees the order of a sext and zext? mkuper: What happens if one of the extends is a sext and the other is a zext? It seems like if N0 is a…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Eli asked a similar question. If one side is sext, and the other is zext, then the type of mult should be signed. I will fix it. wmi: Eli asked a similar question. If one side is sext, and the other is zext, then the type of mult…
		Mode = MULS16;
		// When ranges are from 0 ~ 65535, use MULU16 mode.
		else if (AllPositive && MinSignBits >= 16)
		Mode = MULU16;
		else
		mkuperUnsubmitted Not Done Reply Inline Actions Are you guaranteed that if one of {N0, N1} is a extend, and the other is a BuildVector, N0 is the extend? I can imagine something canonizes it that way - and if that's the case, documenting this here would probably be a good idea. mkuper: Are you guaranteed that if one of {N0, N1} is a extend, and the other is a BuildVector, N0 is…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Yes, InstCombine will canonicalize it. Will add comment for it. wmi: Yes, InstCombine will canonicalize it. Will add comment for it.
		return false;
		return true;
		}

		/// When the operands of vector mul are extended from smaller size values,
		/// like i8 and i16, the type of mul may be shrinked to generate more
		/// efficient code. Two typical patterns are handled:
		eli.friedmanUnsubmitted Not Done Reply Inline Actions It's probably more clear to express this in terms of the number of sign bits and the number of leading zeros (APInt::countLeadingZeros). Actually, you could probably get rid of the ValRange enumeration altogether in favor of those two numbers. For example, return `MULS8` if `std::min(signbits1, signbits2) > 24`. eli.friedman: It's probably more clear to express this in terms of the number of sign bits and the number of…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions I rewrite it and the code is much shorter. Thanks for the suggestion. wmi: I rewrite it and the code is much shorter. Thanks for the suggestion.
		/// Pattern1:
		eli.friedmanUnsubmitted Not Done Reply Inline Actions This is not what you want. The constant vector is "zero-extended" if ZEXT(TRUNC(vec))==vec, and "sign-extended" if SEXT(TRUNC(vec)) == vec. Computing whether the vector is a splat has no relation to either of those properties. eli.friedman: This is not what you want. The constant vector is "zero-extended" if ZEXT(TRUNC(vec))==vec…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Maybe I can use SplatValue.isNegative() to get the signedness of constant? Then use it together with the signedness of N0 to determine the signedness of multiplification. wmi: Maybe I can use SplatValue.isNegative() to get the signedness of constant? Then use it together…
		mkuperUnsubmitted Not Done Reply Inline Actions I think what Eli meant is that there's no reason to look for a splat, specifically. If every element is small enough, it doesn't matter whether the vector is a splat or not. mkuper: I think what Eli meant is that there's no reason to look for a splat, specifically. If every…
		/// %2 = sext/zext <N x i8> %1 to <N x i32>
		/// %4 = sext/zext <N x i8> %3 to <N x i32>
		// or %4 = build_vector <N x i32> %C1, ..., %CN (%C1..%CN are constants)
		/// %5 = mul <N x i32> %2, %4
		///
		/// Pattern2:
		/// %2 = zext/sext <N x i16> %1 to <N x i32>
		/// %4 = zext/sext <N x i16> %3 to <N x i32>
		/// or %4 = build_vector <N x i32> %C1, ..., %CN (%C1..%CN are constants)
		/// %5 = mul <N x i32> %2, %4
		///
		/// There are four mul shrinking modes:
		mkuperUnsubmitted Not Done Reply Inline Actions Why not use N0->getOperand(0), at least when the type is i16? Although I guess we're pretty much guaranteed DAGCombine will clean this up, and this may be a bit cleaner as is. So, I'm not really sure which is better. mkuper: Why not use N0->getOperand(0), at least when the type is i16? Although I guess we're pretty…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Yes, I tried to make the code simpler here. wmi: Yes, I tried to make the code simpler here.
		mkuperUnsubmitted Not Done Reply Inline Actions As long as this gets clean up, that sounds reasonable. mkuper: As long as this gets clean up, that sounds reasonable.
		/// If %2 == sext32(trunc8(%2)), i.e., the scalar value range of %2 is
		/// -128 to 128, and the scalar value range of %4 is also -128 to 128,
		/// generate pmullw+sext32 for it (MULS8 mode).
		/// If %2 == zext32(trunc8(%2)), i.e., the scalar value range of %2 is
		mkuperUnsubmitted Not Done Reply Inline Actions This is necessary because this runs before legalization, right? What if it ran after legalization? Does that help? Or is that too late? mkuper: This is necessary because this runs before legalization, right? What if it ran after…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Yes, it is necessary to run before type legalization. That is because suppose original mul is of type <16 x i32>, it will be splitted into four muls of type <4 x i32> in legalization. We actually need after the transformation is two muls of type <8 x i16>, it is more difficult to merge pairs of muls if it is done after legalization. wmi: Yes, it is necessary to run before type legalization. That is because suppose original mul is…
		mkuperUnsubmitted Not Done Reply Inline Actions Ah, ok, got it, thanks. mkuper: Ah, ok, got it, thanks.
		/// 0 to 255, and the scalar value range of %4 is also 0 to 255,
		/// generate pmullw+zext32 for it (MULU8 mode).
		/// If %2 == sext32(trunc16(%2)), i.e., the scalar value range of %2 is
		/// -32768 to 32767, and the scalar value range of %4 is also -32768 to 32767,
		/// generate pmullw+pmulhw for it (MULS16 mode).
		/// If %2 == zext32(trunc16(%2)), i.e., the scalar value range of %2 is
		/// 0 to 65535, and the scalar value range of %4 is also 0 to 65535,
		/// generate pmullw+pmulhuw for it (MULU16 mode).
		static SDValue reduceVMULWidth(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		// pmulld is supported since SSE41. It is better to use pmulld
		// instead of pmullw+pmulhw.
		if (Subtarget.hasSSE41())
		return SDValue();
		eli.friedmanUnsubmitted Not Done Reply Inline Actions It feels like you should be able to use ComputeNumSignBits/computeKnownBits here; I'm not sure how much shorter that actually ends up, though. eli.friedman: It feels like you should be able to use ComputeNumSignBits/computeKnownBits here; I'm not sure…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Thanks for the suggestion. I change the value range check of intconst to use ComputeNumSignBits. The code length doesn't change much. But ComputeNumSignBits is more powerful, I can use it to do some extension further in the future, like, ; %val1 = load <2 x i8> ; %op1 = zext<2 x i32> %val1 ; %val2 = load <2 x i8> ; %op2 = zext<2 x i32> %val2 ; %add = add <2 x i32> %op1, %op2 ; %rst = mul <2 x i32> %add, %op2 ComputeNumSignBits may know %add's value range is within 0 ~ 32767 (Actually 0 ~ 2552). wmi:* Thanks for the suggestion. I change the value range check of intconst to use ComputeNumSignBits.

		ShrinkMode Mode;
		if (!canReduceVMulWidth(N, DAG, Mode))
		return SDValue();

		SDLoc DL(N);
		SDValue N0 = N->getOperand(0);
		mkuperUnsubmitted Not Done Reply Inline Actions Why not use a generic shuffle here? Is the lowering further down not good enough to get the right unpacks? mkuper: Why not use a generic shuffle here? Is the lowering further down not good enough to get the…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Because I feel using generic shuffle doesn't make the code simpler. We still need two shuffles, two bitcast and one concat_vectors. wmi: Because I feel using generic shuffle doesn't make the code simpler. We still need two shuffles…
		mkuperUnsubmitted Not Done Reply Inline Actions It will actually make the code a bit more complex, really. What I had in mind was that if the result of the mul itself feeds a shuffle, leaving generic shuffles here may expose more dag combines further down the road. mkuper: It will actually make the code a bit more complex, really. What I had in mind was that if the…
		SDValue N1 = N->getOperand(1);
		EVT VT = N->getOperand(0).getValueType();
		unsigned RegSize = 128;
		MVT OpsVT = MVT::getVectorVT(MVT::i16, RegSize / 16);
		EVT ReducedVT =
		EVT::getVectorVT(*DAG.getContext(), MVT::i16, VT.getVectorNumElements());
		// Shrink the operands of mul.
		SDValue NewN0 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N0);
		SDValue NewN1 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N1);

		if (VT.getVectorNumElements() >= OpsVT.getVectorNumElements()) {
		// Generate the lower part of mul: pmullw. For MULU8/MULS8, only the
		// lower part is needed.
		SDValue MulLo = DAG.getNode(ISD::MUL, DL, ReducedVT, NewN0, NewN1);
		if (Mode == MULU8 \|\| Mode == MULS8) {
		return DAG.getNode((Mode == MULU8) ? ISD::ZERO_EXTEND : ISD::SIGN_EXTEND,
		DL, VT, MulLo);
		} else {
		MVT ResVT = MVT::getVectorVT(MVT::i32, VT.getVectorNumElements() / 2);
		// Generate the higher part of mul: pmulhw/pmulhuw. For MULU16/MULS16,
		// the higher part is also needed.
		SDValue MulHi = DAG.getNode(Mode == MULS16 ? ISD::MULHS : ISD::MULHU, DL,
		ReducedVT, NewN0, NewN1);

		// Repack the lower part and higher part result of mul into a wider
		// result.
		// Generate shuffle functioning as punpcklwd.
		SmallVector<int, 16> ShuffleMask(VT.getVectorNumElements());
		for (unsigned i = 0; i < VT.getVectorNumElements() / 2; i++) {
		ShuffleMask[2 * i] = i;
		ShuffleMask[2 * i + 1] = i + VT.getVectorNumElements();
		}
		SDValue ResLo =
		DAG.getVectorShuffle(ReducedVT, DL, MulLo, MulHi, &ShuffleMask[0]);
		ResLo = DAG.getNode(ISD::BITCAST, DL, ResVT, ResLo);
		// Generate shuffle functioning as punpckhwd.
		for (unsigned i = 0; i < VT.getVectorNumElements() / 2; i++) {
		ShuffleMask[2 * i] = i + VT.getVectorNumElements() / 2;
		ShuffleMask[2 * i + 1] = i + VT.getVectorNumElements() * 3 / 2;
		}
		SDValue ResHi =
		DAG.getVectorShuffle(ReducedVT, DL, MulLo, MulHi, &ShuffleMask[0]);
		ResHi = DAG.getNode(ISD::BITCAST, DL, ResVT, ResHi);
		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ResLo, ResHi);
		}
		} else {
		// When VT.getVectorNumElements() < OpsVT.getVectorNumElements(), we want
		// to legalize the mul explicitly because implicit legalization for type
		// <4 x i16> to <4 x i32> sometimes involves unnecessary unpack
		// instructions which will not exist when we explicitly legalize it by
		// extending <4 x i16> to <8 x i16> (concatenating the <4 x i16> val with
		// <4 x i16> undef).
		//
		// Legalize the operands of mul.
		SmallVector<SDValue, 16> Ops(RegSize / ReducedVT.getSizeInBits(),
		DAG.getUNDEF(ReducedVT));
		Ops[0] = NewN0;
		NewN0 = DAG.getNode(ISD::CONCAT_VECTORS, DL, OpsVT, Ops);
		Ops[0] = NewN1;
		NewN1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, OpsVT, Ops);

		if (Mode == MULU8 \|\| Mode == MULS8) {
		// Generate lower part of mul: pmullw. For MULU8/MULS8, only the lower
		// part is needed.
		SDValue Mul = DAG.getNode(ISD::MUL, DL, OpsVT, NewN0, NewN1);

		// convert the type of mul result to VT.
		MVT ResVT = MVT::getVectorVT(MVT::i32, RegSize / 32);
		eli.friedmanUnsubmitted Not Done Reply Inline Actions There's massive code duplication here; needs to be refactored. Maybe make splitting the inputs if necessary and actually computing the product separate steps? eli.friedman: There's massive code duplication here; needs to be refactored. Maybe make splitting the inputs…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions I did see there were some code duplication. Like I can merge the code for "VT.getVectorNumElements() > OpsVT.getVectorNumElements()" and that for "VT.getVectorNumElements() == OpsVT.getVectorNumElements()", but I felt it made the logic little bit less clear after the merge, which mixed the case requiring split with the case requiring no split. Probably still worth it. I will do it and add some comments to clarify the logic. wmi: I did see there were some code duplication. Like I can merge the code for "VT.
		SDValue Res = DAG.getNode(Mode == MULU8 ? ISD::ZERO_EXTEND_VECTOR_INREG
		: ISD::SIGN_EXTEND_VECTOR_INREG,
		DL, ResVT, Mul);
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, Res,
		DAG.getIntPtrConstant(0, DL));
		} else {
		// Generate the lower and higher part of mul: pmulhw/pmulhuw. For
		// MULU16/MULS16, both parts are needed.
		SDValue MulLo = DAG.getNode(ISD::MUL, DL, OpsVT, NewN0, NewN1);
		SDValue MulHi = DAG.getNode(Mode == MULS16 ? ISD::MULHS : ISD::MULHU, DL,
		OpsVT, NewN0, NewN1);

		// Repack the lower part and higher part result of mul into a wider
		// result. Make sure the type of mul result is VT.
		MVT ResVT = MVT::getVectorVT(MVT::i32, RegSize / 32);
		SDValue Res = DAG.getNode(X86ISD::UNPCKL, DL, OpsVT, MulLo, MulHi);
		Res = DAG.getNode(ISD::BITCAST, DL, ResVT, Res);
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, Res,
		DAG.getIntPtrConstant(0, DL));
		}
		}
		}

/// Optimize a single multiply with constant into two operations in order to		/// Optimize a single multiply with constant into two operations in order to
/// implement it with two cheaper instructions, e.g. LEA + SHL, LEA + LEA.		/// implement it with two cheaper instructions, e.g. LEA + SHL, LEA + LEA.
static SDValue combineMul(SDNode *N, SelectionDAG &DAG,		static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI) {		TargetLowering::DAGCombinerInfo &DCI,
		const X86Subtarget &Subtarget) {
		EVT VT = N->getValueType(0);
		if (DCI.isBeforeLegalize() && VT.isVector())
		return reduceVMULWidth(N, DAG, Subtarget);

// An imul is usually smaller than the alternative sequence.		// An imul is usually smaller than the alternative sequence.
if (DAG.getMachineFunction().getFunction()->optForMinSize())		if (DAG.getMachineFunction().getFunction()->optForMinSize())
return SDValue();		return SDValue();

if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())		if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())
return SDValue();		return SDValue();

EVT VT = N->getValueType(0);
if (VT != MVT::i64 && VT != MVT::i32)		if (VT != MVT::i64 && VT != MVT::i32)
return SDValue();		return SDValue();

ConstantSDNode *C = dyn_cast<ConstantSDNode>(N->getOperand(1));		ConstantSDNode *C = dyn_cast<ConstantSDNode>(N->getOperand(1));
if (!C)		if (!C)
return SDValue();		return SDValue();
uint64_t MulAmt = C->getZExtValue();		uint64_t MulAmt = C->getZExtValue();
if (isPowerOf2_64(MulAmt) \|\| MulAmt == 3 \|\| MulAmt == 5 \|\| MulAmt == 9)		if (isPowerOf2_64(MulAmt) \|\| MulAmt == 3 \|\| MulAmt == 5 \|\| MulAmt == 9)
return SDValue();		return SDValue();

uint64_t MulAmt1 = 0;		uint64_t MulAmt1 = 0;
uint64_t MulAmt2 = 0;		uint64_t MulAmt2 = 0;
if ((MulAmt % 9) == 0) {		if ((MulAmt % 9) == 0) {
MulAmt1 = 9;		MulAmt1 = 9;
MulAmt2 = MulAmt / 9;		MulAmt2 = MulAmt / 9;
} else if ((MulAmt % 5) == 0) {		} else if ((MulAmt % 5) == 0) {
MulAmt1 = 5;		MulAmt1 = 5;
MulAmt2 = MulAmt / 5;		MulAmt2 = MulAmt / 5;
} else if ((MulAmt % 3) == 0) {		} else if ((MulAmt % 3) == 0) {
MulAmt1 = 3;		MulAmt1 = 3;
MulAmt2 = MulAmt / 3;		MulAmt2 = MulAmt / 3;
		eli.friedmanUnsubmitted Not Done Reply Inline Actions Ah, I see what you mean. That's how we end up in an awful mess with the following: typedef short a __attribute((ext_vector_type(4))); void g(a x); a f(a x, a y, int c) { a z = xy; if (c) g(z+x); return z; } That doesn't explain why you need to explicitly legalize the inputs in the case where you split the nodes, though. eli.friedman:* Ah, I see what you mean. That's how we end up in an awful mess with the following: ```…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions I choosed to explicitly legalize because I used X86ISD::UNPCKL and X86ISD::UNPCKH instead of vector_shuffle so no mask setting was needed. Seems legalization only works for generic ISD instead of X86ISD. Actually it doesn't need to legalize. I change unpck to vectorshuffle and remove the splitting. The code looks simpler even with the additional mask setting code. Thanks for the suggestion. wmi: I choosed to explicitly legalize because I used X86ISD::UNPCKL and X86ISD::UNPCKH instead of…
}		}

SDLoc DL(N);		SDLoc DL(N);
		eli.friedmanUnsubmitted Not Done Reply Inline Actions It's not obvious to me why you're explicitly legalizing this here; you could just generate a MUL on, for example, <4 x i16> and legalization should do the right thing from there. eli.friedman: It's not obvious to me why you're explicitly legalizing this here; you could just generate a…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions I choose to explicitly legalize here because implicit legalization will generate different results: Suppose the input is <4 x i16>, for implicit legalization, it will be converted to <4 x i64> then bitcast to <8 x i16> before being used as the input of pmullw. If the input is a vector load + sext/zext, then the input needs to be unpck twice to get <4 x i64>. For explicit legalization, I choose to concat <4 x i16> with vector undef to get <8 x i16>. If the input is a vector load + sext/zext, then the input can be directly used as the input of pmullw. wmi: I choose to explicitly legalize here because implicit legalization will generate different…
		eli.friedmanUnsubmitted Not Done Reply Inline Actions I'm not following... how do you get to <4 x i64>? Legalization of a `<4 x i16>` multiply will widen it to an `<8 x i16>` multiply; this codepath already gets used for IR like `mul <4 x i16> %a, %b`. eli.friedman: I'm not following... how do you get to <4 x i64>? Legalization of a `<4 x i16>` multiply will…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Sorry. I wanted to say <4 x i32> instead of <4 x i64>. when legalizing <4 x i16> to <4 x i32>, it will use a punpcklwd instruction if the input is load <4 x i16>. It is different from widening <4 x i16> to <8 x i16> by filling undef in the higher bits. wmi: Sorry. I wanted to say <4 x i32> instead of <4 x i64>. when legalizing <4 x i16> to <4 x i32>…
SDValue NewMul;		SDValue NewMul;
if (MulAmt2 &&		if (MulAmt2 &&
(isPowerOf2_64(MulAmt2) \|\| MulAmt2 == 3 \|\| MulAmt2 == 5 \|\| MulAmt2 == 9)){		(isPowerOf2_64(MulAmt2) \|\| MulAmt2 == 3 \|\| MulAmt2 == 5 \|\| MulAmt2 == 9)){

if (isPowerOf2_64(MulAmt2) &&		if (isPowerOf2_64(MulAmt2) &&
!(N->hasOneUse() && N->use_begin()->getOpcode() == ISD::ADD))		!(N->hasOneUse() && N->use_begin()->getOpcode() == ISD::ADD))
// If second multiplifer is pow2, issue it first. We want the multiply by		// If second multiplifer is pow2, issue it first. We want the multiply by
// 3, 5, or 9 to be folded into the addressing mode unless the lone use		// 3, 5, or 9 to be folded into the addressing mode unless the lone use
▲ Show 20 Lines • Show All 3,105 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::VSELECT:		case ISD::VSELECT:
case ISD::SELECT:		case ISD::SELECT:
case X86ISD::SHRUNKBLEND: return combineSelect(N, DAG, DCI, Subtarget);		case X86ISD::SHRUNKBLEND: return combineSelect(N, DAG, DCI, Subtarget);
case ISD::BITCAST: return combineBitcast(N, DAG, Subtarget);		case ISD::BITCAST: return combineBitcast(N, DAG, Subtarget);
case X86ISD::CMOV: return combineCMov(N, DAG, DCI, Subtarget);		case X86ISD::CMOV: return combineCMov(N, DAG, DCI, Subtarget);
case ISD::ADD: return combineAdd(N, DAG, Subtarget);		case ISD::ADD: return combineAdd(N, DAG, Subtarget);
case ISD::SUB: return combineSub(N, DAG, Subtarget);		case ISD::SUB: return combineSub(N, DAG, Subtarget);
case X86ISD::ADC: return combineADC(N, DAG, DCI);		case X86ISD::ADC: return combineADC(N, DAG, DCI);
case ISD::MUL: return combineMul(N, DAG, DCI);		case ISD::MUL: return combineMul(N, DAG, DCI, Subtarget);
case ISD::SHL:		case ISD::SHL:
		mkuperUnsubmitted Not Done Reply Inline Actions Why the linebreak? mkuper: Why the linebreak?
		wmiAuthorUnsubmitted Not Done Reply Inline Actions It is done by clang-format. I think it is more consistent without the linebreak. I will fix it. wmi: It is done by clang-format. I think it is more consistent without the linebreak. I will fix it.
case ISD::SRA:		case ISD::SRA:
case ISD::SRL: return combineShift(N, DAG, DCI, Subtarget);		case ISD::SRL: return combineShift(N, DAG, DCI, Subtarget);
case ISD::AND: return combineAnd(N, DAG, DCI, Subtarget);		case ISD::AND: return combineAnd(N, DAG, DCI, Subtarget);
case ISD::OR: return combineOr(N, DAG, DCI, Subtarget);		case ISD::OR: return combineOr(N, DAG, DCI, Subtarget);
case ISD::XOR: return combineXor(N, DAG, DCI, Subtarget);		case ISD::XOR: return combineXor(N, DAG, DCI, Subtarget);
case ISD::LOAD: return combineLoad(N, DAG, DCI, Subtarget);		case ISD::LOAD: return combineLoad(N, DAG, DCI, Subtarget);
case ISD::MLOAD: return combineMaskedLoad(N, DAG, DCI, Subtarget);		case ISD::MLOAD: return combineMaskedLoad(N, DAG, DCI, Subtarget);
case ISD::STORE: return combineStore(N, DAG, Subtarget);		case ISD::STORE: return combineStore(N, DAG, Subtarget);
▲ Show 20 Lines • Show All 949 Lines • Show Last 20 Lines

test/CodeGen/X86/shrink_vmul.ll

				; NOTE: Assertions have been autogenerated by update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s
				mkuperUnsubmitted Not Done Reply Inline Actions Could you add sext tests as well? I'm not sure you need the whole type x {zext, sext} matrix, but one sext test would be good. mkuper: Could you add sext tests as well? I'm not sure you need the whole type x {zext, sext} matrix…
				wmiAuthorUnsubmitted Not Done Reply Inline Actions Will add it. wmi: Will add it.

				@c = external global i32*, align 8

				; %val1 = load <2 x i8>
				; %op1 = zext<2 x i32> %val1
				; %val2 = load <2 x i8>
				; %op2 = zext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: movzwl (%rsi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm1
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movq %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = zext <2 x i8> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i8>*
				%wide.load17 = load <2 x i8>, <2 x i8>* %tmp11, align 1
				%tmp12 = zext <2 x i8> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <4 x i8>
				; %op1 = zext<4 x i32> %val1
				; %val2 = load <4 x i8>
				; %op2 = zext<4 x i32> %val2
				; %rst = mul <4 x i32> %op1, %op2
				;
				define void @mul_4xi8(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_4xi8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movdqu %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <4 x i8>*
				%wide.load = load <4 x i8>, <4 x i8>* %tmp7, align 1
				%tmp8 = zext <4 x i8> %wide.load to <4 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <4 x i8>*
				%wide.load17 = load <4 x i8>, <4 x i8>* %tmp11, align 1
				%tmp12 = zext <4 x i8> %wide.load17 to <4 x i32>
				%tmp13 = mul nuw nsw <4 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <4 x i32>*
				store <4 x i32> %tmp13, <4 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <8 x i8>
				; %op1 = zext<8 x i32> %val1
				; %val2 = load <8 x i8>
				; %op2 = zext<8 x i32> %val2
				; %rst = mul <8 x i32> %op1, %op2
				;
				define void @mul_8xi8(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_8xi8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
				; CHECK-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: movdqa %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: movdqu %xmm1, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <8 x i8>*
				%wide.load = load <8 x i8>, <8 x i8>* %tmp7, align 1
				%tmp8 = zext <8 x i8> %wide.load to <8 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <8 x i8>*
				%wide.load17 = load <8 x i8>, <8 x i8>* %tmp11, align 1
				%tmp12 = zext <8 x i8> %wide.load17 to <8 x i32>
				%tmp13 = mul nuw nsw <8 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <8 x i32>*
				store <8 x i32> %tmp13, <8 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <16 x i8>
				; %op1 = zext<16 x i32> %val1
				; %val2 = load <16 x i8>
				; %op2 = zext<16 x i32> %val2
				; %rst = mul <16 x i32> %op1, %op2
				;
				define void @mul_16xi8(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_16xi8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movdqu (%rdi,%rdx), %xmm0
				; CHECK-NEXT: movdqu (%rsi,%rdx), %xmm1
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: movdqa %xmm0, %xmm3
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3],xmm3[4],xmm2[4],xmm3[5],xmm2[5],xmm3[6],xmm2[6],xmm3[7],xmm2[7]
				; CHECK-NEXT: movdqa %xmm1, %xmm4
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm2[0],xmm4[1],xmm2[1],xmm4[2],xmm2[2],xmm4[3],xmm2[3],xmm4[4],xmm2[4],xmm4[5],xmm2[5],xmm4[6],xmm2[6],xmm4[7],xmm2[7]
				; CHECK-NEXT: pmullw %xmm3, %xmm4
				; CHECK-NEXT: movdqa %xmm4, %xmm3
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm2[4],xmm4[5],xmm2[5],xmm4[6],xmm2[6],xmm4[7],xmm2[7]
				; CHECK-NEXT: punpckhbw {{.*#+}} xmm0 = xmm0[8],xmm2[8],xmm0[9],xmm2[9],xmm0[10],xmm2[10],xmm0[11],xmm2[11],xmm0[12],xmm2[12],xmm0[13],xmm2[13],xmm0[14],xmm2[14],xmm0[15],xmm2[15]
				; CHECK-NEXT: punpckhbw {{.*#+}} xmm1 = xmm1[8],xmm2[8],xmm1[9],xmm2[9],xmm1[10],xmm2[10],xmm1[11],xmm2[11],xmm1[12],xmm2[12],xmm1[13],xmm2[13],xmm1[14],xmm2[14],xmm1[15],xmm2[15]
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: movdqa %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: movdqu %xmm1, 48(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, 32(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm4, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm3, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <16 x i8>*
				%wide.load = load <16 x i8>, <16 x i8>* %tmp7, align 1
				%tmp8 = zext <16 x i8> %wide.load to <16 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <16 x i8>*
				%wide.load17 = load <16 x i8>, <16 x i8>* %tmp11, align 1
				%tmp12 = zext <16 x i8> %wide.load17 to <16 x i32>
				%tmp13 = mul nuw nsw <16 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <16 x i32>*
				store <16 x i32> %tmp13, <16 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i16>
				; %op1 = zext<2 x i32> %val1
				; %val2 = load <2 x i16>
				; %op2 = zext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi16:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhuw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movq %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = zext <2 x i16> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i16>*
				%wide.load17 = load <2 x i16>, <2 x i16>* %tmp11, align 1
				%tmp12 = zext <2 x i16> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <4 x i16>
				; %op1 = zext<4 x i32> %val1
				; %val2 = load <4 x i16>
				; %op2 = zext<4 x i32> %val2
				; %rst = mul <4 x i32> %op1, %op2
				;
				define void @mul_4xi16(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_4xi16:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
				; CHECK-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhuw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movdqu %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <4 x i16>*
				%wide.load = load <4 x i16>, <4 x i16>* %tmp7, align 1
				%tmp8 = zext <4 x i16> %wide.load to <4 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <4 x i16>*
				%wide.load17 = load <4 x i16>, <4 x i16>* %tmp11, align 1
				%tmp12 = zext <4 x i16> %wide.load17 to <4 x i32>
				%tmp13 = mul nuw nsw <4 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <4 x i32>*
				store <4 x i32> %tmp13, <4 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <8 x i16>
				; %op1 = zext<8 x i32> %val1
				; %val2 = load <8 x i16>
				; %op2 = zext<8 x i32> %val2
				; %rst = mul <8 x i32> %op1, %op2
				;
				define void @mul_8xi16(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_8xi16:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movdqu (%rdi,%rdx), %xmm0
				; CHECK-NEXT: movdqu (%rsi,%rdx), %xmm1
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhuw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: movdqa %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: movdqu %xmm1, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <8 x i16>*
				%wide.load = load <8 x i16>, <8 x i16>* %tmp7, align 1
				%tmp8 = zext <8 x i16> %wide.load to <8 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <8 x i16>*
				%wide.load17 = load <8 x i16>, <8 x i16>* %tmp11, align 1
				%tmp12 = zext <8 x i16> %wide.load17 to <8 x i32>
				%tmp13 = mul nuw nsw <8 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <8 x i32>*
				store <8 x i32> %tmp13, <8 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <16 x i16>
				; %op1 = zext<16 x i32> %val1
				; %val2 = load <16 x i16>
				; %op2 = zext<16 x i32> %val2
				; %rst = mul <16 x i32> %op1, %op2
				;
				define void @mul_16xi16(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_16xi16:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movdqu (%rdi,%rdx), %xmm0
				; CHECK-NEXT: movdqu 16(%rdi,%rdx), %xmm1
				; CHECK-NEXT: movdqu (%rsi,%rdx), %xmm2
				; CHECK-NEXT: movdqu 16(%rsi,%rdx), %xmm3
				; CHECK-NEXT: movdqa %xmm2, %xmm4
				; CHECK-NEXT: pmulhuw %xmm0, %xmm4
				; CHECK-NEXT: pmullw %xmm0, %xmm2
				; CHECK-NEXT: movdqa %xmm2, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]
				; CHECK-NEXT: movdqa %xmm3, %xmm4
				; CHECK-NEXT: pmulhuw %xmm1, %xmm4
				; CHECK-NEXT: pmullw %xmm1, %xmm3
				; CHECK-NEXT: movdqa %xmm3, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
				; CHECK-NEXT: movdqu %xmm3, 48(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm1, 32(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm2, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <16 x i16>*
				%wide.load = load <16 x i16>, <16 x i16>* %tmp7, align 1
				%tmp8 = zext <16 x i16> %wide.load to <16 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <16 x i16>*
				%wide.load17 = load <16 x i16>, <16 x i16>* %tmp11, align 1
				%tmp12 = zext <16 x i16> %wide.load17 to <16 x i32>
				%tmp13 = mul nuw nsw <16 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <16 x i32>*
				store <16 x i32> %tmp13, <16 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i8>
				; %op1 = sext<2 x i32> %val1
				; %val2 = load <2 x i8>
				; %op2 = sext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_sext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi8_sext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: movzwl (%rsi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm1
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm1
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
				; CHECK-NEXT: psrad $16, %xmm0
				; CHECK-NEXT: movq %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i8>*
				%wide.load17 = load <2 x i8>, <2 x i8>* %tmp11, align 1
				%tmp12 = sext <2 x i8> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i8>
				; %op1 = sext<2 x i32> %val1
				; %val2 = load <2 x i8>
				; %op2 = zext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_sext_zext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi8_sext_zext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: movzwl (%rsi,%rdx), %ecx
				; CHECK-NEXT: movd %ecx, %xmm1
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i8>*
				%wide.load17 = load <2 x i8>, <2 x i8>* %tmp11, align 1
				%tmp12 = zext <2 x i8> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i16>
				; %op1 = sext<2 x i32> %val1
				; %val2 = load <2 x i16>
				; %op2 = sext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_sext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi16_sext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmulhw %xmm0, %xmm2
				; CHECK-NEXT: pmullw %xmm0, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: movq %xmm1, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = sext <2 x i16> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i16>*
				%wide.load17 = load <2 x i16>, <2 x i16>* %tmp11, align 1
				%tmp12 = sext <2 x i16> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <2 x i16>
				; %op1 = sext<2 x i32> %val1
				; %val2 = load <2 x i16>
				; %op2 = zext<2 x i32> %val2
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_sext_zext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_2xi16_sext_zext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
				; CHECK-NEXT: psrad $16, %xmm0
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
				; CHECK-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
				; CHECK-NEXT: pxor %xmm2, %xmm2
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,1,1,3]
				; CHECK-NEXT: movdqa %xmm1, %xmm2
				; CHECK-NEXT: pmuludq %xmm0, %xmm2
				; CHECK-NEXT: movdqa %xmm0, %xmm3
				; CHECK-NEXT: psrlq $32, %xmm3
				; CHECK-NEXT: pmuludq %xmm1, %xmm3
				; CHECK-NEXT: psllq $32, %xmm3
				; CHECK-NEXT: paddq %xmm2, %xmm3
				; CHECK-NEXT: psrlq $32, %xmm1
				; CHECK-NEXT: pmuludq %xmm0, %xmm1
				; CHECK-NEXT: psllq $32, %xmm1
				; CHECK-NEXT: paddq %xmm3, %xmm1
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = sext <2 x i16> %wide.load to <2 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <2 x i16>*
				%wide.load17 = load <2 x i16>, <2 x i16>* %tmp11, align 1
				%tmp12 = zext <2 x i16> %wide.load17 to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val1 = load <16 x i16>
				; %op1 = sext<16 x i32> %val1
				; %val2 = load <16 x i16>
				; %op2 = sext<16 x i32> %val2
				; %rst = mul <16 x i32> %op1, %op2
				;
				define void @mul_16xi16_sext(i8* nocapture readonly %a, i8* nocapture readonly %b, i64 %index) {
				; CHECK-LABEL: mul_16xi16_sext:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movdqu (%rdi,%rdx), %xmm0
				; CHECK-NEXT: movdqu 16(%rdi,%rdx), %xmm1
				; CHECK-NEXT: movdqu (%rsi,%rdx), %xmm2
				; CHECK-NEXT: movdqu 16(%rsi,%rdx), %xmm3
				; CHECK-NEXT: movdqa %xmm2, %xmm4
				; CHECK-NEXT: pmulhw %xmm0, %xmm4
				; CHECK-NEXT: pmullw %xmm0, %xmm2
				; CHECK-NEXT: movdqa %xmm2, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]
				; CHECK-NEXT: movdqa %xmm3, %xmm4
				; CHECK-NEXT: pmulhw %xmm1, %xmm4
				; CHECK-NEXT: pmullw %xmm1, %xmm3
				; CHECK-NEXT: movdqa %xmm3, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
				; CHECK-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
				; CHECK-NEXT: movdqu %xmm3, 48(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm1, 32(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm2, 16(%rax,%rdx,4)
				; CHECK-NEXT: movdqu %xmm0, (%rax,%rdx,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <16 x i16>*
				%wide.load = load <16 x i16>, <16 x i16>* %tmp7, align 1
				%tmp8 = sext <16 x i16> %wide.load to <16 x i32>
				%tmp10 = getelementptr inbounds i8, i8* %b, i64 %index
				%tmp11 = bitcast i8* %tmp10 to <16 x i16>*
				%wide.load17 = load <16 x i16>, <16 x i16>* %tmp11, align 1
				%tmp12 = sext <16 x i16> %wide.load17 to <16 x i32>
				%tmp13 = mul nuw nsw <16 x i32> %tmp12, %tmp8
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <16 x i32>*
				store <16 x i32> %tmp13, <16 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 255)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst1(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst1:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: pxor %xmm1, %xmm1
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
				; CHECK-NEXT: pmullw {{.*}}(%rip), %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = zext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 255>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-128 ~ 127)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst2(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst2:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: pmullw {{.*}}(%rip), %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
				; CHECK-NEXT: psrad $16, %xmm0
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -128, i32 127>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 256)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst3(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst3:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: pxor %xmm1, %xmm1
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <0,256,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = zext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 256>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-1 ~ 255)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst4(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst4:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: pxor %xmm1, %xmm1
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <65535,255,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = zext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -1, i32 255>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-129 ~ 127)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst5(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst5:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <65407,127,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -129, i32 127>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i8>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-128 ~ 128)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi8_varconst6(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi8_varconst6:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movzwl (%rdi,%rsi), %ecx
				; CHECK-NEXT: movd %ecx, %xmm0
				; CHECK-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; CHECK-NEXT: psraw $8, %xmm0
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <65408,128,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i8>*
				%wide.load = load <2 x i8>, <2 x i8>* %tmp7, align 1
				%tmp8 = sext <2 x i8> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -128, i32 128>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i16>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 65535)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_varconst1(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi16_varconst1:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <0,65535,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhuw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = zext <2 x i16> %wide.load to <2 x i32>
				eli.friedmanUnsubmitted Not Done Reply Inline Actions It would probably be more clear to write this as `mul nuw nsw <2 x i32> %tmp8, <i32 -32768, i32 32767>`. eli.friedman: It would probably be more clear to write this as `mul nuw nsw <2 x i32> %tmp8, <i32 -32768, i32…
				wmiAuthorUnsubmitted Not Done Reply Inline Actions That is better. Fixed. wmi: That is better. Fixed.
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 65535>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i16>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (-32768 ~ 32767)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_varconst2(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi16_varconst2:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <32768,32767,u,u,u,u,u,u>
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmulhw %xmm1, %xmm2
				; CHECK-NEXT: pmullw %xmm1, %xmm0
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = sext <2 x i16> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 -32768, i32 32767>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i16>
				; %op1 = zext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 65536)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_varconst3(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi16_varconst3:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: pxor %xmm1, %xmm1
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
				; CHECK-NEXT: movl $65536, %ecx # imm = 0x10000
				; CHECK-NEXT: movd %rcx, %xmm1
				; CHECK-NEXT: pslldq {{.*#+}} xmm1 = zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7]
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmuludq %xmm1, %xmm2
				; CHECK-NEXT: psrlq $32, %xmm0
				; CHECK-NEXT: pmuludq %xmm1, %xmm0
				; CHECK-NEXT: psllq $32, %xmm0
				; CHECK-NEXT: paddq %xmm2, %xmm0
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = zext <2 x i16> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 65536>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}

				; %val = load <2 x i16>
				; %op1 = sext<2 x i32> %val
				; %op2 = const <2 x i32> {c1, c2} // c1 and c2 are within (0 ~ 32768)
				; %rst = mul <2 x i32> %op1, %op2
				;
				define void @mul_2xi16_varconst4(i8* nocapture readonly %a, i64 %index) {
				; CHECK-LABEL: mul_2xi16_varconst4:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: movq {{.*}}(%rip), %rax
				; CHECK-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
				; CHECK-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
				; CHECK-NEXT: psrad $16, %xmm0
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
				; CHECK-NEXT: movl $32768, %ecx # imm = 0x8000
				; CHECK-NEXT: movd %rcx, %xmm1
				; CHECK-NEXT: pslldq {{.*#+}} xmm1 = zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7]
				; CHECK-NEXT: movdqa %xmm0, %xmm2
				; CHECK-NEXT: pmuludq %xmm1, %xmm2
				; CHECK-NEXT: psrlq $32, %xmm0
				; CHECK-NEXT: pmuludq %xmm1, %xmm0
				; CHECK-NEXT: psllq $32, %xmm0
				; CHECK-NEXT: paddq %xmm2, %xmm0
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
				; CHECK-NEXT: movq %xmm0, (%rax,%rsi,4)
				; CHECK-NEXT: retq
				entry:
				%pre = load i32, i32* @c
				%tmp6 = getelementptr inbounds i8, i8* %a, i64 %index
				%tmp7 = bitcast i8* %tmp6 to <2 x i16>*
				%wide.load = load <2 x i16>, <2 x i16>* %tmp7, align 1
				%tmp8 = sext <2 x i16> %wide.load to <2 x i32>
				%tmp13 = mul nuw nsw <2 x i32> %tmp8, <i32 0, i32 32768>
				%tmp14 = getelementptr inbounds i32, i32* %pre, i64 %index
				%tmp15 = bitcast i32* %tmp14 to <2 x i32>*
				store <2 x i32> %tmp13, <2 x i32>* %tmp15, align 4
				ret void
				}