This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Add hooks/combine to form vector extloads, and enable it on X86.
ClosedPublic

Authored by ab on Jan 9 2015, 1:08 PM.

Download Raw Diff

Details

Reviewers

chandlerc
hfinkel

Commits

rGe892d13d905c: [CodeGen] Add hook/combine to form vector extloads, enabled on X86.
rL228325: [CodeGen] Add hook/combine to form vector extloads, enabled on X86.

Summary

Moved from the ML:

As the last change in my extload series, here are 3 squashed (WIP) patches
to actually form extloads on vector types.
They used to be disabled, because "None of the supported targets knows
how to perform load and sign extend on vectors in one instruction."

The first part enables the combine on legal vectors, but hides it
behind a profitability callback.
For instance, on ARM, several instructions have folded extload forms,
so it's not always beneficial to create an extload node (and trying to
match extloads is a whole 'nother can of worms).

The second part adds a combine to fold s/zextloads of illegal
(splittable) vector types, to replace it directly by multiple smaller
extloads. I'm not a big fan of this kind of pseudo-legalization in
combines. However, I tried the alternative: form illegal extloads, and
later try to split them up, but then, you sometimes generate extloads
that can't be split up, but have a valid ext+load expansion. At
vector-op legalization time, it's too late to generate this kind of
thing, so you end up forced to scalarize. It's better to just avoid
creating egregiously illegal nodes.

Finally, the last part enables this all, unconditionally, on X86.

Note that the splitting combine is happy with "custom" extloads. As
is, this bypasses the actual custom lowering, and just unrolls the
extload. But from what I've seen, this is still much better than the
current custom lowering, which does some kind of unrolling at the end
anyway (see for instance load_sext_4i8_to_4i64 on SSE2, and the added
FIXME).

Also note that there's a regression in widen_load-2.ll, where
we can no longer fold the load. I'll look into that later.

Anyway: as can be seen from the nice testcase cleanups, there's
something to be done here. The combines feel a bit dirty, but I don't
see a better alternative. Finally, I didn't see changes on the
testsuite (SSE2 X86-64, I'll try SSE4.1 and AVX2 as well.)

P.S.: I squashed all three because I don't think it makes it harder to
review, just longer. The three changes are nicely isolated, and will
be committed separately. I'll split into 3 threads if desired.

Depends on D6533.

Thanks,
-Ahmed

Diff Detail

Repository: rL LLVM

Event Timeline

ab updated this revision to Diff 17942.Jan 9 2015, 1:08 PM

ab retitled this revision from to [CodeGen] Add hooks/combine to form vector extloads, and enable it on X86..

ab updated this object.

ab edited the test plan for this revision. (Show Details)

ab added a reviewer: chandlerc.

ab added a parent revision: D6533: [X86] Declare SSE4.1/AVX2 vector extloads covered by the *PMOV*X instructions legal.

ab added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptJan 9 2015, 1:08 PM

arsenm added a subscriber: arsenm.Jan 9 2015, 1:36 PM

arsenm added inline comments.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5334–5335 ↗	(On Diff #17942)	I don't think this will work correctly for 3 vectors, but it probably doesn't matter
5336 ↗	(On Diff #17942)	This should use .getStoreSize()
5671–5682 ↗	(On Diff #17942)	This looks like the same in the sextload handling. Should probably be moved to a new function
5688 ↗	(On Diff #17942)	This should also use getStoreSize()
5694–5698 ↗	(On Diff #17942)	It might be helpful to add a new version of getExtLoad based on the version that takes the MMO and an offset rather than having to pass all of the separate arguments
lib/Target/X86/X86ISelLowering.cpp
20109 ↗	(On Diff #17942)	Missing override

Wow this looks amazing. I'm glad you were able to make it all the way through to fixing this properly (I wasn't when I looked here, I ran away screaming).

Some comments inline...

include/llvm/Target/TargetLowering.h
1512 ↗	(On Diff #17942)	Do we you "ExtLd" in APIs elsewhere? this kind of extreme abbreviation, especially if deviating from other abbreviations, makes the APIs really hard to understand and find.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5660 ↗	(On Diff #17942)	This code is extremely similar to the sext case. Can it be factored out with an appropriate predicate or input instruction tag or something?
5665–5671 ↗	(On Diff #17942)	Even if you can't put this into a common function with sext, at least pull both into their own functions so that you can use early exit for all of this rather than boolean variables.
test/CodeGen/X86/widen_load-2.ll
76–79 ↗	(On Diff #17942)	This is both weird and unfortunate... Do understand why this happens?

ab updated this revision to Diff 18019.Jan 12 2015, 8:23 AM

ab edited edge metadata.

ab added inline comments.

include/llvm/Target/TargetLowering.h
1512 ↗	(On Diff #17942)	I agree, and originally wanted to use the closer-to-ubiquitous LoadExt. However, there's enableExtLdPromotion, which is the most similar API so should be similarly named. I guess I could rename both to LoadExt instead, how does that sound? That would stay consistent with ISD::LoadExtType, and the various LoadExt legality functions.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5334–5335 ↗	(On Diff #17942)	You're right, I added a check for isPow2VectorType.
5660 ↗	(On Diff #17942)	It should be trivial to factor it out. Other similar combines as well (I'll take care of it separately.)
5694–5698 ↗	(On Diff #17942)	I thought so as well, but two things would make it confusing IMO: 1) the offset operand is for indexed loads only, which is a concept different from an ADDed base pointer. Also, less importantly, 2) we're not sure we can reuse the previous MMO, as the types may differ. Or maybe you meant something else?
test/CodeGen/X86/widen_load-2.ll
76–79 ↗	(On Diff #17942)	This was actually a rebasing mistake, and doesn't happen on a clean apply of the patch. Sorry for the noise.

chandlerc added inline comments.Jan 12 2015, 9:24 AM

include/llvm/Target/TargetLowering.h
1512 ↗	(On Diff #17942)	We're so inconsistent here that I don't really care about what the old code happened to name things provided we're consistent going forward and converge on that. Given that, I'd prefer the minimal amount of abbreviation that is sane, and I think spelling out "load" but abbreviating "ext" is a fine compromise there.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5660 ↗	(On Diff #17942)	Please don't add duplicated code and then factor it afterward. Either do the minimal amount of factoring in this change so that we don't have to review 2x the code, or do the factoring first, and then the functional change.

Ping!

-Ahmed

hfinkel added a subscriber: hfinkel.Jan 27 2015, 6:37 PM

hfinkel added inline comments.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5227 ↗	(On Diff #18019)	Please make this comment more verbose. We might generate multiple loads and then combine them together, and it is not clear here whether this combining happens to the results of the sext or the sext's operand.
5274 ↗	(On Diff #18019)	Is this alignment correct? Aren't these loads smaller than the original one?

My only comments beyond what Hal has already mentioned are real nits:

s/dl/DL/

Can you use DL everywhere? It's odd that in one place you use SDLoc(N0). It may be correct, but if so, it's likely not obvious to the reader why we need to do something different here.

Finally, I'd find the indentation more consistent with clang-format. But now we're waaay down into minor details. This is generally looking much better.

Use smaller alignment for the split loads.
Improve formatting, names.
Don't accept ANYEXTs, since it's not testable or supported yet.

LGTM.

This revision is now accepted and ready to land.Feb 2 2015, 7:24 PM

ab mentioned this in D7423: [ARM] Enable vector extload combine for legal types..Feb 4 2015, 6:35 PM

Closed by commit rL228325: [CodeGen] Add hook/combine to form vector extloads, enabled on X86. (authored by ab). · Explain WhyFeb 5 2015, 10:32 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Target/

TargetLowering.h

4 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

133 lines

Target/

X86/

X86ISelLowering.h

4 lines

X86ISelLowering.cpp

3 lines

test/

CodeGen/

X86/

vector-sext.ll

231 lines

vector-zext.ll

42 lines

Diff 19417

llvm/trunk/include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 1,512 Lines • ▼ Show 20 Lines	public:
/// Return true if an fpext operation is free (for instance, because		/// Return true if an fpext operation is free (for instance, because
/// single-precision floating-point numbers are implicitly extended to		/// single-precision floating-point numbers are implicitly extended to
/// double-precision).		/// double-precision).
virtual bool isFPExtFree(EVT VT) const {		virtual bool isFPExtFree(EVT VT) const {
assert(VT.isFloatingPoint());		assert(VT.isFloatingPoint());
return false;		return false;
}		}

		/// Return true if folding a vector load into ExtVal (a sign, zero, or any
		/// extend node) is profitable.
		virtual bool isVectorLoadExtDesirable(SDValue ExtVal) const { return false; }

/// Return true if an fneg operation is free to the point where it is never		/// Return true if an fneg operation is free to the point where it is never
/// worthwhile to replace it with a bitwise operation.		/// worthwhile to replace it with a bitwise operation.
virtual bool isFNegFree(EVT VT) const {		virtual bool isFNegFree(EVT VT) const {
assert(VT.isFloatingPoint());		assert(VT.isFloatingPoint());
return false;		return false;
}		}

/// Return true if an fabs operation is free to the point where it is never		/// Return true if an fabs operation is free to the point where it is never
▲ Show 20 Lines • Show All 1,293 Lines • Show Last 20 Lines

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 321 Lines • ▼ Show 20 Lines	private:

bool isSetCCEquivalent(SDValue N, SDValue &LHS, SDValue &RHS,		bool isSetCCEquivalent(SDValue N, SDValue &LHS, SDValue &RHS,
SDValue &CC) const;		SDValue &CC) const;
bool isOneUseSetCC(SDValue N) const;		bool isOneUseSetCC(SDValue N) const;

SDValue SimplifyNodeWithTwoResults(SDNode *N, unsigned LoOp,		SDValue SimplifyNodeWithTwoResults(SDNode *N, unsigned LoOp,
unsigned HiOp);		unsigned HiOp);
SDValue CombineConsecutiveLoads(SDNode *N, EVT VT);		SDValue CombineConsecutiveLoads(SDNode *N, EVT VT);
		SDValue CombineExtLoad(SDNode *N);
SDValue ConstantFoldBITCASTofBUILD_VECTOR(SDNode *, EVT);		SDValue ConstantFoldBITCASTofBUILD_VECTOR(SDNode *, EVT);
SDValue BuildSDIV(SDNode *N);		SDValue BuildSDIV(SDNode *N);
SDValue BuildSDIVPow2(SDNode *N);		SDValue BuildSDIVPow2(SDNode *N);
SDValue BuildUDIV(SDNode *N);		SDValue BuildUDIV(SDNode *N);
SDValue BuildReciprocalEstimate(SDValue Op);		SDValue BuildReciprocalEstimate(SDValue Op);
SDValue BuildRsqrtEstimate(SDValue Op);		SDValue BuildRsqrtEstimate(SDValue Op);
SDValue BuildRsqrtNROneConst(SDValue Op, SDValue Est, unsigned Iterations);		SDValue BuildRsqrtNROneConst(SDValue Op, SDValue Est, unsigned Iterations);
SDValue BuildRsqrtNRTwoConst(SDValue Op, SDValue Est, unsigned Iterations);		SDValue BuildRsqrtNRTwoConst(SDValue Op, SDValue Est, unsigned Iterations);
▲ Show 20 Lines • Show All 4,964 Lines • ▼ Show 20 Lines	for (unsigned j = 0; j != 2; ++j) {
Ops.push_back(DAG.getNode(ExtType, DL, ExtLoad->getValueType(0), SOp));		Ops.push_back(DAG.getNode(ExtType, DL, ExtLoad->getValueType(0), SOp));
}		}

Ops.push_back(SetCC->getOperand(2));		Ops.push_back(SetCC->getOperand(2));
CombineTo(SetCC, DAG.getNode(ISD::SETCC, DL, SetCC->getValueType(0), Ops));		CombineTo(SetCC, DAG.getNode(ISD::SETCC, DL, SetCC->getValueType(0), Ops));
}		}
}		}

		// FIXME: Bring more similar combines here, common to sext/zext (maybe aext?).
		SDValue DAGCombiner::CombineExtLoad(SDNode *N) {
		SDValue N0 = N->getOperand(0);
		EVT DstVT = N->getValueType(0);
		EVT SrcVT = N0.getValueType();

		assert((N->getOpcode() == ISD::SIGN_EXTEND \|\|
		N->getOpcode() == ISD::ZERO_EXTEND) &&
		"Unexpected node type (not an extend)!");

		// fold (sext (load x)) to multiple smaller sextloads; same for zext.
		// For example, on a target with legal v4i32, but illegal v8i32, turn:
		// (v8i32 (sext (v8i16 (load x))))
		// into:
		// (v8i32 (concat_vectors (v4i32 (sextload x)),
		// (v4i32 (sextload (x + 16)))))
		// Where uses of the original load, i.e.:
		// (v8i16 (load x))
		// are replaced with:
		// (v8i16 (truncate
		// (v8i32 (concat_vectors (v4i32 (sextload x)),
		// (v4i32 (sextload (x + 16)))))))
		//
		// This combine is only applicable to illegal, but splittable, vectors.
		// All legal types, and illegal non-vector types, are handled elsewhere.
		// This combine is controlled by TargetLowering::isVectorLoadExtDesirable.
		//
		if (N0->getOpcode() != ISD::LOAD)
		return SDValue();

		LoadSDNode *LN0 = cast<LoadSDNode>(N0);

		if (!ISD::isNON_EXTLoad(LN0) \|\| !ISD::isUNINDEXEDLoad(LN0) \|\|
		!N0.hasOneUse() \|\| LN0->isVolatile() \|\| !DstVT.isVector() \|\|
		!DstVT.isPow2VectorType() \|\| !TLI.isVectorLoadExtDesirable(SDValue(N, 0)))
		return SDValue();

		SmallVector<SDNode *, 4> SetCCs;
		if (!ExtendUsesToFormExtLoad(N, N0, N->getOpcode(), SetCCs, TLI))
		return SDValue();

		ISD::LoadExtType ExtType =
		N->getOpcode() == ISD::SIGN_EXTEND ? ISD::SEXTLOAD : ISD::ZEXTLOAD;

		// Try to split the vector types to get down to legal types.
		EVT SplitSrcVT = SrcVT;
		EVT SplitDstVT = DstVT;
		while (!TLI.isLoadExtLegalOrCustom(ExtType, SplitDstVT, SplitSrcVT) &&
		SplitSrcVT.getVectorNumElements() > 1) {
		SplitDstVT = DAG.GetSplitDestVTs(SplitDstVT).first;
		SplitSrcVT = DAG.GetSplitDestVTs(SplitSrcVT).first;
		}

		if (!TLI.isLoadExtLegalOrCustom(ExtType, SplitDstVT, SplitSrcVT))
		return SDValue();

		SDLoc DL(N);
		const unsigned NumSplits =
		DstVT.getVectorNumElements() / SplitDstVT.getVectorNumElements();
		const unsigned Stride = SplitSrcVT.getStoreSize();
		SmallVector<SDValue, 4> Loads;
		SmallVector<SDValue, 4> Chains;

		SDValue BasePtr = LN0->getBasePtr();
		for (unsigned Idx = 0; Idx < NumSplits; Idx++) {
		const unsigned Offset = Idx * Stride;
		const unsigned Align = MinAlign(LN0->getAlignment(), Offset);

		SDValue SplitLoad = DAG.getExtLoad(
		ExtType, DL, SplitDstVT, LN0->getChain(), BasePtr,
		LN0->getPointerInfo().getWithOffset(Offset), SplitSrcVT,
		LN0->isVolatile(), LN0->isNonTemporal(), LN0->isInvariant(),
		Align, LN0->getAAInfo());

		BasePtr = DAG.getNode(ISD::ADD, DL, BasePtr.getValueType(), BasePtr,
		DAG.getConstant(Stride, BasePtr.getValueType()));

		Loads.push_back(SplitLoad.getValue(0));
		Chains.push_back(SplitLoad.getValue(1));
		}

		SDValue NewChain = DAG.getNode(ISD::TokenFactor, DL, MVT::Other, Chains);
		SDValue NewValue = DAG.getNode(ISD::CONCAT_VECTORS, DL, DstVT, Loads);

		CombineTo(N, NewValue);

		// Replace uses of the original load (before extension)
		// with a truncate of the concatenated sextloaded vectors.
		SDValue Trunc =
		DAG.getNode(ISD::TRUNCATE, SDLoc(N0), N0.getValueType(), NewValue);
		CombineTo(N0.getNode(), Trunc, NewChain);
		ExtendSetCCUses(SetCCs, Trunc, NewValue, DL,
		(ISD::NodeType)N->getOpcode());
		return SDValue(N, 0); // Return N so it doesn't get rechecked!
		}

SDValue DAGCombiner::visitSIGN_EXTEND(SDNode *N) {		SDValue DAGCombiner::visitSIGN_EXTEND(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (SDNode *Res = tryToFoldExtendOfConstant(N, TLI, DAG, LegalTypes,		if (SDNode *Res = tryToFoldExtendOfConstant(N, TLI, DAG, LegalTypes,
LegalOperations))		LegalOperations))
return SDValue(Res, 0);		return SDValue(Res, 0);

▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	if (!LegalOperations \|\| TLI.isOperationLegal(ISD::SIGN_EXTEND_INREG,
else if (OpBits > DestBits)		else if (OpBits > DestBits)
Op = DAG.getNode(ISD::TRUNCATE, SDLoc(N0), VT, Op);		Op = DAG.getNode(ISD::TRUNCATE, SDLoc(N0), VT, Op);
return DAG.getNode(ISD::SIGN_EXTEND_INREG, SDLoc(N), VT, Op,		return DAG.getNode(ISD::SIGN_EXTEND_INREG, SDLoc(N), VT, Op,
DAG.getValueType(N0.getValueType()));		DAG.getValueType(N0.getValueType()));
}		}
}		}

// fold (sext (load x)) -> (sext (truncate (sextload x)))		// fold (sext (load x)) -> (sext (truncate (sextload x)))
// None of the supported targets knows how to perform load and sign extend		// Only generate vector extloads when 1) they're legal, and 2) they are
// on vectors in one instruction. We only perform this transformation on		// deemed desirable by the target.
// scalars.		if (ISD::isNON_EXTLoad(N0.getNode()) && ISD::isUNINDEXEDLoad(N0.getNode()) &&
if (ISD::isNON_EXTLoad(N0.getNode()) && !VT.isVector() &&		((!LegalOperations && !VT.isVector() &&
ISD::isUNINDEXEDLoad(N0.getNode()) &&		!cast<LoadSDNode>(N0)->isVolatile()) \|\|
((!LegalOperations && !cast<LoadSDNode>(N0)->isVolatile()) \|\|
TLI.isLoadExtLegal(ISD::SEXTLOAD, VT, N0.getValueType()))) {		TLI.isLoadExtLegal(ISD::SEXTLOAD, VT, N0.getValueType()))) {
bool DoXform = true;		bool DoXform = true;
SmallVector<SDNode*, 4> SetCCs;		SmallVector<SDNode*, 4> SetCCs;
if (!N0.hasOneUse())		if (!N0.hasOneUse())
DoXform = ExtendUsesToFormExtLoad(N, N0, ISD::SIGN_EXTEND, SetCCs, TLI);		DoXform = ExtendUsesToFormExtLoad(N, N0, ISD::SIGN_EXTEND, SetCCs, TLI);
		if (VT.isVector())
		DoXform &= TLI.isVectorLoadExtDesirable(SDValue(N, 0));
if (DoXform) {		if (DoXform) {
LoadSDNode *LN0 = cast<LoadSDNode>(N0);		LoadSDNode *LN0 = cast<LoadSDNode>(N0);
SDValue ExtLoad = DAG.getExtLoad(ISD::SEXTLOAD, SDLoc(N), VT,		SDValue ExtLoad = DAG.getExtLoad(ISD::SEXTLOAD, SDLoc(N), VT,
LN0->getChain(),		LN0->getChain(),
LN0->getBasePtr(), N0.getValueType(),		LN0->getBasePtr(), N0.getValueType(),
LN0->getMemOperand());		LN0->getMemOperand());
CombineTo(N, ExtLoad);		CombineTo(N, ExtLoad);
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, SDLoc(N0),		SDValue Trunc = DAG.getNode(ISD::TRUNCATE, SDLoc(N0),
N0.getValueType(), ExtLoad);		N0.getValueType(), ExtLoad);
CombineTo(N0.getNode(), Trunc, ExtLoad.getValue(1));		CombineTo(N0.getNode(), Trunc, ExtLoad.getValue(1));
ExtendSetCCUses(SetCCs, Trunc, ExtLoad, SDLoc(N),		ExtendSetCCUses(SetCCs, Trunc, ExtLoad, SDLoc(N),
ISD::SIGN_EXTEND);		ISD::SIGN_EXTEND);
return SDValue(N, 0); // Return N so it doesn't get rechecked!		return SDValue(N, 0); // Return N so it doesn't get rechecked!
}		}
}		}

		// fold (sext (load x)) to multiple smaller sextloads.
		// Only on illegal but splittable vectors.
		if (SDValue ExtLoad = CombineExtLoad(N))
		return ExtLoad;

// fold (sext (sextload x)) -> (sext (truncate (sextload x)))		// fold (sext (sextload x)) -> (sext (truncate (sextload x)))
// fold (sext ( extload x)) -> (sext (truncate (sextload x)))		// fold (sext ( extload x)) -> (sext (truncate (sextload x)))
if ((ISD::isSEXTLoad(N0.getNode()) \|\| ISD::isEXTLoad(N0.getNode())) &&		if ((ISD::isSEXTLoad(N0.getNode()) \|\| ISD::isEXTLoad(N0.getNode())) &&
ISD::isUNINDEXEDLoad(N0.getNode()) && N0.hasOneUse()) {		ISD::isUNINDEXEDLoad(N0.getNode()) && N0.hasOneUse()) {
LoadSDNode *LN0 = cast<LoadSDNode>(N0);		LoadSDNode *LN0 = cast<LoadSDNode>(N0);
EVT MemVT = LN0->getMemoryVT();		EVT MemVT = LN0->getMemoryVT();
if ((!LegalOperations && !LN0->isVolatile()) \|\|		if ((!LegalOperations && !LN0->isVolatile()) \|\|
TLI.isLoadExtLegal(ISD::SEXTLOAD, VT, MemVT)) {		TLI.isLoadExtLegal(ISD::SEXTLOAD, VT, MemVT)) {
▲ Show 20 Lines • Show All 247 Lines • ▼ Show 20 Lines	if (N0.getOpcode() == ISD::AND &&
}		}
APInt Mask = cast<ConstantSDNode>(N0.getOperand(1))->getAPIntValue();		APInt Mask = cast<ConstantSDNode>(N0.getOperand(1))->getAPIntValue();
Mask = Mask.zext(VT.getSizeInBits());		Mask = Mask.zext(VT.getSizeInBits());
return DAG.getNode(ISD::AND, SDLoc(N), VT,		return DAG.getNode(ISD::AND, SDLoc(N), VT,
X, DAG.getConstant(Mask, VT));		X, DAG.getConstant(Mask, VT));
}		}

// fold (zext (load x)) -> (zext (truncate (zextload x)))		// fold (zext (load x)) -> (zext (truncate (zextload x)))
// None of the supported targets knows how to perform load and vector_zext		// Only generate vector extloads when 1) they're legal, and 2) they are
// on vectors in one instruction. We only perform this transformation on		// deemed desirable by the target.
// scalars.		if (ISD::isNON_EXTLoad(N0.getNode()) && ISD::isUNINDEXEDLoad(N0.getNode()) &&
if (ISD::isNON_EXTLoad(N0.getNode()) && !VT.isVector() &&		((!LegalOperations && !VT.isVector() &&
ISD::isUNINDEXEDLoad(N0.getNode()) &&		!cast<LoadSDNode>(N0)->isVolatile()) \|\|
((!LegalOperations && !cast<LoadSDNode>(N0)->isVolatile()) \|\|
TLI.isLoadExtLegal(ISD::ZEXTLOAD, VT, N0.getValueType()))) {		TLI.isLoadExtLegal(ISD::ZEXTLOAD, VT, N0.getValueType()))) {
bool DoXform = true;		bool DoXform = true;
SmallVector<SDNode*, 4> SetCCs;		SmallVector<SDNode*, 4> SetCCs;
if (!N0.hasOneUse())		if (!N0.hasOneUse())
DoXform = ExtendUsesToFormExtLoad(N, N0, ISD::ZERO_EXTEND, SetCCs, TLI);		DoXform = ExtendUsesToFormExtLoad(N, N0, ISD::ZERO_EXTEND, SetCCs, TLI);
		if (VT.isVector())
		DoXform &= TLI.isVectorLoadExtDesirable(SDValue(N, 0));
if (DoXform) {		if (DoXform) {
LoadSDNode *LN0 = cast<LoadSDNode>(N0);		LoadSDNode *LN0 = cast<LoadSDNode>(N0);
SDValue ExtLoad = DAG.getExtLoad(ISD::ZEXTLOAD, SDLoc(N), VT,		SDValue ExtLoad = DAG.getExtLoad(ISD::ZEXTLOAD, SDLoc(N), VT,
LN0->getChain(),		LN0->getChain(),
LN0->getBasePtr(), N0.getValueType(),		LN0->getBasePtr(), N0.getValueType(),
LN0->getMemOperand());		LN0->getMemOperand());
CombineTo(N, ExtLoad);		CombineTo(N, ExtLoad);
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, SDLoc(N0),		SDValue Trunc = DAG.getNode(ISD::TRUNCATE, SDLoc(N0),
N0.getValueType(), ExtLoad);		N0.getValueType(), ExtLoad);
CombineTo(N0.getNode(), Trunc, ExtLoad.getValue(1));		CombineTo(N0.getNode(), Trunc, ExtLoad.getValue(1));

ExtendSetCCUses(SetCCs, Trunc, ExtLoad, SDLoc(N),		ExtendSetCCUses(SetCCs, Trunc, ExtLoad, SDLoc(N),
ISD::ZERO_EXTEND);		ISD::ZERO_EXTEND);
return SDValue(N, 0); // Return N so it doesn't get rechecked!		return SDValue(N, 0); // Return N so it doesn't get rechecked!
}		}
}		}

		// fold (zext (load x)) to multiple smaller zextloads.
		// Only on illegal but splittable vectors.
		if (SDValue ExtLoad = CombineExtLoad(N))
		return ExtLoad;

// fold (zext (and/or/xor (load x), cst)) ->		// fold (zext (and/or/xor (load x), cst)) ->
// (and/or/xor (zextload x), (zext cst))		// (and/or/xor (zextload x), (zext cst))
if ((N0.getOpcode() == ISD::AND \|\| N0.getOpcode() == ISD::OR \|\|		if ((N0.getOpcode() == ISD::AND \|\| N0.getOpcode() == ISD::OR \|\|
N0.getOpcode() == ISD::XOR) &&		N0.getOpcode() == ISD::XOR) &&
isa<LoadSDNode>(N0.getOperand(0)) &&		isa<LoadSDNode>(N0.getOperand(0)) &&
N0.getOperand(1).getOpcode() == ISD::Constant &&		N0.getOperand(1).getOpcode() == ISD::Constant &&
TLI.isLoadExtLegal(ISD::ZEXTLOAD, VT, N0.getValueType()) &&		TLI.isLoadExtLegal(ISD::ZEXTLOAD, VT, N0.getValueType()) &&
(!LegalOperations && TLI.isOperationLegal(N0.getOpcode(), VT))) {		(!LegalOperations && TLI.isOperationLegal(N0.getOpcode(), VT))) {
▲ Show 20 Lines • Show All 7,204 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 738 Lines • ▼ Show 20 Lines	public:
/// virtual registers. Also, if isTruncateFree(Ty2, Ty1) is true, this		/// virtual registers. Also, if isTruncateFree(Ty2, Ty1) is true, this
/// does not necessarily apply to truncate instructions. e.g. on x86-64,		/// does not necessarily apply to truncate instructions. e.g. on x86-64,
/// all instructions that define 32-bit values implicit zero-extend the		/// all instructions that define 32-bit values implicit zero-extend the
/// result out to 64 bits.		/// result out to 64 bits.
bool isZExtFree(Type Ty1, Type Ty2) const override;		bool isZExtFree(Type Ty1, Type Ty2) const override;
bool isZExtFree(EVT VT1, EVT VT2) const override;		bool isZExtFree(EVT VT1, EVT VT2) const override;
bool isZExtFree(SDValue Val, EVT VT2) const override;		bool isZExtFree(SDValue Val, EVT VT2) const override;

		/// Return true if folding a vector load into ExtVal (a sign, zero, or any
		/// extend node) is profitable.
		bool isVectorLoadExtDesirable(SDValue) const override;

/// Return true if an FMA operation is faster than a pair of fmul and fadd		/// Return true if an FMA operation is faster than a pair of fmul and fadd
/// instructions. fmuladd intrinsics will be expanded to FMAs when this		/// instructions. fmuladd intrinsics will be expanded to FMAs when this
/// method returns true, otherwise fmuladd is expanded to fmul + fadd.		/// method returns true, otherwise fmuladd is expanded to fmul + fadd.
bool isFMAFasterThanFMulAndFAdd(EVT VT) const override;		bool isFMAFasterThanFMulAndFAdd(EVT VT) const override;

/// Return true if it's profitable to narrow		/// Return true if it's profitable to narrow
/// operations of type VT1 to VT2. e.g. on x86, it's profitable to narrow		/// operations of type VT1 to VT2. e.g. on x86, it's profitable to narrow
/// from i32 to i8 but not from i32 to i16.		/// from i32 to i8 but not from i32 to i16.
▲ Show 20 Lines • Show All 313 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 16,288 Lines • ▼ Show 20 Lines	static SDValue LowerSIGN_EXTEND(SDValue Op, const X86Subtarget *Subtarget,

return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, OpLo, OpHi);		return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, OpLo, OpHi);
}		}

// Lower vector extended loads using a shuffle. If SSSE3 is not available we		// Lower vector extended loads using a shuffle. If SSSE3 is not available we
// may emit an illegal shuffle but the expansion is still better than scalar		// may emit an illegal shuffle but the expansion is still better than scalar
// code. We generate X86ISD::VSEXT for SEXTLOADs if it's available, otherwise		// code. We generate X86ISD::VSEXT for SEXTLOADs if it's available, otherwise
// we'll emit a shuffle and a arithmetic shift.		// we'll emit a shuffle and a arithmetic shift.
		// FIXME: Is the expansion actually better than scalar code? It doesn't seem so.
// TODO: It is possible to support ZExt by zeroing the undef values during		// TODO: It is possible to support ZExt by zeroing the undef values during
// the shuffle phase or after the shuffle.		// the shuffle phase or after the shuffle.
static SDValue LowerExtendedLoad(SDValue Op, const X86Subtarget *Subtarget,		static SDValue LowerExtendedLoad(SDValue Op, const X86Subtarget *Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT RegVT = Op.getSimpleValueType();		MVT RegVT = Op.getSimpleValueType();
assert(RegVT.isVector() && "We only custom lower vector sext loads.");		assert(RegVT.isVector() && "We only custom lower vector sext loads.");
assert(RegVT.isInteger() &&		assert(RegVT.isInteger() &&
"We only custom lower integer vector sext loads.");		"We only custom lower integer vector sext loads.");
▲ Show 20 Lines • Show All 4,089 Lines • ▼ Show 20 Lines	bool X86TargetLowering::isZExtFree(SDValue Val, EVT VT2) const {
case MVT::i32:		case MVT::i32:
// X86 has 8, 16, and 32-bit zero-extending loads.		// X86 has 8, 16, and 32-bit zero-extending loads.
return true;		return true;
}		}

return false;		return false;
}		}

		bool X86TargetLowering::isVectorLoadExtDesirable(SDValue) const { return true; }

bool		bool
X86TargetLowering::isFMAFasterThanFMulAndFAdd(EVT VT) const {		X86TargetLowering::isFMAFasterThanFMulAndFAdd(EVT VT) const {
if (!(Subtarget->hasFMA() \|\| Subtarget->hasFMA4()))		if (!(Subtarget->hasFMA() \|\| Subtarget->hasFMA4()))
return false;		return false;

VT = VT.getScalarType();		VT = VT.getScalarType();

if (!VT.isSimple())		if (!VT.isSimple())
▲ Show 20 Lines • Show All 6,622 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-sext.ll

	Show First 20 Lines • Show All 517 Lines • ▼ Show 20 Lines
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	%extmask = sext <4 x i1> %mask to <4 x i64>			%extmask = sext <4 x i1> %mask to <4 x i64>
	ret <4 x i64> %extmask			ret <4 x i64> %extmask
	}			}

	define <16 x i16> @sext_16i8_to_16i16(<16 x i8> *%ptr) {			define <16 x i16> @sext_16i8_to_16i16(<16 x i8> *%ptr) {
	; SSE2-LABEL: sext_16i8_to_16i16:			; SSE2-LABEL: sext_16i8_to_16i16:
	; SSE2: # BB#0: # %entry			; SSE2: # BB#0: # %entry
	; SSE2-NEXT: movdqa (%rdi), %xmm1			; SSE2-NEXT: movq (%rdi), %xmm0
	; SSE2-NEXT: movdqa %xmm1, %xmm0
	; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]			; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
	; SSE2-NEXT: psllw $8, %xmm0
	; SSE2-NEXT: psraw $8, %xmm0			; SSE2-NEXT: psraw $8, %xmm0
	; SSE2-NEXT: punpckhbw {{.*#+}} xmm1 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]			; SSE2-NEXT: movq 8(%rdi), %xmm1
	; SSE2-NEXT: psllw $8, %xmm1			; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
	; SSE2-NEXT: psraw $8, %xmm1			; SSE2-NEXT: psraw $8, %xmm1
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSSE3-LABEL: sext_16i8_to_16i16:			; SSSE3-LABEL: sext_16i8_to_16i16:
	; SSSE3: # BB#0: # %entry			; SSSE3: # BB#0: # %entry
	; SSSE3-NEXT: movdqa (%rdi), %xmm1			; SSSE3-NEXT: movq (%rdi), %xmm0
	; SSSE3-NEXT: movdqa %xmm1, %xmm0
	; SSSE3-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]			; SSSE3-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
	; SSSE3-NEXT: psllw $8, %xmm0
	; SSSE3-NEXT: psraw $8, %xmm0			; SSSE3-NEXT: psraw $8, %xmm0
	; SSSE3-NEXT: punpckhbw {{.*#+}} xmm1 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]			; SSSE3-NEXT: movq 8(%rdi), %xmm1
	; SSSE3-NEXT: psllw $8, %xmm1			; SSSE3-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
	; SSSE3-NEXT: psraw $8, %xmm1			; SSSE3-NEXT: psraw $8, %xmm1
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq
	;			;
	; SSE41-LABEL: sext_16i8_to_16i16:			; SSE41-LABEL: sext_16i8_to_16i16:
	; SSE41: # BB#0: # %entry			; SSE41: # BB#0: # %entry
	; SSE41-NEXT: movdqa (%rdi), %xmm1			; SSE41-NEXT: pmovsxbw (%rdi), %xmm0
	; SSE41-NEXT: pmovzxbw %xmm1, %xmm0			; SSE41-NEXT: pmovsxbw 8(%rdi), %xmm1
	; SSE41-NEXT: psllw $8, %xmm0
	; SSE41-NEXT: psraw $8, %xmm0
	; SSE41-NEXT: punpckhbw {{.*#+}} xmm1 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
	; SSE41-NEXT: psllw $8, %xmm1
	; SSE41-NEXT: psraw $8, %xmm1
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX1-LABEL: sext_16i8_to_16i16:			; AVX1-LABEL: sext_16i8_to_16i16:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vmovdqa (%rdi), %xmm0			; AVX1-NEXT: vpmovsxbw (%rdi), %xmm0
	; AVX1-NEXT: vpmovsxbw %xmm0, %xmm1			; AVX1-NEXT: vpmovsxbw 8(%rdi), %xmm1
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: vpmovsxbw %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: sext_16i8_to_16i16:			; AVX2-LABEL: sext_16i8_to_16i16:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vpmovsxbw (%rdi), %ymm0			; AVX2-NEXT: vpmovsxbw (%rdi), %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; X32-SSE41-LABEL: sext_16i8_to_16i16:			; X32-SSE41-LABEL: sext_16i8_to_16i16:
	; X32-SSE41: # BB#0: # %entry			; X32-SSE41: # BB#0: # %entry
	; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE41-NEXT: movdqa (%eax), %xmm1			; X32-SSE41-NEXT: pmovsxbw (%eax), %xmm0
	; X32-SSE41-NEXT: pmovzxbw %xmm1, %xmm0			; X32-SSE41-NEXT: pmovsxbw 8(%eax), %xmm1
	; X32-SSE41-NEXT: psllw $8, %xmm0
	; X32-SSE41-NEXT: psraw $8, %xmm0
	; X32-SSE41-NEXT: punpckhbw {{.*#+}} xmm1 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
	; X32-SSE41-NEXT: psllw $8, %xmm1
	; X32-SSE41-NEXT: psraw $8, %xmm1
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	entry:			entry:
	%X = load <16 x i8>* %ptr			%X = load <16 x i8>* %ptr
	%Y = sext <16 x i8> %X to <16 x i16>			%Y = sext <16 x i8> %X to <16 x i16>
	ret <16 x i16> %Y			ret <16 x i16> %Y
	}			}

	define <4 x i64> @sext_4i8_to_4i64(<4 x i8> %mask) {			define <4 x i64> @sext_4i8_to_4i64(<4 x i8> %mask) {
	▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	%extmask = sext <4 x i8> %mask to <4 x i64>			%extmask = sext <4 x i8> %mask to <4 x i64>
	ret <4 x i64> %extmask			ret <4 x i64> %extmask
	}			}

	define <4 x i64> @load_sext_4i8_to_4i64(<4 x i8> *%ptr) {			define <4 x i64> @load_sext_4i8_to_4i64(<4 x i8> *%ptr) {
	; SSE2-LABEL: load_sext_4i8_to_4i64:			; SSE2-LABEL: load_sext_4i8_to_4i64:
	; SSE2: # BB#0: # %entry			; SSE2: # BB#0: # %entry
	; SSE2-NEXT: movd (%rdi), %xmm1			; SSE2-NEXT: movsbq 1(%rdi), %rax
	; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]			; SSE2-NEXT: movd %rax, %xmm1
	; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]			; SSE2-NEXT: movsbq (%rdi), %rax
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[0,1,1,3]
	; SSE2-NEXT: movd %xmm2, %rax
	; SSE2-NEXT: movsbq %al, %rax
	; SSE2-NEXT: movd %rax, %xmm0			; SSE2-NEXT: movd %rax, %xmm0
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]			; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; SSE2-NEXT: movd %xmm2, %rax			; SSE2-NEXT: movsbq 3(%rdi), %rax
	; SSE2-NEXT: movsbq %al, %rax
	; SSE2-NEXT: movd %rax, %xmm2
	; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,2,3,3]
	; SSE2-NEXT: movd %xmm2, %rax
	; SSE2-NEXT: movsbq %al, %rax
	; SSE2-NEXT: movd %rax, %xmm1
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]
	; SSE2-NEXT: movd %xmm2, %rax
	; SSE2-NEXT: movsbq %al, %rax
	; SSE2-NEXT: movd %rax, %xmm2			; SSE2-NEXT: movd %rax, %xmm2
				; SSE2-NEXT: movsbq 2(%rdi), %rax
				; SSE2-NEXT: movd %rax, %xmm1
	; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]			; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSSE3-LABEL: load_sext_4i8_to_4i64:			; SSSE3-LABEL: load_sext_4i8_to_4i64:
	; SSSE3: # BB#0: # %entry			; SSSE3: # BB#0: # %entry
	; SSSE3-NEXT: movd (%rdi), %xmm1			; SSSE3-NEXT: movsbq 1(%rdi), %rax
	; SSSE3-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]			; SSSE3-NEXT: movd %rax, %xmm1
	; SSSE3-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]			; SSSE3-NEXT: movsbq (%rdi), %rax
	; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm1[0,1,1,3]
	; SSSE3-NEXT: movd %xmm2, %rax
	; SSSE3-NEXT: movsbq %al, %rax
	; SSSE3-NEXT: movd %rax, %xmm0			; SSSE3-NEXT: movd %rax, %xmm0
	; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]			; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; SSSE3-NEXT: movd %xmm2, %rax			; SSSE3-NEXT: movsbq 3(%rdi), %rax
	; SSSE3-NEXT: movsbq %al, %rax
	; SSSE3-NEXT: movd %rax, %xmm2
	; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,2,3,3]
	; SSSE3-NEXT: movd %xmm2, %rax
	; SSSE3-NEXT: movsbq %al, %rax
	; SSSE3-NEXT: movd %rax, %xmm1
	; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]
	; SSSE3-NEXT: movd %xmm2, %rax
	; SSSE3-NEXT: movsbq %al, %rax
	; SSSE3-NEXT: movd %rax, %xmm2			; SSSE3-NEXT: movd %rax, %xmm2
				; SSSE3-NEXT: movsbq 2(%rdi), %rax
				; SSSE3-NEXT: movd %rax, %xmm1
	; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]			; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq
	;			;
	; SSE41-LABEL: load_sext_4i8_to_4i64:			; SSE41-LABEL: load_sext_4i8_to_4i64:
	; SSE41: # BB#0: # %entry			; SSE41: # BB#0: # %entry
	; SSE41-NEXT: pmovzxbd (%rdi), %xmm1			; SSE41-NEXT: pmovsxbq (%rdi), %xmm0
	; SSE41-NEXT: pmovzxdq %xmm1, %xmm0			; SSE41-NEXT: pmovsxbq 2(%rdi), %xmm1
	; SSE41-NEXT: pextrq $1, %xmm0, %rax
	; SSE41-NEXT: movsbq %al, %rax
	; SSE41-NEXT: movd %rax, %xmm2
	; SSE41-NEXT: movd %xmm0, %rax
	; SSE41-NEXT: movsbq %al, %rax
	; SSE41-NEXT: movd %rax, %xmm0
	; SSE41-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,2,3,3]
	; SSE41-NEXT: pextrq $1, %xmm1, %rax
	; SSE41-NEXT: movsbq %al, %rax
	; SSE41-NEXT: movd %rax, %xmm2
	; SSE41-NEXT: movd %xmm1, %rax
	; SSE41-NEXT: movsbq %al, %rax
	; SSE41-NEXT: movd %rax, %xmm1
	; SSE41-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX1-LABEL: load_sext_4i8_to_4i64:			; AVX1-LABEL: load_sext_4i8_to_4i64:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vpmovsxbd (%rdi), %xmm0			; AVX1-NEXT: vpmovsxbd (%rdi), %xmm0
	; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1			; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
	; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0			; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: load_sext_4i8_to_4i64:			; AVX2-LABEL: load_sext_4i8_to_4i64:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vpmovsxbq (%rdi), %ymm0			; AVX2-NEXT: vpmovsxbq (%rdi), %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; X32-SSE41-LABEL: load_sext_4i8_to_4i64:			; X32-SSE41-LABEL: load_sext_4i8_to_4i64:
	; X32-SSE41: # BB#0: # %entry			; X32-SSE41: # BB#0: # %entry
	; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE41-NEXT: movd (%eax), %xmm0			; X32-SSE41-NEXT: pmovsxbq (%eax), %xmm0
	; X32-SSE41-NEXT: pmovzxbd %xmm0, %xmm1			; X32-SSE41-NEXT: pmovsxbq 2(%eax), %xmm1
	; X32-SSE41-NEXT: pmovzxbq %xmm0, %xmm2
	; X32-SSE41-NEXT: movd %xmm2, %eax
	; X32-SSE41-NEXT: movsbl %al, %eax
	; X32-SSE41-NEXT: movd %eax, %xmm0
	; X32-SSE41-NEXT: sarl $31, %eax
	; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm0
	; X32-SSE41-NEXT: pextrd $2, %xmm2, %eax
	; X32-SSE41-NEXT: movsbl %al, %eax
	; X32-SSE41-NEXT: pinsrd $2, %eax, %xmm0
	; X32-SSE41-NEXT: sarl $31, %eax
	; X32-SSE41-NEXT: pinsrd $3, %eax, %xmm0
	; X32-SSE41-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,2,3,3]
	; X32-SSE41-NEXT: movd %xmm2, %eax
	; X32-SSE41-NEXT: movsbl %al, %eax
	; X32-SSE41-NEXT: movd %eax, %xmm1
	; X32-SSE41-NEXT: sarl $31, %eax
	; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm1
	; X32-SSE41-NEXT: pextrd $2, %xmm2, %eax
	; X32-SSE41-NEXT: movsbl %al, %eax
	; X32-SSE41-NEXT: pinsrd $2, %eax, %xmm1
	; X32-SSE41-NEXT: sarl $31, %eax
	; X32-SSE41-NEXT: pinsrd $3, %eax, %xmm1
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	entry:			entry:
	%X = load <4 x i8>* %ptr			%X = load <4 x i8>* %ptr
	%Y = sext <4 x i8> %X to <4 x i64>			%Y = sext <4 x i8> %X to <4 x i64>
	ret <4 x i64>%Y			ret <4 x i64>%Y
	}			}

	define <4 x i64> @load_sext_4i16_to_4i64(<4 x i16> *%ptr) {			define <4 x i64> @load_sext_4i16_to_4i64(<4 x i16> *%ptr) {
	; SSE2-LABEL: load_sext_4i16_to_4i64:			; SSE2-LABEL: load_sext_4i16_to_4i64:
	; SSE2: # BB#0: # %entry			; SSE2: # BB#0: # %entry
	; SSE2-NEXT: movq (%rdi), %xmm1			; SSE2-NEXT: movswq 2(%rdi), %rax
	; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]			; SSE2-NEXT: movd %rax, %xmm1
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[0,1,1,3]			; SSE2-NEXT: movswq (%rdi), %rax
	; SSE2-NEXT: movd %xmm2, %rax
	; SSE2-NEXT: movswq %ax, %rax
	; SSE2-NEXT: movd %rax, %xmm0			; SSE2-NEXT: movd %rax, %xmm0
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]			; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; SSE2-NEXT: movd %xmm2, %rax			; SSE2-NEXT: movswq 6(%rdi), %rax
	; SSE2-NEXT: movswq %ax, %rax
	; SSE2-NEXT: movd %rax, %xmm2
	; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,2,3,3]
	; SSE2-NEXT: movd %xmm2, %rax
	; SSE2-NEXT: movswq %ax, %rax
	; SSE2-NEXT: movd %rax, %xmm1
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]
	; SSE2-NEXT: movd %xmm2, %rax
	; SSE2-NEXT: movswq %ax, %rax
	; SSE2-NEXT: movd %rax, %xmm2			; SSE2-NEXT: movd %rax, %xmm2
				; SSE2-NEXT: movswq 4(%rdi), %rax
				; SSE2-NEXT: movd %rax, %xmm1
	; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]			; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSSE3-LABEL: load_sext_4i16_to_4i64:			; SSSE3-LABEL: load_sext_4i16_to_4i64:
	; SSSE3: # BB#0: # %entry			; SSSE3: # BB#0: # %entry
	; SSSE3-NEXT: movq (%rdi), %xmm1			; SSSE3-NEXT: movswq 2(%rdi), %rax
	; SSSE3-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]			; SSSE3-NEXT: movd %rax, %xmm1
	; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm1[0,1,1,3]			; SSSE3-NEXT: movswq (%rdi), %rax
	; SSSE3-NEXT: movd %xmm2, %rax
	; SSSE3-NEXT: movswq %ax, %rax
	; SSSE3-NEXT: movd %rax, %xmm0			; SSSE3-NEXT: movd %rax, %xmm0
	; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]			; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; SSSE3-NEXT: movd %xmm2, %rax			; SSSE3-NEXT: movswq 6(%rdi), %rax
	; SSSE3-NEXT: movswq %ax, %rax
	; SSSE3-NEXT: movd %rax, %xmm2
	; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,2,3,3]
	; SSSE3-NEXT: movd %xmm2, %rax
	; SSSE3-NEXT: movswq %ax, %rax
	; SSSE3-NEXT: movd %rax, %xmm1
	; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]
	; SSSE3-NEXT: movd %xmm2, %rax
	; SSSE3-NEXT: movswq %ax, %rax
	; SSSE3-NEXT: movd %rax, %xmm2			; SSSE3-NEXT: movd %rax, %xmm2
				; SSSE3-NEXT: movswq 4(%rdi), %rax
				; SSSE3-NEXT: movd %rax, %xmm1
	; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]			; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq
	;			;
	; SSE41-LABEL: load_sext_4i16_to_4i64:			; SSE41-LABEL: load_sext_4i16_to_4i64:
	; SSE41: # BB#0: # %entry			; SSE41: # BB#0: # %entry
	; SSE41-NEXT: movq (%rdi), %xmm0			; SSE41-NEXT: pmovsxwq (%rdi), %xmm0
	; SSE41-NEXT: pmovzxwd %xmm0, %xmm1			; SSE41-NEXT: pmovsxwq 4(%rdi), %xmm1
	; SSE41-NEXT: pmovzxwq %xmm0, %xmm0
	; SSE41-NEXT: pextrq $1, %xmm0, %rax
	; SSE41-NEXT: movswq %ax, %rax
	; SSE41-NEXT: movd %rax, %xmm2
	; SSE41-NEXT: movd %xmm0, %rax
	; SSE41-NEXT: movswq %ax, %rax
	; SSE41-NEXT: movd %rax, %xmm0
	; SSE41-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,2,3,3]
	; SSE41-NEXT: pextrq $1, %xmm1, %rax
	; SSE41-NEXT: movswq %ax, %rax
	; SSE41-NEXT: movd %rax, %xmm2
	; SSE41-NEXT: movd %xmm1, %rax
	; SSE41-NEXT: movswq %ax, %rax
	; SSE41-NEXT: movd %rax, %xmm1
	; SSE41-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX1-LABEL: load_sext_4i16_to_4i64:			; AVX1-LABEL: load_sext_4i16_to_4i64:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vpmovsxwd (%rdi), %xmm0			; AVX1-NEXT: vpmovsxwd (%rdi), %xmm0
	; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1			; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
	; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0			; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: load_sext_4i16_to_4i64:			; AVX2-LABEL: load_sext_4i16_to_4i64:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vpmovsxwq (%rdi), %ymm0			; AVX2-NEXT: vpmovsxwq (%rdi), %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; X32-SSE41-LABEL: load_sext_4i16_to_4i64:			; X32-SSE41-LABEL: load_sext_4i16_to_4i64:
	; X32-SSE41: # BB#0: # %entry			; X32-SSE41: # BB#0: # %entry
	; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-SSE41-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-SSE41-NEXT: movsd (%eax), %xmm0			; X32-SSE41-NEXT: pmovsxwq (%eax), %xmm0
	; X32-SSE41-NEXT: pmovzxwd %xmm0, %xmm1			; X32-SSE41-NEXT: pmovsxwq 4(%eax), %xmm1
	; X32-SSE41-NEXT: pmovzxwq %xmm0, %xmm2
	; X32-SSE41-NEXT: movd %xmm2, %eax
	; X32-SSE41-NEXT: cwtl
	; X32-SSE41-NEXT: movd %eax, %xmm0
	; X32-SSE41-NEXT: sarl $31, %eax
	; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm0
	; X32-SSE41-NEXT: pextrd $2, %xmm2, %eax
	; X32-SSE41-NEXT: cwtl
	; X32-SSE41-NEXT: pinsrd $2, %eax, %xmm0
	; X32-SSE41-NEXT: sarl $31, %eax
	; X32-SSE41-NEXT: pinsrd $3, %eax, %xmm0
	; X32-SSE41-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,2,3,3]
	; X32-SSE41-NEXT: movd %xmm2, %eax
	; X32-SSE41-NEXT: cwtl
	; X32-SSE41-NEXT: movd %eax, %xmm1
	; X32-SSE41-NEXT: sarl $31, %eax
	; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm1
	; X32-SSE41-NEXT: pextrd $2, %xmm2, %eax
	; X32-SSE41-NEXT: cwtl
	; X32-SSE41-NEXT: pinsrd $2, %eax, %xmm1
	; X32-SSE41-NEXT: sarl $31, %eax
	; X32-SSE41-NEXT: pinsrd $3, %eax, %xmm1
	; X32-SSE41-NEXT: retl			; X32-SSE41-NEXT: retl
	entry:			entry:
	%X = load <4 x i16>* %ptr			%X = load <4 x i16>* %ptr
	%Y = sext <4 x i16> %X to <4 x i64>			%Y = sext <4 x i16> %X to <4 x i64>
	ret <4 x i64>%Y			ret <4 x i64>%Y
	}			}

llvm/trunk/test/CodeGen/X86/vector-zext.ll

	Show First 20 Lines • Show All 224 Lines • ▼ Show 20 Lines
	; SSSE3-NEXT: movdqa {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255]			; SSSE3-NEXT: movdqa {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255]
	; SSSE3-NEXT: pand %xmm2, %xmm0			; SSSE3-NEXT: pand %xmm2, %xmm0
	; SSSE3-NEXT: punpckhbw %xmm1, %xmm1 # xmm1 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]			; SSSE3-NEXT: punpckhbw %xmm1, %xmm1 # xmm1 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
	; SSSE3-NEXT: pand %xmm2, %xmm1			; SSSE3-NEXT: pand %xmm2, %xmm1
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq

	; SSE41-LABEL: load_zext_16i8_to_16i16:			; SSE41-LABEL: load_zext_16i8_to_16i16:
	; SSE41: # BB#0: # %entry			; SSE41: # BB#0: # %entry
	; SSE41-NEXT: movdqa (%rdi), %xmm1			; SSE41-NEXT: pmovzxbw (%rdi), %xmm0
	; SSE41-NEXT: pmovzxbw %xmm1, %xmm0			; SSE41-NEXT: pmovzxbw 8(%rdi), %xmm1
	; SSE41-NEXT: movdqa {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255]
	; SSE41-NEXT: pand %xmm2, %xmm0
	; SSE41-NEXT: punpckhbw %xmm1, %xmm1 # xmm1 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
	; SSE41-NEXT: pand %xmm2, %xmm1
	; SSE41-NEXT: retq			; SSE41-NEXT: retq

	; AVX1-LABEL: load_zext_16i8_to_16i16:			; AVX1-LABEL: load_zext_16i8_to_16i16:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vmovdqa (%rdi), %xmm0			; AVX1-NEXT: vpmovzxbw (%rdi), %xmm0
	; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1			; AVX1-NEXT: vpmovzxbw 8(%rdi), %xmm1
	; AVX1-NEXT: vpunpckhbw %xmm1, %xmm0, %xmm1 # xmm1 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
	; AVX1-NEXT: vpmovzxbw %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq

	; AVX2-LABEL: load_zext_16i8_to_16i16:			; AVX2-LABEL: load_zext_16i8_to_16i16:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vpmovzxbw (%rdi), %ymm0			; AVX2-NEXT: vpmovzxbw (%rdi), %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	entry:			entry:
	Show All 22 Lines
	; SSSE3-NEXT: movdqa {{.*#+}} xmm2 = [65535,65535,65535,65535]			; SSSE3-NEXT: movdqa {{.*#+}} xmm2 = [65535,65535,65535,65535]
	; SSSE3-NEXT: pand %xmm2, %xmm0			; SSSE3-NEXT: pand %xmm2, %xmm0
	; SSSE3-NEXT: punpckhwd %xmm1, %xmm1 # xmm1 = xmm1[4,4,5,5,6,6,7,7]			; SSSE3-NEXT: punpckhwd %xmm1, %xmm1 # xmm1 = xmm1[4,4,5,5,6,6,7,7]
	; SSSE3-NEXT: pand %xmm2, %xmm1			; SSSE3-NEXT: pand %xmm2, %xmm1
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq

	; SSE41-LABEL: load_zext_8i16_to_8i32:			; SSE41-LABEL: load_zext_8i16_to_8i32:
	; SSE41: # BB#0: # %entry			; SSE41: # BB#0: # %entry
	; SSE41-NEXT: movdqa (%rdi), %xmm1			; SSE41-NEXT: pmovzxwd (%rdi), %xmm0
	; SSE41-NEXT: pmovzxwd %xmm1, %xmm0			; SSE41-NEXT: pmovzxwd 8(%rdi), %xmm1
	; SSE41-NEXT: movdqa {{.*#+}} xmm2 = [65535,65535,65535,65535]
	; SSE41-NEXT: pand %xmm2, %xmm0
	; SSE41-NEXT: punpckhwd %xmm1, %xmm1 # xmm1 = xmm1[4,4,5,5,6,6,7,7]
	; SSE41-NEXT: pand %xmm2, %xmm1
	; SSE41-NEXT: retq			; SSE41-NEXT: retq

	; AVX1-LABEL: load_zext_8i16_to_8i32:			; AVX1-LABEL: load_zext_8i16_to_8i32:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vmovdqa (%rdi), %xmm0			; AVX1-NEXT: vpmovzxwd (%rdi), %xmm0
	; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1			; AVX1-NEXT: vpmovzxwd 8(%rdi), %xmm1
	; AVX1-NEXT: vpunpckhwd %xmm1, %xmm0, %xmm1 # xmm1 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX1-NEXT: vpmovzxwd %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq

	; AVX2-LABEL: load_zext_8i16_to_8i32:			; AVX2-LABEL: load_zext_8i16_to_8i32:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vpmovzxwd (%rdi), %ymm0			; AVX2-NEXT: vpmovzxwd (%rdi), %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	entry:			entry:
	Show All 20 Lines
	; SSSE3-NEXT: movdqa {{.*#+}} xmm2 = [4294967295,4294967295]			; SSSE3-NEXT: movdqa {{.*#+}} xmm2 = [4294967295,4294967295]
	; SSSE3-NEXT: pand %xmm2, %xmm0			; SSSE3-NEXT: pand %xmm2, %xmm0
	; SSSE3-NEXT: pshufd $250, %xmm1, %xmm1 # xmm1 = xmm1[2,2,3,3]			; SSSE3-NEXT: pshufd $250, %xmm1, %xmm1 # xmm1 = xmm1[2,2,3,3]
	; SSSE3-NEXT: pand %xmm2, %xmm1			; SSSE3-NEXT: pand %xmm2, %xmm1
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq

	; SSE41-LABEL: load_zext_4i32_to_4i64:			; SSE41-LABEL: load_zext_4i32_to_4i64:
	; SSE41: # BB#0: # %entry			; SSE41: # BB#0: # %entry
	; SSE41-NEXT: movdqa (%rdi), %xmm1			; SSE41-NEXT: pmovzxdq (%rdi), %xmm0
	; SSE41-NEXT: pmovzxdq %xmm1, %xmm0			; SSE41-NEXT: pmovzxdq 8(%rdi), %xmm1
	; SSE41-NEXT: movdqa {{.*#+}} xmm2 = [4294967295,4294967295]
	; SSE41-NEXT: pand %xmm2, %xmm0
	; SSE41-NEXT: pshufd $250, %xmm1, %xmm1 # xmm1 = xmm1[2,2,3,3]
	; SSE41-NEXT: pand %xmm2, %xmm1
	; SSE41-NEXT: retq			; SSE41-NEXT: retq

	; AVX1-LABEL: load_zext_4i32_to_4i64:			; AVX1-LABEL: load_zext_4i32_to_4i64:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vmovdqa (%rdi), %xmm0			; AVX1-NEXT: vpmovzxdq (%rdi), %xmm0
	; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1			; AVX1-NEXT: vpmovzxdq 8(%rdi), %xmm1
	; AVX1-NEXT: vpunpckhdq %xmm1, %xmm0, %xmm1 # xmm1 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; AVX1-NEXT: vpmovzxdq %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq

	; AVX2-LABEL: load_zext_4i32_to_4i64:			; AVX2-LABEL: load_zext_4i32_to_4i64:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vpmovzxdq (%rdi), %ymm0			; AVX2-NEXT: vpmovzxdq (%rdi), %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	entry:			entry:
	%X = load <4 x i32>* %ptr			%X = load <4 x i32>* %ptr
	%Y = zext <4 x i32> %X to <4 x i64>			%Y = zext <4 x i32> %X to <4 x i64>
	ret <4 x i64>%Y			ret <4 x i64>%Y
	}			}