This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorized i64 uniform constant SRA shifts
ClosedPublic

Authored by RKSimon on May 10 2015, 4:47 AM.

Download Raw Diff

Details

Reviewers

qcolombet
delena
andreadb

Commits

rG8fbf1c1f4a6f: [X86][SSE] Vectorized i64 uniform constant SRA shifts
rL241514: [X86][SSE] Vectorized i64 uniform constant SRA shifts

Summary

This patch adds vectorization support for uniform constant i64 arithmetic shift right operators.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 25426.May 10 2015, 4:47 AM

RKSimon retitled this revision from to [X86][SSE] Vectorized i64 uniform constant SRA shifts.

RKSimon updated this object.

RKSimon edited the test plan for this revision. (Show Details)

RKSimon added reviewers: andreadb, delena, qcolombet.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: Unknown Object (MLST).

delena added inline comments.May 10 2015, 6:50 AM

lib/Target/X86/X86ISelLowering.cpp
16559–16561	I think, that I fixed here a bug and removed AVX512. Could you, please, check?
test/CodeGen/X86/vector-sext.ll
111–120	I see this code as 2 pmovsxdq instructions and one shuffle between them. For windows, it looks like: pmovsxdq (%rcx), %xmm0 pmovsxdq 8(%rcx), %xmm1 retq For linux your parameter is in xmm, so you need one shuflle with <2, 3, undef, undef>

Thanks Elena, I'll get extra AVX512 tests added for review.

lib/Target/X86/X86ISelLowering.cpp
16559–16561	Yes I'll add a proper AVX512 test.
test/CodeGen/X86/vector-sext.ll
111–120	I have some work in progress patches to improve pmovsx* support - the problem is that SIGN_EXTEND_INREG / SIGN_EXTEND_VECTOR_INREG is a mess and will take a while to sort out. Making this i64 SRA patch was an easy first step so at least we're not transferring between xmm and gprs so much.

I think that you did not update the code before creating the patch.
See 235993.

Elena

Dropped AVX512 support for i64 SRA in LowerScalarVariableShift as noted by Elena - added FIXME note.

You should generate another code for SSE 4.1 and AVX, According to my previous comment.

In D9645#169914, @delena wrote:

You should generate another code for SSE 4.1 and AVX, According to my previous comment.

Thanks - I have better support for pmovsx* coming in an upcoming later patch - it will be much easier to implement once this patch is already done.

I disagree with this approach. You can prepare a separate patch for "shift-right-64bits" optimization.
But you should not use shift-right in SEXT and generate 10 instructions instead of 3.

In D9645#170039, @delena wrote:

I disagree with this approach. You can prepare a separate patch for "shift-right-64bits" optimization.
But you should not use shift-right in SEXT and generate 10 instructions instead of 3.

I haven't intentionally added these shift-right in the sext tests - its just a result of the default sign-ext expansion from SIGN_EXTEND_INREG. I still think this is an improvement over where we are now but I'll push the pmovsx* patch up for review as soon as I can and revisit this afterwards.

RKSimon mentioned this in D9923: Adjust the cost of vectorized SHL/SRL/SRA.May 22 2015, 3:46 AM

Refreshed this patch now that the previous edge cases (sint_to_fp, and sext) have been dealt with properly.

I've updated the patch to work with Elena's 'SupporteVector' tests and moved the shift lowering code into LowerScalarImmediateShift directly.

delena added inline comments.Jul 4 2015, 11:27 PM

test/CodeGen/X86/vector-shift-ashr-128.ll
988 ↗	(On Diff #29008)	Hi Simon, I think that the result here will be incorrect. Let's take a positive 64-bit number (2^34)-1. After the arithmetic shift-right-7 you should receive (2^27)-1. But "vpsrad" will take the source as negative 32-bit and you'll see (2^32)-1 in %xmm1 and the correct result will be after "vpsrlq" in %xmm0. Upper = (2^32)-1 Lower = (2^27)-1 Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {4, 1, 6, 3}); <= the result is incorrect

RKSimon added inline comments.Jul 5 2015, 7:56 AM

test/CodeGen/X86/vector-shift-ashr-128.ll
988 ↗	(On Diff #29008)	Hi Elena, taking your example (as a v1i64 for simplicity): in: 00000003ffffffff (ffffffff 00000003) as v2i32 upper: ashr32 (in, 7): 00000000ffffffff (ffffffff 00000000) 2nd 32-bit lane used lower: lshr64(in, 7): 0000000007ffffff (07ffffff 00000000) 1st 32-bit lane used shuffle32(ashr32, lshr64, 4,1,3,2): 0000000007ffffff (07ffffff 00000000) Which I believe is correct, no?

delena added inline comments.Jul 5 2015, 11:55 PM

test/CodeGen/X86/vector-shift-ashr-128.ll
988 ↗	(On Diff #29008)	yes, you are right. I checked again and I don't see any problem. In my opinion, you can commit this code.

Closed by commit rL241514: [X86][SSE] Vectorized i64 uniform constant SRA shifts (authored by RKSimon). · Explain WhyJul 6 2015, 3:35 PM

This revision was automatically updated to reflect the committed changes.

Thanks Elena

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

33 lines

X86TargetTransformInfo.cpp

29 lines

test/

Analysis/

CostModel/

X86/

testshiftashr.ll

32 lines

CodeGen/

X86/

49 lines

300 lines

5 lines

7 lines

Diff 25433

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,010 Lines • ▼ Show 20 Lines	if (Subtarget->hasSSE2()) {
// In the customized shift lowering, the legal cases in AVX2 will be		// In the customized shift lowering, the legal cases in AVX2 will be
// recognized.		// recognized.
setOperationAction(ISD::SRL, MVT::v2i64, Custom);		setOperationAction(ISD::SRL, MVT::v2i64, Custom);
setOperationAction(ISD::SRL, MVT::v4i32, Custom);		setOperationAction(ISD::SRL, MVT::v4i32, Custom);

setOperationAction(ISD::SHL, MVT::v2i64, Custom);		setOperationAction(ISD::SHL, MVT::v2i64, Custom);
setOperationAction(ISD::SHL, MVT::v4i32, Custom);		setOperationAction(ISD::SHL, MVT::v4i32, Custom);

		setOperationAction(ISD::SRA, MVT::v2i64, Custom);
setOperationAction(ISD::SRA, MVT::v4i32, Custom);		setOperationAction(ISD::SRA, MVT::v4i32, Custom);
}		}

if (!TM.Options.UseSoftFloat && Subtarget->hasFp256()) {		if (!TM.Options.UseSoftFloat && Subtarget->hasFp256()) {
addRegisterClass(MVT::v32i8, &X86::VR256RegClass);		addRegisterClass(MVT::v32i8, &X86::VR256RegClass);
addRegisterClass(MVT::v16i16, &X86::VR256RegClass);		addRegisterClass(MVT::v16i16, &X86::VR256RegClass);
addRegisterClass(MVT::v8i32, &X86::VR256RegClass);		addRegisterClass(MVT::v8i32, &X86::VR256RegClass);
addRegisterClass(MVT::v8f32, &X86::VR256RegClass);		addRegisterClass(MVT::v8f32, &X86::VR256RegClass);
▲ Show 20 Lines • Show All 13,682 Lines • ▼ Show 20 Lines	case X86ISD::VSRAI:
Elts.push_back(DAG.getConstant(C.ashr(ShiftAmt), dl, ElementType));		Elts.push_back(DAG.getConstant(C.ashr(ShiftAmt), dl, ElementType));
}		}
break;		break;
}		}

return DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Elts);		return DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Elts);
}		}

		// pre-AVX512 i64 SRA needs to be performed as partial shifts.
		const X86Subtarget &Subtarget =
		static_cast<const X86Subtarget &>(DAG.getSubtarget());
		if (VT == MVT::v2i64 && Opc == X86ISD::VSRAI && !Subtarget.hasAVX512()) {
		MVT ExVT = MVT::getVectorVT(MVT::i32, VT.getVectorNumElements() * 2);
		SDValue Ex = DAG.getNode(ISD::BITCAST, dl, ExVT, SrcOp);

		if (ShiftAmt >= 32) {
		// Splat sign to upper i32 dst, and SRA upper i32 src to lower i32.
		SDValue Upper =
		getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex, 31, DAG);
		SDValue Lower = getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex,
		ShiftAmt - 32, DAG);
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {5, 1, 7, 3});
		} else {
		// SRA upper i32, SHL whole i64 and select lower i32.
		SDValue Upper = getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex,
		ShiftAmt, DAG);
		SDValue Lower = getTargetVShiftByConstNode(X86ISD::VSRLI, dl, VT, SrcOp,
		ShiftAmt, DAG);
		Lower = DAG.getNode(ISD::BITCAST, dl, ExVT, Lower);
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {4, 1, 6, 3});
		}
		return DAG.getNode(ISD::BITCAST, dl, VT, Ex);
		}

return DAG.getNode(Opc, dl, VT, SrcOp,		return DAG.getNode(Opc, dl, VT, SrcOp,
DAG.getConstant(ShiftAmt, dl, MVT::i8));		DAG.getConstant(ShiftAmt, dl, MVT::i8));
}		}

// getTargetVShiftNode - Handle vector element shifts where the shift amount		// getTargetVShiftNode - Handle vector element shifts where the shift amount
// may or may not be a constant. Takes immediate version of shift as input.		// may or may not be a constant. Takes immediate version of shift as input.
static SDValue getTargetVShiftNode(unsigned Opc, SDLoc dl, MVT VT,		static SDValue getTargetVShiftNode(unsigned Opc, SDLoc dl, MVT VT,
SDValue SrcOp, SDValue ShAmt,		SDValue SrcOp, SDValue ShAmt,
▲ Show 20 Lines • Show All 1,596 Lines • ▼ Show 20 Lines	if (auto *ShiftConst = BVAmt->getConstantSplatNode()) {
(Subtarget->hasAVX512() &&		(Subtarget->hasAVX512() &&
(VT == MVT::v8i64 \|\| VT == MVT::v16i32))) {		(VT == MVT::v8i64 \|\| VT == MVT::v16i32))) {
if (Op.getOpcode() == ISD::SHL)		if (Op.getOpcode() == ISD::SHL)
return getTargetVShiftByConstNode(X86ISD::VSHLI, dl, VT, R, ShiftAmt,		return getTargetVShiftByConstNode(X86ISD::VSHLI, dl, VT, R, ShiftAmt,
DAG);		DAG);
if (Op.getOpcode() == ISD::SRL)		if (Op.getOpcode() == ISD::SRL)
return getTargetVShiftByConstNode(X86ISD::VSRLI, dl, VT, R, ShiftAmt,		return getTargetVShiftByConstNode(X86ISD::VSRLI, dl, VT, R, ShiftAmt,
DAG);		DAG);
if (Op.getOpcode() == ISD::SRA && VT != MVT::v2i64 && VT != MVT::v4i64)		if (Op.getOpcode() == ISD::SRA && VT != MVT::v4i64)
return getTargetVShiftByConstNode(X86ISD::VSRAI, dl, VT, R, ShiftAmt,		return getTargetVShiftByConstNode(X86ISD::VSRAI, dl, VT, R, ShiftAmt,
DAG);		DAG);
}		}

if (VT == MVT::v16i8 \|\| (Subtarget->hasInt256() && VT == MVT::v32i8)) {		if (VT == MVT::v16i8 \|\| (Subtarget->hasInt256() && VT == MVT::v32i8)) {
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
MVT ShiftVT = MVT::getVectorVT(MVT::i16, NumElts / 2);		MVT ShiftVT = MVT::getVectorVT(MVT::i16, NumElts / 2);

▲ Show 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	if (BaseShAmt.getNode()) {
case MVT::v8i64:		case MVT::v8i64:
return getTargetVShiftNode(X86ISD::VSRLI, dl, VT, R, BaseShAmt, DAG);		return getTargetVShiftNode(X86ISD::VSRLI, dl, VT, R, BaseShAmt, DAG);
}		}
}		}
}		}
}		}

// Special case in 32-bit mode, where i64 is expanded into high and low parts.		// Special case in 32-bit mode, where i64 is expanded into high and low parts.
		// FIXME: AVX512 can support i64 SRA by scalar variable.
if (!Subtarget->is64Bit() && VT == MVT::v2i64 &&		if (!Subtarget->is64Bit() && VT == MVT::v2i64 &&
		Op.getOpcode() != ISD::SRA &&
		delenaUnsubmitted Not Done Reply Inline Actions I think, that I fixed here a bug and removed AVX512. Could you, please, check? delena: I think, that I fixed here a bug and removed AVX512. Could you, please, check?
		RKSimonAuthorUnsubmitted Not Done Reply Inline Actions Yes I'll add a proper AVX512 test. RKSimon: Yes I'll add a proper AVX512 test.
Amt.getOpcode() == ISD::BITCAST &&		Amt.getOpcode() == ISD::BITCAST &&
Amt.getOperand(0).getOpcode() == ISD::BUILD_VECTOR) {		Amt.getOperand(0).getOpcode() == ISD::BUILD_VECTOR) {
Amt = Amt.getOperand(0);		Amt = Amt.getOperand(0);
unsigned Ratio = Amt.getSimpleValueType().getVectorNumElements() /		unsigned Ratio = Amt.getSimpleValueType().getVectorNumElements() /
VT.getVectorNumElements();		VT.getVectorNumElements();
std::vector<SDValue> Vals(Ratio);		std::vector<SDValue> Vals(Ratio);
for (unsigned i = 0; i != Ratio; ++i)		for (unsigned i = 0; i != Ratio; ++i)
Vals[i] = Amt.getOperand(i);		Vals[i] = Amt.getOperand(i);
▲ Show 20 Lines • Show All 8,564 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 144 Lines • ▼ Show 20 Lines	static const CostTblEntry<MVT::SimpleValueType> AVX2CostTable[] = {
{ ISD::SHL, MVT::v4i32, 1 },		{ ISD::SHL, MVT::v4i32, 1 },
{ ISD::SRL, MVT::v4i32, 1 },		{ ISD::SRL, MVT::v4i32, 1 },
{ ISD::SRA, MVT::v4i32, 1 },		{ ISD::SRA, MVT::v4i32, 1 },
{ ISD::SHL, MVT::v8i32, 1 },		{ ISD::SHL, MVT::v8i32, 1 },
{ ISD::SRL, MVT::v8i32, 1 },		{ ISD::SRL, MVT::v8i32, 1 },
{ ISD::SRA, MVT::v8i32, 1 },		{ ISD::SRA, MVT::v8i32, 1 },
{ ISD::SHL, MVT::v2i64, 1 },		{ ISD::SHL, MVT::v2i64, 1 },
{ ISD::SRL, MVT::v2i64, 1 },		{ ISD::SRL, MVT::v2i64, 1 },
{ ISD::SHL, MVT::v4i64, 1 },		{ ISD::SHL, MVT::v4i64, 1 },
{ ISD::SRL, MVT::v4i64, 1 },		{ ISD::SRL, MVT::v4i64, 1 },

{ ISD::SHL, MVT::v32i8, 42 }, // cmpeqb sequence.		{ ISD::SHL, MVT::v32i8, 42 }, // cmpeqb sequence.
{ ISD::SHL, MVT::v16i16, 16*10 }, // Scalarized.		{ ISD::SHL, MVT::v16i16, 16*10 }, // Scalarized.

{ ISD::SRL, MVT::v32i8, 32*10 }, // Scalarized.		{ ISD::SRL, MVT::v32i8, 32*10 }, // Scalarized.
{ ISD::SRL, MVT::v16i16, 8*10 }, // Scalarized.		{ ISD::SRL, MVT::v16i16, 8*10 }, // Scalarized.

{ ISD::SRA, MVT::v32i8, 32*10 }, // Scalarized.		{ ISD::SRA, MVT::v32i8, 32*10 }, // Scalarized.
Show All 40 Lines	SSE2UniformConstCostTable[] = {
{ ISD::SHL, MVT::v4i32, 1 }, // pslld		{ ISD::SHL, MVT::v4i32, 1 }, // pslld
{ ISD::SHL, MVT::v2i64, 1 }, // psllq.		{ ISD::SHL, MVT::v2i64, 1 }, // psllq.

{ ISD::SRL, MVT::v16i8, 1 }, // psrlw.		{ ISD::SRL, MVT::v16i8, 1 }, // psrlw.
{ ISD::SRL, MVT::v8i16, 1 }, // psrlw.		{ ISD::SRL, MVT::v8i16, 1 }, // psrlw.
{ ISD::SRL, MVT::v4i32, 1 }, // psrld.		{ ISD::SRL, MVT::v4i32, 1 }, // psrld.
{ ISD::SRL, MVT::v2i64, 1 }, // psrlq.		{ ISD::SRL, MVT::v2i64, 1 }, // psrlq.

{ ISD::SRA, MVT::v16i8, 4 }, // psrlw, pand, pxor, psubb.		{ ISD::SRA, MVT::v16i8, 4 }, // psrlw, pand, pxor, psubb.
{ ISD::SRA, MVT::v8i16, 1 }, // psraw.		{ ISD::SRA, MVT::v8i16, 1 }, // psraw.
{ ISD::SRA, MVT::v4i32, 1 }, // psrad.		{ ISD::SRA, MVT::v4i32, 1 }, // psrad.
		{ ISD::SRA, MVT::v2i64, 4 }, // 2 x psrad + shuffle.
{ ISD::SDIV, MVT::v8i16, 6 }, // pmulhw sequence
{ ISD::UDIV, MVT::v8i16, 6 }, // pmulhuw sequence		{ ISD::SDIV, MVT::v8i16, 6 }, // pmulhw sequence
		{ ISD::UDIV, MVT::v8i16, 6 }, // pmulhuw sequence
{ ISD::SDIV, MVT::v4i32, 19 }, // pmuludq sequence		{ ISD::SDIV, MVT::v4i32, 19 }, // pmuludq sequence
{ ISD::UDIV, MVT::v4i32, 15 }, // pmuludq sequence		{ ISD::UDIV, MVT::v4i32, 15 }, // pmuludq sequence
};		};

if (Op2Info == TargetTransformInfo::OK_UniformConstantValue &&		if (Op2Info == TargetTransformInfo::OK_UniformConstantValue &&
ST->hasSSE2()) {		ST->hasSSE2()) {
// pmuldq sequence.		// pmuldq sequence.
if (ISD == ISD::SDIV && LT.second == MVT::v4i32 && ST->hasSSE41())		if (ISD == ISD::SDIV && LT.second == MVT::v4i32 && ST->hasSSE41())
Show All 20 Lines	unsigned X86TTIImpl::getArithmeticInstrCost(
}		}

static const CostTblEntry<MVT::SimpleValueType> SSE2CostTable[] = {		static const CostTblEntry<MVT::SimpleValueType> SSE2CostTable[] = {
// We don't correctly identify costs of casts because they are marked as		// We don't correctly identify costs of casts because they are marked as
// custom.		// custom.
// For some cases, where the shift amount is a scalar we would be able		// For some cases, where the shift amount is a scalar we would be able
// to generate better code. Unfortunately, when this is the case the value		// to generate better code. Unfortunately, when this is the case the value
// (the splat) will get hoisted out of the loop, thereby making it invisible		// (the splat) will get hoisted out of the loop, thereby making it invisible
// to ISel. The cost model must return worst case assumptions because it is		// to ISel. The cost model must return worst case assumptions because it is
// used for vectorization and we don't want to make vectorized code worse		// used for vectorization and we don't want to make vectorized code worse
// than scalar code.		// than scalar code.
{ ISD::SHL, MVT::v16i8, 30 }, // cmpeqb sequence.		{ ISD::SHL, MVT::v16i8, 30 }, // cmpeqb sequence.
{ ISD::SHL, MVT::v8i16, 8*10 }, // Scalarized.		{ ISD::SHL, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SHL, MVT::v4i32, 2*5 }, // We optimized this using mul.		{ ISD::SHL, MVT::v4i32, 2*5 }, // We optimized this using mul.
{ ISD::SHL, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SHL, MVT::v2i64, 2*10 }, // Scalarized.
{ ISD::SHL, MVT::v4i64, 4*10 }, // Scalarized.		{ ISD::SHL, MVT::v4i64, 4*10 }, // Scalarized.

{ ISD::SRL, MVT::v16i8, 16*10 }, // Scalarized.		{ ISD::SRL, MVT::v16i8, 16*10 }, // Scalarized.
{ ISD::SRL, MVT::v8i16, 8*10 }, // Scalarized.		{ ISD::SRL, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SRL, MVT::v4i32, 4*10 }, // Scalarized.		{ ISD::SRL, MVT::v4i32, 4*10 }, // Scalarized.
{ ISD::SRL, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SRL, MVT::v2i64, 2*10 }, // Scalarized.

{ ISD::SRA, MVT::v16i8, 16*10 }, // Scalarized.		{ ISD::SRA, MVT::v16i8, 16*10 }, // Scalarized.
{ ISD::SRA, MVT::v8i16, 8*10 }, // Scalarized.		{ ISD::SRA, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SRA, MVT::v4i32, 4*10 }, // Scalarized.		{ ISD::SRA, MVT::v4i32, 4*10 }, // Scalarized.
{ ISD::SRA, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SRA, MVT::v2i64, 2*10 }, // Scalarized.

// It is not a good idea to vectorize division. We have to scalarize it and		// It is not a good idea to vectorize division. We have to scalarize it and
// in the process we will often end up having to spilling regular		// in the process we will often end up having to spilling regular
// registers. The overhead of division is going to dominate most kernels		// registers. The overhead of division is going to dominate most kernels
// anyways so try hard to prevent vectorization of division - it is		// anyways so try hard to prevent vectorization of division - it is
// generally a bad idea. Assume somewhat arbitrarily that we have to be able		// generally a bad idea. Assume somewhat arbitrarily that we have to be able
// to hide "20 cycles" for each lane.		// to hide "20 cycles" for each lane.
{ ISD::SDIV, MVT::v16i8, 16*20 },		{ ISD::SDIV, MVT::v16i8, 16*20 },
{ ISD::SDIV, MVT::v8i16, 8*20 },		{ ISD::SDIV, MVT::v8i16, 8*20 },
{ ISD::SDIV, MVT::v4i32, 4*20 },		{ ISD::SDIV, MVT::v4i32, 4*20 },
{ ISD::SDIV, MVT::v2i64, 2*20 },		{ ISD::SDIV, MVT::v2i64, 2*20 },
▲ Show 20 Lines • Show All 851 Lines • Show Last 20 Lines

test/Analysis/CostModel/X86/testshiftashr.ll

Show First 20 Lines • Show All 241 Lines • ▼ Show 20 Lines
}		}

; Test shift by a constant a value.		; Test shift by a constant a value.

%shifttypec = type <2 x i16>		%shifttypec = type <2 x i16>
define %shifttypec @shift2i16const(%shifttypec %a, %shifttypec %b) {		define %shifttypec @shift2i16const(%shifttypec %a, %shifttypec %b) {
entry:		entry:
; SSE2: shift2i16const		; SSE2: shift2i16const
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i16const		; SSE2-CODEGEN: shift2i16const
; SSE2-CODEGEN: sarq $		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec %a , <i16 3, i16 3>		%0 = ashr %shifttypec %a , <i16 3, i16 3>
ret %shifttypec %0		ret %shifttypec %0
}		}

%shifttypec4i16 = type <4 x i16>		%shifttypec4i16 = type <4 x i16>
define %shifttypec4i16 @shift4i16const(%shifttypec4i16 %a, %shifttypec4i16 %b) {		define %shifttypec4i16 @shift4i16const(%shifttypec4i16 %a, %shifttypec4i16 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	%0 = ashr %shifttypec32i16 %a , <i16 3, i16 3, i16 3, i16 3,
i16 3, i16 3, i16 3, i16 3>		i16 3, i16 3, i16 3, i16 3>
ret %shifttypec32i16 %0		ret %shifttypec32i16 %0
}		}

%shifttypec2i32 = type <2 x i32>		%shifttypec2i32 = type <2 x i32>
define %shifttypec2i32 @shift2i32c(%shifttypec2i32 %a, %shifttypec2i32 %b) {		define %shifttypec2i32 @shift2i32c(%shifttypec2i32 %a, %shifttypec2i32 %b) {
entry:		entry:
; SSE2: shift2i32c		; SSE2: shift2i32c
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i32c		; SSE2-CODEGEN: shift2i32c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec2i32 %a , <i32 3, i32 3>		%0 = ashr %shifttypec2i32 %a , <i32 3, i32 3>
ret %shifttypec2i32 %0		ret %shifttypec2i32 %0
}		}

%shifttypec4i32 = type <4 x i32>		%shifttypec4i32 = type <4 x i32>
define %shifttypec4i32 @shift4i32c(%shifttypec4i32 %a, %shifttypec4i32 %b) {		define %shifttypec4i32 @shift4i32c(%shifttypec4i32 %a, %shifttypec4i32 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	%0 = ashr %shifttypec32i32 %a , <i32 3, i32 3, i32 3, i32 3,
i32 3, i32 3, i32 3, i32 3>		i32 3, i32 3, i32 3, i32 3>
ret %shifttypec32i32 %0		ret %shifttypec32i32 %0
}		}

%shifttypec2i64 = type <2 x i64>		%shifttypec2i64 = type <2 x i64>
define %shifttypec2i64 @shift2i64c(%shifttypec2i64 %a, %shifttypec2i64 %b) {		define %shifttypec2i64 @shift2i64c(%shifttypec2i64 %a, %shifttypec2i64 %b) {
entry:		entry:
; SSE2: shift2i64c		; SSE2: shift2i64c
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i64c		; SSE2-CODEGEN: shift2i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec2i64 %a , <i64 3, i64 3>		%0 = ashr %shifttypec2i64 %a , <i64 3, i64 3>
ret %shifttypec2i64 %0		ret %shifttypec2i64 %0
}		}

%shifttypec4i64 = type <4 x i64>		%shifttypec4i64 = type <4 x i64>
define %shifttypec4i64 @shift4i64c(%shifttypec4i64 %a, %shifttypec4i64 %b) {		define %shifttypec4i64 @shift4i64c(%shifttypec4i64 %a, %shifttypec4i64 %b) {
entry:		entry:
; SSE2: shift4i64c		; SSE2: shift4i64c
; SSE2: cost of 40 {{.*}} ashr		; SSE2: cost of 8 {{.*}} ashr
; SSE2-CODEGEN: shift4i64c		; SSE2-CODEGEN: shift4i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec4i64 %a , <i64 3, i64 3, i64 3, i64 3>		%0 = ashr %shifttypec4i64 %a , <i64 3, i64 3, i64 3, i64 3>
ret %shifttypec4i64 %0		ret %shifttypec4i64 %0
}		}

%shifttypec8i64 = type <8 x i64>		%shifttypec8i64 = type <8 x i64>
define %shifttypec8i64 @shift8i64c(%shifttypec8i64 %a, %shifttypec8i64 %b) {		define %shifttypec8i64 @shift8i64c(%shifttypec8i64 %a, %shifttypec8i64 %b) {
entry:		entry:
; SSE2: shift8i64c		; SSE2: shift8i64c
; SSE2: cost of 80 {{.*}} ashr		; SSE2: cost of 16 {{.*}} ashr
; SSE2-CODEGEN: shift8i64c		; SSE2-CODEGEN: shift8i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec8i64 %a , <i64 3, i64 3, i64 3, i64 3,		%0 = ashr %shifttypec8i64 %a , <i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3>		i64 3, i64 3, i64 3, i64 3>
ret %shifttypec8i64 %0		ret %shifttypec8i64 %0
}		}

%shifttypec16i64 = type <16 x i64>		%shifttypec16i64 = type <16 x i64>
define %shifttypec16i64 @shift16i64c(%shifttypec16i64 %a, %shifttypec16i64 %b) {		define %shifttypec16i64 @shift16i64c(%shifttypec16i64 %a, %shifttypec16i64 %b) {
entry:		entry:
; SSE2: shift16i64c		; SSE2: shift16i64c
; SSE2: cost of 160 {{.*}} ashr		; SSE2: cost of 32 {{.*}} ashr
; SSE2-CODEGEN: shift16i64c		; SSE2-CODEGEN: shift16i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec16i64 %a , <i64 3, i64 3, i64 3, i64 3,		%0 = ashr %shifttypec16i64 %a , <i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3>		i64 3, i64 3, i64 3, i64 3>
ret %shifttypec16i64 %0		ret %shifttypec16i64 %0
}		}

%shifttypec32i64 = type <32 x i64>		%shifttypec32i64 = type <32 x i64>
define %shifttypec32i64 @shift32i64c(%shifttypec32i64 %a, %shifttypec32i64 %b) {		define %shifttypec32i64 @shift32i64c(%shifttypec32i64 %a, %shifttypec32i64 %b) {
entry:		entry:
; SSE2: shift32i64c		; SSE2: shift32i64c
; SSE2: cost of 320 {{.*}} ashr		; SSE2: cost of 64 {{.*}} ashr
; SSE2-CODEGEN: shift32i64c		; SSE2-CODEGEN: shift32i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec32i64 %a ,<i64 3, i64 3, i64 3, i64 3,		%0 = ashr %shifttypec32i64 %a ,<i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3>		i64 3, i64 3, i64 3, i64 3>
ret %shifttypec32i64 %0		ret %shifttypec32i64 %0
}		}

%shifttypec2i8 = type <2 x i8>		%shifttypec2i8 = type <2 x i8>
define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {		define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {
entry:		entry:
; SSE2: shift2i8c		; SSE2: shift2i8c
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i8c		; SSE2-CODEGEN: shift2i8c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec2i8 %a , <i8 3, i8 3>		%0 = ashr %shifttypec2i8 %a , <i8 3, i8 3>
ret %shifttypec2i8 %0		ret %shifttypec2i8 %0
}		}

%shifttypec4i8 = type <4 x i8>		%shifttypec4i8 = type <4 x i8>
define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {		define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

test/CodeGen/X86/vec_int_to_fp.ll

Show All 29 Lines	; AVX-NEXT: retq
%cvt = sitofp <2 x i64> %a to <2 x double>		%cvt = sitofp <2 x i64> %a to <2 x double>
ret <2 x double> %cvt		ret <2 x double> %cvt
}		}

define <2 x double> @sitofp_2vf64_i32(<4 x i32> %a) {		define <2 x double> @sitofp_2vf64_i32(<4 x i32> %a) {
; SSE2-LABEL: sitofp_2vf64_i32:		; SSE2-LABEL: sitofp_2vf64_i32:
; SSE2: # BB#0:		; SSE2: # BB#0:
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]		; SSE2-NEXT: psllq $32, %xmm0
		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,3,2,3]
		; SSE2-NEXT: psrad $31, %xmm0
		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
		; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
; SSE2-NEXT: movd %xmm1, %rax		; SSE2-NEXT: movd %xmm1, %rax
; SSE2-NEXT: cltq
; SSE2-NEXT: movd %xmm0, %rcx
; SSE2-NEXT: movslq %ecx, %rcx
; SSE2-NEXT: xorps %xmm0, %xmm0		; SSE2-NEXT: xorps %xmm0, %xmm0
; SSE2-NEXT: cvtsi2sdq %rcx, %xmm0		; SSE2-NEXT: cvtsi2sdq %rax, %xmm0
		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]
		; SSE2-NEXT: movd %xmm1, %rax
; SSE2-NEXT: xorps %xmm1, %xmm1		; SSE2-NEXT: xorps %xmm1, %xmm1
; SSE2-NEXT: cvtsi2sdq %rax, %xmm1		; SSE2-NEXT: cvtsi2sdq %rax, %xmm1
; SSE2-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]		; SSE2-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; AVX-LABEL: sitofp_2vf64_i32:		; AVX-LABEL: sitofp_2vf64_i32:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero		; AVX-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
		; AVX-NEXT: vpsllq $32, %xmm0, %xmm0
		; AVX-NEXT: vpsrad $31, %xmm0, %xmm1
		; AVX-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
		; AVX-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
		; AVX-NEXT: vpextrq $1, %xmm0, %rax
		; AVX-NEXT: vcvtsi2sdq %rax, %xmm0, %xmm1
; AVX-NEXT: vmovq %xmm0, %rax		; AVX-NEXT: vmovq %xmm0, %rax
; AVX-NEXT: cltq
; AVX-NEXT: vpextrq $1, %xmm0, %rcx
; AVX-NEXT: movslq %ecx, %rcx
; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0		; AVX-NEXT: vxorps %xmm0, %xmm0, %xmm0
; AVX-NEXT: vcvtsi2sdq %rcx, %xmm0, %xmm0		; AVX-NEXT: vcvtsi2sdq %rax, %xmm0, %xmm0
; AVX-NEXT: vcvtsi2sdq %rax, %xmm0, %xmm1		; AVX-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
; AVX-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm1[0],xmm0[0]
; AVX-NEXT: retq		; AVX-NEXT: retq
%shuf = shufflevector <4 x i32> %a, <4 x i32> undef, <2 x i32> <i32 0, i32 1>		%shuf = shufflevector <4 x i32> %a, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
%cvt = sitofp <2 x i32> %shuf to <2 x double>		%cvt = sitofp <2 x i32> %shuf to <2 x double>
ret <2 x double> %cvt		ret <2 x double> %cvt
}		}

define <4 x double> @sitofp_4vf64(<4 x i64> %a) {		define <4 x double> @sitofp_4vf64(<4 x i64> %a) {
; SSE2-LABEL: sitofp_4vf64:		; SSE2-LABEL: sitofp_4vf64:
Show All 35 Lines	; AVX-NEXT: retq
%cvt = sitofp <4 x i64> %a to <4 x double>		%cvt = sitofp <4 x i64> %a to <4 x double>
ret <4 x double> %cvt		ret <4 x double> %cvt
}		}

define <4 x double> @sitofp_4vf64_i32(<4 x i32> %a) {		define <4 x double> @sitofp_4vf64_i32(<4 x i32> %a) {
; SSE2-LABEL: sitofp_4vf64_i32:		; SSE2-LABEL: sitofp_4vf64_i32:
; SSE2: # BB#0:		; SSE2: # BB#0:
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]
; SSE2-NEXT: movd %xmm1, %rax		; SSE2-NEXT: psllq $32, %xmm1
; SSE2-NEXT: cltq		; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[1,3,2,3]
		; SSE2-NEXT: psrad $31, %xmm1
		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
		; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm1[0],xmm3[1],xmm1[1]
		; SSE2-NEXT: movd %xmm3, %rax
; SSE2-NEXT: cvtsi2sdq %rax, %xmm2		; SSE2-NEXT: cvtsi2sdq %rax, %xmm2
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm3[2,3,0,1]
; SSE2-NEXT: movd %xmm1, %rax		; SSE2-NEXT: movd %xmm1, %rax
; SSE2-NEXT: cltq
; SSE2-NEXT: xorps %xmm1, %xmm1		; SSE2-NEXT: xorps %xmm1, %xmm1
; SSE2-NEXT: cvtsi2sdq %rax, %xmm1		; SSE2-NEXT: cvtsi2sdq %rax, %xmm1
; SSE2-NEXT: unpcklpd {{.*#+}} xmm2 = xmm2[0],xmm1[0]		; SSE2-NEXT: unpcklpd {{.*#+}} xmm2 = xmm2[0],xmm1[0]
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: psllq $32, %xmm0
; SSE2-NEXT: cltq		; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[1,3,2,3]
		; SSE2-NEXT: psrad $31, %xmm0
		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
		; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1]
		; SSE2-NEXT: movd %xmm3, %rax
; SSE2-NEXT: xorps %xmm1, %xmm1		; SSE2-NEXT: xorps %xmm1, %xmm1
; SSE2-NEXT: cvtsi2sdq %rax, %xmm1		; SSE2-NEXT: cvtsi2sdq %rax, %xmm1
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm3[2,3,0,1]
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: movd %xmm0, %rax
; SSE2-NEXT: cltq
; SSE2-NEXT: xorps %xmm0, %xmm0		; SSE2-NEXT: xorps %xmm0, %xmm0
; SSE2-NEXT: cvtsi2sdq %rax, %xmm0		; SSE2-NEXT: cvtsi2sdq %rax, %xmm0
; SSE2-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],xmm0[0]		; SSE2-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],xmm0[0]
; SSE2-NEXT: movapd %xmm2, %xmm0		; SSE2-NEXT: movapd %xmm2, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; AVX-LABEL: sitofp_4vf64_i32:		; AVX-LABEL: sitofp_4vf64_i32:
; AVX: # BB#0:		; AVX: # BB#0:
▲ Show 20 Lines • Show All 578 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-sext.ll

Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	entry:
%B = sext <8 x i16> %A to <8 x i32>		%B = sext <8 x i16> %A to <8 x i32>
ret <8 x i32>%B		ret <8 x i32>%B
}		}

define <4 x i64> @sext_4i32_to_4i64(<4 x i32> %A) nounwind uwtable readnone ssp {		define <4 x i64> @sext_4i32_to_4i64(<4 x i32> %A) nounwind uwtable readnone ssp {
; SSE2-LABEL: sext_4i32_to_4i64:		; SSE2-LABEL: sext_4i32_to_4i64:
; SSE2: # BB#0: # %entry		; SSE2: # BB#0: # %entry
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]
; SSE2-NEXT: movd %xmm1, %rax		; SSE2-NEXT: psllq $32, %xmm1
; SSE2-NEXT: cltq		; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,3,2,3]
; SSE2-NEXT: movd %rax, %xmm2		; SSE2-NEXT: psrad $31, %xmm1
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
; SSE2-NEXT: movd %xmm1, %rax		; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
; SSE2-NEXT: cltq
; SSE2-NEXT: movd %rax, %xmm1
; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: psllq $32, %xmm0
; SSE2-NEXT: cltq		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,3,2,3]
; SSE2-NEXT: movd %rax, %xmm1		; SSE2-NEXT: psrad $31, %xmm0
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
; SSE2-NEXT: cltq
; SSE2-NEXT: movd %rax, %xmm0
; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
; SSE2-NEXT: movdqa %xmm2, %xmm0		; SSE2-NEXT: movdqa %xmm2, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSSE3-LABEL: sext_4i32_to_4i64:		; SSSE3-LABEL: sext_4i32_to_4i64:
; SSSE3: # BB#0: # %entry		; SSSE3: # BB#0: # %entry
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]		; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]
; SSSE3-NEXT: movd %xmm1, %rax		; SSSE3-NEXT: psllq $32, %xmm1
; SSSE3-NEXT: cltq		; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,3,2,3]
; SSSE3-NEXT: movd %rax, %xmm2		; SSSE3-NEXT: psrad $31, %xmm1
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]		; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
; SSSE3-NEXT: movd %xmm1, %rax		; SSSE3-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
; SSSE3-NEXT: cltq
; SSSE3-NEXT: movd %rax, %xmm1
; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]
; SSSE3-NEXT: movd %xmm0, %rax		; SSSE3-NEXT: psllq $32, %xmm0
; SSSE3-NEXT: cltq		; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,3,2,3]
; SSSE3-NEXT: movd %rax, %xmm1		; SSSE3-NEXT: psrad $31, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
; SSSE3-NEXT: movd %xmm0, %rax		; SSSE3-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
; SSSE3-NEXT: cltq
; SSSE3-NEXT: movd %rax, %xmm0
; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
; SSSE3-NEXT: movdqa %xmm2, %xmm0		; SSSE3-NEXT: movdqa %xmm2, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; SSE41-LABEL: sext_4i32_to_4i64:		; SSE41-LABEL: sext_4i32_to_4i64:
; SSE41: # BB#0: # %entry		; SSE41: # BB#0: # %entry
; SSE41-NEXT: pmovzxdq %xmm0, %xmm1		; SSE41-NEXT: pmovzxdq {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero
; SSE41-NEXT: pextrq $1, %xmm1, %rax		; SSE41-NEXT: psllq $32, %xmm2
; SSE41-NEXT: cltq		; SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm2[1,1,3,3]
; SSE41-NEXT: movd %rax, %xmm3		; SSE41-NEXT: psrad $31, %xmm2
; SSE41-NEXT: movd %xmm1, %rax		; SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
; SSE41-NEXT: cltq		; SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,2,3,3]
; SSE41-NEXT: movd %rax, %xmm2		; SSE41-NEXT: psllq $32, %xmm1
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]		; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,1,3,3]
; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSE41-NEXT: psrad $31, %xmm1
; SSE41-NEXT: pextrq $1, %xmm0, %rax		; SSE41-NEXT: pblendw {{.*#+}} xmm1 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
		delenaUnsubmitted Not Done Reply Inline Actions I see this code as 2 pmovsxdq instructions and one shuffle between them. For windows, it looks like: pmovsxdq (%rcx), %xmm0 pmovsxdq 8(%rcx), %xmm1 retq For linux your parameter is in xmm, so you need one shuflle with <2, 3, undef, undef> delena: I see this code as 2 pmovsxdq instructions and one shuffle between them. For windows, it looks…
		RKSimonAuthorUnsubmitted Not Done Reply Inline Actions I have some work in progress patches to improve pmovsx* support - the problem is that SIGN_EXTEND_INREG / SIGN_EXTEND_VECTOR_INREG is a mess and will take a while to sort out. Making this i64 SRA patch was an easy first step so at least we're not transferring between xmm and gprs so much. RKSimon: I have some work in progress patches to improve pmovsx* support - the problem is that…
; SSE41-NEXT: cltq
; SSE41-NEXT: movd %rax, %xmm3
; SSE41-NEXT: movd %xmm0, %rax
; SSE41-NEXT: cltq
; SSE41-NEXT: movd %rax, %xmm1
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
; SSE41-NEXT: movdqa %xmm2, %xmm0		; SSE41-NEXT: movdqa %xmm2, %xmm0
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX1-LABEL: sext_4i32_to_4i64:		; AVX1-LABEL: sext_4i32_to_4i64:
; AVX1: # BB#0: # %entry		; AVX1: # BB#0: # %entry
; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1		; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1
; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0		; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0
; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0		; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX2-LABEL: sext_4i32_to_4i64:		; AVX2-LABEL: sext_4i32_to_4i64:
; AVX2: # BB#0: # %entry		; AVX2: # BB#0: # %entry
; AVX2-NEXT: vpmovsxdq %xmm0, %ymm0		; AVX2-NEXT: vpmovsxdq %xmm0, %ymm0
; AVX2-NEXT: retq		; AVX2-NEXT: retq
;		;
; X32-SSE41-LABEL: sext_4i32_to_4i64:		; X32-SSE41-LABEL: sext_4i32_to_4i64:
; X32-SSE41: # BB#0: # %entry		; X32-SSE41: # BB#0: # %entry
; X32-SSE41-NEXT: pmovzxdq %xmm0, %xmm2		; X32-SSE41-NEXT: pmovzxdq {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero
; X32-SSE41-NEXT: movd %xmm2, %eax		; X32-SSE41-NEXT: psllq $32, %xmm2
; X32-SSE41-NEXT: sarl $31, %eax		; X32-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm2[1,1,3,3]
; X32-SSE41-NEXT: pextrd $2, %xmm2, %ecx		; X32-SSE41-NEXT: psrad $31, %xmm2
; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm2		; X32-SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
; X32-SSE41-NEXT: sarl $31, %ecx
; X32-SSE41-NEXT: pinsrd $3, %ecx, %xmm2
; X32-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,2,3,3]		; X32-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,2,3,3]
; X32-SSE41-NEXT: movd %xmm1, %eax		; X32-SSE41-NEXT: psllq $32, %xmm1
; X32-SSE41-NEXT: sarl $31, %eax		; X32-SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,1,3,3]
; X32-SSE41-NEXT: pextrd $2, %xmm1, %ecx		; X32-SSE41-NEXT: psrad $31, %xmm1
; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm1		; X32-SSE41-NEXT: pblendw {{.*#+}} xmm1 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
; X32-SSE41-NEXT: sarl $31, %ecx
; X32-SSE41-NEXT: pinsrd $3, %ecx, %xmm1
; X32-SSE41-NEXT: movdqa %xmm2, %xmm0		; X32-SSE41-NEXT: movdqa %xmm2, %xmm0
; X32-SSE41-NEXT: retl		; X32-SSE41-NEXT: retl
entry:		entry:
%B = sext <4 x i32> %A to <4 x i64>		%B = sext <4 x i32> %A to <4 x i64>
ret <4 x i64>%B		ret <4 x i64>%B
}		}

define <4 x i32> @load_sext_test1(<4 x i16> *%ptr) {		define <4 x i32> @load_sext_test1(<4 x i16> *%ptr) {
▲ Show 20 Lines • Show All 227 Lines • ▼ Show 20 Lines
}		}

define <4 x i64> @sext_4i1_to_4i64(<4 x i1> %mask) {		define <4 x i64> @sext_4i1_to_4i64(<4 x i1> %mask) {
; SSE2-LABEL: sext_4i1_to_4i64:		; SSE2-LABEL: sext_4i1_to_4i64:
; SSE2: # BB#0:		; SSE2: # BB#0:
; SSE2-NEXT: pslld $31, %xmm0		; SSE2-NEXT: pslld $31, %xmm0
; SSE2-NEXT: psrad $31, %xmm0		; SSE2-NEXT: psrad $31, %xmm0
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]
; SSE2-NEXT: movd %xmm1, %rax		; SSE2-NEXT: psllq $32, %xmm1
; SSE2-NEXT: cltq		; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,3,2,3]
; SSE2-NEXT: movd %rax, %xmm2		; SSE2-NEXT: psrad $31, %xmm1
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
; SSE2-NEXT: movd %xmm1, %rax		; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
; SSE2-NEXT: cltq
; SSE2-NEXT: movd %rax, %xmm1
; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: psllq $32, %xmm0
; SSE2-NEXT: cltq		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,3,2,3]
; SSE2-NEXT: movd %rax, %xmm1		; SSE2-NEXT: psrad $31, %xmm0
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
; SSE2-NEXT: cltq
; SSE2-NEXT: movd %rax, %xmm0
; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
; SSE2-NEXT: movdqa %xmm2, %xmm0		; SSE2-NEXT: movdqa %xmm2, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSSE3-LABEL: sext_4i1_to_4i64:		; SSSE3-LABEL: sext_4i1_to_4i64:
; SSSE3: # BB#0:		; SSSE3: # BB#0:
; SSSE3-NEXT: pslld $31, %xmm0		; SSSE3-NEXT: pslld $31, %xmm0
; SSSE3-NEXT: psrad $31, %xmm0		; SSSE3-NEXT: psrad $31, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]		; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]
; SSSE3-NEXT: movd %xmm1, %rax		; SSSE3-NEXT: psllq $32, %xmm1
; SSSE3-NEXT: cltq		; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,3,2,3]
; SSSE3-NEXT: movd %rax, %xmm2		; SSSE3-NEXT: psrad $31, %xmm1
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]		; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
; SSSE3-NEXT: movd %xmm1, %rax		; SSSE3-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
; SSSE3-NEXT: cltq
; SSSE3-NEXT: movd %rax, %xmm1
; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]
; SSSE3-NEXT: movd %xmm0, %rax		; SSSE3-NEXT: psllq $32, %xmm0
; SSSE3-NEXT: cltq		; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,3,2,3]
; SSSE3-NEXT: movd %rax, %xmm1		; SSSE3-NEXT: psrad $31, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
; SSSE3-NEXT: movd %xmm0, %rax		; SSSE3-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
; SSSE3-NEXT: cltq
; SSSE3-NEXT: movd %rax, %xmm0
; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
; SSSE3-NEXT: movdqa %xmm2, %xmm0		; SSSE3-NEXT: movdqa %xmm2, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; SSE41-LABEL: sext_4i1_to_4i64:		; SSE41-LABEL: sext_4i1_to_4i64:
; SSE41: # BB#0:		; SSE41: # BB#0:
; SSE41-NEXT: pslld $31, %xmm0		; SSE41-NEXT: pslld $31, %xmm0
; SSE41-NEXT: psrad $31, %xmm0		; SSE41-NEXT: psrad $31, %xmm0
; SSE41-NEXT: pmovzxdq %xmm0, %xmm1		; SSE41-NEXT: pmovzxdq {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero
; SSE41-NEXT: pextrq $1, %xmm1, %rax		; SSE41-NEXT: psllq $32, %xmm2
; SSE41-NEXT: cltq		; SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm2[1,1,3,3]
; SSE41-NEXT: movd %rax, %xmm3		; SSE41-NEXT: psrad $31, %xmm2
; SSE41-NEXT: movd %xmm1, %rax		; SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
; SSE41-NEXT: cltq		; SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,2,3,3]
; SSE41-NEXT: movd %rax, %xmm2		; SSE41-NEXT: psllq $32, %xmm1
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]		; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,1,3,3]
; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSE41-NEXT: psrad $31, %xmm1
; SSE41-NEXT: pextrq $1, %xmm0, %rax		; SSE41-NEXT: pblendw {{.*#+}} xmm1 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
; SSE41-NEXT: cltq
; SSE41-NEXT: movd %rax, %xmm3
; SSE41-NEXT: movd %xmm0, %rax
; SSE41-NEXT: cltq
; SSE41-NEXT: movd %rax, %xmm1
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
; SSE41-NEXT: movdqa %xmm2, %xmm0		; SSE41-NEXT: movdqa %xmm2, %xmm0
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX1-LABEL: sext_4i1_to_4i64:		; AVX1-LABEL: sext_4i1_to_4i64:
; AVX1: # BB#0:		; AVX1: # BB#0:
; AVX1-NEXT: vpslld $31, %xmm0, %xmm0		; AVX1-NEXT: vpslld $31, %xmm0, %xmm0
; AVX1-NEXT: vpsrad $31, %xmm0, %xmm0		; AVX1-NEXT: vpsrad $31, %xmm0, %xmm0
; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1		; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1
; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0		; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0
; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0		; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX2-LABEL: sext_4i1_to_4i64:		; AVX2-LABEL: sext_4i1_to_4i64:
; AVX2: # BB#0:		; AVX2: # BB#0:
; AVX2-NEXT: vpslld $31, %xmm0, %xmm0		; AVX2-NEXT: vpslld $31, %xmm0, %xmm0
; AVX2-NEXT: vpsrad $31, %xmm0, %xmm0		; AVX2-NEXT: vpsrad $31, %xmm0, %xmm0
; AVX2-NEXT: vpmovsxdq %xmm0, %ymm0		; AVX2-NEXT: vpmovsxdq %xmm0, %ymm0
; AVX2-NEXT: retq		; AVX2-NEXT: retq
;		;
; X32-SSE41-LABEL: sext_4i1_to_4i64:		; X32-SSE41-LABEL: sext_4i1_to_4i64:
; X32-SSE41: # BB#0:		; X32-SSE41: # BB#0:
; X32-SSE41-NEXT: pslld $31, %xmm0		; X32-SSE41-NEXT: pslld $31, %xmm0
; X32-SSE41-NEXT: psrad $31, %xmm0		; X32-SSE41-NEXT: psrad $31, %xmm0
; X32-SSE41-NEXT: pmovzxdq %xmm0, %xmm2		; X32-SSE41-NEXT: pmovzxdq {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero
; X32-SSE41-NEXT: movd %xmm2, %eax		; X32-SSE41-NEXT: psllq $32, %xmm2
; X32-SSE41-NEXT: sarl $31, %eax		; X32-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm2[1,1,3,3]
; X32-SSE41-NEXT: pextrd $2, %xmm2, %ecx		; X32-SSE41-NEXT: psrad $31, %xmm2
; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm2		; X32-SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
; X32-SSE41-NEXT: sarl $31, %ecx
; X32-SSE41-NEXT: pinsrd $3, %ecx, %xmm2
; X32-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,2,3,3]		; X32-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,2,3,3]
; X32-SSE41-NEXT: movd %xmm1, %eax		; X32-SSE41-NEXT: psllq $32, %xmm1
; X32-SSE41-NEXT: sarl $31, %eax		; X32-SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,1,3,3]
; X32-SSE41-NEXT: pextrd $2, %xmm1, %ecx		; X32-SSE41-NEXT: psrad $31, %xmm1
; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm1		; X32-SSE41-NEXT: pblendw {{.*#+}} xmm1 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
; X32-SSE41-NEXT: sarl $31, %ecx
; X32-SSE41-NEXT: pinsrd $3, %ecx, %xmm1
; X32-SSE41-NEXT: movdqa %xmm2, %xmm0		; X32-SSE41-NEXT: movdqa %xmm2, %xmm0
; X32-SSE41-NEXT: retl		; X32-SSE41-NEXT: retl
%extmask = sext <4 x i1> %mask to <4 x i64>		%extmask = sext <4 x i1> %mask to <4 x i64>
ret <4 x i64> %extmask		ret <4 x i64> %extmask
}		}

define <16 x i16> @sext_16i8_to_16i16(<16 x i8> *%ptr) {		define <16 x i16> @sext_16i8_to_16i16(<16 x i8> *%ptr) {
; SSE2-LABEL: sext_16i8_to_16i16:		; SSE2-LABEL: sext_16i8_to_16i16:
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
}		}

define <4 x i64> @sext_4i8_to_4i64(<4 x i8> %mask) {		define <4 x i64> @sext_4i8_to_4i64(<4 x i8> %mask) {
; SSE2-LABEL: sext_4i8_to_4i64:		; SSE2-LABEL: sext_4i8_to_4i64:
; SSE2: # BB#0:		; SSE2: # BB#0:
; SSE2-NEXT: pslld $24, %xmm0		; SSE2-NEXT: pslld $24, %xmm0
; SSE2-NEXT: psrad $24, %xmm0		; SSE2-NEXT: psrad $24, %xmm0
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]
; SSE2-NEXT: movd %xmm1, %rax		; SSE2-NEXT: psllq $32, %xmm1
; SSE2-NEXT: cltq		; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,3,2,3]
; SSE2-NEXT: movd %rax, %xmm2		; SSE2-NEXT: psrad $31, %xmm1
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
; SSE2-NEXT: movd %xmm1, %rax		; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
; SSE2-NEXT: cltq
; SSE2-NEXT: movd %rax, %xmm1
; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: psllq $32, %xmm0
; SSE2-NEXT: cltq		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,3,2,3]
; SSE2-NEXT: movd %rax, %xmm1		; SSE2-NEXT: psrad $31, %xmm0
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
; SSE2-NEXT: cltq
; SSE2-NEXT: movd %rax, %xmm0
; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
; SSE2-NEXT: movdqa %xmm2, %xmm0		; SSE2-NEXT: movdqa %xmm2, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSSE3-LABEL: sext_4i8_to_4i64:		; SSSE3-LABEL: sext_4i8_to_4i64:
; SSSE3: # BB#0:		; SSSE3: # BB#0:
; SSSE3-NEXT: pslld $24, %xmm0		; SSSE3-NEXT: pslld $24, %xmm0
; SSSE3-NEXT: psrad $24, %xmm0		; SSSE3-NEXT: psrad $24, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]		; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[0,1,1,3]
; SSSE3-NEXT: movd %xmm1, %rax		; SSSE3-NEXT: psllq $32, %xmm1
; SSSE3-NEXT: cltq		; SSSE3-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,3,2,3]
; SSSE3-NEXT: movd %rax, %xmm2		; SSSE3-NEXT: psrad $31, %xmm1
; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]		; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
; SSSE3-NEXT: movd %xmm1, %rax		; SSSE3-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
; SSSE3-NEXT: cltq
; SSSE3-NEXT: movd %rax, %xmm1
; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]
; SSSE3-NEXT: movd %xmm0, %rax		; SSSE3-NEXT: psllq $32, %xmm0
; SSSE3-NEXT: cltq		; SSSE3-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,3,2,3]
; SSSE3-NEXT: movd %rax, %xmm1		; SSSE3-NEXT: psrad $31, %xmm0
; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSSE3-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
; SSSE3-NEXT: movd %xmm0, %rax		; SSSE3-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
; SSSE3-NEXT: cltq
; SSSE3-NEXT: movd %rax, %xmm0
; SSSE3-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
; SSSE3-NEXT: movdqa %xmm2, %xmm0		; SSSE3-NEXT: movdqa %xmm2, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; SSE41-LABEL: sext_4i8_to_4i64:		; SSE41-LABEL: sext_4i8_to_4i64:
; SSE41: # BB#0:		; SSE41: # BB#0:
; SSE41-NEXT: pslld $24, %xmm0		; SSE41-NEXT: pslld $24, %xmm0
; SSE41-NEXT: psrad $24, %xmm0		; SSE41-NEXT: psrad $24, %xmm0
; SSE41-NEXT: pmovzxdq %xmm0, %xmm1		; SSE41-NEXT: pmovzxdq {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero
; SSE41-NEXT: pextrq $1, %xmm1, %rax		; SSE41-NEXT: psllq $32, %xmm2
; SSE41-NEXT: cltq		; SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm2[1,1,3,3]
; SSE41-NEXT: movd %rax, %xmm3		; SSE41-NEXT: psrad $31, %xmm2
; SSE41-NEXT: movd %xmm1, %rax		; SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
; SSE41-NEXT: cltq		; SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,2,3,3]
; SSE41-NEXT: movd %rax, %xmm2		; SSE41-NEXT: psllq $32, %xmm1
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]		; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,1,3,3]
; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,3,3]		; SSE41-NEXT: psrad $31, %xmm1
; SSE41-NEXT: pextrq $1, %xmm0, %rax		; SSE41-NEXT: pblendw {{.*#+}} xmm1 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
; SSE41-NEXT: cltq
; SSE41-NEXT: movd %rax, %xmm3
; SSE41-NEXT: movd %xmm0, %rax
; SSE41-NEXT: cltq
; SSE41-NEXT: movd %rax, %xmm1
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
; SSE41-NEXT: movdqa %xmm2, %xmm0		; SSE41-NEXT: movdqa %xmm2, %xmm0
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX1-LABEL: sext_4i8_to_4i64:		; AVX1-LABEL: sext_4i8_to_4i64:
; AVX1: # BB#0:		; AVX1: # BB#0:
; AVX1-NEXT: vpslld $24, %xmm0, %xmm0		; AVX1-NEXT: vpslld $24, %xmm0, %xmm0
; AVX1-NEXT: vpsrad $24, %xmm0, %xmm0		; AVX1-NEXT: vpsrad $24, %xmm0, %xmm0
; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1		; AVX1-NEXT: vpmovsxdq %xmm0, %xmm1
; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0		; AVX1-NEXT: vpmovsxdq %xmm0, %xmm0
; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0		; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX2-LABEL: sext_4i8_to_4i64:		; AVX2-LABEL: sext_4i8_to_4i64:
; AVX2: # BB#0:		; AVX2: # BB#0:
; AVX2-NEXT: vpslld $24, %xmm0, %xmm0		; AVX2-NEXT: vpslld $24, %xmm0, %xmm0
; AVX2-NEXT: vpsrad $24, %xmm0, %xmm0		; AVX2-NEXT: vpsrad $24, %xmm0, %xmm0
; AVX2-NEXT: vpmovsxdq %xmm0, %ymm0		; AVX2-NEXT: vpmovsxdq %xmm0, %ymm0
; AVX2-NEXT: retq		; AVX2-NEXT: retq
;		;
; X32-SSE41-LABEL: sext_4i8_to_4i64:		; X32-SSE41-LABEL: sext_4i8_to_4i64:
; X32-SSE41: # BB#0:		; X32-SSE41: # BB#0:
; X32-SSE41-NEXT: pslld $24, %xmm0		; X32-SSE41-NEXT: pslld $24, %xmm0
; X32-SSE41-NEXT: psrad $24, %xmm0		; X32-SSE41-NEXT: psrad $24, %xmm0
; X32-SSE41-NEXT: pmovzxdq %xmm0, %xmm2		; X32-SSE41-NEXT: pmovzxdq {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero
; X32-SSE41-NEXT: movd %xmm2, %eax		; X32-SSE41-NEXT: psllq $32, %xmm2
; X32-SSE41-NEXT: sarl $31, %eax		; X32-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm2[1,1,3,3]
; X32-SSE41-NEXT: pextrd $2, %xmm2, %ecx		; X32-SSE41-NEXT: psrad $31, %xmm2
; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm2		; X32-SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
; X32-SSE41-NEXT: sarl $31, %ecx
; X32-SSE41-NEXT: pinsrd $3, %ecx, %xmm2
; X32-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,2,3,3]		; X32-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,2,3,3]
; X32-SSE41-NEXT: movd %xmm1, %eax		; X32-SSE41-NEXT: psllq $32, %xmm1
; X32-SSE41-NEXT: sarl $31, %eax		; X32-SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,1,3,3]
; X32-SSE41-NEXT: pextrd $2, %xmm1, %ecx		; X32-SSE41-NEXT: psrad $31, %xmm1
; X32-SSE41-NEXT: pinsrd $1, %eax, %xmm1		; X32-SSE41-NEXT: pblendw {{.*#+}} xmm1 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
; X32-SSE41-NEXT: sarl $31, %ecx
; X32-SSE41-NEXT: pinsrd $3, %ecx, %xmm1
; X32-SSE41-NEXT: movdqa %xmm2, %xmm0		; X32-SSE41-NEXT: movdqa %xmm2, %xmm0
; X32-SSE41-NEXT: retl		; X32-SSE41-NEXT: retl
%extmask = sext <4 x i8> %mask to <4 x i64>		%extmask = sext <4 x i8> %mask to <4 x i64>
ret <4 x i64> %extmask		ret <4 x i64> %extmask
}		}

define <4 x i64> @load_sext_4i8_to_4i64(<4 x i8> *%ptr) {		define <4 x i64> @load_sext_4i8_to_4i64(<4 x i8> *%ptr) {
; SSE2-LABEL: load_sext_4i8_to_4i64:		; SSE2-LABEL: load_sext_4i8_to_4i64:
▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

test/CodeGen/X86/vshift-3.ll

	; RUN: llc < %s -march=x86 -mattr=+sse2 \| FileCheck %s			; RUN: llc < %s -march=x86 -mattr=+sse2 \| FileCheck %s

	; test vector shifts converted to proper SSE2 vector shifts when the shift			; test vector shifts converted to proper SSE2 vector shifts when the shift
	; amounts are the same.			; amounts are the same.

	; Note that x86 does have ashr			; Note that x86 does have ashr

	; shift1a can't use a packed shift
	define void @shift1a(<2 x i64> %val, <2 x i64>* %dst) nounwind {			define void @shift1a(<2 x i64> %val, <2 x i64>* %dst) nounwind {
	entry:			entry:
	; CHECK-LABEL: shift1a:			; CHECK-LABEL: shift1a:
	; CHECK: sarl			; CHECK: psrad $31
	%ashr = ashr <2 x i64> %val, < i64 32, i64 32 >			%ashr = ashr <2 x i64> %val, < i64 32, i64 32 >
	store <2 x i64> %ashr, <2 x i64>* %dst			store <2 x i64> %ashr, <2 x i64>* %dst
	ret void			ret void
	}			}

	define void @shift2a(<4 x i32> %val, <4 x i32>* %dst) nounwind {			define void @shift2a(<4 x i32> %val, <4 x i32>* %dst) nounwind {
	entry:			entry:
	; CHECK-LABEL: shift2a:			; CHECK-LABEL: shift2a:
	▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

test/CodeGen/X86/widen_conv-2.ll

	; RUN: llc < %s -march=x86 -mattr=+sse4.2 \| FileCheck %s			; RUN: llc < %s -march=x86 -mattr=+sse4.2 \| FileCheck %s
	; CHECK: {{cwtl\|movswl}}			; CHECK: psllq $48, %xmm0
	; CHECK: {{cwtl\|movswl}}			; CHECK: psrad $16, %xmm0
				; CHECK: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]

	; sign extension v2i32 to v2i16			; sign extension v2i16 to v2i32

	define void @convert(<2 x i32>* %dst.addr, <2 x i16> %src) nounwind {			define void @convert(<2 x i32>* %dst.addr, <2 x i16> %src) nounwind {
	entry:			entry:
	%signext = sext <2 x i16> %src to <2 x i32> ; <<12 x i8>> [#uses=1]			%signext = sext <2 x i16> %src to <2 x i32> ; <<12 x i8>> [#uses=1]
	store <2 x i32> %signext, <2 x i32>* %dst.addr			store <2 x i32> %signext, <2 x i32>* %dst.addr
	ret void			ret void
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorized i64 uniform constant SRA shiftsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 25433

lib/Target/X86/X86ISelLowering.cpp

lib/Target/X86/X86TargetTransformInfo.cpp

test/Analysis/CostModel/X86/testshiftashr.ll

test/CodeGen/X86/vec_int_to_fp.ll

test/CodeGen/X86/vector-sext.ll

test/CodeGen/X86/vshift-3.ll

test/CodeGen/X86/widen_conv-2.ll

[X86][SSE] Vectorized i64 uniform constant SRA shifts
ClosedPublic