This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorized i64 uniform constant SRA shifts
ClosedPublic

Authored by RKSimon on May 10 2015, 4:47 AM.

Download Raw Diff

Details

Reviewers

qcolombet
delena
andreadb

Commits

rG8fbf1c1f4a6f: [X86][SSE] Vectorized i64 uniform constant SRA shifts
rL241514: [X86][SSE] Vectorized i64 uniform constant SRA shifts

Summary

This patch adds vectorization support for uniform constant i64 arithmetic shift right operators.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 25426.May 10 2015, 4:47 AM

RKSimon retitled this revision from to [X86][SSE] Vectorized i64 uniform constant SRA shifts.

RKSimon updated this object.

RKSimon edited the test plan for this revision. (Show Details)

RKSimon added reviewers: andreadb, delena, qcolombet.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: Unknown Object (MLST).

delena added inline comments.May 10 2015, 6:50 AM

lib/Target/X86/X86ISelLowering.cpp
16552 ↗	(On Diff #25426)	I think, that I fixed here a bug and removed AVX512. Could you, please, check?
test/CodeGen/X86/vector-sext.ll
111 ↗	(On Diff #25426)	I see this code as 2 pmovsxdq instructions and one shuffle between them. For windows, it looks like: pmovsxdq (%rcx), %xmm0 pmovsxdq 8(%rcx), %xmm1 retq For linux your parameter is in xmm, so you need one shuflle with <2, 3, undef, undef>

Thanks Elena, I'll get extra AVX512 tests added for review.

lib/Target/X86/X86ISelLowering.cpp
16552 ↗	(On Diff #25426)	Yes I'll add a proper AVX512 test.
test/CodeGen/X86/vector-sext.ll
111 ↗	(On Diff #25426)	I have some work in progress patches to improve pmovsx* support - the problem is that SIGN_EXTEND_INREG / SIGN_EXTEND_VECTOR_INREG is a mess and will take a while to sort out. Making this i64 SRA patch was an easy first step so at least we're not transferring between xmm and gprs so much.

I think that you did not update the code before creating the patch.
See 235993.

Elena

Dropped AVX512 support for i64 SRA in LowerScalarVariableShift as noted by Elena - added FIXME note.

You should generate another code for SSE 4.1 and AVX, According to my previous comment.

In D9645#169914, @delena wrote:

You should generate another code for SSE 4.1 and AVX, According to my previous comment.

Thanks - I have better support for pmovsx* coming in an upcoming later patch - it will be much easier to implement once this patch is already done.

I disagree with this approach. You can prepare a separate patch for "shift-right-64bits" optimization.
But you should not use shift-right in SEXT and generate 10 instructions instead of 3.

In D9645#170039, @delena wrote:

I disagree with this approach. You can prepare a separate patch for "shift-right-64bits" optimization.
But you should not use shift-right in SEXT and generate 10 instructions instead of 3.

I haven't intentionally added these shift-right in the sext tests - its just a result of the default sign-ext expansion from SIGN_EXTEND_INREG. I still think this is an improvement over where we are now but I'll push the pmovsx* patch up for review as soon as I can and revisit this afterwards.

RKSimon mentioned this in D9923: Adjust the cost of vectorized SHL/SRL/SRA.May 22 2015, 3:46 AM

Refreshed this patch now that the previous edge cases (sint_to_fp, and sext) have been dealt with properly.

I've updated the patch to work with Elena's 'SupporteVector' tests and moved the shift lowering code into LowerScalarImmediateShift directly.

delena added inline comments.Jul 4 2015, 11:27 PM

test/CodeGen/X86/vector-shift-ashr-128.ll
988 ↗	(On Diff #29008)	Hi Simon, I think that the result here will be incorrect. Let's take a positive 64-bit number (2^34)-1. After the arithmetic shift-right-7 you should receive (2^27)-1. But "vpsrad" will take the source as negative 32-bit and you'll see (2^32)-1 in %xmm1 and the correct result will be after "vpsrlq" in %xmm0. Upper = (2^32)-1 Lower = (2^27)-1 Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {4, 1, 6, 3}); <= the result is incorrect

RKSimon added inline comments.Jul 5 2015, 7:56 AM

test/CodeGen/X86/vector-shift-ashr-128.ll
988 ↗	(On Diff #29008)	Hi Elena, taking your example (as a v1i64 for simplicity): in: 00000003ffffffff (ffffffff 00000003) as v2i32 upper: ashr32 (in, 7): 00000000ffffffff (ffffffff 00000000) 2nd 32-bit lane used lower: lshr64(in, 7): 0000000007ffffff (07ffffff 00000000) 1st 32-bit lane used shuffle32(ashr32, lshr64, 4,1,3,2): 0000000007ffffff (07ffffff 00000000) Which I believe is correct, no?

delena added inline comments.Jul 5 2015, 11:55 PM

test/CodeGen/X86/vector-shift-ashr-128.ll
988 ↗	(On Diff #29008)	yes, you are right. I checked again and I don't see any problem. In my opinion, you can commit this code.

Closed by commit rL241514: [X86][SSE] Vectorized i64 uniform constant SRA shifts (authored by RKSimon). · Explain WhyJul 6 2015, 3:35 PM

This revision was automatically updated to reflect the committed changes.

Thanks Elena

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

50 lines

X86TargetTransformInfo.cpp

3 lines

test/

Analysis/

CostModel/

X86/

testshiftashr.ll

32 lines

CodeGen/

X86/

vector-shift-ashr-128.ll

49 lines

vector-shift-ashr-256.ll

41 lines

vshift-3.ll

5 lines

widen_conv-2.ll

7 lines

Diff 29133

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,026 Lines • ▼ Show 20 Lines	if (Subtarget->hasSSE2()) {
// In the customized shift lowering, the legal cases in AVX2 will be		// In the customized shift lowering, the legal cases in AVX2 will be
// recognized.		// recognized.
setOperationAction(ISD::SRL, MVT::v2i64, Custom);		setOperationAction(ISD::SRL, MVT::v2i64, Custom);
setOperationAction(ISD::SRL, MVT::v4i32, Custom);		setOperationAction(ISD::SRL, MVT::v4i32, Custom);

setOperationAction(ISD::SHL, MVT::v2i64, Custom);		setOperationAction(ISD::SHL, MVT::v2i64, Custom);
setOperationAction(ISD::SHL, MVT::v4i32, Custom);		setOperationAction(ISD::SHL, MVT::v4i32, Custom);

		setOperationAction(ISD::SRA, MVT::v2i64, Custom);
setOperationAction(ISD::SRA, MVT::v4i32, Custom);		setOperationAction(ISD::SRA, MVT::v4i32, Custom);
}		}

if (!Subtarget->useSoftFloat() && Subtarget->hasFp256()) {		if (!Subtarget->useSoftFloat() && Subtarget->hasFp256()) {
addRegisterClass(MVT::v32i8, &X86::VR256RegClass);		addRegisterClass(MVT::v32i8, &X86::VR256RegClass);
addRegisterClass(MVT::v16i16, &X86::VR256RegClass);		addRegisterClass(MVT::v16i16, &X86::VR256RegClass);
addRegisterClass(MVT::v8i32, &X86::VR256RegClass);		addRegisterClass(MVT::v8i32, &X86::VR256RegClass);
addRegisterClass(MVT::v8f32, &X86::VR256RegClass);		addRegisterClass(MVT::v8f32, &X86::VR256RegClass);
▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	if (!Subtarget->useSoftFloat() && Subtarget->hasFp256()) {
// In the customized shift lowering, the legal cases in AVX2 will be		// In the customized shift lowering, the legal cases in AVX2 will be
// recognized.		// recognized.
setOperationAction(ISD::SRL, MVT::v4i64, Custom);		setOperationAction(ISD::SRL, MVT::v4i64, Custom);
setOperationAction(ISD::SRL, MVT::v8i32, Custom);		setOperationAction(ISD::SRL, MVT::v8i32, Custom);

setOperationAction(ISD::SHL, MVT::v4i64, Custom);		setOperationAction(ISD::SHL, MVT::v4i64, Custom);
setOperationAction(ISD::SHL, MVT::v8i32, Custom);		setOperationAction(ISD::SHL, MVT::v8i32, Custom);

		setOperationAction(ISD::SRA, MVT::v4i64, Custom);
setOperationAction(ISD::SRA, MVT::v8i32, Custom);		setOperationAction(ISD::SRA, MVT::v8i32, Custom);

// Custom lower several nodes for 256-bit types.		// Custom lower several nodes for 256-bit types.
for (MVT VT : MVT::vector_valuetypes()) {		for (MVT VT : MVT::vector_valuetypes()) {
if (VT.getScalarSizeInBits() >= 32) {		if (VT.getScalarSizeInBits() >= 32) {
setOperationAction(ISD::MLOAD, VT, Legal);		setOperationAction(ISD::MLOAD, VT, Legal);
setOperationAction(ISD::MSTORE, VT, Legal);		setOperationAction(ISD::MSTORE, VT, Legal);
}		}
▲ Show 20 Lines • Show All 15,721 Lines • ▼ Show 20 Lines	static SDValue LowerScalarImmediateShift(SDValue Op, SelectionDAG &DAG,
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
SDLoc dl(Op);		SDLoc dl(Op);
SDValue R = Op.getOperand(0);		SDValue R = Op.getOperand(0);
SDValue Amt = Op.getOperand(1);		SDValue Amt = Op.getOperand(1);

unsigned X86Opc = (Op.getOpcode() == ISD::SHL) ? X86ISD::VSHLI :		unsigned X86Opc = (Op.getOpcode() == ISD::SHL) ? X86ISD::VSHLI :
(Op.getOpcode() == ISD::SRL) ? X86ISD::VSRLI : X86ISD::VSRAI;		(Op.getOpcode() == ISD::SRL) ? X86ISD::VSRLI : X86ISD::VSRAI;

		auto ArithmeticShiftRight64 = [&](uint64_t ShiftAmt) {
		assert((VT == MVT::v2i64 \|\| VT == MVT::v4i64) && "Unexpected SRA type");
		MVT ExVT = MVT::getVectorVT(MVT::i32, VT.getVectorNumElements() * 2);
		SDValue Ex = DAG.getBitcast(ExVT, R);

		if (ShiftAmt >= 32) {
		// Splat sign to upper i32 dst, and SRA upper i32 src to lower i32.
		SDValue Upper =
		getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex, 31, DAG);
		SDValue Lower = getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex,
		ShiftAmt - 32, DAG);
		if (VT == MVT::v2i64)
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {5, 1, 7, 3});
		if (VT == MVT::v4i64)
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower,
		{9, 1, 11, 3, 13, 5, 15, 7});
		} else {
		// SRA upper i32, SHL whole i64 and select lower i32.
		SDValue Upper = getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex,
		ShiftAmt, DAG);
		SDValue Lower =
		getTargetVShiftByConstNode(X86ISD::VSRLI, dl, VT, R, ShiftAmt, DAG);
		Lower = DAG.getBitcast(ExVT, Lower);
		if (VT == MVT::v2i64)
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {4, 1, 6, 3});
		if (VT == MVT::v4i64)
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower,
		{8, 1, 10, 3, 12, 5, 14, 7});
		}
		return DAG.getBitcast(VT, Ex);
		};

// Optimize shl/srl/sra with constant shift amount.		// Optimize shl/srl/sra with constant shift amount.
if (auto *BVAmt = dyn_cast<BuildVectorSDNode>(Amt)) {		if (auto *BVAmt = dyn_cast<BuildVectorSDNode>(Amt)) {
if (auto *ShiftConst = BVAmt->getConstantSplatNode()) {		if (auto *ShiftConst = BVAmt->getConstantSplatNode()) {
uint64_t ShiftAmt = ShiftConst->getZExtValue();		uint64_t ShiftAmt = ShiftConst->getZExtValue();

if (SupportedVectorShiftWithImm(VT, Subtarget, Op.getOpcode()))		if (SupportedVectorShiftWithImm(VT, Subtarget, Op.getOpcode()))
return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);		return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);

		// i64 SRA needs to be performed as partial shifts.
		if ((VT == MVT::v2i64 \|\| (Subtarget->hasInt256() && VT == MVT::v4i64)) &&
		Op.getOpcode() == ISD::SRA)
		return ArithmeticShiftRight64(ShiftAmt);

if (VT == MVT::v16i8 \|\| (Subtarget->hasInt256() && VT == MVT::v32i8)) {		if (VT == MVT::v16i8 \|\| (Subtarget->hasInt256() && VT == MVT::v32i8)) {
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
MVT ShiftVT = MVT::getVectorVT(MVT::i16, NumElts / 2);		MVT ShiftVT = MVT::getVectorVT(MVT::i16, NumElts / 2);

if (Op.getOpcode() == ISD::SHL) {		if (Op.getOpcode() == ISD::SHL) {
// Simple i8 add case		// Simple i8 add case
if (ShiftAmt == 1)		if (ShiftAmt == 1)
return DAG.getNode(ISD::ADD, dl, VT, R, R);		return DAG.getNode(ISD::ADD, dl, VT, R, R);
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	for (unsigned i = Ratio; i != Amt.getNumOperands(); i += Ratio) {
if (!C)		if (!C)
return SDValue();		return SDValue();
// 6 == Log2(64)		// 6 == Log2(64)
ShAmt \|= C->getZExtValue() << (j * (1 << (6 - RatioInLog2)));		ShAmt \|= C->getZExtValue() << (j * (1 << (6 - RatioInLog2)));
}		}
if (ShAmt != ShiftAmt)		if (ShAmt != ShiftAmt)
return SDValue();		return SDValue();
}		}

		if (SupportedVectorShiftWithImm(VT, Subtarget, Op.getOpcode()))
return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);		return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);

		if (Op.getOpcode() == ISD::SRA)
		return ArithmeticShiftRight64(ShiftAmt);
}		}

return SDValue();		return SDValue();
}		}

static SDValue LowerScalarVariableShift(SDValue Op, SelectionDAG &DAG,		static SDValue LowerScalarVariableShift(SDValue Op, SelectionDAG &DAG,
const X86Subtarget* Subtarget) {		const X86Subtarget* Subtarget) {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	if (!Subtarget->is64Bit() && VT == MVT::v2i64 &&
std::vector<SDValue> Vals(Ratio);		std::vector<SDValue> Vals(Ratio);
for (unsigned i = 0; i != Ratio; ++i)		for (unsigned i = 0; i != Ratio; ++i)
Vals[i] = Amt.getOperand(i);		Vals[i] = Amt.getOperand(i);
for (unsigned i = Ratio; i != Amt.getNumOperands(); i += Ratio) {		for (unsigned i = Ratio; i != Amt.getNumOperands(); i += Ratio) {
for (unsigned j = 0; j != Ratio; ++j)		for (unsigned j = 0; j != Ratio; ++j)
if (Vals[j] != Amt.getOperand(i + j))		if (Vals[j] != Amt.getOperand(i + j))
return SDValue();		return SDValue();
}		}

		if (SupportedVectorShiftWithBaseAmnt(VT, Subtarget, Op.getOpcode()))
return DAG.getNode(X86OpcV, dl, VT, R, Op.getOperand(1));		return DAG.getNode(X86OpcV, dl, VT, R, Op.getOperand(1));
}		}
return SDValue();		return SDValue();
}		}

static SDValue LowerShift(SDValue Op, const X86Subtarget* Subtarget,		static SDValue LowerShift(SDValue Op, const X86Subtarget* Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
SDLoc dl(Op);		SDLoc dl(Op);
▲ Show 20 Lines • Show All 8,874 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	Cost += getArithmeticInstrCost(Instruction::Add, Ty, Op1Info, Op2Info,
TargetTransformInfo::OP_None,		TargetTransformInfo::OP_None,
TargetTransformInfo::OP_None);		TargetTransformInfo::OP_None);

return Cost;		return Cost;
}		}

static const CostTblEntry<MVT::SimpleValueType>		static const CostTblEntry<MVT::SimpleValueType>
AVX2UniformConstCostTable[] = {		AVX2UniformConstCostTable[] = {
		{ ISD::SRA, MVT::v4i64, 4 }, // 2 x psrad + shuffle.

{ ISD::SDIV, MVT::v16i16, 6 }, // vpmulhw sequence		{ ISD::SDIV, MVT::v16i16, 6 }, // vpmulhw sequence
{ ISD::UDIV, MVT::v16i16, 6 }, // vpmulhuw sequence		{ ISD::UDIV, MVT::v16i16, 6 }, // vpmulhuw sequence
{ ISD::SDIV, MVT::v8i32, 15 }, // vpmuldq sequence		{ ISD::SDIV, MVT::v8i32, 15 }, // vpmuldq sequence
{ ISD::UDIV, MVT::v8i32, 15 }, // vpmuludq sequence		{ ISD::UDIV, MVT::v8i32, 15 }, // vpmuludq sequence
};		};

if (Op2Info == TargetTransformInfo::OK_UniformConstantValue &&		if (Op2Info == TargetTransformInfo::OK_UniformConstantValue &&
ST->hasAVX2()) {		ST->hasAVX2()) {
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	SSE2UniformConstCostTable[] = {
{ ISD::SRL, MVT::v16i8, 1 }, // psrlw.		{ ISD::SRL, MVT::v16i8, 1 }, // psrlw.
{ ISD::SRL, MVT::v8i16, 1 }, // psrlw.		{ ISD::SRL, MVT::v8i16, 1 }, // psrlw.
{ ISD::SRL, MVT::v4i32, 1 }, // psrld.		{ ISD::SRL, MVT::v4i32, 1 }, // psrld.
{ ISD::SRL, MVT::v2i64, 1 }, // psrlq.		{ ISD::SRL, MVT::v2i64, 1 }, // psrlq.

{ ISD::SRA, MVT::v16i8, 4 }, // psrlw, pand, pxor, psubb.		{ ISD::SRA, MVT::v16i8, 4 }, // psrlw, pand, pxor, psubb.
{ ISD::SRA, MVT::v8i16, 1 }, // psraw.		{ ISD::SRA, MVT::v8i16, 1 }, // psraw.
{ ISD::SRA, MVT::v4i32, 1 }, // psrad.		{ ISD::SRA, MVT::v4i32, 1 }, // psrad.
		{ ISD::SRA, MVT::v2i64, 4 }, // 2 x psrad + shuffle.

{ ISD::SDIV, MVT::v8i16, 6 }, // pmulhw sequence		{ ISD::SDIV, MVT::v8i16, 6 }, // pmulhw sequence
{ ISD::UDIV, MVT::v8i16, 6 }, // pmulhuw sequence		{ ISD::UDIV, MVT::v8i16, 6 }, // pmulhuw sequence
{ ISD::SDIV, MVT::v4i32, 19 }, // pmuludq sequence		{ ISD::SDIV, MVT::v4i32, 19 }, // pmuludq sequence
{ ISD::UDIV, MVT::v4i32, 15 }, // pmuludq sequence		{ ISD::UDIV, MVT::v4i32, 15 }, // pmuludq sequence
};		};

if (Op2Info == TargetTransformInfo::OK_UniformConstantValue &&		if (Op2Info == TargetTransformInfo::OK_UniformConstantValue &&
▲ Show 20 Lines • Show All 926 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll

Show First 20 Lines • Show All 241 Lines • ▼ Show 20 Lines
}		}

; Test shift by a constant a value.		; Test shift by a constant a value.

%shifttypec = type <2 x i16>		%shifttypec = type <2 x i16>
define %shifttypec @shift2i16const(%shifttypec %a, %shifttypec %b) {		define %shifttypec @shift2i16const(%shifttypec %a, %shifttypec %b) {
entry:		entry:
; SSE2: shift2i16const		; SSE2: shift2i16const
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i16const		; SSE2-CODEGEN: shift2i16const
; SSE2-CODEGEN: sarq $		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec %a , <i16 3, i16 3>		%0 = ashr %shifttypec %a , <i16 3, i16 3>
ret %shifttypec %0		ret %shifttypec %0
}		}

%shifttypec4i16 = type <4 x i16>		%shifttypec4i16 = type <4 x i16>
define %shifttypec4i16 @shift4i16const(%shifttypec4i16 %a, %shifttypec4i16 %b) {		define %shifttypec4i16 @shift4i16const(%shifttypec4i16 %a, %shifttypec4i16 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	%0 = ashr %shifttypec32i16 %a , <i16 3, i16 3, i16 3, i16 3,
i16 3, i16 3, i16 3, i16 3>		i16 3, i16 3, i16 3, i16 3>
ret %shifttypec32i16 %0		ret %shifttypec32i16 %0
}		}

%shifttypec2i32 = type <2 x i32>		%shifttypec2i32 = type <2 x i32>
define %shifttypec2i32 @shift2i32c(%shifttypec2i32 %a, %shifttypec2i32 %b) {		define %shifttypec2i32 @shift2i32c(%shifttypec2i32 %a, %shifttypec2i32 %b) {
entry:		entry:
; SSE2: shift2i32c		; SSE2: shift2i32c
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i32c		; SSE2-CODEGEN: shift2i32c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec2i32 %a , <i32 3, i32 3>		%0 = ashr %shifttypec2i32 %a , <i32 3, i32 3>
ret %shifttypec2i32 %0		ret %shifttypec2i32 %0
}		}

%shifttypec4i32 = type <4 x i32>		%shifttypec4i32 = type <4 x i32>
define %shifttypec4i32 @shift4i32c(%shifttypec4i32 %a, %shifttypec4i32 %b) {		define %shifttypec4i32 @shift4i32c(%shifttypec4i32 %a, %shifttypec4i32 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	%0 = ashr %shifttypec32i32 %a , <i32 3, i32 3, i32 3, i32 3,
i32 3, i32 3, i32 3, i32 3>		i32 3, i32 3, i32 3, i32 3>
ret %shifttypec32i32 %0		ret %shifttypec32i32 %0
}		}

%shifttypec2i64 = type <2 x i64>		%shifttypec2i64 = type <2 x i64>
define %shifttypec2i64 @shift2i64c(%shifttypec2i64 %a, %shifttypec2i64 %b) {		define %shifttypec2i64 @shift2i64c(%shifttypec2i64 %a, %shifttypec2i64 %b) {
entry:		entry:
; SSE2: shift2i64c		; SSE2: shift2i64c
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i64c		; SSE2-CODEGEN: shift2i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec2i64 %a , <i64 3, i64 3>		%0 = ashr %shifttypec2i64 %a , <i64 3, i64 3>
ret %shifttypec2i64 %0		ret %shifttypec2i64 %0
}		}

%shifttypec4i64 = type <4 x i64>		%shifttypec4i64 = type <4 x i64>
define %shifttypec4i64 @shift4i64c(%shifttypec4i64 %a, %shifttypec4i64 %b) {		define %shifttypec4i64 @shift4i64c(%shifttypec4i64 %a, %shifttypec4i64 %b) {
entry:		entry:
; SSE2: shift4i64c		; SSE2: shift4i64c
; SSE2: cost of 40 {{.*}} ashr		; SSE2: cost of 8 {{.*}} ashr
; SSE2-CODEGEN: shift4i64c		; SSE2-CODEGEN: shift4i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec4i64 %a , <i64 3, i64 3, i64 3, i64 3>		%0 = ashr %shifttypec4i64 %a , <i64 3, i64 3, i64 3, i64 3>
ret %shifttypec4i64 %0		ret %shifttypec4i64 %0
}		}

%shifttypec8i64 = type <8 x i64>		%shifttypec8i64 = type <8 x i64>
define %shifttypec8i64 @shift8i64c(%shifttypec8i64 %a, %shifttypec8i64 %b) {		define %shifttypec8i64 @shift8i64c(%shifttypec8i64 %a, %shifttypec8i64 %b) {
entry:		entry:
; SSE2: shift8i64c		; SSE2: shift8i64c
; SSE2: cost of 80 {{.*}} ashr		; SSE2: cost of 16 {{.*}} ashr
; SSE2-CODEGEN: shift8i64c		; SSE2-CODEGEN: shift8i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec8i64 %a , <i64 3, i64 3, i64 3, i64 3,		%0 = ashr %shifttypec8i64 %a , <i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3>		i64 3, i64 3, i64 3, i64 3>
ret %shifttypec8i64 %0		ret %shifttypec8i64 %0
}		}

%shifttypec16i64 = type <16 x i64>		%shifttypec16i64 = type <16 x i64>
define %shifttypec16i64 @shift16i64c(%shifttypec16i64 %a, %shifttypec16i64 %b) {		define %shifttypec16i64 @shift16i64c(%shifttypec16i64 %a, %shifttypec16i64 %b) {
entry:		entry:
; SSE2: shift16i64c		; SSE2: shift16i64c
; SSE2: cost of 160 {{.*}} ashr		; SSE2: cost of 32 {{.*}} ashr
; SSE2-CODEGEN: shift16i64c		; SSE2-CODEGEN: shift16i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec16i64 %a , <i64 3, i64 3, i64 3, i64 3,		%0 = ashr %shifttypec16i64 %a , <i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3>		i64 3, i64 3, i64 3, i64 3>
ret %shifttypec16i64 %0		ret %shifttypec16i64 %0
}		}

%shifttypec32i64 = type <32 x i64>		%shifttypec32i64 = type <32 x i64>
define %shifttypec32i64 @shift32i64c(%shifttypec32i64 %a, %shifttypec32i64 %b) {		define %shifttypec32i64 @shift32i64c(%shifttypec32i64 %a, %shifttypec32i64 %b) {
entry:		entry:
; SSE2: shift32i64c		; SSE2: shift32i64c
; SSE2: cost of 320 {{.*}} ashr		; SSE2: cost of 64 {{.*}} ashr
; SSE2-CODEGEN: shift32i64c		; SSE2-CODEGEN: shift32i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec32i64 %a ,<i64 3, i64 3, i64 3, i64 3,		%0 = ashr %shifttypec32i64 %a ,<i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3>		i64 3, i64 3, i64 3, i64 3>
ret %shifttypec32i64 %0		ret %shifttypec32i64 %0
}		}

%shifttypec2i8 = type <2 x i8>		%shifttypec2i8 = type <2 x i8>
define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {		define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {
entry:		entry:
; SSE2: shift2i8c		; SSE2: shift2i8c
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i8c		; SSE2-CODEGEN: shift2i8c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec2i8 %a , <i8 3, i8 3>		%0 = ashr %shifttypec2i8 %a , <i8 3, i8 3>
ret %shifttypec2i8 %0		ret %shifttypec2i8 %0
}		}

%shifttypec4i8 = type <4 x i8>		%shifttypec4i8 = type <4 x i8>
define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {		define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shift-ashr-128.ll

	Show First 20 Lines • Show All 948 Lines • ▼ Show 20 Lines

	;			;
	; Uniform Constant Shifts			; Uniform Constant Shifts
	;			;

	define <2 x i64> @splatconstant_shift_v2i64(<2 x i64> %a) {			define <2 x i64> @splatconstant_shift_v2i64(<2 x i64> %a) {
	; SSE2-LABEL: splatconstant_shift_v2i64:			; SSE2-LABEL: splatconstant_shift_v2i64:
	; SSE2: # BB#0:			; SSE2: # BB#0:
	; SSE2-NEXT: movd %xmm0, %rax			; SSE2-NEXT: movdqa %xmm0, %xmm1
	; SSE2-NEXT: sarq $7, %rax			; SSE2-NEXT: psrad $7, %xmm1
	; SSE2-NEXT: movd %rax, %xmm1			; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
	; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]			; SSE2-NEXT: psrlq $7, %xmm0
	; SSE2-NEXT: movd %xmm0, %rax			; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; SSE2-NEXT: sarq $7, %rax			; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; SSE2-NEXT: movd %rax, %xmm0
	; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
	; SSE2-NEXT: movdqa %xmm1, %xmm0
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSE41-LABEL: splatconstant_shift_v2i64:			; SSE41-LABEL: splatconstant_shift_v2i64:
	; SSE41: # BB#0:			; SSE41: # BB#0:
	; SSE41-NEXT: pextrq $1, %xmm0, %rax			; SSE41-NEXT: movdqa %xmm0, %xmm1
	; SSE41-NEXT: sarq $7, %rax			; SSE41-NEXT: psrad $7, %xmm1
	; SSE41-NEXT: movd %rax, %xmm1			; SSE41-NEXT: psrlq $7, %xmm0
	; SSE41-NEXT: movd %xmm0, %rax			; SSE41-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
	; SSE41-NEXT: sarq $7, %rax
	; SSE41-NEXT: movd %rax, %xmm0
	; SSE41-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX-LABEL: splatconstant_shift_v2i64:			; AVX1-LABEL: splatconstant_shift_v2i64:
	; AVX: # BB#0:			; AVX1: # BB#0:
	; AVX-NEXT: vpextrq $1, %xmm0, %rax			; AVX1-NEXT: vpsrad $7, %xmm0, %xmm1
	; AVX-NEXT: sarq $7, %rax			; AVX1-NEXT: vpsrlq $7, %xmm0, %xmm0
	; AVX-NEXT: vmovq %rax, %xmm1			; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
	; AVX-NEXT: vmovq %xmm0, %rax			; AVX1-NEXT: retq
	; AVX-NEXT: sarq $7, %rax			;
	; AVX-NEXT: vmovq %rax, %xmm0			; AVX2-LABEL: splatconstant_shift_v2i64:
	; AVX-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]			; AVX2: # BB#0:
	; AVX-NEXT: retq			; AVX2-NEXT: vpsrad $7, %xmm0, %xmm1
				; AVX2-NEXT: vpsrlq $7, %xmm0, %xmm0
				; AVX2-NEXT: vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
				; AVX2-NEXT: retq
	%shift = ashr <2 x i64> %a, <i64 7, i64 7>			%shift = ashr <2 x i64> %a, <i64 7, i64 7>
	ret <2 x i64> %shift			ret <2 x i64> %shift
	}			}

	define <4 x i32> @splatconstant_shift_v4i32(<4 x i32> %a) {			define <4 x i32> @splatconstant_shift_v4i32(<4 x i32> %a) {
	; SSE-LABEL: splatconstant_shift_v4i32:			; SSE-LABEL: splatconstant_shift_v4i32:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: psrad $5, %xmm0			; SSE-NEXT: psrad $5, %xmm0
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shift-ashr-256.ll

	Show First 20 Lines • Show All 657 Lines • ▼ Show 20 Lines
	;			;
	; Uniform Constant Shifts			; Uniform Constant Shifts
	;			;

	define <4 x i64> @splatconstant_shift_v4i64(<4 x i64> %a) {			define <4 x i64> @splatconstant_shift_v4i64(<4 x i64> %a) {
	; AVX1-LABEL: splatconstant_shift_v4i64:			; AVX1-LABEL: splatconstant_shift_v4i64:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
	; AVX1-NEXT: vpextrq $1, %xmm1, %rax			; AVX1-NEXT: vpsrad $7, %xmm1, %xmm2
	; AVX1-NEXT: sarq $7, %rax			; AVX1-NEXT: vpsrlq $7, %xmm1, %xmm1
	; AVX1-NEXT: vmovq %rax, %xmm2			; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
	; AVX1-NEXT: vmovq %xmm1, %rax			; AVX1-NEXT: vpsrad $7, %xmm0, %xmm2
	; AVX1-NEXT: sarq $7, %rax			; AVX1-NEXT: vpsrlq $7, %xmm0, %xmm0
	; AVX1-NEXT: vmovq %rax, %xmm1			; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; AVX1-NEXT: vpextrq $1, %xmm0, %rax
	; AVX1-NEXT: sarq $7, %rax
	; AVX1-NEXT: vmovq %rax, %xmm2
	; AVX1-NEXT: vmovq %xmm0, %rax
	; AVX1-NEXT: sarq $7, %rax
	; AVX1-NEXT: vmovq %rax, %xmm0
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: splatconstant_shift_v4i64:			; AVX2-LABEL: splatconstant_shift_v4i64:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-NEXT: vpsrad $7, %ymm0, %ymm1
	; AVX2-NEXT: vpextrq $1, %xmm1, %rax			; AVX2-NEXT: vpsrlq $7, %ymm0, %ymm0
	; AVX2-NEXT: sarq $7, %rax			; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7]
	; AVX2-NEXT: vmovq %rax, %xmm2
	; AVX2-NEXT: vmovq %xmm1, %rax
	; AVX2-NEXT: sarq $7, %rax
	; AVX2-NEXT: vmovq %rax, %xmm1
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; AVX2-NEXT: vpextrq $1, %xmm0, %rax
	; AVX2-NEXT: sarq $7, %rax
	; AVX2-NEXT: vmovq %rax, %xmm2
	; AVX2-NEXT: vmovq %xmm0, %rax
	; AVX2-NEXT: sarq $7, %rax
	; AVX2-NEXT: vmovq %rax, %xmm0
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shift = ashr <4 x i64> %a, <i64 7, i64 7, i64 7, i64 7>			%shift = ashr <4 x i64> %a, <i64 7, i64 7, i64 7, i64 7>
	ret <4 x i64> %shift			ret <4 x i64> %shift
	}			}

	define <8 x i32> @splatconstant_shift_v8i32(<8 x i32> %a) {			define <8 x i32> @splatconstant_shift_v8i32(<8 x i32> %a) {
	; AVX1-LABEL: splatconstant_shift_v8i32:			; AVX1-LABEL: splatconstant_shift_v8i32:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vshift-3.ll

	; RUN: llc < %s -march=x86 -mattr=+sse2 \| FileCheck %s			; RUN: llc < %s -march=x86 -mattr=+sse2 \| FileCheck %s

	; test vector shifts converted to proper SSE2 vector shifts when the shift			; test vector shifts converted to proper SSE2 vector shifts when the shift
	; amounts are the same.			; amounts are the same.

	; Note that x86 does have ashr			; Note that x86 does have ashr

	; shift1a can't use a packed shift
	define void @shift1a(<2 x i64> %val, <2 x i64>* %dst) nounwind {			define void @shift1a(<2 x i64> %val, <2 x i64>* %dst) nounwind {
	entry:			entry:
	; CHECK-LABEL: shift1a:			; CHECK-LABEL: shift1a:
	; CHECK: sarl			; CHECK: psrad $31
	%ashr = ashr <2 x i64> %val, < i64 32, i64 32 >			%ashr = ashr <2 x i64> %val, < i64 32, i64 32 >
	store <2 x i64> %ashr, <2 x i64>* %dst			store <2 x i64> %ashr, <2 x i64>* %dst
	ret void			ret void
	}			}

	define void @shift2a(<4 x i32> %val, <4 x i32>* %dst) nounwind {			define void @shift2a(<4 x i32> %val, <4 x i32>* %dst) nounwind {
	entry:			entry:
	; CHECK-LABEL: shift2a:			; CHECK-LABEL: shift2a:
	▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/widen_conv-2.ll

	; RUN: llc < %s -march=x86 -mattr=+sse4.2 \| FileCheck %s			; RUN: llc < %s -march=x86 -mattr=+sse4.2 \| FileCheck %s
	; CHECK: {{cwtl\|movswl}}			; CHECK: psllq $48, %xmm0
	; CHECK: {{cwtl\|movswl}}			; CHECK: psrad $16, %xmm0
				; CHECK: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]

	; sign extension v2i32 to v2i16			; sign extension v2i16 to v2i32

	define void @convert(<2 x i32>* %dst.addr, <2 x i16> %src) nounwind {			define void @convert(<2 x i32>* %dst.addr, <2 x i16> %src) nounwind {
	entry:			entry:
	%signext = sext <2 x i16> %src to <2 x i32> ; <<12 x i8>> [#uses=1]			%signext = sext <2 x i16> %src to <2 x i32> ; <<12 x i8>> [#uses=1]
	store <2 x i32> %signext, <2 x i32>* %dst.addr			store <2 x i32> %signext, <2 x i32>* %dst.addr
	ret void			ret void
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorized i64 uniform constant SRA shiftsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 29133

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll

llvm/trunk/test/CodeGen/X86/vector-shift-ashr-128.ll

llvm/trunk/test/CodeGen/X86/vector-shift-ashr-256.ll

llvm/trunk/test/CodeGen/X86/vshift-3.ll

llvm/trunk/test/CodeGen/X86/widen_conv-2.ll

[X86][SSE] Vectorized i64 uniform constant SRA shifts
ClosedPublic