This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorized i64 uniform constant SRA shifts
ClosedPublic

Authored by RKSimon on May 10 2015, 4:47 AM.

Download Raw Diff

Details

Reviewers

qcolombet
delena
andreadb

Commits

rG8fbf1c1f4a6f: [X86][SSE] Vectorized i64 uniform constant SRA shifts
rL241514: [X86][SSE] Vectorized i64 uniform constant SRA shifts

Summary

This patch adds vectorization support for uniform constant i64 arithmetic shift right operators.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 25426.May 10 2015, 4:47 AM

RKSimon retitled this revision from to [X86][SSE] Vectorized i64 uniform constant SRA shifts.

RKSimon updated this object.

RKSimon edited the test plan for this revision. (Show Details)

RKSimon added reviewers: andreadb, delena, qcolombet.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: Unknown Object (MLST).

delena added inline comments.May 10 2015, 6:50 AM

lib/Target/X86/X86ISelLowering.cpp
45–1	I think, that I fixed here a bug and removed AVX512. Could you, please, check?
test/CodeGen/X86/vector-sext.ll
111 ↗	(On Diff #25426)	I see this code as 2 pmovsxdq instructions and one shuffle between them. For windows, it looks like: pmovsxdq (%rcx), %xmm0 pmovsxdq 8(%rcx), %xmm1 retq For linux your parameter is in xmm, so you need one shuflle with <2, 3, undef, undef>

Thanks Elena, I'll get extra AVX512 tests added for review.

lib/Target/X86/X86ISelLowering.cpp
45–1	Yes I'll add a proper AVX512 test.
test/CodeGen/X86/vector-sext.ll
111 ↗	(On Diff #25426)	I have some work in progress patches to improve pmovsx* support - the problem is that SIGN_EXTEND_INREG / SIGN_EXTEND_VECTOR_INREG is a mess and will take a while to sort out. Making this i64 SRA patch was an easy first step so at least we're not transferring between xmm and gprs so much.

I think that you did not update the code before creating the patch.
See 235993.

Elena

Dropped AVX512 support for i64 SRA in LowerScalarVariableShift as noted by Elena - added FIXME note.

You should generate another code for SSE 4.1 and AVX, According to my previous comment.

In D9645#169914, @delena wrote:

You should generate another code for SSE 4.1 and AVX, According to my previous comment.

Thanks - I have better support for pmovsx* coming in an upcoming later patch - it will be much easier to implement once this patch is already done.

I disagree with this approach. You can prepare a separate patch for "shift-right-64bits" optimization.
But you should not use shift-right in SEXT and generate 10 instructions instead of 3.

In D9645#170039, @delena wrote:

I disagree with this approach. You can prepare a separate patch for "shift-right-64bits" optimization.
But you should not use shift-right in SEXT and generate 10 instructions instead of 3.

I haven't intentionally added these shift-right in the sext tests - its just a result of the default sign-ext expansion from SIGN_EXTEND_INREG. I still think this is an improvement over where we are now but I'll push the pmovsx* patch up for review as soon as I can and revisit this afterwards.

RKSimon mentioned this in D9923: Adjust the cost of vectorized SHL/SRL/SRA.May 22 2015, 3:46 AM

Refreshed this patch now that the previous edge cases (sint_to_fp, and sext) have been dealt with properly.

I've updated the patch to work with Elena's 'SupporteVector' tests and moved the shift lowering code into LowerScalarImmediateShift directly.

delena added inline comments.Jul 4 2015, 11:27 PM

test/CodeGen/X86/vector-shift-ashr-128.ll
988	Hi Simon, I think that the result here will be incorrect. Let's take a positive 64-bit number (2^34)-1. After the arithmetic shift-right-7 you should receive (2^27)-1. But "vpsrad" will take the source as negative 32-bit and you'll see (2^32)-1 in %xmm1 and the correct result will be after "vpsrlq" in %xmm0. Upper = (2^32)-1 Lower = (2^27)-1 Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {4, 1, 6, 3}); <= the result is incorrect

RKSimon added inline comments.Jul 5 2015, 7:56 AM

test/CodeGen/X86/vector-shift-ashr-128.ll
988	Hi Elena, taking your example (as a v1i64 for simplicity): in: 00000003ffffffff (ffffffff 00000003) as v2i32 upper: ashr32 (in, 7): 00000000ffffffff (ffffffff 00000000) 2nd 32-bit lane used lower: lshr64(in, 7): 0000000007ffffff (07ffffff 00000000) 1st 32-bit lane used shuffle32(ashr32, lshr64, 4,1,3,2): 0000000007ffffff (07ffffff 00000000) Which I believe is correct, no?

delena added inline comments.Jul 5 2015, 11:55 PM

test/CodeGen/X86/vector-shift-ashr-128.ll
988	yes, you are right. I checked again and I don't see any problem. In my opinion, you can commit this code.

Closed by commit rL241514: [X86][SSE] Vectorized i64 uniform constant SRA shifts (authored by RKSimon). · Explain WhyJul 6 2015, 3:35 PM

This revision was automatically updated to reflect the committed changes.

Thanks Elena

Revision Contents

Path

Size

lib/

Target/

X86/

	X86ISelLowering.cpp
	X86ISelLowering.cpp (revision 241335)

49 lines

	X86TargetTransformInfo.cpp
	X86TargetTransformInfo.cpp (revision 241335)

1 line

test/

Analysis/

CostModel/

X86/

	testshiftashr.ll
	testshiftashr.ll (revision 241335)

32 lines

CodeGen/

X86/

	vector-shift-ashr-128.ll
	vector-shift-ashr-128.ll (revision 241335)

49 lines

	vector-shift-ashr-256.ll
	vector-shift-ashr-256.ll (revision 241335)

41 lines

	vshift-3.ll
	vshift-3.ll (revision 241335)

5 lines

	widen_conv-2.ll
	widen_conv-2.ll (revision 241335)

7 lines

Diff 29008

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,012 Lines • ▼ Show 20 Lines	if (Subtarget->hasSSE2()) {
// In the customized shift lowering, the legal cases in AVX2 will be		// In the customized shift lowering, the legal cases in AVX2 will be
// recognized.		// recognized.
setOperationAction(ISD::SRL, MVT::v2i64, Custom);		setOperationAction(ISD::SRL, MVT::v2i64, Custom);
setOperationAction(ISD::SRL, MVT::v4i32, Custom);		setOperationAction(ISD::SRL, MVT::v4i32, Custom);

setOperationAction(ISD::SHL, MVT::v2i64, Custom);		setOperationAction(ISD::SHL, MVT::v2i64, Custom);
setOperationAction(ISD::SHL, MVT::v4i32, Custom);		setOperationAction(ISD::SHL, MVT::v4i32, Custom);

		setOperationAction(ISD::SRA, MVT::v2i64, Custom);
setOperationAction(ISD::SRA, MVT::v4i32, Custom);		setOperationAction(ISD::SRA, MVT::v4i32, Custom);
}		}

if (!Subtarget->useSoftFloat() && Subtarget->hasFp256()) {		if (!Subtarget->useSoftFloat() && Subtarget->hasFp256()) {
addRegisterClass(MVT::v32i8, &X86::VR256RegClass);		addRegisterClass(MVT::v32i8, &X86::VR256RegClass);
addRegisterClass(MVT::v16i16, &X86::VR256RegClass);		addRegisterClass(MVT::v16i16, &X86::VR256RegClass);
addRegisterClass(MVT::v8i32, &X86::VR256RegClass);		addRegisterClass(MVT::v8i32, &X86::VR256RegClass);
addRegisterClass(MVT::v8f32, &X86::VR256RegClass);		addRegisterClass(MVT::v8f32, &X86::VR256RegClass);
▲ Show 20 Lines • Show All 150 Lines • ▼ Show 20 Lines	if (!Subtarget->useSoftFloat() && Subtarget->hasFp256()) {
// In the customized shift lowering, the legal cases in AVX2 will be		// In the customized shift lowering, the legal cases in AVX2 will be
// recognized.		// recognized.
setOperationAction(ISD::SRL, MVT::v4i64, Custom);		setOperationAction(ISD::SRL, MVT::v4i64, Custom);
setOperationAction(ISD::SRL, MVT::v8i32, Custom);		setOperationAction(ISD::SRL, MVT::v8i32, Custom);

setOperationAction(ISD::SHL, MVT::v4i64, Custom);		setOperationAction(ISD::SHL, MVT::v4i64, Custom);
setOperationAction(ISD::SHL, MVT::v8i32, Custom);		setOperationAction(ISD::SHL, MVT::v8i32, Custom);

		setOperationAction(ISD::SRA, MVT::v4i64, Custom);
setOperationAction(ISD::SRA, MVT::v8i32, Custom);		setOperationAction(ISD::SRA, MVT::v8i32, Custom);

// Custom lower several nodes for 256-bit types.		// Custom lower several nodes for 256-bit types.
for (MVT VT : MVT::vector_valuetypes()) {		for (MVT VT : MVT::vector_valuetypes()) {
if (VT.getScalarSizeInBits() >= 32) {		if (VT.getScalarSizeInBits() >= 32) {
setOperationAction(ISD::MLOAD, VT, Legal);		setOperationAction(ISD::MLOAD, VT, Legal);
setOperationAction(ISD::MSTORE, VT, Legal);		setOperationAction(ISD::MSTORE, VT, Legal);
}		}
▲ Show 20 Lines • Show All 15,517 Lines • ▼ Show 20 Lines	static SDValue LowerScalarImmediateShift(SDValue Op, SelectionDAG &DAG,
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
SDLoc dl(Op);		SDLoc dl(Op);
SDValue R = Op.getOperand(0);		SDValue R = Op.getOperand(0);
SDValue Amt = Op.getOperand(1);		SDValue Amt = Op.getOperand(1);

unsigned X86Opc = (Op.getOpcode() == ISD::SHL) ? X86ISD::VSHLI :		unsigned X86Opc = (Op.getOpcode() == ISD::SHL) ? X86ISD::VSHLI :
(Op.getOpcode() == ISD::SRL) ? X86ISD::VSRLI : X86ISD::VSRAI;		(Op.getOpcode() == ISD::SRL) ? X86ISD::VSRLI : X86ISD::VSRAI;

		auto ArithmeticShiftRight64 = [&](uint64_t ShiftAmt) {
		assert((VT == MVT::v2i64 \|\| VT == MVT::v4i64) && "Unexpected SRA type");
		MVT ExVT = MVT::getVectorVT(MVT::i32, VT.getVectorNumElements() * 2);
		SDValue Ex = DAG.getBitcast(ExVT, R);

		if (ShiftAmt >= 32) {
		// Splat sign to upper i32 dst, and SRA upper i32 src to lower i32.
		SDValue Upper =
		getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex, 31, DAG);
		SDValue Lower = getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex,
		ShiftAmt - 32, DAG);
		if (VT == MVT::v2i64)
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {5, 1, 7, 3});
		if (VT == MVT::v4i64)
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower,
		{9, 1, 11, 3, 13, 5, 15, 7});
		} else {
		// SRA upper i32, SHL whole i64 and select lower i32.
		SDValue Upper = getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex,
		ShiftAmt, DAG);
		SDValue Lower =
		getTargetVShiftByConstNode(X86ISD::VSRLI, dl, VT, R, ShiftAmt, DAG);
		Lower = DAG.getBitcast(ExVT, Lower);
		if (VT == MVT::v2i64)
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {4, 1, 6, 3});
		if (VT == MVT::v4i64)
		Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower,
		{8, 1, 10, 3, 12, 5, 14, 7});
		}
		return DAG.getBitcast(VT, Ex);
		};

// Optimize shl/srl/sra with constant shift amount.		// Optimize shl/srl/sra with constant shift amount.
if (auto *BVAmt = dyn_cast<BuildVectorSDNode>(Amt)) {		if (auto *BVAmt = dyn_cast<BuildVectorSDNode>(Amt)) {
if (auto *ShiftConst = BVAmt->getConstantSplatNode()) {		if (auto *ShiftConst = BVAmt->getConstantSplatNode()) {
uint64_t ShiftAmt = ShiftConst->getZExtValue();		uint64_t ShiftAmt = ShiftConst->getZExtValue();

if (SupportedVectorShiftWithImm(VT, Subtarget, Op.getOpcode()))		if (SupportedVectorShiftWithImm(VT, Subtarget, Op.getOpcode()))
return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);		return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);

		// i64 SRA needs to be performed as partial shifts.
		if ((VT == MVT::v2i64 \|\| (Subtarget->hasInt256() && VT == MVT::v4i64)) &&
		Op.getOpcode() == ISD::SRA)
		return ArithmeticShiftRight64(ShiftAmt);

if (VT == MVT::v16i8 \|\| (Subtarget->hasInt256() && VT == MVT::v32i8)) {		if (VT == MVT::v16i8 \|\| (Subtarget->hasInt256() && VT == MVT::v32i8)) {
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
MVT ShiftVT = MVT::getVectorVT(MVT::i16, NumElts / 2);		MVT ShiftVT = MVT::getVectorVT(MVT::i16, NumElts / 2);

if (Op.getOpcode() == ISD::SHL) {		if (Op.getOpcode() == ISD::SHL) {
// Simple i8 add case		// Simple i8 add case
if (ShiftAmt == 1)		if (ShiftAmt == 1)
return DAG.getNode(ISD::ADD, dl, VT, R, R);		return DAG.getNode(ISD::ADD, dl, VT, R, R);
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	for (unsigned i = Ratio; i != Amt.getNumOperands(); i += Ratio) {
if (!C)		if (!C)
return SDValue();		return SDValue();
// 6 == Log2(64)		// 6 == Log2(64)
ShAmt \|= C->getZExtValue() << (j * (1 << (6 - RatioInLog2)));		ShAmt \|= C->getZExtValue() << (j * (1 << (6 - RatioInLog2)));
}		}
if (ShAmt != ShiftAmt)		if (ShAmt != ShiftAmt)
return SDValue();		return SDValue();
}		}

		if (SupportedVectorShiftWithImm(VT, Subtarget, Op.getOpcode()))
return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);		return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);

		if (Op.getOpcode() == ISD::SRA)
		return ArithmeticShiftRight64(ShiftAmt);
}		}

return SDValue();		return SDValue();
}		}

static SDValue LowerScalarVariableShift(SDValue Op, SelectionDAG &DAG,		static SDValue LowerScalarVariableShift(SDValue Op, SelectionDAG &DAG,
const X86Subtarget* Subtarget) {		const X86Subtarget* Subtarget) {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	if (!Subtarget->is64Bit() && VT == MVT::v2i64 &&
std::vector<SDValue> Vals(Ratio);		std::vector<SDValue> Vals(Ratio);
for (unsigned i = 0; i != Ratio; ++i)		for (unsigned i = 0; i != Ratio; ++i)
Vals[i] = Amt.getOperand(i);		Vals[i] = Amt.getOperand(i);
for (unsigned i = Ratio; i != Amt.getNumOperands(); i += Ratio) {		for (unsigned i = Ratio; i != Amt.getNumOperands(); i += Ratio) {
for (unsigned j = 0; j != Ratio; ++j)		for (unsigned j = 0; j != Ratio; ++j)
if (Vals[j] != Amt.getOperand(i + j))		if (Vals[j] != Amt.getOperand(i + j))
return SDValue();		return SDValue();
}		}
		if (SupportedVectorShiftWithBaseAmnt(VT, Subtarget, Op.getOpcode()))
return DAG.getNode(X86OpcV, dl, VT, R, Op.getOperand(1));		return DAG.getNode(X86OpcV, dl, VT, R, Op.getOperand(1));
}		}
return SDValue();		return SDValue();
}		}

static SDValue LowerShift(SDValue Op, const X86Subtarget* Subtarget,		static SDValue LowerShift(SDValue Op, const X86Subtarget* Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
SDLoc dl(Op);		SDLoc dl(Op);
▲ Show 20 Lines • Show All 8,849 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 205 Lines • ▼ Show 20 Lines	SSE2UniformConstCostTable[] = {
{ ISD::SRL, MVT::v16i8, 1 }, // psrlw.		{ ISD::SRL, MVT::v16i8, 1 }, // psrlw.
{ ISD::SRL, MVT::v8i16, 1 }, // psrlw.		{ ISD::SRL, MVT::v8i16, 1 }, // psrlw.
{ ISD::SRL, MVT::v4i32, 1 }, // psrld.		{ ISD::SRL, MVT::v4i32, 1 }, // psrld.
{ ISD::SRL, MVT::v2i64, 1 }, // psrlq.		{ ISD::SRL, MVT::v2i64, 1 }, // psrlq.

{ ISD::SRA, MVT::v16i8, 4 }, // psrlw, pand, pxor, psubb.		{ ISD::SRA, MVT::v16i8, 4 }, // psrlw, pand, pxor, psubb.
{ ISD::SRA, MVT::v8i16, 1 }, // psraw.		{ ISD::SRA, MVT::v8i16, 1 }, // psraw.
{ ISD::SRA, MVT::v4i32, 1 }, // psrad.		{ ISD::SRA, MVT::v4i32, 1 }, // psrad.
		{ ISD::SRA, MVT::v2i64, 4 }, // 2 x psrad + shuffle.

{ ISD::SDIV, MVT::v8i16, 6 }, // pmulhw sequence		{ ISD::SDIV, MVT::v8i16, 6 }, // pmulhw sequence
{ ISD::UDIV, MVT::v8i16, 6 }, // pmulhuw sequence		{ ISD::UDIV, MVT::v8i16, 6 }, // pmulhuw sequence
{ ISD::SDIV, MVT::v4i32, 19 }, // pmuludq sequence		{ ISD::SDIV, MVT::v4i32, 19 }, // pmuludq sequence
{ ISD::UDIV, MVT::v4i32, 15 }, // pmuludq sequence		{ ISD::UDIV, MVT::v4i32, 15 }, // pmuludq sequence
};		};

if (Op2Info == TargetTransformInfo::OK_UniformConstantValue &&		if (Op2Info == TargetTransformInfo::OK_UniformConstantValue &&
▲ Show 20 Lines • Show All 926 Lines • Show Last 20 Lines

test/Analysis/CostModel/X86/testshiftashr.ll

Show First 20 Lines • Show All 241 Lines • ▼ Show 20 Lines
}		}

; Test shift by a constant a value.		; Test shift by a constant a value.

%shifttypec = type <2 x i16>		%shifttypec = type <2 x i16>
define %shifttypec @shift2i16const(%shifttypec %a, %shifttypec %b) {		define %shifttypec @shift2i16const(%shifttypec %a, %shifttypec %b) {
entry:		entry:
; SSE2: shift2i16const		; SSE2: shift2i16const
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i16const		; SSE2-CODEGEN: shift2i16const
; SSE2-CODEGEN: sarq $		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec %a , <i16 3, i16 3>		%0 = ashr %shifttypec %a , <i16 3, i16 3>
ret %shifttypec %0		ret %shifttypec %0
}		}

%shifttypec4i16 = type <4 x i16>		%shifttypec4i16 = type <4 x i16>
define %shifttypec4i16 @shift4i16const(%shifttypec4i16 %a, %shifttypec4i16 %b) {		define %shifttypec4i16 @shift4i16const(%shifttypec4i16 %a, %shifttypec4i16 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	%0 = ashr %shifttypec32i16 %a , <i16 3, i16 3, i16 3, i16 3,
i16 3, i16 3, i16 3, i16 3>		i16 3, i16 3, i16 3, i16 3>
ret %shifttypec32i16 %0		ret %shifttypec32i16 %0
}		}

%shifttypec2i32 = type <2 x i32>		%shifttypec2i32 = type <2 x i32>
define %shifttypec2i32 @shift2i32c(%shifttypec2i32 %a, %shifttypec2i32 %b) {		define %shifttypec2i32 @shift2i32c(%shifttypec2i32 %a, %shifttypec2i32 %b) {
entry:		entry:
; SSE2: shift2i32c		; SSE2: shift2i32c
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i32c		; SSE2-CODEGEN: shift2i32c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec2i32 %a , <i32 3, i32 3>		%0 = ashr %shifttypec2i32 %a , <i32 3, i32 3>
ret %shifttypec2i32 %0		ret %shifttypec2i32 %0
}		}

%shifttypec4i32 = type <4 x i32>		%shifttypec4i32 = type <4 x i32>
define %shifttypec4i32 @shift4i32c(%shifttypec4i32 %a, %shifttypec4i32 %b) {		define %shifttypec4i32 @shift4i32c(%shifttypec4i32 %a, %shifttypec4i32 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	%0 = ashr %shifttypec32i32 %a , <i32 3, i32 3, i32 3, i32 3,
i32 3, i32 3, i32 3, i32 3>		i32 3, i32 3, i32 3, i32 3>
ret %shifttypec32i32 %0		ret %shifttypec32i32 %0
}		}

%shifttypec2i64 = type <2 x i64>		%shifttypec2i64 = type <2 x i64>
define %shifttypec2i64 @shift2i64c(%shifttypec2i64 %a, %shifttypec2i64 %b) {		define %shifttypec2i64 @shift2i64c(%shifttypec2i64 %a, %shifttypec2i64 %b) {
entry:		entry:
; SSE2: shift2i64c		; SSE2: shift2i64c
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i64c		; SSE2-CODEGEN: shift2i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec2i64 %a , <i64 3, i64 3>		%0 = ashr %shifttypec2i64 %a , <i64 3, i64 3>
ret %shifttypec2i64 %0		ret %shifttypec2i64 %0
}		}

%shifttypec4i64 = type <4 x i64>		%shifttypec4i64 = type <4 x i64>
define %shifttypec4i64 @shift4i64c(%shifttypec4i64 %a, %shifttypec4i64 %b) {		define %shifttypec4i64 @shift4i64c(%shifttypec4i64 %a, %shifttypec4i64 %b) {
entry:		entry:
; SSE2: shift4i64c		; SSE2: shift4i64c
; SSE2: cost of 40 {{.*}} ashr		; SSE2: cost of 8 {{.*}} ashr
; SSE2-CODEGEN: shift4i64c		; SSE2-CODEGEN: shift4i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec4i64 %a , <i64 3, i64 3, i64 3, i64 3>		%0 = ashr %shifttypec4i64 %a , <i64 3, i64 3, i64 3, i64 3>
ret %shifttypec4i64 %0		ret %shifttypec4i64 %0
}		}

%shifttypec8i64 = type <8 x i64>		%shifttypec8i64 = type <8 x i64>
define %shifttypec8i64 @shift8i64c(%shifttypec8i64 %a, %shifttypec8i64 %b) {		define %shifttypec8i64 @shift8i64c(%shifttypec8i64 %a, %shifttypec8i64 %b) {
entry:		entry:
; SSE2: shift8i64c		; SSE2: shift8i64c
; SSE2: cost of 80 {{.*}} ashr		; SSE2: cost of 16 {{.*}} ashr
; SSE2-CODEGEN: shift8i64c		; SSE2-CODEGEN: shift8i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec8i64 %a , <i64 3, i64 3, i64 3, i64 3,		%0 = ashr %shifttypec8i64 %a , <i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3>		i64 3, i64 3, i64 3, i64 3>
ret %shifttypec8i64 %0		ret %shifttypec8i64 %0
}		}

%shifttypec16i64 = type <16 x i64>		%shifttypec16i64 = type <16 x i64>
define %shifttypec16i64 @shift16i64c(%shifttypec16i64 %a, %shifttypec16i64 %b) {		define %shifttypec16i64 @shift16i64c(%shifttypec16i64 %a, %shifttypec16i64 %b) {
entry:		entry:
; SSE2: shift16i64c		; SSE2: shift16i64c
; SSE2: cost of 160 {{.*}} ashr		; SSE2: cost of 32 {{.*}} ashr
; SSE2-CODEGEN: shift16i64c		; SSE2-CODEGEN: shift16i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec16i64 %a , <i64 3, i64 3, i64 3, i64 3,		%0 = ashr %shifttypec16i64 %a , <i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3>		i64 3, i64 3, i64 3, i64 3>
ret %shifttypec16i64 %0		ret %shifttypec16i64 %0
}		}

%shifttypec32i64 = type <32 x i64>		%shifttypec32i64 = type <32 x i64>
define %shifttypec32i64 @shift32i64c(%shifttypec32i64 %a, %shifttypec32i64 %b) {		define %shifttypec32i64 @shift32i64c(%shifttypec32i64 %a, %shifttypec32i64 %b) {
entry:		entry:
; SSE2: shift32i64c		; SSE2: shift32i64c
; SSE2: cost of 320 {{.*}} ashr		; SSE2: cost of 64 {{.*}} ashr
; SSE2-CODEGEN: shift32i64c		; SSE2-CODEGEN: shift32i64c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec32i64 %a ,<i64 3, i64 3, i64 3, i64 3,		%0 = ashr %shifttypec32i64 %a ,<i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3,		i64 3, i64 3, i64 3, i64 3,
i64 3, i64 3, i64 3, i64 3>		i64 3, i64 3, i64 3, i64 3>
ret %shifttypec32i64 %0		ret %shifttypec32i64 %0
}		}

%shifttypec2i8 = type <2 x i8>		%shifttypec2i8 = type <2 x i8>
define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {		define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {
entry:		entry:
; SSE2: shift2i8c		; SSE2: shift2i8c
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 4 {{.*}} ashr
; SSE2-CODEGEN: shift2i8c		; SSE2-CODEGEN: shift2i8c
; SSE2-CODEGEN: sarq $3		; SSE2-CODEGEN: psrad $3

%0 = ashr %shifttypec2i8 %a , <i8 3, i8 3>		%0 = ashr %shifttypec2i8 %a , <i8 3, i8 3>
ret %shifttypec2i8 %0		ret %shifttypec2i8 %0
}		}

%shifttypec4i8 = type <4 x i8>		%shifttypec4i8 = type <4 x i8>
define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {		define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shift-ashr-128.ll

	Show First 20 Lines • Show All 948 Lines • ▼ Show 20 Lines

	;			;
	; Uniform Constant Shifts			; Uniform Constant Shifts
	;			;

	define <2 x i64> @splatconstant_shift_v2i64(<2 x i64> %a) {			define <2 x i64> @splatconstant_shift_v2i64(<2 x i64> %a) {
	; SSE2-LABEL: splatconstant_shift_v2i64:			; SSE2-LABEL: splatconstant_shift_v2i64:
	; SSE2: # BB#0:			; SSE2: # BB#0:
	; SSE2-NEXT: movd %xmm0, %rax			; SSE2-NEXT: movdqa %xmm0, %xmm1
	; SSE2-NEXT: sarq $7, %rax			; SSE2-NEXT: psrad $7, %xmm1
	; SSE2-NEXT: movd %rax, %xmm1			; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
	; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]			; SSE2-NEXT: psrlq $7, %xmm0
	; SSE2-NEXT: movd %xmm0, %rax			; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; SSE2-NEXT: sarq $7, %rax			; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; SSE2-NEXT: movd %rax, %xmm0
	; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
	; SSE2-NEXT: movdqa %xmm1, %xmm0
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSE41-LABEL: splatconstant_shift_v2i64:			; SSE41-LABEL: splatconstant_shift_v2i64:
	; SSE41: # BB#0:			; SSE41: # BB#0:
	; SSE41-NEXT: pextrq $1, %xmm0, %rax			; SSE41-NEXT: movdqa %xmm0, %xmm1
	; SSE41-NEXT: sarq $7, %rax			; SSE41-NEXT: psrad $7, %xmm1
	; SSE41-NEXT: movd %rax, %xmm1			; SSE41-NEXT: psrlq $7, %xmm0
	; SSE41-NEXT: movd %xmm0, %rax			; SSE41-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
	; SSE41-NEXT: sarq $7, %rax
	; SSE41-NEXT: movd %rax, %xmm0
	; SSE41-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX-LABEL: splatconstant_shift_v2i64:			; AVX1-LABEL: splatconstant_shift_v2i64:
	; AVX: # BB#0:			; AVX1: # BB#0:
	; AVX-NEXT: vpextrq $1, %xmm0, %rax			; AVX1-NEXT: vpsrad $7, %xmm0, %xmm1
	; AVX-NEXT: sarq $7, %rax			; AVX1-NEXT: vpsrlq $7, %xmm0, %xmm0
	; AVX-NEXT: vmovq %rax, %xmm1			; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
	; AVX-NEXT: vmovq %xmm0, %rax			; AVX1-NEXT: retq
	; AVX-NEXT: sarq $7, %rax			;
	; AVX-NEXT: vmovq %rax, %xmm0			; AVX2-LABEL: splatconstant_shift_v2i64:
	; AVX-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]			; AVX2: # BB#0:
	; AVX-NEXT: retq			; AVX2-NEXT: vpsrad $7, %xmm0, %xmm1
				; AVX2-NEXT: vpsrlq $7, %xmm0, %xmm0
				; AVX2-NEXT: vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
				; AVX2-NEXT: retq
	%shift = ashr <2 x i64> %a, <i64 7, i64 7>			%shift = ashr <2 x i64> %a, <i64 7, i64 7>
	ret <2 x i64> %shift			ret <2 x i64> %shift
	}			}
				delenaUnsubmitted Not Done Reply Inline Actions Hi Simon, I think that the result here will be incorrect. Let's take a positive 64-bit number (2^34)-1. After the arithmetic shift-right-7 you should receive (2^27)-1. But "vpsrad" will take the source as negative 32-bit and you'll see (2^32)-1 in %xmm1 and the correct result will be after "vpsrlq" in %xmm0. Upper = (2^32)-1 Lower = (2^27)-1 Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {4, 1, 6, 3}); <= the result is incorrect delena: Hi Simon, I think that the result here will be incorrect. Let's take a positive 64-bit number…
				RKSimonAuthorUnsubmitted Not Done Reply Inline Actions Hi Elena, taking your example (as a v1i64 for simplicity): in: 00000003ffffffff (ffffffff 00000003) as v2i32 upper: ashr32 (in, 7): 00000000ffffffff (ffffffff 00000000) 2nd 32-bit lane used lower: lshr64(in, 7): 0000000007ffffff (07ffffff 00000000) 1st 32-bit lane used shuffle32(ashr32, lshr64, 4,1,3,2): 0000000007ffffff (07ffffff 00000000) Which I believe is correct, no? RKSimon: Hi Elena, taking your example (as a v1i64 for simplicity): in: 00000003ffffffff (ffffffff…
				delenaUnsubmitted Not Done Reply Inline Actions yes, you are right. I checked again and I don't see any problem. In my opinion, you can commit this code. delena: yes, you are right. I checked again and I don't see any problem. In my opinion, you can commit…

	define <4 x i32> @splatconstant_shift_v4i32(<4 x i32> %a) {			define <4 x i32> @splatconstant_shift_v4i32(<4 x i32> %a) {
	; SSE-LABEL: splatconstant_shift_v4i32:			; SSE-LABEL: splatconstant_shift_v4i32:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: psrad $5, %xmm0			; SSE-NEXT: psrad $5, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: splatconstant_shift_v4i32:			; AVX-LABEL: splatconstant_shift_v4i32:
	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shift-ashr-256.ll

	Show First 20 Lines • Show All 657 Lines • ▼ Show 20 Lines
	;			;
	; Uniform Constant Shifts			; Uniform Constant Shifts
	;			;

	define <4 x i64> @splatconstant_shift_v4i64(<4 x i64> %a) {			define <4 x i64> @splatconstant_shift_v4i64(<4 x i64> %a) {
	; AVX1-LABEL: splatconstant_shift_v4i64:			; AVX1-LABEL: splatconstant_shift_v4i64:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
	; AVX1-NEXT: vpextrq $1, %xmm1, %rax			; AVX1-NEXT: vpsrad $7, %xmm1, %xmm2
	; AVX1-NEXT: sarq $7, %rax			; AVX1-NEXT: vpsrlq $7, %xmm1, %xmm1
	; AVX1-NEXT: vmovq %rax, %xmm2			; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
	; AVX1-NEXT: vmovq %xmm1, %rax			; AVX1-NEXT: vpsrad $7, %xmm0, %xmm2
	; AVX1-NEXT: sarq $7, %rax			; AVX1-NEXT: vpsrlq $7, %xmm0, %xmm0
	; AVX1-NEXT: vmovq %rax, %xmm1			; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; AVX1-NEXT: vpextrq $1, %xmm0, %rax
	; AVX1-NEXT: sarq $7, %rax
	; AVX1-NEXT: vmovq %rax, %xmm2
	; AVX1-NEXT: vmovq %xmm0, %rax
	; AVX1-NEXT: sarq $7, %rax
	; AVX1-NEXT: vmovq %rax, %xmm0
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: splatconstant_shift_v4i64:			; AVX2-LABEL: splatconstant_shift_v4i64:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-NEXT: vpsrad $7, %ymm0, %ymm1
	; AVX2-NEXT: vpextrq $1, %xmm1, %rax			; AVX2-NEXT: vpsrlq $7, %ymm0, %ymm0
	; AVX2-NEXT: sarq $7, %rax			; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7]
	; AVX2-NEXT: vmovq %rax, %xmm2
	; AVX2-NEXT: vmovq %xmm1, %rax
	; AVX2-NEXT: sarq $7, %rax
	; AVX2-NEXT: vmovq %rax, %xmm1
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; AVX2-NEXT: vpextrq $1, %xmm0, %rax
	; AVX2-NEXT: sarq $7, %rax
	; AVX2-NEXT: vmovq %rax, %xmm2
	; AVX2-NEXT: vmovq %xmm0, %rax
	; AVX2-NEXT: sarq $7, %rax
	; AVX2-NEXT: vmovq %rax, %xmm0
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shift = ashr <4 x i64> %a, <i64 7, i64 7, i64 7, i64 7>			%shift = ashr <4 x i64> %a, <i64 7, i64 7, i64 7, i64 7>
	ret <4 x i64> %shift			ret <4 x i64> %shift
	}			}

	define <8 x i32> @splatconstant_shift_v8i32(<8 x i32> %a) {			define <8 x i32> @splatconstant_shift_v8i32(<8 x i32> %a) {
	; AVX1-LABEL: splatconstant_shift_v8i32:			; AVX1-LABEL: splatconstant_shift_v8i32:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

test/CodeGen/X86/vshift-3.ll

	; RUN: llc < %s -march=x86 -mattr=+sse2 \| FileCheck %s			; RUN: llc < %s -march=x86 -mattr=+sse2 \| FileCheck %s

	; test vector shifts converted to proper SSE2 vector shifts when the shift			; test vector shifts converted to proper SSE2 vector shifts when the shift
	; amounts are the same.			; amounts are the same.

	; Note that x86 does have ashr			; Note that x86 does have ashr

	; shift1a can't use a packed shift
	define void @shift1a(<2 x i64> %val, <2 x i64>* %dst) nounwind {			define void @shift1a(<2 x i64> %val, <2 x i64>* %dst) nounwind {
	entry:			entry:
	; CHECK-LABEL: shift1a:			; CHECK-LABEL: shift1a:
	; CHECK: sarl			; CHECK: psrad $31
	%ashr = ashr <2 x i64> %val, < i64 32, i64 32 >			%ashr = ashr <2 x i64> %val, < i64 32, i64 32 >
	store <2 x i64> %ashr, <2 x i64>* %dst			store <2 x i64> %ashr, <2 x i64>* %dst
	ret void			ret void
	}			}

	define void @shift2a(<4 x i32> %val, <4 x i32>* %dst) nounwind {			define void @shift2a(<4 x i32> %val, <4 x i32>* %dst) nounwind {
	entry:			entry:
	; CHECK-LABEL: shift2a:			; CHECK-LABEL: shift2a:
	▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

test/CodeGen/X86/widen_conv-2.ll

	; RUN: llc < %s -march=x86 -mattr=+sse4.2 \| FileCheck %s			; RUN: llc < %s -march=x86 -mattr=+sse4.2 \| FileCheck %s
	; CHECK: {{cwtl\|movswl}}			; CHECK: psllq $48, %xmm0
	; CHECK: {{cwtl\|movswl}}			; CHECK: psrad $16, %xmm0
				; CHECK: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]

	; sign extension v2i32 to v2i16			; sign extension v2i16 to v2i32

	define void @convert(<2 x i32>* %dst.addr, <2 x i16> %src) nounwind {			define void @convert(<2 x i32>* %dst.addr, <2 x i16> %src) nounwind {
	entry:			entry:
	%signext = sext <2 x i16> %src to <2 x i32> ; <<12 x i8>> [#uses=1]			%signext = sext <2 x i16> %src to <2 x i32> ; <<12 x i8>> [#uses=1]
	store <2 x i32> %signext, <2 x i32>* %dst.addr			store <2 x i32> %signext, <2 x i32>* %dst.addr
	ret void			ret void
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorized i64 uniform constant SRA shiftsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 29008

lib/Target/X86/X86ISelLowering.cpp

lib/Target/X86/X86TargetTransformInfo.cpp

test/Analysis/CostModel/X86/testshiftashr.ll

test/CodeGen/X86/vector-shift-ashr-128.ll

test/CodeGen/X86/vector-shift-ashr-256.ll

test/CodeGen/X86/vshift-3.ll

test/CodeGen/X86/widen_conv-2.ll

[X86][SSE] Vectorized i64 uniform constant SRA shifts
ClosedPublic