This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
vshift-4.ll
-
x86-shifts.ll

Differential D8416

[X86][SSE] Avoid scalarization of v2i64 vector shifts
ClosedPublic

Authored by RKSimon on Mar 18 2015, 8:09 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
andreadb
mkuper

Commits

rG5ec5c9cafef2: [X86][SSE] Avoid scalarization of v2i64 vector shifts (REAPPLIED)
rG5c837edc2a3a: [X86][SSE] Avoid scalarization of v2i64 vector shifts
rL232682: [X86][SSE] Avoid scalarization of v2i64 vector shifts (REAPPLIED)
rL232660: [X86][SSE] Avoid scalarization of v2i64 vector shifts

Summary

Currently v2i64 vectors shifts (non-equal shift amounts) are scalarized, costing 4 x extract, 2 x x86-shifts and 2 x insert instructions - and it gets even more awkward on 32-bit targets.

This patch separately shifts the vector by both shift amounts and then shuffles the partial results back together, costing 2 x shuffles and 2 x sse-shifts instructions (+ 2 movs on pre-AVX hardware).

Note - this patch only improves the SHL / LSHR shifts as ASHR 2i64 shifts aren't currently supported in hardware - I'm looking at fixing this in a future patch.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 22186.Mar 18 2015, 8:09 AM

RKSimon retitled this revision from to [X86][SSE] Avoid scalarization of v2i64 vector shifts.

RKSimon updated this object.

RKSimon edited the test plan for this revision. (Show Details)

RKSimon added reviewers: qcolombet, mkuper, andreadb, spatel.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: Unknown Object (MLST).

Hi Simon,

please see comments below.

lib/Target/X86/X86ISelLowering.cpp
16194 ↗	(On Diff #22186)	This would generate worse code if `Op.getOpcode() == ISD::SRA`. You should check that the opcode is not ISD::SRA. Otherwise, you would end up scalarizing two shifts.
test/CodeGen/X86/x86-shifts.ll
122–123 ↗	(On Diff #22186)	Could you please add a check for the shift count? Something like CHECK-DAG: psrlq $8 CHECK-DAG: psrlq $1 It would be nice to also have checks for the two extra 'punpcklqdq shuffles that would be generated by your patch.

Thanks Andrea, I'll update a new version of the patch later today.

lib/Target/X86/X86ISelLowering.cpp
16194 ↗	(On Diff #22186)	Yes - i64 SRA isn't currently supported (and doesn't go through LowerShift at all atm) but I will add the check. Initial tests indicate that SRA would be faster for constant shifts (and all AVX2 implementations) - but as I said I'll deal with SRA properly in a future patch.
test/CodeGen/X86/x86-shifts.ll
122–123 ↗	(On Diff #22186)	Yes - I'll add more complete CHECK lines for all my changes.

Updated patch based on Andrea's comments - checked against ashr and improved tests

LGTM. Thanks Simon!

This revision is now accepted and ready to land.Mar 18 2015, 12:10 PM

Closed by commit rL232660: [X86][SSE] Avoid scalarization of v2i64 vector shifts (authored by RKSimon). · Explain WhyMar 18 2015, 12:38 PM

This revision was automatically updated to reflect the committed changes.

Thanks Andrea - I'll prepare a i64 ASHR patch in the next few days.

RKSimon mentioned this in D11327: [X86][SSE] Keep 32-bit target i64 vector shifts on SSE unit..Jul 18 2015, 7:27 AM

RKSimon mentioned this in rL242621: [X86][SSE] Updated SHL/LSHR i64 vectorization costs..Jul 18 2015, 1:06 PM

RKSimon mentioned this in rL243577: [X86][SSE] Keep 32-bit target i64 vector shifts on SSE unit..Jul 29 2015, 2:45 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

37 lines

test/

CodeGen/

X86/

vshift-4.ll

9 lines

x86-shifts.ll

14 lines

Diff 22205

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,900 Lines • ▼ Show 20 Lines	static SDValue LowerCONCAT_VECTORSvXi1(SDValue Op,
// Zero the upper bits of V1		// Zero the upper bits of V1
V1 = DAG.getNode(X86ISD::VSHLI, dl, ResVT, V1, ShiftBits);		V1 = DAG.getNode(X86ISD::VSHLI, dl, ResVT, V1, ShiftBits);
V1 = DAG.getNode(X86ISD::VSRLI, dl, ResVT, V1, ShiftBits);		V1 = DAG.getNode(X86ISD::VSRLI, dl, ResVT, V1, ShiftBits);
if (IsZeroV2)		if (IsZeroV2)
return V1;		return V1;
return DAG.getNode(ISD::OR, dl, ResVT, V1, V2);		return DAG.getNode(ISD::OR, dl, ResVT, V1, V2);
}		}

static SDValue LowerCONCAT_VECTORS(SDValue Op,		static SDValue LowerCONCAT_VECTORS(SDValue Op,
const X86Subtarget *Subtarget,		const X86Subtarget *Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
if (VT.getVectorElementType() == MVT::i1)		if (VT.getVectorElementType() == MVT::i1)
return LowerCONCAT_VECTORSvXi1(Op, Subtarget, DAG);		return LowerCONCAT_VECTORSvXi1(Op, Subtarget, DAG);

assert((VT.is256BitVector() && Op.getNumOperands() == 2) \|\|		assert((VT.is256BitVector() && Op.getNumOperands() == 2) \|\|
(VT.is512BitVector() && (Op.getNumOperands() == 2 \|\|		(VT.is512BitVector() && (Op.getNumOperands() == 2 \|\|
▲ Show 20 Lines • Show All 7,332 Lines • ▼ Show 20 Lines	if (SSECC != 8) {
}		}

SDValue Cmp = DAG.getNode(X86ISD::FSETCC, DL, VT, CondOp0, CondOp1,		SDValue Cmp = DAG.getNode(X86ISD::FSETCC, DL, VT, CondOp0, CondOp1,
DAG.getConstant(SSECC, MVT::i8));		DAG.getConstant(SSECC, MVT::i8));

// If we have AVX, we can use a variable vector select (VBLENDV) instead		// If we have AVX, we can use a variable vector select (VBLENDV) instead
// of 3 logic instructions for size savings and potentially speed.		// of 3 logic instructions for size savings and potentially speed.
// Unfortunately, there is no scalar form of VBLENDV.		// Unfortunately, there is no scalar form of VBLENDV.

// If either operand is a constant, don't try this. We can expect to		// If either operand is a constant, don't try this. We can expect to
// optimize away at least one of the logic instructions later in that		// optimize away at least one of the logic instructions later in that
// case, so that sequence would be faster than a variable blend.		// case, so that sequence would be faster than a variable blend.

// BLENDV was introduced with SSE 4.1, but the 2 register form implicitly		// BLENDV was introduced with SSE 4.1, but the 2 register form implicitly
// uses XMM0 as the selection register. That may need just as many		// uses XMM0 as the selection register. That may need just as many
// instructions as the AND/ANDN/OR sequence due to register moves, so		// instructions as the AND/ANDN/OR sequence due to register moves, so
// don't bother.		// don't bother.

if (Subtarget->hasAVX() &&		if (Subtarget->hasAVX() &&
!isa<ConstantFPSDNode>(Op1) && !isa<ConstantFPSDNode>(Op2)) {		!isa<ConstantFPSDNode>(Op1) && !isa<ConstantFPSDNode>(Op2)) {

// Convert to vectors, do a VSELECT, and convert back to scalar.		// Convert to vectors, do a VSELECT, and convert back to scalar.
// All of the conversions should be optimized away.		// All of the conversions should be optimized away.

EVT VecVT = VT == MVT::f32 ? MVT::v4f32 : MVT::v2f64;		EVT VecVT = VT == MVT::f32 ? MVT::v4f32 : MVT::v2f64;
SDValue VOp1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, Op1);		SDValue VOp1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, Op1);
SDValue VOp2 = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, Op2);		SDValue VOp2 = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, Op2);
SDValue VCmp = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, Cmp);		SDValue VCmp = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, Cmp);

EVT VCmpVT = VT == MVT::f32 ? MVT::v4i32 : MVT::v2i64;		EVT VCmpVT = VT == MVT::f32 ? MVT::v4i32 : MVT::v2i64;
VCmp = DAG.getNode(ISD::BITCAST, DL, VCmpVT, VCmp);		VCmp = DAG.getNode(ISD::BITCAST, DL, VCmpVT, VCmp);

SDValue VSel = DAG.getNode(ISD::VSELECT, DL, VecVT, VCmp, VOp1, VOp2);		SDValue VSel = DAG.getNode(ISD::VSELECT, DL, VecVT, VCmp, VOp1, VOp2);

return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT,		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT,
VSel, DAG.getIntPtrConstant(0));		VSel, DAG.getIntPtrConstant(0));
}		}
SDValue AndN = DAG.getNode(X86ISD::FANDN, DL, VT, Cmp, Op2);		SDValue AndN = DAG.getNode(X86ISD::FANDN, DL, VT, Cmp, Op2);
SDValue And = DAG.getNode(X86ISD::FAND, DL, VT, Cmp, Op1);		SDValue And = DAG.getNode(X86ISD::FAND, DL, VT, Cmp, Op1);
return DAG.getNode(X86ISD::FOR, DL, VT, AndN, And);		return DAG.getNode(X86ISD::FOR, DL, VT, AndN, And);
}		}
}		}
▲ Show 20 Lines • Show All 2,892 Lines • ▼ Show 20 Lines	if (Subtarget->hasInt256()) {
if (Op.getOpcode() == ISD::SHL &&		if (Op.getOpcode() == ISD::SHL &&
(VT == MVT::v2i64 \|\| VT == MVT::v4i32 \|\|		(VT == MVT::v2i64 \|\| VT == MVT::v4i32 \|\|
VT == MVT::v4i64 \|\| VT == MVT::v8i32))		VT == MVT::v4i64 \|\| VT == MVT::v8i32))
return Op;		return Op;
if (Op.getOpcode() == ISD::SRA && (VT == MVT::v4i32 \|\| VT == MVT::v8i32))		if (Op.getOpcode() == ISD::SRA && (VT == MVT::v4i32 \|\| VT == MVT::v8i32))
return Op;		return Op;
}		}

		// 2i64 vector logical shifts can efficiently avoid scalarization - do the
		// shifts per-lane and then shuffle the partial results back together.
		if (VT == MVT::v2i64 && Op.getOpcode() != ISD::SRA) {
		// Splat the shift amounts so the scalar shifts above will catch it.
		SDValue Amt0 = DAG.getVectorShuffle(VT, dl, Amt, Amt, {0, 0});
		SDValue Amt1 = DAG.getVectorShuffle(VT, dl, Amt, Amt, {1, 1});
		SDValue R0 = DAG.getNode(Op->getOpcode(), dl, VT, R, Amt0);
		SDValue R1 = DAG.getNode(Op->getOpcode(), dl, VT, R, Amt1);
		return DAG.getVectorShuffle(VT, dl, R0, R1, {0, 3});
		}

// If possible, lower this packed shift into a vector multiply instead of		// If possible, lower this packed shift into a vector multiply instead of
// expanding it into a sequence of scalar shifts.		// expanding it into a sequence of scalar shifts.
// Do this only if the vector shift count is a constant build_vector.		// Do this only if the vector shift count is a constant build_vector.
if (Op.getOpcode() == ISD::SHL &&		if (Op.getOpcode() == ISD::SHL &&
(VT == MVT::v8i16 \|\| VT == MVT::v4i32 \|\|		(VT == MVT::v8i16 \|\| VT == MVT::v4i32 \|\|
(Subtarget->hasInt256() && VT == MVT::v16i16)) &&		(Subtarget->hasInt256() && VT == MVT::v16i16)) &&
ISD::isBuildVectorOfConstantSDNodes(Amt.getNode())) {		ISD::isBuildVectorOfConstantSDNodes(Amt.getNode())) {
SmallVector<SDValue, 8> Elts;		SmallVector<SDValue, 8> Elts;
▲ Show 20 Lines • Show All 5,755 Lines • ▼ Show 20 Lines	static SDValue VectorZextCombine(SDNode *N, SelectionDAG &DAG,
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
SDLoc DL(N);		SDLoc DL(N);

// A vector zext_in_reg may be represented as a shuffle,		// A vector zext_in_reg may be represented as a shuffle,
// feeding into a bitcast (this represents anyext) feeding into		// feeding into a bitcast (this represents anyext) feeding into
// an and with a mask.		// an and with a mask.
// We'd like to try to combine that into a shuffle with zero		// We'd like to try to combine that into a shuffle with zero
// plus a bitcast, removing the and.		// plus a bitcast, removing the and.
if (N0.getOpcode() != ISD::BITCAST \|\|		if (N0.getOpcode() != ISD::BITCAST \|\|
N0.getOperand(0).getOpcode() != ISD::VECTOR_SHUFFLE)		N0.getOperand(0).getOpcode() != ISD::VECTOR_SHUFFLE)
return SDValue();		return SDValue();

// The other side of the AND should be a splat of 2^C, where C		// The other side of the AND should be a splat of 2^C, where C
// is the number of bits in the source type.		// is the number of bits in the source type.
if (N1.getOpcode() == ISD::BITCAST)		if (N1.getOpcode() == ISD::BITCAST)
N1 = N1.getOperand(0);		N1 = N1.getOperand(0);
if (N1.getOpcode() != ISD::BUILD_VECTOR)		if (N1.getOpcode() != ISD::BUILD_VECTOR)
Show All 13 Lines	static SDValue VectorZextCombine(SDNode *N, SelectionDAG &DAG,
unsigned SplatBitSize;		unsigned SplatBitSize;
bool HasAnyUndefs;		bool HasAnyUndefs;
if (!Vector->isConstantSplat(SplatValue, SplatUndef,		if (!Vector->isConstantSplat(SplatValue, SplatUndef,
SplatBitSize, HasAnyUndefs))		SplatBitSize, HasAnyUndefs))
return SDValue();		return SDValue();

unsigned ResSize = N1.getValueType().getScalarSizeInBits();		unsigned ResSize = N1.getValueType().getScalarSizeInBits();
// Make sure the splat matches the mask we expect		// Make sure the splat matches the mask we expect
if (SplatBitSize > ResSize \|\|		if (SplatBitSize > ResSize \|\|
(SplatValue + 1).exactLogBase2() != (int)SrcSize)		(SplatValue + 1).exactLogBase2() != (int)SrcSize)
return SDValue();		return SDValue();

// Make sure the input and output size make sense		// Make sure the input and output size make sense
if (SrcSize >= ResSize \|\| ResSize % SrcSize)		if (SrcSize >= ResSize \|\| ResSize % SrcSize)
return SDValue();		return SDValue();

// We expect a shuffle of the form <0, u, u, u, 1, u, u, u...>		// We expect a shuffle of the form <0, u, u, u, 1, u, u, u...>
▲ Show 20 Lines • Show All 941 Lines • ▼ Show 20 Lines	static SDValue PerformFANDCombine(SDNode *N, SelectionDAG &DAG) {
if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(0)))		if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(0)))
if (C->getValueAPF().isPosZero())		if (C->getValueAPF().isPosZero())
return N->getOperand(0);		return N->getOperand(0);

// FAND(x, 0.0) -> 0.0		// FAND(x, 0.0) -> 0.0
if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(1)))		if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(1)))
if (C->getValueAPF().isPosZero())		if (C->getValueAPF().isPosZero())
return N->getOperand(1);		return N->getOperand(1);

return SDValue();		return SDValue();
}		}

/// Do target-specific dag combines on X86ISD::FANDN nodes		/// Do target-specific dag combines on X86ISD::FANDN nodes
static SDValue PerformFANDNCombine(SDNode *N, SelectionDAG &DAG) {		static SDValue PerformFANDNCombine(SDNode *N, SelectionDAG &DAG) {
// FANDN(0.0, x) -> x		// FANDN(0.0, x) -> x
if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(0)))		if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(N->getOperand(0)))
if (C->getValueAPF().isPosZero())		if (C->getValueAPF().isPosZero())
▲ Show 20 Lines • Show All 257 Lines • ▼ Show 20 Lines	if (IsSEXT0 && IsVZero1) {
assert(VT == LHS.getOperand(0).getValueType() &&		assert(VT == LHS.getOperand(0).getValueType() &&
"Uexpected operand type");		"Uexpected operand type");
if (CC == ISD::SETGT)		if (CC == ISD::SETGT)
return DAG.getConstant(0, VT);		return DAG.getConstant(0, VT);
if (CC == ISD::SETLE)		if (CC == ISD::SETLE)
return DAG.getConstant(1, VT);		return DAG.getConstant(1, VT);
if (CC == ISD::SETEQ \|\| CC == ISD::SETGE)		if (CC == ISD::SETEQ \|\| CC == ISD::SETGE)
return DAG.getNOT(DL, LHS.getOperand(0), VT);		return DAG.getNOT(DL, LHS.getOperand(0), VT);

assert((CC == ISD::SETNE \|\| CC == ISD::SETLT) &&		assert((CC == ISD::SETNE \|\| CC == ISD::SETLT) &&
"Unexpected condition code!");		"Unexpected condition code!");
return LHS.getOperand(0);		return LHS.getOperand(0);
}		}
}		}

return SDValue();		return SDValue();
}		}
Show All 25 Lines	static SDValue PerformINSERTPSCombine(SDNode *N, SelectionDAG &DAG,
SDValue Ld = N->getOperand(1);		SDValue Ld = N->getOperand(1);
if (MayFoldLoad(Ld)) {		if (MayFoldLoad(Ld)) {
// Extract the countS bits from the immediate so we can get the proper		// Extract the countS bits from the immediate so we can get the proper
// address when narrowing the vector load to a specific element.		// address when narrowing the vector load to a specific element.
// When the second source op is a memory address, insertps doesn't use		// When the second source op is a memory address, insertps doesn't use
// countS and just gets an f32 from that address.		// countS and just gets an f32 from that address.
unsigned DestIndex =		unsigned DestIndex =
cast<ConstantSDNode>(N->getOperand(2))->getZExtValue() >> 6;		cast<ConstantSDNode>(N->getOperand(2))->getZExtValue() >> 6;

Ld = NarrowVectorLoadToElement(cast<LoadSDNode>(Ld), DestIndex, DAG);		Ld = NarrowVectorLoadToElement(cast<LoadSDNode>(Ld), DestIndex, DAG);

// Create this as a scalar to vector to match the instruction pattern.		// Create this as a scalar to vector to match the instruction pattern.
SDValue LoadScalarToVector = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Ld);		SDValue LoadScalarToVector = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Ld);
// countS bits are ignored when loading from memory on insertps, which		// countS bits are ignored when loading from memory on insertps, which
// means we don't need to explicitly set them to 0.		// means we don't need to explicitly set them to 0.
return DAG.getNode(X86ISD::INSERTPS, dl, VT, N->getOperand(0),		return DAG.getNode(X86ISD::INSERTPS, dl, VT, N->getOperand(0),
LoadScalarToVector, N->getOperand(2));		LoadScalarToVector, N->getOperand(2));
}		}
return SDValue();		return SDValue();
}		}

static SDValue PerformBLENDICombine(SDNode *N, SelectionDAG &DAG) {		static SDValue PerformBLENDICombine(SDNode *N, SelectionDAG &DAG) {
SDValue V0 = N->getOperand(0);		SDValue V0 = N->getOperand(0);
SDValue V1 = N->getOperand(1);		SDValue V1 = N->getOperand(1);
SDLoc DL(N);		SDLoc DL(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// Canonicalize a v2f64 blend with a mask of 2 by swapping the vector		// Canonicalize a v2f64 blend with a mask of 2 by swapping the vector
// operands and changing the mask to 1. This saves us a bunch of		// operands and changing the mask to 1. This saves us a bunch of
// pattern-matching possibilities related to scalar math ops in SSE/AVX.		// pattern-matching possibilities related to scalar math ops in SSE/AVX.
// x86InstrInfo knows how to commute this back after instruction selection		// x86InstrInfo knows how to commute this back after instruction selection
// if it would help register allocation.		// if it would help register allocation.

// TODO: If optimizing for size or a processor that doesn't suffer from		// TODO: If optimizing for size or a processor that doesn't suffer from
// partial register update stalls, this should be transformed into a MOVSD		// partial register update stalls, this should be transformed into a MOVSD
// instruction because a MOVSD is 1-2 bytes smaller than a BLENDPD.		// instruction because a MOVSD is 1-2 bytes smaller than a BLENDPD.

if (VT == MVT::v2f64)		if (VT == MVT::v2f64)
if (auto *Mask = dyn_cast<ConstantSDNode>(N->getOperand(2)))		if (auto *Mask = dyn_cast<ConstantSDNode>(N->getOperand(2)))
if (Mask->getZExtValue() == 2 && !isShuffleFoldableLoad(V0)) {		if (Mask->getZExtValue() == 2 && !isShuffleFoldableLoad(V0)) {
SDValue NewMask = DAG.getConstant(1, MVT::i8);		SDValue NewMask = DAG.getConstant(1, MVT::i8);
▲ Show 20 Lines • Show All 1,233 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vshift-4.ll

	; RUN: llc < %s -march=x86 -mcpu=core2 \| FileCheck %s			; RUN: llc < %s -march=x86 -mcpu=core2 \| FileCheck %s

	; test vector shifts converted to proper SSE2 vector shifts when the shift			; test vector shifts converted to proper SSE2 vector shifts when the shift
	; amounts are the same when using a shuffle splat.			; amounts are the same when using a shuffle splat.

	define void @shift1a(<2 x i64> %val, <2 x i64>* %dst, <2 x i64> %sh) nounwind {			define void @shift1a(<2 x i64> %val, <2 x i64>* %dst, <2 x i64> %sh) nounwind {
	entry:			entry:
	; CHECK-LABEL: shift1a:			; CHECK-LABEL: shift1a:
	; CHECK: psllq			; CHECK: psllq
	%shamt = shufflevector <2 x i64> %sh, <2 x i64> undef, <2 x i32> <i32 0, i32 0>			%shamt = shufflevector <2 x i64> %sh, <2 x i64> undef, <2 x i32> <i32 0, i32 0>
	%shl = shl <2 x i64> %val, %shamt			%shl = shl <2 x i64> %val, %shamt
	store <2 x i64> %shl, <2 x i64>* %dst			store <2 x i64> %shl, <2 x i64>* %dst
	ret void			ret void
	}			}

	; shift1b can't use a packed shift			; shift1b can't use a packed shift but can shift lanes separately and shuffle back together
	define void @shift1b(<2 x i64> %val, <2 x i64>* %dst, <2 x i64> %sh) nounwind {			define void @shift1b(<2 x i64> %val, <2 x i64>* %dst, <2 x i64> %sh) nounwind {
	entry:			entry:
	; CHECK-LABEL: shift1b:			; CHECK-LABEL: shift1b:
	; CHECK: shll			; CHECK: pshufd {{.*#+}} xmm2 = xmm1[2,3,0,1]
				; CHECK-NEXT: movdqa %xmm0, %xmm3
				; CHECK-NEXT: psllq %xmm2, %xmm3
				; CHECK-NEXT: movq {{.*#+}} xmm1 = xmm1[0],zero
				; CHECK-NEXT: psllq %xmm1, %xmm0
				; CHECK-NEXT: movsd {{.*#+}} xmm3 = xmm0[0],xmm3[1]
	%shamt = shufflevector <2 x i64> %sh, <2 x i64> undef, <2 x i32> <i32 0, i32 1>			%shamt = shufflevector <2 x i64> %sh, <2 x i64> undef, <2 x i32> <i32 0, i32 1>
	%shl = shl <2 x i64> %val, %shamt			%shl = shl <2 x i64> %val, %shamt
	store <2 x i64> %shl, <2 x i64>* %dst			store <2 x i64> %shl, <2 x i64>* %dst
	ret void			ret void
	}			}

	define void @shift2a(<4 x i32> %val, <4 x i32>* %dst, <2 x i32> %amt) nounwind {			define void @shift2a(<4 x i32> %val, <4 x i32>* %dst, <2 x i32> %amt) nounwind {
	entry:			entry:
	▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/x86-shifts.ll

Show First 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	; CHECK: ret
%C = shl <8 x i16> %A, < i16 9, i16 7, i16 5, i16 1, i16 4, i16 1, i16 1, i16 1>		%C = shl <8 x i16> %A, < i16 9, i16 7, i16 5, i16 1, i16 4, i16 1, i16 1, i16 1>
%K = xor <8 x i16> %B, %C		%K = xor <8 x i16> %B, %C
ret <8 x i16> %K		ret <8 x i16> %K
}		}


define <2 x i64> @shr2_nosplat(<2 x i64> %A) nounwind {		define <2 x i64> @shr2_nosplat(<2 x i64> %A) nounwind {
entry:		entry:
; CHECK: shr2_nosplat		; CHECK-LABEL: shr2_nosplat
; CHECK-NOT: psrlq		; CHECK: movdqa (%rcx), %xmm1
; CHECK-NOT: psrlq		; CHECK-NEXT: movdqa %xmm1, %xmm2
; CHECK: ret		; CHECK-NEXT: psrlq $8, %xmm2
		; CHECK-NEXT: movdqa %xmm1, %xmm0
		; CHECK-NEXT: psrlq $1, %xmm0
		; CHECK-NEXT: movsd {{.*#+}} xmm1 = xmm0[0],xmm1[1]
		; CHECK-NEXT: movsd {{.*#+}} xmm0 = xmm2[0],xmm0[1]
		; CHECK-NEXT: xorpd %xmm1, %xmm0
		; CHECK-NEXT: ret
%B = lshr <2 x i64> %A, < i64 8, i64 1>		%B = lshr <2 x i64> %A, < i64 8, i64 1>
%C = lshr <2 x i64> %A, < i64 1, i64 0>		%C = lshr <2 x i64> %A, < i64 1, i64 0>
%K = xor <2 x i64> %B, %C		%K = xor <2 x i64> %B, %C
ret <2 x i64> %K		ret <2 x i64> %K
}		}


; Other shifts		; Other shifts
▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines