This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorize i64 ASHR operations
ClosedPublic

Authored by RKSimon on Jul 22 2015, 4:20 PM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
delena
andreadb

Commits

rG86478c6909eb: [X86][SSE] Vectorize i64 ASHR operations
rL243569: [X86][SSE] Vectorize i64 ASHR operations

Summary

This patch vectorizes the 2i64/4i64 ASHR shift operations - the only remaining integer vector shifts that still are being transferred to/from the scalar unit to be completed.

Note: The poor code gen for the X32 tests will be improved by D11327.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 30419.Jul 22 2015, 4:20 PM

RKSimon retitled this revision from to [X86][SSE] Vectorize i64 ASHR operations.

RKSimon updated this object.

RKSimon added reviewers: qcolombet, delena, andreadb, spatel.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

rebased.

qcolombet added inline comments.Jul 28 2015, 9:35 AM

lib/Target/X86/X86ISelLowering.cpp
17365	That wasn’t immediately clear to me that s>> and u>> referred to signed and unsigned shift. Use lshr and ashr instead, like in llvm ir (or the SD name variant if you prefer).
test/CodeGen/X86/vector-shift-ashr-128.ll
27	Is this sequence actually better? I guess the GPR to vector and vector to GPR copies are quite expensive so that is the case. Just double checking.
44	Same question here (this time I guess not using pextr is good).

Thanks Quentin, if its not too much trouble please can you check the sibling patch to this one (D11327)?

lib/Target/X86/X86ISelLowering.cpp
17365	No problem.
test/CodeGen/X86/vector-shift-ashr-128.ll
27	Its very target dependent - on my old Penryn avoiding GPR/SSE transfers is by far the best option. Jaguar/Sandy Bridge don't care that much at the 64-bit integer level (it will probably come down to register pressure issues). Haswell has AVX2 so we can do per-lane v2i64 logical shifts where this patch really flies. When in 32-bit mode its always best to avoid trying to do (split) 64-bit shifts on GPR (see D11327). Overall, the general per-lane ashr v2i64 lowering is the weakest improvement, but the splat and constant cases gain a lot more.

Updated psuedocode comments to use llvm ir. Also updated the comment that I copied this from.

Hi Simon,

LGTM.

Thanks,
-Quentin

This revision is now accepted and ready to land.Jul 29 2015, 10:06 AM

Closed by commit rL243569: [X86][SSE] Vectorize i64 ASHR operations (authored by RKSimon). · Explain WhyJul 29 2015, 1:32 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

13 lines

X86TargetTransformInfo.cpp

5 lines

test/

Analysis/

CostModel/

X86/

arith.ll

4 lines

testshiftashr.ll

32 lines

CodeGen/

X86/

vector-shift-ashr-128.ll

342 lines

vector-shift-ashr-256.ll

173 lines

utils/

update_llc_test_checks.py

2 lines

Diff 30419

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 17,354 Lines • ▼ Show 20 Lines	if (VT == MVT::v2i64 && Op.getOpcode() != ISD::SRA) {
// Splat the shift amounts so the scalar shifts above will catch it.		// Splat the shift amounts so the scalar shifts above will catch it.
SDValue Amt0 = DAG.getVectorShuffle(VT, dl, Amt, Amt, {0, 0});		SDValue Amt0 = DAG.getVectorShuffle(VT, dl, Amt, Amt, {0, 0});
SDValue Amt1 = DAG.getVectorShuffle(VT, dl, Amt, Amt, {1, 1});		SDValue Amt1 = DAG.getVectorShuffle(VT, dl, Amt, Amt, {1, 1});
SDValue R0 = DAG.getNode(Op->getOpcode(), dl, VT, R, Amt0);		SDValue R0 = DAG.getNode(Op->getOpcode(), dl, VT, R, Amt0);
SDValue R1 = DAG.getNode(Op->getOpcode(), dl, VT, R, Amt1);		SDValue R1 = DAG.getNode(Op->getOpcode(), dl, VT, R, Amt1);
return DAG.getVectorShuffle(VT, dl, R0, R1, {0, 3});		return DAG.getVectorShuffle(VT, dl, R0, R1, {0, 3});
}		}

		// i64 vector arithmetic shift can be emulated with the transform:
		// M = SIGN_BIT u>> A
		// R s>> a === ((R u>> A) ^ M) - M
		qcolombetUnsubmitted Not Done Reply Inline Actions That wasn’t immediately clear to me that s>> and u>> referred to signed and unsigned shift. Use lshr and ashr instead, like in llvm ir (or the SD name variant if you prefer). qcolombet: That wasn’t immediately clear to me that s>> and u>> referred to signed and unsigned shift. Use…
		RKSimonAuthorUnsubmitted Not Done Reply Inline Actions No problem. RKSimon: No problem.
		if ((VT == MVT::v2i64 \|\| (VT == MVT::v4i64 && Subtarget->hasInt256())) &&
		Op.getOpcode() == ISD::SRA) {
		SDValue S = DAG.getConstant(APInt::getSignBit(64), dl, VT);
		SDValue M = DAG.getNode(ISD::SRL, dl, VT, S, Amt);
		R = DAG.getNode(ISD::SRL, dl, VT, R, Amt);
		R = DAG.getNode(ISD::XOR, dl, VT, R, M);
		R = DAG.getNode(ISD::SUB, dl, VT, R, M);
		return R;
		}

// If possible, lower this packed shift into a vector multiply instead of		// If possible, lower this packed shift into a vector multiply instead of
// expanding it into a sequence of scalar shifts.		// expanding it into a sequence of scalar shifts.
// Do this only if the vector shift count is a constant build_vector.		// Do this only if the vector shift count is a constant build_vector.
if (Op.getOpcode() == ISD::SHL &&		if (Op.getOpcode() == ISD::SHL &&
(VT == MVT::v8i16 \|\| VT == MVT::v4i32 \|\|		(VT == MVT::v8i16 \|\| VT == MVT::v4i32 \|\|
(Subtarget->hasInt256() && VT == MVT::v16i16)) &&		(Subtarget->hasInt256() && VT == MVT::v16i16)) &&
ISD::isBuildVectorOfConstantSDNodes(Amt.getNode())) {		ISD::isBuildVectorOfConstantSDNodes(Amt.getNode())) {
SmallVector<SDValue, 8> Elts;		SmallVector<SDValue, 8> Elts;
▲ Show 20 Lines • Show All 8,899 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	static const CostTblEntry<MVT::SimpleValueType> AVX2CostTable[] = {
{ ISD::SHL, MVT::v32i8, 11 }, // vpblendvb sequence.		{ ISD::SHL, MVT::v32i8, 11 }, // vpblendvb sequence.
{ ISD::SHL, MVT::v16i16, 10 }, // extend/vpsrlvd/pack sequence.		{ ISD::SHL, MVT::v16i16, 10 }, // extend/vpsrlvd/pack sequence.

{ ISD::SRL, MVT::v32i8, 11 }, // vpblendvb sequence.		{ ISD::SRL, MVT::v32i8, 11 }, // vpblendvb sequence.
{ ISD::SRL, MVT::v16i16, 10 }, // extend/vpsrlvd/pack sequence.		{ ISD::SRL, MVT::v16i16, 10 }, // extend/vpsrlvd/pack sequence.

{ ISD::SRA, MVT::v32i8, 24 }, // vpblendvb sequence.		{ ISD::SRA, MVT::v32i8, 24 }, // vpblendvb sequence.
{ ISD::SRA, MVT::v16i16, 10 }, // extend/vpsravd/pack sequence.		{ ISD::SRA, MVT::v16i16, 10 }, // extend/vpsravd/pack sequence.
{ ISD::SRA, MVT::v4i64, 4*10 }, // Scalarized.		{ ISD::SRA, MVT::v2i64, 4 }, // srl/xor/sub sequence.
		{ ISD::SRA, MVT::v4i64, 4 }, // srl/xor/sub sequence.

// Vectorizing division is a bad idea. See the SSE2 table for more comments.		// Vectorizing division is a bad idea. See the SSE2 table for more comments.
{ ISD::SDIV, MVT::v32i8, 32*20 },		{ ISD::SDIV, MVT::v32i8, 32*20 },
{ ISD::SDIV, MVT::v16i16, 16*20 },		{ ISD::SDIV, MVT::v16i16, 16*20 },
{ ISD::SDIV, MVT::v8i32, 8*20 },		{ ISD::SDIV, MVT::v8i32, 8*20 },
{ ISD::SDIV, MVT::v4i64, 4*20 },		{ ISD::SDIV, MVT::v4i64, 4*20 },
{ ISD::UDIV, MVT::v32i8, 32*20 },		{ ISD::UDIV, MVT::v32i8, 32*20 },
{ ISD::UDIV, MVT::v16i16, 16*20 },		{ ISD::UDIV, MVT::v16i16, 16*20 },
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	static const CostTblEntry<MVT::SimpleValueType> SSE2CostTable[] = {
{ ISD::SRL, MVT::v16i8, 26 }, // cmpgtb sequence.		{ ISD::SRL, MVT::v16i8, 26 }, // cmpgtb sequence.
{ ISD::SRL, MVT::v8i16, 32 }, // cmpgtb sequence.		{ ISD::SRL, MVT::v8i16, 32 }, // cmpgtb sequence.
{ ISD::SRL, MVT::v4i32, 16 }, // Shift each lane + blend.		{ ISD::SRL, MVT::v4i32, 16 }, // Shift each lane + blend.
{ ISD::SRL, MVT::v2i64, 4 }, // splat+shuffle sequence.		{ ISD::SRL, MVT::v2i64, 4 }, // splat+shuffle sequence.

{ ISD::SRA, MVT::v16i8, 54 }, // unpacked cmpgtb sequence.		{ ISD::SRA, MVT::v16i8, 54 }, // unpacked cmpgtb sequence.
{ ISD::SRA, MVT::v8i16, 32 }, // cmpgtb sequence.		{ ISD::SRA, MVT::v8i16, 32 }, // cmpgtb sequence.
{ ISD::SRA, MVT::v4i32, 16 }, // Shift each lane + blend.		{ ISD::SRA, MVT::v4i32, 16 }, // Shift each lane + blend.
{ ISD::SRA, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SRA, MVT::v2i64, 12 }, // srl/xor/sub sequence.

// It is not a good idea to vectorize division. We have to scalarize it and		// It is not a good idea to vectorize division. We have to scalarize it and
// in the process we will often end up having to spilling regular		// in the process we will often end up having to spilling regular
// registers. The overhead of division is going to dominate most kernels		// registers. The overhead of division is going to dominate most kernels
// anyways so try hard to prevent vectorization of division - it is		// anyways so try hard to prevent vectorization of division - it is
// generally a bad idea. Assume somewhat arbitrarily that we have to be able		// generally a bad idea. Assume somewhat arbitrarily that we have to be able
// to hide "20 cycles" for each lane.		// to hide "20 cycles" for each lane.
{ ISD::SDIV, MVT::v16i8, 16*20 },		{ ISD::SDIV, MVT::v16i8, 16*20 },
▲ Show 20 Lines • Show All 870 Lines • Show Last 20 Lines

test/Analysis/CostModel/X86/arith.ll

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	define void @shift() {
; AVX: cost of 2 {{.*}} lshr		; AVX: cost of 2 {{.*}} lshr
; AVX2: cost of 1 {{.*}} lshr		; AVX2: cost of 1 {{.*}} lshr
%B1 = lshr <2 x i64> undef, undef		%B1 = lshr <2 x i64> undef, undef

; AVX: cost of 2 {{.*}} ashr		; AVX: cost of 2 {{.*}} ashr
; AVX2: cost of 1 {{.*}} ashr		; AVX2: cost of 1 {{.*}} ashr
%C0 = ashr <4 x i32> undef, undef		%C0 = ashr <4 x i32> undef, undef
; AVX: cost of 6 {{.*}} ashr		; AVX: cost of 6 {{.*}} ashr
; AVX2: cost of 20 {{.*}} ashr		; AVX2: cost of 4 {{.*}} ashr
%C1 = ashr <2 x i64> undef, undef		%C1 = ashr <2 x i64> undef, undef

ret void		ret void
}		}

; AVX: avx2shift		; AVX: avx2shift
; AVX2: avx2shift		; AVX2: avx2shift
define void @avx2shift() {		define void @avx2shift() {
Show All 10 Lines	define void @avx2shift() {
; AVX: cost of 2 {{.*}} lshr		; AVX: cost of 2 {{.*}} lshr
; AVX2: cost of 1 {{.*}} lshr		; AVX2: cost of 1 {{.*}} lshr
%B1 = lshr <4 x i64> undef, undef		%B1 = lshr <4 x i64> undef, undef

; AVX: cost of 2 {{.*}} ashr		; AVX: cost of 2 {{.*}} ashr
; AVX2: cost of 1 {{.*}} ashr		; AVX2: cost of 1 {{.*}} ashr
%C0 = ashr <8 x i32> undef, undef		%C0 = ashr <8 x i32> undef, undef
; AVX: cost of 12 {{.*}} ashr		; AVX: cost of 12 {{.*}} ashr
; AVX2: cost of 40 {{.*}} ashr		; AVX2: cost of 4 {{.*}} ashr
%C1 = ashr <4 x i64> undef, undef		%C1 = ashr <4 x i64> undef, undef

ret void		ret void
}		}

test/Analysis/CostModel/X86/testshiftashr.ll

; RUN: llc -mtriple=x86_64-apple-darwin -mcpu=core2 < %s \| FileCheck --check-prefix=SSE2-CODEGEN %s		; RUN: llc -mtriple=x86_64-apple-darwin -mcpu=core2 < %s \| FileCheck --check-prefix=SSE2-CODEGEN %s
; RUN: opt -mtriple=x86_64-apple-darwin -mcpu=core2 -cost-model -analyze < %s \| FileCheck --check-prefix=SSE2 %s		; RUN: opt -mtriple=x86_64-apple-darwin -mcpu=core2 -cost-model -analyze < %s \| FileCheck --check-prefix=SSE2 %s

%shifttype = type <2 x i16>		%shifttype = type <2 x i16>
define %shifttype @shift2i16(%shifttype %a, %shifttype %b) {		define %shifttype @shift2i16(%shifttype %a, %shifttype %b) {
entry:		entry:
; SSE2: shift2i16		; SSE2: shift2i16
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 12 {{.*}} ashr
; SSE2-CODEGEN: shift2i16		; SSE2-CODEGEN: shift2i16
; SSE2-CODEGEN: sarq %cl		; SSE2-CODEGEN: psrlq

%0 = ashr %shifttype %a , %b		%0 = ashr %shifttype %a , %b
ret %shifttype %0		ret %shifttype %0
}		}

%shifttype4i16 = type <4 x i16>		%shifttype4i16 = type <4 x i16>
define %shifttype4i16 @shift4i16(%shifttype4i16 %a, %shifttype4i16 %b) {		define %shifttype4i16 @shift4i16(%shifttype4i16 %a, %shifttype4i16 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	entry:
%0 = ashr %shifttype32i16 %a , %b		%0 = ashr %shifttype32i16 %a , %b
ret %shifttype32i16 %0		ret %shifttype32i16 %0
}		}

%shifttype2i32 = type <2 x i32>		%shifttype2i32 = type <2 x i32>
define %shifttype2i32 @shift2i32(%shifttype2i32 %a, %shifttype2i32 %b) {		define %shifttype2i32 @shift2i32(%shifttype2i32 %a, %shifttype2i32 %b) {
entry:		entry:
; SSE2: shift2i32		; SSE2: shift2i32
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 12 {{.*}} ashr
; SSE2-CODEGEN: shift2i32		; SSE2-CODEGEN: shift2i32
; SSE2-CODEGEN: sarq %cl		; SSE2-CODEGEN: psrlq

%0 = ashr %shifttype2i32 %a , %b		%0 = ashr %shifttype2i32 %a , %b
ret %shifttype2i32 %0		ret %shifttype2i32 %0
}		}

%shifttype4i32 = type <4 x i32>		%shifttype4i32 = type <4 x i32>
define %shifttype4i32 @shift4i32(%shifttype4i32 %a, %shifttype4i32 %b) {		define %shifttype4i32 @shift4i32(%shifttype4i32 %a, %shifttype4i32 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	entry:
%0 = ashr %shifttype32i32 %a , %b		%0 = ashr %shifttype32i32 %a , %b
ret %shifttype32i32 %0		ret %shifttype32i32 %0
}		}

%shifttype2i64 = type <2 x i64>		%shifttype2i64 = type <2 x i64>
define %shifttype2i64 @shift2i64(%shifttype2i64 %a, %shifttype2i64 %b) {		define %shifttype2i64 @shift2i64(%shifttype2i64 %a, %shifttype2i64 %b) {
entry:		entry:
; SSE2: shift2i64		; SSE2: shift2i64
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 12 {{.*}} ashr
; SSE2-CODEGEN: shift2i64		; SSE2-CODEGEN: shift2i64
; SSE2-CODEGEN: sarq %cl		; SSE2-CODEGEN: psrlq

%0 = ashr %shifttype2i64 %a , %b		%0 = ashr %shifttype2i64 %a , %b
ret %shifttype2i64 %0		ret %shifttype2i64 %0
}		}

%shifttype4i64 = type <4 x i64>		%shifttype4i64 = type <4 x i64>
define %shifttype4i64 @shift4i64(%shifttype4i64 %a, %shifttype4i64 %b) {		define %shifttype4i64 @shift4i64(%shifttype4i64 %a, %shifttype4i64 %b) {
entry:		entry:
; SSE2: shift4i64		; SSE2: shift4i64
; SSE2: cost of 40 {{.*}} ashr		; SSE2: cost of 24 {{.*}} ashr
; SSE2-CODEGEN: shift4i64		; SSE2-CODEGEN: shift4i64
; SSE2-CODEGEN: sarq %cl		; SSE2-CODEGEN: psrlq

%0 = ashr %shifttype4i64 %a , %b		%0 = ashr %shifttype4i64 %a , %b
ret %shifttype4i64 %0		ret %shifttype4i64 %0
}		}

%shifttype8i64 = type <8 x i64>		%shifttype8i64 = type <8 x i64>
define %shifttype8i64 @shift8i64(%shifttype8i64 %a, %shifttype8i64 %b) {		define %shifttype8i64 @shift8i64(%shifttype8i64 %a, %shifttype8i64 %b) {
entry:		entry:
; SSE2: shift8i64		; SSE2: shift8i64
; SSE2: cost of 80 {{.*}} ashr		; SSE2: cost of 48 {{.*}} ashr
; SSE2-CODEGEN: shift8i64		; SSE2-CODEGEN: shift8i64
; SSE2-CODEGEN: sarq %cl		; SSE2-CODEGEN: psrlq

%0 = ashr %shifttype8i64 %a , %b		%0 = ashr %shifttype8i64 %a , %b
ret %shifttype8i64 %0		ret %shifttype8i64 %0
}		}

%shifttype16i64 = type <16 x i64>		%shifttype16i64 = type <16 x i64>
define %shifttype16i64 @shift16i64(%shifttype16i64 %a, %shifttype16i64 %b) {		define %shifttype16i64 @shift16i64(%shifttype16i64 %a, %shifttype16i64 %b) {
entry:		entry:
; SSE2: shift16i64		; SSE2: shift16i64
; SSE2: cost of 160 {{.*}} ashr		; SSE2: cost of 96 {{.*}} ashr
; SSE2-CODEGEN: shift16i64		; SSE2-CODEGEN: shift16i64
; SSE2-CODEGEN: sarq %cl		; SSE2-CODEGEN: psrlq

%0 = ashr %shifttype16i64 %a , %b		%0 = ashr %shifttype16i64 %a , %b
ret %shifttype16i64 %0		ret %shifttype16i64 %0
}		}

%shifttype32i64 = type <32 x i64>		%shifttype32i64 = type <32 x i64>
define %shifttype32i64 @shift32i64(%shifttype32i64 %a, %shifttype32i64 %b) {		define %shifttype32i64 @shift32i64(%shifttype32i64 %a, %shifttype32i64 %b) {
entry:		entry:
; SSE2: shift32i64		; SSE2: shift32i64
; SSE2: cost of 320 {{.*}} ashr		; SSE2: cost of 192 {{.*}} ashr
; SSE2-CODEGEN: shift32i64		; SSE2-CODEGEN: shift32i64
; SSE2-CODEGEN: sarq %cl		; SSE2-CODEGEN: psrlq

%0 = ashr %shifttype32i64 %a , %b		%0 = ashr %shifttype32i64 %a , %b
ret %shifttype32i64 %0		ret %shifttype32i64 %0
}		}

%shifttype2i8 = type <2 x i8>		%shifttype2i8 = type <2 x i8>
define %shifttype2i8 @shift2i8(%shifttype2i8 %a, %shifttype2i8 %b) {		define %shifttype2i8 @shift2i8(%shifttype2i8 %a, %shifttype2i8 %b) {
entry:		entry:
; SSE2: shift2i8		; SSE2: shift2i8
; SSE2: cost of 20 {{.*}} ashr		; SSE2: cost of 12 {{.*}} ashr
; SSE2-CODEGEN: shift2i8		; SSE2-CODEGEN: shift2i8
; SSE2-CODEGEN: sarq %cl		; SSE2-CODEGEN: psrlq

%0 = ashr %shifttype2i8 %a , %b		%0 = ashr %shifttype2i8 %a , %b
ret %shifttype2i8 %0		ret %shifttype2i8 %0
}		}

%shifttype4i8 = type <4 x i8>		%shifttype4i8 = type <4 x i8>
define %shifttype4i8 @shift4i8(%shifttype4i8 %a, %shifttype4i8 %b) {		define %shifttype4i8 @shift4i8(%shifttype4i8 %a, %shifttype4i8 %b) {
entry:		entry:
▲ Show 20 Lines • Show All 333 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shift-ashr-128.ll

; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 \| FileCheck %s --check-prefix=ALL --check-prefix=SSE --check-prefix=SSE2		; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 \| FileCheck %s --check-prefix=ALL --check-prefix=SSE --check-prefix=SSE2
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+sse4.1 \| FileCheck %s --check-prefix=ALL --check-prefix=SSE --check-prefix=SSE41		; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+sse4.1 \| FileCheck %s --check-prefix=ALL --check-prefix=SSE --check-prefix=SSE41
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx \| FileCheck %s --check-prefix=ALL --check-prefix=AVX --check-prefix=AVX1		; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx \| FileCheck %s --check-prefix=ALL --check-prefix=AVX --check-prefix=AVX1
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx2 \| FileCheck %s --check-prefix=ALL --check-prefix=AVX --check-prefix=AVX2		; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx2 \| FileCheck %s --check-prefix=ALL --check-prefix=AVX --check-prefix=AVX2
;		;
; Just one 32-bit run to make sure we do reasonable things for i64 shifts.		; Just one 32-bit run to make sure we do reasonable things for i64 shifts.
; RUN: llc < %s -mtriple=i686-unknown-unknown -mcpu=x86-64 \| FileCheck %s --check-prefix=ALL --check-prefix=X32-SSE --check-prefix=X32-SSE2		; RUN: llc < %s -mtriple=i686-unknown-unknown -mcpu=x86-64 \| FileCheck %s --check-prefix=ALL --check-prefix=X32-SSE --check-prefix=X32-SSE2

;		;
; Variable Shifts		; Variable Shifts
;		;

define <2 x i64> @var_shift_v2i64(<2 x i64> %a, <2 x i64> %b) nounwind {		define <2 x i64> @var_shift_v2i64(<2 x i64> %a, <2 x i64> %b) nounwind {
; SSE2-LABEL: var_shift_v2i64:		; SSE2-LABEL: var_shift_v2i64:
; SSE2: # BB#0:		; SSE2: # BB#0:
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,0,1]
; SSE2-NEXT: movd %xmm1, %rcx		; SSE2-NEXT: movdqa {{.*#+}} xmm2 = [9223372036854775808,9223372036854775808]
; SSE2-NEXT: sarq %cl, %rax		; SSE2-NEXT: movdqa %xmm2, %xmm4
; SSE2-NEXT: movd %rax, %xmm2		; SSE2-NEXT: psrlq %xmm3, %xmm4
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSE2-NEXT: psrlq %xmm1, %xmm2
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: movsd {{.*#+}} xmm4 = xmm2[0],xmm4[1]
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]		; SSE2-NEXT: movdqa %xmm0, %xmm2
; SSE2-NEXT: movd %xmm0, %rcx		; SSE2-NEXT: psrlq %xmm3, %xmm2
; SSE2-NEXT: sarq %cl, %rax		; SSE2-NEXT: psrlq %xmm1, %xmm0
; SSE2-NEXT: movd %rax, %xmm0		; SSE2-NEXT: movsd {{.*#+}} xmm2 = xmm0[0],xmm2[1]
; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm0[0]		; SSE2-NEXT: xorpd %xmm4, %xmm2
		; SSE2-NEXT: psubq %xmm4, %xmm2
		qcolombetUnsubmitted Not Done Reply Inline Actions Is this sequence actually better? I guess the GPR to vector and vector to GPR copies are quite expensive so that is the case. Just double checking. qcolombet: Is this sequence actually better? I guess the GPR to vector and vector to GPR copies are quite…
		RKSimonAuthorUnsubmitted Not Done Reply Inline Actions Its very target dependent - on my old Penryn avoiding GPR/SSE transfers is by far the best option. Jaguar/Sandy Bridge don't care that much at the 64-bit integer level (it will probably come down to register pressure issues). Haswell has AVX2 so we can do per-lane v2i64 logical shifts where this patch really flies. When in 32-bit mode its always best to avoid trying to do (split) 64-bit shifts on GPR (see D11327). Overall, the general per-lane ashr v2i64 lowering is the weakest improvement, but the splat and constant cases gain a lot more. RKSimon: Its very target dependent - on my old Penryn avoiding GPR/SSE transfers is by far the best…
; SSE2-NEXT: movdqa %xmm2, %xmm0		; SSE2-NEXT: movdqa %xmm2, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSE41-LABEL: var_shift_v2i64:		; SSE41-LABEL: var_shift_v2i64:
; SSE41: # BB#0:		; SSE41: # BB#0:
; SSE41-NEXT: pextrq $1, %xmm0, %rax		; SSE41-NEXT: movdqa {{.*#+}} xmm2 = [9223372036854775808,9223372036854775808]
; SSE41-NEXT: pextrq $1, %xmm1, %rcx		; SSE41-NEXT: movdqa %xmm2, %xmm3
; SSE41-NEXT: sarq %cl, %rax		; SSE41-NEXT: psrlq %xmm1, %xmm3
; SSE41-NEXT: movd %rax, %xmm2		; SSE41-NEXT: pshufd {{.*#+}} xmm4 = xmm1[2,3,0,1]
; SSE41-NEXT: movd %xmm0, %rax		; SSE41-NEXT: psrlq %xmm4, %xmm2
; SSE41-NEXT: movd %xmm1, %rcx		; SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm3[0,1,2,3],xmm2[4,5,6,7]
; SSE41-NEXT: sarq %cl, %rax		; SSE41-NEXT: movdqa %xmm0, %xmm3
; SSE41-NEXT: movd %rax, %xmm0		; SSE41-NEXT: psrlq %xmm1, %xmm3
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]		; SSE41-NEXT: psrlq %xmm4, %xmm0
		; SSE41-NEXT: pblendw {{.*#+}} xmm0 = xmm3[0,1,2,3],xmm0[4,5,6,7]
		; SSE41-NEXT: pxor %xmm2, %xmm0
		; SSE41-NEXT: psubq %xmm2, %xmm0
		qcolombetUnsubmitted Not Done Reply Inline Actions Same question here (this time I guess not using pextr is good). qcolombet: Same question here (this time I guess not using pextr is good).
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX-LABEL: var_shift_v2i64:		; AVX1-LABEL: var_shift_v2i64:
; AVX: # BB#0:		; AVX1: # BB#0:
; AVX-NEXT: vpextrq $1, %xmm0, %rax		; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [9223372036854775808,9223372036854775808]
; AVX-NEXT: vpextrq $1, %xmm1, %rcx		; AVX1-NEXT: vpsrlq %xmm1, %xmm2, %xmm3
; AVX-NEXT: sarq %cl, %rax		; AVX1-NEXT: vpshufd {{.*#+}} xmm4 = xmm1[2,3,0,1]
; AVX-NEXT: vmovq %rax, %xmm2		; AVX1-NEXT: vpsrlq %xmm4, %xmm2, %xmm2
; AVX-NEXT: vmovq %xmm0, %rax		; AVX1-NEXT: vpblendw {{.*#+}} xmm2 = xmm3[0,1,2,3],xmm2[4,5,6,7]
; AVX-NEXT: vmovq %xmm1, %rcx		; AVX1-NEXT: vpsrlq %xmm1, %xmm0, %xmm1
; AVX-NEXT: sarq %cl, %rax		; AVX1-NEXT: vpsrlq %xmm4, %xmm0, %xmm0
; AVX-NEXT: vmovq %rax, %xmm0		; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm1[0,1,2,3],xmm0[4,5,6,7]
; AVX-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]		; AVX1-NEXT: vpxor %xmm2, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX1-NEXT: vpsubq %xmm2, %xmm0, %xmm0
		; AVX1-NEXT: retq
		;
		; AVX2-LABEL: var_shift_v2i64:
		; AVX2: # BB#0:
		; AVX2-NEXT: vmovdqa {{.*#+}} xmm2 = [9223372036854775808,9223372036854775808]
		; AVX2-NEXT: vpsrlvq %xmm1, %xmm2, %xmm3
		; AVX2-NEXT: vpxor %xmm2, %xmm0, %xmm0
		; AVX2-NEXT: vpsrlvq %xmm1, %xmm0, %xmm0
		; AVX2-NEXT: vpsubq %xmm3, %xmm0, %xmm0
		; AVX2-NEXT: retq
;		;
; X32-SSE-LABEL: var_shift_v2i64:		; X32-SSE-LABEL: var_shift_v2i64:
; X32-SSE: # BB#0:		; X32-SSE: # BB#0:
; X32-SSE-NEXT: pushl %ebp
; X32-SSE-NEXT: pushl %ebx
; X32-SSE-NEXT: pushl %edi
; X32-SSE-NEXT: pushl %esi
; X32-SSE-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,1,2,3]
; X32-SSE-NEXT: movd %xmm2, %edx
; X32-SSE-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,0,1]
; X32-SSE-NEXT: movd %xmm2, %esi
; X32-SSE-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,0,1]		; X32-SSE-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,0,1]
; X32-SSE-NEXT: movd %xmm2, %eax		; X32-SSE-NEXT: movdqa {{.*#+}} xmm3 = [0,2147483648,0,2147483648]
; X32-SSE-NEXT: movb %al, %cl		; X32-SSE-NEXT: movdqa %xmm3, %xmm4
; X32-SSE-NEXT: shrdl %cl, %edx, %esi		; X32-SSE-NEXT: psrlq %xmm2, %xmm4
; X32-SSE-NEXT: movd %xmm0, %edi		; X32-SSE-NEXT: movq {{.*#+}} xmm5 = xmm1[0],zero
; X32-SSE-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]		; X32-SSE-NEXT: psrlq %xmm5, %xmm3
; X32-SSE-NEXT: movd %xmm0, %ebx		; X32-SSE-NEXT: movsd {{.*#+}} xmm4 = xmm3[0],xmm4[1]
; X32-SSE-NEXT: movd %xmm1, %ecx		; X32-SSE-NEXT: movdqa %xmm0, %xmm1
; X32-SSE-NEXT: shrdl %cl, %ebx, %edi		; X32-SSE-NEXT: psrlq %xmm2, %xmm1
; X32-SSE-NEXT: movl %ebx, %ebp		; X32-SSE-NEXT: psrlq %xmm5, %xmm0
; X32-SSE-NEXT: sarl %cl, %ebp		; X32-SSE-NEXT: movsd {{.*#+}} xmm1 = xmm0[0],xmm1[1]
; X32-SSE-NEXT: sarl $31, %ebx		; X32-SSE-NEXT: xorpd %xmm4, %xmm1
; X32-SSE-NEXT: testb $32, %cl		; X32-SSE-NEXT: psubq %xmm4, %xmm1
; X32-SSE-NEXT: cmovnel %ebp, %edi		; X32-SSE-NEXT: movdqa %xmm1, %xmm0
; X32-SSE-NEXT: movd %edi, %xmm0
; X32-SSE-NEXT: cmovel %ebp, %ebx
; X32-SSE-NEXT: movl %edx, %edi
; X32-SSE-NEXT: movb %al, %cl
; X32-SSE-NEXT: sarl %cl, %edi
; X32-SSE-NEXT: sarl $31, %edx
; X32-SSE-NEXT: testb $32, %al
; X32-SSE-NEXT: cmovnel %edi, %esi
; X32-SSE-NEXT: movd %esi, %xmm1
; X32-SSE-NEXT: movd %ebx, %xmm2
; X32-SSE-NEXT: cmovel %edi, %edx
; X32-SSE-NEXT: movd %edx, %xmm3
; X32-SSE-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
; X32-SSE-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
; X32-SSE-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
; X32-SSE-NEXT: popl %esi
; X32-SSE-NEXT: popl %edi
; X32-SSE-NEXT: popl %ebx
; X32-SSE-NEXT: popl %ebp
; X32-SSE-NEXT: retl		; X32-SSE-NEXT: retl
%shift = ashr <2 x i64> %a, %b		%shift = ashr <2 x i64> %a, %b
ret <2 x i64> %shift		ret <2 x i64> %shift
}		}

define <4 x i32> @var_shift_v4i32(<4 x i32> %a, <4 x i32> %b) nounwind {		define <4 x i32> @var_shift_v4i32(<4 x i32> %a, <4 x i32> %b) nounwind {
; SSE2-LABEL: var_shift_v4i32:		; SSE2-LABEL: var_shift_v4i32:
; SSE2: # BB#0:		; SSE2: # BB#0:
▲ Show 20 Lines • Show All 404 Lines • ▼ Show 20 Lines	; X32-SSE-NEXT: retl
ret <16 x i8> %shift		ret <16 x i8> %shift
}		}

;		;
; Uniform Variable Shifts		; Uniform Variable Shifts
;		;

define <2 x i64> @splatvar_shift_v2i64(<2 x i64> %a, <2 x i64> %b) nounwind {		define <2 x i64> @splatvar_shift_v2i64(<2 x i64> %a, <2 x i64> %b) nounwind {
; SSE2-LABEL: splatvar_shift_v2i64:		; SSE-LABEL: splatvar_shift_v2i64:
; SSE2: # BB#0:		; SSE: # BB#0:
; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[0,1,0,1]		; SSE-NEXT: movdqa {{.*#+}} xmm2 = [9223372036854775808,9223372036854775808]
; SSE2-NEXT: movd %xmm0, %rax		; SSE-NEXT: psrlq %xmm1, %xmm2
; SSE2-NEXT: movd %xmm2, %rcx		; SSE-NEXT: psrlq %xmm1, %xmm0
; SSE2-NEXT: sarq %cl, %rax		; SSE-NEXT: pxor %xmm2, %xmm0
; SSE2-NEXT: movd %rax, %xmm1		; SSE-NEXT: psubq %xmm2, %xmm0
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSE-NEXT: retq
; SSE2-NEXT: movd %xmm0, %rax
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm2[2,3,0,1]
; SSE2-NEXT: movd %xmm0, %rcx
; SSE2-NEXT: sarq %cl, %rax
; SSE2-NEXT: movd %rax, %xmm0
; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
; SSE2-NEXT: movdqa %xmm1, %xmm0
; SSE2-NEXT: retq
;
; SSE41-LABEL: splatvar_shift_v2i64:
; SSE41: # BB#0:
; SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,1,0,1]
; SSE41-NEXT: pextrq $1, %xmm0, %rax
; SSE41-NEXT: pextrq $1, %xmm1, %rcx
; SSE41-NEXT: sarq %cl, %rax
; SSE41-NEXT: movd %rax, %xmm2
; SSE41-NEXT: movd %xmm0, %rax
; SSE41-NEXT: movd %xmm1, %rcx
; SSE41-NEXT: sarq %cl, %rax
; SSE41-NEXT: movd %rax, %xmm0
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
; SSE41-NEXT: retq
;
; AVX1-LABEL: splatvar_shift_v2i64:
; AVX1: # BB#0:
; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,1]
; AVX1-NEXT: vpextrq $1, %xmm0, %rax
; AVX1-NEXT: vpextrq $1, %xmm1, %rcx
; AVX1-NEXT: sarq %cl, %rax
; AVX1-NEXT: vmovq %rax, %xmm2
; AVX1-NEXT: vmovq %xmm0, %rax
; AVX1-NEXT: vmovq %xmm1, %rcx
; AVX1-NEXT: sarq %cl, %rax
; AVX1-NEXT: vmovq %rax, %xmm0
; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
; AVX1-NEXT: retq
;		;
; AVX2-LABEL: splatvar_shift_v2i64:		; AVX-LABEL: splatvar_shift_v2i64:
; AVX2: # BB#0:		; AVX: # BB#0:
; AVX2-NEXT: vpbroadcastq %xmm1, %xmm1		; AVX-NEXT: vmovdqa {{.*#+}} xmm2 = [9223372036854775808,9223372036854775808]
; AVX2-NEXT: vpextrq $1, %xmm0, %rax		; AVX-NEXT: vpsrlq %xmm1, %xmm2, %xmm2
; AVX2-NEXT: vpextrq $1, %xmm1, %rcx		; AVX-NEXT: vpsrlq %xmm1, %xmm0, %xmm0
; AVX2-NEXT: sarq %cl, %rax		; AVX-NEXT: vpxor %xmm2, %xmm0, %xmm0
; AVX2-NEXT: vmovq %rax, %xmm2		; AVX-NEXT: vpsubq %xmm2, %xmm0, %xmm0
; AVX2-NEXT: vmovq %xmm0, %rax		; AVX-NEXT: retq
; AVX2-NEXT: vmovq %xmm1, %rcx
; AVX2-NEXT: sarq %cl, %rax
; AVX2-NEXT: vmovq %rax, %xmm0
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
; AVX2-NEXT: retq
;		;
; X32-SSE-LABEL: splatvar_shift_v2i64:		; X32-SSE-LABEL: splatvar_shift_v2i64:
; X32-SSE: # BB#0:		; X32-SSE: # BB#0:
; X32-SSE-NEXT: pushl %ebp		; X32-SSE-NEXT: movq {{.*#+}} xmm1 = xmm1[0],zero
; X32-SSE-NEXT: pushl %ebx		; X32-SSE-NEXT: movdqa {{.*#+}} xmm2 = [0,2147483648,0,2147483648]
; X32-SSE-NEXT: pushl %edi		; X32-SSE-NEXT: psrlq %xmm1, %xmm2
; X32-SSE-NEXT: pushl %esi		; X32-SSE-NEXT: psrlq %xmm1, %xmm0
; X32-SSE-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,1,0,1]		; X32-SSE-NEXT: pxor %xmm2, %xmm0
; X32-SSE-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,1,2,3]		; X32-SSE-NEXT: psubq %xmm2, %xmm0
; X32-SSE-NEXT: movd %xmm2, %edx
; X32-SSE-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,0,1]
; X32-SSE-NEXT: movd %xmm2, %esi
; X32-SSE-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,0,1]
; X32-SSE-NEXT: movd %xmm2, %eax
; X32-SSE-NEXT: movb %al, %cl
; X32-SSE-NEXT: shrdl %cl, %edx, %esi
; X32-SSE-NEXT: movd %xmm0, %edi
; X32-SSE-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
; X32-SSE-NEXT: movd %xmm0, %ebx
; X32-SSE-NEXT: movd %xmm1, %ecx
; X32-SSE-NEXT: shrdl %cl, %ebx, %edi
; X32-SSE-NEXT: movl %ebx, %ebp
; X32-SSE-NEXT: sarl %cl, %ebp
; X32-SSE-NEXT: sarl $31, %ebx
; X32-SSE-NEXT: testb $32, %cl
; X32-SSE-NEXT: cmovnel %ebp, %edi
; X32-SSE-NEXT: movd %edi, %xmm0
; X32-SSE-NEXT: cmovel %ebp, %ebx
; X32-SSE-NEXT: movl %edx, %edi
; X32-SSE-NEXT: movb %al, %cl
; X32-SSE-NEXT: sarl %cl, %edi
; X32-SSE-NEXT: sarl $31, %edx
; X32-SSE-NEXT: testb $32, %al
; X32-SSE-NEXT: cmovnel %edi, %esi
; X32-SSE-NEXT: movd %esi, %xmm1
; X32-SSE-NEXT: movd %ebx, %xmm2
; X32-SSE-NEXT: cmovel %edi, %edx
; X32-SSE-NEXT: movd %edx, %xmm3
; X32-SSE-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
; X32-SSE-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
; X32-SSE-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
; X32-SSE-NEXT: popl %esi
; X32-SSE-NEXT: popl %edi
; X32-SSE-NEXT: popl %ebx
; X32-SSE-NEXT: popl %ebp
; X32-SSE-NEXT: retl		; X32-SSE-NEXT: retl
%splat = shufflevector <2 x i64> %b, <2 x i64> undef, <2 x i32> zeroinitializer		%splat = shufflevector <2 x i64> %b, <2 x i64> undef, <2 x i32> zeroinitializer
%shift = ashr <2 x i64> %a, %splat		%shift = ashr <2 x i64> %a, %splat
ret <2 x i64> %shift		ret <2 x i64> %shift
}		}

define <4 x i32> @splatvar_shift_v4i32(<4 x i32> %a, <4 x i32> %b) nounwind {		define <4 x i32> @splatvar_shift_v4i32(<4 x i32> %a, <4 x i32> %b) nounwind {
; SSE2-LABEL: splatvar_shift_v4i32:		; SSE2-LABEL: splatvar_shift_v4i32:
▲ Show 20 Lines • Show All 291 Lines • ▼ Show 20 Lines

;		;
; Constant Shifts		; Constant Shifts
;		;

define <2 x i64> @constant_shift_v2i64(<2 x i64> %a) nounwind {		define <2 x i64> @constant_shift_v2i64(<2 x i64> %a) nounwind {
; SSE2-LABEL: constant_shift_v2i64:		; SSE2-LABEL: constant_shift_v2i64:
; SSE2: # BB#0:		; SSE2: # BB#0:
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: movdqa %xmm0, %xmm1
; SSE2-NEXT: sarq %rax		; SSE2-NEXT: psrlq $7, %xmm1
; SSE2-NEXT: movd %rax, %xmm1		; SSE2-NEXT: psrlq $1, %xmm0
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSE2-NEXT: movsd {{.*#+}} xmm1 = xmm0[0],xmm1[1]
; SSE2-NEXT: movd %xmm0, %rax		; SSE2-NEXT: movapd {{.*#+}} xmm0 = [4611686018427387904,72057594037927936]
; SSE2-NEXT: sarq $7, %rax		; SSE2-NEXT: xorpd %xmm0, %xmm1
; SSE2-NEXT: movd %rax, %xmm0		; SSE2-NEXT: psubq %xmm0, %xmm1
; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
; SSE2-NEXT: movdqa %xmm1, %xmm0		; SSE2-NEXT: movdqa %xmm1, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSE41-LABEL: constant_shift_v2i64:		; SSE41-LABEL: constant_shift_v2i64:
; SSE41: # BB#0:		; SSE41: # BB#0:
; SSE41-NEXT: pextrq $1, %xmm0, %rax		; SSE41-NEXT: movdqa %xmm0, %xmm1
; SSE41-NEXT: sarq $7, %rax		; SSE41-NEXT: psrlq $7, %xmm1
; SSE41-NEXT: movd %rax, %xmm1		; SSE41-NEXT: psrlq $1, %xmm0
; SSE41-NEXT: movd %xmm0, %rax		; SSE41-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
; SSE41-NEXT: sarq %rax		; SSE41-NEXT: movdqa {{.*#+}} xmm1 = [4611686018427387904,72057594037927936]
; SSE41-NEXT: movd %rax, %xmm0		; SSE41-NEXT: pxor %xmm1, %xmm0
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]		; SSE41-NEXT: psubq %xmm1, %xmm0
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX-LABEL: constant_shift_v2i64:		; AVX1-LABEL: constant_shift_v2i64:
; AVX: # BB#0:		; AVX1: # BB#0:
; AVX-NEXT: vpextrq $1, %xmm0, %rax		; AVX1-NEXT: vpsrlq $7, %xmm0, %xmm1
; AVX-NEXT: sarq $7, %rax		; AVX1-NEXT: vpsrlq $1, %xmm0, %xmm0
; AVX-NEXT: vmovq %rax, %xmm1		; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
; AVX-NEXT: vmovq %xmm0, %rax		; AVX1-NEXT: vmovdqa {{.*#+}} xmm1 = [4611686018427387904,72057594037927936]
; AVX-NEXT: sarq %rax		; AVX1-NEXT: vpxor %xmm1, %xmm0, %xmm0
; AVX-NEXT: vmovq %rax, %xmm0		; AVX1-NEXT: vpsubq %xmm1, %xmm0, %xmm0
; AVX-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]		; AVX1-NEXT: retq
; AVX-NEXT: retq		;
		; AVX2-LABEL: constant_shift_v2i64:
		; AVX2: # BB#0:
		; AVX2-NEXT: vpsrlvq {{.*}}(%rip), %xmm0, %xmm0
		; AVX2-NEXT: vmovdqa {{.*#+}} xmm1 = [4611686018427387904,72057594037927936]
		; AVX2-NEXT: vpxor %xmm1, %xmm0, %xmm0
		; AVX2-NEXT: vpsubq %xmm1, %xmm0, %xmm0
		; AVX2-NEXT: retq
;		;
; X32-SSE-LABEL: constant_shift_v2i64:		; X32-SSE-LABEL: constant_shift_v2i64:
; X32-SSE: # BB#0:		; X32-SSE: # BB#0:
; X32-SSE-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]		; X32-SSE-NEXT: movl $7, %eax
; X32-SSE-NEXT: movd %xmm1, %eax		; X32-SSE-NEXT: movd %eax, %xmm2
; X32-SSE-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,1,2,3]		; X32-SSE-NEXT: movdqa {{.*#+}} xmm1 = [0,2147483648,0,2147483648]
; X32-SSE-NEXT: movd %xmm1, %ecx		; X32-SSE-NEXT: movdqa %xmm1, %xmm3
; X32-SSE-NEXT: shrdl $7, %ecx, %eax		; X32-SSE-NEXT: psrlq %xmm2, %xmm3
; X32-SSE-NEXT: movd %eax, %xmm1		; X32-SSE-NEXT: movl $1, %eax
; X32-SSE-NEXT: movd %xmm0, %eax		; X32-SSE-NEXT: movd %eax, %xmm4
; X32-SSE-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]		; X32-SSE-NEXT: psrlq %xmm4, %xmm1
; X32-SSE-NEXT: movd %xmm0, %edx		; X32-SSE-NEXT: movsd {{.*#+}} xmm3 = xmm1[0],xmm3[1]
; X32-SSE-NEXT: shrdl $1, %edx, %eax		; X32-SSE-NEXT: movdqa %xmm0, %xmm1
; X32-SSE-NEXT: movd %eax, %xmm0		; X32-SSE-NEXT: psrlq %xmm2, %xmm1
; X32-SSE-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]		; X32-SSE-NEXT: psrlq %xmm4, %xmm0
; X32-SSE-NEXT: sarl $7, %ecx		; X32-SSE-NEXT: movsd {{.*#+}} xmm1 = xmm0[0],xmm1[1]
; X32-SSE-NEXT: movd %ecx, %xmm1		; X32-SSE-NEXT: xorpd %xmm3, %xmm1
; X32-SSE-NEXT: sarl %edx		; X32-SSE-NEXT: psubq %xmm3, %xmm1
; X32-SSE-NEXT: movd %edx, %xmm2		; X32-SSE-NEXT: movdqa %xmm1, %xmm0
; X32-SSE-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
; X32-SSE-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
; X32-SSE-NEXT: retl		; X32-SSE-NEXT: retl
%shift = ashr <2 x i64> %a, <i64 1, i64 7>		%shift = ashr <2 x i64> %a, <i64 1, i64 7>
ret <2 x i64> %shift		ret <2 x i64> %shift
}		}

define <4 x i32> @constant_shift_v4i32(<4 x i32> %a) nounwind {		define <4 x i32> @constant_shift_v4i32(<4 x i32> %a) nounwind {
; SSE2-LABEL: constant_shift_v4i32:		; SSE2-LABEL: constant_shift_v4i32:
; SSE2: # BB#0:		; SSE2: # BB#0:
▲ Show 20 Lines • Show All 455 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shift-ashr-256.ll

	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx \| FileCheck %s --check-prefix=ALL --check-prefix=AVX --check-prefix=AVX1			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx \| FileCheck %s --check-prefix=ALL --check-prefix=AVX --check-prefix=AVX1
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx2 \| FileCheck %s --check-prefix=ALL --check-prefix=AVX --check-prefix=AVX2			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx2 \| FileCheck %s --check-prefix=ALL --check-prefix=AVX --check-prefix=AVX2

	;			;
	; Variable Shifts			; Variable Shifts
	;			;

	define <4 x i64> @var_shift_v4i64(<4 x i64> %a, <4 x i64> %b) nounwind {			define <4 x i64> @var_shift_v4i64(<4 x i64> %a, <4 x i64> %b) nounwind {
	; AVX1-LABEL: var_shift_v4i64:			; AVX1-LABEL: var_shift_v4i64:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm2
	; AVX1-NEXT: vpextrq $1, %xmm2, %rax			; AVX1-NEXT: vmovdqa {{.*#+}} xmm3 = [9223372036854775808,9223372036854775808]
	; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3			; AVX1-NEXT: vpsrlq %xmm2, %xmm3, %xmm4
	; AVX1-NEXT: vpextrq $1, %xmm3, %rcx			; AVX1-NEXT: vpshufd {{.*#+}} xmm5 = xmm2[2,3,0,1]
	; AVX1-NEXT: sarq %cl, %rax			; AVX1-NEXT: vpsrlq %xmm5, %xmm3, %xmm6
	; AVX1-NEXT: vmovq %rax, %xmm4			; AVX1-NEXT: vpblendw {{.*#+}} xmm4 = xmm4[0,1,2,3],xmm6[4,5,6,7]
	; AVX1-NEXT: vmovq %xmm2, %rax			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm6
	; AVX1-NEXT: vmovq %xmm3, %rcx			; AVX1-NEXT: vpsrlq %xmm2, %xmm6, %xmm2
	; AVX1-NEXT: sarq %cl, %rax			; AVX1-NEXT: vpsrlq %xmm5, %xmm6, %xmm5
	; AVX1-NEXT: vmovq %rax, %xmm2			; AVX1-NEXT: vpblendw {{.*#+}} xmm2 = xmm2[0,1,2,3],xmm5[4,5,6,7]
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm4[0]			; AVX1-NEXT: vpxor %xmm4, %xmm2, %xmm2
	; AVX1-NEXT: vpextrq $1, %xmm0, %rax			; AVX1-NEXT: vpsubq %xmm4, %xmm2, %xmm2
	; AVX1-NEXT: vpextrq $1, %xmm1, %rcx			; AVX1-NEXT: vpsrlq %xmm1, %xmm3, %xmm4
	; AVX1-NEXT: sarq %cl, %rax			; AVX1-NEXT: vpshufd {{.*#+}} xmm5 = xmm1[2,3,0,1]
	; AVX1-NEXT: vmovq %rax, %xmm3			; AVX1-NEXT: vpsrlq %xmm5, %xmm3, %xmm3
	; AVX1-NEXT: vmovq %xmm0, %rax			; AVX1-NEXT: vpblendw {{.*#+}} xmm3 = xmm4[0,1,2,3],xmm3[4,5,6,7]
	; AVX1-NEXT: vmovq %xmm1, %rcx			; AVX1-NEXT: vpsrlq %xmm1, %xmm0, %xmm1
	; AVX1-NEXT: sarq %cl, %rax			; AVX1-NEXT: vpsrlq %xmm5, %xmm0, %xmm0
	; AVX1-NEXT: vmovq %rax, %xmm0			; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm1[0,1,2,3],xmm0[4,5,6,7]
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm3[0]			; AVX1-NEXT: vpxor %xmm3, %xmm0, %xmm0
				; AVX1-NEXT: vpsubq %xmm3, %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: var_shift_v4i64:			; AVX2-LABEL: var_shift_v4i64:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm2			; AVX2-NEXT: vpbroadcastq {{.*}}(%rip), %ymm2
	; AVX2-NEXT: vpextrq $1, %xmm2, %rax			; AVX2-NEXT: vpsrlvq %ymm1, %ymm2, %ymm3
	; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm3			; AVX2-NEXT: vpxor %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: vpextrq $1, %xmm3, %rcx			; AVX2-NEXT: vpsrlvq %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: sarq %cl, %rax			; AVX2-NEXT: vpsubq %ymm3, %ymm0, %ymm0
	; AVX2-NEXT: vmovq %rax, %xmm4
	; AVX2-NEXT: vmovq %xmm2, %rax
	; AVX2-NEXT: vmovq %xmm3, %rcx
	; AVX2-NEXT: sarq %cl, %rax
	; AVX2-NEXT: vmovq %rax, %xmm2
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm4[0]
	; AVX2-NEXT: vpextrq $1, %xmm0, %rax
	; AVX2-NEXT: vpextrq $1, %xmm1, %rcx
	; AVX2-NEXT: sarq %cl, %rax
	; AVX2-NEXT: vmovq %rax, %xmm3
	; AVX2-NEXT: vmovq %xmm0, %rax
	; AVX2-NEXT: vmovq %xmm1, %rcx
	; AVX2-NEXT: sarq %cl, %rax
	; AVX2-NEXT: vmovq %rax, %xmm0
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm3[0]
	; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shift = ashr <4 x i64> %a, %b			%shift = ashr <4 x i64> %a, %b
	ret <4 x i64> %shift			ret <4 x i64> %shift
	}			}

	define <8 x i32> @var_shift_v8i32(<8 x i32> %a, <8 x i32> %b) nounwind {			define <8 x i32> @var_shift_v8i32(<8 x i32> %a, <8 x i32> %b) nounwind {
	; AVX1-LABEL: var_shift_v8i32:			; AVX1-LABEL: var_shift_v8i32:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines

	;			;
	; Uniform Variable Shifts			; Uniform Variable Shifts
	;			;

	define <4 x i64> @splatvar_shift_v4i64(<4 x i64> %a, <4 x i64> %b) nounwind {			define <4 x i64> @splatvar_shift_v4i64(<4 x i64> %a, <4 x i64> %b) nounwind {
	; AVX1-LABEL: splatvar_shift_v4i64:			; AVX1-LABEL: splatvar_shift_v4i64:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vmovddup {{.*#+}} xmm1 = xmm1[0,0]			; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [9223372036854775808,9223372036854775808]
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2			; AVX1-NEXT: vpsrlq %xmm1, %xmm2, %xmm2
	; AVX1-NEXT: vpextrq $1, %xmm2, %rdx			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm3
	; AVX1-NEXT: vpextrq $1, %xmm1, %rax			; AVX1-NEXT: vpsrlq %xmm1, %xmm3, %xmm3
	; AVX1-NEXT: movb %al, %cl			; AVX1-NEXT: vpxor %xmm2, %xmm3, %xmm3
	; AVX1-NEXT: sarq %cl, %rdx			; AVX1-NEXT: vpsubq %xmm2, %xmm3, %xmm3
	; AVX1-NEXT: vmovq %rdx, %xmm3			; AVX1-NEXT: vpsrlq %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vmovq %xmm2, %rsi			; AVX1-NEXT: vpxor %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: vmovq %xmm1, %rdx			; AVX1-NEXT: vpsubq %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: movb %dl, %cl			; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm0
	; AVX1-NEXT: sarq %cl, %rsi
	; AVX1-NEXT: vmovq %rsi, %xmm1
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
	; AVX1-NEXT: vpextrq $1, %xmm0, %rsi
	; AVX1-NEXT: movb %al, %cl
	; AVX1-NEXT: sarq %cl, %rsi
	; AVX1-NEXT: vmovq %rsi, %xmm2
	; AVX1-NEXT: vmovq %xmm0, %rax
	; AVX1-NEXT: movb %dl, %cl
	; AVX1-NEXT: sarq %cl, %rax
	; AVX1-NEXT: vmovq %rax, %xmm0
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: splatvar_shift_v4i64:			; AVX2-LABEL: splatvar_shift_v4i64:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: vpbroadcastq %xmm1, %ymm1			; AVX2-NEXT: vpbroadcastq {{.*}}(%rip), %ymm2
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm2			; AVX2-NEXT: vpsrlq %xmm1, %ymm2, %ymm2
	; AVX2-NEXT: vpextrq $1, %xmm2, %rax			; AVX2-NEXT: vpsrlq %xmm1, %ymm0, %ymm0
	; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm3			; AVX2-NEXT: vpxor %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: vpextrq $1, %xmm3, %rcx			; AVX2-NEXT: vpsubq %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: sarq %cl, %rax
	; AVX2-NEXT: vmovq %rax, %xmm4
	; AVX2-NEXT: vmovq %xmm2, %rax
	; AVX2-NEXT: vmovq %xmm3, %rcx
	; AVX2-NEXT: sarq %cl, %rax
	; AVX2-NEXT: vmovq %rax, %xmm2
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm4[0]
	; AVX2-NEXT: vpextrq $1, %xmm0, %rax
	; AVX2-NEXT: vpextrq $1, %xmm1, %rcx
	; AVX2-NEXT: sarq %cl, %rax
	; AVX2-NEXT: vmovq %rax, %xmm3
	; AVX2-NEXT: vmovq %xmm0, %rax
	; AVX2-NEXT: vmovq %xmm1, %rcx
	; AVX2-NEXT: sarq %cl, %rax
	; AVX2-NEXT: vmovq %rax, %xmm0
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm3[0]
	; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%splat = shufflevector <4 x i64> %b, <4 x i64> undef, <4 x i32> zeroinitializer			%splat = shufflevector <4 x i64> %b, <4 x i64> undef, <4 x i32> zeroinitializer
	%shift = ashr <4 x i64> %a, %splat			%shift = ashr <4 x i64> %a, %splat
	ret <4 x i64> %shift			ret <4 x i64> %shift
	}			}

	define <8 x i32> @splatvar_shift_v8i32(<8 x i32> %a, <8 x i32> %b) nounwind {			define <8 x i32> @splatvar_shift_v8i32(<8 x i32> %a, <8 x i32> %b) nounwind {
	; AVX1-LABEL: splatvar_shift_v8i32:			; AVX1-LABEL: splatvar_shift_v8i32:
	▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
	;			;
	; Constant Shifts			; Constant Shifts
	;			;

	define <4 x i64> @constant_shift_v4i64(<4 x i64> %a) nounwind {			define <4 x i64> @constant_shift_v4i64(<4 x i64> %a) nounwind {
	; AVX1-LABEL: constant_shift_v4i64:			; AVX1-LABEL: constant_shift_v4i64:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
	; AVX1-NEXT: vpextrq $1, %xmm1, %rax			; AVX1-NEXT: vpsrlq $62, %xmm1, %xmm2
	; AVX1-NEXT: sarq $62, %rax			; AVX1-NEXT: vpsrlq $31, %xmm1, %xmm1
	; AVX1-NEXT: vmovq %rax, %xmm2			; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1,2,3],xmm2[4,5,6,7]
	; AVX1-NEXT: vmovq %xmm1, %rax			; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [4294967296,2]
	; AVX1-NEXT: sarq $31, %rax			; AVX1-NEXT: vpxor %xmm2, %xmm1, %xmm1
	; AVX1-NEXT: vmovq %rax, %xmm1			; AVX1-NEXT: vpsubq %xmm2, %xmm1, %xmm1
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]			; AVX1-NEXT: vpsrlq $7, %xmm0, %xmm2
	; AVX1-NEXT: vpextrq $1, %xmm0, %rax			; AVX1-NEXT: vpsrlq $1, %xmm0, %xmm0
	; AVX1-NEXT: sarq $7, %rax			; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7]
	; AVX1-NEXT: vmovq %rax, %xmm2			; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [4611686018427387904,72057594037927936]
	; AVX1-NEXT: vmovq %xmm0, %rax			; AVX1-NEXT: vpxor %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: sarq %rax			; AVX1-NEXT: vpsubq %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: vmovq %rax, %xmm0
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: constant_shift_v4i64:			; AVX2-LABEL: constant_shift_v4i64:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-NEXT: vpsrlvq {{.*}}(%rip), %ymm0, %ymm0
	; AVX2-NEXT: vpextrq $1, %xmm1, %rax			; AVX2-NEXT: vmovdqa {{.*#+}} ymm1 = [4611686018427387904,72057594037927936,4294967296,2]
	; AVX2-NEXT: sarq $62, %rax			; AVX2-NEXT: vpxor %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vmovq %rax, %xmm2			; AVX2-NEXT: vpsubq %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vmovq %xmm1, %rax
	; AVX2-NEXT: sarq $31, %rax
	; AVX2-NEXT: vmovq %rax, %xmm1
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; AVX2-NEXT: vpextrq $1, %xmm0, %rax
	; AVX2-NEXT: sarq $7, %rax
	; AVX2-NEXT: vmovq %rax, %xmm2
	; AVX2-NEXT: vmovq %xmm0, %rax
	; AVX2-NEXT: sarq %rax
	; AVX2-NEXT: vmovq %rax, %xmm0
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shift = ashr <4 x i64> %a, <i64 1, i64 7, i64 31, i64 62>			%shift = ashr <4 x i64> %a, <i64 1, i64 7, i64 31, i64 62>
	ret <4 x i64> %shift			ret <4 x i64> %shift
	}			}

	define <8 x i32> @constant_shift_v8i32(<8 x i32> %a) nounwind {			define <8 x i32> @constant_shift_v8i32(<8 x i32> %a) nounwind {
	; AVX1-LABEL: constant_shift_v8i32:			; AVX1-LABEL: constant_shift_v8i32:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	▲ Show 20 Lines • Show All 243 Lines • Show Last 20 Lines

utils/update_llc_test_checks.py

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	def scrub_asm(asm):
asm = ASM_SCRUB_TRAILING_WHITESPACE_RE.sub(r'', asm)		asm = ASM_SCRUB_TRAILING_WHITESPACE_RE.sub(r'', asm)
return asm		return asm


def main():		def main():
parser = argparse.ArgumentParser(description=__doc__)		parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument('-v', '--verbose', action='store_true',		parser.add_argument('-v', '--verbose', action='store_true',
help='Show verbose output')		help='Show verbose output')
parser.add_argument('--llc-binary', default='llc',		parser.add_argument('--llc-binary', default='/Users/simon/LLVM/build/Release+Asserts/bin/llc',
help='The "llc" binary to use to generate the test case')		help='The "llc" binary to use to generate the test case')
parser.add_argument(		parser.add_argument(
'--function', help='The function in the test file to update')		'--function', help='The function in the test file to update')
parser.add_argument('tests', nargs='+')		parser.add_argument('tests', nargs='+')
args = parser.parse_args()		args = parser.parse_args()

run_line_re = re.compile('^\s;\sRUN:\s(.)$')		run_line_re = re.compile('^\s;\sRUN:\s(.)$')
ir_function_re = re.compile('^\sdefine\s+(?:internal\s+)?[^@]@(\w+)\s*\(')		ir_function_re = re.compile('^\sdefine\s+(?:internal\s+)?[^@]@(\w+)\s*\(')
▲ Show 20 Lines • Show All 145 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorize i64 ASHR operationsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 30419

lib/Target/X86/X86ISelLowering.cpp

lib/Target/X86/X86TargetTransformInfo.cpp

test/Analysis/CostModel/X86/arith.ll

test/Analysis/CostModel/X86/testshiftashr.ll

test/CodeGen/X86/vector-shift-ashr-128.ll

test/CodeGen/X86/vector-shift-ashr-256.ll

utils/update_llc_test_checks.py

[X86][SSE] Vectorize i64 ASHR operations
ClosedPublic