This is an archive of the discontinued LLVM Phabricator instance.

Adjust the cost of vectorized SHL/SRL/SRA
Needs ReviewPublic

Authored by wmi on May 21 2015, 4:15 PM.

Download Raw Diff

Details

Reviewers

nadav
aschwaighofer

Summary

The patch is to solve the problem in https://llvm.org/bugs/show_bug.cgi?id=23582. It adjusts the cost of vectorized SHL/SRL/SRA and makes sure they are lowered to vectorized shift instruction.

There are a bunch of testcases needed to be adjusted. send out the patch first to see if it is ok generally. Will update the patch with adjusted tests.

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 26282.May 21 2015, 4:15 PM

wmi retitled this revision from to Adjust the cost of vectorized SHL/SRL/SRA.

wmi updated this object.

wmi edited the test plan for this revision. (Show Details)

wmi added reviewers: nadav, aschwaighofer.

wmi set the repository for this revision to rL LLVM.

wmi added subscribers: Unknown Object (MLST), davidxl.

Looks good to me assuming that you write the cost model tests, and that you write tests for the ISel changes.

Thanks for looking at this - comments below. I should mention I have some similar work in progress in D9474 and D9645 that is trying to efficiently vectorize more integer shifts.

lib/Target/X86/X86ISelLowering.cpp
16421 ↗	(On Diff #26282)	I think this is out of date - Elena did some refactoring on this recently.
lib/Target/X86/X86TargetTransformInfo.cpp
274	Why is v4i64 declared here?
284	SSE only has fast variable shift support for uniform values - these cost values surely should reflect the likely cost of general shift. It would be better to update the TargetTransformInfo::OK_UniformConstantValue tables + code above this to support TargetTransformInfo::OK_UniformValue as well.

I share Simon's concerns. Please make sure that we still get a good estimate for kernels like (these are from the rdar mentioned in the commit).

#define TYPE char
#define OP >>
#define SIZE 1024
#define TYPE_ALIGN __attribute__((aligned(16)))

TYPE A1[SIZE] TYPE_ALIGN;
TYPE B1[SIZE] TYPE_ALIGN;
TYPE C1[SIZE] TYPE_ALIGN;

void kernel1() {
  for (int i = 0; i < SIZE; ++i) {
    A1[i] = B1[i] OP C1[i];
}

or:

for(k=0, r=0; k<pos; k++)
  r += (MAX_UNSIGNED) 1 << k;

In D9923#177012, @aschwaighofer wrote:
I share Simon's concerns. Please make sure that we still get a good estimate for kernels like (these are from the rdar mentioned in the commit).
#define TYPE char
#define OP >>
#define SIZE 1024
#define TYPE_ALIGN __attribute__((aligned(16)))

TYPE A1[SIZE] TYPE_ALIGN;
TYPE B1[SIZE] TYPE_ALIGN;
TYPE C1[SIZE] TYPE_ALIGN;

void kernel1() {
  for (int i = 0; i < SIZE; ++i) {
    A1[i] = B1[i] OP C1[i];
}
or:
for(k=0, r=0; k<pos; k++)
  r += (MAX_UNSIGNED) 1 << k;

Thanks for sharing the testcase. For the first testcase:

Without the patch, the generated code for the kernel loop is:
.LBB0_1: # %for.body

=>This Inner Loop Header: Depth=1 movsbl B1+1024(%rax), %edx movb C1+1024(%rax), %cl sarl %cl, %edx movb %dl, A1+1024(%rax) incq %rax jne .LBB0_1

With the patch, the generated code for the kernel loop is:
.LBB0_1: # %vector.body

=>This Inner Loop Header: Depth=1 movd B1+1024(%rax), %xmm1 # xmm1 = mem[0],zero,zero,zero punpcklbw %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,1,2,2,3,3] psrad $24, %xmm1 movd C1+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero punpcklbw %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3] psrad $24, %xmm2 psrad %xmm2, %xmm1 pand %xmm0, %xmm1 packuswb %xmm1, %xmm1 packuswb %xmm1, %xmm1 movd %xmm1, A1+1024(%rax) addq $4, %rax jne .LBB0_1

The vectorized version is slightly better than the scalarized version. But the cost estimation to compute VF is not very good -- The cost estimation shows cost is 8 when VF==1 and cost is 2 when VF==4. The estimated costs of vectorized sext and trunc are too low and don't match the real costs.

Another problem is that vectorizer doesn't know the char->int type promotion here is unnecessary.

Can you give me the whole version of the second testcase? I am not sure my tweaked version is the right one.

Sorry. The format of the assembly code was removed when I replied from
phabricator. repaste it here:

#define TYPE char
#define OP >>
#define SIZE 1024
#define TYPE_ALIGN __attribute__((aligned(16)))

TYPE A1[SIZE] TYPE_ALIGN;
TYPE B1[SIZE] TYPE_ALIGN;
TYPE C1[SIZE] TYPE_ALIGN;

void kernel1() {
  for (int i = 0; i < SIZE; ++i) {
    A1[i] = B1[i] OP C1[i];
}

Without the patch, the kernel loop:
.LBB0_1: # %for.body

movsbl  B1+1024(%rax), %edx
movb    C1+1024(%rax), %cl
sarl    %cl, %edx
movb    %dl, A1+1024(%rax)
incq    %rax
jne     .LBB0_1

With the patch, the kernel loop:
.LBB0_1: # %vector.body

movd    B1+1024(%rax), %xmm1    # xmm1 = mem[0],zero,zero,zero
punpcklbw       %xmm1, %xmm1    # xmm1 =

xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]

punpcklwd       %xmm1, %xmm1    # xmm1 = xmm1[0,0,1,1,2,2,3,3]
psrad   $24, %xmm1
movd    C1+1024(%rax), %xmm2    # xmm2 = mem[0],zero,zero,zero
punpcklbw       %xmm2, %xmm2    # xmm2 =

xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]

punpcklwd       %xmm2, %xmm2    # xmm2 = xmm2[0,0,1,1,2,2,3,3]
psrad   $24, %xmm2
psrad   %xmm2, %xmm1
pand    %xmm0, %xmm1
packuswb        %xmm1, %xmm1
packuswb        %xmm1, %xmm1
movd    %xmm1, A1+1024(%rax)
addq    $4, %rax
jne     .LBB0_1

Although the vectorized version is slightly better, the cost
estimation is not precise (vectorizer cost estmiation says VF==4 (cost
is 2) is much better than VF==1 (cost is 8)).

Wei.

Ah, the code generated by this patch is incorrect. As Simon said, SSE
only has fast variable shift support for uniform values. I
misunderstood the meaning of the vectorized shift instruction. Will
rewrite the patch.

I updated the patch.

The cost of vectorized shift with uniform scalar shift amount is adjusted. lowerShift has already contained the logic to lower such shift to ISD::VSHL/VSRL/VSRA properly.

It works for the motivational case in PR23582. llvm unittest passes.

Thank you for the update. I don't know enough about the LoopVectorize code to review but the TTI cost model looks correct.

Please can you add uniform non-constant tests to:

test/Analysis/CostModel/X86/testshiftashr.ll
test/Analysis/CostModel/X86/testshiftlshr.ll
test/Analysis/CostModel/X86/testshiftshl.ll

Please can you add uniform non-constant tests to:
test/Analysis/CostModel/X86/testshiftashr.ll
test/Analysis/CostModel/X86/testshiftlshr.ll
test/Analysis/CostModel/X86/testshiftshl.ll

Thanks for the review. I add uniform variant tests in those files. Another change in this patch is to try to set operand to be UniformValue in CostModelAnalysis pass.

Revision Contents

Path

Size

lib/

Analysis/

CostModel.cpp

14 lines

Target/

X86/

X86TargetTransformInfo.cpp

19 lines

Transforms/

Vectorize/

LoopVectorize.cpp

4 lines

test/

Analysis/

CostModel/

X86/

testshiftashr.ll

56 lines

testshiftlshr.ll

57 lines

testshiftshl.ll

57 lines

Transforms/

LoopVectorize/

uniform-shift.ll

39 lines

Diff 26983

lib/Analysis/CostModel.cpp

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines	static TargetTransformInfo::OperandValueKind getOperandInfo(Value *V) {
TargetTransformInfo::OperandValueKind OpInfo =		TargetTransformInfo::OperandValueKind OpInfo =
TargetTransformInfo::OK_AnyValue;		TargetTransformInfo::OK_AnyValue;

// Check for a splat of a constant or for a non uniform vector of constants.		// Check for a splat of a constant or for a non uniform vector of constants.
if (isa<ConstantVector>(V) \|\| isa<ConstantDataVector>(V)) {		if (isa<ConstantVector>(V) \|\| isa<ConstantDataVector>(V)) {
OpInfo = TargetTransformInfo::OK_NonUniformConstantValue;		OpInfo = TargetTransformInfo::OK_NonUniformConstantValue;
if (cast<Constant>(V)->getSplatValue() != nullptr)		if (cast<Constant>(V)->getSplatValue() != nullptr)
OpInfo = TargetTransformInfo::OK_UniformConstantValue;		OpInfo = TargetTransformInfo::OK_UniformConstantValue;
		} else if (ShuffleVectorInst *SI = dyn_cast<ShuffleVectorInst>(V)) {
		// Check for a splat of a variant.
		unsigned NumVecElems = V->getType()->getVectorNumElements();
		if (isPowerOf2_32(NumVecElems)) {
		SmallVector<int, 16> ShuffleMask(NumVecElems, 0);
		// Check that shuffle masks matches.
		SmallVector<int, 16> Mask = SI->getShuffleMask();
		for (unsigned i = 0; i < NumVecElems; i++) {
		for (unsigned j = 0; j < NumVecElems; j++)
		ShuffleMask[j] = i;
		if (ShuffleMask == Mask)
		OpInfo = TargetTransformInfo::OK_UniformValue;
		}
		}
}		}

return OpInfo;		return OpInfo;
}		}

static bool matchPairwiseShuffleMask(ShuffleVectorInst *SI, bool IsLeft,		static bool matchPairwiseShuffleMask(ShuffleVectorInst *SI, bool IsLeft,
unsigned Level) {		unsigned Level) {
// We don't need a shuffle if we just want to have element 0 in position 0 of		// We don't need a shuffle if we just want to have element 0 in position 0 of
▲ Show 20 Lines • Show All 398 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 217 Lines • ▼ Show 20 Lines	if (Op2Info == TargetTransformInfo::OK_UniformConstantValue &&
if (ISD == ISD::SDIV && LT.second == MVT::v4i32 && ST->hasSSE41())		if (ISD == ISD::SDIV && LT.second == MVT::v4i32 && ST->hasSSE41())
return LT.first * 15;		return LT.first * 15;

int Idx = CostTableLookup(SSE2UniformConstCostTable, ISD, LT.second);		int Idx = CostTableLookup(SSE2UniformConstCostTable, ISD, LT.second);
if (Idx != -1)		if (Idx != -1)
return LT.first * SSE2UniformConstCostTable[Idx].Cost;		return LT.first * SSE2UniformConstCostTable[Idx].Cost;
}		}

		static const CostTblEntry<MVT::SimpleValueType> SSE2UniformCostTable[] = {
		{ ISD::SHL, MVT::v8i16, 1 }, // psllw.
		{ ISD::SHL, MVT::v4i32, 1 }, // pslld
		{ ISD::SHL, MVT::v2i64, 1 }, // psllq.

		{ ISD::SRL, MVT::v8i16, 1 }, // psrlw.
		{ ISD::SRL, MVT::v4i32, 1 }, // psrld.
		{ ISD::SRL, MVT::v2i64, 1 }, // psrlq.

		{ ISD::SRA, MVT::v8i16, 1 }, // psraw.
		{ ISD::SRA, MVT::v4i32, 1 }, // psrad.
		};

		if (Op2Info == TargetTransformInfo::OK_UniformValue && ST->hasSSE2()) {
		int Idx = CostTableLookup(SSE2UniformCostTable, ISD, LT.second);
		if (Idx != -1)
		return LT.first * SSE2UniformCostTable[Idx].Cost;
		}

if (ISD == ISD::SHL &&		if (ISD == ISD::SHL &&
Op2Info == TargetTransformInfo::OK_NonUniformConstantValue) {		Op2Info == TargetTransformInfo::OK_NonUniformConstantValue) {
EVT VT = LT.second;		EVT VT = LT.second;
if ((VT == MVT::v8i16 && ST->hasSSE2()) \|\|		if ((VT == MVT::v8i16 && ST->hasSSE2()) \|\|
(VT == MVT::v4i32 && ST->hasSSE41()))		(VT == MVT::v4i32 && ST->hasSSE41()))
// Vector shift left by non uniform constant can be lowered		// Vector shift left by non uniform constant can be lowered
// into vector multiply (pmullw/pmulld).		// into vector multiply (pmullw/pmulld).
return LT.first;		return LT.first;
Show All 13 Lines	static const CostTblEntry<MVT::SimpleValueType> SSE2CostTable[] = {
// to ISel. The cost model must return worst case assumptions because it is		// to ISel. The cost model must return worst case assumptions because it is
// used for vectorization and we don't want to make vectorized code worse		// used for vectorization and we don't want to make vectorized code worse
// than scalar code.		// than scalar code.
{ ISD::SHL, MVT::v16i8, 30 }, // cmpeqb sequence.		{ ISD::SHL, MVT::v16i8, 30 }, // cmpeqb sequence.
{ ISD::SHL, MVT::v8i16, 8*10 }, // Scalarized.		{ ISD::SHL, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SHL, MVT::v4i32, 2*5 }, // We optimized this using mul.		{ ISD::SHL, MVT::v4i32, 2*5 }, // We optimized this using mul.
{ ISD::SHL, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SHL, MVT::v2i64, 2*10 }, // Scalarized.
{ ISD::SHL, MVT::v4i64, 4*10 }, // Scalarized.		{ ISD::SHL, MVT::v4i64, 4*10 }, // Scalarized.

		RKSimonUnsubmitted Not Done Reply Inline Actions Why is v4i64 declared here? RKSimon: Why is v4i64 declared here?
{ ISD::SRL, MVT::v16i8, 16*10 }, // Scalarized.		{ ISD::SRL, MVT::v16i8, 16*10 }, // Scalarized.
{ ISD::SRL, MVT::v8i16, 8*10 }, // Scalarized.		{ ISD::SRL, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SRL, MVT::v4i32, 4*10 }, // Scalarized.		{ ISD::SRL, MVT::v4i32, 4*10 }, // Scalarized.
{ ISD::SRL, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SRL, MVT::v2i64, 2*10 }, // Scalarized.

{ ISD::SRA, MVT::v16i8, 16*10 }, // Scalarized.		{ ISD::SRA, MVT::v16i8, 16*10 }, // Scalarized.
{ ISD::SRA, MVT::v8i16, 8*10 }, // Scalarized.		{ ISD::SRA, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SRA, MVT::v4i32, 4*10 }, // Scalarized.		{ ISD::SRA, MVT::v4i32, 4*10 }, // Scalarized.
{ ISD::SRA, MVT::v2i64, 2*10 }, // Scalarized.		{ ISD::SRA, MVT::v2i64, 2*10 }, // Scalarized.

		RKSimonUnsubmitted Not Done Reply Inline Actions SSE only has fast variable shift support for uniform values - these cost values surely should reflect the likely cost of general shift. It would be better to update the TargetTransformInfo::OK_UniformConstantValue tables + code above this to support TargetTransformInfo::OK_UniformValue as well. RKSimon: SSE only has fast variable shift support for uniform values - these cost values surely should…
// It is not a good idea to vectorize division. We have to scalarize it and		// It is not a good idea to vectorize division. We have to scalarize it and
// in the process we will often end up having to spilling regular		// in the process we will often end up having to spilling regular
// registers. The overhead of division is going to dominate most kernels		// registers. The overhead of division is going to dominate most kernels
// anyways so try hard to prevent vectorization of division - it is		// anyways so try hard to prevent vectorization of division - it is
// generally a bad idea. Assume somewhat arbitrarily that we have to be able		// generally a bad idea. Assume somewhat arbitrarily that we have to be able
// to hide "20 cycles" for each lane.		// to hide "20 cycles" for each lane.
{ ISD::SDIV, MVT::v16i8, 16*20 },		{ ISD::SDIV, MVT::v16i8, 16*20 },
{ ISD::SDIV, MVT::v8i16, 8*20 },		{ ISD::SDIV, MVT::v8i16, 8*20 },
▲ Show 20 Lines • Show All 853 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

Show First 20 Lines • Show All 4,529 Lines • ▼ Show 20 Lines	if (isa<ConstantInt>(Op2)) {
Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;		Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
Constant *SplatValue = cast<Constant>(Op2)->getSplatValue();		Constant *SplatValue = cast<Constant>(Op2)->getSplatValue();
if (SplatValue) {		if (SplatValue) {
ConstantInt *CInt = dyn_cast<ConstantInt>(SplatValue);		ConstantInt *CInt = dyn_cast<ConstantInt>(SplatValue);
if (CInt && CInt->getValue().isPowerOf2())		if (CInt && CInt->getValue().isPowerOf2())
Op2VP = TargetTransformInfo::OP_PowerOf2;		Op2VP = TargetTransformInfo::OP_PowerOf2;
Op2VK = TargetTransformInfo::OK_UniformConstantValue;		Op2VK = TargetTransformInfo::OK_UniformConstantValue;
}		}
		} else if (SE->isSCEVable(Op2->getType())) {
		const SCEV *Op2SCEV = SE->getSCEV(Op2);
		if (SE->isLoopInvariant(Op2SCEV, TheLoop))
		Op2VK = TargetTransformInfo::OK_UniformValue;
}		}

return TTI.getArithmeticInstrCost(I->getOpcode(), VectorTy, Op1VK, Op2VK,		return TTI.getArithmeticInstrCost(I->getOpcode(), VectorTy, Op1VK, Op2VK,
Op1VP, Op2VP);		Op1VP, Op2VP);
}		}
case Instruction::Select: {		case Instruction::Select: {
SelectInst *SI = cast<SelectInst>(I);		SelectInst *SI = cast<SelectInst>(I);
const SCEV *CondSCEV = SE->getSCEV(SI->getCondition());		const SCEV *CondSCEV = SE->getSCEV(SI->getCondition());
▲ Show 20 Lines • Show All 297 Lines • Show Last 20 Lines

test/Analysis/CostModel/X86/testshiftashr.ll

Show First 20 Lines • Show All 523 Lines • ▼ Show 20 Lines	%0 = ashr %shifttypec32i8 %a , <i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3>		i8 3, i8 3, i8 3, i8 3>
ret %shifttypec32i8 %0		ret %shifttypec32i8 %0
}		}

		; Uniform variant shift.
		%shifttypeu16i8 = type <16 x i8>
		define %shifttypeu16i8 @shift16i8u(%shifttypeu16i8 %a, i8 %b) {
		entry:
		; SSE2: shift16i8u
		; SSE2: cost of 160 {{.*}} ashr
		; SSE2-CODEGEN: shift16i8u
		; SSE2-CODEGEN: sarb %cl

		%broadcast.splatinsert1 = insertelement <16 x i8> undef, i8 %b, i32 0
		%broadcast.splat2 = shufflevector <16 x i8> %broadcast.splatinsert1, <16 x i8> undef, <16 x i32> zeroinitializer
		%tmp = ashr <16 x i8> %a, %broadcast.splat2
		ret %shifttypeu16i8 %tmp
		}

		%shifttypeu8i16 = type <8 x i16>
		define %shifttypeu8i16 @shift8i16u(%shifttypeu8i16 %a, i16 %b) {
		entry:
		; SSE2: shift8i16u
		; SSE2: cost of 1 {{.*}} ashr
		; SSE2-CODEGEN: shift8i16u
		; SSE2-CODEGEN: psraw

		%broadcast.splatinsert1 = insertelement <8 x i16> undef, i16 %b, i32 0
		%broadcast.splat2 = shufflevector <8 x i16> %broadcast.splatinsert1, <8 x i16> undef, <8 x i32> zeroinitializer
		%tmp = ashr <8 x i16> %a, %broadcast.splat2
		ret %shifttypeu8i16 %tmp
		}

		%shifttypeu4i32 = type <4 x i32>
		define %shifttypeu4i32 @shift4i32u(%shifttypeu4i32 %a, i32 %b) {
		entry:
		; SSE2: shift4i32u
		; SSE2: cost of 1 {{.*}} ashr
		; SSE2-CODEGEN: shift4i32u
		; SSE2-CODEGEN: psrad

		%broadcast.splatinsert1 = insertelement <4 x i32> undef, i32 %b, i32 0
		%broadcast.splat2 = shufflevector <4 x i32> %broadcast.splatinsert1, <4 x i32> undef, <4 x i32> zeroinitializer
		%tmp = ashr <4 x i32> %a, %broadcast.splat2
		ret %shifttypeu4i32 %tmp
		}

		%shifttypeu2i64 = type <2 x i64>
		define %shifttypeu2i64 @shift2i64u(%shifttypeu2i64 %a, i64 %b) {
		entry:
		; SSE2: shift2i64u
		; SSE2: cost of 20 {{.*}} ashr
		; SSE2-CODEGEN: shift2i64u
		; SSE2-CODEGEN: sarq %cl

		%broadcast.splatinsert1 = insertelement <2 x i64> undef, i64 %b, i32 0
		%broadcast.splat2 = shufflevector <2 x i64> %broadcast.splatinsert1, <2 x i64> undef, <2 x i32> zeroinitializer
		%tmp = ashr <2 x i64> %a, %broadcast.splat2
		ret %shifttypeu2i64 %tmp
		}

test/Analysis/CostModel/X86/testshiftlshr.ll

Show First 20 Lines • Show All 521 Lines • ▼ Show 20 Lines	%0 = lshr %shifttypec32i8 %a , <i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3>		i8 3, i8 3, i8 3, i8 3>
ret %shifttypec32i8 %0		ret %shifttypec32i8 %0
}		}

		; Uniform variant shift.
		%shifttypeu16i8 = type <16 x i8>
		define %shifttypeu16i8 @shift16i8u(%shifttypeu16i8 %a, i8 %b) {
		entry:
		; SSE2: shift16i8u
		; SSE2: cost of 160 {{.*}} lshr
		; SSE2-CODEGEN: shift16i8u
		; SSE2-CODEGEN: shrb %cl

		%broadcast.splatinsert1 = insertelement <16 x i8> undef, i8 %b, i32 0
		%broadcast.splat2 = shufflevector <16 x i8> %broadcast.splatinsert1, <16 x i8> undef, <16 x i32> zeroinitializer
		%tmp = lshr <16 x i8> %a, %broadcast.splat2
		ret %shifttypeu16i8 %tmp
		}

		%shifttypeu8i16 = type <8 x i16>
		define %shifttypeu8i16 @shift8i16u(%shifttypeu8i16 %a, i16 %b) {
		entry:
		; SSE2: shift8i16u
		; SSE2: cost of 1 {{.*}} lshr
		; SSE2-CODEGEN: shift8i16u
		; SSE2-CODEGEN: psrlw

		%broadcast.splatinsert1 = insertelement <8 x i16> undef, i16 %b, i32 0
		%broadcast.splat2 = shufflevector <8 x i16> %broadcast.splatinsert1, <8 x i16> undef, <8 x i32> zeroinitializer
		%tmp = lshr <8 x i16> %a, %broadcast.splat2
		ret %shifttypeu8i16 %tmp
		}

		%shifttypeu4i32 = type <4 x i32>
		define %shifttypeu4i32 @shift4i32u(%shifttypeu4i32 %a, i32 %b) {
		entry:
		; SSE2: shift4i32u
		; SSE2: cost of 1 {{.*}} lshr
		; SSE2-CODEGEN: shift4i32u
		; SSE2-CODEGEN: psrld

		%broadcast.splatinsert1 = insertelement <4 x i32> undef, i32 %b, i32 0
		%broadcast.splat2 = shufflevector <4 x i32> %broadcast.splatinsert1, <4 x i32> undef, <4 x i32> zeroinitializer
		%tmp = lshr <4 x i32> %a, %broadcast.splat2
		ret %shifttypeu4i32 %tmp
		}

		%shifttypeu2i64 = type <2 x i64>
		define %shifttypeu2i64 @shift2i64u(%shifttypeu2i64 %a, i64 %b) {
		entry:
		; SSE2: shift2i64u
		; SSE2: cost of 1 {{.*}} lshr
		; SSE2-CODEGEN: shift2i64u
		; SSE2-CODEGEN: psrlq

		%broadcast.splatinsert1 = insertelement <2 x i64> undef, i64 %b, i32 0
		%broadcast.splat2 = shufflevector <2 x i64> %broadcast.splatinsert1, <2 x i64> undef, <2 x i32> zeroinitializer
		%tmp = lshr <2 x i64> %a, %broadcast.splat2
		ret %shifttypeu2i64 %tmp
		}

test/Analysis/CostModel/X86/testshiftshl.ll

Show First 20 Lines • Show All 521 Lines • ▼ Show 20 Lines	%0 = shl %shifttypec32i8 %a , <i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3,		i8 3, i8 3, i8 3, i8 3,
i8 3, i8 3, i8 3, i8 3>		i8 3, i8 3, i8 3, i8 3>
ret %shifttypec32i8 %0		ret %shifttypec32i8 %0
}		}

		; Uniform variant shift.
		%shifttypeu16i8 = type <16 x i8>
		define %shifttypeu16i8 @shift16i8u(%shifttypeu16i8 %a, i8 %b) {
		entry:
		; SSE2: shift16i8u
		; SSE2: cost of 30 {{.*}} shl
		; SSE2-CODEGEN: shift16i8u
		; SSE2-CODEGEN: psllw

		%broadcast.splatinsert1 = insertelement <16 x i8> undef, i8 %b, i32 0
		%broadcast.splat2 = shufflevector <16 x i8> %broadcast.splatinsert1, <16 x i8> undef, <16 x i32> zeroinitializer
		%tmp = shl <16 x i8> %a, %broadcast.splat2
		ret %shifttypeu16i8 %tmp
		}

		%shifttypeu8i16 = type <8 x i16>
		define %shifttypeu8i16 @shift8i16u(%shifttypeu8i16 %a, i16 %b) {
		entry:
		; SSE2: shift8i16u
		; SSE2: cost of 1 {{.*}} shl
		; SSE2-CODEGEN: shift8i16u
		; SSE2-CODEGEN: psllw

		%broadcast.splatinsert1 = insertelement <8 x i16> undef, i16 %b, i32 0
		%broadcast.splat2 = shufflevector <8 x i16> %broadcast.splatinsert1, <8 x i16> undef, <8 x i32> zeroinitializer
		%tmp = shl <8 x i16> %a, %broadcast.splat2
		ret %shifttypeu8i16 %tmp
		}

		%shifttypeu4i32 = type <4 x i32>
		define %shifttypeu4i32 @shift4i32u(%shifttypeu4i32 %a, i32 %b) {
		entry:
		; SSE2: shift4i32u
		; SSE2: cost of 1 {{.*}} shl
		; SSE2-CODEGEN: shift4i32u
		; SSE2-CODEGEN: pslld

		%broadcast.splatinsert1 = insertelement <4 x i32> undef, i32 %b, i32 0
		%broadcast.splat2 = shufflevector <4 x i32> %broadcast.splatinsert1, <4 x i32> undef, <4 x i32> zeroinitializer
		%tmp = shl <4 x i32> %a, %broadcast.splat2
		ret %shifttypeu4i32 %tmp
		}

		%shifttypeu2i64 = type <2 x i64>
		define %shifttypeu2i64 @shift2i64u(%shifttypeu2i64 %a, i64 %b) {
		entry:
		; SSE2: shift2i64u
		; SSE2: cost of 1 {{.*}} shl
		; SSE2-CODEGEN: shift2i64u
		; SSE2-CODEGEN: psllq

		%broadcast.splatinsert1 = insertelement <2 x i64> undef, i64 %b, i32 0
		%broadcast.splat2 = shufflevector <2 x i64> %broadcast.splatinsert1, <2 x i64> undef, <2 x i32> zeroinitializer
		%tmp = shl <2 x i64> %a, %broadcast.splat2
		ret %shifttypeu2i64 %tmp
		}

test/Transforms/LoopVectorize/uniform-shift.ll

				; PR23582
				; RUN: opt < %s -basicaa -loop-vectorize -force-vector-interleave=1 -dce -instcombine -simplifycfg -S \| llc \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				@k = common global i32 0, align 4
				@A1 = common global [1024 x i32] zeroinitializer, align 16
				@B1 = common global [1024 x i32] zeroinitializer, align 16
				@C1 = common global [1024 x i32] zeroinitializer, align 16

				; This test checks that loop vectorizer will generate uniform vshift.
				; CHECK-LABEL: kernel1:
				; CHECK: [[LOOP:^[a-zA-Z0-9_.]+]]:
				; CHECK: movdqa {{.*}}, [[REG:%xmm[0-7]]]
				; CHECK-NEXT: psrad {{%xmm[0-7]}}, [[REG]]
				; CHECK-NEXT: movdqa [[REG]], {{.*}}
				; CHECK-NEXT: addq $16, {{%[a-z0-9]+}}
				; CHECK-NEXT: jne [[LOOP]]

				define void @kernel1() {
				entry:
				%tmp = load i32, i32* @k, align 4
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds [1024 x i32], [1024 x i32]* @B1, i64 0, i64 %indvars.iv
				%tmp1 = load i32, i32* %arrayidx, align 4
				%shr = ashr i32 %tmp1, %tmp
				%arrayidx2 = getelementptr inbounds [1024 x i32], [1024 x i32]* @A1, i64 0, i64 %indvars.iv
				store i32 %shr, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}