This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Add cost model values for CTPOP of vectors
ClosedPublic

Authored by RKSimon on Jul 18 2016, 3:48 AM.

Download Raw Diff

Details

Reviewers

spatel
ab
silvas
andreadb
mkuper

Commits

rG1b4f511aaa00: [X86][SSE] Add cost model values for CTPOP of vectors
rL276104: [X86][SSE] Add cost model values for CTPOP of vectors

Summary

This patch adds costs for the vectorized implementations of CTPOP, the default values were seriously underestimating the cost of these and was encouraging vectorization on targets where serialized use of POPCNT would be much better.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 64294.Jul 18 2016, 3:48 AM

RKSimon retitled this revision from to [X86][SSE] Add cost model values for CTPOP of vectors.

RKSimon updated this object.

RKSimon added reviewers: mkuper, ab, silvas, spatel, andreadb.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

spatel added inline comments.Jul 18 2016, 11:11 AM

lib/Target/X86/X86TargetTransformInfo.cpp
970–973 ↗	(On Diff #64294)	Add a general comment to explain why we have these numbers? Also add a comment in LowerVectorCTPOP() that the TTI cost model should be updated if the algorithm changes.
test/Transforms/SLPVectorizer/X86/ctpop.ll
3–5 ↗	(On Diff #64294)	Can use -mattr=sse4.2 / avx / avx2 instead of -mcpu?

Is the plan to make these costs also dependent on host CPU? For example, IIRC the vector ctpop lowerings have serially dependent pshufb's which are 1 cycle latency on big intel cores but 4 cycle latency on Jaguar according to Agner.
Also, on Jaguar scalar popcnt is "as cheap as an add" but on e.g. Skylake scalar popcnt has 4x less throughput than an add and 3x higher latency.

In D22456#487520, @silvas wrote:

Is the plan to make these costs also dependent on host CPU? For example, IIRC the vector ctpop lowerings have serially dependent pshufb's which are 1 cycle latency on big intel cores but 4 cycle latency on Jaguar according to Agner.

Someday - I think the priority is to get down to one set of cost tables. Sanjay knows this better than I do but IIRC there are 3 or 4 separate sets of costs in the codebase - some based on approximate latency others (like this one) on throughput of recent big intel cores. I don't think any use the scheduler models or anything overly target specific.

Also, on Jaguar scalar popcnt is "as cheap as an add" but on e.g. Skylake scalar popcnt has 4x less throughput than an add and 3x higher latency.

I haven't added scalar throughput costs here, the costs are for the vector implementations which as you say are dominated by PSHUFB - I was put off dealing with the scalars by TargetTransformInfo::PopcntSupportKind which seems to be trying to do something similar.

A couple of links for reference...

@mkuper mentioned changing the scalar cost of an op here:

https://llvm.org/bugs/show_bug.cgi?id=28434#c4

There was some discussion about the various cost models here:

https://llvm.org/bugs/show_bug.cgi?id=26837

@mzolotukhin made a good point there: it's not clear whether we actually should differentiate for CPUs at this level (this is IR after all). A better solution might be to have the most conservative or the most common values as part of the cost model here, and then expect the backend to fix that up for particular CPU models.

The scary part about that currently is how bad the backend is at deconstructing too-wide vector IR code. This shows up even today in the semi-legal case of AVX where we have 256-bit registers but no 256-bit integer ops. This causes trouble in several cases I've seen. Ie, we would have done much better if we just pretended that we only have 128-bit registers for those targets.

Updated based on Sanjay's comments

LGTM.

This revision is now accepted and ready to land.Jul 19 2016, 2:33 PM

Closed by commit rL276104: [X86][SSE] Add cost model values for CTPOP of vectors (authored by RKSimon). · Explain WhyJul 20 2016, 3:48 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

2 lines

X86TargetTransformInfo.cpp

31 lines

test/

Analysis/

CostModel/

X86/

ctbits-cost.ll

64 lines

Transforms/

SLPVectorizer/

X86/

ctpop.ll

181 lines

Diff 64655

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 21,046 Lines • ▼ Show 20 Lines	static SDValue LowerVectorCTPOPBitmath(SDValue Op, const SDLoc &DL,
if (EltVT == MVT::i8)		if (EltVT == MVT::i8)
return V;		return V;

return LowerHorizontalByteSum(		return LowerHorizontalByteSum(
DAG.getBitcast(MVT::getVectorVT(MVT::i8, VecSize / 8), V), VT, Subtarget,		DAG.getBitcast(MVT::getVectorVT(MVT::i8, VecSize / 8), V), VT, Subtarget,
DAG);		DAG);
}		}

		// Please ensure that any codegen change from LowerVectorCTPOP is reflected in
		// updated cost models in X86TTIImpl::getIntrinsicInstrCost.
static SDValue LowerVectorCTPOP(SDValue Op, const X86Subtarget &Subtarget,		static SDValue LowerVectorCTPOP(SDValue Op, const X86Subtarget &Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
assert((VT.is512BitVector() \|\| VT.is256BitVector() \|\| VT.is128BitVector()) &&		assert((VT.is512BitVector() \|\| VT.is256BitVector() \|\| VT.is128BitVector()) &&
"Unknown CTPOP type to handle");		"Unknown CTPOP type to handle");
SDLoc DL(Op.getNode());		SDLoc DL(Op.getNode());
SDValue Op0 = Op.getOperand(0);		SDValue Op0 = Op.getOperand(0);

▲ Show 20 Lines • Show All 10,890 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 939 Lines • ▼ Show 20 Lines	if (ST->hasSSE2())
if (const auto *Entry = CostTableLookup(SSE2CostTbl, ISD, MTy))		if (const auto *Entry = CostTableLookup(SSE2CostTbl, ISD, MTy))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

return BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy);		return BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy);
}		}

int X86TTIImpl::getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,		int X86TTIImpl::getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<Type *> Tys, FastMathFlags FMF) {		ArrayRef<Type *> Tys, FastMathFlags FMF) {
		// Costs should match the codegen from:
		// BITREVERSE: llvm\test\CodeGen\X86\vector-bitreverse.ll
		// BSWAP: llvm\test\CodeGen\X86\bswap-vector.ll
		// CTPOP: llvm\test\CodeGen\X86\vector-popcnt-*.ll
static const CostTblEntry XOPCostTbl[] = {		static const CostTblEntry XOPCostTbl[] = {
{ ISD::BITREVERSE, MVT::v4i64, 4 },		{ ISD::BITREVERSE, MVT::v4i64, 4 },
{ ISD::BITREVERSE, MVT::v8i32, 4 },		{ ISD::BITREVERSE, MVT::v8i32, 4 },
{ ISD::BITREVERSE, MVT::v16i16, 4 },		{ ISD::BITREVERSE, MVT::v16i16, 4 },
{ ISD::BITREVERSE, MVT::v32i8, 4 },		{ ISD::BITREVERSE, MVT::v32i8, 4 },
{ ISD::BITREVERSE, MVT::v2i64, 1 },		{ ISD::BITREVERSE, MVT::v2i64, 1 },
{ ISD::BITREVERSE, MVT::v4i32, 1 },		{ ISD::BITREVERSE, MVT::v4i32, 1 },
{ ISD::BITREVERSE, MVT::v8i16, 1 },		{ ISD::BITREVERSE, MVT::v8i16, 1 },
{ ISD::BITREVERSE, MVT::v16i8, 1 },		{ ISD::BITREVERSE, MVT::v16i8, 1 },
{ ISD::BITREVERSE, MVT::i64, 3 },		{ ISD::BITREVERSE, MVT::i64, 3 },
{ ISD::BITREVERSE, MVT::i32, 3 },		{ ISD::BITREVERSE, MVT::i32, 3 },
{ ISD::BITREVERSE, MVT::i16, 3 },		{ ISD::BITREVERSE, MVT::i16, 3 },
{ ISD::BITREVERSE, MVT::i8, 3 }		{ ISD::BITREVERSE, MVT::i8, 3 }
};		};
static const CostTblEntry AVX2CostTbl[] = {		static const CostTblEntry AVX2CostTbl[] = {
{ ISD::BITREVERSE, MVT::v4i64, 5 },		{ ISD::BITREVERSE, MVT::v4i64, 5 },
{ ISD::BITREVERSE, MVT::v8i32, 5 },		{ ISD::BITREVERSE, MVT::v8i32, 5 },
{ ISD::BITREVERSE, MVT::v16i16, 5 },		{ ISD::BITREVERSE, MVT::v16i16, 5 },
{ ISD::BITREVERSE, MVT::v32i8, 5 },		{ ISD::BITREVERSE, MVT::v32i8, 5 },
{ ISD::BSWAP, MVT::v4i64, 1 },		{ ISD::BSWAP, MVT::v4i64, 1 },
{ ISD::BSWAP, MVT::v8i32, 1 },		{ ISD::BSWAP, MVT::v8i32, 1 },
{ ISD::BSWAP, MVT::v16i16, 1 }		{ ISD::BSWAP, MVT::v16i16, 1 },
		{ ISD::CTPOP, MVT::v4i64, 7 },
		{ ISD::CTPOP, MVT::v8i32, 11 },
		{ ISD::CTPOP, MVT::v16i16, 9 },
		{ ISD::CTPOP, MVT::v32i8, 6 }
};		};
static const CostTblEntry AVX1CostTbl[] = {		static const CostTblEntry AVX1CostTbl[] = {
{ ISD::BITREVERSE, MVT::v4i64, 10 },		{ ISD::BITREVERSE, MVT::v4i64, 10 },
{ ISD::BITREVERSE, MVT::v8i32, 10 },		{ ISD::BITREVERSE, MVT::v8i32, 10 },
{ ISD::BITREVERSE, MVT::v16i16, 10 },		{ ISD::BITREVERSE, MVT::v16i16, 10 },
{ ISD::BITREVERSE, MVT::v32i8, 10 },		{ ISD::BITREVERSE, MVT::v32i8, 10 },
{ ISD::BSWAP, MVT::v4i64, 4 },		{ ISD::BSWAP, MVT::v4i64, 4 },
{ ISD::BSWAP, MVT::v8i32, 4 },		{ ISD::BSWAP, MVT::v8i32, 4 },
{ ISD::BSWAP, MVT::v16i16, 4 }		{ ISD::BSWAP, MVT::v16i16, 4 },
		{ ISD::CTPOP, MVT::v4i64, 14 },
		{ ISD::CTPOP, MVT::v8i32, 22 },
		{ ISD::CTPOP, MVT::v16i16, 18 },
		{ ISD::CTPOP, MVT::v32i8, 12 }
};		};
static const CostTblEntry SSSE3CostTbl[] = {		static const CostTblEntry SSSE3CostTbl[] = {
{ ISD::BITREVERSE, MVT::v2i64, 5 },		{ ISD::BITREVERSE, MVT::v2i64, 5 },
{ ISD::BITREVERSE, MVT::v4i32, 5 },		{ ISD::BITREVERSE, MVT::v4i32, 5 },
{ ISD::BITREVERSE, MVT::v8i16, 5 },		{ ISD::BITREVERSE, MVT::v8i16, 5 },
{ ISD::BITREVERSE, MVT::v16i8, 5 },		{ ISD::BITREVERSE, MVT::v16i8, 5 },
{ ISD::BSWAP, MVT::v2i64, 1 },		{ ISD::BSWAP, MVT::v2i64, 1 },
{ ISD::BSWAP, MVT::v4i32, 1 },		{ ISD::BSWAP, MVT::v4i32, 1 },
{ ISD::BSWAP, MVT::v8i16, 1 }		{ ISD::BSWAP, MVT::v8i16, 1 },
		{ ISD::CTPOP, MVT::v2i64, 7 },
		{ ISD::CTPOP, MVT::v4i32, 11 },
		{ ISD::CTPOP, MVT::v8i16, 9 },
		{ ISD::CTPOP, MVT::v16i8, 6 }
};		};
static const CostTblEntry SSE2CostTbl[] = {		static const CostTblEntry SSE2CostTbl[] = {
{ ISD::BSWAP, MVT::v2i64, 7 },		{ ISD::BSWAP, MVT::v2i64, 7 },
{ ISD::BSWAP, MVT::v4i32, 7 },		{ ISD::BSWAP, MVT::v4i32, 7 },
{ ISD::BSWAP, MVT::v8i16, 7 }		{ ISD::BSWAP, MVT::v8i16, 7 },
		{ ISD::CTPOP, MVT::v2i64, 12 },
		{ ISD::CTPOP, MVT::v4i32, 15 },
		{ ISD::CTPOP, MVT::v8i16, 13 },
		{ ISD::CTPOP, MVT::v16i8, 10 }
};		};

unsigned ISD = ISD::DELETED_NODE;		unsigned ISD = ISD::DELETED_NODE;
switch (IID) {		switch (IID) {
default:		default:
break;		break;
case Intrinsic::bitreverse:		case Intrinsic::bitreverse:
ISD = ISD::BITREVERSE;		ISD = ISD::BITREVERSE;
break;		break;
case Intrinsic::bswap:		case Intrinsic::bswap:
ISD = ISD::BSWAP;		ISD = ISD::BSWAP;
break;		break;
		case Intrinsic::ctpop:
		ISD = ISD::CTPOP;
		break;
}		}

// Legalize the type.		// Legalize the type.
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, RetTy);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, RetTy);
MVT MTy = LT.second;		MVT MTy = LT.second;

// Attempt to lookup cost.		// Attempt to lookup cost.
if (ST->hasXOP())		if (ST->hasXOP())
▲ Show 20 Lines • Show All 610 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/X86/ctbits-cost.ll

	Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines

	declare <4 x i64> @llvm.ctpop.v4i64(<4 x i64>)			declare <4 x i64> @llvm.ctpop.v4i64(<4 x i64>)
	declare <8 x i32> @llvm.ctpop.v8i32(<8 x i32>)			declare <8 x i32> @llvm.ctpop.v8i32(<8 x i32>)
	declare <16 x i16> @llvm.ctpop.v16i16(<16 x i16>)			declare <16 x i16> @llvm.ctpop.v16i16(<16 x i16>)
	declare <32 x i8> @llvm.ctpop.v32i8(<32 x i8>)			declare <32 x i8> @llvm.ctpop.v32i8(<32 x i8>)

	define <2 x i64> @var_ctpop_v2i64(<2 x i64> %a) {			define <2 x i64> @var_ctpop_v2i64(<2 x i64> %a) {
	; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v2i64':			; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v2i64':
	; SSE: Found an estimated cost of 2 for instruction: %ctpop			; SSE2: Found an estimated cost of 12 for instruction: %ctpop
	; AVX: Found an estimated cost of 2 for instruction: %ctpop			; SSE42: Found an estimated cost of 7 for instruction: %ctpop
	; XOP: Found an estimated cost of 2 for instruction: %ctpop			; AVX: Found an estimated cost of 7 for instruction: %ctpop
				; XOP: Found an estimated cost of 7 for instruction: %ctpop
	%ctpop = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> %a)			%ctpop = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> %a)
	ret <2 x i64> %ctpop			ret <2 x i64> %ctpop
	}			}

	define <4 x i64> @var_ctpop_v4i64(<4 x i64> %a) {			define <4 x i64> @var_ctpop_v4i64(<4 x i64> %a) {
	; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v4i64':			; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v4i64':
	; SSE: Found an estimated cost of 4 for instruction: %ctpop			; SSE2: Found an estimated cost of 24 for instruction: %ctpop
	; AVX: Found an estimated cost of 2 for instruction: %ctpop			; SSE42: Found an estimated cost of 14 for instruction: %ctpop
	; XOP: Found an estimated cost of 2 for instruction: %ctpop			; AVX1: Found an estimated cost of 14 for instruction: %ctpop
				; AVX2: Found an estimated cost of 7 for instruction: %ctpop
				; XOPAVX1: Found an estimated cost of 14 for instruction: %ctpop
				; XOPAVX2: Found an estimated cost of 7 for instruction: %ctpop
	%ctpop = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> %a)			%ctpop = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> %a)
	ret <4 x i64> %ctpop			ret <4 x i64> %ctpop
	}			}

	define <4 x i32> @var_ctpop_v4i32(<4 x i32> %a) {			define <4 x i32> @var_ctpop_v4i32(<4 x i32> %a) {
	; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v4i32':			; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v4i32':
	; SSE: Found an estimated cost of 2 for instruction: %ctpop			; SSE2: Found an estimated cost of 15 for instruction: %ctpop
	; AVX: Found an estimated cost of 2 for instruction: %ctpop			; SSE42: Found an estimated cost of 11 for instruction: %ctpop
	; XOP: Found an estimated cost of 2 for instruction: %ctpop			; AVX: Found an estimated cost of 11 for instruction: %ctpop
				; XOP: Found an estimated cost of 11 for instruction: %ctpop
	%ctpop = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> %a)			%ctpop = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> %a)
	ret <4 x i32> %ctpop			ret <4 x i32> %ctpop
	}			}

	define <8 x i32> @var_ctpop_v8i32(<8 x i32> %a) {			define <8 x i32> @var_ctpop_v8i32(<8 x i32> %a) {
	; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v8i32':			; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v8i32':
	; SSE: Found an estimated cost of 4 for instruction: %ctpop			; SSE2: Found an estimated cost of 30 for instruction: %ctpop
	; AVX: Found an estimated cost of 2 for instruction: %ctpop			; SSE42: Found an estimated cost of 22 for instruction: %ctpop
	; XOP: Found an estimated cost of 2 for instruction: %ctpop			; AVX1: Found an estimated cost of 22 for instruction: %ctpop
				; AVX2: Found an estimated cost of 11 for instruction: %ctpop
				; XOPAVX1: Found an estimated cost of 22 for instruction: %ctpop
				; XOPAVX2: Found an estimated cost of 11 for instruction: %ctpop
	%ctpop = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> %a)			%ctpop = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> %a)
	ret <8 x i32> %ctpop			ret <8 x i32> %ctpop
	}			}

	define <8 x i16> @var_ctpop_v8i16(<8 x i16> %a) {			define <8 x i16> @var_ctpop_v8i16(<8 x i16> %a) {
	; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v8i16':			; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v8i16':
	; SSE: Found an estimated cost of 2 for instruction: %ctpop			; SSE2: Found an estimated cost of 13 for instruction: %ctpop
	; AVX: Found an estimated cost of 2 for instruction: %ctpop			; SSE42: Found an estimated cost of 9 for instruction: %ctpop
	; XOP: Found an estimated cost of 2 for instruction: %ctpop			; AVX: Found an estimated cost of 9 for instruction: %ctpop
				; XOP: Found an estimated cost of 9 for instruction: %ctpop
	%ctpop = call <8 x i16> @llvm.ctpop.v8i16(<8 x i16> %a)			%ctpop = call <8 x i16> @llvm.ctpop.v8i16(<8 x i16> %a)
	ret <8 x i16> %ctpop			ret <8 x i16> %ctpop
	}			}

	define <16 x i16> @var_ctpop_v16i16(<16 x i16> %a) {			define <16 x i16> @var_ctpop_v16i16(<16 x i16> %a) {
	; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v16i16':			; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v16i16':
	; SSE: Found an estimated cost of 4 for instruction: %ctpop			; SSE2: Found an estimated cost of 26 for instruction: %ctpop
	; AVX: Found an estimated cost of 2 for instruction: %ctpop			; SSE42: Found an estimated cost of 18 for instruction: %ctpop
	; XOP: Found an estimated cost of 2 for instruction: %ctpop			; AVX1: Found an estimated cost of 18 for instruction: %ctpop
				; AVX2: Found an estimated cost of 9 for instruction: %ctpop
				; XOPAVX1: Found an estimated cost of 18 for instruction: %ctpop
				; XOPAVX2: Found an estimated cost of 9 for instruction: %ctpop
	%ctpop = call <16 x i16> @llvm.ctpop.v16i16(<16 x i16> %a)			%ctpop = call <16 x i16> @llvm.ctpop.v16i16(<16 x i16> %a)
	ret <16 x i16> %ctpop			ret <16 x i16> %ctpop
	}			}

	define <16 x i8> @var_ctpop_v16i8(<16 x i8> %a) {			define <16 x i8> @var_ctpop_v16i8(<16 x i8> %a) {
	; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v16i8':			; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v16i8':
	; SSE: Found an estimated cost of 2 for instruction: %ctpop			; SSE2: Found an estimated cost of 10 for instruction: %ctpop
	; AVX: Found an estimated cost of 2 for instruction: %ctpop			; SSE42: Found an estimated cost of 6 for instruction: %ctpop
	; XOP: Found an estimated cost of 2 for instruction: %ctpop			; AVX: Found an estimated cost of 6 for instruction: %ctpop
				; XOP: Found an estimated cost of 6 for instruction: %ctpop
	%ctpop = call <16 x i8> @llvm.ctpop.v16i8(<16 x i8> %a)			%ctpop = call <16 x i8> @llvm.ctpop.v16i8(<16 x i8> %a)
	ret <16 x i8> %ctpop			ret <16 x i8> %ctpop
	}			}

	define <32 x i8> @var_ctpop_v32i8(<32 x i8> %a) {			define <32 x i8> @var_ctpop_v32i8(<32 x i8> %a) {
	; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v32i8':			; CHECK: 'Cost Model Analysis' for function 'var_ctpop_v32i8':
	; SSE: Found an estimated cost of 4 for instruction: %ctpop			; SSE2: Found an estimated cost of 20 for instruction: %ctpop
	; AVX: Found an estimated cost of 2 for instruction: %ctpop			; SSE42: Found an estimated cost of 12 for instruction: %ctpop
	; XOP: Found an estimated cost of 2 for instruction: %ctpop			; AVX1: Found an estimated cost of 12 for instruction: %ctpop
				; AVX2: Found an estimated cost of 6 for instruction: %ctpop
				; XOPAVX1: Found an estimated cost of 12 for instruction: %ctpop
				; XOPAVX2: Found an estimated cost of 6 for instruction: %ctpop
	%ctpop = call <32 x i8> @llvm.ctpop.v32i8(<32 x i8> %a)			%ctpop = call <32 x i8> @llvm.ctpop.v32i8(<32 x i8> %a)
	ret <32 x i8> %ctpop			ret <32 x i8> %ctpop
	}			}

	; Verify the cost of scalar leading zero count instructions.			; Verify the cost of scalar leading zero count instructions.

	declare i64 @llvm.ctlz.i64(i64, i1)			declare i64 @llvm.ctlz.i64(i64, i1)
	declare i32 @llvm.ctlz.i32(i32, i1)			declare i32 @llvm.ctlz.i32(i32, i1)
	▲ Show 20 Lines • Show All 433 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/ctpop.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -mtriple=x86_64-unknown -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE			; RUN: opt < %s -mtriple=x86_64-unknown -mattr=+sse2 -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE --check-prefix=SSE2
	; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=corei7-avx -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX			; RUN: opt < %s -mtriple=x86_64-unknown -mattr=+sse4.2,+popcnt -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE --check-prefix=SSE42
	; RUN: opt < %s -mtriple=x86_64-unknown -mcpu=core-avx2 -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX			; RUN: opt < %s -mtriple=x86_64-unknown -mattr=+avx,+popcnt -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX1
				; RUN: opt < %s -mtriple=x86_64-unknown -mattr=+avx2,+popcnt -basicaa -slp-vectorizer -S \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX2

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	@src64 = common global [4 x i64] zeroinitializer, align 32			@src64 = common global [4 x i64] zeroinitializer, align 32
	@dst64 = common global [4 x i64] zeroinitializer, align 32			@dst64 = common global [4 x i64] zeroinitializer, align 32
	@src32 = common global [8 x i32] zeroinitializer, align 32			@src32 = common global [8 x i32] zeroinitializer, align 32
	@dst32 = common global [8 x i32] zeroinitializer, align 32			@dst32 = common global [8 x i32] zeroinitializer, align 32
	@src16 = common global [16 x i16] zeroinitializer, align 32			@src16 = common global [16 x i16] zeroinitializer, align 32
	@dst16 = common global [16 x i16] zeroinitializer, align 32			@dst16 = common global [16 x i16] zeroinitializer, align 32
	@src8 = common global [32 x i8] zeroinitializer, align 32			@src8 = common global [32 x i8] zeroinitializer, align 32
	@dst8 = common global [32 x i8] zeroinitializer, align 32			@dst8 = common global [32 x i8] zeroinitializer, align 32

	declare i64 @llvm.ctpop.i64(i64)			declare i64 @llvm.ctpop.i64(i64)
	declare i32 @llvm.ctpop.i32(i32)			declare i32 @llvm.ctpop.i32(i32)
	declare i16 @llvm.ctpop.i16(i16)			declare i16 @llvm.ctpop.i16(i16)
	declare i8 @llvm.ctpop.i8(i8)			declare i8 @llvm.ctpop.i8(i8)

	define void @ctpop_2i64() #0 {			define void @ctpop_2i64() #0 {
	; CHECK-LABEL: @ctpop_2i64(			; CHECK-LABEL: @ctpop_2i64(
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i64>, <2 x i64> bitcast ([4 x i64]* @src64 to <2 x i64>*), align 8			; CHECK-NEXT: [[LD0:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i32 0, i64 0), align 8
	; CHECK-NEXT: [[TMP2:%.*]] = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> [[TMP1]])			; CHECK-NEXT: [[LD1:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i32 0, i64 1), align 8
	; CHECK-NEXT: store <2 x i64> [[TMP2]], <2 x i64>* bitcast ([4 x i64]* @dst64 to <2 x i64>*), align 8			; CHECK-NEXT: [[CTPOP0:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD0]])
				; CHECK-NEXT: [[CTPOP1:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD1]])
				; CHECK-NEXT: store i64 [[CTPOP0]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i32 0, i64 0), align 8
				; CHECK-NEXT: store i64 [[CTPOP1]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i32 0, i64 1), align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%ld0 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i32 0, i64 0), align 8			%ld0 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i32 0, i64 0), align 8
	%ld1 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i32 0, i64 1), align 8			%ld1 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i32 0, i64 1), align 8
	%ctpop0 = call i64 @llvm.ctpop.i64(i64 %ld0)			%ctpop0 = call i64 @llvm.ctpop.i64(i64 %ld0)
	%ctpop1 = call i64 @llvm.ctpop.i64(i64 %ld1)			%ctpop1 = call i64 @llvm.ctpop.i64(i64 %ld1)
	store i64 %ctpop0, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i32 0, i64 0), align 8			store i64 %ctpop0, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i32 0, i64 0), align 8
	store i64 %ctpop1, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i32 0, i64 1), align 8			store i64 %ctpop1, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i32 0, i64 1), align 8
	ret void			ret void
	}			}

	define void @ctpop_4i64() #0 {			define void @ctpop_4i64() #0 {
	; SSE-LABEL: @ctpop_4i64(			; SSE-LABEL: @ctpop_4i64(
	; SSE-NEXT: [[TMP1:%.]] = load <2 x i64>, <2 x i64> bitcast ([4 x i64]* @src64 to <2 x i64>*), align 4			; SSE-NEXT: [[LD0:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 0), align 4
	; SSE-NEXT: [[TMP2:%.]] = load <2 x i64>, <2 x i64> bitcast (i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 2) to <2 x i64>*), align 4			; SSE-NEXT: [[LD1:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 1), align 4
	; SSE-NEXT: [[TMP3:%.*]] = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> [[TMP1]])			; SSE-NEXT: [[LD2:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 2), align 4
	; SSE-NEXT: [[TMP4:%.*]] = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> [[TMP2]])			; SSE-NEXT: [[LD3:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 3), align 4
	; SSE-NEXT: store <2 x i64> [[TMP3]], <2 x i64>* bitcast ([4 x i64]* @dst64 to <2 x i64>*), align 4			; SSE-NEXT: [[CTPOP0:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD0]])
	; SSE-NEXT: store <2 x i64> [[TMP4]], <2 x i64>* bitcast (i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 2) to <2 x i64>*), align 4			; SSE-NEXT: [[CTPOP1:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD1]])
				; SSE-NEXT: [[CTPOP2:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD2]])
				; SSE-NEXT: [[CTPOP3:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD3]])
				; SSE-NEXT: store i64 [[CTPOP0]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 0), align 4
				; SSE-NEXT: store i64 [[CTPOP1]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 1), align 4
				; SSE-NEXT: store i64 [[CTPOP2]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 2), align 4
				; SSE-NEXT: store i64 [[CTPOP3]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 3), align 4
	; SSE-NEXT: ret void			; SSE-NEXT: ret void
	;			;
	; AVX-LABEL: @ctpop_4i64(			; AVX1-LABEL: @ctpop_4i64(
	; AVX-NEXT: [[TMP1:%.]] = load <4 x i64>, <4 x i64> bitcast ([4 x i64]* @src64 to <4 x i64>*), align 4			; AVX1-NEXT: [[LD0:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 0), align 4
	; AVX-NEXT: [[TMP2:%.*]] = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> [[TMP1]])			; AVX1-NEXT: [[LD1:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 1), align 4
	; AVX-NEXT: store <4 x i64> [[TMP2]], <4 x i64>* bitcast ([4 x i64]* @dst64 to <4 x i64>*), align 4			; AVX1-NEXT: [[LD2:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 2), align 4
	; AVX-NEXT: ret void			; AVX1-NEXT: [[LD3:%.]] = load i64, i64 getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 3), align 4
				; AVX1-NEXT: [[CTPOP0:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD0]])
				; AVX1-NEXT: [[CTPOP1:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD1]])
				; AVX1-NEXT: [[CTPOP2:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD2]])
				; AVX1-NEXT: [[CTPOP3:%.*]] = call i64 @llvm.ctpop.i64(i64 [[LD3]])
				; AVX1-NEXT: store i64 [[CTPOP0]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 0), align 4
				; AVX1-NEXT: store i64 [[CTPOP1]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 1), align 4
				; AVX1-NEXT: store i64 [[CTPOP2]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 2), align 4
				; AVX1-NEXT: store i64 [[CTPOP3]], i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 3), align 4
				; AVX1-NEXT: ret void
				;
				; AVX2-LABEL: @ctpop_4i64(
				; AVX2-NEXT: [[TMP1:%.]] = load <4 x i64>, <4 x i64> bitcast ([4 x i64]* @src64 to <4 x i64>*), align 4
				; AVX2-NEXT: [[TMP2:%.*]] = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> [[TMP1]])
				; AVX2-NEXT: store <4 x i64> [[TMP2]], <4 x i64>* bitcast ([4 x i64]* @dst64 to <4 x i64>*), align 4
				; AVX2-NEXT: ret void
	;			;
	%ld0 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 0), align 4			%ld0 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 0), align 4
	%ld1 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 1), align 4			%ld1 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 1), align 4
	%ld2 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 2), align 4			%ld2 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 2), align 4
	%ld3 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 3), align 4			%ld3 = load i64, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @src64, i64 0, i64 3), align 4
	%ctpop0 = call i64 @llvm.ctpop.i64(i64 %ld0)			%ctpop0 = call i64 @llvm.ctpop.i64(i64 %ld0)
	%ctpop1 = call i64 @llvm.ctpop.i64(i64 %ld1)			%ctpop1 = call i64 @llvm.ctpop.i64(i64 %ld1)
	%ctpop2 = call i64 @llvm.ctpop.i64(i64 %ld2)			%ctpop2 = call i64 @llvm.ctpop.i64(i64 %ld2)
	%ctpop3 = call i64 @llvm.ctpop.i64(i64 %ld3)			%ctpop3 = call i64 @llvm.ctpop.i64(i64 %ld3)
	store i64 %ctpop0, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 0), align 4			store i64 %ctpop0, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 0), align 4
	store i64 %ctpop1, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 1), align 4			store i64 %ctpop1, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 1), align 4
	store i64 %ctpop2, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 2), align 4			store i64 %ctpop2, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 2), align 4
	store i64 %ctpop3, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 3), align 4			store i64 %ctpop3, i64* getelementptr inbounds ([4 x i64], [4 x i64]* @dst64, i64 0, i64 3), align 4
	ret void			ret void
	}			}

	define void @ctpop_4i32() #0 {			define void @ctpop_4i32() #0 {
	; CHECK-LABEL: @ctpop_4i32(			; SSE2-LABEL: @ctpop_4i32(
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> bitcast ([8 x i32]* @src32 to <4 x i32>*), align 4			; SSE2-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> bitcast ([8 x i32]* @src32 to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP2:%.*]] = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> [[TMP1]])			; SSE2-NEXT: [[TMP2:%.*]] = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> [[TMP1]])
	; CHECK-NEXT: store <4 x i32> [[TMP2]], <4 x i32>* bitcast ([8 x i32]* @dst32 to <4 x i32>*), align 4			; SSE2-NEXT: store <4 x i32> [[TMP2]], <4 x i32>* bitcast ([8 x i32]* @dst32 to <4 x i32>*), align 4
	; CHECK-NEXT: ret void			; SSE2-NEXT: ret void
				;
				; SSE42-LABEL: @ctpop_4i32(
				; SSE42-NEXT: [[LD0:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 0), align 4
				; SSE42-NEXT: [[LD1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 1), align 4
				; SSE42-NEXT: [[LD2:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 2), align 4
				; SSE42-NEXT: [[LD3:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 3), align 4
				; SSE42-NEXT: [[CTPOP0:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD0]])
				; SSE42-NEXT: [[CTPOP1:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD1]])
				; SSE42-NEXT: [[CTPOP2:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD2]])
				; SSE42-NEXT: [[CTPOP3:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD3]])
				; SSE42-NEXT: store i32 [[CTPOP0]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 0), align 4
				; SSE42-NEXT: store i32 [[CTPOP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 1), align 4
				; SSE42-NEXT: store i32 [[CTPOP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 2), align 4
				; SSE42-NEXT: store i32 [[CTPOP3]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 3), align 4
				; SSE42-NEXT: ret void
				;
				; AVX-LABEL: @ctpop_4i32(
				; AVX-NEXT: [[LD0:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 0), align 4
				; AVX-NEXT: [[LD1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 1), align 4
				; AVX-NEXT: [[LD2:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 2), align 4
				; AVX-NEXT: [[LD3:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 3), align 4
				; AVX-NEXT: [[CTPOP0:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD0]])
				; AVX-NEXT: [[CTPOP1:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD1]])
				; AVX-NEXT: [[CTPOP2:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD2]])
				; AVX-NEXT: [[CTPOP3:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD3]])
				; AVX-NEXT: store i32 [[CTPOP0]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 0), align 4
				; AVX-NEXT: store i32 [[CTPOP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 1), align 4
				; AVX-NEXT: store i32 [[CTPOP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 2), align 4
				; AVX-NEXT: store i32 [[CTPOP3]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 3), align 4
				; AVX-NEXT: ret void
	;			;
	%ld0 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 0), align 4			%ld0 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 0), align 4
	%ld1 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 1), align 4			%ld1 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 1), align 4
	%ld2 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 2), align 4			%ld2 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 2), align 4
	%ld3 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 3), align 4			%ld3 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 3), align 4
	%ctpop0 = call i32 @llvm.ctpop.i32(i32 %ld0)			%ctpop0 = call i32 @llvm.ctpop.i32(i32 %ld0)
	%ctpop1 = call i32 @llvm.ctpop.i32(i32 %ld1)			%ctpop1 = call i32 @llvm.ctpop.i32(i32 %ld1)
	%ctpop2 = call i32 @llvm.ctpop.i32(i32 %ld2)			%ctpop2 = call i32 @llvm.ctpop.i32(i32 %ld2)
	%ctpop3 = call i32 @llvm.ctpop.i32(i32 %ld3)			%ctpop3 = call i32 @llvm.ctpop.i32(i32 %ld3)
	store i32 %ctpop0, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 0), align 4			store i32 %ctpop0, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 0), align 4
	store i32 %ctpop1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 1), align 4			store i32 %ctpop1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 1), align 4
	store i32 %ctpop2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 2), align 4			store i32 %ctpop2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 2), align 4
	store i32 %ctpop3, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 3), align 4			store i32 %ctpop3, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 3), align 4
	ret void			ret void
	}			}

	define void @ctpop_8i32() #0 {			define void @ctpop_8i32() #0 {
	; SSE-LABEL: @ctpop_8i32(			; SSE2-LABEL: @ctpop_8i32(
	; SSE-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> bitcast ([8 x i32]* @src32 to <4 x i32>*), align 2			; SSE2-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> bitcast ([8 x i32]* @src32 to <4 x i32>*), align 2
	; SSE-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> bitcast (i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 4) to <4 x i32>*), align 2			; SSE2-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> bitcast (i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 4) to <4 x i32>*), align 2
	; SSE-NEXT: [[TMP3:%.*]] = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> [[TMP1]])			; SSE2-NEXT: [[TMP3:%.*]] = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> [[TMP1]])
	; SSE-NEXT: [[TMP4:%.*]] = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> [[TMP2]])			; SSE2-NEXT: [[TMP4:%.*]] = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> [[TMP2]])
	; SSE-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* bitcast ([8 x i32]* @dst32 to <4 x i32>*), align 2			; SSE2-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* bitcast ([8 x i32]* @dst32 to <4 x i32>*), align 2
	; SSE-NEXT: store <4 x i32> [[TMP4]], <4 x i32>* bitcast (i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 4) to <4 x i32>*), align 2			; SSE2-NEXT: store <4 x i32> [[TMP4]], <4 x i32>* bitcast (i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 4) to <4 x i32>*), align 2
	; SSE-NEXT: ret void			; SSE2-NEXT: ret void
	;			;
	; AVX-LABEL: @ctpop_8i32(			; SSE42-LABEL: @ctpop_8i32(
	; AVX-NEXT: [[TMP1:%.]] = load <8 x i32>, <8 x i32> bitcast ([8 x i32]* @src32 to <8 x i32>*), align 2			; SSE42-NEXT: [[LD0:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 0), align 2
	; AVX-NEXT: [[TMP2:%.*]] = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> [[TMP1]])			; SSE42-NEXT: [[LD1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 1), align 2
	; AVX-NEXT: store <8 x i32> [[TMP2]], <8 x i32>* bitcast ([8 x i32]* @dst32 to <8 x i32>*), align 2			; SSE42-NEXT: [[LD2:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 2), align 2
	; AVX-NEXT: ret void			; SSE42-NEXT: [[LD3:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 3), align 2
				; SSE42-NEXT: [[LD4:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 4), align 2
				; SSE42-NEXT: [[LD5:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 5), align 2
				; SSE42-NEXT: [[LD6:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 6), align 2
				; SSE42-NEXT: [[LD7:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 7), align 2
				; SSE42-NEXT: [[CTPOP0:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD0]])
				; SSE42-NEXT: [[CTPOP1:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD1]])
				; SSE42-NEXT: [[CTPOP2:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD2]])
				; SSE42-NEXT: [[CTPOP3:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD3]])
				; SSE42-NEXT: [[CTPOP4:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD4]])
				; SSE42-NEXT: [[CTPOP5:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD5]])
				; SSE42-NEXT: [[CTPOP6:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD6]])
				; SSE42-NEXT: [[CTPOP7:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD7]])
				; SSE42-NEXT: store i32 [[CTPOP0]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 0), align 2
				; SSE42-NEXT: store i32 [[CTPOP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 1), align 2
				; SSE42-NEXT: store i32 [[CTPOP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 2), align 2
				; SSE42-NEXT: store i32 [[CTPOP3]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 3), align 2
				; SSE42-NEXT: store i32 [[CTPOP4]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 4), align 2
				; SSE42-NEXT: store i32 [[CTPOP5]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 5), align 2
				; SSE42-NEXT: store i32 [[CTPOP6]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 6), align 2
				; SSE42-NEXT: store i32 [[CTPOP7]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 7), align 2
				; SSE42-NEXT: ret void
				;
				; AVX1-LABEL: @ctpop_8i32(
				; AVX1-NEXT: [[LD0:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 0), align 2
				; AVX1-NEXT: [[LD1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 1), align 2
				; AVX1-NEXT: [[LD2:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 2), align 2
				; AVX1-NEXT: [[LD3:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 3), align 2
				; AVX1-NEXT: [[LD4:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 4), align 2
				; AVX1-NEXT: [[LD5:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 5), align 2
				; AVX1-NEXT: [[LD6:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 6), align 2
				; AVX1-NEXT: [[LD7:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 7), align 2
				; AVX1-NEXT: [[CTPOP0:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD0]])
				; AVX1-NEXT: [[CTPOP1:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD1]])
				; AVX1-NEXT: [[CTPOP2:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD2]])
				; AVX1-NEXT: [[CTPOP3:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD3]])
				; AVX1-NEXT: [[CTPOP4:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD4]])
				; AVX1-NEXT: [[CTPOP5:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD5]])
				; AVX1-NEXT: [[CTPOP6:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD6]])
				; AVX1-NEXT: [[CTPOP7:%.*]] = call i32 @llvm.ctpop.i32(i32 [[LD7]])
				; AVX1-NEXT: store i32 [[CTPOP0]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 0), align 2
				; AVX1-NEXT: store i32 [[CTPOP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 1), align 2
				; AVX1-NEXT: store i32 [[CTPOP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 2), align 2
				; AVX1-NEXT: store i32 [[CTPOP3]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 3), align 2
				; AVX1-NEXT: store i32 [[CTPOP4]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 4), align 2
				; AVX1-NEXT: store i32 [[CTPOP5]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 5), align 2
				; AVX1-NEXT: store i32 [[CTPOP6]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 6), align 2
				; AVX1-NEXT: store i32 [[CTPOP7]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 7), align 2
				; AVX1-NEXT: ret void
				;
				; AVX2-LABEL: @ctpop_8i32(
				; AVX2-NEXT: [[TMP1:%.]] = load <8 x i32>, <8 x i32> bitcast ([8 x i32]* @src32 to <8 x i32>*), align 2
				; AVX2-NEXT: [[TMP2:%.*]] = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> [[TMP1]])
				; AVX2-NEXT: store <8 x i32> [[TMP2]], <8 x i32>* bitcast ([8 x i32]* @dst32 to <8 x i32>*), align 2
				; AVX2-NEXT: ret void
	;			;
	%ld0 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 0), align 2			%ld0 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 0), align 2
	%ld1 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 1), align 2			%ld1 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 1), align 2
	%ld2 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 2), align 2			%ld2 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 2), align 2
	%ld3 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 3), align 2			%ld3 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 3), align 2
	%ld4 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 4), align 2			%ld4 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 4), align 2
	%ld5 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 5), align 2			%ld5 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 5), align 2
	%ld6 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 6), align 2			%ld6 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 6), align 2
	▲ Show 20 Lines • Show All 290 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Add cost model values for CTPOP of vectorsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 64655

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/trunk/test/Analysis/CostModel/X86/ctbits-cost.ll

llvm/trunk/test/Transforms/SLPVectorizer/X86/ctpop.ll

[X86][SSE] Add cost model values for CTPOP of vectors
ClosedPublic