This is an archive of the discontinued LLVM Phabricator instance.

[x86] @llvm.ctpop.v8i32 custom lowering
AbandonedPublic

Authored by bruno on Dec 4 2014, 8:22 AM.

Download Raw Diff

Details

Reviewers

chandlerc
nadav
delena
andreadb

Summary

This patch adds x86 custom lowering for the @llvm.ctpop.v8i32 intrinsic.

Currently, the expansion of @llvm.ctpop.v8i32 uses vector element extractions,
insertions and individual calls to @llvm.ctpop.i32. Local haswell measurements
show that @llvm.ctpop.v8i32 gets faster by using vector parallel bit twiddling approaches
than using @llvm.ctpop.i32 for each element, based on:

v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
v = ((v + (v >> 4) & 0xF0F0F0F)
v = v + (v >> 8)
v = v + (v >> 16)
v = v & 0x0000003F
(from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)

Some toy microbenchmark presented a ~2x speedup, whereas vector types with smaller number of elements
are still better with the old approach (see results below). Hence this
patch only implements it for v8i32 type. The results indicate it might also be profitable
to implement this approach for v32i8 and v16i16, but I haven't measured that yet.

AVX1 ctpop.v8i32 is broken into two ctpop.v4i32, which is only slightly better than old expansion. However,
this patch does not implement custom lowering for the general ctpop.v4i32 type, since it's not profitable.

[core-avx2]

v8i32-new: 10.3506
v8i32-old: 18.3879
v4i32-new: 10.3699
v4i32-old: 8.01387
v4i64-new: 11.7464
v4i64-old: 10.3043
v2i64-new: 11.7922
v2i64-old: 5.20916

[corei7-avx]

v8i32-new: 16.5359
v8i32-old: 18.2479
v4i32-new: 10.2069
v4i32-old: 8.03686
v4i64-new: 17.8085
v4i64-old: 10.2366
v2i64-new: 11.7623
v2i64-old: 5.11533

Diff Detail

Repository: rL LLVM

Event Timeline

bruno updated this revision to Diff 16929.Dec 4 2014, 8:22 AM

bruno retitled this revision from to [x86] @llvm.ctpop.v8i32 custom lowering.

bruno updated this object.

bruno edited the test plan for this revision. (Show Details)

bruno added reviewers: nadav, chandlerc, andreadb, delena.

bruno set the repository for this revision to rL LLVM.

bruno added a subscriber: Unknown Object (MLST).

LGTM! Thank you for the detailed measurements!

Do you think that other targets may also benefit from this kind of transformation?

Thanks,
Nadav

Chandler,

Thanks for the help, the assembly for v8i32-new/old:
http://pastebin.com/4gnd41Je

About the principled split: I rather go the other way around, i.e., since SelectionDAGLegalize::ExpandBitCount already emits the bit-math for scalarized versions it makes more sense to custom split to other known vector types only when we already know it's profitable.

Nadav and Hal,

There are potential benefits for other targets I believe, but this customisation generates a bunch of vector instructions and I'm afraid that if one or other vector instruction isn't well supported on a target, that could lead to a lot of scalarized instructions which may lead to worse code than before? I might be wrong though. I just rather go into the direction that if other targets implement it and succeed, we than move it to target independent code. Additional thoughts?

Actually, back to x86, if popcnt isn't supported by some x86 target it currently leads to this bitmath scalarized code for each element and it would be always profitable to emit the vectorized code instead - tested it for v4i32, v2i64, v4i64 and v8i32 and it performs better. Gonna update the patch to reflect that. For instance "-arch x86_64" doesn't assume popcnt by default, since it is a separate feature, in cases like this we would always win.

Patch updated!

Ping :-)

I see you've gone ahead and committed this.

Please actually implement the significantly better algorithm I point you at
if you're going to have an x86-speciifc implementation. Also please
implement this for the other vector types. I really don't want this to be
left in a half-done state forever.

Expanding to other vector types doesn't seem unreasonable to do in a
follow-up patch, but I think it would have been better to start off with
the final algorithm. We now have a pretty substantial pile of code in the
x86 backend that will be completely replaced. =/

I see you've gone ahead and committed this.

Please actually implement the significantly better algorithm I point you at
if you're going to have an x86-speciifc implementation. Also please
implement this for the other vector types. I really don't want this to be
left in a half-done state forever.

I understand your concern, I'll get to it, promise :-)

Expanding to other vector types doesn't seem unreasonable to do in a
follow-up patch, but I think it would have been better to start off with
the final algorithm. We now have a pretty substantial pile of code in the
x86 backend that will be completely replaced. =/

Although you're probably right, I rather not do it before giving it appropriate
measurements which I couldn't get yet. Also, since we already know this performs
better than previous expansions, at least we have better generation for ctpop
in the meantime.

Thanks for the feedback and ideas :D

And three months later, you still haven't implemented the requested changes during code review.

Please do so, and quickly. I'm really unhappy about the behavior of promising to make changes requested during code review in order to get past code review, and then failing to follow through on them. I'm very tempted to just revert the patch until you actually have time to address this fully.

chandlerc requested changes to this revision.Mar 29 2015, 2:23 PM

chandlerc edited edge metadata.

This revision now requires changes to proceed.Mar 29 2015, 2:23 PM

I guess this makes it four months later now =T

Really sorry for the late reply. You're totally right, my bad I
haven't tackled this from my priority list despite promises.
However, I intend to resume this work in one week or two, but fell
free to revert it if that's sounds like another vague promise :-)

Cheers,

Hi,

This patch implements a faster vector population count based on the algorithm
described in http://wm.ite.pl/articles/sse-popcount.html

It does so by using an in-register lookup table and the pshufb instruction to
compute the popcnt for each byte. Additional instructions are then used to sum
the bytes and produce the result for wider element types. Numbers:

v4i32-avx:

sselookup (v4i32): 1.10211
scalar + ctpop (v4i32): 0.907016 <-- best == ToT
parallelbitmath (v4i32): 1.14124

v8i32-avx:

sselookup (v8i32): 1.97514 <-- best == patch
scalar + ctpop (v8i32): 2.37118

v8i32-avx2:

sselookup (v8i32): 1.17823
parallelbitmath (v8i32): 1.15288 <-- best == ToT

v2i64-avx:

scalar + ctpop (v2i64): 0.589292 <-- best == ToT
sselookup (v2i64): 0.865797
parallelbitmath (v2i64): 1.31027

v4i64-avx:

scalar + ctpop (v4i64): 0.903523 <-- best == ToT
sselookup (v4i64): 1.11988

v4i64-avx2:

scalar + ctpop (v4i64): 0.895486
sselookup (v4i64): 0.677801 <-- best == patch
parallelbitmath (v4i64): 1.02711

v16i8-avx:

scalar + ctpop (v16i8): 4.1569
sselookup (v16i8): 0.508693 <-- best == patch

v32i8-avx:

scalar + ctpop (v32i8): 8.32336
sselookup (v32i8): 0.961657 <-- best == patch

v32i8-avx2:

scalar + ctpop (v32i8): 8.79509
sselookup (v32i8): 0.487716 <-- best == patch

v8i16-avx:

scalar + ctpop (v8i16): 1.86908
sselookup (v8i16): 0.755885 <-- best == patch

v16i16-avx:

scalar + ctpop (v16i16): 4.08575
sselookup (v16i16): 1.32838 <-- best == patch

v16i16-avx2:

scalar + ctpop (v16i16): 4.19101
sselookup (v16i16): 1.18095 <-- best == patch

More info available at
https://github.com/bcardosolopes/llvm-vpopcount

One unexpected case is v8i32-avx2. Although sselookup and parallelbitmath vary
in which runs faster, I've seen the latter yielding slightly better results in
multiple runs. I would expect sselookup to always be faster because it has
fewer instructions but looks like there's some latency/resource conflict issue
going on.

Given the slightly perf diff between sselookup and parallelbitmath for
v8i32-avx2, I've removed parallelbitmath completely in this patch and left
sselookup as the default for this type too. We can later on change the behavior
for this type back to parallelbitmath (see the next paragraph).

This patch only improves the x86 specific part of vector popcnt. The previous
approach implemented for x86 in Dec 2014, the parallelbitmath, is generally
inferior. Given its target independent nature it will get resubmitted in a next
patch as a target independent expansion for vector popcnt, since (although not
anymore for x86) it's much better than the current scalar expansion we
currently do.

The assembly for v8i32-avx2 cases:

https://github.com/bcardosolopes/llvm-vpopcount/blob/master/v8i32/avx2/v8i32-bitmath.s
https://github.com/bcardosolopes/llvm-vpopcount/blob/master/v8i32/avx2/v8i32-sselookup.s

chandlerc mentioned this in rL238391: [x86] Refactor the tests for popcnt..May 27 2015, 7:44 PM

chandlerc mentioned this in D10084: [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique..May 28 2015, 2:16 AM

chandlerc mentioned this in rL238636: [x86] Implement a faster vector population count based on the PSHUFB.May 29 2015, 8:25 PM

Abandon this one since a improved version was committed way back in http://reviews.llvm.org/D10084.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.h

3 lines

X86ISelLowering.cpp

356 lines

X86InstrFragmentsSIMD.td

3 lines

X86InstrSSE.td

14 lines

test/

CodeGen/

X86/

avx-popcnt.ll

382 lines

avx2-popcnt.ll

93 lines

vector-ctpop.ll

Diff 26286

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 178 Lines • ▼ Show 20 Lines	enum NodeType {

/// Insert the lower 16-bits of a 32-bit value to a vector,		/// Insert the lower 16-bits of a 32-bit value to a vector,
/// corresponds to X86::PINSRW.		/// corresponds to X86::PINSRW.
PINSRW, MMX_PINSRW,		PINSRW, MMX_PINSRW,

/// Shuffle 16 8-bit values within a vector.		/// Shuffle 16 8-bit values within a vector.
PSHUFB,		PSHUFB,

		/// Compute Sum of Absolute Differences.
		PSADBW,

/// Bitwise Logical AND NOT of Packed FP values.		/// Bitwise Logical AND NOT of Packed FP values.
ANDNP,		ANDNP,

/// Copy integer sign.		/// Copy integer sign.
PSIGN,		PSIGN,

/// Blend where the selector is an immediate.		/// Blend where the selector is an immediate.
BLENDI,		BLENDI,
▲ Show 20 Lines • Show All 894 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 827 Lines • ▼ Show 20 Lines	if (!TM.Options.UseSoftFloat && Subtarget->hasSSE2()) {
setOperationAction(ISD::SETCC, MVT::v4i32, Custom);		setOperationAction(ISD::SETCC, MVT::v4i32, Custom);

setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v16i8, Custom);		setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v16i8, Custom);
setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v8i16, Custom);		setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v8i16, Custom);
setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v8i16, Custom);		setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v8i16, Custom);
setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4i32, Custom);		setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4i32, Custom);
setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4f32, Custom);		setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4f32, Custom);

// Only provide customized ctpop vector bit twiddling for vector types we		if (Subtarget->hasSSSE3()) {
// know to perform better than using the popcnt instructions on each vector		setOperationAction(ISD::CTPOP, MVT::v16i8, Custom);
// element. If popcnt isn't supported, always provide the custom version.		setOperationAction(ISD::CTPOP, MVT::v8i16, Custom);
		// It is faster to extract 32/64 bit elements and use scalar ctpop
		// instructions on v4i32/v4i64 elements than to custom lower ctpop.
if (!Subtarget->hasPOPCNT()) {		if (!Subtarget->hasPOPCNT()) {
setOperationAction(ISD::CTPOP, MVT::v4i32, Custom);		setOperationAction(ISD::CTPOP, MVT::v4i32, Custom);
setOperationAction(ISD::CTPOP, MVT::v2i64, Custom);		setOperationAction(ISD::CTPOP, MVT::v2i64, Custom);
}		}
		}

// Custom lower build_vector, vector_shuffle, and extract_vector_elt.		// Custom lower build_vector, vector_shuffle, and extract_vector_elt.
for (int i = MVT::v16i8; i != MVT::v2i64; ++i) {		for (int i = MVT::v16i8; i != MVT::v2i64; ++i) {
MVT VT = (MVT::SimpleValueType)i;		MVT VT = (MVT::SimpleValueType)i;
// Do not attempt to custom lower non-power-of-2 vectors		// Do not attempt to custom lower non-power-of-2 vectors
if (!isPowerOf2_32(VT.getVectorNumElements()))		if (!isPowerOf2_32(VT.getVectorNumElements()))
continue;		continue;
// Do not attempt to custom lower non-128-bit vectors		// Do not attempt to custom lower non-128-bit vectors
▲ Show 20 Lines • Show All 244 Lines • ▼ Show 20 Lines	if (!TM.Options.UseSoftFloat && Subtarget->hasFp256()) {
setOperationAction(ISD::ZERO_EXTEND, MVT::v16i16, Custom);		setOperationAction(ISD::ZERO_EXTEND, MVT::v16i16, Custom);
setOperationAction(ISD::ANY_EXTEND, MVT::v4i64, Custom);		setOperationAction(ISD::ANY_EXTEND, MVT::v4i64, Custom);
setOperationAction(ISD::ANY_EXTEND, MVT::v8i32, Custom);		setOperationAction(ISD::ANY_EXTEND, MVT::v8i32, Custom);
setOperationAction(ISD::ANY_EXTEND, MVT::v16i16, Custom);		setOperationAction(ISD::ANY_EXTEND, MVT::v16i16, Custom);
setOperationAction(ISD::TRUNCATE, MVT::v16i8, Custom);		setOperationAction(ISD::TRUNCATE, MVT::v16i8, Custom);
setOperationAction(ISD::TRUNCATE, MVT::v8i16, Custom);		setOperationAction(ISD::TRUNCATE, MVT::v8i16, Custom);
setOperationAction(ISD::TRUNCATE, MVT::v4i32, Custom);		setOperationAction(ISD::TRUNCATE, MVT::v4i32, Custom);

		setOperationAction(ISD::CTPOP, MVT::v32i8, Custom);
		setOperationAction(ISD::CTPOP, MVT::v16i16, Custom);
		setOperationAction(ISD::CTPOP, MVT::v8i32, Custom);
		// It is faster to extract 64 bit elements and use scalar ctpop
		// instructions on v4i64 elements for avx only (not avx2). But
		// always profitable if scalar popcnt is not available.
		if (!Subtarget->hasPOPCNT())
		setOperationAction(ISD::CTPOP, MVT::v4i64, Custom);

if (Subtarget->hasFMA() \|\| Subtarget->hasFMA4()) {		if (Subtarget->hasFMA() \|\| Subtarget->hasFMA4()) {
setOperationAction(ISD::FMA, MVT::v8f32, Legal);		setOperationAction(ISD::FMA, MVT::v8f32, Legal);
setOperationAction(ISD::FMA, MVT::v4f64, Legal);		setOperationAction(ISD::FMA, MVT::v4f64, Legal);
setOperationAction(ISD::FMA, MVT::v4f32, Legal);		setOperationAction(ISD::FMA, MVT::v4f32, Legal);
setOperationAction(ISD::FMA, MVT::v2f64, Legal);		setOperationAction(ISD::FMA, MVT::v2f64, Legal);
setOperationAction(ISD::FMA, MVT::f32, Legal);		setOperationAction(ISD::FMA, MVT::f32, Legal);
setOperationAction(ISD::FMA, MVT::f64, Legal);		setOperationAction(ISD::FMA, MVT::f64, Legal);
}		}
Show All 14 Lines	if (Subtarget->hasInt256()) {
setOperationAction(ISD::MUL, MVT::v16i16, Legal);		setOperationAction(ISD::MUL, MVT::v16i16, Legal);
setOperationAction(ISD::MUL, MVT::v32i8, Custom);		setOperationAction(ISD::MUL, MVT::v32i8, Custom);

setOperationAction(ISD::UMUL_LOHI, MVT::v8i32, Custom);		setOperationAction(ISD::UMUL_LOHI, MVT::v8i32, Custom);
setOperationAction(ISD::SMUL_LOHI, MVT::v8i32, Custom);		setOperationAction(ISD::SMUL_LOHI, MVT::v8i32, Custom);
setOperationAction(ISD::MULHU, MVT::v16i16, Legal);		setOperationAction(ISD::MULHU, MVT::v16i16, Legal);
setOperationAction(ISD::MULHS, MVT::v16i16, Legal);		setOperationAction(ISD::MULHS, MVT::v16i16, Legal);

		// Always custom lower if avx2 is available.
		setOperationAction(ISD::CTPOP, MVT::v4i64, Custom);

// The custom lowering for UINT_TO_FP for v8i32 becomes interesting		// The custom lowering for UINT_TO_FP for v8i32 becomes interesting
// when we have a 256bit-wide blend with immediate.		// when we have a 256bit-wide blend with immediate.
setOperationAction(ISD::UINT_TO_FP, MVT::v8i32, Custom);		setOperationAction(ISD::UINT_TO_FP, MVT::v8i32, Custom);

// Only provide customized ctpop vector bit twiddling for vector types we
// know to perform better than using the popcnt instructions on each
// vector element. If popcnt isn't supported, always provide the custom
// version.
if (!Subtarget->hasPOPCNT())
setOperationAction(ISD::CTPOP, MVT::v4i64, Custom);

// Custom CTPOP always performs better on natively supported v8i32
setOperationAction(ISD::CTPOP, MVT::v8i32, Custom);

// AVX2 also has wider vector sign/zero extending loads, VPMOV[SZ]X		// AVX2 also has wider vector sign/zero extending loads, VPMOV[SZ]X
setLoadExtAction(ISD::SEXTLOAD, MVT::v16i16, MVT::v16i8, Legal);		setLoadExtAction(ISD::SEXTLOAD, MVT::v16i16, MVT::v16i8, Legal);
setLoadExtAction(ISD::SEXTLOAD, MVT::v8i32, MVT::v8i8, Legal);		setLoadExtAction(ISD::SEXTLOAD, MVT::v8i32, MVT::v8i8, Legal);
setLoadExtAction(ISD::SEXTLOAD, MVT::v4i64, MVT::v4i8, Legal);		setLoadExtAction(ISD::SEXTLOAD, MVT::v4i64, MVT::v4i8, Legal);
setLoadExtAction(ISD::SEXTLOAD, MVT::v8i32, MVT::v8i16, Legal);		setLoadExtAction(ISD::SEXTLOAD, MVT::v8i32, MVT::v8i16, Legal);
setLoadExtAction(ISD::SEXTLOAD, MVT::v4i64, MVT::v4i16, Legal);		setLoadExtAction(ISD::SEXTLOAD, MVT::v4i64, MVT::v4i16, Legal);
setLoadExtAction(ISD::SEXTLOAD, MVT::v4i64, MVT::v4i32, Legal);		setLoadExtAction(ISD::SEXTLOAD, MVT::v4i64, MVT::v4i32, Legal);

▲ Show 20 Lines • Show All 15,944 Lines • ▼ Show 20 Lines	if (DstVT==MVT::i64 && SrcVT.isVector())
return Op;		return Op;
// MMX <=> MMX conversions are Legal.		// MMX <=> MMX conversions are Legal.
if (SrcVT.isVector() && DstVT.isVector())		if (SrcVT.isVector() && DstVT.isVector())
return Op;		return Op;
// All other conversions need to be expanded.		// All other conversions need to be expanded.
return SDValue();		return SDValue();
}		}

static SDValue LowerCTPOP(SDValue Op, const X86Subtarget *Subtarget,		static SDValue LowerCTPOPInRegLUT(SDValue Op, SDLoc DL,
		const X86Subtarget *Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
SDNode *Node = Op.getNode();
SDLoc dl(Node);

Op = Op.getOperand(0);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
assert((VT.is128BitVector() \|\| VT.is256BitVector()) &&		MVT EltVT = VT.getVectorElementType().getSimpleVT();
"CTPOP lowering only implemented for 128/256-bit wide vector types");		unsigned VecSize = VT.getSizeInBits();

unsigned NumElts = VT.getVectorNumElements();
EVT EltVT = VT.getVectorElementType();
unsigned Len = EltVT.getSizeInBits();

// This is the vectorized version of the "best" algorithm from		// Implement a lookup table in register by using an algorithm based on:
// http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel		// http://wm.ite.pl/articles/sse-popcount.html
// with a minor tweak to use a series of adds + shifts instead of vector
// multiplications. Implemented for the v2i64, v4i64, v4i32, v8i32 types:
//
// v2i64, v4i64, v4i32 => Only profitable w/ popcnt disabled
// v8i32 => Always profitable
//
// FIXME: There a couple of possible improvements:
//		//
// 1) Support for i8 and i16 vectors (needs measurements if popcnt enabled).		// The general idea is that every lower byte nibble in the input vector is an
// 2) Use strategies from http://wm.ite.pl/articles/sse-popcount.html		// index into a in-register pre-computed pop count table. We then split up the
//		// input vector in two new ones: (1) a vector with only the shifted-right
assert(EltVT.isInteger() && (Len == 32 \|\| Len == 64) && Len % 8 == 0 &&		// higher nibbles for each byte and (2) a vector with the lower nibbles (and
"CTPOP not implemented for this vector element type.");		// masked out higher ones) for each byte. PSHUB is used separately with both
		// to index the in-register table. Next, both are added and the result is a
		// i8 vector where each element contains the pop count for input byte.
		//
		// To obtain the pop count for elements != i8, we follow up with the same
		// approach and use additional tricks as described below.
		//
		const int LUT[16] = {/* 0 / 0, / 1 / 1, / 2 / 1, / 3 */ 2,
		/* 4 / 1, / 5 / 2, / 6 / 2, / 7 */ 3,
		/* 8 / 1, / 9 / 2, / a / 2, / b */ 3,
		/* c / 2, / d / 3, / e / 3, / f */ 4};

		unsigned NumByteElts = VecSize / 8;
		MVT ByteVecVT = MVT::getVectorVT(MVT::i8, NumByteElts);
		SDValue In = DAG.getNode(ISD::BITCAST, DL, ByteVecVT, Op);
		SmallVector<SDValue, 16> LUTVec;
		for (unsigned i = 0; i < NumByteElts; ++i)
		LUTVec.push_back(DAG.getConstant(LUT[i % 16], DL, MVT::i8));
		SDValue InRegLUT = DAG.getNode(ISD::BUILD_VECTOR, DL, ByteVecVT, LUTVec);
		SmallVector<SDValue, 16> Mask0F(NumByteElts,
		DAG.getConstant(0x0F, DL, MVT::i8));
		SDValue M0F = DAG.getNode(ISD::BUILD_VECTOR, DL, ByteVecVT, Mask0F);

		// High nibbles
		SmallVector<SDValue, 16> Four(NumByteElts, DAG.getConstant(4, DL, MVT::i8));
		SDValue FourV = DAG.getNode(ISD::BUILD_VECTOR, DL, ByteVecVT, Four);
		SDValue HighNibbles = DAG.getNode(ISD::SRL, DL, ByteVecVT, In, FourV);
		HighNibbles = DAG.getNode(ISD::AND, DL, ByteVecVT, HighNibbles, M0F);

		// Low nibbles
		SDValue LowNibbles = DAG.getNode(ISD::AND, DL, ByteVecVT, In, M0F);

		// The input vector is used as the shuffle mask that index elements into the
		// LUT. After counting low and high nibbles, add the vector to obtain the
		// final pop count per i8 element.
		SDValue HighPopCnt =
		DAG.getNode(X86ISD::PSHUFB, DL, ByteVecVT, InRegLUT, HighNibbles);
		SDValue LowPopCnt =
		DAG.getNode(X86ISD::PSHUFB, DL, ByteVecVT, InRegLUT, LowNibbles);
		SDValue PopCnt = DAG.getNode(ISD::ADD, DL, ByteVecVT, HighPopCnt, LowPopCnt);

// X86 canonicalize ANDs to vXi64, generate the appropriate bitcasts to avoid		if (EltVT == MVT::i8)
// extra legalization.		return PopCnt;
bool NeedsBitcast = EltVT == MVT::i32;
MVT BitcastVT = VT.is256BitVector() ? MVT::v4i64 : MVT::v2i64;

SDValue Cst55 = DAG.getConstant(APInt::getSplat(Len, APInt(8, 0x55)), dl,		// PSADBW instruction horizontally add all bytes and leave the result in i64
EltVT);		// chunks, thus directly computes the pop count for v2i64 and v4i64.
SDValue Cst33 = DAG.getConstant(APInt::getSplat(Len, APInt(8, 0x33)), dl,		if (EltVT == MVT::i64) {
EltVT);		SDValue Zeros = getZeroVector(ByteVecVT, Subtarget, DAG, DL);
SDValue Cst0F = DAG.getConstant(APInt::getSplat(Len, APInt(8, 0x0F)), dl,		PopCnt = DAG.getNode(X86ISD::PSADBW, DL, ByteVecVT, PopCnt, Zeros);
EltVT);		return DAG.getNode(ISD::BITCAST, DL, VT, PopCnt);
		}

		// Mask and shift to extract 32-bit components, use two PSADBW to pop count
		// each one and OR the result.
		if (EltVT == MVT::i32) {
		unsigned Vec64NumByteElts = VecSize / 64;
		MVT Vec64 = MVT::getVectorVT(MVT::i64, Vec64NumByteElts);
		PopCnt = DAG.getNode(ISD::BITCAST, DL, Vec64, PopCnt);

		SmallVector<SDValue, 4> MaskLow(
		Vec64NumByteElts,
		DAG.getConstant(APInt::getLowBitsSet(64, 32), DL, MVT::i64));
		SmallVector<SDValue, 4> Dword(Vec64NumByteElts,
		DAG.getConstant(32, DL, MVT::i64));

		SDValue Low = DAG.getNode(ISD::BUILD_VECTOR, DL, Vec64, MaskLow);
		SDValue High =
		DAG.getNode(ISD::SRL, DL, Vec64, PopCnt,
		DAG.getNode(ISD::BUILD_VECTOR, DL, Vec64, Dword));
		Low = DAG.getNode(ISD::AND, DL, Vec64, PopCnt, Low);

		SDValue Zeros = getZeroVector(ByteVecVT, Subtarget, DAG, DL);
		High = DAG.getNode(X86ISD::PSADBW, DL, ByteVecVT,
		DAG.getNode(ISD::BITCAST, DL, ByteVecVT, High), Zeros);
		Low = DAG.getNode(X86ISD::PSADBW, DL, ByteVecVT,
		DAG.getNode(ISD::BITCAST, DL, ByteVecVT, Low), Zeros);

		High = DAG.getNode(ISD::SHL, DL, Vec64,
		DAG.getNode(ISD::BITCAST, DL, Vec64, High),
		DAG.getNode(ISD::BUILD_VECTOR, DL, Vec64, Dword));

		PopCnt = DAG.getNode(ISD::OR, DL, Vec64, High,
		DAG.getNode(ISD::BITCAST, DL, Vec64, Low));
		return DAG.getNode(ISD::BITCAST, DL, VT, PopCnt);
		}

		// To obtain pop count for each i16 element, shuffle the byte pop count to get
		// even and odd elements into distinct vectors, add them and zero-extend each
		// i8 elemento into i16, i.e.:
		//
		// B -> pop count per i8
		// W -> pop count per i16
		//
		// Y = shuffle B, undef <0, 2, ...>
		// Z = shuffle B, undef <1, 3, ...>
		// W = zext <... x i8> to <... x i16> (Y + Z)
		//
		// Use a byte shuffle mask that matches PSHUFB.
		//
		assert(EltVT == MVT::i16 && "Unknown how to handle type");
		SDValue Undef = DAG.getUNDEF(ByteVecVT);
		SmallVector<int, 32> MaskA, MaskB;

		if (NumByteElts <= 16) {
		for (unsigned i = 0; i < NumByteElts / 2; ++i) {
		MaskA.push_back(i * 2);
		MaskB.push_back((i * 2) + 1);
		}
		for (unsigned i = NumByteElts / 2; i < NumByteElts; ++i) {
		MaskA.push_back(-1);
		MaskB.push_back(-1);
		}
		SDValue ShuffA =
		DAG.getVectorShuffle(ByteVecVT, DL, PopCnt, Undef, &MaskA[0]);
		SDValue ShuffB =
		DAG.getVectorShuffle(ByteVecVT, DL, PopCnt, Undef, &MaskB[0]);
		PopCnt = DAG.getNode(ISD::ADD, DL, ByteVecVT, ShuffA, ShuffB);

		// In AVX2, PSHUFB does not support cross-lane shuffle. Therefore, shuffle
		// the bytes in their own lane. This requires an extra shuffle to move the
		// result from the second lane to the first, i.e.:
		//
		// Y = shuffle B, undef <0, ... 14, -1, ... -1, 16 ...>
		// Z = shuffle B, undef <1, ... 15, -1, ... -1, 17 ...>
		// tmp = bitcast to v4i64 (Y + Z)
		// tmp = shuffle tmp, under <0, 2, -1, -1>
		// tmp = bitcast to v32i8 tmp
		// W = zext <... x i8> to <... x i16> tmp
		//
		} else {
		assert(NumByteElts == 32 && "Unknown i8 vector length to handle");
		int Idx = 0;
		for (int i = 0; i < 4; ++i) {
		for (int j = 0; j < 8; ++j) {
		if (i % 2 == 0) {
		MaskA.push_back(Idx++);
		MaskB.push_back(Idx++);
		} else {
		MaskA.push_back(-1);
		MaskB.push_back(-1);
		}
		}
		}
		SDValue ShuffA =
		DAG.getVectorShuffle(ByteVecVT, DL, PopCnt, Undef, &MaskA[0]);
		SDValue ShuffB =
		DAG.getVectorShuffle(ByteVecVT, DL, PopCnt, Undef, &MaskB[0]);
		PopCnt = DAG.getNode(ISD::ADD, DL, ByteVecVT, ShuffA, ShuffB);
		PopCnt = DAG.getNode(ISD::BITCAST, DL, MVT::v4i64, PopCnt);
		SmallVector<int, 4> Mask({0, 2, -1, -1});
		PopCnt = DAG.getVectorShuffle(MVT::v4i64, DL, PopCnt,
		DAG.getUNDEF(MVT::v4i64), &Mask[0]);
		PopCnt = DAG.getNode(ISD::BITCAST, DL, ByteVecVT, PopCnt);
		}

		// Zero extend i8 into i16 elts
		SmallVector<int, 16> ZExtInRegMask;
		for (unsigned i = 0, Idx = 0; i < NumByteElts; i += 2, ++Idx) {
		ZExtInRegMask.push_back(Idx);
		ZExtInRegMask.push_back(NumByteElts);
		}

// v = v - ((v >> 1) & 0x55555555...)		return DAG.getNode(
SmallVector<SDValue, 8> Ones(NumElts, DAG.getConstant(1, dl, EltVT));		ISD::BITCAST, DL, VT,
SDValue OnesV = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Ones);		DAG.getVectorShuffle(ByteVecVT, DL, PopCnt,
SDValue Srl = DAG.getNode(ISD::SRL, dl, VT, Op, OnesV);		getZeroVector(ByteVecVT, Subtarget, DAG, DL),
if (NeedsBitcast)		&ZExtInRegMask[0]));
Srl = DAG.getNode(ISD::BITCAST, dl, BitcastVT, Srl);

SmallVector<SDValue, 8> Mask55(NumElts, Cst55);
SDValue M55 = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Mask55);
if (NeedsBitcast)
M55 = DAG.getNode(ISD::BITCAST, dl, BitcastVT, M55);

SDValue And = DAG.getNode(ISD::AND, dl, Srl.getValueType(), Srl, M55);
if (VT != And.getValueType())
And = DAG.getNode(ISD::BITCAST, dl, VT, And);
SDValue Sub = DAG.getNode(ISD::SUB, dl, VT, Op, And);

// v = (v & 0x33333333...) + ((v >> 2) & 0x33333333...)
SmallVector<SDValue, 8> Mask33(NumElts, Cst33);
SDValue M33 = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Mask33);
SmallVector<SDValue, 8> Twos(NumElts, DAG.getConstant(2, dl, EltVT));
SDValue TwosV = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Twos);

Srl = DAG.getNode(ISD::SRL, dl, VT, Sub, TwosV);
if (NeedsBitcast) {
Srl = DAG.getNode(ISD::BITCAST, dl, BitcastVT, Srl);
M33 = DAG.getNode(ISD::BITCAST, dl, BitcastVT, M33);
Sub = DAG.getNode(ISD::BITCAST, dl, BitcastVT, Sub);
}

SDValue AndRHS = DAG.getNode(ISD::AND, dl, M33.getValueType(), Srl, M33);
SDValue AndLHS = DAG.getNode(ISD::AND, dl, M33.getValueType(), Sub, M33);
if (VT != AndRHS.getValueType()) {
AndRHS = DAG.getNode(ISD::BITCAST, dl, VT, AndRHS);
AndLHS = DAG.getNode(ISD::BITCAST, dl, VT, AndLHS);
}
SDValue Add = DAG.getNode(ISD::ADD, dl, VT, AndLHS, AndRHS);

// v = (v + (v >> 4)) & 0x0F0F0F0F...
SmallVector<SDValue, 8> Fours(NumElts, DAG.getConstant(4, dl, EltVT));
SDValue FoursV = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Fours);
Srl = DAG.getNode(ISD::SRL, dl, VT, Add, FoursV);
Add = DAG.getNode(ISD::ADD, dl, VT, Add, Srl);

SmallVector<SDValue, 8> Mask0F(NumElts, Cst0F);
SDValue M0F = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Mask0F);
if (NeedsBitcast) {
Add = DAG.getNode(ISD::BITCAST, dl, BitcastVT, Add);
M0F = DAG.getNode(ISD::BITCAST, dl, BitcastVT, M0F);
}
And = DAG.getNode(ISD::AND, dl, M0F.getValueType(), Add, M0F);
if (VT != And.getValueType())
And = DAG.getNode(ISD::BITCAST, dl, VT, And);

// The algorithm mentioned above uses:
// v = (v * 0x01010101...) >> (Len - 8)
//
// Change it to use vector adds + vector shifts which yield faster results on
// Haswell than using vector integer multiplication.
//
// For i32 elements:
// v = v + (v >> 8)
// v = v + (v >> 16)
//
// For i64 elements:
// v = v + (v >> 8)
// v = v + (v >> 16)
// v = v + (v >> 32)
//
Add = And;
SmallVector<SDValue, 8> Csts;
for (unsigned i = 8; i <= Len/2; i *= 2) {
Csts.assign(NumElts, DAG.getConstant(i, dl, EltVT));
SDValue CstsV = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Csts);
Srl = DAG.getNode(ISD::SRL, dl, VT, Add, CstsV);
Add = DAG.getNode(ISD::ADD, dl, VT, Add, Srl);
Csts.clear();
}		}

// The result is on the least significant 6-bits on i32 and 7-bits on i64.		static SDValue LowerCTPOP(SDValue Op, const X86Subtarget *Subtarget,
SDValue Cst3F = DAG.getConstant(APInt(Len, Len == 32 ? 0x3F : 0x7F), dl,		SelectionDAG &DAG) {
EltVT);		MVT VT = Op.getSimpleValueType();
SmallVector<SDValue, 8> Cst3FV(NumElts, Cst3F);		assert((VT.is256BitVector() \|\| VT.is128BitVector()) &&
SDValue M3F = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, Cst3FV);		"Unknown CTPOP type to handle");
if (NeedsBitcast) {		SDLoc dl(Op.getNode());
Add = DAG.getNode(ISD::BITCAST, dl, BitcastVT, Add);
M3F = DAG.getNode(ISD::BITCAST, dl, BitcastVT, M3F);		if (Op.getValueType().is256BitVector() && !Subtarget->hasInt256()) {
}		unsigned NumElems = VT.getVectorNumElements();
And = DAG.getNode(ISD::AND, dl, M3F.getValueType(), Add, M3F);
if (VT != And.getValueType())		// Extract each 128-bit vector, compute pop count and concat the result.
And = DAG.getNode(ISD::BITCAST, dl, VT, And);		SDValue Op0 = Op.getOperand(0);
		SDValue LHS = Extract128BitVector(Op0, 0, DAG, dl);
		SDValue RHS = Extract128BitVector(Op0, NumElems/2, DAG, dl);

		return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT,
		LowerCTPOPInRegLUT(LHS, dl, Subtarget, DAG),
		LowerCTPOPInRegLUT(RHS, dl, Subtarget, DAG));
		}

return And;		return LowerCTPOPInRegLUT(Op.getOperand(0), dl, Subtarget, DAG);
}		}

static SDValue LowerLOAD_SUB(SDValue Op, SelectionDAG &DAG) {		static SDValue LowerLOAD_SUB(SDValue Op, SelectionDAG &DAG) {
SDNode *Node = Op.getNode();		SDNode *Node = Op.getNode();
SDLoc dl(Node);		SDLoc dl(Node);
EVT T = Node->getValueType(0);		EVT T = Node->getValueType(0);
SDValue negOp = DAG.getNode(ISD::SUB, dl, T,		SDValue negOp = DAG.getNode(ISD::SUB, dl, T,
DAG.getConstant(0, dl, T), Node->getOperand(2));		DAG.getConstant(0, dl, T), Node->getOperand(2));
▲ Show 20 Lines • Show All 7,751 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrFragmentsSIMD.td

	Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines
	def X86hsub : SDNode<"X86ISD::HSUB", SDTIntBinOp>;			def X86hsub : SDNode<"X86ISD::HSUB", SDTIntBinOp>;
	def X86comi : SDNode<"X86ISD::COMI", SDTX86CmpTest>;			def X86comi : SDNode<"X86ISD::COMI", SDTX86CmpTest>;
	def X86ucomi : SDNode<"X86ISD::UCOMI", SDTX86CmpTest>;			def X86ucomi : SDNode<"X86ISD::UCOMI", SDTX86CmpTest>;
	def X86cmps : SDNode<"X86ISD::FSETCC", SDTX86Cmps>;			def X86cmps : SDNode<"X86ISD::FSETCC", SDTX86Cmps>;
	//def X86cmpsd : SDNode<"X86ISD::FSETCCsd", SDTX86Cmpsd>;			//def X86cmpsd : SDNode<"X86ISD::FSETCCsd", SDTX86Cmpsd>;
	def X86pshufb : SDNode<"X86ISD::PSHUFB",			def X86pshufb : SDNode<"X86ISD::PSHUFB",
	SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisSameAs<0,1>,			SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisSameAs<0,1>,
	SDTCisSameAs<0,2>]>>;			SDTCisSameAs<0,2>]>>;
				def X86psadbw : SDNode<"X86ISD::PSADBW",
				SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisSameAs<0,1>,
				SDTCisSameAs<0,2>]>>;
	def X86andnp : SDNode<"X86ISD::ANDNP",			def X86andnp : SDNode<"X86ISD::ANDNP",
	SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisSameAs<0,1>,			SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisSameAs<0,1>,
	SDTCisSameAs<0,2>]>>;			SDTCisSameAs<0,2>]>>;
	def X86psign : SDNode<"X86ISD::PSIGN",			def X86psign : SDNode<"X86ISD::PSIGN",
	SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisSameAs<0,1>,			SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisSameAs<0,1>,
	SDTCisSameAs<0,2>]>>;			SDTCisSameAs<0,2>]>>;
	def X86pextrb : SDNode<"X86ISD::PEXTRB",			def X86pextrb : SDNode<"X86ISD::PEXTRB",
	SDTypeProfile<1, 2, [SDTCisVT<0, i32>, SDTCisPtrTy<2>]>>;			SDTypeProfile<1, 2, [SDTCisVT<0, i32>, SDTCisPtrTy<2>]>>;
	▲ Show 20 Lines • Show All 645 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,011 Lines • ▼ Show 20 Lines	defm PMADDWD : PDI_binop_all_int<0xF5, "pmaddwd", int_x86_sse2_pmadd_wd,
int_x86_avx2_pmadd_wd, SSE_PMADD, 1>;		int_x86_avx2_pmadd_wd, SSE_PMADD, 1>;
defm PAVGB : PDI_binop_all_int<0xE0, "pavgb", int_x86_sse2_pavg_b,		defm PAVGB : PDI_binop_all_int<0xE0, "pavgb", int_x86_sse2_pavg_b,
int_x86_avx2_pavg_b, SSE_INTALU_ITINS_P, 1>;		int_x86_avx2_pavg_b, SSE_INTALU_ITINS_P, 1>;
defm PAVGW : PDI_binop_all_int<0xE3, "pavgw", int_x86_sse2_pavg_w,		defm PAVGW : PDI_binop_all_int<0xE3, "pavgw", int_x86_sse2_pavg_w,
int_x86_avx2_pavg_w, SSE_INTALU_ITINS_P, 1>;		int_x86_avx2_pavg_w, SSE_INTALU_ITINS_P, 1>;
defm PSADBW : PDI_binop_all_int<0xF6, "psadbw", int_x86_sse2_psad_bw,		defm PSADBW : PDI_binop_all_int<0xF6, "psadbw", int_x86_sse2_psad_bw,
int_x86_avx2_psad_bw, SSE_PMADD, 1>;		int_x86_avx2_psad_bw, SSE_PMADD, 1>;

		let Predicates = [HasAVX2] in
		def : Pat<(v4i64 (bitconvert (v32i8 (X86psadbw (v32i8 VR256:$src1),
		(v32i8 VR256:$src2))))),
		(VPSADBWYrr VR256:$src2, VR256:$src1)>;

		let Predicates = [HasAVX] in
		def : Pat<(v2i64 (bitconvert (v16i8 (X86psadbw (v16i8 VR128:$src1),
		(v16i8 VR128:$src2))))),
		(VPSADBWrr VR128:$src2, VR128:$src1)>;

		def : Pat<(v2i64 (bitconvert (v16i8 (X86psadbw (v16i8 VR128:$src1),
		(v16i8 VR128:$src2))))),
		(PSADBWrr VR128:$src2, VR128:$src1)>;

let Predicates = [HasAVX] in		let Predicates = [HasAVX] in
defm VPMULUDQ : PDI_binop_rm2<0xF4, "vpmuludq", X86pmuludq, v2i64, v4i32, VR128,		defm VPMULUDQ : PDI_binop_rm2<0xF4, "vpmuludq", X86pmuludq, v2i64, v4i32, VR128,
loadv2i64, i128mem, SSE_INTMUL_ITINS_P, 1, 0>,		loadv2i64, i128mem, SSE_INTMUL_ITINS_P, 1, 0>,
VEX_4V;		VEX_4V;
let Predicates = [HasAVX2] in		let Predicates = [HasAVX2] in
defm VPMULUDQY : PDI_binop_rm2<0xF4, "vpmuludq", X86pmuludq, v4i64, v8i32,		defm VPMULUDQY : PDI_binop_rm2<0xF4, "vpmuludq", X86pmuludq, v4i64, v8i32,
VR256, loadv4i64, i256mem,		VR256, loadv4i64, i256mem,
SSE_INTMUL_ITINS_P, 1, 0>, VEX_4V, VEX_L;		SSE_INTMUL_ITINS_P, 1, 0>, VEX_4V, VEX_L;
▲ Show 20 Lines • Show All 4,821 Lines • Show Last 20 Lines

test/CodeGen/X86/avx-popcnt.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx -mattr=+popcnt \| FileCheck -check-prefix=AVX %s
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx -mattr=-popcnt \| FileCheck -check-prefix=AVX-NOPOPCNT %s

				define <4 x i32> @testv4i32(<4 x i32> %in) {
				; AVX-LABEL: testv4i32:
				; AVX: # BB#0:
				; AVX-NEXT: vpextrd $1, %xmm0, %eax
				; AVX-NEXT: popcntl %eax, %eax
				; AVX-NEXT: vmovd %xmm0, %ecx
				; AVX-NEXT: popcntl %ecx, %ecx
				; AVX-NEXT: vmovd %ecx, %xmm1
				; AVX-NEXT: vpinsrd $1, %eax, %xmm1, %xmm1
				; AVX-NEXT: vpextrd $2, %xmm0, %eax
				; AVX-NEXT: popcntl %eax, %eax
				; AVX-NEXT: vpinsrd $2, %eax, %xmm1, %xmm1
				; AVX-NEXT: vpextrd $3, %xmm0, %eax
				; AVX-NEXT: popcntl %eax, %eax
				; AVX-NEXT: vpinsrd $3, %eax, %xmm1, %xmm0
				; AVX-NEXT: retq
				; AVX-NOPOPCNT-LABEL: testv4i32:
				; AVX-NOPOPCNT: # BB#0:
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand {{.*}}(%rip), %xmm0, %xmm1
				; AVX-NOPOPCNT-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX-NOPOPCNT-NEXT: vpsadbw %xmm1, %xmm2, %xmm1
				; AVX-NOPOPCNT-NEXT: vpsrlq $32, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpsadbw %xmm0, %xmm2, %xmm0
				; AVX-NOPOPCNT-NEXT: vpsllq $32, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpor %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: retq
				%out = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> %in)
				ret <4 x i32> %out
				}

				define <32 x i8> @testv32i8(<32 x i8> %in) {
				; AVX-LABEL: testv32i8:
				; AVX: # BB#0:
				; AVX-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX-NEXT: vmovaps {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NEXT: vandps %xmm2, %xmm1, %xmm3
				; AVX-NEXT: vmovdqa {{.*#+}} xmm4 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NEXT: vpshufb %xmm3, %xmm4, %xmm3
				; AVX-NEXT: vpsrlw $4, %xmm1, %xmm1
				; AVX-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NEXT: vpshufb %xmm1, %xmm4, %xmm1
				; AVX-NEXT: vpaddb %xmm3, %xmm1, %xmm1
				; AVX-NEXT: vandps %xmm2, %xmm0, %xmm3
				; AVX-NEXT: vpshufb %xmm3, %xmm4, %xmm3
				; AVX-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NEXT: vpshufb %xmm0, %xmm4, %xmm0
				; AVX-NEXT: vpaddb %xmm3, %xmm0, %xmm0
				; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				; AVX-NEXT: retq
				; AVX-NOPOPCNT-LABEL: testv32i8:
				; AVX-NOPOPCNT: # BB#0:
				; AVX-NOPOPCNT-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX-NOPOPCNT-NEXT: vmovaps {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NOPOPCNT-NEXT: vandps %xmm2, %xmm1, %xmm3
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm4 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm3, %xmm4, %xmm3
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm1, %xmm4, %xmm1
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm3, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vandps %xmm2, %xmm0, %xmm3
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm3, %xmm4, %xmm3
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm0, %xmm4, %xmm0
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm3, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				; AVX-NOPOPCNT-NEXT: retq
				%out = call <32 x i8> @llvm.ctpop.v32i8(<32 x i8> %in)
				ret <32 x i8> %out
				}

				define <4 x i64> @testv4i64(<4 x i64> %in) {
				; AVX-LABEL: testv4i64:
				; AVX: # BB#0:
				; AVX-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX-NEXT: vpextrq $1, %xmm1, %rax
				; AVX-NEXT: popcntq %rax, %rax
				; AVX-NEXT: vmovq %rax, %xmm2
				; AVX-NEXT: vmovq %xmm1, %rax
				; AVX-NEXT: popcntq %rax, %rax
				; AVX-NEXT: vmovq %rax, %xmm1
				; AVX-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
				; AVX-NEXT: vpextrq $1, %xmm0, %rax
				; AVX-NEXT: popcntq %rax, %rax
				; AVX-NEXT: vmovq %rax, %xmm2
				; AVX-NEXT: vmovq %xmm0, %rax
				; AVX-NEXT: popcntq %rax, %rax
				; AVX-NEXT: vmovq %rax, %xmm0
				; AVX-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
				; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				; AVX-NEXT: retq
				; AVX-NOPOPCNT-LABEL: testv4i64:
				; AVX-NOPOPCNT: # BB#0:
				; AVX-NOPOPCNT-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX-NOPOPCNT-NEXT: vmovaps {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NOPOPCNT-NEXT: vandps %xmm2, %xmm1, %xmm3
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm4 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm3, %xmm4, %xmm3
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm1, %xmm4, %xmm1
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm3, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpxor %xmm3, %xmm3, %xmm3
				; AVX-NOPOPCNT-NEXT: vpsadbw %xmm1, %xmm3, %xmm1
				; AVX-NOPOPCNT-NEXT: vandps %xmm2, %xmm0, %xmm5
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm5, %xmm4, %xmm5
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm0, %xmm4, %xmm0
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm5, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpsadbw %xmm0, %xmm3, %xmm0
				; AVX-NOPOPCNT-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				; AVX-NOPOPCNT-NEXT: retq
				%out = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> %in)
				ret <4 x i64> %out
				}

				define <8 x i32> @testv8i32(<8 x i32> %in) {
				; AVX-LABEL: testv8i32:
				; AVX: # BB#0:
				; AVX-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX-NEXT: vmovaps {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NEXT: vandps %xmm2, %xmm1, %xmm3
				; AVX-NEXT: vmovdqa {{.*#+}} xmm4 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NEXT: vpshufb %xmm3, %xmm4, %xmm3
				; AVX-NEXT: vpsrlw $4, %xmm1, %xmm1
				; AVX-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NEXT: vpshufb %xmm1, %xmm4, %xmm1
				; AVX-NEXT: vpaddb %xmm3, %xmm1, %xmm1
				; AVX-NEXT: vmovdqa {{.*#+}} xmm3 = [4294967295,4294967295]
				; AVX-NEXT: vpand %xmm3, %xmm1, %xmm5
				; AVX-NEXT: vpxor %xmm6, %xmm6, %xmm6
				; AVX-NEXT: vpsadbw %xmm5, %xmm6, %xmm5
				; AVX-NEXT: vpsrlq $32, %xmm1, %xmm1
				; AVX-NEXT: vpsadbw %xmm1, %xmm6, %xmm1
				; AVX-NEXT: vpsllq $32, %xmm1, %xmm1
				; AVX-NEXT: vpor %xmm5, %xmm1, %xmm1
				; AVX-NEXT: vandps %xmm2, %xmm0, %xmm5
				; AVX-NEXT: vpshufb %xmm5, %xmm4, %xmm5
				; AVX-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NEXT: vpshufb %xmm0, %xmm4, %xmm0
				; AVX-NEXT: vpaddb %xmm5, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm3, %xmm0, %xmm2
				; AVX-NEXT: vpsadbw %xmm2, %xmm6, %xmm2
				; AVX-NEXT: vpsrlq $32, %xmm0, %xmm0
				; AVX-NEXT: vpsadbw %xmm0, %xmm6, %xmm0
				; AVX-NEXT: vpsllq $32, %xmm0, %xmm0
				; AVX-NEXT: vpor %xmm2, %xmm0, %xmm0
				; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				; AVX-NEXT: retq
				; AVX-NOPOPCNT-LABEL: testv8i32:
				; AVX-NOPOPCNT: # BB#0:
				; AVX-NOPOPCNT-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX-NOPOPCNT-NEXT: vmovaps {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NOPOPCNT-NEXT: vandps %xmm2, %xmm1, %xmm3
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm4 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm3, %xmm4, %xmm3
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm1, %xmm4, %xmm1
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm3, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm3 = [4294967295,4294967295]
				; AVX-NOPOPCNT-NEXT: vpand %xmm3, %xmm1, %xmm5
				; AVX-NOPOPCNT-NEXT: vpxor %xmm6, %xmm6, %xmm6
				; AVX-NOPOPCNT-NEXT: vpsadbw %xmm5, %xmm6, %xmm5
				; AVX-NOPOPCNT-NEXT: vpsrlq $32, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpsadbw %xmm1, %xmm6, %xmm1
				; AVX-NOPOPCNT-NEXT: vpsllq $32, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpor %xmm5, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vandps %xmm2, %xmm0, %xmm5
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm5, %xmm4, %xmm5
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm0, %xmm4, %xmm0
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm5, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm3, %xmm0, %xmm2
				; AVX-NOPOPCNT-NEXT: vpsadbw %xmm2, %xmm6, %xmm2
				; AVX-NOPOPCNT-NEXT: vpsrlq $32, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpsadbw %xmm0, %xmm6, %xmm0
				; AVX-NOPOPCNT-NEXT: vpsllq $32, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpor %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				; AVX-NOPOPCNT-NEXT: retq
				%out = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> %in)
				ret <8 x i32> %out
				}

				define <2 x i64> @testv2i64(<2 x i64> %in) {
				; AVX-LABEL: testv2i64:
				; AVX: # BB#0:
				; AVX-NEXT: vpextrq $1, %xmm0, %rax
				; AVX-NEXT: popcntq %rax, %rax
				; AVX-NEXT: vmovq %rax, %xmm1
				; AVX-NEXT: vmovq %xmm0, %rax
				; AVX-NEXT: popcntq %rax, %rax
				; AVX-NEXT: vmovq %rax, %xmm0
				; AVX-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
				; AVX-NEXT: retq
				; AVX-NOPOPCNT-LABEL: testv2i64:
				; AVX-NOPOPCNT: # BB#0:
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX-NOPOPCNT-NEXT: vpsadbw %xmm0, %xmm1, %xmm0
				; AVX-NOPOPCNT-NEXT: retq
				%out = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> %in)
				ret <2 x i64> %out
				}

				define <16 x i8> @testv16i8(<16 x i8> %in) {
				; AVX-LABEL: testv16i8:
				; AVX: # BB#0:
				; AVX-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX-NEXT: vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX-NEXT: vpaddb %xmm2, %xmm0, %xmm0
				; AVX-NEXT: retq
				; AVX-NOPOPCNT-LABEL: testv16i8:
				; AVX-NOPOPCNT: # BB#0:
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: retq
				%out = call <16 x i8> @llvm.ctpop.v16i8(<16 x i8> %in)
				ret <16 x i8> %out
				}

				define <16 x i16> @testv16i16(<16 x i16> %in) {
				; AVX-LABEL: testv16i16:
				; AVX: # BB#0:
				; AVX-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX-NEXT: vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX-NEXT: vpsrlw $4, %xmm0, %xmm4
				; AVX-NEXT: vpand %xmm1, %xmm4, %xmm4
				; AVX-NEXT: vpand %xmm1, %xmm4, %xmm4
				; AVX-NEXT: vpshufb %xmm4, %xmm3, %xmm4
				; AVX-NEXT: vpaddb %xmm2, %xmm4, %xmm2
				; AVX-NEXT: vmovdqa {{.*#+}} xmm4 = <1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u>
				; AVX-NEXT: vpshufb %xmm4, %xmm2, %xmm5
				; AVX-NEXT: vmovdqa {{.*#+}} xmm6 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
				; AVX-NEXT: vpshufb %xmm6, %xmm2, %xmm2
				; AVX-NEXT: vpaddb %xmm5, %xmm2, %xmm2
				; AVX-NEXT: vpmovzxbw {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
				; AVX-NEXT: vextractf128 $1, %ymm0, %xmm0
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm5
				; AVX-NEXT: vpshufb %xmm5, %xmm3, %xmm5
				; AVX-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX-NEXT: vpaddb %xmm5, %xmm0, %xmm0
				; AVX-NEXT: vpshufb %xmm4, %xmm0, %xmm1
				; AVX-NEXT: vpshufb %xmm6, %xmm0, %xmm0
				; AVX-NEXT: vpaddb %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX-NEXT: vinsertf128 $1, %xmm0, %ymm2, %ymm0
				; AVX-NEXT: retq
				; AVX-NOPOPCNT-LABEL: testv16i16:
				; AVX-NOPOPCNT: # BB#0:
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm0, %xmm4
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm4, %xmm4
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm4, %xmm4
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm4, %xmm3, %xmm4
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm2, %xmm4, %xmm2
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm4 = <1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u>
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm4, %xmm2, %xmm5
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm6 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm6, %xmm2, %xmm2
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm5, %xmm2, %xmm2
				; AVX-NOPOPCNT-NEXT: vpmovzxbw {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
				; AVX-NOPOPCNT-NEXT: vextractf128 $1, %ymm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm5
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm5, %xmm3, %xmm5
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm5, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm4, %xmm0, %xmm1
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm6, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX-NOPOPCNT-NEXT: vinsertf128 $1, %xmm0, %ymm2, %ymm0
				; AVX-NOPOPCNT-NEXT: retq
				%out = call <16 x i16> @llvm.ctpop.v16i16(<16 x i16> %in)
				ret <16 x i16> %out
				}

				define <8 x i16> @testv8i16(<8 x i16> %in) {
				; AVX-LABEL: testv8i16:
				; AVX: # BB#0:
				; AVX-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX-NEXT: vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX-NEXT: vpaddb %xmm2, %xmm0, %xmm0
				; AVX-NEXT: vpshufb {{.*#+}} xmm1 = xmm0[1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u]
				; AVX-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
				; AVX-NEXT: vpaddb %xmm1, %xmm0, %xmm0
				; AVX-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX-NEXT: retq
				; AVX-NOPOPCNT-LABEL: testv8i16:
				; AVX-NOPOPCNT: # BB#0:
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX-NOPOPCNT-NEXT: vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX-NOPOPCNT-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm2, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpshufb {{.*#+}} xmm1 = xmm0[1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u]
				; AVX-NOPOPCNT-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
				; AVX-NOPOPCNT-NEXT: vpaddb %xmm1, %xmm0, %xmm0
				; AVX-NOPOPCNT-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX-NOPOPCNT-NEXT: retq
				%out = call <8 x i16> @llvm.ctpop.v8i16(<8 x i16> %in)
				ret <8 x i16> %out
				}

				declare <4 x i32> @llvm.ctpop.v4i32(<4 x i32>)
				declare <32 x i8> @llvm.ctpop.v32i8(<32 x i8>)
				declare <4 x i64> @llvm.ctpop.v4i64(<4 x i64>)
				declare <8 x i32> @llvm.ctpop.v8i32(<8 x i32>)
				declare <2 x i64> @llvm.ctpop.v2i64(<2 x i64>)
				declare <16 x i8> @llvm.ctpop.v16i8(<16 x i8>)
				declare <16 x i16> @llvm.ctpop.v16i16(<16 x i16>)
				declare <8 x i16> @llvm.ctpop.v8i16(<8 x i16>)

test/CodeGen/X86/avx2-popcnt.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 -mattr=+popcnt \| FileCheck %s
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 -mattr=-popcnt \| FileCheck %s

				; When avx2 is enabled, we should always generate the same code regardless
				; of popcnt instruction availability.

				define <32 x i8> @testv32i8(<32 x i8> %in) {
				; CHECK-LABEL: testv32i8:
				; CHECK: # BB#0:
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm2
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; CHECK-NEXT: vpshufb %ymm2, %ymm3, %ymm2
				; CHECK-NEXT: vpsrlw $4, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpshufb %ymm0, %ymm3, %ymm0
				; CHECK-NEXT: vpaddb %ymm2, %ymm0, %ymm0
				; CHECK-NEXT: retq
				%out = call <32 x i8> @llvm.ctpop.v32i8(<32 x i8> %in)
				ret <32 x i8> %out
				}

				define <4 x i64> @testv4i64(<4 x i64> %in) {
				; CHECK-LABEL: testv4i64:
				; CHECK: # BB#0:
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm2
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; CHECK-NEXT: vpshufb %ymm2, %ymm3, %ymm2
				; CHECK-NEXT: vpsrlw $4, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpshufb %ymm0, %ymm3, %ymm0
				; CHECK-NEXT: vpaddb %ymm2, %ymm0, %ymm0
				; CHECK-NEXT: vpxor %ymm1, %ymm1, %ymm1
				; CHECK-NEXT: vpsadbw %ymm0, %ymm1, %ymm0
				; CHECK-NEXT: retq
				%out = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> %in)
				ret <4 x i64> %out
				}

				define <8 x i32> @testv8i32(<8 x i32> %in) {
				; CHECK-LABEL: testv8i32:
				; CHECK: # BB#0:
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm2
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; CHECK-NEXT: vpshufb %ymm2, %ymm3, %ymm2
				; CHECK-NEXT: vpsrlw $4, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpshufb %ymm0, %ymm3, %ymm0
				; CHECK-NEXT: vpaddb %ymm2, %ymm0, %ymm0
				; CHECK-NEXT: vpbroadcastq {{.*}}(%rip), %ymm1
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm1
				; CHECK-NEXT: vpxor %ymm2, %ymm2, %ymm2
				; CHECK-NEXT: vpsadbw %ymm1, %ymm2, %ymm1
				; CHECK-NEXT: vpsrlq $32, %ymm0, %ymm0
				; CHECK-NEXT: vpsadbw %ymm0, %ymm2, %ymm0
				; CHECK-NEXT: vpsllq $32, %ymm0, %ymm0
				; CHECK-NEXT: vpor %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: retq
				%out = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> %in)
				ret <8 x i32> %out
				}

				define <16 x i16> @testv16i16(<16 x i16> %in) {
				; CHECK-LABEL: testv16i16:
				; CHECK: # BB#0:
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm2
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; CHECK-NEXT: vpshufb %ymm2, %ymm3, %ymm2
				; CHECK-NEXT: vpsrlw $4, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpshufb %ymm0, %ymm3, %ymm0
				; CHECK-NEXT: vpaddb %ymm2, %ymm0, %ymm0
				; CHECK-NEXT: vpshufb {{.*#+}} ymm1 = ymm0[1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u,17,19,21,23,25,27,29,31,u,u,u,u,u,u,u,u]
				; CHECK-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u,16,18,20,22,24,26,28,30,u,u,u,u,u,u,u,u]
				; CHECK-NEXT: vpaddb %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; CHECK-NEXT: vpmovzxbw {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
				; CHECK-NEXT: retq
				%out = call <16 x i16> @llvm.ctpop.v16i16(<16 x i16> %in)
				ret <16 x i16> %out
				}

				declare <32 x i8> @llvm.ctpop.v32i8(<32 x i8>)
				declare <4 x i64> @llvm.ctpop.v4i64(<4 x i64>)
				declare <8 x i32> @llvm.ctpop.v8i32(<8 x i32>)
				declare <16 x i16> @llvm.ctpop.v16i16(<16 x i16>)

test/CodeGen/X86/vector-ctpop.ll

This file was deleted.

	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mattr=avx2 \| FileCheck -check-prefix=AVX2 %s
	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mattr=avx -mattr=-popcnt \| FileCheck -check-prefix=AVX1-NOPOPCNT %s
	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mattr=avx2 -mattr=-popcnt \| FileCheck -check-prefix=AVX2-NOPOPCNT %s

	; Vector version of:
	; v = v - ((v >> 1) & 0x55555555)
	; v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
	; v = (v + (v >> 4) & 0xF0F0F0F)
	; v = v + (v >> 8)
	; v = v + (v >> 16)
	; v = v + (v >> 32) ; i64 only

	define <8 x i32> @test0(<8 x i32> %x) {
	; AVX2-LABEL: @test0
	entry:
	; AVX2: vpsrld $1, %ymm
	; AVX2-NEXT: vpbroadcastd
	; AVX2-NEXT: vpand
	; AVX2-NEXT: vpsubd
	; AVX2-NEXT: vpbroadcastd
	; AVX2-NEXT: vpand
	; AVX2-NEXT: vpsrld $2
	; AVX2-NEXT: vpand
	; AVX2-NEXT: vpaddd
	; AVX2-NEXT: vpsrld $4
	; AVX2-NEXT: vpaddd
	; AVX2-NEXT: vpbroadcastd
	; AVX2-NEXT: vpand
	; AVX2-NEXT: vpsrld $8
	; AVX2-NEXT: vpaddd
	; AVX2-NEXT: vpsrld $16
	; AVX2-NEXT: vpaddd
	; AVX2-NEXT: vpbroadcastd
	; AVX2-NEXT: vpand
	%y = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> %x)
	ret <8 x i32> %y
	}

	define <4 x i64> @test1(<4 x i64> %x) {
	; AVX2-NOPOPCNT-LABEL: @test1
	entry:
	; AVX2-NOPOPCNT: vpsrlq $1, %ymm
	; AVX2-NOPOPCNT-NEXT: vpbroadcastq
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpsubq
	; AVX2-NOPOPCNT-NEXT: vpbroadcastq
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpsrlq $2
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpsrlq $4
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpbroadcastq
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpsrlq $8
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpsrlq $16
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpsrlq $32
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpbroadcastq
	; AVX2-NOPOPCNT-NEXT: vpand
	%y = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> %x)
	ret <4 x i64> %y
	}

	define <4 x i32> @test2(<4 x i32> %x) {
	; AVX2-NOPOPCNT-LABEL: @test2
	; AVX1-NOPOPCNT-LABEL: @test2
	entry:
	; AVX2-NOPOPCNT: vpsrld $1, %xmm
	; AVX2-NOPOPCNT-NEXT: vpbroadcastd
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpsubd
	; AVX2-NOPOPCNT-NEXT: vpbroadcastd
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpsrld $2
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpaddd
	; AVX2-NOPOPCNT-NEXT: vpsrld $4
	; AVX2-NOPOPCNT-NEXT: vpaddd
	; AVX2-NOPOPCNT-NEXT: vpbroadcastd
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpsrld $8
	; AVX2-NOPOPCNT-NEXT: vpaddd
	; AVX2-NOPOPCNT-NEXT: vpsrld $16
	; AVX2-NOPOPCNT-NEXT: vpaddd
	; AVX2-NOPOPCNT-NEXT: vpbroadcastd
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT: vpsrld $1, %xmm
	; AVX1-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT-NEXT: vpsubd
	; AVX1-NOPOPCNT-NEXT: vmovdqa
	; AVX1-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT-NEXT: vpsrld $2
	; AVX1-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT-NEXT: vpaddd
	; AVX1-NOPOPCNT-NEXT: vpsrld $4
	; AVX1-NOPOPCNT-NEXT: vpaddd
	; AVX1-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT-NEXT: vpsrld $8
	; AVX1-NOPOPCNT-NEXT: vpaddd
	; AVX1-NOPOPCNT-NEXT: vpsrld $16
	; AVX1-NOPOPCNT-NEXT: vpaddd
	; AVX1-NOPOPCNT-NEXT: vpand
	%y = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> %x)
	ret <4 x i32> %y
	}

	define <2 x i64> @test3(<2 x i64> %x) {
	; AVX2-NOPOPCNT-LABEL: @test3
	; AVX1-NOPOPCNT-LABEL: @test3
	entry:
	; AVX2-NOPOPCNT: vpsrlq $1, %xmm
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpsubq
	; AVX2-NOPOPCNT-NEXT: vmovdqa
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpsrlq $2
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpsrlq $4
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX2-NOPOPCNT-NEXT: vpsrlq $8
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpsrlq $16
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpsrlq $32
	; AVX2-NOPOPCNT-NEXT: vpaddq
	; AVX2-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT: vpsrlq $1, %xmm
	; AVX1-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT-NEXT: vpsubq
	; AVX1-NOPOPCNT-NEXT: vmovdqa
	; AVX1-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT-NEXT: vpsrlq $2
	; AVX1-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT-NEXT: vpaddq
	; AVX1-NOPOPCNT-NEXT: vpsrlq $4
	; AVX1-NOPOPCNT-NEXT: vpaddq
	; AVX1-NOPOPCNT-NEXT: vpand
	; AVX1-NOPOPCNT-NEXT: vpsrlq $8
	; AVX1-NOPOPCNT-NEXT: vpaddq
	; AVX1-NOPOPCNT-NEXT: vpsrlq $16
	; AVX1-NOPOPCNT-NEXT: vpaddq
	; AVX1-NOPOPCNT-NEXT: vpsrlq $32
	; AVX1-NOPOPCNT-NEXT: vpaddq
	; AVX1-NOPOPCNT-NEXT: vpand
	%y = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> %x)
	ret <2 x i64> %y
	}

	declare <4 x i32> @llvm.ctpop.v4i32(<4 x i32>)
	declare <2 x i64> @llvm.ctpop.v2i64(<2 x i64>)

	declare <8 x i32> @llvm.ctpop.v8i32(<8 x i32>)
	declare <4 x i64> @llvm.ctpop.v4i64(<4 x i64>)