This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/AggressiveInstCombine/
-
Transforms/
-
AggressiveInstCombine/
-
AggressiveInstCombine.cpp
-
AggressiveInstCombineInternal.h
6/9
BitCountCombine.cpp
-
CMakeLists.txt
-
test/Transforms/AggressiveInstCombine/
-
Transforms/
-
AggressiveInstCombine/
-
ctlz-combine.ll
-
ctpop-combine.ll

Differential D45173

[InstCombine] Recognize idioms for ctpop and ctlz
Needs ReviewPublic

Authored by kparzysz on Apr 2 2018, 12:03 PM.

Download Raw Diff

Details

Reviewers

aaboud
spatel
craig.topper

Summary

Look for code that does the population count and count leading zeros that corresponds to the typical bit-twiddling algorithms from Hacker's Delight.

Diff Detail

Repository: rL LLVM

Event Timeline

kparzysz created this revision.Apr 2 2018, 12:03 PM

Herald added a subscriber: mgorny. · View Herald TranscriptApr 2 2018, 12:03 PM

Added comments.

Does this trigger on the compiler-rt implementation of popcountsi2/clzsi2? I guess it technically isn't a problem if it does, given the LLVM backend currently doesn't generate calls to them, but it might be worth adding a backend testcase to make sure we don't generate an infinite loop.

lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
157	You might want to match patterns which don't include the subtraction; it could easily get combined away before you get here (if someone computes "ctlz(n)-1", or "cltz(x)-ctlz(y)", etc.).

In D45173#1054694, @efriedma wrote:

Does this trigger on the compiler-rt implementation of popcountsi2/clzsi2? I guess it technically isn't a problem if it does, given the LLVM backend currently doesn't generate calls to them, but it might be worth adding a backend testcase to make sure we don't generate an infinite loop.

Would it make sense to check the function name for these two above and exit early if it matches?

Removed the explicit check for subtraction when checking ctlz pattern.

Added a check to avoid optimizing compiler-rt functions.

kparzysz marked an inline comment as done.Apr 2 2018, 2:35 PM

craig.topper added inline comments.Apr 2 2018, 9:36 PM

lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
62	m_APInt doesn't match a specific APInt. It just matches any ConstantInt or splat and returns a pointer to the APInt it found. The pointer passed in would normally be unitialized in the caller.
99	Could this make use of APInt::getSplat?
107	Spell this out? 'SA' isn't meaningful without looking at the lambda being called.

Addressed comments on the previous diff.

kparzysz marked 3 inline comments as done.Apr 3 2018, 6:34 AM

kparzysz added inline comments.

lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
62	Wow, this was bad. Thanks for catching this.

craig.topper added inline comments.Apr 3 2018, 1:20 PM

lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
53	Should take APInt M by const reference if possible to avoid a costly copy if its larger than 64 bits.
74	APInt::isSameValue is most useful when the APInts have different widths. Is that the case here? If not we should just use operator==
195	What if the ctpop returns a vector type?

Also use update_test_checks.py to generate the CHECK lines in the tests.

kparzysz marked 3 inline comments as done.Apr 3 2018, 2:04 PM

kparzysz added inline comments.

lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp
74	Yes, this can legitimately happen.

Addressed comments.

I think we need to evaluate what popcount sequences we want to handle. The code you're handling isn't the most optimal version

For example compiler-rt uses this

su_int x = (su_int)a;
x = x - ((x >> 1) & 0x55555555);
/* Every 2 bits holds the sum of every pair of bits */
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
/* Every 4 bits holds the sum of every 4-set of bits (3 significant bits) */
x = (x + (x >> 4)) & 0x0F0F0F0F;
/* Every 8 bits holds the sum of every 8-set of bits (4 significant bits) */
x = (x + (x >> 16));
/* The lower 16 bits hold two 8 bit sums (5 significant bits).*/
/*    Upper 16 bits are garbage */
return (x + (x >> 8)) & 0x0000003F;  /* (6 significant bits) */

Then there is another form here that uses a multiply in the last step.

https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

What do you propose? The ones from the webpage, plus what compiler-rt does?

I'm not sure what I propose. The code in compiler-rt is in hackers delight but the last set of shifts and adds are in a different order. Did you have a real world application or benchmark that was motivating this change?

I think it came from a customer application. I tried to put it in instcombine a long time back, but it was rejected on the grounds that it was too time-consuming. It's been sitting around in my local repo until I thought about committing it again, this time in the aggressive instcombine which didn't exist back then. Since I already have it, I didn't want to delete it.

Recognizing more patterns is doable, but it can take some time.

Even back then there was some interference with the other parts of instcombine, and this code had to be put in the right place to avoid it. The problem was that instcombine would "preprocess" this computation into a form that was no longer "recursively symmetric" and the code as is, would no longer work.

I don't know what the long-term strategy is for dealing with the interactions between pattern-recognition code and instcombine. This is a reoccurring issue for the polynomial multiplication code in HexagonLoopIdiomRecognition and it likely to affect any code of that nature.

In D45173#1056850, @kparzysz wrote:

I don't know what the long-term strategy is for dealing with the interactions between pattern-recognition code and instcombine. This is a reoccurring issue for the polynomial multiplication code in HexagonLoopIdiomRecognition and it likely to affect any code of that nature.

I don't know if there's an actual strategy. There's no formal definition of 'canonical IR' AFAIK, so we continue to simplify code via peepholes in instcombine. Anything downstream of that has to adjust to those changes. I've dealt with that many times as an interaction between instcombine and DAG combine.

Let's look at pop32 as an example. We have 3 non-loop variations to consider so far IIUC: (a) all add ops (Hacker's Delight), (b) replace first add+mask with sub, and (c) replace ending mask+shift+add with multiply.

As IR, these are:

define i32 @pop32_all_adds(i32 %x) {
  %v0 = and i32 %x, 1431655765
  %v1 = lshr i32 %x, 1
  %v2 = and i32 %v1, 1431655765
  %v3 = add nuw i32 %v0, %v2
  %v4 = and i32 %v3, 858993459
  %v5 = lshr i32 %v3, 2
  %v6 = and i32 %v5, 858993459
  %v7 = add nuw nsw i32 %v4, %v6
  %v8 = and i32 %v7, 117901063
  %v9 = lshr i32 %v7, 4
  %v10 = and i32 %v9, 117901063
  %v11 = add nuw nsw i32 %v8, %v10
  %v12 = and i32 %v11, 983055
  %v13 = lshr i32 %v11, 8
  %v14 = and i32 %v13, 983055
  %v15 = add nuw nsw i32 %v12, %v14
  %v16 = and i32 %v15, 31
  %v17 = lshr i32 %v15, 16
  %v18 = add nuw nsw i32 %v16, %v17
  ret i32 %v18
}

define i32 @pop32_sub(i32 %x) {
  %shr = lshr i32 %x, 1
  %and = and i32 %shr, 1431655765
  %sub = sub i32 %x, %and
  %shr1 = lshr i32 %sub, 2
  %and2 = and i32 %shr1, 858993459
  %and3 = and i32 %sub, 858993459
  %add = add nuw nsw i32 %and2, %and3
  %shr4 = lshr i32 %add, 4
  %add5 = add nuw nsw i32 %shr4, %add
  %and6 = and i32 %add5, 252645135
  %shr7 = lshr i32 %and6, 16
  %add8 = add nuw nsw i32 %shr7, %and6
  %shr9 = lshr i32 %add8, 8
  %add10 = add nuw nsw i32 %shr9, %add8
  %and11 = and i32 %add10, 63
  ret i32 %and11
}

define i32 @pop32_mul(i32 %x) {
  %shr = lshr i32 %x, 1
  %and = and i32 %shr, 1431655765
  %sub = sub i32 %x, %and
  %and1 = and i32 %sub, 858993459
  %shr2 = lshr i32 %sub, 2
  %and3 = and i32 %shr2, 858993459
  %add = add nuw nsw i32 %and3, %and1
  %shr4 = lshr i32 %add, 4
  %add5 = add nuw nsw i32 %shr4, %add
  %and6 = and i32 %add5, 252645135
  %mul = mul i32 %and6, 16843009
  %shr7 = lshr i32 %mul, 24
  ret i32 %shr7
}

First, do we have consensus on which of these is canonical? Generally, we prefer the form with less instructions (pop32_mul), but is the instruction count reduction justified by using a mul?
Second, can we add instcombines that would reduce one or more of these to another form? If so, let's add those. If not, then this pass needs to match all of those forms to be effective (but that doesn't have to happen in one patch of course).

In D45173#1056890, @spatel wrote:

I don't know if there's an actual strategy. There's no formal definition of 'canonical IR' AFAIK, so we continue to simplify code via peepholes in instcombine. Anything downstream of that has to adjust to those changes. I've dealt with that many times as an interaction between instcombine and DAG combine.

This isn't sustainable in the long run. Recognizing complex computations and replacing them with short equivalents (such as intrinsics that targets may provide efficient implementations of) is arguably better than only doing peephole optimizations, and yet the current model makes it really difficult to write such code.

In D45173#1057039, @kparzysz wrote:

In D45173#1056890, @spatel wrote:

I don't know if there's an actual strategy. There's no formal definition of 'canonical IR' AFAIK, so we continue to simplify code via peepholes in instcombine. Anything downstream of that has to adjust to those changes. I've dealt with that many times as an interaction between instcombine and DAG combine.

This isn't sustainable in the long run. Recognizing complex computations and replacing them with short equivalents (such as intrinsics that targets may provide efficient implementations of) is arguably better than only doing peephole optimizations, and yet the current model makes it really difficult to write such code.

There was some discussion about optimal graph rewriting, but I don't know if there's any work/progress on that yet.

Until then, I think we're seeing the alternatives of the current model in this patch: either we add code to instcombine and coordinate this pass with instcombine's preferred form, or we increase the pattern matching complexity here...or we acknowledge that it's impossible to match all the variants, and let it slide.

FWIW, here are reductions of the patterns that we could transform in instcombine, but I suspect we don't want to add such narrow transforms there. It's probably better to keep the specialized pattern matching cost and complexity here:

; https://rise4fun.com/Alive/0ej

Name: 2_bit_sum
  %v0 = and i32 %x, 1431655765 ; 0x55555555
  %v1 = lshr i32 %x, 1
  %v2 = and i32 %v1, 1431655765
  %v3 = add i32 %v0, %v2
=>
  %v1 = lshr i32 %x, 1
  %v2 = and i32 %v1, 1431655765
  %v3 = sub i32 %x, %v2

; https://rise4fun.com/Alive/ly5

Name: shift_add_to_mul
  %s1 = and i32 %x, 252645135 ; 0x0f0f0f0f
  %s2 = lshr i32 %s1, 16
  %s3 = add i32 %s1, %s2
  %s4 = lshr i32 %s3, 8
  %s5 = add i32 %s3, %s4
  %r = and i32 %s5, 63 ; 0x3f
=>
  %s1 = and i32 %x, 252645135 ; 0x0f0f0f0f
  %m1 = mul i32 %s1, 16843009 ; 0x01010101
  %r = lshr i32 %m1, 24

This isn't sustainable in the long run. Recognizing complex computations and replacing them with short equivalents (such as intrinsics that targets may provide efficient implementations of) is arguably better than only doing peephole optimizations, and yet the current model makes it really difficult to write such code.

Recognizing every pattern which is equivalent to popcount is basically impossible. (I mean, you could theoretically integrate something like Alive into LLVM to recognize any straight-line code which is equivalent to popcount, but that's probably not practical.) So you have to come up with some specific set of patterns to recognize, based on what the source code looks like and what transforms run before your pattern-recognition code.

As you've noted, running "instcombine" before your pattern-recognition code is going to be a constant source of trouble, because it isn't a fixed set of transforms, and it runs very early in the pass pipeline. So either you run your pattern-recognition before instcombine, or you add some tests to the LLVM regression tests which will break if a new transform breaks your pattern-recognition, and hope for the best. I guess it might help a bit if we came up with some restricted criteria for allowed transforms in instcombine, and put the rest into aggressive-instcombine. But we're never going to completely freeze the optimization pipeline, so there's fundamentally some maintenance burden.

spatel mentioned this in D44266: [InstCombine] remove use restriction for min/max with not operands (PR35875).Apr 6 2018, 8:03 AM

In D45173#1057742, @efriedma wrote:

As you've noted, running "instcombine" before your pattern-recognition code is going to be a constant source of trouble, because it isn't a fixed set of transforms, and it runs very early in the pass pipeline. So either you run your pattern-recognition before instcombine, or you add some tests to the LLVM regression tests which will break if a new transform breaks your pattern-recognition, and hope for the best. I guess it might help a bit if we came up with some restricted criteria for allowed transforms in instcombine, and put the rest into aggressive-instcombine. But we're never going to completely freeze the optimization pipeline, so there's fundamentally some maintenance burden.

The problem right now is that with both, instcombine and complex pattern matching, ongoing development of instcombine will incur very high maintenance costs on the pattern matching code. This is what I'd like to avoid. I'd like to have both optimizations, but without the interference. Running pattern recognition first sounds like the least invasive change and would hopefully keep everybody happy.

Moving aggressive instcombine pass to before the regular instcombine, just to demonstrate the idea.

Herald added a subscriber: mehdi_amini. · View Herald TranscriptApr 6 2018, 9:33 AM

Would a change like the one in PassManagerBuilder.cpp be acceptable? There may be some work to make sure that the pre-existing components of the aggressive instcombine still apply, but I'm wondering if this direction is something we could agree on.

In D45173#1059921, @kparzysz wrote:

Moving aggressive instcombine pass to before the regular instcombine, just to demonstrate the idea.

I'm curious to know, while this does sidestep the issue of having to adapt to the instcombine/earlier passes being smarter, how does this not either
a) significantly increase the amount of matching this pass needs to do (due to being in the earlier position in the pipe), or
b) reduce the number of patterns this pass will handle?

Unless of course the only desire is to match exactly those patterns that are being tested, and exactly those patterns only...

In D45173#1059947, @lebedev.ri wrote:

In D45173#1059921, @kparzysz wrote:

Moving aggressive instcombine pass to before the regular instcombine, just to demonstrate the idea.

I'm curious to know, while this does sidestep the issue of having to adapt to the instcombine/earlier passes being smarter, how does this not either [...]

The patterns that we're trying to match correspond to some "typical implementations", so it's not intended to detect any possible computation that, for example, calculates population count. Different popcount algorithms would obviously need different matching codes. That's not a loss, since instcombine cannot now and likely will never be able to reorganize entire algorithms into a common way. The big problem with instcombine now is that its ongoing development results in a continuous stream of small changes to the output which then require ongoing adjustments to the matching code. Look at HexagonLoopIdiomRecognition.cpp to see the extent to which it goes to try and see through what instcombine has done.

In D45173#1059970, @kparzysz wrote:

In D45173#1059947, @lebedev.ri wrote:

In D45173#1059921, @kparzysz wrote:

Moving aggressive instcombine pass to before the regular instcombine, just to demonstrate the idea.

I'm curious to know, while this does sidestep the issue of having to adapt to the instcombine/earlier passes being smarter, how does this not either [...]

The patterns that we're trying to match correspond to some "typical implementations", so it's not intended to detect any possible computation that, for example, calculates population count.

Ah, i see. Disappointing. That design decision should really be specified in the documentation.

To point the obvious, if one takes one of these tests, runs those through instcombine
and then runs aggressiveinstcombine, chances are they will no longer be matched...
(Well, the main point being, any non-expected, slightly different pattern, that i would normally expected to be slightly canonicalized beforehand),

Also, while that may be a net positive tradeoff for this particular combine (BitCountCombine),
is that also true for all other possible future combines in this pass?

Different popcount algorithms would obviously need different matching codes. That's not a loss, since instcombine cannot now and likely will never be able to reorganize entire algorithms into a common way. The big problem with instcombine now is that its ongoing development results in a continuous stream of small changes to the output which then require ongoing adjustments to the matching code. Look at HexagonLoopIdiomRecognition.cpp to see the extent to which it goes to try and see through what instcombine has done.

a.elovikov added a subscriber: a.elovikov.Apr 6 2018, 12:39 PM

To add to that, i think you should be able to get rid of at least some of the pain of detection
of when instcombine got smarter and these folds no longer match, by adding kind-of end-to-end optimization tests.

I.e. i don't see why you could not add a second run-line so it is something like:

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -aggressive-instcombine -S < %s | FileCheck %s --check-prefixes=PLAIN
; RUN: opt -instcombine -aggressive-instcombine -S < %s | FileCheck %s --check-prefixes=POSTINSTCOMBINE

Caveats:

I'm not sure there is much (any?) precedent for such end-to-end testing in LLVM
I'm not 100% sure update_test_checks.py already supports more than one run-line
Whoever adds a instcombine fold that breaks aggressive-instcombine fold, will have to fix aggressive-instcombine fold. But that is a good thing i suppose :)

In D45173#1060626, @lebedev.ri wrote:

To add to that, i think you should be able to get rid of at least some of the pain of detection
of when instcombine got smarter and these folds no longer match, by adding kind-of end-to-end optimization tests.

Detection is not a problem.

In D45173#1060061, @lebedev.ri wrote:

To point the obvious, if one takes one of these tests, runs those through instcombine
and then runs aggressiveinstcombine, chances are they will no longer be matched...
(Well, the main point being, any non-expected, slightly different pattern, that i would normally expected to be slightly canonicalized beforehand),

I'm still not sure what your argument is. The idea here is that the aggressive instcombine (or generally, complex pattern matching code) will always run before instcombine. Yes, the fact that instcombine will change something in the code being matched is exactly what this idea is trying to avoid. Instcombine is a continuously evolving pass that keeps changing the IR. Any matching code running after it needs to be continuously updated to reflect those changes. This is a big problem for any non-trivial pattern matching. Running such pattern matching before instcombine does not eliminate the need for updates (if the preceding passes change), but significantly reduced the frequency of such updates.

spatel mentioned this in D45731: [InstCombine] Adjusting bswap pattern matching to hold for And/Shift mixed case.Apr 17 2018, 4:44 PM

lebedev.ri mentioned this in D68189: [InstCombine] recognize popcount implemented in hacker's delight..Oct 6 2019, 11:01 AM

Revision Contents

Path

Size

lib/

Transforms/

AggressiveInstCombine/

AggressiveInstCombine.cpp

8 lines

AggressiveInstCombineInternal.h

24 lines

BitCountCombine.cpp

264 lines

CMakeLists.txt

1 line

test/

Transforms/

AggressiveInstCombine/

ctlz-combine.ll

160 lines

ctpop-combine.ll

139 lines

Diff 140663

lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp

	Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines

	bool AggressiveInstCombinerLegacyPass::runOnFunction(Function &F) {			bool AggressiveInstCombinerLegacyPass::runOnFunction(Function &F) {
	auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();			auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
	auto &TLI = getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();			auto &TLI = getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();
	auto &DL = F.getParent()->getDataLayout();			auto &DL = F.getParent()->getDataLayout();

	bool MadeIRChange = false;			bool MadeIRChange = false;

				// Handle bit counting patterns.
				BitCountCombine BCC(TLI, DL);
				MadeIRChange \|= BCC.run(F);

	// Handle TruncInst patterns			// Handle TruncInst patterns
	TruncInstCombine TIC(TLI, DL, DT);			TruncInstCombine TIC(TLI, DL, DT);
	MadeIRChange \|= TIC.run(F);			MadeIRChange \|= TIC.run(F);

	// TODO: add more patterns to handle...			// TODO: add more patterns to handle...

	return MadeIRChange;			return MadeIRChange;
	}			}

	PreservedAnalyses AggressiveInstCombinePass::run(Function &F,			PreservedAnalyses AggressiveInstCombinePass::run(Function &F,
	FunctionAnalysisManager &AM) {			FunctionAnalysisManager &AM) {
	auto &DT = AM.getResult<DominatorTreeAnalysis>(F);			auto &DT = AM.getResult<DominatorTreeAnalysis>(F);
	auto &TLI = AM.getResult<TargetLibraryAnalysis>(F);			auto &TLI = AM.getResult<TargetLibraryAnalysis>(F);
	auto &DL = F.getParent()->getDataLayout();			auto &DL = F.getParent()->getDataLayout();
	bool MadeIRChange = false;			bool MadeIRChange = false;

				// Handle bit counting patterns.
				BitCountCombine BCC(TLI, DL);
				MadeIRChange \|= BCC.run(F);

	// Handle TruncInst patterns			// Handle TruncInst patterns
	TruncInstCombine TIC(TLI, DL, DT);			TruncInstCombine TIC(TLI, DL, DT);
	MadeIRChange \|= TIC.run(F);			MadeIRChange \|= TIC.run(F);
	if (!MadeIRChange)			if (!MadeIRChange)
	// No changes, all analyses are preserved.			// No changes, all analyses are preserved.
	return PreservedAnalyses::all();			return PreservedAnalyses::all();

	// Mark all the analyses that instcombine updates as preserved.			// Mark all the analyses that instcombine updates as preserved.
	Show All 19 Lines

lib/Transforms/AggressiveInstCombine/AggressiveInstCombineInternal.h

Show First 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	private:

/// Create a new expression dag using the reduced /p SclTy type and replace		/// Create a new expression dag using the reduced /p SclTy type and replace
/// the old expression dag with it. Also erase all instructions in the old		/// the old expression dag with it. Also erase all instructions in the old
/// dag, except those that are still needed outside the dag.		/// dag, except those that are still needed outside the dag.
///		///
/// \param SclTy scalar version of new type to reduce expression dag into.		/// \param SclTy scalar version of new type to reduce expression dag into.
void ReduceExpressionDag(Type *SclTy);		void ReduceExpressionDag(Type *SclTy);
};		};

		//===----------------------------------------------------------------------===//
		// BitCountCombine - looks for code that does to the population count and
		// count leading zeros that corresponds to the typical bit-twiddling algorithms
		// from Hacker's Delight.
		//===----------------------------------------------------------------------===//

		class BitCountCombine {
		public:
		BitCountCombine(const TargetLibraryInfo &TLI, const DataLayout &DL)
		: TLI(TLI), DL(DL) {}

		bool run(Function &F);

		private:
		Value *matchCtpopW(Instruction &In, unsigned BW);
		Value *optimizeToCtpop(Instruction &In);
		Value *optimizeToCtlz(Instruction &In);
		bool runOnBlock(BasicBlock &B);

		const TargetLibraryInfo &TLI;
		const DataLayout &DL;
		};

} // end namespace llvm.		} // end namespace llvm.

lib/Transforms/AggressiveInstCombine/BitCountCombine.cpp

This file was added.

				//===- BitCountCombine.cpp ------------------------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				// This code looks for code calculates the population count and the leading
				// zeros count using the bit-manipulation methods from Hacker's Delight.
				//
				// ctpop(n):
				// bw = bitwidth(n)
				// n = (n & 0x01010101..0101) + (n & 0x10101010..1010 >> 1)
				// n = (n & 0x00110011..0011) + (n & 0x11001100..1100 >> 2)
				// ...
				// n = (n & 0x000000..111111) + (n & 0x111111..000000 >> bw/2)
				// return n
				//
				// ctlz(n):
				// bw = bitwidth(n)
				// n = n \| (n >> 1)
				// n = n \| (n >> 2)
				// n = n \| (n >> 4)
				// ...
				// n = n \| (n >> bw/2)
				// return bw - ctpop(n)
				//
				//===----------------------------------------------------------------------===//

				#include "AggressiveInstCombineInternal.h"
				#include "llvm/Analysis/Utils/Local.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/PatternMatch.h"
				#include "llvm/Support/KnownBits.h"

				using namespace llvm;
				using namespace llvm::PatternMatch;

				// Check if In matches a ctpop calculation pattern for a value of width BW
				// bits. If so, return the argument V such that ctpop(V) would be a candidate
				// for replacing In.
				// The matched pattern is:
				// x0 := (V & 0x55..5) + ((V>>1) & 0x55..5)
				// x1 := (x0 & 0x33..3) + ((x0>>2) & 0x33..3)
				// ...
				// xn := (xn-1 & 0x00..0FF..F) + ((xn-1>>S/2) & 0x00..0FF..F)
				// where xn is the candidate for ctpop(V).
				Value *BitCountCombine::matchCtpopW(Instruction &In, unsigned BW) {
				auto matchStep = [] (Value *V, unsigned S, APInt M, bool ShiftAlone)
				-> Value* {
				craig.topperUnsubmitted Done Reply Inline Actions Should take APInt M by const reference if possible to avoid a costly copy if its larger than 64 bits. craig.topper: Should take APInt M by const reference if possible to avoid a costly copy if its larger than 64…
				Value Op0 = nullptr, Op1 = nullptr;
				if (!match(V, m_Add(m_Value(Op0), m_Value(Op1))))
				return nullptr;

				auto matchAndShift = [S,M,ShiftAlone] (Value V0, Value V1) -> Value* {
				Value *V = nullptr;
				const APInt *P = &M;
				auto Mask = m_APInt(P);
				auto Shift = m_SpecificInt(S);
				craig.topperUnsubmitted Done Reply Inline Actions m_APInt doesn't match a specific APInt. It just matches any ConstantInt or splat and returns a pointer to the APInt it found. The pointer passed in would normally be unitialized in the caller. craig.topper: m_APInt doesn't match a specific APInt. It just matches any ConstantInt or splat and returns a…
				kparzyszAuthorUnsubmitted Not Done Reply Inline Actions Wow, this was bad. Thanks for catching this. kparzysz: Wow, this was bad. Thanks for catching this.

				if (!match(V0, m_And(m_Value(V), Mask)))
				return nullptr;
				if (ShiftAlone) {
				if (!match(V1, m_LShr(m_Specific(V), Shift)))
				return nullptr;
				} else {
				if (!match(V1, m_And(m_LShr(m_Specific(V), Shift), Mask)))
				return nullptr;
				}
				return V;
				};
				craig.topperUnsubmitted Not Done Reply Inline Actions APInt::isSameValue is most useful when the APInts have different widths. Is that the case here? If not we should just use operator== craig.topper: APInt::isSameValue is most useful when the APInts have different widths. Is that the case here?
				kparzyszAuthorUnsubmitted Not Done Reply Inline Actions Yes, this can legitimately happen. kparzysz: Yes, this can legitimately happen.

				if (Value *T = matchAndShift(Op0, Op1))
				return T;
				if (Value *T = matchAndShift(Op1, Op0))
				return T;
				return nullptr;
				};

				// Generate the bitmask for the & operation. BW is the bit-width of the
				// entire mask. The masks are:
				// 0b01010101..01010101 0x55..55 1 bit every 2 bits
				// 0b00110011..00110011 0x33..35 2 bits every 4 bits
				// 0b00000111..00000111 0x07..07 3 bits every 8 bits
				// ... ... logS bits every S bits
				// Normally the masks would be 01010101, 00110011, 00001111, i.e. the
				// number of contiguous 1 bits in each group would be twice the number
				// in the previous mask, but by the time this code runs, the "demanded"
				// bits have been optimized to only require one more 1 bit in each
				// subsequent mask. This function generates the post-optimized masks.
				auto getMask = [] (unsigned S, unsigned BW) -> APInt {
				assert(isPowerOf2_32(S));
				APInt M(BW, S-1);
				APInt T(BW, 0);
				while (M != 0) {
				T \|= M;
				craig.topperUnsubmitted Done Reply Inline Actions Could this make use of APInt::getSplat? craig.topper: Could this make use of APInt::getSplat?
				M <<= S;
				}
				return T;
				};

				Value *V = &In;
				bool SA = true;
				unsigned N = BW;
				craig.topperUnsubmitted Done Reply Inline Actions Spell this out? 'SA' isn't meaningful without looking at the lambda being called. craig.topper: Spell this out? 'SA' isn't meaningful without looking at the lambda being called.
				while (N > 1) {
				unsigned S = N/2;
				V = matchStep(V, S, getMask(N, BW), SA);
				if (!V)
				return nullptr;
				N = S;
				SA = false;
				}

				return V;
				}

				// If In is an expression that evaluates popcnt via shift/add pattern,
				// return the equivalent expression using the ctpop intrinsic. Otherwise
				// return nullptr.
				Value *BitCountCombine::optimizeToCtpop(Instruction &In) {
				IntegerType *Ty = dyn_cast<IntegerType>(In.getType());
				if (!Ty)
				return nullptr;

				// Take the first shift amount feeding the add, and assume this is the
				// last shift in the popcnt computation.
				Value Op0 = nullptr, Op1 = nullptr;
				if (!match(&In, m_Add(m_Value(Op0), m_Value(Op1))))
				return nullptr;

				// Shift by half-width.
				uint64_t SH = 0;
				if (!match(Op0, m_And(m_Value(), m_LShr(m_Value(), m_ConstantInt(SH)))) &&
				!match(Op1, m_And(m_Value(), m_LShr(m_Value(), m_ConstantInt(SH)))) &&
				!match(Op0, m_LShr(m_Value(), m_ConstantInt(SH))) &&
				!match(Op1, m_LShr(m_Value(), m_ConstantInt(SH))))
				return nullptr;

				if (SH < 4 \|\| !isPowerOf2_64(SH))
				return nullptr;

				Value V = matchCtpopW(In, 2SH);
				if (!V)
				return nullptr;

				unsigned TW = Ty->getBitWidth(), BW = 2*SH;
				if (BW < TW) {
				// BW is the bit width of the expression whose population count is
				// being calculated. TW is the bit width of the type associated with
				// that expression. Usually they are the same, but for ctpop8 the
				// type may be "unsigned", i.e. 32-bit, while the ctpop8 would only
				// consider the low 8 bits. In that case BW=8 and TW=32.
				KnownBits K(TW);
				computeKnownBits(V, K, DL);
				efriedmaUnsubmitted Done Reply Inline Actions You might want to match patterns which don't include the subtraction; it could easily get combined away before you get here (if someone computes "ctlz(n)-1", or "cltz(x)-ctlz(y)", etc.). efriedma: You might want to match patterns which don't include the subtraction; it could easily get…
				APInt Need0 = APInt::getBitsSet(TW, BW, TW);
				if ((K.Zero & Need0) != Need0)
				return nullptr;
				}

				IRBuilder<> Builder(&In);
				Module *M = In.getParent()->getParent()->getParent();
				Value *Func = Intrinsic::getDeclaration(M, Intrinsic::ctpop, {V->getType()});
				CallInst *CI = Builder.CreateCall(Func, {V});
				CI->setDebugLoc(In.getDebugLoc());
				return CI;
				}

				Value *BitCountCombine::optimizeToCtlz(Instruction &In) {
				// Let bw = bitwidth(n),
				// convert
				// n = n \| (n>>1)
				// n = n \| (n>>2)
				// n = n \| (n>>4)
				// ...
				// n = n \| (n>>bw/2)
				// bw - ctpop(n)
				// to
				// ctlz(n).
				// This code expects that the ctpop intrinsic has already been generated.

				uint64_t BW = 0;
				if (!match(&In, m_Sub(m_ConstantInt(BW), m_Intrinsic<Intrinsic::ctpop>())))
				return nullptr;
				// Get the argument of the ctpop.
				Value *V = cast<User>(In.getOperand(1))->getOperand(0);

				// The argument to ctpop can be zero-extended in some cases. It is safe
				// to ignore the zext.
				if (auto *Z = dyn_cast<ZExtInst>(V))
				V = Z->getOperand(0);

				IntegerType *Ty = cast<IntegerType>(V->getType());
				craig.topperUnsubmitted Done Reply Inline Actions What if the ctpop returns a vector type? craig.topper: What if the ctpop returns a vector type?
				if (BW < Ty->getBitWidth())
				return nullptr;

				auto matchOrShift = [] (Value V, unsigned S) -> Value {
				Value Op0 = nullptr, Op1 = nullptr;
				if (!match(V, m_Or(m_Value(Op0), m_Value(Op1))))
				return nullptr;
				if (match(Op0, m_LShr(m_Specific(Op1), m_SpecificInt(S))))
				return Op1;
				if (match(Op1, m_LShr(m_Specific(Op0), m_SpecificInt(S))))
				return Op0;
				return nullptr;
				};

				unsigned N = BW;
				while (N > 1) {
				N /= 2;
				V = matchOrShift(V, N);
				if (!V)
				return nullptr;
				}

				// The value of BW is the one that determines the type of ctlz's argument.
				IRBuilder<> Builder(&In);
				if (BW > Ty->getBitWidth()) {
				IntegerType *ATy = IntegerType::get(In.getContext(), BW);
				V = Builder.CreateZExt(V, ATy);
				}
				Module *M = In.getParent()->getParent()->getParent();
				Value *Func = Intrinsic::getDeclaration(M, Intrinsic::ctlz, {V->getType()});
				Value *False = ConstantInt::getFalse(In.getContext());
				CallInst *CI = Builder.CreateCall(Func, {V, False});
				CI->setDebugLoc(In.getDebugLoc());
				if (In.getType() != CI->getType())
				return Builder.CreateZExt(CI, In.getType());
				return CI;
				}

				bool BitCountCombine::runOnBlock(BasicBlock &B) {
				bool Changed = false;

				// Iterate over the block as long as there are more intrinsics generated.
				while (true) {
				Value *Int = nullptr;
				for (Instruction &In : reverse(B)) {
				Int = optimizeToCtpop(In);
				if (!Int)
				Int = optimizeToCtlz(In);
				if (Int) {
				Changed = true;
				In.replaceAllUsesWith(Int);
				RecursivelyDeleteTriviallyDeadInstructions(&In, &TLI);
				break;
				}
				}
				if (!Int)
				break;
				}

				return Changed;
				}

				bool BitCountCombine::run(Function &F) {
				bool Changed = false;
				for (BasicBlock &B : F)
				Changed \|= runOnBlock(B);

				return Changed;
				}

lib/Transforms/AggressiveInstCombine/CMakeLists.txt

	add_llvm_library(LLVMAggressiveInstCombine			add_llvm_library(LLVMAggressiveInstCombine
	AggressiveInstCombine.cpp			AggressiveInstCombine.cpp
				BitCountCombine.cpp
	TruncInstCombine.cpp			TruncInstCombine.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms			${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms
	${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms/AggressiveInstCombine			${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms/AggressiveInstCombine

	DEPENDS			DEPENDS
	intrinsics_gen			intrinsics_gen
	)			)

test/Transforms/AggressiveInstCombine/ctlz-combine.ll

This file was added.

				; RUN: opt -aggressive-instcombine -S < %s \| FileCheck %s

				; unsigned ctlz16(unsigned short t0) {
				; t0 = t0 \| (t0>>1);
				; t0 = t0 \| (t0>>2);
				; t0 = t0 \| (t0>>4);
				; t0 = t0 \| (t0>>8);
				; unsigned t1 = (t0 & 0x5555) + ((t0>>1) & 0x5555);
				; unsigned t2 = (t1 & 0x3333) + ((t1>>2) & 0x3333);
				; unsigned t3 = (t2 & 0x0F0F) + ((t2>>4) & 0x0F0F);
				; unsigned t4 = (t3 & 0x00FF) + ((t3>>8) & 0x00FF);
				; return 16-t4;
				; }
				;
				; CHECK-LABEL: define i32 @ctlz16
				; CHECK: [[V0:%[a-zA-Z0-9_]+]] = call i16 @llvm.ctlz.i16(i16 %a0, i1 false)
				; CHECK: zext i16 [[V0]] to i32
				define i32 @ctlz16(i16 zeroext %a0) local_unnamed_addr #0 {
				b0:
				%v0 = lshr i16 %a0, 1
				%v1 = or i16 %v0, %a0
				%v2 = lshr i16 %v1, 2
				%v3 = or i16 %v2, %v1
				%v4 = lshr i16 %v3, 4
				%v5 = or i16 %v4, %v3
				%v6 = lshr i16 %v5, 8
				%v7 = or i16 %v6, %v5
				%v8 = zext i16 %v7 to i32
				%v9 = and i32 %v8, 21845
				%v10 = lshr i32 %v8, 1
				%v11 = and i32 %v10, 21845
				%v12 = add nuw nsw i32 %v9, %v11
				%v13 = and i32 %v12, 13107
				%v14 = lshr i32 %v12, 2
				%v15 = and i32 %v14, 13107
				%v16 = add nuw nsw i32 %v13, %v15
				%v17 = and i32 %v16, 1799
				%v18 = lshr i32 %v16, 4
				%v19 = and i32 %v18, 1799
				%v20 = add nuw nsw i32 %v17, %v19
				%v21 = and i32 %v20, 15
				%v22 = lshr i32 %v20, 8
				%v23 = add nuw nsw i32 %v21, %v22
				%v24 = sub nsw i32 16, %v23
				ret i32 %v24
				}

				; unsigned ctlz32(unsigned t0) {
				; t0 = t0 \| (t0>>1);
				; t0 = t0 \| (t0>>2);
				; t0 = t0 \| (t0>>4);
				; t0 = t0 \| (t0>>8);
				; t0 = t0 \| (t0>>16);
				; unsigned t1 = (t0 & 0x55555555) + ((t0>>1) & 0x55555555);
				; unsigned t2 = (t1 & 0x33333333) + ((t1>>2) & 0x33333333);
				; unsigned t3 = (t2 & 0x0F0F0F0F) + ((t2>>4) & 0x0F0F0F0F);
				; unsigned t4 = (t3 & 0x00FF00FF) + ((t3>>8) & 0x00FF00FF);
				; unsigned t5 = (t4 & 0x0000FFFF) + ((t4>>16) & 0x0000FFFF);
				; return 32-t5;
				; }
				;
				; CHECK-LABEL: define i32 @ctlz32
				; CHECK: @llvm.ctlz.i32(i32 %a0, i1 false)
				define i32 @ctlz32(i32 %a0) local_unnamed_addr #1 {
				b0:
				%v0 = lshr i32 %a0, 1
				%v1 = or i32 %v0, %a0
				%v2 = lshr i32 %v1, 2
				%v3 = or i32 %v1, %v2
				%v4 = lshr i32 %v3, 4
				%v5 = or i32 %v3, %v4
				%v6 = lshr i32 %v5, 8
				%v7 = or i32 %v5, %v6
				%v8 = lshr i32 %v7, 16
				%v9 = or i32 %v7, %v8
				%v10 = and i32 %v9, 1431655765
				%v11 = lshr i32 %v9, 1
				%v12 = and i32 %v11, 1431655765
				%v13 = add nuw i32 %v10, %v12
				%v14 = and i32 %v13, 858993459
				%v15 = lshr i32 %v13, 2
				%v16 = and i32 %v15, 858993459
				%v17 = add nuw nsw i32 %v14, %v16
				%v18 = and i32 %v17, 117901063
				%v19 = lshr i32 %v17, 4
				%v20 = and i32 %v19, 117901063
				%v21 = add nuw nsw i32 %v18, %v20
				%v22 = and i32 %v21, 983055
				%v23 = lshr i32 %v21, 8
				%v24 = and i32 %v23, 983055
				%v25 = add nuw nsw i32 %v22, %v24
				%v26 = and i32 %v25, 31
				%v27 = lshr i32 %v25, 16
				%v28 = add nuw nsw i32 %v26, %v27
				%v29 = sub nsw i32 32, %v28
				ret i32 %v29
				}

				; typedef unsigned long long u64_t;
				; u64_t ctlz64(u64_t t0) {
				; t0 = t0 \| (t0>>1);
				; t0 = t0 \| (t0>>2);
				; t0 = t0 \| (t0>>4);
				; t0 = t0 \| (t0>>8);
				; t0 = t0 \| (t0>>16);
				; t0 = t0 \| (t0>>32);
				; u64_t t1 = (t0 & 0x5555555555555555LL) + ((t0>>1) & 0x5555555555555555LL);
				; u64_t t2 = (t1 & 0x3333333333333333LL) + ((t1>>2) & 0x3333333333333333LL);
				; u64_t t3 = (t2 & 0x0F0F0F0F0F0F0F0FLL) + ((t2>>4) & 0x0F0F0F0F0F0F0F0FLL);
				; u64_t t4 = (t3 & 0x00FF00FF00FF00FFLL) + ((t3>>8) & 0x00FF00FF00FF00FFLL);
				; u64_t t5 = (t4 & 0x0000FFFF0000FFFFLL) + ((t4>>16) & 0x0000FFFF0000FFFFLL);
				; u64_t t6 = (t5 & 0x00000000FFFFFFFFLL) + ((t5>>32) & 0x00000000FFFFFFFFLL);
				; return 64-t6;
				; }
				;
				; CHECK-LABEL: define i64 @ctlz64
				; CHECK: @llvm.ctlz.i64(i64 %a0, i1 false)
				define i64 @ctlz64(i64 %a0) local_unnamed_addr #1 {
				b0:
				%v0 = lshr i64 %a0, 1
				%v1 = or i64 %v0, %a0
				%v2 = lshr i64 %v1, 2
				%v3 = or i64 %v1, %v2
				%v4 = lshr i64 %v3, 4
				%v5 = or i64 %v3, %v4
				%v6 = lshr i64 %v5, 8
				%v7 = or i64 %v5, %v6
				%v8 = lshr i64 %v7, 16
				%v9 = or i64 %v7, %v8
				%v10 = lshr i64 %v9, 32
				%v11 = or i64 %v9, %v10
				%v12 = and i64 %v11, 6148914691236517205
				%v13 = lshr i64 %v11, 1
				%v14 = and i64 %v13, 6148914691236517205
				%v15 = add nuw i64 %v12, %v14
				%v16 = and i64 %v15, 3689348814741910323
				%v17 = lshr i64 %v15, 2
				%v18 = and i64 %v17, 3689348814741910323
				%v19 = add nuw nsw i64 %v16, %v18
				%v20 = and i64 %v19, 506381209866536711
				%v21 = lshr i64 %v19, 4
				%v22 = and i64 %v21, 506381209866536711
				%v23 = add nuw nsw i64 %v20, %v22
				%v24 = and i64 %v23, 4222189076152335
				%v25 = lshr i64 %v23, 8
				%v26 = and i64 %v25, 4222189076152335
				%v27 = add nuw nsw i64 %v24, %v26
				%v28 = and i64 %v27, 133143986207
				%v29 = lshr i64 %v27, 16
				%v30 = and i64 %v29, 133143986207
				%v31 = add nuw nsw i64 %v28, %v30
				%v32 = and i64 %v31, 63
				%v33 = lshr i64 %v31, 32
				%v34 = add nuw nsw i64 %v32, %v33
				%v35 = sub nsw i64 64, %v34
				ret i64 %v35
				}

				attributes #0 = { norecurse nounwind readnone uwtable }
				attributes #1 = { nounwind uwtable }

test/Transforms/AggressiveInstCombine/ctpop-combine.ll

This file was added.

				; RUN: opt -aggressive-instcombine -S < %s \| FileCheck %s

				; unsigned pop8(unsigned char t0) {
				; unsigned t1 = (t0 & 0x55) + ((t0>>1) & 0x55);
				; unsigned t2 = (t1 & 0x33) + ((t1>>2) & 0x33);
				; unsigned t3 = (t2 & 0x0F) + ((t2>>4) & 0x0F);
				; return t3;
				; }
				;
				; CHECK-LABEL: define i32 @pop8
				; CHECK: [[ARG8:%[a-zA-Z0-9_]+]] = zext i8 %a0 to i32
				; CHECK: @llvm.ctpop.i32(i32 [[ARG8]])
				define i32 @pop8(i8 zeroext %a0) local_unnamed_addr #0 {
				b0:
				%v0 = zext i8 %a0 to i32
				%v1 = and i32 %v0, 85
				%v2 = lshr i32 %v0, 1
				%v3 = and i32 %v2, 85
				%v4 = add nuw nsw i32 %v1, %v3
				%v5 = and i32 %v4, 51
				%v6 = lshr i32 %v4, 2
				%v7 = and i32 %v6, 51
				%v8 = add nuw nsw i32 %v5, %v7
				%v9 = and i32 %v8, 7
				%v10 = lshr i32 %v8, 4
				%v11 = add nuw nsw i32 %v9, %v10
				ret i32 %v11
				}

				; unsigned pop16(unsigned short t0) {
				; unsigned t1 = (t0 & 0x5555) + ((t0>>1) & 0x5555);
				; unsigned t2 = (t1 & 0x3333) + ((t1>>2) & 0x3333);
				; unsigned t3 = (t2 & 0x0F0F) + ((t2>>4) & 0x0F0F);
				; unsigned t4 = (t3 & 0x00FF) + ((t3>>8) & 0x00FF);
				; return t4;
				; }
				;
				; CHECK-LABEL: define i32 @pop16
				; CHECK: [[ARG16:%[a-zA-Z0-9_]+]] = zext i16 %a0 to i32
				; CHECK: @llvm.ctpop.i32(i32 [[ARG16]])
				define i32 @pop16(i16 zeroext %a0) local_unnamed_addr #1 {
				b0:
				%v0 = zext i16 %a0 to i32
				%v1 = and i32 %v0, 21845
				%v2 = lshr i32 %v0, 1
				%v3 = and i32 %v2, 21845
				%v4 = add nuw nsw i32 %v1, %v3
				%v5 = and i32 %v4, 13107
				%v6 = lshr i32 %v4, 2
				%v7 = and i32 %v6, 13107
				%v8 = add nuw nsw i32 %v5, %v7
				%v9 = and i32 %v8, 1799
				%v10 = lshr i32 %v8, 4
				%v11 = and i32 %v10, 1799
				%v12 = add nuw nsw i32 %v9, %v11
				%v13 = and i32 %v12, 15
				%v14 = lshr i32 %v12, 8
				%v15 = add nuw nsw i32 %v13, %v14
				ret i32 %v15
				}

				; unsigned pop32(unsigned t0) {
				; unsigned t1 = (t0 & 0x55555555) + ((t0>>1) & 0x55555555);
				; unsigned t2 = (t1 & 0x33333333) + ((t1>>2) & 0x33333333);
				; unsigned t3 = (t2 & 0x0F0F0F0F) + ((t2>>4) & 0x0F0F0F0F);
				; unsigned t4 = (t3 & 0x00FF00FF) + ((t3>>8) & 0x00FF00FF);
				; unsigned t5 = (t4 & 0x0000FFFF) + ((t4>>16) & 0x0000FFFF);
				; return t5;
				; }
				;
				; CHECK-LABEL: define i32 @pop32
				; CHECK: @llvm.ctpop.i32(i32 %a0)
				define i32 @pop32(i32 %a0) local_unnamed_addr #1 {
				b0:
				%v0 = and i32 %a0, 1431655765
				%v1 = lshr i32 %a0, 1
				%v2 = and i32 %v1, 1431655765
				%v3 = add nuw i32 %v0, %v2
				%v4 = and i32 %v3, 858993459
				%v5 = lshr i32 %v3, 2
				%v6 = and i32 %v5, 858993459
				%v7 = add nuw nsw i32 %v4, %v6
				%v8 = and i32 %v7, 117901063
				%v9 = lshr i32 %v7, 4
				%v10 = and i32 %v9, 117901063
				%v11 = add nuw nsw i32 %v8, %v10
				%v12 = and i32 %v11, 983055
				%v13 = lshr i32 %v11, 8
				%v14 = and i32 %v13, 983055
				%v15 = add nuw nsw i32 %v12, %v14
				%v16 = and i32 %v15, 31
				%v17 = lshr i32 %v15, 16
				%v18 = add nuw nsw i32 %v16, %v17
				ret i32 %v18
				}

				; typedef unsigned long long u64_t;
				; u64_t pop64(u64_t t0) {
				; u64_t t1 = (t0 & 0x5555555555555555LL) + ((t0>>1) & 0x5555555555555555LL);
				; u64_t t2 = (t1 & 0x3333333333333333LL) + ((t1>>2) & 0x3333333333333333LL);
				; u64_t t3 = (t2 & 0x0F0F0F0F0F0F0F0FLL) + ((t2>>4) & 0x0F0F0F0F0F0F0F0FLL);
				; u64_t t4 = (t3 & 0x00FF00FF00FF00FFLL) + ((t3>>8) & 0x00FF00FF00FF00FFLL);
				; u64_t t5 = (t4 & 0x0000FFFF0000FFFFLL) + ((t4>>16) & 0x0000FFFF0000FFFFLL);
				; u64_t t6 = (t5 & 0x00000000FFFFFFFFLL) + ((t5>>32) & 0x00000000FFFFFFFFLL);
				; return t6;
				; }
				;
				; CHECK-LABEL: define i64 @pop64
				; CHECK: @llvm.ctpop.i64(i64 %a0)
				define i64 @pop64(i64 %a0) local_unnamed_addr #1 {
				b0:
				%v0 = and i64 %a0, 6148914691236517205
				%v1 = lshr i64 %a0, 1
				%v2 = and i64 %v1, 6148914691236517205
				%v3 = add nuw i64 %v0, %v2
				%v4 = and i64 %v3, 3689348814741910323
				%v5 = lshr i64 %v3, 2
				%v6 = and i64 %v5, 3689348814741910323
				%v7 = add nuw nsw i64 %v4, %v6
				%v8 = and i64 %v7, 506381209866536711
				%v9 = lshr i64 %v7, 4
				%v10 = and i64 %v9, 506381209866536711
				%v11 = add nuw nsw i64 %v8, %v10
				%v12 = and i64 %v11, 4222189076152335
				%v13 = lshr i64 %v11, 8
				%v14 = and i64 %v13, 4222189076152335
				%v15 = add nuw nsw i64 %v12, %v14
				%v16 = and i64 %v15, 133143986207
				%v17 = lshr i64 %v15, 16
				%v18 = and i64 %v17, 133143986207
				%v19 = add nuw nsw i64 %v16, %v18
				%v20 = and i64 %v19, 63
				%v21 = lshr i64 %v19, 32
				%v22 = add nuw nsw i64 %v20, %v21
				ret i64 %v22
				}

				attributes #0 = { norecurse nounwind readnone }
				attributes #1 = { nounwind uwtable }