This is an archive of the discontinued LLVM Phabricator instance.

Recognize CTLZ builtin
ClosedPublic

Authored by evstupac on Apr 27 2017, 11:22 AM.

Download Raw Diff

Details

Reviewers

hfinkel
rengolin
chandlerc
Farhana

Commits

rG2fecd38ab8c0: The patch adds CTLZ idiom recognition.
rL303102: The patch adds CTLZ idiom recognition.

Summary

The patch targeted to recognize and replace with ctlz (make countable) loops like:

while(n) {

n >>= 1;
i++;
body();

}
use(i);

Should be replaced with:
ii = type_bitwidth(n) - ctlz(n);
for(j = 0; j < ii; j++) {
body();
}
use(ii + i)

Diff Detail

Repository: rL LLVM

Event Timeline

evstupac created this revision.Apr 27 2017, 11:22 AM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptApr 27 2017, 11:22 AM

Regarding performance. Almost all benchmark are build same.
Some image benchmarks that uses Huffman algorithm get good improve (~5-10%):
https://github.com/mozilla/mozjpeg/blob/master/jchuff.c

What about platforms that don't have ctlz/ffs hardware support?

Hi Evgeny,

I think this is an interesting concept, but I share Joerg's concern. This has to, at least, calculate the cost of CTLZ, per architecture, and compare with the cost of the loop.

For example, if the loop count is known and short, and the platform has no CTLZ, then it's quite possible that the transformation will render poorer code. It doesn't have to be a very accurate cost, but something needs to be done.

You'd also need some more testing on non-x86 platforms, to make sure the decisions are good for all of them.

cheers,
--renato

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1191 ↗	(On Diff #96950)	You don't need the extra lexical blocks. http://llvm.org/docs/CodingStandards.html
test/Transforms/LoopIdiom/ctlz.ll
1 ↗	(On Diff #96950)	This won't work on builders that don't build x86. Yes, we do have those. :) http://llvm.org/docs/TestingGuide.html#platform-specific-tests

In D32605#739873, @joerg wrote:

What about platforms that don't have ctlz/ffs hardware support?

I can guard this. Or keep as is if we consider converting loop to countable is generally good. In this case builtin should be expanded at codegen in the most profitable way.

In D32605#739888, @evstupac wrote:

In this case builtin should be expanded at codegen in the most profitable way.

The expansion might not be the "most profitable way". :)

Hi Renato,

I assume if CPU have ctlz the transformation is always profitable. For x86 CPUs it is so.
In LLVM source code we replace such loops with the builtin in APInt module for better compile time.
For other architectures it is questionable (and I'm not able to test this), however I agree to guard this.

Thanks,
Evgeny

We could add a TTI callback like we have for popcnt. You could also argue that this is the better canonical form because it can enable other analysis/optimizations (and we should fix the expansion if it is suboptimal).

In D32605#739907, @hfinkel wrote:

We could add a TTI callback like we have for popcnt. You could also argue that this is the better canonical form because it can enable other analysis/optimizations (and we should fix the expansion if it is suboptimal).

I also feel like "not a perfect expand" for particular CPU is the issue that should be resolved by CPU owner.
Besides converting to countable loop we also get:

Clear range for lzct result (from 0 to bitwidth).
Some Combine optimizations that can apply.
An ability to expand intrinsic to the most profitable way for each CPU.

Guarding is easier. Let's do this only if someone complains about performance.

I agree. If the CPU has it, it will be beneficial. If it doesn't, it is only a useful transformation if the intrinsic can be constant folded.

Fixed inline comments.

In D32605#740041, @evstupac wrote:

Guarding is easier. Let's do this only if someone complains about performance.

You mean guarding with TTI callback, right?

In D32605#740556, @rengolin wrote:

In D32605#740041, @evstupac wrote:

Guarding is easier. Let's do this only if someone complains about performance.

You mean guarding with TTI callback, right?

Yes.

In D32605#740840, @evstupac wrote:

Yes.

Sounds good.

In D32605#740866, @rengolin wrote:

In D32605#740840, @evstupac wrote:

Yes.

Sounds good.

Ok. The only missing part to start getting respond is review and approval. :-)

Wait, we also need the TTI callbacks. I thought you were going to introduce them.

I thought everybody agreed on the following.

In D32605#740041, @evstupac wrote:

In D32605#739907, @hfinkel wrote:

We could add a TTI callback like we have for popcnt. You could also argue that this is the better canonical form because it can enable other analysis/optimizations (and we should fix the expansion if it is suboptimal).

I also feel like "not a perfect expand" for particular CPU is the issue that should be resolved by CPU owner.
Besides converting to countable loop we also get:

Clear range for lzct result (from 0 to bitwidth).

Some Combine optimizations that can apply.

An ability to expand intrinsic to the most profitable way for each CPU.

Guarding is easier. Let's do this only if someone complains about performance.

In D32605#740044, @joerg wrote:

I agree. If the CPU has it, it will be beneficial. If it doesn't, it is only a useful transformation if the intrinsic can be constant folded.

@joerg, can you clarify what you agree with?

It sounds to be that you're worried that the intrinsics will be worse in some cases?

ARM(32/64) and x86_64 have CLZ, so it should always be beneficial, but I was just worries I'm missing something important.

cheers,
--renato

That's not true. ARMv4 for example has no clz, it's a V5T feature. That's my point: if the CPU has no direct lowering for the intrinsic, this transform is beneficial only if the resulting intrinsic can be constant folded. But I wonder if we don't catch those cases already with SCEV based optimisations. If a CPU has no direct lowering like on ARMv4, it will add a libcall and with a high chance of being more expensive than the optimisation.

The target independent lowering code emits this for CTLZ when its not supported. I think the popcount expands to this http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel if its not supported. So there shouldn't be a libcall unless the target is changing the default behavior.

// for now, we do this:
// x = x | (x >> 1);
// x = x | (x >> 2);
// ...
// x = x | (x >>16);
// x = x | (x >>32); // for 64-bit input
// return popcount(~x);
//
// Ref: "Hacker's Delight" by Henry Warren

In D32605#750202, @joerg wrote:

if the CPU has no direct lowering for the intrinsic, this transform is beneficial only if the resulting intrinsic can be constant folded

Why?
What about converting loop to countable?
What about InstCombine optimizations (not sure how useful they are, but still)?

// fold (srl (ctlz x), "5") -> x  iff x has one bit set (the low bit).
// select_cc seteq X, 0, sizeof(X), ctlz(X) -> ctlz(X) 
// select_cc seteq X, 0, sizeof(X), ctlz_zero_undef(X) -> ctlz(X)
// select_cc seteq X, 0, sizeof(X), cttz(X) -> cttz(X)
// select_cc seteq X, 0, sizeof(X), cttz_zero_undef(X) -> cttz(X)

.....
What about clear range of CTLZ(X): 0 <= CTLZ(X) <= bitwidth(X)?

Yeah, but even the generic expansion results in ~19 instructions on ARMv4. Compare that to one instruction in the loop and it can hardly be said to be a general win.

In D32605#750247, @joerg wrote:

Yeah, but even the generic expansion results in ~19 instructions on ARMv4. Compare that to one instruction in the loop and it can hardly be said to be a general win.

There should be at least 3 instructions in the loop: add, shift and branch. For 32 bit instruction it will run 16 iterations average, so 16 * 3 > 19.

In D32605#750243, @evstupac wrote:

In D32605#750202, @joerg wrote:

if the CPU has no direct lowering for the intrinsic, this transform is beneficial only if the resulting intrinsic can be constant folded

Why?
What about converting loop to countable?
What about clear range of CTLZ(X): 0 <= CTLZ(X) <= bitwidth(X)?

I don't think this transform changes anything about the countability of the loop, SCEV should certainly be able to do

What about InstCombine optimizations (not sure how useful they are, but still)?

// fold (srl (ctlz x), "5") -> x  iff x has one bit set (the low bit).
// select_cc seteq X, 0, sizeof(X), ctlz(X) -> ctlz(X) 
// select_cc seteq X, 0, sizeof(X), ctlz_zero_undef(X) -> ctlz(X)
// select_cc seteq X, 0, sizeof(X), cttz(X) -> cttz(X)
// select_cc seteq X, 0, sizeof(X), cttz_zero_undef(X) -> cttz(X)

.....

In D32605#750271, @evstupac wrote:

In D32605#750247, @joerg wrote:

Yeah, but even the generic expansion results in ~19 instructions on ARMv4. Compare that to one instruction in the loop and it can hardly be said to be a general win.

There should be at least 3 instructions in the loop: add, shift and branch. For 32 bit instruction it will run 16 iterations average, so 16 * 3 > 19.

You are miscounting. The very example you originally gave trades a shift+count based loop into a clz + increment based loop. Naively speaking without subtleties of the architecture,
that saves one instruction in the loop. Any expansion of clz will be worse most of the time.

The udivmodsi4.S implementation in compiler-rt is a good example -- if clz can be used, it provides a nice optimization. Otherwise the stupid linear checking performs better in the majority of cases. Please keep also in mind that we are talking about potentially executing fewer instructions vs blowing up .text here.

In D32605#750296, @joerg wrote:

In D32605#750243, @evstupac wrote:

In D32605#750202, @joerg wrote:

if the CPU has no direct lowering for the intrinsic, this transform is beneficial only if the resulting intrinsic can be constant folded

Why?
What about converting loop to countable?
What about clear range of CTLZ(X): 0 <= CTLZ(X) <= bitwidth(X)?

I don't think this transform changes anything about the countability of the loop, SCEV should certainly be able to do

It changes. SCEV is unable to get BECount for while (n>>=1) loop. Actually BECount should be CTLZ(n) - 1...

There should be at least 3 instructions in the loop: add, shift and branch. For 32 bit instruction it will run 16 iterations average, so 16 * 3 > 19.

You are miscounting. The very example you originally gave trades a shift+count based loop into a clz + increment based loop. Naively speaking without subtleties of the architecture,
that saves one instruction in the loop. Any expansion of clz will be worse most of the time.

If we get rid of a loop I'm not miscounting.
If loop is just converted to countable other optimizations are applicable like unroll, LSR, vectorization... with potential great impact.

The udivmodsi4.S implementation in compiler-rt is a good example -- if clz can be used, it provides a nice optimization. Otherwise the stupid linear checking performs better in the majority of cases. Please keep also in mind that we are talking about potentially executing fewer instructions vs blowing up .text here.

If we apply the optimization only in case whole loop is converted to CTLZ this is ok?
If just convert to countable - then there could be corner cases, which we can guard with TTI for architects that get regressions (if we get).

In D32605#750309, @evstupac wrote:

If loop is just converted to countable other optimizations are applicable like unroll, LSR, vectorization... with potential great impact.

That is something SCEV should be able to discover on its own.

The udivmodsi4.S implementation in compiler-rt is a good example -- if clz can be used, it provides a nice optimization. Otherwise the stupid linear checking performs better in the majority of cases. Please keep also in mind that we are talking about potentially executing fewer instructions vs blowing up .text here.

If we apply the optimization only in case whole loop is converted to CTLZ this is ok?

Replacing the full loop with the intrinsic is ok. The current default lowering is broken, but improving that is orthogonal. I.e. from a code size perspective, trading the loop for a libcall is still an improvement when using an optimized library version.

If just convert to countable - then there could be corner cases, which we can guard with TTI for architects that get regressions (if we get).

Hoisting the computation out of the loop without removing it should be guarded by the CPU support for CTLZ, correct.

In D32605#750321, @joerg wrote:

In D32605#750309, @evstupac wrote:

If loop is just converted to countable other optimizations are applicable like unroll, LSR, vectorization... with potential great impact.

That is something SCEV should be able to discover on its own.

Could you briefly describe how? The only way to get BECount is to calculate CTLZ. What SCEV should discover?
The only optimization that could avoid CTLZ calculation is full unroll

Replacing the full loop with the intrinsic is ok. The current default lowering is broken, but improving that is orthogonal. I.e. from a code size perspective, trading the loop for a libcall is still an improvement when using an optimized library version.

Loop idiom recognition is not the best place to check if we'll be able to delete a loop or not. If we decided to insert TTI guard I'll do this in the beginning.
If you know that default lowering is broken please file a bug report. Someone could use __builtin_ctlz and get a wrong code.

In D32605#750202, @joerg wrote:

That's not true. ARMv4 for example has no clz, it's a V5T feature.

I keep forgetting about ARMv4... :)

My base point here is that we should avoid arguments like "the issue that should be resolved by CPU owner".

For example, no one I know *benchmarks on ARMv4*. I know people that test stuff on it (Joerg, Saleem), but not benchmark. This would have passed unnoticed for how long?

I think we need to take a more conservative view on optimisations, and get either an approval or "don't care" by CPU ownser, and *then* let them them work out the problems later, if they come.

I particularly care about ARMv7+, and that's why I wanted to make sure Joerg was happy, because he cares about areas I usually don't.

And I agree with Joerg, this will have a large impact on ARMv4, not just performance, but also code size. This feature *needs* a hook.

cheers,
--renato

Just as a side note: this is not only ARMv4. Older SPARC, M68K, SH, IA64, Alpha all seem to lack a FFS / CLZ instruction from cursory check.

Add a TTI guard using existing "getIntrinsicCost" function.

evstupac added inline comments.May 11 2017, 4:21 PM

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1304–1307 ↗	(On Diff #98699)	Actually the arguments are unused now. And we don't need to fill them for TTI->getIntrinsicCost. However to avoid bugs in future it is better to rely on real args and type.

Right, just to make sure it's doing what we expect it to, did you try to run this with targets "armv7a" and "armv4t"? They should have different results.

Also, having a new test in the ARM side with two RUN lines, one for each, would make it much clearer the intentions.

cheers,
--renato

Add a test for armv4t and armv7a.
Update x86 tests to check CPUs that support ctlz and not.
Add a check on insns number in a loop - supposed to check if idiom recognition will remove the loop or not.

Herald added a subscriber: javed.absar. · View Herald TranscriptMay 12 2017, 1:30 PM

rengolin added inline comments.May 12 2017, 1:42 PM

test/Transforms/LoopIdiom/ARM/ctlz.ll
150 ↗	(On Diff #98828)	I'll let Joerg comment on this. It'll depend on the generated code vs. the loop's size, branches, etc.

evstupac added inline comments.May 12 2017, 1:52 PM

test/Transforms/LoopIdiom/ARM/ctlz.ll
150 ↗	(On Diff #98828)	It looks like Joerg already answered on this: In D32605#750321, @joerg wrote: Replacing the full loop with the intrinsic is ok. I thought it would be complicated to check, but looks like insns count is a good way to do this.

Perfect, thanks! LGTM.

This revision is now accepted and ready to land.May 13 2017, 3:32 AM

Can you check the countable part in one of the cases? Otherwise it looks good.

In D32605#755223, @joerg wrote:

Can you check the countable part in one of the cases? Otherwise it looks good.

If I get yuo right, "ctlz_and_other" checks conversion to countable loop and that CPUs that not support ctlz will not insert it in this case.

Closed by commit rL303102: The patch adds CTLZ idiom recognition. (authored by evstupac). · Explain WhyMay 15 2017, 12:22 PM

This revision was automatically updated to reflect the committed changes.

In D32605#755237, @evstupac wrote:

In D32605#755223, @joerg wrote:

Can you check the countable part in one of the cases? Otherwise it looks good.

If I get yuo right, "ctlz_and_other" checks conversion to countable loop and that CPUs that not support ctlz will not insert it in this case.

I mean: it doesn't translate the loop to use ctlz, but the comment suggests that it still transforms the loop into something better understood. But it doesn't test that part?

In D32605#755422, @joerg wrote:

In D32605#755237, @evstupac wrote:

In D32605#755223, @joerg wrote:

Can you check the countable part in one of the cases? Otherwise it looks good.

If I get yuo right, "ctlz_and_other" checks conversion to countable loop and that CPUs that not support ctlz will not insert it in this case.

I mean: it doesn't translate the loop to use ctlz, but the comment suggests that it still transforms the loop into something better understood. But it doesn't test that part?

Get it.
Yes you are right, it would be good to check that loop exit condition was changed as well.
I'll update the test accordingly.

evstupac added inline comments.May 16 2017, 3:07 PM

llvm/trunk/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1296	Just made a follow up commit r303212 adding "!IsCntPhiUsedOutsideLoop" check here. Previously the code was potentially buggy because in case IsCntPhiUsedOutsideLoop we inserted CTLZ(X >> 1) and rely on check X != 0 (instead of (X >> 1) != 0).

yurai007 mentioned this in D94015: [LoopIdiom] Replace cttz loop by call to cttz intrinsic..Apr 20 2022, 12:27 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Scalar/

LoopIdiomRecognize.cpp

295 lines

test/

Transforms/

LoopIdiom/

ARM/

ctlz.ll

185 lines

X86/

ctlz.ll

185 lines

Diff 99042

llvm/trunk/lib/Transforms/Scalar/LoopIdiomRecognize.cpp

Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	private:
/// \name Noncountable Loop Idiom Handling		/// \name Noncountable Loop Idiom Handling
/// @{		/// @{

bool runOnNoncountableLoop();		bool runOnNoncountableLoop();

bool recognizePopcount();		bool recognizePopcount();
void transformLoopToPopcount(BasicBlock PreCondBB, Instruction CntInst,		void transformLoopToPopcount(BasicBlock PreCondBB, Instruction CntInst,
PHINode CntPhi, Value Var);		PHINode CntPhi, Value Var);
		bool recognizeAndInsertCTLZ();
		void transformLoopToCountable(BasicBlock PreCondBB, Instruction CntInst,
		PHINode CntPhi, Value Var, const DebugLoc DL,
		bool ZeroCheck, bool IsCntPhiUsedOutsideLoop);

/// @}		/// @}
};		};

class LoopIdiomRecognizeLegacyPass : public LoopPass {		class LoopIdiomRecognizeLegacyPass : public LoopPass {
public:		public:
static char ID;		static char ID;
explicit LoopIdiomRecognizeLegacyPass() : LoopPass(ID) {		explicit LoopIdiomRecognizeLegacyPass() : LoopPass(ID) {
▲ Show 20 Lines • Show All 834 Lines • ▼ Show 20 Lines	if (!CurLoop->getParentLoop() && (!IsMemset \|\| !IsLoopMemset)) {
return true;		return true;
}		}
}		}

return false;		return false;
}		}

bool LoopIdiomRecognize::runOnNoncountableLoop() {		bool LoopIdiomRecognize::runOnNoncountableLoop() {
return recognizePopcount();		return recognizePopcount() \|\| recognizeAndInsertCTLZ();
}		}

/// Check if the given conditional branch is based on the comparison between		/// Check if the given conditional branch is based on the comparison between
/// a variable and zero, and if the variable is non-zero, the control yields to		/// a variable and zero, and if the variable is non-zero, the control yields to
/// the loop entry. If the branch matches the behavior, the variable involved		/// the loop entry. If the branch matches the behavior, the variable involved
/// in the comparison is returned. This function will be called to see if the		/// in the comparison is returned. This function will be called to see if the
/// precondition and postcondition of the loop are in desirable form.		/// precondition and postcondition of the loop are in desirable form.
static Value matchCondition(BranchInst BI, BasicBlock *LoopEntry) {		static Value matchCondition(BranchInst BI, BasicBlock *LoopEntry) {
▲ Show 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	// "if (x != 0) goto loop-head ; else goto somewhere-we-don't-care;"
CntInst = CountInst;		CntInst = CountInst;
CntPhi = CountPhi;		CntPhi = CountPhi;
Var = T;		Var = T;
}		}

return true;		return true;
}		}

		/// Return true if the idiom is detected in the loop.
		///
		/// Additionally:
		/// 1) \p CntInst is set to the instruction Counting Leading Zeros (CTLZ)
		/// or nullptr if there is no such.
		/// 2) \p CntPhi is set to the corresponding phi node
		/// or nullptr if there is no such.
		/// 3) \p Var is set to the value whose CTLZ could be used.
		/// 4) \p DefX is set to the instruction calculating Loop exit condition.
		///
		/// The core idiom we are trying to detect is:
		/// \code
		/// if (x0 == 0)
		/// goto loop-exit // the precondition of the loop
		/// cnt0 = init-val;
		/// do {
		/// x = phi (x0, x.next); //PhiX
		/// cnt = phi(cnt0, cnt.next);
		///
		/// cnt.next = cnt + 1;
		/// ...
		/// x.next = x >> 1; // DefX
		/// ...
		/// } while(x.next != 0);
		///
		/// loop-exit:
		/// \endcode
		static bool detectCTLZIdiom(Loop CurLoop, PHINode &PhiX,
		Instruction &CntInst, PHINode &CntPhi,
		Instruction *&DefX) {
		BasicBlock *LoopEntry;
		Value *VarX = nullptr;

		DefX = nullptr;
		PhiX = nullptr;
		CntInst = nullptr;
		CntPhi = nullptr;
		LoopEntry = *(CurLoop->block_begin());

		// step 1: Check if the loop-back branch is in desirable form.
		if (Value *T = matchCondition(
		dyn_cast<BranchInst>(LoopEntry->getTerminator()), LoopEntry))
		DefX = dyn_cast<Instruction>(T);
		else
		return false;

		// step 2: detect instructions corresponding to "x.next = x >> 1"
		if (!DefX \|\| DefX->getOpcode() != Instruction::AShr)
		return false;
		if (ConstantInt *Shft = dyn_cast<ConstantInt>(DefX->getOperand(1)))
		if (!Shft \|\| !Shft->isOne())
		return false;
		VarX = DefX->getOperand(0);

		// step 3: Check the recurrence of variable X
		PhiX = dyn_cast<PHINode>(VarX);
		if (!PhiX \|\| (PhiX->getOperand(0) != DefX && PhiX->getOperand(1) != DefX))
		return false;

		// step 4: Find the instruction which count the CTLZ: cnt.next = cnt + 1
		// TODO: We can skip the step. If loop trip count is known (CTLZ),
		// then all uses of "cnt.next" could be optimized to the trip count
		// plus "cnt0". Currently it is not optimized.
		// This step could be used to detect POPCNT instruction:
		// cnt.next = cnt + (x.next & 1)
		for (BasicBlock::iterator Iter = LoopEntry->getFirstNonPHI()->getIterator(),
		IterE = LoopEntry->end();
		Iter != IterE; Iter++) {
		Instruction Inst = &Iter;
		if (Inst->getOpcode() != Instruction::Add)
		continue;

		ConstantInt *Inc = dyn_cast<ConstantInt>(Inst->getOperand(1));
		if (!Inc \|\| !Inc->isOne())
		continue;

		PHINode *Phi = dyn_cast<PHINode>(Inst->getOperand(0));
		if (!Phi \|\| Phi->getParent() != LoopEntry)
		continue;

		CntInst = Inst;
		CntPhi = Phi;
		break;
		}
		if (!CntInst)
		return false;

		return true;
		}

		/// Recognize CTLZ idiom in a non-countable loop and convert the loop
		/// to countable (with CTLZ trip count).
		/// If CTLZ inserted as a new trip count returns true; otherwise, returns false.
		bool LoopIdiomRecognize::recognizeAndInsertCTLZ() {
		// Give up if the loop has multiple blocks or multiple backedges.
		if (CurLoop->getNumBackEdges() != 1 \|\| CurLoop->getNumBlocks() != 1)
		return false;

		Instruction CntInst, DefX;
		PHINode CntPhi, PhiX;
		if (!detectCTLZIdiom(CurLoop, PhiX, CntInst, CntPhi, DefX))
		return false;

		bool IsCntPhiUsedOutsideLoop = false;
		for (User *U : CntPhi->users())
		if (!CurLoop->contains(dyn_cast<Instruction>(U))) {
		IsCntPhiUsedOutsideLoop = true;
		break;
		}
		bool IsCntInstUsedOutsideLoop = false;
		for (User *U : CntInst->users())
		if (!CurLoop->contains(dyn_cast<Instruction>(U))) {
		IsCntInstUsedOutsideLoop = true;
		break;
		}
		// If both CntInst and CntPhi are used outside the loop the profitability
		// is questionable.
		if (IsCntInstUsedOutsideLoop && IsCntPhiUsedOutsideLoop)
		return false;

		// For some CPUs result of CTLZ(X) intrinsic is undefined
		// when X is 0. If we can not guarantee X != 0, we need to check this
		// when expand.
		bool ZeroCheck = false;
		// It is safe to assume Preheader exist as it was checked in
		// parent function RunOnLoop.
		BasicBlock *PH = CurLoop->getLoopPreheader();
		Value *InitX = PhiX->getIncomingValueForBlock(PH);
		// If we check X != 0 before entering the loop we don't need a zero
		// check in CTLZ intrinsic.
		if (BasicBlock *PreCondBB = PH->getSinglePredecessor())
		evstupacAuthorUnsubmitted Not Done Reply Inline Actions Just made a follow up commit r303212 adding "!IsCntPhiUsedOutsideLoop" check here. Previously the code was potentially buggy because in case IsCntPhiUsedOutsideLoop we inserted CTLZ(X >> 1) and rely on check X != 0 (instead of (X >> 1) != 0). evstupac: Just made a follow up commit r303212 adding "!IsCntPhiUsedOutsideLoop" check here. Previously…
		if (BranchInst *PreCondBr =
		dyn_cast<BranchInst>(PreCondBB->getTerminator())) {
		if (matchCondition(PreCondBr, PH) == InitX)
		ZeroCheck = true;
		}

		// Check if CTLZ intrinsic is profitable. Assume it is always profitable
		// if we delete the loop (the loop has only 6 instructions):
		// %n.addr.0 = phi [ %n, %entry ], [ %shr, %while.cond ]
		// %i.0 = phi [ %i0, %entry ], [ %inc, %while.cond ]
		// %shr = ashr %n.addr.0, 1
		// %tobool = icmp eq %shr, 0
		// %inc = add nsw %i.0, 1
		// br i1 %tobool

		IRBuilder<> Builder(PH->getTerminator());
		SmallVector<const Value *, 2> Ops =
		{InitX, ZeroCheck ? Builder.getTrue() : Builder.getFalse()};
		ArrayRef<const Value *> Args(Ops);
		if (CurLoop->getHeader()->size() != 6 &&
		TTI->getIntrinsicCost(Intrinsic::ctlz, InitX->getType(), Args) >
		TargetTransformInfo::TCC_Basic)
		return false;

		const DebugLoc DL = DefX->getDebugLoc();
		transformLoopToCountable(PH, CntInst, CntPhi, InitX, DL, ZeroCheck,
		IsCntPhiUsedOutsideLoop);
		return true;
		}

/// Recognizes a population count idiom in a non-countable loop.		/// Recognizes a population count idiom in a non-countable loop.
///		///
/// If detected, transforms the relevant code to issue the popcount intrinsic		/// If detected, transforms the relevant code to issue the popcount intrinsic
/// function call, and returns true; otherwise, returns false.		/// function call, and returns true; otherwise, returns false.
bool LoopIdiomRecognize::recognizePopcount() {		bool LoopIdiomRecognize::recognizePopcount() {
if (TTI->getPopcntSupport(32) != TargetTransformInfo::PSK_FastHardware)		if (TTI->getPopcntSupport(32) != TargetTransformInfo::PSK_FastHardware)
return false;		return false;

▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	static CallInst createPopcntIntrinsic(IRBuilder<> &IRBuilder, Value Val,
Module *M = IRBuilder.GetInsertBlock()->getParent()->getParent();		Module *M = IRBuilder.GetInsertBlock()->getParent()->getParent();
Value *Func = Intrinsic::getDeclaration(M, Intrinsic::ctpop, Tys);		Value *Func = Intrinsic::getDeclaration(M, Intrinsic::ctpop, Tys);
CallInst *CI = IRBuilder.CreateCall(Func, Ops);		CallInst *CI = IRBuilder.CreateCall(Func, Ops);
CI->setDebugLoc(DL);		CI->setDebugLoc(DL);

return CI;		return CI;
}		}

		static CallInst createCTLZIntrinsic(IRBuilder<> &IRBuilder, Value Val,
		const DebugLoc &DL, bool ZeroCheck) {
		Value *Ops[] = {Val, ZeroCheck ? IRBuilder.getTrue() : IRBuilder.getFalse()};
		Type *Tys[] = {Val->getType()};

		Module *M = IRBuilder.GetInsertBlock()->getParent()->getParent();
		Value *Func = Intrinsic::getDeclaration(M, Intrinsic::ctlz, Tys);
		CallInst *CI = IRBuilder.CreateCall(Func, Ops);
		CI->setDebugLoc(DL);

		return CI;
		}

		/// Transform the following loop:
		/// loop:
		/// CntPhi = PHI [Cnt0, CntInst]
		/// PhiX = PHI [InitX, DefX]
		/// CntInst = CntPhi + 1
		/// DefX = PhiX >> 1
		// LOOP_BODY
		/// Br: loop if (DefX != 0)
		/// Use(CntPhi) or Use(CntInst)
		///
		/// Into:
		/// If CntPhi used outside the loop:
		/// CountPrev = BitWidth(InitX) - CTLZ(InitX >> 1)
		/// Count = CountPrev + 1
		/// else
		/// Count = BitWidth(InitX) - CTLZ(InitX)
		/// loop:
		/// CntPhi = PHI [Cnt0, CntInst]
		/// PhiX = PHI [InitX, DefX]
		/// PhiCount = PHI [Count, Dec]
		/// CntInst = CntPhi + 1
		/// DefX = PhiX >> 1
		/// Dec = PhiCount - 1
		/// LOOP_BODY
		/// Br: loop if (Dec != 0)
		/// Use(CountPrev + Cnt0) // Use(CntPhi)
		/// or
		/// Use(Count + Cnt0) // Use(CntInst)
		///
		/// If LOOP_BODY is empty the loop will be deleted.
		/// If CntInst and DefX are not used in LOOP_BODY they will be removed.
		void LoopIdiomRecognize::transformLoopToCountable(
		BasicBlock Preheader, Instruction CntInst, PHINode CntPhi, Value InitX,
		const DebugLoc DL, bool ZeroCheck, bool IsCntPhiUsedOutsideLoop) {
		BranchInst *PreheaderBr = dyn_cast<BranchInst>(Preheader->getTerminator());

		// Step 1: Insert the CTLZ instruction at the end of the preheader block
		// Count = BitWidth - CTLZ(InitX);
		// If there are uses of CntPhi create:
		// CountPrev = BitWidth - CTLZ(InitX >> 1);
		IRBuilder<> Builder(PreheaderBr);
		Builder.SetCurrentDebugLocation(DL);
		Value CTLZ, Count, CountPrev, NewCount, *InitXNext;

		if (IsCntPhiUsedOutsideLoop)
		InitXNext = Builder.CreateAShr(InitX,
		ConstantInt::get(InitX->getType(), 1));
		else
		InitXNext = InitX;
		CTLZ = createCTLZIntrinsic(Builder, InitXNext, DL, ZeroCheck);
		Count = Builder.CreateSub(
		ConstantInt::get(CTLZ->getType(),
		CTLZ->getType()->getIntegerBitWidth()),
		CTLZ);
		if (IsCntPhiUsedOutsideLoop) {
		CountPrev = Count;
		Count = Builder.CreateAdd(
		CountPrev,
		ConstantInt::get(CountPrev->getType(), 1));
		}
		if (IsCntPhiUsedOutsideLoop)
		NewCount = Builder.CreateZExtOrTrunc(CountPrev,
		cast<IntegerType>(CntInst->getType()));
		else
		NewCount = Builder.CreateZExtOrTrunc(Count,
		cast<IntegerType>(CntInst->getType()));

		// If the CTLZ counter's initial value is not zero, insert Add Inst.
		Value *CntInitVal = CntPhi->getIncomingValueForBlock(Preheader);
		ConstantInt *InitConst = dyn_cast<ConstantInt>(CntInitVal);
		if (!InitConst \|\| !InitConst->isZero())
		NewCount = Builder.CreateAdd(NewCount, CntInitVal);

		// Step 2: Insert new IV and loop condition:
		// loop:
		// ...
		// PhiCount = PHI [Count, Dec]
		// ...
		// Dec = PhiCount - 1
		// ...
		// Br: loop if (Dec != 0)
		BasicBlock Body = (CurLoop->block_begin());
		auto *LbBr = dyn_cast<BranchInst>(Body->getTerminator());
		ICmpInst *LbCond = cast<ICmpInst>(LbBr->getCondition());
		Type *Ty = Count->getType();

		PHINode *TcPhi = PHINode::Create(Ty, 2, "tcphi", &Body->front());

		Builder.SetInsertPoint(LbCond);
		Instruction *TcDec = cast<Instruction>(
		Builder.CreateSub(TcPhi, ConstantInt::get(Ty, 1),
		"tcdec", false, true));

		TcPhi->addIncoming(Count, Preheader);
		TcPhi->addIncoming(TcDec, Body);

		CmpInst::Predicate Pred =
		(LbBr->getSuccessor(0) == Body) ? CmpInst::ICMP_NE : CmpInst::ICMP_EQ;
		LbCond->setPredicate(Pred);
		LbCond->setOperand(0, TcDec);
		LbCond->setOperand(1, ConstantInt::get(Ty, 0));

		// Step 3: All the references to the original counter outside
		// the loop are replaced with the NewCount -- the value returned from
		// __builtin_ctlz(x).
		if (IsCntPhiUsedOutsideLoop)
		CntPhi->replaceUsesOutsideBlock(NewCount, Body);
		else
		CntInst->replaceUsesOutsideBlock(NewCount, Body);

		// step 4: Forget the "non-computable" trip-count SCEV associated with the
		// loop. The loop would otherwise not be deleted even if it becomes empty.
		SE->forgetLoop(CurLoop);
		}

void LoopIdiomRecognize::transformLoopToPopcount(BasicBlock *PreCondBB,		void LoopIdiomRecognize::transformLoopToPopcount(BasicBlock *PreCondBB,
Instruction *CntInst,		Instruction *CntInst,
PHINode CntPhi, Value Var) {		PHINode CntPhi, Value Var) {
BasicBlock *PreHead = CurLoop->getLoopPreheader();		BasicBlock *PreHead = CurLoop->getLoopPreheader();
auto *PreCondBr = dyn_cast<BranchInst>(PreCondBB->getTerminator());		auto *PreCondBr = dyn_cast<BranchInst>(PreCondBB->getTerminator());
const DebugLoc DL = CntInst->getDebugLoc();		const DebugLoc DL = CntInst->getDebugLoc();

// Assuming before transformation, the loop is following:		// Assuming before transformation, the loop is following:
▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopIdiom/ARM/ctlz.ll

				; RUN: opt -loop-idiom -mtriple=armv7a < %s -S \| FileCheck -check-prefix=LZCNT --check-prefix=ALL %s
				; RUN: opt -loop-idiom -mtriple=armv4t < %s -S \| FileCheck -check-prefix=NOLZCNT --check-prefix=ALL %s

				; Recognize CTLZ builtin pattern.
				; Here we'll just convert loop to countable,
				; so do not insert builtin if CPU do not support CTLZ
				;
				; int ctlz_and_other(int n, char *a)
				; {
				; int i = 0, n0 = n;
				; while(n >>= 1) {
				; a[i] = (n0 & (1 << i)) ? 1 : 0;
				; i++;
				; }
				; return i;
				; }
				;
				; LZCNT: entry
				; LZCNT: %0 = call i32 @llvm.ctlz.i32(i32 %shr8, i1 true)
				; LZCNT-NEXT: %1 = sub i32 32, %0
				; LZCNT-NEXT: %2 = zext i32 %1 to i64
				; LZCNT: %indvars.iv.next.lcssa = phi i64 [ %2, %while.body ]
				; LZCNT: %4 = trunc i64 %indvars.iv.next.lcssa to i32
				; LZCNT: %i.0.lcssa = phi i32 [ 0, %entry ], [ %4, %while.end.loopexit ]
				; LZCNT: ret i32 %i.0.lcssa

				; NOLZCNT: entry
				; NOLZCNT-NOT: @llvm.ctlz

				; Function Attrs: norecurse nounwind uwtable
				define i32 @ctlz_and_other(i32 %n, i8* nocapture %a) {
				entry:
				%shr8 = ashr i32 %n, 1
				%tobool9 = icmp eq i32 %shr8, 0
				br i1 %tobool9, label %while.end, label %while.body.preheader

				while.body.preheader: ; preds = %entry
				br label %while.body

				while.body: ; preds = %while.body.preheader, %while.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %while.body ], [ 0, %while.body.preheader ]
				%shr11 = phi i32 [ %shr, %while.body ], [ %shr8, %while.body.preheader ]
				%0 = trunc i64 %indvars.iv to i32
				%shl = shl i32 1, %0
				%and = and i32 %shl, %n
				%tobool1 = icmp ne i32 %and, 0
				%conv = zext i1 %tobool1 to i8
				%arrayidx = getelementptr inbounds i8, i8* %a, i64 %indvars.iv
				store i8 %conv, i8* %arrayidx, align 1
				%indvars.iv.next = add nuw i64 %indvars.iv, 1
				%shr = ashr i32 %shr11, 1
				%tobool = icmp eq i32 %shr, 0
				br i1 %tobool, label %while.end.loopexit, label %while.body

				while.end.loopexit: ; preds = %while.body
				%1 = trunc i64 %indvars.iv.next to i32
				br label %while.end

				while.end: ; preds = %while.end.loopexit, %entry
				%i.0.lcssa = phi i32 [ 0, %entry ], [ %1, %while.end.loopexit ]
				ret i32 %i.0.lcssa
				}

				; Recognize CTLZ builtin pattern.
				; Here it will replace the loop -
				; assume builtin is always profitable.
				;
				; int ctlz_zero_check(int n)
				; {
				; int i = 0;
				; while(n) {
				; n >>= 1;
				; i++;
				; }
				; return i;
				; }
				;
				; ALL: entry
				; ALL: %0 = call i32 @llvm.ctlz.i32(i32 %n, i1 true)
				; ALL-NEXT: %1 = sub i32 32, %0
				; ALL: %inc.lcssa = phi i32 [ %1, %while.body ]
				; ALL: %i.0.lcssa = phi i32 [ 0, %entry ], [ %inc.lcssa, %while.end.loopexit ]
				; ALL: ret i32 %i.0.lcssa

				; Function Attrs: norecurse nounwind readnone uwtable
				define i32 @ctlz_zero_check(i32 %n) {
				entry:
				%tobool4 = icmp eq i32 %n, 0
				br i1 %tobool4, label %while.end, label %while.body.preheader

				while.body.preheader: ; preds = %entry
				br label %while.body

				while.body: ; preds = %while.body.preheader, %while.body
				%i.06 = phi i32 [ %inc, %while.body ], [ 0, %while.body.preheader ]
				%n.addr.05 = phi i32 [ %shr, %while.body ], [ %n, %while.body.preheader ]
				%shr = ashr i32 %n.addr.05, 1
				%inc = add nsw i32 %i.06, 1
				%tobool = icmp eq i32 %shr, 0
				br i1 %tobool, label %while.end.loopexit, label %while.body

				while.end.loopexit: ; preds = %while.body
				br label %while.end

				while.end: ; preds = %while.end.loopexit, %entry
				%i.0.lcssa = phi i32 [ 0, %entry ], [ %inc, %while.end.loopexit ]
				ret i32 %i.0.lcssa
				}

				; Recognize CTLZ builtin pattern.
				; Here it will replace the loop -
				; assume builtin is always profitable.
				;
				; int ctlz(int n)
				; {
				; int i = 0;
				; while(n >>= 1) {
				; i++;
				; }
				; return i;
				; }
				;
				; ALL: entry
				; ALL: %0 = ashr i32 %n, 1
				; ALL-NEXT: %1 = call i32 @llvm.ctlz.i32(i32 %0, i1 false)
				; ALL-NEXT: %2 = sub i32 32, %1
				; ALL-NEXT: %3 = add i32 %2, 1
				; ALL: %i.0.lcssa = phi i32 [ %2, %while.cond ]
				; ALL: ret i32 %i.0.lcssa

				; Function Attrs: norecurse nounwind readnone uwtable
				define i32 @ctlz(i32 %n) {
				entry:
				br label %while.cond

				while.cond: ; preds = %while.cond, %entry
				%n.addr.0 = phi i32 [ %n, %entry ], [ %shr, %while.cond ]
				%i.0 = phi i32 [ 0, %entry ], [ %inc, %while.cond ]
				%shr = ashr i32 %n.addr.0, 1
				%tobool = icmp eq i32 %shr, 0
				%inc = add nsw i32 %i.0, 1
				br i1 %tobool, label %while.end, label %while.cond

				while.end: ; preds = %while.cond
				ret i32 %i.0
				}

				; Recognize CTLZ builtin pattern.
				; Here it will replace the loop -
				; assume builtin is always profitable.
				;
				; int ctlz_add(int n, int i0)
				; {
				; int i = i0;
				; while(n >>= 1) {
				; i++;
				; }
				; return i;
				; }
				;
				; ALL: entry
				; ALL: %0 = ashr i32 %n, 1
				; ALL-NEXT: %1 = call i32 @llvm.ctlz.i32(i32 %0, i1 false)
				; ALL-NEXT: %2 = sub i32 32, %1
				; ALL-NEXT: %3 = add i32 %2, 1
				; ALL-NEXT: %4 = add i32 %2, %i0
				; ALL: %i.0.lcssa = phi i32 [ %4, %while.cond ]
				; ALL: ret i32 %i.0.lcssa
				;
				; Function Attrs: norecurse nounwind readnone uwtable
				define i32 @ctlz_add(i32 %n, i32 %i0) {
				entry:
				br label %while.cond

				while.cond: ; preds = %while.cond, %entry
				%n.addr.0 = phi i32 [ %n, %entry ], [ %shr, %while.cond ]
				%i.0 = phi i32 [ %i0, %entry ], [ %inc, %while.cond ]
				%shr = ashr i32 %n.addr.0, 1
				%tobool = icmp eq i32 %shr, 0
				%inc = add nsw i32 %i.0, 1
				br i1 %tobool, label %while.end, label %while.cond

				while.end: ; preds = %while.cond
				ret i32 %i.0
				}

llvm/trunk/test/Transforms/LoopIdiom/X86/ctlz.ll

				; RUN: opt -loop-idiom -mtriple=x86_64 -mcpu=core-avx2 < %s -S \| FileCheck -check-prefix=LZCNT --check-prefix=ALL %s
				; RUN: opt -loop-idiom -mtriple=x86_64 -mcpu=corei7 < %s -S \| FileCheck -check-prefix=NOLZCNT --check-prefix=ALL %s

				; Recognize CTLZ builtin pattern.
				; Here we'll just convert loop to countable,
				; so do not insert builtin if CPU do not support CTLZ
				;
				; int ctlz_and_other(int n, char *a)
				; {
				; int i = 0, n0 = n;
				; while(n >>= 1) {
				; a[i] = (n0 & (1 << i)) ? 1 : 0;
				; i++;
				; }
				; return i;
				; }
				;
				; LZCNT: entry
				; LZCNT: %0 = call i32 @llvm.ctlz.i32(i32 %shr8, i1 true)
				; LZCNT-NEXT: %1 = sub i32 32, %0
				; LZCNT-NEXT: %2 = zext i32 %1 to i64
				; LZCNT: %indvars.iv.next.lcssa = phi i64 [ %2, %while.body ]
				; LZCNT: %4 = trunc i64 %indvars.iv.next.lcssa to i32
				; LZCNT: %i.0.lcssa = phi i32 [ 0, %entry ], [ %4, %while.end.loopexit ]
				; LZCNT: ret i32 %i.0.lcssa

				; NOLZCNT: entry
				; NOLZCNT-NOT: @llvm.ctlz

				; Function Attrs: norecurse nounwind uwtable
				define i32 @ctlz_and_other(i32 %n, i8* nocapture %a) {
				entry:
				%shr8 = ashr i32 %n, 1
				%tobool9 = icmp eq i32 %shr8, 0
				br i1 %tobool9, label %while.end, label %while.body.preheader

				while.body.preheader: ; preds = %entry
				br label %while.body

				while.body: ; preds = %while.body.preheader, %while.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %while.body ], [ 0, %while.body.preheader ]
				%shr11 = phi i32 [ %shr, %while.body ], [ %shr8, %while.body.preheader ]
				%0 = trunc i64 %indvars.iv to i32
				%shl = shl i32 1, %0
				%and = and i32 %shl, %n
				%tobool1 = icmp ne i32 %and, 0
				%conv = zext i1 %tobool1 to i8
				%arrayidx = getelementptr inbounds i8, i8* %a, i64 %indvars.iv
				store i8 %conv, i8* %arrayidx, align 1
				%indvars.iv.next = add nuw i64 %indvars.iv, 1
				%shr = ashr i32 %shr11, 1
				%tobool = icmp eq i32 %shr, 0
				br i1 %tobool, label %while.end.loopexit, label %while.body

				while.end.loopexit: ; preds = %while.body
				%1 = trunc i64 %indvars.iv.next to i32
				br label %while.end

				while.end: ; preds = %while.end.loopexit, %entry
				%i.0.lcssa = phi i32 [ 0, %entry ], [ %1, %while.end.loopexit ]
				ret i32 %i.0.lcssa
				}

				; Recognize CTLZ builtin pattern.
				; Here it will replace the loop -
				; assume builtin is always profitable.
				;
				; int ctlz_zero_check(int n)
				; {
				; int i = 0;
				; while(n) {
				; n >>= 1;
				; i++;
				; }
				; return i;
				; }
				;
				; ALL: entry
				; ALL: %0 = call i32 @llvm.ctlz.i32(i32 %n, i1 true)
				; ALL-NEXT: %1 = sub i32 32, %0
				; ALL: %inc.lcssa = phi i32 [ %1, %while.body ]
				; ALL: %i.0.lcssa = phi i32 [ 0, %entry ], [ %inc.lcssa, %while.end.loopexit ]
				; ALL: ret i32 %i.0.lcssa

				; Function Attrs: norecurse nounwind readnone uwtable
				define i32 @ctlz_zero_check(i32 %n) {
				entry:
				%tobool4 = icmp eq i32 %n, 0
				br i1 %tobool4, label %while.end, label %while.body.preheader

				while.body.preheader: ; preds = %entry
				br label %while.body

				while.body: ; preds = %while.body.preheader, %while.body
				%i.06 = phi i32 [ %inc, %while.body ], [ 0, %while.body.preheader ]
				%n.addr.05 = phi i32 [ %shr, %while.body ], [ %n, %while.body.preheader ]
				%shr = ashr i32 %n.addr.05, 1
				%inc = add nsw i32 %i.06, 1
				%tobool = icmp eq i32 %shr, 0
				br i1 %tobool, label %while.end.loopexit, label %while.body

				while.end.loopexit: ; preds = %while.body
				br label %while.end

				while.end: ; preds = %while.end.loopexit, %entry
				%i.0.lcssa = phi i32 [ 0, %entry ], [ %inc, %while.end.loopexit ]
				ret i32 %i.0.lcssa
				}

				; Recognize CTLZ builtin pattern.
				; Here it will replace the loop -
				; assume builtin is always profitable.
				;
				; int ctlz(int n)
				; {
				; int i = 0;
				; while(n >>= 1) {
				; i++;
				; }
				; return i;
				; }
				;
				; ALL: entry
				; ALL: %0 = ashr i32 %n, 1
				; ALL-NEXT: %1 = call i32 @llvm.ctlz.i32(i32 %0, i1 false)
				; ALL-NEXT: %2 = sub i32 32, %1
				; ALL-NEXT: %3 = add i32 %2, 1
				; ALL: %i.0.lcssa = phi i32 [ %2, %while.cond ]
				; ALL: ret i32 %i.0.lcssa

				; Function Attrs: norecurse nounwind readnone uwtable
				define i32 @ctlz(i32 %n) {
				entry:
				br label %while.cond

				while.cond: ; preds = %while.cond, %entry
				%n.addr.0 = phi i32 [ %n, %entry ], [ %shr, %while.cond ]
				%i.0 = phi i32 [ 0, %entry ], [ %inc, %while.cond ]
				%shr = ashr i32 %n.addr.0, 1
				%tobool = icmp eq i32 %shr, 0
				%inc = add nsw i32 %i.0, 1
				br i1 %tobool, label %while.end, label %while.cond

				while.end: ; preds = %while.cond
				ret i32 %i.0
				}

				; Recognize CTLZ builtin pattern.
				; Here it will replace the loop -
				; assume builtin is always profitable.
				;
				; int ctlz_add(int n, int i0)
				; {
				; int i = i0;
				; while(n >>= 1) {
				; i++;
				; }
				; return i;
				; }
				;
				; ALL: entry
				; ALL: %0 = ashr i32 %n, 1
				; ALL-NEXT: %1 = call i32 @llvm.ctlz.i32(i32 %0, i1 false)
				; ALL-NEXT: %2 = sub i32 32, %1
				; ALL-NEXT: %3 = add i32 %2, 1
				; ALL-NEXT: %4 = add i32 %2, %i0
				; ALL: %i.0.lcssa = phi i32 [ %4, %while.cond ]
				; ALL: ret i32 %i.0.lcssa
				;
				; Function Attrs: norecurse nounwind readnone uwtable
				define i32 @ctlz_add(i32 %n, i32 %i0) {
				entry:
				br label %while.cond

				while.cond: ; preds = %while.cond, %entry
				%n.addr.0 = phi i32 [ %n, %entry ], [ %shr, %while.cond ]
				%i.0 = phi i32 [ %i0, %entry ], [ %inc, %while.cond ]
				%shr = ashr i32 %n.addr.0, 1
				%tobool = icmp eq i32 %shr, 0
				%inc = add nsw i32 %i.0, 1
				br i1 %tobool, label %while.end, label %while.cond

				while.end: ; preds = %while.cond
				ret i32 %i.0
				}

This is an archive of the discontinued LLVM Phabricator instance.

Recognize CTLZ builtinClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 99042

llvm/trunk/lib/Transforms/Scalar/LoopIdiomRecognize.cpp

llvm/trunk/test/Transforms/LoopIdiom/ARM/ctlz.ll

llvm/trunk/test/Transforms/LoopIdiom/X86/ctlz.ll

Recognize CTLZ builtin
ClosedPublic