This is an archive of the discontinued LLVM Phabricator instance.

Recognize CTLZ builtin
ClosedPublic

Authored by evstupac on Apr 27 2017, 11:22 AM.

Details

Summary

The patch targeted to recognize and replace with ctlz (make countable) loops like:

while(n) {

n >>= 1;
i++;
body();

}
use(i);

Should be replaced with:
ii = type_bitwidth(n) - ctlz(n);
for(j = 0; j < ii; j++) {
body();
}
use(ii + i)

Diff Detail

Repository
rL LLVM

Event Timeline

evstupac created this revision.Apr 27 2017, 11:22 AM

Regarding performance. Almost all benchmark are build same.
Some image benchmarks that uses Huffman algorithm get good improve (~5-10%):
https://github.com/mozilla/mozjpeg/blob/master/jchuff.c

joerg added a subscriber: joerg.Apr 27 2017, 11:53 AM

What about platforms that don't have ctlz/ffs hardware support?

rengolin edited edge metadata.Apr 27 2017, 11:59 AM

Hi Evgeny,

I think this is an interesting concept, but I share Joerg's concern. This has to, at least, calculate the cost of CTLZ, per architecture, and compare with the cost of the loop.

For example, if the loop count is known and short, and the platform has no CTLZ, then it's quite possible that the transformation will render poorer code. It doesn't have to be a very accurate cost, but something needs to be done.

You'd also need some more testing on non-x86 platforms, to make sure the decisions are good for all of them.

cheers,
--renato

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1206

You don't need the extra lexical blocks.

http://llvm.org/docs/CodingStandards.html

test/Transforms/LoopIdiom/ctlz.ll
1 ↗(On Diff #96950)

This won't work on builders that don't build x86. Yes, we do have those. :)

http://llvm.org/docs/TestingGuide.html#platform-specific-tests

What about platforms that don't have ctlz/ffs hardware support?

I can guard this. Or keep as is if we consider converting loop to countable is generally good. In this case builtin should be expanded at codegen in the most profitable way.

In this case builtin should be expanded at codegen in the most profitable way.

The expansion might not be the "most profitable way". :)

Hi Renato,

I assume if CPU have ctlz the transformation is always profitable. For x86 CPUs it is so.
In LLVM source code we replace such loops with the builtin in APInt module for better compile time.
For other architectures it is questionable (and I'm not able to test this), however I agree to guard this.

Thanks,
Evgeny

hfinkel edited edge metadata.Apr 27 2017, 12:19 PM

We could add a TTI callback like we have for popcnt. You could also argue that this is the better canonical form because it can enable other analysis/optimizations (and we should fix the expansion if it is suboptimal).

We could add a TTI callback like we have for popcnt. You could also argue that this is the better canonical form because it can enable other analysis/optimizations (and we should fix the expansion if it is suboptimal).

I also feel like "not a perfect expand" for particular CPU is the issue that should be resolved by CPU owner.
Besides converting to countable loop we also get:

  1. Clear range for lzct result (from 0 to bitwidth).
  2. Some Combine optimizations that can apply.
  3. An ability to expand intrinsic to the most profitable way for each CPU.

Guarding is easier. Let's do this only if someone complains about performance.

joerg added a comment.Apr 27 2017, 2:08 PM

I agree. If the CPU has it, it will be beneficial. If it doesn't, it is only a useful transformation if the intrinsic can be constant folded.

evstupac updated this revision to Diff 96995.Apr 27 2017, 2:09 PM

Fixed inline comments.

Guarding is easier. Let's do this only if someone complains about performance.

You mean guarding with TTI callback, right?

Guarding is easier. Let's do this only if someone complains about performance.

You mean guarding with TTI callback, right?

Yes.

Yes.

Sounds good.

Yes.

Sounds good.

Ok. The only missing part to start getting respond is review and approval. :-)

Wait, we also need the TTI callbacks. I thought you were going to introduce them.

I thought everybody agreed on the following.

We could add a TTI callback like we have for popcnt. You could also argue that this is the better canonical form because it can enable other analysis/optimizations (and we should fix the expansion if it is suboptimal).

I also feel like "not a perfect expand" for particular CPU is the issue that should be resolved by CPU owner.
Besides converting to countable loop we also get:

  1. Clear range for lzct result (from 0 to bitwidth).
  2. Some Combine optimizations that can apply.
  3. An ability to expand intrinsic to the most profitable way for each CPU.

Guarding is easier. Let's do this only if someone complains about performance.

I agree. If the CPU has it, it will be beneficial. If it doesn't, it is only a useful transformation if the intrinsic can be constant folded.

@joerg, can you clarify what you agree with?

It sounds to be that you're worried that the intrinsics will be worse in some cases?

ARM(32/64) and x86_64 have CLZ, so it should always be beneficial, but I was just worries I'm missing something important.

cheers,
--renato

joerg added a comment.May 9 2017, 2:52 PM

That's not true. ARMv4 for example has no clz, it's a V5T feature. That's my point: if the CPU has no direct lowering for the intrinsic, this transform is beneficial only if the resulting intrinsic can be constant folded. But I wonder if we don't catch those cases already with SCEV based optimisations. If a CPU has no direct lowering like on ARMv4, it will add a libcall and with a high chance of being more expensive than the optimisation.

The target independent lowering code emits this for CTLZ when its not supported. I think the popcount expands to this http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel if its not supported. So there shouldn't be a libcall unless the target is changing the default behavior.

// for now, we do this:
// x = x | (x >> 1);
// x = x | (x >> 2);
// ...
// x = x | (x >>16);
// x = x | (x >>32); // for 64-bit input
// return popcount(~x);
//
// Ref: "Hacker's Delight" by Henry Warren

if the CPU has no direct lowering for the intrinsic, this transform is beneficial only if the resulting intrinsic can be constant folded

Why?
What about converting loop to countable?
What about InstCombine optimizations (not sure how useful they are, but still)?

// fold (srl (ctlz x), "5") -> x  iff x has one bit set (the low bit).
// select_cc seteq X, 0, sizeof(X), ctlz(X) -> ctlz(X) 
// select_cc seteq X, 0, sizeof(X), ctlz_zero_undef(X) -> ctlz(X)
// select_cc seteq X, 0, sizeof(X), cttz(X) -> cttz(X)
// select_cc seteq X, 0, sizeof(X), cttz_zero_undef(X) -> cttz(X)

.....
What about clear range of CTLZ(X): 0 <= CTLZ(X) <= bitwidth(X)?

joerg added a comment.May 9 2017, 3:26 PM

Yeah, but even the generic expansion results in ~19 instructions on ARMv4. Compare that to one instruction in the loop and it can hardly be said to be a general win.

Yeah, but even the generic expansion results in ~19 instructions on ARMv4. Compare that to one instruction in the loop and it can hardly be said to be a general win.

There should be at least 3 instructions in the loop: add, shift and branch. For 32 bit instruction it will run 16 iterations average, so 16 * 3 > 19.

joerg added a comment.May 9 2017, 4:05 PM

if the CPU has no direct lowering for the intrinsic, this transform is beneficial only if the resulting intrinsic can be constant folded

Why?
What about converting loop to countable?
What about clear range of CTLZ(X): 0 <= CTLZ(X) <= bitwidth(X)?

I don't think this transform changes anything about the countability of the loop, SCEV should certainly be able to do

What about InstCombine optimizations (not sure how useful they are, but still)?

// fold (srl (ctlz x), "5") -> x  iff x has one bit set (the low bit).
// select_cc seteq X, 0, sizeof(X), ctlz(X) -> ctlz(X) 
// select_cc seteq X, 0, sizeof(X), ctlz_zero_undef(X) -> ctlz(X)
// select_cc seteq X, 0, sizeof(X), cttz(X) -> cttz(X)
// select_cc seteq X, 0, sizeof(X), cttz_zero_undef(X) -> cttz(X)

.....

Yeah, but even the generic expansion results in ~19 instructions on ARMv4. Compare that to one instruction in the loop and it can hardly be said to be a general win.

There should be at least 3 instructions in the loop: add, shift and branch. For 32 bit instruction it will run 16 iterations average, so 16 * 3 > 19.

You are miscounting. The very example you originally gave trades a shift+count based loop into a clz + increment based loop. Naively speaking without subtleties of the architecture,
that saves one instruction in the loop. Any expansion of clz will be worse most of the time.

The udivmodsi4.S implementation in compiler-rt is a good example -- if clz can be used, it provides a nice optimization. Otherwise the stupid linear checking performs better in the majority of cases. Please keep also in mind that we are talking about potentially executing fewer instructions vs blowing up .text here.

if the CPU has no direct lowering for the intrinsic, this transform is beneficial only if the resulting intrinsic can be constant folded

Why?
What about converting loop to countable?
What about clear range of CTLZ(X): 0 <= CTLZ(X) <= bitwidth(X)?

I don't think this transform changes anything about the countability of the loop, SCEV should certainly be able to do

It changes. SCEV is unable to get BECount for while (n>>=1) loop. Actually BECount should be CTLZ(n) - 1...

There should be at least 3 instructions in the loop: add, shift and branch. For 32 bit instruction it will run 16 iterations average, so 16 * 3 > 19.

You are miscounting. The very example you originally gave trades a shift+count based loop into a clz + increment based loop. Naively speaking without subtleties of the architecture,
that saves one instruction in the loop. Any expansion of clz will be worse most of the time.

If we get rid of a loop I'm not miscounting.
If loop is just converted to countable other optimizations are applicable like unroll, LSR, vectorization... with potential great impact.

The udivmodsi4.S implementation in compiler-rt is a good example -- if clz can be used, it provides a nice optimization. Otherwise the stupid linear checking performs better in the majority of cases. Please keep also in mind that we are talking about potentially executing fewer instructions vs blowing up .text here.

If we apply the optimization only in case whole loop is converted to CTLZ this is ok?
If just convert to countable - then there could be corner cases, which we can guard with TTI for architects that get regressions (if we get).

joerg added a comment.May 9 2017, 4:36 PM

If loop is just converted to countable other optimizations are applicable like unroll, LSR, vectorization... with potential great impact.

That is something SCEV should be able to discover on its own.

The udivmodsi4.S implementation in compiler-rt is a good example -- if clz can be used, it provides a nice optimization. Otherwise the stupid linear checking performs better in the majority of cases. Please keep also in mind that we are talking about potentially executing fewer instructions vs blowing up .text here.

If we apply the optimization only in case whole loop is converted to CTLZ this is ok?

Replacing the full loop with the intrinsic is ok. The current default lowering is broken, but improving that is orthogonal. I.e. from a code size perspective, trading the loop for a libcall is still an improvement when using an optimized library version.

If just convert to countable - then there could be corner cases, which we can guard with TTI for architects that get regressions (if we get).

Hoisting the computation out of the loop without removing it should be guarded by the CPU support for CTLZ, correct.

If loop is just converted to countable other optimizations are applicable like unroll, LSR, vectorization... with potential great impact.

That is something SCEV should be able to discover on its own.

Could you briefly describe how? The only way to get BECount is to calculate CTLZ. What SCEV should discover?
The only optimization that could avoid CTLZ calculation is full unroll

Replacing the full loop with the intrinsic is ok. The current default lowering is broken, but improving that is orthogonal. I.e. from a code size perspective, trading the loop for a libcall is still an improvement when using an optimized library version.

Loop idiom recognition is not the best place to check if we'll be able to delete a loop or not. If we decided to insert TTI guard I'll do this in the beginning.
If you know that default lowering is broken please file a bug report. Someone could use __builtin_ctlz and get a wrong code.

That's not true. ARMv4 for example has no clz, it's a V5T feature.

I keep forgetting about ARMv4... :)

My base point here is that we should avoid arguments like "the issue that should be resolved by CPU owner".

For example, no one I know *benchmarks on ARMv4*. I know people that test stuff on it (Joerg, Saleem), but not benchmark. This would have passed unnoticed for how long?

I think we need to take a more conservative view on optimisations, and get either an approval or "don't care" by CPU ownser, and *then* let them them work out the problems later, if they come.

I particularly care about ARMv7+, and that's why I wanted to make sure Joerg was happy, because he cares about areas I usually don't.

And I agree with Joerg, this will have a large impact on ARMv4, not just performance, but also code size. This feature *needs* a hook.

cheers,
--renato

joerg added a comment.May 10 2017, 5:01 AM

Just as a side note: this is not only ARMv4. Older SPARC, M68K, SH, IA64, Alpha all seem to lack a FFS / CLZ instruction from cursory check.

evstupac updated this revision to Diff 98699.May 11 2017, 4:18 PM

Add a TTI guard using existing "getIntrinsicCost" function.

evstupac added inline comments.May 11 2017, 4:21 PM
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1304–1307

Actually the arguments are unused now. And we don't need to fill them for TTI->getIntrinsicCost. However to avoid bugs in future it is better to rely on real args and type.

Right, just to make sure it's doing what we expect it to, did you try to run this with targets "armv7a" and "armv4t"? They should have different results.

Also, having a new test in the ARM side with two RUN lines, one for each, would make it much clearer the intentions.

cheers,
--renato

evstupac updated this revision to Diff 98828.May 12 2017, 1:30 PM

Add a test for armv4t and armv7a.
Update x86 tests to check CPUs that support ctlz and not.
Add a check on insns number in a loop - supposed to check if idiom recognition will remove the loop or not.

rengolin added inline comments.May 12 2017, 1:42 PM
test/Transforms/LoopIdiom/ARM/ctlz.ll
150 ↗(On Diff #98828)

I'll let Joerg comment on this. It'll depend on the generated code vs. the loop's size, branches, etc.

evstupac added inline comments.May 12 2017, 1:52 PM
test/Transforms/LoopIdiom/ARM/ctlz.ll
150 ↗(On Diff #98828)

It looks like Joerg already answered on this:

Replacing the full loop with the intrinsic is ok.

I thought it would be complicated to check, but looks like insns count is a good way to do this.

rengolin accepted this revision.May 13 2017, 3:32 AM

Perfect, thanks! LGTM.

This revision is now accepted and ready to land.May 13 2017, 3:32 AM

Can you check the countable part in one of the cases? Otherwise it looks good.

Can you check the countable part in one of the cases? Otherwise it looks good.

If I get yuo right, "ctlz_and_other" checks conversion to countable loop and that CPUs that not support ctlz will not insert it in this case.

This revision was automatically updated to reflect the committed changes.

Can you check the countable part in one of the cases? Otherwise it looks good.

If I get yuo right, "ctlz_and_other" checks conversion to countable loop and that CPUs that not support ctlz will not insert it in this case.

I mean: it doesn't translate the loop to use ctlz, but the comment suggests that it still transforms the loop into something better understood. But it doesn't test that part?

Can you check the countable part in one of the cases? Otherwise it looks good.

If I get yuo right, "ctlz_and_other" checks conversion to countable loop and that CPUs that not support ctlz will not insert it in this case.

I mean: it doesn't translate the loop to use ctlz, but the comment suggests that it still transforms the loop into something better understood. But it doesn't test that part?

Get it.
Yes you are right, it would be good to check that loop exit condition was changed as well.
I'll update the test accordingly.

evstupac added inline comments.May 16 2017, 3:07 PM
llvm/trunk/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1296 ↗(On Diff #99042)

Just made a follow up commit r303212 adding "!IsCntPhiUsedOutsideLoop" check here.
Previously the code was potentially buggy because in case IsCntPhiUsedOutsideLoop we inserted CTLZ(X >> 1) and rely on check X != 0 (instead of (X >> 1) != 0).