This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/AggressiveInstCombine/
-
Transforms/
-
AggressiveInstCombine/
25/47
AggressiveInstCombine.cpp
-
test/Transforms/
-
Transforms/
-
AggressiveInstCombine/
1/2
lower-table-based-cttz-basics.ll
1/2
lower-table-based-cttz-dereferencing-pointer.ll
3/3
lower-table-based-cttz-non-argument-value.ll
-
lower-table-based-cttz-zero-element.ll
-
negative-lower-table-based-cttz.ll
-
PhaseOrdering/
1/1
lower-table-based-cttz.ll

Differential D113291

[AggressiveInstCombine] Lower Table Based CTTZ
ClosedPublic

Authored by djtodoro on Nov 5 2021, 8:58 AM.

Download Raw Diff

Details

Reviewers

craig.topper
spatel
lebedev.ri
fhahn
dmgreen
xbolva00

Commits

rGfec01ee3f524: [AggressiveInstCombine] Lower Table Based CTTZ

Summary

This patch introduces recognition of table-based ctz implementation during the AggressiveInstCombine.
This fixes the [0].

[0] https://bugs.llvm.org/show_bug.cgi?id=46434

TODO: Get the data on SPEC benchmark.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

craig.topper added reviewers: spatel, lebedev.ri.Nov 12 2021, 11:58 AM

djtodoro marked 7 inline comments as done.Nov 15 2021, 4:02 AM

djtodoro added inline comments.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
371	OK, sure.
451	Agree with this.
483	Yep.
523	Actually, we don't need it.

Refactor && update the tests
Address the comments

Harbormaster completed remote builds in B134216: Diff 387204.Nov 15 2021, 4:04 AM

xbolva00 added inline comments.Nov 15 2021, 4:18 AM

llvm/test/Transforms/AggressiveInstCombine/AARCH64/lower-table-based-ctz-basics.ll
1 ↗	(On Diff #387204)	AggressiveInstCombine/AARCH64 -> AggressiveInstCombine/AArch64

update the test dir name

Harbormaster completed remote builds in B134224: Diff 387215.Nov 15 2021, 4:52 AM

any other comment here? :)

TODO: Get the data on SPEC benchmark.

Did you manage to collect any perf data yet to motivate this change?

The tests already contain a few things that seem unrelated, it would be good to clean those things up.

llvm/test/Transforms/AggressiveInstCombine/AArch64/dereferencing-pointer.ll
28 ↗	(On Diff #387215)	That's out of date.
40 ↗	(On Diff #387215)	can the tests instead just take `i64 %b` or does it need to be a pointer? (
50 ↗	(On Diff #387215)	is all the mote data needed? Same for the other tests
llvm/test/Transforms/AggressiveInstCombine/AArch64/non-argument-value.ll
31 ↗	(On Diff #387215)	not needed

djtodoro edited the summary of this revision. (Show Details)Nov 26 2021, 1:28 AM

djtodoro added a reviewer: fhahn.

In D113291#3153219, @fhahn wrote:

TODO: Get the data on SPEC benchmark.

Did you manage to collect any perf data yet to motivate this change?

Not yet, but I will share ASAP.

Thanks for your comments!

llvm/test/Transforms/AggressiveInstCombine/AArch64/dereferencing-pointer.ll
40 ↗	(On Diff #387215)	This test is meant to test the pointer type. There are other tests checking non-ptr type.
50 ↗	(On Diff #387215)	Yep, I'll reduce the tests in the next update.

-Clean up the tests

Harbormaster completed remote builds in B136159: Diff 389938.Nov 26 2021, 2:32 AM

AggressiveInstCombine is an extension of InstCombine. That is, it's a target-independent canonicalization pass. Therefore, we shouldn't use any target-specific cost model to enable the transform. Since we have a generic intrinsic for cttz, it's fine to create that for all targets as long as we can guarantee that a generic expansion of that intrinsic in the backend will not be worse than the original code.

But I'm not sure if we can make that guarantee? If not, this should be implemented as a late IR or codegen pass (as it was when first posted).

In D113291#3238355, @spatel wrote:

AggressiveInstCombine is an extension of InstCombine. That is, it's a target-independent canonicalization pass. Therefore, we shouldn't use any target-specific cost model to enable the transform. Since we have a generic intrinsic for cttz, it's fine to create that for all targets as long as we can guarantee that a generic expansion of that intrinsic in the backend will not be worse than the original code.

But I'm not sure if we can make that guarantee? If not, this should be implemented as a late IR or codegen pass (as it was when first posted).

Why is it ok to use DataLayout in InstCombine/AggressiveInstCombine, but not TTI? The cttz seems like it could enable other optimizations so I don't think we want it late. In particular, we should give the optimizer a chance to prove that the input isn't 0 to remove the select that gets generated after the cttz intrinsic. That could require computeKnownBits or CorrelatedValuePropagation. LoopIdiomRecognize queries TTI before generating cttz from loops.

In D113291#3238387, @craig.topper wrote:

In D113291#3238355, @spatel wrote:

AggressiveInstCombine is an extension of InstCombine. That is, it's a target-independent canonicalization pass. Therefore, we shouldn't use any target-specific cost model to enable the transform. Since we have a generic intrinsic for cttz, it's fine to create that for all targets as long as we can guarantee that a generic expansion of that intrinsic in the backend will not be worse than the original code.

But I'm not sure if we can make that guarantee? If not, this should be implemented as a late IR or codegen pass (as it was when first posted).

Why is it ok to use DataLayout in InstCombine/AggressiveInstCombine, but not TTI? The cttz seems like it could enable other optimizations so I don't think we want it late. In particular, we should give the optimizer a chance to prove that the input isn't 0 to remove the select that gets generated after the cttz intrinsic. That could require computeKnownBits or CorrelatedValuePropagation. LoopIdiomRecognize queries TTI before generating cttz from loops.

Yeah, it's fuzzy. I think we're only supposed to be using DataLayout to determine where creating an illegal type would obviously lead to worse codegen. But that's not much different than asking if some operation is legal on target X.
We've had several other requests for a cost-aware version of instcombine over the years, so maybe we should just use this an opportunity to reframe/rename AggressiveInstCombine.

Does handling of "0", when accessing index 0 for x = 0 not acceptable?

What is the reason for this patch being on hold?

We've had several other requests for a cost-aware version of instcombine over the years, so maybe we should just use this an opportunity to reframe/rename AggressiveInstCombine.

It could help with doing thing like https://reviews.llvm.org/D114964 earlier too, where the transfroms are not always profitable and not reversible back to the original code, but would be beneficial to do earlier to get better cost modelling and vectorization.

Hello,

I have a small query regarding this patch. The patch emits following llvm assembly for ctz table -

-----Patch assembly-------
// %bb.0:

rbit w8, w0
cmp w0, #0
clz w8, w8
csel w0, wzr, w8, eq
ret

but in gcc, we have the following assembly being emitted -

------------GCC---------------------
f(unsigned int):

rbit    w0, w0
clz     w0, w0
and     w0, w0, 31
ret

Reference - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

[1] My question is - To solve the bug, do I have to generate assembly similar to GCC?

Please give me suggestions on moving forward in solving this bug.

In D113291#3274616, @gsocshubham wrote:
Hello,

I have a small query regarding this patch. The patch emits following llvm assembly for ctz table -

-----Patch assembly-------
// %bb.0:
rbit w8, w0
cmp w0, #0
clz w8, w8
csel w0, wzr, w8, eq
ret
but in gcc, we have the following assembly being emitted -

------------GCC---------------------
f(unsigned int):
rbit    w0, w0
clz     w0, w0
and     w0, w0, 31
ret
Reference - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

[1] My question is - To solve the bug, do I have to generate assembly similar to GCC?

Please give me suggestions on moving forward in solving this bug.

Hi,

Is this really a bug?

I have a small query regarding this patch. The patch emits following llvm assembly for ctz table -

-----Patch assembly-------
// %bb.0:
rbit w8, w0
cmp w0, #0
clz w8, w8
csel w0, wzr, w8, eq
ret
but in gcc, we have the following assembly being emitted -

------------GCC---------------------
f(unsigned int):
rbit    w0, w0
clz     w0, w0
and     w0, w0, 31
ret

That sounds like a backend optimization that could happen given we know the semantics of the AArch64 instruction.

In D113291#3275222, @dmgreen wrote:
I have a small query regarding this patch. The patch emits following llvm assembly for ctz table -

-----Patch assembly-------
// %bb.0:
rbit w8, w0
cmp w0, #0
clz w8, w8
csel w0, wzr, w8, eq
ret
but in gcc, we have the following assembly being emitted -

------------GCC---------------------
f(unsigned int):
rbit    w0, w0
clz     w0, w0
and     w0, w0, 31
ret
That sounds like a backend optimization that could happen given we know the semantics of the AArch64 instruction.

Yep, +1 for this. I'd leave this job to backends.

@craig.topper @spatel Do you think this can go as is?

(P.S. I don't have access to any AARCH64 board currently, so I can see into the SPEC numbers.)

I am wondering about general direction..

Is it worth it? On way to become compiler just for benchmarks?

Spend compile time just to optimize one very very specific pattern from spec is bad thing imho.

Can you show us some other real world “hits”? I assume any sane project already uses builtin to compute this value efficiently.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
383	getZExtValue may assert large ints

Can you please provide compile time data with @nikic's compile time tracker? This should be *very cheap* to be acceptable.

This revision now requires changes to proceed.Feb 1 2022, 2:17 AM

In D113291#3286784, @xbolva00 wrote:

I am wondering about general direction..

Is it worth it? On way to become compiler just for benchmarks?

Spend compile time just to optimize one very very specific pattern from spec is bad thing imho.

Can you show us some other real world “hits”? I assume any sane project already uses builtin to compute this value efficiently.

The same algorithm is documented here https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup

It's also nearby in the code in primesieve from the recent tzcnt discusson. Around line 143 if you expand the context on Erat.hpp here https://github.com/kimwalisch/primesieve/pull/109/files Granted that code knows when to use the tzcnt builtin instead of that code. I'm only mentioning it to show it is a known way to implement tzcnt that is used in more than just spec.

gsocshubham mentioned this in D119010: [AggressiveInstCombine] Recognize table-based ctz implementation and enable it for AARCH64 at -O3.Feb 4 2022, 9:04 AM

In D113291#3286803, @xbolva00 wrote:

Can you please provide compile time data with @nikic's compile time tracker? This should be *very cheap* to be acceptable.

.

It is done: http://llvm-compile-time-tracker.com/compare.php?from=9920943ea201189f9b34918c7663d8a03d7e4676&to=666dd20021db313e4ead3e39ac4c8a12b9525521&stat=instructions

gsocshubham mentioned this in D120462: [AArch64InstrInfo.td] - Lowering fix for cttz intrinsic.Feb 24 2022, 1:36 AM

rahular-rrlogic added a child revision: D120462: [AArch64InstrInfo.td] - Lowering fix for cttz intrinsic.Apr 14 2022, 12:48 AM

rahular-rrlogic mentioned this in D123782: [AArch64] Generate AND in place of CSEL for Table Based CTTZ lowering in -O3.Apr 14 2022, 4:45 AM

rahular-rrlogic added a child revision: D123782: [AArch64] Generate AND in place of CSEL for Table Based CTTZ lowering in -O3.

@xbolva00 ping :)

Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 1:58 AM

Herald added a subscriber: StephenFan. · View Herald Transcript

How hard is to add x86 support?

(Not blocking this)

This revision now requires review to proceed.May 5 2022, 2:04 AM

Thanks!

In D113291#3493276, @xbolva00 wrote:

How hard is to add x86 support?

Is it even worth of implementing it for x86?

In D113291#3495933, @djtodoro wrote:

Thanks!

In D113291#3493276, @xbolva00 wrote:

How hard is to add x86 support?

Is it even worth of implementing it for x86?

Yes. Intel CPUs from about 2013 have a TZCNT instruction.

OK, great! it will be on my TODO list!

But I think that x86 support doesn’t block this.

Ideally this transformation should just emit proper intrinsic without need for target hook.

What is the real problem?

rahular-rrlogic removed a child revision: D123782: [AArch64] Generate AND in place of CSEL for Table Based CTTZ lowering in -O3.May 8 2022, 9:06 PM

dmgreen mentioned this in D125755: [AggressiveInstcombine] Conditionally fold saturated fptosi to llvm.fptosi.sat.May 17 2022, 3:03 AM

Hello - We were having a discussion about a very similar patch in D125755. I think the outcome for this patch is that either:

We need to do this later (maybe in CodeGenPrepare).
We need to do this unconditionally without the call to TTI.preferCTTZLowering() and have the reverse transform later for targets that do not have a cheaper alternative.
We need to argue some more :)

There are more details about why in D125755. I would go for the first option if it doesn't lead to worse performance, as for the second I'm not sure when it would be profitable to transform back and emit the table. You may not want to do that for non-hot ctzs? It sounds like it may be difficult to get right, but maybe I'm overestimating it.

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Compare the generated code for arm-eabi.

You may not want to do that for non-hot ctzs?

As opposed to what, calling into compiler-rt?

In D113291#3539840, @dmgreen wrote:

Hello - We were having a discussion about a very similar patch in D125755. I think the outcome for this patch is that either:

We need to do this later (maybe in CodeGenPrepare).

The initial version of patch was implemented within CodeGenPrepare. And I think it should not introduce any performance regression.

We need to do this unconditionally without the call to TTI.preferCTTZLowering() and have the reverse transform later for targets that do not have a cheaper alternative.

Hmm... I need to think about that.

We need to argue some more :)

Seems that I need to find some time to catch up the conversation in D125755. :)

There are more details about why in D125755. I would go for the first option if it doesn't lead to worse performance, as for the second I'm not sure when it would be profitable to transform back and emit the table. You may not want to do that for non-hot ctzs?

I am not sure I get the question.

In D113291#3539965, @efriedma wrote:

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Compare the generated code for arm-eabi.

I guess we could measure something like that, but seems to me that it could introduce some performance regressions...

In D113291#3539965, @efriedma wrote:

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Compare the generated code for arm-eabi.

You may not want to do that for non-hot ctzs?

As opposed to what, calling into compiler-rt?

I was meaning - it can be difficult for the compiler to recognize _when_ a ctz is performance critical. If the size of the table is large (which I was possible over-estimating the size of in my mind), then you may not want to emit the table for every ctz in the program. Currently that places where this is used have said, from the fact that they wrote it this way, that these ctz's are important. It just depends on whether converting to a table is always better, and if the table it small enough to be reasonable in the common case. 32 x i8 doesn't sound too big compared to what I originally imagined, if that is all it needs.

In D113291#3540303, @dmgreen wrote:

In D113291#3539965, @efriedma wrote:

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Compare the generated code for arm-eabi.

You may not want to do that for non-hot ctzs?

As opposed to what, calling into compiler-rt?

I was meaning - it can be difficult for the compiler to recognize _when_ a ctz is performance critical. If the size of the table is large (which I was possible over-estimating the size of in my mind), then you may not want to emit the table for every ctz in the program. Currently that places where this is used have said, from the fact that they wrote it this way, that these ctz's are important. It just depends on whether converting to a table is always better, and if the table it small enough to be reasonable in the common case. 32 x i8 doesn't sound too big compared to what I originally imagined, if that is all it needs.

For the table lookup, is there an algorithm for creating the special constant that is used in the multiply? Or would we just hardcode known constants for common sizes.

In D113291#3539965, @efriedma wrote:

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Note, the popcount expansion already uses a multiply without checking if it is legal.

For the table lookup, is there an algorithm for creating the special constant that is used in the multiply? Or would we just hardcode known constants for common sizes.

https://en.wikipedia.org/wiki/De_Bruijn_sequence has a description of the algorithm. Probably we'd just hard-code constants, though; practically speaking, the only sizes we actually care about are 16, 32, and 64. (For anything that doesn't fit in a single register, we probably just want to split the cttz.)

Note, the popcount expansion already uses a multiply without checking if it is legal.

It's a relatively cheap multiply to expand on a target with a shifter, though.

Thanks a lot for the comments! Can someone please sum the things up that need to be done for this?

By implementing this in the CodeGenPrepare doesn't include/imply that we should get rid of the call to TTI.preferCTTZLowering().
If we decide to go with the unconditional (not to care about Target) cttz lowering and to get rid of the call to TTI.preferCTTZLowering(), what should be done?

Concretely, my preferred solution looks something like:

Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).
Teach TargetLowering::expandCTTZ to emit a table lookup.

drop target dependent hooks

In D113291#3542941, @efriedma wrote:

Concretely, my preferred solution looks something like:

Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

Teach TargetLowering::expandCTTZ to emit a table lookup.

Harbormaster completed remote builds in B168559: Diff 435145.Jun 8 2022, 7:51 AM

In D113291#3566581, @djtodoro wrote:

In D113291#3542941, @efriedma wrote:

Concretely, my preferred solution looks something like:

Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

Teach TargetLowering::expandCTTZ to emit a table lookup.

@djtodoro - Will you be sending patch for (2) "Teach TargetLowering::expandCTTZ to emit a table lookup."?

In D113291#3569114, @gsocshubham wrote:

In D113291#3566581, @djtodoro wrote:

In D113291#3542941, @efriedma wrote:

Concretely, my preferred solution looks something like:

Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

Teach TargetLowering::expandCTTZ to emit a table lookup.

@djtodoro - Will you be sending patch for (2) "Teach TargetLowering::expandCTTZ to emit a table lookup."?

Unfortunately, I don’t have time to do it right now. If you are interested, please go ahead with the implementation.

In D113291#3569225, @djtodoro wrote:

In D113291#3569114, @gsocshubham wrote:

In D113291#3566581, @djtodoro wrote:

In D113291#3542941, @efriedma wrote:

Concretely, my preferred solution looks something like:

Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

Teach TargetLowering::expandCTTZ to emit a table lookup.

@djtodoro - Will you be sending patch for (2) "Teach TargetLowering::expandCTTZ to emit a table lookup."?

Unfortunately, I don’t have time to do it right now. If you are interested, please go ahead with the implementation.

@djtodoro - Sure. I am interested to do it. Can you elaborate (2) a bit in detail?

In D113291#3569456, @gsocshubham wrote:

In D113291#3569225, @djtodoro wrote:

In D113291#3569114, @gsocshubham wrote:

In D113291#3566581, @djtodoro wrote:

In D113291#3542941, @efriedma wrote:

Concretely, my preferred solution looks something like:

Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

Teach TargetLowering::expandCTTZ to emit a table lookup.

@djtodoro - Will you be sending patch for (2) "Teach TargetLowering::expandCTTZ to emit a table lookup."?

Unfortunately, I don’t have time to do it right now. If you are interested, please go ahead with the implementation.

@djtodoro - Sure. I am interested to do it. Can you elaborate (2) a bit in detail?

Basically this patch implements lowering of the table-based cttz implementation into @llvm.cttz unconditionally. For some targets it won't be that beneficial, so during the TargetLowering::expandCTTZ we should emit table lookup again. @efriedma May have something to add.

I don't think I have much further to say. Emitting a table lookup from TargetLowering::expandCTTZ should be straightforward, I think. See DAGCombiner::convertSelectOfFPConstantsToLoadOffset for an example of how to emit a constant table.

In D113291#3602808, @efriedma wrote:

I don't think I have much further to say. Emitting a table lookup from TargetLowering::expandCTTZ should be straightforward, I think. See DAGCombiner::convertSelectOfFPConstantsToLoadOffset for an example of how to emit a constant table.

Do we need to be careful with vector CTTZ which can also go through expand CTTZ?

gsocshubham mentioned this in D128911: Emit table lookup from TargetLowering::expandCTTZ().Jun 30 2022, 7:17 AM

gsocshubham edited child revisions, added: D128911: Emit table lookup from TargetLowering::expandCTTZ(); removed: D120462: [AArch64InstrInfo.td] - Lowering fix for cttz intrinsic.Jun 30 2022, 7:20 AM

Is this patch ready to be merged? This is parent patch of https://reviews.llvm.org/D128911. Does child get merged before parent?

Anyways, https://reviews.llvm.org/D128911 can be merged independently of this patch.

In D113291#3670999, @gsocshubham wrote:

Is this patch ready to be merged? This is parent patch of https://reviews.llvm.org/D128911. Does child get merged before parent?

Anyways, https://reviews.llvm.org/D128911 can be merged independently of this patch.

D128911 needs to go in first, once that is done we can move forward with this one. It could do with a rebase and a clang-format.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
444	We will need to support opaque pointers nowadays.
483	It's probably better to just say "64bit targets" as opposed to a specific target.
494	I believe it's the top `Bitwidth - Log2(Bitwidth)` bits.

dmgreen added inline comments.Jul 23 2022, 4:47 AM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486	Does the extend between the lshr and the mul every happen? From what I can tell, the type of the VT should be based on the type of these operations.
690	The TTI's can be removed now (and if you rebase they may already be present, but still are not needed by the new code any more).
llvm/test/Transforms/AggressiveInstCombine/AArch64/lower-table-based-ctz-basics.ll
1 ↗	(On Diff #435145)	The tests can be moved out of the AArch64 directory, so long as they drop the AArch64 triple.

gsocshubham removed a child revision: D128911: Emit table lookup from TargetLowering::expandCTTZ().Aug 4 2022, 5:07 AM

dmgreen mentioned this in rGab4fc87a9d96: [DAG] Emit table lookup from TargetLowering::expandCTTZ().Aug 8 2022, 4:08 AM

@djtodoro Do you have any time to update this? Otherwise do you mind we take it over and we can update it and get it reviewed. Thanks.

I promise I will find some time to update this - it is coming next week.

support opaque pointers
remove leftovers (since this was aarch64 only)
move the tests in a non-target dir

Harbormaster completed remote builds in B181687: Diff 453206.Aug 17 2022, 12:20 AM

djtodoro marked 3 inline comments as done.Aug 17 2022, 1:08 AM

djtodoro added inline comments.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486	It does not happen in all the cases.

remove a duplicated test

Harbormaster completed remote builds in B181697: Diff 453228.Aug 17 2022, 2:10 AM

Why are some test files still specifying a triple in the RUN line?

It would be good to consolidate tests into less files if possible with better names/comments to explain exactly what differences are being tested in the sequence of tests. There should also be negative tests (wrong table constants, wrong magic multiplier, wrong shift amount, etc), so we know that the transform is not firing on mismatches.

In D113291#3728796, @spatel wrote:

Why are some test files still specifying a triple in the RUN line?

Leftovers. Thanks.

It would be good to consolidate tests into less files if possible with better names/comments to explain exactly what differences are being tested in the sequence of tests. There should also be negative tests (wrong table constants, wrong magic multiplier, wrong shift amount, etc), so we know that the transform is not firing on mismatches.

I will remove one redundant test. There are C producers as well as some top-level comments that explain what it should test (if the comment is needed). Furthermore, thanks for the suggestion for adding some negative tests - will add it.

adding negative tests
rename the tests
clean the target triple leftovers from tests

Harbormaster completed remote builds in B181937: Diff 453574.Aug 18 2022, 2:39 AM

dmgreen added inline comments.Aug 18 2022, 4:24 AM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486	Do you have a test case where the extend is between the shift and the mul?
493	If the length of the table is larger than the InputBits, how are we sure that the matched elements will be the correct ones? Is this always guaranteed? I think I would have expected a check that `for each i=0..InputBits-1, Table[(Mul<<i)>>Shift] == i`. With a check that the index is in range. Are they always equivalent with the larger tables too?
552	Can this always just use the Load type?
564	I think User here could be GlobalVariable
568	GEPUser->getOperand(0) -> Global->getInitializer(). It is worth adding a test where the global is extern.
590	It might be better to switch this logical around - unsigned InputBits = X1->getType()->getScalarSizeInBits(); if (InputBits != 32 && InputBits != 64) return false;
593–594	Log2_32_Ceil -> Log2_Ceil if we know the InputBits is a power of 2. The -1 case is for a larger table with more elements but that can handle zero values?
llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-non-argument-value.ll
79	It is better to pass x as a parameter, although I'm not sure it matter much where x comes from for the rest of the pattern.

spatel added inline comments.Aug 18 2022, 7:25 AM

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-basics.ll
124	This is identical to the previous test, so it is not adding value here. I realize that the source example it intended to model is slightly different. If you want to verify that we end up with cttz from the IR produced by clang, then I'd recommend adding a file to test/Transforms/PhaseOrdering and "RUN -O3". When I create tests like that, I grab the unoptimized IR using "clang -S -o - -emit-llvm -Xclang -disable-llvm-optzns". Then reduce it by running it through "opt -mem2reg", so it's not completely full of junk IR.
llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-non-argument-value.ll
79	Right - as far as this patch is concerned, this is identical to the previous test, so it shouldn't be here. See my earlier comment about PhaseOrdering tests if we want more end-to-end coverage for `opt -O3`.

Thanks for the comments.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486	I was completely sure that I had a case for it, but I am not able to produce it actually -- so I deleted it for now.
493	Hmmm, can you please walk me through out an example?
552	I think yes, nothing comes up to my mind that can break it.
590	Sounds good.
593–594	Log2_32_Ceil -> Log2_Ceil if we know the InputBits is a power of 2. Right... Bu you meant `Log2_64()`, right? It is a power of 2, since it is either 32 or 64, so no need to add any assert here. The -1 case is for a larger table with more elements but that can handle zero values? int ctz2(unsigned x) { #define u 0 static short table[64] = { 32, 0, 1, 12, 2, 6, u, 13, 3, u, 7, u, u, u, u, 14, 10, 4, u, u, 8, u, u, 25, u, u, u, u, u, 21, 27, 15, 31, 11, 5, u, u, u, u, u, 9, u, u, 24, u, u, 20, 26, 30, u, u, u, u, 23, u, 19, 29, u, 22, 18, 28, 17, 16, u }; x = (x & -x) * 0x0450FBAF; return table[x >> 26]; }
llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-basics.ll
124	This is identical to the previous test, so it is not adding value here. Right, thanks! I realize that the source example it intended to model is slightly different. If you want to verify that we end up with cttz from the IR produced by clang, then I'd recommend adding a file to test/Transforms/PhaseOrdering and "RUN -O3". When I create tests like that, I grab the unoptimized IR using "clang -S -o - -emit-llvm -Xclang -disable-llvm-optzns". Then reduce it by running it through "opt -mem2reg", so it's not completely full of junk IR. I usually do create tests that way, but these may be stale a bit. I will add the `PhaseOrdering``` test, thanks for the suggestion.
llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-non-argument-value.ll
79	Yes, agree.

removed more duplicated tests
added a llvm/test/Transforms/PhaseOrdering test
refactor the code a bit

Harbormaster completed remote builds in B182019: Diff 453684.Aug 18 2022, 10:10 AM

dmgreen added inline comments.Aug 19 2022, 2:01 AM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
493	Hmm. No, I'm not sure I can. I was thinking about the ctz2 case, and whether there could be cases where the u's have different values that make them "match", but the values are different that make them wrong. So the items in the table accessed by the DeBruijn constant would produce incorrect values, but there are still InputBits number of matches. ;; int ctz2(unsigned x) ;; { ;; #define u 0 ;; static short table[64] = ;; { ;; 32, 0, 1, 12, 2, 6, u, 13, 3, u, 7, u, u, u, u, 14, ;; 10, 4, u, u, 8, u, u, 25, u, u, u, u, u, 21, 27, 15, ;; 31, 11, 5, u, u, u, u, u, 9, u, u, 24, u, u, 20, 26, ;; 30, u, u, u, u, 23, u, 19, 29, u, 22, 18, 28, 17, 16, u ;; }; ;; x = (x & -x) * 0x0450FBAF; ;; return table[x >> 26]; ;; } But I don't think that is something that can come up. I was finding it hard to prove, but if the Mul is InputBits in length there are only at most InputBits separate elements that it can access. And multiple elements cannot map successfully back to the same i. I ran a sat solver overnight, and it is still going but hasn't found any counter examples, which is a good sign. (It is able to find valid DeBruijn CTTZ tables given the chance). It might be worth adding a comment explaining why this correctly matches the table in all cases.
547	One think I forgot to mention - llvm has a code style that is best explained as "just run clang-format on the patch". These returns are all in the wrong place, for example, and could do with a cleanup.
593–594	Ah, yeah - I meant Log2_32, but delete the wrong part of the function name.
595–596	This is true by definition now.
llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-dereferencing-pointer.ll
24	I usually remove dso_local
llvm/test/Transforms/PhaseOrdering/lower-table-based-cttz.ll
23	Can remove the `; Function Attrs`

djtodoro marked an inline comment as done.Aug 21 2022, 3:06 AM

djtodoro added inline comments.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
493	Yeah, good sign. I will try to make a reasonable comment. Thanks.
547	I've changed the style to `Google`, accidentally. Thanks.
llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-dereferencing-pointer.ll
24	me as well, leftover :/ thanks!

clang-format
clean up tests

Harbormaster completed remote builds in B182443: Diff 454291.Aug 21 2022, 5:01 AM

Thanks for the updates. I don't think I have anything else than is what is below. Any other comments from anyone else?

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
493	A comment explaining this function would still be useful.
552	This can be changed to just the Load type then.
570–573	if (!GVTable \|\| !GVTable->hasInitializer()) return false;
606–608	Remove this check, as it is always true as far as I can tell.

spatel added inline comments.Aug 24 2022, 1:28 PM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
476–477	There was a request to put a comment on this function, and I'll second that request. It's not clear why we are counting matches rather than just bailing out on the first mismatch. I think that's because you can construct/recognize a table with unaccessed/undefined elements?

djtodoro marked 4 inline comments as done.Aug 27 2022, 6:31 AM

djtodoro added inline comments.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
476–477	Yes, that is the reason - we are iterating over the elements of the table, so there could be mismatch that we can ignore. A comment is coming.

addressing comments

Harbormaster completed remote builds in B183750: Diff 456115.Aug 27 2022, 7:51 AM

LGTM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
584–586	Could shorten this by using something like: if (!match(GEP->idx_begin()->get(), m_ZeroInt())) return false;

This revision is now accepted and ready to land.Aug 27 2022, 11:32 AM

craig.topper added inline comments.Aug 27 2022, 11:52 AM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
604	If we’re only handling 32 and 64, this comment should be 5..6

craig.topper added inline comments.Aug 27 2022, 12:11 PM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486	I think this can be Mask = APInt::getBitsSetFrom(InputBits , Shift)

Thanks for the updates. LGTM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
604	I believe it is 7 because the table can be twice the size. Hence the -1 in the formula below. See the ctz2 test.

Thanks for your comments.

addressing comments

Harbormaster completed remote builds in B183797: Diff 456178.Aug 28 2022, 2:27 AM

craig.topper added inline comments.Aug 28 2022, 11:35 AM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
485	extra space before comma. Looks like I mistyped it in my comment. Sorry.

djtodoro updated this revision to Diff 456267.Aug 29 2022, 12:20 AM

Harbormaster completed remote builds in B183865: Diff 456267.Aug 29 2022, 1:33 AM

Closed by commit rGfec01ee3f524: [AggressiveInstCombine] Lower Table Based CTTZ (authored by djtodoro). · Explain WhySep 2 2022, 8:28 AM

This revision was automatically updated to reflect the committed changes.

djtodoro added a commit: rGfec01ee3f524: [AggressiveInstCombine] Lower Table Based CTTZ.

Hi, I'm seeing a heap-use-after-free in AggressiveInstCombine in a build shortly after this landed, within a function added here.

test/Transforms/AggressiveInstCombine/X86/sqrt.ll fails as follows under asan: https://gist.github.com/zygoloid/a270e65d32ab5b05504b3b0d5717f83b

Please can you fix or revert?

rsmith added a reverting change: rG053841c5624c: Revert "[AggressiveInstCombine] Lower Table Based CTTZ".Sep 2 2022, 4:19 PM

In D113291#3768076, @rsmith wrote:

test/Transforms/AggressiveInstCombine/X86/sqrt.ll fails as follows under asan: https://gist.github.com/zygoloid/a270e65d32ab5b05504b3b0d5717f83b

I've gone ahead and reverted for now to unblock things. Let me know if you're not able to reproduce the ASan failure and I can dig more into it.

Thanks a lot. I will check ASAP.

@rsmith recommitted with f879939157. Thanks!

In D113291#3777119, @djtodoro wrote:

@rsmith recommitted with f879939157. Thanks!

What was the bug, how was it fixed, and is there a new test to verify the fix? That should have been mentioned in the new commit message.

In D113291#3777195, @spatel wrote:

In D113291#3777119, @djtodoro wrote:

@rsmith recommitted with f879939157. Thanks!

What was the bug, how was it fixed, and is there a new test to verify the fix? That should have been mentioned in the new commit message.

You are right. I reverted the recommit, and I will recommit it with proper message, sorry I missed it. :/

Actually, the issue was that your patch D129167 introduced eraseFromParent and the tryToRecognizeTableBasedCttz would try to use the instruction (dyn_cast) after free. I just moved the tryToRecognizeTableBasedCttz above foldSqrt. I guess it does not need any additional test case.

In D113291#3777356, @djtodoro wrote:

Actually, the issue was that your patch D129167 introduced eraseFromParent and the tryToRecognizeTableBasedCttz would try to use the instruction (dyn_cast) after free. I just moved the tryToRecognizeTableBasedCttz above foldSqrt. I guess it does not need any additional test case.

Ah, I see. Please put a comment on that call then - foldSqrt (or any other erasing transform) needs to be accounted for (last in the loop for now), or we might hit that bug. And yes, looks like we don't need another test since the existing regression test was flagged by the asan bot.

djtodoro mentioned this in rGdf868edee561: "Recommit "[AggressiveInstCombine] Lower Table Based CTTZ"".Sep 9 2022, 1:30 AM

libin049 added a subscriber: libin049.Mar 30 2023, 11:56 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

AggressiveInstCombine/

AggressiveInstCombine.cpp

163 lines

test/

Transforms/

AggressiveInstCombine/

lower-table-based-cttz-basics.ll

257 lines

lower-table-based-cttz-dereferencing-pointer.ll

42 lines

lower-table-based-cttz-non-argument-value.ll

45 lines

lower-table-based-cttz-zero-element.ll

26 lines

negative-lower-table-based-cttz.ll

112 lines

PhaseOrdering/

lower-table-based-cttz.ll

37 lines

Diff 457606

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp

Show First 20 Lines • Show All 362 Lines • ▼ Show 20 Lines	if (match(MulOp0, m_And(m_c_Add(m_LShr(m_Value(ShiftOp0), m_SpecificInt(4)),
}		}
}		}
}		}

return false;		return false;
}		}

/// Fold smin(smax(fptosi(x), C1), C2) to llvm.fptosi.sat(x), providing C1 and		/// Fold smin(smax(fptosi(x), C1), C2) to llvm.fptosi.sat(x), providing C1 and
/// C2 saturate the value of the fp conversion. The transform is not reversable		/// C2 saturate the value of the fp conversion. The transform is not reversable
		craig.topperUnsubmitted Not Done Reply Inline Actions Can we pass the Width in and not make this a template function? Maybe making using of APInt to manage the width if needed? craig.topper: Can we pass the Width in and not make this a template function? Maybe making using of APInt to…
		djtodoroAuthorUnsubmitted Done Reply Inline Actions OK, sure. djtodoro: OK, sure.
/// as the fptosi.sat is more defined than the input - all values produce a		/// as the fptosi.sat is more defined than the input - all values produce a
/// valid value for the fptosi.sat, where as some produce poison for original		/// valid value for the fptosi.sat, where as some produce poison for original
/// that were out of range of the integer conversion. The reversed pattern may		/// that were out of range of the integer conversion. The reversed pattern may
/// use fmax and fmin instead. As we cannot directly reverse the transform, and		/// use fmax and fmin instead. As we cannot directly reverse the transform, and
		craig.topperUnsubmitted Done Reply Inline Actions 8 should be `CHAR_BIT`, but if we can do this without using a template function I'd rather do that. craig.topper: 8 should be `CHAR_BIT`, but if we can do this without using a template function I'd rather do…
/// it is not always profitable, we make it conditional on the cost being		/// it is not always profitable, we make it conditional on the cost being
/// reported as lower by TTI.		/// reported as lower by TTI.
static bool tryToFPToSat(Instruction &I, TargetTransformInfo &TTI) {		static bool tryToFPToSat(Instruction &I, TargetTransformInfo &TTI) {
// Look for min(max(fptosi, converting to fptosi_sat.		// Look for min(max(fptosi, converting to fptosi_sat.
Value *In;		Value *In;
const APInt MinC, MaxC;		const APInt MinC, MaxC;
if (!match(&I, m_SMax(m_OneUse(m_SMin(m_OneUse(m_FPToSI(m_Value(In))),		if (!match(&I, m_SMax(m_OneUse(m_SMin(m_OneUse(m_FPToSI(m_Value(In))),
m_APInt(MinC))),		m_APInt(MinC))),
		xbolva00Unsubmitted Not Done Reply Inline Actions getZExtValue may assert large ints xbolva00: getZExtValue may assert large ints
m_APInt(MaxC))) &&		m_APInt(MaxC))) &&
!match(&I, m_SMin(m_OneUse(m_SMax(m_OneUse(m_FPToSI(m_Value(In))),		!match(&I, m_SMin(m_OneUse(m_SMax(m_OneUse(m_FPToSI(m_Value(In))),
m_APInt(MaxC))),		m_APInt(MaxC))),
m_APInt(MinC))))		m_APInt(MinC))))
return false;		return false;

// Check that the constants clamp a saturate.		// Check that the constants clamp a saturate.
if (!(MinC + 1).isPowerOf2() \|\| -MaxC != *MinC + 1)		if (!(MinC + 1).isPowerOf2() \|\| -MaxC != *MinC + 1)
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	foldSqrt(Instruction &I, TargetTransformInfo &TTI, TargetLibraryInfo &TLI) {
// Match a call to sqrt mathlib function.		// Match a call to sqrt mathlib function.
auto *Call = dyn_cast<CallInst>(&I);		auto *Call = dyn_cast<CallInst>(&I);
if (!Call)		if (!Call)
return false;		return false;

Module *M = Call->getModule();		Module *M = Call->getModule();
LibFunc Func;		LibFunc Func;
if (!TLI.getLibFunc(*Call, Func) \|\| !isLibFuncEmittable(M, &TLI, Func))		if (!TLI.getLibFunc(*Call, Func) \|\| !isLibFuncEmittable(M, &TLI, Func))
return false;		return false;
		dmgreenUnsubmitted Done Reply Inline Actions We will need to support opaque pointers nowadays. dmgreen: We will need to support opaque pointers nowadays.

if (Func != LibFunc_sqrt && Func != LibFunc_sqrtf && Func != LibFunc_sqrtl)		if (Func != LibFunc_sqrt && Func != LibFunc_sqrtf && Func != LibFunc_sqrtl)
return false;		return false;

// If (1) this is a sqrt libcall, (2) we can assume that NAN is not created		// If (1) this is a sqrt libcall, (2) we can assume that NAN is not created
// (because NNAN or the operand arg must not be less than -0.0) and (2) we		// (because NNAN or the operand arg must not be less than -0.0) and (2) we
// would not end up lowering to a libcall anyway (which could change the value		// would not end up lowering to a libcall anyway (which could change the value
		craig.topperUnsubmitted Not Done Reply Inline Actions The _or_null here feels overly paranoid. A load won't ever have a null pointer operand will it? So dyn_cast should be ok Same with the other dyn_cast_or_null. I don't think any of them should ever be null to start. craig.topper: The _or_null here feels overly paranoid. A load won't ever have a null pointer operand will it?
		djtodoroAuthorUnsubmitted Done Reply Inline Actions Agree with this. djtodoro: Agree with this.
// of errno), then:		// of errno), then:
// (1) errno won't be set.		// (1) errno won't be set.
// (2) it is safe to convert this to an intrinsic call.		// (2) it is safe to convert this to an intrinsic call.
Type *Ty = Call->getType();		Type *Ty = Call->getType();
Value *Arg = Call->getArgOperand(0);		Value *Arg = Call->getArgOperand(0);
if (TTI.haveFastSqrt(Ty) &&		if (TTI.haveFastSqrt(Ty) &&
(Call->hasNoNaNs() \|\| CannotBeOrderedLessThanZero(Arg, &TLI))) {		(Call->hasNoNaNs() \|\| CannotBeOrderedLessThanZero(Arg, &TLI))) {
IRBuilder<> Builder(&I);		IRBuilder<> Builder(&I);
IRBuilderBase::FastMathFlagGuard Guard(Builder);		IRBuilderBase::FastMathFlagGuard Guard(Builder);
Builder.setFastMathFlags(Call->getFastMathFlags());		Builder.setFastMathFlags(Call->getFastMathFlags());

Function *Sqrt = Intrinsic::getDeclaration(M, Intrinsic::sqrt, Ty);		Function *Sqrt = Intrinsic::getDeclaration(M, Intrinsic::sqrt, Ty);
Value *NewSqrt = Builder.CreateCall(Sqrt, Arg, "sqrt");		Value *NewSqrt = Builder.CreateCall(Sqrt, Arg, "sqrt");
I.replaceAllUsesWith(NewSqrt);		I.replaceAllUsesWith(NewSqrt);

// Explicitly erase the old call because a call with side effects is not		// Explicitly erase the old call because a call with side effects is not
// trivially dead.		// trivially dead.
I.eraseFromParent();		I.eraseFromParent();
return true;		return true;
}		}

return false;		return false;
}		}

		// Check if this array of constants represents a cttz table.
		// Iterate over the elements from \p Table by trying to find/match all
		spatelUnsubmitted Not Done Reply Inline Actions There was a request to put a comment on this function, and I'll second that request. It's not clear why we are counting matches rather than just bailing out on the first mismatch. I think that's because you can construct/recognize a table with unaccessed/undefined elements? spatel: There was a request to put a comment on this function, and I'll second that request. It's not…
		djtodoroAuthorUnsubmitted Done Reply Inline Actions Yes, that is the reason - we are iterating over the elements of the table, so there could be mismatch that we can ignore. A comment is coming. djtodoro: Yes, that is the reason - we are iterating over the elements of the table, so there could be…
		// the numbers from 0 to \p InputBits that should represent cttz results.
		static bool isCTTZTable(const ConstantDataArray &Table, uint64_t Mul,
		uint64_t Shift, uint64_t InputBits) {
		unsigned Length = Table.getNumElements();
		if (Length < InputBits \|\| Length > InputBits * 2)
		return false;
		craig.topperUnsubmitted Not Done Reply Inline Actions Isn't it usually spelled AArch64 with 2 capital As? craig.topper: Isn't it usually spelled AArch64 with 2 capital As?
		djtodoroAuthorUnsubmitted Done Reply Inline Actions Yep. djtodoro: Yep.
		dmgreenUnsubmitted Not Done Reply Inline Actions It's probably better to just say "64bit targets" as opposed to a specific target. dmgreen: It's probably better to just say "64bit targets" as opposed to a specific target.

		APInt Mask = APInt::getBitsSetFrom(InputBits, Shift);
		craig.topperUnsubmitted Not Done Reply Inline Actions extra space before comma. Looks like I mistyped it in my comment. Sorry. craig.topper: extra space before comma. Looks like I mistyped it in my comment. Sorry.
		unsigned Matched = 0;
		dmgreenUnsubmitted Not Done Reply Inline Actions Does the extend between the lshr and the mul every happen? From what I can tell, the type of the VT should be based on the type of these operations. dmgreen: Does the extend between the lshr and the mul every happen? From what I can tell, the type of…
		djtodoroAuthorUnsubmitted Done Reply Inline Actions It does not happen in all the cases. djtodoro: It does not happen in all the cases.
		dmgreenUnsubmitted Not Done Reply Inline Actions Do you have a test case where the extend is between the shift and the mul? dmgreen: Do you have a test case where the extend is between the shift and the mul?
		djtodoroAuthorUnsubmitted Done Reply Inline Actions I was completely sure that I had a case for it, but I am not able to produce it actually -- so I deleted it for now. djtodoro: I was completely sure that I had a case for it, but I am not able to produce it actually -- so…
		craig.topperUnsubmitted Done Reply Inline Actions I think this can be Mask = APInt::getBitsSetFrom(InputBits , Shift) craig.topper: I think this can be Mask = APInt::getBitsSetFrom(InputBits , Shift)

		for (unsigned i = 0; i < Length; i++) {
		craig.topperUnsubmitted Not Done Reply Inline Actions I think you can use m_Deferred(X1) in place of m_Value(X2), but @lebedev.ri or @spatel would know better. craig.topper: I think you can use m_Deferred(X1) in place of m_Value(X2), but @lebedev.ri or @spatel would…
		uint64_t Element = Table.getElementAsInteger(i);
		if (Element >= InputBits)
		continue;

		// Check if \p Element matches a concrete answer. It could fail for some
		dmgreenUnsubmitted Not Done Reply Inline Actions If the length of the table is larger than the InputBits, how are we sure that the matched elements will be the correct ones? Is this always guaranteed? I think I would have expected a check that `for each i=0..InputBits-1, Table[(Mul<<i)>>Shift] == i`. With a check that the index is in range. Are they always equivalent with the larger tables too? dmgreen: If the length of the table is larger than the InputBits, how are we sure that the matched…
		djtodoroAuthorUnsubmitted Done Reply Inline Actions Hmmm, can you please walk me through out an example? djtodoro: Hmmm, can you please walk me through out an example?
		dmgreenUnsubmitted Not Done Reply Inline Actions Hmm. No, I'm not sure I can. I was thinking about the ctz2 case, and whether there could be cases where the u's have different values that make them "match", but the values are different that make them wrong. So the items in the table accessed by the DeBruijn constant would produce incorrect values, but there are still InputBits number of matches. ;; int ctz2(unsigned x) ;; { ;; #define u 0 ;; static short table[64] = ;; { ;; 32, 0, 1, 12, 2, 6, u, 13, 3, u, 7, u, u, u, u, 14, ;; 10, 4, u, u, 8, u, u, 25, u, u, u, u, u, 21, 27, 15, ;; 31, 11, 5, u, u, u, u, u, 9, u, u, 24, u, u, 20, 26, ;; 30, u, u, u, u, 23, u, 19, 29, u, 22, 18, 28, 17, 16, u ;; }; ;; x = (x & -x) * 0x0450FBAF; ;; return table[x >> 26]; ;; } But I don't think that is something that can come up. I was finding it hard to prove, but if the Mul is InputBits in length there are only at most InputBits separate elements that it can access. And multiple elements cannot map successfully back to the same i. I ran a sat solver overnight, and it is still going but hasn't found any counter examples, which is a good sign. (It is able to find valid DeBruijn CTTZ tables given the chance). It might be worth adding a comment explaining why this correctly matches the table in all cases. dmgreen: Hmm. No, I'm not sure I can. I was thinking about the ctz2 case, and whether there could be…
		djtodoroAuthorUnsubmitted Done Reply Inline Actions Yeah, good sign. I will try to make a reasonable comment. Thanks. djtodoro: Yeah, good sign. I will try to make a reasonable comment. Thanks.
		dmgreenUnsubmitted Done Reply Inline Actions A comment explaining this function would still be useful. dmgreen: A comment explaining this function would still be useful.
		// elements that are never accessed, so we keep iterating over each element
		dmgreenUnsubmitted Done Reply Inline Actions I believe it's the top `Bitwidth - Log2(Bitwidth)` bits. dmgreen: I believe it's the top `Bitwidth - Log2(Bitwidth)` bits.
		// from the table. The number of matched elements should be equal to the
		// number of potential right answers which is \p InputBits actually.
		if ((((Mul << Element) & Mask.getZExtValue()) >> Shift) == i)
		Matched++;
		}

		return Matched == InputBits;
		}

		// Try to recognize table-based ctz implementation.
		// E.g., an example in C (for more cases please see the llvm/tests):
		// int f(unsigned x) {
		// static const char table[32] =
		// {0, 1, 28, 2, 29, 14, 24, 3, 30,
		// 22, 20, 15, 25, 17, 4, 8, 31, 27,
		// 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
		// return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
		// }
		// this can be lowered to `cttz` instruction.
		// There is also a special case when the element is 0.
		//
		// Here are some examples or LLVM IR for a 64-bit target:
		//
		// CASE 1:
		craig.topperUnsubmitted Done Reply Inline Actions `B.getInt1(!DefinedForZero);` craig.topper: `B.getInt1(!DefinedForZero);`
		// %sub = sub i32 0, %x
		// %and = and i32 %sub, %x
		// %mul = mul i32 %and, 125613361
		// %shr = lshr i32 %mul, 27
		// %idxprom = zext i32 %shr to i64
		craig.topperUnsubmitted Not Done Reply Inline Actions We don't need this ICmp and Select if DefinedForZero is true right? craig.topper: We don't need this ICmp and Select if DefinedForZero is true right?
		djtodoroAuthorUnsubmitted Done Reply Inline Actions Actually, we don't need it. djtodoro: Actually, we don't need it.
		// %arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @ctz1.table, i64 0,
		// i64 %idxprom %0 = load i8, i8* %arrayidx, align 1, !tbaa !8
		//
		// CASE 2:
		// %sub = sub i32 0, %x
		// %and = and i32 %sub, %x
		// %mul = mul i32 %and, 72416175
		// %shr = lshr i32 %mul, 26
		// %idxprom = zext i32 %shr to i64
		// %arrayidx = getelementptr inbounds [64 x i16], [64 x i16]* @ctz2.table, i64
		// 0, i64 %idxprom %0 = load i16, i16* %arrayidx, align 2, !tbaa !8
		//
		// CASE 3:
		// %sub = sub i32 0, %x
		// %and = and i32 %sub, %x
		// %mul = mul i32 %and, 81224991
		// %shr = lshr i32 %mul, 27
		// %idxprom = zext i32 %shr to i64
		// %arrayidx = getelementptr inbounds [32 x i32], [32 x i32]* @ctz3.table, i64
		// 0, i64 %idxprom %0 = load i32, i32* %arrayidx, align 4, !tbaa !8
		//
		// CASE 4:
		// %sub = sub i64 0, %x
		// %and = and i64 %sub, %x
		dmgreenUnsubmitted Not Done Reply Inline Actions One think I forgot to mention - llvm has a code style that is best explained as "just run clang-format on the patch". These returns are all in the wrong place, for example, and could do with a cleanup. dmgreen: One think I forgot to mention - llvm has a code style that is best explained as "just run clang…
		djtodoroAuthorUnsubmitted Done Reply Inline Actions I've changed the style to `Google`, accidentally. Thanks. djtodoro: I've changed the style to `Google`, accidentally. Thanks.
		// %mul = mul i64 %and, 283881067100198605
		// %shr = lshr i64 %mul, 58
		// %arrayidx = getelementptr inbounds [64 x i8], [64 x i8]* @table, i64 0, i64
		// %shr %0 = load i8, i8* %arrayidx, align 1, !tbaa !8
		//
		dmgreenUnsubmitted Not Done Reply Inline Actions Can this always just use the Load type? dmgreen: Can this always just use the Load type?
		djtodoroAuthorUnsubmitted Done Reply Inline Actions I think yes, nothing comes up to my mind that can break it. djtodoro: I think yes, nothing comes up to my mind that can break it.
		dmgreenUnsubmitted Done Reply Inline Actions This can be changed to just the Load type then. dmgreen: This can be changed to just the Load type then.
		// All this can be lowered to @llvm.cttz.i32/64 intrinsic.
		static bool tryToRecognizeTableBasedCttz(Instruction &I) {
		LoadInst *LI = dyn_cast<LoadInst>(&I);
		if (!LI)
		return false;

		Type *AccessType = LI->getType();
		if (!AccessType->isIntegerTy())
		return false;

		GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(LI->getPointerOperand());
		if (!GEP \|\| !GEP->isInBounds() \|\| GEP->getNumIndices() != 2)
		dmgreenUnsubmitted Done Reply Inline Actions I think User here could be GlobalVariable dmgreen: I think User here could be GlobalVariable
		return false;

		if (!GEP->getSourceElementType()->isArrayTy())
		return false;
		dmgreenUnsubmitted Not Done Reply Inline Actions GEPUser->getOperand(0) -> Global->getInitializer(). It is worth adding a test where the global is extern. dmgreen: GEPUser->getOperand(0) -> Global->getInitializer(). It is worth adding a test where the global…

		uint64_t ArraySize = GEP->getSourceElementType()->getArrayNumElements();
		if (ArraySize != 32 && ArraySize != 64)
		return false;

		dmgreenUnsubmitted Done Reply Inline Actions if (!GVTable \|\| !GVTable->hasInitializer()) return false; dmgreen: ``` if (!GVTable \|\| !GVTable->hasInitializer()) return false; ```
		GlobalVariable *GVTable = dyn_cast<GlobalVariable>(GEP->getPointerOperand());
		if (!GVTable \|\| !GVTable->hasInitializer())
		return false;

		ConstantDataArray *ConstData =
		dyn_cast<ConstantDataArray>(GVTable->getInitializer());
		if (!ConstData)
		return false;

		if (!match(GEP->idx_begin()->get(), m_ZeroInt()))
		return false;

		Value *Idx2 = std::next(GEP->idx_begin())->get();
		spatelUnsubmitted Done Reply Inline Actions Could shorten this by using something like: if (!match(GEP->idx_begin()->get(), m_ZeroInt())) return false; spatel: Could shorten this by using something like: if (!match(GEP->idx_begin()->get(), m_ZeroInt()))…
		Value *X1;
		uint64_t MulConst, ShiftConst;
		// FIXME: 64-bit targets have `i64` type for the GEP index, so this match will
		// probably fail for other (e.g. 32-bit) targets.
		dmgreenUnsubmitted Not Done Reply Inline Actions It might be better to switch this logical around - unsigned InputBits = X1->getType()->getScalarSizeInBits(); if (InputBits != 32 && InputBits != 64) return false; dmgreen: It might be better to switch this logical around - ``` unsigned InputBits = X1->getType()…
		djtodoroAuthorUnsubmitted Done Reply Inline Actions Sounds good. djtodoro: Sounds good.
		if (!match(Idx2, m_ZExtOrSelf(
		m_LShr(m_Mul(m_c_And(m_Neg(m_Value(X1)), m_Deferred(X1)),
		m_ConstantInt(MulConst)),
		m_ConstantInt(ShiftConst)))))
		dmgreenUnsubmitted Not Done Reply Inline Actions Log2_32_Ceil -> Log2_Ceil if we know the InputBits is a power of 2. The -1 case is for a larger table with more elements but that can handle zero values? dmgreen: Log2_32_Ceil -> Log2_Ceil if we know the InputBits is a power of 2. The -1 case is for a…
		djtodoroAuthorUnsubmitted Done Reply Inline Actions Log2_32_Ceil -> Log2_Ceil if we know the InputBits is a power of 2. Right... Bu you meant `Log2_64()`, right? It is a power of 2, since it is either 32 or 64, so no need to add any assert here. The -1 case is for a larger table with more elements but that can handle zero values? int ctz2(unsigned x) { #define u 0 static short table[64] = { 32, 0, 1, 12, 2, 6, u, 13, 3, u, 7, u, u, u, u, 14, 10, 4, u, u, 8, u, u, 25, u, u, u, u, u, 21, 27, 15, 31, 11, 5, u, u, u, u, u, 9, u, u, 24, u, u, 20, 26, 30, u, u, u, u, 23, u, 19, 29, u, 22, 18, 28, 17, 16, u }; x = (x & -x) * 0x0450FBAF; return table[x >> 26]; } djtodoro: >Log2_32_Ceil -> Log2_Ceil if we know the InputBits is a power of 2. Right... Bu you meant…
		dmgreenUnsubmitted Not Done Reply Inline Actions Ah, yeah - I meant Log2_32, but delete the wrong part of the function name. dmgreen: Ah, yeah - I meant Log2_32, but delete the wrong part of the function name.
		return false;

		dmgreenUnsubmitted Not Done Reply Inline Actions This is true by definition now. dmgreen: This is true by definition now.
		unsigned InputBits = X1->getType()->getScalarSizeInBits();
		if (InputBits != 32 && InputBits != 64)
		return false;

		// Shift should extract top 5..7 bits.
		if (InputBits - Log2_32(InputBits) != ShiftConst &&
		InputBits - Log2_32(InputBits) - 1 != ShiftConst)
		return false;
		craig.topperUnsubmitted Not Done Reply Inline Actions If we’re only handling 32 and 64, this comment should be 5..6 craig.topper: If we’re only handling 32 and 64, this comment should be 5..6
		dmgreenUnsubmitted Not Done Reply Inline Actions I believe it is 7 because the table can be twice the size. Hence the -1 in the formula below. See the ctz2 test. dmgreen: I believe it is 7 because the table can be twice the size. Hence the -1 in the formula below.

		if (!isCTTZTable(*ConstData, MulConst, ShiftConst, InputBits))
		return false;

		dmgreenUnsubmitted Done Reply Inline Actions Remove this check, as it is always true as far as I can tell. dmgreen: Remove this check, as it is always true as far as I can tell.
		auto ZeroTableElem = ConstData->getElementAsInteger(0);
		bool DefinedForZero = ZeroTableElem == InputBits;

		IRBuilder<> B(LI);
		ConstantInt *BoolConst = B.getInt1(!DefinedForZero);
		Type *XType = X1->getType();
		auto Cttz = B.CreateIntrinsic(Intrinsic::cttz, {XType}, {X1, BoolConst});
		Value *ZExtOrTrunc = nullptr;

		if (DefinedForZero) {
		ZExtOrTrunc = B.CreateZExtOrTrunc(Cttz, AccessType);
		} else {
		// If the value in elem 0 isn't the same as InputBits, we still want to
		// produce the value from the table.
		auto Cmp = B.CreateICmpEQ(X1, ConstantInt::get(XType, 0));
		auto Select =
		B.CreateSelect(Cmp, ConstantInt::get(XType, ZeroTableElem), Cttz);

		// NOTE: If the table[0] is 0, but the cttz(0) is defined by the Target
		// it should be handled as: `cttz(x) & (typeSize - 1)`.

		ZExtOrTrunc = B.CreateZExtOrTrunc(Select, AccessType);
		}

		LI->replaceAllUsesWith(ZExtOrTrunc);

		return true;
		}

/// This is the entry point for folds that could be implemented in regular		/// This is the entry point for folds that could be implemented in regular
/// InstCombine, but they are separated because they are not expected to		/// InstCombine, but they are separated because they are not expected to
/// occur frequently and/or have more than a constant-length pattern match.		/// occur frequently and/or have more than a constant-length pattern match.
static bool foldUnusualPatterns(Function &F, DominatorTree &DT,		static bool foldUnusualPatterns(Function &F, DominatorTree &DT,
TargetTransformInfo &TTI,		TargetTransformInfo &TTI,
TargetLibraryInfo &TLI) {		TargetLibraryInfo &TLI) {
bool MadeChange = false;		bool MadeChange = false;
for (BasicBlock &BB : F) {		for (BasicBlock &BB : F) {
// Ignore unreachable basic blocks.		// Ignore unreachable basic blocks.
if (!DT.isReachableFromEntry(&BB))		if (!DT.isReachableFromEntry(&BB))
continue;		continue;

// Walk the block backwards for efficiency. We're matching a chain of		// Walk the block backwards for efficiency. We're matching a chain of
// use->defs, so we're more likely to succeed by starting from the bottom.		// use->defs, so we're more likely to succeed by starting from the bottom.
// Also, we want to avoid matching partial patterns.		// Also, we want to avoid matching partial patterns.
// TODO: It would be more efficient if we removed dead instructions		// TODO: It would be more efficient if we removed dead instructions
// iteratively in this loop rather than waiting until the end.		// iteratively in this loop rather than waiting until the end.
for (Instruction &I : make_early_inc_range(llvm::reverse(BB))) {		for (Instruction &I : make_early_inc_range(llvm::reverse(BB))) {
MadeChange \|= foldAnyOrAllBitsSet(I);		MadeChange \|= foldAnyOrAllBitsSet(I);
MadeChange \|= foldGuardedFunnelShift(I, DT);		MadeChange \|= foldGuardedFunnelShift(I, DT);
MadeChange \|= tryToRecognizePopCount(I);		MadeChange \|= tryToRecognizePopCount(I);
MadeChange \|= tryToFPToSat(I, TTI);		MadeChange \|= tryToFPToSat(I, TTI);
MadeChange \|= foldSqrt(I, TTI, TLI);		MadeChange \|= foldSqrt(I, TTI, TLI);
		MadeChange \|= tryToRecognizeTableBasedCttz(I);
}		}
}		}

// We're done with transforms, so remove dead instructions.		// We're done with transforms, so remove dead instructions.
if (MadeChange)		if (MadeChange)
for (BasicBlock &BB : F)		for (BasicBlock &BB : F)
SimplifyInstructionsInBlock(&BB);		SimplifyInstructionsInBlock(&BB);

Show All 12 Lines	static bool runImpl(Function &F, AssumptionCache &AC, TargetTransformInfo &TTI,
return MadeChange;		return MadeChange;
}		}

void AggressiveInstCombinerLegacyPass::getAnalysisUsage(		void AggressiveInstCombinerLegacyPass::getAnalysisUsage(
AnalysisUsage &AU) const {		AnalysisUsage &AU) const {
AU.setPreservesCFG();		AU.setPreservesCFG();
AU.addRequired<AssumptionCacheTracker>();		AU.addRequired<AssumptionCacheTracker>();
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addRequired<TargetLibraryInfoWrapperPass>();		AU.addRequired<TargetLibraryInfoWrapperPass>();
		dmgreenUnsubmitted Done Reply Inline Actions The TTI's can be removed now (and if you rebase they may already be present, but still are not needed by the new code any more). dmgreen: The TTI's can be removed now (and if you rebase they may already be present, but still are not…
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
AU.addPreserved<AAResultsWrapperPass>();		AU.addPreserved<AAResultsWrapperPass>();
AU.addPreserved<BasicAAWrapperPass>();		AU.addPreserved<BasicAAWrapperPass>();
AU.addPreserved<DominatorTreeWrapperPass>();		AU.addPreserved<DominatorTreeWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
}		}

bool AggressiveInstCombinerLegacyPass::runOnFunction(Function &F) {		bool AggressiveInstCombinerLegacyPass::runOnFunction(Function &F) {
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-basics.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -aggressive-instcombine -S < %s \| FileCheck %s

				;; These cases test lowering of various implementations of table-based ctz
				;; algorithms to the llvm.cttz instruction.

				;; C reproducers:
				;; int ctz1 (unsigned x)
				;; {
				;; static const char table[32] =
				;; {
				;; 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
				;; 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
				;; };
				;; return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
				;; }

				;; int ctz2(unsigned x)
				;; {
				;; #define u 0
				;; static short table[64] =
				;; {
				;; 32, 0, 1, 12, 2, 6, u, 13, 3, u, 7, u, u, u, u, 14,
				;; 10, 4, u, u, 8, u, u, 25, u, u, u, u, u, 21, 27, 15,
				;; 31, 11, 5, u, u, u, u, u, 9, u, u, 24, u, u, 20, 26,
				;; 30, u, u, u, u, 23, u, 19, 29, u, 22, 18, 28, 17, 16, u
				;; };
				;; x = (x & -x) * 0x0450FBAF;
				;; return table[x >> 26];
				;; }

				;; int ctz3(unsigned x)
				;;{
				;; static int table[32] =
				;; {
				;; 0, 1, 2, 24, 3, 19, 6, 25, 22, 4, 20, 10, 16, 7, 12, 26,
				;; 31, 23, 18, 5, 21, 9, 15, 11, 30, 17, 8, 14, 29, 13, 28, 27
				;; };
				;; if (x == 0) return 32;
				;; x = (x & -x) * 0x04D7651F;
				;; return table[x >> 27];
				;; }

				;; static const unsigned long long magic = 0x03f08c5392f756cdULL;
				;;
				;; static const int table[64] = {
				;; 0, 1, 12, 2, 13, 22, 17, 3, 14, 33, 23, 36, 18, 58, 28, 4,
				;; 62, 15, 34, 26, 24, 48, 50, 37, 19, 55, 59, 52, 29, 44, 39, 5,
				;; 63, 11, 21, 16, 32, 35, 57, 27, 61, 25, 47, 49, 54, 51, 43, 38,
				;; 10, 20, 31, 56, 60, 46, 53, 42, 9, 30, 45, 41, 8, 40, 7, 6,
				;; };
				;;
				;; int ctz4 (unsigned long long b)
				;; {
				;; unsigned long long lsb = b & -b;
				;; return table[(lsb * magic) >> 58];
				;; }
				;;
				;; int ctz5(unsigned x)
				;; {
				;; static char table[32] =
				;; {
				;; 0, 1, 2, 24, 3, 19, 6, 25, 22, 4, 20, 10, 16, 7, 12, 26,
				;; 31, 23, 18, 5, 21, 9, 15, 11, 30, 17, 8, 14, 29, 13, 28, 27
				;; };
				;; x = (x & -x)*0x04D7651F;
				;; return table[x >> 27];
				;; }

				;; int indexes[] = {
				;; 63, 0, 58, 1, 59, 47, 53, 2,60, 39, 48, 27, 54, 33, 42, 3,
				;; 61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4,
				;; 62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
				;; 56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15, 8, 23, 7, 6, 5
				;; };
				;;
				;; int ctz6(unsigned long n)
				;; {
				;; return indexes[((n & (~n + 1)) * 0x07EDD5E59A4E28C2ull) >> 58];
				;; }
				;;
				;; int ctz8(unsigned v)
				;; {
				;; static const int table[] =
				;; {
				;; 31 ,0 ,1 ,23 ,2 ,18 ,5 ,24 ,21 ,3 ,19 ,9 ,15 ,6 ,11 ,25 ,30 ,22 ,17 ,4 ,20 ;,8 ,14 ,10 ,29 ,16 ,7 ,13 ,28 ,12 ,27 ,26
				;; };
				;; unsigned x =(-v & v);
				;; return table[(unsigned)(x * 0x9AECA3EU) >> 27];
				;; }

				@ctz7.table = internal unnamed_addr constant [32 x i8] c"\00\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09", align 1

				define i32 @ctz1(i32 %x) {
				; CHECK-LABEL: @ctz1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = call i32 @llvm.cttz.i32(i32 [[X:%.]], i1 true)
				; CHECK-NEXT: [[TMP1:%.*]] = icmp eq i32 [[X]], 0
				; CHECK-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 0, i32 [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.*]] = trunc i32 [[TMP2]] to i8
				; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP3]] to i32
				; CHECK-NEXT: ret i32 [[CONV]]
				;
				entry:
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 125613361
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @ctz7.table, i64 0, i64 %idxprom
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				ret i32 %conv
				}

				@ctz2.table = internal unnamed_addr constant [64 x i16] [i16 32, i16 0, i16 1, i16 12, i16 2, i16 6, i16 0, i16 13, i16 3, i16 0, i16 7, i16 0, i16 0, i16 0, i16 0, i16 14, i16 10, i16 4, i16 0, i16 0, i16 8, i16 0, i16 0, i16 25, i16 0, i16 0, i16 0, i16 0, i16 0, i16 21, i16 27, i16 15, i16 31, i16 11, i16 5, i16 0, i16 0, i16 0, i16 0, i16 0, i16 9, i16 0, i16 0, i16 24, i16 0, i16 0, i16 20, i16 26, i16 30, i16 0, i16 0, i16 0, i16 0, i16 23, i16 0, i16 19, i16 29, i16 0, i16 22, i16 18, i16 28, i16 17, i16 16, i16 0], align 2

				define i32 @ctz2(i32 %x) {
				; CHECK-LABEL: @ctz2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = call i32 @llvm.cttz.i32(i32 [[X:%.]], i1 false)
				; CHECK-NEXT: [[TMP1:%.*]] = trunc i32 [[TMP0]] to i16
				; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP1]] to i32
				; CHECK-NEXT: ret i32 [[CONV]]
				spatelUnsubmitted Not Done Reply Inline Actions This is identical to the previous test, so it is not adding value here. I realize that the source example it intended to model is slightly different. If you want to verify that we end up with cttz from the IR produced by clang, then I'd recommend adding a file to test/Transforms/PhaseOrdering and "RUN -O3". When I create tests like that, I grab the unoptimized IR using "clang -S -o - -emit-llvm -Xclang -disable-llvm-optzns". Then reduce it by running it through "opt -mem2reg", so it's not completely full of junk IR. spatel: This is identical to the previous test, so it is not adding value here. I realize that the…
				djtodoroAuthorUnsubmitted Done Reply Inline Actions This is identical to the previous test, so it is not adding value here. Right, thanks! I realize that the source example it intended to model is slightly different. If you want to verify that we end up with cttz from the IR produced by clang, then I'd recommend adding a file to test/Transforms/PhaseOrdering and "RUN -O3". When I create tests like that, I grab the unoptimized IR using "clang -S -o - -emit-llvm -Xclang -disable-llvm-optzns". Then reduce it by running it through "opt -mem2reg", so it's not completely full of junk IR. I usually do create tests that way, but these may be stale a bit. I will add the `PhaseOrdering``` test, thanks for the suggestion. djtodoro: > This is identical to the previous test, so it is not adding value here. Right, thanks! >I…
				;
				entry:
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 72416175
				%shr = lshr i32 %mul, 26
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [64 x i16], [64 x i16]* @ctz2.table, i64 0, i64 %idxprom
				%0 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %0 to i32
				ret i32 %conv
				}

				@ctz3.table = internal unnamed_addr constant [32 x i32] [i32 0, i32 1, i32 2, i32 24, i32 3, i32 19, i32 6, i32 25, i32 22, i32 4, i32 20, i32 10, i32 16, i32 7, i32 12, i32 26, i32 31, i32 23, i32 18, i32 5, i32 21, i32 9, i32 15, i32 11, i32 30, i32 17, i32 8, i32 14, i32 29, i32 13, i32 28, i32 27], align 4

				define i32 @ctz3(i32 %x) {
				; CHECK-LABEL: @ctz3(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[CMP:%.]] = icmp eq i32 [[X:%.]], 0
				; CHECK-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[IF_END:%.]]
				; CHECK: if.end:
				; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.cttz.i32(i32 [[X]], i1 true)
				; CHECK-NEXT: [[TMP1:%.*]] = icmp eq i32 [[X]], 0
				; CHECK-NEXT: br label [[RETURN]]
				; CHECK: return:
				; CHECK-NEXT: [[RETVAL_0:%.]] = phi i32 [ [[TMP0]], [[IF_END]] ], [ 32, [[ENTRY:%.]] ]
				; CHECK-NEXT: ret i32 [[RETVAL_0]]
				;
				entry:
				%cmp = icmp eq i32 %x, 0
				br i1 %cmp, label %return, label %if.end

				if.end: ; preds = %entry
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 81224991
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i32], [32 x i32]* @ctz3.table, i64 0, i64 %idxprom
				%0 = load i32, i32* %arrayidx, align 4
				br label %return

				return: ; preds = %entry, %if.end
				%retval.0 = phi i32 [ %0, %if.end ], [ 32, %entry ]
				ret i32 %retval.0
				}

				@table = internal unnamed_addr constant [64 x i32] [i32 0, i32 1, i32 12, i32 2, i32 13, i32 22, i32 17, i32 3, i32 14, i32 33, i32 23, i32 36, i32 18, i32 58, i32 28, i32 4, i32 62, i32 15, i32 34, i32 26, i32 24, i32 48, i32 50, i32 37, i32 19, i32 55, i32 59, i32 52, i32 29, i32 44, i32 39, i32 5, i32 63, i32 11, i32 21, i32 16, i32 32, i32 35, i32 57, i32 27, i32 61, i32 25, i32 47, i32 49, i32 54, i32 51, i32 43, i32 38, i32 10, i32 20, i32 31, i32 56, i32 60, i32 46, i32 53, i32 42, i32 9, i32 30, i32 45, i32 41, i32 8, i32 40, i32 7, i32 6], align 4

				define i32 @ctz4(i64 %b) {
				; CHECK-LABEL: @ctz4(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = call i64 @llvm.cttz.i64(i64 [[B:%.]], i1 true)
				; CHECK-NEXT: [[TMP1:%.*]] = icmp eq i64 [[B]], 0
				; CHECK-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i64 0, i64 [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.*]] = trunc i64 [[TMP2]] to i32
				; CHECK-NEXT: ret i32 [[TMP3]]
				;
				entry:
				%sub = sub i64 0, %b
				%and = and i64 %sub, %b
				%mul = mul i64 %and, 283881067100198605
				%shr = lshr i64 %mul, 58
				%arrayidx = getelementptr inbounds [64 x i32], [64 x i32]* @table, i64 0, i64 %shr
				%0 = load i32, i32* %arrayidx, align 4
				ret i32 %0
				}

				@ctz5.table = internal unnamed_addr constant [32 x i8] c"\00\01\02\18\03\13\06\19\16\04\14\0A\10\07\0C\1A\1F\17\12\05\15\09\0F\0B\1E\11\08\0E\1D\0D\1C\1B", align 1

				define i32 @ctz5(i32 %x) {
				; CHECK-LABEL: @ctz5(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = call i32 @llvm.cttz.i32(i32 [[X:%.]], i1 true)
				; CHECK-NEXT: [[TMP1:%.*]] = icmp eq i32 [[X]], 0
				; CHECK-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 0, i32 [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.*]] = trunc i32 [[TMP2]] to i8
				; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP3]] to i32
				; CHECK-NEXT: ret i32 [[CONV]]
				;
				entry:
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 81224991
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @ctz5.table, i64 0, i64 %idxprom
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				ret i32 %conv
				}

				@ctz6.table = global [64 x i32] [i32 63, i32 0, i32 58, i32 1, i32 59, i32 47, i32 53, i32 2, i32 60, i32 39, i32 48, i32 27, i32 54, i32 33, i32 42, i32 3, i32 61, i32 51, i32 37, i32 40, i32 49, i32 18, i32 28, i32 20, i32 55, i32 30, i32 34, i32 11, i32 43, i32 14, i32 22, i32 4, i32 62, i32 57, i32 46, i32 52, i32 38, i32 26, i32 32, i32 41, i32 50, i32 36, i32 17, i32 19, i32 29, i32 10, i32 13, i32 21, i32 56, i32 45, i32 25, i32 31, i32 35, i32 16, i32 9, i32 12, i32 44, i32 24, i32 15, i32 8, i32 23, i32 7, i32 6, i32 5], align 4

				define i32 @ctz6(i64 %n) {
				; CHECK-LABEL: @ctz6(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = call i64 @llvm.cttz.i64(i64 [[N:%.]], i1 true)
				; CHECK-NEXT: [[TMP1:%.*]] = icmp eq i64 [[N]], 0
				; CHECK-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i64 63, i64 [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.*]] = trunc i64 [[TMP2]] to i32
				; CHECK-NEXT: ret i32 [[TMP3]]
				;
				entry:
				%add = sub i64 0, %n
				%and = and i64 %add, %n
				%mul = mul i64 %and, 571347909858961602
				%shr = lshr i64 %mul, 58
				%arrayidx = getelementptr inbounds [64 x i32], [64 x i32]* @ctz6.table, i64 0, i64 %shr
				%0 = load i32, i32* %arrayidx, align 4
				ret i32 %0
				}

				@ctz8.table = internal unnamed_addr constant [32 x i32] [i32 31, i32 0, i32 1, i32 23, i32 2, i32 18, i32 5, i32 24, i32 21, i32 3, i32 19, i32 9, i32 15, i32 6, i32 11, i32 25, i32 30, i32 22, i32 17, i32 4, i32 20, i32 8, i32 14, i32 10, i32 29, i32 16, i32 7, i32 13, i32 28, i32 12, i32 27, i32 26], align 4

				define i32 @ctz8(i32 %v) {
				; CHECK-LABEL: @ctz8(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = call i32 @llvm.cttz.i32(i32 [[V:%.]], i1 true)
				; CHECK-NEXT: [[TMP1:%.*]] = icmp eq i32 [[V]], 0
				; CHECK-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 31, i32 [[TMP0]]
				; CHECK-NEXT: ret i32 [[TMP2]]
				;
				entry:
				%sub = sub i32 0, %v
				%and = and i32 %sub, %v
				%mul = mul i32 %and, 162449982
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i32], [32 x i32]* @ctz8.table, i64 0, i64 %idxprom
				%0 = load i32, i32* %arrayidx, align 4
				ret i32 %0
				}

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-dereferencing-pointer.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -aggressive-instcombine -S < %s \| FileCheck %s

				;; static const unsigned long long magic = 0x03f08c5392f756cdULL;
				;;
				;; static const int table[64] = {
				;; 0, 1, 12, 2, 13, 22, 17, 3,
				;; 14, 33, 23, 36, 18, 58, 28, 4,
				;; 62, 15, 34, 26, 24, 48, 50, 37,
				;; 19, 55, 59, 52, 29, 44, 39, 5,
				;; 63, 11, 21, 16, 32, 35, 57, 27,
				;; 61, 25, 47, 49, 54, 51, 43, 38,
				;; 10, 20, 31, 56, 60, 46, 53, 42,
				;; 9, 30, 45, 41, 8, 40, 7, 6,
				;; };
				;;
				;; int ctz6 (unsigned long long * const b) {
				;; return table[(((b) & -(b)) * magic) >> 58];
				;; }

				@table = internal unnamed_addr constant [64 x i32] [i32 0, i32 1, i32 12, i32 2, i32 13, i32 22, i32 17, i32 3, i32 14, i32 33, i32 23, i32 36, i32 18, i32 58, i32 28, i32 4, i32 62, i32 15, i32 34, i32 26, i32 24, i32 48, i32 50, i32 37, i32 19, i32 55, i32 59, i32 52, i32 29, i32 44, i32 39, i32 5, i32 63, i32 11, i32 21, i32 16, i32 32, i32 35, i32 57, i32 27, i32 61, i32 25, i32 47, i32 49, i32 54, i32 51, i32 43, i32 38, i32 10, i32 20, i32 31, i32 56, i32 60, i32 46, i32 53, i32 42, i32 9, i32 30, i32 45, i32 41, i32 8, i32 40, i32 7, i32 6], align 4

				define i32 @ctz6(i64* nocapture readonly %b) {
				; CHECK-LABEL: @ctz6(
				dmgreenUnsubmitted Not Done Reply Inline Actions I usually remove dso_local dmgreen: I usually remove dso_local
				djtodoroAuthorUnsubmitted Done Reply Inline Actions me as well, leftover :/ thanks! djtodoro: me as well, leftover :/ thanks!
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = load i64, i64 [[B:%.*]], align 8
				; CHECK-NEXT: [[TMP1:%.*]] = call i64 @llvm.cttz.i64(i64 [[TMP0]], i1 true)
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i64 [[TMP0]], 0
				; CHECK-NEXT: [[TMP3:%.*]] = select i1 [[TMP2]], i64 0, i64 [[TMP1]]
				; CHECK-NEXT: [[TMP4:%.*]] = trunc i64 [[TMP3]] to i32
				; CHECK-NEXT: ret i32 [[TMP4]]
				;
				entry:
				%0 = load i64, i64* %b, align 8
				%sub = sub i64 0, %0
				%and = and i64 %0, %sub
				%mul = mul i64 %and, 283881067100198605
				%shr = lshr i64 %mul, 58
				%arrayidx = getelementptr inbounds [64 x i32], [64 x i32]* @table, i64 0, i64 %shr
				%1 = load i32, i32* %arrayidx, align 4
				ret i32 %1
				}

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-non-argument-value.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -aggressive-instcombine -S < %s \| FileCheck %s

				;; C reproducers:
				;; #include "stdio.h"
				;; unsigned x;
				;;
				;; int test ()
				;; {
				;; static const char table[32] =
				;; {
				;; 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
				;; 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
				;; };
				;; return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
				;; }
				;;

				@x = global i32 0, align 4
				@.str = private constant [3 x i8] c"%u\00", align 1
				@test.table = internal constant [32 x i8] c"\00\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09", align 1

				define i32 @test() {
				; CHECK-LABEL: @test(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 @x, align 4
				; CHECK-NEXT: [[TMP1:%.*]] = call i32 @llvm.cttz.i32(i32 [[TMP0]], i1 true)
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i32 [[TMP0]], 0
				; CHECK-NEXT: [[TMP3:%.*]] = select i1 [[TMP2]], i32 0, i32 [[TMP1]]
				; CHECK-NEXT: [[TMP4:%.*]] = trunc i32 [[TMP3]] to i8
				; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP4]] to i32
				; CHECK-NEXT: ret i32 [[CONV]]
				;
				entry:
				%0 = load i32, i32* @x, align 4
				%sub = sub i32 0, %0
				%and = and i32 %0, %sub
				%mul = mul i32 %and, 125613361
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @test.table, i64 0, i64 %idxprom
				%1 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %1 to i32
				ret i32 %conv
				}
				dmgreenUnsubmitted Done Reply Inline Actions It is better to pass x as a parameter, although I'm not sure it matter much where x comes from for the rest of the pattern. dmgreen: It is better to pass x as a parameter, although I'm not sure it matter much where x comes from…
				spatelUnsubmitted Done Reply Inline Actions Right - as far as this patch is concerned, this is identical to the previous test, so it shouldn't be here. See my earlier comment about PhaseOrdering tests if we want more end-to-end coverage for `opt -O3`. spatel: Right - as far as this patch is concerned, this is identical to the previous test, so it…
				djtodoroAuthorUnsubmitted Done Reply Inline Actions Yes, agree. djtodoro: Yes, agree.

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-zero-element.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -aggressive-instcombine -S < %s \| FileCheck %s

				@ctz1.table = internal constant [32 x i8] c"\00\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09", align 1

				define i32 @ctz1(i32 %x) {
				; CHECK-LABEL: @ctz1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = call i32 @llvm.cttz.i32(i32 [[X:%.]], i1 true)
				; CHECK-NEXT: [[TMP1:%.*]] = icmp eq i32 [[X]], 0
				; CHECK-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 0, i32 [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.*]] = trunc i32 [[TMP2]] to i8
				; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP3]] to i32
				; CHECK-NEXT: ret i32 [[CONV]]
				;
				entry:
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 125613361
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @ctz1.table, i64 0, i64 %idxprom
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				ret i32 %conv
				}

llvm/test/Transforms/AggressiveInstCombine/negative-lower-table-based-cttz.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -aggressive-instcombine -S < %s \| FileCheck %s --implicit-check-not=llvm.cttz

				;; These cases should ensure we are not lowering of some wrong implementations
				;; of table-based ctz algorithms to the llvm.cttz instruction.

				@ctz7.table = internal unnamed_addr constant [32 x i8] c"\05\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09", align 1

				;; This is a negative test with a wrong table constant.

				define i32 @ctz1(i32 %x) {
				entry:
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 125613361
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @ctz7.table, i64 0, i64 %idxprom
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				ret i32 %conv
				}

				;; These are some negative tests with a wrong instruction sequences.

				@ctz1.table = internal unnamed_addr constant [32 x i8] c"\00\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09", align 1

				define i32 @ctz2(i32 %x) {
				entry:
				%sub = sub i32 1, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 125613361
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @ctz1.table, i64 0, i64 %idxprom
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				ret i32 %conv
				}

				define i32 @ctz3(i32 %x) {
				entry:
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 125613362
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @ctz1.table, i64 0, i64 %idxprom
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				ret i32 %conv
				}

				define i32 @ctz4(i32 %x) {
				entry:
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 125613361
				%shr = lshr i32 %mul, 26
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @ctz1.table, i64 0, i64 %idxprom
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				ret i32 %conv
				}

				;; This is a negative test with a wrong table size and constants.

				@ctz3.table = internal unnamed_addr constant [128 x i8] c"\00\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09\00\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09\00\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09\00\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09", align 1

				define i32 @ctz5(i32 %x) {
				entry:
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 125613361
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [128 x i8], [128 x i8]* @ctz3.table, i64 0, i64 %idxprom
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				ret i32 %conv
				}

				;; A test with an extern global variable representing the table.
				;; extern int table[32];
				;;
				;; int ctz6(unsigned x) {
				;; if (x == 0) return 32;
				;; x = (x & -x) * 0x04D7651F;
				;; return table[x >> 27];
				;; }

				@table = external global [32 x i32], align 16
				define i32 @ctz6(i32 noundef %x) {
				entry:
				%cmp = icmp eq i32 %x, 0
				br i1 %cmp, label %return, label %if.end

				if.end: ; preds = %entry
				%sub = sub i32 0, %x
				%and = and i32 %sub, %x
				%mul = mul i32 %and, 81224991
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i32], [32 x i32]* @table, i64 0, i64 %idxprom
				%0 = load i32, i32* %arrayidx, align 4
				br label %return

				return: ; preds = %entry, %if.end
				%retval.0 = phi i32 [ %0, %if.end ], [ 32, %entry ]
				ret i32 %retval.0
				}

llvm/test/Transforms/PhaseOrdering/lower-table-based-cttz.ll

This file was added.

				;; This tests lowering of the implementations of table-based ctz
				;; algorithm to the llvm.cttz instruction in the -O3 case.

				;; C producer:
				;; int ctz1 (unsigned x)
				;; {
				;; static const char table[32] =
				;; {
				;; 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
				;; 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
				;; };
				;; return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
				;; }
				;; Compiled as: clang -O3 test.c -S -emit-llvm -Xclang -disable-llvm-optzns

				; RUN: opt -O3 -S < %s \| FileCheck %s

				; CHECK: call i32 @llvm.cttz.i32

				@ctz1.table = internal constant [32 x i8] c"\00\01\1C\02\1D\0E\18\03\1E\16\14\0F\19\11\04\08\1F\1B\0D\17\15\13\10\07\1A\0C\12\06\0B\05\0A\09", align 16

				define i32 @ctz1(i32 noundef %x) {
				entry:
				dmgreenUnsubmitted Done Reply Inline Actions Can remove the `; Function Attrs` dmgreen: Can remove the `; Function Attrs`
				%x.addr = alloca i32, align 4
				store i32 %x, ptr %x.addr, align 4
				%0 = load i32, ptr %x.addr, align 4
				%1 = load i32, ptr %x.addr, align 4
				%sub = sub i32 0, %1
				%and = and i32 %0, %sub
				%mul = mul i32 %and, 125613361
				%shr = lshr i32 %mul, 27
				%idxprom = zext i32 %shr to i64
				%arrayidx = getelementptr inbounds [32 x i8], ptr @ctz1.table, i64 0, i64 %idxprom
				%2 = load i8, ptr %arrayidx, align 1
				%conv = sext i8 %2 to i32
				ret i32 %conv
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AggressiveInstCombine] Lower Table Based CTTZ ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 457606

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-basics.ll

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-dereferencing-pointer.ll

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-non-argument-value.ll

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-zero-element.ll

llvm/test/Transforms/AggressiveInstCombine/negative-lower-table-based-cttz.ll

llvm/test/Transforms/PhaseOrdering/lower-table-based-cttz.ll

[AggressiveInstCombine] Lower Table Based CTTZ
ClosedPublic