This is an archive of the discontinued LLVM Phabricator instance.

[AggressiveInstCombine] Lower Table Based CTTZ
ClosedPublic

Authored by djtodoro on Nov 5 2021, 8:58 AM.

Details

Summary

This patch introduces recognition of table-based ctz implementation during the AggressiveInstCombine.
This fixes the [0].

[0] https://bugs.llvm.org/show_bug.cgi?id=46434

TODO: Get the data on SPEC benchmark.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
djtodoro marked 7 inline comments as done.Nov 15 2021, 4:02 AM
djtodoro added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
371

OK, sure.

451

Agree with this.

483

Yep.

523

Actually, we don't need it.

djtodoro updated this revision to Diff 387204.Nov 15 2021, 4:04 AM
  • Refactor && update the tests
  • Address the comments
xbolva00 added inline comments.Nov 15 2021, 4:18 AM
llvm/test/Transforms/AggressiveInstCombine/AARCH64/lower-table-based-ctz-basics.ll
1 ↗(On Diff #387204)

AggressiveInstCombine/AARCH64 -> AggressiveInstCombine/AArch64

djtodoro updated this revision to Diff 387215.Nov 15 2021, 4:51 AM
  • update the test dir name

any other comment here? :)

fhahn added a subscriber: fhahn.Nov 25 2021, 2:23 AM

TODO: Get the data on SPEC benchmark.

Did you manage to collect any perf data yet to motivate this change?

The tests already contain a few things that seem unrelated, it would be good to clean those things up.

llvm/test/Transforms/AggressiveInstCombine/AArch64/dereferencing-pointer.ll
28 ↗(On Diff #387215)

That's out of date.

40 ↗(On Diff #387215)

can the tests instead just take i64 %b or does it need to be a pointer? (

50 ↗(On Diff #387215)

is all the mote data needed? Same for the other tests

llvm/test/Transforms/AggressiveInstCombine/AArch64/non-argument-value.ll
31 ↗(On Diff #387215)

not needed

djtodoro edited the summary of this revision. (Show Details)Nov 26 2021, 1:28 AM
djtodoro added a reviewer: fhahn.

TODO: Get the data on SPEC benchmark.

Did you manage to collect any perf data yet to motivate this change?

Not yet, but I will share ASAP.

Thanks for your comments!

llvm/test/Transforms/AggressiveInstCombine/AArch64/dereferencing-pointer.ll
40 ↗(On Diff #387215)

This test is meant to test the pointer type. There are other tests checking non-ptr type.

50 ↗(On Diff #387215)

Yep, I'll reduce the tests in the next update.

djtodoro updated this revision to Diff 389938.Nov 26 2021, 1:31 AM

-Clean up the tests

AggressiveInstCombine is an extension of InstCombine. That is, it's a target-independent canonicalization pass. Therefore, we shouldn't use any target-specific cost model to enable the transform. Since we have a generic intrinsic for cttz, it's fine to create that for all targets as long as we can guarantee that a generic expansion of that intrinsic in the backend will not be worse than the original code.

But I'm not sure if we can make that guarantee? If not, this should be implemented as a late IR or codegen pass (as it was when first posted).

AggressiveInstCombine is an extension of InstCombine. That is, it's a target-independent canonicalization pass. Therefore, we shouldn't use any target-specific cost model to enable the transform. Since we have a generic intrinsic for cttz, it's fine to create that for all targets as long as we can guarantee that a generic expansion of that intrinsic in the backend will not be worse than the original code.

But I'm not sure if we can make that guarantee? If not, this should be implemented as a late IR or codegen pass (as it was when first posted).

Why is it ok to use DataLayout in InstCombine/AggressiveInstCombine, but not TTI? The cttz seems like it could enable other optimizations so I don't think we want it late. In particular, we should give the optimizer a chance to prove that the input isn't 0 to remove the select that gets generated after the cttz intrinsic. That could require computeKnownBits or CorrelatedValuePropagation. LoopIdiomRecognize queries TTI before generating cttz from loops.

AggressiveInstCombine is an extension of InstCombine. That is, it's a target-independent canonicalization pass. Therefore, we shouldn't use any target-specific cost model to enable the transform. Since we have a generic intrinsic for cttz, it's fine to create that for all targets as long as we can guarantee that a generic expansion of that intrinsic in the backend will not be worse than the original code.

But I'm not sure if we can make that guarantee? If not, this should be implemented as a late IR or codegen pass (as it was when first posted).

Why is it ok to use DataLayout in InstCombine/AggressiveInstCombine, but not TTI? The cttz seems like it could enable other optimizations so I don't think we want it late. In particular, we should give the optimizer a chance to prove that the input isn't 0 to remove the select that gets generated after the cttz intrinsic. That could require computeKnownBits or CorrelatedValuePropagation. LoopIdiomRecognize queries TTI before generating cttz from loops.

Yeah, it's fuzzy. I think we're only supposed to be using DataLayout to determine where creating an illegal type would obviously lead to worse codegen. But that's not much different than asking if some operation is legal on target X.
We've had several other requests for a cost-aware version of instcombine over the years, so maybe we should just use this an opportunity to reframe/rename AggressiveInstCombine.

Does handling of "0", when accessing index 0 for x = 0 not acceptable?

What is the reason for this patch being on hold?

We've had several other requests for a cost-aware version of instcombine over the years, so maybe we should just use this an opportunity to reframe/rename AggressiveInstCombine.

It could help with doing thing like https://reviews.llvm.org/D114964 earlier too, where the transfroms are not always profitable and not reversible back to the original code, but would be beneficial to do earlier to get better cost modelling and vectorization.

Hello,

I have a small query regarding this patch. The patch emits following llvm assembly for ctz table -

-----Patch assembly-------
// %bb.0:

rbit w8, w0
cmp w0, #0
clz w8, w8
csel w0, wzr, w8, eq
ret

but in gcc, we have the following assembly being emitted -

------------GCC---------------------
f(unsigned int):

rbit    w0, w0
clz     w0, w0
and     w0, w0, 31
ret

Reference - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

[1] My question is - To solve the bug, do I have to generate assembly similar to GCC?

Please give me suggestions on moving forward in solving this bug.

Hello,

I have a small query regarding this patch. The patch emits following llvm assembly for ctz table -

-----Patch assembly-------
// %bb.0:

rbit w8, w0
cmp w0, #0
clz w8, w8
csel w0, wzr, w8, eq
ret

but in gcc, we have the following assembly being emitted -

------------GCC---------------------
f(unsigned int):

rbit    w0, w0
clz     w0, w0
and     w0, w0, 31
ret

Reference - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

[1] My question is - To solve the bug, do I have to generate assembly similar to GCC?

Please give me suggestions on moving forward in solving this bug.

Hi,

Is this really a bug?

I have a small query regarding this patch. The patch emits following llvm assembly for ctz table -

-----Patch assembly-------
// %bb.0:

rbit w8, w0
cmp w0, #0
clz w8, w8
csel w0, wzr, w8, eq
ret

but in gcc, we have the following assembly being emitted -

------------GCC---------------------
f(unsigned int):

rbit    w0, w0
clz     w0, w0
and     w0, w0, 31
ret

That sounds like a backend optimization that could happen given we know the semantics of the AArch64 instruction.

I have a small query regarding this patch. The patch emits following llvm assembly for ctz table -

-----Patch assembly-------
// %bb.0:

rbit w8, w0
cmp w0, #0
clz w8, w8
csel w0, wzr, w8, eq
ret

but in gcc, we have the following assembly being emitted -

------------GCC---------------------
f(unsigned int):

rbit    w0, w0
clz     w0, w0
and     w0, w0, 31
ret

That sounds like a backend optimization that could happen given we know the semantics of the AArch64 instruction.

Yep, +1 for this. I'd leave this job to backends.

@craig.topper @spatel Do you think this can go as is?

(P.S. I don't have access to any AARCH64 board currently, so I can see into the SPEC numbers.)

I am wondering about general direction..

Is it worth it? On way to become compiler just for benchmarks?

Spend compile time just to optimize one very very specific pattern from spec is bad thing imho.

Can you show us some other real world “hits”? I assume any sane project already uses builtin to compute this value efficiently.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
383

getZExtValue may assert large ints

xbolva00 requested changes to this revision.Feb 1 2022, 2:17 AM

Can you please provide compile time data with @nikic's compile time tracker? This should be *very cheap* to be acceptable.

.

This revision now requires changes to proceed.Feb 1 2022, 2:17 AM

I am wondering about general direction..

Is it worth it? On way to become compiler just for benchmarks?

Spend compile time just to optimize one very very specific pattern from spec is bad thing imho.

Can you show us some other real world “hits”? I assume any sane project already uses builtin to compute this value efficiently.

The same algorithm is documented here https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup

It's also nearby in the code in primesieve from the recent tzcnt discusson. Around line 143 if you expand the context on Erat.hpp here https://github.com/kimwalisch/primesieve/pull/109/files Granted that code knows when to use the tzcnt builtin instead of that code. I'm only mentioning it to show it is a known way to implement tzcnt that is used in more than just spec.

Can you please provide compile time data with @nikic's compile time tracker? This should be *very cheap* to be acceptable.

.

It is done: http://llvm-compile-time-tracker.com/compare.php?from=9920943ea201189f9b34918c7663d8a03d7e4676&to=666dd20021db313e4ead3e39ac4c8a12b9525521&stat=instructions

Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 1:58 AM
Herald added a subscriber: StephenFan. · View Herald Transcript

How hard is to add x86 support?

xbolva00 resigned from this revision.May 5 2022, 2:04 AM

(Not blocking this)

This revision now requires review to proceed.May 5 2022, 2:04 AM

Thanks!

How hard is to add x86 support?

Is it even worth of implementing it for x86?

Thanks!

How hard is to add x86 support?

Is it even worth of implementing it for x86?

Yes. Intel CPUs from about 2013 have a TZCNT instruction.

OK, great! it will be on my TODO list!

But I think that x86 support doesn’t block this.

Ideally this transformation should just emit proper intrinsic without need for target hook.

What is the real problem?

Hello - We were having a discussion about a very similar patch in D125755. I think the outcome for this patch is that either:

  • We need to do this later (maybe in CodeGenPrepare).
  • We need to do this unconditionally without the call to TTI.preferCTTZLowering() and have the reverse transform later for targets that do not have a cheaper alternative.
  • We need to argue some more :)

There are more details about why in D125755. I would go for the first option if it doesn't lead to worse performance, as for the second I'm not sure when it would be profitable to transform back and emit the table. You may not want to do that for non-hot ctzs? It sounds like it may be difficult to get right, but maybe I'm overestimating it.

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Compare the generated code for arm-eabi.

You may not want to do that for non-hot ctzs?

As opposed to what, calling into compiler-rt?

Hello - We were having a discussion about a very similar patch in D125755. I think the outcome for this patch is that either:

  • We need to do this later (maybe in CodeGenPrepare).

The initial version of patch was implemented within CodeGenPrepare. And I think it should not introduce any performance regression.

  • We need to do this unconditionally without the call to TTI.preferCTTZLowering() and have the reverse transform later for targets that do not have a cheaper alternative.

Hmm... I need to think about that.

  • We need to argue some more :)

Seems that I need to find some time to catch up the conversation in D125755. :)

There are more details about why in D125755. I would go for the first option if it doesn't lead to worse performance, as for the second I'm not sure when it would be profitable to transform back and emit the table. You may not want to do that for non-hot ctzs?

I am not sure I get the question.

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Compare the generated code for arm-eabi.

I guess we could measure something like that, but seems to me that it could introduce some performance regressions...

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Compare the generated code for arm-eabi.

You may not want to do that for non-hot ctzs?

As opposed to what, calling into compiler-rt?

I was meaning - it can be difficult for the compiler to recognize _when_ a ctz is performance critical. If the size of the table is large (which I was possible over-estimating the size of in my mind), then you may not want to emit the table for every ctz in the program. Currently that places where this is used have said, from the fact that they wrote it this way, that these ctz's are important. It just depends on whether converting to a table is always better, and if the table it small enough to be reasonable in the common case. 32 x i8 doesn't sound too big compared to what I originally imagined, if that is all it needs.

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Compare the generated code for arm-eabi.

You may not want to do that for non-hot ctzs?

As opposed to what, calling into compiler-rt?

I was meaning - it can be difficult for the compiler to recognize _when_ a ctz is performance critical. If the size of the table is large (which I was possible over-estimating the size of in my mind), then you may not want to emit the table for every ctz in the program. Currently that places where this is used have said, from the fact that they wrote it this way, that these ctz's are important. It just depends on whether converting to a table is always better, and if the table it small enough to be reasonable in the common case. 32 x i8 doesn't sound too big compared to what I originally imagined, if that is all it needs.

For the table lookup, is there an algorithm for creating the special constant that is used in the multiply? Or would we just hardcode known constants for common sizes.

as for the second I'm not sure when it would be profitable to transform back and emit the table

You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point.

Note, the popcount expansion already uses a multiply without checking if it is legal.

For the table lookup, is there an algorithm for creating the special constant that is used in the multiply? Or would we just hardcode known constants for common sizes.

https://en.wikipedia.org/wiki/De_Bruijn_sequence has a description of the algorithm. Probably we'd just hard-code constants, though; practically speaking, the only sizes we actually care about are 16, 32, and 64. (For anything that doesn't fit in a single register, we probably just want to split the cttz.)

Note, the popcount expansion already uses a multiply without checking if it is legal.

It's a relatively cheap multiply to expand on a target with a shifter, though.

Thanks a lot for the comments! Can someone please sum the things up that need to be done for this?

By implementing this in the CodeGenPrepare doesn't include/imply that we should get rid of the call to TTI.preferCTTZLowering().
If we decide to go with the unconditional (not to care about Target) cttz lowering and to get rid of the call to TTI.preferCTTZLowering(), what should be done?

Concretely, my preferred solution looks something like:

  1. Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).
  2. Teach TargetLowering::expandCTTZ to emit a table lookup.
djtodoro updated this revision to Diff 435145.Jun 8 2022, 7:05 AM
djtodoro retitled this revision from [AggressiveInstCombine] Lower Table Based CTTZ and enable it for AARCH64 in -O3 to [AggressiveInstCombine] Lower Table Based CTTZ .
djtodoro edited the summary of this revision. (Show Details)
  • drop target dependent hooks

Concretely, my preferred solution looks something like:

  1. Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

  1. Teach TargetLowering::expandCTTZ to emit a table lookup.

Concretely, my preferred solution looks something like:

  1. Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

  1. Teach TargetLowering::expandCTTZ to emit a table lookup.

@djtodoro - Will you be sending patch for (2) "Teach TargetLowering::expandCTTZ to emit a table lookup."?

Concretely, my preferred solution looks something like:

  1. Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

  1. Teach TargetLowering::expandCTTZ to emit a table lookup.

@djtodoro - Will you be sending patch for (2) "Teach TargetLowering::expandCTTZ to emit a table lookup."?

Unfortunately, I don’t have time to do it right now. If you are interested, please go ahead with the implementation.

Concretely, my preferred solution looks something like:

  1. Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

  1. Teach TargetLowering::expandCTTZ to emit a table lookup.

@djtodoro - Will you be sending patch for (2) "Teach TargetLowering::expandCTTZ to emit a table lookup."?

Unfortunately, I don’t have time to do it right now. If you are interested, please go ahead with the implementation.

@djtodoro - Sure. I am interested to do it. Can you elaborate (2) a bit in detail?

Concretely, my preferred solution looks something like:

  1. Perform the transform unconditionally in AggressiveInstCombine (so this patch without the preferCTTZLowering() bits).

The latest update implements this.

  1. Teach TargetLowering::expandCTTZ to emit a table lookup.

@djtodoro - Will you be sending patch for (2) "Teach TargetLowering::expandCTTZ to emit a table lookup."?

Unfortunately, I don’t have time to do it right now. If you are interested, please go ahead with the implementation.

@djtodoro - Sure. I am interested to do it. Can you elaborate (2) a bit in detail?

Basically this patch implements lowering of the table-based cttz implementation into @llvm.cttz unconditionally. For some targets it won't be that beneficial, so during the TargetLowering::expandCTTZ we should emit table lookup again. @efriedma May have something to add.

I don't think I have much further to say. Emitting a table lookup from TargetLowering::expandCTTZ should be straightforward, I think. See DAGCombiner::convertSelectOfFPConstantsToLoadOffset for an example of how to emit a constant table.

I don't think I have much further to say. Emitting a table lookup from TargetLowering::expandCTTZ should be straightforward, I think. See DAGCombiner::convertSelectOfFPConstantsToLoadOffset for an example of how to emit a constant table.

Do we need to be careful with vector CTTZ which can also go through expand CTTZ?

Is this patch ready to be merged? This is parent patch of https://reviews.llvm.org/D128911. Does child get merged before parent?

Anyways, https://reviews.llvm.org/D128911 can be merged independently of this patch.

Is this patch ready to be merged? This is parent patch of https://reviews.llvm.org/D128911. Does child get merged before parent?

Anyways, https://reviews.llvm.org/D128911 can be merged independently of this patch.

D128911 needs to go in first, once that is done we can move forward with this one. It could do with a rebase and a clang-format.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
444

We will need to support opaque pointers nowadays.

483

It's probably better to just say "64bit targets" as opposed to a specific target.

494

I believe it's the top Bitwidth - Log2(Bitwidth) bits.

dmgreen added inline comments.Jul 23 2022, 4:47 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486

Does the extend between the lshr and the mul every happen? From what I can tell, the type of the VT should be based on the type of these operations.

693

The TTI's can be removed now (and if you rebase they may already be present, but still are not needed by the new code any more).

llvm/test/Transforms/AggressiveInstCombine/AArch64/lower-table-based-ctz-basics.ll
1 ↗(On Diff #435145)

The tests can be moved out of the AArch64 directory, so long as they drop the AArch64 triple.

@djtodoro Do you have any time to update this? Otherwise do you mind we take it over and we can update it and get it reviewed. Thanks.

I promise I will find some time to update this - it is coming next week.

djtodoro updated this revision to Diff 453206.Aug 16 2022, 11:44 PM
  • support opaque pointers
  • remove leftovers (since this was aarch64 only)
  • move the tests in a non-target dir
djtodoro marked 3 inline comments as done.Aug 17 2022, 1:08 AM
djtodoro added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486

It does not happen in all the cases.

djtodoro updated this revision to Diff 453228.Aug 17 2022, 1:14 AM
  • remove a duplicated test

Why are some test files still specifying a triple in the RUN line?

It would be good to consolidate tests into less files if possible with better names/comments to explain exactly what differences are being tested in the sequence of tests. There should also be negative tests (wrong table constants, wrong magic multiplier, wrong shift amount, etc), so we know that the transform is not firing on mismatches.

Why are some test files still specifying a triple in the RUN line?

Leftovers. Thanks.

It would be good to consolidate tests into less files if possible with better names/comments to explain exactly what differences are being tested in the sequence of tests. There should also be negative tests (wrong table constants, wrong magic multiplier, wrong shift amount, etc), so we know that the transform is not firing on mismatches.

I will remove one redundant test. There are C producers as well as some top-level comments that explain what it should test (if the comment is needed). Furthermore, thanks for the suggestion for adding some negative tests - will add it.

djtodoro updated this revision to Diff 453574.Aug 18 2022, 1:54 AM
  • adding negative tests
  • rename the tests
  • clean the target triple leftovers from tests
dmgreen added inline comments.Aug 18 2022, 4:24 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486

Do you have a test case where the extend is between the shift and the mul?

493

If the length of the table is larger than the InputBits, how are we sure that the matched elements will be the correct ones? Is this always guaranteed?
I think I would have expected a check that for each i=0..InputBits-1, Table[(Mul<<i)>>Shift] == i. With a check that the index is in range. Are they always equivalent with the larger tables too?

552

Can this always just use the Load type?

564

I think User here could be GlobalVariable

568

GEPUser->getOperand(0) -> Global->getInitializer(). It is worth adding a test where the global is extern.

590

It might be better to switch this logical around -

unsigned InputBits = X1->getType()->getScalarSizeInBits();
if (InputBits != 32 && InputBits != 64)
  return false;
593–594

Log2_32_Ceil -> Log2_Ceil if we know the InputBits is a power of 2.

The -1 case is for a larger table with more elements but that can handle zero values?

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-non-argument-value.ll
79

It is better to pass x as a parameter, although I'm not sure it matter much where x comes from for the rest of the pattern.

spatel added inline comments.Aug 18 2022, 7:25 AM
llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-basics.ll
124

This is identical to the previous test, so it is not adding value here.

I realize that the source example it intended to model is slightly different. If you want to verify that we end up with cttz from the IR produced by clang, then I'd recommend adding a file to test/Transforms/PhaseOrdering and "RUN -O3".

When I create tests like that, I grab the unoptimized IR using "clang -S -o - -emit-llvm -Xclang -disable-llvm-optzns". Then reduce it by running it through "opt -mem2reg", so it's not completely full of junk IR.

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-non-argument-value.ll
79

Right - as far as this patch is concerned, this is identical to the previous test, so it shouldn't be here.

See my earlier comment about PhaseOrdering tests if we want more end-to-end coverage for opt -O3.

djtodoro marked 3 inline comments as done.Aug 18 2022, 9:23 AM

Thanks for the comments.

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486

I was completely sure that I had a case for it, but I am not able to produce it actually -- so I deleted it for now.

493

Hmmm, can you please walk me through out an example?

552

I think yes, nothing comes up to my mind that can break it.

590

Sounds good.

593–594

Log2_32_Ceil -> Log2_Ceil if we know the InputBits is a power of 2.

Right...
Bu you meant Log2_64(), right? It is a power of 2, since it is either 32 or 64, so no need to add any assert here.

The -1 case is for a larger table with more elements but that can handle zero values?

int ctz2(unsigned x)
{
#define u 0
 static short table[64] =
  {
    32, 0, 1, 12, 2, 6, u, 13, 3, u, 7, u, u, u, u, 14,
    10, 4, u, u, 8, u, u, 25, u, u, u, u, u, 21, 27, 15,
    31, 11, 5, u, u, u, u, u, 9, u, u, 24, u, u, 20, 26,
    30, u, u, u, u, 23, u, 19, 29, u, 22, 18, 28, 17, 16, u
   };
 x = (x & -x) * 0x0450FBAF;
 return table[x >> 26];
 }
llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-basics.ll
124

This is identical to the previous test, so it is not adding value here.

Right, thanks!

I realize that the source example it intended to model is slightly different. If you want to verify that we end up with cttz from the IR produced by clang, then I'd recommend adding a file to test/Transforms/PhaseOrdering and "RUN -O3".

When I create tests like that, I grab the unoptimized IR using "clang -S -o - -emit-llvm -Xclang -disable-llvm-optzns". Then reduce it by running it through "opt -mem2reg", so it's not completely full of junk IR.

I usually do create tests that way, but these may be stale a bit. I will add the `PhaseOrdering``` test, thanks for the suggestion.

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-non-argument-value.ll
79

Yes, agree.

djtodoro updated this revision to Diff 453684.Aug 18 2022, 9:24 AM
djtodoro marked an inline comment as done.
  • removed more duplicated tests
  • added a llvm/test/Transforms/PhaseOrdering test
  • refactor the code a bit
dmgreen added inline comments.Aug 19 2022, 2:01 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
493

Hmm. No, I'm not sure I can.

I was thinking about the ctz2 case, and whether there could be cases where the u's have different values that make them "match", but the values are different that make them wrong. So the items in the table accessed by the DeBruijn constant would produce incorrect values, but there are still InputBits number of matches.

;; int ctz2(unsigned x)
;; {
;; #define u 0
;;  static short table[64] =
;;   {
;;     32, 0, 1, 12, 2, 6, u, 13, 3, u, 7, u, u, u, u, 14,
;;     10, 4, u, u, 8, u, u, 25, u, u, u, u, u, 21, 27, 15,
;;     31, 11, 5, u, u, u, u, u, 9, u, u, 24, u, u, 20, 26,
;;     30, u, u, u, u, 23, u, 19, 29, u, 22, 18, 28, 17, 16, u
;;   };
;; x = (x & -x) * 0x0450FBAF;
;; return table[x >> 26];
;; }

But I don't think that is something that can come up. I was finding it hard to prove, but if the Mul is InputBits in length there are only at most InputBits separate elements that it can access. And multiple elements cannot map successfully back to the same i.

I ran a sat solver overnight, and it is still going but hasn't found any counter examples, which is a good sign. (It is able to find valid DeBruijn CTTZ tables given the chance).

It might be worth adding a comment explaining why this correctly matches the table in all cases.

547

One think I forgot to mention - llvm has a code style that is best explained as "just run clang-format on the patch". These returns are all in the wrong place, for example, and could do with a cleanup.

593–594

Ah, yeah - I meant Log2_32, but delete the wrong part of the function name.

595–596

This is true by definition now.

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-dereferencing-pointer.ll
24

I usually remove dso_local

llvm/test/Transforms/PhaseOrdering/lower-table-based-cttz.ll
23

Can remove the ; Function Attrs

djtodoro marked an inline comment as done.Aug 21 2022, 3:06 AM
djtodoro added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
493

Yeah, good sign. I will try to make a reasonable comment. Thanks.

547

I've changed the style to Google, accidentally. Thanks.

llvm/test/Transforms/AggressiveInstCombine/lower-table-based-cttz-dereferencing-pointer.ll
24

me as well, leftover :/ thanks!

djtodoro updated this revision to Diff 454291.Aug 21 2022, 3:14 AM
  • clang-format
  • clean up tests

Thanks for the updates. I don't think I have anything else than is what is below. Any other comments from anyone else?

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
493

A comment explaining this function would still be useful.

552

This can be changed to just the Load type then.

570–573
if (!GVTable || !GVTable->hasInitializer())
  return false;
606–608

Remove this check, as it is always true as far as I can tell.

spatel added inline comments.Aug 24 2022, 1:28 PM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
476–477

There was a request to put a comment on this function, and I'll second that request. It's not clear why we are counting matches rather than just bailing out on the first mismatch. I think that's because you can construct/recognize a table with unaccessed/undefined elements?

djtodoro marked 4 inline comments as done.Aug 27 2022, 6:31 AM
djtodoro added inline comments.
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
476–477

Yes, that is the reason - we are iterating over the elements of the table, so there could be mismatch that we can ignore. A comment is coming.

djtodoro updated this revision to Diff 456115.Aug 27 2022, 6:35 AM
  • addressing comments
spatel accepted this revision.Aug 27 2022, 11:32 AM

LGTM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
584–586

Could shorten this by using something like:

if (!match(GEP->idx_begin()->get(), m_ZeroInt())) return false;
This revision is now accepted and ready to land.Aug 27 2022, 11:32 AM
craig.topper added inline comments.Aug 27 2022, 11:52 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
604

If we’re only handling 32 and 64, this comment should be 5..6

craig.topper added inline comments.Aug 27 2022, 12:11 PM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
486

I think this can be
Mask = APInt::getBitsSetFrom(InputBits , Shift)

dmgreen accepted this revision.Aug 28 2022, 1:15 AM

Thanks for the updates. LGTM

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
604

I believe it is 7 because the table can be twice the size. Hence the -1 in the formula below. See the ctz2 test.

djtodoro marked 2 inline comments as done.Aug 28 2022, 1:42 AM

Thanks for your comments.

djtodoro updated this revision to Diff 456178.Aug 28 2022, 1:43 AM
  • addressing comments
craig.topper added inline comments.Aug 28 2022, 11:35 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
485

extra space before comma. Looks like I mistyped it in my comment. Sorry.

djtodoro updated this revision to Diff 456267.Aug 29 2022, 12:20 AM
This revision was automatically updated to reflect the committed changes.
rsmith added a subscriber: rsmith.Sep 2 2022, 3:49 PM

Hi, I'm seeing a heap-use-after-free in AggressiveInstCombine in a build shortly after this landed, within a function added here.

test/Transforms/AggressiveInstCombine/X86/sqrt.ll fails as follows under asan: https://gist.github.com/zygoloid/a270e65d32ab5b05504b3b0d5717f83b

Please can you fix or revert?

rsmith added a comment.Sep 2 2022, 4:20 PM

test/Transforms/AggressiveInstCombine/X86/sqrt.ll fails as follows under asan: https://gist.github.com/zygoloid/a270e65d32ab5b05504b3b0d5717f83b

I've gone ahead and reverted for now to unblock things. Let me know if you're not able to reproduce the ASan failure and I can dig more into it.

Thanks a lot. I will check ASAP.

@rsmith recommitted with f879939157. Thanks!

spatel added a comment.Sep 8 2022, 7:59 AM

@rsmith recommitted with f879939157. Thanks!

What was the bug, how was it fixed, and is there a new test to verify the fix? That should have been mentioned in the new commit message.

@rsmith recommitted with f879939157. Thanks!

What was the bug, how was it fixed, and is there a new test to verify the fix? That should have been mentioned in the new commit message.

You are right. I reverted the recommit, and I will recommit it with proper message, sorry I missed it. :/

Actually, the issue was that your patch D129167 introduced eraseFromParent and the tryToRecognizeTableBasedCttz would try to use the instruction (dyn_cast) after free. I just moved the tryToRecognizeTableBasedCttz above foldSqrt. I guess it does not need any additional test case.

spatel added a comment.Sep 8 2022, 9:28 AM

Actually, the issue was that your patch D129167 introduced eraseFromParent and the tryToRecognizeTableBasedCttz would try to use the instruction (dyn_cast) after free. I just moved the tryToRecognizeTableBasedCttz above foldSqrt. I guess it does not need any additional test case.

Ah, I see. Please put a comment on that call then - foldSqrt (or any other erasing transform) needs to be accounted for (last in the loop for now), or we might hit that bug. And yes, looks like we don't need another test since the existing regression test was flagged by the asan bot.