This patch introduces recognition of table-based ctz implementation during the AggressiveInstCombine.
This fixes the [0].
[0] https://bugs.llvm.org/show_bug.cgi?id=46434
TODO: Get the data on SPEC benchmark.
Differential D113291
[AggressiveInstCombine] Lower Table Based CTTZ djtodoro on Nov 5 2021, 8:58 AM. Authored by
Details This patch introduces recognition of table-based ctz implementation during the AggressiveInstCombine. [0] https://bugs.llvm.org/show_bug.cgi?id=46434 TODO: Get the data on SPEC benchmark.
Diff Detail Event TimelineThere are a very large number of changes, so older changes are hidden. Show Older Changes
Comment Actions
Did you manage to collect any perf data yet to motivate this change? The tests already contain a few things that seem unrelated, it would be good to clean those things up.
Comment Actions AggressiveInstCombine is an extension of InstCombine. That is, it's a target-independent canonicalization pass. Therefore, we shouldn't use any target-specific cost model to enable the transform. Since we have a generic intrinsic for cttz, it's fine to create that for all targets as long as we can guarantee that a generic expansion of that intrinsic in the backend will not be worse than the original code. But I'm not sure if we can make that guarantee? If not, this should be implemented as a late IR or codegen pass (as it was when first posted). Comment Actions Why is it ok to use DataLayout in InstCombine/AggressiveInstCombine, but not TTI? The cttz seems like it could enable other optimizations so I don't think we want it late. In particular, we should give the optimizer a chance to prove that the input isn't 0 to remove the select that gets generated after the cttz intrinsic. That could require computeKnownBits or CorrelatedValuePropagation. LoopIdiomRecognize queries TTI before generating cttz from loops. Comment Actions Yeah, it's fuzzy. I think we're only supposed to be using DataLayout to determine where creating an illegal type would obviously lead to worse codegen. But that's not much different than asking if some operation is legal on target X. Comment Actions Does handling of "0", when accessing index 0 for x = 0 not acceptable? What is the reason for this patch being on hold? Comment Actions
It could help with doing thing like https://reviews.llvm.org/D114964 earlier too, where the transfroms are not always profitable and not reversible back to the original code, but would be beneficial to do earlier to get better cost modelling and vectorization. Comment Actions Hello, I have a small query regarding this patch. The patch emits following llvm assembly for ctz table - -----Patch assembly------- rbit w8, w0 cmp w0, #0 clz w8, w8 csel w0, wzr, w8, eq ret but in gcc, we have the following assembly being emitted - ------------GCC--------------------- rbit w0, w0 clz w0, w0 and w0, w0, 31 ret Reference - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 [1] My question is - To solve the bug, do I have to generate assembly similar to GCC? Please give me suggestions on moving forward in solving this bug. Comment Actions
That sounds like a backend optimization that could happen given we know the semantics of the AArch64 instruction. Comment Actions @craig.topper @spatel Do you think this can go as is? (P.S. I don't have access to any AARCH64 board currently, so I can see into the SPEC numbers.) Comment Actions I am wondering about general direction.. Is it worth it? On way to become compiler just for benchmarks? Spend compile time just to optimize one very very specific pattern from spec is bad thing imho. Can you show us some other real world “hits”? I assume any sane project already uses builtin to compute this value efficiently.
Comment Actions
. Comment Actions The same algorithm is documented here https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup It's also nearby in the code in primesieve from the recent tzcnt discusson. Around line 143 if you expand the context on Erat.hpp here https://github.com/kimwalisch/primesieve/pull/109/files Granted that code knows when to use the tzcnt builtin instead of that code. I'm only mentioning it to show it is a known way to implement tzcnt that is used in more than just spec. Comment Actions Ideally this transformation should just emit proper intrinsic without need for target hook. What is the real problem? Comment Actions Hello - We were having a discussion about a very similar patch in D125755. I think the outcome for this patch is that either:
There are more details about why in D125755. I would go for the first option if it doesn't lead to worse performance, as for the second I'm not sure when it would be profitable to transform back and emit the table. You may not want to do that for non-hot ctzs? It sounds like it may be difficult to get right, but maybe I'm overestimating it. Comment Actions
You really just have to weigh it against the current default expansion on targets where ctlz/cttz aren't legal, which is popcount(v & -v). It should be a straightforward comparison, generally. If you have popcount, use it. If multiply is legal, use a table lookup. Otherwise... maybe stick with the popcount expansion? Probably any approach is expensive at that point. Compare the generated code for arm-eabi.
As opposed to what, calling into compiler-rt? Comment Actions The initial version of patch was implemented within CodeGenPrepare. And I think it should not introduce any performance regression.
Hmm... I need to think about that.
Seems that I need to find some time to catch up the conversation in D125755. :)
I am not sure I get the question. Comment Actions I guess we could measure something like that, but seems to me that it could introduce some performance regressions... Comment Actions I was meaning - it can be difficult for the compiler to recognize _when_ a ctz is performance critical. If the size of the table is large (which I was possible over-estimating the size of in my mind), then you may not want to emit the table for every ctz in the program. Currently that places where this is used have said, from the fact that they wrote it this way, that these ctz's are important. It just depends on whether converting to a table is always better, and if the table it small enough to be reasonable in the common case. 32 x i8 doesn't sound too big compared to what I originally imagined, if that is all it needs. Comment Actions For the table lookup, is there an algorithm for creating the special constant that is used in the multiply? Or would we just hardcode known constants for common sizes. Comment Actions Note, the popcount expansion already uses a multiply without checking if it is legal. Comment Actions
https://en.wikipedia.org/wiki/De_Bruijn_sequence has a description of the algorithm. Probably we'd just hard-code constants, though; practically speaking, the only sizes we actually care about are 16, 32, and 64. (For anything that doesn't fit in a single register, we probably just want to split the cttz.)
It's a relatively cheap multiply to expand on a target with a shifter, though. Comment Actions Thanks a lot for the comments! Can someone please sum the things up that need to be done for this? By implementing this in the CodeGenPrepare doesn't include/imply that we should get rid of the call to TTI.preferCTTZLowering(). Comment Actions Concretely, my preferred solution looks something like:
Comment Actions The latest update implements this.
Comment Actions @djtodoro - Will you be sending patch for (2) "Teach TargetLowering::expandCTTZ to emit a table lookup."? Comment Actions Unfortunately, I don’t have time to do it right now. If you are interested, please go ahead with the implementation. Comment Actions Basically this patch implements lowering of the table-based cttz implementation into @llvm.cttz unconditionally. For some targets it won't be that beneficial, so during the TargetLowering::expandCTTZ we should emit table lookup again. @efriedma May have something to add. Comment Actions I don't think I have much further to say. Emitting a table lookup from TargetLowering::expandCTTZ should be straightforward, I think. See DAGCombiner::convertSelectOfFPConstantsToLoadOffset for an example of how to emit a constant table. Comment Actions Is this patch ready to be merged? This is parent patch of https://reviews.llvm.org/D128911. Does child get merged before parent? Anyways, https://reviews.llvm.org/D128911 can be merged independently of this patch. Comment Actions D128911 needs to go in first, once that is done we can move forward with this one. It could do with a rebase and a clang-format.
Comment Actions @djtodoro Do you have any time to update this? Otherwise do you mind we take it over and we can update it and get it reviewed. Thanks. Comment Actions
Comment Actions Why are some test files still specifying a triple in the RUN line? It would be good to consolidate tests into less files if possible with better names/comments to explain exactly what differences are being tested in the sequence of tests. There should also be negative tests (wrong table constants, wrong magic multiplier, wrong shift amount, etc), so we know that the transform is not firing on mismatches. Comment Actions Leftovers. Thanks.
I will remove one redundant test. There are C producers as well as some top-level comments that explain what it should test (if the comment is needed). Furthermore, thanks for the suggestion for adding some negative tests - will add it.
Comment Actions Thanks for the comments.
Comment Actions
Comment Actions Thanks for the updates. I don't think I have anything else than is what is below. Any other comments from anyone else?
Comment Actions LGTM
Comment Actions Thanks for the updates. LGTM
Comment Actions Hi, I'm seeing a heap-use-after-free in AggressiveInstCombine in a build shortly after this landed, within a function added here. test/Transforms/AggressiveInstCombine/X86/sqrt.ll fails as follows under asan: https://gist.github.com/zygoloid/a270e65d32ab5b05504b3b0d5717f83b Please can you fix or revert? Comment Actions I've gone ahead and reverted for now to unblock things. Let me know if you're not able to reproduce the ASan failure and I can dig more into it. Comment Actions What was the bug, how was it fixed, and is there a new test to verify the fix? That should have been mentioned in the new commit message. Comment Actions You are right. I reverted the recommit, and I will recommit it with proper message, sorry I missed it. :/ Actually, the issue was that your patch D129167 introduced eraseFromParent and the tryToRecognizeTableBasedCttz would try to use the instruction (dyn_cast) after free. I just moved the tryToRecognizeTableBasedCttz above foldSqrt. I guess it does not need any additional test case. Comment Actions Ah, I see. Please put a comment on that call then - foldSqrt (or any other erasing transform) needs to be accounted for (last in the loop for now), or we might hit that bug. And yes, looks like we don't need another test since the existing regression test was flagged by the asan bot. |
clang-format: please reformat the code