Add a TLI hook to allow SelectionDAG to fine tune the conversion of CTPOP to a chain of "x & (x - 1)" when CTPOP isn't legal.
A subsequent patch will attempt to fine tune the X86 code gen.
Differential D89952
[SelectionDAG] Enable CTPOP optimization fine tuning davezarzycki on Oct 22 2020, 5:38 AM. Authored by
Details Add a TLI hook to allow SelectionDAG to fine tune the conversion of CTPOP to a chain of "x & (x - 1)" when CTPOP isn't legal. A subsequent patch will attempt to fine tune the X86 code gen.
Diff Detail
Unit Tests
Event TimelineComment Actions Also, this patch ought to be refined to optimize something like CTPOP(x) > 62 into CTPOP(NOT(x)) <= 2 or a chain of (x | (x + 1)) and a final comparison against a mask value instead of zero (whatever is more efficient). Comment Actions This is a bit general/off-topic for the patch review.
I'm not finding an example where this escapes from InstCombine - please file a bug report if you have one. There's no explicit transform for this, but it happens via 2 other transforms: InstCombinerImpl::foldICmpUsingKnownBits() converts the predicate to NE, and we have a fold for ctpop in InstCombinerImpl::foldICmpEqIntrinsicWithConstant(). You can see the sequence of transforms with "opt -instcombine -debug"
AFAICT, this always works for scalars, but not vectors. There's a chain reaction of transforms, but that is not triggered for vectors because range metadata doesn't support vectors: We can add an explicit fold to instsimplify to make this direct/consistent. Comment Actions Looking a bit closer: there's a more general bit of analysis in ValueTracking that seems to be missing that should solve this part. I'll post a patch.
Comment Actions Updated CTPOP emulation cost heuristics and test diffs so people can see the fallout. Overall, I tried to aim for the same or fewer number of instructions as before, which should be a win because chains of "x & (x - 1)" don't require memory loads, unlike actual CTPOP emulation. Comment Actions @spatel -- Here is the bug you requested: https://bugs.llvm.org/show_bug.cgi?id=47949 Comment Actions @spatel Somewhat related question if InstCombine canonicalizes (x && (x & x-1) ==0) to ctpop x == 1. Does that mean if ctpop isn't legal, the vectorization cost model will return a high cost for the ctpop part? Comment Actions Yes, that's a gap in the cost model that we don't currently have a good way to solve (AFAIK). We need to consider the cost of the expanded icmp (ctpop), C as a whole rather than individual instructions. Comment Actions Hi @craig.topper and @spatel – I'd like to break this patch into two parts. This patch is the first part: the mechanism that enables targets to tune the length of "x & (x - 1)" chains used to emulate unsigned greater-than/less-than comparisons against a CTPOP result. Comment Actions I added another IR fold for ctpop here: 5a6e66ec7238 So now I can't tell from the test diffs what the real motivating cases are. We usually like test coverage, but I wonder if we should remove tests for all of the un-optimized IR patterns that we now expect to be folded in IR. Comment Actions Hi @spatel – We can certainly remove the code gen tests that should have been handled by earlier optimization passes. Thanks for working on those. :-) As to the use case, I have AVX512 machines without BITALG/VPOPCNTDQ and a program that is doing hamming distance checks in critical sections. I.e. lots of low numbered "popcnt(x) < smallNumber" kind of checks that would be more efficiently implemented as a chain of legal "x & (x - 1)" instructions instead of the current/custom POPCNT emulation logic. This is especially true on my Knights Landing test box, which lacks AVX512BW and falling back to AVX2 to emulate VPOPCNTDQ is quite expensive. Comment Actions Ok - I have no doubts that there are cases we can optimize better. It would be great to see 1 or more reduced examples of these code patterns in source code. That way, we can confirm that we're not still missing any pre-codegen optimizations . Can you file a bug to show that? Comment Actions I added another example to the original bug: https://bugs.llvm.org/show_bug.cgi?id=47825 I think pre codegen just needs to worry about the boundary conditions: comparisons effectively against zero and comparisons effectively >= element/scalar bit size. Comment Actions Rebase. No change. The patch now has no test suite impact because all of the vector CTPOP tests that should be eliminated before code gen have been removed on trunk/master. Comment Actions Is this a no-functional-change patch for cases that are not reduced in IR? Do we have regression test coverage for a target with a legal CTPOP vector instruction (like AArch64 cnt)? Comment Actions I believe so. I'm only an occasional contributor to LLVM, so I could be missing something subtle.
As far as I can tell, the AArch64 test coverage for vector CTPOP is fairly sparse. I can try and expand it you want. Comment Actions That would be good - make it less likely that we have missed any cracks in the type/legality checks. I don't see any holes by inspection, so LGTM. But I think I misread this the 1st time - we could hoist the isOperationLegal check for vectors as the minimum NFC patch first?
Comment Actions I've rebased and pre-committed AArch64 and PPC tests. This is still NFC. Also, as requested, I've pre-committed the distracting code hoist. |
Remove this comment (and the copy of it below) if we have generalized this enough with the TLI override?