This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG] Enable CTPOP optimization fine tuning
ClosedPublic

Authored by davezarzycki on Oct 22 2020, 5:38 AM.

Details

Summary

Add a TLI hook to allow SelectionDAG to fine tune the conversion of CTPOP to a chain of "x & (x - 1)" when CTPOP isn't legal.

A subsequent patch will attempt to fine tune the X86 code gen.

Diff Detail

Event Timeline

davezarzycki created this revision.Oct 22 2020, 5:38 AM
davezarzycki requested review of this revision.Oct 22 2020, 5:38 AM

Also, this patch ought to be refined to optimize something like CTPOP(x) > 62 into CTPOP(NOT(x)) <= 2 or a chain of (x | (x + 1)) and a final comparison against a mask value instead of zero (whatever is more efficient).

This is a bit general/off-topic for the patch review.

Also, I'm not an expert at LLVM, but it seems like InstCombine is missing boundary optimizations that should have been caught before SelectionDAG. In particular:

ctpop(x) > 0 --> x != 0

I'm not finding an example where this escapes from InstCombine - please file a bug report if you have one. There's no explicit transform for this, but it happens via 2 other transforms: InstCombinerImpl::foldICmpUsingKnownBits() converts the predicate to NE, and we have a fold for ctpop in InstCombinerImpl::foldICmpEqIntrinsicWithConstant(). You can see the sequence of transforms with "opt -instcombine -debug"

ctpop(x) > (any size >= element size) --> always false
ctpop(x) < (any size >= element size + 1) --> always true

AFAICT, this always works for scalars, but not vectors. There's a chain reaction of transforms, but that is not triggered for vectors because range metadata doesn't support vectors:
http://llvm.org/docs/LangRef.html#range-metadata

We can add an explicit fold to instsimplify to make this direct/consistent.

ctpop(x) > (any size >= element size) --> always false
ctpop(x) < (any size >= element size + 1) --> always true

AFAICT, this always works for scalars, but not vectors. There's a chain reaction of transforms, but that is not triggered for vectors because range metadata doesn't support vectors:
http://llvm.org/docs/LangRef.html#range-metadata

We can add an explicit fold to instsimplify to make this direct/consistent.

Looking a bit closer: there's a more general bit of analysis in ValueTracking that seems to be missing that should solve this part. I'll post a patch.

craig.topper added inline comments.Oct 22 2020, 10:12 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
5340

What do you mean by horizontal add here?

5344

Why does having BWI increase the number? And why is element size not factored in here?

davezarzycki added inline comments.Oct 22 2020, 10:49 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
5340

I was being imprecise. I meant VPSADBW(zero, VPOPCNTB(x)), which for CTPOP emulation is purposes is a horizontal add of the byte-wise CTPOP.

5344

Please ignore that for now. I'm refining these heuristics. It's complicated by the fact that the vector CTPOP emulation on x86 doesn't always behave like I'd expect, so I need to go case by case. I'll upload the current heuristics that I'm working with before I stop working for the day.

Updated CTPOP emulation cost heuristics and test diffs so people can see the fallout. Overall, I tried to aim for the same or fewer number of instructions as before, which should be a win because chains of "x & (x - 1)" don't require memory loads, unlike actual CTPOP emulation.

craig.topper retitled this revision from [SelectionDAG] Fine tune CTPOP optimizations to [SelectionDAG][X86] Fine tune CTPOP optimizations.Oct 22 2020, 11:25 AM

@spatel Somewhat related question if InstCombine canonicalizes (x && (x & x-1) ==0) to ctpop x == 1. Does that mean if ctpop isn't legal, the vectorization cost model will return a high cost for the ctpop part?

@spatel Somewhat related question if InstCombine canonicalizes (x && (x & x-1) ==0) to ctpop x == 1. Does that mean if ctpop isn't legal, the vectorization cost model will return a high cost for the ctpop part?

Yes, that's a gap in the cost model that we don't currently have a good way to solve (AFAIK). We need to consider the cost of the expanded icmp (ctpop), C as a whole rather than individual instructions.

davezarzycki retitled this revision from [SelectionDAG][X86] Fine tune CTPOP optimizations to [SelectionDAG] Enable CTPOP optimization fine tuning.
davezarzycki edited the summary of this revision. (Show Details)

Hi @craig.topper and @spatel – I'd like to break this patch into two parts. This patch is the first part: the mechanism that enables targets to tune the length of "x & (x - 1)" chains used to emulate unsigned greater-than/less-than comparisons against a CTPOP result.

I added another IR fold for ctpop here: 5a6e66ec7238

So now I can't tell from the test diffs what the real motivating cases are. We usually like test coverage, but I wonder if we should remove tests for all of the un-optimized IR patterns that we now expect to be folded in IR.

Hi @spatel – We can certainly remove the code gen tests that should have been handled by earlier optimization passes. Thanks for working on those. :-)

As to the use case, I have AVX512 machines without BITALG/VPOPCNTDQ and a program that is doing hamming distance checks in critical sections. I.e. lots of low numbered "popcnt(x) < smallNumber" kind of checks that would be more efficiently implemented as a chain of legal "x & (x - 1)" instructions instead of the current/custom POPCNT emulation logic. This is especially true on my Knights Landing test box, which lacks AVX512BW and falling back to AVX2 to emulate VPOPCNTDQ is quite expensive.

Hi @spatel – We can certainly remove the code gen tests that should have been handled by earlier optimization passes. Thanks for working on those. :-)

As to the use case, I have AVX512 machines without BITALG/VPOPCNTDQ and a program that is doing hamming distance checks in critical sections. I.e. lots of low numbered "popcnt(x) < smallNumber" kind of checks that would be more efficiently implemented as a chain of legal "x & (x - 1)" instructions instead of the current/custom POPCNT emulation logic. This is especially true on my Knights Landing test box, which lacks AVX512BW and falling back to AVX2 to emulate VPOPCNTDQ is quite expensive.

Ok - I have no doubts that there are cases we can optimize better. It would be great to see 1 or more reduced examples of these code patterns in source code. That way, we can confirm that we're not still missing any pre-codegen optimizations . Can you file a bug to show that?

Hi @spatel – We can certainly remove the code gen tests that should have been handled by earlier optimization passes. Thanks for working on those. :-)

As to the use case, I have AVX512 machines without BITALG/VPOPCNTDQ and a program that is doing hamming distance checks in critical sections. I.e. lots of low numbered "popcnt(x) < smallNumber" kind of checks that would be more efficiently implemented as a chain of legal "x & (x - 1)" instructions instead of the current/custom POPCNT emulation logic. This is especially true on my Knights Landing test box, which lacks AVX512BW and falling back to AVX2 to emulate VPOPCNTDQ is quite expensive.

Ok - I have no doubts that there are cases we can optimize better. It would be great to see 1 or more reduced examples of these code patterns in source code. That way, we can confirm that we're not still missing any pre-codegen optimizations . Can you file a bug to show that?

I added another example to the original bug: https://bugs.llvm.org/show_bug.cgi?id=47825

I think pre codegen just needs to worry about the boundary conditions: comparisons effectively against zero and comparisons effectively >= element/scalar bit size.

Rebase. No change.

The patch now has no test suite impact because all of the vector CTPOP tests that should be eliminated before code gen have been removed on trunk/master.

Rebase. No change.

The patch now has no test suite impact because all of the vector CTPOP tests that should be eliminated before code gen have been removed on trunk/master.

Is this a no-functional-change patch for cases that are not reduced in IR? Do we have regression test coverage for a target with a legal CTPOP vector instruction (like AArch64 cnt)?

Rebase. No change.

The patch now has no test suite impact because all of the vector CTPOP tests that should be eliminated before code gen have been removed on trunk/master.

Is this a no-functional-change patch for cases that are not reduced in IR?

I believe so. I'm only an occasional contributor to LLVM, so I could be missing something subtle.

Do we have regression test coverage for a target with a legal CTPOP vector instruction (like AArch64 cnt)?

As far as I can tell, the AArch64 test coverage for vector CTPOP is fairly sparse. I can try and expand it you want.

spatel accepted this revision.Nov 8 2020, 7:06 AM

As far as I can tell, the AArch64 test coverage for vector CTPOP is fairly sparse. I can try and expand it you want.

That would be good - make it less likely that we have missed any cracks in the type/legality checks.

I don't see any holes by inspection, so LGTM. But I think I misread this the 1st time - we could hoist the isOperationLegal check for vectors as the minimum NFC patch first?

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
3408–3410

Remove this comment (and the copy of it below) if we have generalized this enough with the TLI override?

This revision is now accepted and ready to land.Nov 8 2020, 7:06 AM

I've rebased and pre-committed AArch64 and PPC tests. This is still NFC.

Also, as requested, I've pre-committed the distracting code hoist.

Removed the comment about x86 bias now that more testing exists.

This revision was landed with ongoing or failed builds.Nov 9 2020, 10:49 AM
This revision was automatically updated to reflect the committed changes.
llvm/lib/Target/X86/X86ISelLowering.cpp