Download Raw Diff

Details

Reviewers

craig.topper
spatel
RKSimon
bkramer
fhahn
pengfei
ecnelises

Commits

rGa41ea782c8e1: [SelectionDAG] Enable CTPOP optimization fine tuning

Summary

Add a TLI hook to allow SelectionDAG to fine tune the conversion of CTPOP to a chain of "x & (x - 1)" when CTPOP isn't legal.

A subsequent patch will attempt to fine tune the X86 code gen.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

davezarzycki created this revision.Oct 22 2020, 5:38 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptOct 22 2020, 5:38 AM

davezarzycki requested review of this revision.Oct 22 2020, 5:38 AM

Harbormaster completed remote builds in B76021: Diff 299935.Oct 22 2020, 5:39 AM

Also, this patch ought to be refined to optimize something like CTPOP(x) > 62 into CTPOP(NOT(x)) <= 2 or a chain of (x | (x + 1)) and a final comparison against a mask value instead of zero (whatever is more efficient).

This is a bit general/off-topic for the patch review.

Also, I'm not an expert at LLVM, but it seems like InstCombine is missing boundary optimizations that should have been caught before SelectionDAG. In particular:

ctpop(x) > 0 --> x != 0

I'm not finding an example where this escapes from InstCombine - please file a bug report if you have one. There's no explicit transform for this, but it happens via 2 other transforms: InstCombinerImpl::foldICmpUsingKnownBits() converts the predicate to NE, and we have a fold for ctpop in InstCombinerImpl::foldICmpEqIntrinsicWithConstant(). You can see the sequence of transforms with "opt -instcombine -debug"

ctpop(x) > (any size >= element size) --> always false
ctpop(x) < (any size >= element size + 1) --> always true

AFAICT, this always works for scalars, but not vectors. There's a chain reaction of transforms, but that is not triggered for vectors because range metadata doesn't support vectors:
http://llvm.org/docs/LangRef.html#range-metadata

We can add an explicit fold to instsimplify to make this direct/consistent.

In D89952#2347710, @spatel wrote:

ctpop(x) > (any size >= element size) --> always false
ctpop(x) < (any size >= element size + 1) --> always true

AFAICT, this always works for scalars, but not vectors. There's a chain reaction of transforms, but that is not triggered for vectors because range metadata doesn't support vectors:
http://llvm.org/docs/LangRef.html#range-metadata

We can add an explicit fold to instsimplify to make this direct/consistent.

Looking a bit closer: there's a more general bit of analysis in ValueTracking that seems to be missing that should solve this part. I'll post a patch.

craig.topper added inline comments.Oct 22 2020, 10:12 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
5340 ↗	(On Diff #299935)	What do you mean by horizontal add here?
5344 ↗	(On Diff #299935)	Why does having BWI increase the number? And why is element size not factored in here?

spatel mentioned this in D89976: [ValueTracking] add range limits for ctpop.Oct 22 2020, 10:46 AM

davezarzycki added inline comments.Oct 22 2020, 10:49 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
5340 ↗	(On Diff #299935)	I was being imprecise. I meant VPSADBW(zero, VPOPCNTB(x)), which for CTPOP emulation is purposes is a horizontal add of the byte-wise CTPOP.
5344 ↗	(On Diff #299935)	Please ignore that for now. I'm refining these heuristics. It's complicated by the fact that the vector CTPOP emulation on x86 doesn't always behave like I'd expect, so I need to go case by case. I'll upload the current heuristics that I'm working with before I stop working for the day.

Updated CTPOP emulation cost heuristics and test diffs so people can see the fallout. Overall, I tried to aim for the same or fewer number of instructions as before, which should be a win because chains of "x & (x - 1)" don't require memory loads, unlike actual CTPOP emulation.

craig.topper retitled this revision from [SelectionDAG] Fine tune CTPOP optimizations to [SelectionDAG][X86] Fine tune CTPOP optimizations.Oct 22 2020, 11:25 AM

Harbormaster completed remote builds in B76070: Diff 300054.Oct 22 2020, 11:29 AM

@spatel -- Here is the bug you requested: https://bugs.llvm.org/show_bug.cgi?id=47949

@spatel Somewhat related question if InstCombine canonicalizes (x && (x & x-1) ==0) to ctpop x == 1. Does that mean if ctpop isn't legal, the vectorization cost model will return a high cost for the ctpop part?

In D89952#2348102, @craig.topper wrote:

@spatel Somewhat related question if InstCombine canonicalizes (x && (x & x-1) ==0) to ctpop x == 1. Does that mean if ctpop isn't legal, the vectorization cost model will return a high cost for the ctpop part?

Yes, that's a gap in the cost model that we don't currently have a good way to solve (AFAIK). We need to consider the cost of the expanded icmp (ctpop), C as a whole rather than individual instructions.

spatel mentioned this in rG748ecc6b3260: [ValueTracking] add range limits for ctpop.Oct 23 2020, 5:18 AM

spatel mentioned this in rGc72198079df6: [ValueTracking] add range limits for cttz.Oct 23 2020, 6:04 AM

spatel mentioned this in rG3fb0d6b0d55b: [ValueTracking] add range limits for ctlz.

Hi @craig.topper and @spatel – I'd like to break this patch into two parts. This patch is the first part: the mechanism that enables targets to tune the length of "x & (x - 1)" chains used to emulate unsigned greater-than/less-than comparisons against a CTPOP result.

Harbormaster completed remote builds in B76202: Diff 300296.Oct 23 2020, 8:08 AM

I added another IR fold for ctpop here: 5a6e66ec7238

So now I can't tell from the test diffs what the real motivating cases are. We usually like test coverage, but I wonder if we should remove tests for all of the un-optimized IR patterns that we now expect to be folded in IR.

Hi @spatel – We can certainly remove the code gen tests that should have been handled by earlier optimization passes. Thanks for working on those. :-)

As to the use case, I have AVX512 machines without BITALG/VPOPCNTDQ and a program that is doing hamming distance checks in critical sections. I.e. lots of low numbered "popcnt(x) < smallNumber" kind of checks that would be more efficiently implemented as a chain of legal "x & (x - 1)" instructions instead of the current/custom POPCNT emulation logic. This is especially true on my Knights Landing test box, which lacks AVX512BW and falling back to AVX2 to emulate VPOPCNTDQ is quite expensive.

In D89952#2355869, @davezarzycki wrote:

Hi @spatel – We can certainly remove the code gen tests that should have been handled by earlier optimization passes. Thanks for working on those. :-)

As to the use case, I have AVX512 machines without BITALG/VPOPCNTDQ and a program that is doing hamming distance checks in critical sections. I.e. lots of low numbered "popcnt(x) < smallNumber" kind of checks that would be more efficiently implemented as a chain of legal "x & (x - 1)" instructions instead of the current/custom POPCNT emulation logic. This is especially true on my Knights Landing test box, which lacks AVX512BW and falling back to AVX2 to emulate VPOPCNTDQ is quite expensive.

Ok - I have no doubts that there are cases we can optimize better. It would be great to see 1 or more reduced examples of these code patterns in source code. That way, we can confirm that we're not still missing any pre-codegen optimizations . Can you file a bug to show that?

In D89952#2356168, @spatel wrote:

In D89952#2355869, @davezarzycki wrote:

Hi @spatel – We can certainly remove the code gen tests that should have been handled by earlier optimization passes. Thanks for working on those. :-)

As to the use case, I have AVX512 machines without BITALG/VPOPCNTDQ and a program that is doing hamming distance checks in critical sections. I.e. lots of low numbered "popcnt(x) < smallNumber" kind of checks that would be more efficiently implemented as a chain of legal "x & (x - 1)" instructions instead of the current/custom POPCNT emulation logic. This is especially true on my Knights Landing test box, which lacks AVX512BW and falling back to AVX2 to emulate VPOPCNTDQ is quite expensive.

Ok - I have no doubts that there are cases we can optimize better. It would be great to see 1 or more reduced examples of these code patterns in source code. That way, we can confirm that we're not still missing any pre-codegen optimizations . Can you file a bug to show that?

I added another example to the original bug: https://bugs.llvm.org/show_bug.cgi?id=47825

I think pre codegen just needs to worry about the boundary conditions: comparisons effectively against zero and comparisons effectively >= element/scalar bit size.

Rebase. No change.

The patch now has no test suite impact because all of the vector CTPOP tests that should be eliminated before code gen have been removed on trunk/master.

Harbormaster completed remote builds in B76747: Diff 301312.Oct 28 2020, 11:09 AM

In D89952#2359357, @davezarzycki wrote:

Rebase. No change.

The patch now has no test suite impact because all of the vector CTPOP tests that should be eliminated before code gen have been removed on trunk/master.

Is this a no-functional-change patch for cases that are not reduced in IR? Do we have regression test coverage for a target with a legal CTPOP vector instruction (like AArch64 cnt)?

In D89952#2371747, @spatel wrote:

In D89952#2359357, @davezarzycki wrote:

Rebase. No change.

The patch now has no test suite impact because all of the vector CTPOP tests that should be eliminated before code gen have been removed on trunk/master.

Is this a no-functional-change patch for cases that are not reduced in IR?

I believe so. I'm only an occasional contributor to LLVM, so I could be missing something subtle.

Do we have regression test coverage for a target with a legal CTPOP vector instruction (like AArch64 cnt)?

As far as I can tell, the AArch64 test coverage for vector CTPOP is fairly sparse. I can try and expand it you want.

In D89952#2375652, @davezarzycki wrote:

As far as I can tell, the AArch64 test coverage for vector CTPOP is fairly sparse. I can try and expand it you want.

That would be good - make it less likely that we have missed any cracks in the type/legality checks.

I don't see any holes by inspection, so LGTM. But I think I misread this the 1st time - we could hoist the isOperationLegal check for vectors as the minimum NFC patch first?