⚙ D63004 [TargetLowering] Simplify (ctpop x) == 1

xbolva00 created this revision.Jun 7 2019, 4:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2019, 4:04 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

How does the output differ from the default expansion/optimization of ctpop? Please commit the tests with baseline CHECK lines, so we just show that diff. Add tests for another target too (AArch64?).

RKSimon added inline comments.Jun 7 2019, 6:53 AM

test/CodeGen/X86/ctpop-combine.ll
3 ↗	(On Diff #203530)	Enable common codegen, plus please can you commit this to trunk with current codegen and rebase so the patch shows the diff. Also, do we need corei7 if we're explicitly setting/clearing the popcnt attribute? ; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=+popcnt \| FileCheck %s -check-prefixes=CHECK,POPCOUNT ; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=-popcnt \| FileCheck %s -check-prefixes=CHECK,NO-POPCOUNT

xbolva00 mentioned this in rG43f8ce44b7c5: [NFC] Added tests for D63004.Jun 7 2019, 7:03 AM

xbolva00 mentioned this in rL362801: [NFC] Added tests for D63004.

Rebased

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

Faster expansion

In D63004#1534270, @xbolva00 wrote:

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK

@spatel Maybe we should just expand in it instcombine? I found something interesting...

Example: https://pastebin.com/NdMSsuri
Compiled with -O3 flag.

Even with this patch, f1 and f2 variants are not same (execution time), f2 is much faster. If I disable loop vectorization, there are almost same. Maybe loop vectorizer cannot handle ctpop very well? @craig.topper

bool f1(unsigned x) { // 0.245s

return x && (x & x-1) == 0;

}
bool f2(unsigned x) { // 0,264s

return (x-1) < (x & -x);

}

bool f3(unsigned x) { // 0.660s

return __builtin_popcount(x) == 1; // current trunk

}

But I expand __builtin_popcount(x) == 1 to first or second pattern late, ie. here in TargetLowering, I get +- 0.6s too...

In D63004#1534334, @lebedev.ri wrote:

In D63004#1534270, @xbolva00 wrote:

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK

Also maybe (x << __builtin_clz (x)) == INT_MIN

So with -O3 we can have:

bsr     ecx, edi
xor     ecx, 31
sal     edi, cl
cmp     edi, -2147483648
sete    al
ret

But this looks slower in practise (see my previous post with small "benchmark" code example).

We need to check other targets if we do this as a generic combine

In D63004#1534334, @lebedev.ri wrote:

In D63004#1534270, @xbolva00 wrote:

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK

And yes, if I turn off vectorization in my example, x && (x & x-1) == 0 is fastest.

In D63004#1534364, @RKSimon wrote:

We need to check other targets if we do this as a generic combine

Since we have already implemented tranformations in TargetLowering like:
(ctpop x) u< 2 -> (x & x-1) == 0
(ctpop x) u> 1 -> (x & x-1) != 0

I think I just should take this first version (was suggested in the comment) and go ahead with review...

Use first pattern version

Since @bkramer had worked this code some years ago, I added him as a reviewer.

In D63004#1534341, @xbolva00 wrote:

@spatel Maybe we should just expand in it instcombine? I found something interesting...

There needs to be justification for expansion of the intrinsic in IR. For example, the expansion allows further pattern matching/folds within instcombine that are made more difficult/impossible when the code uses the intrinsic. That would then have to be weighed against the likely harm caused to inlining/unrolling based on the IR cost models.

Even with this patch, f1 and f2 variants are not same (execution time), f2 is much faster. If I disable loop vectorization, there are almost same. Maybe loop vectorizer cannot handle ctpop very well? @craig.topper

That sounds like a vectorizer or cost model bug/enhancement rather than a good reason to change IR canonicalization.

@spatel Thanks for review

What about current revision? Is it OK for you?

cc @craig.topper

@RKSimon had some patches for ctpop costs in the cost model.. So maybe he could check my example if the cost model is issue here.

In D63004#1535595, @xbolva00 wrote:

@spatel Thanks for review

What about current revision? Is it OK for you?

Seems ok, but you didn't answer the request to add tests for another target. We want to have more confidence that this is universally good, so something besides x86 should be tested.

Precommited arm neon test case.
Rebased.

Herald added a subscriber: javed.absar. · View Herald TranscriptJun 9 2019, 9:13 AM

xbolva00 mentioned this in rL362908: [NFC] Adjust test for D63004.Jun 9 2019, 9:14 AM

xbolva00 mentioned this in rG96ccd690f8eb: [NFC] Adjust test for D63004.

LGTM

This revision is now accepted and ready to land.Jun 9 2019, 9:58 AM

Closed by commit rL362912: [TargetLowering] Simplify (ctpop x) == 1 (authored by xbolva00). · Explain WhyJun 9 2019, 11:15 AM

This revision was automatically updated to reflect the committed changes.

I know this is a little late, but is the second run line of test/CodeGen/AArch64/arm64-popcnt.ll correct? If I build with -DLLVM_TARGETS_TO_BUILD=AArch64;X86 I get an error. As far as I can tell, using grep -l 'RUN.*triple=armv8' -R ../test/ armv8a is dealt with as an ARM target, not an AArch64 one, which is weird in and of itself.

In D63004#1537476, @joelkevinjones wrote:

I know this is a little late, but is the second run line of test/CodeGen/AArch64/arm64-popcnt.ll correct? If I build with -DLLVM_TARGETS_TO_BUILD=AArch64;X86 I get an error. As far as I can tell, using grep -l 'RUN.*triple=armv8' -R ../test/ armv8a is dealt with as an ARM target, not an AArch64 one, which is weird in and of itself.

Ok, I will try to move it. Thanks for the report.

spatel mentioned this in rG0baacea2c7ea: [AArch64][x86] add tests for ctpop != 1; NFC.Jun 25 2019, 6:42 AM

spatel mentioned this in rL364314: [AArch64][x86] add tests for ctpop != 1; NFC.Jun 25 2019, 6:46 AM

spatel mentioned this in rL364319: [SDAG] expand ctpop != 1.Jun 25 2019, 7:52 AM

spatel mentioned this in rG685c5cbc654f: [SDAG] expand ctpop != 1.