This is an archive of the discontinued LLVM Phabricator instance.

[TargetLowering] Simplify (ctpop x) == 1
ClosedPublic

Authored by xbolva00 on Jun 7 2019, 4:04 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

xbolva00 created this revision.Jun 7 2019, 4:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2019, 4:04 AM

How does the output differ from the default expansion/optimization of ctpop? Please commit the tests with baseline CHECK lines, so we just show that diff. Add tests for another target too (AArch64?).

RKSimon added inline comments.Jun 7 2019, 6:53 AM
test/CodeGen/X86/ctpop-combine.ll
3 ↗(On Diff #203530)

Enable common codegen, plus please can you commit this to trunk with current codegen and rebase so the patch shows the diff.

Also, do we need corei7 if we're explicitly setting/clearing the popcnt attribute?

; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=+popcnt | FileCheck %s -check-prefixes=CHECK,POPCOUNT
; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=-popcnt | FileCheck %s -check-prefixes=CHECK,NO-POPCOUNT

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

xbolva00 updated this revision to Diff 203577.Jun 7 2019, 10:02 AM

Faster expansion

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK

xbolva00 added a comment.EditedJun 7 2019, 10:18 AM

@spatel Maybe we should just expand in it instcombine? I found something interesting...

Example: https://pastebin.com/NdMSsuri
Compiled with -O3 flag.

Even with this patch, f1 and f2 variants are not same (execution time), f2 is much faster. If I disable loop vectorization, there are almost same. Maybe loop vectorizer cannot handle ctpop very well? @craig.topper

bool f1(unsigned x) { // 0.245s

return x && (x & x-1) == 0;

}
bool f2(unsigned x) { // 0,264s

return (x-1) < (x & -x);

}

bool f3(unsigned x) { // 0.660s

return __builtin_popcount(x) == 1; // current trunk

}

But I expand __builtin_popcount(x) == 1 to first or second pattern late, ie. here in TargetLowering, I get +- 0.6s too...

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK

Also maybe (x << __builtin_clz (x)) == INT_MIN

So with -O3 we can have:

bsr     ecx, edi
xor     ecx, 31
sal     edi, cl
cmp     edi, -2147483648
sete    al
ret

But this looks slower in practise (see my previous post with small "benchmark" code example).

We need to check other targets if we do this as a generic combine

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK

And yes, if I turn off vectorization in my example, x && (x & x-1) == 0 is fastest.

We need to check other targets if we do this as a generic combine

Since we have already implemented tranformations in TargetLowering like:
(ctpop x) u< 2 -> (x & x-1) == 0
(ctpop x) u> 1 -> (x & x-1) != 0

I think I just should take this first version (was suggested in the comment) and go ahead with review...

xbolva00 updated this revision to Diff 203632.Jun 7 2019, 3:07 PM

Use first pattern version

Since @bkramer had worked this code some years ago, I added him as a reviewer.

spatel added a comment.Jun 9 2019, 8:00 AM

@spatel Maybe we should just expand in it instcombine? I found something interesting...

There needs to be justification for expansion of the intrinsic in IR. For example, the expansion allows further pattern matching/folds within instcombine that are made more difficult/impossible when the code uses the intrinsic. That would then have to be weighed against the likely harm caused to inlining/unrolling based on the IR cost models.

Even with this patch, f1 and f2 variants are not same (execution time), f2 is much faster. If I disable loop vectorization, there are almost same. Maybe loop vectorizer cannot handle ctpop very well? @craig.topper

That sounds like a vectorizer or cost model bug/enhancement rather than a good reason to change IR canonicalization.

@spatel Thanks for review

What about current revision? Is it OK for you?

cc @craig.topper

@RKSimon had some patches for ctpop costs in the cost model.. So maybe he could check my example if the cost model is issue here.

spatel added a comment.Jun 9 2019, 8:40 AM

@spatel Thanks for review

What about current revision? Is it OK for you?

Seems ok, but you didn't answer the request to add tests for another target. We want to have more confidence that this is universally good, so something besides x86 should be tested.

xbolva00 updated this revision to Diff 203743.Jun 9 2019, 9:13 AM

Precommited arm neon test case.
Rebased.

spatel accepted this revision.Jun 9 2019, 9:58 AM

LGTM

This revision is now accepted and ready to land.Jun 9 2019, 9:58 AM
This revision was automatically updated to reflect the committed changes.

I know this is a little late, but is the second run line of test/CodeGen/AArch64/arm64-popcnt.ll correct? If I build with -DLLVM_TARGETS_TO_BUILD=AArch64;X86 I get an error. As far as I can tell, using grep -l 'RUN.*triple=armv8' -R ../test/ armv8a is dealt with as an ARM target, not an AArch64 one, which is weird in and of itself.

xbolva00 added a comment.EditedJun 11 2019, 3:43 AM

I know this is a little late, but is the second run line of test/CodeGen/AArch64/arm64-popcnt.ll correct? If I build with -DLLVM_TARGETS_TO_BUILD=AArch64;X86 I get an error. As far as I can tell, using grep -l 'RUN.*triple=armv8' -R ../test/ armv8a is dealt with as an ARM target, not an AArch64 one, which is weird in and of itself.

Ok, I will try to move it. Thanks for the report.