Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
How does the output differ from the default expansion/optimization of ctpop? Please commit the tests with baseline CHECK lines, so we just show that diff. Add tests for another target too (AArch64?).
test/CodeGen/X86/ctpop-combine.ll | ||
---|---|---|
3 ↗ | (On Diff #203530) | Enable common codegen, plus please can you commit this to trunk with current codegen and rebase so the patch shows the diff. Also, do we need corei7 if we're explicitly setting/clearing the popcnt attribute? ; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=+popcnt | FileCheck %s -check-prefixes=CHECK,POPCOUNT ; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=-popcnt | FileCheck %s -check-prefixes=CHECK,NO-POPCOUNT |
I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK
@spatel Maybe we should just expand in it instcombine? I found something interesting...
Example: https://pastebin.com/NdMSsuri
Compiled with -O3 flag.
Even with this patch, f1 and f2 variants are not same (execution time), f2 is much faster. If I disable loop vectorization, there are almost same. Maybe loop vectorizer cannot handle ctpop very well? @craig.topper
bool f1(unsigned x) { // 0.245s
return x && (x & x-1) == 0;
}
bool f2(unsigned x) { // 0,264s
return (x-1) < (x & -x);
}
bool f3(unsigned x) { // 0.660s
return __builtin_popcount(x) == 1; // current trunk
}
But I expand __builtin_popcount(x) == 1 to first or second pattern late, ie. here in TargetLowering, I get +- 0.6s too...
Also maybe (x << __builtin_clz (x)) == INT_MIN
So with -O3 we can have:
bsr ecx, edi xor ecx, 31 sal edi, cl cmp edi, -2147483648 sete al ret
But this looks slower in practise (see my previous post with small "benchmark" code example).
Since we have already implemented tranformations in TargetLowering like:
(ctpop x) u< 2 -> (x & x-1) == 0
(ctpop x) u> 1 -> (x & x-1) != 0
I think I just should take this first version (was suggested in the comment) and go ahead with review...
There needs to be justification for expansion of the intrinsic in IR. For example, the expansion allows further pattern matching/folds within instcombine that are made more difficult/impossible when the code uses the intrinsic. That would then have to be weighed against the likely harm caused to inlining/unrolling based on the IR cost models.
Even with this patch, f1 and f2 variants are not same (execution time), f2 is much faster. If I disable loop vectorization, there are almost same. Maybe loop vectorizer cannot handle ctpop very well? @craig.topper
That sounds like a vectorizer or cost model bug/enhancement rather than a good reason to change IR canonicalization.
@RKSimon had some patches for ctpop costs in the cost model.. So maybe he could check my example if the cost model is issue here.
Seems ok, but you didn't answer the request to add tests for another target. We want to have more confidence that this is universally good, so something besides x86 should be tested.
I know this is a little late, but is the second run line of test/CodeGen/AArch64/arm64-popcnt.ll correct? If I build with -DLLVM_TARGETS_TO_BUILD=AArch64;X86 I get an error. As far as I can tell, using grep -l 'RUN.*triple=armv8' -R ../test/ armv8a is dealt with as an ARM target, not an AArch64 one, which is weird in and of itself.