Download Raw Diff

Details

Reviewers

craig.topper
spatel
RKSimon
bkramer

Commits

rGdcf5e6abdf0c: [TargetLowering] Simplify (ctpop x) == 1
rL362912: [TargetLowering] Simplify (ctpop x) == 1

Diff Detail

Repository: rL LLVM

Event Timeline

xbolva00 created this revision.Jun 7 2019, 4:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2019, 4:04 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

How does the output differ from the default expansion/optimization of ctpop? Please commit the tests with baseline CHECK lines, so we just show that diff. Add tests for another target too (AArch64?).

RKSimon added inline comments.Jun 7 2019, 6:53 AM

test/CodeGen/X86/ctpop-combine.ll
3	Enable common codegen, plus please can you commit this to trunk with current codegen and rebase so the patch shows the diff. Also, do we need corei7 if we're explicitly setting/clearing the popcnt attribute? ; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=+popcnt \| FileCheck %s -check-prefixes=CHECK,POPCOUNT ; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=-popcnt \| FileCheck %s -check-prefixes=CHECK,NO-POPCOUNT

xbolva00 mentioned this in rG43f8ce44b7c5: [NFC] Added tests for D63004.Jun 7 2019, 7:03 AM

xbolva00 mentioned this in rL362801: [NFC] Added tests for D63004.

Rebased

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

Faster expansion

In D63004#1534270, @xbolva00 wrote:

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK

@spatel Maybe we should just expand in it instcombine? I found something interesting...

Example: https://pastebin.com/NdMSsuri
Compiled with -O3 flag.

Even with this patch, f1 and f2 variants are not same (execution time), f2 is much faster. If I disable loop vectorization, there are almost same. Maybe loop vectorizer cannot handle ctpop very well? @craig.topper

bool f1(unsigned x) { // 0.245s

return x && (x & x-1) == 0;

}
bool f2(unsigned x) { // 0,264s

return (x-1) < (x & -x);

}

bool f3(unsigned x) { // 0.660s

return __builtin_popcount(x) == 1; // current trunk

}

But I expand __builtin_popcount(x) == 1 to first or second pattern late, ie. here in TargetLowering, I get +- 0.6s too...

In D63004#1534334, @lebedev.ri wrote:

In D63004#1534270, @xbolva00 wrote:

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK

Also maybe (x << __builtin_clz (x)) == INT_MIN

So with -O3 we can have:

bsr     ecx, edi
xor     ecx, 31
sal     edi, cl
cmp     edi, -2147483648
sete    al
ret

But this looks slower in practise (see my previous post with small "benchmark" code example).

We need to check other targets if we do this as a generic combine

In D63004#1534334, @lebedev.ri wrote:

In D63004#1534270, @xbolva00 wrote:

Or should we prefer (x-1) < (x & -x) ? This seems like a much faster choice.

I'm guessing lack of CTPOP also likely implies lack of BMI1? ((x & x-1) is a single blsr instruction)
While that second variant looks like it has simpler ir, the ask looks more complex in general, more arithmetic:
https://godbolt.org/z/iwIIKK

And yes, if I turn off vectorization in my example, x && (x & x-1) == 0 is fastest.

In D63004#1534364, @RKSimon wrote:

We need to check other targets if we do this as a generic combine

Since we have already implemented tranformations in TargetLowering like:
(ctpop x) u< 2 -> (x & x-1) == 0
(ctpop x) u> 1 -> (x & x-1) != 0

I think I just should take this first version (was suggested in the comment) and go ahead with review...

Use first pattern version

Since @bkramer had worked this code some years ago, I added him as a reviewer.

In D63004#1534341, @xbolva00 wrote:

@spatel Maybe we should just expand in it instcombine? I found something interesting...

There needs to be justification for expansion of the intrinsic in IR. For example, the expansion allows further pattern matching/folds within instcombine that are made more difficult/impossible when the code uses the intrinsic. That would then have to be weighed against the likely harm caused to inlining/unrolling based on the IR cost models.

Even with this patch, f1 and f2 variants are not same (execution time), f2 is much faster. If I disable loop vectorization, there are almost same. Maybe loop vectorizer cannot handle ctpop very well? @craig.topper

That sounds like a vectorizer or cost model bug/enhancement rather than a good reason to change IR canonicalization.

@spatel Thanks for review

What about current revision? Is it OK for you?

cc @craig.topper

@RKSimon had some patches for ctpop costs in the cost model.. So maybe he could check my example if the cost model is issue here.

In D63004#1535595, @xbolva00 wrote:

@spatel Thanks for review

What about current revision? Is it OK for you?

Seems ok, but you didn't answer the request to add tests for another target. We want to have more confidence that this is universally good, so something besides x86 should be tested.

Precommited arm neon test case.
Rebased.

Herald added a subscriber: javed.absar. · View Herald TranscriptJun 9 2019, 9:13 AM

xbolva00 mentioned this in rL362908: [NFC] Adjust test for D63004.Jun 9 2019, 9:14 AM

xbolva00 mentioned this in rG96ccd690f8eb: [NFC] Adjust test for D63004.

LGTM

This revision is now accepted and ready to land.Jun 9 2019, 9:58 AM

Closed by commit rL362912: [TargetLowering] Simplify (ctpop x) == 1 (authored by xbolva00). · Explain WhyJun 9 2019, 11:15 AM

This revision was automatically updated to reflect the committed changes.

I know this is a little late, but is the second run line of test/CodeGen/AArch64/arm64-popcnt.ll correct? If I build with -DLLVM_TARGETS_TO_BUILD=AArch64;X86 I get an error. As far as I can tell, using grep -l 'RUN.*triple=armv8' -R ../test/ armv8a is dealt with as an ARM target, not an AArch64 one, which is weird in and of itself.

In D63004#1537476, @joelkevinjones wrote:

I know this is a little late, but is the second run line of test/CodeGen/AArch64/arm64-popcnt.ll correct? If I build with -DLLVM_TARGETS_TO_BUILD=AArch64;X86 I get an error. As far as I can tell, using grep -l 'RUN.*triple=armv8' -R ../test/ armv8a is dealt with as an ARM target, not an AArch64 one, which is weird in and of itself.

Ok, I will try to move it. Thanks for the report.

spatel mentioned this in rG0baacea2c7ea: [AArch64][x86] add tests for ctpop != 1; NFC.Jun 25 2019, 6:42 AM

spatel mentioned this in rL364314: [AArch64][x86] add tests for ctpop != 1; NFC.Jun 25 2019, 6:46 AM

spatel mentioned this in rL364319: [SDAG] expand ctpop != 1.Jun 25 2019, 7:52 AM

spatel mentioned this in rG685c5cbc654f: [SDAG] expand ctpop != 1.

Diff 203530

lib/CodeGen/SelectionDAG/TargetLowering.cpp

Show First 20 Lines • Show All 2,683 Lines • ▼ Show 20 Lines	if (CTPOP.hasOneUse() && CTPOP.getOpcode() == ISD::CTPOP &&
if ((Cond == ISD::SETULT && C1 == 2) \|\| (Cond == ISD::SETUGT && C1 == 1)){		if ((Cond == ISD::SETULT && C1 == 2) \|\| (Cond == ISD::SETUGT && C1 == 1)){
SDValue Sub = DAG.getNode(ISD::SUB, dl, CTVT, CTOp,		SDValue Sub = DAG.getNode(ISD::SUB, dl, CTVT, CTOp,
DAG.getConstant(1, dl, CTVT));		DAG.getConstant(1, dl, CTVT));
SDValue And = DAG.getNode(ISD::AND, dl, CTVT, CTOp, Sub);		SDValue And = DAG.getNode(ISD::AND, dl, CTVT, CTOp, Sub);
ISD::CondCode CC = Cond == ISD::SETULT ? ISD::SETEQ : ISD::SETNE;		ISD::CondCode CC = Cond == ISD::SETULT ? ISD::SETEQ : ISD::SETNE;
return DAG.getSetCC(dl, VT, And, DAG.getConstant(0, dl, CTVT), CC);		return DAG.getSetCC(dl, VT, And, DAG.getConstant(0, dl, CTVT), CC);
}		}

// TODO: (ctpop x) == 1 -> x && (x & x-1) == 0 iff ctpop is illegal.		// (ctpop x) == 1 -> x && (x & x-1) == 0 iff ctpop is illegal.
		if (Cond == ISD::SETEQ && C1 == 1 &&
		!isOperationLegalOrCustom(ISD::CTPOP, CTVT)) {
		SDValue Sub =
		DAG.getNode(ISD::SUB, dl, CTVT, CTOp, DAG.getConstant(1, dl, CTVT));
		SDValue And = DAG.getNode(ISD::AND, dl, CTVT, CTOp, Sub);
		SDValue LHS = DAG.getSetCC(dl, VT, CTOp, DAG.getConstant(0, dl, CTVT),
		ISD::SETUGT);
		SDValue RHS =
		DAG.getSetCC(dl, VT, And, DAG.getConstant(0, dl, CTVT), ISD::SETEQ);
		return DAG.getNode(ISD::AND, dl, VT, LHS, RHS);
		}
}		}

// (zext x) == C --> x == (trunc C)		// (zext x) == C --> x == (trunc C)
// (sext x) == C --> x == (trunc C)		// (sext x) == C --> x == (trunc C)
if ((Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) &&		if ((Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) &&
DCI.isBeforeLegalize() && N0->hasOneUse()) {		DCI.isBeforeLegalize() && N0->hasOneUse()) {
unsigned MinBits = N0.getValueSizeInBits();		unsigned MinBits = N0.getValueSizeInBits();
SDValue PreExt;		SDValue PreExt;
▲ Show 20 Lines • Show All 3,386 Lines • Show Last 20 Lines

test/CodeGen/X86/ctpop-combine.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=+popcnt \| FileCheck %s
				; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=-popcnt \| FileCheck %s -check-prefix=NO-POPCOUNT
				RKSimonUnsubmitted Not Done Reply Inline Actions Enable common codegen, plus please can you commit this to trunk with current codegen and rebase so the patch shows the diff. Also, do we need corei7 if we're explicitly setting/clearing the popcnt attribute? ; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=+popcnt \| FileCheck %s -check-prefixes=CHECK,POPCOUNT ; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=corei7 -mattr=-popcnt \| FileCheck %s -check-prefixes=CHECK,NO-POPCOUNT RKSimon: Enable common codegen, plus please can you commit this to trunk with current codegen and rebase…

	declare i8 @llvm.ctpop.i8(i8) nounwind readnone			declare i8 @llvm.ctpop.i8(i8) nounwind readnone
	declare i64 @llvm.ctpop.i64(i64) nounwind readnone			declare i64 @llvm.ctpop.i64(i64) nounwind readnone

	define i32 @test1(i64 %x) nounwind readnone {			define i32 @test1(i64 %x) nounwind readnone {
	; CHECK-LABEL: test1:			; CHECK-LABEL: test1:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: leaq -1(%rdi), %rcx			; CHECK-NEXT: leaq -1(%rdi), %rcx
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: testq %rcx, %rdi			; CHECK-NEXT: testq %rcx, %rdi
	; CHECK-NEXT: setne %al			; CHECK-NEXT: setne %al
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
				;
				; NO-POPCOUNT-LABEL: test1:
				; NO-POPCOUNT: # %bb.0:
				; NO-POPCOUNT-NEXT: leaq -1(%rdi), %rcx
				; NO-POPCOUNT-NEXT: xorl %eax, %eax
				; NO-POPCOUNT-NEXT: testq %rcx, %rdi
				; NO-POPCOUNT-NEXT: setne %al
				; NO-POPCOUNT-NEXT: retq
	%count = tail call i64 @llvm.ctpop.i64(i64 %x)			%count = tail call i64 @llvm.ctpop.i64(i64 %x)
	%cast = trunc i64 %count to i32			%cast = trunc i64 %count to i32
	%cmp = icmp ugt i32 %cast, 1			%cmp = icmp ugt i32 %cast, 1
	%conv = zext i1 %cmp to i32			%conv = zext i1 %cmp to i32
	ret i32 %conv			ret i32 %conv
	}			}


	define i32 @test2(i64 %x) nounwind readnone {			define i32 @test2(i64 %x) nounwind readnone {
	; CHECK-LABEL: test2:			; CHECK-LABEL: test2:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: leaq -1(%rdi), %rcx			; CHECK-NEXT: leaq -1(%rdi), %rcx
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: testq %rcx, %rdi			; CHECK-NEXT: testq %rcx, %rdi
	; CHECK-NEXT: sete %al			; CHECK-NEXT: sete %al
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
				;
				; NO-POPCOUNT-LABEL: test2:
				; NO-POPCOUNT: # %bb.0:
				; NO-POPCOUNT-NEXT: leaq -1(%rdi), %rcx
				; NO-POPCOUNT-NEXT: xorl %eax, %eax
				; NO-POPCOUNT-NEXT: testq %rcx, %rdi
				; NO-POPCOUNT-NEXT: sete %al
				; NO-POPCOUNT-NEXT: retq
	%count = tail call i64 @llvm.ctpop.i64(i64 %x)			%count = tail call i64 @llvm.ctpop.i64(i64 %x)
	%cmp = icmp ult i64 %count, 2			%cmp = icmp ult i64 %count, 2
	%conv = zext i1 %cmp to i32			%conv = zext i1 %cmp to i32
	ret i32 %conv			ret i32 %conv
	}			}

	define i32 @test3(i64 %x) nounwind readnone {			define i32 @test3(i64 %x) nounwind readnone {
	; CHECK-LABEL: test3:			; CHECK-LABEL: test3:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: popcntq %rdi, %rcx			; CHECK-NEXT: popcntq %rdi, %rcx
	; CHECK-NEXT: andb $63, %cl			; CHECK-NEXT: andb $63, %cl
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: cmpb $2, %cl			; CHECK-NEXT: cmpb $2, %cl
	; CHECK-NEXT: setb %al			; CHECK-NEXT: setb %al
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
				;
				; NO-POPCOUNT-LABEL: test3:
				; NO-POPCOUNT: # %bb.0:
				; NO-POPCOUNT-NEXT: movq %rdi, %rax
				; NO-POPCOUNT-NEXT: shrq %rax
				; NO-POPCOUNT-NEXT: movabsq $6148914691236517205, %rcx # imm = 0x5555555555555555
				; NO-POPCOUNT-NEXT: andq %rax, %rcx
				; NO-POPCOUNT-NEXT: subq %rcx, %rdi
				; NO-POPCOUNT-NEXT: movabsq $3689348814741910323, %rax # imm = 0x3333333333333333
				; NO-POPCOUNT-NEXT: movq %rdi, %rcx
				; NO-POPCOUNT-NEXT: andq %rax, %rcx
				; NO-POPCOUNT-NEXT: shrq $2, %rdi
				; NO-POPCOUNT-NEXT: andq %rax, %rdi
				; NO-POPCOUNT-NEXT: addq %rcx, %rdi
				; NO-POPCOUNT-NEXT: movq %rdi, %rax
				; NO-POPCOUNT-NEXT: shrq $4, %rax
				; NO-POPCOUNT-NEXT: addq %rdi, %rax
				; NO-POPCOUNT-NEXT: movabsq $1085102592571150095, %rcx # imm = 0xF0F0F0F0F0F0F0F
				; NO-POPCOUNT-NEXT: andq %rax, %rcx
				; NO-POPCOUNT-NEXT: movabsq $72340172838076673, %rdx # imm = 0x101010101010101
				; NO-POPCOUNT-NEXT: imulq %rcx, %rdx
				; NO-POPCOUNT-NEXT: shrq $56, %rdx
				; NO-POPCOUNT-NEXT: andb $63, %dl
				; NO-POPCOUNT-NEXT: xorl %eax, %eax
				; NO-POPCOUNT-NEXT: cmpb $2, %dl
				; NO-POPCOUNT-NEXT: setb %al
				; NO-POPCOUNT-NEXT: retq
	%count = tail call i64 @llvm.ctpop.i64(i64 %x)			%count = tail call i64 @llvm.ctpop.i64(i64 %x)
	%cast = trunc i64 %count to i6 ; Too small for 0-64			%cast = trunc i64 %count to i6 ; Too small for 0-64
	%cmp = icmp ult i6 %cast, 2			%cmp = icmp ult i6 %cast, 2
	%conv = zext i1 %cmp to i32			%conv = zext i1 %cmp to i32
	ret i32 %conv			ret i32 %conv
	}			}

	define i8 @test4(i8 %x) nounwind readnone {			define i8 @test4(i8 %x) nounwind readnone {
	; CHECK-LABEL: test4:			; CHECK-LABEL: test4:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: andl $127, %edi			; CHECK-NEXT: andl $127, %edi
	; CHECK-NEXT: popcntl %edi, %eax			; CHECK-NEXT: popcntl %edi, %eax
	; CHECK-NEXT: # kill: def $al killed $al killed $eax			; CHECK-NEXT: # kill: def $al killed $al killed $eax
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
				;
				; NO-POPCOUNT-LABEL: test4:
				; NO-POPCOUNT: # %bb.0:
				; NO-POPCOUNT-NEXT: # kill: def $edi killed $edi def $rdi
				; NO-POPCOUNT-NEXT: andb $127, %dil
				; NO-POPCOUNT-NEXT: movl %edi, %eax
				; NO-POPCOUNT-NEXT: shrb %al
				; NO-POPCOUNT-NEXT: andb $21, %al
				; NO-POPCOUNT-NEXT: subb %al, %dil
				; NO-POPCOUNT-NEXT: movl %edi, %eax
				; NO-POPCOUNT-NEXT: andb $51, %al
				; NO-POPCOUNT-NEXT: shrb $2, %dil
				; NO-POPCOUNT-NEXT: andb $51, %dil
				; NO-POPCOUNT-NEXT: addb %al, %dil
				; NO-POPCOUNT-NEXT: movl %edi, %eax
				; NO-POPCOUNT-NEXT: shrb $4, %al
				; NO-POPCOUNT-NEXT: addl %edi, %eax
				; NO-POPCOUNT-NEXT: andb $15, %al
				; NO-POPCOUNT-NEXT: # kill: def $al killed $al killed $eax
				; NO-POPCOUNT-NEXT: retq
	%x2 = and i8 %x, 127			%x2 = and i8 %x, 127
	%count = tail call i8 @llvm.ctpop.i8(i8 %x2)			%count = tail call i8 @llvm.ctpop.i8(i8 %x2)
	%and = and i8 %count, 7			%and = and i8 %count, 7
	ret i8 %and			ret i8 %and
	}			}

				define i32 @test5(i64 %x) nounwind readnone {
				; CHECK-LABEL: test5:
				; CHECK: # %bb.0:
				; CHECK-NEXT: popcntq %rdi, %rcx
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpq $1, %rcx
				; CHECK-NEXT: sete %al
				; CHECK-NEXT: retq
				;
				; NO-POPCOUNT-LABEL: test5:
				; NO-POPCOUNT: # %bb.0:
				; NO-POPCOUNT-NEXT: leaq -1(%rdi), %rax
				; NO-POPCOUNT-NEXT: testq %rax, %rdi
				; NO-POPCOUNT-NEXT: sete %al
				; NO-POPCOUNT-NEXT: testq %rdi, %rdi
				; NO-POPCOUNT-NEXT: setne %cl
				; NO-POPCOUNT-NEXT: andb %al, %cl
				; NO-POPCOUNT-NEXT: movzbl %cl, %eax
				; NO-POPCOUNT-NEXT: retq
				%count = tail call i64 @llvm.ctpop.i64(i64 %x)
				%cmp = icmp eq i64 %count, 1
				%conv = zext i1 %cmp to i32
				ret i32 %conv
				}

This is an archive of the discontinued LLVM Phabricator instance.

[TargetLowering] Simplify (ctpop x) == 1
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 203530

lib/CodeGen/SelectionDAG/TargetLowering.cpp

test/CodeGen/X86/ctpop-combine.ll

This is an archive of the discontinued LLVM Phabricator instance.

[TargetLowering] Simplify (ctpop x) == 1ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 203530

lib/CodeGen/SelectionDAG/TargetLowering.cpp

test/CodeGen/X86/ctpop-combine.ll

[TargetLowering] Simplify (ctpop x) == 1
ClosedPublic