This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
-
DAGCombiner.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
popcnt-zext.ll

Differential D69127

[DAGCombiner] widen zext of popcount based on target support
ClosedPublic

Authored by spatel on Oct 17 2019, 11:28 AM.

Download Raw Diff

Details

Reviewers

nemanjai
jsji
RKSimon
xbolva00
craig.topper

Commits

rGe6c145e0548e: [DAGCombiner] widen zext of popcount based on target support

Summary

zext (ctpop X) --> ctpop (zext X)

This is a prerequisite step for canonicalizing in the other direction (narrow the popcount) in IR - PR43688:
https://bugs.llvm.org/show_bug.cgi?id=43688

I'm not sure if any other targets are affected, but I found a missing fold for PPC, so added tests based on that.
The reason we widen all the way to 64-bit in these tests is because the initial DAG looks something like this:

t5: i8 = ctpop t4
t6: i32 = zero_extend t5  <-- created based on IR, but unused node?
  t7: i64 = zero_extend t5

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Oct 17 2019, 11:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 17 2019, 11:28 AM

Herald added subscribers: MaskRay, hiraditya, mcrosier. · View Herald Transcript

Looks good

LGTM. Thanks.
Do we want to consider any_extend as well?

This revision is now accepted and ready to land.Oct 18 2019, 1:34 PM

In D69127#1715078, @jsji wrote:

LGTM. Thanks.
Do we want to consider any_extend as well?

I guess that's possible (convert the anyext to zext?) although I'm not sure how to create a test case.

For example, if we had:
i16 = anyext (i8 = ctpop 0xF0)

The result must be "i16 0x??04" because we only know the value of the lower bits exactly.

i16 = ctpop (i16 = zext 0xF0)

The result must be "i16 0x0004", so it's more defined than the original, but it's probably still a perf win to use a supported ctpop operation?

Yeah.

I am asking because I see we will use any_extend with the simple call.

define i8 @popz_i8_i8(i8 %x) {
  %pop = tail call i8 @llvm.ctpop.i8(i8 %x)
  ret i8 %pop
}

widening in this example might not get a perf gain,
but I think in some case we might win, at least not loss?

And since you mention this is a prerequisite for canonicalizing,
so maybe that will bring more possibility of win?

So the assumption here is that the popcnt we choose here is the same performance as any smaller popcnt that LegalizeDAG would pick through promotion. Or near enough that its better than having 2 zexts?

any_extend might get generated when using ctpop for 'allones' tests:

define i64 @var_ctpop_i32(i32 %a) {
  %1 = call i32 @llvm.ctpop.i32(i32 %a)
  %2 = zext i32 %1 to i64 ; SimplifyDemandedBits may turn zext (or sext) into aext
  %3 = and i64 %2, 32
  ret i64 %3
}
declare i32 @llvm.ctpop.i32(i32)

In D69127#1715373, @craig.topper wrote:

So the assumption here is that the popcnt we choose here is the same performance as any smaller popcnt that LegalizeDAG would pick through promotion. Or near enough that its better than having 2 zexts?

Almost - we're saying that ctpop+zext is no better perf than the wider ctpop. For example, the i8 test on PPC ends up like this after promotion without this patch:

    t13: i32 = and t11, Constant:i32<255>
  t14: i32 = ctpop t13
t18: i64 = zero_extend t14

This patch would already have hoisted the zext ahead of the i32 ctpop in that basic example, but if we had started with this i32 code, then we'd widen to i64 ctpop because we assume it's no more expensive than i32 ctpop+zext. For the PPC examples, the zext becomes a mask op, so we eliminate that instruction.

We shouldn't be touching anything that's expanded. So for base x86, we're not going to widen an i8 op because that expansion should be cheaper than any wider expansion.

In D69127#1715640, @spatel wrote:

This patch would already have hoisted the zext ahead of the i32 ctpop in that basic example, but if we had started with this i32 code, then we'd widen to i64 ctpop because we assume it's no more expensive than i32 ctpop+zext. For the PPC examples, the zext becomes a mask op, so we eliminate that instruction.

Oops - that last part was wrong. (And there's already a test for i32 -> i64.) Since PPC has legal ops for both types, we won't do anything for that pattern.

spatel mentioned this in rGb74d7e5cccb5: [PowerPC] add test for popcnt with any_extend; NFC.Oct 25 2019, 9:48 AM

In D69127#1715638, @RKSimon wrote:
any_extend might get generated when using ctpop for 'allones' tests:
define i64 @var_ctpop_i32(i32 %a) {
  %1 = call i32 @llvm.ctpop.i32(i32 %a)
  %2 = zext i32 %1 to i64 ; SimplifyDemandedBits may turn zext (or sext) into aext
  %3 = and i64 %2, 32
  ret i64 %3
}
declare i32 @llvm.ctpop.i32(i32)

Thanks - I added a variant of this in:
rGb74d7e5cccb5
I'll add a TODO comment here and follow-up to generalize for that case.

I'll add a TODO comment here and follow-up to generalize for that case.

Great. Thanks @spatel !

Closed by commit rGe6c145e0548e: [DAGCombiner] widen zext of popcount based on target support (authored by spatel). · Explain WhyOct 25 2019, 11:12 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in rG1ebd4a2e3ad0: [DAGCombiner] widen any_ext of popcount based on target support.Oct 28 2019, 7:12 AM

any_ext added with:
rG1ebd4a2e3ad0

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

12 lines

test/

CodeGen/

PowerPC/

popcnt-zext.ll

15 lines

Diff 226468

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,915 Lines • ▼ Show 20 Lines	if ((N0.getOpcode() == ISD::SHL \|\| N0.getOpcode() == ISD::SRL) &&
return DAG.getNode(N0.getOpcode(), DL, VT,		return DAG.getNode(N0.getOpcode(), DL, VT,
DAG.getNode(ISD::ZERO_EXTEND, DL, VT, N0.getOperand(0)),		DAG.getNode(ISD::ZERO_EXTEND, DL, VT, N0.getOperand(0)),
ShAmt);		ShAmt);
}		}

if (SDValue NewVSel = matchVSelectOpSizesWithSetCC(N))		if (SDValue NewVSel = matchVSelectOpSizesWithSetCC(N))
return NewVSel;		return NewVSel;

		// If the target does not support a pop-count in the narrow source type but
		// does support it in the destination type, widen the pop-count to this type:
		// zext (ctpop X) --> ctpop (zext X)
		// TODO: Generalize this to handle starting from anyext.
		if (N0.getOpcode() == ISD::CTPOP && N0.hasOneUse() &&
		!TLI.isOperationLegalOrCustom(ISD::CTPOP, N0.getValueType()) &&
		TLI.isOperationLegalOrCustom(ISD::CTPOP, VT)) {
		SDLoc DL(N);
		SDValue NewZext = DAG.getZExtOrTrunc(N0.getOperand(0), DL, VT);
		return DAG.getNode(ISD::CTPOP, DL, VT, NewZext);
		}

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitANY_EXTEND(SDNode *N) {		SDValue DAGCombiner::visitANY_EXTEND(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (SDValue Res = tryToFoldExtendOfConstant(N, TLI, DAG, LegalTypes))		if (SDValue Res = tryToFoldExtendOfConstant(N, TLI, DAG, LegalTypes))
▲ Show 20 Lines • Show All 10,935 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/popcnt-zext.ll

Show All 35 Lines	; SLOW-NEXT: blr
%z = zext i8 %x to i16		%z = zext i8 %x to i16
%pop = tail call i16 @llvm.ctpop.i16(i16 %z)		%pop = tail call i16 @llvm.ctpop.i16(i16 %z)
ret i16 %pop		ret i16 %pop
}		}

define i16 @popz_i8_i16(i8 %x) {		define i16 @popz_i8_i16(i8 %x) {
; FAST-LABEL: popz_i8_i16:		; FAST-LABEL: popz_i8_i16:
; FAST: # %bb.0:		; FAST: # %bb.0:
; FAST-NEXT: rlwinm 3, 3, 0, 24, 31		; FAST-NEXT: clrldi 3, 3, 56
; FAST-NEXT: popcntw 3, 3		; FAST-NEXT: popcntd 3, 3
; FAST-NEXT: clrldi 3, 3, 32
; FAST-NEXT: blr		; FAST-NEXT: blr
;		;
; SLOW-LABEL: popz_i8_i16:		; SLOW-LABEL: popz_i8_i16:
; SLOW: # %bb.0:		; SLOW: # %bb.0:
; SLOW-NEXT: clrlwi 5, 3, 24		; SLOW-NEXT: clrlwi 5, 3, 24
; SLOW-NEXT: rlwinm 3, 3, 31, 0, 31		; SLOW-NEXT: rlwinm 3, 3, 31, 0, 31
; SLOW-NEXT: andi. 3, 3, 85		; SLOW-NEXT: andi. 3, 3, 85
; SLOW-NEXT: lis 4, 13107		; SLOW-NEXT: lis 4, 13107
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	; SLOW-NEXT: blr
%z = zext i8 %x to i32		%z = zext i8 %x to i32
%pop = tail call i32 @llvm.ctpop.i32(i32 %z)		%pop = tail call i32 @llvm.ctpop.i32(i32 %z)
ret i32 %pop		ret i32 %pop
}		}

define i32 @popz_i8_32(i8 %x) {		define i32 @popz_i8_32(i8 %x) {
; FAST-LABEL: popz_i8_32:		; FAST-LABEL: popz_i8_32:
; FAST: # %bb.0:		; FAST: # %bb.0:
; FAST-NEXT: rlwinm 3, 3, 0, 24, 31		; FAST-NEXT: clrldi 3, 3, 56
; FAST-NEXT: popcntw 3, 3		; FAST-NEXT: popcntd 3, 3
; FAST-NEXT: clrldi 3, 3, 32
; FAST-NEXT: blr		; FAST-NEXT: blr
;		;
; SLOW-LABEL: popz_i8_32:		; SLOW-LABEL: popz_i8_32:
; SLOW: # %bb.0:		; SLOW: # %bb.0:
; SLOW-NEXT: clrlwi 5, 3, 24		; SLOW-NEXT: clrlwi 5, 3, 24
; SLOW-NEXT: rlwinm 3, 3, 31, 0, 31		; SLOW-NEXT: rlwinm 3, 3, 31, 0, 31
; SLOW-NEXT: andi. 3, 3, 85		; SLOW-NEXT: andi. 3, 3, 85
; SLOW-NEXT: lis 4, 13107		; SLOW-NEXT: lis 4, 13107
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	; SLOW-NEXT: blr
%z = zext i16 %x to i32		%z = zext i16 %x to i32
%pop = tail call i32 @llvm.ctpop.i32(i32 %z)		%pop = tail call i32 @llvm.ctpop.i32(i32 %z)
ret i32 %pop		ret i32 %pop
}		}

define i32 @popz_i16_32(i16 %x) {		define i32 @popz_i16_32(i16 %x) {
; FAST-LABEL: popz_i16_32:		; FAST-LABEL: popz_i16_32:
; FAST: # %bb.0:		; FAST: # %bb.0:
; FAST-NEXT: rlwinm 3, 3, 0, 16, 31		; FAST-NEXT: clrldi 3, 3, 48
; FAST-NEXT: popcntw 3, 3		; FAST-NEXT: popcntd 3, 3
; FAST-NEXT: clrldi 3, 3, 32
; FAST-NEXT: blr		; FAST-NEXT: blr
;		;
; SLOW-LABEL: popz_i16_32:		; SLOW-LABEL: popz_i16_32:
; SLOW: # %bb.0:		; SLOW: # %bb.0:
; SLOW-NEXT: clrlwi 5, 3, 16		; SLOW-NEXT: clrlwi 5, 3, 16
; SLOW-NEXT: rlwinm 3, 3, 31, 0, 31		; SLOW-NEXT: rlwinm 3, 3, 31, 0, 31
; SLOW-NEXT: andi. 3, 3, 21845		; SLOW-NEXT: andi. 3, 3, 21845
; SLOW-NEXT: lis 4, 13107		; SLOW-NEXT: lis 4, 13107
▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines