This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] widen zext of popcount based on target support
ClosedPublic

Authored by spatel on Oct 17 2019, 11:28 AM.

Details

Summary

zext (ctpop X) --> ctpop (zext X)

This is a prerequisite step for canonicalizing in the other direction (narrow the popcount) in IR - PR43688:
https://bugs.llvm.org/show_bug.cgi?id=43688

I'm not sure if any other targets are affected, but I found a missing fold for PPC, so added tests based on that.
The reason we widen all the way to 64-bit in these tests is because the initial DAG looks something like this:

t5: i8 = ctpop t4
t6: i32 = zero_extend t5  <-- created based on IR, but unused node?
  t7: i64 = zero_extend t5

Diff Detail

Event Timeline

spatel created this revision.Oct 17 2019, 11:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 17 2019, 11:28 AM
jsji accepted this revision.Oct 18 2019, 1:34 PM

LGTM. Thanks.
Do we want to consider any_extend as well?

This revision is now accepted and ready to land.Oct 18 2019, 1:34 PM

LGTM. Thanks.
Do we want to consider any_extend as well?

I guess that's possible (convert the anyext to zext?) although I'm not sure how to create a test case.

For example, if we had:
i16 = anyext (i8 = ctpop 0xF0)

The result must be "i16 0x??04" because we only know the value of the lower bits exactly.

i16 = ctpop (i16 = zext 0xF0)

The result must be "i16 0x0004", so it's more defined than the original, but it's probably still a perf win to use a supported ctpop operation?

jsji added a comment.Oct 18 2019, 2:37 PM

Yeah.

I am asking because I see we will use any_extend with the simple call.

define i8 @popz_i8_i8(i8 %x) {
  %pop = tail call i8 @llvm.ctpop.i8(i8 %x)
  ret i8 %pop
}

widening in this example might not get a perf gain,
but I think in some case we might win, at least not loss?

And since you mention this is a prerequisite for canonicalizing,
so maybe that will bring more possibility of win?

So the assumption here is that the popcnt we choose here is the same performance as any smaller popcnt that LegalizeDAG would pick through promotion. Or near enough that its better than having 2 zexts?

any_extend might get generated when using ctpop for 'allones' tests:

define i64 @var_ctpop_i32(i32 %a) {
  %1 = call i32 @llvm.ctpop.i32(i32 %a)
  %2 = zext i32 %1 to i64 ; SimplifyDemandedBits may turn zext (or sext) into aext
  %3 = and i64 %2, 32
  ret i64 %3
}
declare i32 @llvm.ctpop.i32(i32)
spatel added a comment.EditedOct 19 2019, 8:21 AM

So the assumption here is that the popcnt we choose here is the same performance as any smaller popcnt that LegalizeDAG would pick through promotion. Or near enough that its better than having 2 zexts?

Almost - we're saying that ctpop+zext is no better perf than the wider ctpop. For example, the i8 test on PPC ends up like this after promotion without this patch:

    t13: i32 = and t11, Constant:i32<255>
  t14: i32 = ctpop t13
t18: i64 = zero_extend t14

This patch would already have hoisted the zext ahead of the i32 ctpop in that basic example, but if we had started with this i32 code, then we'd widen to i64 ctpop because we assume it's no more expensive than i32 ctpop+zext. For the PPC examples, the zext becomes a mask op, so we eliminate that instruction.

We shouldn't be touching anything that's expanded. So for base x86, we're not going to widen an i8 op because that expansion should be cheaper than any wider expansion.

This patch would already have hoisted the zext ahead of the i32 ctpop in that basic example, but if we had started with this i32 code, then we'd widen to i64 ctpop because we assume it's no more expensive than i32 ctpop+zext. For the PPC examples, the zext becomes a mask op, so we eliminate that instruction.

Oops - that last part was wrong. (And there's already a test for i32 -> i64.) Since PPC has legal ops for both types, we won't do anything for that pattern.

any_extend might get generated when using ctpop for 'allones' tests:

define i64 @var_ctpop_i32(i32 %a) {
  %1 = call i32 @llvm.ctpop.i32(i32 %a)
  %2 = zext i32 %1 to i64 ; SimplifyDemandedBits may turn zext (or sext) into aext
  %3 = and i64 %2, 32
  ret i64 %3
}
declare i32 @llvm.ctpop.i32(i32)

Thanks - I added a variant of this in:
rGb74d7e5cccb5
I'll add a TODO comment here and follow-up to generalize for that case.

jsji added a comment.Oct 25 2019, 10:43 AM

I'll add a TODO comment here and follow-up to generalize for that case.

Great. Thanks @spatel !

This revision was automatically updated to reflect the committed changes.

any_ext added with:
rG1ebd4a2e3ad0