This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] reverse 'trunc X to <N x i1>' canonicalization
ClosedPublic

Authored by spatel on Oct 1 2018, 2:02 PM.

Details

Summary

icmp ne (and X, 1), 0 --> trunc X to N x i1

Ideally, I think we'd do the same for scalars, but I'm afraid of unintended consequences.
The motivating vector case is from PR37549:
https://bugs.llvm.org/show_bug.cgi?id=37549

define <4 x float> @bitwise_select(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x float> %w) {
  %c = fcmp ole <4 x float> %x, %y
  %s = sext <4 x i1> %c to <4 x i32>
  %s1 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 0, i32 0, i32 1, i32 1>
  %s2 = shufflevector <4 x i32> %s, <4 x i32> undef, <4 x i32> <i32 2, i32 2, i32 3, i32 3>
  %cond = or <4 x i32> %s1, %s2
  %condtr = trunc <4 x i32> %cond to <4 x i1>
  %r = select <4 x i1> %condtr, <4 x float> %z, <4 x float> %w
  ret <4 x float> %r
}

Here's a sampling of the vector codegen for that case using mask+icmp (current behavior) vs. trunc (with this patch):

AVX before:

vcmpleps	%xmm1, %xmm0, %xmm0
vpermilps	$80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps	$250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps	%xmm0, %xmm1, %xmm0
vandps	LCPI0_0(%rip), %xmm0, %xmm0
vxorps	%xmm1, %xmm1, %xmm1
vpcmpeqd	%xmm1, %xmm0, %xmm0
vblendvps	%xmm0, %xmm3, %xmm2, %xmm0

AVX after:

vcmpleps	%xmm1, %xmm0, %xmm0
vpermilps	$80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps	$250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps	%xmm0, %xmm1, %xmm0
vblendvps	%xmm0, %xmm2, %xmm3, %xmm0

AVX512f before:

vcmpleps	%xmm1, %xmm0, %xmm0
vpermilps	$80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps	$250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps	%xmm0, %xmm1, %xmm0
vpbroadcastd	LCPI0_0(%rip), %xmm1 ## xmm1 = [1,1,1,1]
vptestnmd	%zmm1, %zmm0, %k1
vblendmps	%zmm3, %zmm2, %zmm0 {%k1}

AVX512f after:

vcmpleps	%xmm1, %xmm0, %xmm0
vpermilps	$80, %xmm0, %xmm1 ## xmm1 = xmm0[0,0,1,1]
vpermilps	$250, %xmm0, %xmm0 ## xmm0 = xmm0[2,2,3,3]
vorps	%xmm0, %xmm1, %xmm0
vpslld	$31, %xmm0, %xmm0
vptestmd	%zmm0, %zmm0, %k1
vblendmps	%zmm2, %zmm3, %zmm0 {%k1}

AArch64 before:

fcmge	v0.4s, v1.4s, v0.4s
zip1	v1.4s, v0.4s, v0.4s
zip2	v0.4s, v0.4s, v0.4s
orr	v0.16b, v1.16b, v0.16b
movi	v1.4s, #1
and	v0.16b, v0.16b, v1.16b
cmeq	v0.4s, v0.4s, #0
bsl	v0.16b, v3.16b, v2.16b

AArch64 after:

fcmge	v0.4s, v1.4s, v0.4s
zip1	v1.4s, v0.4s, v0.4s
zip2	v0.4s, v0.4s, v0.4s
orr	v0.16b, v1.16b, v0.16b
bsl	v0.16b, v2.16b, v3.16b

PowerPC-le before:

xvcmpgesp 34, 35, 34
vspltisw 0, 1
vmrglw 3, 2, 2
vmrghw 2, 2, 2
xxlor 0, 35, 34
xxlxor 35, 35, 35
xxland 34, 0, 32
vcmpequw 2, 2, 3
xxsel 34, 36, 37, 34

PowerPC-le after:

xvcmpgesp 34, 35, 34
vmrglw 3, 2, 2
vmrghw 2, 2, 2
xxlor 0, 35, 34
xxsel 34, 37, 36, 0

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Oct 1 2018, 2:02 PM
craig.topper added inline comments.Oct 1 2018, 11:17 PM
lib/Transforms/InstCombine/InstCombineCompares.cpp
1712 ↗(On Diff #167820)

Should this be in foldICmpAndConstConst? And it should use the APInts we already extracted?

Ideally, I think we'd do the same for scalars, but I'm afraid of unintended consequences.

I *think* this originates from rL67635.

spatel added a comment.Oct 2 2018, 7:24 AM

Ideally, I think we'd do the same for scalars, but I'm afraid of unintended consequences.

I *think* this originates from rL67635.

Thanks for digging that up! As the codegen examples here show, the icmp variant is not always better for vector codegen at least (we could just fix the backend, but since we can reduce the IR, I figured that's the better option).

My bigger worry is that we're going to expose IR-level holes for trunc patterns with scalar code (and-of-icmps and similar or the example with shift that's now shown here). Those seem less likely for vector code.

lib/Transforms/InstCombine/InstCombineCompares.cpp
1712 ↗(On Diff #167820)

There are 2 independent problems here, and I should have put this in a code comment:

  1. If we use the already extracted APInt values, we won't handle vectors with undefs because m_APInt doesn't match those (yet). So in cases like this, I've been using the more specific matcher even if it looks redundant. I will add a test that includes undefs in the constant vector values.
  2. Depending on where we position this fold, it exposes another canonicalization question because it will affect patterns with shifts like this:
%shr = ashr <2 x i84> %X, <i84 4, i84 4>
%and = and <2 x i84> %shr, <i84 1, i84 1>
%cmp = icmp ne <2 x i84> %and, zeroinitializer

Should that become:

%m = and <2 x i84> %X, <i84 16, i84 16>
%cmp = icmp ne <2 x i84> %m, zeroinitializer

or:

%sh = lshr <2 x i84> %X, <i84 4, i84 4>
%cmp = trunc <2 x i84> %sh to <2 x i1>

This patch sidesteps that question by allowing the larger pattern to match first, but we could make that a prerequisite step for this patch.

spatel added a comment.Oct 2 2018, 7:57 AM

If we use the already extracted APInt values, we won't handle vectors with undefs because m_APInt doesn't match those (yet).
So in cases like this, I've been using the more specific matcher even if it looks redundant.
I will add a test that includes undefs in the constant vector values.

Looking at this closer...as the patch is written currently, we would fail to match if the compare constant (zero) has undefs because we already used m_APInt as a condition to get here in the first place.

spatel updated this revision to Diff 168059.Oct 2 2018, 5:43 PM

Patch updated:

  1. Moved the and+icmp --> trunc transform earlier in visitICmpInst, so we have a better idea about potential regressions.
  2. This required adding 2 trunc folds to avoid known regressions. These transforms have phantom (cosmetic-only) test diffs in apint-shift.ll and icmp.ll::icmp_and_or_lshr_cst_vec(), so we can see that the new code is firing on the patterns with trunc.
  3. The other test diffs are all wins in IR (less instructions). Included in that, we see that 1 of the existing icmp transforms that we're replacing doesn't work if the operands are commuted.
  4. The loop vectorizer tests (running with full -O3 in that test file...) produce mixed results in codegen: both tests improve (less instructions in the inner loop) on KNL, but regress (more instructions in the inner loop) with AVX2. Note: that diff should've been in the previous rev of this patch, but I missed it.

So we have IR improvements in all cases shown here (but there could be regressions for patterns that have no vector test coverage), codegen improvements for the motivating blendv cases across a range of targets, codegen improvements on larger loop tests on AVX512, but codegen regressions with that same IR on AVX2.

@craig.topper The final codegen from the updated IR in masked_load_store.ll regresses due to masked stores not making use of only requiring the MSB of the mask vector (lots of SIGN_EXTEND_INREG etc.) - X86ISelLowering's combineMaskedStore only handles the PCMPGT case, how tricky would it be to replace it with a general SimplifyDemandedBits calls?

spatel added a comment.Oct 6 2018, 6:03 AM

@craig.topper The final codegen from the updated IR in masked_load_store.ll regresses due to masked stores not making use of only requiring the MSB of the mask vector (lots of SIGN_EXTEND_INREG etc.) - X86ISelLowering's combineMaskedStore only handles the PCMPGT case, how tricky would it be to replace it with a general SimplifyDemandedBits calls?

I have a draft of that patch in progress. Let me add some tests, clean it up, and post it. That's the only regression that I'm aware of from this patch, so we can make this patch dependent on that one.

Now that D52964 has landed - is there anything stopping this?

spatel added a comment.Oct 9 2018, 7:18 AM

Now that D52964 has landed - is there anything stopping this?

IMO, no. The known IR regressions are now handled with additional trunc pattern matching, so all changes in IR shown here are improvements. All known codegen regressions have been squashed.

RKSimon accepted this revision.Oct 9 2018, 8:02 AM

LGTM - thanks

This revision is now accepted and ready to land.Oct 9 2018, 8:02 AM
This revision was automatically updated to reflect the committed changes.