This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] move splat-shuffle after binop with splat constant
AbandonedPublic

Authored by spatel on Apr 3 2019, 10:03 AM.

Details

Summary

This is a subset of a transform that we do in instcombine:

binop (splat X), SplatC --> splat (binop X, SplatC)

...and it seems to usually be working as intended (we seem to get more demanded-bits/elements optimizations), but I'm still going through these test diffs.

The end motivation is to reduce the number of patterns that we have to match when trying to scalarize with D60150.

This is intentionally limited to only work when the constant has no undef lanes, but we might be able to ease that as a follow-up. Otherwise, at least in the more general transform in instcombine, we have to deal with potential poison propagation and undef simplification.

Diff Detail

Event Timeline

spatel created this revision.Apr 3 2019, 10:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 3 2019, 10:03 AM

My only concern is the width reduction on the BROADCAST - as these could regress AVX512 memory folding - but then again that isn't working great in many cases as it is (due to existing SimplifyDemanded calla, shuffle combining etc.) - @craig.topper what do you reckon?

spatel updated this revision to Diff 194344.EditedApr 9 2019, 8:57 AM

Patch updated:

  1. Rebase - D60150 allowed scalarization of x86 FP ops, so we see that happening in the motivating examples in 'test/CodeGen/X86/scalarize-fp.ll'.
  2. Refined the conditions where we can use the existing shuffle mask vs. creating a new splat w/o undef lanes. For example:
%s = shufflevector <2 x i64> %x, <2 x i64> undef, <2 x i32> <i32 1, i32 undef>
%a = and <2 x i64> %s, <i64 42, i64 42>

To be safe, the new shuffle must actually splat element 1 of the source operand:

LCPI0_0:
.quad	42                      ## 0x2a
.quad	42                      ## 0x2a
pand	LCPI0_0(%rip), %xmm0
pshufd	$238, %xmm0, %xmm0      ## xmm0 = xmm0[2,3,2,3]

Rather than:

pand	LCPI0_0(%rip), %xmm0
pshufd	$78, %xmm0, %xmm0       ## xmm0 = xmm0[2,3,0,1]

A couple of minors but this looks almost ready to me, the avx512 broadcast folds are a known issue

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
18844

Like in getKnownUndefForVectorBinop - add a TODO about not creating nodes on the fly.

llvm/test/CodeGen/ARM/reg_sequence.ll
274–289

Add the other checks for this test to trunk - to show the full codegen delta in this patch.

spatel updated this revision to Diff 195586.Apr 17 2019, 9:34 AM
spatel marked an inline comment as done.

Patch updated:

  1. Added TODO comment about adding a more effificient constant query.
  2. Updated ARM test assertions to show more complete diff.
RKSimon accepted this revision.Apr 17 2019, 9:49 AM

Cheers, LGTM

This revision is now accepted and ready to land.Apr 17 2019, 9:49 AM
spatel planned changes to this revision.Apr 17 2019, 9:51 AM

I haven't hand-written FileCheck lines in a long time, and I messed that up...another update coming up.

spatel updated this revision to Diff 195593.Apr 17 2019, 10:04 AM

Patch updated:
Made ARM test checks more exact, so we can see there actually is an extra instruction here (vext.32). That seems like a regression, but I'm not sure if it's important and/or another known shortcoming for folding shuffles.

This revision is now accepted and ready to land.Apr 17 2019, 10:04 AM
efriedma added inline comments.Apr 17 2019, 12:49 PM
llvm/test/CodeGen/ARM/reg_sequence.ll
277

I think the reason you're seeing a regression here is that ARM has a combined "splat-and-multiply" operation. If the splat isn't an operand to the multiply, you end up with an extra instruction. If DAGCombine is going to reorder the operations, we might need some ARM/AArch64-specific pattern-matching code for multiplies.

Granted, this isn't really a great testcase to demonstrate that issue; it's sort of degenerate. Instead, consider something like the following:

define <4 x i32> @x(<4 x i32> %a) {
entry:
  %splat = shufflevector <4 x i32> %a, <4 x i32> undef, <4 x i32> zeroinitializer
  %mul = mul <4 x i32> %splat, <i32 3, i32 3, i32 3, i32 3>
  ret <4 x i32> %mul
}

On AArch64, this currently compiles to something like "movi v1.4s, #3; mul v0.4s, v1.4s, v0.s[0]". If you reorder the operations, I think you end up with three instructions.

RKSimon requested changes to this revision.Apr 17 2019, 1:45 PM

Please add support for arm/aarch64 splat-and-multiply instructions

This revision now requires changes to proceed.Apr 17 2019, 1:45 PM
spatel abandoned this revision.May 26 2020, 5:25 AM

Abandoning - see D79886 for a limited version of this.
I didn't locate this patch locally, so I wrote another version.