This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] fold shuffle-of-binops with common operand
ClosedPublic

Authored by spatel on Oct 15 2021, 10:34 AM.

Details

Summary

shuf (bo X, Y), (bo X, W) --> bo (shuf X), (shuf Y, W)

This is motivated by an example in D111800 (although that patch would avoid the problem for that particular example).

The pattern is shown in reduced form with:
https://llvm.org/PR52178
https://alive2.llvm.org/ce/z/d8zB4D

There is no difference on the PhaseOrdering test from D111800 because the aarch64 cost model says that the shuffle cost is 3 while the fadd cost is 2. That seems wrong for a simple v4f32 shuffle, but that should be another patch if correct.

Diff Detail

Event Timeline

spatel created this revision.Oct 15 2021, 10:34 AM
spatel requested review of this revision.Oct 15 2021, 10:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 15 2021, 10:34 AM
RKSimon added inline comments.Oct 18 2021, 8:00 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
1084

Are we OK with accepting the fold if ShufCost == BinopCost ?

spatel marked an inline comment as done.Oct 18 2021, 8:44 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
1084

This follows the lead of other vector combines and instcombine in general: if we can rearrange code without incurring cost, then it might unlock further transforms, so we try it.

Direct motivation is seen in the example from D111800 - in that case, we get shuffle-of-shuffle as the 1st instructions in the function and that can be reduced by the backend (x86 at least).

In a minimal case where there's no further optimization, we end up with something that is probably neutral. For example, here's the 'and' v2i64 test on x86 and aarch64:

define <2 x i64> @and_and_shuf_v2i64_yy_swap(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) {
  %b0 = and <2 x i64> %x, %y
  %b1 = and <2 x i64> %y, %z
  %r = shufflevector <2 x i64> %b0, <2 x i64> %b1, <2 x i32> <i32 3, i32 0>
  ret <2 x i64> %r
}

define <2 x i64> @shuf_shuf_and_v2i64_yy_swap(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) {
  %a0 = shufflevector <2 x i64> %y, <2 x i64> poison, <2 x i32> <i32 1, i32 0>
  %a1 = shufflevector <2 x i64> %x, <2 x i64> %z, <2 x i32> <i32 3, i32 0>
  %r = and <2 x i64> %a0, %a1
  ret <2 x i64> %r
}
before:
	andps	%xmm1, %xmm0
	andps	%xmm1, %xmm2
	shufps	$78, %xmm0, %xmm2               ## xmm2 = xmm2[2,3],xmm0[0,1]
after:	
	pshufd	$78, %xmm1, %xmm1               ## xmm1 = xmm1[2,3,0,1]
	shufps	$78, %xmm0, %xmm2               ## xmm2 = xmm2[2,3],xmm0[0,1]
	pand	%xmm2, %xmm1
before:
	and	v2.16b, v1.16b, v2.16b
	and	v0.16b, v0.16b, v1.16b
	ext	v0.16b, v2.16b, v0.16b, #8
after:
	ext	v1.16b, v1.16b, v1.16b, #8
	ext	v0.16b, v2.16b, v0.16b, #8
	and	v0.16b, v1.16b, v0.16b
RKSimon accepted this revision.Oct 18 2021, 8:51 AM

Yes that makes sense - LGTM.

This revision is now accepted and ready to land.Oct 18 2021, 8:51 AM
This revision was landed with ongoing or failed builds.Oct 21 2021, 9:38 AM
This revision was automatically updated to reflect the committed changes.
spatel marked an inline comment as done.