This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] combine a shuffle and an extract subvector shuffle
ClosedPublic

Authored by spatel on Oct 9 2018, 1:09 PM.

Details

Summary

This is part of the missing IR-level folding noted in D52912.
This should be ok as a canonicalization because the new shuffle mask can't be any more complicated than the existing shuffle mask. If there's some target where the shorter vector shuffle is not legal, it should just end up expanding to something like the pair of shuffles that we're starting with here.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Oct 9 2018, 1:09 PM
lebedev.ri added inline comments.Oct 10 2018, 8:38 AM
lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1496 ↗(On Diff #168860)

I'm being paranoid here. What is the expected outcome of this transform.
The new shuffle mask, do we expect (read: require) it to be smaller than the existing ones?

spatel added inline comments.Oct 10 2018, 9:01 AM
lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1496 ↗(On Diff #168860)

The new mask is the same size as the extracting shuffle (Shuf) size. The extracting shuffle size must be smaller than the first shuffle's size. This is guaranteed by 'isIdentityWithExtract', but I'll add an assert to make that clearer.

spatel updated this revision to Diff 169088.Oct 10 2018, 2:03 PM
spatel marked 2 inline comments as done.

Patch updated:
Added an assert to (hopefully) make this transform clearer.

Side note: while trying to edit this patch, I realized that I had accidentally already committed it to trunk with rL344082. I reverted that in rL344178.

Sorry about that...on the plus side, there were no bot complaints or bug reports while it was in trunk. :)

lebedev.ri added inline comments.Oct 12 2018, 12:48 AM
lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1493 ↗(On Diff #169088)

I think this comment is too consice.
The inner shuffle isn't *required* to be identity, is it?
Also, can it have undefs?

RKSimon accepted this revision.Oct 12 2018, 10:32 AM

LGTM - cheers

lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1493 ↗(On Diff #169088)

Add an under to the inner mask if you can - but its not a big issue.

test/Transforms/InstCombine/vec_shuffle.ll
173 ↗(On Diff #169088)

Remove TODO?

187 ↗(On Diff #169088)

Remove TODO?

This revision is now accepted and ready to land.Oct 12 2018, 10:32 AM
spatel added inline comments.Oct 12 2018, 10:51 AM
lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1493 ↗(On Diff #169088)

Good observations!

  1. Yes, the inner shuffle can have undefs (1 of the regression tests shows that possibility).
  2. No, it is not technically required to be identity, but it is required by our current standards of shuffle combining. Ie, if it wasn't an identity, we'd potentially be creating a truly new shuffle mask (shuffling elements in a way that the original code did not), and that's not allowed.

I'll clarify these points in a revised code comment.

test/Transforms/InstCombine/vec_shuffle.ll
173 ↗(On Diff #169088)

Yes - forgot to update the comment here and below.

This revision was automatically updated to reflect the committed changes.
fhahn added a subscriber: fhahn.Feb 5 2019, 8:36 AM

I found a case were this combine causes a codegen regression on AArch64. In the example below, %s0 puts data into a 128 bit vector and %s1 and %s2 extract the lower and upper halves. Without folding %s0 and %s1, we can generate a single AArch64 tbl instruction for %s0 and a mov instruction for %s2. With the fold in this patch, we generate 3 additional instructions: additional tbl for %s2 and 2 instructions for loading the mask.

So on AArch64, the combine produces worse code, in case we can generate a single tbl instruction for the top-level shuffle and we extract the lower and upper halves, which is cheap. Do you have an idea how to best address the issue?

define <8 x i16> @test(<16 x i8> %s) {
entry:
  %0 = sub <16 x i8> <i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1>, %s
  %s0 = shufflevector <16 x i8> %0, <16 x i8> undef, <16 x i32> <i32 3, i32 3, i32 3, i32 3, i32 7, i32 7, i32 7, i32 7, i32 11, i32 11, i32 11, i32 11, i32 15, i32 15, i32 15, i32 15>
  %s1 = shufflevector <16 x i8> %s0, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %s2 = shufflevector <16 x i8> %s0, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %a = call <8 x i16> @fn(<8 x i8> %s1, <8 x i8> %s2) #6
  ret <8 x i16> %a
}

declare <8 x i16> @fn(<8 x i8>, <8 x i8>)
Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 8:36 AM

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

But can you be sure there won't ever be such a shuffle via some other means?
I don't know that much about AArch64, but this really sounds like something to be dealt with in the backend..

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

The goal for this patch was to make things easier for the backend, but I didn't think of this case of course. How about limiting the transform like this:

Index: InstCombine/InstCombineVectorOps.cpp
===================================================================
--- InstCombine/InstCombineVectorOps.cpp	(revision 353169)
+++ InstCombine/InstCombineVectorOps.cpp	(working copy)
@@ -1498,6 +1498,11 @@
   if (!match(Op0, m_ShuffleVector(m_Value(X), m_Value(Y), m_Constant(Mask))))
     return nullptr;
 
+  // Be extra conservative with shuffle transforms. If we can't kill the 1st
+  // shuffle, then combining may result in worse codegen.
+  if (!Op0->hasOneUse())
+    return nullptr;
+
   // We are extracting a subvector from a shuffle. Remove excess elements from
   // the 1st shuffle mask to eliminate the extract.
   //
fhahn added a comment.Feb 5 2019, 12:57 PM

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

The goal for this patch was to make things easier for the backend, but I didn't think of this case of course. How about limiting the transform like this:

Index: InstCombine/InstCombineVectorOps.cpp
===================================================================
--- InstCombine/InstCombineVectorOps.cpp	(revision 353169)
+++ InstCombine/InstCombineVectorOps.cpp	(working copy)
@@ -1498,6 +1498,11 @@
   if (!match(Op0, m_ShuffleVector(m_Value(X), m_Value(Y), m_Constant(Mask))))
     return nullptr;
 
+  // Be extra conservative with shuffle transforms. If we can't kill the 1st
+  // shuffle, then combining may result in worse codegen.
+  if (!Op0->hasOneUse())
+    return nullptr;
+
   // We are extracting a subvector from a shuffle. Remove excess elements from
   // the 1st shuffle mask to eliminate the extract.
   //

That would solve the issue, thanks. I think it would make sense to keep this transform quite conservative, without knowing more about the backend. If this is helpful for some backends, it might make sense to do it at a later stage in the pipeline. Merging back shuffles in the backend might be possible in some cases, but I think it would be better if the backend does not have to fight instcombine in this case.

spatel added a comment.Feb 5 2019, 1:53 PM

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

But can you be sure there won't ever be such a shuffle via some other means?
I don't know that much about AArch64, but this really sounds like something to be dealt with in the backend..

Yes, that's our default argument for IR canonicalization: if the source code was already in this form to begin with, then there must be some missing optimization already out there.
But we have a special restriction for shufflevector - we assume that masks are only ever created if they are easy for the target to digest. And the corollary is that we don't create all-new shuffle masks in target-independent passes. That means we're going to miss some optimizations in IR that would've fallen out from intermediate transforms, but I think we have to live with that. As discussed in the earlier comments, this patch is right at the edge of the mask restriction.

spatel added a comment.Feb 5 2019, 3:01 PM

Added one-use check:
rL353235

fhahn added a comment.Feb 6 2019, 6:17 AM

Great, thanks Sanjay!