Page MenuHomePhabricator

[InstCombine] Extend collectShuffleElements to support extract/zext/insert patterns
AbandonedPublic

Authored by joey on Aug 16 2018, 6:47 AM.

Details

Summary

collectShuffleElements already handles combining the following into a single shufflevector:

%elt0 = extractelement <8 x i16> %in, i32 3
%elt1 = extractelement <8 x i16> %in, i32 1
%elt2 = extractelement <8 x i16> %in2, i32 0
%elt3 = extractelement <8 x i16> %in, i32 2

%vec.0 = insertelement <4 x i16> undef, i16 %elt0, i32 0
%vec.1 = insertelement <4 x i16> %vec.0, i16 %elt1, i32 1
%vec.2 = insertelement <4 x i16> %vec.1, i16 %elt2, i32 2
%vec.3 = insertelement <4 x i16> %vec.2, i16 %elt3, i32 3

This patch extends it to handle the following, by turning it into shufflevector + ext.

%elt0e = extractelement <8 x i16> %in, i32 3
%elt1e = extractelement <8 x i16> %in, i32 1
%elt2e = extractelement <8 x i16> %in, i32 0
%elt3e = extractelement <8 x i16> %in, i32 3

%elt0 = zext i16 %elt0e to i32
%elt1 = zext i16 %elt1e to i32
%elt2 = zext i16 %elt2e to i32
%elt3 = zext i16 %elt3e to i32

%vec.0 = insertelement <4 x i32> undef, i32 %elt0, i32 0
%vec.1 = insertelement <4 x i32> %vec.0, i32 %elt1, i32 1
%vec.2 = insertelement <4 x i32> %vec.1, i32 %elt2, i32 2
%vec.3 = insertelement <4 x i32> %vec.2, i32 %elt3, i32 3

Diff Detail

Event Timeline

joey created this revision.Aug 16 2018, 6:47 AM
  1. Why is this limited to extensions? Why can't this also be done for trunc? Or more generally, why don't we want to do this if the same operation is applied for all the elements?
  2. Do the vectorizer passes handle this? (Especially in light of the last question of 1.)
lib/Transforms/InstCombine/InstCombineVectorOps.cpp
477

Why ZI?
Is this limited to zext?
CI perhaps?

801

Same

joey added a comment.Aug 16 2018, 7:12 AM
  1. Why is this limited to extensions? Why can't this also be done for trunc? Or more generally, why don't we want to do this if the same operation is applied for all the elements?
  2. Do the vectorizer passes handle this? (Especially in light of the last question of 1.)
  1. No real reason, just the test case I wrote the code for. I can do the same for trunc as well. I'm not sure if it makes sense for the other CastInst operators though.
  2. I tried the test case with opt -slp-vectorizer and it doesn't catch it. Anything else I can try?
lib/Transforms/InstCombine/InstCombineVectorOps.cpp
477

ZI just because I originally wrote the patch for ZExtInst. I can change that to CI. It works for zext and sext currently.

  1. Why is this limited to extensions? Why can't this also be done for trunc? Or more generally, why don't we want to do this if the same operation is applied for all the elements?
  2. Do the vectorizer passes handle this? (Especially in light of the last question of 1.)
  1. No real reason, just the test case I wrote the code for. I can do the same for trunc as well. I'm not sure if it makes sense for the other CastInst operators though.
  1. I tried the test case with opt -slp-vectorizer and it doesn't catch it. Anything else I can try?

I see, CC'ing @ABataev / @RKSimon / @mssimpso
I really wonder if this is one of these cases where we shouldn't do it in instcombine, even if it can be done, but elsewhere..
Because it isn't perfectly obvious [to me] why we would want to stop on these 3 casts, and not keep piling more stuff here.

It could be the kind of thing we should do in slp @ABataev what do you think?

It could be the kind of thing we should do in slp @ABataev what do you think?

Yes, looks like the opportunity for the SLP Vectorizer.

joey added a comment.Aug 16 2018, 8:27 AM

Here is the (reduced) motivating example:

__kernel void foo(__global uchar4 *p1, __global ushort2 *p2)
{
    uchar4 t0 = p1[0];
    uchar4 t1 = p1[1];
    
    ushort2 t00 = (ushort2)((ushort)t0.x, (ushort)t0.y);
    ushort2 t10 = (ushort2)((ushort)t1.x, (ushort)t1.y);
    
    *p2 += (t00 * t10);
}

I haven't worked with the SLPVectorizer before, so would need some guidance in making the change there. Or someone could take over the change, if it's easier.

I found that if I apply the following patch:

diff --git a/lib/Transforms/Vectorize/SLPVectorizer.cpp b/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 32df6d58157..76103732adc 100644
--- a/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/lib/Transforms/Vectorize/SLPVectorizer.cpp 
@@ -2431,8 +2431,10 @@ bool BoUpSLP::isFullyVectorizableTinyTree() {
     return true;
 
   // Gathering cost would be too much for tiny trees.
+/*
   if (VectorizableTree[0].NeedToGather || VectorizableTree[1].NeedToGather)
     return false;
+*/

Using the test:

define <4 x i32> @test3(<8 x i16> %in, <8 x i16> %in2) {
  %elt0e = extractelement <8 x i16> %in, i32 3
  %elt1e = extractelement <8 x i16> %in, i32 1
  %elt2e = extractelement <8 x i16> %in, i32 0
  %elt3e = extractelement <8 x i16> %in, i32 3
  %elt0 = zext i16 %elt0e to i32
  %elt1 = zext i16 %elt1e to i32
  %elt2 = zext i16 %elt2e to i32
  %elt3 = zext i16 %elt3e to i32
  %vec.0 = insertelement <4 x i32> undef, i32 %elt0, i32 0
  %vec.1 = insertelement <4 x i32> %vec.0, i32 %elt1, i32 1
  %vec.2 = insertelement <4 x i32> %vec.1, i32 %elt2, i32 2
  %vec.3 = insertelement <4 x i32> %vec.2, i32 %elt3, i32 3
  ret <4 x i32> %vec.3
}

The SLPVectorizer produces:

define <4 x i32> @test3(<8 x i16> %in, <8 x i16> %in2) {
  %elt0e = extractelement <8 x i16> %in, i32 3
  %elt1e = extractelement <8 x i16> %in, i32 1
  %elt2e = extractelement <8 x i16> %in, i32 0
  %1 = insertelement <4 x i16> undef, i16 %elt0e, i32 0
  %2 = insertelement <4 x i16> %1, i16 %elt1e, i32 1
  %3 = insertelement <4 x i16> %2, i16 %elt2e, i32 2
  %4 = insertelement <4 x i16> %3, i16 %elt0e, i32 3
  %5 = zext <4 x i16> %4 to <4 x i32>
  %6 = extractelement <4 x i32> %5, i32 0
  %vec.0 = insertelement <4 x i32> undef, i32 %6, i32 0
  %7 = extractelement <4 x i32> %5, i32 1
  %vec.1 = insertelement <4 x i32> %vec.0, i32 %7, i32 1
  %8 = extractelement <4 x i32> %5, i32 2
  %vec.2 = insertelement <4 x i32> %vec.1, i32 %8, i32 2
  %9 = extractelement <4 x i32> %5, i32 3
  %vec.3 = insertelement <4 x i32> %vec.2, i32 %9, i32 3
  ret <4 x i32> %vec.3
}

Then InstCombine can clean that up into:

define <4 x i32> @test3(<8 x i16> %in, <8 x i16> %in2) {
  %1 = shufflevector <8 x i16> %in, <8 x i16> undef, <4 x i32> <i32 3, i32 1, i32 0, i32 3>
  %2 = zext <4 x i16> %1 to <4 x i32>
  ret <4 x i32> %2
}

So it looks like the SLPVectorizer already can do this, with some tweaks.

renlin added a subscriber: renlin.Aug 16 2018, 9:03 AM

I haven't worked with the SLPVectorizer before, so would need some guidance in making the change there. Or someone could take over the change, if it's easier.

I found that if I apply the following patch:

diff --git a/lib/Transforms/Vectorize/SLPVectorizer.cpp b/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 32df6d58157..76103732adc 100644
--- a/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/lib/Transforms/Vectorize/SLPVectorizer.cpp 
@@ -2431,8 +2431,10 @@ bool BoUpSLP::isFullyVectorizableTinyTree() {
     return true;
 
   // Gathering cost would be too much for tiny trees.
+/*
   if (VectorizableTree[0].NeedToGather || VectorizableTree[1].NeedToGather)
     return false;
+*/

It seems that, the buildvector instructions are not counted in the VectorizableTree while the cost is properly included as a parameter to tryToVectorizeList.

lebedev.ri requested changes to this revision.Aug 29 2018, 2:01 AM

As discussed, most likely something for a proper vectorizer.
Removing from review queue..

This revision now requires changes to proceed.Aug 29 2018, 2:01 AM
joey added a comment.Sep 3 2018, 2:14 AM

Some guidance on how to fix this in the SLPVectorzier would be helpful. Or if it's small enough that someone else can fix it, that's fine with me too.

joey abandoned this revision.Nov 22 2018, 3:52 AM

Filed a bug report, so we don't forget this: https://bugs.llvm.org/show_bug.cgi?id=39768