Page MenuHomePhabricator

[SLP] Fix for PR31880: shuffle and vectorize repeated scalar ops on extracted elements

Authored by ABataev on Feb 21 2017, 7:19 AM.



Currently most of the time vectors of extractelement instructions are
treated as scalars that must be gathered into vectors. But in some
cases, like when we have extractelement instructions from single vector
with different constant indeces or from 2 vectors of the same size, we
can treat this operations as shuffle of a single vector or blending of 2

define <2 x i8> @g(<2 x i8> %x, <2 x i8> %y) {
  %x0 = extractelement <2 x i8> %x, i32 0
  %y1 = extractelement <2 x i8> %y, i32 1
  %x0x0 = mul i8 %x0, %x0
  %y1y1 = mul i8 %y1, %y1
  %ins1 = insertelement <2 x i8> undef, i8 %x0x0, i32 0
  %ins2 = insertelement <2 x i8> %ins1, i8 %y1y1, i32 1
  ret <2 x i8> %ins2

can be converted to something like

define <2 x i8> @g(<2 x i8> %x, <2 x i8> %y) {
  %1 = shufflevector <2 x i8> %x, <2 x i8> %y, <2 x i32> <i32 0, i32 3>
  %2 = mul <2 x i8> %1, %1
  ret <2 x i8> %2

Currently this type of conversion is considered as high cost

Event Timeline

ABataev created this revision.Feb 21 2017, 7:19 AM
mkuper edited edge metadata.Feb 21 2017, 2:00 PM

Thanks, Alexey.
Some fairly minor comments inline.


I think the name is a bit weird. If it's a blend, it's definitely a shuffle. What this really does is determine whether we're extracting from at most 2 sources, right? Maybe give a name that indicates this?


I'm not a fan of using SK_Alternate here. Can you use Optional<>?


Maybe define all 3 on one line? (If you think this is better, feel free to leave as is).


I think you're guaranteed that all members of a VL are the same instruction type, so you can just cast<> here, and avoid the check below.
(Please verify I'm right, though. :-) )


I think if the order of extracts and inserts just happens to be the same, we'll overestimate the cost. But that's fine for now. Could you check, and if we indeed overestimate, add a FIXME?


When can !UsedElements1.any() be true?


Maybe check whether VL[0] is an extract here? It would make it clearer that the code below is relevant only for extracts, as opposed to checking internally.

ABataev added inline comments.Feb 22 2017, 3:37 AM

Renamed it to isShuffle. Is this ok?


Oh, yes, thanks for the hint.


Ok, fine.


No, we're not guaranteed. This function works when we found out that this is gathering of elements. In this case, we have no information that all instructions are of the same type.


Yes, we're overestimated. Reworked it a bit. If UsedElements1.any() is false, it is SK_Alternate (i.e. blending, because we're not crossing lanes in extractelement instructions for different vectors), otherwise it is still estimated as SK_PermuteTwoSrc or SK_PermuteSingleSrc.


Ok, will do.

ABataev updated this revision to Diff 89347.Feb 22 2017, 4:12 AM

Address Michael's comments + minor fixes

mkuper added inline comments.

I think that makes sense. It's not ideal, but I don't have any good ideas either. "IsUpToTwoVectorShuffle" is... meh.


Ah, I see, ok.


Oh, I'm sorry, I misread &= as |=.

This still looks wrong, though. The way I understand it, right now SK_Alternate (which usually, for x86, maps to a blend - but the two are not equivalent) has very specific requirements for the mask. See isAlternateVectorMask() in CostModel.cpp. I don't think your code matches that.

Adding Elena who should probably know better than I do.

Elena, some comments?

A couple of comments, but otherwise I think this looks almost ready to go.


@mkuper is right - SK_Alternate is not a simple blend, it should only match shuffles which alternate between sequential elements from the 2 vectors (e.g. 0, 5, 2, 7). I've no idea why we went for that instead of a general SK_Blend but that's where we are...


Any way that this can be tied up? Maybe pull out cast<ExtractElementInst>(VL[I])->getIndexOperand())

I hate seeing -> on new lines....


I think this move to BoUpSLP::areAllUsersVectorized can be done as a NFC pre-commit.

ABataev marked an inline comment as done.Jul 27 2017, 8:36 AM
ABataev added inline comments.

Reworked this part


Ok, will do

RKSimon added inline comments.Jul 27 2017, 8:59 AM

use for range loop?

ABataev marked an inline comment as done.Jul 27 2017, 9:22 AM
ABataev updated this revision to Diff 108498.Jul 27 2017, 10:18 AM

Update after review

RKSimon edited edge metadata.Jul 27 2017, 11:24 AM

@mkuper Any further comments?

mkuper added inline comments.Jul 27 2017, 2:45 PM

Shouldn't this be something like

else if (Vec1 != Vec)



The condition of the ternary expression is kind of confusing. Maybe unpack it a bit?

ABataev marked 2 inline comments as done.Jul 28 2017, 6:12 AM
ABataev added inline comments.

Yes, you're correct, thanks.

ABataev updated this revision to Diff 108639.Jul 28 2017, 6:19 AM
ABataev marked an inline comment as done.

Update after review

One minor comment, but other than that it looks ok - @mkuper?


You could merge this and the isNegative test above into a single bounds test by using:

if (Idx->getValue().uge(Size))
unsigned IntIdx = Idx->getValue().getZExtValue();
ABataev updated this revision to Diff 108655.Jul 28 2017, 7:45 AM

Update after review

RKSimon accepted this revision.Aug 1 2017, 5:17 AM


This revision is now accepted and ready to land.Aug 1 2017, 5:17 AM
This revision was automatically updated to reflect the committed changes.
dorit added a subscriber: dorit.Aug 10 2017, 12:01 AM

Hi Alexey,

With this patch I'm seeing cases where SLP no longer vectorizes code that was vectorized before.
SLP detects that it is possible that the extracts will be optimized into a shuffle,
so it uses the cost of the shuffle in its profitability evaluation.
However the cost of the shuffle comes out very high, so SLP is abandoned…

In the specific case at hand the scenario is the following:
With the patch, SLP is asking about getShuffleCost(Permute_Single_Src, 16xi64).
So for AVX2 we fall back to return 12*Base::getShuffleCost(Permute_Two_Src,4xi64),
and since that cost is not modeled this in turn returns 12*getPermuteShuffleOverhead, which comes out to = 12*8 = 96 ….
In the end the overall cost of the shuffle (after subtracting 16 inserts) is 80,
compared to the cost of the sequence before the patch, which was 16 (using the result of getGatherCost).
96 is certainly an excessive cost, and should be fixed in the target. But regardless...:

I wonder if it may make sense to also have SLP examine both costs (gatherCost as before and ShuffleCost) and select the minimum one?

Another concern is that *maybe* SLP is not asking the exact right question when it calls getShuffleCost(Permute_Single_Src, 16xi64);
Apparently InstCombine does reach the decision to optimize the sequence into shuffles, so this may suggest that the question that SLP asks TTI when it tries to predict what instCombine will do, does not match the question that InstCombine will ask TTI...?

In any case, I'm working on fixing the cost that X86TTI returns, which fixes (or works around?) the problem for AVX2, but not yet for SSE2 on Silvermont (still under investigation).
But maybe the other directions of fixes in the SLPVectorizer suggested above should be considered as well...


Hi Dorit,
With this patch I tried to simulate what InstCombiner expects, i.e. I tried to synchronize the behavior of SLP vectorizer and InstCombiner. So, if you have some troubles with the cost, it means that it must be fixed for your target.
I don't think this is the right solution to calculate the gather cost here because InstCombiner does not treat these ops as gather. Just like I said, it is better to fix the cost model.

@dorit I've implemented most of the x86 single source shuffle costs at rL310632, please can you see if this has improved things for you?

dorit added a comment.Aug 20 2017, 5:41 AM

@dorit I've implemented most of the x86 single source shuffle costs at rL310632, please can you see if this has improved things for you?

Yes it does :-) thanks!