Page MenuHomePhabricator

[TTI][SLP] Add TTI interface to estimate cost of chain of vector inserts/extracts.
AbandonedPublic

Authored by vdmitrie on Mar 27 2020, 3:03 PM.

Details

Summary

Cost for single insert/extract to/from vector is correct. But when summing costs
of single insert/extract instructions it can be overestimated as extraction/insertion
of 128 bits sub-vectors only needed once per group of elements that go to/from
same 128 bits sub-vector.
Using this interface in SLP fixes some cases that previously bailed due to
high gathering cost.

Diff Detail

Event Timeline

vdmitrie created this revision.Mar 27 2020, 3:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 27 2020, 3:03 PM
vdmitrie updated this revision to Diff 253266.Mar 27 2020, 5:57 PM

Fixed format and test failures.

We already have getScalarizationOverhead - can't we extend that instead?

vdmitrie added a comment.EditedMar 30 2020, 10:09 AM

We already have getScalarizationOverhead - can't we extend that instead?

Could you please be more specific about your suggestion?
Unfortunately the API documentation is no better than nothing.
I really do not understand what this API is supposed to do:

unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;

Type here I believe must be a vector type.
Questions:

  • Whether it calculates cost of every vector element extract/insert (i.e. scalarization of whole vector)?
  • It is not clear whether boolean arguments can both be true (in my opinion the purpose of the interface could be much clearer if they were documented mutually exclusive).

if yes, and if both are true – whether any scalar computation supposed to be in between.
With no scalar computation between extracts and inserts they all can be folded into shuffle or optimized away.

Updated later:
After I've seen basic TTI implementation (I also found interface descriptions there, not clear why not in TargetTransformInfo.h where it supposed to be) I know answers for these questions. Thus never mind.
As I see getScalarizationOverhead is simply a convenience wrapper around getVectorInstrCost. I still do not understand how It can be extended without first extending the latter. What definitely missed is that existing getScalarizationOverhead should be changed to use new interface too as it directly related.

vdmitrie updated this revision to Diff 253742.Mar 30 2020, 5:29 PM

Updated getScalarizationOverhead to use new TTI API for chained inserts/extracts.
This reduced cost overestimates in many cases brought with https://reviews.llvm.org/D74976.
Cost model tests adjusted to reflect that.
Tests Transforms/SLPVectorizer/X86/resched.ll and Transforms/LoopVectorize/X86/strided_load_cost.ll reverted to state prior to D74976.

vdmitrie updated this revision to Diff 253769.Mar 30 2020, 7:13 PM

clang-format

lebedev.ri added inline comments.Mar 30 2020, 11:39 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
958

s/iether/either/

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
475

I wonder if this should be

return std::accumulate(Indices.begin(), Indices.end(), unsigned(0), 
                       [&](unsigned CostSoFar, unsigned Index) {
                         return CostSoFar + getVectorInstrCost(Opcode, Val, Index);
                       }
       );

or this is too advanced for this particular impl

vdmitrie marked 2 inline comments as done.Mar 31 2020, 9:20 AM
vdmitrie added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
475

I see your point. There is an assumption here that cost is one per index. I agree that it is better to avoid any assumptions but I'd use just for loop here if you have no objections. Will fix it.

vdmitrie updated this revision to Diff 253912.Mar 31 2020, 9:22 AM

Addressed comments.

spatel added inline comments.Apr 8 2020, 3:04 PM
llvm/test/Transforms/SLPVectorizer/X86/resched.ll
22–29

Did SLP fail to recognize that this is a splat shuffle? I would have expected it to produce splat IR:
(shuffle (insert X, 0), zeroinitializer)
...instead of a chain of inserts.

vdmitrie marked an inline comment as done.Apr 8 2020, 5:20 PM
vdmitrie added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/resched.ll
22–29

No. It did not fail to recognize a splat. As I see from code single element is not shuffled deliberately:

Value *BoUpSLP::vectorizeTree(ArrayRef<Value *> VL) {
...

// Do not shuffle single element or if number of unique values is not power
// of 2.
if (UniqueValues.size() == VL.size() || UniqueValues.size() <= 1 ||
    !llvm::isPowerOf2_32(UniqueValues.size()))
  ReuseShuffleIndicies.clear();

...

spatel added inline comments.Apr 9 2020, 9:30 AM
llvm/test/Transforms/SLPVectorizer/X86/resched.ll
22–29

I tried to step through the debug spew from SLP, but I can't tell what is happening on this example. I only see a call to getGatherCost() at current line 3318 of SLPVectorizer.cpp, so I thought that is the point where we check for a splat.

I don't understand this part of the model/usage very well so others should review this patch, but there's still a concern here: ideally, SLP should not be producing this chain of inserts if it's a splat op. I don't know if that changes how we view the diff for the cost model.

ABataev added inline comments.Apr 9 2020, 9:43 AM
llvm/test/Transforms/SLPVectorizer/X86/resched.ll
22–29

I think, SLP recognizes splat here, but models it through gather, and relies on the InstCombiner to transform it into a real shuffle.

RKSimon added inline comments.Apr 9 2020, 9:52 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
676

I think we can avoid the need for getVectorInstrChainCost at all by adding a APInt DemandedElts mask to this?

vdmitrie marked an inline comment as done.Apr 9 2020, 10:10 AM
vdmitrie added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/resched.ll
22–29

Nope. That is not the point where splat is detected.
First SLP checks for splat when it builds a tree. Such tree entries are created with NeedToGather state. You can locate the place with "SLP: Gathering due to C,S,B,O." message.
Then cost is calculated in int BoUpSLP::getEntryCost(TreeEntry *E).
Ideally we want cost calculation routine to calculate cost for such entry taking into account how VectorizeTree then would handle it.
And here is what I see there:

if (E->State == TreeEntry::NeedToGather) {
  if (allConstant(VL))
    return 0;
  if (isSplat(VL)) {
    return ReuseShuffleCost +
           TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, 0);
  }

And that is obviously different from what vectorizeTree does(see my previous post). This is a problem. And it explains your debugging experience.
It definitely has to be fixed but I believe not in this patch. This patch specifically targets calculation of gathering cost for multiple indexes for vectors with size greater than 128bits. Specifically it fixes regression in this area caused by D74976
and wrt to this test case (and also one other) it is a revert of D74976 changes.

ABataev added inline comments.Apr 9 2020, 10:25 AM
llvm/test/Transforms/SLPVectorizer/X86/resched.ll
22–29

And this what I meant. SLP vectorizer does not emit shuffle directly here, instead it relies on InstCombiner. Sure, it can be improved.

vdmitrie added inline comments.Apr 9 2020, 10:26 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
676

This interface is a second level API. i.e. it is a convenience interface that handles repeated pattern of getVectorInstrCost TTI interface use. I do not see how it can be extended without extending the latter to handle multiple indexes at a time.
We can probably make getVectorInstrChainCost private and extend getScalarizationOverhead with mask argument but can't totally avoid creating it. Is this is the way you want me to change it?

vdmitrie added inline comments.Apr 9 2020, 10:30 AM
llvm/test/Transforms/SLPVectorizer/X86/resched.ll
22–29

I see. This at least deserves a comment in the code.

ABataev added inline comments.Apr 9 2020, 10:58 AM
llvm/test/Transforms/SLPVectorizer/X86/resched.ll
22–29

Agree. Actually, this is very old code. Originally, SLP vectorizer did not emit shuffles at all, it just emitted sets of ExtractElement/InsertElement instructions completely relying on the InstCombiner. We just had no time to improve/fix everything here.

@vdmitrie I've created D78216 to demonstrate the approach I had in mind.

@vdmitrie I've created D78216 to demonstrate the approach I had in mind.

As I see D78216 does not target the problem I was trying to resolve here. I specifically targeted extracts/inserts to/from vectors that are 256bits or 512 bits.

If we for example need to estimate cost to insert elements 6 and 7 into <8 x float> vector current approach is: getVectorInstrCost for index 6 plus getVectorInstrCost for index 7. D78216 does not change that.
That isn't quite accurate. And here is why.
It estimates as if generated code was like this:

  1. extract upper 128bits subvector from 256 bits vector
  2. insert element 6 into the 128bits subvector
  3. insert upper 128bits subvector back to 256 vector
  4. extract upper 128bits subvector from 256 bits vector
  5. insert element 7 into the 128bits subvector
  6. insert upper 128bits subvector back to 256 vector

But actual code will be:

  1. extract upper 128bits subvector from 256 bits vector
  2. insert element 6 into the 128bits subvector
  3. insert element 7 into the 128bits subvector
  4. insert upper 128bits subvector back to 256 vector
vdmitrie abandoned this revision.Apr 28 2020, 1:08 PM