Cost for single insert/extract to/from vector is correct. But when summing costs
of single insert/extract instructions it can be overestimated as extraction/insertion
of 128 bits sub-vectors only needed once per group of elements that go to/from
same 128 bits sub-vector.
Using this interface in SLP fixes some cases that previously bailed due to
high gathering cost.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Could you please be more specific about your suggestion?
Unfortunately the API documentation is no better than nothing.
I really do not understand what this API is supposed to do:
unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;
Type here I believe must be a vector type.
Questions:
- Whether it calculates cost of every vector element extract/insert (i.e. scalarization of whole vector)?
- It is not clear whether boolean arguments can both be true (in my opinion the purpose of the interface could be much clearer if they were documented mutually exclusive).
if yes, and if both are true – whether any scalar computation supposed to be in between.
With no scalar computation between extracts and inserts they all can be folded into shuffle or optimized away.
Updated later:
After I've seen basic TTI implementation (I also found interface descriptions there, not clear why not in TargetTransformInfo.h where it supposed to be) I know answers for these questions. Thus never mind.
As I see getScalarizationOverhead is simply a convenience wrapper around getVectorInstrCost. I still do not understand how It can be extended without first extending the latter. What definitely missed is that existing getScalarizationOverhead should be changed to use new interface too as it directly related.
Updated getScalarizationOverhead to use new TTI API for chained inserts/extracts.
This reduced cost overestimates in many cases brought with https://reviews.llvm.org/D74976.
Cost model tests adjusted to reflect that.
Tests Transforms/SLPVectorizer/X86/resched.ll and Transforms/LoopVectorize/X86/strided_load_cost.ll reverted to state prior to D74976.
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
958 | s/iether/either/ | |
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h | ||
475 | I wonder if this should be return std::accumulate(Indices.begin(), Indices.end(), unsigned(0), [&](unsigned CostSoFar, unsigned Index) { return CostSoFar + getVectorInstrCost(Opcode, Val, Index); } ); or this is too advanced for this particular impl |
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h | ||
---|---|---|
475 | I see your point. There is an assumption here that cost is one per index. I agree that it is better to avoid any assumptions but I'd use just for loop here if you have no objections. Will fix it. |
llvm/test/Transforms/SLPVectorizer/X86/resched.ll | ||
---|---|---|
22–29 | Did SLP fail to recognize that this is a splat shuffle? I would have expected it to produce splat IR: |
llvm/test/Transforms/SLPVectorizer/X86/resched.ll | ||
---|---|---|
22–29 | No. It did not fail to recognize a splat. As I see from code single element is not shuffled deliberately: Value *BoUpSLP::vectorizeTree(ArrayRef<Value *> VL) { // Do not shuffle single element or if number of unique values is not power // of 2. if (UniqueValues.size() == VL.size() || UniqueValues.size() <= 1 || !llvm::isPowerOf2_32(UniqueValues.size())) ReuseShuffleIndicies.clear(); ... |
llvm/test/Transforms/SLPVectorizer/X86/resched.ll | ||
---|---|---|
22–29 | I tried to step through the debug spew from SLP, but I can't tell what is happening on this example. I only see a call to getGatherCost() at current line 3318 of SLPVectorizer.cpp, so I thought that is the point where we check for a splat. I don't understand this part of the model/usage very well so others should review this patch, but there's still a concern here: ideally, SLP should not be producing this chain of inserts if it's a splat op. I don't know if that changes how we view the diff for the cost model. |
llvm/test/Transforms/SLPVectorizer/X86/resched.ll | ||
---|---|---|
22–29 | I think, SLP recognizes splat here, but models it through gather, and relies on the InstCombiner to transform it into a real shuffle. |
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
676 | I think we can avoid the need for getVectorInstrChainCost at all by adding a APInt DemandedElts mask to this? |
llvm/test/Transforms/SLPVectorizer/X86/resched.ll | ||
---|---|---|
22–29 | Nope. That is not the point where splat is detected. if (E->State == TreeEntry::NeedToGather) { if (allConstant(VL)) return 0; if (isSplat(VL)) { return ReuseShuffleCost + TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, 0); } And that is obviously different from what vectorizeTree does(see my previous post). This is a problem. And it explains your debugging experience. |
llvm/test/Transforms/SLPVectorizer/X86/resched.ll | ||
---|---|---|
22–29 | And this what I meant. SLP vectorizer does not emit shuffle directly here, instead it relies on InstCombiner. Sure, it can be improved. |
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
676 | This interface is a second level API. i.e. it is a convenience interface that handles repeated pattern of getVectorInstrCost TTI interface use. I do not see how it can be extended without extending the latter to handle multiple indexes at a time. |
llvm/test/Transforms/SLPVectorizer/X86/resched.ll | ||
---|---|---|
22–29 | I see. This at least deserves a comment in the code. |
llvm/test/Transforms/SLPVectorizer/X86/resched.ll | ||
---|---|---|
22–29 | Agree. Actually, this is very old code. Originally, SLP vectorizer did not emit shuffles at all, it just emitted sets of ExtractElement/InsertElement instructions completely relying on the InstCombiner. We just had no time to improve/fix everything here. |
As I see D78216 does not target the problem I was trying to resolve here. I specifically targeted extracts/inserts to/from vectors that are 256bits or 512 bits.
If we for example need to estimate cost to insert elements 6 and 7 into <8 x float> vector current approach is: getVectorInstrCost for index 6 plus getVectorInstrCost for index 7. D78216 does not change that.
That isn't quite accurate. And here is why.
It estimates as if generated code was like this:
- extract upper 128bits subvector from 256 bits vector
- insert element 6 into the 128bits subvector
- insert upper 128bits subvector back to 256 vector
- extract upper 128bits subvector from 256 bits vector
- insert element 7 into the 128bits subvector
- insert upper 128bits subvector back to 256 vector
But actual code will be:
- extract upper 128bits subvector from 256 bits vector
- insert element 6 into the 128bits subvector
- insert element 7 into the 128bits subvector
- insert upper 128bits subvector back to 256 vector
I think we can avoid the need for getVectorInstrChainCost at all by adding a APInt DemandedElts mask to this?