Currently when building gather/buildvector node, we try to build nodes
shuffles without taking into account separate vector registers. WE can
improve final codegen and the whole vectorization process by including
this info into the analysis and the vector code emission, allows to emit
better vectorized code.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2490 | Update the \returns description |
Hi Alexey,
I tried to measure performance impact of this patch. Observed ~10% regression on nnet_test spec from coremark-pro test suite. (on avx2 with -flto).
Could you provide more details what is the cause? The patch itself just improves shuffles emissions, most probably there are some problems with the TTI cost model.
command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll
BTW. which gcc version you use to build? I've experienced problem building with both 7.5.0 and 8.5.0 (I did not try any later versions) while there were no issues w/o patch.
Here is fail log:
<dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3573:24: error: declaration of llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI [-fpermissive]
TargetTransformInfo *TTI; ^~~
In file included from <dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:47:0:
<dir>/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:205:29: error: changes meaning of TTI from typedef class llvm::TargetTransformInfo llvm::TTI [-fpermissive]
typedef TargetTransformInfo TTI;
^~~
HMM, llvm-mca shows that the code after patch is better than the code before (with the patch the throughput is 4.0, without - 5.0)
https://godbolt.org/z/zvzTKedPd
BTW. which gcc version you use to build? I've experienced problem building with both 7.5.0 and 8.5.0 (I did not try any later versions) while there were no issues w/o patch.
Here is fail log:
<dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3573:24: error: declaration of llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI [-fpermissive]TargetTransformInfo *TTI; ^~~In file included from <dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:47:0:
<dir>/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:205:29: error: changes meaning of TTI from typedef class llvm::TargetTransformInfo llvm::TTI [-fpermissive]
typedef TargetTransformInfo TTI;^~~
It is most probably because of the using TTI:: in the function declaration, will fix this.
It puzzled me too actually. It was the only vectorization candidate that changed behavior due to the patch in the test. I could not find any obvious flaw in handling subsequent vectorization sequences.
Eyeballing of generated code before/after also did not reveal any obvious issues.
I fixed cost estimation, now it results in the same translation (I added your test, no changes)
Rebase, ping!!!
Required to continue the work on cost model unification and non-power-2.
Why are we doing this in SLP and not TTI getShuffleCost? It looks like TTI should be doing a better job of recognizing how it should split the shuffle mask and determine the shuffle kinds per split?
I thought about this. Unfortunately, TTI is too late here + it does not have idion of TreeEntry.
What this function does? It scans the list of scalars and then checks, if it can build a shuffle, using previously build TreeEntries (use previously build vector to build the new one) to reduce number of inserts.
It supports only 1- and 2-vector shuffles, as 3- and more vector shuffles may not be effective for some targets.
Why cannot it be done in TTI? 1. TTI does not know about TreeEntries. 2. Without this knowledge it won't help in complex situations.
Assume we have 3 already built TreeEntries: <abcd>, <efgh>, <ijkl>. But the actual vector register includes only 2 elements (i.e. each this vector uses 2 vector registers). And we trying to build a new buildvector (NeedToGather) entry with elements <acih>.
Original function will help to translate this gather node into this:
%s1 = shuffle <abcd>, <efgh>, <0,2,poison,7> %v = insertelement %s1, i, 2 - might be very cost ineffective + there might be several such inserts, increasing the overall cost of the buildvector/gather node.
This reworked function can do this:
%s1 = shuffle <abcd>, poison, <0,2> %s2 = shuffle <efgh>, <ijkl>, <4,3> %v = shuffle %s1, %s2, <0,1,2,3> - free insert subvector shuffle
TTI won't be able to transform the first variant to the second one without knowing that there is <ijkl> vector already exists. It requires teaching it about the whole SLP graph idiom.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
9011–9013 | Stylistic suggestion here. This function is quite complicated, and following all the changes through are tricky. I *think* this is just performing the same operation for each sub-range (divided by register). If so, I'd suggest leaving the function alone, and introducing a wrapper which just calls the old version with a sub-range of the VLs, and builds up the vector results. If this works here, then applying the same basic idea throughout the patch would make it much easier to follow and review. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
9011–9013 | Yes, it splits the multi-register vector into several one-register vectors and just performs the same analysis, as before, to avoid long multi-vector shuffles. Will try to move the previous logic into a separate function. |
I've done my best, but I'm really struggling to understand a lot of this patch :(
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7156 | Add method description, also why is it called estimateNodesPermuteCost when it doesn't seem to use/return costs? | |
7165 | This logic makes very little sense to me | |
7676 | Would we be better off merging the 1TE + 2TE add() methods? |
Generally speaking, it does the same as before, just splits the node into vector registers and tries to do per register shuffles instead of per node shuffles. It may reduce final number of shuffles, since we do not need to permute all the register in the node, just some of them. The single node may result in several vector registers, no need to permute all of them.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7156 | It adds the cost to final cost estimation, see e.g. lines 7169 and 7176. | |
7165 | If we decided to permute E1 and (possibly) E2 already, and we found that we again need to permute the same nodes, no need to do it immediately, instead need to include the sub-Mask into CommonMask and do it later to avoid double shuffling of the same nodes. | |
7676 | I'll try to rework this function to avoid it. |