Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[SLP]Improve isGatherShuffledEntry by trying per-register shuffle.
Needs ReviewPublic

Authored by ABataev on May 3 2023, 5:47 AM.

Details

Reviewers
RKSimon
vdmitrie
Summary

Currently when building gather/buildvector node, we try to build nodes
shuffles without taking into account separate vector registers. WE can
improve final codegen and the whole vectorization process by including
this info into the analysis and the vector code emission, allows to emit
better vectorized code.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 5:47 AM
ABataev requested review of this revision.May 3 2023, 5:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 5:47 AM
RKSimon added inline comments.May 4 2023, 3:54 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2490

Update the \returns description

ABataev updated this revision to Diff 519650.May 4 2023, 2:06 PM

Address comments

Hi Alexey,

I tried to measure performance impact of this patch. Observed ~10% regression on nnet_test spec from coremark-pro test suite. (on avx2 with -flto).

Hi Alexey,

I tried to measure performance impact of this patch. Observed ~10% regression on nnet_test spec from coremark-pro test suite. (on avx2 with -flto).

Could you provide more details what is the cause? The patch itself just improves shuffles emissions, most probably there are some problems with the TTI cost model.

command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll

BTW. which gcc version you use to build? I've experienced problem building with both 7.5.0 and 8.5.0 (I did not try any later versions) while there were no issues w/o patch.
Here is fail log:
<dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3573:24: error: declaration of llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI [-fpermissive]

TargetTransformInfo *TTI;
                     ^~~

In file included from <dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:47:0:
<dir>/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:205:29: error: changes meaning of TTI from typedef class llvm::TargetTransformInfo llvm::TTI [-fpermissive]
typedef TargetTransformInfo TTI;

^~~
ABataev added a comment.EditedJun 29 2023, 9:55 AM

command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll

HMM, llvm-mca shows that the code after patch is better than the code before (with the patch the throughput is 4.0, without - 5.0)

https://godbolt.org/z/zvzTKedPd

BTW. which gcc version you use to build? I've experienced problem building with both 7.5.0 and 8.5.0 (I did not try any later versions) while there were no issues w/o patch.
Here is fail log:
<dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3573:24: error: declaration of llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI [-fpermissive]

TargetTransformInfo *TTI;
                     ^~~

In file included from <dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:47:0:
<dir>/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:205:29: error: changes meaning of TTI from typedef class llvm::TargetTransformInfo llvm::TTI [-fpermissive]
typedef TargetTransformInfo TTI;

^~~

It is most probably because of the using TTI:: in the function declaration, will fix this.

command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll

HMM, llvm-mca shows that the code after patch is better than the code before (with the patch the throughput is 4.0, without - 5.0)

https://godbolt.org/z/zvzTKedPd

It puzzled me too actually. It was the only vectorization candidate that changed behavior due to the patch in the test. I could not find any obvious flaw in handling subsequent vectorization sequences.
Eyeballing of generated code before/after also did not reveal any obvious issues.

command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll

HMM, llvm-mca shows that the code after patch is better than the code before (with the patch the throughput is 4.0, without - 5.0)

https://godbolt.org/z/zvzTKedPd

It puzzled me too actually. It was the only vectorization candidate that changed behavior due to the patch in the test. I could not find any obvious flaw in handling subsequent vectorization sequences.
Eyeballing of generated code before/after also did not reveal any obvious issues.

I fixed cost estimation, now it results in the same translation (I added your test, no changes)

ABataev updated this revision to Diff 538237.Jul 7 2023, 1:02 PM

Rebase, ping!

ABataev updated this revision to Diff 538755.Jul 10 2023, 11:46 AM

Rebase, ping!

ABataev updated this revision to Diff 540995.Jul 17 2023, 6:40 AM

Rebase, ping!

ABataev updated this revision to Diff 544471.Jul 26 2023, 12:20 PM

Rebase, ping!

ABataev updated this revision to Diff 546043.Aug 1 2023, 7:14 AM

Rebase, ping!!!
Required to continue the work on cost model unification and non-power-2.

ABataev updated this revision to Diff 547869.Aug 7 2023, 11:25 AM

Rebase, ping!

ABataev updated this revision to Diff 548992.Aug 10 2023, 5:33 AM

Rebase, ping!

Why are we doing this in SLP and not TTI getShuffleCost? It looks like TTI should be doing a better job of recognizing how it should split the shuffle mask and determine the shuffle kinds per split?

Why are we doing this in SLP and not TTI getShuffleCost? It looks like TTI should be doing a better job of recognizing how it should split the shuffle mask and determine the shuffle kinds per split?

I thought about this. Unfortunately, TTI is too late here + it does not have idion of TreeEntry.

What this function does? It scans the list of scalars and then checks, if it can build a shuffle, using previously build TreeEntries (use previously build vector to build the new one) to reduce number of inserts.
It supports only 1- and 2-vector shuffles, as 3- and more vector shuffles may not be effective for some targets.
Why cannot it be done in TTI? 1. TTI does not know about TreeEntries. 2. Without this knowledge it won't help in complex situations.

Assume we have 3 already built TreeEntries: <abcd>, <efgh>, <ijkl>. But the actual vector register includes only 2 elements (i.e. each this vector uses 2 vector registers). And we trying to build a new buildvector (NeedToGather) entry with elements <acih>.
Original function will help to translate this gather node into this:

%s1 = shuffle <abcd>, <efgh>, <0,2,poison,7>
%v = insertelement %s1, i, 2 - might be very cost ineffective + there might be several such inserts, increasing the overall cost of the buildvector/gather node.

This reworked function can do this:

%s1 = shuffle <abcd>, poison, <0,2>
%s2 = shuffle <efgh>, <ijkl>, <4,3>
%v = shuffle %s1, %s2, <0,1,2,3> - free insert subvector shuffle

TTI won't be able to transform the first variant to the second one without knowing that there is <ijkl> vector already exists. It requires teaching it about the whole SLP graph idiom.

ABataev updated this revision to Diff 552339.Aug 22 2023, 6:51 AM

Rebase, ping!

ABataev updated this revision to Diff 554447.Tue, Aug 29, 11:59 AM

Rebase, ping!

ikelarev removed a subscriber: ikelarev.Wed, Sep 6, 8:10 AM
reames added a subscriber: reames.Wed, Sep 6, 2:44 PM
reames added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9011–9013

Stylistic suggestion here. This function is quite complicated, and following all the changes through are tricky. I *think* this is just performing the same operation for each sub-range (divided by register). If so, I'd suggest leaving the function alone, and introducing a wrapper which just calls the old version with a sub-range of the VLs, and builds up the vector results.

If this works here, then applying the same basic idea throughout the patch would make it much easier to follow and review.

ABataev added inline comments.Wed, Sep 6, 3:04 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9011–9013

Yes, it splits the multi-register vector into several one-register vectors and just performs the same analysis, as before, to avoid long multi-vector shuffles. Will try to move the previous logic into a separate function.

ABataev updated this revision to Diff 556298.Fri, Sep 8, 1:12 PM

Rebase, address comments

ABataev updated this revision to Diff 556434.Mon, Sep 11, 8:05 AM

Rebase, ping!

ABataev updated this revision to Diff 556877.Fri, Sep 15, 12:47 PM

Rebase, ping!

I've done my best, but I'm really struggling to understand a lot of this patch :(

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7156

Add method description, also why is it called estimateNodesPermuteCost when it doesn't seem to use/return costs?

7165

This logic makes very little sense to me

7676

Would we be better off merging the 1TE + 2TE add() methods?

I've done my best, but I'm really struggling to understand a lot of this patch :(

Generally speaking, it does the same as before, just splits the node into vector registers and tries to do per register shuffles instead of per node shuffles. It may reduce final number of shuffles, since we do not need to permute all the register in the node, just some of them. The single node may result in several vector registers, no need to permute all of them.

ABataev added inline comments.Wed, Sep 20, 1:13 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7156

It adds the cost to final cost estimation, see e.g. lines 7169 and 7176.

7165

If we decided to permute E1 and (possibly) E2 already, and we found that we again need to permute the same nodes, no need to do it immediately, instead need to include the sub-Mask into CommonMask and do it later to avoid double shuffling of the same nodes.

7676

I'll try to rework this function to avoid it.

ABataev updated this revision to Diff 557185.Thu, Sep 21, 9:11 AM

Rebase, address comments