This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
5/23
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
1
multi-nodes-to-shuffle.ll

Differential D149742

[SLP]Improve isGatherShuffledEntry by trying per-register shuffle.
ClosedPublic

Authored by ABataev on May 3 2023, 5:47 AM.

Download Raw Diff

Details

Reviewers

RKSimon
vdmitrie

Commits

rG196d154ab7a7: [SLP]Improve isGatherShuffledEntry by trying per-register shuffle.
rG560bad013ebc: [SLP]Improve isGatherShuffledEntry by trying per-register shuffle.

Summary

Currently when building gather/buildvector node, we try to build nodes
shuffles without taking into account separate vector registers. WE can
improve final codegen and the whole vectorization process by including
this info into the analysis and the vector code emission, allows to emit
better vectorized code.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.May 3 2023, 5:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 5:47 AM

Herald added subscribers: vporpo, hiraditya. · View Herald Transcript

ABataev requested review of this revision.May 3 2023, 5:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 5:47 AM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

Harbormaster completed remote builds in B229664: Diff 519040.May 3 2023, 6:34 AM

Rebase

Harbormaster completed remote builds in B229790: Diff 519219.May 3 2023, 2:13 PM

RKSimon added inline comments.May 4 2023, 3:54 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2515	Update the \returns description

Address comments

Harbormaster completed remote builds in B230093: Diff 519650.May 4 2023, 3:13 PM

Rebase

Harbormaster completed remote builds in B230293: Diff 519927.May 5 2023, 11:30 AM

Rebase

Harbormaster completed remote builds in B231090: Diff 520984.May 10 2023, 7:44 AM

Rebase

Harbormaster completed remote builds in B232330: Diff 522636.May 16 2023, 10:36 AM

Rebase

Harbormaster completed remote builds in B233887: Diff 524736.May 23 2023, 10:38 AM

Rebase

Harbormaster completed remote builds in B235278: Diff 526595.May 30 2023, 7:22 AM

Ping!

Hi Alexey,

I tried to measure performance impact of this patch. Observed ~10% regression on nnet_test spec from coremark-pro test suite. (on avx2 with -flto).

In D149742#4396406, @vdmitrie wrote:

Hi Alexey,

I tried to measure performance impact of this patch. Observed ~10% regression on nnet_test spec from coremark-pro test suite. (on avx2 with -flto).

Could you provide more details what is the cause? The patch itself just improves shuffles emissions, most probably there are some problems with the TTI cost model.

ikelarev added a subscriber: ikelarev.Jun 6 2023, 10:02 AM

test.ll935 BDownload

command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll

BTW. which gcc version you use to build? I've experienced problem building with both 7.5.0 and 8.5.0 (I did not try any later versions) while there were no issues w/o patch.
Here is fail log:
<dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3573:24: error: declaration of llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI [-fpermissive]

TargetTransformInfo *TTI;
                     ^~~

In file included from <dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:47:0:
<dir>/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:205:29: error: changes meaning of TTI from typedef class llvm::TargetTransformInfo llvm::TTI [-fpermissive]
typedef TargetTransformInfo TTI;

^~~

In D149742#4401811, @vdmitrie wrote:

test.ll935 BDownload

command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll

HMM, llvm-mca shows that the code after patch is better than the code before (with the patch the throughput is 4.0, without - 5.0)

https://godbolt.org/z/zvzTKedPd

BTW. which gcc version you use to build? I've experienced problem building with both 7.5.0 and 8.5.0 (I did not try any later versions) while there were no issues w/o patch.
Here is fail log:
<dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3573:24: error: declaration of llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI [-fpermissive]
TargetTransformInfo *TTI;
                     ^~~
In file included from <dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:47:0:
<dir>/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:205:29: error: changes meaning of TTI from typedef class llvm::TargetTransformInfo llvm::TTI [-fpermissive]
typedef TargetTransformInfo TTI;
^~~

It is most probably because of the using TTI:: in the function declaration, will fix this.

In D149742#4460486, @ABataev wrote:

In D149742#4401811, @vdmitrie wrote:

test.ll935 BDownload

command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll

HMM, llvm-mca shows that the code after patch is better than the code before (with the patch the throughput is 4.0, without - 5.0)

https://godbolt.org/z/zvzTKedPd

It puzzled me too actually. It was the only vectorization candidate that changed behavior due to the patch in the test. I could not find any obvious flaw in handling subsequent vectorization sequences.
Eyeballing of generated code before/after also did not reveal any obvious issues.

Rebase

In D149742#4460718, @vdmitrie wrote:

In D149742#4460486, @ABataev wrote:

In D149742#4401811, @vdmitrie wrote:

test.ll935 BDownload

command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll

HMM, llvm-mca shows that the code after patch is better than the code before (with the patch the throughput is 4.0, without - 5.0)

https://godbolt.org/z/zvzTKedPd

It puzzled me too actually. It was the only vectorization candidate that changed behavior due to the patch in the test. I could not find any obvious flaw in handling subsequent vectorization sequences.
Eyeballing of generated code before/after also did not reveal any obvious issues.

I fixed cost estimation, now it results in the same translation (I added your test, no changes)

Harbormaster completed remote builds in B242216: Diff 535967.Jun 29 2023, 3:08 PM

Rebase, ping!

Herald added a subscriber: wangpc. · View Herald TranscriptJul 7 2023, 1:02 PM

Harbormaster completed remote builds in B243843: Diff 538237.Jul 7 2023, 4:04 PM

Rebase, ping!

Harbormaster completed remote builds in B244229: Diff 538755.Jul 10 2023, 2:12 PM

Rebase, ping!

Harbormaster completed remote builds in B245824: Diff 540995.Jul 17 2023, 9:32 AM

Rebase, ping!

Harbormaster completed remote builds in B248341: Diff 544471.Jul 26 2023, 7:33 PM

Rebase, ping!!!
Required to continue the work on cost model unification and non-power-2.

Harbormaster completed remote builds in B249480: Diff 546043.Aug 1 2023, 7:46 AM

ABataev updated this revision to Diff 547869.Aug 7 2023, 11:25 AM

Rebase, ping!

Harbormaster completed remote builds in B250853: Diff 547869.Aug 7 2023, 3:32 PM

Rebase, ping!

Harbormaster completed remote builds in B251657: Diff 548992.Aug 10 2023, 9:37 AM

Why are we doing this in SLP and not TTI getShuffleCost? It looks like TTI should be doing a better job of recognizing how it should split the shuffle mask and determine the shuffle kinds per split?

In D149742#4587735, @RKSimon wrote:

Why are we doing this in SLP and not TTI getShuffleCost? It looks like TTI should be doing a better job of recognizing how it should split the shuffle mask and determine the shuffle kinds per split?

I thought about this. Unfortunately, TTI is too late here + it does not have idion of TreeEntry.

What this function does? It scans the list of scalars and then checks, if it can build a shuffle, using previously build TreeEntries (use previously build vector to build the new one) to reduce number of inserts.
It supports only 1- and 2-vector shuffles, as 3- and more vector shuffles may not be effective for some targets.
Why cannot it be done in TTI? 1. TTI does not know about TreeEntries. 2. Without this knowledge it won't help in complex situations.

Assume we have 3 already built TreeEntries: <abcd>, <efgh>, <ijkl>. But the actual vector register includes only 2 elements (i.e. each this vector uses 2 vector registers). And we trying to build a new buildvector (NeedToGather) entry with elements <acih>.
Original function will help to translate this gather node into this:

%s1 = shuffle <abcd>, <efgh>, <0,2,poison,7>
%v = insertelement %s1, i, 2 - might be very cost ineffective + there might be several such inserts, increasing the overall cost of the buildvector/gather node.

This reworked function can do this:

%s1 = shuffle <abcd>, poison, <0,2>
%s2 = shuffle <efgh>, <ijkl>, <4,3>
%v = shuffle %s1, %s2, <0,1,2,3> - free insert subvector shuffle

TTI won't be able to transform the first variant to the second one without knowing that there is <ijkl> vector already exists. It requires teaching it about the whole SLP graph idiom.

Rebase, ping!

Harbormaster completed remote builds in B254081: Diff 552339.Aug 22 2023, 7:34 AM

Rebase, ping!

Harbormaster completed remote builds in B255599: Diff 554447.Aug 29 2023, 4:35 PM

Ping!

ikelarev removed a subscriber: ikelarev.Sep 6 2023, 8:10 AM

reames added a subscriber: reames.Sep 6 2023, 2:44 PM

reames added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9177–9179	Stylistic suggestion here. This function is quite complicated, and following all the changes through are tricky. I think this is just performing the same operation for each sub-range (divided by register). If so, I'd suggest leaving the function alone, and introducing a wrapper which just calls the old version with a sub-range of the VLs, and builds up the vector results. If this works here, then applying the same basic idea throughout the patch would make it much easier to follow and review.

ABataev added inline comments.Sep 6 2023, 3:04 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9177–9179	Yes, it splits the multi-register vector into several one-register vectors and just performs the same analysis, as before, to avoid long multi-vector shuffles. Will try to move the previous logic into a separate function.

Rebase, address comments

Harbormaster completed remote builds in B256889: Diff 556298.Sep 8 2023, 3:27 PM

Rebase, ping!

Fix typo

Rebase, ping!

Harbormaster completed remote builds in B257292: Diff 556877.Sep 15 2023, 1:35 PM

I've done my best, but I'm really struggling to understand a lot of this patch :(

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7260	Add method description, also why is it called estimateNodesPermuteCost when it doesn't seem to use/return costs?
7269	This logic makes very little sense to me
7830	Would we be better off merging the 1TE + 2TE add() methods?

In D149742#4648725, @RKSimon wrote:

I've done my best, but I'm really struggling to understand a lot of this patch :(

Generally speaking, it does the same as before, just splits the node into vector registers and tries to do per register shuffles instead of per node shuffles. It may reduce final number of shuffles, since we do not need to permute all the register in the node, just some of them. The single node may result in several vector registers, no need to permute all of them.

ABataev added inline comments.Sep 20 2023, 1:13 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7260	It adds the cost to final cost estimation, see e.g. lines 7169 and 7176.
7269	If we decided to permute E1 and (possibly) E2 already, and we found that we again need to permute the same nodes, no need to do it immediately, instead need to include the sub-Mask into CommonMask and do it later to avoid double shuffling of the same nodes.
7830	I'll try to rework this function to avoid it.

Rebase, address comments

Harbormaster completed remote builds in B257502: Diff 557185.Sep 21 2023, 11:15 AM

Ping!

Rebase, ping!

Harbormaster completed remote builds in B257770: Diff 557616.Oct 5 2023, 12:22 PM

Rebase

Harbormaster completed remote builds in B257818: Diff 557697.Oct 12 2023, 9:04 PM

RKSimon added inline comments.Oct 17 2023, 7:34 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9523	Expected positive
9531	assert(VL.size() % NumParts == 0)?
10343	Pull these kind of NFC cleanups out of the patch

Rebase, address comments

RKSimon added inline comments.Oct 18 2023, 3:03 PM

llvm/test/Transforms/SLPVectorizer/X86/multi-nodes-to-shuffle.ll

1–3

Please can you add a AVX2 run to confirm that the shuffles are still being suitably split for legal 256-bit vectors?

; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -slp-threshold=-115 | FileCheck % --check-prefix=SSE
; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux mattr=+avx2 -slp-threshold=-115 | FileCheck % --check-prefix=AVX

Harbormaster completed remote builds in B257862: Diff 557763.Oct 18 2023, 3:07 PM

vdmitrie added inline comments.Oct 18 2023, 6:20 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7007	I'm not sure I understand the purpose of this flag. Could you please describe? It is quite confusing that it is already set right at construction when no estimation was done yet.
7275	InVectors.size() >=1 is same as !InVectors.empty() Having this check assumes that InVectors can be empty here, but at the same time InVectors.front() below (at line 7267) assumes it is not empty. So which one is the right assumption?
7741–7742	perhaps worth renaming to GatherShuffles since it becomes an array of optionals
7743	This ScalarTy declaration hides one at the function scope.
7749	ScalarTy ?
7759	VecTy is already defined at 7674. Can just reuse it.

Rebase, address comments

Fix comment

Fix formatting

@vdmitrie Any more comments?

vdmitrie added inline comments.Oct 23 2023, 1:22 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7007	Would this sound more accurate: "While set, we are still trying ..." ? I just did only see it cleared.
7275	InVectors.size() >=1 is same as !InVectors.empty() Having this check assumes that InVectors can be empty here, but at the same time InVectors.front() below (at line 7267) assumes it is not empty. So which one is the right assumption? What about the assumption? I believe "!InVectors.empty()" is always true. That is why I asked the question about assumption. This method is called twice in the patch and both have early return when InVectors is empty.
9523	Typo: "poistive"
10420–10421	nit: suggest hoisting "Entries.front().front()" subexpr and assign to dedicated local variable of type "const TreeEntry *"

ABataev added inline comments.Oct 23 2023, 2:13 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7275	Yes, this check can be dropped here.

Rebase, address comments

LGTM

This revision is now accepted and ready to land.Oct 23 2023, 2:32 PM

Harbormaster completed remote builds in B257918: Diff 557850.Oct 23 2023, 5:44 PM

LGTM

This revision was landed with ongoing or failed builds.Oct 26 2023, 6:00 AM

Closed by commit rG560bad013ebc: [SLP]Improve isGatherShuffledEntry by trying per-register shuffle. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rG560bad013ebc: [SLP]Improve isGatherShuffledEntry by trying per-register shuffle..

ABataev added a reverting change: rGc65ec9d9195a: Revert "[SLP]Improve isGatherShuffledEntry by trying per-register shuffle.".Oct 26 2023, 8:37 AM

ABataev added a commit: rG196d154ab7a7: [SLP]Improve isGatherShuffledEntry by trying per-register shuffle..Oct 26 2023, 8:52 AM

uabelho added a subscriber: uabelho.Oct 27 2023, 7:20 AM

Hi, this reland causes crash: https://github.com/llvm/llvm-project/commit/196d154ab7a76e8ccb11addf61ff53387e397130.

ld.lld: /b/s/w/ir/cache/builder/src/third_party/llvm/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7602: void llvm::slpvectorizer::BoUpSLP::ShuffleCostEstimator::add(const TreeEntry &, ArrayRef<int>): Assertion `NumParts > 0 && NumParts < Mask.size() && "Expected positive number of registers."' failed.

@zequanwu Do you have a repro please?

In D149742#4655482, @RKSimon wrote:

@zequanwu Do you have a repro please?

This only happens when using -fprofile-use=. I'll try to come up with a smaller repro on Monday.

Stack trace:

clang: /usr/local/google/home/zequanwu/workspace/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7602: void llvm::slpvectorizer::BoUpSLP::ShuffleCostEstimator::add(const TreeEntry &, ArrayRef<int>): Assertion `NumParts > 0 && NumParts < Mask.size() && "Expected positive number of registers."' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: /usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang -MMD -MF obj/third_party/boringssl/boringssl/curve25519.o.d -DUSE_UDEV -DUSE_AURA=1 -DUSE_GLIB=1 -DUSE_OZONE=1 -DOFFICIAL_BUILD -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -DNO_UNWIND_TABLES -D_GNU_SOURCE -D_LIBCPP_ENABLE_SAFE_MODE=1 -D_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS -D_LIBCXXABI_DISABLE_VISIBILITY_ANNOTATIONS -DCR_LIBCXX_REVISION=9b27200e8219bbc9e05ba8435456554a6e35d67d -DCR_SYSROOT_KEY=20230611T210420Z-2 -DNDEBUG -DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -DBORINGSSL_IMPLEMENTATION -D_BORINGSSL_LIBPKI_ -DBORINGSSL_ALLOW_CXX_RUNTIME -DBORINGSSL_NO_STATIC_INITIALIZER -DOPENSSL_SMALL -I../.. -Igen -I../../buildtools/third_party/libc++ -I../../third_party/boringssl/src/include -fno-delete-null-pointer-checks -fno-ident -fno-strict-aliasing -fstack-protector -fno-unwind-tables -fno-asynchronous-unwind-tables -fPIC -pthread -fcolor-diagnostics -fmerge-all-constants -fcrash-diagnostics-dir=../../tools/clang/crashreports -mllvm -instcombine-lower-dbg-declare=0 -mllvm -split-threshold-for-reg-with-hint=0 -ffp-contract=off -fcomplete-member-pointers -m64 -msse3 -ffile-compilation-dir=. -no-canonical-prefixes -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -gdwarf-4 -g2 -gdwarf-aranges -ggnu-pubnames -Xclang -fuse-ctor-homing -fprofile-use=../../chrome/build/pgo_profiles/chrome-linux-main-1698407887-6c91c384bff1766cf9949635c133936a2feea1bd.profdata -Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date -Wno-backend-plugin -mllvm -enable-ext-tsp-block-placement=1 -fvisibility=hidden -Wheader-hygiene -Wstring-conversion -Wtautological-overlap-compare -Wall -Wno-unused-variable -Wno-c++11-narrowing -Wno-unused-but-set-variable -Wno-misleading-indentation -Wno-missing-field-initializers -Wno-unused-parameter -Wno-psabi -Wloop-analysis -Wno-unneeded-internal-declaration -Wenum-compare-conditional -Wno-ignored-pragma-optimize -Wno-deprecated-builtins -Wno-bitfield-constant-conversion -Wno-deprecated-this-capture -Wno-invalid-offsetof -Wno-vla-extension -Wno-thread-safety-reference-return -Wno-delayed-template-parsing-in-cxx20 -Werror -O2 -fdata-sections -ffunction-sections -fno-unique-section-names -fno-math-errno -std=c11 --sysroot=../../build/linux/debian_bullseye_amd64-sysroot -c ../../third_party/boringssl/src/crypto/curve25519/curve25519.c -o obj/third_party/boringssl/boringssl/curve25519.o
1.      <eof> parser at end of file
2.      Optimizer
 #0 0x0000556e5dfa28c8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7cbf8c8)
 #1 0x0000556e5dfa048e llvm::sys::RunSignalHandlers() (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7cbd48e)
 #2 0x0000556e5df0a786 CrashRecoverySignalHandler(int) CrashRecoveryContext.cpp:0:0
 #3 0x00007f92cb85a510 (/lib/x86_64-linux-gnu/libc.so.6+0x3c510)
 #4 0x00007f92cb8a80fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #5 0x00007f92cb85a472 raise ./signal/../sysdeps/posix/raise.c:27:6
 #6 0x00007f92cb8444b2 abort ./stdlib/abort.c:81:7
 #7 0x00007f92cb8443d5 _nl_load_domain ./intl/loadmsgcat.c:1177:9
 #8 0x00007f92cb8533a2 (/lib/x86_64-linux-gnu/libc.so.6+0x353a2)
 #9 0x0000556e5f69792c llvm::slpvectorizer::BoUpSLP::ShuffleCostEstimator::add(llvm::slpvectorizer::BoUpSLP::TreeEntry const&, llvm::ArrayRef<int>) SLPVectorizer.cpp:0:0
#10 0x0000556e5f692f38 llvm::slpvectorizer::BoUpSLP::getEntryCost(llvm::slpvectorizer::BoUpSLP::TreeEntry const*, llvm::ArrayRef<llvm::Value*>, llvm::SmallPtrSetImpl<llvm::Value*>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93aff38)
#11 0x0000556e5f69c33b llvm::slpvectorizer::BoUpSLP::getTreeCost(llvm::ArrayRef<llvm::Value*>) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93b933b)
#12 0x0000556e5f6c0a1a llvm::SLPVectorizerPass::tryToVectorizeList(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP&, bool) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93dda1a)
#13 0x0000556e5f6c1a2c llvm::SLPVectorizerPass::tryToVectorize(llvm::Instruction*, llvm::slpvectorizer::BoUpSLP&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93dea2c)
#14 0x0000556e5f6c4be2 llvm::SLPVectorizerPass::tryToVectorize(llvm::ArrayRef<llvm::WeakTrackingVH>, llvm::slpvectorizer::BoUpSLP&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93e1be2)
#15 0x0000556e5f6c4b2f llvm::SLPVectorizerPass::vectorizeRootInstruction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93e1b2f)
#16 0x0000556e5f6bbc2a llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93d8c2a)
#17 0x0000556e5f6b96fc llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93d66fc)
#18 0x0000556e5f6b8c9e llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93d5c9e)
#19 0x0000556e5f37319d llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) PassBuilder.cpp:0:0
#20 0x0000556e5d9f2a24 llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x770fa24)
#21 0x0000556e5be0377d llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) AMDGPUTargetMachine.cpp:0:0
#22 0x0000556e5d9f6cb3 llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7713cb3)
#23 0x0000556e5be0351d llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) AMDGPUTargetMachine.cpp:0:0
#24 0x0000556e5d9f1c04 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x770ec04)
#25 0x0000556e5e74f5b5 (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>&, std::unique_ptr<llvm::ToolOutputFile, std::default_delete<llvm::ToolOutputFile>>&) BackendUtil.cpp:0:0
#26 0x0000556e5e746102 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8463102)
#27 0x0000556e5ec426b1 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) CodeGenAction.cpp:0:0
#28 0x0000556e60390b56 clang::ParseAST(clang::Sema&, bool, bool) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0xa0adb56)
#29 0x0000556e5eb59f7f clang::FrontendAction::Execute() (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8876f7f)
#30 0x0000556e5eacb01d clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x87e801d)
#31 0x0000556e5ec3ad4e clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8957d4e)
#32 0x0000556e5ba36012 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x5753012)
#33 0x0000556e5ba323ed ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) driver.cpp:0:0
#34 0x0000556e5e92b9b9 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::$_0>(long) Job.cpp:0:0
#35 0x0000556e5df0a4c6 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7c274c6)
#36 0x0000556e5e92b0c2 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x86480c2)
#37 0x0000556e5e8e6107 clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8603107)
#38 0x0000556e5e8e6647 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8603647)
#39 0x0000556e5e9065e9 clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x86235e9)
#40 0x0000556e5ba31836 clang_main(int, char**, llvm::ToolContext const&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x574e836)
#41 0x0000556e5ba42871 main (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x575f871)
#42 0x00007f92cb8456ca __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
#43 0x00007f92cb845785 call_init ./csu/../csu/libc-start.c:128:20
#44 0x00007f92cb845785 __libc_start_main ./csu/../csu/libc-start.c:347:5
#45 0x0000556e5ba2e8e1 _start (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x574b8e1)
clang: error: clang frontend command failed with exit code 134 (use -v to see invocation)
clang version 18.0.0 (git@github.com:ZequanWu/llvm-project.git 196d154ab7a76e8ccb11addf61ff53387e397130)

In D149742#4655482, @RKSimon wrote:

@zequanwu Do you have a repro please?

I prepared the repro myself, will fix it ASAP.

If this is known broken at head, can we revert while investigation is underway?

In D149742#4655598, @thakis wrote:

If this is known broken at head, can we revert while investigation is underway?

Will commit a fix in a minute

In D149742#4655491, @zequanwu wrote:

In D149742#4655482, @RKSimon wrote:

@zequanwu Do you have a repro please?

This only happens when using -fprofile-use=. I'll try to come up with a smaller repro on Monday.

Stack trace:

clang: /usr/local/google/home/zequanwu/workspace/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7602: void llvm::slpvectorizer::BoUpSLP::ShuffleCostEstimator::add(const TreeEntry &, ArrayRef<int>): Assertion `NumParts > 0 && NumParts < Mask.size() && "Expected positive number of registers."' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: /usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang -MMD -MF obj/third_party/boringssl/boringssl/curve25519.o.d -DUSE_UDEV -DUSE_AURA=1 -DUSE_GLIB=1 -DUSE_OZONE=1 -DOFFICIAL_BUILD -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -DNO_UNWIND_TABLES -D_GNU_SOURCE -D_LIBCPP_ENABLE_SAFE_MODE=1 -D_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS -D_LIBCXXABI_DISABLE_VISIBILITY_ANNOTATIONS -DCR_LIBCXX_REVISION=9b27200e8219bbc9e05ba8435456554a6e35d67d -DCR_SYSROOT_KEY=20230611T210420Z-2 -DNDEBUG -DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -DBORINGSSL_IMPLEMENTATION -D_BORINGSSL_LIBPKI_ -DBORINGSSL_ALLOW_CXX_RUNTIME -DBORINGSSL_NO_STATIC_INITIALIZER -DOPENSSL_SMALL -I../.. -Igen -I../../buildtools/third_party/libc++ -I../../third_party/boringssl/src/include -fno-delete-null-pointer-checks -fno-ident -fno-strict-aliasing -fstack-protector -fno-unwind-tables -fno-asynchronous-unwind-tables -fPIC -pthread -fcolor-diagnostics -fmerge-all-constants -fcrash-diagnostics-dir=../../tools/clang/crashreports -mllvm -instcombine-lower-dbg-declare=0 -mllvm -split-threshold-for-reg-with-hint=0 -ffp-contract=off -fcomplete-member-pointers -m64 -msse3 -ffile-compilation-dir=. -no-canonical-prefixes -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -gdwarf-4 -g2 -gdwarf-aranges -ggnu-pubnames -Xclang -fuse-ctor-homing -fprofile-use=../../chrome/build/pgo_profiles/chrome-linux-main-1698407887-6c91c384bff1766cf9949635c133936a2feea1bd.profdata -Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date -Wno-backend-plugin -mllvm -enable-ext-tsp-block-placement=1 -fvisibility=hidden -Wheader-hygiene -Wstring-conversion -Wtautological-overlap-compare -Wall -Wno-unused-variable -Wno-c++11-narrowing -Wno-unused-but-set-variable -Wno-misleading-indentation -Wno-missing-field-initializers -Wno-unused-parameter -Wno-psabi -Wloop-analysis -Wno-unneeded-internal-declaration -Wenum-compare-conditional -Wno-ignored-pragma-optimize -Wno-deprecated-builtins -Wno-bitfield-constant-conversion -Wno-deprecated-this-capture -Wno-invalid-offsetof -Wno-vla-extension -Wno-thread-safety-reference-return -Wno-delayed-template-parsing-in-cxx20 -Werror -O2 -fdata-sections -ffunction-sections -fno-unique-section-names -fno-math-errno -std=c11 --sysroot=../../build/linux/debian_bullseye_amd64-sysroot -c ../../third_party/boringssl/src/crypto/curve25519/curve25519.c -o obj/third_party/boringssl/boringssl/curve25519.o
1.      <eof> parser at end of file
2.      Optimizer
 #0 0x0000556e5dfa28c8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7cbf8c8)
 #1 0x0000556e5dfa048e llvm::sys::RunSignalHandlers() (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7cbd48e)
 #2 0x0000556e5df0a786 CrashRecoverySignalHandler(int) CrashRecoveryContext.cpp:0:0
 #3 0x00007f92cb85a510 (/lib/x86_64-linux-gnu/libc.so.6+0x3c510)
 #4 0x00007f92cb8a80fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #5 0x00007f92cb85a472 raise ./signal/../sysdeps/posix/raise.c:27:6
 #6 0x00007f92cb8444b2 abort ./stdlib/abort.c:81:7
 #7 0x00007f92cb8443d5 _nl_load_domain ./intl/loadmsgcat.c:1177:9
 #8 0x00007f92cb8533a2 (/lib/x86_64-linux-gnu/libc.so.6+0x353a2)
 #9 0x0000556e5f69792c llvm::slpvectorizer::BoUpSLP::ShuffleCostEstimator::add(llvm::slpvectorizer::BoUpSLP::TreeEntry const&, llvm::ArrayRef<int>) SLPVectorizer.cpp:0:0
#10 0x0000556e5f692f38 llvm::slpvectorizer::BoUpSLP::getEntryCost(llvm::slpvectorizer::BoUpSLP::TreeEntry const*, llvm::ArrayRef<llvm::Value*>, llvm::SmallPtrSetImpl<llvm::Value*>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93aff38)
#11 0x0000556e5f69c33b llvm::slpvectorizer::BoUpSLP::getTreeCost(llvm::ArrayRef<llvm::Value*>) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93b933b)
#12 0x0000556e5f6c0a1a llvm::SLPVectorizerPass::tryToVectorizeList(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP&, bool) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93dda1a)
#13 0x0000556e5f6c1a2c llvm::SLPVectorizerPass::tryToVectorize(llvm::Instruction*, llvm::slpvectorizer::BoUpSLP&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93dea2c)
#14 0x0000556e5f6c4be2 llvm::SLPVectorizerPass::tryToVectorize(llvm::ArrayRef<llvm::WeakTrackingVH>, llvm::slpvectorizer::BoUpSLP&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93e1be2)
#15 0x0000556e5f6c4b2f llvm::SLPVectorizerPass::vectorizeRootInstruction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93e1b2f)
#16 0x0000556e5f6bbc2a llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93d8c2a)
#17 0x0000556e5f6b96fc llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93d66fc)
#18 0x0000556e5f6b8c9e llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93d5c9e)
#19 0x0000556e5f37319d llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) PassBuilder.cpp:0:0
#20 0x0000556e5d9f2a24 llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x770fa24)
#21 0x0000556e5be0377d llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) AMDGPUTargetMachine.cpp:0:0
#22 0x0000556e5d9f6cb3 llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7713cb3)
#23 0x0000556e5be0351d llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) AMDGPUTargetMachine.cpp:0:0
#24 0x0000556e5d9f1c04 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x770ec04)
#25 0x0000556e5e74f5b5 (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>&, std::unique_ptr<llvm::ToolOutputFile, std::default_delete<llvm::ToolOutputFile>>&) BackendUtil.cpp:0:0
#26 0x0000556e5e746102 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8463102)
#27 0x0000556e5ec426b1 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) CodeGenAction.cpp:0:0
#28 0x0000556e60390b56 clang::ParseAST(clang::Sema&, bool, bool) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0xa0adb56)
#29 0x0000556e5eb59f7f clang::FrontendAction::Execute() (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8876f7f)
#30 0x0000556e5eacb01d clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x87e801d)
#31 0x0000556e5ec3ad4e clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8957d4e)
#32 0x0000556e5ba36012 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x5753012)
#33 0x0000556e5ba323ed ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) driver.cpp:0:0
#34 0x0000556e5e92b9b9 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::$_0>(long) Job.cpp:0:0
#35 0x0000556e5df0a4c6 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7c274c6)
#36 0x0000556e5e92b0c2 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x86480c2)
#37 0x0000556e5e8e6107 clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8603107)
#38 0x0000556e5e8e6647 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8603647)
#39 0x0000556e5e9065e9 clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x86235e9)
#40 0x0000556e5ba31836 clang_main(int, char**, llvm::ToolContext const&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x574e836)
#41 0x0000556e5ba42871 main (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x575f871)
#42 0x00007f92cb8456ca __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
#43 0x00007f92cb845785 call_init ./csu/../csu/libc-start.c:128:20
#44 0x00007f92cb845785 __libc_start_main ./csu/../csu/libc-start.c:347:5
#45 0x0000556e5ba2e8e1 _start (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x574b8e1)
clang: error: clang frontend command failed with exit code 134 (use -v to see invocation)
clang version 18.0.0 (git@github.com:ZequanWu/llvm-project.git 196d154ab7a76e8ccb11addf61ff53387e397130)

Fixed in af15c46777208a4cb4b276c4974a5b556608a415

Fixed in af15c46777208a4cb4b276c4974a5b556608a415

Thanks, confirmed that is fixed.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

462 lines

test/

Transforms/

SLPVectorizer/

X86/

multi-nodes-to-shuffle.ll

45 lines

Diff 557894

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,501 Lines • ▼ Show 20 Lines	private:
Value createBuildVector(const TreeEntry E);		Value createBuildVector(const TreeEntry E);

/// Returns the instruction in the bundle, which can be used as a base point		/// Returns the instruction in the bundle, which can be used as a base point
/// for scheduling. Usually it is the last instruction in the bundle, except		/// for scheduling. Usually it is the last instruction in the bundle, except
/// for the case when all operands are external (in this case, it is the first		/// for the case when all operands are external (in this case, it is the first
/// instruction in the list).		/// instruction in the list).
Instruction &getLastInstructionInBundle(const TreeEntry *E);		Instruction &getLastInstructionInBundle(const TreeEntry *E);

/// Checks if the gathered \p VL can be represented as shuffle(s) of previous		/// Checks if the gathered \p VL can be represented as a single register
/// tree entries.		/// shuffle(s) of previous tree entries.
/// \param TE Tree entry checked for permutation.		/// \param TE Tree entry checked for permutation.
/// \param VL List of scalars (a subset of the TE scalar), checked for		/// \param VL List of scalars (a subset of the TE scalar), checked for
/// permutations.		/// permutations. Must form single-register vector.
/// \returns ShuffleKind, if gathered values can be represented as shuffles of		/// \returns ShuffleKind, if gathered values can be represented as shuffles of
		RKSimonUnsubmitted Not Done Reply Inline Actions Update the \returns description RKSimon: Update the \returns description
/// previous tree entries. \p Mask is filled with the shuffle mask.		/// previous tree entries. \p Part of \p Mask is filled with the shuffle mask.
std::optional<TargetTransformInfo::ShuffleKind>		std::optional<TargetTransformInfo::ShuffleKind>
isGatherShuffledEntry(const TreeEntry TE, ArrayRef<Value > VL,		isGatherShuffledSingleRegisterEntry(
SmallVectorImpl<int> &Mask,		const TreeEntry TE, ArrayRef<Value > VL, MutableArrayRef<int> Mask,
SmallVectorImpl<const TreeEntry *> &Entries);		SmallVectorImpl<const TreeEntry *> &Entries, unsigned Part);

		/// Checks if the gathered \p VL can be represented as multi-register
		/// shuffle(s) of previous tree entries.
		/// \param TE Tree entry checked for permutation.
		/// \param VL List of scalars (a subset of the TE scalar), checked for
		/// permutations.
		/// \returns per-register series of ShuffleKind, if gathered values can be
		/// represented as shuffles of previous tree entries. \p Mask is filled with
		/// the shuffle mask (also on per-register base).
		SmallVector<std::optional<TargetTransformInfo::ShuffleKind>>
		isGatherShuffledEntry(
		const TreeEntry TE, ArrayRef<Value > VL, SmallVectorImpl<int> &Mask,
		SmallVectorImpl<SmallVector<const TreeEntry *>> &Entries,
		unsigned NumParts);

/// \returns the scalarization cost for this list of values. Assuming that		/// \returns the scalarization cost for this list of values. Assuming that
/// this subtree gets vectorized, we may need to extract the values from the		/// this subtree gets vectorized, we may need to extract the values from the
/// roots. This method calculates the cost of extracting the values.		/// roots. This method calculates the cost of extracting the values.
/// \param ForPoisonSrc true if initial vector is poison, false otherwise.		/// \param ForPoisonSrc true if initial vector is poison, false otherwise.
InstructionCost getGatherCost(ArrayRef<Value *> VL, bool ForPoisonSrc) const;		InstructionCost getGatherCost(ArrayRef<Value *> VL, bool ForPoisonSrc) const;

/// Set the Builder insert point to one after the last instruction in		/// Set the Builder insert point to one after the last instruction in
▲ Show 20 Lines • Show All 4,456 Lines • ▼ Show 20 Lines	class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
SmallVector<int> CommonMask;		SmallVector<int> CommonMask;
SmallVector<PointerUnion<Value , const TreeEntry >, 2> InVectors;		SmallVector<PointerUnion<Value , const TreeEntry >, 2> InVectors;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
InstructionCost Cost = 0;		InstructionCost Cost = 0;
SmallDenseSet<Value *> VectorizedVals;		SmallDenseSet<Value *> VectorizedVals;
BoUpSLP &R;		BoUpSLP &R;
SmallPtrSetImpl<Value *> &CheckedExtracts;		SmallPtrSetImpl<Value *> &CheckedExtracts;
constexpr static TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;		constexpr static TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
		/// While set, still trying to estimate the cost for the same nodes and we
		vdmitrieUnsubmitted Not Done Reply Inline Actions I'm not sure I understand the purpose of this flag. Could you please describe? It is quite confusing that it is already set right at construction when no estimation was done yet. vdmitrie: I'm not sure I understand the purpose of this flag. Could you please describe? It is quite…
		vdmitrieUnsubmitted Not Done Reply Inline Actions Would this sound more accurate: "While set, we are still trying ..." ? I just did only see it cleared. vdmitrie: Would this sound more accurate: "While set, we are still trying ..." ? I just did only see it…
		/// can delay actual cost estimation (virtual shuffle instruction emission).
		/// May help better estimate the cost if same nodes must be permuted + allows
		/// to move most of the long shuffles cost estimation to TTI.
		bool SameNodesEstimated = true;

static Constant getAllOnesValue(const DataLayout &DL, Type Ty) {		static Constant getAllOnesValue(const DataLayout &DL, Type Ty) {
if (Ty->getScalarType()->isPointerTy()) {		if (Ty->getScalarType()->isPointerTy()) {
Constant *Res = ConstantExpr::getIntToPtr(		Constant *Res = ConstantExpr::getIntToPtr(
ConstantInt::getAllOnesValue(		ConstantInt::getAllOnesValue(
IntegerType::get(Ty->getContext(),		IntegerType::get(Ty->getContext(),
DL.getTypeStoreSizeInBits(Ty->getScalarType()))),		DL.getTypeStoreSizeInBits(Ty->getScalarType()))),
Ty->getScalarType());		Ty->getScalarType());
▲ Show 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	for (auto [Idx, V] : enumerate(VL)) {

// If we have a series of extracts which are not consecutive and hence		// If we have a series of extracts which are not consecutive and hence
// cannot re-use the source vector register directly, compute the shuffle		// cannot re-use the source vector register directly, compute the shuffle
// cost to extract the vector with EltsPerVector elements.		// cost to extract the vector with EltsPerVector elements.
Cost += TTI.getShuffleCost(RegisterSK, RegisterVecTy, RegMask);		Cost += TTI.getShuffleCost(RegisterSK, RegisterVecTy, RegMask);
}		}
return Cost;		return Cost;
}		}
		/// Transforms mask \p CommonMask per given \p Mask to make proper set after
		/// shuffle emission.
		static void transformMaskAfterShuffle(MutableArrayRef<int> CommonMask,
		ArrayRef<int> Mask) {
		for (unsigned Idx = 0, Sz = CommonMask.size(); Idx < Sz; ++Idx)
		if (Mask[Idx] != PoisonMaskElem)
		CommonMask[Idx] = Idx;
		}
		/// Adds the cost of reshuffling \p E1 and \p E2 (if present), using given
		RKSimonUnsubmitted Not Done Reply Inline Actions Add method description, also why is it called estimateNodesPermuteCost when it doesn't seem to use/return costs? RKSimon: Add method description, also why is it called estimateNodesPermuteCost when it doesn't seem to…
		ABataevAuthorUnsubmitted Done Reply Inline Actions It adds the cost to final cost estimation, see e.g. lines 7169 and 7176. ABataev: It adds the cost to final cost estimation, see e.g. lines 7169 and 7176.
		/// mask \p Mask, register number \p Part, that includes \p SliceSize
		/// elements.
		void estimateNodesPermuteCost(const TreeEntry &E1, const TreeEntry *E2,
		ArrayRef<int> Mask, unsigned Part,
		unsigned SliceSize) {
		if (SameNodesEstimated) {
		// Delay the cost estimation if the same nodes are reshuffling.
		// If we already requested the cost of reshuffling of E1 and E2 before, no
		// need to estimate another cost with the sub-Mask, instead include this
		RKSimonUnsubmitted Not Done Reply Inline Actions This logic makes very little sense to me RKSimon: This logic makes very little sense to me
		ABataevAuthorUnsubmitted Done Reply Inline Actions If we decided to permute E1 and (possibly) E2 already, and we found that we again need to permute the same nodes, no need to do it immediately, instead need to include the sub-Mask into CommonMask and do it later to avoid double shuffling of the same nodes. ABataev: If we decided to permute E1 and (possibly) E2 already, and we found that we again need to…
		// sub-Mask into the CommonMask to estimate it later and avoid double cost
		// estimation.
		if ((InVectors.size() == 2 &&
		InVectors.front().get<const TreeEntry *>() == &E1 &&
		InVectors.back().get<const TreeEntry *>() == E2) \|\|
		(!E2 && InVectors.front().get<const TreeEntry *>() == &E1)) {
		vdmitrieUnsubmitted Not Done Reply Inline Actions InVectors.size() >=1 is same as !InVectors.empty() Having this check assumes that InVectors can be empty here, but at the same time InVectors.front() below (at line 7267) assumes it is not empty. So which one is the right assumption? vdmitrie: InVectors.size() >=1 is same as !InVectors.empty() Having this check assumes that InVectors…
		vdmitrieUnsubmitted Not Done Reply Inline Actions InVectors.size() >=1 is same as !InVectors.empty() Having this check assumes that InVectors can be empty here, but at the same time InVectors.front() below (at line 7267) assumes it is not empty. So which one is the right assumption? What about the assumption? I believe "!InVectors.empty()" is always true. That is why I asked the question about assumption. This method is called twice in the patch and both have early return when InVectors is empty. vdmitrie: > InVectors.size() >=1 is same as !InVectors.empty() > > Having this check assumes that…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, this check can be dropped here. ABataev: Yes, this check can be dropped here.
		assert(all_of(ArrayRef(CommonMask).slice(Part * SliceSize, SliceSize),
		[](int Idx) { return Idx == PoisonMaskElem; }) &&
		"Expected all poisoned elements.");
		ArrayRef<int> SubMask =
		ArrayRef(Mask).slice(Part * SliceSize, SliceSize);
		copy(SubMask, std::next(CommonMask.begin(), SliceSize * Part));
		return;
		}
		// Found non-matching nodes - need to estimate the cost for the matched
		// and transform mask.
		Cost += createShuffle(InVectors.front(),
		InVectors.size() == 1 ? nullptr : InVectors.back(),
		CommonMask);
		transformMaskAfterShuffle(CommonMask, CommonMask);
		}
		SameNodesEstimated = false;
		Cost += createShuffle(&E1, E2, Mask);
		transformMaskAfterShuffle(CommonMask, Mask);
		}

class ShuffleCostBuilder {		class ShuffleCostBuilder {
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;

static bool isEmptyOrIdentity(ArrayRef<int> Mask, unsigned VF) {		static bool isEmptyOrIdentity(ArrayRef<int> Mask, unsigned VF) {
int Index = -1;		int Index = -1;
return Mask.empty() \|\|		return Mask.empty() \|\|
(VF == Mask.size() &&		(VF == Mask.size() &&
▲ Show 20 Lines • Show All 247 Lines • ▼ Show 20 Lines	for (const auto &Data : ExtractVectorsTys) {
}		}
}		}
// Check that gather of extractelements can be represented as just a		// Check that gather of extractelements can be represented as just a
// shuffle of a single/two vectors the scalars are extracted from.		// shuffle of a single/two vectors the scalars are extracted from.
// Found the bunch of extractelement instructions that must be gathered		// Found the bunch of extractelement instructions that must be gathered
// into a vector and can be represented as a permutation elements in a		// into a vector and can be represented as a permutation elements in a
// single input vector or of 2 input vectors.		// single input vector or of 2 input vectors.
Cost += computeExtractCost(VL, Mask, ShuffleKind);		Cost += computeExtractCost(VL, Mask, ShuffleKind);
		InVectors.assign(1, E);
		CommonMask.assign(Mask.begin(), Mask.end());
		transformMaskAfterShuffle(CommonMask, CommonMask);
		SameNodesEstimated = false;
return VecBase;		return VecBase;
}		}
void add(const TreeEntry E1, const TreeEntry E2, ArrayRef<int> Mask) {		void add(const TreeEntry &E1, const TreeEntry &E2, ArrayRef<int> Mask) {
if (E1 == E2) {		if (&E1 == &E2) {
assert(all_of(Mask,		assert(all_of(Mask,
[=](int Idx) {		[&](int Idx) {
return Idx < static_cast<int>(E1->getVectorFactor());		return Idx < static_cast<int>(E1.getVectorFactor());
}) &&		}) &&
"Expected single vector shuffle mask.");		"Expected single vector shuffle mask.");
add(E1, Mask);		add(E1, Mask);
return;		return;
}		}
		if (InVectors.empty()) {
CommonMask.assign(Mask.begin(), Mask.end());		CommonMask.assign(Mask.begin(), Mask.end());
InVectors.assign({E1, E2});		InVectors.assign({&E1, &E2});
		return;
		}
		assert(!CommonMask.empty() && "Expected non-empty common mask.");
		auto *MaskVecTy =
		FixedVectorType::get(E1.Scalars.front()->getType(), Mask.size());
		unsigned NumParts = TTI.getNumberOfParts(MaskVecTy);
		assert(NumParts > 0 && NumParts < Mask.size() &&
		"Expected positive number of registers.");
		unsigned SliceSize = Mask.size() / NumParts;
		const auto *It =
		find_if(Mask, [](int Idx) { return Idx != PoisonMaskElem; });
		unsigned Part = std::distance(Mask.begin(), It) / SliceSize;
		estimateNodesPermuteCost(E1, &E2, Mask, Part, SliceSize);
}		}
void add(const TreeEntry *E1, ArrayRef<int> Mask) {		void add(const TreeEntry &E1, ArrayRef<int> Mask) {
		if (InVectors.empty()) {
CommonMask.assign(Mask.begin(), Mask.end());		CommonMask.assign(Mask.begin(), Mask.end());
InVectors.assign(1, E1);		InVectors.assign(1, &E1);
		return;
		}
		assert(!CommonMask.empty() && "Expected non-empty common mask.");
		auto *MaskVecTy =
		FixedVectorType::get(E1.Scalars.front()->getType(), Mask.size());
		unsigned NumParts = TTI.getNumberOfParts(MaskVecTy);
		assert(NumParts > 0 && NumParts < Mask.size() &&
		"Expected positive number of registers.");
		unsigned SliceSize = Mask.size() / NumParts;
		const auto *It =
		find_if(Mask, [](int Idx) { return Idx != PoisonMaskElem; });
		unsigned Part = std::distance(Mask.begin(), It) / SliceSize;
		estimateNodesPermuteCost(E1, nullptr, Mask, Part, SliceSize);
		if (!SameNodesEstimated && InVectors.size() == 1)
		InVectors.emplace_back(&E1);
}		}
/// Adds another one input vector and the mask for the shuffling.		/// Adds another one input vector and the mask for the shuffling.
void add(Value *V1, ArrayRef<int> Mask) {		void add(Value *V1, ArrayRef<int> Mask) {
assert(CommonMask.empty() && InVectors.empty() &&		if (InVectors.empty()) {
"Expected empty input mask/vectors.");		assert(CommonMask.empty() && "Expected empty input mask/vectors.");
CommonMask.assign(Mask.begin(), Mask.end());		CommonMask.assign(Mask.begin(), Mask.end());
InVectors.assign(1, V1);		InVectors.assign(1, V1);
		return;
		}
		assert(InVectors.size() == 1 && InVectors.front().is<const TreeEntry *>() &&
		!CommonMask.empty() && "Expected only single entry from extracts.");
		InVectors.push_back(V1);
		unsigned VF = CommonMask.size();
		for (unsigned Idx = 0; Idx < VF; ++Idx)
		if (Mask[Idx] != PoisonMaskElem && CommonMask[Idx] == PoisonMaskElem)
		CommonMask[Idx] = Mask[Idx] + VF;
}		}
Value gather(ArrayRef<Value > VL, Value *Root = nullptr) {		Value gather(ArrayRef<Value > VL, Value *Root = nullptr) {
Cost += getBuildVectorCost(VL, Root);		Cost += getBuildVectorCost(VL, Root);
if (!Root) {		if (!Root) {
assert(InVectors.empty() && "Unexpected input vectors for buildvector.");		assert(InVectors.empty() && "Unexpected input vectors for buildvector.");
// FIXME: Need to find a way to avoid use of getNullValue here.		// FIXME: Need to find a way to avoid use of getNullValue here.
SmallVector<Constant *> Vals;		SmallVector<Constant *> Vals;
for (Value *V : VL) {		for (Value *V : VL) {
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
};		};

InstructionCost		InstructionCost
BoUpSLP::getEntryCost(const TreeEntry E, ArrayRef<Value > VectorizedVals,		BoUpSLP::getEntryCost(const TreeEntry E, ArrayRef<Value > VectorizedVals,
SmallPtrSetImpl<Value *> &CheckedExtracts) {		SmallPtrSetImpl<Value *> &CheckedExtracts) {
ArrayRef<Value *> VL = E->Scalars;		ArrayRef<Value *> VL = E->Scalars;

Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
		if (E->State != TreeEntry::NeedToGather) {
if (auto *SI = dyn_cast<StoreInst>(VL[0]))		if (auto *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
else if (auto *CI = dyn_cast<CmpInst>(VL[0]))		else if (auto *CI = dyn_cast<CmpInst>(VL[0]))
ScalarTy = CI->getOperand(0)->getType();		ScalarTy = CI->getOperand(0)->getType();
else if (auto *IE = dyn_cast<InsertElementInst>(VL[0]))		else if (auto *IE = dyn_cast<InsertElementInst>(VL[0]))
ScalarTy = IE->getOperand(1)->getType();		ScalarTy = IE->getOperand(1)->getType();
		}
		if (!FixedVectorType::isValidElementType(ScalarTy))
		return InstructionCost::getInvalid();
auto *VecTy = FixedVectorType::get(ScalarTy, VL.size());		auto *VecTy = FixedVectorType::get(ScalarTy, VL.size());
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;

// If we have computed a smaller type for the expression, update VecTy so		// If we have computed a smaller type for the expression, update VecTy so
// that the costs will be accurate.		// that the costs will be accurate.
auto It = MinBWs.find(VL.front());		auto It = MinBWs.find(VL.front());
if (It != MinBWs.end()) {		if (It != MinBWs.end()) {
ScalarTy = IntegerType::get(F->getContext(), It->second.first);		ScalarTy = IntegerType::get(F->getContext(), It->second.first);
VecTy = FixedVectorType::get(ScalarTy, VL.size());		VecTy = FixedVectorType::get(ScalarTy, VL.size());
}		}
unsigned EntryVF = E->getVectorFactor();		unsigned EntryVF = E->getVectorFactor();
auto *FinalVecTy = FixedVectorType::get(VecTy->getElementType(), EntryVF);		auto *FinalVecTy = FixedVectorType::get(ScalarTy, EntryVF);

bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();		bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();
if (E->State == TreeEntry::NeedToGather) {		if (E->State == TreeEntry::NeedToGather) {
if (allConstant(VL))		if (allConstant(VL))
return 0;		return 0;
if (isa<InsertElementInst>(VL[0]))		if (isa<InsertElementInst>(VL[0]))
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();
// The gather nodes use small bitwidth only if all operands use the same		// The gather nodes use small bitwidth only if all operands use the same
Show All 16 Lines	if (E->State == TreeEntry::NeedToGather) {
// mask.		// mask.
SmallVector<int> ReorderMask;		SmallVector<int> ReorderMask;
inversePermutation(E->ReorderIndices, ReorderMask);		inversePermutation(E->ReorderIndices, ReorderMask);
if (!ReorderMask.empty())		if (!ReorderMask.empty())
reorderScalars(GatheredScalars, ReorderMask);		reorderScalars(GatheredScalars, ReorderMask);
SmallVector<int> Mask;		SmallVector<int> Mask;
SmallVector<int> ExtractMask;		SmallVector<int> ExtractMask;
std::optional<TargetTransformInfo::ShuffleKind> ExtractShuffle;		std::optional<TargetTransformInfo::ShuffleKind> ExtractShuffle;
std::optional<TargetTransformInfo::ShuffleKind> GatherShuffle;		SmallVector<std::optional<TargetTransformInfo::ShuffleKind>> GatherShuffles;
SmallVector<const TreeEntry *> Entries;		SmallVector<SmallVector<const TreeEntry *>> Entries;
		vdmitrieUnsubmitted Not Done Reply Inline Actions perhaps worth renaming to GatherShuffles since it becomes an array of optionals vdmitrie: perhaps worth renaming to GatherShuffles since it becomes an array of optionals
// Check for gathered extracts.		// Check for gathered extracts.
		vdmitrieUnsubmitted Not Done Reply Inline Actions This ScalarTy declaration hides one at the function scope. vdmitrie: This ScalarTy declaration hides one at the function scope.
ExtractShuffle = tryToGatherSingleRegisterExtractElements(GatheredScalars, ExtractMask);		ExtractShuffle =
		tryToGatherSingleRegisterExtractElements(GatheredScalars, ExtractMask);

bool Resized = false;		bool Resized = false;
		unsigned NumParts = TTI->getNumberOfParts(VecTy);
		if (NumParts == 0 \|\| NumParts >= GatheredScalars.size())
		vdmitrieUnsubmitted Not Done Reply Inline Actions ScalarTy ? vdmitrie: ScalarTy ?
		NumParts = 1;
if (Value *VecBase = Estimator.adjustExtracts(		if (Value *VecBase = Estimator.adjustExtracts(
E, ExtractMask, ExtractShuffle.value_or(TTI::SK_PermuteTwoSrc)))		E, ExtractMask, ExtractShuffle.value_or(TTI::SK_PermuteTwoSrc))) {
if (auto *VecBaseTy = dyn_cast<FixedVectorType>(VecBase->getType()))		if (auto *VecBaseTy = dyn_cast<FixedVectorType>(VecBase->getType()))
if (VF == VecBaseTy->getNumElements() && GatheredScalars.size() != VF) {		if (VF == VecBaseTy->getNumElements() && GatheredScalars.size() != VF) {
Resized = true;		Resized = true;
GatheredScalars.append(VF - GatheredScalars.size(),		GatheredScalars.append(VF - GatheredScalars.size(),
PoisonValue::get(ScalarTy));		PoisonValue::get(ScalarTy));
}		}
		} else if (ExtractShuffle &&
		vdmitrieUnsubmitted Not Done Reply Inline Actions VecTy is already defined at 7674. Can just reuse it. vdmitrie: VecTy is already defined at 7674. Can just reuse it.
		TTI->getNumberOfParts(VecTy) == VecTy->getNumElements()) {
		copy(VL, GatheredScalars.begin());
		}

// Do not try to look for reshuffled loads for gathered loads (they will be		// Do not try to look for reshuffled loads for gathered loads (they will be
// handled later), for vectorized scalars, and cases, which are definitely		// handled later), for vectorized scalars, and cases, which are definitely
// not profitable (splats and small gather nodes.)		// not profitable (splats and small gather nodes.)
if (ExtractShuffle \|\| E->getOpcode() != Instruction::Load \|\|		if (ExtractShuffle \|\| E->getOpcode() != Instruction::Load \|\|
E->isAltShuffle() \|\|		E->isAltShuffle() \|\|
all_of(E->Scalars, [this](Value *V) { return getTreeEntry(V); }) \|\|		all_of(E->Scalars, [this](Value *V) { return getTreeEntry(V); }) \|\|
isSplat(E->Scalars) \|\|		isSplat(E->Scalars) \|\|
(E->Scalars != GatheredScalars && GatheredScalars.size() <= 2))		(E->Scalars != GatheredScalars && GatheredScalars.size() <= 2))
GatherShuffle = isGatherShuffledEntry(E, GatheredScalars, Mask, Entries);		GatherShuffles =
if (GatherShuffle) {		isGatherShuffledEntry(E, GatheredScalars, Mask, Entries, NumParts);
assert((Entries.size() == 1 \|\| Entries.size() == 2) &&		if (!GatherShuffles.empty()) {
"Expected shuffle of 1 or 2 entries.");		if (GatherShuffles.size() == 1 &&
if (*GatherShuffle == TTI::SK_PermuteSingleSrc &&		*GatherShuffles.front() == TTI::SK_PermuteSingleSrc &&
Entries.front()->isSame(E->Scalars)) {		Entries.front().front()->isSame(E->Scalars)) {
// Perfect match in the graph, will reuse the previously vectorized		// Perfect match in the graph, will reuse the previously vectorized
// node. Cost is 0.		// node. Cost is 0.
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "SLP: perfect diamond match for gather bundle "		<< "SLP: perfect diamond match for gather bundle "
<< shortBundleName(VL) << ".\n");		<< shortBundleName(VL) << ".\n");
// Restore the mask for previous partially matched values.		// Restore the mask for previous partially matched values.
for (auto [I, V] : enumerate(E->Scalars)) {		for (auto [I, V] : enumerate(E->Scalars)) {
if (isa<PoisonValue>(V)) {		if (isa<PoisonValue>(V)) {
Mask[I] = PoisonMaskElem;		Mask[I] = PoisonMaskElem;
continue;		continue;
}		}
if (Mask[I] == PoisonMaskElem)		if (Mask[I] == PoisonMaskElem)
Mask[I] = Entries.front()->findLaneForValue(V);		Mask[I] = Entries.front().front()->findLaneForValue(V);
}		}
Estimator.add(Entries.front(), Mask);		Estimator.add(*Entries.front().front(), Mask);
return Estimator.finalize(E->ReuseShuffleIndices);		return Estimator.finalize(E->ReuseShuffleIndices);
}		}
if (!Resized) {		if (!Resized) {
unsigned VF1 = Entries.front()->getVectorFactor();		if (GatheredScalars.size() != VF &&
unsigned VF2 = Entries.back()->getVectorFactor();		any_of(Entries, [&](ArrayRef<const TreeEntry *> TEs) {
if ((VF == VF1 \|\| VF == VF2) && GatheredScalars.size() != VF)		return any_of(TEs, [&](const TreeEntry *TE) {
		return TE->getVectorFactor() == VF;
		});
		}))
GatheredScalars.append(VF - GatheredScalars.size(),		GatheredScalars.append(VF - GatheredScalars.size(),
PoisonValue::get(ScalarTy));		PoisonValue::get(ScalarTy));
}		}
// Remove shuffled elements from list of gathers.		// Remove shuffled elements from list of gathers.
for (int I = 0, Sz = Mask.size(); I < Sz; ++I) {		for (int I = 0, Sz = Mask.size(); I < Sz; ++I) {
if (Mask[I] != PoisonMaskElem)		if (Mask[I] != PoisonMaskElem)
GatheredScalars[I] = PoisonValue::get(ScalarTy);		GatheredScalars[I] = PoisonValue::get(ScalarTy);
}		}
LLVM_DEBUG(dbgs() << "SLP: shuffled " << Entries.size()		LLVM_DEBUG(dbgs() << "SLP: shuffled " << Entries.size()
<< " entries for bundle "		<< " entries for bundle "
<< shortBundleName(VL) << ".\n");		<< shortBundleName(VL) << ".\n");
Estimator.add(Entries.front(), Entries.back(), Mask);		unsigned SliceSize = E->Scalars.size() / NumParts;
		SmallVector<int> VecMask(Mask.size(), PoisonMaskElem);
		for (const auto [I, TEs] : enumerate(Entries)) {
		if (TEs.empty()) {
		assert(!GatherShuffles[I] &&
		"No shuffles with empty entries list expected.");
		continue;
		}
		assert((TEs.size() == 1 \|\| TEs.size() == 2) &&
		"Expected shuffle of 1 or 2 entries.");
		auto SubMask = ArrayRef(Mask).slice(I * SliceSize, SliceSize);
		VecMask.assign(VecMask.size(), PoisonMaskElem);
		copy(SubMask, std::next(VecMask.begin(), I * SliceSize));
		Estimator.add(TEs.front(), TEs.back(), VecMask);
		}
if (all_of(GatheredScalars, PoisonValue ::classof))		if (all_of(GatheredScalars, PoisonValue ::classof))
return Estimator.finalize(E->ReuseShuffleIndices);		return Estimator.finalize(E->ReuseShuffleIndices);
		RKSimonUnsubmitted Not Done Reply Inline Actions Would we be better off merging the 1TE + 2TE add() methods? RKSimon: Would we be better off merging the 1TE + 2TE add() methods?
		ABataevAuthorUnsubmitted Done Reply Inline Actions I'll try to rework this function to avoid it. ABataev: I'll try to rework this function to avoid it.
return Estimator.finalize(		return Estimator.finalize(
E->ReuseShuffleIndices, E->Scalars.size(),		E->ReuseShuffleIndices, E->Scalars.size(),
[&](Value *&Vec, SmallVectorImpl<int> &Mask) {		[&](Value *&Vec, SmallVectorImpl<int> &Mask) {
Vec = Estimator.gather(GatheredScalars,		Vec = Estimator.gather(GatheredScalars,
Constant::getNullValue(FixedVectorType::get(		Constant::getNullValue(FixedVectorType::get(
ScalarTy, GatheredScalars.size())));		ScalarTy, GatheredScalars.size())));
});		});
}		}
if (!all_of(GatheredScalars, PoisonValue::classof)) {		if (!all_of(GatheredScalars, PoisonValue::classof)) {
auto Gathers = ArrayRef(GatheredScalars).take_front(VL.size());		auto Gathers = ArrayRef(GatheredScalars).take_front(VL.size());
bool SameGathers = VL.equals(Gathers);		bool SameGathers = VL.equals(Gathers);
Value *BV = Estimator.gather(		if (!SameGathers)
Gathers, SameGathers ? nullptr		return Estimator.finalize(
: Constant::getNullValue(FixedVectorType::get(		E->ReuseShuffleIndices, E->Scalars.size(),
		[&](Value *&Vec, SmallVectorImpl<int> &Mask) {
		Vec = Estimator.gather(
		GatheredScalars, Constant::getNullValue(FixedVectorType::get(
ScalarTy, GatheredScalars.size())));		ScalarTy, GatheredScalars.size())));
		});
		Value *BV = Estimator.gather(Gathers);
SmallVector<int> ReuseMask(Gathers.size(), PoisonMaskElem);		SmallVector<int> ReuseMask(Gathers.size(), PoisonMaskElem);
std::iota(ReuseMask.begin(), ReuseMask.end(), 0);		std::iota(ReuseMask.begin(), ReuseMask.end(), 0);
Estimator.add(BV, ReuseMask);		Estimator.add(BV, ReuseMask);
}		}
if (ExtractShuffle)
Estimator.add(E, std::nullopt);
return Estimator.finalize(E->ReuseShuffleIndices);		return Estimator.finalize(E->ReuseShuffleIndices);
}		}
InstructionCost CommonCost = 0;		InstructionCost CommonCost = 0;
SmallVector<int> Mask;		SmallVector<int> Mask;
if (!E->ReorderIndices.empty() &&		if (!E->ReorderIndices.empty() &&
E->State != TreeEntry::PossibleStridedVectorize) {		E->State != TreeEntry::PossibleStridedVectorize) {
SmallVector<int> NewMask;		SmallVector<int> NewMask;
if (E->getOpcode() == Instruction::Store) {		if (E->getOpcode() == Instruction::Store) {
▲ Show 20 Lines • Show All 1,306 Lines • ▼ Show 20 Lines	#ifndef NDEBUG
if (ViewSLPTree)		if (ViewSLPTree)
ViewGraph(this, "SLP" + F->getName(), false, Str);		ViewGraph(this, "SLP" + F->getName(), false, Str);
#endif		#endif

return Cost;		return Cost;
}		}

std::optional<TargetTransformInfo::ShuffleKind>		std::optional<TargetTransformInfo::ShuffleKind>
BoUpSLP::isGatherShuffledEntry(const TreeEntry TE, ArrayRef<Value > VL,		BoUpSLP::isGatherShuffledSingleRegisterEntry(
SmallVectorImpl<int> &Mask,		const TreeEntry TE, ArrayRef<Value > VL, MutableArrayRef<int> Mask,
SmallVectorImpl<const TreeEntry *> &Entries) {		SmallVectorImpl<const TreeEntry *> &Entries, unsigned Part) {
		reamesUnsubmitted Not Done Reply Inline Actions Stylistic suggestion here. This function is quite complicated, and following all the changes through are tricky. I think this is just performing the same operation for each sub-range (divided by register). If so, I'd suggest leaving the function alone, and introducing a wrapper which just calls the old version with a sub-range of the VLs, and builds up the vector results. If this works here, then applying the same basic idea throughout the patch would make it much easier to follow and review. reames: Stylistic suggestion here. This function is quite complicated, and following all the changes…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, it splits the multi-register vector into several one-register vectors and just performs the same analysis, as before, to avoid long multi-vector shuffles. Will try to move the previous logic into a separate function. ABataev: Yes, it splits the multi-register vector into several one-register vectors and just performs…
Entries.clear();		Entries.clear();
// No need to check for the topmost gather node.
if (TE == VectorizableTree.front().get())
return std::nullopt;
Mask.assign(VL.size(), PoisonMaskElem);
assert(TE->UserTreeIndices.size() == 1 &&
"Expected only single user of the gather node.");
// TODO: currently checking only for Scalars in the tree entry, need to count		// TODO: currently checking only for Scalars in the tree entry, need to count
// reused elements too for better cost estimation.		// reused elements too for better cost estimation.
const EdgeInfo &TEUseEI = TE->UserTreeIndices.front();		const EdgeInfo &TEUseEI = TE->UserTreeIndices.front();
const Instruction *TEInsertPt = &getLastInstructionInBundle(TEUseEI.UserTE);		const Instruction *TEInsertPt = &getLastInstructionInBundle(TEUseEI.UserTE);
const BasicBlock *TEInsertBlock = nullptr;		const BasicBlock *TEInsertBlock = nullptr;
// Main node of PHI entries keeps the correct order of operands/incoming		// Main node of PHI entries keeps the correct order of operands/incoming
// blocks.		// blocks.
if (auto *PHI = dyn_cast<PHINode>(TEUseEI.UserTE->getMainOp())) {		if (auto *PHI = dyn_cast<PHINode>(TEUseEI.UserTE->getMainOp())) {
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	for (const TreeEntry *TEPtr : ValueToGatherNodes.find(V)->second) {
"Expected only single user of a gather node.");		"Expected only single user of a gather node.");
const EdgeInfo &UseEI = TEPtr->UserTreeIndices.front();		const EdgeInfo &UseEI = TEPtr->UserTreeIndices.front();

PHINode *UserPHI = dyn_cast<PHINode>(UseEI.UserTE->getMainOp());		PHINode *UserPHI = dyn_cast<PHINode>(UseEI.UserTE->getMainOp());
const Instruction *InsertPt =		const Instruction *InsertPt =
UserPHI ? UserPHI->getIncomingBlock(UseEI.EdgeIdx)->getTerminator()		UserPHI ? UserPHI->getIncomingBlock(UseEI.EdgeIdx)->getTerminator()
: &getLastInstructionInBundle(UseEI.UserTE);		: &getLastInstructionInBundle(UseEI.UserTE);
if (TEInsertPt == InsertPt) {		if (TEInsertPt == InsertPt) {
// If 2 gathers are operands of the same entry (regardless of wether		// If 2 gathers are operands of the same entry (regardless of whether
// user is PHI or else), compare operands indices, use the earlier one		// user is PHI or else), compare operands indices, use the earlier one
// as the base.		// as the base.
if (TEUseEI.UserTE == UseEI.UserTE && TEUseEI.EdgeIdx < UseEI.EdgeIdx)		if (TEUseEI.UserTE == UseEI.UserTE && TEUseEI.EdgeIdx < UseEI.EdgeIdx)
continue;		continue;
// If the user instruction is used for some reason in different		// If the user instruction is used for some reason in different
// vectorized nodes - make it depend on index.		// vectorized nodes - make it depend on index.
if (TEUseEI.UserTE != UseEI.UserTE && TE->Idx < TEPtr->Idx)		if (TEUseEI.UserTE != UseEI.UserTE && TE->Idx < TEPtr->Idx)
continue;		continue;
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	if (UsedTEs.empty()) {
continue;		continue;
UsedTEs.push_back(SavedVToTEs);		UsedTEs.push_back(SavedVToTEs);
Idx = UsedTEs.size() - 1;		Idx = UsedTEs.size() - 1;
}		}
UsedValuesEntry.try_emplace(V, Idx);		UsedValuesEntry.try_emplace(V, Idx);
}		}
}		}

if (UsedTEs.empty())		if (UsedTEs.empty()) {
		Entries.clear();
return std::nullopt;		return std::nullopt;
		}

unsigned VF = 0;		unsigned VF = 0;
if (UsedTEs.size() == 1) {		if (UsedTEs.size() == 1) {
// Keep the order to avoid non-determinism.		// Keep the order to avoid non-determinism.
SmallVector<const TreeEntry *> FirstEntries(UsedTEs.front().begin(),		SmallVector<const TreeEntry *> FirstEntries(UsedTEs.front().begin(),
UsedTEs.front().end());		UsedTEs.front().end());
sort(FirstEntries, [](const TreeEntry TE1, const TreeEntry TE2) {		sort(FirstEntries, [](const TreeEntry TE1, const TreeEntry TE2) {
return TE1->Idx < TE2->Idx;		return TE1->Idx < TE2->Idx;
});		});
// Try to find the perfect match in another gather node at first.		// Try to find the perfect match in another gather node at first.
auto It = find_if(FirstEntries, [=](const TreeEntry EntryPtr) {		auto It = find_if(FirstEntries, [=](const TreeEntry EntryPtr) {
return EntryPtr->isSame(VL) \|\| EntryPtr->isSame(TE->Scalars);		return EntryPtr->isSame(VL) \|\| EntryPtr->isSame(TE->Scalars);
});		});
if (It != FirstEntries.end() && (*It)->getVectorFactor() == VL.size()) {		if (It != FirstEntries.end() && (*It)->getVectorFactor() == VL.size()) {
Entries.push_back(*It);		Entries.push_back(*It);
std::iota(Mask.begin(), Mask.end(), 0);		std::iota(std::next(Mask.begin(), Part * VL.size()),
		std::next(Mask.begin(), (Part + 1) * VL.size()), 0);
// Clear undef scalars.		// Clear undef scalars.
for (int I = 0, Sz = VL.size(); I < Sz; ++I)		for (int I = 0, Sz = VL.size(); I < Sz; ++I)
if (isa<PoisonValue>(VL[I]))		if (isa<PoisonValue>(VL[I]))
Mask[I] = PoisonMaskElem;		Mask[I] = PoisonMaskElem;
return TargetTransformInfo::SK_PermuteSingleSrc;		return TargetTransformInfo::SK_PermuteSingleSrc;
}		}
// No perfect match, just shuffle, so choose the first tree node from the		// No perfect match, just shuffle, so choose the first tree node from the
// tree.		// tree.
▲ Show 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	for (unsigned I = 0, Sz = Entries.size(); I < Sz; ++I) {
// These indices are used when calculating final shuffle mask as the vector		// These indices are used when calculating final shuffle mask as the vector
// offset.		// offset.
for (std::pair<unsigned, int> &Pair : EntryLanes)		for (std::pair<unsigned, int> &Pair : EntryLanes)
if (Pair.first == I)		if (Pair.first == I)
Pair.first = TempEntries.size();		Pair.first = TempEntries.size();
TempEntries.push_back(Entries[I]);		TempEntries.push_back(Entries[I]);
}		}
Entries.swap(TempEntries);		Entries.swap(TempEntries);
if (EntryLanes.size() == Entries.size() && !VL.equals(TE->Scalars)) {		if (EntryLanes.size() == Entries.size() &&
		!VL.equals(ArrayRef(TE->Scalars)
		.slice(Part * VL.size(),
		std::min<int>(VL.size(), TE->Scalars.size())))) {
// We may have here 1 or 2 entries only. If the number of scalars is equal		// We may have here 1 or 2 entries only. If the number of scalars is equal
// to the number of entries, no need to do the analysis, it is not very		// to the number of entries, no need to do the analysis, it is not very
// profitable. Since VL is not the same as TE->Scalars, it means we already		// profitable. Since VL is not the same as TE->Scalars, it means we already
// have some shuffles before. Cut off not profitable case.		// have some shuffles before. Cut off not profitable case.
Entries.clear();		Entries.clear();
return std::nullopt;		return std::nullopt;
}		}
// Build the final mask, check for the identity shuffle, if possible.		// Build the final mask, check for the identity shuffle, if possible.
bool IsIdentity = Entries.size() == 1;		bool IsIdentity = Entries.size() == 1;
// Pair.first is the offset to the vector, while Pair.second is the index of		// Pair.first is the offset to the vector, while Pair.second is the index of
// scalar in the list.		// scalar in the list.
for (const std::pair<unsigned, int> &Pair : EntryLanes) {		for (const std::pair<unsigned, int> &Pair : EntryLanes) {
Mask[Pair.second] = Pair.first * VF +		unsigned Idx = Part * VL.size() + Pair.second;
		Mask[Idx] = Pair.first * VF +
Entries[Pair.first]->findLaneForValue(VL[Pair.second]);		Entries[Pair.first]->findLaneForValue(VL[Pair.second]);
IsIdentity &= Mask[Pair.second] == Pair.second;		IsIdentity &= Mask[Idx] == Pair.second;
}		}
switch (Entries.size()) {		switch (Entries.size()) {
case 1:		case 1:
if (IsIdentity \|\| EntryLanes.size() > 1 \|\| VL.size() <= 2)		if (IsIdentity \|\| EntryLanes.size() > 1 \|\| VL.size() <= 2)
return TargetTransformInfo::SK_PermuteSingleSrc;		return TargetTransformInfo::SK_PermuteSingleSrc;
break;		break;
case 2:		case 2:
if (EntryLanes.size() > 2 \|\| VL.size() <= 2)		if (EntryLanes.size() > 2 \|\| VL.size() <= 2)
return TargetTransformInfo::SK_PermuteTwoSrc;		return TargetTransformInfo::SK_PermuteTwoSrc;
break;		break;
default:		default:
break;		break;
}		}
Entries.clear();		Entries.clear();
		// Clear the corresponding mask elements.
		std::fill(std::next(Mask.begin(), Part * VL.size()),
		std::next(Mask.begin(), (Part + 1) * VL.size()), PoisonMaskElem);
return std::nullopt;		return std::nullopt;
}		}

		SmallVector<std::optional<TargetTransformInfo::ShuffleKind>>
		BoUpSLP::isGatherShuffledEntry(
		const TreeEntry TE, ArrayRef<Value > VL, SmallVectorImpl<int> &Mask,
		SmallVectorImpl<SmallVector<const TreeEntry *>> &Entries,
		unsigned NumParts) {
		assert(NumParts > 0 && NumParts < VL.size() &&
		"Expected positive number of registers.");
		RKSimonUnsubmitted Not Done Reply Inline Actions Expected positive RKSimon: Expected positive
		vdmitrieUnsubmitted Not Done Reply Inline Actions Typo: "poistive" vdmitrie: Typo: "poistive"
		Entries.clear();
		// No need to check for the topmost gather node.
		if (TE == VectorizableTree.front().get())
		return {};
		Mask.assign(VL.size(), PoisonMaskElem);
		assert(TE->UserTreeIndices.size() == 1 &&
		"Expected only single user of the gather node.");
		assert(VL.size() % NumParts == 0 &&
		RKSimonUnsubmitted Not Done Reply Inline Actions assert(VL.size() % NumParts == 0)? RKSimon: assert(VL.size() % NumParts == 0)?
		"Number of scalars must be divisible by NumParts.");
		unsigned SliceSize = VL.size() / NumParts;
		SmallVector<std::optional<TTI::ShuffleKind>> Res;
		for (unsigned Part = 0; Part < NumParts; ++Part) {
		ArrayRef<Value > SubVL = VL.slice(Part SliceSize, SliceSize);
		SmallVectorImpl<const TreeEntry *> &SubEntries = Entries.emplace_back();
		std::optional<TTI::ShuffleKind> SubRes =
		isGatherShuffledSingleRegisterEntry(TE, SubVL, Mask, SubEntries, Part);
		if (!SubRes)
		SubEntries.clear();
		Res.push_back(SubRes);
		if (SubEntries.size() == 1 &&
		SubRes.value_or(TTI::SK_PermuteTwoSrc) == TTI::SK_PermuteSingleSrc &&
		SubEntries.front()->getVectorFactor() == VL.size() &&
		(SubEntries.front()->isSame(TE->Scalars) \|\|
		SubEntries.front()->isSame(VL))) {
		Entries.clear();
		Res.clear();
		std::iota(Mask.begin(), Mask.end(), 0);
		// Clear undef scalars.
		for (int I = 0, Sz = VL.size(); I < Sz; ++I)
		if (isa<PoisonValue>(VL[I]))
		Mask[I] = PoisonMaskElem;
		Entries.emplace_back(1, SubEntries.front());
		Res.push_back(TargetTransformInfo::SK_PermuteSingleSrc);
		return Res;
		}
		}
		if (all_of(Res,
		[](const std::optional<TTI::ShuffleKind> &SK) { return !SK; })) {
		Entries.clear();
		return {};
		}
		return Res;
		}

InstructionCost BoUpSLP::getGatherCost(ArrayRef<Value *> VL,		InstructionCost BoUpSLP::getGatherCost(ArrayRef<Value *> VL,
bool ForPoisonSrc) const {		bool ForPoisonSrc) const {
// Find the type of the operands in VL.		// Find the type of the operands in VL.
Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
auto *VecTy = FixedVectorType::get(ScalarTy, VL.size());		auto *VecTy = FixedVectorType::get(ScalarTy, VL.size());
bool DuplicateNonConst = false;		bool DuplicateNonConst = false;
▲ Show 20 Lines • Show All 450 Lines • ▼ Show 20 Lines	for (int I = 0, Sz = Mask.size(); I < Sz; ++I) {
}))		}))
continue;		continue;
R.eraseInstruction(EI);		R.eraseInstruction(EI);
}		}
return VecBase;		return VecBase;
}		}
/// Checks if the specified entry \p E needs to be delayed because of its		/// Checks if the specified entry \p E needs to be delayed because of its
/// dependency nodes.		/// dependency nodes.
Value needToDelay(const TreeEntry E, ArrayRef<const TreeEntry *> Deps) {		Value needToDelay(const TreeEntry E,
		ArrayRef<SmallVector<const TreeEntry *>> Deps) {
// No need to delay emission if all deps are ready.		// No need to delay emission if all deps are ready.
if (all_of(Deps, [](const TreeEntry *TE) { return TE->VectorizedValue; }))		if (all_of(Deps, [](ArrayRef<const TreeEntry *> TEs) {
		return all_of(
		TEs, [](const TreeEntry *TE) { return TE->VectorizedValue; });
		}))
return nullptr;		return nullptr;
// Postpone gather emission, will be emitted after the end of the		// Postpone gather emission, will be emitted after the end of the
// process to keep correct order.		// process to keep correct order.
auto *VecTy = FixedVectorType::get(E->Scalars.front()->getType(),		auto *VecTy = FixedVectorType::get(E->Scalars.front()->getType(),
E->getVectorFactor());		E->getVectorFactor());
return Builder.CreateAlignedLoad(		return Builder.CreateAlignedLoad(
VecTy, PoisonValue::get(PointerType::getUnqual(VecTy->getContext())),		VecTy, PoisonValue::get(PointerType::getUnqual(VecTy->getContext())),
MaybeAlign());		MaybeAlign());
▲ Show 20 Lines • Show All 286 Lines • ▼ Show 20 Lines	SmallVector<int> ReuseShuffleIndicies(E->ReuseShuffleIndices.begin(),
E->ReuseShuffleIndices.end());		E->ReuseShuffleIndices.end());
SmallVector<Value *> GatheredScalars(E->Scalars.begin(), E->Scalars.end());		SmallVector<Value *> GatheredScalars(E->Scalars.begin(), E->Scalars.end());
// Build a mask out of the reorder indices and reorder scalars per this		// Build a mask out of the reorder indices and reorder scalars per this
// mask.		// mask.
SmallVector<int> ReorderMask;		SmallVector<int> ReorderMask;
inversePermutation(E->ReorderIndices, ReorderMask);		inversePermutation(E->ReorderIndices, ReorderMask);
if (!ReorderMask.empty())		if (!ReorderMask.empty())
reorderScalars(GatheredScalars, ReorderMask);		reorderScalars(GatheredScalars, ReorderMask);
auto FindReusedSplat = [&](MutableArrayRef<int> Mask, unsigned InputVF) {		auto FindReusedSplat = [&](MutableArrayRef<int> Mask, unsigned InputVF) {
		RKSimonUnsubmitted Not Done Reply Inline Actions Pull these kind of NFC cleanups out of the patch RKSimon: Pull these kind of NFC cleanups out of the patch
if (!isSplat(E->Scalars) \|\| none_of(E->Scalars, [](Value *V) {		if (!isSplat(E->Scalars) \|\| none_of(E->Scalars, [](Value *V) {
return isa<UndefValue>(V) && !isa<PoisonValue>(V);		return isa<UndefValue>(V) && !isa<PoisonValue>(V);
}))		}))
return false;		return false;
TreeEntry *UserTE = E->UserTreeIndices.back().UserTE;		TreeEntry *UserTE = E->UserTreeIndices.back().UserTE;
unsigned EdgeIdx = E->UserTreeIndices.back().EdgeIdx;		unsigned EdgeIdx = E->UserTreeIndices.back().EdgeIdx;
if (UserTE->getNumOperands() != 2)		if (UserTE->getNumOperands() != 2)
return false;		return false;
Show All 15 Lines	auto FindReusedSplat = [&](MutableArrayRef<int> Mask, unsigned InputVF) {
}		}
return true;		return true;
};		};
BVTy ShuffleBuilder(Params...);		BVTy ShuffleBuilder(Params...);
ResTy Res = ResTy();		ResTy Res = ResTy();
SmallVector<int> Mask;		SmallVector<int> Mask;
SmallVector<int> ExtractMask;		SmallVector<int> ExtractMask;
std::optional<TargetTransformInfo::ShuffleKind> ExtractShuffle;		std::optional<TargetTransformInfo::ShuffleKind> ExtractShuffle;
std::optional<TargetTransformInfo::ShuffleKind> GatherShuffle;		SmallVector<std::optional<TargetTransformInfo::ShuffleKind>> GatherShuffles;
SmallVector<const TreeEntry *> Entries;		SmallVector<SmallVector<const TreeEntry *>> Entries;
Type *ScalarTy = GatheredScalars.front()->getType();		Type *ScalarTy = GatheredScalars.front()->getType();
		unsigned NumParts = TTI->getNumberOfParts(
		FixedVectorType::get(ScalarTy, GatheredScalars.size()));
		if (NumParts == 0 \|\| NumParts >= GatheredScalars.size())
		NumParts = 1;
if (!all_of(GatheredScalars, UndefValue::classof)) {		if (!all_of(GatheredScalars, UndefValue::classof)) {
// Check for gathered extracts.		// Check for gathered extracts.
ExtractShuffle =		ExtractShuffle =
tryToGatherSingleRegisterExtractElements(GatheredScalars, ExtractMask);		tryToGatherSingleRegisterExtractElements(GatheredScalars, ExtractMask);
bool Resized = false;		bool Resized = false;
if (Value *VecBase = ShuffleBuilder.adjustExtracts(E, ExtractMask))		if (Value *VecBase = ShuffleBuilder.adjustExtracts(E, ExtractMask))
if (auto *VecBaseTy = dyn_cast<FixedVectorType>(VecBase->getType()))		if (auto *VecBaseTy = dyn_cast<FixedVectorType>(VecBase->getType()))
if (VF == VecBaseTy->getNumElements() && GatheredScalars.size() != VF) {		if (VF == VecBaseTy->getNumElements() && GatheredScalars.size() != VF) {
Resized = true;		Resized = true;
GatheredScalars.append(VF - GatheredScalars.size(),		GatheredScalars.append(VF - GatheredScalars.size(),
PoisonValue::get(ScalarTy));		PoisonValue::get(ScalarTy));
}		}
// Gather extracts after we check for full matched gathers only.		// Gather extracts after we check for full matched gathers only.
if (ExtractShuffle \|\| E->getOpcode() != Instruction::Load \|\|		if (ExtractShuffle \|\| E->getOpcode() != Instruction::Load \|\|
E->isAltShuffle() \|\|		E->isAltShuffle() \|\|
all_of(E->Scalars, [this](Value *V) { return getTreeEntry(V); }) \|\|		all_of(E->Scalars, [this](Value *V) { return getTreeEntry(V); }) \|\|
isSplat(E->Scalars) \|\|		isSplat(E->Scalars) \|\|
(E->Scalars != GatheredScalars && GatheredScalars.size() <= 2)) {		(E->Scalars != GatheredScalars && GatheredScalars.size() <= 2)) {
GatherShuffle = isGatherShuffledEntry(E, GatheredScalars, Mask, Entries);		GatherShuffles =
		isGatherShuffledEntry(E, GatheredScalars, Mask, Entries, NumParts);
}		}
if (GatherShuffle) {		if (!GatherShuffles.empty()) {
if (Value *Delayed = ShuffleBuilder.needToDelay(E, Entries)) {		if (Value *Delayed = ShuffleBuilder.needToDelay(E, Entries)) {
// Delay emission of gathers which are not ready yet.		// Delay emission of gathers which are not ready yet.
PostponedGathers.insert(E);		PostponedGathers.insert(E);
// Postpone gather emission, will be emitted after the end of the		// Postpone gather emission, will be emitted after the end of the
// process to keep correct order.		// process to keep correct order.
return Delayed;		return Delayed;
}		}
assert((Entries.size() == 1 \|\| Entries.size() == 2) &&		if (GatherShuffles.size() == 1 &&
"Expected shuffle of 1 or 2 entries.");		*GatherShuffles.front() == TTI::SK_PermuteSingleSrc &&
if (*GatherShuffle == TTI::SK_PermuteSingleSrc &&		Entries.front().front()->isSame(E->Scalars)) {
Entries.front()->isSame(E->Scalars)) {
// Perfect match in the graph, will reuse the previously vectorized		// Perfect match in the graph, will reuse the previously vectorized
// node. Cost is 0.		// node. Cost is 0.
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "SLP: perfect diamond match for gather bundle "		<< "SLP: perfect diamond match for gather bundle "
<< shortBundleName(E->Scalars) << ".\n");		<< shortBundleName(E->Scalars) << ".\n");
// Restore the mask for previous partially matched values.		// Restore the mask for previous partially matched values.
if (Entries.front()->ReorderIndices.empty() &&		const TreeEntry *FrontTE = Entries.front().front();
		vdmitrieUnsubmitted Not Done Reply Inline Actions nit: suggest hoisting "Entries.front().front()" subexpr and assign to dedicated local variable of type "const TreeEntry " vdmitrie:* nit: suggest hoisting "Entries.front().front()" subexpr and assign to dedicated local…
((Entries.front()->ReuseShuffleIndices.empty() &&		if (FrontTE->ReorderIndices.empty() &&
E->Scalars.size() == Entries.front()->Scalars.size()) \|\|		((FrontTE->ReuseShuffleIndices.empty() &&
(E->Scalars.size() ==		E->Scalars.size() == FrontTE->Scalars.size()) \|\|
Entries.front()->ReuseShuffleIndices.size()))) {		(E->Scalars.size() == FrontTE->ReuseShuffleIndices.size()))) {
std::iota(Mask.begin(), Mask.end(), 0);		std::iota(Mask.begin(), Mask.end(), 0);
} else {		} else {
for (auto [I, V] : enumerate(E->Scalars)) {		for (auto [I, V] : enumerate(E->Scalars)) {
if (isa<PoisonValue>(V)) {		if (isa<PoisonValue>(V)) {
Mask[I] = PoisonMaskElem;		Mask[I] = PoisonMaskElem;
continue;		continue;
}		}
Mask[I] = Entries.front()->findLaneForValue(V);		Mask[I] = FrontTE->findLaneForValue(V);
}		}
}		}
ShuffleBuilder.add(Entries.front()->VectorizedValue, Mask);		ShuffleBuilder.add(FrontTE->VectorizedValue, Mask);
Res = ShuffleBuilder.finalize(E->getCommonMask());		Res = ShuffleBuilder.finalize(E->getCommonMask());
return Res;		return Res;
}		}
if (!Resized) {		if (!Resized) {
unsigned VF1 = Entries.front()->getVectorFactor();		if (GatheredScalars.size() != VF &&
unsigned VF2 = Entries.back()->getVectorFactor();		any_of(Entries, [&](ArrayRef<const TreeEntry *> TEs) {
if ((VF == VF1 \|\| VF == VF2) && GatheredScalars.size() != VF)		return any_of(TEs, [&](const TreeEntry *TE) {
		return TE->getVectorFactor() == VF;
		});
		}))
GatheredScalars.append(VF - GatheredScalars.size(),		GatheredScalars.append(VF - GatheredScalars.size(),
PoisonValue::get(ScalarTy));		PoisonValue::get(ScalarTy));
}		}
// Remove shuffled elements from list of gathers.		// Remove shuffled elements from list of gathers.
for (int I = 0, Sz = Mask.size(); I < Sz; ++I) {		for (int I = 0, Sz = Mask.size(); I < Sz; ++I) {
if (Mask[I] != PoisonMaskElem)		if (Mask[I] != PoisonMaskElem)
GatheredScalars[I] = PoisonValue::get(ScalarTy);		GatheredScalars[I] = PoisonValue::get(ScalarTy);
}		}
▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	if (NumNonConsts == 1) {
ReuseMask[I] = PoisonMaskElem;		ReuseMask[I] = PoisonMaskElem;
if (isa<UndefValue>(Scalars[I]))		if (isa<UndefValue>(Scalars[I]))
Scalars[I] = PoisonValue::get(ScalarTy);		Scalars[I] = PoisonValue::get(ScalarTy);
}		}
NeedFreeze = true;		NeedFreeze = true;
}		}
}		}
};		};
if (ExtractShuffle \|\| GatherShuffle) {		if (ExtractShuffle \|\| !GatherShuffles.empty()) {
bool IsNonPoisoned = true;		bool IsNonPoisoned = true;
bool IsUsedInExpr = false;		bool IsUsedInExpr = true;
Value *Vec1 = nullptr;		Value *Vec1 = nullptr;
if (ExtractShuffle) {		if (ExtractShuffle) {
// Gather of extractelements can be represented as just a shuffle of		// Gather of extractelements can be represented as just a shuffle of
// a single/two vectors the scalars are extracted from.		// a single/two vectors the scalars are extracted from.
// Find input vectors.		// Find input vectors.
Value *Vec2 = nullptr;		Value *Vec2 = nullptr;
for (unsigned I = 0, Sz = ExtractMask.size(); I < Sz; ++I) {		for (unsigned I = 0, Sz = ExtractMask.size(); I < Sz; ++I) {
if (ExtractMask[I] == PoisonMaskElem \|\|		if (ExtractMask[I] == PoisonMaskElem \|\|
(!Mask.empty() && Mask[I] != PoisonMaskElem)) {		(!Mask.empty() && Mask[I] != PoisonMaskElem)) {
ExtractMask[I] = PoisonMaskElem;		ExtractMask[I] = PoisonMaskElem;
continue;		continue;
}		}
if (isa<UndefValue>(E->Scalars[I]))		if (isa<UndefValue>(E->Scalars[I]))
continue;		continue;
auto *EI = cast<ExtractElementInst>(E->Scalars[I]);		auto *EI = cast<ExtractElementInst>(E->Scalars[I]);
if (!Vec1) {		if (!Vec1) {
Vec1 = EI->getVectorOperand();		Vec1 = EI->getVectorOperand();
} else if (Vec1 != EI->getVectorOperand()) {		} else if (Vec1 != EI->getVectorOperand()) {
assert((!Vec2 \|\| Vec2 == EI->getVectorOperand()) &&		assert((!Vec2 \|\| Vec2 == EI->getVectorOperand()) &&
"Expected only 1 or 2 vectors shuffle.");		"Expected only 1 or 2 vectors shuffle.");
Vec2 = EI->getVectorOperand();		Vec2 = EI->getVectorOperand();
}		}
}		}
if (Vec2) {		if (Vec2) {
		IsUsedInExpr = false;
IsNonPoisoned &=		IsNonPoisoned &=
isGuaranteedNotToBePoison(Vec1) && isGuaranteedNotToBePoison(Vec2);		isGuaranteedNotToBePoison(Vec1) && isGuaranteedNotToBePoison(Vec2);
ShuffleBuilder.add(Vec1, Vec2, ExtractMask);		ShuffleBuilder.add(Vec1, Vec2, ExtractMask);
} else if (Vec1) {		} else if (Vec1) {
IsUsedInExpr = FindReusedSplat(		IsUsedInExpr &= FindReusedSplat(
ExtractMask,		ExtractMask,
cast<FixedVectorType>(Vec1->getType())->getNumElements());		cast<FixedVectorType>(Vec1->getType())->getNumElements());
ShuffleBuilder.add(Vec1, ExtractMask);		ShuffleBuilder.add(Vec1, ExtractMask);
IsNonPoisoned &= isGuaranteedNotToBePoison(Vec1);		IsNonPoisoned &= isGuaranteedNotToBePoison(Vec1);
} else {		} else {
		IsUsedInExpr = false;
ShuffleBuilder.add(PoisonValue::get(FixedVectorType::get(		ShuffleBuilder.add(PoisonValue::get(FixedVectorType::get(
ScalarTy, GatheredScalars.size())),		ScalarTy, GatheredScalars.size())),
ExtractMask);		ExtractMask);
}		}
}		}
if (GatherShuffle) {		if (!GatherShuffles.empty()) {
if (Entries.size() == 1) {		unsigned SliceSize = E->Scalars.size() / NumParts;
IsUsedInExpr = FindReusedSplat(		SmallVector<int> VecMask(Mask.size(), PoisonMaskElem);
Mask,		for (const auto [I, TEs] : enumerate(Entries)) {
cast<FixedVectorType>(Entries.front()->VectorizedValue->getType())		if (TEs.empty()) {
		assert(!GatherShuffles[I] &&
		"No shuffles with empty entries list expected.");
		continue;
		}
		assert((TEs.size() == 1 \|\| TEs.size() == 2) &&
		"Expected shuffle of 1 or 2 entries.");
		auto SubMask = ArrayRef(Mask).slice(I * SliceSize, SliceSize);
		VecMask.assign(VecMask.size(), PoisonMaskElem);
		copy(SubMask, std::next(VecMask.begin(), I * SliceSize));
		if (TEs.size() == 1) {
		IsUsedInExpr &= FindReusedSplat(
		VecMask,
		cast<FixedVectorType>(TEs.front()->VectorizedValue->getType())
->getNumElements());		->getNumElements());
ShuffleBuilder.add(Entries.front()->VectorizedValue, Mask);		ShuffleBuilder.add(TEs.front()->VectorizedValue, VecMask);
IsNonPoisoned &=		IsNonPoisoned &=
isGuaranteedNotToBePoison(Entries.front()->VectorizedValue);		isGuaranteedNotToBePoison(TEs.front()->VectorizedValue);
} else {		} else {
ShuffleBuilder.add(Entries.front()->VectorizedValue,		IsUsedInExpr = false;
Entries.back()->VectorizedValue, Mask);		ShuffleBuilder.add(TEs.front()->VectorizedValue,
		TEs.back()->VectorizedValue, VecMask);
IsNonPoisoned &=		IsNonPoisoned &=
isGuaranteedNotToBePoison(Entries.front()->VectorizedValue) &&		isGuaranteedNotToBePoison(TEs.front()->VectorizedValue) &&
isGuaranteedNotToBePoison(Entries.back()->VectorizedValue);		isGuaranteedNotToBePoison(TEs.back()->VectorizedValue);
		}
}		}
}		}
// Try to figure out best way to combine values: build a shuffle and insert		// Try to figure out best way to combine values: build a shuffle and insert
// elements or just build several shuffles.		// elements or just build several shuffles.
// Insert non-constant scalars.		// Insert non-constant scalars.
SmallVector<Value *> NonConstants(GatheredScalars);		SmallVector<Value *> NonConstants(GatheredScalars);
int EMSz = ExtractMask.size();		int EMSz = ExtractMask.size();
int MSz = Mask.size();		int MSz = Mask.size();
// Try to build constant vector and shuffle with it only if currently we		// Try to build constant vector and shuffle with it only if currently we
// have a single permutation and more than 1 scalar constants.		// have a single permutation and more than 1 scalar constants.
bool IsSingleShuffle = !ExtractShuffle \|\| !GatherShuffle;		bool IsSingleShuffle = !ExtractShuffle \|\| GatherShuffles.empty();
bool IsIdentityShuffle =		bool IsIdentityShuffle =
(ExtractShuffle.value_or(TTI::SK_PermuteTwoSrc) ==		(ExtractShuffle.value_or(TTI::SK_PermuteTwoSrc) ==
TTI::SK_PermuteSingleSrc &&		TTI::SK_PermuteSingleSrc &&
none_of(ExtractMask, [&](int I) { return I >= EMSz; }) &&		none_of(ExtractMask, [&](int I) { return I >= EMSz; }) &&
ShuffleVectorInst::isIdentityMask(ExtractMask, EMSz)) \|\|		ShuffleVectorInst::isIdentityMask(ExtractMask, EMSz)) \|\|
(GatherShuffle.value_or(TTI::SK_PermuteTwoSrc) ==		(!GatherShuffles.empty() &&
TTI::SK_PermuteSingleSrc &&		all_of(GatherShuffles,
		[](const std::optional<TTI::ShuffleKind> &SK) {
		return SK.value_or(TTI::SK_PermuteTwoSrc) ==
		TTI::SK_PermuteSingleSrc;
		}) &&
none_of(Mask, [&](int I) { return I >= MSz; }) &&		none_of(Mask, [&](int I) { return I >= MSz; }) &&
ShuffleVectorInst::isIdentityMask(Mask, MSz));		ShuffleVectorInst::isIdentityMask(Mask, MSz));
bool EnoughConstsForShuffle =		bool EnoughConstsForShuffle =
IsSingleShuffle &&		IsSingleShuffle &&
(none_of(GatheredScalars,		(none_of(GatheredScalars,
[](Value *V) {		[](Value *V) {
return isa<UndefValue>(V) && !isa<PoisonValue>(V);		return isa<UndefValue>(V) && !isa<PoisonValue>(V);
}) \|\|		}) \|\|
▲ Show 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
return NewPhi;		return NewPhi;
}		}

if (!VisitedBBs.insert(IBB).second) {		if (!VisitedBBs.insert(IBB).second) {
NewPhi->addIncoming(NewPhi->getIncomingValueForBlock(IBB), IBB);		NewPhi->addIncoming(NewPhi->getIncomingValueForBlock(IBB), IBB);
continue;		continue;
}		}

		// if (any_of(E->getOperand(i), [&](Value *V) {
		// auto *I = dyn_cast<Instruction>(V);
		// return I && I->getParent() == IBB;
		// }))
Builder.SetInsertPoint(IBB->getTerminator());		Builder.SetInsertPoint(IBB->getTerminator());
		// else
		// Builder.SetInsertPoint(IBB->getFirstNonPHIOrDbgOrLifetime());
Builder.SetCurrentDebugLocation(PH->getDebugLoc());		Builder.SetCurrentDebugLocation(PH->getDebugLoc());
Value Vec = vectorizeOperand(E, i, /PostponedPHIs=*/true);		Value Vec = vectorizeOperand(E, i, /PostponedPHIs=*/true);
NewPhi->addIncoming(Vec, IBB);		NewPhi->addIncoming(Vec, IBB);
}		}

assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&		assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&
"Invalid number of incoming values");		"Invalid number of incoming values");
return NewPhi;		return NewPhi;
▲ Show 20 Lines • Show All 647 Lines • ▼ Show 20 Lines	for (const TreeEntry *E : PostponedNodes) {
// If user is a PHI node, its vector code have to be inserted right before		// If user is a PHI node, its vector code have to be inserted right before
// block terminator. Since the node was delayed, there were some unresolved		// block terminator. Since the node was delayed, there were some unresolved
// dependencies at the moment when stab instruction was emitted. In a case		// dependencies at the moment when stab instruction was emitted. In a case
// when any of these dependencies turn out an operand of another PHI, coming		// when any of these dependencies turn out an operand of another PHI, coming
// from this same block, position of a stab instruction will become invalid.		// from this same block, position of a stab instruction will become invalid.
// The is because source vector that supposed to feed this gather node was		// The is because source vector that supposed to feed this gather node was
// inserted at the end of the block [after stab instruction]. So we need		// inserted at the end of the block [after stab instruction]. So we need
// to adjust insertion point again to the end of block.		// to adjust insertion point again to the end of block.
if (isa<PHINode>(UserI))		if (isa<PHINode>(UserI)) {
Builder.SetInsertPoint(PrevVec->getParent()->getTerminator());		// Insert before all users.
else		Instruction *InsertPt = PrevVec->getParent()->getTerminator();
		for (User *U : PrevVec->users()) {
		if (U == UserI)
		continue;
		auto *UI = dyn_cast<Instruction>(U);
		if (!UI \|\| isa<PHINode>(UI) \|\| UI->getParent() != InsertPt->getParent())
		continue;
		if (UI->comesBefore(InsertPt))
		InsertPt = UI;
		}
		Builder.SetInsertPoint(InsertPt);
		} else {
Builder.SetInsertPoint(PrevVec);		Builder.SetInsertPoint(PrevVec);
		}
Builder.SetCurrentDebugLocation(UserI->getDebugLoc());		Builder.SetCurrentDebugLocation(UserI->getDebugLoc());
Value Vec = vectorizeTree(TE, /PostponedPHIs=*/false);		Value Vec = vectorizeTree(TE, /PostponedPHIs=*/false);
PrevVec->replaceAllUsesWith(Vec);		PrevVec->replaceAllUsesWith(Vec);
PostponedValues.try_emplace(Vec).first->second.push_back(TE);		PostponedValues.try_emplace(Vec).first->second.push_back(TE);
// Replace the stub vector node, if it was used before for one of the		// Replace the stub vector node, if it was used before for one of the
// buildvector nodes already.		// buildvector nodes already.
auto It = PostponedValues.find(PrevVec);		auto It = PostponedValues.find(PrevVec);
if (It != PostponedValues.end()) {		if (It != PostponedValues.end()) {
▲ Show 20 Lines • Show All 4,433 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/multi-nodes-to-shuffle.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -slp-threshold=-107 \| FileCheck %s			; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -slp-threshold=-115 \| FileCheck %s
	; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -slp-threshold=-107 -mattr=+avx2 \| FileCheck %s			; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -slp-threshold=-115 -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2
				RKSimonUnsubmitted Not Done Reply Inline Actions Please can you add a AVX2 run to confirm that the shuffles are still being suitably split for legal 256-bit vectors? ; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -slp-threshold=-115 \| FileCheck % --check-prefix=SSE ; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux mattr=+avx2 -slp-threshold=-115 \| FileCheck % --check-prefix=AVX RKSimon: Please can you add a AVX2 run to confirm that the shuffles are still being suitably split for…

	define void @test(i64 %p0, i64 %p1, i64 %p2, i64 %p3) {			define void @test(i64 %p0, i64 %p1, i64 %p2, i64 %p3) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = insertelement <4 x i64> poison, i64 [[P0:%.]], i32 0			; CHECK-NEXT: [[TMP0:%.]] = insertelement <4 x i64> poison, i64 [[P0:%.]], i32 0
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x i64> [[TMP0]], i64 [[P1:%.]], i32 1			; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x i64> [[TMP0]], i64 [[P1:%.]], i32 1
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x i64> [[TMP1]], i64 [[P2:%.]], i32 2			; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x i64> [[TMP1]], i64 [[P2:%.]], i32 2
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i64> [[TMP2]], i64 [[P3:%.]], i32 3			; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i64> [[TMP2]], i64 [[P3:%.]], i32 3
	; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i64> [[TMP3]], [[TMP3]]			; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i64> [[TMP3]], [[TMP3]]
	; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i64> [[TMP3]], [[TMP3]]			; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i64> [[TMP3]], [[TMP3]]
	; CHECK-NEXT: [[TMP6:%.*]] = sdiv <4 x i64> [[TMP3]], [[TMP3]]			; CHECK-NEXT: [[TMP6:%.*]] = sdiv <4 x i64> [[TMP3]], [[TMP3]]
	; CHECK-NEXT: [[TMP7:%.*]] = sub <4 x i64> [[TMP5]], [[TMP6]]			; CHECK-NEXT: [[TMP7:%.*]] = sub <4 x i64> [[TMP5]], [[TMP6]]
	; CHECK-NEXT: [[TMP8:%.*]] = shl <4 x i64> [[TMP4]], [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = shl <4 x i64> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> [[TMP5]], <4 x i32> <i32 0, i32 4, i32 poison, i32 4>			; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> [[TMP5]], <4 x i32> <i32 0, i32 4, i32 poison, i32 poison>
	; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i64> [[TMP9]], <4 x i64> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 4, i32 3>			; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i64> [[TMP6]], <4 x i64> [[TMP5]], <4 x i32> <i32 poison, i32 poison, i32 0, i32 4>
	; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> [[TMP5]], <4 x i32> <i32 1, i32 5, i32 poison, i32 5>			; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <4 x i64> [[TMP9]], <4 x i64> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>
	; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <4 x i64> [[TMP11]], <4 x i64> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 5, i32 3>			; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> [[TMP5]], <4 x i32> <i32 1, i32 5, i32 poison, i32 poison>
	; CHECK-NEXT: [[TMP13:%.*]] = or <4 x i64> [[TMP10]], [[TMP12]]			; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <4 x i64> [[TMP6]], <4 x i64> [[TMP5]], <4 x i32> <i32 poison, i32 poison, i32 1, i32 5>
	; CHECK-NEXT: [[TMP14:%.*]] = trunc <4 x i64> [[TMP13]] to <4 x i32>			; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <4 x i64> [[TMP12]], <4 x i64> [[TMP13]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>
				; CHECK-NEXT: [[TMP15:%.*]] = or <4 x i64> [[TMP11]], [[TMP14]]
				; CHECK-NEXT: [[TMP16:%.*]] = trunc <4 x i64> [[TMP15]] to <4 x i32>
	; CHECK-NEXT: br label [[BB:%.*]]			; CHECK-NEXT: br label [[BB:%.*]]
	; CHECK: bb:			; CHECK: bb:
	; CHECK-NEXT: [[TMP15:%.]] = phi <4 x i32> [ [[TMP16:%.]], [[BB]] ], [ [[TMP14]], [[ENTRY:%.*]] ]			; CHECK-NEXT: [[TMP17:%.]] = phi <4 x i32> [ [[TMP18:%.]], [[BB]] ], [ [[TMP16]], [[ENTRY:%.*]] ]
	; CHECK-NEXT: [[TMP16]] = trunc <4 x i64> [[TMP8]] to <4 x i32>			; CHECK-NEXT: [[TMP18]] = trunc <4 x i64> [[TMP8]] to <4 x i32>
	; CHECK-NEXT: br label [[BB]]			; CHECK-NEXT: br label [[BB]]
	;			;
				; AVX2-LABEL: @test(
				; AVX2-NEXT: entry:
				; AVX2-NEXT: [[TMP0:%.]] = insertelement <4 x i64> poison, i64 [[P0:%.]], i32 0
				; AVX2-NEXT: [[TMP1:%.]] = insertelement <4 x i64> [[TMP0]], i64 [[P1:%.]], i32 1
				; AVX2-NEXT: [[TMP2:%.]] = insertelement <4 x i64> [[TMP1]], i64 [[P2:%.]], i32 2
				; AVX2-NEXT: [[TMP3:%.]] = insertelement <4 x i64> [[TMP2]], i64 [[P3:%.]], i32 3
				; AVX2-NEXT: [[TMP4:%.*]] = add <4 x i64> [[TMP3]], [[TMP3]]
				; AVX2-NEXT: [[TMP5:%.*]] = mul <4 x i64> [[TMP3]], [[TMP3]]
				; AVX2-NEXT: [[TMP6:%.*]] = sdiv <4 x i64> [[TMP3]], [[TMP3]]
				; AVX2-NEXT: [[TMP7:%.*]] = sub <4 x i64> [[TMP5]], [[TMP6]]
				; AVX2-NEXT: [[TMP8:%.*]] = shl <4 x i64> [[TMP4]], [[TMP7]]
				; AVX2-NEXT: [[TMP9:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> [[TMP5]], <4 x i32> <i32 0, i32 4, i32 poison, i32 4>
				; AVX2-NEXT: [[TMP10:%.*]] = shufflevector <4 x i64> [[TMP9]], <4 x i64> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 4, i32 3>
				; AVX2-NEXT: [[TMP11:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> [[TMP5]], <4 x i32> <i32 1, i32 5, i32 poison, i32 5>
				; AVX2-NEXT: [[TMP12:%.*]] = shufflevector <4 x i64> [[TMP11]], <4 x i64> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 5, i32 3>
				; AVX2-NEXT: [[TMP13:%.*]] = or <4 x i64> [[TMP10]], [[TMP12]]
				; AVX2-NEXT: [[TMP14:%.*]] = trunc <4 x i64> [[TMP13]] to <4 x i32>
				; AVX2-NEXT: br label [[BB:%.*]]
				; AVX2: bb:
				; AVX2-NEXT: [[TMP15:%.]] = phi <4 x i32> [ [[TMP16:%.]], [[BB]] ], [ [[TMP14]], [[ENTRY:%.*]] ]
				; AVX2-NEXT: [[TMP16]] = trunc <4 x i64> [[TMP8]] to <4 x i32>
				; AVX2-NEXT: br label [[BB]]
				;
	entry:			entry:
	%a0 = add i64 %p0, %p0			%a0 = add i64 %p0, %p0
	%a1 = add i64 %p1, %p1			%a1 = add i64 %p1, %p1
	%a2 = add i64 %p2, %p2			%a2 = add i64 %p2, %p2
	%a3 = add i64 %p3, %p3			%a3 = add i64 %p3, %p3
	%m0 = mul i64 %p0, %p0			%m0 = mul i64 %p0, %p0
	%m1 = mul i64 %p1, %p1			%m1 = mul i64 %p1, %p1
	%m2 = mul i64 %p2, %p2			%m2 = mul i64 %p2, %p2
	Show All 34 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve isGatherShuffledEntry by trying per-register shuffle.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 557894

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/multi-nodes-to-shuffle.ll

[SLP]Improve isGatherShuffledEntry by trying per-register shuffle.
ClosedPublic