Currently when building gather/buildvector node, we try to build nodes
shuffles without taking into account separate vector registers. WE can
improve final codegen and the whole vectorization process by including
this info into the analysis and the vector code emission, allows to emit
better vectorized code.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2515 | Update the \returns description |
Hi Alexey,
I tried to measure performance impact of this patch. Observed ~10% regression on nnet_test spec from coremark-pro test suite. (on avx2 with -flto).
Could you provide more details what is the cause? The patch itself just improves shuffles emissions, most probably there are some problems with the TTI cost model.
command to reproduce: opt -passes=slp-vectorizer -mtriple=x86_64 -mcpu=core-avx2 -S test.ll
BTW. which gcc version you use to build? I've experienced problem building with both 7.5.0 and 8.5.0 (I did not try any later versions) while there were no issues w/o patch.
Here is fail log:
<dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3573:24: error: declaration of llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI [-fpermissive]
TargetTransformInfo *TTI; ^~~
In file included from <dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:47:0:
<dir>/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:205:29: error: changes meaning of TTI from typedef class llvm::TargetTransformInfo llvm::TTI [-fpermissive]
typedef TargetTransformInfo TTI;
^~~
HMM, llvm-mca shows that the code after patch is better than the code before (with the patch the throughput is 4.0, without - 5.0)
https://godbolt.org/z/zvzTKedPd
BTW. which gcc version you use to build? I've experienced problem building with both 7.5.0 and 8.5.0 (I did not try any later versions) while there were no issues w/o patch.
Here is fail log:
<dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3573:24: error: declaration of llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI [-fpermissive]TargetTransformInfo *TTI; ^~~In file included from <dir>/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:47:0:
<dir>/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:205:29: error: changes meaning of TTI from typedef class llvm::TargetTransformInfo llvm::TTI [-fpermissive]
typedef TargetTransformInfo TTI;^~~
It is most probably because of the using TTI:: in the function declaration, will fix this.
It puzzled me too actually. It was the only vectorization candidate that changed behavior due to the patch in the test. I could not find any obvious flaw in handling subsequent vectorization sequences.
Eyeballing of generated code before/after also did not reveal any obvious issues.
I fixed cost estimation, now it results in the same translation (I added your test, no changes)
Rebase, ping!!!
Required to continue the work on cost model unification and non-power-2.
Why are we doing this in SLP and not TTI getShuffleCost? It looks like TTI should be doing a better job of recognizing how it should split the shuffle mask and determine the shuffle kinds per split?
I thought about this. Unfortunately, TTI is too late here + it does not have idion of TreeEntry.
What this function does? It scans the list of scalars and then checks, if it can build a shuffle, using previously build TreeEntries (use previously build vector to build the new one) to reduce number of inserts.
It supports only 1- and 2-vector shuffles, as 3- and more vector shuffles may not be effective for some targets.
Why cannot it be done in TTI? 1. TTI does not know about TreeEntries. 2. Without this knowledge it won't help in complex situations.
Assume we have 3 already built TreeEntries: <abcd>, <efgh>, <ijkl>. But the actual vector register includes only 2 elements (i.e. each this vector uses 2 vector registers). And we trying to build a new buildvector (NeedToGather) entry with elements <acih>.
Original function will help to translate this gather node into this:
%s1 = shuffle <abcd>, <efgh>, <0,2,poison,7> %v = insertelement %s1, i, 2 - might be very cost ineffective + there might be several such inserts, increasing the overall cost of the buildvector/gather node.
This reworked function can do this:
%s1 = shuffle <abcd>, poison, <0,2> %s2 = shuffle <efgh>, <ijkl>, <4,3> %v = shuffle %s1, %s2, <0,1,2,3> - free insert subvector shuffle
TTI won't be able to transform the first variant to the second one without knowing that there is <ijkl> vector already exists. It requires teaching it about the whole SLP graph idiom.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
9177–9179 | Stylistic suggestion here. This function is quite complicated, and following all the changes through are tricky. I *think* this is just performing the same operation for each sub-range (divided by register). If so, I'd suggest leaving the function alone, and introducing a wrapper which just calls the old version with a sub-range of the VLs, and builds up the vector results. If this works here, then applying the same basic idea throughout the patch would make it much easier to follow and review. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
9177–9179 | Yes, it splits the multi-register vector into several one-register vectors and just performs the same analysis, as before, to avoid long multi-vector shuffles. Will try to move the previous logic into a separate function. |
I've done my best, but I'm really struggling to understand a lot of this patch :(
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7260 | Add method description, also why is it called estimateNodesPermuteCost when it doesn't seem to use/return costs? | |
7269 | This logic makes very little sense to me | |
7830 | Would we be better off merging the 1TE + 2TE add() methods? |
Generally speaking, it does the same as before, just splits the node into vector registers and tries to do per register shuffles instead of per node shuffles. It may reduce final number of shuffles, since we do not need to permute all the register in the node, just some of them. The single node may result in several vector registers, no need to permute all of them.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7260 | It adds the cost to final cost estimation, see e.g. lines 7169 and 7176. | |
7269 | If we decided to permute E1 and (possibly) E2 already, and we found that we again need to permute the same nodes, no need to do it immediately, instead need to include the sub-Mask into CommonMask and do it later to avoid double shuffling of the same nodes. | |
7830 | I'll try to rework this function to avoid it. |
llvm/test/Transforms/SLPVectorizer/X86/multi-nodes-to-shuffle.ll | ||
---|---|---|
1–3 | Please can you add a AVX2 run to confirm that the shuffles are still being suitably split for legal 256-bit vectors? ; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -slp-threshold=-115 | FileCheck % --check-prefix=SSE ; RUN: opt -passes=slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux mattr=+avx2 -slp-threshold=-115 | FileCheck % --check-prefix=AVX |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7007 | I'm not sure I understand the purpose of this flag. Could you please describe? | |
7275 | InVectors.size() >=1 is same as !InVectors.empty() Having this check assumes that InVectors can be empty here, but at the same time InVectors.front() below (at line 7267) assumes it is not empty. So which one is the right assumption? | |
7741–7742 | perhaps worth renaming to GatherShuffles since it becomes an array of optionals | |
7743 | This ScalarTy declaration hides one at the function scope. | |
7749 | ScalarTy ? | |
7759 | VecTy is already defined at 7674. Can just reuse it. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7007 | Would this sound more accurate: "While set, we are still trying ..." ? | |
7275 |
What about the assumption? I believe "!InVectors.empty()" is always true. That is why I asked the question about assumption. This method is called twice in the patch and both have early return when InVectors is empty. | |
9523 | Typo: "poistive" | |
10420–10421 | nit: suggest hoisting "Entries.front().front()" subexpr and assign to dedicated local variable of type "const TreeEntry *" |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
7275 | Yes, this check can be dropped here. |
Hi, this reland causes crash: https://github.com/llvm/llvm-project/commit/196d154ab7a76e8ccb11addf61ff53387e397130.
ld.lld: /b/s/w/ir/cache/builder/src/third_party/llvm/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7602: void llvm::slpvectorizer::BoUpSLP::ShuffleCostEstimator::add(const TreeEntry &, ArrayRef<int>): Assertion `NumParts > 0 && NumParts < Mask.size() && "Expected positive number of registers."' failed.
This only happens when using -fprofile-use=. I'll try to come up with a smaller repro on Monday.
Stack trace:
clang: /usr/local/google/home/zequanwu/workspace/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7602: void llvm::slpvectorizer::BoUpSLP::ShuffleCostEstimator::add(const TreeEntry &, ArrayRef<int>): Assertion `NumParts > 0 && NumParts < Mask.size() && "Expected positive number of registers."' failed. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump: 0. Program arguments: /usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang -MMD -MF obj/third_party/boringssl/boringssl/curve25519.o.d -DUSE_UDEV -DUSE_AURA=1 -DUSE_GLIB=1 -DUSE_OZONE=1 -DOFFICIAL_BUILD -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -DNO_UNWIND_TABLES -D_GNU_SOURCE -D_LIBCPP_ENABLE_SAFE_MODE=1 -D_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS -D_LIBCXXABI_DISABLE_VISIBILITY_ANNOTATIONS -DCR_LIBCXX_REVISION=9b27200e8219bbc9e05ba8435456554a6e35d67d -DCR_SYSROOT_KEY=20230611T210420Z-2 -DNDEBUG -DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -DBORINGSSL_IMPLEMENTATION -D_BORINGSSL_LIBPKI_ -DBORINGSSL_ALLOW_CXX_RUNTIME -DBORINGSSL_NO_STATIC_INITIALIZER -DOPENSSL_SMALL -I../.. -Igen -I../../buildtools/third_party/libc++ -I../../third_party/boringssl/src/include -fno-delete-null-pointer-checks -fno-ident -fno-strict-aliasing -fstack-protector -fno-unwind-tables -fno-asynchronous-unwind-tables -fPIC -pthread -fcolor-diagnostics -fmerge-all-constants -fcrash-diagnostics-dir=../../tools/clang/crashreports -mllvm -instcombine-lower-dbg-declare=0 -mllvm -split-threshold-for-reg-with-hint=0 -ffp-contract=off -fcomplete-member-pointers -m64 -msse3 -ffile-compilation-dir=. -no-canonical-prefixes -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -gdwarf-4 -g2 -gdwarf-aranges -ggnu-pubnames -Xclang -fuse-ctor-homing -fprofile-use=../../chrome/build/pgo_profiles/chrome-linux-main-1698407887-6c91c384bff1766cf9949635c133936a2feea1bd.profdata -Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date -Wno-backend-plugin -mllvm -enable-ext-tsp-block-placement=1 -fvisibility=hidden -Wheader-hygiene -Wstring-conversion -Wtautological-overlap-compare -Wall -Wno-unused-variable -Wno-c++11-narrowing -Wno-unused-but-set-variable -Wno-misleading-indentation -Wno-missing-field-initializers -Wno-unused-parameter -Wno-psabi -Wloop-analysis -Wno-unneeded-internal-declaration -Wenum-compare-conditional -Wno-ignored-pragma-optimize -Wno-deprecated-builtins -Wno-bitfield-constant-conversion -Wno-deprecated-this-capture -Wno-invalid-offsetof -Wno-vla-extension -Wno-thread-safety-reference-return -Wno-delayed-template-parsing-in-cxx20 -Werror -O2 -fdata-sections -ffunction-sections -fno-unique-section-names -fno-math-errno -std=c11 --sysroot=../../build/linux/debian_bullseye_amd64-sysroot -c ../../third_party/boringssl/src/crypto/curve25519/curve25519.c -o obj/third_party/boringssl/boringssl/curve25519.o 1. <eof> parser at end of file 2. Optimizer #0 0x0000556e5dfa28c8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7cbf8c8) #1 0x0000556e5dfa048e llvm::sys::RunSignalHandlers() (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7cbd48e) #2 0x0000556e5df0a786 CrashRecoverySignalHandler(int) CrashRecoveryContext.cpp:0:0 #3 0x00007f92cb85a510 (/lib/x86_64-linux-gnu/libc.so.6+0x3c510) #4 0x00007f92cb8a80fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76 #5 0x00007f92cb85a472 raise ./signal/../sysdeps/posix/raise.c:27:6 #6 0x00007f92cb8444b2 abort ./stdlib/abort.c:81:7 #7 0x00007f92cb8443d5 _nl_load_domain ./intl/loadmsgcat.c:1177:9 #8 0x00007f92cb8533a2 (/lib/x86_64-linux-gnu/libc.so.6+0x353a2) #9 0x0000556e5f69792c llvm::slpvectorizer::BoUpSLP::ShuffleCostEstimator::add(llvm::slpvectorizer::BoUpSLP::TreeEntry const&, llvm::ArrayRef<int>) SLPVectorizer.cpp:0:0 #10 0x0000556e5f692f38 llvm::slpvectorizer::BoUpSLP::getEntryCost(llvm::slpvectorizer::BoUpSLP::TreeEntry const*, llvm::ArrayRef<llvm::Value*>, llvm::SmallPtrSetImpl<llvm::Value*>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93aff38) #11 0x0000556e5f69c33b llvm::slpvectorizer::BoUpSLP::getTreeCost(llvm::ArrayRef<llvm::Value*>) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93b933b) #12 0x0000556e5f6c0a1a llvm::SLPVectorizerPass::tryToVectorizeList(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP&, bool) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93dda1a) #13 0x0000556e5f6c1a2c llvm::SLPVectorizerPass::tryToVectorize(llvm::Instruction*, llvm::slpvectorizer::BoUpSLP&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93dea2c) #14 0x0000556e5f6c4be2 llvm::SLPVectorizerPass::tryToVectorize(llvm::ArrayRef<llvm::WeakTrackingVH>, llvm::slpvectorizer::BoUpSLP&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93e1be2) #15 0x0000556e5f6c4b2f llvm::SLPVectorizerPass::vectorizeRootInstruction(llvm::PHINode*, llvm::Instruction*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93e1b2f) #16 0x0000556e5f6bbc2a llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93d8c2a) #17 0x0000556e5f6b96fc llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93d66fc) #18 0x0000556e5f6b8c9e llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x93d5c9e) #19 0x0000556e5f37319d llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) PassBuilder.cpp:0:0 #20 0x0000556e5d9f2a24 llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x770fa24) #21 0x0000556e5be0377d llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) AMDGPUTargetMachine.cpp:0:0 #22 0x0000556e5d9f6cb3 llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7713cb3) #23 0x0000556e5be0351d llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) AMDGPUTargetMachine.cpp:0:0 #24 0x0000556e5d9f1c04 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x770ec04) #25 0x0000556e5e74f5b5 (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>&, std::unique_ptr<llvm::ToolOutputFile, std::default_delete<llvm::ToolOutputFile>>&) BackendUtil.cpp:0:0 #26 0x0000556e5e746102 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8463102) #27 0x0000556e5ec426b1 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) CodeGenAction.cpp:0:0 #28 0x0000556e60390b56 clang::ParseAST(clang::Sema&, bool, bool) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0xa0adb56) #29 0x0000556e5eb59f7f clang::FrontendAction::Execute() (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8876f7f) #30 0x0000556e5eacb01d clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x87e801d) #31 0x0000556e5ec3ad4e clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8957d4e) #32 0x0000556e5ba36012 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x5753012) #33 0x0000556e5ba323ed ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) driver.cpp:0:0 #34 0x0000556e5e92b9b9 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::$_0>(long) Job.cpp:0:0 #35 0x0000556e5df0a4c6 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x7c274c6) #36 0x0000556e5e92b0c2 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x86480c2) #37 0x0000556e5e8e6107 clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8603107) #38 0x0000556e5e8e6647 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x8603647) #39 0x0000556e5e9065e9 clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x86235e9) #40 0x0000556e5ba31836 clang_main(int, char**, llvm::ToolContext const&) (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x574e836) #41 0x0000556e5ba42871 main (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x575f871) #42 0x00007f92cb8456ca __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3 #43 0x00007f92cb845785 call_init ./csu/../csu/libc-start.c:128:20 #44 0x00007f92cb845785 __libc_start_main ./csu/../csu/libc-start.c:347:5 #45 0x0000556e5ba2e8e1 _start (/usr/local/google/home/zequanwu/workspace/llvm-project/build/cmake/bin/clang+0x574b8e1) clang: error: clang frontend command failed with exit code 134 (use -v to see invocation) clang version 18.0.0 (git@github.com:ZequanWu/llvm-project.git 196d154ab7a76e8ccb11addf61ff53387e397130)
Update the \returns description