Improved/fixed cost modeling for shuffles by providing masks, improved
cost model for non-identity insertelements.
Details
- Reviewers
RKSimon anton-afanasyev dtemirbulatov - Commits
- rG2faacf61a50e: [SLP]Improve shuffles cost estimation where possible.
rGcac60940b771: [SLP]Improve shuffles cost estimation where possible.
rG9980c9971892: [SLP]Improve shuffles cost estimation where possible.
rGfd5a6ce9dcb7: [SLP]Improve shuffles cost estimation where possible.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
60,090 ms | x64 debian > libFuzzer.libFuzzer::fuzzer-leak.test | |
60,040 ms | x64 debian > libFuzzer.libFuzzer::large.test |
Event Timeline
I'm not certain I've entirely following this, but it looks like its just trying to account for some weaknesses in getScalarizationOverhead()? Might we be better off trying to improve that?
Actually, this is a bug fix. We're not quite correct in some cases, the patch fixesbthe cost of some of the shuffles.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5580–5583 | I'm not sure I understand why this would be a SK_Select. That is a bit of a X86 special as far as I understand and doesn't always correlate well to other architectures. Why is the Mask missing too? That might be enough to help avoid the regressions if it was re-added. | |
llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll | ||
1335 ↗ | (On Diff #431724) | This seems worse I'm afraid - I don't think it should be keeping all these individual loads that are inserted. The insert_subvector cost should be low enough for them to be profitable to vectorize undef AArch64 - they are just a s register load. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5580–5583 |
But I'll check it. | |
llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll | ||
1335 ↗ | (On Diff #431724) | I'll try to improve cost estimation there with insert_subvector cost. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5580–5583 | AArch64 (and most other architectures AFAIU) do not have SK_Select shuffles, so is not a lot better than SK_PermuteTwoSrc. A Mask can help to improve the cost though, if the backend can come up with something more accurate for it. I'm surprised this is not a SK_InsertSubvector with adjacent elements though - that seems like the most natural fit, unless I'm missing how this works. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5580–5583 | Yep, you right, it must be an InserSubvector kind, changed it to Select because some cost for InsertSubvector were not implemented. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5580–5583 | was this on x86 / aarch64 or some other target? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5580–5583 | x86, IIRC. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5580–5583 | if you can email a test case I'll take a look |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
5580–5583 | Some of the lit test, do not remember already. Most of the cases were fixed already, I believe. |
Seems that the relanded 9980c9971892378ea82475e000de8df210a58e69 caused wasm test failures for Halide.
I'll try to find folks to give a smaller reproduce (I myself know nearly nothing about Halide), but here is the stack trace in case you notice something immediate.
# related to correctness_interleave in wasm-32-wasmrt-wasm_simd128-wasm_signext-wasm_sat_float_to_int mode assert.h assertion failed at llvm/CodeGen/BasicTTIImpl.h:152 in llvm::InstructionCost llvm::BasicTTIImplBase<llvm::WebAssemblyTTIImpl>::getInsertSubvectorOverhead(llvm::VectorType *, int, llvm::FixedVectorType *) [T = llvm::WebAssemblyTTIImpl]: (!isa<FixedVectorType>(VTy) || (Index + NumSubElts) <= (int)cast<FixedVectorType>(VTy)->getNumElements()) && "SK_InsertSubvector index out of range"
*** Check failure stack trace: *** @ 0x5606259259e4 absl::log_internal::LogMessage::SendToLog() @ 0x56062592518a absl::log_internal::LogMessage::Flush() @ 0x560625925dc9 absl::log_internal::LogMessageFatal::~LogMessageFatal() @ 0x560625905ce4 __assert_fail @ 0x5606232d2f51 llvm::BasicTTIImplBase<>::getInsertSubvectorOverhead() @ 0x5606232ce898 llvm::BasicTTIImplBase<>::getShuffleCost() @ 0x560624ce91c5 llvm::TargetTransformInfo::getShuffleCost() @ 0x56062477966b llvm::slpvectorizer::BoUpSLP::getEntryCost() @ 0x56062477ed63 llvm::slpvectorizer::BoUpSLP::getTreeCost() @ 0x560624797c28 llvm::SLPVectorizerPass::tryToVectorizeList() @ 0x56062479c4d8 llvm::SLPVectorizerPass::vectorizeInsertElementInst() @ 0x56062479c67c llvm::SLPVectorizerPass::vectorizeSimpleInstructions() @ 0x560624793e82 llvm::SLPVectorizerPass::vectorizeChainsInBlock() @ 0x560624791dbc llvm::SLPVectorizerPass::runImpl() @ 0x560624791168 llvm::SLPVectorizerPass::run() @ 0x5606236a4b92 llvm::detail::PassModel<>::run() @ 0x56062512c0b5 llvm::PassManager<>::run() @ 0x560622c278b2 llvm::detail::PassModel<>::run() @ 0x5606251300e0 llvm::ModuleToFunctionPassAdaptor::run() @ 0x560621fc84f2 llvm::detail::PassModel<>::run() @ 0x56062512b1de llvm::PassManager<>::run() @ 0x560621f9d82b Halide::Internal::CodeGen_LLVM::optimize_module() @ 0x560621f9ae1a Halide::Internal::CodeGen_LLVM::finish_codegen() @ 0x560621f9bdf5 Halide::Internal::CodeGen_LLVM::compile() @ 0x56062250c909 Halide::Internal::WasmModuleContents::WasmModuleContents() @ 0x5606225118f2 Halide::Internal::WasmModule::compile() @ 0x560622155f52 Halide::Pipeline::compile_jit() @ 0x5606221584ea Halide::Pipeline::realize() @ 0x560622157f21 Halide::Pipeline::realize() @ 0x560622157701 Halide::Pipeline::realize() @ 0x560621ef6ac5 Halide::Func::realize() @ 0x560621ec01ad main @ 0x7f33b32b08d3 __libc_start_main @ 0x560621ebe90a _start
I also bisected a bunch of failing assertions on ARM and aarch64 to this commit. Here's a reduced reproducer:
$ cat repro.c char *a; long b; int c() { int d, e = d = 0; for (; d < 8; d++) e += a[d] - a[b ^ d] - a[b] >> -a[d] >> 1; return e; } $ clang -target aarch64-linux-gnu -c repro.c -O3 clang: ../include/llvm/CodeGen/BasicTTIImpl.h:149: llvm::InstructionCost llvm::BasicTTIImplBase<T>::getInsertSubvectorOverhead(llvm::VectorType*, int, llvm::FixedVectorType*) [with T = llvm::AArch64TTIImpl]: Assertion `(!isa<FixedVectorType>(VTy) || (Index + NumSubElts) <= (int)cast<FixedVectorType>(VTy)->getNumElements()) && "SK_InsertSubvector index out of range"' failed.
Looks like cac60940b771a0685d058a5b471c84cea05fdc46 causes a miscompile again.
This can be reproduced at this commit or current head.
The commit causes https://github.com/pytorch/cpuinfo src/x86/isa.c:cpuinfo_x86_detect_isa to be miscompiled at least in -Os -fsanitize=memory -march=haswell -fsanitize-memory-param-retval -fsanitize-memory-use-after-dtor mode.
For the test test/init.cc, the address of isa soon incorrectly becomes zero:
(gdb) p &isa $3 = (struct cpuinfo_x86_isa *) 0x0
Then the test will crash here isa.sysenter = !!(basic_info.edx & UINT32_C(0x00000800)); since the address of isa.sysenter is incorrectly considered as 0x2.
cac60940b771a0685d058a5b471c84cea05fdc46 is not fixed by f96fbc5d96869e7c75f64dacab6e4894ed291530 [SLP]Fix a crash when insert subvector is out of range..
I am sorry as I am preparing a revert.
Here is -mllvm -print-changed -mllvm -print-module-scope output before SLPVectorizer: https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da
Double checked the patch and the results. The patch itself does not change the vectorization, just adjusts the cost model estimation. Some of the buildvector sequences are not profitable for re-vectorization after this patch, nothing else. Plus, the code after this patch results in the same code, being applied to the compiler without the patch, i.e. the code transformations are the same, just different vectorization.
Maybe the debug info is corrupted somehow? Or there are other effects?
I have added IR before/after SLPVectorizer, with and without the two commits, and generated assembly to https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da
Hope they are useful.
Maybe the debug info is corrupted somehow? Or there are other effects?
The codegen is corrupted. I provide debug info just in case it helps analyze the problem.
Try to run something like:
opt -slp-vectorizer -S ./post-slp-bad.ll -o post-slp-after.ll
(opt without these patches) and compare post-slp-after.ll with post-slp-good.ll. It will be the same. The only thing the patch does is prevents some insertelement vectorization in your case, nothing else. That's why the result will be same. Most probably it just reveals the bug somewhere in the compiler, maybe in lowering.
Yes, I suspect that this change just exposed a bug in the X86 backend handling inline asm. I have added some notes on D128461 and am keeping investigation.
Heads up: I am seeing a clang crash on arm with this commit:
commit 2faacf61a50e7f23fd10927cbbb98c59799bfcd0
Author: Alexey Bataev <a.bataev@outlook.com>
Date: Thu Dec 9 10:34:08 2021 -0800
CommitDate: Fri Jun 24 09:28:01 2022 -0700
[SLP]Improve shuffles cost estimation where possible.
I am trying to create a reduced test case.
Still crashes on trunk.
C-reduce Test case:
typedef __INT64_TYPE__ int64_t; int sbr_autocorrelate_c_x_i; void phiautocorr_calc(int64_t ); void sbr_autocorrelate_c_x(void) { int(*x)[2] = sbr_autocorrelate_c_x; int64_t accu_re , accu_im = 0; for (; sbr_autocorrelate_c_x_i; sbr_autocorrelate_c_x_i++) { accu_re += x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][0]; accu_re += x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][1]; accu_im += x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][1]; accu_im -= x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][0]; } phiautocorr_calc(accu_im); phiautocorr_calc(accu_re); }
Crashes with
clang -Os -c test.c --target=armv7a-linux-gnueabihf -mfpu=neon -Wno-error
llvm/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7744: llvm::Value *llvm::slpvectorizer::BoUpSLP::createBuildVector(ArrayRef<llvm::Value *>): Assertion `any_of(VectorizableTree, [VL](const std::unique_ptr<TreeEntry> &TE) { return TE->State == TreeEntry::NeedToGather && TE->isSame(VL); }) && "Non-matching gather node."' failed.
Thanks, there is PR56251 already. The patch is not a cause of the crash, it has an extra assert that reveals internal bug in SLP vectorizer. Will send a patch with a fix in a few minutes.
I saw another crash on chromium builds - not sure if this is related to the previous mentioned crash but here's a creduced repro:
struct f { float g; float h; }; struct j { j() = default; j(float k, float l) : c(k), d(l) {} j operator*(j k) const { return j(k.a + c, b + k.d); } float a = 1.0f; float b = 0.0f; float c = 0.0f; float d = 1.0f; float e = 0.0f; }; struct m { j n() const; f o; j p; }; j m::n() const { if (o.g || o.h) return j(); j a; return p * a; }
build with
clang -cc1 -O2 -vectorize-slp -emit-llvm -fno-delete-null-pointer-checks t.cpp
I'm not sure I understand why this would be a SK_Select. That is a bit of a X86 special as far as I understand and doesn't always correlate well to other architectures. Why is the Mask missing too? That might be enough to help avoid the regressions if it was re-added.