This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve shuffles cost estimation where possible.
ClosedPublic

Authored by ABataev on Dec 9 2021, 12:20 PM.

Diff Detail

Event Timeline

ABataev created this revision.Dec 9 2021, 12:20 PM
ABataev requested review of this revision.Dec 9 2021, 12:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2021, 12:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 10:02 AM
Herald added a subscriber: vporpo. · View Herald Transcript

I'm not certain I've entirely following this, but it looks like its just trying to account for some weaknesses in getScalarizationOverhead()? Might we be better off trying to improve that?

I'm not certain I've entirely following this, but it looks like its just trying to account for some weaknesses in getScalarizationOverhead()? Might we be better off trying to improve that?

Actually, this is a bug fix. We're not quite correct in some cases, the patch fixesbthe cost of some of the shuffles.

RKSimon accepted this revision.May 26 2022, 7:05 AM

LGTM

This revision is now accepted and ready to land.May 26 2022, 7:05 AM
dmgreen added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844

I'm not sure I understand why this would be a SK_Select. That is a bit of a X86 special as far as I understand and doesn't always correlate well to other architectures. Why is the Mask missing too? That might be enough to help avoid the regressions if it was re-added.

llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
1335 ↗(On Diff #431724)

This seems worse I'm afraid - I don't think it should be keeping all these individual loads that are inserted. The insert_subvector cost should be low enough for them to be profitable to vectorize undef AArch64 - they are just a s register load.

ABataev added inline comments.May 26 2022, 8:35 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844
  1. It is a permuatation of 2 sub-vectors: the root of the buildvector and a subvector after the vectorization. Since it was a buildvector, the compiler selects elements from the root and corresponding elements from the resulting vector.
  1. Mask is not required, if TTI::SK_Select is used, mask is used only with SK_PermuteSingleSrc and SK_PermuteTwoSrc.

But I'll check it.

llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
1335 ↗(On Diff #431724)

I'll try to improve cost estimation there with insert_subvector cost.

dmgreen added inline comments.May 26 2022, 8:48 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844

AArch64 (and most other architectures AFAIU) do not have SK_Select shuffles, so is not a lot better than SK_PermuteTwoSrc. A Mask can help to improve the cost though, if the backend can come up with something more accurate for it.

I'm surprised this is not a SK_InsertSubvector with adjacent elements though - that seems like the most natural fit, unless I'm missing how this works.

RKSimon requested changes to this revision.May 26 2022, 9:20 AM

reopen to address @dmgreen's comments

This revision now requires changes to proceed.May 26 2022, 9:20 AM
ABataev added inline comments.May 26 2022, 9:20 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844

Yep, you right, it must be an InserSubvector kind, changed it to Select because some cost for InsertSubvector were not implemented.

RKSimon added inline comments.May 26 2022, 9:54 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844

was this on x86 / aarch64 or some other target?

ABataev added inline comments.May 26 2022, 9:57 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844

x86, IIRC.

RKSimon added inline comments.May 26 2022, 9:58 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844

if you can email a test case I'll take a look

ABataev updated this revision to Diff 432337.May 26 2022, 11:24 AM

Address comments.

ABataev added inline comments.May 26 2022, 11:26 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844

Some of the lit test, do not remember already. Most of the cases were fixed already, I believe.

RKSimon accepted this revision.May 27 2022, 4:30 AM

LGTM

This revision is now accepted and ready to land.May 27 2022, 4:30 AM

Thanks for the updates.

This revision was landed with ongoing or failed builds.Jun 1 2022, 11:04 AM
This revision was automatically updated to reflect the committed changes.
MaskRay added a subscriber: MaskRay.EditedJun 2 2022, 7:26 PM

Seems that the relanded 9980c9971892378ea82475e000de8df210a58e69 caused wasm test failures for Halide.
I'll try to find folks to give a smaller reproduce (I myself know nearly nothing about Halide), but here is the stack trace in case you notice something immediate.

# related to correctness_interleave in wasm-32-wasmrt-wasm_simd128-wasm_signext-wasm_sat_float_to_int mode

assert.h assertion failed at llvm/CodeGen/BasicTTIImpl.h:152 in llvm::InstructionCost llvm::BasicTTIImplBase<llvm::WebAssemblyTTIImpl>::getInsertSubvectorOverhead(llvm::VectorType *, int, llvm::FixedVectorType *) [T = llvm::WebAssemblyTTIImpl]: (!isa<FixedVectorType>(VTy) || (Index + NumSubElts) <= (int)cast<FixedVectorType>(VTy)->getNumElements()) && "SK_InsertSubvector index out of range"
*** Check failure stack trace: ***
    @     0x5606259259e4  absl::log_internal::LogMessage::SendToLog()
    @     0x56062592518a  absl::log_internal::LogMessage::Flush()
    @     0x560625925dc9  absl::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x560625905ce4  __assert_fail
    @     0x5606232d2f51  llvm::BasicTTIImplBase<>::getInsertSubvectorOverhead()
    @     0x5606232ce898  llvm::BasicTTIImplBase<>::getShuffleCost()
    @     0x560624ce91c5  llvm::TargetTransformInfo::getShuffleCost()
    @     0x56062477966b  llvm::slpvectorizer::BoUpSLP::getEntryCost()
    @     0x56062477ed63  llvm::slpvectorizer::BoUpSLP::getTreeCost()
    @     0x560624797c28  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x56062479c4d8  llvm::SLPVectorizerPass::vectorizeInsertElementInst()
    @     0x56062479c67c  llvm::SLPVectorizerPass::vectorizeSimpleInstructions()
    @     0x560624793e82  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x560624791dbc  llvm::SLPVectorizerPass::runImpl()
    @     0x560624791168  llvm::SLPVectorizerPass::run()
    @     0x5606236a4b92  llvm::detail::PassModel<>::run()
    @     0x56062512c0b5  llvm::PassManager<>::run()
    @     0x560622c278b2  llvm::detail::PassModel<>::run()
    @     0x5606251300e0  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x560621fc84f2  llvm::detail::PassModel<>::run()
    @     0x56062512b1de  llvm::PassManager<>::run()
    @     0x560621f9d82b  Halide::Internal::CodeGen_LLVM::optimize_module()
    @     0x560621f9ae1a  Halide::Internal::CodeGen_LLVM::finish_codegen()
    @     0x560621f9bdf5  Halide::Internal::CodeGen_LLVM::compile()
    @     0x56062250c909  Halide::Internal::WasmModuleContents::WasmModuleContents()
    @     0x5606225118f2  Halide::Internal::WasmModule::compile()
    @     0x560622155f52  Halide::Pipeline::compile_jit()
    @     0x5606221584ea  Halide::Pipeline::realize()
    @     0x560622157f21  Halide::Pipeline::realize()
    @     0x560622157701  Halide::Pipeline::realize()
    @     0x560621ef6ac5  Halide::Func::realize()
    @     0x560621ec01ad  main
    @     0x7f33b32b08d3  __libc_start_main
    @     0x560621ebe90a  _start

I also bisected a bunch of failing assertions on ARM and aarch64 to this commit. Here's a reduced reproducer:

$ cat repro.c 
char *a;
long b;
int c() {
  int d, e = d = 0;
  for (; d < 8; d++)
    e += a[d] - a[b ^ d] - a[b] >> -a[d] >> 1;
  return e;
}
$ clang -target aarch64-linux-gnu -c repro.c -O3
clang: ../include/llvm/CodeGen/BasicTTIImpl.h:149: llvm::InstructionCost llvm::BasicTTIImplBase<T>::getInsertSubvectorOverhead(llvm::VectorType*, int, llvm::FixedVectorType*) [with T = llvm::AArch64TTIImpl]: Assertion `(!isa<FixedVectorType>(VTy) || (Index + NumSubElts) <= (int)cast<FixedVectorType>(VTy)->getNumElements()) && "SK_InsertSubvector index out of range"' failed.
MaskRay added a comment.EditedJun 22 2022, 10:13 PM

Looks like cac60940b771a0685d058a5b471c84cea05fdc46 causes a miscompile again.
This can be reproduced at this commit or current head.
The commit causes https://github.com/pytorch/cpuinfo src/x86/isa.c:cpuinfo_x86_detect_isa to be miscompiled at least in -Os -fsanitize=memory -march=haswell -fsanitize-memory-param-retval -fsanitize-memory-use-after-dtor mode.

For the test test/init.cc, the address of isa soon incorrectly becomes zero:

(gdb) p &isa
$3 = (struct cpuinfo_x86_isa *) 0x0

Then the test will crash here isa.sysenter = !!(basic_info.edx & UINT32_C(0x00000800)); since the address of isa.sysenter is incorrectly considered as 0x2.

cac60940b771a0685d058a5b471c84cea05fdc46 is not fixed by f96fbc5d96869e7c75f64dacab6e4894ed291530 [SLP]Fix a crash when insert subvector is out of range..
I am sorry as I am preparing a revert.

Here is -mllvm -print-changed -mllvm -print-module-scope output before SLPVectorizer: https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da

Here is -mllvm -print-changed -mllvm -print-module-scope output before SLPVectorizer: https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da

Double checked the patch and the results. The patch itself does not change the vectorization, just adjusts the cost model estimation. Some of the buildvector sequences are not profitable for re-vectorization after this patch, nothing else. Plus, the code after this patch results in the same code, being applied to the compiler without the patch, i.e. the code transformations are the same, just different vectorization.
Maybe the debug info is corrupted somehow? Or there are other effects?

MaskRay added a comment.EditedJun 23 2022, 12:59 PM

Here is -mllvm -print-changed -mllvm -print-module-scope output before SLPVectorizer: https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da

Double checked the patch and the results. The patch itself does not change the vectorization, just adjusts the cost model estimation. Some of the buildvector sequences are not profitable for re-vectorization after this patch, nothing else. Plus, the code after this patch results in the same code, being applied to the compiler without the patch, i.e. the code transformations are the same, just different vectorization.

I have added IR before/after SLPVectorizer, with and without the two commits, and generated assembly to https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da
Hope they are useful.

Maybe the debug info is corrupted somehow? Or there are other effects?

The codegen is corrupted. I provide debug info just in case it helps analyze the problem.

Here is -mllvm -print-changed -mllvm -print-module-scope output before SLPVectorizer: https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da

Double checked the patch and the results. The patch itself does not change the vectorization, just adjusts the cost model estimation. Some of the buildvector sequences are not profitable for re-vectorization after this patch, nothing else. Plus, the code after this patch results in the same code, being applied to the compiler without the patch, i.e. the code transformations are the same, just different vectorization.

I have added IR before/after SLPVectorizer, with and without the two commits, and generated assembly to https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da
Hope they are useful.

Maybe the debug info is corrupted somehow? Or there are other effects?

The codegen is corrupted. I provide debug info just in case it helps analyze the problem.

Try to run something like:
opt -slp-vectorizer -S ./post-slp-bad.ll -o post-slp-after.ll

(opt without these patches) and compare post-slp-after.ll with post-slp-good.ll. It will be the same. The only thing the patch does is prevents some insertelement vectorization in your case, nothing else. That's why the result will be same. Most probably it just reveals the bug somewhere in the compiler, maybe in lowering.

Yes, I suspect that this change just exposed a bug in the X86 backend handling inline asm. I have added some notes on D128461 and am keeping investigation.

I more firmly believe I made a mistake. Sorry for that. You may recommit.

I more firmly believe I made a mistake. Sorry for that. You may recommit.

No problem

manojgupta added a subscriber: manojgupta.EditedJun 26 2022, 9:44 PM

Heads up: I am seeing a clang crash on arm with this commit:

commit 2faacf61a50e7f23fd10927cbbb98c59799bfcd0
Author: Alexey Bataev <a.bataev@outlook.com>
Date: Thu Dec 9 10:34:08 2021 -0800
CommitDate: Fri Jun 24 09:28:01 2022 -0700

[SLP]Improve shuffles cost estimation where possible.

I am trying to create a reduced test case.

Heads up: I am seeing a clang crash on arm with this commit:

commit 2faacf61a50e7f23fd10927cbbb98c59799bfcd0
Author: Alexey Bataev <a.bataev@outlook.com>
Date: Thu Dec 9 10:34:08 2021 -0800
CommitDate: Fri Jun 24 09:28:01 2022 -0700

[SLP]Improve shuffles cost estimation where possible.

I am trying to create a reduced test case.

It must be fixed already, please, check trunc.

Still crashes on trunk.

C-reduce Test case:

typedef __INT64_TYPE__ int64_t;
int sbr_autocorrelate_c_x_i;
void phiautocorr_calc(int64_t );

void sbr_autocorrelate_c_x(void) {
  int(*x)[2] = sbr_autocorrelate_c_x;
  int64_t accu_re , accu_im = 0;
  for (; sbr_autocorrelate_c_x_i; sbr_autocorrelate_c_x_i++) {
    accu_re +=
        x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][0];
    accu_re +=
        x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][1];
    accu_im +=
        x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][1];
    accu_im -=
        x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][0];
  }
  phiautocorr_calc(accu_im);
  phiautocorr_calc(accu_re);
}

Crashes with
clang -Os -c test.c --target=armv7a-linux-gnueabihf -mfpu=neon -Wno-error

llvm/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7744: llvm::Value *llvm::slpvectorizer::BoUpSLP::createBuildVector(ArrayRef<llvm::Value *>): Assertion `any_of(VectorizableTree, [VL](const std::unique_ptr<TreeEntry> &TE) { return TE->State == TreeEntry::NeedToGather && TE->isSame(VL); }) && "Non-matching gather node."' failed.

Still crashes on trunk.

C-reduce Test case:

typedef __INT64_TYPE__ int64_t;
int sbr_autocorrelate_c_x_i;
void phiautocorr_calc(int64_t );

void sbr_autocorrelate_c_x(void) {
  int(*x)[2] = sbr_autocorrelate_c_x;
  int64_t accu_re , accu_im = 0;
  for (; sbr_autocorrelate_c_x_i; sbr_autocorrelate_c_x_i++) {
    accu_re +=
        x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][0];
    accu_re +=
        x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][1];
    accu_im +=
        x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][1];
    accu_im -=
        x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][0];
  }
  phiautocorr_calc(accu_im);
  phiautocorr_calc(accu_re);
}

Crashes with
clang -Os -c test.c --target=armv7a-linux-gnueabihf -mfpu=neon -Wno-error

llvm/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7744: llvm::Value *llvm::slpvectorizer::BoUpSLP::createBuildVector(ArrayRef<llvm::Value *>): Assertion `any_of(VectorizableTree, [VL](const std::unique_ptr<TreeEntry> &TE) { return TE->State == TreeEntry::NeedToGather && TE->isSame(VL); }) && "Non-matching gather node."' failed.

Thanks, there is PR56251 already. The patch is not a cause of the crash, it has an extra assert that reveals internal bug in SLP vectorizer. Will send a patch with a fix in a few minutes.

I saw another crash on chromium builds - not sure if this is related to the previous mentioned crash but here's a creduced repro:

struct f {
  float g;
  float h;
};
struct j {
  j() = default;
  j(float k, float l) : c(k), d(l) {}
  j operator*(j k) const { return j(k.a + c, b + k.d); }
  float a = 1.0f;
  float b = 0.0f;
  float c = 0.0f;
  float d = 1.0f;
  float e = 0.0f;
};
struct m {
  j n() const;
  f o;
  j p;
};
j m::n() const {
  if (o.g || o.h)
    return j();
  j a;
  return p * a;
}

build with

clang -cc1 -O2 -vectorize-slp -emit-llvm -fno-delete-null-pointer-checks t.cpp

I saw another crash on chromium builds - not sure if this is related to the previous mentioned crash but here's a creduced repro:

struct f {
  float g;
  float h;
};
struct j {
  j() = default;
  j(float k, float l) : c(k), d(l) {}
  j operator*(j k) const { return j(k.a + c, b + k.d); }
  float a = 1.0f;
  float b = 0.0f;
  float c = 0.0f;
  float d = 1.0f;
  float e = 0.0f;
};
struct m {
  j n() const;
  f o;
  j p;
};
j m::n() const {
  if (o.g || o.h)
    return j();
  j a;
  return p * a;
}

build with

clang -cc1 -O2 -vectorize-slp -emit-llvm -fno-delete-null-pointer-checks t.cpp

Yes, the cause is the same but need to adjust the patch D128680 a bit.

I saw another crash on chromium builds - not sure if this is related to the previous mentioned crash but here's a creduced repro:

struct f {
  float g;
  float h;
};
struct j {
  j() = default;
  j(float k, float l) : c(k), d(l) {}
  j operator*(j k) const { return j(k.a + c, b + k.d); }
  float a = 1.0f;
  float b = 0.0f;
  float c = 0.0f;
  float d = 1.0f;
  float e = 0.0f;
};
struct m {
  j n() const;
  f o;
  j p;
};
j m::n() const {
  if (o.g || o.h)
    return j();
  j a;
  return p * a;
}

build with

clang -cc1 -O2 -vectorize-slp -emit-llvm -fno-delete-null-pointer-checks t.cpp

Fixed in D128680