This is an archive of the discontinued LLVM Phabricator instance.

I'm not certain I've entirely following this, but it looks like its just trying to account for some weaknesses in getScalarizationOverhead()? Might we be better off trying to improve that?

In D115462#3484627, @RKSimon wrote:

I'm not certain I've entirely following this, but it looks like its just trying to account for some weaknesses in getScalarizationOverhead()? Might we be better off trying to improve that?

Actually, this is a bug fix. We're not quite correct in some cases, the patch fixesbthe cost of some of the shuffles.

Rebase

Harbormaster completed remote builds in B166087: Diff 431724.May 24 2022, 11:53 AM

LGTM

This revision is now accepted and ready to land.May 26 2022, 7:05 AM

dmgreen added a subscriber: dmgreen.May 26 2022, 8:17 AM

dmgreen added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844	I'm not sure I understand why this would be a SK_Select. That is a bit of a X86 special as far as I understand and doesn't always correlate well to other architectures. Why is the Mask missing too? That might be enough to help avoid the regressions if it was re-added.
llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
1335 ↗	(On Diff #431724)	This seems worse I'm afraid - I don't think it should be keeping all these individual loads that are inserted. The insert_subvector cost should be low enough for them to be profitable to vectorize undef AArch64 - they are just a s register load.

ABataev added inline comments.May 26 2022, 8:35 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844	It is a permuatation of 2 sub-vectors: the root of the buildvector and a subvector after the vectorization. Since it was a buildvector, the compiler selects elements from the root and corresponding elements from the resulting vector. Mask is not required, if TTI::SK_Select is used, mask is used only with SK_PermuteSingleSrc and SK_PermuteTwoSrc. But I'll check it.
llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
1335 ↗	(On Diff #431724)	I'll try to improve cost estimation there with insert_subvector cost.

dmgreen added inline comments.May 26 2022, 8:48 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844	AArch64 (and most other architectures AFAIU) do not have SK_Select shuffles, so is not a lot better than SK_PermuteTwoSrc. A Mask can help to improve the cost though, if the backend can come up with something more accurate for it. I'm surprised this is not a SK_InsertSubvector with adjacent elements though - that seems like the most natural fit, unless I'm missing how this works.

reopen to address @dmgreen's comments

This revision now requires changes to proceed.May 26 2022, 9:20 AM

ABataev added inline comments.May 26 2022, 9:20 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844	Yep, you right, it must be an InserSubvector kind, changed it to Select because some cost for InsertSubvector were not implemented.

RKSimon added inline comments.May 26 2022, 9:54 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844	was this on x86 / aarch64 or some other target?

ABataev added inline comments.May 26 2022, 9:57 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844	x86, IIRC.

RKSimon added inline comments.May 26 2022, 9:58 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844	if you can email a test case I'll take a look

Address comments.

ABataev added inline comments.May 26 2022, 11:26 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4841–4844	Some of the lit test, do not remember already. Most of the cases were fixed already, I believe.

Harbormaster completed remote builds in B166510: Diff 432337.May 26 2022, 12:08 PM

LGTM

This revision is now accepted and ready to land.May 27 2022, 4:30 AM

Thanks for the updates.

This revision was landed with ongoing or failed builds.Jun 1 2022, 11:04 AM

Closed by commit rGfd5a6ce9dcb7: [SLP]Improve shuffles cost estimation where possible. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rGfd5a6ce9dcb7: [SLP]Improve shuffles cost estimation where possible..

ABataev added a reverting change: rG73020b45407f: Revert "[SLP]Improve shuffles cost estimation where possible.".Jun 1 2022, 3:47 PM

ABataev added a commit: rG9980c9971892: [SLP]Improve shuffles cost estimation where possible..Jun 2 2022, 11:20 AM

Seems that the relanded 9980c9971892378ea82475e000de8df210a58e69 caused wasm test failures for Halide.
I'll try to find folks to give a smaller reproduce (I myself know nearly nothing about Halide), but here is the stack trace in case you notice something immediate.

# related to correctness_interleave in wasm-32-wasmrt-wasm_simd128-wasm_signext-wasm_sat_float_to_int mode

assert.h assertion failed at llvm/CodeGen/BasicTTIImpl.h:152 in llvm::InstructionCost llvm::BasicTTIImplBase<llvm::WebAssemblyTTIImpl>::getInsertSubvectorOverhead(llvm::VectorType *, int, llvm::FixedVectorType *) [T = llvm::WebAssemblyTTIImpl]: (!isa<FixedVectorType>(VTy) || (Index + NumSubElts) <= (int)cast<FixedVectorType>(VTy)->getNumElements()) && "SK_InsertSubvector index out of range"

*** Check failure stack trace: ***
    @     0x5606259259e4  absl::log_internal::LogMessage::SendToLog()
    @     0x56062592518a  absl::log_internal::LogMessage::Flush()
    @     0x560625925dc9  absl::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x560625905ce4  __assert_fail
    @     0x5606232d2f51  llvm::BasicTTIImplBase<>::getInsertSubvectorOverhead()
    @     0x5606232ce898  llvm::BasicTTIImplBase<>::getShuffleCost()
    @     0x560624ce91c5  llvm::TargetTransformInfo::getShuffleCost()
    @     0x56062477966b  llvm::slpvectorizer::BoUpSLP::getEntryCost()
    @     0x56062477ed63  llvm::slpvectorizer::BoUpSLP::getTreeCost()
    @     0x560624797c28  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x56062479c4d8  llvm::SLPVectorizerPass::vectorizeInsertElementInst()
    @     0x56062479c67c  llvm::SLPVectorizerPass::vectorizeSimpleInstructions()
    @     0x560624793e82  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x560624791dbc  llvm::SLPVectorizerPass::runImpl()
    @     0x560624791168  llvm::SLPVectorizerPass::run()
    @     0x5606236a4b92  llvm::detail::PassModel<>::run()
    @     0x56062512c0b5  llvm::PassManager<>::run()
    @     0x560622c278b2  llvm::detail::PassModel<>::run()
    @     0x5606251300e0  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x560621fc84f2  llvm::detail::PassModel<>::run()
    @     0x56062512b1de  llvm::PassManager<>::run()
    @     0x560621f9d82b  Halide::Internal::CodeGen_LLVM::optimize_module()
    @     0x560621f9ae1a  Halide::Internal::CodeGen_LLVM::finish_codegen()
    @     0x560621f9bdf5  Halide::Internal::CodeGen_LLVM::compile()
    @     0x56062250c909  Halide::Internal::WasmModuleContents::WasmModuleContents()
    @     0x5606225118f2  Halide::Internal::WasmModule::compile()
    @     0x560622155f52  Halide::Pipeline::compile_jit()
    @     0x5606221584ea  Halide::Pipeline::realize()
    @     0x560622157f21  Halide::Pipeline::realize()
    @     0x560622157701  Halide::Pipeline::realize()
    @     0x560621ef6ac5  Halide::Func::realize()
    @     0x560621ec01ad  main
    @     0x7f33b32b08d3  __libc_start_main
    @     0x560621ebe90a  _start

uabelho added a subscriber: uabelho.Jun 2 2022, 10:37 PM

I also bisected a bunch of failing assertions on ARM and aarch64 to this commit. Here's a reduced reproducer:

$ cat repro.c 
char *a;
long b;
int c() {
  int d, e = d = 0;
  for (; d < 8; d++)
    e += a[d] - a[b ^ d] - a[b] >> -a[d] >> 1;
  return e;
}
$ clang -target aarch64-linux-gnu -c repro.c -O3
clang: ../include/llvm/CodeGen/BasicTTIImpl.h:149: llvm::InstructionCost llvm::BasicTTIImplBase<T>::getInsertSubvectorOverhead(llvm::VectorType*, int, llvm::FixedVectorType*) [with T = llvm::AArch64TTIImpl]: Assertion `(!isa<FixedVectorType>(VTy) || (Index + NumSubElts) <= (int)cast<FixedVectorType>(VTy)->getNumElements()) && "SK_InsertSubvector index out of range"' failed.

MaskRay added a reverting change: rGdf0f30dc36c1: Revert "[SLP]Improve shuffles cost estimation where possible.".Jun 3 2022, 12:30 AM

ABataev added a commit: rGcac60940b771: [SLP]Improve shuffles cost estimation where possible..Jun 3 2022, 8:08 AM

Looks like cac60940b771a0685d058a5b471c84cea05fdc46 causes a miscompile again.
This can be reproduced at this commit or current head.
The commit causes https://github.com/pytorch/cpuinfo src/x86/isa.c:cpuinfo_x86_detect_isa to be miscompiled at least in -Os -fsanitize=memory -march=haswell -fsanitize-memory-param-retval -fsanitize-memory-use-after-dtor mode.

For the test test/init.cc, the address of isa soon incorrectly becomes zero:

(gdb) p &isa
$3 = (struct cpuinfo_x86_isa *) 0x0

Then the test will crash here isa.sysenter = !!(basic_info.edx & UINT32_C(0x00000800)); since the address of isa.sysenter is incorrectly considered as 0x2.

Herald added a subscriber: nlopes. · View Herald TranscriptJun 22 2022, 10:13 PM

cac60940b771a0685d058a5b471c84cea05fdc46 is not fixed by f96fbc5d96869e7c75f64dacab6e4894ed291530 [SLP]Fix a crash when insert subvector is out of range..
I am sorry as I am preparing a revert.

Here is -mllvm -print-changed -mllvm -print-module-scope output before SLPVectorizer: https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da

MaskRay added a reverting change: rG1ffd2d99c29e: Revert D115462 "[SLP]Improve shuffles cost estimation where possible.".Jun 22 2022, 11:16 PM

In D115462#3603899, @MaskRay wrote:

Here is -mllvm -print-changed -mllvm -print-module-scope output before SLPVectorizer: https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da

Double checked the patch and the results. The patch itself does not change the vectorization, just adjusts the cost model estimation. Some of the buildvector sequences are not profitable for re-vectorization after this patch, nothing else. Plus, the code after this patch results in the same code, being applied to the compiler without the patch, i.e. the code transformations are the same, just different vectorization.
Maybe the debug info is corrupted somehow? Or there are other effects?

In D115462#3605808, @ABataev wrote:

In D115462#3603899, @MaskRay wrote:

Here is -mllvm -print-changed -mllvm -print-module-scope output before SLPVectorizer: https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da

Double checked the patch and the results. The patch itself does not change the vectorization, just adjusts the cost model estimation. Some of the buildvector sequences are not profitable for re-vectorization after this patch, nothing else. Plus, the code after this patch results in the same code, being applied to the compiler without the patch, i.e. the code transformations are the same, just different vectorization.

I have added IR before/after SLPVectorizer, with and without the two commits, and generated assembly to https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da
Hope they are useful.

Maybe the debug info is corrupted somehow? Or there are other effects?

The codegen is corrupted. I provide debug info just in case it helps analyze the problem.

In D115462#3606101, @MaskRay wrote:

In D115462#3605808, @ABataev wrote:

In D115462#3603899, @MaskRay wrote:

Here is -mllvm -print-changed -mllvm -print-module-scope output before SLPVectorizer: https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da

Double checked the patch and the results. The patch itself does not change the vectorization, just adjusts the cost model estimation. Some of the buildvector sequences are not profitable for re-vectorization after this patch, nothing else. Plus, the code after this patch results in the same code, being applied to the compiler without the patch, i.e. the code transformations are the same, just different vectorization.

I have added IR before/after SLPVectorizer, with and without the two commits, and generated assembly to https://gist.github.com/MaskRay/23f0db50e136127fda1b4f83db2488da
Hope they are useful.

Maybe the debug info is corrupted somehow? Or there are other effects?

The codegen is corrupted. I provide debug info just in case it helps analyze the problem.

Try to run something like:
opt -slp-vectorizer -S ./post-slp-bad.ll -o post-slp-after.ll

(opt without these patches) and compare post-slp-after.ll with post-slp-good.ll. It will be the same. The only thing the patch does is prevents some insertelement vectorization in your case, nothing else. That's why the result will be same. Most probably it just reveals the bug somewhere in the compiler, maybe in lowering.

Yes, I suspect that this change just exposed a bug in the X86 backend handling inline asm. I have added some notes on D128461 and am keeping investigation.

I more firmly believe I made a mistake. Sorry for that. You may recommit.

In D115462#3607097, @MaskRay wrote:

I more firmly believe I made a mistake. Sorry for that. You may recommit.

No problem

ABataev added a commit: rG2faacf61a50e: [SLP]Improve shuffles cost estimation where possible..Jun 24 2022, 9:30 AM

Heads up: I am seeing a clang crash on arm with this commit:

commit 2faacf61a50e7f23fd10927cbbb98c59799bfcd0
Author: Alexey Bataev <a.bataev@outlook.com>
Date: Thu Dec 9 10:34:08 2021 -0800
CommitDate: Fri Jun 24 09:28:01 2022 -0700

[SLP]Improve shuffles cost estimation where possible.

I am trying to create a reduced test case.

In D115462#3611255, @manojgupta wrote:
Heads up: I am seeing a clang crash on arm with this commit:

commit 2faacf61a50e7f23fd10927cbbb98c59799bfcd0
Author: Alexey Bataev <a.bataev@outlook.com>
Date: Thu Dec 9 10:34:08 2021 -0800
CommitDate: Fri Jun 24 09:28:01 2022 -0700
[SLP]Improve shuffles cost estimation where possible.
I am trying to create a reduced test case.

It must be fixed already, please, check trunc.

Still crashes on trunk.

C-reduce Test case:

typedef __INT64_TYPE__ int64_t;
int sbr_autocorrelate_c_x_i;
void phiautocorr_calc(int64_t );

void sbr_autocorrelate_c_x(void) {
  int(*x)[2] = sbr_autocorrelate_c_x;
  int64_t accu_re , accu_im = 0;
  for (; sbr_autocorrelate_c_x_i; sbr_autocorrelate_c_x_i++) {
    accu_re +=
        x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][0];
    accu_re +=
        x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][1];
    accu_im +=
        x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][1];
    accu_im -=
        x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][0];
  }
  phiautocorr_calc(accu_im);
  phiautocorr_calc(accu_re);
}

Crashes with
clang -Os -c test.c --target=armv7a-linux-gnueabihf -mfpu=neon -Wno-error

llvm/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7744: llvm::Value *llvm::slpvectorizer::BoUpSLP::createBuildVector(ArrayRef<llvm::Value *>): Assertion `any_of(VectorizableTree, [VL](const std::unique_ptr<TreeEntry> &TE) { return TE->State == TreeEntry::NeedToGather && TE->isSame(VL); }) && "Non-matching gather node."' failed.

In D115462#3613388, @manojgupta wrote:
Still crashes on trunk.

C-reduce Test case:
typedef __INT64_TYPE__ int64_t;
int sbr_autocorrelate_c_x_i;
void phiautocorr_calc(int64_t );

void sbr_autocorrelate_c_x(void) {
  int(*x)[2] = sbr_autocorrelate_c_x;
  int64_t accu_re , accu_im = 0;
  for (; sbr_autocorrelate_c_x_i; sbr_autocorrelate_c_x_i++) {
    accu_re +=
        x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][0];
    accu_re +=
        x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][1];
    accu_im +=
        x[sbr_autocorrelate_c_x_i][0] * x[sbr_autocorrelate_c_x_i + 2][1];
    accu_im -=
        x[sbr_autocorrelate_c_x_i][1] * x[sbr_autocorrelate_c_x_i + 2][0];
  }
  phiautocorr_calc(accu_im);
  phiautocorr_calc(accu_re);
}
Crashes with
clang -Os -c test.c --target=armv7a-linux-gnueabihf -mfpu=neon -Wno-error

llvm/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7744: llvm::Value *llvm::slpvectorizer::BoUpSLP::createBuildVector(ArrayRef<llvm::Value *>): Assertion `any_of(VectorizableTree, [VL](const std::unique_ptr<TreeEntry> &TE) { return TE->State == TreeEntry::NeedToGather && TE->isSame(VL); }) && "Non-matching gather node."' failed.

Thanks, there is PR56251 already. The patch is not a cause of the crash, it has an extra assert that reveals internal bug in SLP vectorizer. Will send a patch with a fix in a few minutes.

I saw another crash on chromium builds - not sure if this is related to the previous mentioned crash but here's a creduced repro:

struct f {
  float g;
  float h;
};
struct j {
  j() = default;
  j(float k, float l) : c(k), d(l) {}
  j operator*(j k) const { return j(k.a + c, b + k.d); }
  float a = 1.0f;
  float b = 0.0f;
  float c = 0.0f;
  float d = 1.0f;
  float e = 0.0f;
};
struct m {
  j n() const;
  f o;
  j p;
};
j m::n() const {
  if (o.g || o.h)
    return j();
  j a;
  return p * a;
}

build with

clang -cc1 -O2 -vectorize-slp -emit-llvm -fno-delete-null-pointer-checks t.cpp

In D115462#3613570, @akhuang wrote:

I saw another crash on chromium builds - not sure if this is related to the previous mentioned crash but here's a creduced repro:

struct f {
  float g;
  float h;
};
struct j {
  j() = default;
  j(float k, float l) : c(k), d(l) {}
  j operator*(j k) const { return j(k.a + c, b + k.d); }
  float a = 1.0f;
  float b = 0.0f;
  float c = 0.0f;
  float d = 1.0f;
  float e = 0.0f;
};
struct m {
  j n() const;
  f o;
  j p;
};
j m::n() const {
  if (o.g || o.h)
    return j();
  j a;
  return p * a;
}

build with

clang -cc1 -O2 -vectorize-slp -emit-llvm -fno-delete-null-pointer-checks t.cpp

Yes, the cause is the same but need to adjust the patch D128680 a bit.

In D115462#3613570, @akhuang wrote:

I saw another crash on chromium builds - not sure if this is related to the previous mentioned crash but here's a creduced repro:

struct f {
  float g;
  float h;
};
struct j {
  j() = default;
  j(float k, float l) : c(k), d(l) {}
  j operator*(j k) const { return j(k.a + c, b + k.d); }
  float a = 1.0f;
  float b = 0.0f;
  float c = 0.0f;
  float d = 1.0f;
  float e = 0.0f;
};
struct m {
  j n() const;
  f o;
  j p;
};
j m::n() const {
  if (o.g || o.h)
    return j();
  j a;
  return p * a;
}

build with

clang -cc1 -O2 -vectorize-slp -emit-llvm -fno-delete-null-pointer-checks t.cpp

Fixed in D128680

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

110 lines

test/

Transforms/

SLPVectorizer/

X86/

alternate-int-inseltpoison.ll

24 lines

alternate-int.ll

24 lines

Diff 393252

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,339 Lines • ▼ Show 20 Lines	computeExtractCost(ArrayRef<Value > VL, FixedVectorType VecTy,
bool AllConsecutive = true;		bool AllConsecutive = true;
unsigned EltsPerVector = VecTy->getNumElements() / NumOfParts;		unsigned EltsPerVector = VecTy->getNumElements() / NumOfParts;
unsigned Idx = -1;		unsigned Idx = -1;
InstructionCost Cost = 0;		InstructionCost Cost = 0;

// Process extracts in blocks of EltsPerVector to check if the source vector		// Process extracts in blocks of EltsPerVector to check if the source vector
// operand can be re-used directly. If not, add the cost of creating a shuffle		// operand can be re-used directly. If not, add the cost of creating a shuffle
// to extract the values into a vector register.		// to extract the values into a vector register.
		SmallVector<int> RegMask(EltsPerVector, UndefMaskElem);
for (auto *V : VL) {		for (auto *V : VL) {
++Idx;		++Idx;

// Need to exclude undefs from analysis.
if (isa<UndefValue>(V) \|\| Mask[Idx] == UndefMaskElem)
continue;

// Reached the start of a new vector registers.		// Reached the start of a new vector registers.
if (Idx % EltsPerVector == 0) {		if (Idx % EltsPerVector == 0) {
		RegMask.assign(EltsPerVector, UndefMaskElem);
AllConsecutive = true;		AllConsecutive = true;
continue;		continue;
}		}

		// Need to exclude undefs from analysis.
		if (isa<UndefValue>(V) \|\| Mask[Idx] == UndefMaskElem)
		continue;

// Check all extracts for a vector register on the target directly		// Check all extracts for a vector register on the target directly
// extract values in order.		// extract values in order.
unsigned CurrentIdx = *getExtractIndex(cast<Instruction>(V));		unsigned CurrentIdx = *getExtractIndex(cast<Instruction>(V));
if (!isa<UndefValue>(VL[Idx - 1]) && Mask[Idx - 1] != UndefMaskElem) {		if (!isa<UndefValue>(VL[Idx - 1]) && Mask[Idx - 1] != UndefMaskElem) {
unsigned PrevIdx = *getExtractIndex(cast<Instruction>(VL[Idx - 1]));		unsigned PrevIdx = *getExtractIndex(cast<Instruction>(VL[Idx - 1]));
AllConsecutive &= PrevIdx + 1 == CurrentIdx &&		AllConsecutive &= PrevIdx + 1 == CurrentIdx &&
CurrentIdx % EltsPerVector == Idx % EltsPerVector;		CurrentIdx % EltsPerVector == Idx % EltsPerVector;
		RegMask[Idx % EltsPerVector] = CurrentIdx % EltsPerVector;
}		}

if (AllConsecutive)		if (AllConsecutive)
continue;		continue;

// Skip all indices, except for the last index per vector block.		// Skip all indices, except for the last index per vector block.
if ((Idx + 1) % EltsPerVector != 0 && Idx + 1 != VL.size())		if ((Idx + 1) % EltsPerVector != 0 && Idx + 1 != VL.size())
continue;		continue;

// If we have a series of extracts which are not consecutive and hence		// If we have a series of extracts which are not consecutive and hence
// cannot re-use the source vector register directly, compute the shuffle		// cannot re-use the source vector register directly, compute the shuffle
// cost to extract the a vector with EltsPerVector elements.		// cost to extract the vector with EltsPerVector elements.
Cost += TTI.getShuffleCost(		Cost += TTI.getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc,		TargetTransformInfo::SK_PermuteSingleSrc,
FixedVectorType::get(VecTy->getElementType(), EltsPerVector));		FixedVectorType::get(VecTy->getElementType(), EltsPerVector), RegMask);
}		}
return Cost;		return Cost;
}		}

/// Build shuffle mask for shuffle graph entries and lists of main and alternate		/// Build shuffle mask for shuffle graph entries and lists of main and alternate
/// operations operands.		/// operations operands.
static void		static void
buildSuffleEntryMask(ArrayRef<Value *> VL, ArrayRef<unsigned> ReorderIndices,		buildSuffleEntryMask(ArrayRef<Value *> VL, ArrayRef<unsigned> ReorderIndices,
▲ Show 20 Lines • Show All 378 Lines • ▼ Show 20 Lines	case Instruction::ExtractElement: {
AdjustExtractsCost(CommonCost);		AdjustExtractsCost(CommonCost);
}		}
return CommonCost;		return CommonCost;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
assert(E->ReuseShuffleIndices.empty() &&		assert(E->ReuseShuffleIndices.empty() &&
"Unique insertelements only are expected.");		"Unique insertelements only are expected.");
auto *SrcVecTy = cast<FixedVectorType>(VL0->getType());		auto *SrcVecTy = cast<FixedVectorType>(VL0->getType());

unsigned const NumElts = SrcVecTy->getNumElements();		unsigned const NumElts = SrcVecTy->getNumElements();
unsigned const NumScalars = VL.size();		unsigned const NumScalars = VL.size();

		unsigned NumOfParts = TTI->getNumberOfParts(SrcVecTy);

		unsigned OffsetBeg = *getInsertIndex(VL.front(), 0);
		unsigned OffsetEnd = *getInsertIndex(VL.back(), 0);
		unsigned VecSz = NumElts;
		unsigned VecScalarsSz = NumScalars;
		if (NumOfParts > 0) {
		VecScalarsSz = NumElts / NumOfParts;
		VecSz = PowerOf2Ceil(
		(1 + OffsetEnd / VecScalarsSz - OffsetBeg / VecScalarsSz) *
		VecScalarsSz);
		}

APInt DemandedElts = APInt::getZero(NumElts);		APInt DemandedElts = APInt::getZero(NumElts);
// TODO: Add support for Instruction::InsertValue.		// TODO: Add support for Instruction::InsertValue.
SmallVector<int> Mask;		SmallVector<int> Mask;
if (!E->ReorderIndices.empty()) {		if (!E->ReorderIndices.empty()) {
inversePermutation(E->ReorderIndices, Mask);		inversePermutation(E->ReorderIndices, Mask);
Mask.append(NumElts - NumScalars, UndefMaskElem);
} else {		} else {
Mask.assign(NumElts, UndefMaskElem);		Mask.assign(NumScalars, UndefMaskElem);
std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);		std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);
}		}
unsigned Offset = *getInsertIndex(VL0, 0);
bool IsIdentity = true;		bool IsIdentity = true;
SmallVector<int> PrevMask(NumElts, UndefMaskElem);		SmallVector<int> PrevMask(VecSz, UndefMaskElem);
Mask.swap(PrevMask);		Mask.swap(PrevMask);
		unsigned Offset = VecScalarsSz * (OffsetBeg / VecScalarsSz);
for (unsigned I = 0; I < NumScalars; ++I) {		for (unsigned I = 0; I < NumScalars; ++I) {
Optional<int> InsertIdx = getInsertIndex(VL[PrevMask[I]], 0);		Optional<int> InsertIdx = getInsertIndex(VL[PrevMask[I]], 0);
if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)		if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
continue;		continue;
DemandedElts.setBit(*InsertIdx);		DemandedElts.setBit(*InsertIdx);
IsIdentity &= *InsertIdx - Offset == I;		IsIdentity &= *InsertIdx - Offset == I;
Mask[*InsertIdx - Offset] = I;		Mask[*InsertIdx - Offset] = I;
}		}
assert(Offset < NumElts && "Failed to find vector index offset");		assert(Offset < NumElts && "Failed to find vector index offset");

InstructionCost Cost = 0;		InstructionCost Cost = 0;
Cost -= TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,		Cost -= TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,
/Insert/ true, /Extract/ false);		/Insert/ true, /Extract/ false);

if (IsIdentity && NumElts != NumScalars && Offset % NumScalars != 0) {		// First cost - resize to actual vector size if not identity shuffle or
// FIXME: Replace with SK_InsertSubvector once it is properly supported.		// need to shift the vector.
unsigned Sz = PowerOf2Ceil(Offset + NumScalars);		// Do not calculate the cost if the actual size is the register size and
Cost += TTI->getShuffleCost(		// we can merge this shuffle with the following SK_Select.
TargetTransformInfo::SK_PermuteSingleSrc,		auto *ActualVecTy =
FixedVectorType::get(SrcVecTy->getElementType(), Sz));		FixedVectorType::get(SrcVecTy->getElementType(), VecSz);
} else if (!IsIdentity) {		if ((!IsIdentity \|\| Offset != OffsetBeg) && VecScalarsSz != VecSz)
auto *FirstInsert =		Cost += TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc,
cast<Instruction>(find_if(E->Scalars, [E](Value V) {		ActualVecTy, Mask);
return !is_contained(E->Scalars,		auto FirstInsert = cast<Instruction>(find_if(E->Scalars, [E](Value *V) {
cast<Instruction>(V)->getOperand(0));		return !is_contained(E->Scalars, cast<Instruction>(V)->getOperand(0));
}));		}));
if (isUndefVector(FirstInsert->getOperand(0))) {		// Second cost - permutation with subvector, if some elements are from the
Cost += TTI->getShuffleCost(TTI::SK_PermuteSingleSrc, SrcVecTy, Mask);		// initial vector or inserting a subvector.
} else {		// TODO: Implement the analysis of the FirstInsert->getOperand(0)
SmallVector<int> InsertMask(NumElts);		// subvector of ActualVecTy.
std::iota(InsertMask.begin(), InsertMask.end(), 0);		if (!isUndefVector(FirstInsert->getOperand(0)) && Offset != OffsetBeg)
for (unsigned I = 0; I < NumElts; I++) {		Cost += TTI->getShuffleCost(
if (Mask[I] != UndefMaskElem)		TTI::SK_Select,
InsertMask[Offset + I] = NumElts + I;		NumOfParts > 0
}		? FixedVectorType::get(SrcVecTy->getElementType(), VecScalarsSz)
Cost +=		: ActualVecTy);
		dmgreenUnsubmitted Not Done Reply Inline Actions I'm not sure I understand why this would be a SK_Select. That is a bit of a X86 special as far as I understand and doesn't always correlate well to other architectures. Why is the Mask missing too? That might be enough to help avoid the regressions if it was re-added. dmgreen: I'm not sure I understand why this would be a SK_Select. That is a bit of a X86 special as far…
		ABataevAuthorUnsubmitted Done Reply Inline Actions It is a permuatation of 2 sub-vectors: the root of the buildvector and a subvector after the vectorization. Since it was a buildvector, the compiler selects elements from the root and corresponding elements from the resulting vector. Mask is not required, if TTI::SK_Select is used, mask is used only with SK_PermuteSingleSrc and SK_PermuteTwoSrc. But I'll check it. ABataev: 1. It is a permuatation of 2 sub-vectors: the root of the buildvector and a subvector after the…
		dmgreenUnsubmitted Not Done Reply Inline Actions AArch64 (and most other architectures AFAIU) do not have SK_Select shuffles, so is not a lot better than SK_PermuteTwoSrc. A Mask can help to improve the cost though, if the backend can come up with something more accurate for it. I'm surprised this is not a SK_InsertSubvector with adjacent elements though - that seems like the most natural fit, unless I'm missing how this works. dmgreen: AArch64 (and most other architectures AFAIU) do not have SK_Select shuffles, so is not a lot…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yep, you right, it must be an InserSubvector kind, changed it to Select because some cost for InsertSubvector were not implemented. ABataev: Yep, you right, it must be an InserSubvector kind, changed it to Select because some cost for…
		RKSimonUnsubmitted Not Done Reply Inline Actions was this on x86 / aarch64 or some other target? RKSimon: was this on x86 / aarch64 or some other target?
		ABataevAuthorUnsubmitted Done Reply Inline Actions x86, IIRC. ABataev: x86, IIRC.
		RKSimonUnsubmitted Not Done Reply Inline Actions if you can email a test case I'll take a look RKSimon: if you can email a test case I'll take a look
		ABataevAuthorUnsubmitted Done Reply Inline Actions Some of the lit test, do not remember already. Most of the cases were fixed already, I believe. ABataev: Some of the lit test, do not remember already. Most of the cases were fixed already, I believe.
TTI->getShuffleCost(TTI::SK_PermuteTwoSrc, SrcVecTy, InsertMask);
}
}

return Cost;		return Cost;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
▲ Show 20 Lines • Show All 287 Lines • ▼ Show 20 Lines	case Instruction::ShuffleVector: {
auto *Src0Ty = FixedVectorType::get(Src0SclTy, VL.size());		auto *Src0Ty = FixedVectorType::get(Src0SclTy, VL.size());
auto *Src1Ty = FixedVectorType::get(Src1SclTy, VL.size());		auto *Src1Ty = FixedVectorType::get(Src1SclTy, VL.size());
VecCost = TTI->getCastInstrCost(E->getOpcode(), VecTy, Src0Ty,		VecCost = TTI->getCastInstrCost(E->getOpcode(), VecTy, Src0Ty,
TTI::CastContextHint::None, CostKind);		TTI::CastContextHint::None, CostKind);
VecCost += TTI->getCastInstrCost(E->getAltOpcode(), VecTy, Src1Ty,		VecCost += TTI->getCastInstrCost(E->getAltOpcode(), VecTy, Src1Ty,
TTI::CastContextHint::None, CostKind);		TTI::CastContextHint::None, CostKind);
}		}

		if (E->ReuseShuffleIndices.empty()) {
		CommonCost =
		TTI->getShuffleCost(TargetTransformInfo::SK_Select, FinalVecTy);
		} else {
SmallVector<int> Mask;		SmallVector<int> Mask;
buildSuffleEntryMask(		buildSuffleEntryMask(
E->Scalars, E->ReorderIndices, E->ReuseShuffleIndices,		E->Scalars, E->ReorderIndices, E->ReuseShuffleIndices,
[E](Instruction *I) {		[E](Instruction *I) {
assert(E->isOpcodeOrAlt(I) && "Unexpected main/alternate opcode");		assert(E->isOpcodeOrAlt(I) && "Unexpected main/alternate opcode");
return I->getOpcode() == E->getAltOpcode();		return I->getOpcode() == E->getAltOpcode();
},		},
Mask);		Mask);
CommonCost =		CommonCost = TTI->getShuffleCost(TargetTransformInfo::SK_PermuteTwoSrc,
TTI->getShuffleCost(TargetTransformInfo::SK_Select, FinalVecTy, Mask);		FinalVecTy, Mask);
		}
LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecCost, ScalarCost));		LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecCost, ScalarCost));
return CommonCost + VecCost - ScalarCost;		return CommonCost + VecCost - ScalarCost;
}		}
default:		default:
llvm_unreachable("Unknown instruction");		llvm_unreachable("Unknown instruction");
}		}
}		}

▲ Show 20 Lines • Show All 373 Lines • ▼ Show 20 Lines	if (MinBWs.count(ScalarRoot)) {
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);
}		}
}		}

InstructionCost SpillCost = getSpillCost();		InstructionCost SpillCost = getSpillCost();
Cost += SpillCost + ExtractCost;		Cost += SpillCost + ExtractCost;
if (FirstUsers.size() == 1) {		if (FirstUsers.size() == 1) {
int Limit = ShuffleMask.front().size() * 2;		int Limit = ShuffleMask.front().size() * 2;
if (all_of(ShuffleMask.front(), [Limit](int Idx) { return Idx < Limit; }) &&		if (!all_of(ShuffleMask.front(),
		[Limit](int Idx) { return Idx < Limit; }) \|\|
!ShuffleVectorInst::isIdentityMask(ShuffleMask.front())) {		!ShuffleVectorInst::isIdentityMask(ShuffleMask.front())) {
InstructionCost C = TTI->getShuffleCost(		InstructionCost C = TTI->getShuffleCost(
TTI::SK_PermuteSingleSrc,		TTI::SK_PermuteSingleSrc,
cast<FixedVectorType>(FirstUsers.front()->getType()),		cast<FixedVectorType>(FirstUsers.front()->getType()),
ShuffleMask.front());		ShuffleMask.front());
LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C		LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C
<< " for final shuffle of insertelement external users "		<< " for final shuffle of insertelement external users "
<< *VectorizableTree.front()->Scalars.front() << ".\n"		<< *VectorizableTree.front()->Scalars.front() << ".\n"
▲ Show 20 Lines • Show All 4,500 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll

Show First 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	;
%r5 = insertelement <8 x i32> %r4, i32 %ab5, i32 5		%r5 = insertelement <8 x i32> %r4, i32 %ab5, i32 5
%r6 = insertelement <8 x i32> %r5, i32 %ab6, i32 6		%r6 = insertelement <8 x i32> %r5, i32 %ab6, i32 6
%r7 = insertelement <8 x i32> %r6, i32 %ab7, i32 7		%r7 = insertelement <8 x i32> %r6, i32 %ab7, i32 7
ret <8 x i32> %r7		ret <8 x i32> %r7
}		}

define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) {		define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) {
; SSE-LABEL: @ashr_lshr_shl_v8i32(		; SSE-LABEL: @ashr_lshr_shl_v8i32(
; SSE-NEXT: [[TMP1:%.]] = shufflevector <8 x i32> [[A:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		; SSE-NEXT: [[A6:%.]] = extractelement <8 x i32> [[A:%.]], i32 6
; SSE-NEXT: [[TMP2:%.]] = shufflevector <8 x i32> [[B:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		; SSE-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7
		; SSE-NEXT: [[B6:%.]] = extractelement <8 x i32> [[B:%.]], i32 6
		; SSE-NEXT: [[B7:%.*]] = extractelement <8 x i32> [[B]], i32 7
		; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
		; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]		; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]
; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]]		; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]]
; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>		; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>
; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]]		; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]]
; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5>		; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5>
; SSE-NEXT: [[TMP8:%.*]] = shl <8 x i32> [[A]], [[B]]		; SSE-NEXT: [[AB6:%.*]] = shl i32 [[A6]], [[B6]]
; SSE-NEXT: [[TMP9:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> poison, <2 x i32> <i32 6, i32 7>		; SSE-NEXT: [[AB7:%.*]] = shl i32 [[A7]], [[B7]]
; SSE-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>		; SSE-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
; SSE-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		; SSE-NEXT: [[TMP9:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; SSE-NEXT: [[R52:%.*]] = shufflevector <8 x i32> [[TMP10]], <8 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef>		; SSE-NEXT: [[R51:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef>
; SSE-NEXT: [[TMP12:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		; SSE-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R51]], i32 [[AB6]], i32 6
; SSE-NEXT: [[R71:%.*]] = shufflevector <8 x i32> [[R52]], <8 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 9>		; SSE-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7
; SSE-NEXT: ret <8 x i32> [[R71]]		; SSE-NEXT: ret <8 x i32> [[R7]]
;		;
; SLM-LABEL: @ashr_lshr_shl_v8i32(		; SLM-LABEL: @ashr_lshr_shl_v8i32(
; SLM-NEXT: [[TMP1:%.]] = shufflevector <8 x i32> [[A:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		; SLM-NEXT: [[TMP1:%.]] = shufflevector <8 x i32> [[A:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; SLM-NEXT: [[TMP2:%.]] = shufflevector <8 x i32> [[B:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		; SLM-NEXT: [[TMP2:%.]] = shufflevector <8 x i32> [[B:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; SLM-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]		; SLM-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]
; SLM-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]]		; SLM-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]]
; SLM-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>		; SLM-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>
; SLM-NEXT: [[TMP6:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>		; SLM-NEXT: [[TMP6:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
▲ Show 20 Lines • Show All 276 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll

Show First 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	;
%r5 = insertelement <8 x i32> %r4, i32 %ab5, i32 5		%r5 = insertelement <8 x i32> %r4, i32 %ab5, i32 5
%r6 = insertelement <8 x i32> %r5, i32 %ab6, i32 6		%r6 = insertelement <8 x i32> %r5, i32 %ab6, i32 6
%r7 = insertelement <8 x i32> %r6, i32 %ab7, i32 7		%r7 = insertelement <8 x i32> %r6, i32 %ab7, i32 7
ret <8 x i32> %r7		ret <8 x i32> %r7
}		}

define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) {		define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) {
; SSE-LABEL: @ashr_lshr_shl_v8i32(		; SSE-LABEL: @ashr_lshr_shl_v8i32(
; SSE-NEXT: [[TMP1:%.]] = shufflevector <8 x i32> [[A:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		; SSE-NEXT: [[A6:%.]] = extractelement <8 x i32> [[A:%.]], i32 6
; SSE-NEXT: [[TMP2:%.]] = shufflevector <8 x i32> [[B:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		; SSE-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7
		; SSE-NEXT: [[B6:%.]] = extractelement <8 x i32> [[B:%.]], i32 6
		; SSE-NEXT: [[B7:%.*]] = extractelement <8 x i32> [[B]], i32 7
		; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
		; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]		; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]
; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]]		; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]]
; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>		; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>
; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]]		; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]]
; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5>		; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5>
; SSE-NEXT: [[TMP8:%.*]] = shl <8 x i32> [[A]], [[B]]		; SSE-NEXT: [[AB6:%.*]] = shl i32 [[A6]], [[B6]]
; SSE-NEXT: [[TMP9:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> poison, <2 x i32> <i32 6, i32 7>		; SSE-NEXT: [[AB7:%.*]] = shl i32 [[A7]], [[B7]]
; SSE-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>		; SSE-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
; SSE-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		; SSE-NEXT: [[TMP9:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
; SSE-NEXT: [[R52:%.*]] = shufflevector <8 x i32> [[TMP10]], <8 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef>		; SSE-NEXT: [[R51:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef>
; SSE-NEXT: [[TMP12:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		; SSE-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R51]], i32 [[AB6]], i32 6
; SSE-NEXT: [[R71:%.*]] = shufflevector <8 x i32> [[R52]], <8 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 9>		; SSE-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7
; SSE-NEXT: ret <8 x i32> [[R71]]		; SSE-NEXT: ret <8 x i32> [[R7]]
;		;
; SLM-LABEL: @ashr_lshr_shl_v8i32(		; SLM-LABEL: @ashr_lshr_shl_v8i32(
; SLM-NEXT: [[TMP1:%.]] = shufflevector <8 x i32> [[A:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		; SLM-NEXT: [[TMP1:%.]] = shufflevector <8 x i32> [[A:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; SLM-NEXT: [[TMP2:%.]] = shufflevector <8 x i32> [[B:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		; SLM-NEXT: [[TMP2:%.]] = shufflevector <8 x i32> [[B:%.]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; SLM-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]		; SLM-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]]
; SLM-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]]		; SLM-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]]
; SLM-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>		; SLM-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>
; SLM-NEXT: [[TMP6:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>		; SLM-NEXT: [[TMP6:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
▲ Show 20 Lines • Show All 276 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve shuffles cost estimation where possible.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 393252

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll

llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll

[SLP]Improve shuffles cost estimation where possible.
ClosedPublic