This is an archive of the discontinued LLVM Phabricator instance.

[COST]Improve cost model for shuffles in SLP.
ClosedPublic

Authored by ABataev on Apr 14 2021, 8:19 AM.

Details

Summary

Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.

Event Timeline

ABataev created this revision.Apr 14 2021, 8:19 AM
ABataev requested review of this revision.Apr 14 2021, 8:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2021, 8:19 AM

Adding some AArch64 reviewers

fhahn added a subscriber: fhahn.Apr 15 2021, 3:41 AM

Can you split off the target specific cost model changes? This makes it easier to track down potential regressions.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1408 ↗(On Diff #337463)

This should be moved up I think, before the scalable vector handling. It would also be good to have cost-model tests for those shuffles.

sdesmalen added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4487–4488

Can the finding of a more specific ShuffleKind be done by getShuffleCost when a Mask is given?
It seems a bit inconvenient to have to do that manually before calling this function.

RKSimon added inline comments.Apr 15 2021, 4:39 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4487–4488

I made a similar comment on D100495 - we could do with a generic 'ShuffleKind' decoder helper function (e.g. in Analysis\VectorUtils.h) that everybody can use.

ABataev added inline comments.Apr 16 2021, 6:30 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4487–4488

In this case, we need to update getShuffleCost function for all the targets to translate the mask. Is it ok if I'll do it in this patch?

sdesmalen added inline comments.Apr 20 2021, 1:21 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4487–4488

There's already a few different changes in this patch (AArch64 cost-model, X86 cost-model and changes to the SLPVectorizer), so I think it makes more sense to do this in a separate patch.

Matt added a subscriber: Matt.Apr 20 2021, 6:51 AM
ABataev added inline comments.Apr 20 2021, 9:24 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1408 ↗(On Diff #337463)

There is getIntrinsicInstrCost-vector-reverse.ll already affected by the cost changes.

ABataev updated this revision to Diff 343123.May 5 2021, 11:35 AM

Rebase + fixes

ABataev updated this revision to Diff 345554.May 14 2021, 1:54 PM

Rebase and fixes

Just because we've legalized to a certain vector type, is that sufficient guarantee that we have an appropriate shuffle operations on that legal vector type?

Just because we've legalized to a certain vector type, is that sufficient guarantee that we have an appropriate shuffle operations on that legal vector type?

I think so, at least the existing code relies on this. This code just improves the existing situation. Currently, we conservatively calculate the cost of the shuffles as the permutation of all source subvectors. But in presence of mask we can estimate this cost more precisely, choosing only subvectors actually used in the permutations.

ABataev updated this revision to Diff 351488.Jun 11 2021, 10:06 AM

Rebase + fixed gathering cost calculation.

Adding @lebedev.ri who's worked on improving x86 shuffle/buildvector costmodels recently.

lebedev.ri resigned from this revision.EditedJun 16 2021, 12:55 AM

No i didn't.
I think while this may be somewhat correct,
is not really correct. For example, before AVX,
there is no non-32-bit shuffles, only unpacks.

No i didn't.
I think while this may be somewhat correct,
is not really correct. For example, before AVX,
there is no non-32-bit shuffles, only unpacks.

Checked the cost model calculations and looks like it is correct. E.g. the cost of :

; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V256 = shufflevector <16 x i16> %src256, <16 x i16> undef, <16 x i32> <i32 15, i32 14, i32 13, i32 13, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>

We have a permutation of 2 registers here. We have a general SK_PermuteSingleSrc for the first register <i32 15, i32 14, i32 13, i32 13, i32 11, i32 10, i32 9, i32 8>:

SSE2
{TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2*pshuflw + 2*pshufhw + pshufd/unpck

and we have SK_Reverse for the second register `<i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>:

{TTI::SK_Reverse, MVT::v8i16, 3}, // pshuflw + pshufhw + pshufd

. The sum result is 8.

ABataev updated this revision to Diff 354335.Jun 24 2021, 12:54 PM

Added correct calculation of the cost for shuffle reuses.

ABataev added inline comments.Jun 24 2021, 12:57 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4206–4210

We incorrectly calculate the cost here (using the wrong vector type), with this fix and without this patch at least 4 X86 tests are not vectorized anymore.

llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗(On Diff #354335)

These regressions are caused by the incomplete cost model for AArch64, no cost for PermuteSingleSrc shuffle kind for VFxi16

Thanks for looking at this - this is well overdue!

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1213

We already have something similar in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE - do you think we could have a single version of the code some place?

ABataev added inline comments.Jul 15 2021, 5:39 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1213

Will check if can merge it somehow.

ABataev updated this revision to Diff 362162.Jul 27 2021, 1:22 PM

Rebase + address comments.

Currently in hangs the compiler in tests test/CodeGen/Thumb2/mve-vst4.ll and llvm/test/CodeGen/Thumb2/mve-vst3.ll, need to investigate.

I hadn't been expecting you'd have to change the legalization implemention, at least not as an initial patch - can we not keep closer to the original SplitVecRes_VECTOR_SHUFFLE implementation?

I hadn't been expecting you'd have to change the legalization implemention, at least not as an initial patch - can we not keep closer to the original SplitVecRes_VECTOR_SHUFFLE implementation?

Do you mean, use extract/inserts in case if we have more than 2 input vectors?

ABataev updated this revision to Diff 362526.Jul 28 2021, 1:49 PM

Rebase + address comments.

vporpo added a subscriber: vporpo.Nov 11 2021, 7:57 PM

Anything else here?

Sorry for the slow response on this one - please can you split the DAG legalization change out as the first patch and the TTI change as a followup?

llvm/include/llvm/Analysis/VectorUtils.h
401

Description comment?

402

Should this be splitShuffleMasks?

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
2141

I'm rather surprised this didn't reduce further (same for the TTI) - maybe there's more functionality that can be moved into the computeShuffleMasks helper?

Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 1:58 PM
RKSimon added inline comments.Apr 20 2022, 2:23 PM
llvm/test/Analysis/CostModel/X86/reduction.ll
64

Something might be still going wrong here - SSE max legal type is v4i32 so this cost should be 0 as its referencing a single existing vector

ABataev added inline comments.Apr 20 2022, 2:35 PM
llvm/test/Analysis/CostModel/X86/reduction.ll
64

I'm kind of pessimistic here and if the dest reg is not the same as src, I consider it as a reg copy and add cost TCC_Basic.

RKSimon accepted this revision.Apr 26 2022, 5:09 AM

LGTM

This revision is now accepted and ready to land.Apr 26 2022, 5:09 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Apr 28 2022, 6:56 AM

Just a heads up that with this change building llvm-test-suite/SingleSource/UnitTests/matrix-types-spec.cpp crashes on X86. I'll see if I can extract a reproducer.

Just a heads up that with this change building llvm-test-suite/SingleSource/UnitTests/matrix-types-spec.cpp crashes on X86. I'll see if I can extract a reproducer.

Hi, most probably the crash is caused by the opaque pointers. This patch does not change the vectorizer behavior, just adjusts the cost of the vector shuffles ops.

fhahn added a comment.Apr 28 2022, 7:02 AM

Just a heads up that with this change building llvm-test-suite/SingleSource/UnitTests/matrix-types-spec.cpp crashes on X86. I'll see if I can extract a reproducer.

Hi, most probably the crash is caused by the opaque pointers. This patch does not change the vectorizer behavior, just adjusts the cost of the vector shuffles ops.

It crashes when calculating the cost with

Assertion failed: (idx < size()), function operator[], file SmallVector.h, line 273.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments:  clang++ -DNDEBUG -O3 -O3 -DNDEBUG -isysroot ...
1.	<eof> parser at end of file
2.	Optimizer
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  clang-15                 0x000000010a03f8c7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 39
1  clang-15                 0x000000010a03e6f8 llvm::sys::RunSignalHandlers() + 248
2  clang-15                 0x000000010a03ed20 llvm::sys::CleanupOnSignal(unsigned long) + 208
3  clang-15                 0x0000000109f6417a (anonymous namespace)::CrashRecoveryContextImpl::HandleCrash(int, unsigned long) + 106
4  clang-15                 0x0000000109f6435e CrashRecoverySignalHandler(int) + 110
5  libsystem_platform.dylib 0x00007ff808cfddfd _sigtramp + 29
6  libsystem_platform.dylib 000000000000000000 _sigtramp + 18446603370433094176
7  libsystem_c.dylib        0x00007ff808c33d24 abort + 123
8  libsystem_c.dylib        0x00007ff808c330cb err + 0
9  clang-15                 0x000000010d263f33 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) (.cold.9) + 35
10 clang-15                 0x00000001091758c4 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) + 2244
11 clang-15                 0x000000010886d07b llvm::X86TTIImpl::getShuffleCost(llvm::TargetTransformInfo::ShuffleKind, llvm::VectorType*, llvm::ArrayRef<int>, int, llvm::VectorType*, llvm::ArrayRef<llvm::Value const*>) + 1755
12 clang-15                 0x00000001088666a7 llvm::TargetTransformInfoImplCRTPBase<llvm::X86TTIImpl>::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) + 4551
13 clang-15                 0x000000010913f4b2 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) const + 18
14 clang-15                 0x0000000109d40056 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::TargetTransformInfo::TargetCostKind) const + 214
15 clang-15                 0x0000000108f97ed5 (anonymous namespace)::CallAnalyzer::visitInstruction(llvm::Instruction&) + 37

Just a heads up that with this change building llvm-test-suite/SingleSource/UnitTests/matrix-types-spec.cpp crashes on X86. I'll see if I can extract a reproducer.

Hi, most probably the crash is caused by the opaque pointers. This patch does not change the vectorizer behavior, just adjusts the cost of the vector shuffles ops.

It crashes when calculating the cost with

Assertion failed: (idx < size()), function operator[], file SmallVector.h, line 273.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments:  clang++ -DNDEBUG -O3 -O3 -DNDEBUG -isysroot ...
1.	<eof> parser at end of file
2.	Optimizer
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  clang-15                 0x000000010a03f8c7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 39
1  clang-15                 0x000000010a03e6f8 llvm::sys::RunSignalHandlers() + 248
2  clang-15                 0x000000010a03ed20 llvm::sys::CleanupOnSignal(unsigned long) + 208
3  clang-15                 0x0000000109f6417a (anonymous namespace)::CrashRecoveryContextImpl::HandleCrash(int, unsigned long) + 106
4  clang-15                 0x0000000109f6435e CrashRecoverySignalHandler(int) + 110
5  libsystem_platform.dylib 0x00007ff808cfddfd _sigtramp + 29
6  libsystem_platform.dylib 000000000000000000 _sigtramp + 18446603370433094176
7  libsystem_c.dylib        0x00007ff808c33d24 abort + 123
8  libsystem_c.dylib        0x00007ff808c330cb err + 0
9  clang-15                 0x000000010d263f33 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) (.cold.9) + 35
10 clang-15                 0x00000001091758c4 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) + 2244
11 clang-15                 0x000000010886d07b llvm::X86TTIImpl::getShuffleCost(llvm::TargetTransformInfo::ShuffleKind, llvm::VectorType*, llvm::ArrayRef<int>, int, llvm::VectorType*, llvm::ArrayRef<llvm::Value const*>) + 1755
12 clang-15                 0x00000001088666a7 llvm::TargetTransformInfoImplCRTPBase<llvm::X86TTIImpl>::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) + 4551
13 clang-15                 0x000000010913f4b2 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) const + 18
14 clang-15                 0x0000000109d40056 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::TargetTransformInfo::TargetCostKind) const + 214
15 clang-15                 0x0000000108f97ed5 (anonymous namespace)::CallAnalyzer::visitInstruction(llvm::Instruction&) + 37

Ok, the reproducer should help.

Just a heads up that with this change building llvm-test-suite/SingleSource/UnitTests/matrix-types-spec.cpp crashes on X86. I'll see if I can extract a reproducer.

Hi, most probably the crash is caused by the opaque pointers. This patch does not change the vectorizer behavior, just adjusts the cost of the vector shuffles ops.

It crashes when calculating the cost with

Assertion failed: (idx < size()), function operator[], file SmallVector.h, line 273.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments:  clang++ -DNDEBUG -O3 -O3 -DNDEBUG -isysroot ...
1.	<eof> parser at end of file
2.	Optimizer
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  clang-15                 0x000000010a03f8c7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 39
1  clang-15                 0x000000010a03e6f8 llvm::sys::RunSignalHandlers() + 248
2  clang-15                 0x000000010a03ed20 llvm::sys::CleanupOnSignal(unsigned long) + 208
3  clang-15                 0x0000000109f6417a (anonymous namespace)::CrashRecoveryContextImpl::HandleCrash(int, unsigned long) + 106
4  clang-15                 0x0000000109f6435e CrashRecoverySignalHandler(int) + 110
5  libsystem_platform.dylib 0x00007ff808cfddfd _sigtramp + 29
6  libsystem_platform.dylib 000000000000000000 _sigtramp + 18446603370433094176
7  libsystem_c.dylib        0x00007ff808c33d24 abort + 123
8  libsystem_c.dylib        0x00007ff808c330cb err + 0
9  clang-15                 0x000000010d263f33 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) (.cold.9) + 35
10 clang-15                 0x00000001091758c4 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) + 2244
11 clang-15                 0x000000010886d07b llvm::X86TTIImpl::getShuffleCost(llvm::TargetTransformInfo::ShuffleKind, llvm::VectorType*, llvm::ArrayRef<int>, int, llvm::VectorType*, llvm::ArrayRef<llvm::Value const*>) + 1755
12 clang-15                 0x00000001088666a7 llvm::TargetTransformInfoImplCRTPBase<llvm::X86TTIImpl>::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) + 4551
13 clang-15                 0x000000010913f4b2 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) const + 18
14 clang-15                 0x0000000109d40056 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::TargetTransformInfo::TargetCostKind) const + 214
15 clang-15                 0x0000000108f97ed5 (anonymous namespace)::CallAnalyzer::visitInstruction(llvm::Instruction&) + 37

Could you just share your compile options for the source file compilation?

fhahn added a comment.Apr 28 2022, 7:28 AM

Reproducer below crashes with bin/opt -passes="print<cost-model>"

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-darwin"

define void @test() {
entry:
  %matins.2.2 = shufflevector <9 x double> undef, <9 x double> undef, <9 x i32> <i32 0, i32 3, i32 6, i32 1, i32 4, i32 7, i32 2, i32 5, i32 8>
  ret void
}

Reproducer below crashes with bin/opt -passes="print<cost-model>"

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-darwin"

define void @test() {
entry:
  %matins.2.2 = shufflevector <9 x double> undef, <9 x double> undef, <9 x i32> <i32 0, i32 3, i32 6, i32 1, i32 4, i32 7, i32 2, i32 5, i32 8>
  ret void
}

Thanks, will revert the patch to fix the bug.

Hi Alexey. Here is another crash reproducer:

; bin/opt -mcpu=corei7-avx -passes="print<cost-model>" -S test.ll

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local <12 x i64> @foo(<12 x i64> noundef %src) {
entry:

%shuffle = shufflevector <12 x i64> %src, <12 x i64> poison, <12 x i32> <i32 0, i32 3, i32 6, i32 9, i32 1, i32 4, i32 7, i32 10, i32 2, i32 5, i32 8, i32 11>
ret <12 x i64> %shuffle

}

Hi Alexey. Here is another crash reproducer:

; bin/opt -mcpu=corei7-avx -passes="print<cost-model>" -S test.ll

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local <12 x i64> @foo(<12 x i64> noundef %src) {
entry:

%shuffle = shufflevector <12 x i64> %src, <12 x i64> poison, <12 x i32> <i32 0, i32 3, i32 6, i32 9, i32 1, i32 4, i32 7, i32 10, i32 2, i32 5, i32 8, i32 11>
ret <12 x i64> %shuffle

}

Fixed in 371412e065a63107d5d79330da6757ff693d91cc

jgorbe added a subscriber: jgorbe.May 5 2022, 11:14 AM

We currently root-caused a 20% regression in eigen complex to this patch and it is blocking our compiler release.
I need some time to do additional testing, given there are at least a few patches on top of this. So far I see a crash fix, two NFCs and two additional changes that may also affect performance (https://reviews.llvm.org/D114171 and https://reviews.llvm.org/D115750).
Can you please hold off on any additional changes while we test further, or include them here for review so I can check if they resolve the regressions seen so far?

We currently root-caused a 20% regression in eigen complex to this patch and it is blocking our compiler release.
I need some time to do additional testing, given there are at least a few patches on top of this. So far I see a crash fix, two NFCs and two additional changes that may also affect performance (https://reviews.llvm.org/D114171 and https://reviews.llvm.org/D115750).
Can you please hold off on any additional changes while we test further, or include them here for review so I can check if they resolve the regressions seen so far?

Most probably, some overoptimistic cost model estimations turned on some extra vectorization. The reproducer will help, if you can provide it.

We currently root-caused a 20% regression in eigen complex to this patch and it is blocking our compiler release.
I need some time to do additional testing, given there are at least a few patches on top of this. So far I see a crash fix, two NFCs and two additional changes that may also affect performance (https://reviews.llvm.org/D114171 and https://reviews.llvm.org/D115750).
Can you please hold off on any additional changes while we test further, or include them here for review so I can check if they resolve the regressions seen so far?

Most probably, some overoptimistic cost model estimations turned on some extra vectorization. The reproducer will help, if you can provide it.

The reproducer is open source: https://gitlab.com/libeigen/eigen
Regressions are on haswell platform, xfdo configuration; the compiler used is bootstrapped against itself; revision tested includes this patch and all follow-up changes to the SLPVectorizer including D114171 and excluding D115750.
A few number samples:
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_16x64_ 16%
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_64x64_ 17%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_64x64_ 15%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_16x64_ 14%

We currently root-caused a 20% regression in eigen complex to this patch and it is blocking our compiler release.
I need some time to do additional testing, given there are at least a few patches on top of this. So far I see a crash fix, two NFCs and two additional changes that may also affect performance (https://reviews.llvm.org/D114171 and https://reviews.llvm.org/D115750).
Can you please hold off on any additional changes while we test further, or include them here for review so I can check if they resolve the regressions seen so far?

Most probably, some overoptimistic cost model estimations turned on some extra vectorization. The reproducer will help, if you can provide it.

The reproducer is open source: https://gitlab.com/libeigen/eigen
Regressions are on haswell platform, xfdo configuration; the compiler used is bootstrapped against itself; revision tested includes this patch and all follow-up changes to the SLPVectorizer including D114171 and excluding D115750.
A few number samples:
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_16x64_ 16%
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_64x64_ 17%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_64x64_ 15%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_16x64_ 14%

Could you at least extract the llvm ir code for one of these functions before the changes and after, so I coukd at least compare them?

We currently root-caused a 20% regression in eigen complex to this patch and it is blocking our compiler release.
I need some time to do additional testing, given there are at least a few patches on top of this. So far I see a crash fix, two NFCs and two additional changes that may also affect performance (https://reviews.llvm.org/D114171 and https://reviews.llvm.org/D115750).
Can you please hold off on any additional changes while we test further, or include them here for review so I can check if they resolve the regressions seen so far?

Most probably, some overoptimistic cost model estimations turned on some extra vectorization. The reproducer will help, if you can provide it.

The reproducer is open source: https://gitlab.com/libeigen/eigen
Regressions are on haswell platform, xfdo configuration; the compiler used is bootstrapped against itself; revision tested includes this patch and all follow-up changes to the SLPVectorizer including D114171 and excluding D115750.
A few number samples:
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_16x64_ 16%
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_64x64_ 17%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_64x64_ 15%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_16x64_ 14%

Could you at least extract the llvm ir code for one of these functions before the changes and after, so I coukd at least compare them?

Here are the only differences I see for dynamic, 64x64; base is before the patch, experiment includes this - revert+recommit -, https://reviews.llvm.org/rG371412e065a63107d5d79330da6757ff693d91cc, https://reviews.llvm.org/D114171 and the NFCs.
Looks like the change is in the presence of poison:


Another 30-40% regression on a non-public benchmark was root-caused to https://reviews.llvm.org/D114171, so investigating that.

We currently root-caused a 20% regression in eigen complex to this patch and it is blocking our compiler release.
I need some time to do additional testing, given there are at least a few patches on top of this. So far I see a crash fix, two NFCs and two additional changes that may also affect performance (https://reviews.llvm.org/D114171 and https://reviews.llvm.org/D115750).
Can you please hold off on any additional changes while we test further, or include them here for review so I can check if they resolve the regressions seen so far?

Most probably, some overoptimistic cost model estimations turned on some extra vectorization. The reproducer will help, if you can provide it.

The reproducer is open source: https://gitlab.com/libeigen/eigen
Regressions are on haswell platform, xfdo configuration; the compiler used is bootstrapped against itself; revision tested includes this patch and all follow-up changes to the SLPVectorizer including D114171 and excluding D115750.
A few number samples:
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_16x64_ 16%
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_64x64_ 17%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_64x64_ 15%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_16x64_ 14%

Could you at least extract the llvm ir code for one of these functions before the changes and after, so I coukd at least compare them?

Here are the only differences I see for dynamic, 64x64; base is before the patch, experiment includes this - revert+recommit -, https://reviews.llvm.org/rG371412e065a63107d5d79330da6757ff693d91cc, https://reviews.llvm.org/D114171 and the NFCs.
Looks like the change is in the presence of poison:


Another 30-40% regression on a non-public benchmark was root-caused to https://reviews.llvm.org/D114171, so investigating that.

Hmm, I don't see the difference, actually, just changed the order of evaluation and lanes switched, no actual difference in IR. Most probably, the regression is caused by something else, maybe some lowering passes were tweaked for the previous order but not for the current.

Following up on the regression I was seeing with https://reviews.llvm.org/D114171, the only meaningful IR change seems to be:
Before the patch (after slp):

%65 = shufflevector <2 x float> %9, <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
%66 = insertelement <4 x float> %65, float %16, i32 2  
%67 = insertelement <4 x float> %66, float %15, i32 3
%68 = call <4 x float> @llvm.fabs.v4f32(<4 x float> %67)
%69 = fcmp oeq <4 x float> %68, <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>
%70 = freeze <4 x i1> %69 
%71 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %70)

After the patch (after slp)

%65 = shufflevector <2 x float> %9, <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 undef, i32 undef>  
%66 = shufflevector <4 x float> poison, <4 x float> %65, <4 x i32> <i32 4, i32 5, i32 2, i32 3> 
%67 = insertelement <4 x float> %66, float %16, i32 2  
%68 = insertelement <4 x float> %67, float %15, i32 3
%69 = call <4 x float> @llvm.fabs.v4f32(<4 x float> %68)
%70 = fcmp oeq <4 x float> %69, <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>
%71 = freeze <4 x i1> %70  
%72 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %71)

I'm not clear why the order of 0,1 in the shufflevector changed or the purpose of the second shufflevector adding poison on positions that will be overwritten, but the operations that follow seem to not affected by these changes.
Just providing an update, and not digging further into it at this time.

Following up on the regression I was seeing with https://reviews.llvm.org/D114171, the only meaningful IR change seems to be:
Before the patch (after slp):

%65 = shufflevector <2 x float> %9, <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
%66 = insertelement <4 x float> %65, float %16, i32 2  
%67 = insertelement <4 x float> %66, float %15, i32 3
%68 = call <4 x float> @llvm.fabs.v4f32(<4 x float> %67)
%69 = fcmp oeq <4 x float> %68, <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>
%70 = freeze <4 x i1> %69 
%71 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %70)

After the patch (after slp)

%65 = shufflevector <2 x float> %9, <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 undef, i32 undef>  
%66 = shufflevector <4 x float> poison, <4 x float> %65, <4 x i32> <i32 4, i32 5, i32 2, i32 3> 
%67 = insertelement <4 x float> %66, float %16, i32 2  
%68 = insertelement <4 x float> %67, float %15, i32 3
%69 = call <4 x float> @llvm.fabs.v4f32(<4 x float> %68)
%70 = fcmp oeq <4 x float> %69, <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>
%71 = freeze <4 x i1> %70  
%72 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %71)

I'm not clear why the order of 0,1 in the shufflevector changed or the purpose of the second shufflevector adding poison on positions that will be overwritten, but the operations that follow seem to not affected by these changes.
Just providing an update, and not digging further into it at this time.

I'm still working on the reductions improvements. Hope to fix all these regressions.