Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Can you split off the target specific cost model changes? This makes it easier to track down potential regressions.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1408 ↗ | (On Diff #337463) | This should be moved up I think, before the scalable vector handling. It would also be good to have cost-model tests for those shuffles. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3823–3824 | Can the finding of a more specific ShuffleKind be done by getShuffleCost when a Mask is given? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3823–3824 | In this case, we need to update getShuffleCost function for all the targets to translate the mask. Is it ok if I'll do it in this patch? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3823–3824 | There's already a few different changes in this patch (AArch64 cost-model, X86 cost-model and changes to the SLPVectorizer), so I think it makes more sense to do this in a separate patch. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1408 ↗ | (On Diff #337463) | There is getIntrinsicInstrCost-vector-reverse.ll already affected by the cost changes. |
Just because we've legalized to a certain vector type, is that sufficient guarantee that we have an appropriate shuffle operations on that legal vector type?
I think so, at least the existing code relies on this. This code just improves the existing situation. Currently, we conservatively calculate the cost of the shuffles as the permutation of all source subvectors. But in presence of mask we can estimate this cost more precisely, choosing only subvectors actually used in the permutations.
Adding @lebedev.ri who's worked on improving x86 shuffle/buildvector costmodels recently.
No i didn't.
I think while this may be somewhat correct,
is not really correct. For example, before AVX,
there is no non-32-bit shuffles, only unpacks.
Checked the cost model calculations and looks like it is correct. E.g. the cost of :
; SSE2-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %V256 = shufflevector <16 x i16> %src256, <16 x i16> undef, <16 x i32> <i32 15, i32 14, i32 13, i32 13, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
We have a permutation of 2 registers here. We have a general SK_PermuteSingleSrc for the first register <i32 15, i32 14, i32 13, i32 13, i32 11, i32 10, i32 9, i32 8>:
SSE2 {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2*pshuflw + 2*pshufhw + pshufd/unpck
and we have SK_Reverse for the second register `<i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>:
{TTI::SK_Reverse, MVT::v8i16, 3}, // pshuflw + pshufhw + pshufd
. The sum result is 8.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3641–3645 | We incorrectly calculate the cost here (using the wrong vector type), with this fix and without this patch at least 4 X86 tests are not vectorized anymore. | |
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll | ||
6–17 | These regressions are caused by the incomplete cost model for AArch64, no cost for PermuteSingleSrc shuffle kind for VFxi16 |
Thanks for looking at this - this is well overdue!
llvm/lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
1192 | We already have something similar in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE - do you think we could have a single version of the code some place? |
llvm/lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
1192 | Will check if can merge it somehow. |
Rebase + address comments.
Currently in hangs the compiler in tests test/CodeGen/Thumb2/mve-vst4.ll and llvm/test/CodeGen/Thumb2/mve-vst3.ll, need to investigate.
I hadn't been expecting you'd have to change the legalization implemention, at least not as an initial patch - can we not keep closer to the original SplitVecRes_VECTOR_SHUFFLE implementation?
Sorry for the slow response on this one - please can you split the DAG legalization change out as the first patch and the TTI change as a followup?
llvm/include/llvm/Analysis/VectorUtils.h | ||
---|---|---|
401 ↗ | (On Diff #392814) | Description comment? |
402 ↗ | (On Diff #392814) | Should this be splitShuffleMasks? |
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
2140 ↗ | (On Diff #392814) | I'm rather surprised this didn't reduce further (same for the TTI) - maybe there's more functionality that can be moved into the computeShuffleMasks helper? |
llvm/test/Analysis/CostModel/X86/reduction.ll | ||
---|---|---|
66 | Something might be still going wrong here - SSE max legal type is v4i32 so this cost should be 0 as its referencing a single existing vector |
llvm/test/Analysis/CostModel/X86/reduction.ll | ||
---|---|---|
66 | I'm kind of pessimistic here and if the dest reg is not the same as src, I consider it as a reg copy and add cost TCC_Basic. |
Just a heads up that with this change building llvm-test-suite/SingleSource/UnitTests/matrix-types-spec.cpp crashes on X86. I'll see if I can extract a reproducer.
Hi, most probably the crash is caused by the opaque pointers. This patch does not change the vectorizer behavior, just adjusts the cost of the vector shuffles ops.
It crashes when calculating the cost with
Assertion failed: (idx < size()), function operator[], file SmallVector.h, line 273. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump: 0. Program arguments: clang++ -DNDEBUG -O3 -O3 -DNDEBUG -isysroot ... 1. <eof> parser at end of file 2. Optimizer Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it): 0 clang-15 0x000000010a03f8c7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 39 1 clang-15 0x000000010a03e6f8 llvm::sys::RunSignalHandlers() + 248 2 clang-15 0x000000010a03ed20 llvm::sys::CleanupOnSignal(unsigned long) + 208 3 clang-15 0x0000000109f6417a (anonymous namespace)::CrashRecoveryContextImpl::HandleCrash(int, unsigned long) + 106 4 clang-15 0x0000000109f6435e CrashRecoverySignalHandler(int) + 110 5 libsystem_platform.dylib 0x00007ff808cfddfd _sigtramp + 29 6 libsystem_platform.dylib 000000000000000000 _sigtramp + 18446603370433094176 7 libsystem_c.dylib 0x00007ff808c33d24 abort + 123 8 libsystem_c.dylib 0x00007ff808c330cb err + 0 9 clang-15 0x000000010d263f33 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) (.cold.9) + 35 10 clang-15 0x00000001091758c4 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) + 2244 11 clang-15 0x000000010886d07b llvm::X86TTIImpl::getShuffleCost(llvm::TargetTransformInfo::ShuffleKind, llvm::VectorType*, llvm::ArrayRef<int>, int, llvm::VectorType*, llvm::ArrayRef<llvm::Value const*>) + 1755 12 clang-15 0x00000001088666a7 llvm::TargetTransformInfoImplCRTPBase<llvm::X86TTIImpl>::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) + 4551 13 clang-15 0x000000010913f4b2 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) const + 18 14 clang-15 0x0000000109d40056 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::TargetTransformInfo::TargetCostKind) const + 214 15 clang-15 0x0000000108f97ed5 (anonymous namespace)::CallAnalyzer::visitInstruction(llvm::Instruction&) + 37
Reproducer below crashes with bin/opt -passes="print<cost-model>"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-darwin" define void @test() { entry: %matins.2.2 = shufflevector <9 x double> undef, <9 x double> undef, <9 x i32> <i32 0, i32 3, i32 6, i32 1, i32 4, i32 7, i32 2, i32 5, i32 8> ret void }
Hi Alexey. Here is another crash reproducer:
; bin/opt -mcpu=corei7-avx -passes="print<cost-model>" -S test.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
define dso_local <12 x i64> @foo(<12 x i64> noundef %src) {
entry:
%shuffle = shufflevector <12 x i64> %src, <12 x i64> poison, <12 x i32> <i32 0, i32 3, i32 6, i32 9, i32 1, i32 4, i32 7, i32 10, i32 2, i32 5, i32 8, i32 11> ret <12 x i64> %shuffle
}
We currently root-caused a 20% regression in eigen complex to this patch and it is blocking our compiler release.
I need some time to do additional testing, given there are at least a few patches on top of this. So far I see a crash fix, two NFCs and two additional changes that may also affect performance (https://reviews.llvm.org/D114171 and https://reviews.llvm.org/D115750).
Can you please hold off on any additional changes while we test further, or include them here for review so I can check if they resolve the regressions seen so far?
Most probably, some overoptimistic cost model estimations turned on some extra vectorization. The reproducer will help, if you can provide it.
The reproducer is open source: https://gitlab.com/libeigen/eigen
Regressions are on haswell platform, xfdo configuration; the compiler used is bootstrapped against itself; revision tested includes this patch and all follow-up changes to the SLPVectorizer including D114171 and excluding D115750.
A few number samples:
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_16x64_ 16%
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_64x64_ 17%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_64x64_ 15%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_16x64_ 14%
Could you at least extract the llvm ir code for one of these functions before the changes and after, so I coukd at least compare them?
Here are the only differences I see for dynamic, 64x64; base is before the patch, experiment includes this - revert+recommit -, https://reviews.llvm.org/rG371412e065a63107d5d79330da6757ff693d91cc, https://reviews.llvm.org/D114171 and the NFCs.
Looks like the change is in the presence of poison:
Another 30-40% regression on a non-public benchmark was root-caused to https://reviews.llvm.org/D114171, so investigating that.
Hmm, I don't see the difference, actually, just changed the order of evaluation and lanes switched, no actual difference in IR. Most probably, the regression is caused by something else, maybe some lowering passes were tweaked for the previous order but not for the current.
Following up on the regression I was seeing with https://reviews.llvm.org/D114171, the only meaningful IR change seems to be:
Before the patch (after slp):
%65 = shufflevector <2 x float> %9, <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> %66 = insertelement <4 x float> %65, float %16, i32 2 %67 = insertelement <4 x float> %66, float %15, i32 3 %68 = call <4 x float> @llvm.fabs.v4f32(<4 x float> %67) %69 = fcmp oeq <4 x float> %68, <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000> %70 = freeze <4 x i1> %69 %71 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %70)
After the patch (after slp)
%65 = shufflevector <2 x float> %9, <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 undef, i32 undef> %66 = shufflevector <4 x float> poison, <4 x float> %65, <4 x i32> <i32 4, i32 5, i32 2, i32 3> %67 = insertelement <4 x float> %66, float %16, i32 2 %68 = insertelement <4 x float> %67, float %15, i32 3 %69 = call <4 x float> @llvm.fabs.v4f32(<4 x float> %68) %70 = fcmp oeq <4 x float> %69, <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000> %71 = freeze <4 x i1> %70 %72 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %71)
I'm not clear why the order of 0,1 in the shufflevector changed or the purpose of the second shufflevector adding poison on positions that will be overwritten, but the operations that follow seem to not affected by these changes.
Just providing an update, and not digging further into it at this time.
We already have something similar in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE - do you think we could have a single version of the code some place?