Introduced masks where they are not added and improved target dependent
cost models to avoid returning of the incorrect cost results after
adding masks.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks for looking at this - this is well overdue!
llvm/lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
1229 | We already have something similar in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE - do you think we could have a single version of the code some place? |
llvm/lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
1229 | Will check if can merge it somehow. |
Rebase + address comments.
Currently in hangs the compiler in tests test/CodeGen/Thumb2/mve-vst4.ll and llvm/test/CodeGen/Thumb2/mve-vst3.ll, need to investigate.
I hadn't been expecting you'd have to change the legalization implemention, at least not as an initial patch - can we not keep closer to the original SplitVecRes_VECTOR_SHUFFLE implementation?
Sorry for the slow response on this one - please can you split the DAG legalization change out as the first patch and the TTI change as a followup?
llvm/include/llvm/Analysis/VectorUtils.h | ||
---|---|---|
401 | Description comment? | |
402 | Should this be splitShuffleMasks? | |
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
2488 | I'm rather surprised this didn't reduce further (same for the TTI) - maybe there's more functionality that can be moved into the computeShuffleMasks helper? |
llvm/test/Analysis/CostModel/X86/reduction.ll | ||
---|---|---|
64 | Something might be still going wrong here - SSE max legal type is v4i32 so this cost should be 0 as its referencing a single existing vector |
llvm/test/Analysis/CostModel/X86/reduction.ll | ||
---|---|---|
64 | I'm kind of pessimistic here and if the dest reg is not the same as src, I consider it as a reg copy and add cost TCC_Basic. |
Just a heads up that with this change building llvm-test-suite/SingleSource/UnitTests/matrix-types-spec.cpp crashes on X86. I'll see if I can extract a reproducer.
Hi, most probably the crash is caused by the opaque pointers. This patch does not change the vectorizer behavior, just adjusts the cost of the vector shuffles ops.
It crashes when calculating the cost with
Assertion failed: (idx < size()), function operator[], file SmallVector.h, line 273. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump: 0. Program arguments: clang++ -DNDEBUG -O3 -O3 -DNDEBUG -isysroot ... 1. <eof> parser at end of file 2. Optimizer Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it): 0 clang-15 0x000000010a03f8c7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 39 1 clang-15 0x000000010a03e6f8 llvm::sys::RunSignalHandlers() + 248 2 clang-15 0x000000010a03ed20 llvm::sys::CleanupOnSignal(unsigned long) + 208 3 clang-15 0x0000000109f6417a (anonymous namespace)::CrashRecoveryContextImpl::HandleCrash(int, unsigned long) + 106 4 clang-15 0x0000000109f6435e CrashRecoverySignalHandler(int) + 110 5 libsystem_platform.dylib 0x00007ff808cfddfd _sigtramp + 29 6 libsystem_platform.dylib 000000000000000000 _sigtramp + 18446603370433094176 7 libsystem_c.dylib 0x00007ff808c33d24 abort + 123 8 libsystem_c.dylib 0x00007ff808c330cb err + 0 9 clang-15 0x000000010d263f33 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) (.cold.9) + 35 10 clang-15 0x00000001091758c4 llvm::processShuffleMasks(llvm::ArrayRef<int>, unsigned int, unsigned int, unsigned int, llvm::function_ref<void ()>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>, llvm::function_ref<void (llvm::ArrayRef<int>, unsigned int, unsigned int)>) + 2244 11 clang-15 0x000000010886d07b llvm::X86TTIImpl::getShuffleCost(llvm::TargetTransformInfo::ShuffleKind, llvm::VectorType*, llvm::ArrayRef<int>, int, llvm::VectorType*, llvm::ArrayRef<llvm::Value const*>) + 1755 12 clang-15 0x00000001088666a7 llvm::TargetTransformInfoImplCRTPBase<llvm::X86TTIImpl>::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) + 4551 13 clang-15 0x000000010913f4b2 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::ArrayRef<llvm::Value const*>, llvm::TargetTransformInfo::TargetCostKind) const + 18 14 clang-15 0x0000000109d40056 llvm::TargetTransformInfo::getUserCost(llvm::User const*, llvm::TargetTransformInfo::TargetCostKind) const + 214 15 clang-15 0x0000000108f97ed5 (anonymous namespace)::CallAnalyzer::visitInstruction(llvm::Instruction&) + 37
Reproducer below crashes with bin/opt -passes="print<cost-model>"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-darwin" define void @test() { entry: %matins.2.2 = shufflevector <9 x double> undef, <9 x double> undef, <9 x i32> <i32 0, i32 3, i32 6, i32 1, i32 4, i32 7, i32 2, i32 5, i32 8> ret void }
Hi Alexey. Here is another crash reproducer:
; bin/opt -mcpu=corei7-avx -passes="print<cost-model>" -S test.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
define dso_local <12 x i64> @foo(<12 x i64> noundef %src) {
entry:
%shuffle = shufflevector <12 x i64> %src, <12 x i64> poison, <12 x i32> <i32 0, i32 3, i32 6, i32 9, i32 1, i32 4, i32 7, i32 10, i32 2, i32 5, i32 8, i32 11> ret <12 x i64> %shuffle
}
We currently root-caused a 20% regression in eigen complex to this patch and it is blocking our compiler release.
I need some time to do additional testing, given there are at least a few patches on top of this. So far I see a crash fix, two NFCs and two additional changes that may also affect performance (https://reviews.llvm.org/D114171 and https://reviews.llvm.org/D115750).
Can you please hold off on any additional changes while we test further, or include them here for review so I can check if they resolve the regressions seen so far?
Most probably, some overoptimistic cost model estimations turned on some extra vectorization. The reproducer will help, if you can provide it.
The reproducer is open source: https://gitlab.com/libeigen/eigen
Regressions are on haswell platform, xfdo configuration; the compiler used is bootstrapped against itself; revision tested includes this patch and all follow-up changes to the SLPVectorizer including D114171 and excluding D115750.
A few number samples:
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_16x64_ 16%
BM_MatrixVectorMultiply_Complex_EigenRowMajorDynamic_float_64x64_ 17%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_64x64_ 15%
BM_MatrixVectorMultiply_Complex_EigenRowMajorFixed_float_16x64_ 14%
Could you at least extract the llvm ir code for one of these functions before the changes and after, so I coukd at least compare them?
Here are the only differences I see for dynamic, 64x64; base is before the patch, experiment includes this - revert+recommit -, https://reviews.llvm.org/rG371412e065a63107d5d79330da6757ff693d91cc, https://reviews.llvm.org/D114171 and the NFCs.
Looks like the change is in the presence of poison:
Another 30-40% regression on a non-public benchmark was root-caused to https://reviews.llvm.org/D114171, so investigating that.
Hmm, I don't see the difference, actually, just changed the order of evaluation and lanes switched, no actual difference in IR. Most probably, the regression is caused by something else, maybe some lowering passes were tweaked for the previous order but not for the current.
Following up on the regression I was seeing with https://reviews.llvm.org/D114171, the only meaningful IR change seems to be:
Before the patch (after slp):
%65 = shufflevector <2 x float> %9, <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> %66 = insertelement <4 x float> %65, float %16, i32 2 %67 = insertelement <4 x float> %66, float %15, i32 3 %68 = call <4 x float> @llvm.fabs.v4f32(<4 x float> %67) %69 = fcmp oeq <4 x float> %68, <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000> %70 = freeze <4 x i1> %69 %71 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %70)
After the patch (after slp)
%65 = shufflevector <2 x float> %9, <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 undef, i32 undef> %66 = shufflevector <4 x float> poison, <4 x float> %65, <4 x i32> <i32 4, i32 5, i32 2, i32 3> %67 = insertelement <4 x float> %66, float %16, i32 2 %68 = insertelement <4 x float> %67, float %15, i32 3 %69 = call <4 x float> @llvm.fabs.v4f32(<4 x float> %68) %70 = fcmp oeq <4 x float> %69, <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000> %71 = freeze <4 x i1> %70 %72 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %71)
I'm not clear why the order of 0,1 in the shufflevector changed or the purpose of the second shufflevector adding poison on positions that will be overwritten, but the operations that follow seem to not affected by these changes.
Just providing an update, and not digging further into it at this time.