Currently fcopysign for VLS vectors lowers through NEON even when the
vector width is wider than a NEON vector, causing bad codegen as the
vectors are split. This patch causes SVE to be used for these vectors
instead, giving much better codegen on wide VLS vectors.
Currently fcopysign for VLS vectors lowers through NEON even when the
|60,120 ms||x64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp|
Script: -- : 'RUN: at line 4'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 -O0 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/asan/TestCases/scariness_score_test.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/asan/X86_64LinuxConfig/TestCases/Output/scariness_score_test.cpp.tmp
|1,600 ms||x64 debian > LLVM.CodeGen/AArch64::sve-fcopysign.ll|
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/sve-fcopysign.ll -mtriple aarch64-eabi -mattr=+sve -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefixes=CHECK,CHECK-NO-EXTEND-ROUND /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/sve-fcopysign.ll
|1,910 ms||x64 debian > LLVM.CodeGen/AArch64::sve-fixed-length-fcopysign.ll|
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -aarch64-sve-vector-bits-min=256 < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/sve-fixed-length-fcopysign.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/sve-fixed-length-fcopysign.ll -check-prefixes=CHECK,VBITS_GE_256,CHECK_NO_EXTEND_ROUND
|1,450 ms||x64 debian > LLVM.CodeGen/AArch64::sve2-fcopysign.ll|
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/sve2-fcopysign.ll -mtriple aarch64-eabi -mattr=+sve2 -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefixes=CHECK,CHECK_NO_EXTEND_ROUND /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/sve2-fcopysign.ll
|1,660 ms||x64 debian > LLVM.CodeGen/AArch64::sve2-fixed-length-fcopysign.ll|
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -aarch64-sve-vector-bits-min=256 < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/sve2-fixed-length-fcopysign.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AArch64/sve2-fixed-length-fcopysign.ll -check-prefixes=CHECK,VBITS_GE_256,CHECK_NO_EXTEND_ROUND
When checking the output for SVE2 I see no difference, which means we're missing out on the BSL optimisation we get for scalable vectors. I think this is because you're handling the fixed->scalable lowering too late. I think you really need to edit LowerFCOPYSIGN to first convert the fixed length ISD::FCOPYSIGN to a scalable one, then let the existing scalable vector code decide how best to lower it.
Rather than have this dangling there's a large ordered/sorted block further down.
I think you mean VT.isScalableVector() here. However...
Given this bug fix it makes me wonder if the following code was ever excised before this patch? Which given my SVE2 comment I'm think we can in fact keep the original code and just remove the fixedSVEVectorVT code?
This doesn't look safe with respect to the extend/rounding code just below. When faced with differing types the result from both convertToScalableVector called will be a type of the same size. However their element counts will be different. For example take the case:
fcopysign v8f64, v8f32
this will resulting in:
In1 = nxv2f64 In2 = nxv4f32
which I doubt the remaining logic will handle properly. The most likely affect being a getNode assert firing for invalid operands.
My guess is that you're not seeing this because In1 and In2 always have the same type and indeed I couldn't immediate see a way to exercise this logic. I think this means your "mixtype" tests are likely exercising nothing new and are redundant. This is likely also true for you original patch when you added the initial scalable vector support. If they are not exercising this code as I suspect then you either need to rewrite them or just remove them if there's no actually route to test this logic.
Personally I think the safest route is to simply rewrite the fixed length fcopysign into a scalable vector one after any necessary extending/rounding of the input has taken place.
For what it's worth I also think the use of FP_EXTEND/FP_ROUND is not the most efficient way to get the sign bits to align but that can be changed later.
Isn't this original code now fine and you instead just need to remove the following
// Don't expand for NEON if (VT.isFixedLengthVector()) return SDValue();
block because that is covered by the !VT.isScalableVector() check?
I believe I've corrected this now; I think you're right that the inputs will always be the same type anyway though. I agree that it is safer to leave the handling in just in case that does get triggered.
Looking generally good but I see some possible minor improvements/cleanup.
Nit: Does isFixedSVE want to move down with the use?
Is this line necessary or could it be pushed up? At a glance it appears it should already be an integer VT derived from VT. Same question for the VT assignment.
What about return EnableVectorFcopysignExtendRound;?
By this point we know the result type is legal because results are legalised before operands. What's important here is the result type remains legal after splitting the operands. Given the result and first operands have the same type this means ensuring the types of LHSLo and LHSHi are legal after splitting.
There's a function GetSplitDestVTs which returns the types expected from splitting. I mention this because I think it's better to query the expected types are legal before performing the actual splitting.
Ah ok I think I was considering this wrong, I thought that the result type of the concat (which is the result type of the original FCOPYSIGN) needed to be legal for us to do the transform
If that's already legal, is there a problem? Is there a case where splitting an already legal vector in two would make a vector illegal? (genuine question I'm not sure when this would pop up)
Or do we need RHSLo to be legal?
You can have multiple legal types for the same vector element type. For NEON v4f32 and v2f32 are legal. So it is possible for the result type to be legal and yet still be legal after splitting. Likewise v1f32 is not legal for NEON and so it is possible to enter with a legal type that would become illegal when split.
For the former case we can split the operation in two as you've done. For the latter we're better reverting to the original code path of calling UnrollVector. So generally what you've done is fine, it is just you're checking the wrong type (i.e. N's result type rather than the expected result type of the new FCOPYSIGN operations). Plus my comment that you probably want to use GetSplitDestVTs so you only call SplitVector for the cases that are safe.
Up to you but I think EnableVectorFCopySignExtendRound looks better.
Not new but can this be removed? as it can never happen given the SrcVT.bitsLT/SrcVT.bitsGT code above.
This can be assumed, plus getContainerForFixedLengthVector will ensure the type is legal anyway.
Bookending the fixed length lowering like this has pitfalls and can complicate the code. It's better to just rewrite the fixed length operations using scalable vector types and then let the scalable vector lowering handle any complexity. Towards the start of the function you can do:
EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT); In1 = convertToScalableVector(DAG, ContainerVT, In1); In2 = convertToScalableVector(DAG, ContainerVT, In2); Res = getNode(ISD::FCOPYSIGN, ContainerVT , In1, In2) return convertFromScalableVector(DAG, ContainerVT, Res);
This way it doesn't matter how complicated the scalable vector lowering gets. Doing this also means you no longer need sve2-fixed-length-fcopysign.ll because there's nothing SVE2 special about the lowering code you've added (i.e. the original sve2-fcopysign.ll tests are good enough to protect that functionality).
Documentation for combiner-vector-fcopysign-extend-round needs updating but otherwise looks good.
Please drop this part of the documentation. Although this is why you've added the flag, it is not the only reason somebody might want to use it (i.e. somebody might actually want to enable the optimisation).
You could just return DAG.getNode(....
Bogus blank line.