This fixes https://github.com/llvm/llvm-project/issues/50872 on x86 with SSE or AVX2, bringing
sincos() vectorization with -fveclib=libmvec in line with the handling of sin() and cos().
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Time | Test | |
---|---|---|
2,560 ms | x64 debian > libFuzzer.libFuzzer::fuzzer-finalstats.test Script:
--
: 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -O2 -gline-tables-only -fsanitize=address,fuzzer -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/lib/fuzzer -m64 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/fuzzer/SimpleTest.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/fuzzer/X86_64DefaultLinuxConfig/Output/fuzzer-finalstats.test.tmp-SimpleTest
|
Event Timeline
Of the two reproducers linked in https://bugs.llvm.org/show_bug.cgi?id=51530, the sincos_simd.cpp now auto-vectorizes when compiling with clang++ -fveclib=libmvec -O2 -march=core-avx2 sincos_simd.cpp, even when the #pragma omp simd annotation is left out in the source.
On the templated example sincos_simd_template.cpp a helping hand from the user is needed by annotating the loop with #pragma omp simd and adding -fopenmp to the compiler flags, but at least vectorization is now possible without the user having to manually substitute function calls.
In line with existing behavior, I have not added vector function definitions for AVX-512.
I have not included vector function definitions for @llvm.sincos.f(32|64) either as I believe nothing would be able to generate them.
They would be straightforward and I could add them if requested.
I've added regression tests to the same extend as for the existing, i.e. checking auto-vectorisation with float and double for vector widths 2, 4 and 8 where those are covered by SSE or AVX-2.
At some point we might want to limit the number of tests. In that case I'd still recommend keeping all tests I've added, because of the unique signature of sincos() producing two results from one call.
Instead I'd suggest to prune the existing coverage of combinations of sin(), cos(), vector lengths, float/double and libm functions / LLVM intrinsics.
We might also consider updating the documentation by mentioning that further functions might be auto-vectorized on some platforms as well as some combinations not being supported by others, Then again, a conscient user would likely be able to deduce that themselves.
I haven't figured out how to link to https://bugs.llvm.org/show_bug.cgi?id=51530 so that it would automatically get closed on merging. I suppose this isn't possible anymore with bugzilla being frozen?
A potential issue is that working sincos() support in libmvec is "only" a bit over 5 years old.
E.g. a user on Centos 7 compiling code with a vectorizable call to sincos() and compiling with -fveclib=libmvec and -O2 or higher will now run into an undefined reference to the vectorized sincos() function and will either have to deactivate auto-vectorization or update their libmvec.
Given the very specific circumstances where this applies I am not sure whether to consider that a bug or a feature - a user that much concerned with performance of numerical codes might well appreciate the change.
Correct. Bugzilla won't change any more. Any changes will be done on Github. I'm not sure if Phab has a way to auto-close a Github issue, though.
This patch looks good to me, pretty straight forward stuff, but I'm not a libmvec expert. Please wait until the relevant folks have a look at it and approve.
Thanks!
The patch looks good to me. But please wait for the approval from other vectorization experts in the reviewer list.
Thank you.
I do not have commit access anyway, so can someone please commit once review is sufficient.
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll | ||
---|---|---|
362 | Is this correct? This looks like it creates a sincos signature that takes vectors of pointers to doubles, but I expect most sincos vector implementations to actually use pointers to vectors of doubles. Something like: void @sincos(<2 x double>, <2 x double>*, <2 x double>*) I hit something almost identical here: https://llvm.org/PR38424 |
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll | ||
---|---|---|
362 | I stumbled over this as well. Unfortunately the libmvec Vector ABI Spec isn't particularly enlightening on the matter: 2.3. Element Data Type to Vector Data Type Mapping The vector data types for parameters are selected depending on ISA, vector length, data type of original parameter, and parameter specification. For uniform and linear parameters (detailed description could be found in [1]), the original data type is preserved. For vector parameters, vector data types are selected by the compiler. The mapping from element data type to vector data type is described as below. * The bit size of vector data type of parameter is computed as: size_of_vector_data_type = VLEN * sizeof(original_parameter_data_type) * 8 For instance, for SSE version of vector function with parameter data type "int": If VLEN = 4, size_of_vector_data_type = 4 * 4 * 8 = 128 (bits), which means one argument of type __m128 to be passed. * If the size_of_vector_data_type is greater than the width of the vector register, multiple vector registers are selected and the parameter will be passed in multiple vector registers. For instance, for SSE version of vector function with parameter data type "int": If VLEN = 8, size_of_vector_data_type = 8 * 4 * 8 = 256 (bits), so the vector data type is __m256, which means 2 arguments of type __m128 are to be passed. I interpret that as the vvv part of the signature indicating the three scalar arguments as being duplicated inside vector registers, which would make the last two arguments vectors of pointers, rather than pointers to vectors. I also tested that the generated code actually works with libmvec. |
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll | ||
---|---|---|
362 | Good catch! I totally missed that. Tim, how did you test this? It's possible that vector of pointers "just worked" on X86 because it's supported, but this would probably break on non-SVE Arm. Regardless, that's the wrong implementation, we want just vectors. Can you share the asm output of this sequence you're getting? |
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll | ||
---|---|---|
362 | [un-inlining the discussion, as testcase + asm output are somewhat lengthy] |
I have beefed up my testcase to demonstrate why I had to choose the _ZGVdN4vvv_sincos() variant for correctness, even though _ZGVdN4vl8l8_sincos() would be desirable from a performance perspective:
We have no control over what pointers the user is passing in in different loop iterations.
sincosarr.cpp:
#include <math.h> void sincos_arr(double* sines, double* cosines, double* phases, int* indices, int size) { #pragma unroll 1 for (int i=0; i<size; i++) { sincos(phases[indices[i]], sines+indices[i], cosines+indices[i]); } }
main.cpp:
#include <stdio.h> #include <math.h> void sincos_arr(double* sins, double* coses, double* phases, int* indices, int size); int main() { const int N=32; int indices[N]; double phases[N], sins[N], coses[N]; for (int i=0; i<N; i++) { phases[i] = i; indices[i] = (i < 2) ? 1 : (indices[i-2] + indices[i-1]) % N; } sincos_arr(sins, coses, phases, indices, N); for (int i=0; i<N; i++) { int j = indices[i]; printf("sin(%2d) == %10f == %10f | cos(%2d) == %10f == %10f\n", j, sin(phases[j]), sins[j], j, cos(phases[j]), coses[j]); } return 0; }
Vectorized inner loop x86 assembly from clang++ -march=core-avx2 -fveclib=libmvec -O2 -S sincosarr.cpp:
.p2align 4, 0x90 .LBB0_4: # =>This Inner Loop Header: Depth=1 vpmovsxdq (%r14,%r12), %ymm1 vpextrq $1, %xmm1, %rax vextracti128 $1, %ymm1, %xmm0 vpextrq $1, %xmm0, %rcx vmovq %xmm0, %rdx vmovsd (%rbx,%rdx,8), %xmm0 # xmm0 = mem[0],zero vmovhps (%rbx,%rcx,8), %xmm0, %xmm0 # xmm0 = xmm0[0,1],mem[0,1] vmovq %xmm1, %rcx vmovsd (%rbx,%rcx,8), %xmm2 # xmm2 = mem[0],zero vmovhps (%rbx,%rax,8), %xmm2, %xmm2 # xmm2 = xmm2[0,1],mem[0,1] vinsertf128 $1, %xmm0, %ymm2, %ymm0 vpsllq $3, %ymm1, %ymm2 vpaddq 48(%rsp), %ymm2, %ymm1 # 32-byte Folded Reload vpaddq 16(%rsp), %ymm2, %ymm2 # 32-byte Folded Reload callq _ZGVdN4vvv_sincos addq $16, %r12 cmpq %r12, %r15 jne .LBB0_4
And a variant of the code above shows that even the transformation to the vvv variant isn't safe in all cases.
I am looking into adding variations of the code above into the test-suite ahead of enabling the vectorizing transformation, to be sure the transformation is not applied when unsafe, and that the behavior of the underlying vector library matches my interpretation of the VectorAPI.
I don't see any existing tests around the transformations already performed.
Also, I see lots of regression tests to prevent potential performance regressions, ensuring the vectorizing transformation is not missed. But I don't see any tests currently to guard against correctness issues, ensuring the transformation is not applied in unsafe cases.
I think there is groundwork to be done before this change can be made in confidence. So please do not yet commit, even if someone should approve.
That is true, but the vectoriser won't generate code that it deems unsafe (no known bounds, aliasing) and that's why I'm assuming you need the pragmas to force vectorisation in your tests.
In your example below, size is known and the compiler assumes access to the [ith] element from each pointer is sane (even though it could be undefined), and can vectorise the call inside the loop, regardless of what the original pointers had in hand.
And a variant of the code above shows that even the transformation to the vvv variant isn't safe in all cases.
Is this with pragma or without? The compiler sometimes treats pragmas as "the user said it's safe, then it probably is".
Does this code generate unsafe transformations without any pragma or forced parameters?
I am looking into adding variations of the code above into the test-suite ahead of enabling the vectorizing transformation, to be sure the transformation is not applied when unsafe, and that the behavior of the underlying vector library matches my interpretation of the VectorAPI.
I don't see any existing tests around the transformations already performed.
Awesome, thanks!
@fpetrogalli @spatel @fhahn do you know of any tests for math library vectorisation?
Also, I see lots of regression tests to prevent potential performance regressions, ensuring the vectorizing transformation is not missed. But I don't see any tests currently to guard against correctness issues, ensuring the transformation is not applied in unsafe cases.
It's harder to create adversarial tests than benign ones, unfortunately. That's not an excuse, just a fact. :)
It'd be awesome if we had more of such tests... (wink)
I think there is groundwork to be done before this change can be made in confidence. So please do not yet commit, even if someone should approve.
Ack. We usually don't merge other people's patches unless they ask for it (like those that don't have commit permissions), so we should be safe.
I have just submitted the test for the testsuite. Once that is merged, this change should also be fine to go in.
And a variant of the code above shows that even the transformation to the vvv variant isn't safe in all cases.
I must have made a mistake there, as I cannot reproduce the failure anymore.
I'm assuming you need the pragmas to force vectorisation in your tests.
The pragma is only for convenience to shorten the loop body for manual inspection. It has no other impact on the result.
We usually don't merge other people's patches unless they ask for it (like those that don't have commit permissions), so we should be safe.
I just wanted to be explicit, as further up I had already requested merging.