Page MenuHomePhabricator

[llvm] Allow auto-vectorization of sincos() using libmvec
Changes PlannedPublic

Authored by tim.schmielau on Jan 9 2022, 12:29 AM.

Details

Summary

This fixes https://github.com/llvm/llvm-project/issues/50872 on x86 with SSE or AVX2, bringing
sincos() vectorization with -fveclib=libmvec in line with the handling of sin() and cos().

Diff Detail

Unit TestsFailed

TimeTest
2,560 msx64 debian > libFuzzer.libFuzzer::fuzzer-finalstats.test
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -O2 -gline-tables-only -fsanitize=address,fuzzer -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/lib/fuzzer -m64 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/fuzzer/SimpleTest.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/fuzzer/X86_64DefaultLinuxConfig/Output/fuzzer-finalstats.test.tmp-SimpleTest

Event Timeline

tim.schmielau created this revision.Jan 9 2022, 12:29 AM
tim.schmielau requested review of this revision.Jan 9 2022, 12:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 9 2022, 12:29 AM
tim.schmielau edited the summary of this revision. (Show Details)Jan 9 2022, 12:46 AM
tim.schmielau added a comment.EditedJan 9 2022, 12:53 AM

Of the two reproducers linked in https://bugs.llvm.org/show_bug.cgi?id=51530, the sincos_simd.cpp now auto-vectorizes when compiling with clang++ -fveclib=libmvec -O2 -march=core-avx2 sincos_simd.cpp, even when the #pragma omp simd annotation is left out in the source.

On the templated example sincos_simd_template.cpp a helping hand from the user is needed by annotating the loop with #pragma omp simd and adding -fopenmp to the compiler flags, but at least vectorization is now possible without the user having to manually substitute function calls.

In line with existing behavior, I have not added vector function definitions for AVX-512.

I have not included vector function definitions for @llvm.sincos.f(32|64) either as I believe nothing would be able to generate them.
They would be straightforward and I could add them if requested.

I've added regression tests to the same extend as for the existing, i.e. checking auto-vectorisation with float and double for vector widths 2, 4 and 8 where those are covered by SSE or AVX-2.
At some point we might want to limit the number of tests. In that case I'd still recommend keeping all tests I've added, because of the unique signature of sincos() producing two results from one call.
Instead I'd suggest to prune the existing coverage of combinations of sin(), cos(), vector lengths, float/double and libm functions / LLVM intrinsics.

We might also consider updating the documentation by mentioning that further functions might be auto-vectorized on some platforms as well as some combinations not being supported by others, Then again, a conscient user would likely be able to deduce that themselves.

I haven't figured out how to link to https://bugs.llvm.org/show_bug.cgi?id=51530 so that it would automatically get closed on merging. I suppose this isn't possible anymore with bugzilla being frozen?

tim.schmielau edited the summary of this revision. (Show Details)Jan 9 2022, 1:15 AM

A potential issue is that working sincos() support in libmvec is "only" a bit over 5 years old.
E.g. a user on Centos 7 compiling code with a vectorizable call to sincos() and compiling with -fveclib=libmvec and -O2 or higher will now run into an undefined reference to the vectorized sincos() function and will either have to deactivate auto-vectorization or update their libmvec.

Given the very specific circumstances where this applies I am not sure whether to consider that a bug or a feature - a user that much concerned with performance of numerical codes might well appreciate the change.

I haven't figured out how to link to https://bugs.llvm.org/show_bug.cgi?id=51530 so that it would automatically get closed on merging. I suppose this isn't possible anymore with bugzilla being frozen?

Correct. Bugzilla won't change any more. Any changes will be done on Github. I'm not sure if Phab has a way to auto-close a Github issue, though.

This patch looks good to me, pretty straight forward stuff, but I'm not a libmvec expert. Please wait until the relevant folks have a look at it and approve.

Thanks!

The patch looks good to me. But please wait for the approval from other vectorization experts in the reviewer list.

tim.schmielau added a comment.EditedJan 9 2022, 10:45 AM

Thank you.
I do not have commit access anyway, so can someone please commit once review is sufficient.

RKSimon added a subscriber: RKSimon.
RKSimon added inline comments.
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll
362

Is this correct? This looks like it creates a sincos signature that takes vectors of pointers to doubles, but I expect most sincos vector implementations to actually use pointers to vectors of doubles. Something like:

void @sincos(<2 x double>, <2 x double>*, <2 x double>*)

I hit something almost identical here: https://llvm.org/PR38424

tim.schmielau added inline comments.Jan 9 2022, 6:42 PM
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll
362

I stumbled over this as well. Unfortunately the libmvec Vector ABI Spec isn't particularly enlightening on the matter:

2.3. Element Data Type to Vector Data Type Mapping
 
The vector data types for parameters are selected depending on ISA, vector length, data type of original parameter, and parameter specification.
For uniform and linear parameters (detailed description could be found in [1]), the original data type is preserved.
For vector parameters, vector data types are selected by the compiler. The mapping from element data type to vector data type is described as below.
* The bit size of vector data type of parameter is computed as: 
size_of_vector_data_type = VLEN * sizeof(original_parameter_data_type) * 8
For instance, for SSE version of vector function with parameter data type "int":
If VLEN = 4, size_of_vector_data_type = 4 * 4 * 8 = 128 (bits), which means one argument of type __m128 to be passed.
* If the size_of_vector_data_type is greater than the width of the vector register, multiple vector registers are selected and the parameter will be passed in multiple vector registers.
For instance, for SSE version of vector function with parameter data type "int": If VLEN = 8, size_of_vector_data_type = 8 * 4 * 8 = 256 (bits), so the vector data type is __m256, which means 2 arguments of type __m128 are to be passed.

I interpret that as the vvv part of the signature indicating the three scalar arguments as being duplicated inside vector registers, which would make the last two arguments vectors of pointers, rather than pointers to vectors. I also tested that the generated code actually works with libmvec.
However, given the lack of specific mention of pointers in the vector ABI spec I don't feel particularly confident about my interpretation.

rengolin added inline comments.Jan 10 2022, 1:02 AM
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll
362

Good catch! I totally missed that. Tim, how did you test this?

It's possible that vector of pointers "just worked" on X86 because it's supported, but this would probably break on non-SVE Arm. Regardless, that's the wrong implementation, we want just vectors.

Can you share the asm output of this sequence you're getting?

tim.schmielau added inline comments.Jan 10 2022, 12:36 PM
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll
362

[un-inlining the discussion, as testcase + asm output are somewhat lengthy]

tim.schmielau added a comment.EditedJan 10 2022, 12:37 PM

I have beefed up my testcase to demonstrate why I had to choose the _ZGVdN4vvv_sincos() variant for correctness, even though _ZGVdN4vl8l8_sincos() would be desirable from a performance perspective:
We have no control over what pointers the user is passing in in different loop iterations.

sincosarr.cpp:

#include <math.h>

void sincos_arr(double* sines, double* cosines, double* phases, int* indices, int size) {
#pragma unroll 1
    for (int i=0; i<size; i++) {
        sincos(phases[indices[i]], sines+indices[i], cosines+indices[i]);
    }
}

main.cpp:

#include <stdio.h>
#include <math.h>

void sincos_arr(double* sins, double* coses, double* phases, int* indices, int size);

int main()
{
    const int N=32;
    int indices[N];
    double phases[N], sins[N], coses[N];
    for (int i=0; i<N; i++) {
        phases[i] = i;
        indices[i] = (i < 2) ? 1 : (indices[i-2] + indices[i-1]) % N;
    }
    sincos_arr(sins, coses, phases, indices, N);
    for (int i=0; i<N; i++) {
        int j = indices[i];
        printf("sin(%2d) == %10f == %10f | cos(%2d) == %10f == %10f\n",
               j, sin(phases[j]), sins[j],
               j, cos(phases[j]), coses[j]);
    }
    return 0;
}

Vectorized inner loop x86 assembly from clang++ -march=core-avx2 -fveclib=libmvec -O2 -S sincosarr.cpp:

    .p2align    4, 0x90
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
    vpmovsxdq   (%r14,%r12), %ymm1
    vpextrq $1, %xmm1, %rax
    vextracti128    $1, %ymm1, %xmm0
    vpextrq $1, %xmm0, %rcx
    vmovq   %xmm0, %rdx
    vmovsd  (%rbx,%rdx,8), %xmm0            # xmm0 = mem[0],zero
    vmovhps (%rbx,%rcx,8), %xmm0, %xmm0     # xmm0 = xmm0[0,1],mem[0,1]
    vmovq   %xmm1, %rcx
    vmovsd  (%rbx,%rcx,8), %xmm2            # xmm2 = mem[0],zero
    vmovhps (%rbx,%rax,8), %xmm2, %xmm2     # xmm2 = xmm2[0,1],mem[0,1]
    vinsertf128 $1, %xmm0, %ymm2, %ymm0
    vpsllq  $3, %ymm1, %ymm2
    vpaddq  48(%rsp), %ymm2, %ymm1          # 32-byte Folded Reload
    vpaddq  16(%rsp), %ymm2, %ymm2          # 32-byte Folded Reload
    callq   _ZGVdN4vvv_sincos
    addq    $16, %r12
    cmpq    %r12, %r15
    jne .LBB0_4
tim.schmielau planned changes to this revision.Jan 12 2022, 2:11 AM

And a variant of the code above shows that even the transformation to the vvv variant isn't safe in all cases.
I am looking into adding variations of the code above into the test-suite ahead of enabling the vectorizing transformation, to be sure the transformation is not applied when unsafe, and that the behavior of the underlying vector library matches my interpretation of the VectorAPI.
I don't see any existing tests around the transformations already performed.
Also, I see lots of regression tests to prevent potential performance regressions, ensuring the vectorizing transformation is not missed. But I don't see any tests currently to guard against correctness issues, ensuring the transformation is not applied in unsafe cases.

I think there is groundwork to be done before this change can be made in confidence. So please do not yet commit, even if someone should approve.

I have beefed up my testcase to demonstrate why I had to choose the _ZGVdN4vvv_sincos() variant for correctness, even though _ZGVdN4vl8l8_sincos() would be desirable from a performance perspective:
We have no control over what pointers the user is passing in in different loop iterations.

That is true, but the vectoriser won't generate code that it deems unsafe (no known bounds, aliasing) and that's why I'm assuming you need the pragmas to force vectorisation in your tests.

In your example below, size is known and the compiler assumes access to the [ith] element from each pointer is sane (even though it could be undefined), and can vectorise the call inside the loop, regardless of what the original pointers had in hand.

And a variant of the code above shows that even the transformation to the vvv variant isn't safe in all cases.

Is this with pragma or without? The compiler sometimes treats pragmas as "the user said it's safe, then it probably is".

Does this code generate unsafe transformations without any pragma or forced parameters?

I am looking into adding variations of the code above into the test-suite ahead of enabling the vectorizing transformation, to be sure the transformation is not applied when unsafe, and that the behavior of the underlying vector library matches my interpretation of the VectorAPI.
I don't see any existing tests around the transformations already performed.

Awesome, thanks!

@fpetrogalli @spatel @fhahn do you know of any tests for math library vectorisation?

Also, I see lots of regression tests to prevent potential performance regressions, ensuring the vectorizing transformation is not missed. But I don't see any tests currently to guard against correctness issues, ensuring the transformation is not applied in unsafe cases.

It's harder to create adversarial tests than benign ones, unfortunately. That's not an excuse, just a fact. :)

It'd be awesome if we had more of such tests... (wink)

I think there is groundwork to be done before this change can be made in confidence. So please do not yet commit, even if someone should approve.

Ack. We usually don't merge other people's patches unless they ask for it (like those that don't have commit permissions), so we should be safe.

Matt added a subscriber: Matt.Jan 25 2022, 3:13 PM

I have just submitted the test for the testsuite. Once that is merged, this change should also be fine to go in.

And a variant of the code above shows that even the transformation to the vvv variant isn't safe in all cases.

I must have made a mistake there, as I cannot reproduce the failure anymore.

I'm assuming you need the pragmas to force vectorisation in your tests.

The pragma is only for convenience to shorten the loop body for manual inspection. It has no other impact on the result.

We usually don't merge other people's patches unless they ask for it (like those that don't have commit permissions), so we should be safe.

I just wanted to be explicit, as further up I had already requested merging.

Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 1:08 AM