This is an archive of the discontinued LLVM Phabricator instance.

[llvm] Allow forced auto-vectorization of sincos() using libmvec
Needs ReviewPublic

Authored by tim.schmielau on Jan 9 2022, 12:29 AM.

Details

Summary

Currently auto-vectorization lacks the ability to analyze
memory dependencies caused by function calls, only dependencies
caused by explicit load and store instructions are considered.

In order to still be able to vectorize loops with calls
to basic mathematical functions, any function listed in
include/llvm/Analysis/VecFuncs.def was implicitly assumed
to be safe.

This prevents addition of sincos() and other functions returning
multiple values via pointer operands to VecFuncs.def.

As a first step we only vectorize functions with pointer
arguments if the user forcibly skips dependency checks
via #pragma clang loop vectorize(assume_safety).

Diff Detail

Event Timeline

tim.schmielau created this revision.Jan 9 2022, 12:29 AM
tim.schmielau requested review of this revision.Jan 9 2022, 12:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 9 2022, 12:29 AM
tim.schmielau edited the summary of this revision. (Show Details)Jan 9 2022, 12:46 AM
tim.schmielau added a comment.EditedJan 9 2022, 12:53 AM

Of the two reproducers linked in https://bugs.llvm.org/show_bug.cgi?id=51530, the sincos_simd.cpp now auto-vectorizes when compiling with clang++ -fveclib=libmvec -O2 -march=core-avx2 sincos_simd.cpp, even when the #pragma omp simd annotation is left out in the source.

On the templated example sincos_simd_template.cpp a helping hand from the user is needed by annotating the loop with #pragma omp simd and adding -fopenmp to the compiler flags, but at least vectorization is now possible without the user having to manually substitute function calls.

In line with existing behavior, I have not added vector function definitions for AVX-512.

I have not included vector function definitions for @llvm.sincos.f(32|64) either as I believe nothing would be able to generate them.
They would be straightforward and I could add them if requested.

I've added regression tests to the same extend as for the existing, i.e. checking auto-vectorisation with float and double for vector widths 2, 4 and 8 where those are covered by SSE or AVX-2.
At some point we might want to limit the number of tests. In that case I'd still recommend keeping all tests I've added, because of the unique signature of sincos() producing two results from one call.
Instead I'd suggest to prune the existing coverage of combinations of sin(), cos(), vector lengths, float/double and libm functions / LLVM intrinsics.

We might also consider updating the documentation by mentioning that further functions might be auto-vectorized on some platforms as well as some combinations not being supported by others, Then again, a conscient user would likely be able to deduce that themselves.

I haven't figured out how to link to https://bugs.llvm.org/show_bug.cgi?id=51530 so that it would automatically get closed on merging. I suppose this isn't possible anymore with bugzilla being frozen?

tim.schmielau edited the summary of this revision. (Show Details)Jan 9 2022, 1:15 AM

A potential issue is that working sincos() support in libmvec is "only" a bit over 5 years old.
E.g. a user on Centos 7 compiling code with a vectorizable call to sincos() and compiling with -fveclib=libmvec and -O2 or higher will now run into an undefined reference to the vectorized sincos() function and will either have to deactivate auto-vectorization or update their libmvec.

Given the very specific circumstances where this applies I am not sure whether to consider that a bug or a feature - a user that much concerned with performance of numerical codes might well appreciate the change.

I haven't figured out how to link to https://bugs.llvm.org/show_bug.cgi?id=51530 so that it would automatically get closed on merging. I suppose this isn't possible anymore with bugzilla being frozen?

Correct. Bugzilla won't change any more. Any changes will be done on Github. I'm not sure if Phab has a way to auto-close a Github issue, though.

This patch looks good to me, pretty straight forward stuff, but I'm not a libmvec expert. Please wait until the relevant folks have a look at it and approve.

Thanks!

The patch looks good to me. But please wait for the approval from other vectorization experts in the reviewer list.

tim.schmielau added a comment.EditedJan 9 2022, 10:45 AM

Thank you.
I do not have commit access anyway, so can someone please commit once review is sufficient.

RKSimon added a subscriber: RKSimon.
RKSimon added inline comments.
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll
362

Is this correct? This looks like it creates a sincos signature that takes vectors of pointers to doubles, but I expect most sincos vector implementations to actually use pointers to vectors of doubles. Something like:

void @sincos(<2 x double>, <2 x double>*, <2 x double>*)

I hit something almost identical here: https://llvm.org/PR38424

tim.schmielau added inline comments.Jan 9 2022, 6:42 PM
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll
362

I stumbled over this as well. Unfortunately the libmvec Vector ABI Spec isn't particularly enlightening on the matter:

2.3. Element Data Type to Vector Data Type Mapping
 
The vector data types for parameters are selected depending on ISA, vector length, data type of original parameter, and parameter specification.
For uniform and linear parameters (detailed description could be found in [1]), the original data type is preserved.
For vector parameters, vector data types are selected by the compiler. The mapping from element data type to vector data type is described as below.
* The bit size of vector data type of parameter is computed as: 
size_of_vector_data_type = VLEN * sizeof(original_parameter_data_type) * 8
For instance, for SSE version of vector function with parameter data type "int":
If VLEN = 4, size_of_vector_data_type = 4 * 4 * 8 = 128 (bits), which means one argument of type __m128 to be passed.
* If the size_of_vector_data_type is greater than the width of the vector register, multiple vector registers are selected and the parameter will be passed in multiple vector registers.
For instance, for SSE version of vector function with parameter data type "int": If VLEN = 8, size_of_vector_data_type = 8 * 4 * 8 = 256 (bits), so the vector data type is __m256, which means 2 arguments of type __m128 are to be passed.

I interpret that as the vvv part of the signature indicating the three scalar arguments as being duplicated inside vector registers, which would make the last two arguments vectors of pointers, rather than pointers to vectors. I also tested that the generated code actually works with libmvec.
However, given the lack of specific mention of pointers in the vector ABI spec I don't feel particularly confident about my interpretation.

rengolin added inline comments.Jan 10 2022, 1:02 AM
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll
362

Good catch! I totally missed that. Tim, how did you test this?

It's possible that vector of pointers "just worked" on X86 because it's supported, but this would probably break on non-SVE Arm. Regardless, that's the wrong implementation, we want just vectors.

Can you share the asm output of this sequence you're getting?

tim.schmielau added inline comments.Jan 10 2022, 12:36 PM
llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls-VF2-VF8.ll
362

[un-inlining the discussion, as testcase + asm output are somewhat lengthy]

tim.schmielau added a comment.EditedJan 10 2022, 12:37 PM

I have beefed up my testcase to demonstrate why I had to choose the _ZGVdN4vvv_sincos() variant for correctness, even though _ZGVdN4vl8l8_sincos() would be desirable from a performance perspective:
We have no control over what pointers the user is passing in in different loop iterations.

sincosarr.cpp:

#include <math.h>

void sincos_arr(double* sines, double* cosines, double* phases, int* indices, int size) {
#pragma unroll 1
    for (int i=0; i<size; i++) {
        sincos(phases[indices[i]], sines+indices[i], cosines+indices[i]);
    }
}

main.cpp:

#include <stdio.h>
#include <math.h>

void sincos_arr(double* sins, double* coses, double* phases, int* indices, int size);

int main()
{
    const int N=32;
    int indices[N];
    double phases[N], sins[N], coses[N];
    for (int i=0; i<N; i++) {
        phases[i] = i;
        indices[i] = (i < 2) ? 1 : (indices[i-2] + indices[i-1]) % N;
    }
    sincos_arr(sins, coses, phases, indices, N);
    for (int i=0; i<N; i++) {
        int j = indices[i];
        printf("sin(%2d) == %10f == %10f | cos(%2d) == %10f == %10f\n",
               j, sin(phases[j]), sins[j],
               j, cos(phases[j]), coses[j]);
    }
    return 0;
}

Vectorized inner loop x86 assembly from clang++ -march=core-avx2 -fveclib=libmvec -O2 -S sincosarr.cpp:

    .p2align    4, 0x90
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
    vpmovsxdq   (%r14,%r12), %ymm1
    vpextrq $1, %xmm1, %rax
    vextracti128    $1, %ymm1, %xmm0
    vpextrq $1, %xmm0, %rcx
    vmovq   %xmm0, %rdx
    vmovsd  (%rbx,%rdx,8), %xmm0            # xmm0 = mem[0],zero
    vmovhps (%rbx,%rcx,8), %xmm0, %xmm0     # xmm0 = xmm0[0,1],mem[0,1]
    vmovq   %xmm1, %rcx
    vmovsd  (%rbx,%rcx,8), %xmm2            # xmm2 = mem[0],zero
    vmovhps (%rbx,%rax,8), %xmm2, %xmm2     # xmm2 = xmm2[0,1],mem[0,1]
    vinsertf128 $1, %xmm0, %ymm2, %ymm0
    vpsllq  $3, %ymm1, %ymm2
    vpaddq  48(%rsp), %ymm2, %ymm1          # 32-byte Folded Reload
    vpaddq  16(%rsp), %ymm2, %ymm2          # 32-byte Folded Reload
    callq   _ZGVdN4vvv_sincos
    addq    $16, %r12
    cmpq    %r12, %r15
    jne .LBB0_4
tim.schmielau planned changes to this revision.Jan 12 2022, 2:11 AM

And a variant of the code above shows that even the transformation to the vvv variant isn't safe in all cases.
I am looking into adding variations of the code above into the test-suite ahead of enabling the vectorizing transformation, to be sure the transformation is not applied when unsafe, and that the behavior of the underlying vector library matches my interpretation of the VectorAPI.
I don't see any existing tests around the transformations already performed.
Also, I see lots of regression tests to prevent potential performance regressions, ensuring the vectorizing transformation is not missed. But I don't see any tests currently to guard against correctness issues, ensuring the transformation is not applied in unsafe cases.

I think there is groundwork to be done before this change can be made in confidence. So please do not yet commit, even if someone should approve.

I have beefed up my testcase to demonstrate why I had to choose the _ZGVdN4vvv_sincos() variant for correctness, even though _ZGVdN4vl8l8_sincos() would be desirable from a performance perspective:
We have no control over what pointers the user is passing in in different loop iterations.

That is true, but the vectoriser won't generate code that it deems unsafe (no known bounds, aliasing) and that's why I'm assuming you need the pragmas to force vectorisation in your tests.

In your example below, size is known and the compiler assumes access to the [ith] element from each pointer is sane (even though it could be undefined), and can vectorise the call inside the loop, regardless of what the original pointers had in hand.

And a variant of the code above shows that even the transformation to the vvv variant isn't safe in all cases.

Is this with pragma or without? The compiler sometimes treats pragmas as "the user said it's safe, then it probably is".

Does this code generate unsafe transformations without any pragma or forced parameters?

I am looking into adding variations of the code above into the test-suite ahead of enabling the vectorizing transformation, to be sure the transformation is not applied when unsafe, and that the behavior of the underlying vector library matches my interpretation of the VectorAPI.
I don't see any existing tests around the transformations already performed.

Awesome, thanks!

@fpetrogalli @spatel @fhahn do you know of any tests for math library vectorisation?

Also, I see lots of regression tests to prevent potential performance regressions, ensuring the vectorizing transformation is not missed. But I don't see any tests currently to guard against correctness issues, ensuring the transformation is not applied in unsafe cases.

It's harder to create adversarial tests than benign ones, unfortunately. That's not an excuse, just a fact. :)

It'd be awesome if we had more of such tests... (wink)

I think there is groundwork to be done before this change can be made in confidence. So please do not yet commit, even if someone should approve.

Ack. We usually don't merge other people's patches unless they ask for it (like those that don't have commit permissions), so we should be safe.

Matt added a subscriber: Matt.Jan 25 2022, 3:13 PM

I have just submitted the test for the testsuite. Once that is merged, this change should also be fine to go in.

And a variant of the code above shows that even the transformation to the vvv variant isn't safe in all cases.

I must have made a mistake there, as I cannot reproduce the failure anymore.

I'm assuming you need the pragmas to force vectorisation in your tests.

The pragma is only for convenience to shorten the loop body for manual inspection. It has no other impact on the result.

We usually don't merge other people's patches unless they ask for it (like those that don't have commit permissions), so we should be safe.

I just wanted to be explicit, as further up I had already requested merging.

Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 1:08 AM

It's been a while since I last looked at this. However there don't seem to be many users of -fveclib=libmvec in general as demonstrated by the fact that it has been broken since before I've submitted this patch.
Before returning to allowing sincos() vectorization, I've thus submitted a separate patch to reenable vectorization of all other functions declared in VecFuncs.def already: D134732.

tim.schmielau retitled this revision from [llvm] Allow auto-vectorization of sincos() using libmvec to [llvm] Allow forced auto-vectorization of sincos() using libmvec.
tim.schmielau edited the summary of this revision. (Show Details)

To finally get this out of the way without having the necessary analysis available, I am retreating a bit and just allow forced auto-vectorization via #pragma clang loop vectorize(assume_safety).

I'm struggling to find much documentation for libmvec other than the VectorABI.txt you mentioned - do you happen to know where we can review a reference header or source code please? I've tried searching by there's plenty of libs out there with similar names....

I'm still not sure the vector-of-pointers pattern what we want to support tbh

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2230–2231

I'm not sure this will work well once we want to handle sincos(<2 x double>, <2 x double>*, <2 x double>*) style patterns as well.

Reset parent to reflect reviewer comment there (removing duplicate call && test).

The failing LLVM.Transforms/LoopVectorize/AArch64::scalable-call.ll test makes me wonder what the exact semantics of the vector-function-abi-variant call attribute is in the presence of pointer arguments.

Apparently it does not imply vectorization is safe, because then we wouldn't need to invoke LoopAccessAnalysis at all. But how is LoopAccessAnalysis supposed to work if we don't know the length of the array pointed at? Do we assume pointers always point to a single element?

tim.schmielau added inline comments.Sep 29 2022, 8:46 AM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
2230–2231

It is indeed only meant as a stopgap solution - don't vectorize any function with pointer arguments unless the user has asserted it is safe to do so.
How would your pattern cause an issue if the user has already asserted it is safe to vectorize?

(On an unrelated note I should probably recordAnalysis() though when rejecting vectorization.)

I'm struggling to find much documentation for libmvec other than the VectorABI.txt you mentioned - do you happen to know where we can review a reference header or source code please? I've tried searching by there's plenty of libs out there with similar names....

I'm still not sure the vector-of-pointers pattern what we want to support tbh

I find the libmvec sources hard to read, and agree there isn't much useful material out there (at least not that I am aware of).
Most illuminating is probably the commit that changed the sincos() vector variant signature to what it currently is: https://sourceware.org/git/?p=glibc.git;a=commit;h=ee2196bb6766ca7e63a1ba22ebb7619a3266776a

Here is the discussion that lead up to the glibc / libmvec commit linked above: https://marc.info/?t=146472287500003

The glibc Bugzilla ticket probably makes the best reading so far: https://sourceware.org/bugzilla/show_bug.cgi?id=20024

n-omer added a subscriber: n-omer.Jul 13 2023, 2:34 AM

Hi @tim.schmielau, your assumption that libm-vec expects vectors filled with pointers is correct, as shown by https://elixir.bootlin.com/glibc/glibc-2.37.9000/source/sysdeps/x86_64/fpu/test-vector-abi-sincos.h#L46.

Do you plan on continuing to work on this patch?