This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Neon Polynomial vadd Intrinsic Fix
AbandonedPublic

Authored by rsanthir.quic on Apr 14 2021, 11:45 AM.

Details

Summary

The Neon vadd intrinsic for poly128 was added to both the ARMSIMD and
AArch64SIMD Intrinsic maps. The poly128 type is not support on ARM,
this patch removes the poly128 mapping from ARMSIMDIntrinsicMap
that was added in https://reviews.llvm.org/D96825

Diff Detail

Event Timeline

rsanthir.quic created this revision.Apr 14 2021, 11:45 AM
rsanthir.quic requested review of this revision.Apr 14 2021, 11:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2021, 11:45 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

What's your logic for these being Arm only?

I looked up the ones that were added:

vadd_p8
vadd_p16
vadd_p64
vaddq_p8
vaddq_p16
vaddq_p64
vaddq_p128

E.g. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vadd_p8

It says that this is enabled for v7/A32/A64. However the pseudocode does use CheckFPAdvSIMDEnabled64 which might imply AArch64 only. There is a AArch32.CheckAdvSIMDOrFPEnabled for AArch32 but looking at vabdq_u32 which is Arm and AArch64, it also uses CheckFPAdvSIMDEnabled64 so clearly that doesn't mean much.

The weird thing is that the header already guards this with __aarch64__ so that must be based on some other property than simply being in these tables. (GCC agrees)

How did you find this? Presumably you couldn't use them from C, even without this patch.

As you mentioned, I thought it was only supported due to CheckFPAdvSIMDEnabled64. If the header is also guarding for AArch64 does that not support the idea that it is AArch64 specific?

Both clang and GCC have their issues when it comes to matching the ACLE, so I wouldn't take the header guard as fact. It could be that we never implemented the A32 path for these functions/when they were added the document was in flux/no on ever tried this on A32.

I think you could implement vadd_p8 on A32 with:

veor.u8 d0, d0, d0

I think we just show A64 versions in the documentation. I think. Possible that what I've got above is a simd instruction but not an "advanced simd" instruction and that somehow doesn't count?

(caveat: I've mostly been making sure the function prototypes match the ACLE, not actually using these to do real work)

If I bodge the header to have vadd_p8 on Arm I get:

$ cat /tmp/test.c
#include <arm_neon.h>

poly8x8_t test_vadd_p8(poly8x8_t a, poly8x8_t b) {
    return vadd_p8 (a, b);
}
$ ./bin/clang -target arm-arm-none-eabi -mcpu=cortex-a57 -S -o - /tmp/test.c -O3
<...>
test_vadd_p8:
        .fnstart
        vmov    d16, r0, r1
        vmov    d17, r2, r3
        veor    d16, d17, d16
        vmov    r0, r1, d16
        bx      lr

Which seems to confirm but I don't know why it's put behind the __aarch64__ guard.

Ok, it's behind the aarch64 guard because there's a giant #ifdef around it that I didn't see. If you move it in arm_neon.td down to outside that #ifdef you get the definitions. Next problem is that the poly128_t type isn't implemented for AArch32. Not sure why but for example in the bf16 tests:

clang/test/CodeGen/arm-bf16-reinterpret-intrinsics.c:
// TODO: poly128_t not implemented on aarch32
// CHCK-LABEL: @test_vreinterpretq_p128_bf16

If you split the def in two, one for the non Q, one for the Q parts that can work and you get the non poly128_t definitions for Arm and then I can compile my test program without modifications to the header.

--- a/clang/include/clang/Basic/arm_neon.td
+++ b/clang/include/clang/Basic/arm_neon.td
@@ -1160,7 +1160,8 @@ def SM4E : SInst<"vsm4e", "...", "QUi">;
 def SM4EKEY : SInst<"vsm4ekey", "...", "QUi">;
 }

-def VADDP   : WInst<"vadd", "...", "PcPsPlQPcQPsQPlQPk">;
+// TODO: poly128_t isn't implemented for AArch32
+def VADDP   : WInst<"vadd", "...", "QPcQPsQPlQPk">;

 ////////////////////////////////////////////////////////////////////////////////
 // Float -> Int conversions with explicit rounding mode
@@ -1630,6 +1631,9 @@ def SCALAR_VDUP_LANEQ : IInst<"vdup_laneq", "1QI", "ScSsSiSlSfSdSUcSUsSUiSUlSPcS
 }
 }

+// Everything that doesn't use poly128_t
+def VADDP_Q   : WInst<"vadd", "...", "PcPsPl">;
+
 // ARMv8.2-A FP16 vector intrinsics for A32/A64.
 let ArchGuard = "defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)" in {

However I ran into issues trying to get the relevant bits of aarc64-poly-add.c to compile for Arm and don't have time time to pursue it myself. If you can get those to run (or make an arm specific test file, some others do already) then all you'd need to do is remove the vaddq_p128 in CGBuiltin.cpp to prevent anyone thinking it is implemented for AArch32. (no one would be able to hit it without a modified header anyway)

My conclusion being that this group of intrinsics should be for A32 and A64 as the ACLE says. However we can't do all of them on A32 without poly128_t.

Thanks for looking into this @DavidSpickett ! What you found makes sense. I'll update this patch to remove only the poly128 vadd from the mapping. I'll also add another patch that will correctly enable the remaining vadd intrinsics for ARM.

only the pol128 intrinsic is incompatible with ARM, the rest should be supported.

rsanthir.quic edited the summary of this revision. (Show Details)Apr 19 2021, 9:02 AM

Here's the fix for enabling these on ARM:
https://reviews.llvm.org/D100772