This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add ARMv8.2-A FP16 vector intrinsics - Continuation
ClosedPublic

Authored by az on Jun 13 2017, 12:57 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

az created this revision.Jun 13 2017, 12:57 PM
t.p.northover accepted this revision.Jun 13 2017, 3:05 PM
t.p.northover added a subscriber: t.p.northover.

Looks like a pretty straightforward bugfix to me.

This revision is now accepted and ready to land.Jun 13 2017, 3:05 PM
SjoerdMeijer edited edge metadata.Jun 15 2017, 4:07 AM

Not sure if I am missing something, but the new regression tests arm-v8.2a-neon-intrinsics.c and aarch64-neon-intrinsics.c are failing for me (after first applying patch D32511 and then this one here in D34161). For example, the very first test in arm-v8.2a-neon-intrinsics.c checks that the function argument is <4 x half> %a, but <4 x half> %1 is generated. With anything other than -O0 you will get the %a (and then it becomes a tailcall).

az added a comment.Jun 15 2017, 12:10 PM

Not sure if I am missing something, but the new regression tests arm-v8.2a-neon-intrinsics.c and aarch64-neon-intrinsics.c are failing for me (after first applying patch D32511 and then this one here in D34161). For example, the very first test in arm-v8.2a-neon-intrinsics.c checks that the function argument is <4 x half> %a, but <4 x half> %1 is generated. With anything other than -O0 you will get the %a (and then it becomes a tailcall).

I renamed arm-v8.2a-neon-intrinsics.c to aarch64-v8.2a-neon-intrinsics.c and modified the +fp16 flag. Could it be that you are still running the old file from the old patch with the old flag? If that is not the case, then I will add this:

For the newly added file aarch64-v8.2a-neon-intrinsics.c, It passed for me in this patch and in the original one (D32511). However, it failed for other people (in the same way as you are describing: %1 vs. %a) in the original patch and the changes were reverted. That lead me to execute this test individually in D32511. It gave me a warning about the +fp16 flag and I replaced it with the correct +fullfp16 flag after consulting with you about it. I was hoping that fixing the flag and the warning will make the test pass for all. Note that the conversion of <4 x half> %1, to ##<4 x half> %a# is done by mem2reg pass (opt -S mem2reg). I am not sure why this optimization does not work for you on this test. I can either 1) replace the occurrence of variables such as %a by * in the test because we are not really testing the optimization after all or 2) send me your cmake command so that I can build and test in the same way to be able to reproduce and investigate the problem on my side.

For the old test arm-neon-intrinsics.c, I did not run it in D32511 because it has REQUIRES: long-tests and our testing configuration did not run those. Once I run it, I can reproduce the problem and I fixed it in this patch. Does it still fail for you?

Just to avoid any confusion/mistakes, can you upload the final patch that you intend to commit?
My understanding is that will include: D32511 + D34161 - arm-v8.2a-neon-intrinsics.c.
And I was indeed accidentally running the old test, but its replacement aarch64-neon-intrinsics.c that
I was also running is giving me similar problems. This looks like a very easy fix to me: if you add -O2
you don't need run it separately through opt and mem2reg (with and without gives exactly the result for me anyway),
and the only thing you need to change is e.g this expected string:

[[ABS:%.*]] = call <4 x half> @llvm.fabs.v4f16(<4 x half> %a)

to match this:

%vabs1.i = tail call <4 x half> @llvm.fabs.v4f16(<4 x half> %a)

where the only difference is "tail".

So I agree that is a straightforward bugfix and also the correct thing to do (i.e. using option +fullfp16).

az added a comment.Jun 16 2017, 4:49 PM

Ok, I will upload the combination of the two patches here.

I am kind of undecided between your suggestion of using -O2 or my previous solution of removing 'opt -S -mem2reg'. the former makes the test more readable but unstable given that future changes to optimizations at -O2 level may result in updating the tests (but, I do no expect much changes given that they are small tests).

az updated this revision to Diff 103097.Jun 19 2017, 1:41 PM

I am resubmitting the combination of D32511 and D34161 in here to avoid any confusion.

Also, I experimented with our build system and figure out a way to reproduce the problem we were discussing and it looks like adding the flag -disable-O0-optnone fixes the issue as this will not disable "opt -memreg" optimization after running clang.

SjoerdMeijer accepted this revision.Jun 20 2017, 1:41 AM

Hi Abderrazek,
Thanks! I've run the patch through some testing, and it looks all good now!
Inlined is one nit that I missed earlier, but no need for another review.
Cheers,
Sjoerd.

clang/lib/Basic/Targets.cpp
6176 ↗(On Diff #103097)

Renaming this to HasFullFP16 is more consistent.

This revision was automatically updated to reflect the committed changes.
az marked an inline comment as done.

Unfortunately this is causing problems in testing:

SplitVectorOperand Op #3: t8: ch = llvm.arm.neon.vst4lane<ST32[%0](align=2)> t3, TargetConstant:i32<699>, FrameIndex:i32<0>, undef:v4f16, undef:v4f16, undef:v4f16, undef:v4f16, Constant:i32<1>, Constant:i32<2>
fatal error: error in backend: Do not know how to split this operator's operand!

This assert is triggered in lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp:1457

Reproducer:

typedef __attribute__((neon_vector_type(4))) short uint16x4_t;
void dotests_1082() {
__fp16 result[1];
  uint16x4_t __s1_0_3;
  uint16x4_t __s1_0_2;
  uint16x4_t __s1_0_1;
  uint16x4_t __s1_0_0;
  __builtin_neon_vst4_lane_v(result, __s1_0_0, __s1_0_1, __s1_0_2, __s1_0_3,
                             1, 8);
}

Compile with:

clang -O2 -target armv8-linux-gnueabihf -mcpu=cortex-a57 -D__ARM_NEON_FP16_INTRINSICS -S t.c

Befor this patch the IR looked like this:

call void @llvm.arm.neon.vst4lane.p0i8.v4i16(i8* nonnull %0, <4 x i16> undef, <4 x i16> undef, <4 x i16> undef, <4 x i16> undef, i32 1, i32 2)

And now after:

call void @llvm.arm.neon.vst4lane.p0i8.v4f16(i8* nonnull %0, <4 x half> undef, <4 x half> undef, <4 x half> undef, <4 x half> undef, i32 1, i32 2)

Note that the i16 types have changed into half and the intrinsic from i16 to f16.
It has problems with legalising type v4f16.