This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add vmulxh_lane FP16 intrinsics
AbandonedPublic

Authored by SjoerdMeijer on Mar 7 2018, 12:01 PM.

Details

Summary

Add 2 vmulxh_lane vector intrinsics that were commented out.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Mar 7 2018, 12:01 PM
evandro accepted this revision.Mar 7 2018, 12:57 PM

Looks pretty straightforward to me.

This revision is now accepted and ready to land.Mar 7 2018, 12:57 PM
SjoerdMeijer added inline comments.Mar 8 2018, 2:22 AM
include/clang/Basic/arm_neon.td
1504

I found that unfortunately it's not that straightforward. This leads to wrong code generation as it is generating a fmul instead of fmulx. I am suspecting this instruction description should be using OP_SCALAR_MULX_LN, but also the type decls are wrong. Need to dig a bit further here.

az added inline comments.Mar 8 2018, 5:20 PM
include/clang/Basic/arm_neon.td
1504

Sorry for confusion as the commented code was never intended to be used and it is a copy of the code for the intrinsic vmulh_lane(). It was done that way in order to point out that vmulh_lane() and vmulxh_lane() intrinsics should be implemented in a similar way. The only useful thing in the commented code is the explanation that we need the scalar intrinsic vmulxh_f16() which was implemented in the scalar intrinsic patch later on.

If we look at how vmulh_lane (a, b, lane) is implemented:
x = extract (b, lane);
res = a * x;
return res;

Similarly, I thought at the time that vmulxh_lane (a, b, lane) can be implemented:
x = extract (b, lane);
res = vmulxh_f16 (a, x); // no llvm native mulx instruction, so we use the fp16 scalar intrinsic.
return res;

I am not sure now that we can easily use scalar intrinsic while generating the arm_neon.h file. In case we can not do that, I am thinking that the frontend should generate a new builtin for intrinsic vmulxh_lane() that the backend recognizes and generate the right code for it which is fmulx h0, h0, v1.h[lane]. If you made or will be making progress on this, then that is great. Otherwise, I can look at a frontend solution for it.

SjoerdMeijer added inline comments.Mar 12 2018, 4:53 AM
include/clang/Basic/arm_neon.td
1504

Hi Abderrazek,
Thanks for the clarifications! And I agree with your observations.
This simple changed looked to do the right thing, because as you also said, this vmulx is just an extract and a multiply, but then it was incorrectly generating a fmul which should be a fmulx. I briefly looked at fixing this, but also didn't see how I could use the scalar intrinsic here. Looks like passing a builtin is indeed the best thing, also because fmulx is instruction selected based on a intrinsic:

defm FMULX    : SIMDThreeSameVectorFP<0,0,0b011,"fmulx", int_aarch64_neon_fmulx>;

If you have the bandwidth to pick this up, that would be great; I started looking into the other failing AArch64 vector intrinsics.

Cheers,
Sjoerd.

az added a comment.Mar 16 2018, 4:02 PM

Was not able to update this particular review with the new code, So I created a new one in https://reviews.llvm.org/D44591

I manage to reuse the mulx scalar intrinsic work, not exactly calling the fp16 scalar intrinsic itself which is not available here but the same frontend codegen work with an extract instruction before that.

SjoerdMeijer abandoned this revision.Mar 19 2018, 6:55 AM

This is implemented in D44591.