This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add vmulxh_lane FP16 intrinsics
AbandonedPublic

Authored by SjoerdMeijer on Mar 7 2018, 12:01 PM.

Download Raw Diff

Details

Reviewers

az
evandro
olista01

Summary

Add 2 vmulxh_lane vector intrinsics that were commented out.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Mar 7 2018, 12:01 PM

Herald added subscribers: kristof.beyls, javed.absar, rengolin. · View Herald TranscriptMar 7 2018, 12:01 PM

Looks pretty straightforward to me.

This revision is now accepted and ready to land.Mar 7 2018, 12:57 PM

SjoerdMeijer added inline comments.Mar 8 2018, 2:22 AM

include/clang/Basic/arm_neon.td
1504	I found that unfortunately it's not that straightforward. This leads to wrong code generation as it is generating a fmul instead of fmulx. I am suspecting this instruction description should be using OP_SCALAR_MULX_LN, but also the type decls are wrong. Need to dig a bit further here.

az added inline comments.Mar 8 2018, 5:20 PM

include/clang/Basic/arm_neon.td
1504	Sorry for confusion as the commented code was never intended to be used and it is a copy of the code for the intrinsic vmulh_lane(). It was done that way in order to point out that vmulh_lane() and vmulxh_lane() intrinsics should be implemented in a similar way. The only useful thing in the commented code is the explanation that we need the scalar intrinsic vmulxh_f16() which was implemented in the scalar intrinsic patch later on. If we look at how vmulh_lane (a, b, lane) is implemented: x = extract (b, lane); res = a * x; return res; Similarly, I thought at the time that vmulxh_lane (a, b, lane) can be implemented: x = extract (b, lane); res = vmulxh_f16 (a, x); // no llvm native mulx instruction, so we use the fp16 scalar intrinsic. return res; I am not sure now that we can easily use scalar intrinsic while generating the arm_neon.h file. In case we can not do that, I am thinking that the frontend should generate a new builtin for intrinsic vmulxh_lane() that the backend recognizes and generate the right code for it which is fmulx h0, h0, v1.h[lane]. If you made or will be making progress on this, then that is great. Otherwise, I can look at a frontend solution for it.

SjoerdMeijer added inline comments.Mar 12 2018, 4:53 AM

include/clang/Basic/arm_neon.td
1504	Hi Abderrazek, Thanks for the clarifications! And I agree with your observations. This simple changed looked to do the right thing, because as you also said, this vmulx is just an extract and a multiply, but then it was incorrectly generating a fmul which should be a fmulx. I briefly looked at fixing this, but also didn't see how I could use the scalar intrinsic here. Looks like passing a builtin is indeed the best thing, also because fmulx is instruction selected based on a intrinsic: defm FMULX : SIMDThreeSameVectorFP<0,0,0b011,"fmulx", int_aarch64_neon_fmulx>; If you have the bandwidth to pick this up, that would be great; I started looking into the other failing AArch64 vector intrinsics. Cheers, Sjoerd.

Was not able to update this particular review with the new code, So I created a new one in https://reviews.llvm.org/D44591

I manage to reuse the mulx scalar intrinsic work, not exactly calling the fp16 scalar intrinsic itself which is not available here but the same frontend codegen work with an extract instruction before that.

This is implemented in D44591.

Revision Contents

Path

Size

include/

clang/

Basic/

arm_neon.td

9 lines

test/

CodeGen/

aarch64-v8.2a-neon-intrinsics.c

26 lines

Diff 137452

include/clang/Basic/arm_neon.td

Show First 20 Lines • Show All 1,493 Lines • ▼ Show 20 Lines	let ArchGuard = "defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) && defined(__aarch64__)" in {
// Scalar floating point multiply (scalar, by element)		// Scalar floating point multiply (scalar, by element)
def SCALAR_FMUL_LANEH : IOpInst<"vmul_lane", "ssdi", "Sh", OP_SCALAR_MUL_LN>;		def SCALAR_FMUL_LANEH : IOpInst<"vmul_lane", "ssdi", "Sh", OP_SCALAR_MUL_LN>;
def SCALAR_FMUL_LANEQH : IOpInst<"vmul_laneq", "ssji", "Sh", OP_SCALAR_MUL_LN>;		def SCALAR_FMUL_LANEQH : IOpInst<"vmul_laneq", "ssji", "Sh", OP_SCALAR_MUL_LN>;

// Mulx lane		// Mulx lane
def VMULX_LANEH : IOpInst<"vmulx_lane", "ddgi", "hQh", OP_MULX_LN>;		def VMULX_LANEH : IOpInst<"vmulx_lane", "ddgi", "hQh", OP_MULX_LN>;
def VMULX_LANEQH : IOpInst<"vmulx_laneq", "ddji", "hQh", OP_MULX_LN>;		def VMULX_LANEQH : IOpInst<"vmulx_laneq", "ddji", "hQh", OP_MULX_LN>;
def VMULX_NH : IOpInst<"vmulx_n", "dds", "hQh", OP_MULX_N>;		def VMULX_NH : IOpInst<"vmulx_n", "dds", "hQh", OP_MULX_N>;
// TODO: Scalar floating point multiply extended (scalar, by element)
// Below ones are commented out because they need vmulx_f16(float16_t, float16_t)		// Scalar floating point multiply extended (scalar, by element)
// which will be implemented later with fp16 scalar intrinsic (arm_fp16.h)		def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", OP_SCALAR_MUL_LN>;
		SjoerdMeijerAuthorUnsubmitted Not Done Reply Inline Actions I found that unfortunately it's not that straightforward. This leads to wrong code generation as it is generating a fmul instead of fmulx. I am suspecting this instruction description should be using OP_SCALAR_MULX_LN, but also the type decls are wrong. Need to dig a bit further here. SjoerdMeijer: I found that unfortunately it's not that straightforward. This leads to wrong code generation…
		azUnsubmitted Not Done Reply Inline Actions Sorry for confusion as the commented code was never intended to be used and it is a copy of the code for the intrinsic vmulh_lane(). It was done that way in order to point out that vmulh_lane() and vmulxh_lane() intrinsics should be implemented in a similar way. The only useful thing in the commented code is the explanation that we need the scalar intrinsic vmulxh_f16() which was implemented in the scalar intrinsic patch later on. If we look at how vmulh_lane (a, b, lane) is implemented: x = extract (b, lane); res = a * x; return res; Similarly, I thought at the time that vmulxh_lane (a, b, lane) can be implemented: x = extract (b, lane); res = vmulxh_f16 (a, x); // no llvm native mulx instruction, so we use the fp16 scalar intrinsic. return res; I am not sure now that we can easily use scalar intrinsic while generating the arm_neon.h file. In case we can not do that, I am thinking that the frontend should generate a new builtin for intrinsic vmulxh_lane() that the backend recognizes and generate the right code for it which is fmulx h0, h0, v1.h[lane]. If you made or will be making progress on this, then that is great. Otherwise, I can look at a frontend solution for it. az: Sorry for confusion as the commented code was never intended to be used and it is a copy of the…
		SjoerdMeijerAuthorUnsubmitted Not Done Reply Inline Actions Hi Abderrazek, Thanks for the clarifications! And I agree with your observations. This simple changed looked to do the right thing, because as you also said, this vmulx is just an extract and a multiply, but then it was incorrectly generating a fmul which should be a fmulx. I briefly looked at fixing this, but also didn't see how I could use the scalar intrinsic here. Looks like passing a builtin is indeed the best thing, also because fmulx is instruction selected based on a intrinsic: defm FMULX : SIMDThreeSameVectorFP<0,0,0b011,"fmulx", int_aarch64_neon_fmulx>; If you have the bandwidth to pick this up, that would be great; I started looking into the other failing AArch64 vector intrinsics. Cheers, Sjoerd. SjoerdMeijer: Hi Abderrazek, Thanks for the clarifications! And I agree with your observations. This simple…
//def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", OP_SCALAR_MUL_LN>;		def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", OP_SCALAR_MUL_LN>;
//def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", OP_SCALAR_MUL_LN>;

// ARMv8.2-A FP16 reduction vector intrinsics.		// ARMv8.2-A FP16 reduction vector intrinsics.
def VMAXVH : SInst<"vmaxv", "sd", "hQh">;		def VMAXVH : SInst<"vmaxv", "sd", "hQh">;
def VMINVH : SInst<"vminv", "sd", "hQh">;		def VMINVH : SInst<"vminv", "sd", "hQh">;
def FMAXNMVH : SInst<"vmaxnmv", "sd", "hQh">;		def FMAXNMVH : SInst<"vmaxnmv", "sd", "hQh">;
def FMINNMVH : SInst<"vminnmv", "sd", "hQh">;		def FMINNMVH : SInst<"vminnmv", "sd", "hQh">;

// Data processing intrinsics - section 5		// Data processing intrinsics - section 5
Show All 33 Lines

test/CodeGen/aarch64-v8.2a-neon-intrinsics.c

	Show First 20 Lines • Show All 1,217 Lines • ▼ Show 20 Lines
	// CHECK: [[TMP6:%.*]] = insertelement <8 x half> [[TMP5]], half %b, i32 6			// CHECK: [[TMP6:%.*]] = insertelement <8 x half> [[TMP5]], half %b, i32 6
	// CHECK: [[TMP7:%.*]] = insertelement <8 x half> [[TMP6]], half %b, i32 7			// CHECK: [[TMP7:%.*]] = insertelement <8 x half> [[TMP6]], half %b, i32 7
	// CHECK: [[MUL:%.*]] = call <8 x half> @llvm.aarch64.neon.fmulx.v8f16(<8 x half> %a, <8 x half> [[TMP7]])			// CHECK: [[MUL:%.*]] = call <8 x half> @llvm.aarch64.neon.fmulx.v8f16(<8 x half> %a, <8 x half> [[TMP7]])
	// CHECK: ret <8 x half> [[MUL]]			// CHECK: ret <8 x half> [[MUL]]
	float16x8_t test_vmulxq_n_f16(float16x8_t a, float16_t b) {			float16x8_t test_vmulxq_n_f16(float16x8_t a, float16_t b) {
	return vmulxq_n_f16(a, b);			return vmulxq_n_f16(a, b);
	}			}

	/* TODO: Not implemented yet (needs scalar intrinsic from arm_fp16.h)			// CHECK-LABEL: test_vmulxh_lane_f16
	// CCHECK-LABEL: test_vmulxh_lane_f16			// CHECK: [[CONV0:%.*]] = fpext half %a to float
	// CCHECK: [[CONV0:%.*]] = fpext half %a to float			// CHECK: [[CONV1:%.]] = fpext half %{{.}} to float
	// CCHECK: [[CONV1:%.]] = fpext half %{{.}} to float			// CHECK: [[MUL:%.]] = fmul float [[CONV0:%.]], [[CONV0:%.*]]
	// CCHECK: [[MUL:%.]] = fmul float [[CONV0:%.]], [[CONV0:%.*]]			// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half
	// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half			// CHECK: ret half [[CONV3:%.*]]
	// CCHECK: ret half [[CONV3:%.*]]
	float16_t test_vmulxh_lane_f16(float16_t a, float16x4_t b) {			float16_t test_vmulxh_lane_f16(float16_t a, float16x4_t b) {
	return vmulxh_lane_f16(a, b, 3);			return vmulxh_lane_f16(a, b, 3);
	}			}

	// CCHECK-LABEL: test_vmulxh_laneq_f16			// CHECK-LABEL: test_vmulxh_laneq_f16
	// CCHECK: [[CONV0:%.*]] = fpext half %a to float			// CHECK: [[CONV0:%.*]] = fpext half %a to float
	// CCHECK: [[CONV1:%.]] = fpext half %{{.}} to float			// CHECK: [[CONV1:%.]] = fpext half %{{.}} to float
	// CCHECK: [[MUL:%.]] = fmul float [[CONV0:%.]], [[CONV0:%.*]]			// CHECK: [[MUL:%.]] = fmul float [[CONV0:%.]], [[CONV0:%.*]]
	// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half			// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half
	// CCHECK: ret half [[CONV3:%.*]]			// CHECK: ret half [[CONV3:%.*]]
	float16_t test_vmulxh_laneq_f16(float16_t a, float16x8_t b) {			float16_t test_vmulxh_laneq_f16(float16_t a, float16x8_t b) {
	return vmulxh_laneq_f16(a, b, 7);			return vmulxh_laneq_f16(a, b, 7);
	}			}
	*/

	// CHECK-LABEL: test_vmaxv_f16			// CHECK-LABEL: test_vmaxv_f16
	// CHECK: [[TMP0:%.*]] = bitcast <4 x half> %a to <8 x i8>			// CHECK: [[TMP0:%.*]] = bitcast <4 x half> %a to <8 x i8>
	// CHECK: [[TMP1:%.*]] = bitcast <8 x i8> [[TMP0]] to <4 x half>			// CHECK: [[TMP1:%.*]] = bitcast <8 x i8> [[TMP0]] to <4 x half>
	// CHECK: [[MAX:%.*]] = call half @llvm.aarch64.neon.fmaxv.f16.v4f16(<4 x half> [[TMP1]])			// CHECK: [[MAX:%.*]] = call half @llvm.aarch64.neon.fmaxv.f16.v4f16(<4 x half> [[TMP1]])
	// CHECK: ret half [[MAX]]			// CHECK: ret half [[MAX]]
	float16_t test_vmaxv_f16(float16x4_t a) {			float16_t test_vmaxv_f16(float16x4_t a) {
	return vmaxv_f16(a);			return vmaxv_f16(a);
	▲ Show 20 Lines • Show All 379 Lines • Show Last 20 Lines