This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Replace DUP scalar by DUP element
AbandonedPublic

Authored by jaykang10 on Apr 12 2023, 8:12 AM.

Details

Summary

gcc generates less instructions than llvm from below intrinsic example.

#include <arm_neon.h>

uint8x8_t test1(uint8x8_t a) {
    return vdup_n_u8(vrshrd_n_u64(vaddlv_u8(a), 3));
}

uint8x8_t test2(uint8x8_t a) {
    return vrshrn_n_u16(vdupq_n_u16(vaddlv_u8(a)), 3); 
}

gcc output
test1:
	uaddlv	h0, v0.8b
	umov	w0, v0.h[0]
	fmov	d0, x0
	urshr	d0, d0, 3
	dup	v0.8b, v0.b[0]
	ret

test2:
	uaddlv	h0, v0.8b
	dup	v0.8h, v0.h[0]
	rshrn	v0.8b, v0.8h, 3
	ret

llvm output
test1:                                  // @test1
	uaddlv	h0, v0.8b
	fmov	w8, s0
	and	w8, w8, #0xffff
	fmov	d0, x8
	urshr	d0, d0, #3
	fmov	x8, d0
	dup	v0.8b, w8
	ret

test2:                                  // @test2
	uaddlv	h0, v0.8b
	fmov	w8, s0
	dup	v0.8h, w8
	rshrn	v0.8b, v0.8h, #3
	ret

We can see additional fmov instructions on llvm output.
The uddlv has FPR as out register class and the dup has GPR as source register class. Therefore, there is COPY instruction for register class conversions between FPR and GPR and it is expanded to fmov.
There is dup instruction with simd register which is called dup element. If we use it, we can remove the COPY instruction because the FPR is shared with simd register.
With this patch, llvm generates below output.

test1:                                  // @test1
	uaddlv	h0, v0.8b
	fmov	w8, s0
	and	w8, w8, #0xffff
	fmov	d0, x8
	urshr	d0, d0, #3
	dup	v0.8b, v0.b[0]
	ret

test2:                                  // @test2
	uaddlv	h1, v0.8b
	dup	v0.8h, v1.h[0]
	rshrn	v0.8b, v0.8h, #3
	ret

Diff Detail

Event Timeline

jaykang10 created this revision.Apr 12 2023, 8:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2023, 8:12 AM
jaykang10 requested review of this revision.Apr 12 2023, 8:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2023, 8:12 AM

For the i64 neon intrinsics it would be possible to change the representation in DAG combine so that they passed and used a v1i64 node, which might help that case simplify in SDAG. Maybe something similar could be done for UADDLV too.

This way has other advantages with it being shared between SDAG and GlobalISel. And might come up in other cases, like across basic-block boundaries. Can we extend it to all the DUP sizes?

jaykang10 abandoned this revision.Apr 13 2023, 9:04 AM

For the i64 neon intrinsics it would be possible to change the representation in DAG combine so that they passed and used a v1i64 node, which might help that case simplify in SDAG. Maybe something similar could be done for UADDLV too.

This way has other advantages with it being shared between SDAG and GlobalISel. And might come up in other cases, like across basic-block boundaries. Can we extend it to all the DUP sizes?

I agree with you.
It could be better to solve this issue on SelectionDAG level.
Let me close this patch.

OK, this might have uses in general. Like I said MIPeephole can have advantages. We can reinstate it if we find a good use, or get stuck doing it the other way.

OK, this might have uses in general. Like I said MIPeephole can have advantages. We can reinstate it if we find a good use, or get stuck doing it the other way.

Yep, I think the type cast with v1i64 would be better to fix this issue.