gcc generates less instructions than llvm from below intrinsic example.
#include <arm_neon.h> uint8x8_t test1(uint8x8_t a) { return vdup_n_u8(vrshrd_n_u64(vaddlv_u8(a), 3)); } uint8x8_t test2(uint8x8_t a) { return vrshrn_n_u16(vdupq_n_u16(vaddlv_u8(a)), 3); } gcc output test1: uaddlv h0, v0.8b umov w0, v0.h[0] fmov d0, x0 urshr d0, d0, 3 dup v0.8b, v0.b[0] ret test2: uaddlv h0, v0.8b dup v0.8h, v0.h[0] rshrn v0.8b, v0.8h, 3 ret llvm output test1: // @test1 uaddlv h0, v0.8b fmov w8, s0 and w8, w8, #0xffff fmov d0, x8 urshr d0, d0, #3 fmov x8, d0 dup v0.8b, w8 ret test2: // @test2 uaddlv h0, v0.8b fmov w8, s0 dup v0.8h, w8 rshrn v0.8b, v0.8h, #3 ret
We can see additional fmov instructions on llvm output.
The uddlv has FPR as out register class and the dup has GPR as source register class. Therefore, there is COPY instruction for register class conversions between FPR and GPR and it is expanded to fmov.
There is dup instruction with simd register which is called dup element. If we use it, we can remove the COPY instruction because the FPR is shared with simd register.
With this patch, llvm generates below output.
test1: // @test1 uaddlv h0, v0.8b fmov w8, s0 and w8, w8, #0xffff fmov d0, x8 urshr d0, d0, #3 dup v0.8b, v0.b[0] ret test2: // @test2 uaddlv h1, v0.8b dup v0.8h, v1.h[0] rshrn v0.8b, v0.8h, #3 ret