gcc generates less number of instructions from below example than llvm.
#include <arm_neon.h>
uint8x8_t foo(uint8x8_t a) {
return vdup_n_u8(vrshrd_n_u64(vaddlv_u8(a), 3));
}
gcc output
foo:
uaddlv h0, v0.8b
urshr d0, d0, 3
dup v0.8b, v0.b[0]
ret
llvm output
foo:
uaddlv h0, v0.8b
fmov w8, s0
fmov d0, x8
urshr d0, d0, #3
dup v0.8b, v0.b[0]
retThere are copy instructions between gpr and fpr. We could remove them as below pattern.
def : Pat<(v1i64 (scalar_to_vector (i64 (zext (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))))),
(INSERT_SUBREG (v1i64 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub)>;With above pattern, llvm generates below output.
foo: // @foo
uaddlv h0, v0.8b
urshr d0, d0, #3
dup v0.8b, v0.b[0]
retThe pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.
I think we should be able to generalize this to other operations that only produce a result in the low element, but I guess we can leave that for a followup.