gcc generates less number of instructions from below example than llvm.
#include <arm_neon.h>
uint8x8_t bar(uint8x8_t a) {
return vrshrn_n_u16(vdupq_n_u16(vaddlv_u8(a)), 3);
}
gcc output
bar:
uaddlv h0, v0.8b
dup v0.8h, v0.h[0]
rshrn v0.8b, v0.8h, 3
ret
llvm output
bar:
uaddlv h0, v0.8b
fmov w8, s0
dup v0.8h, w8
rshrn v0.8b, v0.8h, #3
retThere is a copy instruction between gpr and fpr. We could need to change scalar dup to vector dup to remove the copy instruction as below.
def : Pat<(v8i16 (AArch64dup (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))),
(v8i16 (DUPv8i16lane
(INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub),
(i64 0)))>;With above pattern, llvm generates below output.
bar: // @bar
uaddlv h0, v0.8b
dup v0.8h, v0.h[0]
rshrn v0.8b, v0.8h, #3
retThe pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.
This could be extended to i16 uaddlv as well, but we can leave that for a followup, I guess.