gcc generates less number of instructions from below example than llvm.
#include <arm_neon.h>
uint8x8_t bar(uint8x8_t a) {
return vrshrn_n_u16(vdupq_n_u16(vaddlv_u8(a)), 3);
}
gcc output
bar:
uaddlv h0, v0.8b
dup v0.8h, v0.h[0]
rshrn v0.8b, v0.8h, 3
ret
llvm output
bar:
uaddlv h0, v0.8b
fmov w8, s0
dup v0.8h, w8
rshrn v0.8b, v0.8h, #3
retThere is a copy instruction between gpr and fpr. We could need to change scalar dup to vector dup to remove the copy instruction as below.
def : Pat<(v8i16 (AArch64dup (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))),
(v8i16 (DUPv8i16lane
(INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub),
(i64 0)))>;With above pattern, llvm generates below output.
bar: // @bar
uaddlv h0, v0.8b
dup v0.8h, v0.h[0]
rshrn v0.8b, v0.8h, #3
retThe pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.
I think this can be the same as SDT_AArch64uaddlp