gcc generates less number of instructions from below example than llvm.
#include <arm_neon.h> uint8x8_t bar(uint8x8_t a) { return vrshrn_n_u16(vdupq_n_u16(vaddlv_u8(a)), 3); } gcc output bar: uaddlv h0, v0.8b dup v0.8h, v0.h[0] rshrn v0.8b, v0.8h, 3 ret llvm output bar: uaddlv h0, v0.8b fmov w8, s0 dup v0.8h, w8 rshrn v0.8b, v0.8h, #3 ret
There is a copy instruction between gpr and fpr. We could need to change scalar dup to vector dup to remove the copy instruction as below.
def : Pat<(v8i16 (AArch64dup (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))), (v8i16 (DUPv8i16lane (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub), (i64 0)))>;
With above pattern, llvm generates below output.
bar: // @bar uaddlv h0, v0.8b dup v0.8h, v0.h[0] rshrn v0.8b, v0.8h, #3 ret
The pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.
This could be extended to i16 uaddlv as well, but we can leave that for a followup, I guess.