gcc generates less number of instructions from below example than llvm.
#include <arm_neon.h> uint8x8_t bar(uint8x8_t a) { return vrshrn_n_u16(vdupq_n_u16(vaddlv_u8(a)), 3); } gcc output bar: uaddlv h0, v0.8b dup v0.8h, v0.h[0] rshrn v0.8b, v0.8h, 3 ret llvm output bar: uaddlv h0, v0.8b fmov w8, s0 dup v0.8h, w8 rshrn v0.8b, v0.8h, #3 ret
There is a copy instruction between gpr and fpr. We could need to change scalar dup to vector dup to remove the copy instruction as below.
def : Pat<(v8i16 (AArch64dup (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))), (v8i16 (DUPv8i16lane (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub), (i64 0)))>;
With above pattern, llvm generates below output.
bar: // @bar uaddlv h0, v0.8b dup v0.8h, v0.h[0] rshrn v0.8b, v0.8h, #3 ret
The pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.
I think this can be the same as SDT_AArch64uaddlp