gcc generates less number of instructions from below example than llvm.
#include <arm_neon.h> uint8x8_t foo(uint8x8_t a) { return vdup_n_u8(vrshrd_n_u64(vaddlv_u8(a), 3)); } gcc output foo: uaddlv h0, v0.8b urshr d0, d0, 3 dup v0.8b, v0.b[0] ret llvm output foo: uaddlv h0, v0.8b fmov w8, s0 fmov d0, x8 urshr d0, d0, #3 dup v0.8b, v0.b[0] ret
There are copy instructions between gpr and fpr. We could remove them as below pattern.
def : Pat<(v1i64 (scalar_to_vector (i64 (zext (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))))), (INSERT_SUBREG (v1i64 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub)>;
With above pattern, llvm generates below output.
foo: // @foo uaddlv h0, v0.8b urshr d0, d0, #3 dup v0.8b, v0.b[0] ret
The pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.
I think we should be able to generalize this to other operations that only produce a result in the low element, but I guess we can leave that for a followup.