gcc generates less instructions than llvm from below intrinsic example.
#include <arm_neon.h> unsigned foo(uint16x8_t b) { return vaddlvq_u32(vpadalq_u16(vdupq_n_u32(0), b)); }
gcc output
foo: uaddlv s31, v0.8h fmov x0, d31 ret
llvm output
foo: uaddlp v0.4s, v0.8h uaddlv d0, v0.4s fmov x0, d0
We could do uaddlv(uaddlp(x)) ==> uaddlv(x).
After adding tablegen pattern for it, the llvm output is as below.
foo: uaddlv s0, v0.8h fmov x0, d0
It is probably quite a minor point, but can you change this to a (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i16v V128:$op), ssub)). The EXTRACT_SUBREG is using the fact that the higher lanes will be implicitly zeroed.