gcc generates less instructions than llvm from below intrinsic example.
#include <arm_neon.h> float16x8_t test1(const float32x4_t a) { float16x4_t b = vcvt_f16_f32(a); return vcombine_f16(b, vdup_n_f16(0.0)); } uint8x8_t test2(uint16_t *in, uint8x8_t *dst, uint8x8_t idx) { return vtbl1_u8(vshrn_n_u16(vld1q_u16(in), 4), idx); } gcc output test1: fcvtn v0.4h, v0.4s fmov d0, d0 ret test2: ldr q1, [x0] shrn v1.8b, v1.8h, 4 tbl v0.8b, {v1.16b}, v0.8b ret llvm output test1: // @test1 movi d1, #0000000000000000 fcvtn v0.4h, v0.4s mov v0.d[1], v1.d[0] ret test2: // @test2 ldr q1, [x0] movi v2.2d, #0000000000000000 shrn v1.8b, v1.8h, #4 mov v1.d[1], v2.d[0] tbl v0.8b, { v1.16b }, v0.8b ret
The fcvtn and shrn instructions set zero for high 64-bits implicitly so we do not need mov 0 instruction for high 64-bits. It looks gcc has patterns for the cases. For example,
the gcc rtl pattern for test2 function's shrn (define_insn "aarch64_shrn<mode>_insn_le" [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w") (vec_concat:<VNARROWQ2> (truncate:<VNARROWQ> (lshiftrt:VQN (match_operand:VQN 1 "register_operand" "w") (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_<vn_mode>"))) (match_operand:<VNARROWQ> 3 "aarch64_simd_or_scalar_imm_zero")))] "TARGET_SIMD && !BYTES_BIG_ENDIAN" "shrn\\t%0.<Vntype>, %1.<Vtype>, %2" [(set_attr "type" "neon_shift_imm_narrow_q")] )
llvm could also add tablegen patterns for them like gcc but it could be better to handle the patterns on MIR Peephole optimization pass because they have common sub patterns and the pass can consider multiple basic blocks.
With this patch, llvm generates below output.
llvm output test1: // @test1 fcvtn v0.4h, v0.4s ret test2: // @test2 ldr q1, [x0] shrn v1.8b, v1.8h, #4 tbl v0.8b, { v1.16b }, v0.8b ret
Can we extend this to all the instructions that are like FCVTNv4i16/SHRNv8i8_shift? For example maybe these, which I think produce 64bit results and are similar to the instructions you already have:
We might be able to get away with "Any instruction that defs a FPR64", but that might need more careful checking and there are quite a few of them. We should probably try and get these classes of instruction though, not just the exact sizes.