gcc generates less instructions than llvm from below intrinsic example.
#include <arm_neon.h> void foo(int16x8_t a, int32x4_t acc, int32x4_t *out, const int32_t *p) { int16x8_t b = vcombine_s16(vmovn_s32(vld1q_s32(&p[0])), vmovn_s32(vld1q_s32(&p[4]))); acc = vmlsl_s16(acc, vget_low_s16(a), vget_low_s16(b)); acc = vmlsl_s16(acc, vget_high_s16(a), vget_high_s16(b)); *out = acc; }
GCC output
foo: ldp q2, q3, [x1] uzp1 v2.8h, v2.8h, v3.8h smlsl v1.4s, v0.4h, v2.4h smlsl2 v1.4s, v0.8h, v2.8h str q1, [x0] ret
LLVM output
ldp q2, q3, [x1] ext v4.16b, v0.16b, v0.16b, #8 xtn v2.4h, v2.4s smlsl v1.4s, v0.4h, v2.4h xtn v0.4h, v3.4s smlsl v1.4s, v4.4h, v0.4h str q1, [x0] ret
It looks gcc keeps the intrinsic function calls with builtin function calls.
For example, the vmonv and vcombine intrinsic function calls are matched to the uzp1 pattern as below.
_4 = __builtin_aarch64_xtnv4si (_3);(insn 9 8 10 (set (reg:V4SI 107) _6 = __builtin_aarch64_xtnv4si (_5);(insn 12 11 13 (set (reg:V4SI 109) _7 = {_4, _6}; ... (insn 10 9 11 (set (reg:V8HI 108) (vec_concat:V8HI (truncate:V4HI (reg:V4SI 107)) (const_vector:V4HI [ (const_int 0 [0]) repeated x4 ]))) (nil)) (insn 11 10 0 (set (reg:V4HI 93 [ _5 ]) (subreg:V4HI (reg:V8HI 108) 0)) (nil)) (insn 13 12 14 (set (reg:V8HI 110) (vec_concat:V8HI (truncate:V4HI (reg:V4SI 109)) (const_vector:V4HI [ (const_int 0 [0]) repeated x4 ]))) (nil)) (insn 14 13 0 (set (reg:V4HI 95 [ _7 ]) (subreg:V4HI (reg:V8HI 110) 0)) (nil)) (insn 15 14 16 (set (reg:V8HI 111) (vec_concat:V8HI (reg:V4HI 93 [ _5 ]) (reg:V4HI 95 [ _7 ]))) (nil)) ... (define_insn "*aarch64_narrow_trunc<mode>" [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w") (vec_concat:<VNARROWQ2> (truncate:<VNARROWQ> (match_operand:VQN 1 "register_operand" "w")) (truncate:<VNARROWQ> (match_operand:VQN 2 "register_operand" "w"))))] "TARGET_SIMD" { if (!BYTES_BIG_ENDIAN) return "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"; else return "uzp1\\t%0.<V2ntype>, %2.<V2ntype>, %1.<V2ntype>"; } [(set_attr "type" "neon_permute<q>")] )
It looks clang generates some intrinsic functions' deifintion. After inlining, some intrinsic function calls are optimized away as below.
define dso_local void @foo(<8 x i16> noundef %a, <4 x i32> noundef %acc, ptr nocapture noundef writeonly %out, ptr nocapture noundef readonly %p) local_unnamed_addr #0 { entry: %0 = load <4 x i32>, ptr %p, align 4 %vmovn.i = trunc <4 x i32> %0 to <4 x i16> %arrayidx2 = getelementptr inbounds i32, ptr %p, i64 4 %1 = load <4 x i32>, ptr %arrayidx2, align 4 %vmovn.i17 = trunc <4 x i32> %1 to <4 x i16> %shuffle.i18 = shufflevector <8 x i16> %a, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %vmull2.i.i = tail call <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16> %shuffle.i18, <4 x i16> %vmovn.i) %shuffle.i19 = shufflevector <8 x i16> %a, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7> %vmull2.i.i20 = tail call <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16> %shuffle.i19, <4 x i16> %vmovn.i17) %2 = add <4 x i32> %vmull2.i.i, %vmull2.i.i20 %sub.i21 = sub <4 x i32> %acc, %2 store <4 x i32> %sub.i21, ptr %out, align 16, !tbaa !6 ret void }
For uzp1 instruction, it is hard to match existing pattern for uzp1 without concat_vectors which comes from vcombine_s16.
If clang does not generate the intrinsic function's definition and backend lowers the intrinsic function call, we could see similar code with gcc. However, I do not think it is good way. It could be better to generate the intrinsic function's definition and optimize the code after inlining.
Alternatively, I have tried to check the MIR code sequence with smlsl in AArch64MIPeepholeOpt pass. With this patch, the llvm output is as below.
foo: ldp q2, q3, [x1] uzp1 v2.8h, v2.8h, v3.8h smlsl v1.4s, v0.4h, v2.4h smlsl2 v1.4s, v0.8h, v2.8h str q1, [x0] ret
This (and maybe the distance checks below) would make the algorithm O(N^2) in the number of instructions in the block.
It does allow the algorithm to be quite general - it can match any truncate with the UZP getting a free truncate for what may be an unrelated instruction. It may not always be valid though - Could the truncate depend on result of the UZP or vice-versa? It does have the advantage that it works with either SDAG or GlobalISel though.
From what I have seen mull's often come in pairs. For example the code in smlsl_smlsl2_v4i32_uzp1 has:
If it was processing the smlsl2, it might be able to look at the extract high of the first operand, see that it has 2 uses with the other being an smull(extractlow(.)), and use the other operand of the smull in the UZP instead of the undef when creating it in DAG? It has to check a number of things (and doesn't help with globalisel), but hopefully fits in as an extension to the existing code in SDAG.