This is an archive of the discontinued LLVM Phabricator instance.

[ARM,MVE] Add vector-scalar intrinsics
ClosedPublic

Authored by miyuki on Feb 14 2020, 8:18 AM.

Details

Summary

This patch adds vector-scalar variants to the following families of
MVE intrinsics:

  • vaddq
  • vsubq
  • vmulq
  • vqaddq
  • vqsubq
  • vhaddq
  • vhsubq
  • vqdmulhq
  • vqrdmulhq

The vector-scalar variants perform a splat operation on the scalar
operand and then perform the same operations as their vector-vector
counterparts. Code generation is done accordingly (using LLVM IR 'insert'
and 'shuffle' operations which are later converted into an ARMvdup
SDNode).

Diff Detail

Event Timeline

miyuki created this revision.Feb 14 2020, 8:18 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptFeb 14 2020, 8:19 AM

I would prefer not to format the Clang test cases to avoid code churn.

I like how this uses a splat for all the register arguments. That sounds like a good idea.

The one's that worry me are the floating point instructions. Last time we tried those it was actually causing performance regressions because of extra sp->gpr mov's left in the loop.

If this is just the backend patterns though, not the sinking of splats into loops too, then I think it should be OK. On it's own I don't think it will usually cause problems. And some quick tests seem to verify that.

clang/test/CodeGen/arm-mve-intrinsics/vaddq.c
2–3

Why is this running the entire -O1 pass pipeline? These tests deliberately uses a limit subset to not need adjusting with every midend llvm change. (But not be littered with clang's verbose ir output).

I'm guessing the half args are being a pain again. Is it something to do with halfs?

llvm/lib/Target/ARM/ARMInstrMVE.td
4496

These GPR's can use the same regclass as the instruction. rGPR in this case I think?

4566

I find all these if's at different levels a little hard to follow. It looks OK, but is it possible to rearrange things to not need it here?

miyuki marked an inline comment as done.Feb 17 2020, 2:58 AM
miyuki added inline comments.
clang/test/CodeGen/arm-mve-intrinsics/vaddq.c
2–3

Yes, with just -mem2reg I get:

define arm_aapcs_vfpcc <8 x half> @test_vaddq_n_f16(<8 x half> %a, float %b.coerce) #0 {
entry:
  %b = alloca half, align 2
  %tmp = alloca float, align 4
  store float %b.coerce, float* %tmp, align 4
  %0 = bitcast float* %tmp to i8*
  %1 = bitcast half* %b to i8*
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 2 %1, i8* align 4 %0, i32 2, i1 false)
  %b1 = load half, half* %b, align 2
[...]

SROA seems to do the trick.

miyuki updated this revision to Diff 244936.Feb 17 2020, 3:48 AM

Addressed reviewer's comments

miyuki marked 3 inline comments as done.Feb 17 2020, 3:50 AM
dmgreen accepted this revision.Feb 17 2020, 9:17 AM

LGTM.

llvm/lib/Target/ARM/ARMInstrMVE.td
4566

I meant trying to remove this defvar unpred_op = !if( line. I think it's always going to be a bit difficult to follow whatever we do here though (or we end up repeating ourselves a lot), so you can ignore this one :)

This revision is now accepted and ready to land.Feb 17 2020, 9:17 AM
This revision was automatically updated to reflect the committed changes.