This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Implement vector tuple intrinsics
ClosedPublic

Authored by c-rhodes on Mar 5 2020, 4:15 AM.

Details

Summary

This patch adds the following intrinsics for creating two-tuple,
three-tuple and four-tuple scalable vectors:

  • llvm.aarch64.sve.tuple.create2
  • llvm.aarch64.sve.tuple.create3
  • llvm.aarch64.sve.tuple.create4

As well as:

  • llvm.aarch64.sve.tuple.get
  • llvm.aarch64.sve.tuple.set

For extracting and inserting scalable vectors from vector tuples. These
intrinsics are intended to be used by the ACLE functions svcreate<n>,
svget and svset.

This patch also includes calling convention support for passing and
returning tuples of scalable vectors to/from functions.

Diff Detail

Event Timeline

c-rhodes created this revision.Mar 5 2020, 4:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2020, 4:15 AM
c-rhodes updated this revision to Diff 248453.Mar 5 2020, 5:29 AM

Fix warnings reported by Harbor.

c-rhodes updated this revision to Diff 248670.Mar 6 2020, 2:07 AM

Added full context.

efriedma added inline comments.Mar 12 2020, 3:48 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7038 ↗(On Diff #248670)

This assertion seems ambitious; I'm not sure how you plan to ensure that the sve.tuple.get intrinsic is in the same basic block as the intrinsic that produces the value.

Is there some reason you're putting this handling into the target-independent SelectionDAGBuilder, as opposed to target-specific code?

I think that in this patch we are missing tests that check the output code in situations in which the tuple.get operates on tuple that are passed to the definition of the function, as in the following example:

define <vscale x 16 x i8> @foo(<vscale x 32 x i8> %res) {
; CHECK-LABEL: ld2b_i8_1:
; CHECK: mov z0.d, z1.d
; CHECK-NEXT: ret
  %v2 = call <vscale x 16 x i8> @llvm.aarch64.sve.tuple.get.nxv32i8(<vscale x 32 x i8> %res, i32 1)
  ret <vscale x 16 x i8> %v2
}

@sdesmalen - should we add them? Am I missing something?

Reference comment: https://reviews.llvm.org/D75751#inline-697705

Francesco

c-rhodes updated this revision to Diff 265514.May 21 2020, 9:00 AM

Changes:

  • Rebased on top of D80139 implementing copyToReg/copyFromReg support for illegal scalable vectors. This enables copying vector tuple types between BBs.
  • Removed assert in the lowering of sve_tuple_get/sve_tuple_set intrinsics checking the tuple is CONCAT_VECTOR.
  • Vector tuple elements are now extracted with EXTRACT_SUBVECTOR.
  • Removed target specific calling conv methods from AArch64ISelLowering.
  • Updated test/CodeGen/AArch64/sve-intrinsics-create-tuple.ll to test extract in different BB.
c-rhodes added inline comments.May 21 2020, 10:13 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7038 ↗(On Diff #248670)

This assertion seems ambitious; I'm not sure how you plan to ensure that the sve.tuple.get intrinsic is in the same basic block as the intrinsic that produces the value.

Sander's patch D80139 fixes copying between BBs and I've updated this patch to use EXTRACT_SUBVECTOR to get the vector from the tuple type, rather than looking through the CONCAT_VECTOR, so this assert has been removed.

Is there some reason you're putting this handling into the target-independent SelectionDAGBuilder, as opposed to target-specific code?

Tuple types for structured loads/stores were originally implemented with sizeless structs downstream but adding sizeless structs to the C standard isn't going to happen, so we chose to represent them as multiples of 128-bit vectors, e.g. svint32x2_t -> <n x 8 x i32>. For vector tuples containing 2 and 4 vectors (LD2/LD4, ST2/ST4) this was fine as the types are powers of 2 and can be broken down by the existing mechanisms, but for LD3/ST3 the types are odd sizes:

  • svint8x3_t -> <vscale x 48 x i8>
  • svint16x3_t -> <vscale x 24 x i16>
  • svint32x3_t -> <vscale x 12 x i32>
  • svint64x2_t -> <vscale x 6 x i64>

Which are problematic for legalisation when it comes to widening/splitting. To do this lowering in target-specific AArch64ISelLowering MVTs would have to be defined for these types which aren't legal. By lowering here we avoid these issues as these types don't reach type legalization.

efriedma added inline comments.May 21 2020, 10:55 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7038 ↗(On Diff #248670)

If you need to run before legalization, you can put code in AArch64TargetLowering::PerformDAGCombine.

Alternatively, additional MVTs wouldn't be that terrible.

c-rhodes updated this revision to Diff 265762.May 22 2020, 9:50 AM

Changes:

  • Moved lowering of vector tuple intrinsics to AArch64TargetLowering::PerformDAGCombine.
  • Renamed function args in test/CodeGen/AArch64/sve-intrinsics-create-tuple.ll from x<n> to z<n>.
  • Fixed bug in lowering of sve.tuple.set where the index (of element to update) was being used to extract subvectors from the tuple vector causing the wrong vectors to be copied. Added tests to catch this.
c-rhodes added inline comments.May 22 2020, 10:03 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7038 ↗(On Diff #248670)

If you need to run before legalization, you can put code in AArch64TargetLowering::PerformDAGCombine.

That's a much better idea :) I've implemented your suggestion, cheers!

Alternatively, additional MVTs wouldn't be that terrible.

I'm a little confused about this, there's a comment at the top of MachineValueType.h:

// This file defines the set of machine-level target independent types which
// legal values in the code generator use.

which to me implies MVTs should only be implemented for legal types which tuple vectors aren't, although I've also seen v3i32 which seems to be used in the AMDGPU backend but I don't know if the ISA specifies a register for that. I don't think I want to implements MVTs in this patch but I'm curious if it's ok to define them for illegal types?

efriedma accepted this revision.May 22 2020, 10:23 AM

LGTM

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7038 ↗(On Diff #248670)

We treat the "legal values in the code generator" bit a little loosely; really, it's any type that useful to define as an MVT for some backend.

And yes, if I recall that discussion correctly, I think AMDGPU actually does have operations on v3i32 registers.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13664

No point to explicitly checking isBeforeLegalizeOps(); the intrinsics should be gone after that.

This revision is now accepted and ready to land.May 22 2020, 10:23 AM
c-rhodes updated this revision to Diff 266147.May 26 2020, 3:02 AM

Changes:

  • Test extract of tuple vector passed into function.
  • Remove explicit check to exit early if after legalization when lowering intrinsics.
c-rhodes marked an inline comment as done.May 26 2020, 3:03 AM

I think that in this patch we are missing tests that check the output code in situations in which the tuple.get operates on tuple that are passed to the definition of the function, as in the following example:

I've added a few tests to test/CodeGen/AArch64/sve-intrinsics-insert-extract-tuple.ll to test this.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13664

Doh, done!

LGTM

Thanks for reviewing!

I'll hold off on merging this until Sander lands D80139

This revision was automatically updated to reflect the committed changes.
craig.topper added inline comments.
llvm/include/llvm/IR/IntrinsicsAArch64.td
815

Sorry for digging up an old patch here. Why is this IntrArgMemOnly? And why are any of these IntrReadMem? We're working on similar intrinsics for RISCV vector support and were wondering if we were missing something.

rkruppe removed a subscriber: rkruppe.Dec 23 2020, 3:43 AM
c-rhodes added inline comments.Jan 5 2021, 10:30 AM
llvm/include/llvm/IR/IntrinsicsAArch64.td
815

Sorry for digging up an old patch here. Why is this IntrArgMemOnly? And why are any of these IntrReadMem? We're working on similar intrinsics for RISCV vector support and were wondering if we were missing something.

Apologies for the delayed response.

When these intrinsics were first implemented CopyToReg/CopyFromReg didn't support some of the more unusual tuple types like nxv6i64 which could be created by an LD3 for example. This prevented copying between basic blocks since the compiler would crash. To get around this a temporary constraint of these intrinsics was they must immediately be followed by an extract to prevent the backend from ever seeing such types. I think at the time this was ok since these intrinsics were only used the the Arm C Language Extension (ACLE) intrinsics so we could apply this constraint, although it required marking the intrinsics as touching memory to prevent the mid end from mucking around with them. The CopyToReg/CopyFromReg issue has since been fixed however, so I can see no reason why these intrinsics can't be updated to mark them as not touching memory, I'll make a note to address that, cheers.