This is an archive of the discontinued LLVM Phabricator instance.

[POC][SVE] Allow code generation for fixed length vectorised loops [Patch 2/2].
AbandonedPublic

Authored by paulwalker-arm on Dec 20 2019, 6:46 AM.

Details

Reviewers
rengolin
efriedma
Summary

No expectations for review at this stage unless you are super keen.

This extends the proof of concept introduced by https://reviews.llvm.org/D71760 to show the work required to custom lower the main fixed length vector operations. The general scheme being:

<n x ty> op (<n x ty> op1, <n x ty> op2...

becomes

pg = create_predicate_for(<n x ty>)
new_op1 = convertToSVE(op1)
new_op2 = convertToSVE(op2)
...
return convertFromSVE(sve_op(pg, new_op1, new_op2...

To keep the patch small I've reused the existing intrinsic isel rules. This provides a route to reasonable performance whilst allowing us to start work on immediate packing, condition code handling and better utilisation of reversed instructions and movprfx. The ultimately goal is that the first patch ensures everything can run, whilst this and subsequent patches improve performance.

WARNING: The more operations we lower the more existing code paths will need to be cleaned so they no longer assume that (isTypeLegal(VectorType) implies VectorType.getSizeInBits() >= 128). This can be seen in this patch which required a charge to one of the AArch64 DAGCombines.

Diff Detail

Event Timeline

paulwalker-arm created this revision.Dec 20 2019, 6:46 AM
Herald added a reviewer: efriedma. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript

Is there any particular reason to lower ISD::AND on a fixed vector to an intrinsic, as opposed to simply lowering it to an ISD::AND on a scalable vector? The patterns should mostly be there to make it work.

Otherwise, the general approach seems reasonable.

Err, oops, ignore me; somehow I thought this was new.

This seems like a good approach.

Do we have any tests for it? I notice a simple fixed-width argument test blows up (unless I'm making a mistake on my side):

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-isel-sve-vector-bits=512 < %s | FileCheck %s

define <16 x float> @fadd_float_512b(<16 x float> %a, <16 x float> %b) {
; CHECK-LABEL: fadd_float_512b:
; CHECK:       // %bb.0:
; CHECK-NEXT:    fadd z0.s, z0.s, z1.s
; CHECK-NEXT:    ret
  %fadd = fadd <16 x float> %a, %b
  ret <16 x float> %fadd
}

I have a somewhat hacky solution to the calling convention issues, which I'll share soon. Hoping for a better solution though, since mine is clunky...

The handling of larger than NEON fixed length vector function parameters is still up in the air. We cannot break existing NEON support so we're almost certainly going to need a new function attribute for fixed length SVE. For this reason and generally not knowing if the approach this patch is based on will be acceptable I did not prioritise writing tests, instead I focused on execution testing. I'll see about resolving the lack of tests albeit at this stage using pass by reference rather than value.

fhahn added a subscriber: fhahn.Mar 20 2020, 10:45 AM

Ah, ok. So you wouldn't see fixed-width vector arguments since the vectorizers works at the loop/block/etc level. That's what I misunderstood.

We have a pre-IR vectorizer and vectorize lib calls, so our needs are a little askew.

What about a scheme where we make the fixed-width vector types proper subregs of the scalable types? That way we can add the fixed-width types to the calling conventions and register classes. Would that be of interest you?

It would be similar to X86's XMM->YMM->ZMM registers (i.e. 128b->256b->512b), except that all of SVE's sub-registers would print the same register name. it's a little hacky, but might polish up ok.

I'll see about resolving the lack of tests albeit at this stage using pass by reference rather than value.

Thanks, but I'm fine without them for now. Up to you...

What about a scheme where we make the fixed-width vector types proper subregs of the scalable types? That way we can add the fixed-width types to the calling conventions and register classes. Would that be of interest you?

It would be similar to X86's XMM->YMM->ZMM registers (i.e. 128b->256b->512b), except that all of SVE's sub-registers would print the same register name. it's a little hacky, but might polish

The problem here is what fixed-width vector types would you use? The mapping would be dependent on the target. We did experiment with HwMode but it didn't work out. At this stage I think the simplest option is for the vectorised functions to use scalable argument and then provide a way to insert/extract fixed width vectors into/from them. During code generation these insert/extract operations will become _SUBVECTOR operations that should generate no code. This is akin to the general philosophy I have used for fixed width code generation.

What about a scheme where we make the fixed-width vector types proper subregs of the scalable types? That way we can add the fixed-width types to the calling conventions and register classes. Would that be of interest you?

It would be similar to X86's XMM->YMM->ZMM registers (i.e. 128b->256b->512b), except that all of SVE's sub-registers would print the same register name. it's a little hacky, but might polish

The problem here is what fixed-width vector types would you use? The mapping would be dependent on the target.

I'm not sure if I'm following. If we don't add a vector type to a register class for a target, then it shouldn't be a problem, I think. We should be able to do that programmatically in ISel since we know the max vector width at compile time. E.g. Don't add the 1024b vector types to the ZPR register class for a 512b wide target (or we could create a ZPR and subreg hierarchy instead).

I've posted a back-of-the-envelope patch so we can look at something concrete, D77224. It definitely has some wart, but I think it could be workable going forward.

vkmr added a subscriber: vkmr.Apr 2 2020, 7:18 AM
dancgr added a subscriber: dancgr.May 28 2020, 10:39 AM

Nothing to see here, just rebasing for those who want to experiment.

paulwalker-arm planned changes to this revision.Jul 13 2020, 10:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2020, 8:13 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
paulwalker-arm planned changes to this revision.Aug 10 2020, 8:13 AM

@cameron.mcinally this is the patch I mentioned the other day, which contains the nodes where once I've written suitable tests I'll push separate patches for. Anything else is fair game. This patch implements VSELECT but that was just to investigate what we talked about during the previous sync call so I'll ignore it if you're planning to push on with your original work?

@cameron.mcinally this is the patch I mentioned the other day, which contains the nodes where once I've written suitable tests I'll push separate patches for.

Thanks, Paul. You mentioned that you would be focusing on another project for a few weeks. Would it help if I attempted to cherry-pick some of this Diff into individual patches (with new tests) for you? Or would I be stepping on your toes too much?

Anything else is fair game. This patch implements VSELECT but that was just to investigate what we talked about during the previous sync call so I'll ignore it if you're planning to push on with your original work?

I can carry on with VSELECT. I think I have enough intuition built around it now. I haven't yet looked into the state of the unpack[lo|hi] instructions, but will do so today.

@cameron.mcinally this is the patch I mentioned the other day, which contains the nodes where once I've written suitable tests I'll push separate patches for.

Thanks, Paul. You mentioned that you would be focusing on another project for a few weeks. Would it help if I attempted to cherry-pick some of this Diff into individual patches (with new tests) for you? Or would I be stepping on your toes too much?

That would be great, thanks. I already have patches up for the extends and am currently focusing on setcc, sub and the shifts, which leaves min/max and divides. That said, one area I've not looked at yet are the VECREDUCE_ nodes. I don't anticipate them to be that problematic but having proof of this would be nice. Let me know what you decide.

MIN/MAX and DIVs are up my alley. I'll try those. Will check out VREDUCE if I get through the others.

paulwalker-arm planned changes to this revision.Aug 13 2020, 5:15 AM
paulwalker-arm abandoned this revision.Sep 2 2020, 4:22 AM

With the exception of VSELECT lowering, which is being worked under D85364, everything else is available in master.