This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SME]: Generate streaming-compatible code for ld2-alloca.
ClosedPublic

Authored by hassnaa-arm on Nov 28 2022, 4:44 AM.

Details

Summary

To generate code compatible to streaming mode:

  • disable lowering interleaved load to avoid generating invalid NEON intrinsics.

Diff Detail

Event Timeline

hassnaa-arm created this revision.Nov 28 2022, 4:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 4:44 AM
hassnaa-arm requested review of this revision.Nov 28 2022, 4:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 4:44 AM
sdesmalen added inline comments.Nov 28 2022, 6:10 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13965–13966

This also allows the case where the total VecSize == 32 (for e.g. <4 x i8> which is currently not supported by Neon), or whether the number of elements is not a power of 2 (e.g. <6 x i8>. Can you add a test for this case?

13966

The indentation seems weird, did you use clang-format?

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ld2-alloca.ll
163

nit: can you add nounwind as one of the parameters, to avoid the .cfi directives in the output?

hassnaa-arm marked an inline comment as done.Nov 28 2022, 9:24 AM
hassnaa-arm added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13965–13966

In the IR, I don't understand why it allocates 16 elements while it only loads 8.
Is there a reason behind that ?

hassnaa-arm marked an inline comment as done.

Add nounwind to avoid .cfi directives get generated.

Add extra test cases

sdesmalen added inline comments.Nov 30 2022, 1:18 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13965–13966

Based on the code you've added, I expected the tests you've added to use an ld2, do you know why it doesn't?

hassnaa-arm marked an inline comment as done.Nov 30 2022, 5:56 AM

Change the used mask in the test cases, so that interleaved load get used.

sdesmalen added inline comments.Nov 30 2022, 6:09 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967

Sorry, I just realise you'll also need to add a check that we can generate a predicate pattern for the number of elements (e.g. vl2, vl3, ...), because e.g. a <9 x i8> has no corresponding predicate pattern. You can use Optional<unsigned> getSVEPredPatternFromNumElements() for this (defined in Utils/AArch64BaseInfo.h).

Can you also add a test for <9 x i8> ?

hassnaa-arm added inline comments.Nov 30 2022, 6:13 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967

Sorry, I don't understand why I should check for e.g. a <9 x i8> ? How is that related to the condition in the code ?
and what do you mean by "You can use Optional<unsigned> getSVEPredPatternFromNumElements() for this " ?
Do you mean that in addition to adding the check, I will add a code change also ?

sdesmalen added inline comments.Nov 30 2022, 6:20 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967

What I meant was that if you do a shuffle mask like this:

%load = load <6 x i8>, ptr %alloc
%strided.vec = shufflevector <6 x i8> %load, <6 x i8> poison, <3 x i32> <i32 1, i32 3, i32 5>

You get the following LD2 instruction:

ptrue p0.b, vl3
ld2b { z0.b, z1.b }, p0/z, [x8]

Which uses vl3 for the predicate, which means enable 3 lanes. But there is no vl9, so if you'd end up with NumElements == 9, then you can't code-generate the interleaved access using LD2. To ask if there is a ptrue predicate for NumElements, you can use getSVEPredPatternFromNumElements.

(I meant <9 x i8> as the result type of the shuffle by the way)

hassnaa-arm added inline comments.Nov 30 2022, 6:31 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967

you mean if the result type of the shuffle is <9 x i8> there will be a problem, and you are asking me to add a test cases for that and fix its problem, correct ?

sdesmalen added inline comments.Nov 30 2022, 6:32 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967

Correct.

hassnaa-arm added inline comments.Nov 30 2022, 6:44 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967

but why did you choose vl9 specifically ? what about other vl ? e.g. vl10.

sdesmalen added inline comments.Nov 30 2022, 6:50 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967

Because the available predicate patterns are vl1, vl2, vl3, ..., vl8, vl16, ... So vl9 is the first non-power-of-2 vector length that can't be represented.

hassnaa-arm marked an inline comment as done.Nov 30 2022, 7:08 AM

Skip using interleaved_load for vl9, because vl9 predicate is not available.

sdesmalen accepted this revision.Nov 30 2022, 7:11 AM

LGTM with nit addressed

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13955

nit: you can just do !getSVEPredPatternFromNumElements(NumElements))

This revision is now accepted and ready to land.Nov 30 2022, 7:11 AM
hassnaa-arm marked an inline comment as done.Nov 30 2022, 7:57 AM

Add check for hasSVE() to avoid affecting NEON interleaved-access tests

Failed Tests (1):

LLVM :: CodeGen/AArch64/sve-streaming-mode-fixed-length-ld2-alloca.ll

Testing Time: 345.12s

Skipped          :    38
Unsupported      :  1357
Passed           : 88906
Expectedly Failed:   195
Failed           :     1

https://lab.llvm.org/buildbot/#/builders/85/builds/12665

uabelho added a subscriber: uabelho.Dec 1 2022, 3:37 AM