This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2
ClosedPublic

Authored by wwei on Nov 7 2021, 6:40 PM.

Details

Summary

Attempt to lower a shuffle as a permute instruction(zip/uzp/trn) for fixed length SVE.

Diff Detail

Event Timeline

wwei created this revision.Nov 7 2021, 6:40 PM
wwei requested review of this revision.Nov 7 2021, 6:40 PM
david-arm added inline comments.Nov 8 2021, 1:47 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19264

This looks a bit wrong because VT=Fixed Length Vector Type, but Op1 and Op2 have scalable vector types.

wwei added inline comments.Nov 8 2021, 3:40 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19264

No, the code here is correct. VT and ShuffleMask are used to check the type of the mask value, like zip or uzp mask.
Op1 and Op2 are used to generate new permute nodes, which will be scalable aarch64 SDNodes here.
In this patch, I encapsulated a new function for the code lowering shuffles to neon ZIP/UZP/TRN. Fortunately, the same code can be used for fixed-length SVE.

peterwaller-arm accepted this revision.Nov 8 2021, 3:49 AM

LGTM, thanks.

This revision is now accepted and ready to land.Nov 8 2021, 3:49 AM
paulwalker-arm requested changes to this revision.Nov 8 2021, 5:13 AM
paulwalker-arm added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9678–9681

I don't believe this logic is universally safe. The use of ZIP2 specifically relies on knowing the exact size of the register to know which indices represent the "high half" of a vector register. This is something that is only known when sve-vector-bits-min==sve-vector-bits-max.

I also suspect functions like isZIPMask were written when only 64 and 128 bit legal vectors existed. I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register, as is the case with the SVE fixed length support.

I have not looked into the extent at which the other shuffles are affected but I suspect each have their own complexity.

llvm/test/CodeGen/AArch64/sve-fix-length-permute-zip-uzp-trn.ll
21

To highlight my previous comment. Index 16 is only the first byte of the second half of a 256bit vector and thus if the target has bigger vector register the use of zip2 will not have the correct behaviour.

This revision now requires changes to proceed.Nov 8 2021, 5:13 AM
wwei added inline comments.Nov 8 2021, 8:02 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9678–9681

Thanks, I get it. It seems that there are potential problems for the fixed-length shuffle vector lowering, include the patch https://reviews.llvm.org/D105289. I will update a new revision later to fix it by adding the check: sve-vector-bits-min==sve-vector-bits-max

paulwalker-arm added inline comments.Nov 8 2021, 8:36 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9678–9681

I believe that patch is ok given the restrictions it imposes. It is fine using the is###Mask functions as a way to identify a named shuffle. What is unsafe is assuming is###Mask means the ### instruction can be used directly when converted to scalable vectors.

So taking D105289 you can see that is uses isEXTMask to determine the type of shuffle but then the actually lowering code only handles a specific case and that case does not emit an EXT instruction as that would have fallen into the same trap.

wwei added inline comments.Nov 9 2021, 12:55 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9678–9681

Yeah, you are right. The logic in D105289 is correct, using EXTRACT_VECTOR_ELT and INSR instructions to match a specific EXT pattern, it's a smart solution.

wwei updated this revision to Diff 385727.Nov 9 2021, 1:03 AM

update the patch, adding sve min/max size check.

paulwalker-arm added inline comments.Nov 9 2021, 8:23 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267–19268

I still don't think this is enough as you've missed my previous I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register comment. Further discussion is attached to the test zip_v32i8 below...

I'm starting to wonder if it's worth breaking out lowerShuffleToZIP_UZP_TRN as I feel like once support is added for more of the combinations there's not likely to be much reuse. For example taking the zip1/zip2 case I think supporting zip1 is likely simple and just works for all legal VTs, but zip2 requires the VT to be exactly the size of the target vector (i.e. VT.getSizeInBits() == Subtarget->getMinSVEVectorSizeInBits() == Subtarget->getMaxSVEVectorSizeInBits()) or perhaps a different instruction sequence.

llvm/test/CodeGen/AArch64/sve-fix-length-permute-zip-uzp-trn.ll
25–26

The output for VBITS_EQ_256 and VBITS_EQ_512 is the same, which doesn't make sense considering the target vector length is different.

For the VBITS_EQ_256 case index 16 is the start of the high half of the target vector and thus zip2 works. But for the VBITS_EQ_512 case index 16 is actually the start of the second quarter of the target vector and thus zip2 will do the wrong thing.

wwei added inline comments.Nov 10 2021, 1:00 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267–19268

Thanks, I agree with you. Perhaps the current implementation of lowerShuffleToZIP_UZP_TRN reusing neon code is not so good, since SVE may have variable length.
Maybe we should identify and classify different shuffle patterns. For zip/uzp/trn/rev these four shuffle types, we can roughly divide them into two categories:
zip1/uzp1/uzp2/trn1/trn2/revb/revh/revw (it' simple,and works for all legal VTs)
zip2/rev (complex,and need consider high half or whole register)
I think we can implement these two scenarios in steps.

Matt added a subscriber: Matt.Nov 17 2021, 12:01 PM
wwei updated this revision to Diff 391320.Dec 2 2021, 7:05 AM
wwei added a comment.Dec 2 2021, 7:17 AM

update the patch,also there's another patch D114960 to support rev insts.

paulwalker-arm added inline comments.Dec 9 2021, 8:09 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267–19268

Looking at the six shuffles trn1, trn2, uzp1, uzp2, zip1, zip2 I believe that when min_length != max_length != VT.getSizeInBits() only trn1, trn2 & zip1 are valid. We agree that zip2 is not valid because the top half of the input fixed length vector does not necessarily map to the top half of a scalable vector. The reason I believe both uzp variants are also invalid is because their underlying operation is to concat both input vectors and then extract every second element. However, the vectors we're concatenating may contain junk so you end up with:

expected uzp1 [A | B | C | D], [E | F | G | H] => [A | C | E | G]

SVE_VLS uzp [A | B | C | D | ? | ? | ? | ?], [E | F | G | H | ? | ? | ?] => [A | C | ? | ? | E | G | ? | ?]

And thus when you extract the fixed length result you'll end up with [A | C | ? | ?]

This I think further reduces the value of having lowerShuffleToZIP_UZP_TRN as it would be clearer to just handle each of the scenarios separately.

wwei added inline comments.Dec 9 2021, 8:59 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267–19268

@paulwalker-arm Thanks for your detailed explanation. I fully agree with your point of view. Indeed, we don’t need lowerShuffleToZIP_UZP_TRN anymore, handle each case separately will make the code clearer. I will try to refactor the code later.

wwei updated this revision to Diff 393828.Dec 13 2021, 3:18 AM
wwei added a comment.Dec 13 2021, 7:51 AM

update the patch, lowerShuffleToZIP_UZP_TRN removed and test file refactored.

wwei updated this revision to Diff 394577.Dec 15 2021, 8:18 AM

rebased.

Just a passing review I'm afraid. I'll take a proper look tomorrow.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267

I think you can use ContainerVT here? It might help with the formatting.

Same goes for the other places where you use Op1.getValueType().

19279–19288

I've not thought about it deeply so might be wrong but given these cases only use Op1 I'm wondering if they're always safe and thus don't need to be part of the MinSVESize == MaxSVESize == VT.getSizeInBits() restricted set?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19286–19289

I figured this comment could be better and started to write something but ended up with

Functions like isZIPMask return true when a ISD::VECTOR_SHUFFLE's mask represents the same logical operation as performed by a ZIP instruction.  In isolation these functions do not mean the ISD::VECTOR_SHUFFLE is exactly equivalent to an AArch64 instruction.  There's the extra component of ISD::VECTOR_SHUFFLE's value type to consider.  Prior to SVE these functions only operated on 64/128bit vector types that have a direct mapping to a target register and so an exact mapping is implied.

However, when using SVE for fixed length vectors, most legal vector types are actually sub-vectors of a larger SVE register.  When mapping ISD::VECTOR_SHUFFLE to an SVE instruction care must be taken to consider how the mask's indices translate.  Specifically, when the mapping requires an exact meaning for a specific vector index (e.g. Index X is the last vector element in the register) then such mappings are often only safe when the exact SVE register size is know.  The main exception to this is when indices are logically relative to the first element of either ISD::VECTOR_SHUFFLE operand because these relative indices don't change when converting from fixed-length to scalable vector types (i.e. the start of a fixed length vector is always the start of a scalable vector).

Which is more like a novel than a comment :) I've posted it anyway just in case there's something in there that's useful.

llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll
10–14 ↗(On Diff #394577)

It looks like InterleavedAccessPass is causing your code to be bypassed. I couldn't immediately see a way to disable the pass other than using -O0 which might cause other issues but I think if you use volatile loads and stores within these tests you'll get what you need.

413 ↗(On Diff #394577)

Please place the attributes together at the end of the file because otherwise they're hard to find when trying to see what attributes exist for a specific function.

627 ↗(On Diff #394577)

As above.

wwei updated this revision to Diff 395107.Dec 17 2021, 6:05 AM
wwei added inline comments.Dec 17 2021, 6:24 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267

Changed, using ContainerVT to replace Op1.getValueType()

19279–19288

I don't think it's safe. For undef case, Op1 will be used as input twice, it still has the problem like below:

uzp [A | B | C | D | ? | ? | ? | ?], [A | B | C | D | ? | ? | ?] => [A | C | ? | ? |A | C | ? | ?]
19286–19289

Thanks for your comment, I used it directly in the code

llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll
10–14 ↗(On Diff #394577)

Thanks,volatile can work

413 ↗(On Diff #394577)

done

paulwalker-arm accepted this revision.Dec 20 2021, 10:15 AM
paulwalker-arm added inline comments.
llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll
419 ↗(On Diff #395107)

I think there's value in adding a comment that states why we can safely emit zip2 instructions. Not too detailed, it's just worth drawing the readers attention to the fact that for this and related tests the runtime vector length is known.

481 ↗(On Diff #395107)

As above, there's value in adding a comment that mentions why we can safely emit uzp instructions.

627 ↗(On Diff #395107)

Please add a small comment to highlight this is a negative test and why only zip1 instructions are emitted.

This revision is now accepted and ready to land.Dec 20 2021, 10:15 AM
This revision was landed with ongoing or failed builds.Dec 21 2021, 2:46 AM
This revision was automatically updated to reflect the committed changes.