This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Improve code generation for VLS i1 masks
ClosedPublic

Authored by DavidTruby on Oct 6 2021, 5:33 AM.

Details

Summary

This patch partially resolves an issue for VLS code generation
where a mask is generated from a smaller width integer comparison
than the instruction using the mask requires.

Instead of sign extending a p register by converting it to a z
register, extending that, and converting back, we instead just
do an unpack of the p register.

A separate issue causes the code generation to still be poor when
the mask generation would fit in a neon register, as we then use
a neon comparison operation and have to convert that to a p register.
This will be resolved in a separate patch.

Diff Detail

Event Timeline

DavidTruby created this revision.Oct 6 2021, 5:33 AM
DavidTruby requested review of this revision.Oct 6 2021, 5:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 6 2021, 5:33 AM
peterwaller-arm added inline comments.Oct 6 2021, 6:24 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15846

nit. As many of the other combines have, please could you introduce a comment showing the form of the intended combine?

16595

nit. As many of the other combines have, please could you introduce a comment showing the form of the combine?

paulwalker-arm added inline comments.Oct 6 2021, 9:49 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16603

I believe this transform is sound for a normal SETCC but N is a SETCC_MERGE_ZERO and so consideration must be paid to the predicate to ensure the inactive lanes get zero'd.

The easiest way to make the combine safe is to check Pred is all active. The problem with doing that is it'll probably mean the combine doesn't fire for the case you care about.

Matt added a subscriber: Matt.Oct 6 2021, 10:45 AM

Added an extra condition to make the transformation sound.

In essence, we need to check that the predicate is the same
before and after the transform, modulo its size since we're
moving from a smaller to larger predicate.

DavidTruby marked an inline comment as done.Oct 12 2021, 5:40 AM
bsmith added inline comments.Oct 25 2021, 3:34 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
36

Rogue include?

16605

Is this assuming that Pred and OrigPred and ptrues? If they're not, won't this break?

Add additional check that predicates are ptrues

DavidTruby marked 2 inline comments as done.Oct 27 2021, 5:57 AM

Add negative tests for the case where the ptrue vl is not the same, or where
one of the predicates is not a ptrue at all.

llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
29

ptrue

Improve added ACLE intrinsic tests

Fix spelling mistake

Fix tests after spelling mistake (whoops...)

llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
19

Unused %dup? (I guess you replaced it with zeroinitializer?)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15846

This comment doesn't match the code below. At the DAG level this is really sunpklo(sext(pred)) -> sext(extract_low_half(pred)).

Also, is this transform universally good? As in, if the later transformation does not occur, is punpklo always better than sunpklo?

15854

Can you be explicit here given we know we always want a sign extend?

16604

Perhaps I've confused myself with the number of getOperand() calls but this doesn't looks quite right:

LHS == ISD::SIGN_EXTEND
LHS->getOperand(0) == ISD::EXTRACT_SUBVECTOR
LHS->getOperand(0)->getOperand(0) == vector we're extracting the subvector from
LHS->getOperand(0)->getOperand(0)->getOperand(0) == ???? and how do you know LHS->getOperand(0)->getOperand(0) has an operand to get?

I can see the following code then ensures LHS->getOperand(0)->getOperand(0)->getOperand(0) is a PTRUE but before then depending on LHS->getOperand(0)->getOperand(0) the extra call to getOperand(0) might assert. Also, is it really the case that is doesn't matter how the PTRUE is processed before being passed to ISD::EXTRACT_SUBVECTOR?

DavidTruby updated this revision to Diff 393121.Dec 9 2021, 5:16 AM

Refactor SetCC combine and rebase

DavidTruby updated this revision to Diff 393125.Dec 9 2021, 5:32 AM

replace getSextOrTrunc with getNode(SIGN_EXTEND)

DavidTruby updated this revision to Diff 393126.Dec 9 2021, 5:33 AM

Update comments

DavidTruby marked 3 inline comments as done.Dec 9 2021, 5:35 AM
DavidTruby added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15846

I'm not sure that the transform is universally good, however it does significantly simplify the following transform. This transform has the effect of moving the setcc that converts from a i1 register to another register after the sign_extend, which becomes just a single sign extend rather than a chain of unpacks of unknown length. This is a much easier pattern to recognize later.

16604

I've refactored all of this to make it a bit clearer what's going on, and added an additional check so that we can make sure that getOperand(0) actually exists.

peterwaller-arm added inline comments.Dec 9 2021, 8:47 AM
llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
225

I believe these should be spelled "v2i16", etc?

DavidTruby marked an inline comment as done.

Fix up acle tests

DavidTruby marked 3 inline comments as done.Dec 13 2021, 5:11 AM
DavidTruby marked 2 inline comments as done.Dec 13 2021, 5:31 AM
peterwaller-arm accepted this revision.Dec 13 2021, 7:03 AM
This revision is now accepted and ready to land.Dec 13 2021, 7:03 AM

I've not hit the "Request Changes" button just in case I'm wrong but please don't commit the patch until my AArch64SVEPredPattern::all query is resolved one way or the other.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15846

Sounds reasonable to me. Given it's not clear cut perhaps it's worth adding a line after // sunpklo(... to highlight this combine works in partnership with performSetCCPunpkCombine?

16552

Is this required? I'm thinking that because the return type of AArch64ISD::SETCC_MERGE_ZERO is always a scalable i1 vector, the previous Extract->getValueType(0) != N->getValueType(0) already has you covered here.

16553

FYI: I wasn't going to say this and you may well want to ignore this in the interest of time but I cannot convince myself this restriction is necessary. I say this because regardless of it's value the result will be a sequence of PUNPK instructions which should all do what we want. The downside is that we'll need more test coverage and given this code is a special case it's probably best to leave as is.

16560–16562

This comment doesn't match the code. The punpklo part should be extract_subvector?

It's also worth copying this as a function comment because without it you need to jump here first before you can understand why the code above exists. Perhaps with a starting statement of "Remove redundant predicate trunc(sext()) sequences."

Actually if you do that then I'm wondering if here words might better describe what is going on. I'm think something like

By this point we've effectively got zero_inactive_lanes_and_trunc_i1(sext_i1(A)). If we can prove A's inactive lanes are already zero then the trunc(sext()) sequence is redundant and we can operate on A directly.
16563

Up to you but given you have InnerSetCC perhaps this should be InnerPred?

16566

Is this enough? I'm not totally sure but my initial reaction is "If this is AArch64SVEPredPattern::all then that means the number of active lanes for InnerSetCC will be different to N", so I'm thinking this needs to be restricted to the cases where a specific vector length pattern is used?

Add additional check that ptrues are not vl_all, and associated tests.

Improve comments describing what the punpk combine does in combination with
the other combine.

Ensure ptrue has a fixed vector length

paulwalker-arm added inline comments.Dec 17 2021, 7:32 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15856–15858

This can be just return DAG.getNode(ISD::SIGN_EXTEND...

16542

To be consistent, either the above punpklo needs to be extract_subvector or this need should be punpklo.

16552

If you go with punpklo above then can you please add something like // punpklo == extract_subvector here.

llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
20–21

I think you can simplify all the tests in this file by removing the extending load part and returning the comparison result directly. I say this because the new DAG combines don't care about them and so they shouldn't be necessary to exercise them.

243

From the function name alone it's not obvious this is a negative test. Please add a small comment that highlights this is a negative test and why the optimisation is not applicable.

I've put the comment here but it's a general one applicable to the other functions.

Add comments to negative tests explaining why the optimisation can't apply

Directly return predicate from tests rather than adding an extending load

paulwalker-arm accepted this revision.Dec 17 2021, 8:15 AM
paulwalker-arm added inline comments.
llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
247–252

These can be removed also?

This revision was landed with ongoing or failed builds.Dec 17 2021, 8:44 AM
This revision was automatically updated to reflect the committed changes.