This is an archive of the discontinued LLVM Phabricator instance.

[SVE][LoopVectorize] Optimise code generated by widenPHIInstruction
ClosedPublic

Authored by RosieSumpter on Sep 8 2021, 9:04 AM.

Details

Summary

For SVE, when scalarising the PHI instruction the whole vector part is
generated as opposed to creating instructions for each lane for fixed-
width vectors. However, in some cases the lane values may be needed
later (e.g for a load instruction) so we still need to calculate
these values to avoid extractelement being called on the vector part.

Diff Detail

Event Timeline

RosieSumpter created this revision.Sep 8 2021, 9:04 AM
RosieSumpter requested review of this revision.Sep 8 2021, 9:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2021, 9:04 AM

Just a minor nit on the commit message, this patch is not really specific to AArch64 SVE but rather to scalable vectors.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771

Hi @RosieSumpter, I think it's worth elaborating a little bit more on the 'generate better code' in the comment.

[(too) long explanation here]

From what I understand, the code is better because the extractelement instruction that is otherwise generated (for scalar uses of this vector) may not always be folded away if the stepvector has multiple uses, leading to a redundant move (in case of element 0 for the vector-element-0 -> gpr move) or possibly expensive extractelement instructions (to extract a fixed-width lane from a scalable vector) for element > 0.

In the former case, the value for element 0 is freely available because it is the start value of the stepvector.
In the latter case, there will be a cost regardless. Either the additional add/gep generated below to offset from the start value of the stepvector, or the extract from the stepvector itself. It's just expected that the scalar code will be cheaper.

Can you maybe capture some of that in the comment? (albeit more succinctly)

SjoerdMeijer added inline comments.Sep 9 2021, 8:16 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771

I had exactly the same questions as Sander. The main question I think is indeed why this is better, which it's not (that) obvious from the test changes. Thus, I was wondering, does this deserve adding some CodeGen tests?

sdesmalen added inline comments.Sep 9 2021, 8:25 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771

IMO an improved description should be sufficient.

SjoerdMeijer added inline comments.Sep 9 2021, 9:00 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771

Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would we not test this? I think it would help too with explaining why this is better.

Improved comment

RosieSumpter added inline comments.Sep 10 2021, 1:28 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771

For now I have just updated the comment (hopefully it makes sense - I have tried to add some detail but keep it concise, but am happy to change it!)

Also happy to add a codegen test if it's decided that it's necessary.

Thanks @RosieSumpter that explains it well I think.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771

Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would we not test this?

Correct, it normally doesn't happen for individual IR passes that the resulting asm for a particular target is tested for a change to the transformation, right? IMO that should be avoided whenever possible, because it defeats the purpose of a unit test.

In this case it should be sufficient to test explicitly that the extractelement does not exist after loop-vectorize. That happens with the extra CHECK-NOT line in sve-widen-phi.ll. The fmov is the code-generated equivalent to the extract element of element 0, so would there be any value in also code-generating this for AArch64 to show that the fmov is removed?

Perhaps @RosieSumpter can add a similar comment to clarify the CHECK-NOT line to clarify this a bit more in the test itself?

4771

nit: s/is not folded/is not folded away/

4771–4772

nit: s/we still need to calculate/we still calculate/
nit: s/to avoid a redundant move/to avoid redundant moves/

SjoerdMeijer added inline comments.Sep 10 2021, 2:16 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771

I was proposing an additional test, thus I was not proposing that a codegen test should replace an IR test, which would indeed be a bad idea.

This change is/was talking about codegen improvements. The IR changes here don't make that very obvious IMO. So if I was doing this work, I would add additional codegen tests with the before/after IR as input, and check its codegen to make sure these improvements are there.

Not sure if I am missing something or should be discussing adding tests here. Don't have strong opinions either, so will leave finishing the review up to you.

Amended original comment, and added comment to sve-widen-phi.ll test.

sdesmalen accepted this revision.Sep 10 2021, 3:52 AM
sdesmalen added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4769–4771

Thanks, I understood what you meant about adding an additional test. In this case it's more about InstCombine not folding it away because the vector has multiple uses than it is about the target not being able to do something special with it, so from my perspective adding a codegen test for this would be a bit out of place.

This revision is now accepted and ready to land.Sep 10 2021, 3:52 AM
This revision was landed with ongoing or failed builds.Sep 10 2021, 4:06 AM
This revision was automatically updated to reflect the committed changes.
This comment was removed by kmclaughlin.

Hi all, apologies for my last comment on this patch - a mistake with email auto-complete led to an internal email being forwarded to Phabricator.