This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Add support for scalable vectorization of invariant stores
ClosedPublic

Authored by david-arm on Jun 21 2021, 1:50 AM.

Details

Summary

Previously in setCostBasedWideningDecision if we encountered an
invariant store we just assumed that we could scalarize the store
and called getUniformMemOpCost to get the associated cost.
However, for scalable vectors this is not an option because it is
not currently possibly to scalarize the store. At the moment we
crash in VPReplicateRecipe::execute when trying to scalarize the
store.

Therefore, I have changed setCostBasedWideningDecision so that if
we are storing a scalable vector out to a uniform address and the
target supports scatter instructions, then we should use those
instead.

Tests have been added here:

Transforms/LoopVectorize/AArch64/sve-inv-store.ll

Diff Detail

Event Timeline

david-arm created this revision.Jun 21 2021, 1:50 AM
david-arm requested review of this revision.Jun 21 2021, 1:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2021, 1:50 AM
Matt added a subscriber: Matt.Jun 21 2021, 8:49 AM
peterwaller-arm accepted this revision.Jun 28 2021, 8:32 AM

I'm relatively inexperienced in this area of the code, but giving this a +1 on the basis of reading the diff and thinking about this.

I understand that the result might not be ideal because a scatter may be slow when every lane is writing to the same address. I also understand that it seems reasonable to do this in the short term to reach functional correctness.

This revision is now accepted and ready to land.Jun 28 2021, 8:32 AM
fhahn added a subscriber: fhahn.Jun 28 2021, 8:41 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7354

I guess with that we are still crashing on targets with scalable vectors but without gather/scatter? Is there a way to also handle the case without gather/scatters without crashing?

david-arm added inline comments.Jun 28 2021, 8:50 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7354

I think there is potentially a way to do this, but may involve changing VPReplicateRecipe::execute to only store out the last lane value. I wasn't sure how difficult this was, or whether this was even an acceptable solution. So I chose this simpler route for now, especially since I wasn't sure how much we really had to worry about lack of scatter support at the moment. I was thinking to revisit this again at some point and investigate the possibility of changing the replicate recipe. I think essentially for fixed width we generate N stores to the same address, and let subsequent passes optimise all but the last away.

This revision was landed with ongoing or failed builds.Jun 29 2021, 3:56 AM
This revision was automatically updated to reflect the committed changes.