This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Synthesize mask operands for vector variants as needed
ClosedPublic

Authored by huntergr on Aug 23 2022, 3:45 AM.

Details

Summary

This patch allows loop vectorization with function calls in cases where masks are not required but the only available vectorized function variants are masked.

Some of the code was originally written by @paulwalker-arm

Diff Detail

Event Timeline

huntergr created this revision.Aug 23 2022, 3:45 AM
huntergr requested review of this revision.Aug 23 2022, 3:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 3:45 AM
mgabka added a subscriber: mgabka.Aug 23 2022, 10:54 PM

I recommend you split this patch into the following patches:

  • First, add support for calling the predicated call version when masking is not required for correctness. This requires all the plumbing for costing both versions of the call (if both are available), and let's you test those bits.
  • Second, add support for predicated calls via scalar-with-predication. This requires speculation reasoning (which you don't appear to have), and is complicated enough to need testing on it's own. (This may partially exist already, but some of the code at least is incomplete since e.g. isPredicatedInst always returns false for calls.)
  • Third, come back and add support for predication of calls via masking (as opposed to replication.)

The first and second step are independent, and could be done in either order.

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll
1

Please autogenerate these tests for readability.

huntergr updated this revision to Diff 459684.Sep 13 2022, 2:04 AM

Split out the functionality for synthesizing a mask when required, added a cost for mask generation.

huntergr marked an inline comment as done.Sep 13 2022, 2:09 AM

Thanks for the review. I'm tempted to add a masking equivalent to -force-target-supports-scalable-vectors=true in order to have target-independent tests, but I can add that in another patch.

Ping.

@reames -- is this roughly what you expected for the case of allowing a masked variant to be used when no mask is required? I've added a cost for generating the mask (per-call for now, instead of potentially sharing it) so that we can compare costs for different VFs with and without a masked variant, but I think we would always prefer the non-masked variant for the same VF if a mask is not required.

In any case, I'm now working on the third patch.

fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
478 ↗(On Diff #459684)

It would be good if the decision whether to used the masked or non-masked variant would be taken at the time of VPlan construction instead of during executing.

It would probably also be good to pass in the mask as operand to the recipe, especially if we want to support non-trivial masks in the future.

huntergr added inline comments.Oct 13 2022, 2:29 AM
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
478 ↗(On Diff #459684)

So the problem I had with trying to decide up front was that you might have both masked and unmasked variants available, and the decision on which one to use left to the cost model -- which I think is calculated after VPlan construction.

For example, on AArch64 you might have a non-masked NEON variant and a masked SVE variant. If you know the implementation width is 128b, then the cost would be slightly higher for generating the mask for the SVE variant. If it's 256b or higher, it might be worth the extra cost due to additional parallelism.

Is there a (straightforward) way to tell VPlan that it may need to construct different recipes based on masked/non-masked variants being available? Or would this need some reworking of VPlan?

I did add the mask as an operand in the case where it is required though. If we can generate multiple recipes easily then it can be added to the operands when a dummy mask is required (and possibly shared if there are multiple calls, giving a more accurate cost).

Matt added a subscriber: Matt.Oct 19 2022, 5:18 AM
huntergr updated this revision to Diff 477717.Nov 24 2022, 3:20 AM
huntergr retitled this revision from [LoopVectorize] Support masked function vectorization to [LoopVectorize] Synthesize mask operands for vector variants as needed.
huntergr edited the summary of this revision. (Show Details)

Updated to store the pointer to the vector function in the recipe rather than looking it up again during recipe execution. Forced generation of a plan per VF when there are variants available for those VFs. Added some new tests for masked vs. unmasked variants.

I'm not fond of adding the optional parameters LoopVectorizationCostModel::getVectorCallCost -- it feels like function lookup needs to be split out of it, but I'd like to get some feedback from others before doing so.

huntergr marked an inline comment as done.Nov 24 2022, 3:21 AM

Hi @huntergr, I've only partially reviewed this I'm afraid, but here are the comments I have so far. I'll try to review more this week!

llvm/include/llvm/Analysis/VectorUtils.h
129

Might be worth adding /// comments here, since the others all have them?

134

I think this will be compiled away in a release build, right? So really it's just a non-release wrapper around getParamIndexForOptionalMask. Given it's only called in one place is it worth just making getParamIndexForOptionalMask public instead and putting an assert in LoopVectorize.cpp that a mask exists?

137

This function is never called - can it be deleted?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3442

This looks a little strange to me. In my mind, the ability to emit an active lane mask based on two integer inputs is orthogonal to how cheap it is to broadcast a true bit across a predicate. For example, an architecture may cheaply support the latter, but not the former. Maybe X86 is such an example? Can we not just let the mask cost decide the behaviour? That way you can simplify this to just

if (!VecFunc) {
   ...
3456

Do we really need both the Variant and the NeedToScalarize parameter? It looks naively that setting Variant = VecFunc is synonymous with NeedToScalarize = false. I haven't looked into this in detail so I could be wrong, but it might make more sense to remove the NeedToScalarize in favour of setting Variant?

8359

Doesn't this mean we may end up picking the least optimal VF? For example, if there are v2i32 and v4i32 masked variants we'll only ever pick the v2i32, i.e. the lowest VF?

huntergr updated this revision to Diff 490161.Jan 18 2023, 7:50 AM

Rebased, changed as requested.

huntergr marked 3 inline comments as done.Jan 18 2023, 7:58 AM
huntergr added inline comments.
llvm/include/llvm/Analysis/VectorUtils.h
137

There's now a use for it in the assert when building a recipe. (This was used in the original patch, but was left in when splitting into 3 parts).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3442

My thinking was to treat the capability to emit an active lane mask as a proxy for being able to use masks at all, but perhaps that's a little too conservative.

I don't know if we should add a proper TTI interface to represent that capability, or just rely on the VFDatabase only having entries which the target is capable of supporting.

In any case, I've removed that check for now.

8359

No. Since we now store the pointer to the Function in the recipe, we need to force vplan to generate different plans for each VF that has a vector variant available.

See the vplan checks for 'test_v2_v4m' in synthesize-mask-for-call.ll -- there are separate VF=2 and VF=4 plans, with a widened call to different functions.

Thanks a lot for addressing all the comments @huntergr! I just have a few more minor comments then I think it's good to go. :)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8359

OK, can you add some comments here explaining why you are forcing the creation of a new vplan for every subsequent VF after discovering a vector variant?

8363

nit: Could you change this to

VFShape Shape = VFShape::get(*CI, VariantVF, /*HasGlobalPred=*/true);

llvm/lib/Transforms/Vectorize/VPlan.h
949

Can you add some comments explaining what this is please? For example, that there should be one recipe for every VF because the variant requires a 1:1 mapping with the VF?

llvm/test/Transforms/LoopVectorize/AArch64/synthesize-mask-for-call.ll
8 ↗(On Diff #490161)

I think it's also worth having a test for the case when both VF=2 and VF=4 use unmasked variants, because even in this case we must create separate vplans for each. The Variant member added to the recipe requires it now I think.

huntergr updated this revision to Diff 494239.Feb 2 2023, 3:25 AM

Rebased, addressed latest requests.

huntergr marked 4 inline comments as done.Feb 2 2023, 3:27 AM
david-arm accepted this revision.Feb 9 2023, 1:18 AM

LGTM! Thanks for making all the changes @huntergr. :)

This revision is now accepted and ready to land.Feb 9 2023, 1:18 AM
This revision was landed with ongoing or failed builds.Feb 14 2023, 6:52 AM
This revision was automatically updated to reflect the committed changes.