Page MenuHomePhabricator

[LoopVectorize] Synthesize mask operands for vector variants as needed
Needs ReviewPublic

Authored by huntergr on Aug 23 2022, 3:45 AM.

Details

Summary

This patch allows loop vectorization with function calls in cases where masks are not required but the only available vectorized function variants are masked.

Some of the code was originally written by @paulwalker-arm

Diff Detail

Unit TestsFailed

TimeTest
1,820 msx64 debian > libFuzzer.libFuzzer::fuzzer-finalstats.test
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -O2 -gline-tables-only -fsanitize=address,fuzzer -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/lib/fuzzer -m64 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/fuzzer/SimpleTest.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/fuzzer/X86_64DefaultLinuxConfig/Output/fuzzer-finalstats.test.tmp-SimpleTest

Event Timeline

huntergr created this revision.Aug 23 2022, 3:45 AM
huntergr requested review of this revision.Aug 23 2022, 3:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 3:45 AM
mgabka added a subscriber: mgabka.Aug 23 2022, 10:54 PM

I recommend you split this patch into the following patches:

  • First, add support for calling the predicated call version when masking is not required for correctness. This requires all the plumbing for costing both versions of the call (if both are available), and let's you test those bits.
  • Second, add support for predicated calls via scalar-with-predication. This requires speculation reasoning (which you don't appear to have), and is complicated enough to need testing on it's own. (This may partially exist already, but some of the code at least is incomplete since e.g. isPredicatedInst always returns false for calls.)
  • Third, come back and add support for predication of calls via masking (as opposed to replication.)

The first and second step are independent, and could be done in either order.

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll
10–1

Please autogenerate these tests for readability.

huntergr updated this revision to Diff 459684.Sep 13 2022, 2:04 AM

Split out the functionality for synthesizing a mask when required, added a cost for mask generation.

huntergr marked an inline comment as done.Sep 13 2022, 2:09 AM

Thanks for the review. I'm tempted to add a masking equivalent to -force-target-supports-scalable-vectors=true in order to have target-independent tests, but I can add that in another patch.

Ping.

@reames -- is this roughly what you expected for the case of allowing a masked variant to be used when no mask is required? I've added a cost for generating the mask (per-call for now, instead of potentially sharing it) so that we can compare costs for different VFs with and without a masked variant, but I think we would always prefer the non-masked variant for the same VF if a mask is not required.

In any case, I'm now working on the third patch.

fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
474

It would be good if the decision whether to used the masked or non-masked variant would be taken at the time of VPlan construction instead of during executing.

It would probably also be good to pass in the mask as operand to the recipe, especially if we want to support non-trivial masks in the future.

huntergr added inline comments.Oct 13 2022, 2:29 AM
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
474

So the problem I had with trying to decide up front was that you might have both masked and unmasked variants available, and the decision on which one to use left to the cost model -- which I think is calculated after VPlan construction.

For example, on AArch64 you might have a non-masked NEON variant and a masked SVE variant. If you know the implementation width is 128b, then the cost would be slightly higher for generating the mask for the SVE variant. If it's 256b or higher, it might be worth the extra cost due to additional parallelism.

Is there a (straightforward) way to tell VPlan that it may need to construct different recipes based on masked/non-masked variants being available? Or would this need some reworking of VPlan?

I did add the mask as an operand in the case where it is required though. If we can generate multiple recipes easily then it can be added to the operands when a dummy mask is required (and possibly shared if there are multiple calls, giving a more accurate cost).

Matt added a subscriber: Matt.Oct 19 2022, 5:18 AM
huntergr updated this revision to Diff 477717.Thu, Nov 24, 3:20 AM
huntergr retitled this revision from [LoopVectorize] Support masked function vectorization to [LoopVectorize] Synthesize mask operands for vector variants as needed.
huntergr edited the summary of this revision. (Show Details)

Updated to store the pointer to the vector function in the recipe rather than looking it up again during recipe execution. Forced generation of a plan per VF when there are variants available for those VFs. Added some new tests for masked vs. unmasked variants.

I'm not fond of adding the optional parameters LoopVectorizationCostModel::getVectorCallCost -- it feels like function lookup needs to be split out of it, but I'd like to get some feedback from others before doing so.

huntergr marked an inline comment as done.Thu, Nov 24, 3:21 AM