Page MenuHomePhabricator

[LoopVectorize] Permit fixed-width epilogue loops for scalable vector bodies
Needs ReviewPublic

Authored by david-arm on Sep 8 2021, 4:14 AM.

Details

Summary

At the moment in LoopVectorizationCostModel::selectEpilogueVectorizationFactor
we bail out if the main vector loop uses a scalable VF. This patch adds
support for generating epilogue vector loops using a fixed-width VF when the
main vector loop uses a scalable VF.

I've changed LoopVectorizationCostModel::selectEpilogueVectorizationFactor
so that we convert the scalable VF into a fixed-width VF and do profitability
checks on that instead. In addition, since the scalable and fixed-width VFs
live in different VPlans that means I had to change the calls to
LVP.hasPlanWithVFs so that we only pass in the fixed-width VF.

New tests added here:

Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll

Diff Detail

Event Timeline

david-arm created this revision.Sep 8 2021, 4:14 AM
david-arm requested review of this revision.Sep 8 2021, 4:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2021, 4:14 AM

Different plans may have different recipes, so if we don't find a single plan that supports both the main VF and the epilogue VF, I worry we may get unintended codegen. Furthermore, what guarantees that a vplan with a fixed-width conversion from the scalable VF actually exists? Would it be possible to build scalable plans that include the corresponding fixed width (and some lower powers of 2) VFs?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6262–6271

"EpilogueVF" is confusing here, since the rest of the logic expects main-loop VF. Suggest renaming to "KnownMainLoopVF" or "FixedMainLoopVF" or "FixedWidthMainLoopVF".

6266

include the fixed-width value in the debug message?

Different plans may have different recipes, so if we don't find a single plan that supports both the main VF and the epilogue VF, I worry we may get unintended codegen. Furthermore, what guarantees that a vplan with a fixed-width conversion from the scalable VF actually exists? Would it be possible to build scalable plans that include the corresponding fixed width (and some lower powers of 2) VFs?

Hi @bmahjour thanks for taking a look at the patch and leaving some comments!

So in LoopVectorizationPlanner::plan we do try to build vplans with equivalent ranges of VFs, i.e.

buildVPlansWithVPRecipes(ElementCount::getFixed(1), MaxFactors.FixedVF);
buildVPlansWithVPRecipes(ElementCount::getScalable(1), MaxFactors.ScalableVF);

Even if for some reason we are missing the fixed-width equivalent from the vplan the worst thing that can happen is we just won't produce an epilogue loop I think? In future we do intend to also add support for using scalable vectors in the epilogue loop as well, but I imagine that this is less likely to be profitable because "VF=vscale x 2" will likely cover more scalar iterations than "VF=2". This means there is a lower chance of us ever entering the epilogue loop.

However, you make a good point about the possibility of having different recipes for the same loop depending upon whether it's scalable or fixed-width. I wasn't sure if this would actually be a problem or not because the vectorised loops are independent. At the moment I can't think of a scenario where the recipes would be different, but perhaps I can add code to bail out if the recipes are different?

david-arm updated this revision to Diff 372266.Sep 13 2021, 8:41 AM
  • Renamed EpilogueVF -> FixedMainLoopVF
  • Improved new LLVM_DEBUG message to include FixedMainLoopVF
david-arm marked 2 inline comments as done.Sep 13 2021, 8:41 AM

However, you make a good point about the possibility of having different recipes for the same loop depending upon whether it's scalable or fixed-width.

Even for non-scalable targets, this patch may cause the epilogue plan to be different from that of the main loop. One could argue that it may be desirable to have different plans (recipes) for the two in some cases, but at least the selection criteria should not be arbitrary (as it would be if we just look for *a* plan that matches the VF).

I'm still wondering why we currently create two separate plans one for the scalable VFs and one for the fixed-width VFs, instead of one plan that includes a union of fixed-width VFs and scalable VFs? If we had such a single plan, then the workarounds in this patch would not be necessary.

Matt added a subscriber: Matt.Sep 16 2021, 1:22 PM
david-arm updated this revision to Diff 375560.Tue, Sep 28, 6:58 AM
  • Fixed a bug in createEpilogueVectorizedLoopSkeleton where we created the wrong Step value.

I have run LNT and SPEC2006 on this patch and all tests pass with these changes.

I found no overall change in performance with SPEC2006 when building with scalable vectorisation on a A64FX machine.

Hi @david-arm, thanks for working on this. I don't see any fundamental issues with having a different VPlan for the epilogue loop and the main vector body. In fact, it makes sense to me to pick the most suitable and cost-effective (VF, Plan) pair for the epilogue. If we want to use scalable vectors for the epilogue loop at some point, then we'll also want to use predication, so I could see that requiring a different VPlan. It doesn't seem to me that this patch makes arbitrary decisions about which VPlan to choose, it aims to pick the VF with the lowest cost, although it currently falls back (consistently) to a fixed-width VPlan when the main loop has a scalable VF.

Thanks for testing that there are no functional regressions when having different VPlans for the main- and epilogue vector loops.

I made some suggestions to simplify the implementation, hope my comments make sense.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6244

Can this change be committed as a separate patch, this seems like a sensible bugfix to me. (although no idea how to test for it)

6263

nit: Can you rewrite this as:

if (MainLoopVF.isScalable())
  LLVM_DEBUG(...);

auto FixedMainLoopVF = ElementCount::getFixed(FixedMainLoopVF.getKnownMinValue());

FixedMainLoopVF.isScalable() reads like a contradictory expression that should never happen.

8193

I see that you're just building on top of the current mechanism, but I'd rather see setBestPlan replaced by getBestPlanFor, which returns a pointer to the VPlan that can be passed to LoopVectorizationPlanner::executePlan. That way, you don't need to do all this odd trickery with removing vplans and requiring BackupPlans to repopulate the set for a second call to setBestPlan.

Then the code becomes a bit simpler to follow:

auto Plan = getBestPlanFor(VF, UF)
LVP.executePlan(Plan, VF, UF, ILV, DT);

I don't know whether the type of auto Plan can be VPlan* or whether it needs to be an instance of std::unique_ptr<VPlan>. It would be convenient if the LoopVectorizationPlanner can keep ownership of all VPlans until the end, where LoopVectorizationPlanner::executePlan just invokes the relevant VPlan to execute.

8397

Not sure I fully understand it, but is the VF at this point always guaranteed to be a fixed-width VF? If so, can we avoid making this change here (and instead s/getKnownMinValue/getFixedValue/)? I'm sure we'll want this change at some point when we make the epilogue VF scalable, but perhaps this patch is not the one to change it in.

I'm still wondering why we currently create two separate plans one for the scalable VFs and one for the fixed-width VFs, instead of one plan that includes a union of fixed-width VFs and scalable VFs? If we had such a single plan, then the workarounds in this patch would not be necessary.

Scalable VFs require different VPlans because parts of the plan that are legal for fixed-width VFs may not work for scalable VFs, or may not be profitable. For example, scalable vectors may use gather/scatter instructions, whereas for fixed-width vectors it may resort to scalarised loads/stores.

I found no overall change in performance with SPEC2006 when building with scalable vectorisation on a A64FX machine.

Are you sure that you actually enabled/used tail vectorization? I would have expected differences in performance if this was used.

I found no overall change in performance with SPEC2006 when building with scalable vectorisation on a A64FX machine.

Are you sure that you actually enabled/used tail vectorization? I would have expected differences in performance if this was used.

Yeah I'm sure. I'm happy to run some more tests though on a different machine! This simply points to a couple of things I think:

  1. It may suggest that not much hot C/C++ code in SPEC2006 is currently vectorisable. We might see more difference if we could test Fortran benchmarks.
  2. If there are any vectorised loops then the main body trip count is possibly far larger than the tail trip count. I imagine tail vectorisation has the largest impact on smaller loops?
david-arm added inline comments.Mon, Oct 4, 1:04 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8397

This fix is needed for correctness in the tests below, i.e. see

; CHECK:         [[IDX_NXT]] = add nuw i64 [[IDX]], [[VEC_ITS1]]

Without this fix we end up incrementing the induction variable by the wrong VF, which leads to subsequent crashes in a later pass too.

david-arm added a comment.EditedMon, Oct 4, 1:15 AM

Hi @sdesmalen, also I guess we may still not be vectorising much on A64FX due to the cost model. Perhaps I can test this out in conjunction with my other patches to tune the cost model?

david-arm added inline comments.Mon, Oct 4, 8:53 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8193

Hi @sdesmalen, Im happy to take a look at this. I think in order to keep the VPlans in LoopVectorizationPlanner and return a pointer to the best vplan from getBestPlanFor I'd need to change VPlans to use VPlan* instead of std::unique_ptr<VPlan>. Not sure if @fhahn or @bmahjour have any thoughts on this?

david-arm added inline comments.Mon, Oct 4, 9:20 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8193

Actually, creating function prototypes like this seems to work:

PlanPtr& getBestPlanFor(ElementCount VF, unsigned UF);
void executePlan(VPlanPtr &BestPlan, InnerLoopVectorizer &LB, DominatorTree *DT);
david-arm updated this revision to Diff 377117.Tue, Oct 5, 1:57 AM
  • Spun-off patch D111125 to remove setBestPlan in favour of getBestPlanFor and using that in this patch.
  • Addressed other review comments.
david-arm marked 4 inline comments as done.Tue, Oct 5, 2:33 AM

I observed no regressions when running SPEC2006 on a Neoverse-N1 machine. In fact, overall I saw a 0.9% performance improvement using the geometric mean of all results. Biggest outliers:

471.omnetpp: 2.7% faster
429.mcf: 2.2% faster
483.xalancbmk: 1.7% faster

david-arm updated this revision to Diff 378179.Fri, Oct 8, 5:20 AM
  • Rebase.