This is an archive of the discontinued LLVM Phabricator instance.

[LV] Use VScaleForTuning to fine-tune the cost per lane.
ClosedPublic

Authored by sdesmalen on Nov 4 2021, 11:23 AM.

Details

Summary

When targeting a specific CPU with scalable vectorization, the knowledge
of that particular CPU's vscale value can be used to tune the cost-model
and make the cost per lane less pessimistic.

If the target implements 'TTI.getVScaleForTuning()', the cost-per-lane
is calculated as:

Cost / (VScaleForTuning * VF.KnownMinLanes)

Otherwise, it assumes a value of 1 meaning that the behavior
is unchanged and calculated as:

Cost / VF.KnownMinLanes

Diff Detail

Event Timeline

sdesmalen created this revision.Nov 4 2021, 11:23 AM
sdesmalen requested review of this revision.Nov 4 2021, 11:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 4 2021, 11:23 AM
kmclaughlin accepted this revision.Nov 5 2021, 3:50 AM

Thanks @sdesmalen, this LGTM!

This revision is now accepted and ready to land.Nov 5 2021, 3:50 AM

How does this interact with vscale_range? Could it perhaps automatically infer getVScaleForTuning using that? Or is the idea the target ultimately chooses?

How does this interact with vscale_range? Could it perhaps automatically infer getVScaleForTuning using that? Or is the idea the target ultimately chooses?

The two are actually quite different; vscale_range specifies the range of vscale that the compiled binary is compatible with. LLVM guarantees that the compiled binary is correct for that vscale_range. VScaleForTuning can be set separately by mcpu/mtune and will purely tune the cost-model without changing the requirements on vscale. This means it doesn't change the compatibility of the binary, it just helps choose a better VF for the CPU it compiles for.

How does this interact with vscale_range? Could it perhaps automatically infer getVScaleForTuning using that? Or is the idea the target ultimately chooses?

The two are actually quite different; vscale_range specifies the range of vscale that the compiled binary is compatible with. LLVM guarantees that the compiled binary is correct for that vscale_range. VScaleForTuning can be set separately by mcpu/mtune and will purely tune the cost-model without changing the requirements on vscale. This means it doesn't change the compatibility of the binary, it just helps choose a better VF for the CPU it compiles for.

Fair enough, thanks for the explanation!

david-arm added inline comments.Nov 5 2021, 7:35 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6009

nit: Could you remove the unnecessary white space before merging? Thanks!

6025

nit: This is just a suggestion, but it feels a little odd to be rescaling ElementCount by vscale, because this is already implicit for a scalable element count. For example, what we're doing here is taking a <vscale x 4> ElementCount and then multiplying by another vscale so we end up effectively with an ElementCount like <vscale x vscale x 4>. Is it worth just using unsigned values instead?

unsigned EstimatedWidthA = A.Width.getKnownMinValue();
unsigned EstimatedWidthB = B.Width.getKnownMinValue();
if (Optional<unsigned> VScale = TTI.getVScaleForTuning()) {
  if (A.Width.isScalable())
    EstimatedWidthA *= VScale.getValue();
  if (B.Width.isScalable())
    EstimatedWidthB *= VScale.getValue();
}
...
sdesmalen added inline comments.Nov 5 2021, 8:15 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6025

I was a bit worried it would get confusing with using both A.Width.isScalable() in combination EstimatedWidthA, but perhaps splitting it up is more clear as you suggest.

sdesmalen updated this revision to Diff 385079.Nov 5 2021, 8:15 AM

Use unsigned instead of ElementCount for 'EstimatedWidth'.

david-arm accepted this revision.Nov 8 2021, 3:20 AM

LGTM! Thanks for making the changes @sdesmalen. :partyparrot

This revision was automatically updated to reflect the committed changes.