This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Add field to track if intrinsic should be used for call. (NFC)
ClosedPublic

Authored by fhahn on Aug 24 2022, 12:03 PM.

Details

Summary

This patch moves the cost-based decision whether to use an intrinsic or
library call to the point where the recipe is created. This untangles
code-gen from the cost model and also avoids doing some extra work as
the information is already computed at construction.

Diff Detail

Event Timeline

fhahn created this revision.Aug 24 2022, 12:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2022, 12:03 PM
fhahn requested review of this revision.Aug 24 2022, 12:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2022, 12:03 PM
Herald added a subscriber: vkmr. · View Herald Transcript
Ayal added inline comments.Aug 28 2022, 11:10 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8302

We already get the vector intrinsic ID here, reuse it instead of getting it again repeatedly below? (Independent of patch.)

8308–8309

WillWiden >> CanUseVectorCall? Using an intrinsic is also widening.

8308–8309

Above comment deserves an update.

8316

Avoid considering CallCost if NeedToScalarize is true?

Avoid getting decision and clamping Range if !ID, when a vector call can be used, e.g., w/o clamping Range (WillWiden)?

The compound decision for which (range of) VF's to use an intrinsic vs. call vs. neither should probably be retained instead of decomposing it into two independent clamps? Calls for better test coverage to make sure patch is indeed NFC.

fhahn updated this revision to Diff 456296.Aug 29 2022, 2:58 AM

Address comments, replace boolen flag with Intrinsic::ID, which will either be the chosen ID or Intrinsic::not_intrinsic. This removes the need for TLI in D132586.

fhahn marked 3 inline comments as done.Aug 29 2022, 3:06 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8302
8308–8309

Updated, thanks!

8308–8309

The comment should be updated in the latest version.

8316

Avoid considering CallCost if NeedToScalarize is true?

I am not sure if we need to handle this explicitly, as the cost comparison should either chose the vector intrinsic (if it is cheaper than the lib call which may get scalarized) or CanUseVectorCall will be also false.

Avoid getting decision and clamping Range if !ID, when a vector call can be used, e.g., w/o clamping Range (WillWiden)?

Added a check, thanks!

The compound decision for which (range of) VF's to use an intrinsic vs. call vs. neither should probably be retained instead of decomposing it into two independent clamps? Calls for better test coverage to make sure patch is indeed NFC.

I think we need to clamp both separately. Before, we could have VPlans where we either use lib functions or intrinsics for the same call for different VFs. Now we need to split them to track whether an intrinsic or libfunc should be used. I added a test case to show this: 005d1a8ff533

It should only change the debug output (VPlan printing) but not the generated code, so arguably this can be considered NFC (from the perspective of the generated code) or not.

Ayal added inline comments.Aug 29 2022, 7:16 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4166–4167

This is now also redundant given VectorIntrinsicID?

4174–4175

nit: can ask if (!VectorIntrinsicID || ...) given that Intrinsic::not_intrinsic is fixed to zero.

4185

nit: can ask if (VectorIntrinsicID) given that Intrinsic::not_intrinsic is fixed to zero.

8316

The compound decision for which (range of) VF's to use an intrinsic vs. call vs. neither should probably be retained instead of decomposing it into two independent clamps? Calls for better test coverage to make sure patch is indeed NFC.

I think we need to clamp both separately. Before, we could have VPlans where we either use lib functions or intrinsics for the same call for different VFs. Now we need to split them to track whether an intrinsic or libfunc should be used. I added a test case to show this: 005d1a8ff533

Hmm, getDecisionAndClampRange() works with boolean decisions rather than 3-way ones. May result in excessive clamping, which is ok albeit potentially conservative. E.g., say first VF=2 of range can make a vector call but next VF=4 cannot, where both can more efficiently make an intrinsic call, range would clamp after VF=2 needlessly.

One way to optimize the clamping is to figure out the compound decision for first VF of range and then getDecisionAndClampRange() accordingly - worth the hassle?

bool ScalarBetterThanVectorAtStart;
InstructionCost CallCostAtStart =
          CM.getVectorCallCost(CI, Range.Start, ScalarBetterThanVectorAtStart);
bool IntrinsicBestAtStart = ID && CM.getVectorIntrinsicCost(CI, Range.Start) < CallCostAtStart;

LoopVectorizationPlanner::getDecisionAndClampRange(
    [&](ElementCount VF) -> bool {
      bool ScalarBetterThanVectorAtVF;
      // Is it beneficial to perform intrinsic call compared to lib call?
      InstructionCost CallCostAtVF =
          CM.getVectorCallCost(CI, VF, ScalarBetterThanVectorAtVF);
      bool IntrinsicBestAtVF = ID && CM.getVectorIntrinsicCost(CI, VF) < CallCostAtVF;
      return (IntrinsicBestAtStart == IntrinsicBestAtVF) &&
                 (IntrinsicBestAtStart || ScalarBetterThanVectorAtVF == ScalarBetterThanVectorAtVF);
    },
    Range);

CM.getVectorCallCost() already compares vector call cost with scalar call cost, returning the cheaper along with an indicator which is it.
Perhaps worth extending this API to compare the three alternatives, returning the cheapest along with an indicator(s) which is it(?)

It should only change the debug output (VPlan printing) but not the generated code, so arguably this can be considered NFC (from the perspective of the generated code) or not.

8322

nit: can ask if ID given that Intrinsic::not_intrinsic is fixed to zero.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
454

nit: can ask if (VectorIntrinsicID) given that Intrinsic::not_intrinsic is fixed to zero.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
80–81

Pass some Intrinsic::ID instead of true?

llvm/unittests/Transforms/Vectorize/VPlanTest.cpp
809

false is synonymous with Intrinsic::not_intrinsic being zero?

1069

ditto

fhahn updated this revision to Diff 456994.Aug 31 2022, 9:51 AM
fhahn marked 8 inline comments as done.

Address latest comments, thanks!

fhahn added inline comments.Aug 31 2022, 9:57 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4166–4167

Yes, should be removed!

4174–4175

Though explicitly checking == Intrinsic::not_intrinsic may be clearer, but it seems too verbose. I simplified it.

4185

Simplified, thanks!

8316

Hm I tried to restructure to code to make things a bit clearer.

If we can use an intrinsic call, clamp the decision to the range of intrinsic calls and return the recipe. If the intrinsic call is profitable at the start, we clamp the range until it becomes unprofitable. If it is not profitable at the beginning, we should clamp the range until it becomes profitable.

If it is not profitable to use an intrinsic call at the start, it must be profitable to use a lib call. Now clamp to the range until lib calls are not profitable.

I *think* that should avoid excessive clamping in most cases in practice and the code seems easier to follow. WDYT?

8322

Simplified, thanks!

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
454

Simplified thanks!

Ayal accepted this revision.Aug 31 2022, 1:01 PM

Thanks for addressing, looks good to me, adding minor last nits.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8316

Hm I tried to restructure to code to make things a bit clearer.

If we can use an intrinsic call, clamp the decision to the range of intrinsic calls and return the recipe. If the intrinsic call is profitable at the start, we clamp the range until it becomes unprofitable. If it is not profitable at the beginning, we should clamp the range until it becomes profitable.

Agreed! "profitable" here means "most profitable/best", i.e., better than scalarizing and better than calling a vector library function.

If it is not profitable to use an intrinsic call at the start, it must be profitable to use a lib call. Now clamp to the range until lib calls are not profitable.

It is also possible that scalarizing is most profitable at start. In any case it's indeed fine to now clamp based on the better between scalarizing and using a lib call (which is best, i.e., also better than using an intrinsic), as done below.

I *think* that should avoid excessive clamping in most cases in practice and the code seems easier to follow. WDYT?

Agreed, excessive clamping is avoided and code is clearer, LGTM!

8325

nits: can drop Should, {}

8333

Maybe the following:

// The flag shows whether it is better to scalarize the call than to call a vectorized version of the function.

is a bit more accurate?

8340

nits: can drop Should, {}

llvm/lib/Transforms/Vectorize/VPlan.h
953

nit: comment that not_intrinsic/false indicates that a library call is used instead of an intrinsic.

This revision is now accepted and ready to land.Aug 31 2022, 1:01 PM
This revision was automatically updated to reflect the committed changes.
fhahn marked 3 inline comments as done.
fhahn added inline comments.Sep 1 2022, 5:23 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8325

Done in the committed version, thanks!

8333

I updated the comment to

+  // Is better to call a vectorized version of the function than to to scalarize
+  // the call?

in the committed version/

8340

Done in the committed version, thanks!

llvm/lib/Transforms/Vectorize/VPlan.h
953

Added a comment in the committed version, thanks!