This patch moves the cost-based decision whether to use an intrinsic or
library call to the point where the recipe is created. This untangles
code-gen from the cost model and also avoids doing some extra work as
the information is already computed at construction.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
8302 | We already get the vector intrinsic ID here, reuse it instead of getting it again repeatedly below? (Independent of patch.) | |
8308–8309 | WillWiden >> CanUseVectorCall? Using an intrinsic is also widening. | |
8308–8309 | Above comment deserves an update. | |
8316 | Avoid considering CallCost if NeedToScalarize is true? Avoid getting decision and clamping Range if !ID, when a vector call can be used, e.g., w/o clamping Range (WillWiden)? The compound decision for which (range of) VF's to use an intrinsic vs. call vs. neither should probably be retained instead of decomposing it into two independent clamps? Calls for better test coverage to make sure patch is indeed NFC. |
Address comments, replace boolen flag with Intrinsic::ID, which will either be the chosen ID or Intrinsic::not_intrinsic. This removes the need for TLI in D132586.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
8302 | ||
8308–8309 | Updated, thanks! | |
8308–8309 | The comment should be updated in the latest version. | |
8316 |
I am not sure if we need to handle this explicitly, as the cost comparison should either chose the vector intrinsic (if it is cheaper than the lib call which may get scalarized) or CanUseVectorCall will be also false.
Added a check, thanks!
I think we need to clamp both separately. Before, we could have VPlans where we either use lib functions or intrinsics for the same call for different VFs. Now we need to split them to track whether an intrinsic or libfunc should be used. I added a test case to show this: 005d1a8ff533 It should only change the debug output (VPlan printing) but not the generated code, so arguably this can be considered NFC (from the perspective of the generated code) or not. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4166–4167 | This is now also redundant given VectorIntrinsicID? | |
4174–4175 | nit: can ask if (!VectorIntrinsicID || ...) given that Intrinsic::not_intrinsic is fixed to zero. | |
4185 | nit: can ask if (VectorIntrinsicID) given that Intrinsic::not_intrinsic is fixed to zero. | |
8316 |
Hmm, getDecisionAndClampRange() works with boolean decisions rather than 3-way ones. May result in excessive clamping, which is ok albeit potentially conservative. E.g., say first VF=2 of range can make a vector call but next VF=4 cannot, where both can more efficiently make an intrinsic call, range would clamp after VF=2 needlessly. One way to optimize the clamping is to figure out the compound decision for first VF of range and then getDecisionAndClampRange() accordingly - worth the hassle? bool ScalarBetterThanVectorAtStart; InstructionCost CallCostAtStart = CM.getVectorCallCost(CI, Range.Start, ScalarBetterThanVectorAtStart); bool IntrinsicBestAtStart = ID && CM.getVectorIntrinsicCost(CI, Range.Start) < CallCostAtStart; LoopVectorizationPlanner::getDecisionAndClampRange( [&](ElementCount VF) -> bool { bool ScalarBetterThanVectorAtVF; // Is it beneficial to perform intrinsic call compared to lib call? InstructionCost CallCostAtVF = CM.getVectorCallCost(CI, VF, ScalarBetterThanVectorAtVF); bool IntrinsicBestAtVF = ID && CM.getVectorIntrinsicCost(CI, VF) < CallCostAtVF; return (IntrinsicBestAtStart == IntrinsicBestAtVF) && (IntrinsicBestAtStart || ScalarBetterThanVectorAtVF == ScalarBetterThanVectorAtVF); }, Range); CM.getVectorCallCost() already compares vector call cost with scalar call cost, returning the cheaper along with an indicator which is it.
| |
8322 | nit: can ask if ID given that Intrinsic::not_intrinsic is fixed to zero. | |
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp | ||
454 | nit: can ask if (VectorIntrinsicID) given that Intrinsic::not_intrinsic is fixed to zero. | |
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp | ||
80–81 | Pass some Intrinsic::ID instead of true? | |
llvm/unittests/Transforms/Vectorize/VPlanTest.cpp | ||
809 | false is synonymous with Intrinsic::not_intrinsic being zero? | |
1069 | ditto |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4166–4167 | Yes, should be removed! | |
4174–4175 | Though explicitly checking == Intrinsic::not_intrinsic may be clearer, but it seems too verbose. I simplified it. | |
4185 | Simplified, thanks! | |
8316 | Hm I tried to restructure to code to make things a bit clearer. If we can use an intrinsic call, clamp the decision to the range of intrinsic calls and return the recipe. If the intrinsic call is profitable at the start, we clamp the range until it becomes unprofitable. If it is not profitable at the beginning, we should clamp the range until it becomes profitable. If it is not profitable to use an intrinsic call at the start, it must be profitable to use a lib call. Now clamp to the range until lib calls are not profitable. I *think* that should avoid excessive clamping in most cases in practice and the code seems easier to follow. WDYT? | |
8322 | Simplified, thanks! | |
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp | ||
454 | Simplified thanks! |
Thanks for addressing, looks good to me, adding minor last nits.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
8316 |
Agreed! "profitable" here means "most profitable/best", i.e., better than scalarizing and better than calling a vector library function.
It is also possible that scalarizing is most profitable at start. In any case it's indeed fine to now clamp based on the better between scalarizing and using a lib call (which is best, i.e., also better than using an intrinsic), as done below.
Agreed, excessive clamping is avoided and code is clearer, LGTM! | |
8325 | nits: can drop Should, {} | |
8333 | Maybe the following: // The flag shows whether it is better to scalarize the call than to call a vectorized version of the function. is a bit more accurate? | |
8340 | nits: can drop Should, {} | |
llvm/lib/Transforms/Vectorize/VPlan.h | ||
953 | nit: comment that not_intrinsic/false indicates that a library call is used instead of an intrinsic. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
8325 | Done in the committed version, thanks! | |
8333 | I updated the comment to + // Is better to call a vectorized version of the function than to to scalarize + // the call? in the committed version/ | |
8340 | Done in the committed version, thanks! | |
llvm/lib/Transforms/Vectorize/VPlan.h | ||
953 | Added a comment in the committed version, thanks! |
This is now also redundant given VectorIntrinsicID?