This is an archive of the discontinued LLVM Phabricator instance.

Replace vector intrinsics with call to vector library
ClosedPublic

Authored by LukasSommerTu on Jan 25 2021, 9:22 AM.

Details

Summary

This patch adds a pass to replace calls to vector intrinsics (i.e., LLVM intrinsics operating on vector operands) with calls to a vector library.

Currently, calls to LLVM intrinsics are only replaced with calls to vector libraries when scalar calls to intrinsics are vectorized by the Loop- or SLP-Vectorizer.

With this pass, it is now possible to replace calls to LLVM intrinsics already operating on vector operands, e.g., if such code was generated by MLIR. For the replacement, information from the TargetLibraryInfo, e.g., as specified via -vector-library is used.

Diff Detail

Event Timeline

LukasSommerTu created this revision.Jan 25 2021, 9:22 AM
LukasSommerTu requested review of this revision.Jan 25 2021, 9:22 AM

What's the intended use-case of this pass?

What's the intended use-case of this pass?

Some frontends or in my case MLIR, generate calls to LLVM intrinsics operating on vector operands, e.g.:
%call = call <4 x double> @llvm.exp.v4f64(<4 x double> %in)

Those calls are currently not replaced with calls to vector libraries (e.g., libmvec, SVML), because this replacement only happens for scalar calls in the Loop-/SLP-vectorizer.

With the pass, it is now possible to replace the calls for improved performance, resulting in the following for SVML & the example above:
%call = call <4 x double> @__svml_exp4(<4 x double> %in)

What's the intended use-case of this pass?

Some frontends or in my case MLIR, generate calls to LLVM intrinsics operating on vector operands, e.g.:
%call = call <4 x double> @llvm.exp.v4f64(<4 x double> %in)

Those calls are currently not replaced with calls to vector libraries (e.g., libmvec, SVML), because this replacement only happens for scalar calls in the Loop-/SLP-vectorizer.

With the pass, it is now possible to replace the calls for improved performance, resulting in the following for SVML & the example above:
%call = call <4 x double> @__svml_exp4(<4 x double> %in)

I understand what transformation this performs, that wasn't my question.

What i mean is that this is the opposite of the transformation
i'd expect to happen for middle-end IR optimizations,
and this lowering should be happening somewhere in backend part
of the compilation pipeline. So i'm mainly asking where this is intended to be run

@lebedev.ri: Thanks for your feedback!

My reasoning to implement this as an IR pass was that the replacement for the scalar version of the intrinsics is also happening as part of the middle-end, so it made sense to me to implement this in a similar location and use similar mechansisms.

My motivating usage scenario was to add this pass to the pipeline that is applied after converting from MLIR to LLVM IR, which introduces these calls. As the replacement requires some target-specific information (TargetLibraryInfo), implementing it on the LLVM side rather than in MLIR seemed reasonable to me. If other frontends produce similar IR, they can also benefit from this pass/transformation.

@lebedev.ri: Thanks for your feedback!

My reasoning to implement this as an IR pass was that the replacement for the scalar version of the intrinsics is also happening as part of the middle-end, so it made sense to me to implement this in a similar location and use similar mechansisms.

Hm, it does? That seems quite surprising to me.

My motivating usage scenario was to add this pass to the pipeline that is applied after converting from MLIR to LLVM IR, which introduces these calls. As the replacement requires some target-specific information (TargetLibraryInfo), implementing it on the LLVM side rather than in MLIR seemed reasonable to me. If other frontends produce similar IR, they can also benefit from this pass/transformation.

The thing is, the passes that now know how to deal with whatever IR instruction/intrinsic,
will now need to also know about N flavors of those intrinsics shaped in the form of a libcall.
This doesn't quite seem like the right direction.
Also, tangentially related, is there cost modelling being performed for them?

So i would still expect this to happen somewhere in codegen (llc) pipeline, not optimization (opt) pipeline..

@lebedev.ri: Thanks for your feedback!

My reasoning to implement this as an IR pass was that the replacement for the scalar version of the intrinsics is also happening as part of the middle-end, so it made sense to me to implement this in a similar location and use similar mechansisms.

Hm, it does? That seems quite surprising to me.

Yes, inject-tli-mappings add the VFABI-attributes to the intrinsic calls and the loop-vectorizer performs the replacement as part of the vectorization.
For example, %call = tail call double @llvm.sin.f64(double %conv) will be replaced inside a loop that is vectorized by %5 = call <4 x double> @_ZGVdN4v_sin(<4 x double> %4) using opt -vector-library=LIBMVEC-X86 -S -inject-tli-mappings -loop-vectorize

More examples can be found in https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls.ll

I get your point about placing such a transformation as last as possible in the pipeline or in the codegen, but as the Loop- and SLP-vectorizer perform a similar transformation, I thought it might be a good idea to reuse similar mechanisms and infrastructure.

I haven't looked at the trade-offs, but if we can do this transform later ( SelectionDAGLegalize::ConvertNodeToLibcall() ? ) that would seem to be more flexible...and smaller patch?
For reference:
D8131 - added support for external vector library calls
D70107 - added inject-tli-mappings

fhahn added a subscriber: fhahn.Jan 26 2021, 5:34 AM

I haven't looked at the trade-offs, but if we can do this transform later ( SelectionDAGLegalize::ConvertNodeToLibcall() ? ) that would seem to be more flexible...and smaller patch?
For reference:
D8131 - added support for external vector library calls
D70107 - added inject-tli-mappings

+1 on doing this as late as possible. Ideally we would then just get rid of the logic in LV/SLP that inserts the library calls.

fhahn edited reviewers, added: fhahn; removed: Florian.Jan 26 2021, 5:36 AM
lebedev.ri requested changes to this revision.Jan 26 2021, 5:38 AM
This revision now requires changes to proceed.Jan 26 2021, 5:38 AM

I haven't looked at the trade-offs, but if we can do this transform later ( SelectionDAGLegalize::ConvertNodeToLibcall() ? ) that would seem to be more flexible...and smaller patch?
For reference:
D8131 - added support for external vector library calls
D70107 - added inject-tli-mappings

+1 on doing this as late as possible. Ideally we would then just get rid of the logic in LV/SLP that inserts the library calls.

I think SelectionDAG would scalarize the operation in LegalizeVectorOps before they reach LegalizeDAG.

I haven't looked at the trade-offs, but if we can do this transform later ( SelectionDAGLegalize::ConvertNodeToLibcall() ? ) that would seem to be more flexible...and smaller patch?
For reference:
D8131 - added support for external vector library calls
D70107 - added inject-tli-mappings

+1 on doing this as late as possible. Ideally we would then just get rid of the logic in LV/SLP that inserts the library calls.

I think SelectionDAG would scalarize the operation in LegalizeVectorOps before they reach LegalizeDAG.

So an IR codegen pass that runs after the optimizer would be best? If the patch is small enough, we could glom this into CGP?

I haven't looked at the trade-offs, but if we can do this transform later ( SelectionDAGLegalize::ConvertNodeToLibcall() ? ) that would seem to be more flexible...and smaller patch?
For reference:
D8131 - added support for external vector library calls
D70107 - added inject-tli-mappings

+1 on doing this as late as possible. Ideally we would then just get rid of the logic in LV/SLP that inserts the library calls.

I think SelectionDAG would scalarize the operation in LegalizeVectorOps before they reach LegalizeDAG.

So an IR codegen pass that runs after the optimizer would be best? If the patch is small enough, we could glom this into CGP?

Does GlobalISel also use CGP? Do we want to emit vector libcalls for vector IR intrinsics at O0? If I recall, CGP is disabled at O0.

I haven't looked at the trade-offs, but if we can do this transform later ( SelectionDAGLegalize::ConvertNodeToLibcall() ? ) that would seem to be more flexible...and smaller patch?
For reference:
D8131 - added support for external vector library calls
D70107 - added inject-tli-mappings

+1 on doing this as late as possible. Ideally we would then just get rid of the logic in LV/SLP that inserts the library calls.

I think SelectionDAG would scalarize the operation in LegalizeVectorOps before they reach LegalizeDAG.

So an IR codegen pass that runs after the optimizer would be best? If the patch is small enough, we could glom this into CGP?

Can this be scheduled as a pre code generation pass ?

Does GlobalISel also use CGP? Do we want to emit vector libcalls for vector IR intrinsics at O0? If I recall, CGP is disabled at O0.

GlobalISel should not use CGP in the long-run since CGP was purely a hack to overcome SDAG's single-block limitation. I'm not sure what the status is for GlobalISel targets currently though.

Can this be scheduled as a pre code generation pass ?

I haven't actually looked at the patch details, but if there's agreement that it's ok to leave it as an IR transform, then it should be a small adjustment to make this an IR codegen pass. It would be similar to the existing ExpandReductions pass.

Does GlobalISel also use CGP? Do we want to emit vector libcalls for vector IR intrinsics at O0? If I recall, CGP is disabled at O0.

GlobalISel should not use CGP in the long-run since CGP was purely a hack to overcome SDAG's single-block limitation. I'm not sure what the status is for GlobalISel targets currently though.

Can this be scheduled as a pre code generation pass ?

I haven't actually looked at the patch details, but if there's agreement that it's ok to leave it as an IR transform, then it should be a small adjustment to make this an IR codegen pass. It would be similar to the existing ExpandReductions pass.

That is what i was suggesting to do FWIW.
Perhaps, this makes sense as middle-end optimization, but then we should have intrinsics instead of all these libcalls,
and somehow everything would need to be updated to be okay with that, but that will require a *much* larger changes.

Thanks to everyone for your feedback so far!

I haven't actually looked at the patch details, but if there's agreement that it's ok to leave it as an IR transform, then it should be a small adjustment to make this an IR codegen pass. It would be similar to the existing ExpandReductions pass.

As this solution (i.e., keeping it as an IR pass, but move it to be an IR codegen pass) seems to be acceptable, I would change the implementation accordingly and update the patch.

@lebedev.ri, @fhahn, @spatel, @venkataramanan.kumar.llvm, @craig.topper: Any objections?

Thanks to everyone for your feedback so far!

I haven't actually looked at the patch details, but if there's agreement that it's ok to leave it as an IR transform, then it should be a small adjustment to make this an IR codegen pass. It would be similar to the existing ExpandReductions pass.

As this solution (i.e., keeping it as an IR pass, but move it to be an IR codegen pass) seems to be acceptable, I would change the implementation accordingly and update the patch.

@lebedev.ri, @fhahn, @spatel, @venkataramanan.kumar.llvm, @craig.topper: Any objections?

No objections from me.

Thanks to everyone for your feedback so far!

I haven't actually looked at the patch details, but if there's agreement that it's ok to leave it as an IR transform, then it should be a small adjustment to make this an IR codegen pass. It would be similar to the existing ExpandReductions pass.

As this solution (i.e., keeping it as an IR pass, but move it to be an IR codegen pass) seems to be acceptable, I would change the implementation accordingly and update the patch.

@lebedev.ri, @fhahn, @spatel, @venkataramanan.kumar.llvm, @craig.topper: Any objections?

I am ok with this.

I've updated the implementation to have this pass as an IR codegen pass.

I did not add an explicit flag to deactivate the pass (as it for example is the case for ExpandReductions), as the replacement can be deactivated by setting --vector-library=none (which is also the default). The pass is also not added to the pipeline for -O0.

Similar to ExpandReductions, the pass now explicitly deletes the call instructions to intrinsics that have been replaced and does not rely on DCE, given that it is now very late in pipeline.

I've updated the implementation to have this pass as an IR codegen pass.

I did not add an explicit flag to deactivate the pass (as it for example is the case for ExpandReductions), as the replacement can be deactivated by setting --vector-library=none (which is also the default). The pass is also not added to the pipeline for -O0.

Similar to ExpandReductions, the pass now explicitly deletes the call instructions to intrinsics that have been replaced and does not rely on DCE, given that it is now very late in pipeline.

[EDIT] I realized there was a inconsistency between CodeGenPassBuilder and TargetPassConfig, which is now corrected.

spatel added inline comments.Feb 3 2021, 12:13 PM
llvm/lib/CodeGen/ReplaceWithVeclib.cpp
80

Can we assert here that the FunctionType of Replacement is the same as the FunctionType of CI?

166

typo: intrinsic

187

typo: intrinsics

llvm/test/CodeGen/Generic/replace-intrinsics-with-veclib.ll
15

FileCheck typo: "sAME"

Can you use utils/update_test_checks.py to auto-generate the assertions?

Thanks @spatel for the detailed review!

I've addressed your comments and updated the patch. The test is now generated by utils/update_test_checks.py and I've added the assertion.

LukasSommerTu marked 4 inline comments as done.Feb 4 2021, 7:07 AM
spatel accepted this revision.Feb 4 2021, 8:00 AM

LGTM - see inline for a couple of minor points.

llvm/lib/CodeGen/ReplaceWithVeclib.cpp
80

Move this above the replaceAllUsesWith() to be more effective (not just for FP calls)?

llvm/test/CodeGen/Generic/replace-intrinsics-with-veclib.ll
64

The test comments that were here in the previous revision were helpful. I'd put those back (but put them outside of the function body to avoid interfering with the script's CHECK lines).

@spatel: Thanks for your quick feedback, I've addressed your comments and updated the patch.

I do not have commit access, could someone please commit for me? Thanks!

LukasSommerTu marked 2 inline comments as done.Feb 4 2021, 9:25 AM

@spatel: Thanks for your quick feedback, I've addressed your comments and updated the patch.

I do not have commit access, could someone please commit for me? Thanks!

I can commit it for you. Let's check if there are any more comments though - @lebedev.ri , was making this a codegen pass sufficient for your change request?

This revision was not accepted when it landed; it landed in state Needs Review.Feb 5 2021, 11:25 AM
This revision was automatically updated to reflect the committed changes.

@spatel: Thanks for your quick feedback, I've addressed your comments and updated the patch.

I do not have commit access, could someone please commit for me? Thanks!

I can commit it for you. Let's check if there are any more comments though - @lebedev.ri , was making this a codegen pass sufficient for your change request?

Oops - I didn't notice that I had this patch lined up with other local changes and pushed it just now. Let me know if I should revert.

@spatel: I got notified about some build-bot failures, so maybe it's better to revert the commit and I will check for the cause of the build-bot failures.

@spatel: I got notified about some build-bot failures, so maybe it's better to revert the commit and I will check for the cause of the build-bot failures.

Ok - reverted:
c981f6f8e16

spatel reopened this revision.Feb 5 2021, 12:12 PM

@spatel: I got notified about some build-bot failures, so maybe it's better to revert the commit and I will check for the cause of the build-bot failures.

Ok - reverted:
c981f6f8e16

My guess on the problem is that the test file was put in the "Generic" test directory but specified an x86 target in the text. BuildBots may not build the X86 target. We either need to make the test truly generic (target-independent) or put the file in the X86 dir.

The failure also seems to happen with X86 builds, so I suspect some connections to changes to the new PassManager, as the pass is not recognized at all. I will check that in detail and also make sure that the test works correctly, if the X86 target is not built.

fhahn added a comment.Feb 8 2021, 7:27 AM

@LukasSommerTu are you also looking into unifying/removing the code in SLPVectorizer/LV to create library calls?

llvm/include/llvm/CodeGen/ReplaceWithVeclib.h
22

nit: I think the coding style recommends using struct if all members are public.

llvm/lib/CodeGen/ReplaceWithVeclib.cpp
72

nit: Why IRBuilder{&CI} here but Args(CI.arg_operands()) below? can this just be IRBuilder(&CI) for consistency?

I've found the reason for the build-bot failure, I had overlooked that codegen passes are currently pinned to the legacy PM in opt. I fixed this and also addressed @fhahn's inline comments (thanks for the feedback!).

LukasSommerTu marked 2 inline comments as done.Feb 8 2021, 9:24 AM

@LukasSommerTu are you also looking into unifying/removing the code in SLPVectorizer/LV to create library calls?

@fhahn: No, I haven't looked into that so far. I assume it would make sense to have a separate patch for that?

Should I try pushing this again?

I reproduced the cause of the failure in the build-bot locally, fixed the bug and successfully ran the tests locally, so if you do not have any additional things that I should change, you could try again. Thanks!

I reproduced the cause of the failure in the build-bot locally, fixed the bug and successfully ran the tests locally, so if you do not have any additional things that I should change, you could try again. Thanks!

D96011 changed the TLI.getVectorizedFunction() API, so we need a small adjustment to make this compile. Let me know if this looks right ( cc @david-arm ):

diff --git a/llvm/lib/CodeGen/ReplaceWithVeclib.cpp b/llvm/lib/CodeGen/ReplaceWithVeclib.cpp
index 943199933494..bec0fb772c3e 100644
--- a/llvm/lib/CodeGen/ReplaceWithVeclib.cpp
+++ b/llvm/lib/CodeGen/ReplaceWithVeclib.cpp
@@ -104,7 +104,7 @@ static bool replaceWithCallToVeclib(const TargetLibraryInfo &TLI,
 
   // Convert vector arguments to scalar type and check that
   // all vector operands have identical vector width.
-  unsigned VF = 0;
+  ElementCount VF;
   SmallVector<Type *> ScalarTypes;
   for (auto Arg : enumerate(CI.arg_operands())) {
     auto *ArgType = Arg.value()->getType();
@@ -121,17 +121,17 @@ static bool replaceWithCallToVeclib(const TargetLibraryInfo &TLI,
         // the replacement.
         return false;
       }
-      auto NumElements = VectorArgTy->getElementCount();
+      ElementCount NumElements = VectorArgTy->getElementCount();
       if (NumElements.isScalable()) {
         // The current implementation does not support
         // scalable vectors.
         return false;
       }
-      if (VF && VF != NumElements.getFixedValue()) {
+      if (VF && VF != NumElements) {
         // The different arguments differ in vector size.
         return false;
       } else {
-        VF = NumElements.getFixedValue();
+        VF = NumElements;
       }
       ScalarTypes.push_back(VectorArgTy->getElementType());
     }

Hi @spatel, I think that fix looks sensible, although I'd recommend initialising the ElementCount to something in the same way as VF = 0 rather than relying upon the default constructor behaviour. You can write this as:

ElementCount VF = ElementCount::getFixed(0);

and this ensures it's initialised to the value that you expect. Also you might want to explicitly check for non-zero element counts, i.e.:

if (VF.isNonZero() && VF != NumElements) {

Thanks!

@spatel, @david-arm: Thanks for the update, I will integrate your proposed changes and update the patch after running some tests locally.

@spatel, @david-arm: I integrated your proposed changes/updates and updated the patch. Local tests run fine, so it should be ready for committing from my side.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 12 2021, 9:53 AM
This revision was automatically updated to reflect the committed changes.