Page MenuHomePhabricator

[MemCpyOpt] Enable memcpy optimizations unconditionally.
ClosedPublic

Authored by tra on Jun 23 2021, 11:14 AM.

Details

Summary

The patch does not depend on the availability of the library functions for memcpy/memset as it operates on LLVM intrinsics.
The optimizations are useful on the targets that have these functions disabled (e.g. NVPTX & AMDGPU).

Diff Detail

Event Timeline

tra created this revision.Jun 23 2021, 11:14 AM
tra requested review of this revision.Jun 23 2021, 11:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2021, 11:14 AM
jlebar added a subscriber: jlebar.Jun 23 2021, 11:17 AM

MemCpyOpt works on intrinsics, not libcalls. Why is it checking TLI at all?

I think the only place TLI should come in is when SimplifyLibCall converts memcpy into llvm.memcpy, but MemCpyOpt itself shouldn't care about it.

tra added a comment.Jun 23 2021, 11:56 AM

MemCpyOpt works on intrinsics, not libcalls. Why is it checking TLI at all?

I think the only place TLI should come in is when SimplifyLibCall converts memcpy into llvm.memcpy, but MemCpyOpt itself shouldn't care about it.

The check has been added long ago to deal with -fno-builtin, but it's not clear what exactly was the issue.
https://github.com/llvm/llvm-project/commit/23f61a09aff4a3c5ca4bba9410878dcfebf0656c

My best guess here is that this exists to guard against introduction of memset/memcpy where none existed before. MemCpyOpt mostly does optimizations that remove/replace memcpys/memsets, but there are two that convert loads/stored into memcpy/memset. Possibly we should only be guarding those two and let everything else happen?

Not sure who would be familiar with this area.

tra added a subscriber: lattner.Jun 23 2021, 12:35 PM

This change only enables the pass for NVPTX where use of memcpy/memset intrinsics is fine and it does not change anything for other back-ends.

Figuring out a better criteria for enabling/disabling the pass can be dealt with separately by someone with better understanding of the pass than myself.

Not sure who would be familiar with this area.

We may need to ask @lattner who authored the change, but it was long ago, so our chances are probably not great.

This change only enables the pass for NVPTX where use of memcpy/memset intrinsics is fine and it does not change anything for other back-ends.

Figuring out a better criteria for enabling/disabling the pass can be dealt with separately by someone with better understanding of the pass than myself.

I think these questions are rather related. Why does NVPTX require special handling here? If these libcalls are actually available, then you need to enable them in TLI. If they aren't, but the intrinsics form is still usable, then the libcall checks in MemCpyOpt are wrong and we should adjust those. A separate TTI hook seems like the wrong solution in either case.

tra added a comment.Jun 23 2021, 1:01 PM

I think these questions are rather related. Why does NVPTX require special handling here? If these libcalls are actually available, then you need to enable them in TLI. If they aren't, but the intrinsics form is still usable, then the libcall checks in MemCpyOpt are wrong and we should adjust those. A separate TTI hook seems like the wrong solution in either case.

NVPTX is... special.
There's no standard library of any kind on the GPU, hence no libcalls.
We do make an effort to provide enough functionality to keep things working.
E.g. most of the math functions end up being pulled from an external bitcode library.
We rely on memcpy/memset intrinsics to get replaced by loops by one of NVPTX-specific passes.

So, reporting libcalls as unavailable is correct.
"memcpy/memset intrinsics are still usable" is... mostly true. It is for the early passes. We do have some corner cases where we en up materializing them too late, but it's rarely an issue in practice.
Given that this behavior is very specific to this particular back-end and that the NVPTX back-end does something rather peculiar, TTI is an appropriate source for the hint, IMO.
I don't think we have a good way to express "things work in a weird way here" in a back-end agnostic way here.

fhahn added a subscriber: fhahn.Jun 24 2021, 2:31 AM

I think these questions are rather related. Why does NVPTX require special handling here? If these libcalls are actually available, then you need to enable them in TLI. If they aren't, but the intrinsics form is still usable, then the libcall checks in MemCpyOpt are wrong and we should adjust those. A separate TTI hook seems like the wrong solution in either case.

NVPTX is... special.
There's no standard library of any kind on the GPU, hence no libcalls.
We do make an effort to provide enough functionality to keep things working.
E.g. most of the math functions end up being pulled from an external bitcode library.
We rely on memcpy/memset intrinsics to get replaced by loops by one of NVPTX-specific passes.

So, reporting libcalls as unavailable is correct.
"memcpy/memset intrinsics are still usable" is... mostly true. It is for the early passes. We do have some corner cases where we en up materializing them too late, but it's rarely an issue in practice.
Given that this behavior is very specific to this particular back-end and that the NVPTX back-end does something rather peculiar, TTI is an appropriate source for the hint, IMO.
I don't think we have a good way to express "things work in a weird way here" in a back-end agnostic way here.

So IIUC the behavior you want to express is that certain lib functions actually are considered available up to a point? I'm not sure about having a dedicated TTI hook to specifically enable/disable the pass. Would it be possible to mark the lib functions as available until the pass that lowers them runs (there's TLI->disableAllFunctions)?

tra added a comment.Jun 24 2021, 12:08 PM

So IIUC the behavior you want to express is that certain lib functions actually are considered available up to a point?

No, *intrinsics* are available. We can not currently lower any of them to libcalls and rely on alternative lowering mechanisms.

I think the pass' use of "is libcall available" is an imperfect proxy for either "should we bother looking for memcpy/memset ops" or for "can we materialize new memcpy/memset". It appears to assume that intrinsics and builtins either both supported or not supported.

I'm not sure about having a dedicated TTI hook to specifically enable/disable the pass.

IMO TTI is the standard mechanism to make target-specific information available to otherwise generic passes. There are ~50 passes under lib/Transforms that utilize it.

Would it be possible to mark the lib functions as available until the pass that lowers them runs (there's TLI->disableAllFunctions)?

I don't think it's feasible. Back-end has no idea or control over when/where the pass is going to run.
E.g. I can run opt -memcpyopt my.ll and I would want it to work with NVPTX.

I agree that this seems overly specific and unfortunate, but I'm a competent reviewer for this area, so take that with a grain of salt.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1417 ↗(On Diff #353830)

typo providing

tra updated this revision to Diff 354330.Jun 24 2021, 12:27 PM
tra edited the summary of this revision. (Show Details)

typo fix

tra marked an inline comment as done.Jun 24 2021, 12:35 PM

I agree that this seems overly specific and unfortunate, but I'm a competent reviewer for this area, so take that with a grain of salt.

The TTI knob can be generalized to something like "hasMemoryIntrinsicsAlwaysAvailable" which would allow decoupling availability of memcpy libcalls from usability of memcpy intrinsics without being overly specific about MemCpyOpt pass.

I'm open about any other suggestions on a better way to get the pass working for NVPTX.

fhahn added a comment.Mon, Jun 28, 5:21 AM

So IIUC the behavior you want to express is that certain lib functions actually are considered available up to a point?

No, *intrinsics* are available. We can not currently lower any of them to libcalls and rely on alternative lowering mechanisms.

Sure, the intrinsics should *always* be available on any target regardless of whether they are lowered to lib calls or now.

I think the pass' use of "is libcall available" is an imperfect proxy for either "should we bother looking for memcpy/memset ops" or for "can we materialize new memcpy/memset". It appears to assume that intrinsics and builtins either both supported or not supported.

Agreed, the current check is not ideal. IIUC the current check acts as a proxy to check whether the backends can lower the intrinsic using a lib call. At the moment, I think this mainly guards against MemCpyOpt introducing llvm.memcpy calls that later get lowered to library calls in the backend, even if the library function is not marked as available.

Perhaps an alternative to the TLI check would be to provide a generic lowering for llvm.memcpy intrinsics when lib calls are not available? (One thing to note is that I think both Clang and GCC require memcpy and a few others to be available, even in freestanding environments)

I'm not sure about having a dedicated TTI hook to specifically enable/disable the pass.

IMO TTI is the standard mechanism to make target-specific information available to otherwise generic passes. There are ~50 passes under lib/Transforms that utilize it.

Yes, but at the moment, I don't think there's much precedence for backends specifically disabling certain passes. This could get messy very quickly, e.g. even if we add only hooks for some of the available passes. Those hooks are also not composable. IMO it would be preferable to model this in a way so other passes that may want to introduce llvm.memcpy calls also benefit.

Generalizing the hook to whether the intrinsics can be lowered without lib calls seems a step in the right direction to me and this could be helpful for other passes as well (as you suggested in a later comment). If backends could easily lower any llvm.memcpy call without needing to fall back to library calls, that would also be compelling.

tra added a comment.Mon, Jun 28, 9:45 AM

Perhaps an alternative to the TLI check would be to provide a generic lowering for llvm.memcpy intrinsics when lib calls are not available? (One thing to note is that I think both Clang and GCC require memcpy and a few others to be available, even in freestanding environments)

I'm not ready to bite that much.

I wonder if we could use existing int_memcpy_inline in cases where libcals are not available. It's supposed to guarantee that it does not call external functions.
https://github.com/llvm/llvm-project/blob/44826ecd929bdd33b3c86650198a5f8a57965cc7/llvm/include/llvm/IR/Intrinsics.td#L616

I'm not sure about having a dedicated TTI hook to specifically enable/disable the pass.

IMO TTI is the standard mechanism to make target-specific information available to otherwise generic passes. There are ~50 passes under lib/Transforms that utilize it.

Yes, but at the moment, I don't think there's much precedence for backends specifically disabling certain passes. This could get messy very quickly, e.g. even if we add only hooks for some of the available passes. Those hooks are also not composable. IMO it would be preferable to model this in a way so other passes that may want to introduce llvm.memcpy calls also benefit.

In this case the knob only *allows* using the pass in wider range of cases. Passes that don't want or don't need it are not affected.
In a way enabling/disabling libcalls already acts as such external enable/disable knob already. So do the various other existing knobs provided by TTI, if you squint just right. E.g. setting inlining threshold very low would effectively disable inlining pass. I don't think this pass introduces anything conceptually new.

Generalizing the hook to whether the intrinsics can be lowered without lib calls seems a step in the right direction to me and this could be helpful for other passes as well (as you suggested in a later comment). If backends could easily lower any llvm.memcpy call without needing to fall back to library calls, that would also be compelling.

OK. Let me try to see if I can make MemCpyOpt pass fall back to inline version of memcpy intrinsic if libcalls are not available.

If that does not pan out, plan B would be to provide a generic TTI knob (e.g. optional<bool> canLowerIntrinsic()) and then change MemOptPass to use that, instead of checking libcalls.

tra updated this revision to Diff 355705.Wed, Jun 30, 3:41 PM

Allow the pass to run regradless of libcall availability.

arsenm added a subscriber: arsenm.Wed, Jun 30, 3:48 PM

I've always hated how TLI acts as if the llvm intrinsics and libc/libm functions are interchangeable. TLI should have no bearing on the introduction of an intrinsic call which can be easily lowered without relying on a host library call. We should just always canonicalize to llvm.memcpy and let the backend expand it if the target needs to later

tra added inline comments.Wed, Jun 30, 3:50 PM
llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
1757–1761

AFAICT the check does not affect correctness.

The "little point of doing anything here" argument is questionable. These optimizations do allow other optimizations to happen and that's exactly how I ended up here -- a memcpy that was not replaced with memset ended up keeping other things alive which resulted in more work for the compiler and slower code at runtime.

In practice only AMDGPU and NVPTX disable these libcalls and for both targets this pass is beneficial. For the others the change is a no-op.

tra added a comment.Thu, Jul 15, 3:17 PM

Ping. PTAL. The patch no longer relies on TTI and has been simplified to remove the checks for libcall availability.

nikic accepted this revision.Fri, Jul 16, 10:46 AM

LGTM, but the patch description needs an update.

This revision is now accepted and ready to land.Fri, Jul 16, 10:46 AM
tra retitled this revision from [MemCpyOpt] Enable memcpy optimization for NVPTX back-end. to [MemCpyOpt] Enable memcpy optimizations unconditionally..Fri, Jul 16, 11:06 AM
tra edited the summary of this revision. (Show Details)
tra added a comment.Fri, Jul 16, 11:06 AM

LGTM, but the patch description needs an update.

Done.

This revision was landed with ongoing or failed builds.Mon, Jul 19, 11:58 AM
This revision was automatically updated to reflect the committed changes.

The one thing I'm concerned about with this change is that we might potentially make a memcpy implementation call memcpy, which seems bad.

The one thing I'm concerned about with this change is that we might potentially make a memcpy implementation call memcpy, which seems bad.

memcpy implementation is recognized by LoopIdiom pass.

tra added a comment.Mon, Jul 19, 2:45 PM

The one thing I'm concerned about with this change is that we might potentially make a memcpy implementation call memcpy, which seems bad.

Turns out that the patch breaks sanitizers that didn't expect memset to materialize when they were compiling their runtime.
https://lab.llvm.org/buildbot/#/builders/37/builds/5453

I've reverted the patch for now and will try again once I figure out how to keep sanitizers happy.

Besides that this patch created the intrinsics llvm.memset which in our target was unsupported, it also created pointers that may be invalid on the target, e.g. i8 addrspace(X)* where addrspace(X) only supports vector memory. While this is probably all solvable, please consider adding TTI switches for targets to disable this optimization (e.g. for the intrinsics).

tra added a comment.Tue, Jul 20, 9:44 AM

Besides that this patch created the intrinsics llvm.memset which in our target was unsupported, it also created pointers that may be invalid on the target, e.g. i8 addrspace(X)* where addrspace(X) only supports vector memory. While this is probably all solvable, please consider adding TTI switches for targets to disable this optimization (e.g. for the intrinsics).

...and that brings us back to where this patch has started -- with a TTI knob to control whether it wants this pass enabled. :-/

That said, I'm not quite sure that "this type/AS combination" is invalid is something you can expect all parts of LLVM to respect. While this pass may happen to tickle it, I can see other cases where the type of the pointer may change. I would think that dealing with the quirks of type support of particular target would be something for legalizer to do.

Besides that this patch created the intrinsics llvm.memset which in our target was unsupported, it also created pointers that may be invalid on the target, e.g. i8 addrspace(X)* where addrspace(X) only supports vector memory. While this is probably all solvable, please consider adding TTI switches for targets to disable this optimization (e.g. for the intrinsics).

...and that brings us back to where this patch has started -- with a TTI knob to control whether it wants this pass enabled. :-/

That said, I'm not quite sure that "this type/AS combination" is invalid is something you can expect all parts of LLVM to respect. While this pass may happen to tickle it, I can see other cases where the type of the pointer may change. I would think that dealing with the quirks of type support of particular target would be something for legalizer to do.

I agree to above. But in practice, since this change is quite drastic, I would recommend adding some kind of control for targets not wanting this. Also a flat-rate knob to disable the pass altogether would be fine IMO.