This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Respect libfunc availability when lowering intrinsic memcpy
Needs RevisionPublic

Authored by mstorsjo on Aug 15 2020, 2:36 PM.

Details

Summary

If the memcpy libfunc is marked as not being available (e.g. via the -fno-builtin flag), we shouldn't lower intrinsic memcpy into an actual memcpy function call.

The same should probably be done for other intrinsics (that can be spontaneously generated by e.g. clang without the functions being explicitly mentioned) that can end up lowered into a function call, but this is the one I've run into in practice.

Adding a test for this that runs on AArch64, as the patch requires separate changes to all of SelectionDAG, FastISel and GlobalISel.

Diff Detail

Event Timeline

mstorsjo created this revision.Aug 15 2020, 2:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 15 2020, 2:36 PM
mstorsjo requested review of this revision.Aug 15 2020, 2:36 PM
mstorsjo added inline comments.Aug 15 2020, 2:39 PM
llvm/test/CodeGen/AArch64/intrinsic-memcpy-nobuiltin.ll
10

Ideally, I'd have the test also test without the no-builtins attribute, to check that the test still does what it's supposed to, i.e. that without the attribute, these still are lowered to an actual memcpy call.

If we can’t lower it into a libcall, what *can* we do with it beyond optimizing into an unrolled in-line sequence? Just generate a whole memcpy implementation inline?

If we can’t lower it into a libcall, what *can* we do with it beyond optimizing into an unrolled in-line sequence? Just generate a whole memcpy implementation inline?

I guess those are the options, yeah.

For the case I'm looking at, code with an explicit struct assignment without the code calling memcpy or anything equivalent, and building it with -fno-builtin, I think the expectation is to just expand it into an inline memcpy, however long that sequence might be.

Conventionally, we've assumed that memcpy is available despite -fno-builtin etc., simply because generating it inline is too painful. For example, consider what happens to your testcase if +strict-align is specified. (This is following gcc, which has a similar expectation.)

If the requirement of having a symbol named "memcpy" is problematic, I guess we could look into making compiler-rt provide a symbol with equivalent semantics, but use a compiler-reserved name so the user can't disable it with -fno-builtin. For example, __aeabi_memcpy is defined on ARM.

aemerson added inline comments.Aug 17 2020, 12:57 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
6438

So if memcpy is not available but we have a non-constant size, then we still generate a memcpy call here?

arsenm requested changes to this revision.Aug 17 2020, 1:05 PM
arsenm added a subscriber: arsenm.

The memcpy intrinsic has nothing to do with library availability, and handling it should not depend on the library function. The backend may choose to emit a libcall for it, but this is not required (e.g. AMDGPU has no libcalls whatsoever). GlobalISel should be able to handle arbitrary memcpys with ease since it's possible to emit a loop. With SelectionDAG you have to expand the arbitrary case in the IR, but it still handles constant size cases.

llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
1787

This needs to be dropped. D85199 switches these to using proper legalizeable instructions. It may make more sense for the libcall action to fail if the libraryinfo does not have memcpy

This revision now requires changes to proceed.Aug 17 2020, 1:05 PM

Conventionally, we've assumed that memcpy is available despite -fno-builtin etc., simply because generating it inline is too painful. For example, consider what happens to your testcase if +strict-align is specified. (This is following gcc, which has a similar expectation.)

Right... The actual case I'm trying to fix is on x86_64, and there gcc doesn't seem to generate a memcpy lightly, but for aarch64 it does indeed seem to generate such a call despite -fno-builtin, just as you say. (The reason for having the testcase for aarch64 is that the fix needed separate cases for selectiondag/fastisel/globalisel, so testing it on aarch64 felt easiest.)

If the requirement of having a symbol named "memcpy" is problematic, I guess we could look into making compiler-rt provide a symbol with equivalent semantics, but use a compiler-reserved name so the user can't disable it with -fno-builtin. For example, __aeabi_memcpy is defined on ARM.

Actually, the issue isn't that memcpy is missing, but that it has a different calling convention. In wine, certain DLLs (which are built as a regular ELF .so or MachO .dylib) would only include windows mode headers, and use a memcpy function with windows calling conventions. As long as memcpy is called explicitly, it ends up fine, but for the cases when the compiler backend invents memcpy calls, they end up using the platform's default calling convention.

This is mostly an issue on x86_64, where the calling conventions differ significantly - on other architectures the calling conventions are similar enough that there's no difference in a memcpy call.

In any case, a separate builtin function to generate a call to would probably help here as well, but doing that would break cases when building with clang but linking libgcc, no?

The memcpy intrinsic has nothing to do with library availability, and handling it should not depend on the library function. The backend may choose to emit a libcall for it, but this is not required (e.g. AMDGPU has no libcalls whatsoever).

Yeah, that's kind of the behaviour I expected when building with -fno-builtin. But then again, as @efriedma pointed out, GCC also does generate similar memcpy libcalls despite -fno-builtin (it didn't in the cases I tested before, on x86_64, but I now checked more and found cases where it does as well), so I'm a bit more hesitant about actually proceeding with this now...

GlobalISel should be able to handle arbitrary memcpys with ease since it's possible to emit a loop. With SelectionDAG you have to expand the arbitrary case in the IR, but it still handles constant size cases.

Ok, so where would I start out doing that, more concretely? (I have close to no experience in either transforms that change IR or GlobalISel.)

llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
1787

The reason for placing it here, was that if we within translateKnownIntrinsic decide we don't want to handle it that way, and return false from that method, we just proceed further down here, instead of making this method return false (prompting the caller to handle it differently).

But it does indeed break cases where the target would have a different, non-libcall based lowering of the intrinsic...

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
6438

Yeah, that's what the current patch does - that's clearly not ideal either...

GlobalISel should be able to handle arbitrary memcpys with ease since it's possible to emit a loop. With SelectionDAG you have to expand the arbitrary case in the IR, but it still handles constant size cases.

Ok, so where would I start out doing that, more concretely? (I have close to no experience in either transforms that change IR or GlobalISel.)

The actual expansion code (at least in the non-loop case) is already implemented, it's just in the wrong place. Currently the memory intrinsics can be expanded by the code in CombinerHelper::optimizeMemcpy/optimizeMemmove/optimizeMemset, and treat this as an optimization. I think this should be treated as a legalization decision, and moved into LegalizerHelper. LegalizerHelper::lower would then dispatch to a lowerMemcpy(), which would look similar to optimizeMemcpy today. After that, more code would be needed for the loop cases. I'll probably get to this in a month or two, but if you want to take care of it that would be great.

I would then probably add a utility to LegalizeRuleSet to help targets control this. Something like lowerIfLibcallUnavailable(TLI, LibFunc)

GlobalISel should be able to handle arbitrary memcpys with ease since it's possible to emit a loop. With SelectionDAG you have to expand the arbitrary case in the IR, but it still handles constant size cases.

Ok, so where would I start out doing that, more concretely? (I have close to no experience in either transforms that change IR or GlobalISel.)

The actual expansion code (at least in the non-loop case) is already implemented, it's just in the wrong place. Currently the memory intrinsics can be expanded by the code in CombinerHelper::optimizeMemcpy/optimizeMemmove/optimizeMemset, and treat this as an optimization. I think this should be treated as a legalization decision, and moved into LegalizerHelper. LegalizerHelper::lower would then dispatch to a lowerMemcpy(), which would look similar to optimizeMemcpy today. After that, more code would be needed for the loop cases. I'll probably get to this in a month or two, but if you want to take care of it that would be great.

We can discuss this in a separate GISel thread, but I don't see how memcpy inlining is any more of an instruction legality decision than an optimization one. I'm not strongly against it, but IMO it's no more correct than a combine.

I would then probably add a utility to LegalizeRuleSet to help targets control this. Something like lowerIfLibcallUnavailable(TLI, LibFunc)

GlobalISel should be able to handle arbitrary memcpys with ease since it's possible to emit a loop. With SelectionDAG you have to expand the arbitrary case in the IR, but it still handles constant size cases.

Ok, so where would I start out doing that, more concretely? (I have close to no experience in either transforms that change IR or GlobalISel.)

The actual expansion code (at least in the non-loop case) is already implemented, it's just in the wrong place. Currently the memory intrinsics can be expanded by the code in CombinerHelper::optimizeMemcpy/optimizeMemmove/optimizeMemset, and treat this as an optimization. I think this should be treated as a legalization decision, and moved into LegalizerHelper. LegalizerHelper::lower would then dispatch to a lowerMemcpy(), which would look similar to optimizeMemcpy today. After that, more code would be needed for the loop cases. I'll probably get to this in a month or two, but if you want to take care of it that would be great.

We can discuss this in a separate GISel thread, but I don't see how memcpy inlining is any more of an instruction legality decision than an optimization one. I'm not strongly against it, but IMO it's no more correct than a combine.

Because whether or not emitting a library call is a valid strategy depends on the target. An inlined memcpy may not be optional, in which case it is not an optimization. The legalizer code would essentially be identical to the combiner code. The combiner could also do the same thing, but these should share code.

Conventionally, we've assumed that memcpy is available despite -fno-builtin etc., simply because generating it inline is too painful. For example, consider what happens to your testcase if +strict-align is specified. (This is following gcc, which has a similar expectation.)

Right... The actual case I'm trying to fix is on x86_64, and there gcc doesn't seem to generate a memcpy lightly, but for aarch64 it does indeed seem to generate such a call despite -fno-builtin, just as you say. (The reason for having the testcase for aarch64 is that the fix needed separate cases for selectiondag/fastisel/globalisel, so testing it on aarch64 felt easiest.)

gcc will generate memcpy on x86 too; the threshold is just ridiculously high for -mtune=generic. (I don't think this choice is affected by -fno-builtin.)

I really don't really want to go down this path of inlining all memcpys for CPU targets: the code for an efficient memcpy is way too big to reasonably inline, particularly on non-x86 targets.

Note that LLVM optimizations will currently avoid forming memcpy under -fno-builtin (mostly to avoid weird issues with implementing memcpy). This doesn't affect clang.

If the requirement of having a symbol named "memcpy" is problematic, I guess we could look into making compiler-rt provide a symbol with equivalent semantics, but use a compiler-reserved name so the user can't disable it with -fno-builtin. For example, __aeabi_memcpy is defined on ARM.

Actually, the issue isn't that memcpy is missing, but that it has a different calling convention. In wine, certain DLLs (which are built as a regular ELF .so or MachO .dylib) would only include windows mode headers, and use a memcpy function with windows calling conventions. As long as memcpy is called explicitly, it ends up fine, but for the cases when the compiler backend invents memcpy calls, they end up using the platform's default calling convention.

This is mostly an issue on x86_64, where the calling conventions differ significantly - on other architectures the calling conventions are similar enough that there's no difference in a memcpy call.

You mean, the header explicitly declares a function named memcpy, and uses an attribute to explicitly overrides the calling convention so it's not the same as the "C" calling convention? Messing with the backend's calling convention for memcpy isn't hard, but I'm not sure what conditions would be appropriate. We already support clang --target=x86_64-pc-win32-elf.

In any case, a separate builtin function to generate a call to would probably help here as well, but doing that would break cases when building with clang but linking libgcc, no?

Yes.

The actual expansion code (at least in the non-loop case) is already implemented, it's just in the wrong place. Currently the memory intrinsics can be expanded by the code in CombinerHelper::optimizeMemcpy/optimizeMemmove/optimizeMemset, and treat this as an optimization. I think this should be treated as a legalization decision, and moved into LegalizerHelper. LegalizerHelper::lower would then dispatch to a lowerMemcpy(), which would look similar to optimizeMemcpy today. After that, more code would be needed for the loop cases. I'll probably get to this in a month or two, but if you want to take care of it that would be great.

Thanks for the pointers! I think I might hold off of this for now though - since the target where it's needed is x86_64, I'd primarily only need it in SelectionDAG, but I tried to fix it consistently in all cases at once.

gcc will generate memcpy on x86 too; the threshold is just ridiculously high for -mtune=generic. (I don't think this choice is affected by -fno-builtin.)

That does indeed seem to be the case. For -mtune=generic, it seems to generate something that boils down to rep movsq - and you're right that -fno-builtin doesn't seem to affect it.

I really don't really want to go down this path of inlining all memcpys for CPU targets: the code for an efficient memcpy is way too big to reasonably inline, particularly on non-x86 targets.

Do you mean there's a lot of code, built with -fno-builtin, where the code itself didn't mention memcpy, but where a call to it ideally should be generated? (Not questioning the assumption, just trying to understand the concern better.)

Note that LLVM optimizations will currently avoid forming memcpy under -fno-builtin (mostly to avoid weird issues with implementing memcpy). This doesn't affect clang.

Hmm, which cases do you refer to here?

If the requirement of having a symbol named "memcpy" is problematic, I guess we could look into making compiler-rt provide a symbol with equivalent semantics, but use a compiler-reserved name so the user can't disable it with -fno-builtin. For example, __aeabi_memcpy is defined on ARM.

Actually, the issue isn't that memcpy is missing, but that it has a different calling convention. In wine, certain DLLs (which are built as a regular ELF .so or MachO .dylib) would only include windows mode headers, and use a memcpy function with windows calling conventions. As long as memcpy is called explicitly, it ends up fine, but for the cases when the compiler backend invents memcpy calls, they end up using the platform's default calling convention.

This is mostly an issue on x86_64, where the calling conventions differ significantly - on other architectures the calling conventions are similar enough that there's no difference in a memcpy call.

You mean, the header explicitly declares a function named memcpy, and uses an attribute to explicitly overrides the calling convention so it's not the same as the "C" calling convention?

Exactly. I guess that's maybe technically invalid C, but still a scenario that there's a demand to handle in some way.

Messing with the backend's calling convention for memcpy isn't hard, but I'm not sure what conditions would be appropriate. We already support clang --target=x86_64-pc-win32-elf.

I fear that tucking the decision away behind a target triple can be a bit opaque. This is also a transition in progress - 12 months ago, the memcpy called within Wine builtin modules was the host libc's, but they're transitioning towards making the wine builtin modules actual freestanding DLLs. They can still either be built as ELF/MachO, or as a real PE DLL, but they don't interact with the host environment, only other DLLs, and the memcpy used within them is one with the windows calling convention regardless of the object file format.

I really don't really want to go down this path of inlining all memcpys for CPU targets: the code for an efficient memcpy is way too big to reasonably inline, particularly on non-x86 targets.

Do you mean there's a lot of code, built with -fno-builtin, where the code itself didn't mention memcpy, but where a call to it ideally should be generated? (Not questioning the assumption, just trying to understand the concern better.)

Yes. There's a lot of code built with something like -ffreestanding, that's compatible with the current understanding of the rules for when the compiler can generate memcpy.

Note that LLVM optimizations will currently avoid forming memcpy under -fno-builtin (mostly to avoid weird issues with implementing memcpy). This doesn't affect clang.

Hmm, which cases do you refer to here?

Loop idiom recognition, things like that. Basically, if you write something like void *memcpy(void *dest, const void *src, size_t n) { while (n--) *dest++ = *src++; }, we don't want to turn that into a recursive memcpy call.

If the requirement of having a symbol named "memcpy" is problematic, I guess we could look into making compiler-rt provide a symbol with equivalent semantics, but use a compiler-reserved name so the user can't disable it with -fno-builtin. For example, __aeabi_memcpy is defined on ARM.

Actually, the issue isn't that memcpy is missing, but that it has a different calling convention. In wine, certain DLLs (which are built as a regular ELF .so or MachO .dylib) would only include windows mode headers, and use a memcpy function with windows calling conventions. As long as memcpy is called explicitly, it ends up fine, but for the cases when the compiler backend invents memcpy calls, they end up using the platform's default calling convention.

This is mostly an issue on x86_64, where the calling conventions differ significantly - on other architectures the calling conventions are similar enough that there's no difference in a memcpy call.

You mean, the header explicitly declares a function named memcpy, and uses an attribute to explicitly overrides the calling convention so it's not the same as the "C" calling convention?

Exactly. I guess that's maybe technically invalid C, but still a scenario that there's a demand to handle in some way.

Messing with the backend's calling convention for memcpy isn't hard, but I'm not sure what conditions would be appropriate. We already support clang --target=x86_64-pc-win32-elf.

I fear that tucking the decision away behind a target triple can be a bit opaque. This is also a transition in progress - 12 months ago, the memcpy called within Wine builtin modules was the host libc's, but they're transitioning towards making the wine builtin modules actual freestanding DLLs. They can still either be built as ELF/MachO, or as a real PE DLL, but they don't interact with the host environment, only other DLLs, and the memcpy used within them is one with the windows calling convention regardless of the object file format.

The point here would just be that if you use x86_64-pc-win32-elf, we generate code as if the target were a Windows target; this might be useful if you want to generate an ELF library that globally uses the Windows calling convention. It might not be right if you only wanted to mess with the calling convention of memcpy.