This is an archive of the discontinued LLVM Phabricator instance.

[LTO] Allow library calls to be internalized in freestanding mode
Needs ReviewPublic

Authored by jhuber6 on Jul 3 2023, 9:33 AM.

Details

Summary

Currently, we mark known library calls, (e.g. malloc / printf) as used
symbols because the linker expects these to be present. However, this
means that they cannot be removed through LTO. The current LTO
configuration allows for the Freestanding option which implies that we
should not emit library calls from the backen as we may not have them.
This patch extends that logic to also imply that we can internalize
these library calls signals if they are present, because in freestanding
mode they should bind to regular symbols.

Currently, this is important for my GPU libc project. We rely on LTO
linking for the GPU to optimize binaries and currently, certain symbols
cannot be internalized by the LTO pass because they match the known
library calls. This is intentional, but because the GPU is an unhosted
environment and everything is linked in statically we should be able to
internalize these.

This patch primarily just adds a new IR symbol table flag to indicate a
known library call. This is then used to special-case handle symbols
that are used but which can be internalized because of Freestanding
logic.

Diff Detail

Event Timeline

jhuber6 created this revision.Jul 3 2023, 9:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 3 2023, 9:33 AM
jhuber6 requested review of this revision.Jul 3 2023, 9:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 3 2023, 9:33 AM
jdoerfert added inline comments.Jul 3 2023, 9:36 AM
llvm/lib/Object/IRSymtab.cpp
284

This is how we find out it is a libcall? If so, then we might want to rename the bool above, maybe even more.

jhuber6 added inline comments.Jul 3 2023, 9:50 AM
llvm/lib/Object/IRSymtab.cpp
284

I think it used to be called "IsBuiltinFunc", but it got renamed to this more generically because there's like`__ssp_canary_word`, and __stack_chk_guard aren't contained in llvm/IR/RuntimeLibcalls.def. We may need to filter those out here thinking about it, since they're not true library calls but will be marked as such.

RuntimeLibcalls.def describes libcalls which can be emitted by various backends. This means that, absent a promise from a particular backend that it won't call a given symbol, the symbols in question must be available post-LTO. Erasing the whole PreservedSymbols simply doesn't work, at least not for the targets I'm familiar with; there are symbols which will be used unconditionally for certain IR constructs, so we must preserve those symbols. (For example, if you're on a soft-float target, and you're compiling code that uses floating-point, you need the compiler-rt soft-float routines.)

If RuntimeLibcalls.def contains symbols that you don't actually need in specific scenarios, we need a mechanism to narrow the list somehow. Maybe we can add a target hook to get a target-specific list of relevant symbols.

RuntimeLibcalls.def describes libcalls which can be emitted by various backends. This means that, absent a promise from a particular backend that it won't call a given symbol, the symbols in question must be available post-LTO. Erasing the whole PreservedSymbols simply doesn't work, at least not for the targets I'm familiar with; there are symbols which will be used unconditionally for certain IR constructs, so we must preserve those symbols. (For example, if you're on a soft-float target, and you're compiling code that uses floating-point, you need the compiler-rt soft-float routines.)

If RuntimeLibcalls.def contains symbols that you don't actually need in specific scenarios, we need a mechanism to narrow the list somehow. Maybe we can add a target hook to get a target-specific list of relevant symbols.

I was under the impression that was what the freestanding flag did for LTO, it sets TLII->disableAllFunctions();. I'm guessing that's not sufficient here?

it sets TLII->disableAllFunctions();. I'm guessing that's not sufficient here?

That just disables TargetLibraryInfo, i.e. recognition of non-intrinsic symbols as having some particular meaning. Backends generally still assume that compiler-rt.builtins or libgcc, plus memcpy/memmove/memset, are available.

it sets TLII->disableAllFunctions();. I'm guessing that's not sufficient here?

That just disables TargetLibraryInfo, i.e. recognition of non-intrinsic symbols as having some particular meaning. Backends generally still assume that compiler-rt.builtins or libgcc, plus memcpy/memmove/memset, are available.

The functions that I'm interested in internalizing are the math functions. Since the GPU resolves everything statically we should be able to internalize all of this and remove the overhead. I'm not sure if there's a good way to express that. Maybe a secondary list?

I think the strategy generally works. Those libcall functions doesn't really matter if you have a linker that can do dead-stripping but can help other linkers.

I wonder if the better strategy for libcall symbol is to patch up during LTO time, rather than encode them as used in bitcode symtab, then you can decide based on LTO configuration. Also when there is a version mismatch between compiler and LTO, it is more important to know what LTO is considered a libcall function as it is doing the code generation.

I think the math functions (sin() etc.) are only used if the corresponding intrinsics are used in LLVM IR (llvm.sin() etc.). We should be able to add a check along those lines.

I think the strategy generally works. Those libcall functions doesn't really matter if you have a linker that can do dead-stripping but can help other linkers.

I wonder if the better strategy for libcall symbol is to patch up during LTO time, rather than encode them as used in bitcode symtab, then you can decide based on LTO configuration. Also when there is a version mismatch between compiler and LTO, it is more important to know what LTO is considered a libcall function as it is doing the code generation.

Yeah, I'm not sure if this would block us from generating some needed runtime calls. I don't think that this would collide with the GPU at least but I'm not an expert. I'm wondering if there is a good way to slice this, because in general I think it's not ideal that we cannot treat these functions as standard symbols if we are the ones providing the implementation. E.g. if we implement sin on the GPU we expect to resolve it statically like any other symbol.

arsenm added a comment.Jul 5 2023, 5:20 AM

Yeah, I'm not sure if this would block us from generating some needed runtime calls. I don't think that this would collide with the GPU at least but I'm not an expert. I'm wondering if there is a good way to slice this, because in general I think it's not ideal that we cannot treat these functions as standard symbols if we are the ones providing the implementation. E.g. if we implement sin on the GPU we expect to resolve it statically like any other symbol.

There is no amdgpu compiler-rt. We'll just fail if we end up accidentally hitting the libcall legalizations in the DAG and we try to inline expand everything. It may not always be this way, I do think we'd be better off in a few instances emitting compiler-rt style calls but we're not going to be doing that any time soon. The worst offenders are the giant f64 math intrinsics which are just totally unhandled

Yeah, I'm not sure if this would block us from generating some needed runtime calls. I don't think that this would collide with the GPU at least but I'm not an expert. I'm wondering if there is a good way to slice this, because in general I think it's not ideal that we cannot treat these functions as standard symbols if we are the ones providing the implementation. E.g. if we implement sin on the GPU we expect to resolve it statically like any other symbol.

There is no amdgpu compiler-rt. We'll just fail if we end up accidentally hitting the libcall legalizations in the DAG and we try to inline expand everything. It may not always be this way, I do think we'd be better off in a few instances emitting compiler-rt style calls but we're not going to be doing that any time soon. The worst offenders are the giant f64 math intrinsics which are just totally unhandled

If nothing else I could probably keep this approach but only enable it on NVPTX and AMDGPU as they are totally unhosted as far as I'm aware.

arsenm added inline comments.Jul 5 2023, 5:26 AM
llvm/include/llvm/Object/IRSymtab.h
117

Is this a visible IR change? I've wondered if we needed a new form of linkage for this

jhuber6 added inline comments.Jul 5 2023, 5:28 AM
llvm/include/llvm/Object/IRSymtab.h
117

Only the IR symbol table I thought, I guess the ideal solution would be to have a special linkage of "compiler-rt" type functions that the backend can emit. Then each target lists the compiler builtins it expects to emit.

I've spent a surprising amount of time staring at this libcall list (originally in the context of https://discourse.llvm.org/t/lto-deplibs-and-libcalls-oh-my/64510). It's struck me as a bit of a hack; targets are free to change the names of libcalls in TLI or to include additional target-specific ones, and neither would end up in this list. LTO also depends on the accuracy of this list for correctness, since the symbol table of the link must be finalized before LTO codegen begins.

If RuntimeLibcalls.def contains symbols that you don't actually need in specific scenarios, we need a mechanism to narrow the list somehow. Maybe we can add a target hook to get a target-specific list of relevant symbols.

I went a little ways down this route in https://reviews.llvm.org/D127885 to resolve https://github.com/llvm/llvm-project/issues/56070; maintaining per-target lists like this in the linker is a pretty rough maintenance burden.

It seems like it might be advantageous to provide a more direct way for the linker to query the code generator for this list via the LTO API; the code generator should be able to produce an accurate list from the configuration its provided by LLD.
Otherwise, we're left with the status quo, where the linker has to maintain in parallel a conservative description of the eventual behavior of the code generator. Any way in which that description is wrong is a way in which you can't use LTO.

It's struck me as a bit of a hack; targets are free to change the names of libcalls in TLI or to include additional target-specific ones, and neither would end up in this list. LTO also depends on the accuracy of this list for correctness, since the symbol table of the link must be finalized before LTO codegen begins.

Right.

It seems like it might be advantageous to provide a more direct way for the linker to query the code generator for this list via the LTO API; the code generator should be able to produce an accurate list from the configuration its provided by LLD.

This makes sense; I'm not sure how hard it would be to implement, though.

If we can maintain a map from some compiler builtin intrinsics to possible libcall functions that codegen can emit from them, it is possible that those functions can just be emitted as Undef in the LTO object symbol table. We can drop all the hacks that mark them as used to prevent them from dead stripped.

If we can maintain a map from some compiler builtin intrinsics to possible libcall functions that codegen can emit from them, it is possible that those functions can just be emitted as Undef in the LTO object symbol table

The idea here is that we do the computation at compile-time, instead of making the linker do it? I suspect we end up with a few intrinsics where we can compute the mapping to libcalls somewhat precisely, alongside a long list of libcalls where the the result is unpredictable because it depends on codegen optimizations. So almost every object file would contain a long list of target-specific libcalls; I'm not sure that's the right tradeoff.

If we can maintain a map from some compiler builtin intrinsics to possible libcall functions that codegen can emit from them, it is possible that those functions can just be emitted as Undef in the LTO object symbol table

The idea here is that we do the computation at compile-time, instead of making the linker do it? I suspect we end up with a few intrinsics where we can compute the mapping to libcalls somewhat precisely, alongside a long list of libcalls where the the result is unpredictable because it depends on codegen optimizations. So almost every object file would contain a long list of target-specific libcalls; I'm not sure that's the right tradeoff.

The LTO pass has access to the target machine when we run the backend. I'd presume that we could just let the target machine optionally export a list of libcalls it expects to be able to emit safely. It might be messy plumbing that information into the existing LTO handling however. I remember hearing @jdoerfert mentioning there's some similar set somewhere.