When we build the libomptarget device runtime library targeting bitcode,
we need special care to make sure that certain functions are not
optimized out. This is because we manually internalize and optimize
these definitions, ignoring their standard linkage semantics. When we
build with the static library, we can maintain these semantics and we do
not need these to be kept-alive. Furthermore, if they are kept-alive it
prevents them from being removed during LTO. This prevents us from
completely internalizing IsSPMDMode and removing several other
functions. This patch removes these for the static library target by
using a macro definition to enable them.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I don't see why going via the static library is changing the semantics of the optimisation pass. That seems bad. What're we currently doing that stops us removing the magic functions after linking the devicertl?
The bitcode library is a bit of a hack, we don't do true linking on it. When we link it in via -mlink-builtin-bitcode we eagerly internalize everything. This has the result that certain functions will just be optimized out when we want them to remain alive. We got around this with these hacks that just put them in the used list. The magic functions aren't removed because we can't remove them or else we would optimize out some functions or variables we might need. The difference with the static library is that it maintains all the standard semantics of a static library, so we do not optimize anything out until the final linking job when we have all the information, so there's no worry about anything being optimized out prematurely.
Is the difference that we link the .bc vs the .a at different points in the pipeline, or the behaviour of the mlink-builtin-bitcode? Which does internalise and aome attribute propagation, maybe other things.
On the face of it we should do exactly the same thing with the devicertl whether it is wrapped in an archive or not, so same point in pipeline and splice in the contents using the same hacks we have in place without it. Otherwise we can expect a long tail of basically spurious behaviour changes based on whether the archive convenience feature is in use or not.
We link the .bc early and internalize eagerly, the .a we link at the very end and treat it just like any other library. We could get rid of the internalization for the bitcode linking in Clang, but then the performance would be awful.
On the face of it we should do exactly the same thing with the devicertl whether it is wrapped in an archive or not, so same point in pipeline and splice in the contents using the same hacks we have in place without it. Otherwise we can expect a long tail of basically spurious behaviour changes based on whether the archive convenience feature is in use or not.
The existing code is just a hacky workaround that's not necessary when using the static library, it shouldn't change any behavior otherwise. Right now we could get rid of the bitcode library and just use this static library, but that would tank the performance of non-LTO builds on NVPTX. I don't think it's too much of a problem to just keep this workaround where it's needed, but remove it where it's not.
Removing / reducing the keepalive hack seems like a good thing but I expect -DLIBOMPTARGET_BC_TARGET to come back and bite us. It means putting (slightly) different bitcode in a static library vs linking the bitcode directly and I expect that'll lead to bugs that repro on one config and not the other.
Instead of that, let's go with whichever setup works best for LTO and accept a minor performance regression on the non-LTO case. That gets us identical bitcode on each path and a straightforward message for users about choosing between compile time cost and runtime performance.
I don't think that will work, necessary functions will be optimized out otherwise. The best solution is to just not build the bitcode version in the first place and only use the static library. But this will tank performance pretty heavily on non-LTO builds since we can't inline runtime functions or optimize them together anymore. In the future we can make LTO the default since it improves performance in almost every case, but I don't want to make that switch right now. This is kind of the short workaround so we can get better IR with LTO without completely deleting the bitcode library. I think it'll be fine to have a slightly different version for the two methods. If one fails we can always tell them to use LTO until that's the default.
Can we link the bitcode after openmp-opt? I was never keen on llvm emitting calls to functions that were linked in earlier and may have since been deleted. Optimising by runtime function name before linking them should be equivalent to what we have now and that renders the keepalive hack unnecessary.
You would need to link it after any optimizations that could potentially inline the functions, which somewhat defeats the purpose of doing this over the static library for now.
This is working around design mistakes elsewhere. Essentially linking N copies of the RTL plus internalising them then trying to access symbols from the internalised lib later is a mess. But this particular change isn't _that_ likely to make things worse and it slightly reduces the scope of some of the hacks.