We build deviceRTLs with -O1 by default, which also triggers OpenMPOpt. When
the info cache is created, some attributes are removed. As a result, although we
mark a few functions noinline, they are still inlined when the bitcode library
is generated. This can cause an issue in middle end optimization.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Does the same issue not affect application code? Or in other words, what was the issue in the middle end that disabling the pass fixes?
That issue will not affect application. The issue is, in OpenMPOpt, when we build the information cache, such attributes will be stripped. For example, although we add noinline attribute to the function __kmpc_parallel_level, it is stripped. As a result, the function call in __kmpc_parallel_51 is inlined. That is a problem when we optimize application code especially using attributor because we are using the information that the function call should be there. However, it is not a problem when we build the deviceRTLs as all OpenMP optimization must start with kernels, but when building the deviceRTLs there is no kernel so no optimization from OpenMPOpt is fired.
We disable OpenMPOpt when building deviceRTLs. It would not cause any problem as eventually it will be triggered along with application code.
Could you clarify:
all OpenMP optimization must start with kernels
Is the problem that the deviceRTL defines functions with names that are pattern matched from the optimisation pass, in which case this patch is definitely necessary? Analogous to building libc itself as ffreestanding.
I'm slightly surprised that we don't have an early return in the pass when there are no kernels in the IR module since that is probably quicker than mutating the module and then mutating it back. If we implemented that then we wouldn't need to special case the cmake here, but it might still be clearer to keep it explicitly disabled.
We don't mutate and mutate back. We want to run on openmp-device modules without kernels to do things like globalization removal if the kernel is split across TUs, at least until we do LTO by default.
I was thinking of D106707, haven't looked for more examples
We want to run on openmp-device modules without kernels to do things like globalization removal if the kernel is split across TUs, at least until we do LTO by default.
OK, cool. If we have/want optimisations that run without visibility of the top level kernel then we don't want to skip the pass when there are no top level kernels.
I think all is good here, I just didn't understand the commit message.
I'm confused. What has that to do with this? In the patch we preserve symbols that have been made internal in the earlier linking step so we can use them later, all that happens after the runtime is already build and optimized and shipped.
We want to run on openmp-device modules without kernels to do things like globalization removal if the kernel is split across TUs, at least until we do LTO by default.
OK, cool. If we have/want optimisations that run without visibility of the top level kernel then we don't want to skip the pass when there are no top level kernels.
I think all is good here, I just didn't understand the commit message.
The behaviour of the optimisation pass seems germane to disabling the optimisation pass.
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt | ||
---|---|---|
210 | tangential to this, I think we should run opt -O2 or similar across the linked bitcode library before it ships, and internalize all the symbols that aren't __kmpc_ prefixed at the same time |
amdgpu is still using openmp-opt-disable-internalization, going to patch that shortly
remove this line then.