This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][NVPTX] Disable OpenMPOpt when building deviceRTLs
ClosedPublic

Authored by tianshilei1992 on Jul 23 2021, 1:49 PM.

Details

Summary

We build deviceRTLs with -O1 by default, which also triggers OpenMPOpt. When
the info cache is created, some attributes are removed. As a result, although we
mark a few functions noinline, they are still inlined when the bitcode library
is generated. This can cause an issue in middle end optimization.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Jul 23 2021, 1:49 PM
tianshilei1992 requested review of this revision.Jul 23 2021, 1:49 PM
Herald added a project: Restricted Project. · View Herald Transcript
jdoerfert accepted this revision.Jul 24 2021, 8:10 PM

LGTM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
156

remove this line then.

This revision is now accepted and ready to land.Jul 24 2021, 8:10 PM
tianshilei1992 marked an inline comment as done.Jul 25 2021, 7:37 AM
This revision was landed with ongoing or failed builds.Jul 25 2021, 7:38 AM
This revision was automatically updated to reflect the committed changes.

Does the same issue not affect application code? Or in other words, what was the issue in the middle end that disabling the pass fixes?

Does the same issue not affect application code? Or in other words, what was the issue in the middle end that disabling the pass fixes?

That issue will not affect application. The issue is, in OpenMPOpt, when we build the information cache, such attributes will be stripped. For example, although we add noinline attribute to the function __kmpc_parallel_level, it is stripped. As a result, the function call in __kmpc_parallel_51 is inlined. That is a problem when we optimize application code especially using attributor because we are using the information that the function call should be there. However, it is not a problem when we build the deviceRTLs as all OpenMP optimization must start with kernels, but when building the deviceRTLs there is no kernel so no optimization from OpenMPOpt is fired.
We disable OpenMPOpt when building deviceRTLs. It would not cause any problem as eventually it will be triggered along with application code.

Could you clarify:

all OpenMP optimization must start with kernels

Is the problem that the deviceRTL defines functions with names that are pattern matched from the optimisation pass, in which case this patch is definitely necessary? Analogous to building libc itself as ffreestanding.

I'm slightly surprised that we don't have an early return in the pass when there are no kernels in the IR module since that is probably quicker than mutating the module and then mutating it back. If we implemented that then we wouldn't need to special case the cmake here, but it might still be clearer to keep it explicitly disabled.

I'm slightly surprised that we don't have an early return in the pass when there are no kernels in the IR module since that is probably quicker than mutating the module and then mutating it back.

We don't mutate and mutate back. We want to run on openmp-device modules without kernels to do things like globalization removal if the kernel is split across TUs, at least until we do LTO by default.

I'm slightly surprised that we don't have an early return in the pass when there are no kernels in the IR module since that is probably quicker than mutating the module and then mutating it back.

We don't mutate and mutate back.

I was thinking of D106707, haven't looked for more examples

We want to run on openmp-device modules without kernels to do things like globalization removal if the kernel is split across TUs, at least until we do LTO by default.

OK, cool. If we have/want optimisations that run without visibility of the top level kernel then we don't want to skip the pass when there are no top level kernels.

I think all is good here, I just didn't understand the commit message.

I'm slightly surprised that we don't have an early return in the pass when there are no kernels in the IR module since that is probably quicker than mutating the module and then mutating it back.

We don't mutate and mutate back.

I was thinking of D106707, haven't looked for more examples

I'm confused. What has that to do with this? In the patch we preserve symbols that have been made internal in the earlier linking step so we can use them later, all that happens after the runtime is already build and optimized and shipped.

We want to run on openmp-device modules without kernels to do things like globalization removal if the kernel is split across TUs, at least until we do LTO by default.

OK, cool. If we have/want optimisations that run without visibility of the top level kernel then we don't want to skip the pass when there are no top level kernels.

I think all is good here, I just didn't understand the commit message.

The behaviour of the optimisation pass seems germane to disabling the optimisation pass.

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
210

tangential to this, I think we should run opt -O2 or similar across the linked bitcode library before it ships, and internalize all the symbols that aren't __kmpc_ prefixed at the same time

amdgpu is still using openmp-opt-disable-internalization, going to patch that shortly