This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/nvptx/
-
libomptarget/
-
deviceRTLs/
-
nvptx/
1/2
CMakeLists.txt

Differential D106710

[OpenMP][NVPTX] Disable OpenMPOpt when building deviceRTLs
ClosedPublic

Authored by tianshilei1992 on Jul 23 2021, 1:49 PM.

Download Raw Diff

Details

Reviewers

jdoerfert

Commits

rGf1b8fa55d033: [OpenMP][NVPTX] Disable OpenMPOpt when building deviceRTLs

Summary

We build deviceRTLs with -O1 by default, which also triggers OpenMPOpt. When
the info cache is created, some attributes are removed. As a result, although we
mark a few functions noinline, they are still inlined when the bitcode library
is generated. This can cause an issue in middle end optimization.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tianshilei1992 created this revision.Jul 23 2021, 1:49 PM

Herald added subscribers: guansong, yaxunl, mgorny. · View Herald TranscriptJul 23 2021, 1:49 PM

tianshilei1992 requested review of this revision.Jul 23 2021, 1:49 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptJul 23 2021, 1:49 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: openmp-commits, sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B115949: Diff 361337.Jul 23 2021, 2:29 PM

LGTM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
156	remove this line then.

This revision is now accepted and ready to land.Jul 24 2021, 8:10 PM

fix comments

tianshilei1992 marked an inline comment as done.Jul 25 2021, 7:37 AM

This revision was landed with ongoing or failed builds.Jul 25 2021, 7:38 AM

Closed by commit rGf1b8fa55d033: [OpenMP][NVPTX] Disable OpenMPOpt when building deviceRTLs (authored by tianshilei1992). · Explain Why

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rGf1b8fa55d033: [OpenMP][NVPTX] Disable OpenMPOpt when building deviceRTLs.

Harbormaster completed remote builds in B116072: Diff 361505.Jul 25 2021, 8:10 AM

Does the same issue not affect application code? Or in other words, what was the issue in the middle end that disabling the pass fixes?

In D106710#2903223, @JonChesterfield wrote:

Does the same issue not affect application code? Or in other words, what was the issue in the middle end that disabling the pass fixes?

That issue will not affect application. The issue is, in OpenMPOpt, when we build the information cache, such attributes will be stripped. For example, although we add noinline attribute to the function __kmpc_parallel_level, it is stripped. As a result, the function call in __kmpc_parallel_51 is inlined. That is a problem when we optimize application code especially using attributor because we are using the information that the function call should be there. However, it is not a problem when we build the deviceRTLs as all OpenMP optimization must start with kernels, but when building the deviceRTLs there is no kernel so no optimization from OpenMPOpt is fired.
We disable OpenMPOpt when building deviceRTLs. It would not cause any problem as eventually it will be triggered along with application code.

Could you clarify:

all OpenMP optimization must start with kernels

Is the problem that the deviceRTL defines functions with names that are pattern matched from the optimisation pass, in which case this patch is definitely necessary? Analogous to building libc itself as ffreestanding.

I'm slightly surprised that we don't have an early return in the pass when there are no kernels in the IR module since that is probably quicker than mutating the module and then mutating it back. If we implemented that then we wouldn't need to special case the cmake here, but it might still be clearer to keep it explicitly disabled.

In D106710#2904091, @JonChesterfield wrote:

I'm slightly surprised that we don't have an early return in the pass when there are no kernels in the IR module since that is probably quicker than mutating the module and then mutating it back.

We don't mutate and mutate back. We want to run on openmp-device modules without kernels to do things like globalization removal if the kernel is split across TUs, at least until we do LTO by default.

In D106710#2904210, @jdoerfert wrote:

In D106710#2904091, @JonChesterfield wrote:

I'm slightly surprised that we don't have an early return in the pass when there are no kernels in the IR module since that is probably quicker than mutating the module and then mutating it back.

We don't mutate and mutate back.

I was thinking of D106707, haven't looked for more examples

We want to run on openmp-device modules without kernels to do things like globalization removal if the kernel is split across TUs, at least until we do LTO by default.

OK, cool. If we have/want optimisations that run without visibility of the top level kernel then we don't want to skip the pass when there are no top level kernels.

I think all is good here, I just didn't understand the commit message.

In D106710#2904241, @JonChesterfield wrote:

In D106710#2904210, @jdoerfert wrote:

In D106710#2904091, @JonChesterfield wrote:

I'm slightly surprised that we don't have an early return in the pass when there are no kernels in the IR module since that is probably quicker than mutating the module and then mutating it back.

We don't mutate and mutate back.

I was thinking of D106707, haven't looked for more examples

I'm confused. What has that to do with this? In the patch we preserve symbols that have been made internal in the earlier linking step so we can use them later, all that happens after the runtime is already build and optimized and shipped.

We want to run on openmp-device modules without kernels to do things like globalization removal if the kernel is split across TUs, at least until we do LTO by default.

OK, cool. If we have/want optimisations that run without visibility of the top level kernel then we don't want to skip the pass when there are no top level kernels.

I think all is good here, I just didn't understand the commit message.

The behaviour of the optimisation pass seems germane to disabling the optimisation pass.

JonChesterfield added inline comments.Jul 26 2021, 10:12 AM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
210	tangential to this, I think we should run opt -O2 or similar across the linked bitcode library before it ships, and internalize all the symbols that aren't `__kmpc_` prefixed at the same time

amdgpu is still using openmp-opt-disable-internalization, going to patch that shortly

JonChesterfield mentioned this in rGf420939b8276: [libomptarget] Apply D106710 to amdgcn devicertl.Aug 18 2021, 5:35 PM

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

nvptx/

CMakeLists.txt

2 lines

Diff 361506

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	set(cuda_src_files
${devicertl_common_directory}/src/sync.cu		${devicertl_common_directory}/src/sync.cu
${devicertl_common_directory}/src/task.cu		${devicertl_common_directory}/src/task.cu
${devicertl_common_directory}/src/shuffle.cpp		${devicertl_common_directory}/src/shuffle.cpp
src/target_impl.cu		src/target_impl.cu
)		)

# Set flags for LLVM Bitcode compilation.		# Set flags for LLVM Bitcode compilation.
set(bc_flags -S -x c++ -O1 -std=c++14		set(bc_flags -S -x c++ -O1 -std=c++14
-mllvm -openmp-opt-disable-internalization		-mllvm -openmp-opt-disable
		jdoerfertUnsubmitted Done Reply Inline Actions remove this line then. jdoerfert: remove this line then.
-target nvptx64		-target nvptx64
-Xclang -emit-llvm-bc		-Xclang -emit-llvm-bc
-Xclang -aux-triple -Xclang ${aux_triple}		-Xclang -aux-triple -Xclang ${aux_triple}
-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device		-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device
-Xclang -target-feature -Xclang +ptx61		-Xclang -target-feature -Xclang +ptx61
-D__CUDACC__		-D__CUDACC__
-I${devicertl_base_directory}		-I${devicertl_base_directory}
-I${devicertl_common_directory}/include		-I${devicertl_common_directory}/include
Show All 37 Lines	foreach(src ${cuda_src_files})
set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${outfile})		set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${outfile})

list(APPEND bc_files ${outfile})		list(APPEND bc_files ${outfile})
endforeach()		endforeach()

set(bclib_name "libomptarget-nvptx-sm_${sm}.bc")		set(bclib_name "libomptarget-nvptx-sm_${sm}.bc")

# Link to a bitcode library.		# Link to a bitcode library.
add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name}		add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name}
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions tangential to this, I think we should run opt -O2 or similar across the linked bitcode library before it ships, and internalize all the symbols that aren't `__kmpc_` prefixed at the same time JonChesterfield: tangential to this, I think we should run opt -O2 or similar across the linked bitcode library…
COMMAND ${bc_linker}		COMMAND ${bc_linker}
-o ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} ${bc_files}		-o ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} ${bc_files}
DEPENDS ${bc_files}		DEPENDS ${bc_files}
COMMENT "Linking LLVM bitcode ${bclib_name}"		COMMENT "Linking LLVM bitcode ${bclib_name}"
)		)
if("${bc_linker}" STREQUAL "$<TARGET_FILE:llvm-link>")		if("${bc_linker}" STREQUAL "$<TARGET_FILE:llvm-link>")
# Add a file-level dependency to ensure that llvm-link is up-to-date.		# Add a file-level dependency to ensure that llvm-link is up-to-date.
# By default, add_custom_command only builds llvm-link if the		# By default, add_custom_command only builds llvm-link if the
Show All 26 Lines