This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/DeviceRTL/
-
libomptarget/
-
DeviceRTL/
-
CMakeLists.txt
-
include/
-
Types.h
-
src/
-
Mapping.cpp
-
Utils.cpp

Differential D126701

[Libomptarget] Do not use retaining attributes for the static library
ClosedPublic

Authored by jhuber6 on May 31 2022, 7:12 AM.

Download Raw Diff

Details

Reviewers

JonChesterfield
jdoerfert
tianshilei1992

Commits

rG421b1f55c6e2: [Libomptarget] Do not use retaining attributes for the static library

Summary

When we build the libomptarget device runtime library targeting bitcode,
we need special care to make sure that certain functions are not
optimized out. This is because we manually internalize and optimize
these definitions, ignoring their standard linkage semantics. When we
build with the static library, we can maintain these semantics and we do
not need these to be kept-alive. Furthermore, if they are kept-alive it
prevents them from being removed during LTO. This prevents us from
completely internalizing IsSPMDMode and removing several other
functions. This patch removes these for the static library target by
using a macro definition to enable them.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.May 31 2022, 7:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 31 2022, 7:12 AM

Herald added a subscriber: mgorny. · View Herald Transcript

jhuber6 requested review of this revision.May 31 2022, 7:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 31 2022, 7:12 AM

Herald added a subscriber: openmp-commits. · View Herald Transcript

Harbormaster completed remote builds in B167039: Diff 433076.May 31 2022, 7:22 AM

Herald added a subscriber: sstefan1. · View Herald TranscriptMay 31 2022, 7:22 AM

I don't see why going via the static library is changing the semantics of the optimisation pass. That seems bad. What're we currently doing that stops us removing the magic functions after linking the devicertl?

In D126701#3547421, @JonChesterfield wrote:

I don't see why going via the static library is changing the semantics of the optimisation pass. That seems bad. What're we currently doing that stops us removing the magic functions after linking the devicertl?

The bitcode library is a bit of a hack, we don't do true linking on it. When we link it in via -mlink-builtin-bitcode we eagerly internalize everything. This has the result that certain functions will just be optimized out when we want them to remain alive. We got around this with these hacks that just put them in the used list. The magic functions aren't removed because we can't remove them or else we would optimize out some functions or variables we might need. The difference with the static library is that it maintains all the standard semantics of a static library, so we do not optimize anything out until the final linking job when we have all the information, so there's no worry about anything being optimized out prematurely.

Is the difference that we link the .bc vs the .a at different points in the pipeline, or the behaviour of the mlink-builtin-bitcode? Which does internalise and aome attribute propagation, maybe other things.

On the face of it we should do exactly the same thing with the devicertl whether it is wrapped in an archive or not, so same point in pipeline and splice in the contents using the same hacks we have in place without it. Otherwise we can expect a long tail of basically spurious behaviour changes based on whether the archive convenience feature is in use or not.

In D126701#3547566, @JonChesterfield wrote:

Is the difference that we link the .bc vs the .a at different points in the pipeline, or the behaviour of the mlink-builtin-bitcode? Which does internalise and aome attribute propagation, maybe other things.

We link the .bc early and internalize eagerly, the .a we link at the very end and treat it just like any other library. We could get rid of the internalization for the bitcode linking in Clang, but then the performance would be awful.

On the face of it we should do exactly the same thing with the devicertl whether it is wrapped in an archive or not, so same point in pipeline and splice in the contents using the same hacks we have in place without it. Otherwise we can expect a long tail of basically spurious behaviour changes based on whether the archive convenience feature is in use or not.

The existing code is just a hacky workaround that's not necessary when using the static library, it shouldn't change any behavior otherwise. Right now we could get rid of the bitcode library and just use this static library, but that would tank the performance of non-LTO builds on NVPTX. I don't think it's too much of a problem to just keep this workaround where it's needed, but remove it where it's not.

Removing / reducing the keepalive hack seems like a good thing but I expect -DLIBOMPTARGET_BC_TARGET to come back and bite us. It means putting (slightly) different bitcode in a static library vs linking the bitcode directly and I expect that'll lead to bugs that repro on one config and not the other.

Instead of that, let's go with whichever setup works best for LTO and accept a minor performance regression on the non-LTO case. That gets us identical bitcode on each path and a straightforward message for users about choosing between compile time cost and runtime performance.

In D126701#3563330, @JonChesterfield wrote:

Removing / reducing the keepalive hack seems like a good thing but I expect -DLIBOMPTARGET_BC_TARGET to come back and bite us. It means putting (slightly) different bitcode in a static library vs linking the bitcode directly and I expect that'll lead to bugs that repro on one config and not the other.

Instead of that, let's go with whichever setup works best for LTO and accept a minor performance regression on the non-LTO case. That gets us identical bitcode on each path and a straightforward message for users about choosing between compile time cost and runtime performance.

I don't think that will work, necessary functions will be optimized out otherwise. The best solution is to just not build the bitcode version in the first place and only use the static library. But this will tank performance pretty heavily on non-LTO builds since we can't inline runtime functions or optimize them together anymore. In the future we can make LTO the default since it improves performance in almost every case, but I don't want to make that switch right now. This is kind of the short workaround so we can get better IR with LTO without completely deleting the bitcode library. I think it'll be fine to have a slightly different version for the two methods. If one fails we can always tell them to use LTO until that's the default.

Can we link the bitcode after openmp-opt? I was never keen on llvm emitting calls to functions that were linked in earlier and may have since been deleted. Optimising by runtime function name before linking them should be equivalent to what we have now and that renders the keepalive hack unnecessary.

In D126701#3563526, @JonChesterfield wrote:

Can we link the bitcode after openmp-opt? I was never keen on llvm emitting calls to functions that were linked in earlier and may have since been deleted. Optimising by runtime function name before linking them should be equivalent to what we have now and that renders the keepalive hack unnecessary.

You would need to link it after any optimizations that could potentially inline the functions, which somewhat defeats the purpose of doing this over the static library for now.

This is working around design mistakes elsewhere. Essentially linking N copies of the RTL plus internalising them then trying to access symbols from the internalised lib later is a mess. But this particular change isn't _that_ likely to make things worse and it slightly reduces the scope of some of the hacks.

This revision is now accepted and ready to land.Jun 7 2022, 8:58 AM

Closed by commit rG421b1f55c6e2: [Libomptarget] Do not use retaining attributes for the static library (authored by jhuber6). · Explain WhyJun 7 2022, 9:16 AM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG421b1f55c6e2: [Libomptarget] Do not use retaining attributes for the static library.

Revision Contents

Path

Size

openmp/

libomptarget/

DeviceRTL/

CMakeLists.txt

4 lines

include/

Types.h

7 lines

src/

Mapping.cpp

2 lines

Utils.cpp

2 lines

Diff 434844

openmp/libomptarget/DeviceRTL/CMakeLists.txt

Show First 20 Lines • Show All 228 Lines • ▼ Show 20 Lines	function(compileDeviceRTLLibrary target_cpu target_name)

# Install bitcode library under the lib destination folder.		# Install bitcode library under the lib destination folder.
install(FILES ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} DESTINATION "${OPENMP_INSTALL_LIBDIR}")		install(FILES ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} DESTINATION "${OPENMP_INSTALL_LIBDIR}")
endfunction()		endfunction()

# Generate a Bitcode library for all the compute capabilities the user requested		# Generate a Bitcode library for all the compute capabilities the user requested
add_custom_target(omptarget.devicertl.nvptx)		add_custom_target(omptarget.devicertl.nvptx)
foreach(sm ${nvptx_sm_list})		foreach(sm ${nvptx_sm_list})
compileDeviceRTLLibrary(sm_${sm} nvptx -target nvptx64-nvidia-cuda -Xclang -target-feature -Xclang +ptx61 "-D__CUDA_ARCH__=${sm}0")		compileDeviceRTLLibrary(sm_${sm} nvptx -target nvptx64-nvidia-cuda -DLIBOMPTARGET_BC_TARGET -Xclang -target-feature -Xclang +ptx61 "-D__CUDA_ARCH__=${sm}0")
endforeach()		endforeach()

add_custom_target(omptarget.devicertl.amdgpu)		add_custom_target(omptarget.devicertl.amdgpu)
foreach(mcpu ${amdgpu_mcpus})		foreach(mcpu ${amdgpu_mcpus})
compileDeviceRTLLibrary(${mcpu} amdgpu -target amdgcn-amd-amdhsa -D__AMDGCN__ -nogpulib)		compileDeviceRTLLibrary(${mcpu} amdgpu -target amdgcn-amd-amdhsa -DLIBOMPTARGET_BC_TARGET -D__AMDGCN__ -nogpulib)
endforeach()		endforeach()

set(LIBOMPTARGET_LLVM_VERSION "${LLVM_VERSION_MAJOR}.${LLVM_VERSION_MINOR}.${LLVM_VERSION_PATCH}")		set(LIBOMPTARGET_LLVM_VERSION "${LLVM_VERSION_MAJOR}.${LLVM_VERSION_MINOR}.${LLVM_VERSION_PATCH}")
if(NOT (CMAKE_CXX_COMPILER_ID MATCHES "[Cc]lang" AND CMAKE_CXX_COMPILER_VERSION VERSION_EQUAL LIBOMPTARGET_LLVM_VERSION))		if(NOT (CMAKE_CXX_COMPILER_ID MATCHES "[Cc]lang" AND CMAKE_CXX_COMPILER_VERSION VERSION_EQUAL LIBOMPTARGET_LLVM_VERSION))
libomptarget_say("Not building static library, CMake compiler '${CMAKE_CXX_COMPILER_ID} ${CMAKE_CXX_COMPILER_VERSION}' is not 'Clang ${LIBOMPTARGET_LLVM_VERSION}'.")		libomptarget_say("Not building static library, CMake compiler '${CMAKE_CXX_COMPILER_ID} ${CMAKE_CXX_COMPILER_VERSION}' is not 'Clang ${LIBOMPTARGET_LLVM_VERSION}'.")
libomptarget_say(" Use the 'LLVM_ENABLE_RUNTIMES=openmp' option instead")		libomptarget_say(" Use the 'LLVM_ENABLE_RUNTIMES=openmp' option instead")
return()		return()
endif()		endif()
Show All 19 Lines

openmp/libomptarget/DeviceRTL/include/Types.h

	Show First 20 Lines • Show All 203 Lines • ▼ Show 20 Lines
	#define THREAD_LOCAL(NAME) \			#define THREAD_LOCAL(NAME) \
	NAME [[clang::loader_uninitialized, clang::address_space(5)]]			NAME [[clang::loader_uninitialized, clang::address_space(5)]]

	// TODO: clang should use address space 4 for omp_const_mem_alloc, maybe it			// TODO: clang should use address space 4 for omp_const_mem_alloc, maybe it
	// does?			// does?
	#define CONSTANT(NAME) \			#define CONSTANT(NAME) \
	NAME [[clang::loader_uninitialized, clang::address_space(4)]]			NAME [[clang::loader_uninitialized, clang::address_space(4)]]

				// Attribute to keep alive certain definition for the bitcode library.
				#ifdef LIBOMPTARGET_BC_TARGET
				#define KEEP_ALIVE __attribute__((used, retain))
				#else
				#define KEEP_ALIVE
				#endif

	///}			///}

	#endif			#endif

openmp/libomptarget/DeviceRTL/src/Mapping.cpp

	Show First 20 Lines • Show All 270 Lines • ▼ Show 20 Lines
	///}			///}

	/// Execution mode			/// Execution mode
	///			///
	///{			///{

	// TODO: This is a workaround for initialization coming from kernels outside of			// TODO: This is a workaround for initialization coming from kernels outside of
	// the TU. We will need to solve this more correctly in the future.			// the TU. We will need to solve this more correctly in the future.
	int __attribute__((used, retain, weak)) SHARED(IsSPMDMode);			int __attribute__((weak)) KEEP_ALIVE SHARED(IsSPMDMode);

	void mapping::init(bool IsSPMD) {			void mapping::init(bool IsSPMD) {
	if (mapping::isInitialThreadInLevel0(IsSPMD))			if (mapping::isInitialThreadInLevel0(IsSPMD))
	IsSPMDMode = IsSPMD;			IsSPMDMode = IsSPMD;
	}			}

	bool mapping::isSPMDMode() { return IsSPMDMode; }			bool mapping::isSPMDMode() { return IsSPMDMode; }

	Show All 20 Lines

openmp/libomptarget/DeviceRTL/src/Utils.cpp

	Show All 15 Lines
	#include "Mapping.h"			#include "Mapping.h"

	#pragma omp begin declare target device_type(nohost)			#pragma omp begin declare target device_type(nohost)

	using namespace _OMP;			using namespace _OMP;

	namespace _OMP {			namespace _OMP {
	/// Helper to keep code alive without introducing a performance penalty.			/// Helper to keep code alive without introducing a performance penalty.
	__attribute__((used, retain, weak, optnone, cold)) void keepAlive() {			__attribute__((weak, optnone, cold)) KEEP_ALIVE void keepAlive() {
	__kmpc_get_hardware_thread_id_in_block();			__kmpc_get_hardware_thread_id_in_block();
	__kmpc_get_hardware_num_threads_in_block();			__kmpc_get_hardware_num_threads_in_block();
	__kmpc_get_warp_size();			__kmpc_get_warp_size();
	__kmpc_barrier_simple_spmd(nullptr, 0);			__kmpc_barrier_simple_spmd(nullptr, 0);
	__kmpc_barrier_simple_generic(nullptr, 0);			__kmpc_barrier_simple_generic(nullptr, 0);
	}			}
	} // namespace _OMP			} // namespace _OMP

	▲ Show 20 Lines • Show All 124 Lines • Show Last 20 Lines