This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
cmake/modules/
-
modules/
-
LLVMLibCTestRules.cmake
-
startup/gpu/nvptx/
-
gpu/
-
nvptx/
-
CMakeLists.txt
1/2
start.cpp
-
test/
-
IntegrationTest/
2/5
test.cpp
-
integration/startup/gpu/
-
startup/
-
gpu/
-
CMakeLists.txt
-
init_fini_array_test.cpp
-
utils/gpu/loader/
-
gpu/
-
loader/
-
CMakeLists.txt
-
nvptx/
3/6
CMakeLists.txt
-
Loader.cpp

Differential D149527

[libc] Support global constructors and destructors on NVPTX
ClosedPublic

Authored by jhuber6 on Apr 29 2023, 12:38 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
tianshilei1992
tra
sivachandra
lntue
michaelrj

Commits

rG2e1c0ec62979: [libc] Support global constructors and destructors on NVPTX

Summary

This patch adds the necessary hacks to support global constructors and
destructors. This is an incredibly hacky process caused by the primary
fact that Nvidia does not provide any binary tools and very little
linker support. We first had to emit references to these functions and
their priority in D149451. Then we dig them out of the module once it's
loaded to manually create the list that the linker should have made for
us. This patch also contains a few Nvidia specific hacks, but it passes
the test, albeit with a stack size warning from ptxas for the
callback. But this should be fine given the resource usage of a common
test.

This also adds a dependency on LLVM to the NVPTX loader, which hopefully doesn't
cause problems with our CUDA buildbot.

Depends on D149451

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Apr 29 2023, 12:38 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 29 2023, 12:38 PM

Herald added subscribers: libc-commits, mattd, gchakrabarti and 4 others. · View Herald Transcript

jhuber6 requested review of this revision.Apr 29 2023, 12:38 PM

Harbormaster completed remote builds in B229060: Diff 518227.Apr 29 2023, 1:18 PM

jhuber6 added a child revision: D149532: [libc] Enable running libc unit tests on NVPTX.Apr 29 2023, 2:19 PM

jhuber6 added a child revision: D149581: [libc] Change GPU startup and loader to use multiple kernels.May 1 2023, 6:09 AM

tra added a subscriber: MaskRay.May 1 2023, 11:11 AM

tra added inline comments.

libc/startup/gpu/nvptx/start.cpp
23	The comment is somewhat puzzling. The sections themselves would be created by whatever generates the object files, before the linker gets involved. IIRC from our exchange on discourse, the actual problem was that nvlink discards the sections it's not familiar with and that's why we can't just put the initializers into a known init/fini sections and have to rely on putting initializers among regular data and use explicit symbols to find them.
libc/test/IntegrationTest/test.cpp
79	What exactly is the 'toolchain' in this context?
79	What exactly needs this symbol? I'm surprised we need to care about DSOs on NVPTX as we do not have any there. Googling around (https://stackoverflow.com/questions/34308720/where-is-dso-handle-defined) suggests that we may avoid the issue by compiling with `-fno-use-cxa-atexit`. @MaskRay -- any suggestions on what's the right way to deal with this?
libc/utils/gpu/loader/nvptx/CMakeLists.txt
12–13	Can you check how long the clean build of the tool with `-j 6` would take now? If it's in the ballpark of a minute or so, we can probably live with that. Otherwise we should build the tool along with clang/LLVM, similar to how we deal with `libc-hdrgen`.

jhuber6 added inline comments.May 1 2023, 11:23 AM

libc/startup/gpu/nvptx/start.cpp
23	Sorry, I meant to say "symbols" here. Normally when the linker finds a `.fini_array` or `.init_array` section it will provide these symbol names to let you traverse the section. This i the behavior in `ld.lld` which is what AMD uses. Nvidia both does not provide a way to put these in the `.init_array` section, nor does the linker create these symbols if you were to force them to exist. The latter could be potentially solved by reinventing `nvlink` in `lld` the former is a more difficult problem. Maybe there's a way to hack around this in the PTX Compiler API.
libc/test/IntegrationTest/test.cpp
79	I mean this as when going through a compilation targeting `nvptx64` e.g. `clang++ test.cpp --target=nvptx64-nvidia-cuda -march=sm_70 -c`
79	That might work for these hermetic tests since we should provide the base `atexit`. @sivachandra do you think we could use this? It seems to be supported on both Clang and GCC.
libc/utils/gpu/loader/nvptx/CMakeLists.txt
12–13	Is this assume we need to rebuild the libraries? I figured these would be copied somewhere in the build environment. On my machine rebuilding the loader takes about five seconds.

sivachandra added inline comments.May 1 2023, 11:37 AM

libc/test/IntegrationTest/test.cpp
79	You can choose to build for the GPUs with `-fno-use-cxa-atexit`.

tra added inline comments.May 1 2023, 11:43 AM

libc/utils/gpu/loader/nvptx/CMakeLists.txt
12–13	We only copy the clang installation directory to the GPU machines, so when you build the tests, the build directory itself will be empty and the libraries would have to be rebuilt. On my machine rebuilding the loader takes about five seconds. Is that a clean build? Can you check what `ninja clean; ninja -nv nvptx_loader \| wc -l` shows?

jhuber6 added inline comments.May 1 2023, 11:55 AM

libc/utils/gpu/loader/nvptx/CMakeLists.txt
12–13	When done in the `runtimes/runtimes-bins` directory. [1/1] Cleaning all built files... Cleaning... 509 files. 3

Changing NVPTX to use -fno-use-cxa-atexit.

Harbormaster completed remote builds in B229289: Diff 518521.May 1 2023, 12:16 PM

tra added inline comments.May 1 2023, 12:59 PM

libc/utils/gpu/loader/nvptx/CMakeLists.txt
12–13	Rebuilding it on a cloud machine with 6 cores may take too much. Un-ccached clean rebuild of LLVMSupport and LLVMObject on 6 real cores took ~2.5 minutes. On the build bots it will probably take ~2x that much (cloud counts hyperthereads as cores, IIUIC, and GPU bots have 6 of them, so only 3 physical cores) which would almost double the wall time of each test run which currently is between 5-8 minutes. We could live with it short-term, but I think we do need to move nvptx_loader into the main clang build and allow libc tests to be configured to use it as an externally provided tool.

jhuber6 added inline comments.May 1 2023, 1:02 PM

libc/utils/gpu/loader/nvptx/CMakeLists.txt
12–13	Yeah, I can definitely do that in a follow-up patch. I think I might've needed to do this anyway, because in the future we're going to want to export an RPC server library that OpenMP can use to implement things like `printf` in `libomptarget`. So to do that we'd need to build this first during the projects build.

Glad you could at least hack this. If NVIDIA ever fixed their tools we can get rid of this stuff.

@tra, you think this is good to go with a follow up to avoid compile time increase and provide more flexibility?

I'm OK with fixing launcher build in the followup.

This revision is now accepted and ready to land.May 1 2023, 1:29 PM

This revision was landed with ongoing or failed builds.May 4 2023, 5:13 AM

Closed by commit rG2e1c0ec62979: [libc] Support global constructors and destructors on NVPTX (authored by jhuber6). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG2e1c0ec62979: [libc] Support global constructors and destructors on NVPTX.

Revision Contents

Path

Size

libc/

cmake/

modules/

LLVMLibCTestRules.cmake

6 lines

startup/

gpu/

nvptx/

CMakeLists.txt

2 lines

start.cpp

78 lines

test/

IntegrationTest/

test.cpp

4 lines

integration/

startup/

gpu/

CMakeLists.txt

15 lines

init_fini_array_test.cpp

2 lines

utils/

gpu/

loader/

CMakeLists.txt

4 lines

nvptx/

CMakeLists.txt

6 lines

Loader.cpp

128 lines

Diff 519449

libc/cmake/modules/LLVMLibCTestRules.cmake

Show First 20 Lines • Show All 491 Lines • ▼ Show 20 Lines	PRIVATE
${LIBC_BUILD_DIR}		${LIBC_BUILD_DIR}
${LIBC_BUILD_DIR}/include		${LIBC_BUILD_DIR}/include
)		)
target_compile_options(${fq_build_target_name}		target_compile_options(${fq_build_target_name}
PRIVATE -fpie -ffreestanding ${INTEGRATION_TEST_COMPILE_OPTIONS})		PRIVATE -fpie -ffreestanding ${INTEGRATION_TEST_COMPILE_OPTIONS})
# The GPU build requires overriding the default CMake triple and architecture.		# The GPU build requires overriding the default CMake triple and architecture.
if(LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)		if(LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)
target_compile_options(${fq_build_target_name} PRIVATE		target_compile_options(${fq_build_target_name} PRIVATE
-mcpu=${LIBC_GPU_TARGET_ARCHITECTURE} -flto		-mcpu=${LIBC_GPU_TARGET_ARCHITECTURE}
--target=${LIBC_GPU_TARGET_TRIPLE})		-flto --target=${LIBC_GPU_TARGET_TRIPLE})
elseif(LIBC_GPU_TARGET_ARCHITECTURE_IS_NVPTX)		elseif(LIBC_GPU_TARGET_ARCHITECTURE_IS_NVPTX)
get_nvptx_compile_options(nvptx_options ${LIBC_GPU_TARGET_ARCHITECTURE})		get_nvptx_compile_options(nvptx_options ${LIBC_GPU_TARGET_ARCHITECTURE})
target_compile_options(${fq_build_target_name} PRIVATE		target_compile_options(${fq_build_target_name} PRIVATE
${nvptx_options}		${nvptx_options} -fno-use-cxa-atexit
--target=${LIBC_GPU_TARGET_TRIPLE})		--target=${LIBC_GPU_TARGET_TRIPLE})
endif()		endif()

target_link_options(${fq_build_target_name} PRIVATE -nostdlib -static)		target_link_options(${fq_build_target_name} PRIVATE -nostdlib -static)
target_link_libraries(		target_link_libraries(
${fq_build_target_name}		${fq_build_target_name}
# The NVIDIA 'nvlink' linker does not currently support static libraries.		# The NVIDIA 'nvlink' linker does not currently support static libraries.
$<$<NOT:$<BOOL:${LIBC_GPU_TARGET_ARCHITECTURE_IS_NVPTX}>>:${fq_target_name}.__libc__>		$<$<NOT:$<BOOL:${LIBC_GPU_TARGET_ARCHITECTURE_IS_NVPTX}>>:${fq_target_name}.__libc__>
▲ Show 20 Lines • Show All 200 Lines • Show Last 20 Lines

libc/startup/gpu/nvptx/CMakeLists.txt

	get_nvptx_compile_options(nvptx_options ${LIBC_GPU_TARGET_ARCHITECTURE})			get_nvptx_compile_options(nvptx_options ${LIBC_GPU_TARGET_ARCHITECTURE})
	add_startup_object(			add_startup_object(
	crt1			crt1
	SRC			SRC
	start.cpp			start.cpp
	DEPENDS			DEPENDS
	libc.src.__support.RPC.rpc_client			libc.src.__support.RPC.rpc_client
	libc.src.__support.GPU.utils			libc.src.__support.GPU.utils
				libc.src.stdlib.exit
				libc.src.stdlib.atexit
	COMPILE_OPTIONS			COMPILE_OPTIONS
	-ffreestanding # To avoid compiler warnings about calling the main function.			-ffreestanding # To avoid compiler warnings about calling the main function.
	-fno-builtin			-fno-builtin
	-nogpulib # Do not include any GPU vendor libraries.			-nogpulib # Do not include any GPU vendor libraries.
	--target=${LIBC_GPU_TARGET_TRIPLE}			--target=${LIBC_GPU_TARGET_TRIPLE}
	${nvptx_options}			${nvptx_options}
	NO_GPU_BUNDLE # Compile this file directly without special GPU handling.			NO_GPU_BUNDLE # Compile this file directly without special GPU handling.
	)			)
	Show All 9 Lines

libc/startup/gpu/nvptx/start.cpp

	//===-- Implementation of crt for nvptx -----------------------------------===//			//===-- Implementation of crt for nvptx -----------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "src/__support/GPU/utils.h"			#include "src/__support/GPU/utils.h"
	#include "src/__support/RPC/rpc_client.h"			#include "src/__support/RPC/rpc_client.h"
				#include "src/stdlib/atexit.h"
				#include "src/stdlib/exit.h"

	extern "C" int main(int argc, char argv, char envp);			extern "C" int main(int argc, char argv, char envp);

	namespace __llvm_libc {			namespace __llvm_libc {

	static cpp::Atomic<uint32_t> lock = 0;			static cpp::Atomic<uint32_t> lock = 0;

	static cpp::Atomic<uint32_t> init = 0;			static cpp::Atomic<uint32_t> count = 0;

	void init_rpc(void in, void out, void *buffer) {			extern "C" {
	// Only a single thread should update the RPC data.			// Nvidia's 'nvlink' linker does not provide these symbols. We instead need
				traUnsubmitted Not Done Reply Inline Actions The comment is somewhat puzzling. The sections themselves would be created by whatever generates the object files, before the linker gets involved. IIRC from our exchange on discourse, the actual problem was that nvlink discards the sections it's not familiar with and that's why we can't just put the initializers into a known init/fini sections and have to rely on putting initializers among regular data and use explicit symbols to find them. tra: The comment is somewhat puzzling. The sections themselves would be created by whatever…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Sorry, I meant to say "symbols" here. Normally when the linker finds a `.fini_array` or `.init_array` section it will provide these symbol names to let you traverse the section. This i the behavior in `ld.lld` which is what AMD uses. Nvidia both does not provide a way to put these in the `.init_array` section, nor does the linker create these symbols if you were to force them to exist. The latter could be potentially solved by reinventing `nvlink` in `lld` the former is a more difficult problem. Maybe there's a way to hack around this in the PTX Compiler API. jhuber6: Sorry, I meant to say "symbols" here. Normally when the linker finds a `.fini_array` or `.
				// to manually create them and update the globals in the loader implememtation.
				uintptr_t *__init_array_start [[gnu::visibility("protected")]];
				uintptr_t *__init_array_end [[gnu::visibility("protected")]];
				uintptr_t *__fini_array_start [[gnu::visibility("protected")]];
				uintptr_t *__fini_array_end [[gnu::visibility("protected")]];
				}

				using InitCallback = void(int, char , char );
				using FiniCallback = void(void);

				static uint64_t get_grid_size() {
				return gpu::get_num_threads() * gpu::get_num_blocks();
				}

				static void call_init_array_callbacks(int argc, char argv, char env) {
				size_t init_array_size = __init_array_end - __init_array_start;
				for (size_t i = 0; i < init_array_size; ++i)
				reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
				}

				static void call_fini_array_callbacks() {
				size_t fini_array_size = __fini_array_end - __fini_array_start;
				for (size_t i = 0; i < fini_array_size; ++i)
				reinterpret_cast<FiniCallback *>(__fini_array_start[i])();
				}

				// TODO: Put this in a separate kernel and call it with one thread.
				void initialize(int argc, char argv, char env, void in, void out,
				void *buffer) {
				// We need a single GPU thread to perform the initialization of the global
				// constructors and data. We simply mask off all but a single thread and
				// execute.
				count.fetch_add(1, cpp::MemoryOrder::RELAXED);
	if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {			if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
				// We need to set up the RPC client first in case any of the constructors
				// require it.
	rpc::client.reset(&lock, in, out, buffer);			rpc::client.reset(&lock, in, out, buffer);
	init.store(1, cpp::MemoryOrder::RELAXED);
				// We want the fini array callbacks to be run after other atexit
				// callbacks are run. So, we register them before running the init
				// array callbacks as they can potentially register their own atexit
				// callbacks.
				// FIXME: The function pointer escaping this TU causes warnings.
				__llvm_libc::atexit(&call_fini_array_callbacks);
				call_init_array_callbacks(argc, argv, env);
	}			}

	// Wait until the previous thread signals that the data has been written.			// We wait until every single thread launched on the GPU has seen the
	while (!init.load(cpp::MemoryOrder::RELAXED))			// initialization code. This will get very, very slow for high thread counts,
				// but for testing purposes it is unlikely to matter.
				while (count.load(cpp::MemoryOrder::RELAXED) != get_grid_size())
	rpc::sleep_briefly();			rpc::sleep_briefly();
				gpu::sync_threads();
				}

	// Wait for the threads in the block to converge and fence the write.			// TODO: Put this in a separate kernel and call it with one thread.
				void finalize(int retval) {
				// We wait until every single thread launched on the GPU has finished
				// executing and reached the finalize region.
				count.fetch_sub(1, cpp::MemoryOrder::RELAXED);
				while (count.load(cpp::MemoryOrder::RELAXED) != 0)
				rpc::sleep_briefly();
	gpu::sync_threads();			gpu::sync_threads();
				if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
				// Only a single thread should call `exit` here, the rest should gracefully
				// return from the kernel. This is so only one thread calls the destructors
				// registred with 'atexit' above.
				__llvm_libc::exit(retval);
				}
	}			}

	} // namespace __llvm_libc			} // namespace __llvm_libc

	extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void			extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
	_start(int argc, char argv, char envp, int ret, void in, void *out,			_start(int argc, char argv, char envp, int ret, void in, void *out,
	void *buffer) {			void *buffer) {
	__llvm_libc::init_rpc(in, out, buffer);			__llvm_libc::initialize(argc, argv, envp, in, out, buffer);

	__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);			__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);

				__llvm_libc::finalize(*ret);
	}			}

libc/test/IntegrationTest/test.cpp

	Show All 16 Lines
	namespace __llvm_libc {			namespace __llvm_libc {

	int bcmp(const void lhs, const void rhs, size_t count);			int bcmp(const void lhs, const void rhs, size_t count);
	void bzero(void *ptr, size_t count);			void bzero(void *ptr, size_t count);
	int memcmp(const void lhs, const void rhs, size_t count);			int memcmp(const void lhs, const void rhs, size_t count);
	void memcpy(void __restrict, const void *__restrict, size_t);			void memcpy(void __restrict, const void *__restrict, size_t);
	void memmove(void dst, const void *src, size_t count);			void memmove(void dst, const void *src, size_t count);
	void memset(void ptr, int value, size_t count);			void memset(void ptr, int value, size_t count);
				int atexit(void (*func)(void));

	} // namespace __llvm_libc			} // namespace __llvm_libc

	extern "C" {			extern "C" {

	int bcmp(const void lhs, const void rhs, size_t count) {			int bcmp(const void lhs, const void rhs, size_t count) {
	return __llvm_libc::bcmp(lhs, rhs, count);			return __llvm_libc::bcmp(lhs, rhs, count);
	}			}
	void bzero(void *ptr, size_t count) { __llvm_libc::bzero(ptr, count); }			void bzero(void *ptr, size_t count) { __llvm_libc::bzero(ptr, count); }
	int memcmp(const void lhs, const void rhs, size_t count) {			int memcmp(const void lhs, const void rhs, size_t count) {
	return __llvm_libc::memcmp(lhs, rhs, count);			return __llvm_libc::memcmp(lhs, rhs, count);
	}			}
	void memcpy(void __restrict dst, const void *__restrict src, size_t count) {			void memcpy(void __restrict dst, const void *__restrict src, size_t count) {
	return __llvm_libc::memcpy(dst, src, count);			return __llvm_libc::memcpy(dst, src, count);
	}			}
	void memmove(void dst, const void *src, size_t count) {			void memmove(void dst, const void *src, size_t count) {
	return __llvm_libc::memmove(dst, src, count);			return __llvm_libc::memmove(dst, src, count);
	}			}
	void memset(void ptr, int value, size_t count) {			void memset(void ptr, int value, size_t count) {
	return __llvm_libc::memset(ptr, value, count);			return __llvm_libc::memset(ptr, value, count);
	}			}

				// This is needed if the test was compiled with '-fno-use-cxa-atexit'.
				int atexit(void (*func)(void)) { return __llvm_libc::atexit(func); }

	} // extern "C"			} // extern "C"

	// Integration tests cannot use the SCUDO standalone allocator as SCUDO pulls			// Integration tests cannot use the SCUDO standalone allocator as SCUDO pulls
	// various other parts of the libc. Since SCUDO development does not use			// various other parts of the libc. Since SCUDO development does not use
	// LLVM libc build rules, it is very hard to keep track or pull all that SCUDO			// LLVM libc build rules, it is very hard to keep track or pull all that SCUDO
	// requires. Hence, as a work around for this problem, we use a simple allocator			// requires. Hence, as a work around for this problem, we use a simple allocator
	// which just hands out continuous blocks from a statically allocated chunk of			// which just hands out continuous blocks from a statically allocated chunk of
	// memory.			// memory.
	Show All 12 Lines
	void free(void *) {}			void free(void *) {}

	void realloc(void ptr, size_t s) {			void realloc(void ptr, size_t s) {
	free(ptr);			free(ptr);
	return malloc(s);			return malloc(s);
	}			}

	// Integration tests are linked with -nostdlib. BFD linker expects			// Integration tests are linked with -nostdlib. BFD linker expects
	// __dso_handle when -nostdlib is used.			// __dso_handle when -nostdlib is used.
				traUnsubmitted Not Done Reply Inline Actions What exactly is the 'toolchain' in this context? tra: What exactly is the 'toolchain' in this context?
				jhuber6AuthorUnsubmitted Done Reply Inline Actions I mean this as when going through a compilation targeting `nvptx64` e.g. `clang++ test.cpp --target=nvptx64-nvidia-cuda -march=sm_70 -c` jhuber6: I mean this as when going through a compilation targeting `nvptx64` e.g. `clang++ test.cpp…
				traUnsubmitted Not Done Reply Inline Actions What exactly needs this symbol? I'm surprised we need to care about DSOs on NVPTX as we do not have any there. Googling around (https://stackoverflow.com/questions/34308720/where-is-dso-handle-defined) suggests that we may avoid the issue by compiling with `-fno-use-cxa-atexit`. @MaskRay -- any suggestions on what's the right way to deal with this? tra: What exactly needs this symbol? I'm surprised we need to care about DSOs on NVPTX as we do…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions That might work for these hermetic tests since we should provide the base `atexit`. @sivachandra do you think we could use this? It seems to be supported on both Clang and GCC. jhuber6: That might work for these hermetic tests since we should provide the base `atexit`.
				sivachandraUnsubmitted Not Done Reply Inline Actions You can choose to build for the GPUs with `-fno-use-cxa-atexit`. sivachandra: You can choose to build for the GPUs with `-fno-use-cxa-atexit`.
	void *__dso_handle = nullptr;			void *__dso_handle = nullptr;
	} // extern "C"			} // extern "C"

libc/test/integration/startup/gpu/CMakeLists.txt

Show All 20 Lines	add_integration_test(
DEPENDS		DEPENDS
libc.src.__support.RPC.rpc_client		libc.src.__support.RPC.rpc_client
libc.src.__support.GPU.utils		libc.src.__support.GPU.utils
LOADER_ARGS		LOADER_ARGS
--blocks 16		--blocks 16
--threads 1		--threads 1
)		)

# Constructors are currently only supported on AMDGPU.
if(LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)
add_integration_test(		add_integration_test(
init_fini_array_test		init_fini_array_test
SUITE libc-startup-tests		SUITE libc-startup-tests
SRCS		SRCS
init_fini_array_test.cpp		init_fini_array_test.cpp
)		)
endif()

libc/test/integration/startup/gpu/init_fini_array_test.cpp

	Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	__attribute__((constructor)) void set_initval() {			__attribute__((constructor)) void set_initval() {
	initval = INITVAL_INITIALIZER;			initval = INITVAL_INITIALIZER;
	}			}
	__attribute__((destructor(1))) void reset_initval() {			__attribute__((destructor(1))) void reset_initval() {
	ASSERT_TRUE(global_destroyed);			ASSERT_TRUE(global_destroyed);
	initval = 0;			initval = 0;
	}			}

	TEST_MAIN() {			TEST_MAIN(int argc, char argv, char env) {
	ASSERT_EQ(global.get(GLOBAL_INDEX), INITVAL_INITIALIZER);			ASSERT_EQ(global.get(GLOBAL_INDEX), INITVAL_INITIALIZER);
	ASSERT_EQ(initval, INITVAL_INITIALIZER);			ASSERT_EQ(initval, INITVAL_INITIALIZER);
	return 0;			return 0;
	}			}

libc/utils/gpu/loader/CMakeLists.txt

	add_library(gpu_loader OBJECT Main.cpp)			add_library(gpu_loader OBJECT Main.cpp)
	target_include_directories(gpu_loader PUBLIC			target_include_directories(gpu_loader PUBLIC
	${CMAKE_CURRENT_SOURCE_DIR}			${CMAKE_CURRENT_SOURCE_DIR}
	${LIBC_SOURCE_DIR}			${LIBC_SOURCE_DIR}
	)			)

	find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)			find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)
	if(hsa-runtime64_FOUND)			if(hsa-runtime64_FOUND)
	add_subdirectory(amdgpu)			add_subdirectory(amdgpu)
	else()			else()
	message(STATUS "Skipping HSA loader for gpu target, no HSA was detected")			message(STATUS "Skipping HSA loader for gpu target, no HSA was detected")
	endif()			endif()

	find_package(CUDAToolkit QUIET)			find_package(CUDAToolkit QUIET)
	if(CUDAToolkit_FOUND)			# The CUDA loader requires LLVM to traverse the ELF image for symbols.
				find_package(LLVM QUIET)
				if(CUDAToolkit_FOUND AND LLVM_FOUND)
	add_subdirectory(nvptx)			add_subdirectory(nvptx)
	else()			else()
	message(STATUS "Skipping CUDA loader for gpu target, no CUDA was detected")			message(STATUS "Skipping CUDA loader for gpu target, no CUDA was detected")
	endif()			endif()

	# Add a custom target to be used for testing.			# Add a custom target to be used for testing.
	if(TARGET amdhsa_loader AND LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)			if(TARGET amdhsa_loader AND LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)
	add_custom_target(libc.utils.gpu.loader)			add_custom_target(libc.utils.gpu.loader)
	Show All 15 Lines

libc/utils/gpu/loader/nvptx/CMakeLists.txt

	add_executable(nvptx_loader Loader.cpp)			add_executable(nvptx_loader Loader.cpp)
	add_dependencies(nvptx_loader libc.src.__support.RPC.rpc)			add_dependencies(nvptx_loader libc.src.__support.RPC.rpc)

				if(NOT LLVM_ENABLE_RTTI)
				target_compile_options(nvptx_loader PRIVATE -fno-rtti)
				endif()
				target_include_directories(nvptx_loader PRIVATE ${LLVM_INCLUDE_DIRS})
	target_link_libraries(nvptx_loader			target_link_libraries(nvptx_loader
	PRIVATE			PRIVATE
	gpu_loader			gpu_loader
	CUDA::cuda_driver			CUDA::cuda_driver
				LLVMObject
				LLVMSupport
				traUnsubmitted Not Done Reply Inline Actions Can you check how long the clean build of the tool with `-j 6` would take now? If it's in the ballpark of a minute or so, we can probably live with that. Otherwise we should build the tool along with clang/LLVM, similar to how we deal with `libc-hdrgen`. tra: Can you check how long the clean build of the tool with `-j 6` would take now? If it's in the…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Is this assume we need to rebuild the libraries? I figured these would be copied somewhere in the build environment. On my machine rebuilding the loader takes about five seconds. jhuber6: Is this assume we need to rebuild the libraries? I figured these would be copied somewhere in…
				traUnsubmitted Not Done Reply Inline Actions We only copy the clang installation directory to the GPU machines, so when you build the tests, the build directory itself will be empty and the libraries would have to be rebuilt. On my machine rebuilding the loader takes about five seconds. Is that a clean build? Can you check what `ninja clean; ninja -nv nvptx_loader \| wc -l` shows? tra: We only copy the clang installation directory to the GPU machines, so when you build the tests…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions When done in the `runtimes/runtimes-bins` directory. [1/1] Cleaning all built files... Cleaning... 509 files. 3 jhuber6: When done in the `runtimes/runtimes-bins` directory. ``` [1/1] Cleaning all built files...
				traUnsubmitted Not Done Reply Inline Actions Rebuilding it on a cloud machine with 6 cores may take too much. Un-ccached clean rebuild of LLVMSupport and LLVMObject on 6 real cores took ~2.5 minutes. On the build bots it will probably take ~2x that much (cloud counts hyperthereads as cores, IIUIC, and GPU bots have 6 of them, so only 3 physical cores) which would almost double the wall time of each test run which currently is between 5-8 minutes. We could live with it short-term, but I think we do need to move nvptx_loader into the main clang build and allow libc tests to be configured to use it as an externally provided tool. tra: Rebuilding it on a cloud machine with 6 cores may take too much. Un-ccached clean rebuild of…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Yeah, I can definitely do that in a follow-up patch. I think I might've needed to do this anyway, because in the future we're going to want to export an RPC server library that OpenMP can use to implement things like `printf` in `libomptarget`. So to do that we'd need to build this first during the projects build. jhuber6: Yeah, I can definitely do that in a follow-up patch. I think I might've needed to do this…
	)			)

libc/utils/gpu/loader/nvptx/Loader.cpp

Show All 11 Lines
// function.		// function.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "Loader.h"		#include "Loader.h"
#include "Server.h"		#include "Server.h"

#include "cuda.h"		#include "cuda.h"

		#include "llvm/Object/ELF.h"
		#include "llvm/Object/ELFObjectFile.h"

#include <cstddef>		#include <cstddef>
#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <cstring>		#include <cstring>
		#include <vector>

		using namespace llvm;
		using namespace object;

/// The arguments to the '_start' kernel.		/// The arguments to the '_start' kernel.
struct kernel_args_t {		struct kernel_args_t {
int argc;		int argc;
void *argv;		void *argv;
void *envp;		void *envp;
void *ret;		void *ret;
void *inbox;		void *inbox;
Show All 14 Lines	static void handle_error(CUresult err) {
exit(1);		exit(1);
}		}

static void handle_error(const char *msg) {		static void handle_error(const char *msg) {
fprintf(stderr, "%s\n", msg);		fprintf(stderr, "%s\n", msg);
exit(EXIT_FAILURE);		exit(EXIT_FAILURE);
}		}

		// Gets the names of all the globals that contain functions to initialize or
		// deinitialize. We need to do this manually because the NVPTX toolchain does
		// not contain the necessary binary manipulation tools.
		template <typename Alloc>
		Expected<void > get_ctor_dtor_array(const void image, const size_t size,
		Alloc allocator, CUmodule binary) {
		auto mem_buffer = MemoryBuffer::getMemBuffer(
		StringRef(reinterpret_cast<const char *>(image), size), "image",
		/RequiresNullTerminator=/false);
		Expected<ELF64LEObjectFile> elf_or_err =
		ELF64LEObjectFile::create(*mem_buffer);
		if (!elf_or_err)
		handle_error(toString(elf_or_err.takeError()).c_str());

		std::vector<std::pair<const char *, uint16_t>> ctors;
		std::vector<std::pair<const char *, uint16_t>> dtors;
		// CUDA has no way to iterate over all the symbols so we need to inspect the
		// ELF directly using the LLVM libraries.
		for (const auto &symbol : elf_or_err->symbols()) {
		auto name_or_err = symbol.getName();
		if (!name_or_err)
		handle_error(toString(name_or_err.takeError()).c_str());

		// Search for all symbols that contain a constructor or destructor.
		if (!name_or_err->starts_with("__init_array_object_") &&
		!name_or_err->starts_with("__fini_array_object_"))
		continue;

		uint16_t priority;
		if (name_or_err->rsplit('_').second.getAsInteger(10, priority))
		handle_error("Invalid priority for constructor or destructor");

		if (name_or_err->starts_with("__init"))
		ctors.emplace_back(std::make_pair(name_or_err->data(), priority));
		else
		dtors.emplace_back(std::make_pair(name_or_err->data(), priority));
		}
		// Lower priority constructors are run before higher ones. The reverse is true
		// for destructors.
		llvm::sort(ctors, [](auto x, auto y) { return x.second < y.second; });
		llvm::sort(dtors, [](auto x, auto y) { return x.second < y.second; });
		llvm::reverse(dtors);

		// Allocate host pinned memory to make these arrays visible to the GPU.
		CUdeviceptr dev_memory = reinterpret_cast<CUdeviceptr >(allocator(
		ctors.size() * sizeof(CUdeviceptr) + dtors.size() * sizeof(CUdeviceptr)));
		uint64_t global_size = 0;

		// Get the address of the global and then store the address of the constructor
		// function to call in the constructor array.
		CUdeviceptr *dev_ctors_start = dev_memory;
		CUdeviceptr *dev_ctors_end = dev_ctors_start + ctors.size();
		for (uint64_t i = 0; i < ctors.size(); ++i) {
		CUdeviceptr dev_ptr;
		if (CUresult err =
		cuModuleGetGlobal(&dev_ptr, &global_size, binary, ctors[i].first))
		handle_error(err);
		if (CUresult err =
		cuMemcpyDtoH(&dev_ctors_start[i], dev_ptr, sizeof(uintptr_t)))
		handle_error(err);
		}

		// Get the address of the global and then store the address of the destructor
		// function to call in the destructor array.
		CUdeviceptr *dev_dtors_start = dev_ctors_end;
		CUdeviceptr *dev_dtors_end = dev_dtors_start + dtors.size();
		for (uint64_t i = 0; i < dtors.size(); ++i) {
		CUdeviceptr dev_ptr;
		if (CUresult err =
		cuModuleGetGlobal(&dev_ptr, &global_size, binary, dtors[i].first))
		handle_error(err);
		if (CUresult err =
		cuMemcpyDtoH(&dev_dtors_start[i], dev_ptr, sizeof(uintptr_t)))
		handle_error(err);
		}

		// Obtain the address of the pointers the startup implementation uses to
		// iterate the constructors and destructors.
		CUdeviceptr init_start;
		if (CUresult err = cuModuleGetGlobal(&init_start, &global_size, binary,
		"__init_array_start"))
		handle_error(err);
		CUdeviceptr init_end;
		if (CUresult err = cuModuleGetGlobal(&init_end, &global_size, binary,
		"__init_array_end"))
		handle_error(err);
		CUdeviceptr fini_start;
		if (CUresult err = cuModuleGetGlobal(&fini_start, &global_size, binary,
		"__fini_array_start"))
		handle_error(err);
		CUdeviceptr fini_end;
		if (CUresult err = cuModuleGetGlobal(&fini_end, &global_size, binary,
		"__fini_array_end"))
		handle_error(err);

		// Copy the pointers to the newly written array to the symbols so the startup
		// implementation can iterate them.
		if (CUresult err =
		cuMemcpyHtoD(init_start, &dev_ctors_start, sizeof(uintptr_t)))
		handle_error(err);
		if (CUresult err = cuMemcpyHtoD(init_end, &dev_ctors_end, sizeof(uintptr_t)))
		handle_error(err);
		if (CUresult err =
		cuMemcpyHtoD(fini_start, &dev_dtors_start, sizeof(uintptr_t)))
		handle_error(err);
		if (CUresult err = cuMemcpyHtoD(fini_end, &dev_dtors_end, sizeof(uintptr_t)))
		handle_error(err);

		return dev_memory;
		}

int load(int argc, char argv, char envp, void *image, size_t size,		int load(int argc, char argv, char envp, void *image, size_t size,
const LaunchParameters &params) {		const LaunchParameters &params) {

if (CUresult err = cuInit(0))		if (CUresult err = cuInit(0))
handle_error(err);		handle_error(err);

// Obtain the first device found on the system.		// Obtain the first device found on the system.
CUdevice device;		CUdevice device;
if (CUresult err = cuDeviceGet(&device, 0))		if (CUresult err = cuDeviceGet(&device, 0))
handle_error(err);		handle_error(err);

// Initialize the CUDA context and claim it for this execution.		// Initialize the CUDA context and claim it for this execution.
CUcontext context;		CUcontext context;
if (CUresult err = cuDevicePrimaryCtxRetain(&context, device))		if (CUresult err = cuDevicePrimaryCtxRetain(&context, device))
Show All 19 Lines	int load(int argc, char argv, char envp, void *image, size_t size,
// Allocate pinned memory on the host to hold the pointer array for the		// Allocate pinned memory on the host to hold the pointer array for the
// copied argv and allow the GPU device to access it.		// copied argv and allow the GPU device to access it.
auto allocator = [&](uint64_t size) -> void * {		auto allocator = [&](uint64_t size) -> void * {
void *dev_ptr;		void *dev_ptr;
if (CUresult err = cuMemAllocHost(&dev_ptr, size))		if (CUresult err = cuMemAllocHost(&dev_ptr, size))
handle_error(err);		handle_error(err);
return dev_ptr;		return dev_ptr;
};		};

		auto memory_or_err = get_ctor_dtor_array(image, size, allocator, binary);
		if (!memory_or_err)
		handle_error(toString(memory_or_err.takeError()).c_str());

void *dev_argv = copy_argument_vector(argc, argv, allocator);		void *dev_argv = copy_argument_vector(argc, argv, allocator);
if (!dev_argv)		if (!dev_argv)
handle_error("Failed to allocate device argv");		handle_error("Failed to allocate device argv");

// Allocate pinned memory on the host to hold the pointer array for the		// Allocate pinned memory on the host to hold the pointer array for the
// copied environment array and allow the GPU device to access it.		// copied environment array and allow the GPU device to access it.
void *dev_envp = copy_environment(envp, allocator);		void *dev_envp = copy_environment(envp, allocator);
if (!dev_envp)		if (!dev_envp)
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	int load(int argc, char argv, char envp, void *image, size_t size,
int host_ret = 0;		int host_ret = 0;
if (CUresult err = cuMemcpyDtoH(&host_ret, dev_ret, sizeof(int)))		if (CUresult err = cuMemcpyDtoH(&host_ret, dev_ret, sizeof(int)))
handle_error(err);		handle_error(err);

if (CUresult err = cuStreamSynchronize(stream))		if (CUresult err = cuStreamSynchronize(stream))
handle_error(err);		handle_error(err);

// Free the memory allocated for the device.		// Free the memory allocated for the device.
		if (CUresult err = cuMemFreeHost(*memory_or_err))
		handle_error(err);
if (CUresult err = cuMemFree(dev_ret))		if (CUresult err = cuMemFree(dev_ret))
handle_error(err);		handle_error(err);
if (CUresult err = cuMemFreeHost(dev_argv))		if (CUresult err = cuMemFreeHost(dev_argv))
handle_error(err);		handle_error(err);
if (CUresult err = cuMemFreeHost(server_inbox))		if (CUresult err = cuMemFreeHost(server_inbox))
handle_error(err);		handle_error(err);
if (CUresult err = cuMemFreeHost(server_outbox))		if (CUresult err = cuMemFreeHost(server_outbox))
handle_error(err);		handle_error(err);
Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Support global constructors and destructors on NVPTXClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 519449

libc/cmake/modules/LLVMLibCTestRules.cmake

libc/startup/gpu/nvptx/CMakeLists.txt

libc/startup/gpu/nvptx/start.cpp

libc/test/IntegrationTest/test.cpp

libc/test/integration/startup/gpu/CMakeLists.txt

libc/test/integration/startup/gpu/init_fini_array_test.cpp

libc/utils/gpu/loader/CMakeLists.txt

libc/utils/gpu/loader/nvptx/CMakeLists.txt

libc/utils/gpu/loader/nvptx/Loader.cpp

[libc] Support global constructors and destructors on NVPTX
ClosedPublic