This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
startup/gpu/amdgpu/
-
gpu/
-
amdgpu/
-
CMakeLists.txt
4/6
start.cpp
-
test/integration/startup/gpu/
-
integration/
-
startup/
-
gpu/
-
CMakeLists.txt
-
init_fini_array_test.cpp

Differential D149398

[libc] Add support for global ctors / dtors for AMDGPU
ClosedPublic

Authored by jhuber6 on Apr 27 2023, 6:25 PM.

Download Raw Diff

Details

Reviewers

tra
jdoerfert
tianshilei1992
sivachandra
michaelrj
lntue
MaskRay
JonChesterfield

Commits

rG1b823abea74d: [libc] Add support for global ctors / dtors for AMDGPU

Summary

This patch makes the necessary changes to support calling global
constructors and destructors on the GPU. The patch in D149340 allows the
lld linker to create the symbols pointing us to these globals. These
should be executed by a single thread, which is more difficult on the
GPU because all threads are active. I chose to use an atomic counter to
sync every thread on the GPU. This is very slow if you use more than a
few thousand threads, but for testing purposes it should be sufficient.

Depends on D149340 D149363

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Apr 27 2023, 6:25 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 27 2023, 6:25 PM

Herald added subscribers: libc-commits, kosarev, ecnelises and 7 others. · View Herald Transcript

jhuber6 requested review of this revision.Apr 27 2023, 6:25 PM

Herald added a subscriber: wdng. · View Herald TranscriptApr 27 2023, 6:25 PM

Harbormaster completed remote builds in B228710: Diff 517765.Apr 27 2023, 6:25 PM

jhuber6 mentioned this in D149451: [NVPTX] Add NVPTXCtorDtorLoweringPass to handle global ctors / dtors.Apr 28 2023, 8:13 AM

Remove unused dependency.

Harbormaster completed remote builds in B228915: Diff 518045.Apr 28 2023, 2:10 PM

sivachandra added inline comments.Apr 28 2023, 10:28 PM

libc/startup/gpu/amdgpu/start.cpp
61	Nit: Explicit namespace scoping to make it clear to the reader which `atexit` is being called: `__llvm_libc::atexit`.
67	Can this be avoided at all? As in, if there are globals that have to be initialized on the GPU, then all threads have to wait until they can start using those globals?

sivachandra accepted this revision.Apr 28 2023, 10:28 PM

This revision is now accepted and ready to land.Apr 28 2023, 10:28 PM

Counting on the global looks like a DIY barrier, which is ok, but I can't see anything that stops reordering of operations past the initialisation code run on thread zero.

libc/startup/gpu/amdgpu/start.cpp
71	This is missing a fence. Noth

This revision now requires changes to proceed.Apr 28 2023, 11:49 PM

In D149398#4307296, @JonChesterfield wrote:

Counting on the global looks like a DIY barrier, which is ok, but I can't see anything that stops reordering of operations past the initialisation code run on thread zero.

Obviously we need the DIY barrier because there's no built-in functionality to globally sync on the device. Once the globals have been initialized I simply assume that they'll call main in an orderly fashion and then we wait again at the barrier before finishing.

libc/startup/gpu/amdgpu/start.cpp
67	Generally I just assume it's unsafe to have any GPU threads calling `main` before we've run all the global constructors. We could reduce this to a regular sync if we placed every global object in thread shared memory however. Then this would be a simple `gpu::sync_threads`. But that would require modifying the source to put `[[clang::addressspace(3)]]` around everything, which is a pretty scarce resource.
71	The implementation of `sync_threads` has a fence.

JonChesterfield resigned from this revision.Apr 29 2023, 6:27 AM

This revision is now accepted and ready to land.Apr 29 2023, 6:27 AM

Closed by commit rG1b823abea74d: [libc] Add support for global ctors / dtors for AMDGPU (authored by jhuber6). · Explain WhyApr 29 2023, 6:40 AM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG1b823abea74d: [libc] Add support for global ctors / dtors for AMDGPU.

jhuber6 added inline comments.Apr 29 2023, 7:14 AM

libc/startup/gpu/amdgpu/start.cpp
67	Another solution here is to have a separate kernel that we call to do the initialization and then we call `main`. Generally it thwarts a few optimizations to have global state shared between kernel calls, but I don't think we care about that here. I'll make a patch to do that instead sometime in the future rather than having a weird global barrier the hardware doesn't support. That should allow us to run tests on a fully saturated GPU.

jhuber6 mentioned this in rGf05ce9045af4: [NVPTX] Add NVPTXCtorDtorLoweringPass to handle global ctors / dtors.May 4 2023, 5:13 AM

Revision Contents

Path

Size

libc/

startup/

gpu/

amdgpu/

CMakeLists.txt

2 lines

start.cpp

71 lines

test/

integration/

startup/

gpu/

CMakeLists.txt

10 lines

init_fini_array_test.cpp

60 lines

Diff 518163

libc/startup/gpu/amdgpu/CMakeLists.txt

	add_startup_object(			add_startup_object(
	crt1			crt1
	SRC			SRC
	start.cpp			start.cpp
	DEPENDS			DEPENDS
	libc.src.__support.RPC.rpc_client			libc.src.__support.RPC.rpc_client
	libc.src.__support.GPU.utils			libc.src.__support.GPU.utils
				libc.src.stdlib.exit
				libc.src.stdlib.atexit
	COMPILE_OPTIONS			COMPILE_OPTIONS
	-ffreestanding # To avoid compiler warnings about calling the main function.			-ffreestanding # To avoid compiler warnings about calling the main function.
	-fno-builtin			-fno-builtin
	-nogpulib # Do not include any GPU vendor libraries.			-nogpulib # Do not include any GPU vendor libraries.
	-mcpu=${LIBC_GPU_TARGET_ARCHITECTURE}			-mcpu=${LIBC_GPU_TARGET_ARCHITECTURE}
	-emit-llvm # AMDGPU's intermediate object file format is bitcode.			-emit-llvm # AMDGPU's intermediate object file format is bitcode.
	--target=${LIBC_GPU_TARGET_TRIPLE}			--target=${LIBC_GPU_TARGET_TRIPLE}
	NO_GPU_BUNDLE # Compile this file directly without special GPU handling.			NO_GPU_BUNDLE # Compile this file directly without special GPU handling.
	Show All 11 Lines

libc/startup/gpu/amdgpu/start.cpp

	//===-- Implementation of crt for amdgpu ----------------------------------===//			//===-- Implementation of crt for amdgpu ----------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "src/__support/GPU/utils.h"			#include "src/__support/GPU/utils.h"
	#include "src/__support/RPC/rpc_client.h"			#include "src/__support/RPC/rpc_client.h"
				#include "src/stdlib/atexit.h"
				#include "src/stdlib/exit.h"

	extern "C" int main(int argc, char argv, char envp);			extern "C" int main(int argc, char argv, char envp);

	namespace __llvm_libc {			namespace __llvm_libc {

	static cpp::Atomic<uint32_t> lock = 0;			static cpp::Atomic<uint32_t> lock = 0;

	static cpp::Atomic<uint32_t> init = 0;			static cpp::Atomic<uint32_t> count = 0;

	void init_rpc(void in, void out, void *buffer) {			extern "C" uintptr_t __init_array_start[];
	// Only a single thread should update the RPC data.			extern "C" uintptr_t __init_array_end[];
				extern "C" uintptr_t __fini_array_start[];
				extern "C" uintptr_t __fini_array_end[];

				using InitCallback = void(int, char , char );
				using FiniCallback = void(void);

				static uint64_t get_grid_size() {
				return gpu::get_num_threads() * gpu::get_num_blocks();
				}

				static void call_init_array_callbacks(int argc, char argv, char env) {
				size_t init_array_size = __init_array_end - __init_array_start;
				for (size_t i = 0; i < init_array_size; ++i)
				reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
				}

				static void call_fini_array_callbacks() {
				size_t fini_array_size = __fini_array_end - __fini_array_start;
				for (size_t i = 0; i < fini_array_size; ++i)
				reinterpret_cast<FiniCallback *>(__fini_array_start[i])();
				}

				void initialize(int argc, char argv, char env, void in, void out,
				void *buffer) {
				// We need a single GPU thread to perform the initialization of the global
				// constructors and data. We simply mask off all but a single thread and
				// execute.
				count.fetch_add(1, cpp::MemoryOrder::RELAXED);
	if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {			if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
				// We need to set up the RPC client first in case any of the constructors
				// require it.
	rpc::client.reset(&lock, in, out, buffer);			rpc::client.reset(&lock, in, out, buffer);
	init.store(1, cpp::MemoryOrder::RELAXED);
				// We want the fini array callbacks to be run after other atexit
				// callbacks are run. So, we register them before running the init
				// array callbacks as they can potentially register their own atexit
				// callbacks.
				atexit(&call_fini_array_callbacks);
				sivachandraUnsubmitted Done Reply Inline Actions Nit: Explicit namespace scoping to make it clear to the reader which `atexit` is being called: `__llvm_libc::atexit`. sivachandra: Nit: Explicit namespace scoping to make it clear to the reader which `atexit` is being called…
				call_init_array_callbacks(argc, argv, env);
	}			}

	// Wait until the previous thread signals that the data has been written.			// We wait until every single thread launched on the GPU has seen the
	while (!init.load(cpp::MemoryOrder::RELAXED))			// initialization code. This will get very, very slow for high thread counts,
				// but for testing purposes it is unlikely to matter.
				sivachandraUnsubmitted Not Done Reply Inline Actions Can this be avoided at all? As in, if there are globals that have to be initialized on the GPU, then all threads have to wait until they can start using those globals? sivachandra: Can this be avoided at all? As in, if there are globals that have to be initialized on the GPU…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Generally I just assume it's unsafe to have any GPU threads calling `main` before we've run all the global constructors. We could reduce this to a regular sync if we placed every global object in thread shared memory however. Then this would be a simple `gpu::sync_threads`. But that would require modifying the source to put `[[clang::addressspace(3)]]` around everything, which is a pretty scarce resource. jhuber6: Generally I just assume it's unsafe to have any GPU threads calling `main` before we've run all…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Another solution here is to have a separate kernel that we call to do the initialization and then we call `main`. Generally it thwarts a few optimizations to have global state shared between kernel calls, but I don't think we care about that here. I'll make a patch to do that instead sometime in the future rather than having a weird global barrier the hardware doesn't support. That should allow us to run tests on a fully saturated GPU. jhuber6: Another solution here is to have a separate kernel that we call to do the initialization and…
				while (count.load(cpp::MemoryOrder::RELAXED) != get_grid_size())
	rpc::sleep_briefly();			rpc::sleep_briefly();
				gpu::sync_threads();
				}
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions This is missing a fence. Noth JonChesterfield: This is missing a fence. Noth
				jhuber6AuthorUnsubmitted Done Reply Inline Actions The implementation of `sync_threads` has a fence. jhuber6: The implementation of `sync_threads` has a fence.

	// Wait for the threads in the block to converge and fence the write.			void finalize(int retval) {
				// We wait until every single thread launched on the GPU has finished
				// executing and reached the finalize region.
				count.fetch_sub(1, cpp::MemoryOrder::RELAXED);
				while (count.load(cpp::MemoryOrder::RELAXED) != 0)
				rpc::sleep_briefly();
	gpu::sync_threads();			gpu::sync_threads();
				if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
				// Only a single thread should call `exit` here, the rest should gracefully
				// return from the kernel. This is so only one thread calls the destructors
				// registred with 'atexit' above.
				__llvm_libc::exit(retval);
				}
	}			}

	} // namespace __llvm_libc			} // namespace __llvm_libc

	extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void			extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
	_start(int argc, char argv, char envp, int ret, void in, void *out,			_start(int argc, char argv, char envp, int ret, void in, void *out,
	void *buffer) {			void *buffer) {
	__llvm_libc::init_rpc(in, out, buffer);			__llvm_libc::initialize(argc, argv, envp, in, out, buffer);

	__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);			__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);

				__llvm_libc::finalize(*ret);
	}			}

libc/test/integration/startup/gpu/CMakeLists.txt

Show All 19 Lines	SRCS
rpc_test.cpp		rpc_test.cpp
DEPENDS		DEPENDS
libc.src.__support.RPC.rpc_client		libc.src.__support.RPC.rpc_client
libc.src.__support.GPU.utils		libc.src.__support.GPU.utils
LOADER_ARGS		LOADER_ARGS
--blocks 16		--blocks 16
--threads 1		--threads 1
)		)

		# Constructors are currently only supported on AMDGPU.
		if(LIBC_GPU_TARGET_ARCHITECTURE_IS_AMDGPU)
		add_integration_test(
		init_fini_array_test
		SUITE libc-startup-tests
		SRCS
		init_fini_array_test.cpp
		)
		endif()

libc/test/integration/startup/gpu/init_fini_array_test.cpp

This file was added.

				//===-- Loader test to test init and fini array iteration -----------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "test/IntegrationTest/test.h"

				#include <stddef.h>

				int global_destroyed = false;

				class A {
				private:
				int val[1024];

				public:
				A(int i, int a) {
				for (int k = 0; k < 1024; ++k)
				val[k] = 0;
				val[i] = a;
				}

				~A() { global_destroyed = true; }

				int get(int i) const { return val[i]; }
				};

				int GLOBAL_INDEX = 512;
				int INITVAL_INITIALIZER = 0x600D;
				int BEFORE_INITIALIZER = 0xFEED;

				A global(GLOBAL_INDEX, INITVAL_INITIALIZER);

				int initval = 0;
				int before = 0;

				__attribute__((constructor(101))) void run_before() {
				before = BEFORE_INITIALIZER;
				}

				__attribute__((constructor(65535))) void run_after() {
				ASSERT_EQ(before, BEFORE_INITIALIZER);
				}

				__attribute__((constructor)) void set_initval() {
				initval = INITVAL_INITIALIZER;
				}
				__attribute__((destructor(1))) void reset_initval() {
				ASSERT_TRUE(global_destroyed);
				initval = 0;
				}

				TEST_MAIN() {
				ASSERT_EQ(global.get(GLOBAL_INDEX), INITVAL_INITIALIZER);
				ASSERT_EQ(initval, INITVAL_INITIALIZER);
				return 0;
				}