This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
startup/gpu/
-
gpu/
-
amdgpu/
-
start.cpp
-
nvptx/
1/2
start.cpp
-
utils/gpu/loader/
-
gpu/
-
loader/
-
Loader.h
-
amdgpu/
-
Loader.cpp
-
nvptx/
-
Loader.cpp

Differential D149581

[libc] Change GPU startup and loader to use multiple kernels
ClosedPublic

Authored by jhuber6 on May 1 2023, 6:09 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
tianshilei1992
JonChesterfield
tra
sivachandra
lntue
michaelrj

Commits

rG901266dad313: [libc] Change GPU startup and loader to use multiple kernels

Summary

The GPU has a different execution model to standard _start
implementations. On the GPU, all threads are active at the start of a
kernel. In order to correctly intitialize and call the constructors we
want single threaded semantics. Previously, this was done using a
makeshift global barrier with atomics. However, it should be easier to
simply put the portions of the code that must be single threaded in
separate kernels and then call those with only one thread. Generally,
mixing global state between kernel launches makes optimizations more
difficult, similarly to calling a function outside of the TU, but for
testing it is better to be correct.

Depends on D149527 D148943

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.May 1 2023, 6:09 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 1 2023, 6:09 AM

Herald added subscribers: libc-commits, kosarev, mattd and 5 others. · View Herald Transcript

jhuber6 requested review of this revision.May 1 2023, 6:09 AM

Harbormaster completed remote builds in B229216: Diff 518426.May 1 2023, 6:10 AM

Leftover debug

Harbormaster completed remote builds in B229218: Diff 518428.May 1 2023, 6:19 AM

jhuber6 added a child revision: D149598: [libc] Support concurrent RPC port access on the GPU.May 1 2023, 10:51 AM

Splitting into three kernels seems better. These implementations are very heavily copy&pasted between amdgpu and nvptx though. A bunch of stuff is inherently target specific, notably the kernel launch machinery, but things like the kernel argument structs are very easily factored into a header, and I think start.cpp are ~ a hundred lines of fairly subtle code that is identical on nvptx&amdgpu except for the spelling of kernel calling convention. I think that would be worth cleaning up - it's likely to be quicker to deduplicate than to review the duplication

libc/startup/gpu/nvptx/start.cpp
71	Is there a missing call to libc::finalize() here?

In D149581#4317364, @JonChesterfield wrote:

Splitting into three kernels seems better. These implementations are very heavily copy&pasted between amdgpu and nvptx though. A bunch of stuff is inherently target specific, notably the kernel launch machinery, but things like the kernel argument structs are very easily factored into a header, and I think start.cpp are ~ a hundred lines of fairly subtle code that is identical on nvptx&amdgpu except for the spelling of kernel calling convention. I think that would be worth cleaning up - it's likely to be quicker to deduplicate than to review the duplication

The structs should definitely be common, you're right. The startup code itself I think should remain separate in different directories, it's how the other libc targets do it just for clarity of implementation.

libc/startup/gpu/nvptx/start.cpp
71	That was only necessary for the weird "global barrier" hack I implemented. `exit` should be sufficient.

Merging structs

Harbormaster completed remote builds in B229871: Diff 519327.May 3 2023, 6:49 PM

Thanks for moving the structs. I think the copy&paste startup code is likely to be bugprone but as that's relatively likely to be your maintenance burden I'll accept.

This revision is now accepted and ready to land.May 4 2023, 7:20 AM

This revision was landed with ongoing or failed builds.May 4 2023, 5:32 PM

Closed by commit rG901266dad313: [libc] Change GPU startup and loader to use multiple kernels (authored by jhuber6). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG901266dad313: [libc] Change GPU startup and loader to use multiple kernels.

Revision Contents

Path

Size

libc/

startup/

gpu/

amdgpu/

start.cpp

76 lines

nvptx/

start.cpp

78 lines

utils/

gpu/

loader/

Loader.h

23 lines

amdgpu/

Loader.cpp

247 lines

nvptx/

Loader.cpp

85 lines

Diff 519703

libc/startup/gpu/amdgpu/start.cpp

	Show All 11 Lines
	#include "src/stdlib/exit.h"			#include "src/stdlib/exit.h"

	extern "C" int main(int argc, char argv, char envp);			extern "C" int main(int argc, char argv, char envp);

	namespace __llvm_libc {			namespace __llvm_libc {

	static cpp::Atomic<uint32_t> lock = 0;			static cpp::Atomic<uint32_t> lock = 0;

	static cpp::Atomic<uint32_t> count = 0;

	extern "C" uintptr_t __init_array_start[];			extern "C" uintptr_t __init_array_start[];
	extern "C" uintptr_t __init_array_end[];			extern "C" uintptr_t __init_array_end[];
	extern "C" uintptr_t __fini_array_start[];			extern "C" uintptr_t __fini_array_start[];
	extern "C" uintptr_t __fini_array_end[];			extern "C" uintptr_t __fini_array_end[];

	using InitCallback = void(int, char , char );			using InitCallback = void(int, char , char );
	using FiniCallback = void(void);			using FiniCallback = void(void);

	static uint64_t get_grid_size() {
	return gpu::get_num_threads() * gpu::get_num_blocks();
	}

	static void call_init_array_callbacks(int argc, char argv, char env) {			static void call_init_array_callbacks(int argc, char argv, char env) {
	size_t init_array_size = __init_array_end - __init_array_start;			size_t init_array_size = __init_array_end - __init_array_start;
	for (size_t i = 0; i < init_array_size; ++i)			for (size_t i = 0; i < init_array_size; ++i)
	reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);			reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
	}			}

	static void call_fini_array_callbacks() {			static void call_fini_array_callbacks() {
	size_t fini_array_size = __fini_array_end - __fini_array_start;			size_t fini_array_size = __fini_array_end - __fini_array_start;
	for (size_t i = 0; i < fini_array_size; ++i)			for (size_t i = 0; i < fini_array_size; ++i)
	reinterpret_cast<FiniCallback *>(__fini_array_start[i])();			reinterpret_cast<FiniCallback *>(__fini_array_start[i])();
	}			}

	void initialize(int argc, char argv, char env, void in, void out,			} // namespace __llvm_libc
	void *buffer) {
	// We need a single GPU thread to perform the initialization of the global			extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
	// constructors and data. We simply mask off all but a single thread and			_begin(int argc, char argv, char env, void in, void out, void *buffer) {
	// execute.
	count.fetch_add(1, cpp::MemoryOrder::RELAXED);
	if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
	// We need to set up the RPC client first in case any of the constructors			// We need to set up the RPC client first in case any of the constructors
	// require it.			// require it.
	rpc::client.reset(gpu::get_lane_size(), &lock, in, out, buffer);			__llvm_libc::rpc::client.reset(__llvm_libc::gpu::get_lane_size(),
				&__llvm_libc::lock, in, out, buffer);

	// We want the fini array callbacks to be run after other atexit			// We want the fini array callbacks to be run after other atexit
	// callbacks are run. So, we register them before running the init			// callbacks are run. So, we register them before running the init
	// array callbacks as they can potentially register their own atexit			// array callbacks as they can potentially register their own atexit
	// callbacks.			// callbacks.
	atexit(&call_fini_array_callbacks);			__llvm_libc::atexit(&__llvm_libc::call_fini_array_callbacks);
	call_init_array_callbacks(argc, argv, env);			__llvm_libc::call_init_array_callbacks(argc, argv, env);
	}			}

	// We wait until every single thread launched on the GPU has seen the			extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
	// initialization code. This will get very, very slow for high thread counts,			_start(int argc, char argv, char envp, int *ret) {
	// but for testing purposes it is unlikely to matter.			// Invoke the 'main' function with every active thread that the user launched
	while (count.load(cpp::MemoryOrder::RELAXED) != get_grid_size())			// the _start kernel with.
	rpc::sleep_briefly();			__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);
	gpu::sync_threads();
	}			}

	void finalize(int retval) {			extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
	// We wait until every single thread launched on the GPU has finished			_end(int retval) {
	// executing and reached the finalize region.
	count.fetch_sub(1, cpp::MemoryOrder::RELAXED);
	while (count.load(cpp::MemoryOrder::RELAXED) != 0)
	rpc::sleep_briefly();
	gpu::sync_threads();
	if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
	// Only a single thread should call `exit` here, the rest should gracefully			// Only a single thread should call `exit` here, the rest should gracefully
	// return from the kernel. This is so only one thread calls the destructors			// return from the kernel. This is so only one thread calls the destructors
	// registred with 'atexit' above.			// registred with 'atexit' above.
	__llvm_libc::exit(retval);			__llvm_libc::exit(retval);
	}			}
	}

	} // namespace __llvm_libc

	extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
	_start(int argc, char argv, char envp, int ret, void in, void *out,
	void *buffer) {
	__llvm_libc::initialize(argc, argv, envp, in, out, buffer);

	__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);

	__llvm_libc::finalize(*ret);
	}

libc/startup/gpu/nvptx/start.cpp

	Show All 11 Lines
	#include "src/stdlib/exit.h"			#include "src/stdlib/exit.h"

	extern "C" int main(int argc, char argv, char envp);			extern "C" int main(int argc, char argv, char envp);

	namespace __llvm_libc {			namespace __llvm_libc {

	static cpp::Atomic<uint32_t> lock = 0;			static cpp::Atomic<uint32_t> lock = 0;

	static cpp::Atomic<uint32_t> count = 0;

	extern "C" {			extern "C" {
	// Nvidia's 'nvlink' linker does not provide these symbols. We instead need			// Nvidia's 'nvlink' linker does not provide these symbols. We instead need
	// to manually create them and update the globals in the loader implememtation.			// to manually create them and update the globals in the loader implememtation.
	uintptr_t *__init_array_start [[gnu::visibility("protected")]];			uintptr_t *__init_array_start [[gnu::visibility("protected")]];
	uintptr_t *__init_array_end [[gnu::visibility("protected")]];			uintptr_t *__init_array_end [[gnu::visibility("protected")]];
	uintptr_t *__fini_array_start [[gnu::visibility("protected")]];			uintptr_t *__fini_array_start [[gnu::visibility("protected")]];
	uintptr_t *__fini_array_end [[gnu::visibility("protected")]];			uintptr_t *__fini_array_end [[gnu::visibility("protected")]];
	}			}

	using InitCallback = void(int, char , char );			using InitCallback = void(int, char , char );
	using FiniCallback = void(void);			using FiniCallback = void(void);

	static uint64_t get_grid_size() {
	return gpu::get_num_threads() * gpu::get_num_blocks();
	}

	static void call_init_array_callbacks(int argc, char argv, char env) {			static void call_init_array_callbacks(int argc, char argv, char env) {
	size_t init_array_size = __init_array_end - __init_array_start;			size_t init_array_size = __init_array_end - __init_array_start;
	for (size_t i = 0; i < init_array_size; ++i)			for (size_t i = 0; i < init_array_size; ++i)
	reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);			reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
	}			}

	static void call_fini_array_callbacks() {			static void call_fini_array_callbacks() {
	size_t fini_array_size = __fini_array_end - __fini_array_start;			size_t fini_array_size = __fini_array_end - __fini_array_start;
	for (size_t i = 0; i < fini_array_size; ++i)			for (size_t i = 0; i < fini_array_size; ++i)
	reinterpret_cast<FiniCallback *>(__fini_array_start[i])();			reinterpret_cast<FiniCallback *>(__fini_array_start[i])();
	}			}

	// TODO: Put this in a separate kernel and call it with one thread.			} // namespace __llvm_libc
	void initialize(int argc, char argv, char env, void in, void out,
	void *buffer) {			extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
	// We need a single GPU thread to perform the initialization of the global			_begin(int argc, char argv, char env, void in, void out, void *buffer) {
	// constructors and data. We simply mask off all but a single thread and
	// execute.
	count.fetch_add(1, cpp::MemoryOrder::RELAXED);
	if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
	// We need to set up the RPC client first in case any of the constructors			// We need to set up the RPC client first in case any of the constructors
	// require it.			// require it.
	rpc::client.reset(gpu::get_lane_size(), &lock, in, out, buffer);			__llvm_libc::rpc::client.reset(__llvm_libc::gpu::get_lane_size(),
				&__llvm_libc::lock, in, out, buffer);

	// We want the fini array callbacks to be run after other atexit			// We want the fini array callbacks to be run after other atexit
	// callbacks are run. So, we register them before running the init			// callbacks are run. So, we register them before running the init
	// array callbacks as they can potentially register their own atexit			// array callbacks as they can potentially register their own atexit
	// callbacks.			// callbacks.
	// FIXME: The function pointer escaping this TU causes warnings.			__llvm_libc::atexit(&__llvm_libc::call_fini_array_callbacks);
	__llvm_libc::atexit(&call_fini_array_callbacks);			__llvm_libc::call_init_array_callbacks(argc, argv, env);
	call_init_array_callbacks(argc, argv, env);
	}

	// We wait until every single thread launched on the GPU has seen the
	// initialization code. This will get very, very slow for high thread counts,
	// but for testing purposes it is unlikely to matter.
	while (count.load(cpp::MemoryOrder::RELAXED) != get_grid_size())
	rpc::sleep_briefly();
	gpu::sync_threads();
	}

	// TODO: Put this in a separate kernel and call it with one thread.
	void finalize(int retval) {
	// We wait until every single thread launched on the GPU has finished
	// executing and reached the finalize region.
	count.fetch_sub(1, cpp::MemoryOrder::RELAXED);
	while (count.load(cpp::MemoryOrder::RELAXED) != 0)
	rpc::sleep_briefly();
	gpu::sync_threads();
	if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
	// Only a single thread should call `exit` here, the rest should gracefully
	// return from the kernel. This is so only one thread calls the destructors
	// registred with 'atexit' above.
	__llvm_libc::exit(retval);
	}			}
	}

	} // namespace __llvm_libc

	extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void			extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
	_start(int argc, char argv, char envp, int ret, void in, void *out,			_start(int argc, char argv, char envp, int *ret) {
	void *buffer) {			// Invoke the 'main' function with every active thread that the user launched
	__llvm_libc::initialize(argc, argv, envp, in, out, buffer);			// the _start kernel with.

	__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);			__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);
				}

	__llvm_libc::finalize(*ret);			extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
				_end(int retval) {
				// To finis the execution we invoke all the callbacks registered via 'atexit'
				// and then exit with the appropriate return value.
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Is there a missing call to libc::finalize() here? JonChesterfield: Is there a missing call to libc::finalize() here?
				jhuber6AuthorUnsubmitted Done Reply Inline Actions That was only necessary for the weird "global barrier" hack I implemented. `exit` should be sufficient. jhuber6: That was only necessary for the weird "global barrier" hack I implemented. `exit` should be…
				__llvm_libc::exit(retval);
	}			}

libc/utils/gpu/loader/Loader.h

Show All 17 Lines	struct LaunchParameters {
uint32_t num_threads_x;		uint32_t num_threads_x;
uint32_t num_threads_y;		uint32_t num_threads_y;
uint32_t num_threads_z;		uint32_t num_threads_z;
uint32_t num_blocks_x;		uint32_t num_blocks_x;
uint32_t num_blocks_y;		uint32_t num_blocks_y;
uint32_t num_blocks_z;		uint32_t num_blocks_z;
};		};

		/// The arguments to the '_begin' kernel.
		struct begin_args_t {
		int argc;
		void *argv;
		void *envp;
		void *inbox;
		void *outbox;
		void *buffer;
		};

		/// The arguments to the '_start' kernel.
		struct start_args_t {
		int argc;
		void *argv;
		void *envp;
		void *ret;
		};

		/// The arguments to the '_end' kernel.
		struct end_args_t {
		int argc;
		};

/// Generic interface to load the \p image and launch execution of the _start		/// Generic interface to load the \p image and launch execution of the _start
/// kernel on the target device. Copies \p argc and \p argv to the device.		/// kernel on the target device. Copies \p argc and \p argv to the device.
/// Returns the final value of the `main` function on the device.		/// Returns the final value of the `main` function on the device.
int load(int argc, char argv, char evnp, void *image, size_t size,		int load(int argc, char argv, char evnp, void *image, size_t size,
const LaunchParameters &params);		const LaunchParameters &params);

/// Return \p V aligned "upwards" according to \p Align.		/// Return \p V aligned "upwards" according to \p Align.
template <typename V, typename A> inline V align_up(V val, A align) {		template <typename V, typename A> inline V align_up(V val, A align) {
▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines

libc/utils/gpu/loader/amdgpu/Loader.cpp

Show All 18 Lines
#include <hsa/hsa.h>		#include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>		#include <hsa/hsa_ext_amd.h>

#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <cstring>		#include <cstring>
#include <utility>		#include <utility>

/// The name of the kernel we will launch. All AMDHSA kernels end with '.kd'.
constexpr const char *KERNEL_START = "_start.kd";

/// The arguments to the '_start' kernel.
struct kernel_args_t {
int argc;
void *argv;
void *envp;
void *ret;
void *inbox;
void *outbox;
void *buffer;
};

/// Print the error code and exit if \p code indicates an error.		/// Print the error code and exit if \p code indicates an error.
static void handle_error(hsa_status_t code) {		static void handle_error(hsa_status_t code) {
if (code == HSA_STATUS_SUCCESS \|\| code == HSA_STATUS_INFO_BREAK)		if (code == HSA_STATUS_SUCCESS \|\| code == HSA_STATUS_INFO_BREAK)
return;		return;

const char *desc;		const char *desc;
if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)		if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)
desc = "Unknown error";		desc = "Unknown error";
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	auto cb = [&](hsa_amd_memory_pool_t memory_pool) {
if (flags & flag)		if (flags & flag)
*output_pool = memory_pool;		*output_pool = memory_pool;

return HSA_STATUS_SUCCESS;		return HSA_STATUS_SUCCESS;
};		};
return iterate_agent_memory_pools(agent, cb);		return iterate_agent_memory_pools(agent, cb);
}		}

		template <typename args_t>
		hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
		hsa_amd_memory_pool_t kernargs_pool,
		hsa_queue_t *queue, const LaunchParameters &params,
		const char *kernel_name, args_t kernel_args) {
		// Look up the '_start' kernel in the loaded executable.
		hsa_executable_symbol_t symbol;
		if (hsa_status_t err = hsa_executable_get_symbol_by_name(
		executable, kernel_name, &dev_agent, &symbol))
		return err;

		// Retrieve different properties of the kernel symbol used for launch.
		uint64_t kernel;
		uint32_t args_size;
		uint32_t group_size;
		uint32_t private_size;

		std::pair<hsa_executable_symbol_info_t, void *> symbol_infos[] = {
		{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel},
		{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE, &args_size},
		{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE, &group_size},
		{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE, &private_size}};

		for (auto &[info, value] : symbol_infos)
		if (hsa_status_t err = hsa_executable_symbol_get_info(symbol, info, value))
		return err;

		// Allocate space for the kernel arguments on the host and allow the GPU agent
		// to access it.
		void *args;
		if (hsa_status_t err = hsa_amd_memory_pool_allocate(kernargs_pool, args_size,
		/flags=/0, &args))
		handle_error(err);
		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, args);

		// Initialie all the arguments (explicit and implicit) to zero, then set the
		// explicit arguments to the values created above.
		std::memset(args, 0, args_size);
		std::memcpy(args, &kernel_args, sizeof(args_t));

		// Obtain a packet from the queue.
		uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
		while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue->size)
		;

		const uint32_t mask = queue->size - 1;
		hsa_kernel_dispatch_packet_t *packet =
		static_cast<hsa_kernel_dispatch_packet_t *>(queue->base_address) +
		(packet_id & mask);

		// Set up the packet for exeuction on the device. We currently only launch
		// with one thread on the device, forcing the rest of the wavefront to be
		// masked off.
		std::memset(packet, 0, sizeof(hsa_kernel_dispatch_packet_t));
		packet->setup = (1 + (params.num_blocks_y * params.num_threads_y != 1) +
		(params.num_blocks_z * params.num_threads_z != 1))
		<< HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
		packet->workgroup_size_x = params.num_threads_x;
		packet->workgroup_size_y = params.num_threads_y;
		packet->workgroup_size_z = params.num_threads_z;
		packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
		packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
		packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
		packet->private_segment_size = private_size;
		packet->group_segment_size = group_size;
		packet->kernel_object = kernel;
		packet->kernarg_address = args;

		// Create a signal to indicate when this packet has been completed.
		if (hsa_status_t err =
		hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
		handle_error(err);

		// Initialize the packet header and set the doorbell signal to begin execution
		// by the HSA runtime.
		uint16_t setup = packet->setup;
		uint16_t header =
		(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|
		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) \|
		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
		__atomic_store_n(&packet->header, header \| (setup << 16), __ATOMIC_RELEASE);
		hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);

		// Wait until the kernel has completed execution on the device. Periodically
		// check the RPC client for work to be performed on the server.
		while (hsa_signal_wait_scacquire(
		packet->completion_signal, HSA_SIGNAL_CONDITION_EQ, 0,
		/timeout_hint=/1024, HSA_WAIT_STATE_ACTIVE) != 0)
		handle_server();

		// Destroy the resources acquired to launch the kernel and return.
		if (hsa_status_t err = hsa_amd_memory_pool_free(args))
		handle_error(err);
		if (hsa_status_t err = hsa_signal_destroy(packet->completion_signal))
		handle_error(err);

		return HSA_STATUS_SUCCESS;
		}

int load(int argc, char argv, char envp, void *image, size_t size,		int load(int argc, char argv, char envp, void *image, size_t size,
const LaunchParameters &params) {		const LaunchParameters &params) {
// Initialize the HSA runtime used to communicate with the device.		// Initialize the HSA runtime used to communicate with the device.
if (hsa_status_t err = hsa_init())		if (hsa_status_t err = hsa_init())
handle_error(err);		handle_error(err);

// Register a callback when the device encounters a memory fault.		// Register a callback when the device encounters a memory fault.
if (hsa_status_t err = hsa_amd_register_system_event_handler(		if (hsa_status_t err = hsa_amd_register_system_event_handler(
[](const hsa_amd_event_t event, void ) -> hsa_status_t {		[](const hsa_amd_event_t event, void ) -> hsa_status_t {
if (event->event_type == HSA_AMD_GPU_MEMORY_FAULT_EVENT)		if (event->event_type == HSA_AMD_GPU_MEMORY_FAULT_EVENT)
return HSA_STATUS_ERROR;		return HSA_STATUS_ERROR;
return HSA_STATUS_SUCCESS;		return HSA_STATUS_SUCCESS;
},		},
nullptr))		nullptr))
handle_error(err);		handle_error(err);

// Obtain an agent for the device and host to use the HSA memory model.		// Obtain an agent for the device and host to use the HSA memory model.
hsa_agent_t dev_agent;		hsa_agent_t dev_agent;
hsa_agent_t host_agent;		hsa_agent_t host_agent;
if (hsa_status_t err = get_agent<HSA_DEVICE_TYPE_GPU>(&dev_agent))		if (hsa_status_t err = get_agent<HSA_DEVICE_TYPE_GPU>(&dev_agent))
handle_error(err);		handle_error(err);
if (hsa_status_t err = get_agent<HSA_DEVICE_TYPE_CPU>(&host_agent))		if (hsa_status_t err = get_agent<HSA_DEVICE_TYPE_CPU>(&host_agent))
handle_error(err);		handle_error(err);

// Obtain a queue with the minimum (power of two) size, used to send commands
// to the HSA runtime and launch execution on the device.
uint64_t queue_size;
if (hsa_status_t err = hsa_agent_get_info(
dev_agent, HSA_AGENT_INFO_QUEUE_MIN_SIZE, &queue_size))
handle_error(err);
hsa_queue_t *queue = nullptr;
if (hsa_status_t err =
hsa_queue_create(dev_agent, queue_size, HSA_QUEUE_TYPE_SINGLE,
nullptr, nullptr, UINT32_MAX, UINT32_MAX, &queue))
handle_error(err);

// Load the code object's ISA information and executable data segments.		// Load the code object's ISA information and executable data segments.
hsa_code_object_t object;		hsa_code_object_t object;
if (hsa_status_t err = hsa_code_object_deserialize(image, size, "", &object))		if (hsa_status_t err = hsa_code_object_deserialize(image, size, "", &object))
handle_error(err);		handle_error(err);

hsa_executable_t executable;		hsa_executable_t executable;
if (hsa_status_t err = hsa_executable_create_alt(		if (hsa_status_t err = hsa_executable_create_alt(
HSA_PROFILE_FULL, HSA_DEFAULT_FLOAT_ROUNDING_MODE_ZERO, "",		HSA_PROFILE_FULL, HSA_DEFAULT_FLOAT_ROUNDING_MODE_ZERO, "",
Show All 31 Lines	if (hsa_status_t err =
get_agent_memory_pool<HSA_AMD_MEMORY_POOL_GLOBAL_FLAG_FINE_GRAINED>(		get_agent_memory_pool<HSA_AMD_MEMORY_POOL_GLOBAL_FLAG_FINE_GRAINED>(
host_agent, &finegrained_pool))		host_agent, &finegrained_pool))
handle_error(err);		handle_error(err);
if (hsa_status_t err =		if (hsa_status_t err =
get_agent_memory_pool<HSA_AMD_MEMORY_POOL_GLOBAL_FLAG_COARSE_GRAINED>(		get_agent_memory_pool<HSA_AMD_MEMORY_POOL_GLOBAL_FLAG_COARSE_GRAINED>(
dev_agent, &coarsegrained_pool))		dev_agent, &coarsegrained_pool))
handle_error(err);		handle_error(err);

// Look up the '_start' kernel in the loaded executable.
hsa_executable_symbol_t symbol;
if (hsa_status_t err = hsa_executable_get_symbol_by_name(
executable, KERNEL_START, &dev_agent, &symbol))
handle_error(err);

// Retrieve different properties of the kernel symbol used for launch.
uint64_t kernel;
uint32_t args_size;
uint32_t group_size;
uint32_t private_size;

std::pair<hsa_executable_symbol_info_t, void *> symbol_infos[] = {
{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel},
{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE, &args_size},
{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE, &group_size},
{HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE, &private_size}};

for (auto &[info, value] : symbol_infos)
if (hsa_status_t err = hsa_executable_symbol_get_info(symbol, info, value))
handle_error(err);

// Allocate space for the kernel arguments on the host and allow the GPU agent
// to access it.
void *args;
if (hsa_status_t err = hsa_amd_memory_pool_allocate(kernargs_pool, args_size,
/flags=/0, &args))
handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, args);

// Allocate fine-grained memory on the host to hold the pointer array for the		// Allocate fine-grained memory on the host to hold the pointer array for the
// copied argv and allow the GPU agent to access it.		// copied argv and allow the GPU agent to access it.
auto allocator = [&](uint64_t size) -> void * {		auto allocator = [&](uint64_t size) -> void * {
void *dev_ptr = nullptr;		void *dev_ptr = nullptr;
if (hsa_status_t err = hsa_amd_memory_pool_allocate(finegrained_pool, size,		if (hsa_status_t err = hsa_amd_memory_pool_allocate(finegrained_pool, size,
/flags=/0, &dev_ptr))		/flags=/0, &dev_ptr))
handle_error(err);		handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_ptr);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_ptr);
Show All 39 Lines	if (hsa_status_t err = hsa_amd_memory_pool_allocate(
(wavefront_size * sizeof(__llvm_libc::rpc::Buffer)),		(wavefront_size * sizeof(__llvm_libc::rpc::Buffer)),
alignof(__llvm_libc::rpc::Packet)),		alignof(__llvm_libc::rpc::Packet)),
/flags=/0, &buffer))		/flags=/0, &buffer))
handle_error(err);		handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_inbox);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_inbox);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_outbox);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_outbox);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, buffer);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, buffer);

// Initialie all the arguments (explicit and implicit) to zero, then set the		// Initialize the RPC server's buffer for host-device communication.
// explicit arguments to the values created above.		server.reset(wavefront_size, &lock, server_inbox, server_outbox, buffer);
std::memset(args, 0, args_size);
kernel_args_t kernel_args = reinterpret_cast<kernel_args_t >(args);
kernel_args->argc = argc;
kernel_args->argv = dev_argv;
kernel_args->envp = dev_envp;
kernel_args->ret = dev_ret;
kernel_args->inbox = server_outbox;
kernel_args->outbox = server_inbox;
kernel_args->buffer = buffer;

// Obtain a packet from the queue.
uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue_size)
;

const uint32_t mask = queue_size - 1;
hsa_kernel_dispatch_packet_t *packet =
(hsa_kernel_dispatch_packet_t *)queue->base_address + (packet_id & mask);

// Set up the packet for exeuction on the device. We currently only launch
// with one thread on the device, forcing the rest of the wavefront to be
// masked off.
std::memset(packet, 0, sizeof(hsa_kernel_dispatch_packet_t));
packet->setup = (1 + (params.num_blocks_y * params.num_threads_y != 1) +
(params.num_blocks_z * params.num_threads_z != 1))
<< HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
packet->workgroup_size_x = params.num_threads_x;
packet->workgroup_size_y = params.num_threads_y;
packet->workgroup_size_z = params.num_threads_z;
packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
packet->private_segment_size = private_size;
packet->group_segment_size = group_size;
packet->kernel_object = kernel;
packet->kernarg_address = args;

// Create a signal to indicate when this packet has been completed.		// Obtain a queue with the minimum (power of two) size, used to send commands
		// to the HSA runtime and launch execution on the device.
		uint64_t queue_size;
		if (hsa_status_t err = hsa_agent_get_info(
		dev_agent, HSA_AGENT_INFO_QUEUE_MIN_SIZE, &queue_size))
		handle_error(err);
		hsa_queue_t *queue = nullptr;
if (hsa_status_t err =		if (hsa_status_t err =
hsa_signal_create(1, 0, nullptr, &packet->completion_signal))		hsa_queue_create(dev_agent, queue_size, HSA_QUEUE_TYPE_MULTI, nullptr,
		nullptr, UINT32_MAX, UINT32_MAX, &queue))
handle_error(err);		handle_error(err);

// Initialize the RPC server's buffer for host-device communication.		LaunchParameters single_threaded_params = {1, 1, 1, 1, 1, 1};
server.reset(wavefront_size, &lock, server_inbox, server_outbox, buffer);		begin_args_t init_args = {argc, dev_argv, dev_envp,
		server_outbox, server_inbox, buffer};
// Initialize the packet header and set the doorbell signal to begin execution		if (hsa_status_t err =
// by the HSA runtime.		launch_kernel(dev_agent, executable, kernargs_pool, queue,
uint16_t header =		single_threaded_params, "_begin.kd", init_args))
(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|		handle_error(err);
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) \|
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);
__atomic_store_n(&packet->header, header \| (packet->setup << 16),
__ATOMIC_RELEASE);
hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);

// Wait until the kernel has completed execution on the device. Periodically		start_args_t args = {argc, dev_argv, dev_envp, dev_ret};
// check the RPC client for work to be performed on the server.		if (hsa_status_t err = launch_kernel(dev_agent, executable, kernargs_pool,
while (hsa_signal_wait_scacquire(		queue, params, "_start.kd", args))
packet->completion_signal, HSA_SIGNAL_CONDITION_EQ, 0,		handle_error(err);
/timeout_hint=/1024, HSA_WAIT_STATE_ACTIVE) != 0)
handle_server();

// Create a memory signal and copy the return value back from the device into		// Create a memory signal and copy the return value back from the device into
// a new buffer.		// a new buffer.
hsa_signal_t memory_signal;		hsa_signal_t memory_signal;
if (hsa_status_t err = hsa_signal_create(1, 0, nullptr, &memory_signal))		if (hsa_status_t err = hsa_signal_create(1, 0, nullptr, &memory_signal))
handle_error(err);		handle_error(err);

void *host_ret;		void *host_ret;
Show All 10 Lines	int load(int argc, char argv, char envp, void *image, size_t size,

while (hsa_signal_wait_scacquire(memory_signal, HSA_SIGNAL_CONDITION_EQ, 0,		while (hsa_signal_wait_scacquire(memory_signal, HSA_SIGNAL_CONDITION_EQ, 0,
UINT64_MAX, HSA_WAIT_STATE_ACTIVE) != 0)		UINT64_MAX, HSA_WAIT_STATE_ACTIVE) != 0)
;		;

// Save the return value and perform basic clean-up.		// Save the return value and perform basic clean-up.
int ret = static_cast<int >(host_ret);		int ret = static_cast<int >(host_ret);

// Free the memory allocated for the device.		end_args_t fini_args = {ret};
if (hsa_status_t err = hsa_amd_memory_pool_free(args))		if (hsa_status_t err =
		launch_kernel(dev_agent, executable, kernargs_pool, queue,
		single_threaded_params, "_end.kd", fini_args))
handle_error(err);		handle_error(err);

		// Free the memory allocated for the device.
if (hsa_status_t err = hsa_amd_memory_pool_free(dev_argv))		if (hsa_status_t err = hsa_amd_memory_pool_free(dev_argv))
handle_error(err);		handle_error(err);
if (hsa_status_t err = hsa_amd_memory_pool_free(dev_ret))		if (hsa_status_t err = hsa_amd_memory_pool_free(dev_ret))
handle_error(err);		handle_error(err);
if (hsa_status_t err = hsa_amd_memory_pool_free(server_inbox))		if (hsa_status_t err = hsa_amd_memory_pool_free(server_inbox))
handle_error(err);		handle_error(err);
if (hsa_status_t err = hsa_amd_memory_pool_free(server_outbox))		if (hsa_status_t err = hsa_amd_memory_pool_free(server_outbox))
handle_error(err);		handle_error(err);
if (hsa_status_t err = hsa_amd_memory_pool_free(buffer))		if (hsa_status_t err = hsa_amd_memory_pool_free(buffer))
handle_error(err);		handle_error(err);
if (hsa_status_t err = hsa_amd_memory_pool_free(host_ret))		if (hsa_status_t err = hsa_amd_memory_pool_free(host_ret))
handle_error(err);		handle_error(err);

if (hsa_status_t err = hsa_signal_destroy(memory_signal))		if (hsa_status_t err = hsa_signal_destroy(memory_signal))
handle_error(err);		handle_error(err);

if (hsa_status_t err = hsa_signal_destroy(packet->completion_signal))
handle_error(err);

if (hsa_status_t err = hsa_queue_destroy(queue))		if (hsa_status_t err = hsa_queue_destroy(queue))
handle_error(err);		handle_error(err);

if (hsa_status_t err = hsa_executable_destroy(executable))		if (hsa_status_t err = hsa_executable_destroy(executable))
handle_error(err);		handle_error(err);

if (hsa_status_t err = hsa_code_object_destroy(object))		if (hsa_status_t err = hsa_code_object_destroy(object))
handle_error(err);		handle_error(err);

if (hsa_status_t err = hsa_shut_down())		if (hsa_status_t err = hsa_shut_down())
handle_error(err);		handle_error(err);

return ret;		return ret;
}		}

libc/utils/gpu/loader/nvptx/Loader.cpp

Show All 24 Lines
#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <cstring>		#include <cstring>
#include <vector>		#include <vector>

using namespace llvm;		using namespace llvm;
using namespace object;		using namespace object;

/// The arguments to the '_start' kernel.
struct kernel_args_t {
int argc;
void *argv;
void *envp;
void *ret;
void *inbox;
void *outbox;
void *buffer;
};

static void handle_error(CUresult err) {		static void handle_error(CUresult err) {
if (err == CUDA_SUCCESS)		if (err == CUDA_SUCCESS)
return;		return;

const char *err_str = nullptr;		const char *err_str = nullptr;
CUresult result = cuGetErrorString(err, &err_str);		CUresult result = cuGetErrorString(err, &err_str);
if (result != CUDA_SUCCESS)		if (result != CUDA_SUCCESS)
fprintf(stderr, "Unknown Error\n");		fprintf(stderr, "Unknown Error\n");
▲ Show 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	if (CUresult err =
cuMemcpyHtoD(fini_start, &dev_dtors_start, sizeof(uintptr_t)))		cuMemcpyHtoD(fini_start, &dev_dtors_start, sizeof(uintptr_t)))
handle_error(err);		handle_error(err);
if (CUresult err = cuMemcpyHtoD(fini_end, &dev_dtors_end, sizeof(uintptr_t)))		if (CUresult err = cuMemcpyHtoD(fini_end, &dev_dtors_end, sizeof(uintptr_t)))
handle_error(err);		handle_error(err);

return dev_memory;		return dev_memory;
}		}

		template <typename args_t>
		CUresult launch_kernel(CUmodule binary, CUstream stream,
		const LaunchParameters &params, const char *kernel_name,
		args_t kernel_args) {
		// look up the '_start' kernel in the loaded module.
		CUfunction function;
		if (CUresult err = cuModuleGetFunction(&function, binary, kernel_name))
		handle_error(err);

		// Set up the arguments to the '_start' kernel on the GPU.
		uint64_t args_size = sizeof(args_t);
		void *args_config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, &kernel_args,
		CU_LAUNCH_PARAM_BUFFER_SIZE, &args_size,
		CU_LAUNCH_PARAM_END};

		// Call the kernel with the given arguments.
		if (CUresult err = cuLaunchKernel(
		function, params.num_blocks_x, params.num_blocks_y,
		params.num_blocks_z, params.num_threads_x, params.num_threads_y,
		params.num_threads_z, 0, stream, nullptr, args_config))
		handle_error(err);

		// Wait until the kernel has completed execution on the device. Periodically
		// check the RPC client for work to be performed on the server.
		while (cuStreamQuery(stream) == CUDA_ERROR_NOT_READY)
		handle_server();

		return CUDA_SUCCESS;
		}

int load(int argc, char argv, char envp, void *image, size_t size,		int load(int argc, char argv, char envp, void *image, size_t size,
const LaunchParameters &params) {		const LaunchParameters &params) {

if (CUresult err = cuInit(0))		if (CUresult err = cuInit(0))
handle_error(err);		handle_error(err);
// Obtain the first device found on the system.		// Obtain the first device found on the system.
CUdevice device;		CUdevice device;
if (CUresult err = cuDeviceGet(&device, 0))		if (CUresult err = cuDeviceGet(&device, 0))
Show All 11 Lines	int load(int argc, char argv, char envp, void *image, size_t size,
if (CUresult err = cuStreamCreate(&stream, CU_STREAM_NON_BLOCKING))		if (CUresult err = cuStreamCreate(&stream, CU_STREAM_NON_BLOCKING))
handle_error(err);		handle_error(err);

// Load the image into a CUDA module.		// Load the image into a CUDA module.
CUmodule binary;		CUmodule binary;
if (CUresult err = cuModuleLoadDataEx(&binary, image, 0, nullptr, nullptr))		if (CUresult err = cuModuleLoadDataEx(&binary, image, 0, nullptr, nullptr))
handle_error(err);		handle_error(err);

// look up the '_start' kernel in the loaded module.
CUfunction function;
if (CUresult err = cuModuleGetFunction(&function, binary, "_start"))
handle_error(err);

// Allocate pinned memory on the host to hold the pointer array for the		// Allocate pinned memory on the host to hold the pointer array for the
// copied argv and allow the GPU device to access it.		// copied argv and allow the GPU device to access it.
auto allocator = [&](uint64_t size) -> void * {		auto allocator = [&](uint64_t size) -> void * {
void *dev_ptr;		void *dev_ptr;
if (CUresult err = cuMemAllocHost(&dev_ptr, size))		if (CUresult err = cuMemAllocHost(&dev_ptr, size))
handle_error(err);		handle_error(err);
return dev_ptr;		return dev_ptr;
};		};
Show All 24 Lines	int load(int argc, char argv, char envp, void *image, size_t size,
void *server_outbox = allocator(sizeof(__llvm_libc::cpp::Atomic<int>));		void *server_outbox = allocator(sizeof(__llvm_libc::cpp::Atomic<int>));
void *buffer =		void *buffer =
allocator(align_up(sizeof(__llvm_libc::rpc::Header) +		allocator(align_up(sizeof(__llvm_libc::rpc::Header) +
(warp_size * sizeof(__llvm_libc::rpc::Buffer)),		(warp_size * sizeof(__llvm_libc::rpc::Buffer)),
alignof(__llvm_libc::rpc::Packet)));		alignof(__llvm_libc::rpc::Packet)));
if (!server_inbox \|\| !server_outbox \|\| !buffer)		if (!server_inbox \|\| !server_outbox \|\| !buffer)
handle_error("Failed to allocate memory the RPC client / server.");		handle_error("Failed to allocate memory the RPC client / server.");

// Set up the arguments to the '_start' kernel on the GPU.
uint64_t args_size = sizeof(kernel_args_t);
kernel_args_t args;
std::memset(&args, 0, args_size);
args.argc = argc;
args.argv = dev_argv;
args.envp = dev_envp;
args.ret = reinterpret_cast<void *>(dev_ret);
args.inbox = server_outbox;
args.outbox = server_inbox;
args.buffer = buffer;
void *args_config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, &args,
CU_LAUNCH_PARAM_BUFFER_SIZE, &args_size,
CU_LAUNCH_PARAM_END};

// Initialize the RPC server's buffer for host-device communication.		// Initialize the RPC server's buffer for host-device communication.
server.reset(warp_size, &lock, server_inbox, server_outbox, buffer);		server.reset(warp_size, &lock, server_inbox, server_outbox, buffer);

// Call the kernel with the given arguments.		LaunchParameters single_threaded_params = {1, 1, 1, 1, 1, 1};
if (CUresult err = cuLaunchKernel(		// Call the kernel to
function, params.num_blocks_x, params.num_blocks_y,		begin_args_t init_args = {argc, dev_argv, dev_envp,
params.num_blocks_z, params.num_threads_x, params.num_threads_y,		server_outbox, server_inbox, buffer};
params.num_threads_z, 0, stream, nullptr, args_config))		if (CUresult err = launch_kernel(binary, stream, single_threaded_params,
		"_begin", init_args))
handle_error(err);		handle_error(err);

// Wait until the kernel has completed execution on the device. Periodically		start_args_t args = {argc, dev_argv, dev_envp,
// check the RPC client for work to be performed on the server.		reinterpret_cast<void *>(dev_ret)};
while (cuStreamQuery(stream) == CUDA_ERROR_NOT_READY)		if (CUresult err = launch_kernel(binary, stream, params, "_start", args))
handle_server();		handle_error(err);

// Copy the return value back from the kernel and wait.		// Copy the return value back from the kernel and wait.
int host_ret = 0;		int host_ret = 0;
if (CUresult err = cuMemcpyDtoH(&host_ret, dev_ret, sizeof(int)))		if (CUresult err = cuMemcpyDtoH(&host_ret, dev_ret, sizeof(int)))
handle_error(err);		handle_error(err);

if (CUresult err = cuStreamSynchronize(stream))		if (CUresult err = cuStreamSynchronize(stream))
handle_error(err);		handle_error(err);

		end_args_t fini_args = {host_ret};
		if (CUresult err = launch_kernel(binary, stream, single_threaded_params,
		"_end", fini_args))
		handle_error(err);

// Free the memory allocated for the device.		// Free the memory allocated for the device.
if (CUresult err = cuMemFreeHost(*memory_or_err))		if (CUresult err = cuMemFreeHost(*memory_or_err))
handle_error(err);		handle_error(err);
if (CUresult err = cuMemFree(dev_ret))		if (CUresult err = cuMemFree(dev_ret))
handle_error(err);		handle_error(err);
if (CUresult err = cuMemFreeHost(dev_argv))		if (CUresult err = cuMemFreeHost(dev_argv))
handle_error(err);		handle_error(err);
if (CUresult err = cuMemFreeHost(server_inbox))		if (CUresult err = cuMemFreeHost(server_inbox))
Show All 13 Lines