Download Raw Diff

Details

Reviewers

jdoerfert
tianshilei1992
JonChesterfield
tra
sivachandra
michaelrj
lntue

Commits

rG719d77ed28b6: [libc] Begin implementing a library for the RPC server

Summary

This patch begins providing a generic static library that wraps around
the raw rpc.h interface. As discussed in the corresponding RFC,
https://discourse.llvm.org/t/rfc-libc-exporting-the-rpc-interface-for-the-gpu-libc/71030,
we want to begin exporting RPC services to external users. In order to
do this we decided to not expose the rpc.h header by wrapping around
its functionality. This is done with a C-interface as we make heavy use
of callbacks and allows us to provide a predictable interface.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Mar 28 2023, 8:20 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 28 2023, 8:20 AM

Herald added subscribers: libc-commits, kosarev, mattd and 5 others. · View Herald Transcript

jhuber6 requested review of this revision.Mar 28 2023, 8:20 AM

Herald added subscribers: jplehr, sstefan1. · View Herald TranscriptMar 28 2023, 8:20 AM

Harbormaster completed remote builds in B222259: Diff 509025.Mar 28 2023, 8:26 AM

Make include directory public to simplify usage.

Harbormaster completed remote builds in B222270: Diff 509038.Mar 28 2023, 8:58 AM

Changing implementation to use a C-based interface. This is primarily because
it's required to actually hide the implementation details of the RPC server. The
rpc.h header should not be visible outside of internal libc projects. This
also makes it easier to provide as a simple library with an expected symbol
name.

Harbormaster completed remote builds in B222505: Diff 509345.Mar 29 2023, 6:52 AM

Forgot to reset memory, and ping.

Harbormaster completed remote builds in B223019: Diff 510043.Mar 31 2023, 8:10 AM

jhuber6 edited the summary of this revision. (Show Details)Jun 5 2023, 1:25 PM

jhuber6 added parent revisions: D151735: [libc] Implement basic `malloc` and `free` support on the GPU, D151282: [libc] Add initial support for 'puts' and 'fputs' to the GPU.

Updating and rebasing.

I'm not completely happy with the interface but we can always modify it later.
This should provide what the libc developers wanted and separate the
implementation of rpc.h and the provided server. This has some downsides
compared to just exporting the full header somehow, but also has some
advantages, considering that we may not need to really provde that much custom
utility outside of libc for users, so it's easier to consoliate the
functionality in a defined interface.

Let me know what should be changed, there's a lot of cruft associated with
registering the custom handlers for malloc / free but I don't think there's
another way to get around it.

Harbormaster completed remote builds in B236716: Diff 528564.Jun 5 2023, 1:29 PM

Answering the question from discord: Normally, a single header file named <some>_service.h is to be used by both the client and the server. Client's will use the client API from that header file, where as servers will use the server API. That way, the message types (or opcodes as you are calling) will be in a single shared header file.

libc/utils/gpu/server/Server.h
46 ↗	(On Diff #528564)	Why are there separate functions allocation and deallocation?
51 ↗	(On Diff #528564)	Update comment.
54 ↗	(On Diff #528564)	Ditto.
61 ↗	(On Diff #528564)	Why not just `rpc_shutdown`?

jhuber6 marked 3 inline comments as done.Jun 6 2023, 4:17 AM

jhuber6 added inline comments.

libc/utils/gpu/server/Server.h
46 ↗	(On Diff #528564)	I'll change the name to `free` but we need both to adequately de allocate the shared memory.

Making suggested changes

Harbormaster completed remote builds in B236899: Diff 528799.Jun 6 2023, 4:56 AM

jhuber6 mentioned this in D152283: [libc] Export GPU extensions to `libc` for external use.Jun 6 2023, 9:13 AM

jhuber6 added a parent revision: D152283: [libc] Export GPU extensions to `libc` for external use.Jun 7 2023, 5:38 AM

Ping

OK from my side but GPU review should be done by a GPU expert.

This revision is now accepted and ready to land.Jun 12 2023, 10:45 PM

JonChesterfield added inline comments.Jun 13 2023, 12:10 PM

libc/utils/gpu/loader/amdgpu/Loader.cpp
55	Call into the other one, static void handle_error(rpc_status_t) { handle_error("Failure in the RPC server"); }
libc/utils/gpu/loader/nvptx/Loader.cpp
50	i still really dislike the copy/paste going on here
72	Does this work? It looks like the same stream running the kernel is being used to provide malloc/free, and I'd expect that to deadlock
libc/utils/gpu/server/Server.cpp
35 ↗	(On Diff #528799)	Could we go with a vector of Device instead of the new+array construct?
44 ↗	(On Diff #528799)	Why is this a heap allocated thing, as opposed to `static State state;` ?
51 ↗	(On Diff #528799)	Could make the counter 64 bit and delete the test against max as a counter >= address space size can't overflow In general the DIY reference counting is a bit odd - is there a reason this isn't a shared_ptr?
107 ↗	(On Diff #528799)	I'm still hopeful that we'll come up with a better idea than rpc::MAX_LANE_SIZE
libc/utils/gpu/server/Server.h
57 ↗	(On Diff #528799)	This looks like the type of a function, not the type of a function pointer. It's used as an argument to functions where it'll decay to the pointer type. More conventionally written with an extra * typedef void(rpc_free_ty)(void ptr, void *data); Is there a benefit to declaring this as the function type as opposed to the function pointer type?
70 ↗	(On Diff #528799)	Want `(void)` if this is meant to be usable from C (the guards about suggest it is) C++ thinks foo() is a function of no arguments. C thinks it's some aberration from the past (though that might have been dropped in the last standard).

jhuber6 added inline comments.Jun 13 2023, 12:33 PM

libc/utils/gpu/loader/nvptx/Loader.cpp
72	There's a test for this that's been running on https://lab.llvm.org/buildbot/#/builders/46 for a few weeks now and it hasn't deadlocked as far as I can tell. It's a completely separate stream called `memory_stream` that's just created here. The one running the kernel is just called `stream`. This requires CUDA 11.2 IIRC.
libc/utils/gpu/server/Server.cpp
35 ↗	(On Diff #528799)	It's a static size so a constant sized array should be more correct.
44 ↗	(On Diff #528799)	It's just easier to check if it's been initialized because the pointer is nullable. We coiuld probably make it a static thing and have a flag instead if you'd like.
51 ↗	(On Diff #528799)	Would a shared pointer give us the same semantics? We would be allocating it multiple times and not copying it.
107 ↗	(On Diff #528799)	We could use a vector and push back into it instead, or preallocate according to the size above, but this was the easiest solution.

jhuber6 marked 5 inline comments as done.Jun 14 2023, 1:27 PM

jhuber6 added inline comments.

libc/utils/gpu/server/Server.h
57 ↗	(On Diff #528799)	I just forgot to add it and my IDE took care of the conversions when I got an error, will change.
70 ↗	(On Diff #528799)	Forgot about that quirk of C, thanks.

Addressing comments

Harbormaster completed remote builds in B238926: Diff 531484.Jun 14 2023, 1:28 PM

Interesting that the cuda path doesn't deadlock, I wonder if that's a fix from previous cuda revisions. Thanks

In D147054#4422839, @JonChesterfield wrote:

Interesting that the cuda path doesn't deadlock, I wonder if that's a fix from previous cuda revisions. Thanks

Appreciate the thorough review.

Yeah I heard about that problem as well which made me nervous about how well we could support a real malloc in the future. But pleasingly it's been running on the sm_70 and sm_60 testers for awhile and it seems good as long as your CUDA is new enough. As it stands we'll probably need to turn this function into a hard error if we integrate it into OpenMP and CUDA is too old.

This revision was landed with ongoing or failed builds.Jun 15 2023, 9:02 AM

Closed by commit rG719d77ed28b6: [libc] Begin implementing a library for the RPC server (authored by jhuber6). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG719d77ed28b6: [libc] Begin implementing a library for the RPC server.

jhuber6 mentioned this in rGdcdfc963d793: [libc] Export GPU extensions to `libc` for external use.

Diff 509345

libc/utils/gpu/CMakeLists.txt

				add_subdirectory(server)
	add_subdirectory(loader)			add_subdirectory(loader)

libc/utils/gpu/loader/amdgpu/CMakeLists.txt

	add_executable(amdhsa_loader Loader.cpp)			add_executable(amdhsa_loader Loader.cpp)
	add_dependencies(amdhsa_loader libc.src.__support.RPC.rpc)

	target_include_directories(amdhsa_loader PRIVATE ${LIBC_SOURCE_DIR})
	target_link_libraries(amdhsa_loader			target_link_libraries(amdhsa_loader
	PRIVATE			PRIVATE
	hsa-runtime64::hsa-runtime64			hsa-runtime64::hsa-runtime64
				rpc_server
	gpu_loader			gpu_loader
	)			)

libc/utils/gpu/loader/amdgpu/Loader.cpp

Show All 9 Lines
// architecture. The file launches the '_start' kernel which should be provided		// architecture. The file launches the '_start' kernel which should be provided
// by the device application start code and call ultimately call the 'main'		// by the device application start code and call ultimately call the 'main'
// function.		// function.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "Loader.h"		#include "Loader.h"

#include "src/__support/RPC/rpc.h"		#include "rpc_server.h"

#include <hsa/hsa.h>		#include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>		#include <hsa/hsa_ext_amd.h>

#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <cstring>		#include <cstring>
		#include <tuple>
#include <utility>		#include <utility>

/// The name of the kernel we will launch. All AMDHSA kernels end with '.kd'.		/// The name of the kernel we will launch. All AMDHSA kernels end with '.kd'.
constexpr const char *KERNEL_START = "_start.kd";		constexpr const char *KERNEL_START = "_start.kd";

/// The arguments to the '_start' kernel.		/// The arguments to the '_start' kernel.
struct kernel_args_t {		struct kernel_args_t {
int argc;		int argc;
void *argv;		void *argv;
void *envp;		void *envp;
void *ret;		void *ret;
void *inbox;		void *inbox;
void *outbox;		void *outbox;
void *buffer;		void *buffer;
};		};

static __llvm_libc::rpc::Server server;

/// Queries the RPC client at least once and performs server-side work if there
/// are any active requests.
void handle_server() {
while (server.handle(
[&](__llvm_libc::rpc::Buffer *buffer) {
switch (static_cast<__llvm_libc::rpc::Opcode>(buffer->data[0])) {
case __llvm_libc::rpc::Opcode::PRINT_TO_STDERR: {
fputs(reinterpret_cast<const char *>(&buffer->data[1]), stderr);
break;
}
case __llvm_libc::rpc::Opcode::EXIT: {
exit(buffer->data[1]);
break;
}
default:
return;
};
},
[](__llvm_libc::rpc::Buffer *buffer) {}))
;
}

/// Print the error code and exit if \p code indicates an error.		/// Print the error code and exit if \p code indicates an error.
static void handle_error(hsa_status_t code) {		static void handle_error(hsa_status_t code) {
if (code == HSA_STATUS_SUCCESS \|\| code == HSA_STATUS_INFO_BREAK)		if (code == HSA_STATUS_SUCCESS \|\| code == HSA_STATUS_INFO_BREAK)
return;		return;

const char *desc;		const char *desc;
if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)		if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)
desc = "Unknown error";		desc = "Unknown error";
fprintf(stderr, "%s\n", desc);		fprintf(stderr, "%s\n", desc);
exit(EXIT_FAILURE);		exit(EXIT_FAILURE);
}		}

static void handle_error(const char *msg) {		static void handle_error(const char *msg) {
		JonChesterfieldUnsubmitted Done Reply Inline Actions Call into the other one, static void handle_error(rpc_status_t) { handle_error("Failure in the RPC server"); } JonChesterfield: Call into the other one, ``` static void handle_error(rpc_status_t) { handle_error("Failure…
fprintf(stderr, "%s\n", msg);		fprintf(stderr, "%s\n", msg);
exit(EXIT_FAILURE);		exit(EXIT_FAILURE);
}		}

/// Generic interface for iterating using the HSA callbacks.		/// Generic interface for iterating using the HSA callbacks.
template <typename elem_ty, typename func_ty, typename callback_ty>		template <typename elem_ty, typename func_ty, typename callback_ty>
hsa_status_t iterate(func_ty func, callback_ty cb) {		hsa_status_t iterate(func_ty func, callback_ty cb) {
auto l = [](elem_ty elem, void *data) -> hsa_status_t {		auto l = [](elem_ty elem, void *data) -> hsa_status_t {
▲ Show 20 Lines • Show All 219 Lines • ▼ Show 20 Lines	int load(int argc, char argv, char envp, void *image, size_t size) {
void *dev_ret;		void *dev_ret;
if (hsa_status_t err =		if (hsa_status_t err =
hsa_amd_memory_pool_allocate(coarsegrained_pool, sizeof(int),		hsa_amd_memory_pool_allocate(coarsegrained_pool, sizeof(int),
/flags=/0, &dev_ret))		/flags=/0, &dev_ret))
handle_error(err);		handle_error(err);
hsa_amd_memory_fill(dev_ret, 0, sizeof(int));		hsa_amd_memory_fill(dev_ret, 0, sizeof(int));

// Allocate finegrained memory for the RPC server and client to share.		// Allocate finegrained memory for the RPC server and client to share.
void *server_inbox;		auto rpc_data = std::make_tuple(finegrained_pool, dev_agent);
void *server_outbox;		auto rpc_allocator = [](uint64_t size, void data) -> void {
void *buffer;		auto &[finegrained_pool, dev_agent] =
if (hsa_status_t err = hsa_amd_memory_pool_allocate(		reinterpret_cast<decltype(rpc_data) >(data);
finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),		void *dev_ptr = nullptr;
/flags=/0, &server_inbox))		if (hsa_status_t err = hsa_amd_memory_pool_allocate(finegrained_pool, size,
handle_error(err);		/flags=/0, &dev_ptr))
if (hsa_status_t err = hsa_amd_memory_pool_allocate(		handle_error(err);
finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_ptr);
/flags=/0, &server_outbox))		return dev_ptr;
handle_error(err);		};
if (hsa_status_t err = hsa_amd_memory_pool_allocate(		rpc_init(rpc_allocator, &rpc_data);
finegrained_pool, sizeof(__llvm_libc::rpc::Buffer),
/flags=/0, &buffer))
handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_inbox);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_outbox);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, buffer);

// Initialie all the arguments (explicit and implicit) to zero, then set the		// Initialie all the arguments (explicit and implicit) to zero, then set the
// explicit arguments to the values created above.		// explicit arguments to the values created above.
std::memset(args, 0, args_size);		std::memset(args, 0, args_size);
kernel_args_t kernel_args = reinterpret_cast<kernel_args_t >(args);		kernel_args_t kernel_args = reinterpret_cast<kernel_args_t >(args);
kernel_args->argc = argc;		kernel_args->argc = argc;
kernel_args->argv = dev_argv;		kernel_args->argv = dev_argv;
kernel_args->envp = dev_envp;		kernel_args->envp = dev_envp;
kernel_args->ret = dev_ret;		kernel_args->ret = dev_ret;
kernel_args->inbox = server_outbox;		kernel_args->inbox = rpc_get_outbox();
kernel_args->outbox = server_inbox;		kernel_args->outbox = rpc_get_inbox();
kernel_args->buffer = buffer;		kernel_args->buffer = rpc_get_buffer();

// Obtain a packet from the queue.		// Obtain a packet from the queue.
uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);		uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue_size)		while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue_size)
;		;

const uint32_t mask = queue_size - 1;		const uint32_t mask = queue_size - 1;
hsa_kernel_dispatch_packet_t *packet =		hsa_kernel_dispatch_packet_t *packet =
Show All 15 Lines	int load(int argc, char argv, char envp, void *image, size_t size) {
packet->kernel_object = kernel;		packet->kernel_object = kernel;
packet->kernarg_address = args;		packet->kernarg_address = args;

// Create a signal to indicate when this packet has been completed.		// Create a signal to indicate when this packet has been completed.
if (hsa_status_t err =		if (hsa_status_t err =
hsa_signal_create(1, 0, nullptr, &packet->completion_signal))		hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
handle_error(err);		handle_error(err);

// Initialize the RPC server's buffer for host-device communication.
server.reset(server_inbox, server_outbox, buffer);

// Initialize the packet header and set the doorbell signal to begin execution		// Initialize the packet header and set the doorbell signal to begin execution
// by the HSA runtime.		// by the HSA runtime.
uint16_t header =		uint16_t header =
(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|		(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) \|		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) \|
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);
__atomic_store_n(&packet->header, header \| (packet->setup << 16),		__atomic_store_n(&packet->header, header \| (packet->setup << 16),
__ATOMIC_RELEASE);		__ATOMIC_RELEASE);
hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);		hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);

// Wait until the kernel has completed execution on the device. Periodically		// Wait until the kernel has completed execution on the device. Periodically
// check the RPC client for work to be performed on the server.		// check the RPC client for work to be performed on the server.
while (hsa_signal_wait_scacquire(		while (hsa_signal_wait_scacquire(
packet->completion_signal, HSA_SIGNAL_CONDITION_EQ, 0,		packet->completion_signal, HSA_SIGNAL_CONDITION_EQ, 0,
/timeout_hint=/1024, HSA_WAIT_STATE_ACTIVE) != 0)		/timeout_hint=/1024, HSA_WAIT_STATE_ACTIVE) != 0)
handle_server();		rpc_handle();

// Create a memory signal and copy the return value back from the device into		// Create a memory signal and copy the return value back from the device into
// a new buffer.		// a new buffer.
hsa_signal_t memory_signal;		hsa_signal_t memory_signal;
if (hsa_status_t err = hsa_signal_create(1, 0, nullptr, &memory_signal))		if (hsa_status_t err = hsa_signal_create(1, 0, nullptr, &memory_signal))
handle_error(err);		handle_error(err);

void *host_ret;		void *host_ret;
Show All 38 Lines

libc/utils/gpu/loader/nvptx/CMakeLists.txt

	add_executable(nvptx_loader Loader.cpp)			add_executable(nvptx_loader Loader.cpp)
	add_dependencies(nvptx_loader libc.src.__support.RPC.rpc)

	target_include_directories(nvptx_loader PRIVATE ${LIBC_SOURCE_DIR})
	target_link_libraries(nvptx_loader			target_link_libraries(nvptx_loader
	PRIVATE			PRIVATE
	gpu_loader			gpu_loader
				rpc_server
	CUDA::cuda_driver			CUDA::cuda_driver
	)			)

libc/utils/gpu/loader/nvptx/Loader.cpp

Show All 9 Lines
// architecture. The file launches the '_start' kernel which should be provided		// architecture. The file launches the '_start' kernel which should be provided
// by the device application start code and call ultimately call the 'main'		// by the device application start code and call ultimately call the 'main'
// function.		// function.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "Loader.h"		#include "Loader.h"

#include "src/__support/RPC/rpc.h"		#include "rpc_server.h"

#include "cuda.h"		#include "cuda.h"
#include <cstddef>		#include <cstddef>
#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <cstring>		#include <cstring>

/// The arguments to the '_start' kernel.		/// The arguments to the '_start' kernel.
struct kernel_args_t {		struct kernel_args_t {
int argc;		int argc;
void *argv;		void *argv;
void *envp;		void *envp;
void *ret;		void *ret;
void *inbox;		void *inbox;
void *outbox;		void *outbox;
void *buffer;		void *buffer;
};		};

static __llvm_libc::rpc::Server server;

/// Queries the RPC client at least once and performs server-side work if there
/// are any active requests.
void handle_server() {
while (server.handle(
[&](__llvm_libc::rpc::Buffer *buffer) {
switch (static_cast<__llvm_libc::rpc::Opcode>(buffer->data[0])) {
case __llvm_libc::rpc::Opcode::PRINT_TO_STDERR: {
fputs(reinterpret_cast<const char *>(&buffer->data[1]), stderr);
break;
}
case __llvm_libc::rpc::Opcode::EXIT: {
exit(buffer->data[1]);
break;
}
default:
return;
};
},
[](__llvm_libc::rpc::Buffer *buffer) {}))
;
}

static void handle_error(CUresult err) {		static void handle_error(CUresult err) {
if (err == CUDA_SUCCESS)		if (err == CUDA_SUCCESS)
return;		return;

const char *err_str = nullptr;		const char *err_str = nullptr;
CUresult result = cuGetErrorString(err, &err_str);		CUresult result = cuGetErrorString(err, &err_str);
if (result != CUDA_SUCCESS)		if (result != CUDA_SUCCESS)
fprintf(stderr, "Unknown Error\n");		fprintf(stderr, "Unknown Error\n");
else		else
fprintf(stderr, "%s\n", err_str);		fprintf(stderr, "%s\n", err_str);
exit(1);		exit(1);
}		}

static void handle_error(const char *msg) {		static void handle_error(const char *msg) {
		JonChesterfieldUnsubmitted Done Reply Inline Actions i still really dislike the copy/paste going on here JonChesterfield: i still really dislike the copy/paste going on here
fprintf(stderr, "%s\n", msg);		fprintf(stderr, "%s\n", msg);
exit(EXIT_FAILURE);		exit(EXIT_FAILURE);
}		}

int load(int argc, char argv, char envp, void *image, size_t size) {		int load(int argc, char argv, char envp, void *image, size_t size) {
if (CUresult err = cuInit(0))		if (CUresult err = cuInit(0))
handle_error(err);		handle_error(err);

// Obtain the first device found on the system.		// Obtain the first device found on the system.
CUdevice device;		CUdevice device;
if (CUresult err = cuDeviceGet(&device, 0))		if (CUresult err = cuDeviceGet(&device, 0))
handle_error(err);		handle_error(err);

// Initialize the CUDA context and claim it for this execution.		// Initialize the CUDA context and claim it for this execution.
CUcontext context;		CUcontext context;
if (CUresult err = cuDevicePrimaryCtxRetain(&context, device))		if (CUresult err = cuDevicePrimaryCtxRetain(&context, device))
handle_error(err);		handle_error(err);
if (CUresult err = cuCtxSetCurrent(context))		if (CUresult err = cuCtxSetCurrent(context))
handle_error(err);		handle_error(err);

// Initialize a non-blocking CUDA stream to execute the kernel.		// Initialize a non-blocking CUDA stream to execute the kernel.
CUstream stream;		CUstream stream;
		JonChesterfieldUnsubmitted Done Reply Inline Actions Does this work? It looks like the same stream running the kernel is being used to provide malloc/free, and I'd expect that to deadlock JonChesterfield: Does this work? It looks like the same stream running the kernel is being used to provide…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions There's a test for this that's been running on https://lab.llvm.org/buildbot/#/builders/46 for a few weeks now and it hasn't deadlocked as far as I can tell. It's a completely separate stream called `memory_stream` that's just created here. The one running the kernel is just called `stream`. This requires CUDA 11.2 IIRC. jhuber6: There's a test for this that's been running on https://lab.llvm.org/buildbot/#/builders/46 for…
if (CUresult err = cuStreamCreate(&stream, CU_STREAM_NON_BLOCKING))		if (CUresult err = cuStreamCreate(&stream, CU_STREAM_NON_BLOCKING))
handle_error(err);		handle_error(err);

// Load the image into a CUDA module.		// Load the image into a CUDA module.
CUmodule binary;		CUmodule binary;
if (CUresult err = cuModuleLoadDataEx(&binary, image, 0, nullptr, nullptr))		if (CUresult err = cuModuleLoadDataEx(&binary, image, 0, nullptr, nullptr))
handle_error(err);		handle_error(err);

Show All 22 Lines	int load(int argc, char argv, char envp, void *image, size_t size) {

// Allocate space for the return pointer and initialize it to zero.		// Allocate space for the return pointer and initialize it to zero.
CUdeviceptr dev_ret;		CUdeviceptr dev_ret;
if (CUresult err = cuMemAlloc(&dev_ret, sizeof(int)))		if (CUresult err = cuMemAlloc(&dev_ret, sizeof(int)))
handle_error(err);		handle_error(err);
if (CUresult err = cuMemsetD32(dev_ret, 0, 1))		if (CUresult err = cuMemsetD32(dev_ret, 0, 1))
handle_error(err);		handle_error(err);

void *server_inbox = allocator(sizeof(__llvm_libc::cpp::Atomic<int>));		// Allocate finegrained memory for the RPC server and client to share.
void *server_outbox = allocator(sizeof(__llvm_libc::cpp::Atomic<int>));		auto rpc_allocator = [](uint64_t size, void ) -> void {
void *buffer = allocator(sizeof(__llvm_libc::rpc::Buffer));		void *dev_ptr;
if (!server_inbox \|\| !server_outbox \|\| !buffer)		if (CUresult err = cuMemAllocHost(&dev_ptr, size))
handle_error("Failed to allocate memory the RPC client / server.");		handle_error(err);
		return dev_ptr;
		};
		rpc_init(rpc_allocator, nullptr);

// Set up the arguments to the '_start' kernel on the GPU.		// Set up the arguments to the '_start' kernel on the GPU.
uint64_t args_size = sizeof(kernel_args_t);		uint64_t args_size = sizeof(kernel_args_t);
kernel_args_t args;		kernel_args_t args;
std::memset(&args, 0, args_size);		std::memset(&args, 0, args_size);
args.argc = argc;		args.argc = argc;
args.argv = dev_argv;		args.argv = dev_argv;
args.envp = dev_envp;		args.envp = dev_envp;
args.ret = reinterpret_cast<void *>(dev_ret);		args.ret = reinterpret_cast<void *>(dev_ret);
args.inbox = server_outbox;		args.inbox = rpc_get_outbox();
args.outbox = server_inbox;		args.outbox = rpc_get_inbox();
args.buffer = buffer;		args.buffer = rpc_get_buffer();
void *args_config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, &args,		void *args_config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, &args,
CU_LAUNCH_PARAM_BUFFER_SIZE, &args_size,		CU_LAUNCH_PARAM_BUFFER_SIZE, &args_size,
CU_LAUNCH_PARAM_END};		CU_LAUNCH_PARAM_END};

// Initialize the RPC server's buffer for host-device communication.
server.reset(server_inbox, server_outbox, buffer);

// Call the kernel with the given arguments.		// Call the kernel with the given arguments.
if (CUresult err =		if (CUresult err =
cuLaunchKernel(function, /gridDimX=/1, /gridDimY=/1,		cuLaunchKernel(function, /gridDimX=/1, /gridDimY=/1,
/gridDimZ=/1, /blockDimX=/1, /blockDimY=/1,		/gridDimZ=/1, /blockDimX=/1, /blockDimY=/1,
/bloackDimZ=/1, 0, stream, nullptr, args_config))		/bloackDimZ=/1, 0, stream, nullptr, args_config))
handle_error(err);		handle_error(err);

// Wait until the kernel has completed execution on the device. Periodically		// Wait until the kernel has completed execution on the device. Periodically
// check the RPC client for work to be performed on the server.		// check the RPC client for work to be performed on the server.
while (cuStreamQuery(stream) == CUDA_ERROR_NOT_READY)		while (cuStreamQuery(stream) == CUDA_ERROR_NOT_READY)
handle_server();		rpc_handle();

// Copy the return value back from the kernel and wait.		// Copy the return value back from the kernel and wait.
int host_ret = 0;		int host_ret = 0;
if (CUresult err = cuMemcpyDtoH(&host_ret, dev_ret, sizeof(int)))		if (CUresult err = cuMemcpyDtoH(&host_ret, dev_ret, sizeof(int)))
handle_error(err);		handle_error(err);

if (CUresult err = cuStreamSynchronize(stream))		if (CUresult err = cuStreamSynchronize(stream))
handle_error(err);		handle_error(err);

// Destroy the context and the loaded binary.		// Destroy the context and the loaded binary.
if (CUresult err = cuModuleUnload(binary))		if (CUresult err = cuModuleUnload(binary))
handle_error(err);		handle_error(err);
if (CUresult err = cuDevicePrimaryCtxRelease(device))		if (CUresult err = cuDevicePrimaryCtxRelease(device))
handle_error(err);		handle_error(err);
return host_ret;		return host_ret;
}		}

libc/utils/gpu/server/CMakeLists.txt

This file was added.

				add_library(rpc_server STATIC rpc_server.h rpc_server.cpp)

				# Include the RPC implemenation from libc.
				add_dependencies(rpc_server libc.src.__support.RPC.rpc)
				target_include_directories(rpc_server PRIVATE ${LIBC_SOURCE_DIR})
				target_include_directories(rpc_server PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})

libc/utils/gpu/server/rpc_server.h

This file was added.

				//===-- Shared memory RPC server instantiation ------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_UTILS_GPU_SERVER_RPC_SERVER_H
				#define LLVM_LIBC_UTILS_GPU_SERVER_RPC_SERVER_H

				#include <stdint.h>

				#ifdef __cplusplus
				extern "C" {
				#endif

				typedef void (rpc_alloc_ty)(uint64_t size, void data);

				typedef void(rpc_dealloc_ty)(void ptr, void data);

				/// Initialize the server with unified memory to communicate with the client.
				void rpc_init(rpc_alloc_ty alloc, void *data);

				/// Deallocate the memory associated with the server.
				void rpc_deinit(rpc_dealloc_ty, void *data);

				/// Queries the RPC client at least once and performs server-side work if there
				/// are any active requests.
				void rpc_handle();

				/// Get the pointer to the data inbox.
				/// TODO: We should try to compress this into a single buffer.
				void *rpc_get_inbox();

				/// Get the pointer to the data outbox.
				void *rpc_get_outbox();

				/// Get the pointer to the data buffer.
				void *rpc_get_buffer();

				#ifdef __cplusplus
				}
				#endif

				#endif

libc/utils/gpu/server/rpc_server.cpp

This file was added.

				//===-- Shared memory RPC server instantiation ------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "rpc_server.h"

				#include "src/__support/RPC/rpc.h"

				#include <cstdio>
				#include <cstdlib>

				/// The server instance used to communicate with the libc client.
				__llvm_libc::rpc::Server server;

				void rpc_init(rpc_alloc_ty alloc, void *data) {
				void *inbox = alloc(sizeof(__llvm_libc::cpp::Atomic<int>), data);
				void *outbox = alloc(sizeof(__llvm_libc::cpp::Atomic<int>), data);
				void *buffer = alloc(sizeof(__llvm_libc::rpc::Buffer), data);
				server.reset(inbox, outbox, buffer);
				}

				void rpc_deinit(rpc_dealloc_ty dealloc, void *data) {
				dealloc(server.inbox, data);
				dealloc(server.outbox, data);
				dealloc(server.buffer, data);
				}

				void rpc_handle() {
				while (server.handle(
				[&](__llvm_libc::rpc::Buffer *buffer) {
				switch (static_cast<__llvm_libc::rpc::Opcode>(buffer->data[0])) {
				case __llvm_libc::rpc::Opcode::PRINT_TO_STDERR: {
				fputs(reinterpret_cast<const char *>(&buffer->data[1]), stderr);
				break;
				}
				case __llvm_libc::rpc::Opcode::EXIT: {
				exit(buffer->data[1]);
				break;
				}
				default:
				return;
				};
				},
				[](__llvm_libc::rpc::Buffer *buffer) {}))
				;
				}

				void *rpc_get_inbox() { return server.inbox; }

				void *rpc_get_outbox() { return server.outbox; }

				void *rpc_get_buffer() { return server.buffer; }

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Begin implementing a library for the RPC server
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 509345

libc/utils/gpu/CMakeLists.txt

libc/utils/gpu/loader/amdgpu/CMakeLists.txt

libc/utils/gpu/loader/amdgpu/Loader.cpp

libc/utils/gpu/loader/nvptx/CMakeLists.txt

libc/utils/gpu/loader/nvptx/Loader.cpp

libc/utils/gpu/server/CMakeLists.txt

libc/utils/gpu/server/rpc_server.h

libc/utils/gpu/server/rpc_server.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Begin implementing a library for the RPC serverClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 509345

libc/utils/gpu/CMakeLists.txt

libc/utils/gpu/loader/amdgpu/CMakeLists.txt

libc/utils/gpu/loader/amdgpu/Loader.cpp

libc/utils/gpu/loader/nvptx/CMakeLists.txt

libc/utils/gpu/loader/nvptx/Loader.cpp

libc/utils/gpu/server/CMakeLists.txt

libc/utils/gpu/server/rpc_server.h

libc/utils/gpu/server/rpc_server.cpp

[libc] Begin implementing a library for the RPC server
ClosedPublic