This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
src/__support/
-
__support/
-
CMakeLists.txt
-
OSUtil/
-
CMakeLists.txt
-
gpu/
-
CMakeLists.txt
-
io.h
3/5
io.cpp
-
quick_exit.cpp
-
io.h
-
RPC/
-
CMakeLists.txt
8/15
rpc.h
-
rpc_client.h
1/3
rpc_client.cpp
-
startup/gpu/amdgpu/
-
gpu/
-
amdgpu/
-
CMakeLists.txt
-
start.cpp
-
utils/gpu/loader/amdgpu/
-
gpu/
-
loader/
-
amdgpu/
-
CMakeLists.txt
3/7
Loader.cpp

Differential D145913

[libc] Add initial support for an RPC mechanism for the GPU
ClosedPublic

Authored by jhuber6 on Mar 13 2023, 2:38 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield
tianshilei1992
michaelrj
lntue
sivachandra
tra

Commits

rG8e4f9b1fcbfd: [libc] Add initial support for an RPC mechanism for the GPU

Summary

This patch adds initial support for an RPC client / server architecture.
The GPU is unable to perform several system utilities on its own, so in
order to implement features like printing or memory allocation we need
to be able to communicate with the executing process. This is done via a
buffer of "sharable" memory. That is, a buffer with a unified pointer
that both the client and server can use to communicate.

The implementation here is based off of Jon Chesterfields minimal RPC
example in his work. We use an inbox and outbox to communicate
between if there is an RPC request and to signify when work is done.
We use a fixed-size buffer for the communication channel. This is fixed
size so that we can ensure that there is enough space for all
compute-units on the GPU to issue work to any of the ports. Right now
the implementation is single threaded so there is only a single buffer
that is not shared.

This implementation still has several features missing to be complete.
Such as multi-threaded support and asynchrnonous calls.

Depends on D145912

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Mar 13 2023, 2:38 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 13 2023, 2:38 AM

Herald added subscribers: libc-commits, kosarev, ecnelises and 3 others. · View Herald Transcript

jhuber6 requested review of this revision.Mar 13 2023, 2:38 AM

Harbormaster completed remote builds in B218972: Diff 504564.Mar 13 2023, 2:58 AM

Few fixes and more comments.

Harbormaster completed remote builds in B219435: Diff 505214.Mar 14 2023, 12:23 PM

jdoerfert added inline comments.Mar 14 2023, 2:00 PM

libc/src/__support/OSUtil/gpu/io.cpp
27	Nit: braces
libc/src/__support/RPC/rpc.h
8	file comment how this is going to work. Also add a web documentation document
21	reserve it, don't call it noon now.
51	Comment for all classes and enums and functions.
76	Commented with full sentences please.
84	maybe handle, or query?

jhuber6 marked 3 inline comments as done.Mar 14 2023, 2:04 PM

jhuber6 added inline comments.

libc/src/__support/RPC/rpc.h
8	File comment is good. I'll documentation online in a later patch.
21	How about `null`? I think it's important that `0` doesn't do anything as we expect that to be the default clean state of this buffer.
84	Handle would probably be good, since I decided to make this non-blocking to make it more versatile on the server-side.

jdoerfert added inline comments.Mar 14 2023, 4:50 PM

libc/src/__support/OSUtil/gpu/io.cpp
28	Why is this function unused? (also braces for the loop). Let's avoid "print" for now and go with fputs and exit first. We certainly want fputs to be the fast path for fprintf, and the non-f versions can be resolved on the device.
libc/src/__support/RPC/rpc.h
21	OK, null.
31	Since data[0] is supposed to be the Opcode, should we not make it a opcode member and then 7 * 8 bytes payload?

jhuber6 marked an inline comment as done.Mar 14 2023, 5:24 PM

jhuber6 added inline comments.

libc/src/__support/OSUtil/gpu/io.cpp
28	It's the function to use the return value from the server. Since this is void there's nothing to use. In the future this should be "asynchonrous" where we don't wait for the return value.
libc/src/__support/RPC/rpc.h
31	I'm wondering if we should provide a bunch of structs that outline the structure of the arguments and just reinterpret case the underlying buffer to that struct type.

jplehr added a subscriber: jplehr.Mar 15 2023, 2:50 AM

Addressing more comments.

The GPU plumbing looks ok to me. Implementing the minimal can-write-stderr version first is good too, details can be revised in tree as needed.

Harbormaster completed remote builds in B219659: Diff 505523.Mar 15 2023, 9:25 AM

michaelrj added inline comments.Mar 15 2023, 10:13 AM

libc/src/__support/RPC/rpc.h
59	nit: communication

There is no usage of these functions currently. Adding test support is WIP. Right now the following works if I use the tools manually

namespace __llvm_libc {
void write_to_stderr(const char *msg);
void quick_exit(int);
} // namespace __llvm_libc

using namespace __llvm_libc;

int main(int argc, char **argv) {
  for (int i = 0; i < argc; ++i) {
    write_to_stderr(argv[i]);
    write_to_stderr("\n");
  }
  quick_exit(127);
}

$ clang++ crt1.o rpc_client.o io.o quick_exit.o main.cpp -flto --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -o image
$ ./amdhsa_loader image args to the main function
image
args
to
the
main
function
$ echo $?
127

I have no major comments so stepping aside.

libc/src/__support/RPC/rpc.h
2	Should this RPC directory be nested under `OSUtil/gpu` as it is a GPU specific? AFAICT there is nothing GPU specific though.
40	Constants are in `UPPER_CASE` style.
libc/src/__support/RPC/rpc_client.cpp
23	`false` ? Also, is this being used anywhere?

This revision is now accepted and ready to land.Mar 16 2023, 8:43 AM

Herald added a subscriber: mikhail.ramalho. · View Herald TranscriptMar 16 2023, 8:43 AM

jhuber6 marked an inline comment as done.Mar 16 2023, 8:48 AM

jhuber6 added inline comments.

libc/src/__support/RPC/rpc.h
2	Right now it's "gpu specific" but theoretically it could be used by any heterogeneous system that can share a memory region atomically. Maybe we could port it to run on FPGAs some day. But no other targets will use this right now, so it would be fine to put it under a GPU only directory if you want.
libc/src/__support/RPC/rpc_client.cpp
23	This isn't used right now. The idea is that when we provide this as a static library, any entrypoint that requires the RPC will pull in this file to resolve the symbol. That will in turn pull in this externally visible symbol which we can then read from the image directly. Therefore, if the GPU image does not contain `__llvm_libc_rpc` we don't need to bother with the runtime cost of spinning up a server on the host CPU. It's set to `false` because constants don't get emitted as symbols if they are undefined, and we want to use the default initializer. It might be fine to make it true, but the presence of the symbol is the boolean here.

sivachandra added inline comments.Mar 16 2023, 8:59 AM

libc/src/__support/RPC/rpc_client.cpp
23	It's set to `false` because constants don't get emitted as symbols if they are undefined, and we want to use the default initializer. It might be fine to make it true, but the presence of the symbol is the boolean here. My comment was about changing it to `false` instead of `0`.

Addressing comments

Harbormaster completed remote builds in B219894: Diff 505840.Mar 16 2023, 9:59 AM

jhuber6 added a child revision: D146256: [libc] Enable integration tests targeting the GPU.Mar 16 2023, 1:19 PM

Closed by commit rG8e4f9b1fcbfd: [libc] Add initial support for an RPC mechanism for the GPU (authored by jhuber6). · Explain WhyMar 17 2023, 10:55 AM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG8e4f9b1fcbfd: [libc] Add initial support for an RPC mechanism for the GPU.

I stumbled upon this code while looking at write_to_stderr and I think that the PRINT_TO_STDERR opcode is buggy (see comments). Also since we're exchanging messages of 64B, why waste 8B on the Opcode? Do we need such a huge opcode space ?
I'm working on a patch to try to fix these issues, I'll ping back when it's ready for discussion.

libc/src/__support/OSUtil/gpu/io.cpp
24	The number of bytes copied at each round are either `buffer_len` (i.e. 56B) or the total length of the string, it should be the remainder.
libc/utils/gpu/loader/amdgpu/Loader.cpp
50	If the message's length is greater than 56B, `&buffer->data[1]` contains only a fragment of the message and is not null-terminated, leading `fputs` to read past it's allocated buffer.
314	Are there any guarantees that this piece of memory has the same alignment that `__llvm_libc::cpp::Atomic<int>`. Same for the Cuda loader.

In D145913#4236641, @gchatelet wrote:

I stumbled upon this code while looking at write_to_stderr and I think that the PRINT_TO_STDERR opcode is buggy (see comments). Also since we're exchanging messages of 64B, why waste 8B on the Opcode? Do we need such a huge opcode space ?
I'm working on a patch to try to fix these issues, I'll ping back when it's ready for discussion.

This code is heavily WIP. Almost all of this is going to change at some point in the future. This exists mainly to let us stand up unit tests. The reason for the opcode's size right now was mainly just alignment. We want the actual arguments to at least have 8B alignment. I'm going to change this in the future. Something like this, but I'm still thinking about it.

struct Port {
  cpp::Atomic<uint32_t> flags;
  uint32_t opcode[WARP_SIZE];
  uint64_t activemask;
  Buffer data[WARP_SIZE];
}

libc/src/__support/OSUtil/gpu/io.cpp
24	Yeah, don't know why this one works. Probably more accidental behavior given that the memory allocation is a lot larger than the size of the buffer I'm using.
libc/utils/gpu/loader/amdgpu/Loader.cpp
50	You're right. This worked in practice so I never noticed. Most likely because allocating fine-grained memory most likely gives you a full page at a minimum. I never write to the rest of that data and it's implicitly initialized to zero.
314	Fine-grained memory in this context is always going to be aligned to a page as far as I know. So that'll usually be aligned on a 4096 byte boundary.

gchatelet mentioned this in D147375: [libc] Use string_view for write_to_stderr.Apr 1 2023, 4:36 AM

gchatelet added inline comments.Apr 1 2023, 5:04 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
314	Does that mean that you're reserving one page for just a few bytes? Maybe it would make more sense to reserve a larger chunk of memory and place the objects ourselves (not sure what this implies for the GPU). Regarding the rpc mechanism, should the two atomics share the same cache line or would it make sense to have them in separate cache lines to prevent false sharing? It's probably OK to have them on the same cache line if there is only one client and one server. Performance wise, it seems important that the atomic don't cross cache line boundaries though. And since placement is important (`alignof(cpp::atomic<T>)` should be honored for proper codegen) maybe the server should just allocate a sufficiently large chunk of memory and let a common function do the actual placement. This would prevent duplicate logic in each server (loader).

JonChesterfield added inline comments.Apr 1 2023, 5:25 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
314	The two atomics are single-writer. Not only should they be on different cache lines, they should probably be on different pages (so that a given page is only every written by one of the agents involved). Atomic variables definitely shouldn't cross cache lines. I don't know whether they'd work on GPUs if they did, that seems likely to be a correctness issue. But as they're naturally aligned and smaller than cache lines we're fine.

jhuber6 added inline comments.Apr 1 2023, 5:25 AM

libc/utils/gpu/loader/amdgpu/Loader.cpp
314	Does that mean that you're reserving one page for just a few bytes? Maybe it would make more sense to reserve a larger chunk of memory and place the objects ourselves (not sure what this implies for the GPU). Yes, this is really wasteful but I'm ignoring it for now since the resource usage of these tests is quite low. This will change when we need to support multiple client calls from GPU threads. To meet hardware requirements the size of the buffer will probably be about 2 MB. Regarding the rpc mechanism, should the two atomics share the same cache line or would it make sense to have them in separate cache lines to prevent false sharing? It's probably OK to have them on the same cache line if there is only one client and one server. I'm probably going to merge these into a single atomic that we use as a bit-field. This is mostly because the underlying RPC state machine will need to support more than four states in the future (client send, server reply, client use, server clean). I'm actually not entirely sure what false sharing would mean in this context. This memory is basically a whole page that's shared with the GPU via the PCI(e) bus. I'm not privy enough to the internals to know if this can cause issues. Performance wise, it seems important that the atomic don't cross cache line boundaries though. And since placement is important (alignof(cpp::atomic<T>) should be honored for proper codegen) maybe the server should just allocate a sufficiently large chunk of memory and let a common function do the actual placement. This would prevent duplicate logic in each server (loader). We should definitely put `alignas` on the struct. I was thinking about doing the above. Having a single buffer makes it much easier to write to the GPU as well. I'm planning on just making this an array of some struct type.

gchatelet mentioned this in rG9a99afb45560: [libc] Use string_view for write_to_stderr.Apr 2 2023, 7:53 AM

JonChesterfield mentioned this in D148191: [libc] Concurrent GPU RPC.Apr 12 2023, 8:13 PM

Revision Contents

Path

Size

libc/

src/

__support/

CMakeLists.txt

1 line

OSUtil/

CMakeLists.txt

29 lines

gpu/

CMakeLists.txt

3 lines

io.cpp

29 lines

quick_exit.cpp

10 lines

	gpu/

io.h

16 lines

io.h

6 lines

RPC/

18 lines

140 lines

23 lines

27 lines

startup/

gpu/

amdgpu/

CMakeLists.txt

3 lines

start.cpp

6 lines

utils/

gpu/

loader/

amdgpu/

CMakeLists.txt

3 lines

Loader.cpp

66 lines

Diff 506139

libc/src/__support/CMakeLists.txt

Show First 20 Lines • Show All 198 Lines • ▼ Show 20 Lines	add_header_library(
DEPENDS		DEPENDS
.integer_to_string		.integer_to_string
libc.src.__support.OSUtil.osutil		libc.src.__support.OSUtil.osutil
)		)

add_subdirectory(FPUtil)		add_subdirectory(FPUtil)
add_subdirectory(OSUtil)		add_subdirectory(OSUtil)
add_subdirectory(StringUtil)		add_subdirectory(StringUtil)
		add_subdirectory(RPC)

# Thread support is used by other "File". So, we add the "threads"		# Thread support is used by other "File". So, we add the "threads"
# before "File".		# before "File".
add_subdirectory(threads)		add_subdirectory(threads)

add_subdirectory(File)		add_subdirectory(File)

libc/src/__support/OSUtil/CMakeLists.txt

	if(NOT EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${LIBC_TARGET_OS})			if(NOT EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${LIBC_TARGET_OS})
	return()			return()
	endif()			endif()

	add_subdirectory(${LIBC_TARGET_OS})			add_subdirectory(${LIBC_TARGET_OS})
	set(target_os_util libc.src.__support.OSUtil.${LIBC_TARGET_OS}.${LIBC_TARGET_OS}_util)			set(target_os_util libc.src.__support.OSUtil.${LIBC_TARGET_OS}.${LIBC_TARGET_OS}_util)
	if(NOT TARGET ${target_os_util})			if(NOT TARGET ${target_os_util})
	return()			return()
	endif()			endif()

				# The OSUtil is an object library in GPU mode.
				if(NOT LIBC_TARGET_ARCHITECTURE_IS_GPU)
	add_header_library(			add_header_library(
	osutil			osutil
	HDRS			HDRS
	io.h			io.h
	quick_exit.h			quick_exit.h
	syscall.h			syscall.h
	DEPENDS			DEPENDS
	${target_os_util}			${target_os_util}
	)			)
				else()
				add_object_library(
				osutil
				ALIAS
				${target_os_util}
				DEPENDS
				${target_os_util}
				)
				endif()

libc/src/__support/OSUtil/gpu/CMakeLists.txt

	add_object_library(			add_object_library(
	gpu_util			gpu_util
	SRCS			SRCS
	quick_exit.cpp			quick_exit.cpp
				io.cpp
	HDRS			HDRS
	quick_exit.h			quick_exit.h
				io.h
	DEPENDS			DEPENDS
	libc.src.__support.common			libc.src.__support.common
				libc.src.__support.RPC.rpc_client
	)			)

libc/src/__support/OSUtil/gpu/io.h

This file was copied from libc/src/__support/OSUtil/io.h.

	//===---------------- Implementation of IO utils ----------------- C++ --===//			//===-------------- GPU implementation of IO utils --------------- C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIBC_SRC_SUPPORT_OSUTIL_IO_H			#ifndef LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_IO_H
	#define LLVM_LIBC_SRC_SUPPORT_OSUTIL_IO_H			#define LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_IO_H

	#ifdef __unix__			namespace __llvm_libc {
	#include "linux/io.h"
	#endif

	#endif // LLVM_LIBC_SRC_SUPPORT_OSUTIL_IO_H			void write_to_stderr(const char *msg);

				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_SUPPORT_OSUTIL_LINUX_IO_H

libc/src/__support/OSUtil/gpu/io.cpp

This file was added.

				//===-------------- GPU implementation of IO utils --------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "io.h"

				#include "src/__support/RPC/rpc_client.h"
				#include "src/string/string_utils.h"

				namespace __llvm_libc {

				void write_to_stderr(const char *msg) {
				uint64_t length = internal::string_length(msg) + 1;
				uint64_t buffer_len = sizeof(rpc::Buffer) - sizeof(uint64_t);
				for (uint64_t i = 0; i < length; i += buffer_len)
				rpc::client.run(
				[&](rpc::Buffer *buffer) {
				buffer->data[0] = rpc::Opcode::PRINT_TO_STDERR;
				inline_memcpy(reinterpret_cast<char *>(&buffer->data[1]), &msg[i],
				(length > buffer_len ? buffer_len : length));
				gchateletUnsubmitted Not Done Reply Inline Actions The number of bytes copied at each round are either `buffer_len` (i.e. 56B) or the total length of the string, it should be the remainder. gchatelet: The number of bytes copied at each round are either `buffer_len` (i.e. 56B) or the total…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Yeah, don't know why this one works. Probably more accidental behavior given that the memory allocation is a lot larger than the size of the buffer I'm using. jhuber6: Yeah, don't know why this one works. Probably more accidental behavior given that the memory…
				},
				[](rpc::Buffer *) {});
				}
				jdoerfertUnsubmitted Done Reply Inline Actions Nit: braces jdoerfert: Nit: braces

				jdoerfertUnsubmitted Not Done Reply Inline Actions Why is this function unused? (also braces for the loop). Let's avoid "print" for now and go with fputs and exit first. We certainly want fputs to be the fast path for fprintf, and the non-f versions can be resolved on the device. jdoerfert: Why is this function unused? (also braces for the loop). Let's avoid "print" for now and go…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions It's the function to use the return value from the server. Since this is void there's nothing to use. In the future this should be "asynchonrous" where we don't wait for the return value. jhuber6: It's the function to use the return value from the server. Since this is void there's nothing…
				} // namespace __llvm_libc

libc/src/__support/OSUtil/gpu/quick_exit.cpp

	//===---------- GPU implementation of a quick exit function ------ C++ --===//			//===---------- GPU implementation of a quick exit function ------ C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_QUICK_EXIT_H			#ifndef LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_QUICK_EXIT_H
	#define LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_QUICK_EXIT_H			#define LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_QUICK_EXIT_H

	#include "quick_exit.h"			#include "quick_exit.h"

				#include "src/__support/RPC/rpc_client.h"
	#include "src/__support/macros/properties/architectures.h"			#include "src/__support/macros/properties/architectures.h"

	namespace __llvm_libc {			namespace __llvm_libc {

	void quick_exit(int status) {			void quick_exit(int status) {
				// TODO: Support asynchronous calls so we don't wait and exit from the GPU
				// immediately.
				rpc::client.run(
				[&](rpc::Buffer *buffer) {
				buffer->data[0] = rpc::Opcode::EXIT;
				buffer->data[1] = status;
				},
				[](rpc::Buffer *) {});

	#if defined(LIBC_TARGET_ARCH_IS_NVPTX)			#if defined(LIBC_TARGET_ARCH_IS_NVPTX)
	asm("exit" ::: "memory");			asm("exit" ::: "memory");
	#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)			#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)
	// This will terminate the entire wavefront, may not be valid with divergent			// This will terminate the entire wavefront, may not be valid with divergent
	// work items.			// work items.
	asm("s_endpgm" ::: "memory");			asm("s_endpgm" ::: "memory");
	#endif			#endif
	__builtin_unreachable();			__builtin_unreachable();
	}			}

	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif // LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_QUICK_EXIT_H			#endif // LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_QUICK_EXIT_H

libc/src/__support/OSUtil/io.h

This file was copied to libc/src/__support/OSUtil/gpu/io.h.

	//===---------------- Implementation of IO utils ----------------- C++ --===//			//===---------------- Implementation of IO utils ----------------- C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIBC_SRC_SUPPORT_OSUTIL_IO_H			#ifndef LLVM_LIBC_SRC_SUPPORT_OSUTIL_IO_H
	#define LLVM_LIBC_SRC_SUPPORT_OSUTIL_IO_H			#define LLVM_LIBC_SRC_SUPPORT_OSUTIL_IO_H

	#ifdef __unix__			#include "src/__support/macros/properties/architectures.h"

				#if defined(LIBC_TARGET_ARCH_IS_GPU)
				#include "gpu/io.h"
				#elif defined(__unix__)
	#include "linux/io.h"			#include "linux/io.h"
	#endif			#endif

	#endif // LLVM_LIBC_SRC_SUPPORT_OSUTIL_IO_H			#endif // LLVM_LIBC_SRC_SUPPORT_OSUTIL_IO_H

libc/src/__support/RPC/CMakeLists.txt

This file was added.

				add_header_library(
				rpc
				HDRS
				rpc.h
				DEPENDS
				libc.src.__support.common
				libc.src.__support.CPP.atomic
				)

				add_object_library(
				rpc_client
				SRCS
				rpc_client.cpp
				HDRS
				rpc_client.h
				DEPENDS
				.rpc
				)

libc/src/__support/RPC/rpc.h

This file was added.

				//===-- Shared memory RPC client / server interface -------------- C++ --===//
				//
				sivachandraUnsubmitted Not Done Reply Inline Actions Should this RPC directory be nested under `OSUtil/gpu` as it is a GPU specific? AFAICT there is nothing GPU specific though. sivachandra: Should this RPC directory be nested under `OSUtil/gpu` as it is a GPU specific? AFAICT there is…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Right now it's "gpu specific" but theoretically it could be used by any heterogeneous system that can share a memory region atomically. Maybe we could port it to run on FPGAs some day. But no other targets will use this right now, so it would be fine to put it under a GPU only directory if you want. jhuber6: Right now it's "gpu specific" but theoretically it could be used by any heterogeneous system…
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				jdoerfertUnsubmitted Not Done Reply Inline Actions file comment how this is going to work. Also add a web documentation document jdoerfert: file comment how this is going to work. Also add a web documentation document
				jhuber6AuthorUnsubmitted Done Reply Inline Actions File comment is good. I'll documentation online in a later patch. jhuber6: File comment is good. I'll documentation online in a later patch.
				#ifndef LLVM_LIBC_SRC_SUPPORT_RPC_RPC_H
				#define LLVM_LIBC_SRC_SUPPORT_RPC_RPC_H

				#include "src/__support/CPP/atomic.h"

				#include <stdint.h>

				namespace __llvm_libc {
				namespace rpc {

				/// A list of opcodes that we use to invoke certain actions on the server. We
				/// reserve the first 255 values for internal libc usage.
				enum Opcode : uint64_t {
				jdoerfertUnsubmitted Done Reply Inline Actions reserve it, don't call it noon now. jdoerfert: reserve it, don't call it noon now.
				jhuber6AuthorUnsubmitted Done Reply Inline Actions How about `null`? I think it's important that `0` doesn't do anything as we expect that to be the default clean state of this buffer. jhuber6: How about `null`? I think it's important that `0` doesn't do anything as we expect that to be…
				jdoerfertUnsubmitted Not Done Reply Inline Actions OK, null. jdoerfert: OK, null.
				NOOP = 0,
				PRINT_TO_STDERR = 1,
				EXIT = 2,
				LIBC_LAST = (1UL << 8) - 1,
				};

				/// A fixed size channel used to communicate between the RPC client and server.
				struct Buffer {
				uint64_t data[8];
				};
				jdoerfertUnsubmitted Not Done Reply Inline Actions Since data[0] is supposed to be the Opcode, should we not make it a opcode member and then 7 * 8 bytes payload? jdoerfert: Since data[0] is supposed to be the Opcode, should we not make it a opcode member and then 7 *…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions I'm wondering if we should provide a bunch of structs that outline the structure of the arguments and just reinterpret case the underlying buffer to that struct type. jhuber6: I'm wondering if we should provide a bunch of structs that outline the structure of the…

				/// A common process used to synchronize communication between a client and a
				/// server. The process contains an inbox and an outbox used for signaling
				/// ownership of the shared buffer.
				struct Process {
				cpp::Atomic<uint32_t> *inbox;
				cpp::Atomic<uint32_t> *outbox;
				Buffer *buffer;

				sivachandraUnsubmitted Done Reply Inline Actions Constants are in `UPPER_CASE` style. sivachandra: Constants are in `UPPER_CASE` style.
				/// Initialize the communication channels.
				void reset(void inbox, void outbox, void *buffer) {
				*this = {
				reinterpret_cast<cpp::Atomic<uint32_t> *>(inbox),
				reinterpret_cast<cpp::Atomic<uint32_t> *>(outbox),
				reinterpret_cast<Buffer *>(buffer),
				};
				}
				};

				/// The RPC client used to make requests to the server.
				jdoerfertUnsubmitted Not Done Reply Inline Actions Comment for all classes and enums and functions. jdoerfert: Comment for all classes and enums and functions.
				struct Client : public Process {
				template <typename F, typename U> void run(F fill, U use);
				};

				/// The RPC server used to respond to the client.
				struct Server : public Process {
				template <typename W, typename C> bool run(W work, C clean);
				};
				michaelrjUnsubmitted Not Done Reply Inline Actions nit: communication michaelrj: nit: communication

				/// Run the RPC client protocol to communicate with the server. We perform the
				/// following high level actions to complete a communication:
				/// - Apply \p fill to the shared buffer and write 1 to the outbox.
				/// - Wait until the inbox is 1.
				/// - Apply \p use to the shared buffer and write 0 to the outbox.
				/// - Wait until the inbox is 0.
				template <typename F, typename U> void Client::run(F fill, U use) {
				bool in = inbox->load(cpp::MemoryOrder::RELAXED);
				bool out = outbox->load(cpp::MemoryOrder::RELAXED);
				atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
				// Write to buffer then to the outbox.
				if (!in & !out) {
				fill(buffer);
				atomic_thread_fence(cpp::MemoryOrder::RELEASE);
				outbox->store(1, cpp::MemoryOrder::RELEASE);
				out = 1;
				jdoerfertUnsubmitted Done Reply Inline Actions Commented with full sentences please. jdoerfert: Commented with full sentences please.
				}
				// Wait for the result from the server.
				if (!in & out) {
				while (!in)
				in = inbox->load(cpp::MemoryOrder::RELAXED);
				atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
				}
				// Read from the buffer and then write to outbox.
				jdoerfertUnsubmitted Not Done Reply Inline Actions maybe handle, or query? jdoerfert: maybe handle, or query?
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Handle would probably be good, since I decided to make this non-blocking to make it more versatile on the server-side. jhuber6: Handle would probably be good, since I decided to make this non-blocking to make it more…
				if (in & out) {
				use(buffer);
				atomic_thread_fence(cpp::MemoryOrder::RELEASE);
				outbox->store(0, cpp::MemoryOrder::RELEASE);
				out = 0;
				}
				// Wait for server to complete the communication.
				if (in & !out) {
				while (in)
				in = inbox->load(cpp::MemoryOrder::RELAXED);
				atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
				}
				}

				/// Run the RPC server protocol to communicate with the client. This is
				/// non-blocking and only checks the server a single time. We perform the
				/// following high level actions to complete a communication:
				/// - Query if the inbox is 1 and exit if there is no work to do.
				/// - Apply \p work to the shared buffer and write 1 to the outbox.
				/// - Wait until the inbox is 0.
				/// - Apply \p clean to the shared buffer and write 0 to the outbox.
				template <typename W, typename C> bool Server::run(W work, C clean) {
				bool in = inbox->load(cpp::MemoryOrder::RELAXED);
				bool out = outbox->load(cpp::MemoryOrder::RELAXED);
				atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
				// No work to do, exit.
				if (!in & !out)
				return false;
				// Do work then write to the outbox.
				if (in & !out) {
				work(buffer);
				atomic_thread_fence(cpp::MemoryOrder::RELEASE);
				outbox->store(1, cpp::MemoryOrder::RELEASE);
				out = 1;
				}
				// Wait for the client to read the result.
				if (in & out) {
				while (in)
				in = inbox->load(cpp::MemoryOrder::RELAXED);
				atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
				}
				// Clean up the buffer and signal the client.
				if (!in & out) {
				clean(buffer);
				atomic_thread_fence(cpp::MemoryOrder::RELEASE);
				outbox->store(0, cpp::MemoryOrder::RELEASE);
				out = 0;
				}

				return true;
				}

				} // namespace rpc
				} // namespace __llvm_libc

				#endif

libc/src/__support/RPC/rpc_client.h

This file was added.

				//===-- Shared memory RPC client instantiation ------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_SUPPORT_RPC_RPC_CLIENT_H
				#define LLVM_LIBC_SRC_SUPPORT_RPC_RPC_CLIENT_H

				#include "rpc.h"

				namespace __llvm_libc {
				namespace rpc {

				/// The libc client instance used to communicate with the server.
				extern Client client;

				} // namespace rpc
				} // namespace __llvm_libc

				#endif

libc/src/__support/RPC/rpc_client.cpp

This file was added.

				//===-- Shared memory RPC client instantiation ------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_SUPPORT_RPC_RPC_CLIENT_H
				#define LLVM_LIBC_SRC_SUPPORT_RPC_RPC_CLIENT_H

				#include "rpc.h"

				namespace __llvm_libc {
				namespace rpc {

				/// The libc client instance used to communicate with the server.
				Client client;

				/// Externally visible symbol to signify the usage of an RPC client to
				/// whomever needs to run the server.
				extern "C" [[gnu::visibility("protected")]] const bool __llvm_libc_rpc = false;

				sivachandraUnsubmitted Not Done Reply Inline Actions `false` ? Also, is this being used anywhere? sivachandra: `false` ? Also, is this being used anywhere?
				jhuber6AuthorUnsubmitted Done Reply Inline Actions This isn't used right now. The idea is that when we provide this as a static library, any entrypoint that requires the RPC will pull in this file to resolve the symbol. That will in turn pull in this externally visible symbol which we can then read from the image directly. Therefore, if the GPU image does not contain `__llvm_libc_rpc` we don't need to bother with the runtime cost of spinning up a server on the host CPU. It's set to `false` because constants don't get emitted as symbols if they are undefined, and we want to use the default initializer. It might be fine to make it true, but the presence of the symbol is the boolean here. jhuber6: This isn't used right now. The idea is that when we provide this as a static library, any…
				sivachandraUnsubmitted Not Done Reply Inline Actions It's set to `false` because constants don't get emitted as symbols if they are undefined, and we want to use the default initializer. It might be fine to make it true, but the presence of the symbol is the boolean here. My comment was about changing it to `false` instead of `0`. sivachandra: > It's set to `false` because constants don't get emitted as symbols if they are undefined, and…
				} // namespace rpc
				} // namespace __llvm_libc

				#endif

libc/startup/gpu/amdgpu/CMakeLists.txt

	add_startup_object(			add_startup_object(
	crt1			crt1
	SRC			SRC
	start.cpp			start.cpp
				DEPENDS
				libc.src.__support.RPC.rpc_client
	COMPILE_OPTIONS			COMPILE_OPTIONS
	-ffreestanding # To avoid compiler warnings about calling the main function.			-ffreestanding # To avoid compiler warnings about calling the main function.
	-fno-builtin			-fno-builtin
	-nogpulib # Do not include any GPU vendor libraries.			-nogpulib # Do not include any GPU vendor libraries.
	-nostdinc
	-mcpu=${LIBC_GPU_TARGET_ARCHITECTURE}			-mcpu=${LIBC_GPU_TARGET_ARCHITECTURE}
	-emit-llvm # AMDGPU's intermediate object file format is bitcode.			-emit-llvm # AMDGPU's intermediate object file format is bitcode.
	--target=${LIBC_GPU_TARGET_TRIPLE}			--target=${LIBC_GPU_TARGET_TRIPLE}
	NO_GPU_BUNDLE # Compile this file directly without special GPU handling.			NO_GPU_BUNDLE # Compile this file directly without special GPU handling.
	)			)
	get_fq_target_name(crt1 fq_name)			get_fq_target_name(crt1 fq_name)

	# Ensure that clang uses the correct linker for this object type.			# Ensure that clang uses the correct linker for this object type.
	target_link_libraries(${fq_name} PUBLIC			target_link_libraries(${fq_name} PUBLIC
	"--target=${LIBC_GPU_TARGET_TRIPLE}" "-flto")			"--target=${LIBC_GPU_TARGET_TRIPLE}" "-flto")

libc/startup/gpu/amdgpu/start.cpp

	//===-- Implementation of crt for amdgpu ----------------------------------===//			//===-- Implementation of crt for amdgpu ----------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

				#include "src/__support/RPC/rpc_client.h"

	extern "C" int main(int argc, char **argv);			extern "C" int main(int argc, char **argv);

	extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void			extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
	_start(int argc, char *argv, int ret) {			_start(int argc, char *argv, int ret, void in, void out, void *buffer) {
				__llvm_libc::rpc::client.reset(in, out, buffer);

	__atomic_fetch_or(ret, main(argc, argv), __ATOMIC_RELAXED);			__atomic_fetch_or(ret, main(argc, argv), __ATOMIC_RELAXED);
	}			}

libc/utils/gpu/loader/amdgpu/CMakeLists.txt

	add_executable(amdhsa_loader Loader.cpp)			add_executable(amdhsa_loader Loader.cpp)
				add_dependencies(amdhsa_loader libc.src.__support.RPC.rpc)

				target_include_directories(amdhsa_loader PRIVATE ${LIBC_SOURCE_DIR})
	target_link_libraries(amdhsa_loader			target_link_libraries(amdhsa_loader
	PRIVATE			PRIVATE
	hsa-runtime64::hsa-runtime64			hsa-runtime64::hsa-runtime64
	gpu_loader			gpu_loader
	)			)

libc/utils/gpu/loader/amdgpu/Loader.cpp

Show All 9 Lines
// architecture. The file launches the '_start' kernel which should be provided		// architecture. The file launches the '_start' kernel which should be provided
// by the device application start code and call ultimately call the 'main'		// by the device application start code and call ultimately call the 'main'
// function.		// function.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "Loader.h"		#include "Loader.h"

		#include "src/__support/RPC/rpc.h"

#include <hsa/hsa.h>		#include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>		#include <hsa/hsa_ext_amd.h>

#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <cstring>		#include <cstring>
#include <utility>		#include <utility>

/// The name of the kernel we will launch. All AMDHSA kernels end with '.kd'.		/// The name of the kernel we will launch. All AMDHSA kernels end with '.kd'.
constexpr const char *KERNEL_START = "_start.kd";		constexpr const char *KERNEL_START = "_start.kd";

/// The arguments to the '_start' kernel.		/// The arguments to the '_start' kernel.
struct kernel_args_t {		struct kernel_args_t {
int argc;		int argc;
void *argv;		void *argv;
void *ret;		void *ret;
		void *inbox;
		void *outbox;
		void *buffer;
		};

		static __llvm_libc::rpc::Server server;

		/// Queries the RPC client at least once and performs server-side work if there
		/// are any active requests.
		void handle_server() {
		while (server.run(
		[&](__llvm_libc::rpc::Buffer *buffer) {
		switch (static_cast<__llvm_libc::rpc::Opcode>(buffer->data[0])) {
		case __llvm_libc::rpc::Opcode::PRINT_TO_STDERR: {
		fputs(reinterpret_cast<const char *>(&buffer->data[1]), stderr);
		gchateletUnsubmitted Not Done Reply Inline Actions If the message's length is greater than 56B, `&buffer->data[1]` contains only a fragment of the message and is not null-terminated, leading `fputs` to read past it's allocated buffer. gchatelet: If the message's length is greater than 56B, `&buffer->data[1]` contains only a fragment of the…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions You're right. This worked in practice so I never noticed. Most likely because allocating fine-grained memory most likely gives you a full page at a minimum. I never write to the rest of that data and it's implicitly initialized to zero. jhuber6: You're right. This worked in practice so I never noticed. Most likely because allocating fine…
		break;
		}
		case __llvm_libc::rpc::Opcode::EXIT: {
		exit(buffer->data[1]);
		break;
		}
		default:
		return;
};		};
		},
		[](__llvm_libc::rpc::Buffer *buffer) {}))
		;
		}

/// Print the error code and exit if \p code indicates an error.		/// Print the error code and exit if \p code indicates an error.
static void handle_error(hsa_status_t code) {		static void handle_error(hsa_status_t code) {
if (code == HSA_STATUS_SUCCESS \|\| code == HSA_STATUS_INFO_BREAK)		if (code == HSA_STATUS_SUCCESS \|\| code == HSA_STATUS_INFO_BREAK)
return;		return;

const char *desc;		const char *desc;
if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)		if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)
▲ Show 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	int load(int argc, char *argv, void image, size_t size) {
// Allocate space for the return pointer and initialize it to zero.		// Allocate space for the return pointer and initialize it to zero.
void *dev_ret;		void *dev_ret;
if (hsa_status_t err =		if (hsa_status_t err =
hsa_amd_memory_pool_allocate(coarsegrained_pool, sizeof(int),		hsa_amd_memory_pool_allocate(coarsegrained_pool, sizeof(int),
/flags=/0, &dev_ret))		/flags=/0, &dev_ret))
handle_error(err);		handle_error(err);
hsa_amd_memory_fill(dev_ret, 0, sizeof(int));		hsa_amd_memory_fill(dev_ret, 0, sizeof(int));

		// Allocate finegrained memory for the RPC server and client to share.
		void *server_inbox;
		void *server_outbox;
		void *buffer;
		if (hsa_status_t err = hsa_amd_memory_pool_allocate(
		gchateletUnsubmitted Not Done Reply Inline Actions Are there any guarantees that this piece of memory has the same alignment that `__llvm_libc::cpp::Atomic<int>`. Same for the Cuda loader. gchatelet: Are there any guarantees that this piece of memory has the same alignment that `__llvm_libc…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Fine-grained memory in this context is always going to be aligned to a page as far as I know. So that'll usually be aligned on a 4096 byte boundary. jhuber6: Fine-grained memory in this context is always going to be aligned to a page as far as I know.
		gchateletUnsubmitted Not Done Reply Inline Actions Does that mean that you're reserving one page for just a few bytes? Maybe it would make more sense to reserve a larger chunk of memory and place the objects ourselves (not sure what this implies for the GPU). Regarding the rpc mechanism, should the two atomics share the same cache line or would it make sense to have them in separate cache lines to prevent false sharing? It's probably OK to have them on the same cache line if there is only one client and one server. Performance wise, it seems important that the atomic don't cross cache line boundaries though. And since placement is important (`alignof(cpp::atomic<T>)` should be honored for proper codegen) maybe the server should just allocate a sufficiently large chunk of memory and let a common function do the actual placement. This would prevent duplicate logic in each server (loader). gchatelet: Does that mean that you're reserving one page for just a few bytes? Maybe it would make more…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Does that mean that you're reserving one page for just a few bytes? Maybe it would make more sense to reserve a larger chunk of memory and place the objects ourselves (not sure what this implies for the GPU). Yes, this is really wasteful but I'm ignoring it for now since the resource usage of these tests is quite low. This will change when we need to support multiple client calls from GPU threads. To meet hardware requirements the size of the buffer will probably be about 2 MB. Regarding the rpc mechanism, should the two atomics share the same cache line or would it make sense to have them in separate cache lines to prevent false sharing? It's probably OK to have them on the same cache line if there is only one client and one server. I'm probably going to merge these into a single atomic that we use as a bit-field. This is mostly because the underlying RPC state machine will need to support more than four states in the future (client send, server reply, client use, server clean). I'm actually not entirely sure what false sharing would mean in this context. This memory is basically a whole page that's shared with the GPU via the PCI(e) bus. I'm not privy enough to the internals to know if this can cause issues. Performance wise, it seems important that the atomic don't cross cache line boundaries though. And since placement is important (alignof(cpp::atomic<T>) should be honored for proper codegen) maybe the server should just allocate a sufficiently large chunk of memory and let a common function do the actual placement. This would prevent duplicate logic in each server (loader). We should definitely put `alignas` on the struct. I was thinking about doing the above. Having a single buffer makes it much easier to write to the GPU as well. I'm planning on just making this an array of some struct type. jhuber6: > Does that mean that you're reserving one page for just a few bytes? Maybe it would make more…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions The two atomics are single-writer. Not only should they be on different cache lines, they should probably be on different pages (so that a given page is only every written by one of the agents involved). Atomic variables definitely shouldn't cross cache lines. I don't know whether they'd work on GPUs if they did, that seems likely to be a correctness issue. But as they're naturally aligned and smaller than cache lines we're fine. JonChesterfield: The two atomics are single-writer. Not only should they be on different cache lines, they…
		finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),
		/flags=/0, &server_inbox))
		handle_error(err);
		if (hsa_status_t err = hsa_amd_memory_pool_allocate(
		finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),
		/flags=/0, &server_outbox))
		handle_error(err);
		if (hsa_status_t err = hsa_amd_memory_pool_allocate(
		finegrained_pool, sizeof(__llvm_libc::rpc::Buffer),
		/flags=/0, &buffer))
		handle_error(err);
		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_inbox);
		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_outbox);
		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, buffer);

// Initialie all the arguments (explicit and implicit) to zero, then set the		// Initialie all the arguments (explicit and implicit) to zero, then set the
// explicit arguments to the values created above.		// explicit arguments to the values created above.
std::memset(args, 0, args_size);		std::memset(args, 0, args_size);
kernel_args_t kernel_args = reinterpret_cast<kernel_args_t >(args);		kernel_args_t kernel_args = reinterpret_cast<kernel_args_t >(args);
kernel_args->argc = argc;		kernel_args->argc = argc;
kernel_args->argv = dev_argv;		kernel_args->argv = dev_argv;
kernel_args->ret = dev_ret;		kernel_args->ret = dev_ret;
		kernel_args->inbox = server_outbox;
		kernel_args->outbox = server_inbox;
		kernel_args->buffer = buffer;

// Obtain a packet from the queue.		// Obtain a packet from the queue.
uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);		uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue_size)		while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue_size)
;		;

const uint32_t mask = queue_size - 1;		const uint32_t mask = queue_size - 1;
hsa_kernel_dispatch_packet_t *packet =		hsa_kernel_dispatch_packet_t *packet =
Show All 15 Lines	int load(int argc, char *argv, void image, size_t size) {
packet->kernel_object = kernel;		packet->kernel_object = kernel;
packet->kernarg_address = args;		packet->kernarg_address = args;

// Create a signal to indicate when this packet has been completed.		// Create a signal to indicate when this packet has been completed.
if (hsa_status_t err =		if (hsa_status_t err =
hsa_signal_create(1, 0, nullptr, &packet->completion_signal))		hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
handle_error(err);		handle_error(err);

		// Initialize the RPC server's buffer for host-device communication.
		server.reset(server_inbox, server_outbox, buffer);

// Initialize the packet header and set the doorbell signal to begin execution		// Initialize the packet header and set the doorbell signal to begin execution
// by the HSA runtime.		// by the HSA runtime.
uint16_t header =		uint16_t header =
(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|		(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) \|		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) \|
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);
__atomic_store_n(&packet->header, header \| (packet->setup << 16),		__atomic_store_n(&packet->header, header \| (packet->setup << 16),
__ATOMIC_RELEASE);		__ATOMIC_RELEASE);
hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);		hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);

// Wait until the kernel has completed execution on the device.		// Wait until the kernel has completed execution on the device. Periodically
while (hsa_signal_wait_scacquire(packet->completion_signal,		// check the RPC client for work to be performed on the server.
HSA_SIGNAL_CONDITION_EQ, 0, UINT64_MAX,		while (hsa_signal_wait_scacquire(
HSA_WAIT_STATE_ACTIVE) != 0)		packet->completion_signal, HSA_SIGNAL_CONDITION_EQ, 0,
;		/timeout_hint=/1024, HSA_WAIT_STATE_ACTIVE) != 0)
		handle_server();

// Create a memory signal and copy the return value back from the device into		// Create a memory signal and copy the return value back from the device into
// a new buffer.		// a new buffer.
hsa_signal_t memory_signal;		hsa_signal_t memory_signal;
if (hsa_status_t err = hsa_signal_create(1, 0, nullptr, &memory_signal))		if (hsa_status_t err = hsa_signal_create(1, 0, nullptr, &memory_signal))
handle_error(err);		handle_error(err);

void *host_ret;		void *host_ret;
Show All 38 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add initial support for an RPC mechanism for the GPUClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 506139

libc/src/__support/CMakeLists.txt

libc/src/__support/OSUtil/CMakeLists.txt

libc/src/__support/OSUtil/gpu/CMakeLists.txt

libc/src/__support/OSUtil/gpu/io.h

libc/src/__support/OSUtil/gpu/io.cpp

libc/src/__support/OSUtil/gpu/quick_exit.cpp

libc/src/__support/OSUtil/io.h

libc/src/__support/RPC/CMakeLists.txt

libc/src/__support/RPC/rpc.h

libc/src/__support/RPC/rpc_client.h

libc/src/__support/RPC/rpc_client.cpp

libc/startup/gpu/amdgpu/CMakeLists.txt

libc/startup/gpu/amdgpu/start.cpp

libc/utils/gpu/loader/amdgpu/CMakeLists.txt

libc/utils/gpu/loader/amdgpu/Loader.cpp

[libc] Add initial support for an RPC mechanism for the GPU
ClosedPublic