This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
src/__support/RPC/
-
__support/
-
RPC/
-
CMakeLists.txt
5/11
rpc.h
1
rpc_util.h
-
startup/gpu/
-
gpu/
-
amdgpu/
-
start.cpp
-
nvptx/
-
start.cpp
-
test/integration/startup/gpu/
-
integration/
-
startup/
-
gpu/
-
CMakeLists.txt
-
rpc_test.cpp
-
utils/gpu/loader/
-
gpu/
-
loader/
-
Loader.h
-
Server.h
-
amdgpu/
-
Loader.cpp
-
nvptx/
-
Loader.cpp

Differential D148943

[libc] Enable multiple threads to use RPC on the GPU
ClosedPublic

Authored by jhuber6 on Apr 21 2023, 10:03 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
tra
JonChesterfield
tianshilei1992
sivachandra
lntue
michaelrj
gchatelet

Commits

rG507edb52f9a9: [libc] Enable multiple threads to use RPC on the GPU

Summary

The execution model of the GPU expects that groups of threads will
execute in lock-step in SIMD fashion. It's both important for
performance and correctness that we treat this as the smallest possible
granularity for an RPC operation. Thus, we map multiple threads to a
single larger buffer and ship that across the wire.

This patch makes the necessary changes to support executing the RPC on
the GPU with multiple threads. This requires some workarounds to mimic
the model when handling the protocol from the CPU. I'm not completely
happy with some of the workarounds required, but I think it should work.

Uses some of the implementation details from D148191.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Apr 21 2023, 10:03 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 21 2023, 10:03 AM

Herald added subscribers: libc-commits, kosarev, mattd and 5 others. · View Herald Transcript

jhuber6 requested review of this revision.Apr 21 2023, 10:03 AM

jhuber6 added reviewers: jdoerfert, tra, JonChesterfield, tianshilei1992, sivachandra, lntue, michaelrj, gchatelet.Apr 21 2023, 10:05 AM

Cleanup leftover debugging code.

Drive

libc/src/__support/RPC/rpc.h
48	Or
56	We should have a single generic warp size macro
115	why is data not typed here?
284	shouldn't 0 be something like gpu::get_id_in_lane()? Also below.
368	return; }

jhuber6 marked 2 inline comments as done.Apr 21 2023, 11:09 AM

jhuber6 added inline comments.

libc/src/__support/RPC/rpc.h
56	True, we can put this in the GPU utils.
115	Just easier to manage since this needs to be copied from the host to the GPU and it's easier to model arguments to the kernel as void pointers.
284	This is basically just to let the CPU pretend like it has all the "threads" the GPU has. So given the GPU's lane size of 32. It'll execute this loop once. The CPU however will execute it 32 times because we set its lane size to 1. The `idx` is just an offset as you march in multiples of the lane size.

Harbormaster completed remote builds in B227247: Diff 515818.Apr 21 2023, 11:17 AM

Address comments

Add MAX_LANE_SIZE for situations where we get the host to allocate enough to accomodate all sizes unconditionally.

Harbormaster completed remote builds in B227267: Diff 515849.Apr 21 2023, 11:53 AM

This works as expected on my gfx1030, but seems to fail spuriously on sm_70. It's probably something to do with the Volta threading model, I'll need to investigate some more.

jhuber6 added a parent revision: D148971: [libc] Adjust the `cpp:function` type to support lambdas.Apr 21 2023, 6:00 PM

Updating to use an overloaded function type rather than make every function take uin32_t.

Harbormaster completed remote builds in B227366: Diff 515969.Apr 21 2023, 6:07 PM

Add Volta warp syncs on divergent paths.

Harbormaster completed remote builds in B227524: Diff 516156.Apr 23 2023, 4:43 AM

This should work as expected now on both AMDGPU and NVPTX.

Fix some iffy control flow that caused problems on AMDGPU sometimes. I've run
the tests 1000 times on both my gfx1030 and sm_70 and didn't observe any
failures, so I'm reasonably confident.

Harbormaster completed remote builds in B227538: Diff 516171.Apr 23 2023, 8:40 AM

Update some names and allocate the proper amount of memory.

Also this adds a hack to work around a problem with the optimizer. The GPU
executes groups of threads in SIMD style. This means that basic blocks turn into
thread masks. THe problem was occuring when the optimizer would sink the common
'close()` call to release the thread lock until after another open call within
the same lane. This can be seen in this https://godbolt.org/z/cfEr4n5TE. There
is some ongoing work to make this actually work on AMD but for now it's a hack
to prevent these function calls from being merged. It has a performance penalty,
but it's minor.

Harbormaster completed remote builds in B228041: Diff 516819.Apr 25 2023, 8:32 AM

Add a lane sync to address the convergence problem. Thanks to @nhaehnle for the suggestion.

Harbormaster completed remote builds in B228789: Diff 517874.Apr 28 2023, 4:56 AM

Add acquire / release semantics on the device lock. The NVPTX backend doesn't
support release on stores so simply use an explicit fence and a relaxed
operation. The fence here will actually reduce to a full memory barrier, but
that should just be a slight performance hit because of Nvidia's lack of
support.

Harbormaster completed remote builds in B228798: Diff 517889.Apr 28 2023, 6:11 AM

What's the reasoning behind the placement of sync_lanecalls? If it's meant to be one per basic block some are missing. The comment says it's added to close but the code has more calls than that.

If it's an amdgpu specific hack it should probably be named based on that and excluded on nvptx. Currently it reads like nvptx picks up a lot of synchronisation it doesn't need.

In D148943#4305191, @JonChesterfield wrote:

What's the reasoning behind the placement of sync_lanecalls? If it's meant to be one per basic block some are missing. The comment says it's added to close but the code has more calls than that.

If it's an amdgpu specific hack it should probably be named based on that and excluded on nvptx. Currently it reads like nvptx picks up a lot of synchronisation it doesn't need.

So, the convergence problem was present on Nvidia as well, it's just that its hardware model prevented it from deadlocking so it instead serialized the entire region. Right now we have a sync_lane in front of the open and the close mostly for safety to enforce the expected convergence of these operations. e.g. every thread that opens the port should close the port as well in the same context. The last sync is definitely necessary on the send_n call because the threads could copy variable amounts of data.

Add a check to detect divergence.

Harbormaster completed remote builds in B228926: Diff 518056.Apr 28 2023, 2:35 PM

jhuber6 added a child revision: D149581: [libc] Change GPU startup and loader to use multiple kernels.May 1 2023, 6:09 AM

Updating to the new interface from @JonChesterfield

Harbormaster completed remote builds in B229857: Diff 519306.May 3 2023, 5:00 PM

Rebase

Harbormaster completed remote builds in B229971: Diff 519475.May 4 2023, 7:02 AM

Rebasing, the send and recv interface still needs to use the mask stored in the buffer because it's shared.

Harbormaster completed remote builds in B230041: Diff 519573.May 4 2023, 11:15 AM

JonChesterfield mentioned this in rG09ceb4729f1c: [libc][rpc] Land helpers from D148943.May 4 2023, 12:54 PM

jhuber6 added a parent revision: D149894: [libc] Add RPC utility functions for handling the lanes.May 4 2023, 12:54 PM

Remove file moved into another patch.

Harbormaster completed remote builds in B230071: Diff 519620.May 4 2023, 1:06 PM

JonChesterfield mentioned this in rG5cba7717fcce: Revert "[libc][rpc] Land helpers from D148943".May 4 2023, 1:45 PM

Rebasing, this currently deadlocks on the RPC test. Need to fix that.

Harbormaster completed remote builds in B230105: Diff 519664.May 4 2023, 2:45 PM

Fixed deadlock, problem was the second lane sync wasn't rebased in.

Harbormaster completed remote builds in B230124: Diff 519686.May 4 2023, 3:52 PM

I think this is OK. It seems likely that we'll be able to simplify parts of it later. I'll remove the outdated part of the commit message shortly, the state machine has changed since it was written

libc/src/__support/RPC/rpc.h
106–107	Not keen on this being a runtime value but it's not that clear how to avoid that on the host side so this seems OK
188	a bit frustrated that we need this sync for now
419–423	i think we're better off without the is_first_lane here, i.e. let all active lanes write the same value to buffer, but that should only be a codegen improvement
libc/src/__support/RPC/rpc_util.h
33	weird that these show up in the phab diff, they should already be in trunk

This revision is now accepted and ready to land.May 4 2023, 4:08 PM

JonChesterfield edited the summary of this revision. (Show Details)May 4 2023, 4:08 PM

JonChesterfield edited the summary of this revision. (Show Details)

This revision was landed with ongoing or failed builds.May 4 2023, 5:32 PM

Closed by commit rG507edb52f9a9: [libc] Enable multiple threads to use RPC on the GPU (authored by jhuber6). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG507edb52f9a9: [libc] Enable multiple threads to use RPC on the GPU.

Revision Contents

Path

Size

libc/

src/

__support/

RPC/

CMakeLists.txt

2 lines

rpc.h

138 lines

rpc_util.h

12 lines

startup/

gpu/

amdgpu/

start.cpp

2 lines

nvptx/

start.cpp

2 lines

test/

integration/

startup/

gpu/

CMakeLists.txt

8 lines

rpc_test.cpp

15 lines

utils/

gpu/

loader/

Loader.h

5 lines

Server.h

23 lines

amdgpu/

Loader.cpp

11 lines

nvptx/

Loader.cpp

8 lines

Diff 519702

libc/src/__support/RPC/CMakeLists.txt

	if(NOT LIBC_TARGET_ARCHITECTURE_IS_GPU)			if(NOT LIBC_TARGET_ARCHITECTURE_IS_GPU)
	return()			return()
	endif()			endif()

	add_header_library(			add_header_library(
	rpc			rpc
	HDRS			HDRS
	rpc.h			rpc.h
	rpc_util.h			rpc_util.h
	DEPENDS			DEPENDS
	libc.src.__support.common			libc.src.__support.common
	libc.src.__support.CPP.atomic			libc.src.__support.CPP.atomic
				libc.src.__support.CPP.optional
				libc.src.__support.CPP.functional
	libc.src.__support.GPU.utils			libc.src.__support.GPU.utils
	)			)

	add_object_library(			add_object_library(
	rpc_client			rpc_client
	SRCS			SRCS
	rpc_client.cpp			rpc_client.cpp
	HDRS			HDRS
	rpc_client.h			rpc_client.h
	DEPENDS			DEPENDS
	libc.src.__support.GPU.utils			libc.src.__support.GPU.utils
	.rpc			.rpc
	)			)

libc/src/__support/RPC/rpc.h

Show All 14 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_LIBC_SRC_SUPPORT_RPC_RPC_H		#ifndef LLVM_LIBC_SRC_SUPPORT_RPC_RPC_H
#define LLVM_LIBC_SRC_SUPPORT_RPC_RPC_H		#define LLVM_LIBC_SRC_SUPPORT_RPC_RPC_H

#include "rpc_util.h"		#include "rpc_util.h"
#include "src/__support/CPP/atomic.h"		#include "src/__support/CPP/atomic.h"
		#include "src/__support/CPP/functional.h"
#include "src/__support/CPP/optional.h"		#include "src/__support/CPP/optional.h"
#include "src/__support/GPU/utils.h"		#include "src/__support/GPU/utils.h"
#include "src/string/memory_utils/memcpy_implementations.h"		#include "src/string/memory_utils/memcpy_implementations.h"

#include <stdint.h>		#include <stdint.h>

namespace __llvm_libc {		namespace __llvm_libc {
namespace rpc {		namespace rpc {

/// A list of opcodes that we use to invoke certain actions on the server.		/// A list of opcodes that we use to invoke certain actions on the server.
enum Opcode : uint16_t {		enum Opcode : uint16_t {
NOOP = 0,		NOOP = 0,
PRINT_TO_STDERR = 1,		PRINT_TO_STDERR = 1,
EXIT = 2,		EXIT = 2,
TEST_INCREMENT = 3,		TEST_INCREMENT = 3,
};		};

/// A fixed size channel used to communicate between the RPC client and server.		/// A fixed size channel used to communicate between the RPC client and server.
struct alignas(64) Buffer {		struct Buffer {
uint8_t data[62];		uint64_t data[8];
uint16_t opcode;
};		};
static_assert(sizeof(Buffer) == 64, "Buffer size mismatch");		static_assert(sizeof(Buffer) == 64, "Buffer size mismatch");

		/// The information associated with a packet. This indicates which operations to
		/// perform and which threads are active in the slots.
		jdoerfertUnsubmitted Done Reply Inline Actions Or jdoerfert: Or
		struct Header {
		uint64_t mask;
		uint16_t opcode;
		};

		/// The data payload for the associated packet. We provide enough space for each
		/// thread in the cooperating lane to have a buffer.
		struct Payload {
		jdoerfertUnsubmitted Not Done Reply Inline Actions We should have a single generic warp size macro jdoerfert: We should have a single generic warp size macro
		jhuber6AuthorUnsubmitted Done Reply Inline Actions True, we can put this in the GPU utils. jhuber6: True, we can put this in the GPU utils.
		#if defined(LIBC_TARGET_ARCH_IS_GPU)
		Buffer slot[gpu::LANE_SIZE];
		#else
		// Flexible array size allocated at runtime to the appropriate size.
		Buffer slot[];
		#endif
		};

		/// A packet used to share data between the client and server across an entire
		/// lane. We use a lane as the minimum granularity for execution.
		struct alignas(64) Packet {
		Header header;
		Payload payload;
		};

/// A common process used to synchronize communication between a client and a		/// A common process used to synchronize communication between a client and a
/// server. The process contains an inbox and an outbox used for signaling		/// server. The process contains an inbox and an outbox used for signaling
/// ownership of the shared buffer between both sides.		/// ownership of the shared buffer between both sides.
///		///
/// No process writes to its inbox. Each toggles the bit in the outbox to pass		/// No process writes to its inbox. Each toggles the bit in the outbox to pass
/// ownership to the other process.		/// ownership to the other process.
/// When inbox == outbox, the current state machine owns the buffer.		/// When inbox == outbox, the current state machine owns the buffer.
/// Initially the client is able to open any port as it will load 0 from both.		/// Initially the client is able to open any port as it will load 0 from both.
Show All 11 Lines
/// mirrored 'recv' / 'send' call.		/// mirrored 'recv' / 'send' call.
///		///
template <bool InvertInbox> struct Process {		template <bool InvertInbox> struct Process {
LIBC_INLINE Process() = default;		LIBC_INLINE Process() = default;
LIBC_INLINE Process(const Process &) = default;		LIBC_INLINE Process(const Process &) = default;
LIBC_INLINE Process &operator=(const Process &) = default;		LIBC_INLINE Process &operator=(const Process &) = default;
LIBC_INLINE ~Process() = default;		LIBC_INLINE ~Process() = default;

		uint32_t lane_size;
cpp::Atomic<uint32_t> *lock;		cpp::Atomic<uint32_t> *lock;
cpp::Atomic<uint32_t> *inbox;		cpp::Atomic<uint32_t> *inbox;
cpp::Atomic<uint32_t> *outbox;		cpp::Atomic<uint32_t> *outbox;
Buffer *buffer;		Packet *buffer;

/// Initialize the communication channels.		/// Initialize the communication channels.
LIBC_INLINE void reset(void lock, void inbox, void outbox, void buffer) {		LIBC_INLINE void reset(uint32_t lane_size, void lock, void inbox,
		void outbox, void buffer) {
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Not keen on this being a runtime value but it's not that clear how to avoid that on the host side so this seems OK JonChesterfield: Not keen on this being a runtime value but it's not that clear how to avoid that on the host…
*this = {		*this = {
		lane_size,
reinterpret_cast<cpp::Atomic<uint32_t> *>(lock),		reinterpret_cast<cpp::Atomic<uint32_t> *>(lock),
reinterpret_cast<cpp::Atomic<uint32_t> *>(inbox),		reinterpret_cast<cpp::Atomic<uint32_t> *>(inbox),
reinterpret_cast<cpp::Atomic<uint32_t> *>(outbox),		reinterpret_cast<cpp::Atomic<uint32_t> *>(outbox),
reinterpret_cast<Buffer *>(buffer),		reinterpret_cast<Packet *>(buffer),
};		};
}		}
		jdoerfertUnsubmitted Not Done Reply Inline Actions why is data not typed here? jdoerfert: why is data not typed here?
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Just easier to manage since this needs to be copied from the host to the GPU and it's easier to model arguments to the kernel as void pointers. jhuber6: Just easier to manage since this needs to be copied from the host to the GPU and it's easier to…

/// Inverting the bits loaded from the inbox in exactly one of the pair of		/// Inverting the bits loaded from the inbox in exactly one of the pair of
/// processes means that each can use the same state transitions.		/// processes means that each can use the same state transitions.
/// Whichever process has InvertInbox==false is the initial owner.		/// Whichever process has InvertInbox==false is the initial owner.
/// Inbox equal Outbox => current process owns the buffer		/// Inbox equal Outbox => current process owns the buffer
/// Inbox difer Outbox => current process does not own the buffer		/// Inbox difer Outbox => current process does not own the buffer
/// At startup, memory is zero initialised and raw loads of either mailbox		/// At startup, memory is zero initialised and raw loads of either mailbox
/// would return zero. Thus both would succeed in opening a port and data		/// would return zero. Thus both would succeed in opening a port and data
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	[[clang::convergent]] LIBC_INLINE bool try_lock(uint64_t lane_mask,
// mask==1 and before==0 (success), set zero by ballot -> 0		// mask==1 and before==0 (success), set zero by ballot -> 0
// mask==1 and before==1 (failure), set one by ballot -> 1		// mask==1 and before==1 (failure), set one by ballot -> 1
//		//
// mask != packed implies at least one of the threads got the lock		// mask != packed implies at least one of the threads got the lock
// atomic semantics of fetch_or mean at most one of the threads for the lock		// atomic semantics of fetch_or mean at most one of the threads for the lock
return lane_mask != packed;		return lane_mask != packed;
}		}

// Unlock the lock at index.		/// Unlock the lock at index. We need a lane sync to keep this function
		/// convergent, otherwise the compiler will sink the store and deadlock.
[[clang::convergent]] LIBC_INLINE void unlock(uint64_t lane_mask,		[[clang::convergent]] LIBC_INLINE void unlock(uint64_t lane_mask,
uint64_t index) {		uint64_t index) {
// Wait for other threads in the warp to finish using the lock		// Wait for other threads in the warp to finish using the lock
gpu::sync_lane(lane_mask);		gpu::sync_lane(lane_mask);

// Use exactly one thread to clear the bit at position 0 in lock[index]		// Use exactly one thread to clear the bit at position 0 in lock[index]
// Must restrict to a single thread to avoid one thread dropping the lock,		// Must restrict to a single thread to avoid one thread dropping the lock,
// then an unrelated warp claiming the lock, then a second thread in this		// then an unrelated warp claiming the lock, then a second thread in this
// warp dropping the lock again.		// warp dropping the lock again.
uint32_t and_mask = ~(rpc::is_first_lane(lane_mask) ? 1 : 0);		uint32_t and_mask = ~(rpc::is_first_lane(lane_mask) ? 1 : 0);
lock[index].fetch_and(and_mask, cpp::MemoryOrder::RELAXED);		lock[index].fetch_and(and_mask, cpp::MemoryOrder::RELAXED);
		gpu::sync_lane(lane_mask);
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions a bit frustrated that we need this sync for now JonChesterfield: a bit frustrated that we need this sync for now
		}

		/// Invokes a function accross every active buffer across the total lane size.
		LIBC_INLINE void invoke_rpc(cpp::function<void(Buffer *)> fn,
		uint32_t index) {
		if constexpr (is_process_gpu()) {
		fn(&buffer[index].payload.slot[gpu::get_lane_id()]);
		} else {
		for (uint32_t i = 0; i < lane_size; i += gpu::get_lane_size())
		if (buffer[index].header.mask & 1ul << i)
		fn(&buffer[index].payload.slot[i]);
		}
		}

		/// Alternate version that also provides the index of the current lane.
		LIBC_INLINE void invoke_rpc(cpp::function<void(Buffer *, uint32_t)> fn,
		uint32_t index) {
		if constexpr (is_process_gpu()) {
		fn(&buffer[index].payload.slot[gpu::get_lane_id()], gpu::get_lane_id());
		} else {
		for (uint32_t i = 0; i < lane_size; i += gpu::get_lane_size())
		if (buffer[index].header.mask & 1ul << i)
		fn(&buffer[index].payload.slot[i], i);
		}
}		}
};		};

/// The port provides the interface to communicate between the multiple		/// The port provides the interface to communicate between the multiple
/// processes. A port is conceptually an index into the memory provided by the		/// processes. A port is conceptually an index into the memory provided by the
/// underlying process that is guarded by a lock bit.		/// underlying process that is guarded by a lock bit.
template <bool T> struct Port {		template <bool T> struct Port {
// TODO: This should be move-only.		// TODO: This should be move-only.
LIBC_INLINE Port(Process<T> &process, uint64_t lane_mask, uint64_t index,		LIBC_INLINE Port(Process<T> &process, uint64_t lane_mask, uint64_t index,
uint32_t out)		uint32_t out)
: process(process), lane_mask(lane_mask), index(index), out(out) {}		: process(process), lane_mask(lane_mask), index(index), out(out) {}
LIBC_INLINE Port(const Port &) = default;		LIBC_INLINE Port(const Port &) = default;
LIBC_INLINE Port &operator=(const Port &) = delete;		LIBC_INLINE Port &operator=(const Port &) = delete;
LIBC_INLINE ~Port() = default;		LIBC_INLINE ~Port() = default;

template <typename U> LIBC_INLINE void recv(U use);		template <typename U> LIBC_INLINE void recv(U use);
template <typename F> LIBC_INLINE void send(F fill);		template <typename F> LIBC_INLINE void send(F fill);
template <typename F, typename U>		template <typename F, typename U>
LIBC_INLINE void send_and_recv(F fill, U use);		LIBC_INLINE void send_and_recv(F fill, U use);
template <typename W> LIBC_INLINE void recv_and_send(W work);		template <typename W> LIBC_INLINE void recv_and_send(W work);
LIBC_INLINE void send_n(const void *src, uint64_t size);		LIBC_INLINE void send_n(const void *src, uint64_t size);
template <typename A> LIBC_INLINE void recv_n(A alloc);		template <typename A> LIBC_INLINE void recv_n(A alloc);

LIBC_INLINE uint16_t get_opcode() const {		LIBC_INLINE uint16_t get_opcode() const {
return process.buffer[index].opcode;		return process.buffer[index].header.opcode;
}		}

LIBC_INLINE void close() { process.unlock(lane_mask, index); }		LIBC_INLINE void close() { process.unlock(lane_mask, index); }

private:		private:
Process<T> &process;		Process<T> &process;
uint64_t lane_mask;		uint64_t lane_mask;
uint64_t index;		uint64_t index;
Show All 30 Lines	template <bool T> template <typename F> LIBC_INLINE void Port<T>::send(F fill) {

// We need to wait until we own the buffer before sending.		// We need to wait until we own the buffer before sending.
while (Process<T>::buffer_unavailable(in, out)) {		while (Process<T>::buffer_unavailable(in, out)) {
sleep_briefly();		sleep_briefly();
in = process.load_inbox(index);		in = process.load_inbox(index);
}		}

// Apply the \p fill function to initialize the buffer and release the memory.		// Apply the \p fill function to initialize the buffer and release the memory.
fill(&process.buffer[index]);		process.invoke_rpc(fill, index);
		jdoerfertUnsubmitted Not Done Reply Inline Actions shouldn't 0 be something like gpu::get_id_in_lane()? Also below. jdoerfert: shouldn't 0 be something like gpu::get_id_in_lane()? Also below.
		jhuber6AuthorUnsubmitted Done Reply Inline Actions This is basically just to let the CPU pretend like it has all the "threads" the GPU has. So given the GPU's lane size of 32. It'll execute this loop once. The CPU however will execute it 32 times because we set its lane size to 1. The `idx` is just an offset as you march in multiples of the lane size. jhuber6: This is basically just to let the CPU pretend like it has all the "threads" the GPU has. So…
out = !out;		out = !out;
atomic_thread_fence(cpp::MemoryOrder::RELEASE);		atomic_thread_fence(cpp::MemoryOrder::RELEASE);
process.outbox[index].store(out, cpp::MemoryOrder::RELAXED);		process.outbox[index].store(out, cpp::MemoryOrder::RELAXED);
}		}

/// Applies \p use to the shared buffer and acknowledges the send.		/// Applies \p use to the shared buffer and acknowledges the send.
template <bool T> template <typename U> LIBC_INLINE void Port<T>::recv(U use) {		template <bool T> template <typename U> LIBC_INLINE void Port<T>::recv(U use) {
uint32_t in = process.load_inbox(index);		uint32_t in = process.load_inbox(index);

// We need to wait until we own the buffer before receiving.		// We need to wait until we own the buffer before receiving.
while (Process<T>::buffer_unavailable(in, out)) {		while (Process<T>::buffer_unavailable(in, out)) {
sleep_briefly();		sleep_briefly();
in = process.load_inbox(index);		in = process.load_inbox(index);
}		}
atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);		atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);

// Apply the \p use function to read the memory out of the buffer.		// Apply the \p use function to read the memory out of the buffer.
use(&process.buffer[index]);		process.invoke_rpc(use, index);
out = !out;		out = !out;
process.outbox[index].store(out, cpp::MemoryOrder::RELAXED);		process.outbox[index].store(out, cpp::MemoryOrder::RELAXED);
}		}

/// Combines a send and receive into a single function.		/// Combines a send and receive into a single function.
template <bool T>		template <bool T>
template <typename F, typename U>		template <typename F, typename U>
LIBC_INLINE void Port<T>::send_and_recv(F fill, U use) {		LIBC_INLINE void Port<T>::send_and_recv(F fill, U use) {
Show All 12 Lines
}		}

/// Sends an arbitrarily sized data buffer \p src across the shared channel in		/// Sends an arbitrarily sized data buffer \p src across the shared channel in
/// multiples of the packet length.		/// multiples of the packet length.
template <bool T>		template <bool T>
LIBC_INLINE void Port<T>::send_n(const void *src, uint64_t size) {		LIBC_INLINE void Port<T>::send_n(const void *src, uint64_t size) {
// TODO: We could send the first bytes in this call and potentially save an		// TODO: We could send the first bytes in this call and potentially save an
// extra send operation.		// extra send operation.
send([=](Buffer *buffer) { buffer->data[0] = size; });		// TODO: We may need a way for the CPU to send different strings per thread.
		send([=](Buffer *buffer) {
		reinterpret_cast<uint64_t *>(buffer->data)[0] = size;
		});
const uint8_t ptr = reinterpret_cast<const uint8_t >(src);		const uint8_t ptr = reinterpret_cast<const uint8_t >(src);
for (uint64_t idx = 0; idx < size; idx += sizeof(Buffer::data)) {		for (uint64_t idx = 0; idx < size; idx += sizeof(Buffer::data)) {
send([=](Buffer *buffer) {		send([=](Buffer *buffer) {
const uint64_t len =		const uint64_t len =
size - idx > sizeof(Buffer::data) ? sizeof(Buffer::data) : size - idx;		size - idx > sizeof(Buffer::data) ? sizeof(Buffer::data) : size - idx;
inline_memcpy(buffer->data, ptr + idx, len);		inline_memcpy(buffer->data, ptr + idx, len);
});		});
}		}
		gpu::sync_lane(process.buffer[index].header.mask);
}		}

/// Receives an arbitrarily sized data buffer across the shared channel in		/// Receives an arbitrarily sized data buffer across the shared channel in
/// multiples of the packet length. The \p alloc function is called with the		/// multiples of the packet length. The \p alloc function is called with the
/// size of the data so that we can initialize the size of the \p dst buffer.		/// size of the data so that we can initialize the size of the \p dst buffer.
template <bool T>		template <bool T>
template <typename A>		template <typename A>
LIBC_INLINE void Port<T>::recv_n(A alloc) {		LIBC_INLINE void Port<T>::recv_n(A alloc) {
		// The GPU handles thread private variables and masking implicitly through its
		// execution model. If this is the CPU we need to manually handle the
		// possibility that the sent data is of different length.
		if constexpr (is_process_gpu()) {
uint64_t size = 0;		uint64_t size = 0;
recv([&](Buffer *buffer) { size = buffer->data[0]; });		recv([&](Buffer *buffer) {
uint8_t dst = reinterpret_cast<uint8_t >(alloc(size));		size = reinterpret_cast<uint64_t *>(buffer->data)[0];
		});
		uint8_t dst = reinterpret_cast<uint8_t >(alloc(size), gpu::get_lane_id());
for (uint64_t idx = 0; idx < size; idx += sizeof(Buffer::data)) {		for (uint64_t idx = 0; idx < size; idx += sizeof(Buffer::data)) {
recv([=](Buffer *buffer) {		recv([=](Buffer *buffer) {
uint64_t len =		uint64_t len = size - idx > sizeof(Buffer::data) ? sizeof(Buffer::data)
size - idx > sizeof(Buffer::data) ? sizeof(Buffer::data) : size - idx;		: size - idx;
inline_memcpy(dst + idx, buffer->data, len);		inline_memcpy(dst + idx, buffer->data, len);
});		});
}		}
		return;
		jdoerfertUnsubmitted Done Reply Inline Actions return; } jdoerfert: return; }
		} else {
		uint64_t size[MAX_LANE_SIZE];
		uint8_t *dst[MAX_LANE_SIZE];
		uint64_t max = 0;
		recv([&](Buffer *buffer, uint32_t id) {
		size[id] = reinterpret_cast<uint64_t *>(buffer->data)[0];
		dst[id] = reinterpret_cast<uint8_t *>(alloc(size[id], id));
		max = size[id] > max ? size[id] : max;
		});
		for (uint64_t idx = 0; idx < max; idx += sizeof(Buffer::data)) {
		recv([=](Buffer *buffer, uint32_t id) {
		uint64_t len = size[id] - idx > sizeof(Buffer::data)
		? sizeof(Buffer::data)
		: size[id] - idx;
		if (idx < size[id])
		inline_memcpy(dst[id] + idx, buffer->data, len);
		});
		}
		return;
		}
}		}

/// Attempts to open a port to use as the client. The client can only open a		/// Attempts to open a port to use as the client. The client can only open a
/// port if we find an index that is in a valid sending state. That is, there		/// port if we find an index that is in a valid sending state. That is, there
/// are send operations pending that haven't been serviced on this port. Each		/// are send operations pending that haven't been serviced on this port. Each
/// port instance uses an associated \p opcode to tell the server what to do.		/// port instance uses an associated \p opcode to tell the server what to do.
LIBC_INLINE cpp::optional<Client::Port> Client::try_open(uint16_t opcode) {		/// Opening a port is only valid if the `opcode` is the sam accross every
		/// participating thread.
		[[clang::convergent]] LIBC_INLINE cpp::optional<Client::Port>
		Client::try_open(uint16_t opcode) {
constexpr uint64_t index = 0;		constexpr uint64_t index = 0;
const uint64_t lane_mask = gpu::get_lane_mask();		const uint64_t lane_mask = gpu::get_lane_mask();

// Attempt to acquire the lock on this index.		// Attempt to acquire the lock on this index.
if (!try_lock(lane_mask, index))		if (!try_lock(lane_mask, index))
return cpp::nullopt;		return cpp::nullopt;

// The mailbox state must be read with the lock held.		// The mailbox state must be read with the lock held.
atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);		atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);

uint32_t in = load_inbox(index);		uint32_t in = load_inbox(index);
uint32_t out = outbox[index].load(cpp::MemoryOrder::RELAXED);		uint32_t out = outbox[index].load(cpp::MemoryOrder::RELAXED);

// Once we acquire the index we need to check if we are in a valid sending		// Once we acquire the index we need to check if we are in a valid sending
// state.		// state.

if (buffer_unavailable(in, out)) {		if (buffer_unavailable(in, out)) {
unlock(lane_mask, index);		unlock(lane_mask, index);
return cpp::nullopt;		return cpp::nullopt;
}		}

buffer->opcode = opcode;		if (is_first_lane(lane_mask)) {
		buffer[index].header.opcode = opcode;
		buffer[index].header.mask = lane_mask;
		}
		gpu::sync_lane(lane_mask);
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions i think we're better off without the is_first_lane here, i.e. let all active lanes write the same value to buffer, but that should only be a codegen improvement JonChesterfield: i think we're better off without the is_first_lane here, i.e. let all active lanes write the…
return Port(*this, lane_mask, index, out);		return Port(*this, lane_mask, index, out);
}		}

LIBC_INLINE Client::Port Client::open(uint16_t opcode) {		LIBC_INLINE Client::Port Client::open(uint16_t opcode) {
for (;;) {		for (;;) {
if (cpp::optional<Client::Port> p = try_open(opcode))		if (cpp::optional<Client::Port> p = try_open(opcode))
return p.value();		return p.value();
sleep_briefly();		sleep_briefly();
}		}
}		}

/// Attempts to open a port to use as the server. The server can only open a		/// Attempts to open a port to use as the server. The server can only open a
/// port if it has a pending receive operation		/// port if it has a pending receive operation
LIBC_INLINE cpp::optional<Server::Port> Server::try_open() {		[[clang::convergent]] LIBC_INLINE cpp::optional<Server::Port>
		Server::try_open() {
constexpr uint64_t index = 0;		constexpr uint64_t index = 0;
const uint64_t lane_mask = gpu::get_lane_mask();		const uint64_t lane_mask = gpu::get_lane_mask();

uint32_t in = load_inbox(index);		uint32_t in = load_inbox(index);
uint32_t out = outbox[index].load(cpp::MemoryOrder::RELAXED);		uint32_t out = outbox[index].load(cpp::MemoryOrder::RELAXED);

// The server is passive, if there is no work pending don't bother		// The server is passive, if there is no work pending don't bother
// opening a port.		// opening a port.
Show All 33 Lines

libc/src/__support/RPC/rpc_util.h

	Show All 10 Lines

	#include "src/__support/GPU/utils.h"			#include "src/__support/GPU/utils.h"
	#include "src/__support/macros/attributes.h"			#include "src/__support/macros/attributes.h"
	#include "src/__support/macros/properties/architectures.h"			#include "src/__support/macros/properties/architectures.h"

	namespace __llvm_libc {			namespace __llvm_libc {
	namespace rpc {			namespace rpc {

				/// Maximum amount of data a single lane can use.
				constexpr uint64_t MAX_LANE_SIZE = 64;

	/// Suspend the thread briefly to assist the thread scheduler during busy loops.			/// Suspend the thread briefly to assist the thread scheduler during busy loops.
	LIBC_INLINE void sleep_briefly() {			LIBC_INLINE void sleep_briefly() {
	#if defined(LIBC_TARGET_ARCH_IS_NVPTX) && __CUDA_ARCH__ >= 700			#if defined(LIBC_TARGET_ARCH_IS_NVPTX) && __CUDA_ARCH__ >= 700
	asm("nanosleep.u32 64;" ::: "memory");			asm("nanosleep.u32 64;" ::: "memory");
	#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)			#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)
	__builtin_amdgcn_s_sleep(2);			__builtin_amdgcn_s_sleep(2);
	#else			#else
	// Simply do nothing if sleeping isn't supported on this platform.			// Simply do nothing if sleeping isn't supported on this platform.
	#endif			#endif
	}			}

	/// Get the first active thread inside the lane.			/// Get the first active thread inside the lane.
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions weird that these show up in the phab diff, they should already be in trunk JonChesterfield: weird that these show up in the phab diff, they should already be in trunk
	LIBC_INLINE uint64_t get_first_lane_id(uint64_t lane_mask) {			LIBC_INLINE uint64_t get_first_lane_id(uint64_t lane_mask) {
	return __builtin_ffsl(lane_mask) - 1;			return __builtin_ffsl(lane_mask) - 1;
	}			}

	/// Conditional that is only true for a single thread in a lane.			/// Conditional that is only true for a single thread in a lane.
	LIBC_INLINE bool is_first_lane(uint64_t lane_mask) {			LIBC_INLINE bool is_first_lane(uint64_t lane_mask) {
	return gpu::get_lane_id() == get_first_lane_id(lane_mask);			return gpu::get_lane_id() == get_first_lane_id(lane_mask);
	}			}

				/// Conditional to indicate if this process is running on the GPU.
				LIBC_INLINE constexpr bool is_process_gpu() {
				#if defined(LIBC_TARGET_ARCH_IS_GPU)
				return true;
				#else
				return false;
				#endif
				}

	} // namespace rpc			} // namespace rpc
	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif			#endif

libc/startup/gpu/amdgpu/start.cpp

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	void initialize(int argc, char argv, char env, void in, void out,
void *buffer) {		void *buffer) {
// We need a single GPU thread to perform the initialization of the global		// We need a single GPU thread to perform the initialization of the global
// constructors and data. We simply mask off all but a single thread and		// constructors and data. We simply mask off all but a single thread and
// execute.		// execute.
count.fetch_add(1, cpp::MemoryOrder::RELAXED);		count.fetch_add(1, cpp::MemoryOrder::RELAXED);
if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {		if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
// We need to set up the RPC client first in case any of the constructors		// We need to set up the RPC client first in case any of the constructors
// require it.		// require it.
rpc::client.reset(&lock, in, out, buffer);		rpc::client.reset(gpu::get_lane_size(), &lock, in, out, buffer);

// We want the fini array callbacks to be run after other atexit		// We want the fini array callbacks to be run after other atexit
// callbacks are run. So, we register them before running the init		// callbacks are run. So, we register them before running the init
// array callbacks as they can potentially register their own atexit		// array callbacks as they can potentially register their own atexit
// callbacks.		// callbacks.
atexit(&call_fini_array_callbacks);		atexit(&call_fini_array_callbacks);
call_init_array_callbacks(argc, argv, env);		call_init_array_callbacks(argc, argv, env);
}		}
Show All 35 Lines

libc/startup/gpu/nvptx/start.cpp

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	void initialize(int argc, char argv, char env, void in, void out,
void *buffer) {		void *buffer) {
// We need a single GPU thread to perform the initialization of the global		// We need a single GPU thread to perform the initialization of the global
// constructors and data. We simply mask off all but a single thread and		// constructors and data. We simply mask off all but a single thread and
// execute.		// execute.
count.fetch_add(1, cpp::MemoryOrder::RELAXED);		count.fetch_add(1, cpp::MemoryOrder::RELAXED);
if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {		if (gpu::get_thread_id() == 0 && gpu::get_block_id() == 0) {
// We need to set up the RPC client first in case any of the constructors		// We need to set up the RPC client first in case any of the constructors
// require it.		// require it.
rpc::client.reset(&lock, in, out, buffer);		rpc::client.reset(gpu::get_lane_size(), &lock, in, out, buffer);

// We want the fini array callbacks to be run after other atexit		// We want the fini array callbacks to be run after other atexit
// callbacks are run. So, we register them before running the init		// callbacks are run. So, we register them before running the init
// array callbacks as they can potentially register their own atexit		// array callbacks as they can potentially register their own atexit
// callbacks.		// callbacks.
// FIXME: The function pointer escaping this TU causes warnings.		// FIXME: The function pointer escaping this TU causes warnings.
__llvm_libc::atexit(&call_fini_array_callbacks);		__llvm_libc::atexit(&call_fini_array_callbacks);
call_init_array_callbacks(argc, argv, env);		call_init_array_callbacks(argc, argv, env);
Show All 37 Lines

libc/test/integration/startup/gpu/CMakeLists.txt

Show All 16 Lines	add_integration_test(
startup_rpc_test		startup_rpc_test
SUITE libc-startup-tests		SUITE libc-startup-tests
SRCS		SRCS
rpc_test.cpp		rpc_test.cpp
DEPENDS		DEPENDS
libc.src.__support.RPC.rpc_client		libc.src.__support.RPC.rpc_client
libc.src.__support.GPU.utils		libc.src.__support.GPU.utils
LOADER_ARGS		LOADER_ARGS
--blocks 16		--blocks-x 2
--threads 1		--blocks-y 2
		--blocks-z 2
		--threads-x 4
		--threads-y 4
		--threads-z 4
)		)

add_integration_test(		add_integration_test(
init_fini_array_test		init_fini_array_test
SUITE libc-startup-tests		SUITE libc-startup-tests
SRCS		SRCS
init_fini_array_test.cpp		init_fini_array_test.cpp
)		)

libc/test/integration/startup/gpu/rpc_test.cpp

	//===-- Loader test to check the RPC interface with the loader ------------===//			//===-- Loader test to check the RPC interface with the loader ------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "src/__support/GPU/utils.h"			#include "src/__support/GPU/utils.h"
	#include "src/__support/RPC/rpc_client.h"			#include "src/__support/RPC/rpc_client.h"
	#include "test/IntegrationTest/test.h"			#include "test/IntegrationTest/test.h"

	using namespace __llvm_libc;			using namespace __llvm_libc;

	static void test_add_simple() {			static void test_add_simple() {
	uint32_t num_additions = 1000 + 10 * gpu::get_block_id_x();			uint32_t num_additions =
				10 + 10 * gpu::get_thread_id() + 10 * gpu::get_block_id();
	uint64_t cnt = 0;			uint64_t cnt = 0;
	for (uint32_t i = 0; i < num_additions; ++i) {			for (uint32_t i = 0; i < num_additions; ++i) {
	rpc::Client::Port port = rpc::client.open(rpc::TEST_INCREMENT);			rpc::Client::Port port = rpc::client.open(rpc::TEST_INCREMENT);
	port.send_and_recv(			port.send_and_recv(
	[=](rpc::Buffer *buffer) {			[=](rpc::Buffer *buffer) {
	reinterpret_cast<uint64_t *>(buffer->data)[0] = cnt;			reinterpret_cast<uint64_t *>(buffer->data)[0] = cnt;
	},			},
	[&](rpc::Buffer *buffer) {			[&](rpc::Buffer *buffer) {
	cnt = reinterpret_cast<uint64_t *>(buffer->data)[0];			cnt = reinterpret_cast<uint64_t *>(buffer->data)[0];
	});			});
	port.close();			port.close();
	}			}
	ASSERT_TRUE(cnt == num_additions && "Incorrect sum");			ASSERT_TRUE(cnt == num_additions && "Incorrect sum");
	}			}

				// Test to ensure that the RPC mechanism doesn't hang on divergence.
				static void test_noop(uint8_t data) {
				rpc::Client::Port port = rpc::client.open(rpc::NOOP);
				port.send([=](rpc::Buffer *buffer) { buffer->data[0] = data; });
				port.close();
				}

	TEST_MAIN(int argc, char argv, char envp) {			TEST_MAIN(int argc, char argv, char envp) {
	test_add_simple();			test_add_simple();

				if (gpu::get_thread_id() % 2)
				test_noop(1);
				else
				test_noop(2);

	return 0;			return 0;
	}			}

libc/utils/gpu/loader/Loader.h

	Show All 23 Lines
	};			};

	/// Generic interface to load the \p image and launch execution of the _start			/// Generic interface to load the \p image and launch execution of the _start
	/// kernel on the target device. Copies \p argc and \p argv to the device.			/// kernel on the target device. Copies \p argc and \p argv to the device.
	/// Returns the final value of the `main` function on the device.			/// Returns the final value of the `main` function on the device.
	int load(int argc, char argv, char evnp, void *image, size_t size,			int load(int argc, char argv, char evnp, void *image, size_t size,
	const LaunchParameters &params);			const LaunchParameters &params);

				/// Return \p V aligned "upwards" according to \p Align.
				template <typename V, typename A> inline V align_up(V val, A align) {
				return ((val + V(align) - 1) / V(align)) * V(align);
				}

	/// Copy the system's argument vector to GPU memory allocated using \p alloc.			/// Copy the system's argument vector to GPU memory allocated using \p alloc.
	template <typename Allocator>			template <typename Allocator>
	void copy_argument_vector(int argc, char *argv, Allocator alloc) {			void copy_argument_vector(int argc, char *argv, Allocator alloc) {
	size_t argv_size = sizeof(char ) (argc + 1);			size_t argv_size = sizeof(char ) (argc + 1);
	size_t str_size = 0;			size_t str_size = 0;
	for (int i = 0; i < argc; ++i)			for (int i = 0; i < argc; ++i)
	str_size += strlen(argv[i]) + 1;			str_size += strlen(argv[i]) + 1;

	Show All 30 Lines

libc/utils/gpu/loader/Server.h

	Show All 24 Lines
	/// are any active requests.			/// are any active requests.
	void handle_server() {			void handle_server() {
	auto port = server.try_open();			auto port = server.try_open();
	if (!port)			if (!port)
	return;			return;

	switch (port->get_opcode()) {			switch (port->get_opcode()) {
	case __llvm_libc::rpc::Opcode::PRINT_TO_STDERR: {			case __llvm_libc::rpc::Opcode::PRINT_TO_STDERR: {
	uint64_t str_size;			uint64_t str_size[__llvm_libc::rpc::MAX_LANE_SIZE] = {0};
	char *str = nullptr;			char *strs[__llvm_libc::rpc::MAX_LANE_SIZE] = {nullptr};
	port->recv_n([&](uint64_t size) {			port->recv_n([&](uint64_t size, uint32_t id) {
	str_size = size;			str_size[id] = size;
	str = new char[size];			strs[id] = new char[size];
	return str;			return strs[id];
	});			});
	fwrite(str, str_size, 1, stderr);			for (uint64_t i = 0; i < __llvm_libc::rpc::MAX_LANE_SIZE; ++i) {
	delete[] str;			if (strs[i]) {
				fwrite(strs[i], str_size[i], 1, stderr);
				delete[] strs[i];
				}
				}
	break;			break;
	}			}
	case __llvm_libc::rpc::Opcode::EXIT: {			case __llvm_libc::rpc::Opcode::EXIT: {
	port->recv([](__llvm_libc::rpc::Buffer *buffer) {			port->recv([](__llvm_libc::rpc::Buffer *buffer) {
	exit(reinterpret_cast<uint32_t *>(buffer->data)[0]);			exit(reinterpret_cast<uint32_t *>(buffer->data)[0]);
	});			});
	break;			break;
	}			}
	case __llvm_libc::rpc::Opcode::TEST_INCREMENT: {			case __llvm_libc::rpc::Opcode::TEST_INCREMENT: {
	port->recv_and_send([](__llvm_libc::rpc::Buffer *buffer) {			port->recv_and_send([](__llvm_libc::rpc::Buffer *buffer) {
	reinterpret_cast<uint64_t *>(buffer->data)[0] += 1;			reinterpret_cast<uint64_t *>(buffer->data)[0] += 1;
	});			});
	break;			break;
	}			}
	default:			default:
	port->recv([](__llvm_libc::rpc::Buffer ) { / no-op */ });			port->recv([](__llvm_libc::rpc::Buffer *buffer) {});
	return;
	}			}
	port->close();			port->close();
	}			}
	#endif			#endif

libc/utils/gpu/loader/amdgpu/Loader.cpp

Show First 20 Lines • Show All 281 Lines • ▼ Show 20 Lines	int load(int argc, char argv, char envp, void *image, size_t size,
void *dev_ret;		void *dev_ret;
if (hsa_status_t err =		if (hsa_status_t err =
hsa_amd_memory_pool_allocate(coarsegrained_pool, sizeof(int),		hsa_amd_memory_pool_allocate(coarsegrained_pool, sizeof(int),
/flags=/0, &dev_ret))		/flags=/0, &dev_ret))
handle_error(err);		handle_error(err);
hsa_amd_memory_fill(dev_ret, 0, sizeof(int));		hsa_amd_memory_fill(dev_ret, 0, sizeof(int));

// Allocate finegrained memory for the RPC server and client to share.		// Allocate finegrained memory for the RPC server and client to share.
		uint32_t wavefront_size = 0;
		if (hsa_status_t err = hsa_agent_get_info(
		dev_agent, HSA_AGENT_INFO_WAVEFRONT_SIZE, &wavefront_size))
		handle_error(err);
void *server_inbox;		void *server_inbox;
void *server_outbox;		void *server_outbox;
void *buffer;		void *buffer;
if (hsa_status_t err = hsa_amd_memory_pool_allocate(		if (hsa_status_t err = hsa_amd_memory_pool_allocate(
finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),		finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),
/flags=/0, &server_inbox))		/flags=/0, &server_inbox))
handle_error(err);		handle_error(err);
if (hsa_status_t err = hsa_amd_memory_pool_allocate(		if (hsa_status_t err = hsa_amd_memory_pool_allocate(
finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),		finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),
/flags=/0, &server_outbox))		/flags=/0, &server_outbox))
handle_error(err);		handle_error(err);
if (hsa_status_t err = hsa_amd_memory_pool_allocate(		if (hsa_status_t err = hsa_amd_memory_pool_allocate(
finegrained_pool, sizeof(__llvm_libc::rpc::Buffer),		finegrained_pool,
		align_up(sizeof(__llvm_libc::rpc::Header) +
		(wavefront_size * sizeof(__llvm_libc::rpc::Buffer)),
		alignof(__llvm_libc::rpc::Packet)),
/flags=/0, &buffer))		/flags=/0, &buffer))
handle_error(err);		handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_inbox);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_inbox);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_outbox);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_outbox);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, buffer);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, buffer);

// Initialie all the arguments (explicit and implicit) to zero, then set the		// Initialie all the arguments (explicit and implicit) to zero, then set the
// explicit arguments to the values created above.		// explicit arguments to the values created above.
Show All 35 Lines	int load(int argc, char argv, char envp, void *image, size_t size,
packet->kernarg_address = args;		packet->kernarg_address = args;

// Create a signal to indicate when this packet has been completed.		// Create a signal to indicate when this packet has been completed.
if (hsa_status_t err =		if (hsa_status_t err =
hsa_signal_create(1, 0, nullptr, &packet->completion_signal))		hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
handle_error(err);		handle_error(err);

// Initialize the RPC server's buffer for host-device communication.		// Initialize the RPC server's buffer for host-device communication.
server.reset(&lock, server_inbox, server_outbox, buffer);		server.reset(wavefront_size, &lock, server_inbox, server_outbox, buffer);

// Initialize the packet header and set the doorbell signal to begin execution		// Initialize the packet header and set the doorbell signal to begin execution
// by the HSA runtime.		// by the HSA runtime.
uint16_t header =		uint16_t header =
(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|		(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) \|		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) \|
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);
__atomic_store_n(&packet->header, header \| (packet->setup << 16),		__atomic_store_n(&packet->header, header \| (packet->setup << 16),
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

libc/utils/gpu/loader/nvptx/Loader.cpp

Show First 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	int load(int argc, char argv, char envp, void *image, size_t size,

// Allocate space for the return pointer and initialize it to zero.		// Allocate space for the return pointer and initialize it to zero.
CUdeviceptr dev_ret;		CUdeviceptr dev_ret;
if (CUresult err = cuMemAlloc(&dev_ret, sizeof(int)))		if (CUresult err = cuMemAlloc(&dev_ret, sizeof(int)))
handle_error(err);		handle_error(err);
if (CUresult err = cuMemsetD32(dev_ret, 0, 1))		if (CUresult err = cuMemsetD32(dev_ret, 0, 1))
handle_error(err);		handle_error(err);

		uint32_t warp_size = 32;
void *server_inbox = allocator(sizeof(__llvm_libc::cpp::Atomic<int>));		void *server_inbox = allocator(sizeof(__llvm_libc::cpp::Atomic<int>));
void *server_outbox = allocator(sizeof(__llvm_libc::cpp::Atomic<int>));		void *server_outbox = allocator(sizeof(__llvm_libc::cpp::Atomic<int>));
void *buffer = allocator(sizeof(__llvm_libc::rpc::Buffer));		void *buffer =
		allocator(align_up(sizeof(__llvm_libc::rpc::Header) +
		(warp_size * sizeof(__llvm_libc::rpc::Buffer)),
		alignof(__llvm_libc::rpc::Packet)));
if (!server_inbox \|\| !server_outbox \|\| !buffer)		if (!server_inbox \|\| !server_outbox \|\| !buffer)
handle_error("Failed to allocate memory the RPC client / server.");		handle_error("Failed to allocate memory the RPC client / server.");

// Set up the arguments to the '_start' kernel on the GPU.		// Set up the arguments to the '_start' kernel on the GPU.
uint64_t args_size = sizeof(kernel_args_t);		uint64_t args_size = sizeof(kernel_args_t);
kernel_args_t args;		kernel_args_t args;
std::memset(&args, 0, args_size);		std::memset(&args, 0, args_size);
args.argc = argc;		args.argc = argc;
args.argv = dev_argv;		args.argv = dev_argv;
args.envp = dev_envp;		args.envp = dev_envp;
args.ret = reinterpret_cast<void *>(dev_ret);		args.ret = reinterpret_cast<void *>(dev_ret);
args.inbox = server_outbox;		args.inbox = server_outbox;
args.outbox = server_inbox;		args.outbox = server_inbox;
args.buffer = buffer;		args.buffer = buffer;
void *args_config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, &args,		void *args_config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, &args,
CU_LAUNCH_PARAM_BUFFER_SIZE, &args_size,		CU_LAUNCH_PARAM_BUFFER_SIZE, &args_size,
CU_LAUNCH_PARAM_END};		CU_LAUNCH_PARAM_END};

// Initialize the RPC server's buffer for host-device communication.		// Initialize the RPC server's buffer for host-device communication.
server.reset(&lock, server_inbox, server_outbox, buffer);		server.reset(warp_size, &lock, server_inbox, server_outbox, buffer);

// Call the kernel with the given arguments.		// Call the kernel with the given arguments.
if (CUresult err = cuLaunchKernel(		if (CUresult err = cuLaunchKernel(
function, params.num_blocks_x, params.num_blocks_y,		function, params.num_blocks_x, params.num_blocks_y,
params.num_blocks_z, params.num_threads_x, params.num_threads_y,		params.num_blocks_z, params.num_threads_x, params.num_threads_y,
params.num_threads_z, 0, stream, nullptr, args_config))		params.num_threads_z, 0, stream, nullptr, args_config))
handle_error(err);		handle_error(err);

Show All 34 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Enable multiple threads to use RPC on the GPUClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 519702

libc/src/__support/RPC/CMakeLists.txt

libc/src/__support/RPC/rpc.h

libc/src/__support/RPC/rpc_util.h

libc/startup/gpu/amdgpu/start.cpp

libc/startup/gpu/nvptx/start.cpp

libc/test/integration/startup/gpu/CMakeLists.txt

libc/test/integration/startup/gpu/rpc_test.cpp

libc/utils/gpu/loader/Loader.h

libc/utils/gpu/loader/Server.h

libc/utils/gpu/loader/amdgpu/Loader.cpp

libc/utils/gpu/loader/nvptx/Loader.cpp

[libc] Enable multiple threads to use RPC on the GPU
ClosedPublic