This is an archive of the discontinued LLVM Phabricator instance.

[libc] Concurrent GPU RPC
Needs ReviewPublic

Authored by JonChesterfield on Apr 12 2023, 8:12 PM.

Download Raw Diff

Details

Reviewers

jhuber6
jdoerfert
tianshilei1992
michaelrj
sivachandra
tra
gchatelet
lntue

Summary

Extends the current model to handle N=64 concurrent calls. Requires API changes
for runtime-specified, e.g. sizing based on a given GPU.

This is a minimal abstraction for mutually exclusive access to shared memory.

Provided some API call can allocate memory that is writeable from two different
things, those two things can use this to make syscall-style function calls or
do other arbitrary streaming data back and forth. E.g. a GPU to a host, or
between two GPUs, or between different Linux processes etc. It compiles the
same code for each of the paired processes which allows testing the logic with
both sides running an architecture with thread sanitizers or similar.

Once memory has been successfully allocated and zero-initialised, there are no
further failure modes. It is not necessary to amend function calls to pass a
failure indicator back and forth in addition to their usual values.

The interface exposes exactly the concept of applying a function to the mutex
protected state, giving ownership to the other process and waiting for the
other process to release it. This is a moderately annoying model to program
against, see D148288 for an example of a friendlier interface one might wish
to write on top of this.

Multiple concurrent calls are supported using N copies of the underlying state.
This follows the OpenCL forward progress guarantees, i.e. assumes none. The
distinct calls have distinct port instances and cannot affect one another.

This is a simplified version with regard to compile time invariants. If we wish
to go further with that, it is possible to provide a correct by construction
implementation of this. There is sufficient information at _compile time_ to
guarantee that exactly one of the two processes uses the buffer at a given time
and that the resources are dropped when no longer in use. That cannot deadlock.

In the present simplified version, the port_t<> type represents the state of
the inbox and outbox as best known by the current process. Outbox is accurate,
inbox is possibly out of date. Misuse of this is possible, e.g. conjuring new
ports out of raw integers, though many typos are still caught as unused value
warnings. Changing to move semantics would help, depending on how libc does
std::move, especially in combination with bugprone use after move.

The interface as written here is slightly off optimal. The 'wait' primitive is
better expressed as a query that returns an either of an unchanged port or an
unavailable one. That is left out as the current client/server layer does not
use it. A query that does an atomic load is simple, avoiding a repeat of that
load on success requires either less or more type safety than this patch uses.

This is tested on various amdgpu, nvptx and x86 architectures. It requires some
means of allocating memory that supports a relaxed atomic load and a acq_rel
fetch_add, e.g. over pci-express. The protocol was designed to place minimal
requirements on the processes in the hope that it can ultimately bind to more
exotic architectures.

Extends the inbox/outbox variables to a bitmap which is mostly responsible for
mapping from an index into a specific bit in that structure. Moves the mailbox
operations into Process from the client/server classes so that invariants are
more readily tracked and models the state machine in the template parameters of
said index.

Does not handle wavefront size of 64, still undecided how best to represent
that from the Loader.cpp memory allocation side. There's a bug to fix in the
clang driver for nvptx freestanding before this can work there.

Extra testing is tbd, I haven't understood the libc framework for it yet.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

JonChesterfield created this revision.Apr 12 2023, 8:12 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 12 2023, 8:12 PM

Herald added subscribers: libc-commits, kosarev, mattd and 6 others. · View Herald Transcript

JonChesterfield requested review of this revision.Apr 12 2023, 8:12 PM

arc and phab are resisting me adding @intue to the reviewers. Others copied from D145913

Harbormaster completed remote builds in B225241: Diff 513050.Apr 12 2023, 8:18 PM

Two dubious compile time assumptions in this draft.

One is that wavefront==64 is not a thing. Changing that on the GPU side is calling a macro, the tedious thing is Loader.cpp will need to query the hardware to see what the expected wavesize is, allocate space accordingly, and deal with the server type varying based on it in some reasonable fashion.

Second is that mutual exclusion within a process involves a locks bitmap. This shouldn't be in shared/fine grain/pinned memory as the other process never accesses it. That's presently implemented with static globals. Changing the number of slots to a runtime value (perhaps based on the number of compute units) involves allocating GPU memory in the loader and passing that along with with fine grain memory, which is a bigger API change than I wanted to make here.

fix overallocation error

JonChesterfield added inline comments.Apr 12 2023, 8:27 PM

libc/src/__support/RPC/rpc.h
377–378	Going to have to change the client interface to pass the active threads down the stack for nvidia, various intrinsics require it

Harbormaster completed remote builds in B225242: Diff 513051.Apr 12 2023, 8:32 PM

I'm very close to standing up a patch that reworks the RPC interface for a single thread. So, we should wait until that's settled and then apply the concurrency changes on top of it. I'm also planning on putting a stress test of the RPC in that patch. As it stands, we only use the RPC in the existing tests if they fail.

I'm not worried about adapting the interface. The client::run and server:::handle are just what was there before. Worth mentioning that this doesn't change the existing algorithm, it stamps out N identical copies of the structure and hands them out to threads in roughly round robin fashion. It'll compose with your changes.

Definitely agrees on more testing, I haven't got a handle on the libc/cmake plumbing around this.

write the comments, rename send to post

Harbormaster completed remote builds in B225723: Diff 513739.Apr 14 2023, 2:01 PM

JonChesterfield edited the summary of this revision. (Show Details)Apr 14 2023, 2:16 PM

Herald added subscribers: pengfei, tpr. · View Herald TranscriptApr 14 2023, 2:16 PM

JonChesterfield edited the summary of this revision. (Show Details)Apr 14 2023, 2:16 PM

JonChesterfield added a reviewer: lntue.

format

JonChesterfield retitled this revision from [libc][wip] Draft of concurrent GPU RPC to [libc] Concurrent GPU RPC.Apr 14 2023, 2:22 PM

Harbormaster completed remote builds in B225727: Diff 513745.Apr 14 2023, 2:26 PM

JonChesterfield mentioned this in D148585: D148191 with concurrency support removed.Apr 17 2023, 4:59 PM

JonChesterfield mentioned this in D148600: Rebase D148288 and associated test commits by jhuber6 on D148191.Apr 17 2023, 7:17 PM

JonChesterfield mentioned this in D148288: [libc] Update RPC interface for system utilities on the GPU.Apr 19 2023, 11:10 AM

jhuber6 mentioned this in D148943: [libc] Enable multiple threads to use RPC on the GPU.Apr 21 2023, 10:03 AM

jhuber6 mentioned this in D149598: [libc] Support concurrent RPC port access on the GPU.May 1 2023, 10:51 AM

JonChesterfield mentioned this in D149788: [libc][rpc] Simplify mailbox state tracking.May 3 2023, 2:45 PM

JonChesterfield mentioned this in rG218b50a60637: [libc][rpc] Simplify mailbox state tracking.May 3 2023, 4:21 PM

jhuber6 mentioned this in rG507edb52f9a9: [libc] Enable multiple threads to use RPC on the GPU.May 4 2023, 5:32 PM

jhuber6 mentioned this in rGaea866c12cb4: [libc] Support concurrent RPC port access on the GPU.May 5 2023, 8:12 AM

Revision Contents

Path

Size

libc/

src/

__support/

CPP/

atomic.h

8 lines

OSUtil/

gpu/

io.cpp

4 lines

quick_exit.cpp

4 lines

RPC/

rpc.h

408 lines

rpc_util.h

53 lines

startup/

gpu/

amdgpu/

start.cpp

4 lines

nvptx/

start.cpp

4 lines

utils/

gpu/

loader/

amdgpu/

Loader.cpp

27 lines

Diff 513745

libc/src/__support/CPP/atomic.h

Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	public:
T fetch_add(T increment, MemoryOrder mem_ord = MemoryOrder::SEQ_CST) {		T fetch_add(T increment, MemoryOrder mem_ord = MemoryOrder::SEQ_CST) {
return __atomic_fetch_add(&val, increment, int(mem_ord));		return __atomic_fetch_add(&val, increment, int(mem_ord));
}		}

T fetch_sub(T decrement, MemoryOrder mem_ord = MemoryOrder::SEQ_CST) {		T fetch_sub(T decrement, MemoryOrder mem_ord = MemoryOrder::SEQ_CST) {
return __atomic_fetch_sub(&val, decrement, int(mem_ord));		return __atomic_fetch_sub(&val, decrement, int(mem_ord));
}		}

		T fetch_or(T increment, MemoryOrder mem_ord = MemoryOrder::SEQ_CST) {
		return __atomic_fetch_or(&val, increment, int(mem_ord));
		}

		T fetch_and(T increment, MemoryOrder mem_ord = MemoryOrder::SEQ_CST) {
		return __atomic_fetch_and(&val, increment, int(mem_ord));
		}

// Set the value without using an atomic operation. This is useful		// Set the value without using an atomic operation. This is useful
// in initializing atomic values without a constructor.		// in initializing atomic values without a constructor.
void set(T rhs) { val = rhs; }		void set(T rhs) { val = rhs; }
};		};

// Issue a thread fence with the given memory ordering.		// Issue a thread fence with the given memory ordering.
LIBC_INLINE void atomic_thread_fence(MemoryOrder mem_ord) {		LIBC_INLINE void atomic_thread_fence(MemoryOrder mem_ord) {
// The NVPTX backend currently does not support atomic thread fences so we use a		// The NVPTX backend currently does not support atomic thread fences so we use a
Show All 13 Lines

libc/src/__support/OSUtil/gpu/io.cpp

	Show All 15 Lines

	namespace internal {			namespace internal {

	static constexpr size_t BUFFER_SIZE = sizeof(rpc::Buffer) - sizeof(uint64_t);			static constexpr size_t BUFFER_SIZE = sizeof(rpc::Buffer) - sizeof(uint64_t);
	static constexpr size_t MAX_STRING_SIZE = BUFFER_SIZE;			static constexpr size_t MAX_STRING_SIZE = BUFFER_SIZE;

	LIBC_INLINE void send_null_terminated(cpp::string_view src) {			LIBC_INLINE void send_null_terminated(cpp::string_view src) {
	rpc::client.run(			rpc::client.run(
	[&](rpc::Buffer *buffer) {			[&](rpc::ThreadBuffer *buffer) {
	buffer->data[0] = rpc::Opcode::PRINT_TO_STDERR;			buffer->data[0] = rpc::Opcode::PRINT_TO_STDERR;
	char data = reinterpret_cast<char >(&buffer->data[1]);			char data = reinterpret_cast<char >(&buffer->data[1]);
	inline_memcpy(data, src.data(), src.size());			inline_memcpy(data, src.data(), src.size());
	data[src.size()] = '\0';			data[src.size()] = '\0';
	},			},
	[](rpc::Buffer ) { / void */ });			[](rpc::ThreadBuffer ) { / void */ });
	}			}

	} // namespace internal			} // namespace internal

	void write_to_stderr(cpp::string_view msg) {			void write_to_stderr(cpp::string_view msg) {
	bool send_empty_string = true;			bool send_empty_string = true;
	for (; !msg.empty();) {			for (; !msg.empty();) {
	const auto chunk = msg.substr(0, internal::MAX_STRING_SIZE);			const auto chunk = msg.substr(0, internal::MAX_STRING_SIZE);
	Show All 9 Lines

libc/src/__support/OSUtil/gpu/quick_exit.cpp

	Show All 14 Lines
	#include "src/__support/macros/properties/architectures.h"			#include "src/__support/macros/properties/architectures.h"

	namespace __llvm_libc {			namespace __llvm_libc {

	void quick_exit(int status) {			void quick_exit(int status) {
	// TODO: Support asynchronous calls so we don't wait and exit from the GPU			// TODO: Support asynchronous calls so we don't wait and exit from the GPU
	// immediately.			// immediately.
	rpc::client.run(			rpc::client.run(
	[&](rpc::Buffer *buffer) {			[&](rpc::ThreadBuffer *buffer) {
	buffer->data[0] = rpc::Opcode::EXIT;			buffer->data[0] = rpc::Opcode::EXIT;
	buffer->data[1] = status;			buffer->data[1] = status;
	},			},
	[](rpc::Buffer ) { / void */ });			[](rpc::ThreadBuffer ) { / void */ });

	#if defined(LIBC_TARGET_ARCH_IS_NVPTX)			#if defined(LIBC_TARGET_ARCH_IS_NVPTX)
	asm("exit;" ::: "memory");			asm("exit;" ::: "memory");
	#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)			#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)
	// This will terminate the entire wavefront, may not be valid with divergent			// This will terminate the entire wavefront, may not be valid with divergent
	// work items.			// work items.
	asm("s_endpgm" ::: "memory");			asm("s_endpgm" ::: "memory");
	#endif			#endif
	__builtin_unreachable();			__builtin_unreachable();
	}			}

	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif // LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_QUICK_EXIT_H			#endif // LLVM_LIBC_SRC_SUPPORT_OSUTIL_GPU_QUICK_EXIT_H

libc/src/__support/RPC/rpc.h

	Show All 29 Lines
	/// reserve the first 255 values for internal libc usage.			/// reserve the first 255 values for internal libc usage.
	enum Opcode : uint64_t {			enum Opcode : uint64_t {
	NOOP = 0,			NOOP = 0,
	PRINT_TO_STDERR = 1,			PRINT_TO_STDERR = 1,
	EXIT = 2,			EXIT = 2,
	LIBC_LAST = (1UL << 8) - 1,			LIBC_LAST = (1UL << 8) - 1,
	};			};

				enum : uint32_t {
				NumberPorts = 64,
				NumberUInt32ForBitmaps = 2,
				};

	/// A fixed size channel used to communicate between the RPC client and server.			/// A fixed size channel used to communicate between the RPC client and server.
	struct Buffer {			struct ThreadBuffer {
	uint64_t data[8];			uint64_t data[8];
	};			};

				struct Buffer {
				ThreadBuffer data[32]; // TODO: handle wavesize==64 as well
				};

				// The data structure Process is essentially four arrays of the same length
				// indexed by port_t. The operations on process provide mutally exclusive
				// access to the Buffer element at index port_t::value. Ownership alternates
				// between the client and the server instance.
				// The template parameters I, O correspond to the runtime
				// values of the state machine implemented here from the perspective of the
				// current process. They are tracked in the type system to raise errors on
				// attempts to make invalid transitions or to use the protected buffer
				// while the other process owns it.

				template <unsigned I, unsigned O> struct port_t {
				static_assert(I == 0 \|\| I == 1, "");
				static_assert(O == 0 \|\| O == 1, "");
				uint32_t value;

				port_t(uint32_t value) : value(value) {}

				port_t<!I, O> invert_inbox() { return value; }
				port_t<I, !O> invert_outbox() { return value; }
				};

				/// Bitmap deals with consistently picking the address that corresponds to a
				/// given port instance. 'Slot' is used to mean an index into the shared arrays
				/// which may not be currently bound to a port.
				template <bool InvertedLoad, int scope> struct bitmap_t {
				private:
				cpp::Atomic<uint32_t> *underlying;
				using Word = uint32_t;

				inline uint32_t index_to_element(uint32_t x) {
				uint32_t wordBits = 8 * sizeof(Word);
				return x / wordBits;
				}

				inline uint32_t index_to_subindex(uint32_t x) {
				uint32_t wordBits = 8 * sizeof(Word);
				return x % wordBits;
				}

				inline bool nthbitset(uint32_t x, uint32_t n) {
				return x & (UINT32_C(1) << n);
				}

				inline bool nthbitset(uint64_t x, uint32_t n) {
				return x & (UINT64_C(1) << n);
				}

				inline uint32_t setnthbit(uint32_t x, uint32_t n) {
				return x \| (UINT32_C(1) << n);
				}

				inline uint64_t setnthbit(uint64_t x, uint32_t n) {
				return x \| (UINT64_C(1) << n);
				}

				static constexpr bool system_scope() {
				return scope == __OPENCL_MEMORY_SCOPE_ALL_SVM_DEVICES;
				}
				static constexpr bool device_scope() {
				return scope == __OPENCL_MEMORY_SCOPE_DEVICE;
				}

				static_assert(system_scope() \|\| device_scope(), "");
				static_assert(system_scope() != device_scope(), "");

				Word load_word(uint32_t w) const {
				cpp::Atomic<uint32_t> &addr = underlying[w];
				Word tmp = addr.load(cpp::MemoryOrder::RELAXED);
				return InvertedLoad ? ~tmp : tmp;
				}

				public:
				bitmap_t() /: underlying(nullptr)/ {}
				bitmap_t(cpp::Atomic<uint32_t> *d) : underlying(d) {
				// can't necessarily write to the pointer from this object. if the memory is
				// on a gpu, but this instance is being constructed on a cpu first, then
				// direct writes will fail. However, the data does need to be zeroed for the
				// bitmap to work.
				}

				bool read_slot(uint32_t slot) {
				uint32_t w = index_to_element(slot);
				uint32_t subindex = index_to_subindex(slot);
				return nthbitset(load_word(w), subindex);
				}

				// Does not change inbox as no process ever writes to it's own inbox
				// Knows that the outbox is initially zero which allows using fetch_add
				// to set the bit over pci-e, otherwise we would need to use a CAS loop
				template <unsigned I> port_t<I, 1> claim_slot(port_t<I, 0> port) {
				uint32_t slot = port.value;

				uint32_t w = index_to_element(slot);
				uint32_t subindex = index_to_subindex(slot);

				cpp::Atomic<uint32_t> &addr = underlying[w];
				Word before;
				if (system_scope()) {
				// System scope is used here to approximate 'might be over pcie', where
				// the available atomic operations are likely to be CAS, Add, Exchange.
				// Set the bit using the knowledge that it is currently clear.
				Word addend = (Word)1 << subindex;
				before = addr.fetch_add(addend, cpp::MemoryOrder::ACQ_REL);
				} else {
				// Set the bit more clearly. TODO: device scope is missing from atomic.h
				Word mask = setnthbit((Word)0, subindex);
				before = addr.fetch_or(mask, cpp::MemoryOrder::ACQ_REL);
				}

				(void)before;
				return port.invert_outbox();
				}

				// Release also does not change the inbox. Assumes the outbox is set
				template <unsigned I> port_t<I, 0> release_slot(port_t<I, 1> port) {
				release_slot(port.value);
				return port.invert_outbox();
				}

				// Not wholly typed as called to drop partially constructed ports, locks
				void release_slot(uint32_t i) {
				uint32_t w = index_to_element(i);
				uint32_t subindex = index_to_subindex(i);

				cpp::Atomic<uint32_t> &addr = underlying[w];

				if (system_scope()) {
				// Clear the bit using the knowledge that it is currently set.
				Word addend = 1 + ~((Word)1 << subindex);
				addr.fetch_add(addend, cpp::MemoryOrder::ACQ_REL);
				} else {
				// Clear the bit more clearly
				Word mask = ~setnthbit((Word)0, subindex);
				addr.fetch_and(mask, cpp::MemoryOrder::ACQ_REL);
				}
				}

				// Only used on the bitmap used for device local mutual exclusion. Does not
				// hit shared memory.
				bool try_claim_slot(uint32_t slot) {
				uint32_t w = index_to_element(slot);
				uint32_t subindex = index_to_subindex(slot);

				static_assert(device_scope(), "");

				Word mask = setnthbit((Word)0, subindex);

				// Fetch or implementing test and set as a faster alternative to CAS
				cpp::Atomic<uint32_t> &addr = underlying[w];
				uint32_t before = addr.fetch_or(mask, cpp::MemoryOrder::ACQ_REL);

				return !nthbitset(before, subindex);
				}
				};

				// TODO: Work out a reasonable way to abstract over this
				template <uint32_t WaveSize> struct WaveSizeType;
				template <> struct WaveSizeType<32> { using Type = uint32_t; };
				template <> struct WaveSizeType<64> { using Type = uint64_t; };

	/// A common process used to synchronize communication between a client and a			/// A common process used to synchronize communication between a client and a
	/// server. The process contains an inbox and an outbox used for signaling			/// server. The process contains an inbox and an outbox used for signaling
	/// ownership of the shared buffer.			/// ownership of the shared buffer.
				template <typename BufferElement, uint32_t WaveSize, bool InvertedInboxLoadT>
	struct Process {			struct Process {
	LIBC_INLINE Process() = default;			static_assert(WaveSize == 32 \|\| WaveSize == 64, "");
	LIBC_INLINE Process(const Process &) = default;
	LIBC_INLINE Process &operator=(const Process &) = default;			static_assert(WaveSize == 32, "64 not yet implemented");
	LIBC_INLINE ~Process() = default;
				// The inverted read on inbox determines which of two linked processes
	cpp::Atomic<uint32_t> *inbox;			// initially owns the underlying buffer.
	cpp::Atomic<uint32_t> *outbox;			// The initial memory state is inbox == outbox == 0 which implies ownership,
	Buffer *buffer;			// but one process has inbox read bitwise inverted. That starts out believing
				// believing the memory state is inbox == 1, outbox == 0, which implies the
				// other process owns the buffer.
				BufferElement *shared_buffer;
				bitmap_t<false, __OPENCL_MEMORY_SCOPE_DEVICE> active;
				bitmap_t<InvertedInboxLoadT, __OPENCL_MEMORY_SCOPE_ALL_SVM_DEVICES> inbox;
				bitmap_t<false, __OPENCL_MEMORY_SCOPE_ALL_SVM_DEVICES> outbox;

				using ThreadMask = typename WaveSizeType<WaveSize>::Type;

				Process() = default;
				~Process() = default;

				Process(cpp::Atomic<uint32_t> locks, cpp::Atomic<uint32_t> inbox,
				cpp::Atomic<uint32_t> outbox, BufferElement shared_buffer)
				: shared_buffer(shared_buffer), active(locks), inbox(inbox),
				outbox(outbox) {}

	/// Initialize the communication channels.			/// Initialize the communication channels.
	LIBC_INLINE void reset(void inbox, void outbox, void *buffer) {			LIBC_INLINE void reset(void locks, void inbox, void outbox, void buffer) {
	*this = {			this->active = reinterpret_cast<cpp::Atomic<uint32_t> *>(locks);
	reinterpret_cast<cpp::Atomic<uint32_t> *>(inbox),			this->inbox = reinterpret_cast<cpp::Atomic<uint32_t> *>(inbox);
	reinterpret_cast<cpp::Atomic<uint32_t> *>(outbox),			this->outbox = reinterpret_cast<cpp::Atomic<uint32_t> *>(outbox);
	reinterpret_cast<Buffer *>(buffer),			this->shared_buffer = reinterpret_cast<BufferElement *>(buffer);
				}

				template <typename T> struct maybe {
				T value;
				bool success;
	};			};

				/// Try to claim one of the buffer elements for this warp/wavefront/wave
				maybe<port_t<0, 0>> try_open(ThreadMask active_threads) {

				// Inefficient try-each-port-in-order for simplicity of initial diff
				for (uint32_t p = 0; p < NumberPorts; p++) {

				// GPUs test-set is per-lane-in-wave so you have to mask off all but one
				// in order to distinguish between a different wave getting the lock vs
				// a different lane in this wave getting the lock.
				// Passing the active threads around is a volta specific quirk, can
				// usually make that a compile time value for minor codegen improvements
				bool claim = false;
				if (is_first_lane(active_threads)) {
				claim = active.try_claim_slot(p);
				}
				claim = broadcast_first_lane(active_threads, claim);

				if (!claim) {
				continue;
				}

				atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
				bool in = inbox.read_slot(p);
				bool out = outbox.read_slot(p);

				if (in == 0 && out == 0) {
				// Only return a port in the 0, 0 state
				return {p, true};
				}

				if (in == 1 && out == 1) {
				// Garbage collect from an async call
				outbox.release_slot(p);
				}

				// Other values mean the buffer is not available to this process
				active.release_slot(p);
				}

				return {UINT32_MAX, false};
				}

				/// Release a port. Any inbox/outbox state is acceptable.
				template <unsigned I, unsigned O> void close(port_t<I, O> port) {
				active.release_slot(port.value);
				}

				/// Call a function Op on the owned buffer. Note I==O is required.
				template <unsigned IandO, typename Op>
				void apply(port_t<IandO, IandO> port, Op op) {
				uint32_t raw = port.value;
				op(&shared_buffer[raw].data[get_lane_id()]);
				}

				/// Release ownership of the buffer to the other process.
				/// Requires I==O to call, returns I!=O.
				template <unsigned IandO>
				port_t<IandO, !IandO> post(port_t<IandO, IandO> port) {
				atomic_thread_fence(cpp::MemoryOrder::RELEASE);
				if constexpr (IandO == 0) {
				return outbox.claim_slot(port);
				} else {
				return outbox.release_slot(port);
				}
				}

				/// Wait for the buffer to be returned by the other process.
				/// Equivalently, for the other process to close the port.
				/// Requires I!=O to call, returns I==O
				template <unsigned I> port_t<!I, !I> wait(port_t<I, !I> port) {
				bool in = inbox.read_slot(port.value);
				while (in == I) {
				sleep_briefly();
				in = inbox.read_slot(port.value);
				}

				atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
				return port.invert_inbox();
				}

				/// Derivative / convenience functions, possibly better in a derived class
				port_t<0, 0> open(ThreadMask active_threads) {
				for (;;) {
				maybe<port_t<0, 0>> r = try_open(active_threads);
				if (r.success) {
				return r.value;
				}
				sleep_briefly();
				}
	}			}
	};			};

	/// The RPC client used to make requests to the server.			/// The RPC client used to make requests to the server.
	struct Client : public Process {			/// The 'false' parameter to Process means this instance can open ports first
				struct Client : public Process<Buffer, 32, false> {
	LIBC_INLINE Client() = default;			LIBC_INLINE Client() = default;
	LIBC_INLINE Client(const Client &) = default;			LIBC_INLINE Client(const Client &) = default;
	LIBC_INLINE Client &operator=(const Client &) = default;			LIBC_INLINE Client &operator=(const Client &) = default;
	LIBC_INLINE ~Client() = default;			LIBC_INLINE ~Client() = default;

	template <typename F, typename U> LIBC_INLINE void run(F fill, U use);			template <typename F, typename U> LIBC_INLINE void run(F fill, U use);
				template <typename F> LIBC_INLINE void run_async(F fill);
	};			};

	/// The RPC server used to respond to the client.			/// The RPC server used to respond to the client.
	struct Server : public Process {			/// The 'true' parameter to Process means all ports will be unavailable
				/// initially, until Client has opened one and then called post on it.
				struct Server : public Process<Buffer, 32, true> {
	LIBC_INLINE Server() = default;			LIBC_INLINE Server() = default;
	LIBC_INLINE Server(const Server &) = default;			LIBC_INLINE Server(const Server &) = default;
	LIBC_INLINE Server &operator=(const Server &) = default;			LIBC_INLINE Server &operator=(const Server &) = default;
	LIBC_INLINE ~Server() = default;			LIBC_INLINE ~Server() = default;

	template <typename W, typename C> LIBC_INLINE bool handle(W work, C clean);			template <typename W, typename C> LIBC_INLINE bool handle(W work, C clean);
	};			};

	/// Run the RPC client protocol to communicate with the server. We perform the			/// Run the RPC client protocol to communicate with the server. We perform the
	/// following high level actions to complete a communication:			/// following high level actions to complete a communication:
	/// - Apply \p fill to the shared buffer and write 1 to the outbox.			/// - Apply \p fill to the shared buffer and write 1 to the outbox.
	/// - Wait until the inbox is 1.			/// - Wait until the inbox is 1.
	/// - Apply \p use to the shared buffer and write 0 to the outbox.			/// - Apply \p use to the shared buffer and write 0 to the outbox.
	/// - Wait until the inbox is 0.			/// - Wait until the inbox is 0.
	template <typename F, typename U> LIBC_INLINE void Client::run(F fill, U use) {			template <typename F, typename U> LIBC_INLINE void Client::run(F fill, U use) {
	bool in = inbox->load(cpp::MemoryOrder::RELAXED);			uint32_t ThreadMask = 1; // TODO: This needs to be passed in.
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Going to have to change the client interface to pass the active threads down the stack for nvidia, various intrinsics require it JonChesterfield: Going to have to change the client interface to pass the active threads down the stack for…
	bool out = outbox->load(cpp::MemoryOrder::RELAXED);
	atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);			port_t<0, 0> port0 = open(ThreadMask);

	// Apply the \p fill to the buffer and signal the server.			// Apply the \p fill to the buffer and signal the server.
	if (!in & !out) {			apply(port0, fill);
	fill(buffer);			port_t<0, 1> port1 = post(port0);
	atomic_thread_fence(cpp::MemoryOrder::RELEASE);
	outbox->store(1, cpp::MemoryOrder::RELAXED);
	out = 1;
	}
	// Wait for the server to work on the buffer and respond.			// Wait for the server to work on the buffer and respond.
	if (!in & out) {			port_t<1, 1> port2 = wait(port1);
	while (!in) {
	sleep_briefly();
	in = inbox->load(cpp::MemoryOrder::RELAXED);
	}
	atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
	}
	// Apply \p use to the buffer and signal the server.			// Apply \p use to the buffer and signal the server.
	if (in & out) {			apply(port2, use);
	use(buffer);			port_t<1, 0> port3 = post(port2);
	atomic_thread_fence(cpp::MemoryOrder::RELEASE);
	outbox->store(0, cpp::MemoryOrder::RELAXED);
	out = 0;
	}
	// Wait for the server to signal the end of the protocol.			// Wait for the server to signal the end of the protocol.
	if (in & !out) {			close(port3);
	while (in) {
	sleep_briefly();
	in = inbox->load(cpp::MemoryOrder::RELAXED);
	}
	atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
	}			}

				template <typename F> LIBC_INLINE void Client::run_async(F fill) {
				uint32_t ThreadMask = 1;
				port_t<0, 0> port0 = open(ThreadMask);
				// Apply the \p fill to the buffer and signal the server.
				apply(port0, fill);
				port_t<0, 1> port1 = post(port0);
				close(port1);
	}			}

	/// Run the RPC server protocol to communicate with the client. This is			/// Run the RPC server protocol to communicate with the client. This is
	/// non-blocking and only checks the server a single time. We perform the			/// non-blocking and only checks the server a single time. We perform the
	/// following high level actions to complete a communication:			/// following high level actions to complete a communication:
	/// - Query if the inbox is 1 and exit if there is no work to do.			/// - Open a port or exit if there is no work to do.
	/// - Apply \p work to the shared buffer and write 1 to the outbox.			/// - Apply \p work to the shared buffer and write 1 to the outbox.
	/// - Wait until the inbox is 0.			/// - Wait until the inbox is 1.
	/// - Apply \p clean to the shared buffer and write 0 to the outbox.			/// - Apply \p clean to the shared buffer and write 0 to the outbox.
	template <typename W, typename C>			template <typename W, typename C>
	LIBC_INLINE bool Server::handle(W work, C clean) {			LIBC_INLINE bool Server::handle(W work, C clean) {
	bool in = inbox->load(cpp::MemoryOrder::RELAXED);			uint32_t ThreadMask = 1;
	bool out = outbox->load(cpp::MemoryOrder::RELAXED);
	atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);			auto maybe_port = try_open(ThreadMask);
	// There is no work to do, exit early.			// There is no work to do, exit early.
	if (!in & !out)			if (!maybe_port.success)
	return false;			return false;

				port_t<0, 0> port0 = maybe_port.value;

	// Apply \p work to the buffer and signal the client.			// Apply \p work to the buffer and signal the client.
	if (in & !out) {			apply(port0, work);
	work(buffer);			port_t<0, 1> port1 = post(port0);
	atomic_thread_fence(cpp::MemoryOrder::RELEASE);
	outbox->store(1, cpp::MemoryOrder::RELAXED);
	out = 1;
	}
	// Wait for the client to use the buffer and respond.			// Wait for the client to use the buffer and respond.
	if (in & out) {			port_t<1, 1> port2 = wait(port1);
	while (in)
	in = inbox->load(cpp::MemoryOrder::RELAXED);
	atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
	}
	// Clean up the buffer and signal the end of the protocol.			// Clean up the buffer and signal the end of the protocol.
	if (!in & out) {			apply(port2, clean);
	clean(buffer);			port_t<1, 0> port3 = post(port2);
	atomic_thread_fence(cpp::MemoryOrder::RELEASE);
	outbox->store(0, cpp::MemoryOrder::RELAXED);
	out = 0;
	}

				close(port3);
	return true;			return true;
	}			}

	} // namespace rpc			} // namespace rpc
	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif			#endif

libc/src/__support/RPC/rpc_util.h

	//===-- Shared memory RPC client / server utilities -------------- C++ --===//			//===-- Shared memory RPC client / server utilities -------------- C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIBC_SRC_SUPPORT_RPC_RPC_UTILS_H			#ifndef LLVM_LIBC_SRC_SUPPORT_RPC_RPC_UTILS_H
	#define LLVM_LIBC_SRC_SUPPORT_RPC_RPC_UTILS_H			#define LLVM_LIBC_SRC_SUPPORT_RPC_RPC_UTILS_H

	#include "src/__support/macros/attributes.h"			#include "src/__support/macros/attributes.h"
	#include "src/__support/macros/properties/architectures.h"			#include "src/__support/macros/properties/architectures.h"

				#include <stdint.h>

	namespace __llvm_libc {			namespace __llvm_libc {
	namespace rpc {			namespace rpc {

	/// Suspend the thread briefly to assist the thread scheduler during busy loops.			/// Suspend the thread briefly to assist the thread scheduler during busy loops.
	LIBC_INLINE void sleep_briefly() {			LIBC_INLINE void sleep_briefly() {
	#if defined(LIBC_TARGET_ARCH_IS_NVPTX) && __CUDA_ARCH__ >= 700			#if defined(LIBC_TARGET_ARCH_IS_NVPTX) && __CUDA_ARCH__ >= 700
	asm("nanosleep.u32 64;" ::: "memory");			asm("nanosleep.u32 64;" ::: "memory");
	#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)			#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)
	__builtin_amdgcn_s_sleep(2);			__builtin_amdgcn_s_sleep(2);
	#else			#else
	// Simply do nothing if sleeping isn't supported on this platform.			// Simply do nothing if sleeping isn't supported on this platform.
	#endif			#endif
	}			}

				LIBC_INLINE uint64_t get_lane_id() {
				#if defined(LIBC_TARGET_ARCH_IS_NVPTX)
				return __nvvm_read_ptx_sreg_tid_x() /threadIdx.x/ & (32 - 1);

				#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)
				#if __AMDGCN_WAVEFRONT_SIZE == 64
				return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
				#elif __AMDGCN_WAVEFRONT_SIZE == 32
				return __builtin_amdgcn_mbcnt_lo(~0u, 0u);
				#else
				#error ""
				#endif
				#else
				return 0;
				#endif
				}

				template <typename T> LIBC_INLINE uint64_t get_first_lane_id(T active_threads) {
				return __builtin_ffsl(active_threads) - 1;
				}

				template <typename T> LIBC_INLINE bool is_first_lane(T active_threads) {

				return get_lane_id() == get_first_lane_id(active_threads);
				}

				template <typename T>
				LIBC_INLINE uint32_t broadcast_first_lane(T active_threads, uint32_t x) {
				(void)active_threads;

				#if defined(LIBC_TARGET_ARCH_IS_NVPTX) && __CUDA_ARCH__ >= 700

				#if 0
				#error \
				"This doesn't compile, needs target feature ptx60...., despite the cuda arch guard"
				uint32_t first_id = get_first_lane_id(active_threads);
				return __nvvm_shfl_sync_idx_i32(active_threads, x, first_id,
				32 - 1);
				#else
				return x;
				#endif

				#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)
				// reads from lowest set bit in exec mask
				// this is OK from definition of is_first_lane
				return __builtin_amdgcn_readfirstlane(x);
				#else
				return x;
				#endif
				}

	} // namespace rpc			} // namespace rpc
	} // namespace __llvm_libc			} // namespace __llvm_libc

	#endif			#endif

libc/startup/gpu/amdgpu/start.cpp

	//===-- Implementation of crt for amdgpu ----------------------------------===//			//===-- Implementation of crt for amdgpu ----------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "src/__support/RPC/rpc_client.h"			#include "src/__support/RPC/rpc_client.h"

	extern "C" int main(int argc, char argv, char envp);			extern "C" int main(int argc, char argv, char envp);

	extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void			extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
	_start(int argc, char argv, char envp, int ret, void in, void *out,			_start(int argc, char argv, char envp, int ret, void in, void *out,
	void *buffer) {			void *buffer) {
	__llvm_libc::rpc::client.reset(in, out, buffer);			static __llvm_libc::cpp::Atomic<uint32_t>
				locks[__llvm_libc::rpc::NumberUInt32ForBitmaps] = {0};
				__llvm_libc::rpc::client.reset(&locks, in, out, buffer);

	__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);			__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);
	}			}

libc/startup/gpu/nvptx/start.cpp

	//===-- Implementation of crt for nvptx -----------------------------------===//			//===-- Implementation of crt for nvptx -----------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "src/__support/RPC/rpc_client.h"			#include "src/__support/RPC/rpc_client.h"

	extern "C" int main(int argc, char argv, char envp);			extern "C" int main(int argc, char argv, char envp);

	extern "C" [[gnu::visibility("protected")]] __attribute__((nvptx_kernel)) void			extern "C" [[gnu::visibility("protected")]] __attribute__((nvptx_kernel)) void
	_start(int argc, char argv, char envp, int ret, void in, void *out,			_start(int argc, char argv, char envp, int ret, void in, void *out,
	void *buffer) {			void *buffer) {
	__llvm_libc::rpc::client.reset(in, out, buffer);			static __llvm_libc::cpp::Atomic<uint32_t>
				locks[__llvm_libc::rpc::NumberUInt32ForBitmaps] = {0};
				__llvm_libc::rpc::client.reset(&locks, in, out, buffer);

	__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);			__atomic_fetch_or(ret, main(argc, argv, envp), __ATOMIC_RELAXED);
	}			}

libc/utils/gpu/loader/amdgpu/Loader.cpp

Show All 39 Lines
};		};

static __llvm_libc::rpc::Server server;		static __llvm_libc::rpc::Server server;

/// Queries the RPC client at least once and performs server-side work if there		/// Queries the RPC client at least once and performs server-side work if there
/// are any active requests.		/// are any active requests.
void handle_server() {		void handle_server() {
while (server.handle(		while (server.handle(
[&](__llvm_libc::rpc::Buffer *buffer) {		[&](__llvm_libc::rpc::ThreadBuffer *buffer) {
switch (static_cast<__llvm_libc::rpc::Opcode>(buffer->data[0])) {		switch (static_cast<__llvm_libc::rpc::Opcode>(buffer->data[0])) {
case __llvm_libc::rpc::Opcode::PRINT_TO_STDERR: {		case __llvm_libc::rpc::Opcode::PRINT_TO_STDERR: {
fputs(reinterpret_cast<const char *>(&buffer->data[1]), stderr);		fputs(reinterpret_cast<const char *>(&buffer->data[1]), stderr);
break;		break;
}		}
case __llvm_libc::rpc::Opcode::EXIT: {		case __llvm_libc::rpc::Opcode::EXIT: {
exit(buffer->data[1]);		exit(buffer->data[1]);
break;		break;
}		}
default:		default:
return;		return;
};		};
},		},
[](__llvm_libc::rpc::Buffer *buffer) {}))		[](__llvm_libc::rpc::ThreadBuffer *buffer) {}))
;		;
}		}

/// Print the error code and exit if \p code indicates an error.		/// Print the error code and exit if \p code indicates an error.
static void handle_error(hsa_status_t code) {		static void handle_error(hsa_status_t code) {
if (code == HSA_STATUS_SUCCESS \|\| code == HSA_STATUS_INFO_BREAK)		if (code == HSA_STATUS_SUCCESS \|\| code == HSA_STATUS_INFO_BREAK)
return;		return;

▲ Show 20 Lines • Show All 238 Lines • ▼ Show 20 Lines	if (hsa_status_t err =
/flags=/0, &dev_ret))		/flags=/0, &dev_ret))
handle_error(err);		handle_error(err);
hsa_amd_memory_fill(dev_ret, 0, sizeof(int));		hsa_amd_memory_fill(dev_ret, 0, sizeof(int));

// Allocate finegrained memory for the RPC server and client to share.		// Allocate finegrained memory for the RPC server and client to share.
void *server_inbox;		void *server_inbox;
void *server_outbox;		void *server_outbox;
void *buffer;		void *buffer;
if (hsa_status_t err = hsa_amd_memory_pool_allocate(
finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),		size_t bitmap_number_bytes = __llvm_libc::rpc::NumberUInt32ForBitmaps *
		sizeof(__llvm_libc::cpp::Atomic<int>);

		if (hsa_status_t err =
		hsa_amd_memory_pool_allocate(finegrained_pool, bitmap_number_bytes,
/flags=/0, &server_inbox))		/flags=/0, &server_inbox))
handle_error(err);		handle_error(err);
if (hsa_status_t err = hsa_amd_memory_pool_allocate(		if (hsa_status_t err =
finegrained_pool, sizeof(__llvm_libc::cpp::Atomic<int>),		hsa_amd_memory_pool_allocate(finegrained_pool, bitmap_number_bytes,
/flags=/0, &server_outbox))		/flags=/0, &server_outbox))
handle_error(err);		handle_error(err);
if (hsa_status_t err = hsa_amd_memory_pool_allocate(		if (hsa_status_t err = hsa_amd_memory_pool_allocate(
finegrained_pool, sizeof(__llvm_libc::rpc::Buffer),		finegrained_pool,
		__llvm_libc::rpc::NumberPorts * sizeof(__llvm_libc::rpc::Buffer),
/flags=/0, &buffer))		/flags=/0, &buffer))
handle_error(err);		handle_error(err);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_inbox);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_inbox);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_outbox);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, server_outbox);
hsa_amd_agents_allow_access(1, &dev_agent, nullptr, buffer);		hsa_amd_agents_allow_access(1, &dev_agent, nullptr, buffer);

// Initialie all the arguments (explicit and implicit) to zero, then set the		// Initialie all the arguments (explicit and implicit) to zero, then set the
// explicit arguments to the values created above.		// explicit arguments to the values created above.
Show All 33 Lines	int load(int argc, char argv, char envp, void *image, size_t size) {
packet->kernarg_address = args;		packet->kernarg_address = args;

// Create a signal to indicate when this packet has been completed.		// Create a signal to indicate when this packet has been completed.
if (hsa_status_t err =		if (hsa_status_t err =
hsa_signal_create(1, 0, nullptr, &packet->completion_signal))		hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
handle_error(err);		handle_error(err);

// Initialize the RPC server's buffer for host-device communication.		// Initialize the RPC server's buffer for host-device communication.
server.reset(server_inbox, server_outbox, buffer);		static __llvm_libc::cpp::Atomic<uint32_t>
		locks[__llvm_libc::rpc::NumberUInt32ForBitmaps] = {0};
		server.reset(&locks, server_inbox, server_outbox, buffer);

// Initialize the packet header and set the doorbell signal to begin execution		// Initialize the packet header and set the doorbell signal to begin execution
// by the HSA runtime.		// by the HSA runtime.
uint16_t header =		uint16_t header =
(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|		(HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) \|
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) \|		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) \|
(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);		(HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE);
__atomic_store_n(&packet->header, header \| (packet->setup << 16),		__atomic_store_n(&packet->header, header \| (packet->setup << 16),
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Concurrent GPU RPCNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 513745

libc/src/__support/CPP/atomic.h

libc/src/__support/OSUtil/gpu/io.cpp

libc/src/__support/OSUtil/gpu/quick_exit.cpp

libc/src/__support/RPC/rpc.h

libc/src/__support/RPC/rpc_util.h

libc/startup/gpu/amdgpu/start.cpp

libc/startup/gpu/nvptx/start.cpp

libc/utils/gpu/loader/amdgpu/Loader.cpp

[libc] Concurrent GPU RPC
Needs ReviewPublic