This is an archive of the discontinued LLVM Phabricator instance.

[libc] Concurrent GPU RPC
Needs ReviewPublic

Authored by JonChesterfield on Apr 12 2023, 8:12 PM.

Details

Summary

Extends the current model to handle N=64 concurrent calls. Requires API changes
for runtime-specified, e.g. sizing based on a given GPU.

This is a minimal abstraction for mutually exclusive access to shared memory.

Provided some API call can allocate memory that is writeable from two different
things, those two things can use this to make syscall-style function calls or
do other arbitrary streaming data back and forth. E.g. a GPU to a host, or
between two GPUs, or between different Linux processes etc. It compiles the
same code for each of the paired processes which allows testing the logic with
both sides running an architecture with thread sanitizers or similar.

Once memory has been successfully allocated and zero-initialised, there are no
further failure modes. It is not necessary to amend function calls to pass a
failure indicator back and forth in addition to their usual values.

The interface exposes exactly the concept of applying a function to the mutex
protected state, giving ownership to the other process and waiting for the
other process to release it. This is a moderately annoying model to program
against, see D148288 for an example of a friendlier interface one might wish
to write on top of this.

Multiple concurrent calls are supported using N copies of the underlying state.
This follows the OpenCL forward progress guarantees, i.e. assumes none. The
distinct calls have distinct port instances and cannot affect one another.

This is a simplified version with regard to compile time invariants. If we wish
to go further with that, it is possible to provide a correct by construction
implementation of this. There is sufficient information at _compile time_ to
guarantee that exactly one of the two processes uses the buffer at a given time
and that the resources are dropped when no longer in use. That cannot deadlock.

In the present simplified version, the port_t<> type represents the state of
the inbox and outbox as best known by the current process. Outbox is accurate,
inbox is possibly out of date. Misuse of this is possible, e.g. conjuring new
ports out of raw integers, though many typos are still caught as unused value
warnings. Changing to move semantics would help, depending on how libc does
std::move, especially in combination with bugprone use after move.

The interface as written here is slightly off optimal. The 'wait' primitive is
better expressed as a query that returns an either of an unchanged port or an
unavailable one. That is left out as the current client/server layer does not
use it. A query that does an atomic load is simple, avoiding a repeat of that
load on success requires either less or more type safety than this patch uses.

This is tested on various amdgpu, nvptx and x86 architectures. It requires some
means of allocating memory that supports a relaxed atomic load and a acq_rel
fetch_add, e.g. over pci-express. The protocol was designed to place minimal
requirements on the processes in the hope that it can ultimately bind to more
exotic architectures.

Extends the inbox/outbox variables to a bitmap which is mostly responsible for
mapping from an index into a specific bit in that structure. Moves the mailbox
operations into Process from the client/server classes so that invariants are
more readily tracked and models the state machine in the template parameters of
said index.

Does not handle wavefront size of 64, still undecided how best to represent
that from the Loader.cpp memory allocation side. There's a bug to fix in the
clang driver for nvptx freestanding before this can work there.

Extra testing is tbd, I haven't understood the libc framework for it yet.

Diff Detail

Event Timeline

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 12 2023, 8:12 PM
JonChesterfield requested review of this revision.Apr 12 2023, 8:12 PM

arc and phab are resisting me adding @intue to the reviewers. Others copied from D145913

Two dubious compile time assumptions in this draft.

One is that wavefront==64 is not a thing. Changing that on the GPU side is calling a macro, the tedious thing is Loader.cpp will need to query the hardware to see what the expected wavesize is, allocate space accordingly, and deal with the server type varying based on it in some reasonable fashion.

Second is that mutual exclusion within a process involves a locks bitmap. This shouldn't be in shared/fine grain/pinned memory as the other process never accesses it. That's presently implemented with static globals. Changing the number of slots to a runtime value (perhaps based on the number of compute units) involves allocating GPU memory in the loader and passing that along with with fine grain memory, which is a bigger API change than I wanted to make here.

  • fix overallocation error
libc/src/__support/RPC/rpc.h
346

Going to have to change the client interface to pass the active threads down the stack for nvidia, various intrinsics require it

I'm very close to standing up a patch that reworks the RPC interface for a single thread. So, we should wait until that's settled and then apply the concurrency changes on top of it. I'm also planning on putting a stress test of the RPC in that patch. As it stands, we only use the RPC in the existing tests if they fail.

I'm not worried about adapting the interface. The client::run and server:::handle are just what was there before. Worth mentioning that this doesn't change the existing algorithm, it stamps out N identical copies of the structure and hands them out to threads in roughly round robin fashion. It'll compose with your changes.

Definitely agrees on more testing, I haven't got a handle on the libc/cmake plumbing around this.

  • write the comments, rename send to post
JonChesterfield edited the summary of this revision. (Show Details)Apr 14 2023, 2:16 PM
JonChesterfield edited the summary of this revision. (Show Details)Apr 14 2023, 2:16 PM
JonChesterfield added a reviewer: lntue.
JonChesterfield retitled this revision from [libc][wip] Draft of concurrent GPU RPC to [libc] Concurrent GPU RPC.Apr 14 2023, 2:22 PM