Extends the current model to handle N=64 concurrent calls. Requires API changes
for runtime-specified, e.g. sizing based on a given GPU.
This is a minimal abstraction for mutually exclusive access to shared memory.
Provided some API call can allocate memory that is writeable from two different
things, those two things can use this to make syscall-style function calls or
do other arbitrary streaming data back and forth. E.g. a GPU to a host, or
between two GPUs, or between different Linux processes etc. It compiles the
same code for each of the paired processes which allows testing the logic with
both sides running an architecture with thread sanitizers or similar.
Once memory has been successfully allocated and zero-initialised, there are no
further failure modes. It is not necessary to amend function calls to pass a
failure indicator back and forth in addition to their usual values.
The interface exposes exactly the concept of applying a function to the mutex
protected state, giving ownership to the other process and waiting for the
other process to release it. This is a moderately annoying model to program
against, see D148288 for an example of a friendlier interface one might wish
to write on top of this.
Multiple concurrent calls are supported using N copies of the underlying state.
This follows the OpenCL forward progress guarantees, i.e. assumes none. The
distinct calls have distinct port instances and cannot affect one another.
This is a simplified version with regard to compile time invariants. If we wish
to go further with that, it is possible to provide a correct by construction
implementation of this. There is sufficient information at _compile time_ to
guarantee that exactly one of the two processes uses the buffer at a given time
and that the resources are dropped when no longer in use. That cannot deadlock.
In the present simplified version, the port_t<> type represents the state of
the inbox and outbox as best known by the current process. Outbox is accurate,
inbox is possibly out of date. Misuse of this is possible, e.g. conjuring new
ports out of raw integers, though many typos are still caught as unused value
warnings. Changing to move semantics would help, depending on how libc does
std::move, especially in combination with bugprone use after move.
The interface as written here is slightly off optimal. The 'wait' primitive is
better expressed as a query that returns an either of an unchanged port or an
unavailable one. That is left out as the current client/server layer does not
use it. A query that does an atomic load is simple, avoiding a repeat of that
load on success requires either less or more type safety than this patch uses.
This is tested on various amdgpu, nvptx and x86 architectures. It requires some
means of allocating memory that supports a relaxed atomic load and a acq_rel
fetch_add, e.g. over pci-express. The protocol was designed to place minimal
requirements on the processes in the hope that it can ultimately bind to more
exotic architectures.
Extends the inbox/outbox variables to a bitmap which is mostly responsible for
mapping from an index into a specific bit in that structure. Moves the mailbox
operations into Process from the client/server classes so that invariants are
more readily tracked and models the state machine in the template parameters of
said index.
Does not handle wavefront size of 64, still undecided how best to represent
that from the Loader.cpp memory allocation side. There's a bug to fix in the
clang driver for nvptx freestanding before this can work there.
Extra testing is tbd, I haven't understood the libc framework for it yet.
Going to have to change the client interface to pass the active threads down the stack for nvidia, various intrinsics require it