This patch begins providing a generic static library that wraps around    
the raw rpc.h interface. As discussed in the corresponding RFC,    
https://discourse.llvm.org/t/rfc-libc-exporting-the-rpc-interface-for-the-gpu-libc/71030,    
we want to begin exporting RPC services to external users. In order to    
do this we decided to not expose the rpc.h header by wrapping around    
its functionality. This is done with a C-interface as we make heavy use    
of callbacks and allows us to provide a predictable interface.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Changing implementation to use a C-based interface. This is primarily because
it's required to actually hide the implementation details of the RPC server. The
rpc.h header should not be visible outside of internal libc projects. This
also makes it easier to provide as a simple library with an expected symbol
name.
Updating and rebasing.
I'm not completely happy with the interface but we can always modify it later.
This should provide what the libc developers wanted and separate the
implementation of rpc.h and the provided server. This has some downsides
compared to just exporting the full header somehow, but also has some
advantages, considering that we may not need to really provde that much custom
utility outside of libc for users, so it's easier to consoliate the
functionality in a defined interface.
Let me know what should be changed, there's a lot of cruft associated with
registering the custom handlers for malloc / free but I don't think there's
another way to get around it.
Answering the question from discord: Normally, a single header file named <some>_service.h is to be used by both the client and the server. Client's will use the client API from that header file, where as servers will use the server API. That way, the message types (or opcodes as you are calling) will be in a single shared header file.
| libc/utils/gpu/server/Server.h | ||
|---|---|---|
| 46 ↗ | (On Diff #528564) | Why are there separate functions allocation and deallocation? | 
| 51 ↗ | (On Diff #528564) | Update comment. | 
| 54 ↗ | (On Diff #528564) | Ditto. | 
| 61 ↗ | (On Diff #528564) | Why not just rpc_shutdown? | 
| libc/utils/gpu/server/Server.h | ||
|---|---|---|
| 46 ↗ | (On Diff #528564) | I'll change the name to free but we need both to adequately de allocate the shared memory. | 
| libc/utils/gpu/loader/amdgpu/Loader.cpp | ||
|---|---|---|
| 55 | Call into the other one, static void handle_error(rpc_status_t) {
  handle_error("Failure in the RPC server");
} | |
| libc/utils/gpu/loader/nvptx/Loader.cpp | ||
| 50 | i still really dislike the copy/paste going on here | |
| 72 | Does this work? It looks like the same stream running the kernel is being used to provide malloc/free, and I'd expect that to deadlock | |
| libc/utils/gpu/server/Server.cpp | ||
| 35 ↗ | (On Diff #528799) | Could we go with a vector of Device instead of the new+array construct? | 
| 44 ↗ | (On Diff #528799) | Why is this a heap allocated thing, as opposed to static State state; ? | 
| 51 ↗ | (On Diff #528799) | Could make the counter 64 bit and delete the test against max as a counter >= address space size can't overflow In general the DIY reference counting is a bit odd - is there a reason this isn't a shared_ptr? | 
| 107 ↗ | (On Diff #528799) | I'm still hopeful that we'll come up with a better idea than rpc::MAX_LANE_SIZE | 
| libc/utils/gpu/server/Server.h | ||
| 57 ↗ | (On Diff #528799) | This looks like the type of a function, not the type of a function pointer. It's used as an argument to functions where it'll decay to the pointer type. More conventionally written with an extra * typedef void(*rpc_free_ty)(void *ptr, void *data); Is there a benefit to declaring this as the function type as opposed to the function pointer type? | 
| 70 ↗ | (On Diff #528799) | Want (void) if this is meant to be usable from C (the guards about suggest it is) C++ thinks foo() is a function of no arguments. C thinks it's some aberration from the past (though that might have been dropped in the last standard). | 
| libc/utils/gpu/loader/nvptx/Loader.cpp | ||
|---|---|---|
| 72 | There's a test for this that's been running on https://lab.llvm.org/buildbot/#/builders/46 for a few weeks now and it hasn't deadlocked as far as I can tell. It's a completely separate stream called memory_stream that's just created here. The one running the kernel is just called stream. This requires CUDA 11.2 IIRC. | |
| libc/utils/gpu/server/Server.cpp | ||
| 35 ↗ | (On Diff #528799) | It's a static size so a constant sized array should be more correct. | 
| 44 ↗ | (On Diff #528799) | It's just easier to check if it's been initialized because the pointer is nullable. We coiuld probably make it a static thing and have a flag instead if you'd like. | 
| 51 ↗ | (On Diff #528799) | Would a shared pointer give us the same semantics? We would be allocating it multiple times and not copying it. | 
| 107 ↗ | (On Diff #528799) | We could use a vector and push back into it instead, or preallocate according to the size above, but this was the easiest solution. | 
Interesting that the cuda path doesn't deadlock, I wonder if that's a fix from previous cuda revisions. Thanks
Appreciate the thorough review.
Yeah I heard about that problem as well which made me nervous about how well we could support a real malloc in the future. But pleasingly it's been running on the sm_70 and sm_60 testers for awhile and it seems good as long as your CUDA is new enough. As it stands we'll probably need to turn this function into a hard error if we integrate it into OpenMP and CUDA is too old.
Call into the other one,
static void handle_error(rpc_status_t) { handle_error("Failure in the RPC server"); }