This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add initial support for an RPC mechanism for the GPU
ClosedPublic

Authored by jhuber6 on Mar 13 2023, 2:38 AM.

Details

Summary

This patch adds initial support for an RPC client / server architecture.
The GPU is unable to perform several system utilities on its own, so in
order to implement features like printing or memory allocation we need
to be able to communicate with the executing process. This is done via a
buffer of "sharable" memory. That is, a buffer with a unified pointer
that both the client and server can use to communicate.

The implementation here is based off of Jon Chesterfields minimal RPC
example in his work. We use an inbox and outbox to communicate
between if there is an RPC request and to signify when work is done.
We use a fixed-size buffer for the communication channel. This is fixed
size so that we can ensure that there is enough space for all
compute-units on the GPU to issue work to any of the ports. Right now
the implementation is single threaded so there is only a single buffer
that is not shared.

This implementation still has several features missing to be complete.
Such as multi-threaded support and asynchrnonous calls.

Depends on D145912

Diff Detail

Event Timeline

jhuber6 created this revision.Mar 13 2023, 2:38 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 13 2023, 2:38 AM
jhuber6 requested review of this revision.Mar 13 2023, 2:38 AM
jhuber6 updated this revision to Diff 505214.Mar 14 2023, 12:17 PM

Few fixes and more comments.

jdoerfert added inline comments.Mar 14 2023, 2:00 PM
libc/src/__support/OSUtil/gpu/io.cpp
27

Nit: braces

libc/src/__support/RPC/rpc.h
8

file comment how this is going to work.

Also add a web documentation document

21

reserve it, don't call it noon now.

51

Comment for all classes and enums and functions.

76

Commented with full sentences please.

84

maybe handle, or query?

jhuber6 marked 3 inline comments as done.Mar 14 2023, 2:04 PM
jhuber6 added inline comments.
libc/src/__support/RPC/rpc.h
8

File comment is good. I'll documentation online in a later patch.

21

How about null? I think it's important that 0 doesn't do anything as we expect that to be the default clean state of this buffer.

84

Handle would probably be good, since I decided to make this non-blocking to make it more versatile on the server-side.

jdoerfert added inline comments.Mar 14 2023, 4:50 PM
libc/src/__support/OSUtil/gpu/io.cpp
27

Why is this function unused? (also braces for the loop).
Let's avoid "print" for now and go with fputs and exit first.
We certainly want fputs to be the fast path for fprintf, and the non-f versions can be resolved on the device.

libc/src/__support/RPC/rpc.h
21

OK, null.

30

Since data[0] is supposed to be the Opcode, should we not make it a opcode member and then 7 * 8 bytes payload?

jhuber6 marked an inline comment as done.Mar 14 2023, 5:24 PM
jhuber6 added inline comments.
libc/src/__support/OSUtil/gpu/io.cpp
27

It's the function to use the return value from the server. Since this is void there's nothing to use. In the future this should be "asynchonrous" where we don't wait for the return value.

libc/src/__support/RPC/rpc.h
30

I'm wondering if we should provide a bunch of structs that outline the structure of the arguments and just reinterpret case the underlying buffer to that struct type.

jplehr added a subscriber: jplehr.Mar 15 2023, 2:50 AM
jhuber6 updated this revision to Diff 505523.Mar 15 2023, 9:07 AM

Addressing more comments.

The GPU plumbing looks ok to me. Implementing the minimal can-write-stderr version first is good too, details can be revised in tree as needed.

michaelrj added inline comments.Mar 15 2023, 10:13 AM
libc/src/__support/RPC/rpc.h
59

nit: communication

There is no usage of these functions currently. Adding test support is WIP. Right now the following works if I use the tools manually

namespace __llvm_libc {
void write_to_stderr(const char *msg);
void quick_exit(int);
} // namespace __llvm_libc

using namespace __llvm_libc;

int main(int argc, char **argv) {
  for (int i = 0; i < argc; ++i) {
    write_to_stderr(argv[i]);
    write_to_stderr("\n");
  }
  quick_exit(127);
}
$ clang++ crt1.o rpc_client.o io.o quick_exit.o main.cpp -flto --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -o image
$ ./amdhsa_loader image args to the main function
image
args
to
the
main
function
$ echo $?
127
sivachandra accepted this revision.Mar 16 2023, 8:43 AM

I have no major comments so stepping aside.

libc/src/__support/RPC/rpc.h
2

Should this RPC directory be nested under OSUtil/gpu as it is a GPU specific? AFAICT there is nothing GPU specific though.

40

Constants are in UPPER_CASE style.

libc/src/__support/RPC/rpc_client.cpp
23

false ? Also, is this being used anywhere?

This revision is now accepted and ready to land.Mar 16 2023, 8:43 AM
jhuber6 marked an inline comment as done.Mar 16 2023, 8:48 AM
jhuber6 added inline comments.
libc/src/__support/RPC/rpc.h
2

Right now it's "gpu specific" but theoretically it could be used by any heterogeneous system that can share a memory region atomically. Maybe we could port it to run on FPGAs some day. But no other targets will use this right now, so it would be fine to put it under a GPU only directory if you want.

libc/src/__support/RPC/rpc_client.cpp
23

This isn't used right now. The idea is that when we provide this as a static library, any entrypoint that requires the RPC will pull in this file to resolve the symbol. That will in turn pull in this externally visible symbol which we can then read from the image directly. Therefore, if the GPU image does not contain __llvm_libc_rpc we don't need to bother with the runtime cost of spinning up a server on the host CPU.

It's set to false because constants don't get emitted as symbols if they are undefined, and we want to use the default initializer. It might be fine to make it true, but the presence of the symbol is the boolean here.

sivachandra added inline comments.Mar 16 2023, 8:59 AM
libc/src/__support/RPC/rpc_client.cpp
23

It's set to false because constants don't get emitted as symbols if they are undefined, and we want to use the default initializer. It might be fine to make it true, but the presence of the symbol is the boolean here.

My comment was about changing it to false instead of 0.

jhuber6 updated this revision to Diff 505840.Mar 16 2023, 9:23 AM

Addressing comments

I stumbled upon this code while looking at write_to_stderr and I think that the PRINT_TO_STDERR opcode is buggy (see comments). Also since we're exchanging messages of 64B, why waste 8B on the Opcode? Do we need such a huge opcode space ?
I'm working on a patch to try to fix these issues, I'll ping back when it's ready for discussion.

libc/src/__support/OSUtil/gpu/io.cpp
25

The number of bytes copied at each round are either buffer_len (i.e. 56B) or the total length of the string, it should be the remainder.

libc/utils/gpu/loader/amdgpu/Loader.cpp
50

If the message's length is greater than 56B, &buffer->data[1] contains only a fragment of the message and is not null-terminated, leading fputs to read past it's allocated buffer.

314

Are there any guarantees that this piece of memory has the same alignment that __llvm_libc::cpp::Atomic<int>.
Same for the Cuda loader.

jhuber6 added a comment.EditedMar 31 2023, 7:20 AM

I stumbled upon this code while looking at write_to_stderr and I think that the PRINT_TO_STDERR opcode is buggy (see comments). Also since we're exchanging messages of 64B, why waste 8B on the Opcode? Do we need such a huge opcode space ?
I'm working on a patch to try to fix these issues, I'll ping back when it's ready for discussion.

This code is heavily WIP. Almost all of this is going to change at some point in the future. This exists mainly to let us stand up unit tests. The reason for the opcode's size right now was mainly just alignment. We want the actual arguments to at least have 8B alignment. I'm going to change this in the future. Something like this, but I'm still thinking about it.

struct Port {
  cpp::Atomic<uint32_t> flags;
  uint32_t opcode[WARP_SIZE];
  uint64_t activemask;
  Buffer data[WARP_SIZE];
}
libc/src/__support/OSUtil/gpu/io.cpp
25

Yeah, don't know why this one works. Probably more accidental behavior given that the memory allocation is a lot larger than the size of the buffer I'm using.

libc/utils/gpu/loader/amdgpu/Loader.cpp
50

You're right. This worked in practice so I never noticed. Most likely because allocating fine-grained memory most likely gives you a full page at a minimum. I never write to the rest of that data and it's implicitly initialized to zero.

314

Fine-grained memory in this context is always going to be aligned to a page as far as I know. So that'll usually be aligned on a 4096 byte boundary.

gchatelet added inline comments.Apr 1 2023, 5:04 AM
libc/utils/gpu/loader/amdgpu/Loader.cpp
314

Does that mean that you're reserving one page for just a few bytes? Maybe it would make more sense to reserve a larger chunk of memory and place the objects ourselves (not sure what this implies for the GPU).

Regarding the rpc mechanism, should the two atomics share the same cache line or would it make sense to have them in separate cache lines to prevent false sharing? It's probably OK to have them on the same cache line if there is only one client and one server.

Performance wise, it seems important that the atomic don't cross cache line boundaries though. And since placement is important (alignof(cpp::atomic<T>) should be honored for proper codegen) maybe the server should just allocate a sufficiently large chunk of memory and let a common function do the actual placement. This would prevent duplicate logic in each server (loader).

JonChesterfield added inline comments.Apr 1 2023, 5:25 AM
libc/utils/gpu/loader/amdgpu/Loader.cpp
314

The two atomics are single-writer. Not only should they be on different cache lines, they should probably be on different pages (so that a given page is only every written by one of the agents involved).

Atomic variables definitely shouldn't cross cache lines. I don't know whether they'd work on GPUs if they did, that seems likely to be a correctness issue. But as they're naturally aligned and smaller than cache lines we're fine.

jhuber6 added inline comments.Apr 1 2023, 5:25 AM
libc/utils/gpu/loader/amdgpu/Loader.cpp
314

Does that mean that you're reserving one page for just a few bytes? Maybe it would make more sense to reserve a larger chunk of memory and place the objects ourselves (not sure what this implies for the GPU).

Yes, this is really wasteful but I'm ignoring it for now since the resource usage of these tests is quite low. This will change when we need to support multiple client calls from GPU threads. To meet hardware requirements the size of the buffer will probably be about 2 MB.

Regarding the rpc mechanism, should the two atomics share the same cache line or would it make sense to have them in separate cache lines to prevent false sharing? It's probably OK to have them on the same cache line if there is only one client and one server.

I'm probably going to merge these into a single atomic that we use as a bit-field. This is mostly because the underlying RPC state machine will need to support more than four states in the future (client send, server reply, client use, server clean). I'm actually not entirely sure what false sharing would mean in this context. This memory is basically a whole page that's shared with the GPU via the PCI(e) bus. I'm not privy enough to the internals to know if this can cause issues.

Performance wise, it seems important that the atomic don't cross cache line boundaries though. And since placement is important (alignof(cpp::atomic<T>) should be honored for proper codegen) maybe the server should just allocate a sufficiently large chunk of memory and let a common function do the actual placement. This would prevent duplicate logic in each server (loader).

We should definitely put alignas on the struct. I was thinking about doing the above. Having a single buffer makes it much easier to write to the GPU as well. I'm planning on just making this an array of some struct type.