This is an archive of the discontinued LLVM Phabricator instance.

[libc] Update RPC interface for system utilities on the GPU
ClosedPublic

Authored by jhuber6 on Apr 13 2023, 7:38 PM.

Details

Summary

This patch reworks the RPC interface to allow more generic memory
operations using the shared better. This patch decomposes the entire RPC
interface into opening a port and calling send or recv on it.

The send function sends a single packet of the length of the buffer.
The recv function is paired with the send call to then use the data.
So, any aribtrary combination of sending packets is possible. The only
restriction is that the client initiates the exchange with a send
while the server consumes it with a recv.

The operation of this is driven by two independent state machines that
tracks the buffer ownership during loads / stores. We keep track of two
so that we can transition between a send state and a recv state without
an extra wait. State transitions are observed via bit toggling, e.g.

This interface supports an efficient send -> ack -> send -> ack -> send
interface and allows for the last send to be ignored without checking
the ack.

A following patch will add some more comprehensive testing to this interface. I
I informally made an RPC call that simply incremented an integer and it took
roughly 10 microsends to complete an RPC call.

Diff Detail

Event Timeline

jhuber6 created this revision.Apr 13 2023, 7:38 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 13 2023, 7:38 PM
jhuber6 requested review of this revision.Apr 13 2023, 7:38 PM
goldstein.w.n added inline comments.
libc/src/__support/RPC/rpc.h
57

IIUC this means it must be send/recv/senc/recv back-and-forth pattern? This seems contradictory to "mostly arbitrary combinations of 'send' and 'recv'" no?

goldstein.w.n added inline comments.Apr 13 2023, 8:28 PM
libc/src/__support/RPC/rpc.h
255

given that is only two states, exchange is enough, no need for CAS.

Likewise below.

jhuber6 updated this revision to Diff 513565.Apr 14 2023, 6:15 AM

Add sleeps and update recv_n.

jhuber6 added inline comments.Apr 14 2023, 7:30 AM
libc/src/__support/RPC/rpc.h
57

Yes, poor choice of words. I meant arbitrary combinations on one side as long as it's mirrored on the other side. So, the following will work:

Client:  send -> send -> send -> recv -> send -> recv -> recv -> send -> recv
Server: recv -> recv -> recv -> send -> recv -> send -> send -> recv -> send
255

Good point, it's mostly a placeholder right now as we only support single threaded access. (At least this means if you call it with multiple threads it will wait until the other one is done).

jhuber6 updated this revision to Diff 513590.Apr 14 2023, 7:37 AM

Change atomic and adjust comments.

jhuber6 updated this revision to Diff 513617.Apr 14 2023, 8:51 AM

Remove unused variable.

jhuber6 marked 2 inline comments as done.Apr 14 2023, 8:52 AM
lntue added inline comments.Apr 14 2023, 9:42 AM
libc/src/__support/OSUtil/gpu/io.cpp
19

Does send_n still expect msg.data() is null-terminated?

jhuber6 added inline comments.Apr 14 2023, 10:04 AM
libc/src/__support/OSUtil/gpu/io.cpp
19

We copy the entire buffer now, including the terminating null byte. Previously we had to add a null byte because we printed it in chunks. it worked alright but would be sub-optimal when multiple threads are involved.

lntue added inline comments.Apr 14 2023, 8:02 PM
libc/src/__support/OSUtil/gpu/quick_exit.cpp
21

capture-by-value is better here.

libc/src/__support/RPC/rpc.h
33

Can this be enum class ?

74

Since this class is internally used, shall we make it enum class to be a bit safer? If you don't want to keep adding Flags::Data, you can add using Flags::Data and using Flags::Ack, or simply add using enum Flags with C++20 inside struct Process.

167

a != b is a safer option for boolean than bitwise-xor.

213

capture-by-value is better here.

230

capture-by-reference is not needed.

233

I think capture-by-value is good enough here.

250

a != b is a safer option for boolean than bitwise-xor.

262

a != b is a safer option for boolean than bitwise-xor.

265

nit: I actually prefer while (true) to for (;;)

287

a != b is actually a safer xor operator for boolean than bitwise-xor ^. So logically, this is equivalent to

bool(in & Data) == bool(out & Process::Ack)

Can you double check if it is correct?

299

nit: I actually prefer while (true) to for (;;)

libc/utils/gpu/loader/Server.h
41

nothing is captured with this lambda.

47

nothing is captured with this lambda.

jhuber6 updated this revision to Diff 513899.Apr 15 2023, 6:45 AM
jhuber6 marked 7 inline comments as done.

Addressing comments

jhuber6 added inline comments.Apr 15 2023, 6:47 AM
libc/src/__support/RPC/rpc.h
33

We could, but since this is passed over the wire I think it's okay to treat it as just an integer.

167

These are bits, we can't do things like & if it's an enum class without a ton of conversions. It should be uint32_t however.

230

This copies the size out.

265

I saw examples of both in the codebase, Maybe we should make a style rule for that.

287

The compiler will reduce it to an XOR AFAIK, but I explicitly use the language "bits are not equal" in the logic, so it's a lot clearer.

jhuber6 updated this revision to Diff 513937.Apr 15 2023, 2:25 PM

Make a few changes. Add a placeholder for an index for when multiple ports
will be supported. Also improve performance by removing the atomic load of the
write-only variable out.

jhuber6 retitled this revision from [libc] Update RPC interface for system utilities on th GPU to [libc] Update RPC interface for system utilities on the GPU.Apr 15 2023, 2:28 PM
jhuber6 updated this revision to Diff 514058.Apr 16 2023, 4:13 PM

Get rid of RAII interface for now. Was causing issues when using multiple threads.

some comments, waiting on the next version now.

libc/src/__support/RPC/rpc.h
42

I would argue data should be int8_t[62] to get the size right.

74

Enum class or just constexpr constants.

164–166

Complex logic, please put it into a helper with comment. Probably the entire while loop.

173–174

reorder to simplify?

185–186

Same as above.

236

Here and above, move the cast out the loop, or at least stmt, to make it simpler (to read).

258–259

This should use the helper as well, and pass a retry flag as false, hence give up rather than loop. The scheme above otherwise exists 4 times.

jhuber6 updated this revision to Diff 514356.Apr 17 2023, 12:13 PM

Addressing comments.

jhuber6 marked 10 inline comments as done.Apr 17 2023, 1:14 PM
jhuber6 added inline comments.
libc/src/__support/RPC/rpc.h
164–166

Since the other use is an if I think it's just helpful to put the logic to a named function. I think showing an explicit busy-wait is good.

Last real issue I have is the try open loop, see below.

libc/src/__support/RPC/rpc.h
43–44

static assert sizeof

303

I think {try,}open need a common helper impl. but should not use each other.
The problem right now is that we check the in/out multiple times for no good reason.
In fact, we might get stuck in a weird state which we could reasonably report as error.
Why not a single impl. with a flag, in case it is set you loop, otherwise you give up?

libc/utils/gpu/loader/Server.h
42

FWIW, later we want an easier way to get a single Ty value from the RPC. Seems like something common that should hide the casts and encoding.

jhuber6 marked an inline comment as done.Apr 17 2023, 2:33 PM
jhuber6 added inline comments.
libc/src/__support/RPC/rpc.h
43–44

My future plan is to collapse all this inbox / outbox stuff into a struct of arrays, but we should go for a static 64-byte buffer in any case.

303

The logic is that the client can only open a port capable of sending and the server can only open a port capable of receiving.

libc/utils/gpu/loader/Server.h
42

Definitely, I'm already tired of reinterpret cast.

JonChesterfield added a comment.EditedApr 17 2023, 7:24 PM
This comment has been deleted.
libc/src/__support/RPC/rpc.h
253

This leaks ports if they're closed after a non-even number of send or recv calls

libc/utils/gpu/loader/Server.h
48

This leaks the port here^

jhuber6 added inline comments.Apr 17 2023, 7:35 PM
libc/src/__support/RPC/rpc.h
253

The port can be opened in the Ack / Data state 0, 0 or 1, 1. it doesn't get stuck at 1, 1 because we treat 1, 1 the same as 0, 0 in the logic.

libc/src/__support/RPC/rpc.h
253

Uh, maybe. I'm still having trouble mapping between in/data in/ack out/data out/ack bits and client!=server and the two bit, client==server models.

I'm fine with this. Any objections?

libc/utils/gpu/loader/Server.h
27
JonChesterfield added a comment.EditedApr 18 2023, 1:18 PM

Main objection from me is it uses twice as much shared state as necessary. Using many more states than necessary makes the invariants difficult to follow.

Specifically, this implements the four states GPU owns, send to host, host owns, send to GPU, which are the right states for mutex here, but it does it using four bits of state instead of two. It misses the fundamental symmetry in the system - send and recv are the same thing.

Joseph and I are slowly working through that offline.

jdoerfert accepted this revision.Apr 19 2023, 10:09 AM

We can reduce the bits from 4 to 2 later if that results in less complexity or better performance. I don't think that is in itself a blocker. When we see the patch to change it we can probably easily tell if it is simpler, and we hopefully have more stress testing then to trust whatever changes we make.

This revision is now accepted and ready to land.Apr 19 2023, 10:09 AM

I think we should land D148191 and then adopt this send/recv scheme on top of it if preferred. That patch keeps the current semantics while extending to multiple warps, in a fashion which is known to support the interface proposed here.

What this patch does is simultaneously change the underlying state machine and the layers of code built on top of it, in a fashion which I think will need to be revised again when changing to support multiple warps.

Going to land this with the expectation that we can hack on it later and it will make subsequent patches easier to follow. Portions from D148191 will be added soon.