This is an archive of the discontinued LLVM Phabricator instance.

[libc] Support concurrent RPC port access on the GPU
ClosedPublic

Authored by jhuber6 on May 1 2023, 10:51 AM.

Details

Summary

Previously we used a single port to implement the RPC. This was
sufficient for single threaded tests but can potentially cause deadlocks
when using multiple threads. The reason for this is that GPUs make no
forward progress guarantees. Therefore one group of threads waiting on
another group of threads can spin forever because there is no guarantee
that the other threads will continue executing. The typical workaround
for this is to allocate enough memory that a sufficiently large number
of work groups can make progress. As long as this number is somewhat
close to the amount of total concurrency we can obtain reliable
execution around a shared resource.

This patch enables using multiple ports by widening the arrays to a
predetermined size and indexes into them. Empty ports are currently
obtained via a trivial linker scan. This should be imporoved in the
future for performance reasons. Portions of D148191 were applied to
achieve parallel support.

Depends on D149581

Diff Detail

Event Timeline

jhuber6 created this revision.May 1 2023, 10:51 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 1 2023, 10:51 AM
jhuber6 requested review of this revision.May 1 2023, 10:51 AM
libc/src/__support/RPC/rpc.h
353

Your data representation induces overhead here.

inbox.load hits the bus to retrieve one bit of information. If it fails, next step through the loop starts over from scratch.

If you change to packed bits, the first atomic load gets you the state of ~32 mailboxes in one tick. If your current index doesn't work out, you now have enough information to pick the next credible candidate without hitting the bus at all.

Also you're loading from the outbox despite the value of that being known already. Maybe the latency on that redundant load is hidden by the latency on loading from the inbox.

libc/src/__support/RPC/rpc.h
355

You could make this less vulnerable to data races on volta by getting the appropriate thread mask outside of the loop. That's also likely to be faster on amdgpu.

jhuber6 added inline comments.May 3 2023, 10:30 AM
libc/src/__support/RPC/rpc.h
353

I can see the performance benefits of bit-packing here I'd assume unless the RPC client is mostly unused the performance penalty of false sharing more atomic indices will be small compared to saving several atomic checks in the client / server. We can also definitely hoist the outbox load.

355

Good point, since this is marked convergent it's unlikely that the compiler will remove this from the loop itself.

libc/src/__support/RPC/rpc.h
398

Is all this a straight copy/paste from the above? Why are client/server open even different functions, they could be identical

jhuber6 added inline comments.May 3 2023, 10:52 AM
libc/src/__support/RPC/rpc.h
398

They're slightly different. The client open function only works if the port is capable of sending data while the server only works if it's capable of receiving it. I could probably template that somehow, but I didn't see the need.

libc/src/__support/RPC/rpc.h
398

Refer to D148191 for an example of using a single function for both. Traditional arguments for not copy/paste/slightly-change functions which pick up performance related changes over time apply really.

jhuber6 updated this revision to Diff 519322.May 3 2023, 6:29 PM

Update to new interface after @JonChesterfield's merge

jhuber6 updated this revision to Diff 519330.May 3 2023, 7:01 PM

Fix again.

JonChesterfield accepted this revision.May 5 2023, 8:07 AM

OK, much easier to parse than the earlier versions. Thanks! There's some cleanup to do post commit I think - in particular the try_open methods are messy and really similar, and it's probably worth having the inbox/outbox/locks pointers private to process.

libc/src/__support/RPC/rpc.h
106

port_count? - its how many ports are allocated, not how big a port is

149

I don't see a remaining use of *this, could be a free function (or at least static)

This revision is now accepted and ready to land.May 5 2023, 8:07 AM
This revision was landed with ongoing or failed builds.May 5 2023, 8:12 AM
This revision was automatically updated to reflect the committed changes.