This is an archive of the discontinued LLVM Phabricator instance.

[libc] Enable multiple threads to use RPC on the GPU
ClosedPublic

Authored by jhuber6 on Apr 21 2023, 10:03 AM.

Details

Summary

The execution model of the GPU expects that groups of threads will
execute in lock-step in SIMD fashion. It's both important for
performance and correctness that we treat this as the smallest possible
granularity for an RPC operation. Thus, we map multiple threads to a
single larger buffer and ship that across the wire.

This patch makes the necessary changes to support executing the RPC on
the GPU with multiple threads. This requires some workarounds to mimic
the model when handling the protocol from the CPU. I'm not completely
happy with some of the workarounds required, but I think it should work.

Uses some of the implementation details from D148191.

Diff Detail

Event Timeline

jhuber6 created this revision.Apr 21 2023, 10:03 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 21 2023, 10:03 AM
jhuber6 requested review of this revision.Apr 21 2023, 10:03 AM
jhuber6 updated this revision to Diff 515818.Apr 21 2023, 10:12 AM

Cleanup leftover debugging code.

Drive

libc/src/__support/RPC/rpc.h
48

Or

56

We should have a single generic warp size macro

115

why is data not typed here?

284

shouldn't 0 be something like gpu::get_id_in_lane()? Also below.

368

return;
}

jhuber6 marked 2 inline comments as done.Apr 21 2023, 11:09 AM
jhuber6 added inline comments.
libc/src/__support/RPC/rpc.h
56

True, we can put this in the GPU utils.

115

Just easier to manage since this needs to be copied from the host to the GPU and it's easier to model arguments to the kernel as void pointers.

284

This is basically just to let the CPU pretend like it has all the "threads" the GPU has. So given the GPU's lane size of 32. It'll execute this loop once. The CPU however will execute it 32 times because we set its lane size to 1. The idx is just an offset as you march in multiples of the lane size.

jhuber6 updated this revision to Diff 515848.Apr 21 2023, 11:18 AM

Address comments

jhuber6 updated this revision to Diff 515849.Apr 21 2023, 11:21 AM

Add MAX_LANE_SIZE for situations where we get the host to allocate enough to accomodate all sizes unconditionally.

This works as expected on my gfx1030, but seems to fail spuriously on sm_70. It's probably something to do with the Volta threading model, I'll need to investigate some more.

jhuber6 updated this revision to Diff 515969.Apr 21 2023, 6:00 PM

Updating to use an overloaded function type rather than make every function take uin32_t.

jhuber6 updated this revision to Diff 516156.Apr 23 2023, 4:37 AM

Add Volta warp syncs on divergent paths.

This should work as expected now on both AMDGPU and NVPTX.

jhuber6 updated this revision to Diff 516171.Apr 23 2023, 8:34 AM

Fix some iffy control flow that caused problems on AMDGPU sometimes. I've run
the tests 1000 times on both my gfx1030 and sm_70 and didn't observe any
failures, so I'm reasonably confident.

jhuber6 updated this revision to Diff 516819.Apr 25 2023, 8:26 AM

Update some names and allocate the proper amount of memory.

Also this adds a hack to work around a problem with the optimizer. The GPU
executes groups of threads in SIMD style. This means that basic blocks turn into
thread masks. THe problem was occuring when the optimizer would sink the common
'close()` call to release the thread lock until after another open call within
the same lane. This can be seen in this https://godbolt.org/z/cfEr4n5TE. There
is some ongoing work to make this actually work on AMD but for now it's a hack
to prevent these function calls from being merged. It has a performance penalty,
but it's minor.

jhuber6 updated this revision to Diff 517874.Apr 28 2023, 4:50 AM
jhuber6 added a subscriber: nhaehnle.

Add a lane sync to address the convergence problem. Thanks to @nhaehnle for the suggestion.

jhuber6 updated this revision to Diff 517889.Apr 28 2023, 6:04 AM

Add acquire / release semantics on the device lock. The NVPTX backend doesn't
support release on stores so simply use an explicit fence and a relaxed
operation. The fence here will actually reduce to a full memory barrier, but
that should just be a slight performance hit because of Nvidia's lack of
support.

What's the reasoning behind the placement of sync_lanecalls? If it's meant to be one per basic block some are missing. The comment says it's added to close but the code has more calls than that.

If it's an amdgpu specific hack it should probably be named based on that and excluded on nvptx. Currently it reads like nvptx picks up a lot of synchronisation it doesn't need.

jhuber6 added a comment.EditedApr 28 2023, 6:42 AM

What's the reasoning behind the placement of sync_lanecalls? If it's meant to be one per basic block some are missing. The comment says it's added to close but the code has more calls than that.

If it's an amdgpu specific hack it should probably be named based on that and excluded on nvptx. Currently it reads like nvptx picks up a lot of synchronisation it doesn't need.

So, the convergence problem was present on Nvidia as well, it's just that its hardware model prevented it from deadlocking so it instead serialized the entire region. Right now we have a sync_lane in front of the open and the close mostly for safety to enforce the expected convergence of these operations. e.g. every thread that opens the port should close the port as well in the same context. The last sync is definitely necessary on the send_n call because the threads could copy variable amounts of data.

jhuber6 updated this revision to Diff 518056.Apr 28 2023, 2:29 PM

Add a check to detect divergence.

jhuber6 updated this revision to Diff 519306.May 3 2023, 5:00 PM

Updating to the new interface from @JonChesterfield

jhuber6 updated this revision to Diff 519573.May 4 2023, 11:07 AM

Rebasing, the send and recv interface still needs to use the mask stored in the buffer because it's shared.

jhuber6 updated this revision to Diff 519620.May 4 2023, 12:54 PM

Remove file moved into another patch.

jhuber6 updated this revision to Diff 519664.May 4 2023, 2:45 PM

Rebasing, this currently deadlocks on the RPC test. Need to fix that.

jhuber6 updated this revision to Diff 519686.May 4 2023, 3:52 PM

Fixed deadlock, problem was the second lane sync wasn't rebased in.

JonChesterfield accepted this revision.May 4 2023, 4:08 PM

I think this is OK. It seems likely that we'll be able to simplify parts of it later. I'll remove the outdated part of the commit message shortly, the state machine has changed since it was written

libc/src/__support/RPC/rpc.h
106–107

Not keen on this being a runtime value but it's not that clear how to avoid that on the host side so this seems OK

188

a bit frustrated that we need this sync for now

419–423

i think we're better off without the is_first_lane here, i.e. let all active lanes write the same value to buffer, but that should only be a codegen improvement

libc/src/__support/RPC/rpc_util.h
33

weird that these show up in the phab diff, they should already be in trunk

This revision is now accepted and ready to land.May 4 2023, 4:08 PM
JonChesterfield edited the summary of this revision. (Show Details)May 4 2023, 4:08 PM
JonChesterfield edited the summary of this revision. (Show Details)
This revision was landed with ongoing or failed builds.May 4 2023, 5:32 PM
This revision was automatically updated to reflect the committed changes.