This is an archive of the discontinued LLVM Phabricator instance.

[libc][gpu] Thread divergence fix on volta
ClosedPublic

Authored by JonChesterfield on Aug 31 2023, 5:49 AM.

Details

Summary

The inbox/outbox loads are performed by the current warp, not a single thread.

The outbox load indicates whether a port has been successfully opened. If some
lanes in the warp think it has and others think the port open failed, as the
warp happened to be diverged when the load occurred, all the subsequent control
flow will be incorrect.

The inbox load indicates whether the machine on the other side of the RPC channel
has progressed. If lanes in the warp have different ideas about that, some will
try to progress their state transition while others won't. As far as the RPC layer
is concerned this is a performance problem and not a correctness one - none of the lanes
can start the transition early, only miss it and start late - but in practice the calls
layered on top of RPC do not have the interface required to detect this event and retry
the load on the stalled lanes, so the calls layered on top will be broken.

None of this is broken on amdgpu, but it's likely that the readfirstlane will have
beneficial performance properties there. Possible significant enough that it's
worth landing this ahead of fixing gpu::broadcast_value on volta.

Essentially volta wasn't adequately considered when writing this part of the protocol.
It's a bug present in the initial prototype and propagated thus far, because none of
the test cases push volta into a warp diverged state in the middle of the RPC sequence.

We should have some test cases for volta where port_open and equivalent are called
from diverged warps.

Diff Detail

Event Timeline

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 31 2023, 5:49 AM
JonChesterfield requested review of this revision.Aug 31 2023, 5:49 AM

We have a single test that opens and closes in a divergent state, it was the original one that caught deadlocks on AMDGPU https://github.com/llvm/llvm-project/blob/main/libc/test/integration/startup/gpu/rpc_test.cpp. This is definitely a good idea on AMDGPU, when I originally tested it we end up exchanging about 7 VGPRs for about 6 SGPRs which is a very good trade-off. I think the best we can do is just maintain the divergence that we know of when we open the RPC interface. That is, when we broadcast the value we should write it to the mask we know of, since that's always a subset of the "true mask" right?

  • pass lane_mask to broadcast_value

I think the best we can do is just maintain the divergence that we know of when we open the RPC interface. That is, when we broadcast the value we should write it to the mask we know of, since that's always a subset of the "true mask" right?

The path on volta is heavily constrained by the architecture. We cannot know the 'true' lane mask because the architecture doesn't expose it. The general workarounds for handling warp level divergence on volta are:

1/ conjure the currently active lanes once and hope that's close enough
2/ pass ~0 into all the intrinsics and hope that's OK
3/ require the application to pass the lanemask around

Fortunately, the RPC layer is structured so that a single RPC call as far as the GPU logic is concerned can be handled by multiple calls on the server. We should probably document that design constraint somewhere if it isn't in any of the papers. That means a wholly converged warp makes a single trip to the server, and a worst-case diverged warp might make 32 trips to the server, but the application logic is unchanged and all that happens is a performance slowdown. Which we sometime observe in testing.

That gives us workaround option 4/, transparently deal with partial lanemask implementation. That mostly makes the port manipulation logic harder to write, try_lock and unlock are where this surfaces in our implementation. It also means it's _really_ important that we use the same lane_mask value we opened a port with consistently, which is hopefully constrained by the type system and our tendency to store it in Process.

This particular bug is that I missed the race condition on loads over pcie from a diverged warp.

JonChesterfield retitled this revision from [libc][gpu] Thread divergence fix on volta, WIP to [libc][gpu] Thread divergence fix on volta.Aug 31 2023, 6:14 AM
JonChesterfield edited the summary of this revision. (Show Details)
libc/src/__support/RPC/rpc.h
337

These values may be a problem. I think Port instances usually end up on the stack in the hope that SROA will disassemble the pieces. Lane_mask and index should be invariant over the lifetime of the Port. I'm not sure what the out variable is - my recollection is that it's a cache of the outbox state, but if that's true it should be a bool.

I'd guess out/receive/owns are mutable and toggle back and forth as the machine executes.

Mutable or not, I think these should all be warp invariant, and whether that invariant is known at compile time or not is probably an emergent feature of the optimiser at the moment.

jhuber6 accepted this revision.Aug 31 2023, 6:20 AM

LG With nits.

libc/src/__support/GPU/amdgpu/utils.h
128–130

You can just leave out the argument, same below.

libc/src/__support/GPU/nvptx/utils.h
127

Unrelated?

This revision is now accepted and ready to land.Aug 31 2023, 6:20 AM
  • got lic to build, fix smoke test
  • review comment, drop the void cast of unused lane mask
This revision was landed with ongoing or failed builds.Aug 31 2023, 6:34 AM
This revision was automatically updated to reflect the committed changes.