This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add support for global ctors / dtors for AMDGPU
ClosedPublic

Authored by jhuber6 on Apr 27 2023, 6:25 PM.

Details

Summary

This patch makes the necessary changes to support calling global
constructors and destructors on the GPU. The patch in D149340 allows the
lld linker to create the symbols pointing us to these globals. These
should be executed by a single thread, which is more difficult on the
GPU because all threads are active. I chose to use an atomic counter to
sync every thread on the GPU. This is very slow if you use more than a
few thousand threads, but for testing purposes it should be sufficient.

Depends on D149340 D149363

Diff Detail

Event Timeline

jhuber6 created this revision.Apr 27 2023, 6:25 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 27 2023, 6:25 PM
jhuber6 requested review of this revision.Apr 27 2023, 6:25 PM
jhuber6 updated this revision to Diff 518045.Apr 28 2023, 1:55 PM

Remove unused dependency.

sivachandra added inline comments.Apr 28 2023, 10:28 PM
libc/startup/gpu/amdgpu/start.cpp
61

Nit: Explicit namespace scoping to make it clear to the reader which atexit is being called: __llvm_libc::atexit.

67

Can this be avoided at all? As in, if there are globals that have to be initialized on the GPU, then all threads have to wait until they can start using those globals?

sivachandra accepted this revision.Apr 28 2023, 10:28 PM
This revision is now accepted and ready to land.Apr 28 2023, 10:28 PM
JonChesterfield requested changes to this revision.Apr 28 2023, 11:49 PM

Counting on the global looks like a DIY barrier, which is ok, but I can't see anything that stops reordering of operations past the initialisation code run on thread zero.

libc/startup/gpu/amdgpu/start.cpp
71

This is missing a fence. Noth

This revision now requires changes to proceed.Apr 28 2023, 11:49 PM
jhuber6 marked an inline comment as done.Apr 29 2023, 4:11 AM

Counting on the global looks like a DIY barrier, which is ok, but I can't see anything that stops reordering of operations past the initialisation code run on thread zero.

Obviously we need the DIY barrier because there's no built-in functionality to globally sync on the device. Once the globals have been initialized I simply assume that they'll call main in an orderly fashion and then we wait again at the barrier before finishing.

libc/startup/gpu/amdgpu/start.cpp
67

Generally I just assume it's unsafe to have any GPU threads calling main before we've run all the global constructors. We could reduce this to a regular sync if we placed every global object in thread shared memory however. Then this would be a simple gpu::sync_threads. But that would require modifying the source to put [[clang::addressspace(3)]] around everything, which is a pretty scarce resource.

71

The implementation of sync_threads has a fence.

JonChesterfield resigned from this revision.Apr 29 2023, 6:27 AM
This revision is now accepted and ready to land.Apr 29 2023, 6:27 AM
This revision was automatically updated to reflect the committed changes.
jhuber6 added inline comments.Apr 29 2023, 7:14 AM
libc/startup/gpu/amdgpu/start.cpp
67

Another solution here is to have a separate kernel that we call to do the initialization and then we call main. Generally it thwarts a few optimizations to have global state shared between kernel calls, but I don't think we care about that here. I'll make a patch to do that instead sometime in the future rather than having a weird global barrier the hardware doesn't support. That should allow us to run tests on a fully saturated GPU.