This patch makes the necessary changes to support calling global
constructors and destructors on the GPU. The patch in D149340 allows the
lld linker to create the symbols pointing us to these globals. These
should be executed by a single thread, which is more difficult on the
GPU because all threads are active. I chose to use an atomic counter to
sync every thread on the GPU. This is very slow if you use more than a
few thousand threads, but for testing purposes it should be sufficient.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
libc/startup/gpu/amdgpu/start.cpp | ||
---|---|---|
61 | Nit: Explicit namespace scoping to make it clear to the reader which atexit is being called: __llvm_libc::atexit. | |
67 | Can this be avoided at all? As in, if there are globals that have to be initialized on the GPU, then all threads have to wait until they can start using those globals? |
Counting on the global looks like a DIY barrier, which is ok, but I can't see anything that stops reordering of operations past the initialisation code run on thread zero.
libc/startup/gpu/amdgpu/start.cpp | ||
---|---|---|
71 | This is missing a fence. Noth |
Obviously we need the DIY barrier because there's no built-in functionality to globally sync on the device. Once the globals have been initialized I simply assume that they'll call main in an orderly fashion and then we wait again at the barrier before finishing.
libc/startup/gpu/amdgpu/start.cpp | ||
---|---|---|
67 | Generally I just assume it's unsafe to have any GPU threads calling main before we've run all the global constructors. We could reduce this to a regular sync if we placed every global object in thread shared memory however. Then this would be a simple gpu::sync_threads. But that would require modifying the source to put [[clang::addressspace(3)]] around everything, which is a pretty scarce resource. | |
71 | The implementation of sync_threads has a fence. |
libc/startup/gpu/amdgpu/start.cpp | ||
---|---|---|
67 | Another solution here is to have a separate kernel that we call to do the initialization and then we call main. Generally it thwarts a few optimizations to have global state shared between kernel calls, but I don't think we care about that here. I'll make a patch to do that instead sometime in the future rather than having a weird global barrier the hardware doesn't support. That should allow us to run tests on a fully saturated GPU. |
Nit: Explicit namespace scoping to make it clear to the reader which atexit is being called: __llvm_libc::atexit.