The GPU has a different execution model to standard _start
implementations. On the GPU, all threads are active at the start of a
kernel. In order to correctly intitialize and call the constructors we
want single threaded semantics. Previously, this was done using a
makeshift global barrier with atomics. However, it should be easier to
simply put the portions of the code that must be single threaded in
separate kernels and then call those with only one thread. Generally,
mixing global state between kernel launches makes optimizations more
difficult, similarly to calling a function outside of the TU, but for
testing it is better to be correct.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Splitting into three kernels seems better. These implementations are very heavily copy&pasted between amdgpu and nvptx though. A bunch of stuff is inherently target specific, notably the kernel launch machinery, but things like the kernel argument structs are very easily factored into a header, and I think start.cpp are ~ a hundred lines of fairly subtle code that is identical on nvptx&amdgpu except for the spelling of kernel calling convention. I think that would be worth cleaning up - it's likely to be quicker to deduplicate than to review the duplication
| libc/startup/gpu/nvptx/start.cpp | ||
|---|---|---|
| 71 | Is there a missing call to libc::finalize() here? | |
The structs should definitely be common, you're right. The startup code itself I think should remain separate in different directories, it's how the other libc targets do it just for clarity of implementation.
| libc/startup/gpu/nvptx/start.cpp | ||
|---|---|---|
| 71 | That was only necessary for the weird "global barrier" hack I implemented. exit should be sufficient. | |
Thanks for moving the structs. I think the copy&paste startup code is likely to be bugprone but as that's relatively likely to be your maintenance burden I'll accept.
Is there a missing call to libc::finalize() here?