Currently, AMDGPU has special handling for constructors and destructors.
We manuall emit a kernel that calls the functoins listed in the global
constructor / destructor list. This currently has two main problems. The
first is that we do not repsect the priortiy and simply call them in any
order. The second is that we redefine the symbol unconditionally which
coulid have a different definition, meaning we cannot merge any code
with a constructor post-codegen. This patch changes the handling to
instead use the standard support for travering the .init_array and
.fini_array sections the compiler creates. This allows us to emit a
single kernel with odr semantics, so even if we emit this multiple
times they will be merged into a single kernel.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUCtorDtorLowering.cpp | ||
---|---|---|
80 | Use the enum or query from datalayout | |
85 | Use the enum or query from datalayout | |
89 | Ditto | |
91 | Ditto | |
96 | Ditto | |
106 | I don't like getUnqual, it's a bad name that implies they're 0="unqualified" which is not really how things work. Query the function's address space | |
110 | Should just save the ptr type to a variable and reuse throughout |
I've tested this working with a standalone constructor, but it may be good to double check with the asan case.
llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll | ||
---|---|---|
1 | can we add a check that init_array_start and init_array_end are created by lld as expected? |
llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll | ||
---|---|---|
1 | That might be a little tough since lld is a separate project that might not be built. There are tests for it in there though, but it would be best to know for sure. Might need to be a runtime test? For now you can just take my word for it since it's what we use in the libc tests which have a buildbot https://lab.llvm.org/staging/#/builders/247. |
llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll | ||
---|---|---|
1 | OK. LGTM |
llvm/lib/Target/AMDGPU/AMDGPUCtorDtorLowering.cpp | ||
---|---|---|
30–36 | Should just use the exiting AMDGPUAS::* instead of inventing a new copy |
I tested the above patch downstream with my asan-toolchain setup on r8 systems and found that the patch has no regressions with respect to both Hip/OpenMP runtimes of launching global ctor/dtors on amdgpu device code.
I have tested the above patches with asan test cases that stress test on the gpu side global-buffer-overflow conditions both with hip and openmp test cases and found the poisoning of redzones to be happening as we want.
I can share the test cases if any people has a working asan toolchain setup with the current patchset included.
Thanks
Amit
Wondering if it's worth maintaining a way to keep the old non-indirect codegen for LTO
llvm/lib/Target/AMDGPU/AMDGPUCtorDtorLowering.cpp | ||
---|---|---|
111–112 | CreateConstInBoundsGEP1_64 | |
llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll | ||
20 | It's not really wrong but maybe we should use the default program address space instead |
That's true, these are indirect function calls so they can't be inlined or anything. But I feel like it may not be worthwhile considering that we execute this kernel with a single thread anyway, overhead from a few indirect function calls probably isn't going to add a lot of latency overall.
llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll | ||
---|---|---|
20 | So I put it in addrspace(1) because I was getting weird failures. I think the LLVM ctor / dtor globals were either addrsapce(1) or default, the latter then causing problems. |
Should just use the exiting AMDGPUAS::* instead of inventing a new copy