This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Rewrite device ctor / dtor handling to use .init / .fini sections
ClosedPublic

Authored by jhuber6 on May 16 2023, 7:56 AM.

Details

Summary

Currently, AMDGPU has special handling for constructors and destructors.
We manuall emit a kernel that calls the functoins listed in the global
constructor / destructor list. This currently has two main problems. The
first is that we do not repsect the priortiy and simply call them in any
order. The second is that we redefine the symbol unconditionally which
coulid have a different definition, meaning we cannot merge any code
with a constructor post-codegen. This patch changes the handling to
instead use the standard support for travering the .init_array and
.fini_array sections the compiler creates. This allows us to emit a
single kernel with odr semantics, so even if we emit this multiple
times they will be merged into a single kernel.

Diff Detail

Event Timeline

jhuber6 created this revision.May 16 2023, 7:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 7:56 AM
jhuber6 requested review of this revision.May 16 2023, 7:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 7:56 AM
arsenm added inline comments.May 16 2023, 8:13 AM
llvm/lib/Target/AMDGPU/AMDGPUCtorDtorLowering.cpp
80

Use the enum or query from datalayout

85

Use the enum or query from datalayout

89

Ditto

91

Ditto

96

Ditto

106

I don't like getUnqual, it's a bad name that implies they're 0="unqualified" which is not really how things work. Query the function's address space

110

Should just save the ptr type to a variable and reuse throughout

jhuber6 updated this revision to Diff 522639.May 16 2023, 8:29 AM

Addressing comments

jhuber6 marked 7 inline comments as done.May 16 2023, 8:30 AM

I've tested this working with a standalone constructor, but it may be good to double check with the asan case.

yaxunl added inline comments.May 16 2023, 8:43 AM
llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll
1

can we add a check that init_array_start and init_array_end are created by lld as expected?

jhuber6 added inline comments.May 16 2023, 8:46 AM
llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll
1

That might be a little tough since lld is a separate project that might not be built. There are tests for it in there though, but it would be best to know for sure. Might need to be a runtime test? For now you can just take my word for it since it's what we use in the libc tests which have a buildbot https://lab.llvm.org/staging/#/builders/247.

yaxunl added inline comments.May 16 2023, 8:52 AM
llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll
1

OK. LGTM

arsenm added inline comments.May 16 2023, 12:25 PM
llvm/lib/Target/AMDGPU/AMDGPUCtorDtorLowering.cpp
30–36

Should just use the exiting AMDGPUAS::* instead of inventing a new copy

jhuber6 updated this revision to Diff 522740.May 16 2023, 12:32 PM

Move to AMDGPUAS.

I've tested this working with a standalone constructor, but it may be good to double check with the asan case.

I tested the above patch downstream with my asan-toolchain setup on r8 systems and found that the patch has no regressions with respect to both Hip/OpenMP runtimes of launching global ctor/dtors on amdgpu device code.

I have tested the above patches with asan test cases that stress test on the gpu side global-buffer-overflow conditions both with hip and openmp test cases and found the poisoning of redzones to be happening as we want.

I can share the test cases if any people has a working asan toolchain setup with the current patchset included.

Thanks
Amit

arsenm accepted this revision.May 19 2023, 11:33 AM

Wondering if it's worth maintaining a way to keep the old non-indirect codegen for LTO

llvm/lib/Target/AMDGPU/AMDGPUCtorDtorLowering.cpp
111–112

CreateConstInBoundsGEP1_64

llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll
20

It's not really wrong but maybe we should use the default program address space instead

This revision is now accepted and ready to land.May 19 2023, 11:33 AM

Wondering if it's worth maintaining a way to keep the old non-indirect codegen for LTO

That's true, these are indirect function calls so they can't be inlined or anything. But I feel like it may not be worthwhile considering that we execute this kernel with a single thread anyway, overhead from a few indirect function calls probably isn't going to add a lot of latency overall.

llvm/test/CodeGen/AMDGPU/lower-ctor-dtor-constexpr-alias.ll
20

So I put it in addrspace(1) because I was getting weird failures. I think the LLVM ctor / dtor globals were either addrsapce(1) or default, the latter then causing problems.