This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Add wrapper code generation for registering CUDA images
ClosedPublic

Authored by jhuber6 on Apr 14 2022, 12:28 PM.

Details

Summary

This patch adds the necessary code generation to create the wrapper code
that registers all the globals in CUDA. We create the necessary
functions and iterate through the list of
__start_cuda_offloading_entries to find which globals must be
registered. This is very similar to the code generation done currently
in Clang for non-rdc builds, but here we are registering a fully linked
fatbinary and finding the globals via the above sections.

With this we should be able to fully support basic RDC / LTO building of CUDA
code.

It's also worth noting that this does not include the necessary PTX to JIT the
image, so to use this support the offloading architecture must match the
system's architecture.

Depends on D123810

Diff Detail

Event Timeline

jhuber6 created this revision.Apr 14 2022, 12:28 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 12:28 PM
jhuber6 requested review of this revision.Apr 14 2022, 12:28 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 12:28 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

For reference, this is how I used this driver to offload a CUDA and OpenMP offload kernel that both called an external "hello world" function in device.cu.

$ clang++ hello.cu device.cu -foffload-new-driver --offload-arch=sm_70 -c
$ clang++ openmp.cpp -fopenmp-new-driver -fopenmp -fopenmp-targets=nvptx64 -Xopenmp-target=nvptx64 -march=sm_70 -fopenmp-offload-mandatory -c
$ clang++ hello.o device.o openmp.o -foffload-new-driver -fopenmp -fopenmp-targets=nvptx64
tra added inline comments.Apr 14 2022, 1:43 PM
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
351

Do you think generation of the CUDA registration glue could be shared with the front-end?

jhuber6 added inline comments.Apr 14 2022, 1:47 PM
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
351

I was thinking about it, but ultimately decided to keep the noise outside of the new driver to a minimum. Maybe if we move to the offloading entries being a common format we can easily share this code. Keeping it in Clang would have the advantage that it's easier to test directly and ensures we don't de-sync if anything changes. The only downside is that in the future I may want to push this functionality to a linker plugin or similar, which would require pulling it out of Clang again to prevent us needing to link in Clang to build LLVM.

Also needing to do this all through the builder API isn't ideal, it would be nice if we had some kind of runtime to call to do this for us, but I didn't feel like adding yet another shared library for CUDA. I considered putting it inside the cuda header wrappers as well, but forcing every CUDA file to have some externally visible weak registration blob didn't sit well with me.

tra added inline comments.Apr 14 2022, 2:08 PM
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
351

Perhaps front-end is not the right place for it, indeed. LLVM itself may be a better choice. We already have some things there for somewhat similar purposes (like lib/WindowsManifest) so adding a helper function to generate runtime glue for CUDA should not be unreasonable.

jhuber6 added inline comments.Apr 14 2022, 2:46 PM
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
351

I think it's fine here for this patch, but I definitely want to move it into LLVM in the future once I start generalizing more of this stuff.

need a CodeGenCUDA test for the registering. Also need a Driver test for the subcommands.

Also, I am wondering whether we should document the new embedding scheme: section names, symbol names, entries, etc, if it has not bee done.

need a CodeGenCUDA test for the registering. Also need a Driver test for the subcommands.

Testing things inside the linker wrapper is a little hairy. I may need to add a special option for doing dry runs and printing the wrapping code to we can test this more satisfactorily. Doing that will create some more noise for review unfortunately. For OpenMP is simply ran all of our unit tests using the new driver and considered that sufficient evidence that it was working. Also which driver sub-commands should be tested?

Also, I am wondering whether we should document the new embedding scheme: section names, symbol names, entries, etc, if it has not bee done.

There's some existing documentation I wrote for OpenMP offloading once we started using this scheme. I was planning on updating it with this scheme for CUDA once it's landed, no sense writing documentation before it's final.

jhuber6 updated this revision to Diff 423796.Apr 19 2022, 7:52 PM

Addings tests for wrapper codegen and fatbinary usage.

yaxunl added inline comments.Apr 20 2022, 12:01 PM
clang/test/Driver/linker-wrapper-image.c
32 ↗(On Diff #423796)

what happens if there are multiple binaries for different GPUs? will the linker-wrapper generates one fatbinary containing both elfs and embed the fatbinary as one image?

clang/test/Driver/linker-wrapper.c
46 ↗(On Diff #423796)

This option is the same as the preceding option. Is this intentional? Can we have a test that embeds multiple binaries for different GPUs?

jhuber6 added inline comments.Apr 20 2022, 12:17 PM
clang/test/Driver/linker-wrapper-image.c
32 ↗(On Diff #423796)

Yes, I'll add it to the other test.

clang/test/Driver/linker-wrapper.c
46 ↗(On Diff #423796)

It's intentional to show that we can pull out two objects embedded in a single file (Like if someone did ld -r or something). I'll add binaries for different GPUs to show that works.

jhuber6 updated this revision to Diff 423993.Apr 20 2022, 12:20 PM

Adjusting tests.

jhuber6 updated this revision to Diff 424534.Apr 22 2022, 10:46 AM
This comment was removed by jhuber6.
jhuber6 updated this revision to Diff 424536.Apr 22 2022, 10:51 AM

Applied changes to wrong commit, whoops.

jhuber6 updated this revision to Diff 426467.May 2 2022, 11:38 AM

Updating code generation. Previously we would seg-fault in the case that no offloading entries were created. To solve this we simply check that Begin != End before trying to register anything.

Are there unresolved concerns we should address in this review?

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
351

I'm OK with it being here but the place to consider (IMHO) is llvm/lib/Frontend, maybe /CUDA/Register.cpp.

LGTM.

Did you forget to accept the revision? D123810 and D123471 still need to be looked at, but these are mostly non-intrusive so I don't think they'll break anything.

tra accepted this revision.May 10 2022, 3:52 PM
tra added inline comments.
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
351

OK. I'm fine keeping it all here for now.
Please add a comment pointing towards the origin of this code. and, maybe, a TODO item to consolidate and move it into a better place.

This revision is now accepted and ready to land.May 10 2022, 3:52 PM
jhuber6 added inline comments.May 10 2022, 3:54 PM
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
351

Will do, thanks for the reviews. I'll land these tomorrow morning and see if anything breaks.