This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Add support for handling HIP in the linker wrapper
ClosedPublic

Authored by jhuber6 on Jun 30 2022, 7:30 AM.

Details

Summary

This patch adds the necessary changes required to bundle and wrap HIP
files. The bundling is done using clang-offload-bundler currently to
mimic fatbinary and the wrapping is done using very similar runtime
calls to CUDA. This still does not support managed / surface / texture
variables, that would require some additional information in the entry.

One difference in the codegeneration with AMD is that I don't check if
the handle is null before destructing it, I'm not sure if that's
required.

With this we should be able to support HIP with the new driver.

Depends on D128850

Diff Detail

Event Timeline

jhuber6 created this revision.Jun 30 2022, 7:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2022, 7:30 AM
jhuber6 requested review of this revision.Jun 30 2022, 7:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2022, 7:30 AM
tra added a comment.Jun 30 2022, 12:36 PM

Syntax/style looks OK to me with a few nits.

clang/test/Driver/linker-wrapper.c
120

Nit: This test case does not have any CHECK lines and could use a comment describing what it's supposed to test. AFAICT it's intended to make sure that no temporary files are left around, but I'm not 100% sure.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
614–616

We probably do not want to hardcode the assumption that the host is x86_64 linux.

Bundle alignment should probably also be target-dependent, but 4K is common enough and is probably fine in practice.

1218

I'd move it to the end where the buffer is actually used.

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
393

We should probably have a helper function returning properly prefixed name, similar to what we do in clang:
https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGCUDANV.cpp#L184

Thanks for the comments.

clang/test/Driver/linker-wrapper.c
120

Yes, it ensures that the files extracted from the static library are not leftover as temp files, this was a problem previously that I fixed. I'll add a comment explaining that.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
614–616

This is exactly the way it is in the Clang source for HIP. HIP uses the clang-offload-bundler which expects a host file and host triple, ergo the dummy triple and input from /dev/null. This obviously isn't great, maybe in the future I'll be able to convince the AMD folks to use my format instead.

1218

Sure, I'll do that for the others as well.

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
393

I had that thought, but unless I wanted to use regular expressions it would be a little weird since there's many different types here, e.g. __cuda, .cuda and _cuda. I figured it was easier to just make two strings rather than carry around three different functions to handle these cases, or introduce some weird regex.

jhuber6 updated this revision to Diff 442356.Jul 5 2022, 10:46 AM

Addressing some comments.

JonChesterfield accepted this revision.Jul 11 2022, 8:29 AM

Code looks good to me. It's hard to be sure whether it works without running a bunch of hip test cases through it, have you already done so? If it doesn't work out of the box it should be close enough to fix up post commit, e.g. when trying to move hip over to this by default.

This revision is now accepted and ready to land.Jul 11 2022, 8:29 AM

Code looks good to me. It's hard to be sure whether it works without running a bunch of hip test cases through it, have you already done so? If it doesn't work out of the box it should be close enough to fix up post commit, e.g. when trying to move hip over to this by default.

Thanks for the review, I ran a couple mini-apps with HIP versions (XSBench, RSBench, SU3Bench) using this method and they passed without issue. The only thing I was unsure about what whether or not the handle needed to be checked for null, because my testing suggested it's unnecessary. I was hoping one of the HIP developers would let me know. We can think about making this the default approach when I make the new driver work for non-rdc mode compilations.

Code looks good to me. It's hard to be sure whether it works without running a bunch of hip test cases through it, have you already done so? If it doesn't work out of the box it should be close enough to fix up post commit, e.g. when trying to move hip over to this by default.

Thanks for the review, I ran a couple mini-apps with HIP versions (XSBench, RSBench, SU3Bench) using this method and they passed without issue. The only thing I was unsure about what whether or not the handle needed to be checked for null, because my testing suggested it's unnecessary. I was hoping one of the HIP developers would let me know. We can think about making this the default approach when I make the new driver work for non-rdc mode compilations.

There is only one fatbin for -fgpu-rdc mode but the fatbin unregister function is called multiple times in each TU. HIP runtime expects each fatbin is unregistered only once. The old embedding scheme introduced a weak symbol to track whether the fabin has been unregistered and to make sure it is only unregistered once.

Code looks good to me. It's hard to be sure whether it works without running a bunch of hip test cases through it, have you already done so? If it doesn't work out of the box it should be close enough to fix up post commit, e.g. when trying to move hip over to this by default.

Thanks for the review, I ran a couple mini-apps with HIP versions (XSBench, RSBench, SU3Bench) using this method and they passed without issue. The only thing I was unsure about what whether or not the handle needed to be checked for null, because my testing suggested it's unnecessary. I was hoping one of the HIP developers would let me know. We can think about making this the default approach when I make the new driver work for non-rdc mode compilations.

There is only one fatbin for -fgpu-rdc mode but the fatbin unregister function is called multiple times in each TU. HIP runtime expects each fatbin is unregistered only once. The old embedding scheme introduced a weak symbol to track whether the fabin has been unregistered and to make sure it is only unregistered once.

I see, this wrapping will only happen in RDC-mode so it's probably safe to ignore here? When I support non-RDC mode in the new driver it will most likely rely on the old code generation. Although it's entirely feasible to make RDC-mode the default. There's no runtime overhead when using LTO.

This revision was landed with ongoing or failed builds.Jul 11 2022, 12:49 PM
This revision was automatically updated to reflect the committed changes.

There is only one fatbin for -fgpu-rdc mode but the fatbin unregister function is called multiple times in each TU. HIP runtime expects each fatbin is unregistered only once. The old embedding scheme introduced a weak symbol to track whether the fabin has been unregistered and to make sure it is only unregistered once.

I see, this wrapping will only happen in RDC-mode so it's probably safe to ignore here? When I support non-RDC mode in the new driver it will most likely rely on the old code generation. Although it's entirely feasible to make RDC-mode the default. There's no runtime overhead when using LTO.

If you only unregister fatbin once for the whole program, then it should be safe -fgpu-rdc. I am not sure if that is the case.

My experience with -fgpu-rdc is that it causes much longer linking time for large applications like PyTorch or TensroFlow, and LTO does not help. This is because the compiler has lots of inter-procedural optimization passes which take more than linear time. Due to that those apps need to be compiled as -fno-gpu-rdc. Actually most CUDA/HIP applications are using -fno-gpu-rdc.

If you only unregister fatbin once for the whole program, then it should be safe -fgpu-rdc. I am not sure if that is the case.

it should be here, the generated handle is private to the registration module we created We only make one and it's impossible for anyone else to touch it even if mixing rdc with non-rdc codes.

My experience with -fgpu-rdc is that it causes much longer linking time for large applications like PyTorch or TensroFlow, and LTO does not help. This is because the compiler has lots of inter-procedural optimization passes which take more than linear time. Due to that those apps need to be compiled as -fno-gpu-rdc. Actually most CUDA/HIP applications are using -fno-gpu-rdc.

Yes, it's actually pretty difficult to find a CUDA application using fgpu-rdc. It seems much more common to just stick everything that's needed in the file.I've considered finding a CUDA / HIP benchmark suite and comparing compile times using the new driver stuff. The benefit of having fgpu-rdc be the default is that device code basically behaves exactly like host code and LTO makes fgpu-rdc behave like fno-gpu-rdc performance wise. The downside, as you mentioned, is compile time.

tra added a comment.Jul 11 2022, 4:27 PM

Yes, it's actually pretty difficult to find a CUDA application using fgpu-rdc. It seems much more common to just stick everything that's needed in the file.I've considered finding a CUDA / HIP benchmark suite and comparing compile times using the new driver stuff. The benefit of having fgpu-rdc be the default is that device code basically behaves exactly like host code and LTO makes fgpu-rdc behave like fno-gpu-rdc performance wise. The downside, as you mentioned, is compile time.

For what it's worth, NCCL is the only nontrivial library that needs RDC compilation that I'm aware of.
It's also self-contained for RDC purposes we only need to use RDC on the library TUs and do not need to propagate it to all CUDA TUs in the build.

I believe such 'constrained' RDC compilation will likely be the reasonable practical trade-off. It may not become the default compilation mode, but we should be able to control where the "fully linked GPU executable" boundary is and it's not necessarily going to match the fully-linked host executable.

For what it's worth, NCCL is the only nontrivial library that needs RDC compilation that I'm aware of.
It's also self-contained for RDC purposes we only need to use RDC on the library TUs and do not need to propagate it to all CUDA TUs in the build.

I believe such 'constrained' RDC compilation will likely be the reasonable practical trade-off. It may not become the default compilation mode, but we should be able to control where the "fully linked GPU executable" boundary is and it's not necessarily going to match the fully-linked host executable.

Theoretically we could do this with a relocatable link using the linker-wrapper. The only problem with this approach are the __start/__stop linker defined variables that we use to iterate the globals to be registered as these are tied to the section specifically. Potentially, we could move these to a unique section so they don't interfere with anything. So it would be something like this

clang-linker-wrapper -r a.o b.o c.o -o registered.o // Contains RTL calls to register all globals at section 'cuda_offloading_entries_<ID>'
llvm-strip ---remove-section .llvm.offloading registered.o // Remove embedded IR so no other files will link against it
llvm-objcopy --rename-section cuda_offloading_entries=cuda_offloading_entries_<ID> registered.o // Change the registration section to something unique

Think this would work?

thakis added a subscriber: thakis.Jul 11 2022, 6:25 PM

This breaks check-clang on mac: http://45.33.8.238/macm1/39907/step_7.txt

Please take a look and revert for now if it takes a while to fix.

This breaks check-clang on mac: http://45.33.8.238/macm1/39907/step_7.txt

Please take a look and revert for now if it takes a while to fix.

I changed some of the argument formats in a previous patch, probably messed up the rebase. I'll try to land a patch real quick.

This breaks check-clang on mac: http://45.33.8.238/macm1/39907/step_7.txt

Please take a look and revert for now if it takes a while to fix.

Let me know if rGfe6a391357fc resolved the issue.

Bot's happy again. Thanks for the quick fix!