This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Fix host used external kernel in archive
ClosedPublic

Authored by yaxunl on Apr 8 2022, 9:15 PM.

Details

Summary

For -fgpu-rdc, a host function may call an external kernel
which is defined in an archive of bitcode. Since this external
kernel is only referenced in host function, the device
bitcode does not contain reference to this external
kernel, then the linker will not try to resolve this external
kernel in the archive.

To fix this issue, host-used external kernels and device
variables are tracked. A global array containing pointers
to these external kernels and variables is emitted which
serves as an artificial references to the external kernels
and variables used by host.

Diff Detail

Event Timeline

yaxunl created this revision.Apr 8 2022, 9:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 9:15 PM
yaxunl requested review of this revision.Apr 8 2022, 9:15 PM
tra added a comment.Apr 12 2022, 11:59 AM

LGTM in principle. This will keep around the GPU code we do need.

That said, it seems to be a rather blunt hammer. I think we'll end up linking almost everything in an archive into the final executable as we'll likely have a host-visible symbol in most of the GPU objects (e.g. most of them would have a kernel).
Device-side linking would also be unaware of which objects were actually linked into the host executable and thus would link in more objects than necessary. We could have achieved about the same result by linking with --whole-archive.

The root of the problem here is that in isolation GPU-side linking does not know what will really be needed by the host and thus has to link in everything, except, maybe, object files where we may have __device__ functions only.
Ideally, the linking should be a two-phase process -- link CPU side, extract references to the GPU symbols (host-side compilation would have to be augmented to place them in a well known location) and pass them to the GPU-side linker which would then have all the info necessary to pull in relevant GPU-side objects without compiler having to force having nearly all of them linked in.

I realize that this would be a nontrivial change to the compilation pipeline. As a short-to-medium term solution, this patch may do, though I'd probably prefer just linking with --whole-archive as it would, in theory, be simpler.

LGTM in principle. This will keep around the GPU code we do need.

That said, it seems to be a rather blunt hammer. I think we'll end up linking almost everything in an archive into the final executable as we'll likely have a host-visible symbol in most of the GPU objects (e.g. most of them would have a kernel).
Device-side linking would also be unaware of which objects were actually linked into the host executable and thus would link in more objects than necessary. We could have achieved about the same result by linking with --whole-archive.

The root of the problem here is that in isolation GPU-side linking does not know what will really be needed by the host and thus has to link in everything, except, maybe, object files where we may have __device__ functions only.
Ideally, the linking should be a two-phase process -- link CPU side, extract references to the GPU symbols (host-side compilation would have to be augmented to place them in a well known location) and pass them to the GPU-side linker which would then have all the info necessary to pull in relevant GPU-side objects without compiler having to force having nearly all of them linked in.

I realize that this would be a nontrivial change to the compilation pipeline. As a short-to-medium term solution, this patch may do, though I'd probably prefer just linking with --whole-archive as it would, in theory, be simpler.

This approach will only link in kernels and device variables used by host code, whereas --whole-archive will keep everything in the archive. There are use cases where the archive contains a large amount of kernels that the application only use a few of them.

Also, --whole-archive will require users to carefully arrange --whole-archive and --no-whole-archive options for the archives they use. This approach avoids that.

tra added a comment.Apr 12 2022, 12:41 PM

This approach will only link in kernels and device variables used by host code

In the absence of the explicit reference info from the host side, GPU-side linker must link all objects with symbols that may be used by the host.
E.g if we have a library with three objects, each has one kernel (and thus potentially used by the host), but the main TU only refers to a kernel from one of them, GPU-side linker would still have to link in all three objects from the library, as any of them may have been referenced by the host.

--whole-archive will require users to carefully arrange --whole-archive and --no-whole-archive options for the archives they use.

This would be done by the driver. My understanding is that we already have to do nontrivial stuff under the hood (e.g. unbundling) so telling the linker that static archives must always use --whole-archive should be doable.
I don't insist on it, just exploring alternative options we may have.

This approach will only link in kernels and device variables used by host code

In the absence of the explicit reference info from the host side, GPU-side linker must link all objects with symbols that may be used by the host.
E.g if we have a library with three objects, each has one kernel (and thus potentially used by the host), but the main TU only refers to a kernel from one of them, GPU-side linker would still have to link in all three objects from the library, as any of them may have been referenced by the host.

You are talking about a use case that actually needs --whole-archive option. If the main TU does not reference some symbols in the archive but wants all symbols in the archive to be linked in, it is justifiable to use --whole-archive and HIP toolchain can support passing -Wl,--whole-archive specified in the command line to the device linking step.

However, in normal use cases, users only want to link in symbols referenced by the main TU. They do not need to link every symbol in the archive.

Also, I don't see the advantage of resolving this issue through toolchains. You still need to detect kernels and device variables referenced by host code, and generate IR's which introduce artificial references to them. It just becomes more complicated since you have to do them with external tools and handle extra outputs and inputs with the toolchain. Whereas in the current approach the information is directly available in AST and the IR can be generated by clang codegen.

tra accepted this revision.Apr 12 2022, 2:13 PM

You are talking about a use case that actually needs --whole-archive option. If the main TU does not reference some symbols in the archive but wants all symbols in the archive to be linked in, it is justifiable to use --whole-archive and HIP toolchain can support passing -Wl,--whole-archive specified in the command line to the device linking step.

Hmm. My point was the opposite -- only one object should be linked and I saw no way to do that without conservatively including everything.
I think I've misunderstood what your patch does.

So, a main TU with just __global__ void kernel(); would emit a reference when it's compiled on the GPU side. That, in turn will tell the linker what it needs to pull from the libraries and things should just work.
If that's the case, then it would work in my example, too.

However, in normal use cases, users only want to link in symbols referenced by the main TU. They do not need to link every symbol in the archive.

Agreed.

clang/lib/CodeGen/CodeGenModule.cpp
602

This is not HIP-specific and should have a more generic name. @gpu.used.external ?

This revision is now accepted and ready to land.Apr 12 2022, 2:13 PM
yaxunl marked an inline comment as done.Apr 12 2022, 8:08 PM

So, a main TU with just __global__ void kernel(); would emit a reference when it's compiled on the GPU side. That, in turn will tell the linker what it needs to pull from the libraries and things should just work.
If that's the case, then it would work in my example, too.

Yes, this patch creates artificial references in the device IR originating from the host functions in the same TU, or, in other words, it creates the missing references which should be there but are not there and no more references than those.

clang/lib/CodeGen/CodeGenModule.cpp
602

will do when committing.

This revision was landed with ongoing or failed builds.Apr 13 2022, 7:48 AM
This revision was automatically updated to reflect the committed changes.
yaxunl marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2022, 7:48 AM