This patch adds the necessary code generation to create the wrapper code
that registers all the globals in CUDA. We create the necessary
functions and iterate through the list of
__start_cuda_offloading_entries to find which globals must be
registered. This is very similar to the code generation done currently
in Clang for non-rdc builds, but here we are registering a fully linked
fatbinary and finding the globals via the above sections.
With this we should be able to fully support basic RDC / LTO building of CUDA
code.
It's also worth noting that this does not include the necessary PTX to JIT the
image, so to use this support the offloading architecture must match the
system's architecture.
Depends on D123810
what happens if there are multiple binaries for different GPUs? will the linker-wrapper generates one fatbinary containing both elfs and embed the fatbinary as one image?