One current downside of the LLVM support for CUDA in RDC-mode is that we
cannot JIT off of the PTX image. This requires the user to provide the
specific architecture when offloading. CUDA's runtime uses a special
method to link the separate PTX files when in RDC-mode, while LLVM
cannot do this with the chosen approach to supporting RDC-mode
compilation. However, if we embed bitcode via LTO we can use the
single-linked PTX image for the whole module and include it in the
fatbinary. This allows us to do the following and have it execute even
without the correct architecture specified.
clang foo.cu -foffload-lto -fgpu-rdc --offload-new-driver -lcudart
It is also worth noting that in full-LTO mode, RDC-mode will behave
exactly like non-RDC mode after linking.
Depends on D127246