Previously, when using the new driver we created a fatbinary with the
PTX and Cubin output. This was mainly done in an attempt to create some
backwards compatibility with the existing CUDA support that embeds the
fatbinary in each TU. This will most likely be more work than necessary
to actually implement. The linker wrapper cannot do anything with these
embedded PTX files because we do not know how to link them, and if we
did want to include multiple files it should go through the
clang-offload-packager instead. Also this didn't repsect the setting
that disables embedding PTX (although it wasn't used anyway).
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Remove comment that is no longer true now that getInputFilename always returns a .cubin variant for object types.
The linker wrapper cannot do anything with these embedded PTX files because we do not know how to link them,
Neither, apparently does nvlink. It does have --emip-ptx <file> option, but only if LTO is enabled, which matches the new driver behavior.
Thanks for the review. I'm not sure exactly how CUDA does it, but for their RDC support they do somehow link PTX from multiple TU's at runtime for JIT. I'm guessing they just compile each file upon initialization and link them with nvlink. I think using LTO for JIT support is the saner option in that case.
This change breaks clang++ --cuda-device-only compilation. Clang does not create any output in this case. Reverting the change fixes the problem.
Reproducible with:
echo '__global__ void k(){}' | bin/clang++ --offload-arch=sm_70 -x cuda - --cuda-device-only -v -c -o foo123.o
Compilation succeeds, but there's no foo123.o to be found.
That's the output name it passes to ptxas, but it's treated as a temporary file and is removed at the end, so the user gets nothing.