This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Do not embed a fatbinary when using the new driver
ClosedPublic

Authored by jhuber6 on Jun 23 2022, 7:10 AM.

Details

Summary

Previously, when using the new driver we created a fatbinary with the
PTX and Cubin output. This was mainly done in an attempt to create some
backwards compatibility with the existing CUDA support that embeds the
fatbinary in each TU. This will most likely be more work than necessary
to actually implement. The linker wrapper cannot do anything with these
embedded PTX files because we do not know how to link them, and if we
did want to include multiple files it should go through the
clang-offload-packager instead. Also this didn't repsect the setting
that disables embedding PTX (although it wasn't used anyway).

Diff Detail

Event Timeline

jhuber6 created this revision.Jun 23 2022, 7:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2022, 7:10 AM
jhuber6 requested review of this revision.Jun 23 2022, 7:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2022, 7:10 AM
jhuber6 updated this revision to Diff 439384.Jun 23 2022, 7:11 AM

Remove comment that is no longer true now that getInputFilename always returns a .cubin variant for object types.

tra accepted this revision.Jun 23 2022, 10:54 AM

The linker wrapper cannot do anything with these embedded PTX files because we do not know how to link them,

Neither, apparently does nvlink. It does have --emip-ptx <file> option, but only if LTO is enabled, which matches the new driver behavior.

This revision is now accepted and ready to land.Jun 23 2022, 10:54 AM

The linker wrapper cannot do anything with these embedded PTX files because we do not know how to link them,

Neither, apparently does nvlink. It does have --emip-ptx <file> option, but only if LTO is enabled, which matches the new driver behavior.

Thanks for the review. I'm not sure exactly how CUDA does it, but for their RDC support they do somehow link PTX from multiple TU's at runtime for JIT. I'm guessing they just compile each file upon initialization and link them with nvlink. I think using LTO for JIT support is the saner option in that case.

This revision was landed with ongoing or failed builds.Jun 23 2022, 12:40 PM
This revision was automatically updated to reflect the committed changes.
tra added a comment.Aug 5 2022, 11:26 AM

This change breaks clang++ --cuda-device-only compilation. Clang does not create any output in this case. Reverting the change fixes the problem.

Reproducible with:

echo '__global__ void k(){}' | bin/clang++  --offload-arch=sm_70 -x cuda -  --cuda-device-only -v  -c -o foo123.o

Compilation succeeds, but there's no foo123.o to be found.

This change breaks clang++ --cuda-device-only compilation. Clang does not create any output in this case. Reverting the change fixes the problem.

Reproducible with:

echo '__global__ void k(){}' | bin/clang++  --offload-arch=sm_70 -x cuda -  --cuda-device-only -v  -c -o foo123.o

Compilation succeeds, but there's no foo123.o to be found.

Is it spitting it out as foo123.cubin instead?

tra added a comment.Aug 5 2022, 11:49 AM

Is it spitting it out as foo123.cubin instead?

That's the output name it passes to ptxas, but it's treated as a temporary file and is removed at the end, so the user gets nothing.