This is an archive of the discontinued LLVM Phabricator instance.

[LinkerWrapper] Add PTX output to CUDA fatbinary in LTO-mode
Needs ReviewPublic

Authored by jhuber6 on Jun 15 2022, 12:57 PM.

Details

Summary

One current downside of the LLVM support for CUDA in RDC-mode is that we
cannot JIT off of the PTX image. This requires the user to provide the
specific architecture when offloading. CUDA's runtime uses a special
method to link the separate PTX files when in RDC-mode, while LLVM
cannot do this with the chosen approach to supporting RDC-mode
compilation. However, if we embed bitcode via LTO we can use the
single-linked PTX image for the whole module and include it in the
fatbinary. This allows us to do the following and have it execute even
without the correct architecture specified.

clang foo.cu -foffload-lto -fgpu-rdc --offload-new-driver -lcudart

It is also worth noting that in full-LTO mode, RDC-mode will behave
exactly like non-RDC mode after linking.

Depends on D127246

Diff Detail

Event Timeline

jhuber6 created this revision.Jun 15 2022, 12:57 PM
jhuber6 requested review of this revision.Jun 15 2022, 12:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2022, 12:57 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
tra added a comment.Jun 16 2022, 2:40 PM

Playing devil's advocate, I've got to ask -- do we even want to support JIT?

JIT brings more trouble than benefits.

  • substantial start-up time on nontrivial apps. Last time I tried launching a tensorflow app and needed to JIT its kernels, it took about half an hour until JIT was done.
  • substantial increase in the size of the executable. Statically linked tensorflow apps are already pushing the limits of the executables that use small memory model (-mcmodel=small is the default for clang and gcc, AFAICT).
  • very easy to make a mistake, compile for a wrong GPU and not notice it, because JIT will try to keep it running using PTX.
  • makes executables and tests non-hermetic -- the code that will run on GPU (and thus the behavior) will depend on particular driver version the apps uses at runtime.

Benefits: It *may* allow us to run a miscompiled/outdated CUDA app. Whether it's actually a benefit is questionable. To me it looks like a way to paper over a problem.

We (google) have experienced all of the above and ended up disabling PTX JIT'ting altogether.

That said, we do embed PTX by default at the moment, so this patch does not really change the status quo, so I'm not opposed to it, as long is we can disable PTX embedding if we need/want to.

Playing devil's advocate, I've got to ask -- do we even want to support JIT?

JIT brings more trouble than benefits.

  • substantial start-up time on nontrivial apps. Last time I tried launching a tensorflow app and needed to JIT its kernels, it took about half an hour until JIT was done.
  • substantial increase in the size of the executable. Statically linked tensorflow apps are already pushing the limits of the executables that use small memory model (-mcmodel=small is the default for clang and gcc, AFAICT).
  • very easy to make a mistake, compile for a wrong GPU and not notice it, because JIT will try to keep it running using PTX.
  • makes executables and tests non-hermetic -- the code that will run on GPU (and thus the behavior) will depend on particular driver version the apps uses at runtime.

Benefits: It *may* allow us to run a miscompiled/outdated CUDA app. Whether it's actually a benefit is questionable. To me it looks like a way to paper over a problem.

We (google) have experienced all of the above and ended up disabling PTX JIT'ting altogether.

That said, we do embed PTX by default at the moment, so this patch does not really change the status quo, so I'm not opposed to it, as long is we can disable PTX embedding if we need/want to.

I guess it's one of those situations where I figured since we have it when we do LTO anyway I may as well add it. I don't know much about the usage of it w.r.t. performance, but I figured that this was a shortcoming of the RDC-mode support for Clang considering that NVIDIA can JIT RDC-mode compilations. We could definitely have an argument that disables this, I'm assuming there's an argument that does that in Clang already that we could overload to pass something to the linker wrapper. Or we could decide which behaviour we want to be the default.

The problem with LTO however is that many "compile-only" flags are suddenly relevant during linking. So let's say for a build someone did clang foo.cu -c -no-embed-ptx -foffload-lto and then clang foo.o we won't have the argument. I think regular LTO can embed the command line in the bitcode or something. We also have the option to embed the arguments in the binary format I made.

Also one problem with the RDC mode support with this is that we don't gracefully error if something was wrong with the image. so the following is really unhelpful

clang app.cu --offload-arch=sm_<not correct> -fgpu-rdc --offload-new-driver
./a.out // Gives no output, kernel simply never executes.

Do we want JIT -> YES, but specalizing LLVM-IR JIT.
Do we want/need PTX, I do not, but I don't mind having it. Someone will ask for it eventually.

tra added a comment.Jun 22 2022, 2:38 PM

Do we want/need PTX, I do not, but I don't mind having it. Someone will ask for it eventually.

Fair enough.

However, if we embed bitcode via LTO we can use the
single-linked PTX image for the whole module and include it in the
fatbinary. This allows us to do the following and have it execute even
without the correct architecture specified.
clang foo.cu -foffload-lto -fgpu-rdc --offload-new-driver -lcudart

Then we do need a knob controlling whether we do want to embed PTX or not. The default should be "off" IMO.
We currently have --[no-]cuda-include-ptx= we may reuse for that purpose.

This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by --[no-]cuda-include-ptx= ?

Then we do need a knob controlling whether we do want to embed PTX or not. The default should be "off" IMO.
We currently have --[no-]cuda-include-ptx= we may reuse for that purpose.

We could definitely re-use that. It's another option that probably need to go inside the binary itself since normally those options aren't passed to the linker. We'll probably just use the same default as that flag (which is on I think).

This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by --[no-]cuda-include-ptx= ?

Right now, it'll be the one that's attached to the LTO job. So if the user specified sm_70 they'll get PTX for sm_70.

tra added a comment.Jun 22 2022, 4:39 PM

Then we do need a knob controlling whether we do want to embed PTX or not. The default should be "off" IMO.
We currently have --[no-]cuda-include-ptx= we may reuse for that purpose.

We could definitely re-use that. It's another option that probably need to go inside the binary itself since normally those options aren't passed to the linker.

I'm not sure I follow. WDYM by "go inside the binary itself" ? I assume you mean the per-GPU offload binaries inside per-TU .o. so that it could be used when that GPU object gets linked into GPU executable?

What if different TUs that we're linking were compiled using different/contradictory options?

The problem is that conceptually "--cuda-include-ptx" option ultimately affects the final GPU executable. If we're in RDC mode, then PTX is probably useless for JITT-ing purposes, as you can't link PTX and create the final executable. Well, I guess it might sort of be possible by concatenating the .s files and adding bunch of forward declarations for the functions, and merging debug info, and removing duplicate weak functions,,... Well, basically by writing a linker for a new "PTX" architecture. Doable, but so not worth it, IMO.

TUs are compiled to IR, then PTX generation shifts to the final link phase. I think we may need to rely on the user to supply PTX controls there explicitly. Or, at the very least, check that cuda-include-ptx propagated from TUs is used consistently in all TUs.

We'll probably just use the same default as that flag (which is on I think).

This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by --[no-]cuda-include-ptx= ?

Right now, it'll be the one that's attached to the LTO job. So if the user specified sm_70 they'll get PTX for sm_70.

I mean, when the user specifies more than one GPU variant to target.
E.g. both sm_70 and sm_50.
PTX for the former would probably provide better performance if we run on a newer GPU (e.g. sm_80).
On the other hand, it will likely fail if we were to attempt running from PTX on sm_60.
Both would probably fail if we were to run on sm_35. Including all PTX variants is wasteful (Tensorflow-using applications are already pushing the limits on small memory model and sometimes fail to link due to the executable being too large).

The point is that there's no "one true choice" for the PTX architecture (as there's no safe/sensible choice for the offload target). Only the end user would know their intent. We do need explicit controls and a documented policy on what we produce by default.

I'm not sure I follow. WDYM by "go inside the binary itself" ? I assume you mean the per-GPU offload binaries inside per-TU .o. so that it could be used when that GPU object gets linked into GPU executable?

What if different TUs that we're linking were compiled using different/contradictory options?

The problem is that conceptually "--cuda-include-ptx" option ultimately affects the final GPU executable. If we're in RDC mode, then PTX is probably useless for JITT-ing purposes, as you can't link PTX and create the final executable. Well, I guess it might sort of be possible by concatenating the .s files and adding bunch of forward declarations for the functions, and merging debug info, and removing duplicate weak functions,,... Well, basically by writing a linker for a new "PTX" architecture. Doable, but so not worth it, IMO.

TUs are compiled to IR, then PTX generation shifts to the final link phase. I think we may need to rely on the user to supply PTX controls there explicitly. Or, at the very least, check that cuda-include-ptx propagated from TUs is used consistently in all TUs.

I just mean that right now the --[no-]cuda-include-ptx is done at the compilation phase, whereas this in LTO so we'd need to make sure we have those arguments. It's true that we could just require the user to pass it to the linker instead, but conceptually PTX generation happens in the "compiler" and not the linker.

We'll probably just use the same default as that flag (which is on I think).

This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by --[no-]cuda-include-ptx= ?

Right now, it'll be the one that's attached to the LTO job. So if the user specified sm_70 they'll get PTX for sm_70.

I mean, when the user specifies more than one GPU variant to target.
E.g. both sm_70 and sm_50.
PTX for the former would probably provide better performance if we run on a newer GPU (e.g. sm_80).
On the other hand, it will likely fail if we were to attempt running from PTX on sm_60.
Both would probably fail if we were to run on sm_35. Including all PTX variants is wasteful (Tensorflow-using applications are already pushing the limits on small memory model and sometimes fail to link due to the executable being too large).

The point is that there's no "one true choice" for the PTX architecture (as there's no safe/sensible choice for the offload target). Only the end user would know their intent. We do need explicit controls and a documented policy on what we produce by default.

This is a good point I haven't thought of. This right now is basically just a by-product of the LTO pass. We run LTO for the target and since we got a PTX output we might as well include it. This may be what we do in Clang as well, I think we just include the PTX output in with the Cubin for each offload job. Even if we went to LLVM-IR we'd still be restricted by some features I think. As it stands, this patch just makes clang++ cuda.cu --offload-new-driver -fgpu-rdc --offload-arch=sm_60 -foffload-lto give a fatbinary with sm_60 PTX / Cubins. I think that is controlled by the user as it's only going to generate PTX for the architecture they specified via --offload-arch (or default).