This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Add driver support for compiling CUDA with the new driver
ClosedPublic

Authored by jhuber6 on Feb 21 2022, 11:46 AM.

Details

Summary

This patch adds the basic support for the clang driver to compile and link CUDA
using the new offloading driver. This requires handling the CUDA offloading kind
and embedding the generated files into the host. This will allow us to link
OpenMP code with CUDA code in the linker wrapper. More support will be required
to create functional CUDA / HIP binaries using this method.

Depends on D120270 D120271 D120934

Diff Detail

Event Timeline

jhuber6 created this revision.Feb 21 2022, 11:46 AM
jhuber6 requested review of this revision.Feb 21 2022, 11:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2022, 11:46 AM

Tests?

clang/lib/Driver/Driver.cpp
4050

Everything till here can be a NFC commit, right? Let's split it off

4202

Can we have a doxygen comment explaining what these helpers do?

4250

With the NFC commit we can probably also include some of this but restricted to OFK_OpenMP. Try to minimize the functional change that one has to think about.

jhuber6 updated this revision to Diff 412812.Mar 3 2022, 12:29 PM

Updating, embed fatbinaries now and small changes.

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 12:29 PM
Herald added a subscriber: dang. · View Herald Transcript
jhuber6 edited the summary of this revision. (Show Details)Mar 3 2022, 12:38 PM
jhuber6 updated this revision to Diff 412816.Mar 3 2022, 12:44 PM

Adding comments

jhuber6 updated this revision to Diff 412821.Mar 3 2022, 1:21 PM

Adding test

yaxunl added inline comments.Mar 3 2022, 1:57 PM
clang/lib/Driver/Driver.cpp
4193–4196

The final set depends on the order of -offload-arch and -no-offload-arch options, e.g. --offload-arch=gfx906 --no-offload-arch=gfx906 and --no-offload-arch=gfx906 --offload-arch=gfx906 is different. Also there is --no-offload-arch=all which removes all precedent --offload-arch= options.

4201

should this be HIP?

jhuber6 added inline comments.Mar 3 2022, 1:59 PM
clang/lib/Driver/Driver.cpp
4193–4196

I see, so we need to iterate the arguments in-order and insert or erase them as they are entered, I'll fix it.

4201

Whoops, haven't tested it with HIP yet.

jhuber6 updated this revision to Diff 412908.Mar 3 2022, 7:48 PM

Correctly handle offloading architecture options.

jhuber6 updated this revision to Diff 412996.Mar 4 2022, 6:57 AM
tra added inline comments.Mar 4 2022, 10:41 AM
clang/lib/Driver/Driver.cpp
4201

Nit: It would be nice to report specific option which triggered the diag. Reporting --offload-arch when it's triggered by --no-offload-arch would be somewhat confusing.

Args.hasArgNoClaim(options::OPT_offload_arch_EQ) ? "--offload-arch" : --no-offload-arch ?

4226–4228

If we do not have constants for the default CUDA/HIP arch yet, we should probably add them and use them here.

4263–4269

If we do not allow non-relocatable compilation with the new driver, perhaps we should make -fgpu-rdc enabled by default with the new driver and error out if someone attempts to use -fno-gpu-rdc. Otherwise we're virtually guaranteed that everyone attempting to use -foffload-new-driver will hit this error.

4264

Do we want to return early here?

4315–4316

Nit: We do have llvm::zip for iterating over multiple containers. However, unpacking loop variable results maybe more trouble than it's worth it in such a small loop, Up to you.

clang/lib/Driver/Driver.cpp
4226–4228

Defaulting hip to gfx803 is unlikely to be helpful. It won't run on other architectures, i.e. there's no conservative default that will run on most things. I guess that's an existing quirk of the hip toolchain?

tra added inline comments.Mar 4 2022, 11:33 AM
clang/lib/Driver/Driver.cpp
4226–4228

I agree that there's no "safe" choice of GPU target. It applies to CUDA, as well, at least for GPU binaries.
That said, we still want clang -c foo.cu to work.

For CUDA we use the oldest variant still supported by the vendor. It produces PTX assembly which we embed along with the GPU binary.
That PTX is valid for newer GPU archtectures and CUDA runtime will be able to compile it for the GPU the app happens to run on. It's not ideal, but it works.

While for AMDGPU we do not have such forward compatibility mode as we have with CUDA, being able to compile for *something* by default is still useful, IMO.

jhuber6 updated this revision to Diff 413087.Mar 4 2022, 11:47 AM

Addressing nits.

jhuber6 added inline comments.Mar 4 2022, 11:48 AM
clang/lib/Driver/Driver.cpp
4226–4228

I just copied GFX803 because that's what the previous driver used. I agree we should just default to something, maybe in the AMD case we can issue a warning telling them to use the flag to properly specify it.

jhuber6 marked 8 inline comments as done.Mar 4 2022, 12:03 PM
jhuber6 updated this revision to Diff 414371.Mar 10 2022, 7:03 AM

Fix architecture parsing and still include the GPU binary so cuobjcopy can use them.

jhuber6 updated this revision to Diff 415439.Mar 15 2022, 8:03 AM

We shouldn't need to restrict this to RDC only if implemented properly.

jhuber6 updated this revision to Diff 415460.Mar 15 2022, 8:58 AM

Accidentally clang formatted an entire file.

jhuber6 updated this revision to Diff 415622.Mar 15 2022, 4:14 PM

Fix wrong condition for picking up input.

tra added inline comments.Mar 17 2022, 11:25 AM
clang/include/clang/Basic/Cuda.h
108–110

Nit: those could be just enum values.

  ...
  LAST,
  DefaultCudaArch = SM_35,
  DefaultHIPArch = GFX803,
};
clang/lib/Driver/Driver.cpp
4270

This loop can be folded into the loop above.

Alternatively, for simple loops like this one could use llvm::for_each.

4298

Could you elaborate why TCAndArch is incremented only here? Don't other cases need to advance it, too? At the very least we need some comments here and for the loop in general, describing what's going on.

clang/lib/Driver/ToolChains/Clang.cpp
6967

Extracting arch name from the file name looks... questionable. Where does that file name come from? Can't we get this information directly?

jhuber6 added inline comments.Mar 17 2022, 12:29 PM
clang/include/clang/Basic/Cuda.h
108–110

Will do.

clang/lib/Driver/Driver.cpp
4270

It could, I think the intention is clearer (i.e. make an input for every toolchain and architecture) having them separate. But I can merge them if that's better.

4298

It shouldn't be, I forgot to move this out of the conditional once I added more things, I'll explain the usage as well.

clang/lib/Driver/ToolChains/Clang.cpp
6967

Yeah, it's not ideal but it was the easiest way to do it. The alternative is to find a way to traverse the job action list and find the nodes with a bound architecture set and hope they're in the same order. I can try to do that.

jhuber6 updated this revision to Diff 416292.Mar 17 2022, 12:55 PM

Addressing comments

jhuber6 updated this revision to Diff 421232.Apr 7 2022, 8:41 AM

Update with the more generic clang argument handling.

MaskRay added inline comments.Apr 7 2022, 10:07 AM
clang/test/Driver/cuda-openmp-driver.cu
11

Better to add -NEXT whenever applicable

MaskRay added inline comments.Apr 7 2022, 10:08 AM
clang/test/Driver/cuda-openmp-driver.cu
2

clang-driver is unneeded.

jhuber6 updated this revision to Diff 421270.Apr 7 2022, 10:26 AM

Make -foffload-new-driver imply GPU-RDC mode, it won't work otherwise. Also adjust tests.

jhuber6 updated this revision to Diff 423645.Apr 19 2022, 8:57 AM

Adding new test for CUDA phases.

tra added a comment.Apr 19 2022, 10:34 AM

Thank you for adding the compilation pipeline tests.

LGTM overall.

clang/lib/Driver/ToolChains/Clang.cpp
6225–6232

Combine both ifs, so we don't add -fgpu-rdc twice?

6897–6898

Ditto.

clang/test/Driver/cuda-openmp-driver.cu
19

You probably want to check for clang -cc1 ... -fgpu-rdc, too.

jhuber6 updated this revision to Diff 423685.Apr 19 2022, 11:20 AM

Making suggested changes.

tra added inline comments.Apr 22 2022, 10:21 AM
clang/lib/Driver/ToolChains/Clang.cpp
6225–6231

If user specifies both -fno-gpu-rdc and -foffload-new-driver we would still enable RDC compilation.
We may want to at least issue a warning.

Considering that we have multiple places where we may check for -f[no]gpu-rdc we should make sure we don't get different ideas whether RDC has been enabled.
I think it may make sense to provide a common way to figure it out. Either via a helper function that would process CLI arguments or calculate it once and save it somewhere.

jhuber6 added inline comments.Apr 22 2022, 10:27 AM
clang/lib/Driver/ToolChains/Clang.cpp
6225–6231

I haven't quite finalized how to handle this. The new driver should be compatible with a non-RDC build since we simply wouldn't embed the device image or create offloading entries. It's a little bit more difficult here since the new method is opt-in so it requires a flag. We should definitely emit a warning if both are enabled (I'm assuming there's one for passing both fgpu-rdc and fno-gpu-rdc). I'll add one in.

Also we could consider the new driver *the* RDC in the future which would be the easiest. The problem is if we want to support CUDA's method of RDC considering how other build systems seem to expect it. I could see us embedding the fatbinary in the object file, even if unused, just so that cuobjdump works. However we couldn't support the generation of __cudaRegisterFatBinary_nv.... functions because then those would cause linker errors. WDYT?

tra added inline comments.Apr 22 2022, 10:36 AM
clang/lib/Driver/ToolChains/Clang.cpp
6225–6231

I'm assuming there's one for passing both fgpu-rdc and fno-gpu-rdc

This is not a valid assumption. The whole idea behind -fno-something is that the options can be overridden. E.g. if the build specifies a standard set of compiler options, but we need to override some of them when building a particular file. We can only do so by appending to the standard options. Potentially we may end up having those options overridden again. While it's not very common, it's certainly possible. It's also possible to start with '-fno-gpu-rdc' and then override it with -fgpu-rdc.

In this case, we care about the final state of RDC specified by -f*gpu-rdc options, not by the fact that -fno-gpu-rdc is present. Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false) does exactly that -- gives you the final state. If it returns false, but we have -foffload-new-driver, then we need a warning as these options are contradictory.

tra added inline comments.Apr 22 2022, 10:44 AM
clang/lib/Driver/ToolChains/Clang.cpp
6225–6231

The new driver should be compatible with a non-RDC build

In that case, we don't need a warning, but we do need a test verifying this behavior.

Also we could consider the new driver *the* RDC in the future which would be the easiest.

SGTM. I do not know how it all will work out in the end. Your proposed model makes a lot of sense, and I'm guardedly optimistic about it.
Eventually we would deprecate RDC options, but we still need to work sensibly when user specifies a mix of these options.

jhuber6 updated this revision to Diff 424537.Apr 22 2022, 10:52 AM

Adding warning for using both -fno-gpu-rdc and -foffload-new-driver. I think this is a good warning to have for now while this is being worked in as opt-in. Once this has matured I plan on adding the necessary logic to handle RDC and non-RDC builds correctly with this. But for the purposes of this patch just warning is fine.

jhuber6 added inline comments.Apr 22 2022, 10:56 AM
clang/lib/Driver/ToolChains/Clang.cpp
6225–6231

In that case, we don't need a warning, but we do need a test verifying this behavior.

It's possible but I don't have the logic here to do it, figured we can cross that bridge later.

SGTM. I do not know how it all will work out in the end. Your proposed model makes a lot of sense, and I'm guardedly optimistic about it.

So the only downsides I know of, is that we don't currently replicate CUDA's magic to JIT RDC code (We can do this with LTO anyway), and that registering offload entries relies on the linker defining __start / __stop variables, which I don't know if linkers on Windows / MacOS provide. I'd be really interested if someone on the LLD team knew the answer to that.

tra added a subscriber: rnk.Apr 22 2022, 11:11 AM
tra added inline comments.
clang/lib/Driver/ToolChains/Clang.cpp
6225–6230

This has to be Args.hasArg(options::OPT_fno_gpu_rdc) && Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false) == false

E.g. we don't want a warning if we have -foffload-new-driver -fno-gpu-rdc -fgpu-rdc.

6225–6231

relies on the linker defining start / stop variables, which I don't know if linkers on Windows / MacOS provide. I'd be really interested if someone on the LLD team knew the answer to that.

@MaskRay, @rnk - would you happen to know the answer?

6228

The warning does not take any parameters and this one looks wrong anyways.

6896

I'm not sure why we're no longer checking for OPT_foffload_new_driver here. Don't we want to have the same RDC mode on the host and device sides?
I think we do as that affects the way we mangle some symbols and it has to be consistent on both sides.

jhuber6 added inline comments.Apr 22 2022, 11:13 AM
clang/lib/Driver/ToolChains/Clang.cpp
6225–6230

K, will do

6228

Whoops, deleted that previously but had a little SNAFU with my commits and forgot to do it again.

6896

This is only checked with CudaDeviceInput which we don't use with the new driver.

jhuber6 updated this revision to Diff 424543.Apr 22 2022, 11:16 AM

Making suggested changes.

jhuber6 updated this revision to Diff 425261.Apr 26 2022, 10:31 AM

Changing this to simply require that the user manually passes -fgpu-rdc in order to use the new driver. I think this makes more sense in the short term and we can move to make the new driver the default rdc approach later. I tested this and the following should workd,

clang foo.cu -fgpu-rdc -foffload-new-driver -c
clang bar.cu -c
clang foo.o bar.o -fgpu-rdc -foffload-new-driver
tra accepted this revision.Apr 26 2022, 11:41 AM

Changing this to simply require that the user manually passes -fgpu-rdc in order to use the new driver. I think this makes more sense in the short term and we can move to make the new driver the default rdc approach later. I tested this and the following should workd,

SGTM.

This revision is now accepted and ready to land.Apr 26 2022, 11:41 AM
rnk added inline comments.Apr 26 2022, 11:50 AM
clang/lib/Driver/ToolChains/Clang.cpp
6225–6231

I believe MachO has an equivalent mechanism, but I'm not familiar with it. For PE/COFF, you can search the ASan code to see how the start / stop symbols are defined on Windows using various pragmas and __declspec(allocate) to set up sections and sort them accordingly.

I would love to have a doc that writes up how to implement this array registration mechanism portably for all major platforms, given that we believe it is possible everywhere.

jhuber6 added inline comments.Apr 26 2022, 12:09 PM
clang/lib/Driver/ToolChains/Clang.cpp
6225–6231

I believe MachO has an equivalent mechanism, but I'm not familiar with it. For PE/COFF, you can search the ASan code to see how the start / stop symbols are defined on Windows using various pragmas and __declspec(allocate) to set up sections and sort them accordingly.
I would love to have a doc that writes up how to implement this array registration mechanism portably for all major platforms, given that we believe it is possible everywhere.

Thanks for the information, I was having a hard to figuring out if it was possible to implement this on other platforms. Some documentation for handling this on each platform would definitely be useful as I am hoping this can become the standard way to compile / register offloading languages in LLVM. Let me know if I can do anything to help on that front.

This revision was landed with ongoing or failed builds.Apr 29 2022, 6:15 AM
This revision was automatically updated to reflect the committed changes.