This patch adds the basic support for the clang driver to compile and link CUDA
using the new offloading driver. This requires handling the CUDA offloading kind
and embedding the generated files into the host. This will allow us to link
OpenMP code with CUDA code in the linker wrapper. More support will be required
to create functional CUDA / HIP binaries using this method.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Tests?
clang/lib/Driver/Driver.cpp | ||
---|---|---|
4021 | Everything till here can be a NFC commit, right? Let's split it off | |
4173 | Can we have a doxygen comment explaining what these helpers do? | |
4221 | With the NFC commit we can probably also include some of this but restricted to OFK_OpenMP. Try to minimize the functional change that one has to think about. |
clang/lib/Driver/Driver.cpp | ||
---|---|---|
4164–4167 | The final set depends on the order of -offload-arch and -no-offload-arch options, e.g. --offload-arch=gfx906 --no-offload-arch=gfx906 and --no-offload-arch=gfx906 --offload-arch=gfx906 is different. Also there is --no-offload-arch=all which removes all precedent --offload-arch= options. | |
4172 | should this be HIP? |
clang/lib/Driver/Driver.cpp | ||
---|---|---|
4172 | Nit: It would be nice to report specific option which triggered the diag. Reporting --offload-arch when it's triggered by --no-offload-arch would be somewhat confusing. Args.hasArgNoClaim(options::OPT_offload_arch_EQ) ? "--offload-arch" : --no-offload-arch ? | |
4197–4199 | If we do not have constants for the default CUDA/HIP arch yet, we should probably add them and use them here. | |
4234–4240 | If we do not allow non-relocatable compilation with the new driver, perhaps we should make -fgpu-rdc enabled by default with the new driver and error out if someone attempts to use -fno-gpu-rdc. Otherwise we're virtually guaranteed that everyone attempting to use -foffload-new-driver will hit this error. | |
4235 | Do we want to return early here? | |
4286–4287 | Nit: We do have llvm::zip for iterating over multiple containers. However, unpacking loop variable results maybe more trouble than it's worth it in such a small loop, Up to you. |
clang/lib/Driver/Driver.cpp | ||
---|---|---|
4197–4199 | Defaulting hip to gfx803 is unlikely to be helpful. It won't run on other architectures, i.e. there's no conservative default that will run on most things. I guess that's an existing quirk of the hip toolchain? |
clang/lib/Driver/Driver.cpp | ||
---|---|---|
4197–4199 | I agree that there's no "safe" choice of GPU target. It applies to CUDA, as well, at least for GPU binaries. For CUDA we use the oldest variant still supported by the vendor. It produces PTX assembly which we embed along with the GPU binary. While for AMDGPU we do not have such forward compatibility mode as we have with CUDA, being able to compile for *something* by default is still useful, IMO. |
clang/lib/Driver/Driver.cpp | ||
---|---|---|
4197–4199 | I just copied GFX803 because that's what the previous driver used. I agree we should just default to something, maybe in the AMD case we can issue a warning telling them to use the flag to properly specify it. |
clang/include/clang/Basic/Cuda.h | ||
---|---|---|
108–110 | Nit: those could be just enum values. ... LAST, DefaultCudaArch = SM_35, DefaultHIPArch = GFX803, }; | |
clang/lib/Driver/Driver.cpp | ||
4241 | This loop can be folded into the loop above. Alternatively, for simple loops like this one could use llvm::for_each. | |
4269 | Could you elaborate why TCAndArch is incremented only here? Don't other cases need to advance it, too? At the very least we need some comments here and for the loop in general, describing what's going on. | |
clang/lib/Driver/ToolChains/Clang.cpp | ||
7008 | Extracting arch name from the file name looks... questionable. Where does that file name come from? Can't we get this information directly? |
clang/include/clang/Basic/Cuda.h | ||
---|---|---|
108–110 | Will do. | |
clang/lib/Driver/Driver.cpp | ||
4241 | It could, I think the intention is clearer (i.e. make an input for every toolchain and architecture) having them separate. But I can merge them if that's better. | |
4269 | It shouldn't be, I forgot to move this out of the conditional once I added more things, I'll explain the usage as well. | |
clang/lib/Driver/ToolChains/Clang.cpp | ||
7008 | Yeah, it's not ideal but it was the easiest way to do it. The alternative is to find a way to traverse the job action list and find the nodes with a bound architecture set and hope they're in the same order. I can try to do that. |
clang/test/Driver/cuda-openmp-driver.cu | ||
---|---|---|
11 | Better to add -NEXT whenever applicable |
clang/test/Driver/cuda-openmp-driver.cu | ||
---|---|---|
2 | clang-driver is unneeded. |
Make -foffload-new-driver imply GPU-RDC mode, it won't work otherwise. Also adjust tests.
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6249–6252 | If user specifies both -fno-gpu-rdc and -foffload-new-driver we would still enable RDC compilation. Considering that we have multiple places where we may check for -f[no]gpu-rdc we should make sure we don't get different ideas whether RDC has been enabled. |
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6249–6252 | I haven't quite finalized how to handle this. The new driver should be compatible with a non-RDC build since we simply wouldn't embed the device image or create offloading entries. It's a little bit more difficult here since the new method is opt-in so it requires a flag. We should definitely emit a warning if both are enabled (I'm assuming there's one for passing both fgpu-rdc and fno-gpu-rdc). I'll add one in. Also we could consider the new driver *the* RDC in the future which would be the easiest. The problem is if we want to support CUDA's method of RDC considering how other build systems seem to expect it. I could see us embedding the fatbinary in the object file, even if unused, just so that cuobjdump works. However we couldn't support the generation of __cudaRegisterFatBinary_nv.... functions because then those would cause linker errors. WDYT? |
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6249–6252 |
This is not a valid assumption. The whole idea behind -fno-something is that the options can be overridden. E.g. if the build specifies a standard set of compiler options, but we need to override some of them when building a particular file. We can only do so by appending to the standard options. Potentially we may end up having those options overridden again. While it's not very common, it's certainly possible. It's also possible to start with '-fno-gpu-rdc' and then override it with -fgpu-rdc. In this case, we care about the final state of RDC specified by -f*gpu-rdc options, not by the fact that -fno-gpu-rdc is present. Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false) does exactly that -- gives you the final state. If it returns false, but we have -foffload-new-driver, then we need a warning as these options are contradictory. |
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6249–6252 |
In that case, we don't need a warning, but we do need a test verifying this behavior.
SGTM. I do not know how it all will work out in the end. Your proposed model makes a lot of sense, and I'm guardedly optimistic about it. |
Adding warning for using both -fno-gpu-rdc and -foffload-new-driver. I think this is a good warning to have for now while this is being worked in as opt-in. Once this has matured I plan on adding the necessary logic to handle RDC and non-RDC builds correctly with this. But for the purposes of this patch just warning is fine.
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6249–6252 |
It's possible but I don't have the logic here to do it, figured we can cross that bridge later.
So the only downsides I know of, is that we don't currently replicate CUDA's magic to JIT RDC code (We can do this with LTO anyway), and that registering offload entries relies on the linker defining __start / __stop variables, which I don't know if linkers on Windows / MacOS provide. I'd be really interested if someone on the LLD team knew the answer to that. |
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6249 | This has to be Args.hasArg(options::OPT_fno_gpu_rdc) && Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false) == false E.g. we don't want a warning if we have -foffload-new-driver -fno-gpu-rdc -fgpu-rdc. | |
6249–6252 | ||
6252 | The warning does not take any parameters and this one looks wrong anyways. | |
6934 | I'm not sure why we're no longer checking for OPT_foffload_new_driver here. Don't we want to have the same RDC mode on the host and device sides? |
Changing this to simply require that the user manually passes -fgpu-rdc in order to use the new driver. I think this makes more sense in the short term and we can move to make the new driver the default rdc approach later. I tested this and the following should workd,
clang foo.cu -fgpu-rdc -foffload-new-driver -c clang bar.cu -c clang foo.o bar.o -fgpu-rdc -foffload-new-driver
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6249–6252 | I believe MachO has an equivalent mechanism, but I'm not familiar with it. For PE/COFF, you can search the ASan code to see how the start / stop symbols are defined on Windows using various pragmas and __declspec(allocate) to set up sections and sort them accordingly. I would love to have a doc that writes up how to implement this array registration mechanism portably for all major platforms, given that we believe it is possible everywhere. |
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6249–6252 |
Thanks for the information, I was having a hard to figuring out if it was possible to implement this on other platforms. Some documentation for handling this on each platform would definitely be useful as I am hoping this can become the standard way to compile / register offloading languages in LLVM. Let me know if I can do anything to help on that front. |
Nit: those could be just enum values.