This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Basic/
-
clang/
-
Basic/
1/2
Cuda.h
-
DiagnosticDriverKinds.td
-
lib/Driver/
-
Driver/
13/19
Driver.cpp
-
ToolChains/
7/18
Clang.cpp
-
test/Driver/
-
Driver/
3
cuda-openmp-driver.cu
-
cuda-phases.cu

Differential D120272

[CUDA] Add driver support for compiling CUDA with the new driver
ClosedPublic

Authored by jhuber6 on Feb 21 2022, 11:46 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield
tra
yaxunl

Commits

rGc5e5b54350fe: [CUDA] Add driver support for compiling CUDA with the new driver

Summary

This patch adds the basic support for the clang driver to compile and link CUDA
using the new offloading driver. This requires handling the CUDA offloading kind
and embedding the generated files into the host. This will allow us to link
OpenMP code with CUDA code in the linker wrapper. More support will be required
to create functional CUDA / HIP binaries using this method.

Depends on D120270 D120271 D120934

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Feb 21 2022, 11:46 AM

Herald added a subscriber: carlosgalvezp. · View Herald TranscriptFeb 21 2022, 11:46 AM

jhuber6 requested review of this revision.Feb 21 2022, 11:46 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2022, 11:46 AM

Herald added subscribers: cfe-commits, sstefan1. · View Herald Transcript

jhuber6 added a child revision: D120273: [OpenMP] Allow CUDA to be linked with OpenMP using the new driver.Feb 21 2022, 11:47 AM

Harbormaster completed remote builds in B150737: Diff 410354.Feb 21 2022, 12:35 PM

Tests?

clang/lib/Driver/Driver.cpp
4121	Everything till here can be a NFC commit, right? Let's split it off
4273	Can we have a doxygen comment explaining what these helpers do?
4321	With the NFC commit we can probably also include some of this but restricted to OFK_OpenMP. Try to minimize the functional change that one has to think about.

Updating, embed fatbinaries now and small changes.

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 12:29 PM

Herald added a subscriber: dang. · View Herald Transcript

jhuber6 added a parent revision: D120934: [OpenMP][NFC] Refactor new driver to be more general.Mar 3 2022, 12:30 PM

Harbormaster completed remote builds in B152446: Diff 412812.Mar 3 2022, 12:31 PM

jhuber6 edited the summary of this revision. (Show Details)Mar 3 2022, 12:38 PM

Adding comments

Harbormaster completed remote builds in B152449: Diff 412816.Mar 3 2022, 12:46 PM

Adding test

Harbormaster completed remote builds in B152452: Diff 412821.Mar 3 2022, 1:22 PM

yaxunl added inline comments.Mar 3 2022, 1:57 PM

clang/lib/Driver/Driver.cpp
4264–4267	The final set depends on the order of -offload-arch and -no-offload-arch options, e.g. `--offload-arch=gfx906 --no-offload-arch=gfx906` and `--no-offload-arch=gfx906 --offload-arch=gfx906` is different. Also there is `--no-offload-arch=all` which removes all precedent --offload-arch= options.
4272	should this be HIP?

jhuber6 added inline comments.Mar 3 2022, 1:59 PM

clang/lib/Driver/Driver.cpp
4264–4267	I see, so we need to iterate the arguments in-order and insert or erase them as they are entered, I'll fix it.
4272	Whoops, haven't tested it with HIP yet.

Correctly handle offloading architecture options.

Harbormaster completed remote builds in B152517: Diff 412908.Mar 3 2022, 7:49 PM

jhuber6 updated this revision to Diff 412996.Mar 4 2022, 6:57 AM

Fix

Harbormaster completed remote builds in B152583: Diff 412996.Mar 4 2022, 6:58 AM

tra added inline comments.Mar 4 2022, 10:41 AM

clang/lib/Driver/Driver.cpp
4272	Nit: It would be nice to report specific option which triggered the diag. Reporting `--offload-arch` when it's triggered by `--no-offload-arch` would be somewhat confusing. `Args.hasArgNoClaim(options::OPT_offload_arch_EQ) ? "--offload-arch" : --no-offload-arch` ?
4297–4299	If we do not have constants for the default CUDA/HIP arch yet, we should probably add them and use them here.
4334–4340	If we do not allow non-relocatable compilation with the new driver, perhaps we should make `-fgpu-rdc` enabled by default with the new driver and error out if someone attempts to use `-fno-gpu-rdc`. Otherwise we're virtually guaranteed that everyone attempting to use `-foffload-new-driver` will hit this error.
4335	Do we want to return early here?
4386–4387	Nit: We do have `llvm::zip` for iterating over multiple containers. However, unpacking loop variable results maybe more trouble than it's worth it in such a small loop, Up to you.

JonChesterfield added inline comments.Mar 4 2022, 11:16 AM

clang/lib/Driver/Driver.cpp
4297–4299	Defaulting hip to gfx803 is unlikely to be helpful. It won't run on other architectures, i.e. there's no conservative default that will run on most things. I guess that's an existing quirk of the hip toolchain?

tra added inline comments.Mar 4 2022, 11:33 AM

clang/lib/Driver/Driver.cpp
4297–4299	I agree that there's no "safe" choice of GPU target. It applies to CUDA, as well, at least for GPU binaries. That said, we still want `clang -c foo.cu` to work. For CUDA we use the oldest variant still supported by the vendor. It produces PTX assembly which we embed along with the GPU binary. That PTX is valid for newer GPU archtectures and CUDA runtime will be able to compile it for the GPU the app happens to run on. It's not ideal, but it works. While for AMDGPU we do not have such forward compatibility mode as we have with CUDA, being able to compile for something by default is still useful, IMO.

Addressing nits.

Herald added a subscriber: dexonsmith. · View Herald TranscriptMar 4 2022, 11:47 AM

Harbormaster completed remote builds in B152642: Diff 413087.Mar 4 2022, 11:48 AM

jhuber6 added inline comments.Mar 4 2022, 11:48 AM

clang/lib/Driver/Driver.cpp
4297–4299	I just copied GFX803 because that's what the previous driver used. I agree we should just default to something, maybe in the AMD case we can issue a warning telling them to use the flag to properly specify it.

jhuber6 marked 8 inline comments as done.Mar 4 2022, 12:03 PM

Fix architecture parsing and still include the GPU binary so cuobjcopy can use them.

Herald added subscribers: abrachet, phosek. · View Herald TranscriptMar 10 2022, 7:03 AM

Harbormaster completed remote builds in B153559: Diff 414371.Mar 10 2022, 7:04 AM

We shouldn't need to restrict this to RDC only if implemented properly.

Harbormaster completed remote builds in B154337: Diff 415439.Mar 15 2022, 8:48 AM

Accidentally clang formatted an entire file.

Harbormaster completed remote builds in B154348: Diff 415460.Mar 15 2022, 9:37 AM

Fix wrong condition for picking up input.

Harbormaster completed remote builds in B154452: Diff 415622.Mar 15 2022, 4:15 PM

tra added inline comments.Mar 17 2022, 11:25 AM

clang/include/clang/Basic/Cuda.h
108–110	Nit: those could be just enum values. ... LAST, DefaultCudaArch = SM_35, DefaultHIPArch = GFX803, };
clang/lib/Driver/Driver.cpp
4341	This loop can be folded into the loop above. Alternatively, for simple loops like this one could use `llvm::for_each`.
4369	Could you elaborate why TCAndArch is incremented only here? Don't other cases need to advance it, too? At the very least we need some comments here and for the loop in general, describing what's going on.
clang/lib/Driver/ToolChains/Clang.cpp
6983	Extracting arch name from the file name looks... questionable. Where does that file name come from? Can't we get this information directly?

jhuber6 added inline comments.Mar 17 2022, 12:29 PM

clang/include/clang/Basic/Cuda.h
108–110	Will do.
clang/lib/Driver/Driver.cpp
4341	It could, I think the intention is clearer (i.e. make an input for every toolchain and architecture) having them separate. But I can merge them if that's better.
4369	It shouldn't be, I forgot to move this out of the conditional once I added more things, I'll explain the usage as well.
clang/lib/Driver/ToolChains/Clang.cpp
6983	Yeah, it's not ideal but it was the easiest way to do it. The alternative is to find a way to traverse the job action list and find the nodes with a bound architecture set and hope they're in the same order. I can try to do that.

Addressing comments

Harbormaster completed remote builds in B154904: Diff 416292.Mar 17 2022, 1:31 PM

jhuber6 added a parent revision: D123313: [OpenMP] Make clang argument handling for the new driver more generic.Apr 7 2022, 8:41 AM

Herald added a subscriber: MaskRay. · View Herald TranscriptApr 7 2022, 8:41 AM

Update with the more generic clang argument handling.

MaskRay added inline comments.Apr 7 2022, 10:07 AM

clang/test/Driver/cuda-openmp-driver.cu
11	Better to add -NEXT whenever applicable

MaskRay added inline comments.Apr 7 2022, 10:08 AM

clang/test/Driver/cuda-openmp-driver.cu
2	clang-driver is unneeded.

jhuber6 added a parent revision: D123325: [Clang] Make enabling the new driver more generic.Apr 7 2022, 10:25 AM

Make -foffload-new-driver imply GPU-RDC mode, it won't work otherwise. Also adjust tests.

Harbormaster completed remote builds in B158524: Diff 421270.Apr 7 2022, 12:53 PM

jhuber6 added a child revision: D123471: [CUDA] Create offloading entries when using the new driver.Apr 10 2022, 1:15 PM

Adding new test for CUDA phases.

Herald added a subscriber: mattd. · View Herald TranscriptApr 19 2022, 8:57 AM

Harbormaster completed remote builds in B160250: Diff 423645.Apr 19 2022, 8:58 AM

jhuber6 mentioned this in D123325: [Clang] Make enabling the new driver more generic.Apr 19 2022, 9:00 AM

Thank you for adding the compilation pipeline tests.

LGTM overall.

clang/lib/Driver/ToolChains/Clang.cpp
6242–6244	Combine both ifs, so we don't add `-fgpu-rdc` twice?
6913–6914	Ditto.
clang/test/Driver/cuda-openmp-driver.cu
19	You probably want to check for `clang -cc1 ... -fgpu-rdc`, too.

Making suggested changes.

Harbormaster completed remote builds in B160285: Diff 423685.Apr 19 2022, 11:21 AM

Ping

tra added inline comments.Apr 22 2022, 10:21 AM

clang/lib/Driver/ToolChains/Clang.cpp
6242–6243	If user specifies both `-fno-gpu-rdc` and `-foffload-new-driver` we would still enable RDC compilation. We may want to at least issue a warning. Considering that we have multiple places where we may check for `-f[no]gpu-rdc` we should make sure we don't get different ideas whether RDC has been enabled. I think it may make sense to provide a common way to figure it out. Either via a helper function that would process CLI arguments or calculate it once and save it somewhere.

jhuber6 added inline comments.Apr 22 2022, 10:27 AM

clang/lib/Driver/ToolChains/Clang.cpp
6242–6243	I haven't quite finalized how to handle this. The new driver should be compatible with a non-RDC build since we simply wouldn't embed the device image or create offloading entries. It's a little bit more difficult here since the new method is opt-in so it requires a flag. We should definitely emit a warning if both are enabled (I'm assuming there's one for passing both `fgpu-rdc` and `fno-gpu-rdc`). I'll add one in. Also we could consider the new driver the RDC in the future which would be the easiest. The problem is if we want to support CUDA's method of RDC considering how other build systems seem to expect it. I could see us embedding the fatbinary in the object file, even if unused, just so that cuobjdump works. However we couldn't support the generation of `__cudaRegisterFatBinary_nv....` functions because then those would cause linker errors. WDYT?

tra added inline comments.Apr 22 2022, 10:36 AM

clang/lib/Driver/ToolChains/Clang.cpp
6242–6243	I'm assuming there's one for passing both fgpu-rdc and fno-gpu-rdc This is not a valid assumption. The whole idea behind `-fno-something` is that the options can be overridden. E.g. if the build specifies a standard set of compiler options, but we need to override some of them when building a particular file. We can only do so by appending to the standard options. Potentially we may end up having those options overridden again. While it's not very common, it's certainly possible. It's also possible to start with '-fno-gpu-rdc' and then override it with `-fgpu-rdc`. In this case, we care about the final state of RDC specified by -f*gpu-rdc options, not by the fact that `-fno-gpu-rdc` is present. `Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false)` does exactly that -- gives you the final state. If it returns false, but we have `-foffload-new-driver`, then we need a warning as these options are contradictory.

tra added inline comments.Apr 22 2022, 10:44 AM

clang/lib/Driver/ToolChains/Clang.cpp
6242–6243	The new driver should be compatible with a non-RDC build In that case, we don't need a warning, but we do need a test verifying this behavior. Also we could consider the new driver the RDC in the future which would be the easiest. SGTM. I do not know how it all will work out in the end. Your proposed model makes a lot of sense, and I'm guardedly optimistic about it. Eventually we would deprecate RDC options, but we still need to work sensibly when user specifies a mix of these options.

Adding warning for using both -fno-gpu-rdc and -foffload-new-driver. I think this is a good warning to have for now while this is being worked in as opt-in. Once this has matured I plan on adding the necessary logic to handle RDC and non-RDC builds correctly with this. But for the purposes of this patch just warning is fine.

Harbormaster completed remote builds in B160903: Diff 424537.Apr 22 2022, 10:53 AM

jhuber6 added inline comments.Apr 22 2022, 10:56 AM

clang/lib/Driver/ToolChains/Clang.cpp
6242–6243	In that case, we don't need a warning, but we do need a test verifying this behavior. It's possible but I don't have the logic here to do it, figured we can cross that bridge later. SGTM. I do not know how it all will work out in the end. Your proposed model makes a lot of sense, and I'm guardedly optimistic about it. So the only downsides I know of, is that we don't currently replicate CUDA's magic to JIT RDC code (We can do this with LTO anyway), and that registering offload entries relies on the linker defining `__start / __stop` variables, which I don't know if linkers on Windows / MacOS provide. I'd be really interested if someone on the LLD team knew the answer to that.

tra added a subscriber: rnk.Apr 22 2022, 11:11 AM

tra added inline comments.

clang/lib/Driver/ToolChains/Clang.cpp
6242	This has to be `Args.hasArg(options::OPT_fno_gpu_rdc) && Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false) == false` E.g. we don't want a warning if we have `-foffload-new-driver -fno-gpu-rdc -fgpu-rdc`.
6242–6243	relies on the linker defining start / stop variables, which I don't know if linkers on Windows / MacOS provide. I'd be really interested if someone on the LLD team knew the answer to that. @MaskRay, @rnk - would you happen to know the answer?
6245	The warning does not take any parameters and this one looks wrong anyways.
6912	I'm not sure why we're no longer checking for `OPT_foffload_new_driver` here. Don't we want to have the same RDC mode on the host and device sides? I think we do as that affects the way we mangle some symbols and it has to be consistent on both sides.

jhuber6 added inline comments.Apr 22 2022, 11:13 AM

clang/lib/Driver/ToolChains/Clang.cpp
6242	K, will do
6245	Whoops, deleted that previously but had a little SNAFU with my commits and forgot to do it again.
6912	This is only checked with `CudaDeviceInput` which we don't use with the new driver.

Making suggested changes.

Harbormaster completed remote builds in B160908: Diff 424543.Apr 22 2022, 11:16 AM

Changing this to simply require that the user manually passes -fgpu-rdc in order to use the new driver. I think this makes more sense in the short term and we can move to make the new driver the default rdc approach later. I tested this and the following should workd,

clang foo.cu -fgpu-rdc -foffload-new-driver -c
clang bar.cu -c
clang foo.o bar.o -fgpu-rdc -foffload-new-driver

Harbormaster completed remote builds in B161432: Diff 425261.Apr 26 2022, 10:31 AM

In D120272#3475155, @jhuber6 wrote:

Changing this to simply require that the user manually passes -fgpu-rdc in order to use the new driver. I think this makes more sense in the short term and we can move to make the new driver the default rdc approach later. I tested this and the following should workd,

SGTM.

This revision is now accepted and ready to land.Apr 26 2022, 11:41 AM

rnk added inline comments.Apr 26 2022, 11:50 AM

clang/lib/Driver/ToolChains/Clang.cpp
6242–6243	I believe MachO has an equivalent mechanism, but I'm not familiar with it. For PE/COFF, you can search the ASan code to see how the start / stop symbols are defined on Windows using various pragmas and `__declspec(allocate)` to set up sections and sort them accordingly. I would love to have a doc that writes up how to implement this array registration mechanism portably for all major platforms, given that we believe it is possible everywhere.

jhuber6 added inline comments.Apr 26 2022, 12:09 PM

clang/lib/Driver/ToolChains/Clang.cpp
6242–6243	I believe MachO has an equivalent mechanism, but I'm not familiar with it. For PE/COFF, you can search the ASan code to see how the start / stop symbols are defined on Windows using various pragmas and __declspec(allocate) to set up sections and sort them accordingly. I would love to have a doc that writes up how to implement this array registration mechanism portably for all major platforms, given that we believe it is possible everywhere. Thanks for the information, I was having a hard to figuring out if it was possible to implement this on other platforms. Some documentation for handling this on each platform would definitely be useful as I am hoping this can become the standard way to compile / register offloading languages in LLVM. Let me know if I can do anything to help on that front.

This revision was landed with ongoing or failed builds.Apr 29 2022, 6:15 AM

Closed by commit rGc5e5b54350fe: [CUDA] Add driver support for compiling CUDA with the new driver (authored by jhuber6). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rGc5e5b54350fe: [CUDA] Add driver support for compiling CUDA with the new driver.

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

Cuda.h

3 lines

DiagnosticDriverKinds.td

2 lines

lib/

Driver/

Driver.cpp

136 lines

ToolChains/

Clang.cpp

22 lines

test/

Driver/

cuda-openmp-driver.cu

18 lines

cuda-phases.cu

29 lines

Diff 426036

clang/include/clang/Basic/Cuda.h

Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	enum class CudaArch {
GFX1032,		GFX1032,
GFX1033,		GFX1033,
GFX1034,		GFX1034,
GFX1035,		GFX1035,
GFX1036,		GFX1036,
Generic, // A processor model named 'generic' if the target backend defines a		Generic, // A processor model named 'generic' if the target backend defines a
// public one.		// public one.
LAST,		LAST,

		CudaDefault = CudaArch::SM_35,
		HIPDefault = CudaArch::GFX803,
};		};

static inline bool IsNVIDIAGpuArch(CudaArch A) {		static inline bool IsNVIDIAGpuArch(CudaArch A) {
return A >= CudaArch::SM_20 && A < CudaArch::GFX600;		return A >= CudaArch::SM_20 && A < CudaArch::GFX600;
}		}
		traUnsubmitted Not Done Reply Inline Actions Nit: those could be just enum values. ... LAST, DefaultCudaArch = SM_35, DefaultHIPArch = GFX803, }; tra: Nit: those could be just enum values. ``` ... LAST, DefaultCudaArch = SM_35…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Will do. jhuber6: Will do.

static inline bool IsAMDGpuArch(CudaArch A) {		static inline bool IsAMDGpuArch(CudaArch A) {
// Generic processor model is for testing only.		// Generic processor model is for testing only.
return A >= CudaArch::GFX600 && A < CudaArch::Generic;		return A >= CudaArch::GFX600 && A < CudaArch::Generic;
}		}

const char *CudaArchToString(CudaArch A);		const char *CudaArchToString(CudaArch A);
const char *CudaArchToVirtualArchString(CudaArch A);		const char *CudaArchToVirtualArchString(CudaArch A);
Show All 25 Lines

clang/include/clang/Basic/DiagnosticDriverKinds.td

	Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	def err_drv_cuda_bad_gpu_arch : Error<"unsupported CUDA gpu architecture: %0">;			def err_drv_cuda_bad_gpu_arch : Error<"unsupported CUDA gpu architecture: %0">;
	def err_drv_no_cuda_installation : Error<			def err_drv_no_cuda_installation : Error<
	"cannot find CUDA installation; provide its path via '--cuda-path', or pass "			"cannot find CUDA installation; provide its path via '--cuda-path', or pass "
	"'-nocudainc' to build without CUDA includes">;			"'-nocudainc' to build without CUDA includes">;
	def err_drv_no_cuda_libdevice : Error<			def err_drv_no_cuda_libdevice : Error<
	"cannot find libdevice for %0; provide path to different CUDA installation "			"cannot find libdevice for %0; provide path to different CUDA installation "
	"via '--cuda-path', or pass '-nocudalib' to build without linking with "			"via '--cuda-path', or pass '-nocudalib' to build without linking with "
	"libdevice">;			"libdevice">;
				def err_drv_no_rdc_new_driver : Error<
				"Using '--offload-new-driver' requires '-fgpu-rdc'">;

	def err_drv_no_rocm_device_lib : Error<			def err_drv_no_rocm_device_lib : Error<
	"cannot find ROCm device library%select{\| for %1\|for ABI version %1}0; provide its path via "			"cannot find ROCm device library%select{\| for %1\|for ABI version %1}0; provide its path via "
	"'--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build "			"'--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build "
	"without ROCm device library">;			"without ROCm device library">;
	def err_drv_no_hip_runtime : Error<			def err_drv_no_hip_runtime : Error<
	"cannot find HIP runtime; provide its path via '--rocm-path', or pass "			"cannot find HIP runtime; provide its path via '--rocm-path', or pass "
	"'-nogpuinc' to build without HIP runtime">;			"'-nogpuinc' to build without HIP runtime">;
	▲ Show 20 Lines • Show All 595 Lines • Show Last 20 Lines

clang/lib/Driver/Driver.cpp

Show First 20 Lines • Show All 4,112 Lines • ▼ Show 20 Lines	if (!UseNewOffloadingDriver)
if (Action *Wrapper = OffloadBuilder.makeHostLinkAction())		if (Action *Wrapper = OffloadBuilder.makeHostLinkAction())
LinkerInputs.push_back(Wrapper);		LinkerInputs.push_back(Wrapper);
Action *LA;		Action *LA;
// Check if this Linker Job should emit a static library.		// Check if this Linker Job should emit a static library.
if (ShouldEmitStaticLibrary(Args)) {		if (ShouldEmitStaticLibrary(Args)) {
LA = C.MakeAction<StaticLibJobAction>(LinkerInputs, types::TY_Image);		LA = C.MakeAction<StaticLibJobAction>(LinkerInputs, types::TY_Image);
} else if (UseNewOffloadingDriver) {		} else if (UseNewOffloadingDriver) {
LA = C.MakeAction<LinkerWrapperJobAction>(LinkerInputs, types::TY_Image);		LA = C.MakeAction<LinkerWrapperJobAction>(LinkerInputs, types::TY_Image);
LA->propagateHostOffloadInfo(C.getActiveOffloadKinds(),		LA->propagateHostOffloadInfo(C.getActiveOffloadKinds(),
		jdoerfertUnsubmitted Done Reply Inline Actions Everything till here can be a NFC commit, right? Let's split it off jdoerfert: Everything till here can be a NFC commit, right? Let's split it off
/BoundArch=/nullptr);		/BoundArch=/nullptr);
} else {		} else {
LA = C.MakeAction<LinkJobAction>(LinkerInputs, types::TY_Image);		LA = C.MakeAction<LinkJobAction>(LinkerInputs, types::TY_Image);
}		}
if (!UseNewOffloadingDriver)		if (!UseNewOffloadingDriver)
LA = OffloadBuilder.processHostLinkAction(LA);		LA = OffloadBuilder.processHostLinkAction(LA);
Actions.push_back(LA);		Actions.push_back(LA);
}		}
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	void Driver::BuildActions(Compilation &C, DerivedArgList &Args,
Args.ClaimAllArgs(options::OPT_cl_ignored_Group);		Args.ClaimAllArgs(options::OPT_cl_ignored_Group);

// Claim --cuda-host-only and --cuda-compile-host-device, which may be passed		// Claim --cuda-host-only and --cuda-compile-host-device, which may be passed
// to non-CUDA compilations and should not trigger warnings there.		// to non-CUDA compilations and should not trigger warnings there.
Args.ClaimAllArgs(options::OPT_cuda_host_only);		Args.ClaimAllArgs(options::OPT_cuda_host_only);
Args.ClaimAllArgs(options::OPT_cuda_compile_host_device);		Args.ClaimAllArgs(options::OPT_cuda_compile_host_device);
}		}

		/// Returns the canonical name for the offloading architecture when using HIP or
		/// CUDA.
		static StringRef getCanonicalArchString(Compilation &C,
		llvm::opt::DerivedArgList &Args,
		StringRef ArchStr,
		Action::OffloadKind Kind) {
		if (Kind == Action::OFK_Cuda) {
		CudaArch Arch = StringToCudaArch(ArchStr);
		if (Arch == CudaArch::UNKNOWN \|\| !IsNVIDIAGpuArch(Arch)) {
		C.getDriver().Diag(clang::diag::err_drv_cuda_bad_gpu_arch) << ArchStr;
		return StringRef();
		}
		return Args.MakeArgStringRef(CudaArchToString(Arch));
		} else if (Kind == Action::OFK_HIP) {
		llvm::StringMap<bool> Features;
		// getHIPOffloadTargetTriple() is known to return valid value as it has
		// been called successfully in the CreateOffloadingDeviceToolChains().
		auto Arch = parseTargetID(
		*getHIPOffloadTargetTriple(C.getDriver(), C.getInputArgs()), ArchStr,
		&Features);
		if (!Arch) {
		C.getDriver().Diag(clang::diag::err_drv_bad_target_id) << ArchStr;
		C.setContainsError();
		return StringRef();
		}
		return Args.MakeArgStringRef(
		getCanonicalTargetID(Arch.getValue(), Features));
		}
		return StringRef();
		}

		/// Checks if the set offloading architectures does not conflict. Returns the
		/// incompatible pair if a conflict occurs.
		static llvm::Optional<std::pair<llvm::StringRef, llvm::StringRef>>
		getConflictOffloadArchCombination(const llvm::DenseSet<StringRef> &Archs,
		Action::OffloadKind Kind) {
		if (Kind != Action::OFK_HIP)
		return None;

		std::set<StringRef> ArchSet;
		llvm::copy(Archs, std::inserter(ArchSet, ArchSet.begin()));
		return getConflictTargetIDCombination(ArchSet);
		}

		/// Returns the set of bound architectures active for this compilation kind.
		/// This function returns a set of bound architectures, if there are no bound
		/// architctures we return a set containing only the empty string.
		static llvm::DenseSet<StringRef>
		getOffloadArchs(Compilation &C, llvm::opt::DerivedArgList &Args,
		Action::OffloadKind Kind) {

		// If this is OpenMP offloading we don't use a bound architecture.
		if (Kind == Action::OFK_OpenMP)
		return llvm::DenseSet<StringRef>{StringRef()};

		yaxunlUnsubmitted Done Reply Inline Actions The final set depends on the order of -offload-arch and -no-offload-arch options, e.g. `--offload-arch=gfx906 --no-offload-arch=gfx906` and `--no-offload-arch=gfx906 --offload-arch=gfx906` is different. Also there is `--no-offload-arch=all` which removes all precedent --offload-arch= options. yaxunl: The final set depends on the order of -offload-arch and -no-offload-arch options, e.g. `…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions I see, so we need to iterate the arguments in-order and insert or erase them as they are entered, I'll fix it. jhuber6: I see, so we need to iterate the arguments in-order and insert or erase them as they are…
		// --offload and --offload-arch options are mutually exclusive.
		if (Args.hasArgNoClaim(options::OPT_offload_EQ) &&
		Args.hasArgNoClaim(options::OPT_offload_arch_EQ,
		options::OPT_no_offload_arch_EQ)) {
		C.getDriver().Diag(diag::err_opt_not_valid_with_opt)
		yaxunlUnsubmitted Done Reply Inline Actions should this be HIP? yaxunl: should this be HIP?
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Whoops, haven't tested it with HIP yet. jhuber6: Whoops, haven't tested it with HIP yet.
		traUnsubmitted Done Reply Inline Actions Nit: It would be nice to report specific option which triggered the diag. Reporting `--offload-arch` when it's triggered by `--no-offload-arch` would be somewhat confusing. `Args.hasArgNoClaim(options::OPT_offload_arch_EQ) ? "--offload-arch" : --no-offload-arch` ? tra: Nit: It would be nice to report specific option which triggered the diag. Reporting `--offload…
		<< "--offload"
		jdoerfertUnsubmitted Done Reply Inline Actions Can we have a doxygen comment explaining what these helpers do? jdoerfert: Can we have a doxygen comment explaining what these helpers do?
		<< (Args.hasArgNoClaim(options::OPT_offload_arch_EQ)
		? "--offload-arch"
		: "--no-offload-arch");
		}

		llvm::DenseSet<StringRef> Archs;
		for (auto &Arg : Args) {
		if (Arg->getOption().matches(options::OPT_offload_arch_EQ)) {
		Archs.insert(getCanonicalArchString(C, Args, Arg->getValue(), Kind));
		} else if (Arg->getOption().matches(options::OPT_no_offload_arch_EQ)) {
		if (Arg->getValue() == StringRef("all"))
		Archs.clear();
		else
		Archs.erase(getCanonicalArchString(C, Args, Arg->getValue(), Kind));
		}
		}

		if (auto ConflictingArchs = getConflictOffloadArchCombination(Archs, Kind)) {
		C.getDriver().Diag(clang::diag::err_drv_bad_offload_arch_combo)
		<< ConflictingArchs.getValue().first
		<< ConflictingArchs.getValue().second;
		C.setContainsError();
		}

		if (Archs.empty()) {
		if (Kind == Action::OFK_Cuda)
		traUnsubmitted Done Reply Inline Actions If we do not have constants for the default CUDA/HIP arch yet, we should probably add them and use them here. tra: If we do not have constants for the default CUDA/HIP arch yet, we should probably add them and…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Defaulting hip to gfx803 is unlikely to be helpful. It won't run on other architectures, i.e. there's no conservative default that will run on most things. I guess that's an existing quirk of the hip toolchain? JonChesterfield: Defaulting hip to gfx803 is unlikely to be helpful. It won't run on other architectures, i.e.
		traUnsubmitted Not Done Reply Inline Actions I agree that there's no "safe" choice of GPU target. It applies to CUDA, as well, at least for GPU binaries. That said, we still want `clang -c foo.cu` to work. For CUDA we use the oldest variant still supported by the vendor. It produces PTX assembly which we embed along with the GPU binary. That PTX is valid for newer GPU archtectures and CUDA runtime will be able to compile it for the GPU the app happens to run on. It's not ideal, but it works. While for AMDGPU we do not have such forward compatibility mode as we have with CUDA, being able to compile for something by default is still useful, IMO. tra: I agree that there's no "safe" choice of GPU target. It applies to CUDA, as well, at least for…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions I just copied GFX803 because that's what the previous driver used. I agree we should just default to something, maybe in the AMD case we can issue a warning telling them to use the flag to properly specify it. jhuber6: I just copied GFX803 because that's what the previous driver used. I agree we should just…
		Archs.insert(CudaArchToString(CudaArch::CudaDefault));
		else if (Kind == Action::OFK_HIP)
		Archs.insert(CudaArchToString(CudaArch::HIPDefault));
		}

		return Archs;
		}

Action *Driver::BuildOffloadingActions(Compilation &C,		Action *Driver::BuildOffloadingActions(Compilation &C,
llvm::opt::DerivedArgList &Args,		llvm::opt::DerivedArgList &Args,
const InputTy &Input,		const InputTy &Input,
Action *HostAction) const {		Action *HostAction) const {
if (!isa<CompileJobAction>(HostAction))		if (!isa<CompileJobAction>(HostAction))
return HostAction;		return HostAction;

OffloadAction::DeviceDependences DDeps;		OffloadAction::DeviceDependences DDeps;

types::ID InputType = Input.first;		types::ID InputType = Input.first;
const Arg *InputArg = Input.second;		const Arg *InputArg = Input.second;

const Action::OffloadKind OffloadKinds[] = {Action::OFK_OpenMP};		const Action::OffloadKind OffloadKinds[] = {
		Action::OFK_OpenMP, Action::OFK_Cuda, Action::OFK_HIP};
		jdoerfertUnsubmitted Not Done Reply Inline Actions With the NFC commit we can probably also include some of this but restricted to OFK_OpenMP. Try to minimize the functional change that one has to think about. jdoerfert: With the NFC commit we can probably also include some of this but restricted to OFK_OpenMP. Try…

for (Action::OffloadKind Kind : OffloadKinds) {		for (Action::OffloadKind Kind : OffloadKinds) {
SmallVector<const ToolChain *, 2> ToolChains;		SmallVector<const ToolChain *, 2> ToolChains;
ActionList DeviceActions;		ActionList DeviceActions;

auto TCRange = C.getOffloadToolChains(Kind);		auto TCRange = C.getOffloadToolChains(Kind);
for (auto TI = TCRange.first, TE = TCRange.second; TI != TE; ++TI)		for (auto TI = TCRange.first, TE = TCRange.second; TI != TE; ++TI)
ToolChains.push_back(TI->second);		ToolChains.push_back(TI->second);

if (ToolChains.empty())		if (ToolChains.empty())
continue;		continue;

for (unsigned I = 0; I < ToolChains.size(); ++I)		// Get the product of all bound architectures and toolchains.
		SmallVector<std::pair<const ToolChain *, StringRef>> TCAndArchs;
		traUnsubmitted Done Reply Inline Actions Do we want to return early here? tra: Do we want to return early here?
		for (const ToolChain *TC : ToolChains)
		for (StringRef Arch : getOffloadArchs(C, Args, Kind))
		TCAndArchs.push_back(std::make_pair(TC, Arch));

		for (unsigned I = 0, E = TCAndArchs.size(); I != E; ++I)
		traUnsubmitted Done Reply Inline Actions If we do not allow non-relocatable compilation with the new driver, perhaps we should make `-fgpu-rdc` enabled by default with the new driver and error out if someone attempts to use `-fno-gpu-rdc`. Otherwise we're virtually guaranteed that everyone attempting to use `-foffload-new-driver` will hit this error. tra: If we do not allow non-relocatable compilation with the new driver, perhaps we should make `…
DeviceActions.push_back(C.MakeAction<InputAction>(*InputArg, InputType));		DeviceActions.push_back(C.MakeAction<InputAction>(*InputArg, InputType));
		traUnsubmitted Not Done Reply Inline Actions This loop can be folded into the loop above. Alternatively, for simple loops like this one could use `llvm::for_each`. tra: This loop can be folded into the loop above. Alternatively, for simple loops like this one…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions It could, I think the intention is clearer (i.e. make an input for every toolchain and architecture) having them separate. But I can merge them if that's better. jhuber6: It could, I think the intention is clearer (i.e. make an input for every toolchain and…

if (DeviceActions.empty())		if (DeviceActions.empty())
return HostAction;		return HostAction;

auto PL = types::getCompilationPhases(*this, Args, InputType);		auto PL = types::getCompilationPhases(*this, Args, InputType);

for (phases::ID Phase : PL) {		for (phases::ID Phase : PL) {
if (Phase == phases::Link) {		if (Phase == phases::Link) {
assert(Phase == PL.back() && "linking must be final compilation step.");		assert(Phase == PL.back() && "linking must be final compilation step.");
break;		break;
}		}

auto TC = ToolChains.begin();		auto TCAndArch = TCAndArchs.begin();
for (Action *&A : DeviceActions) {		for (Action *&A : DeviceActions) {
A = ConstructPhaseAction(C, Args, Phase, A, Kind);		A = ConstructPhaseAction(C, Args, Phase, A, Kind);

if (isa<CompileJobAction>(A) && Kind == Action::OFK_OpenMP) {		if (isa<CompileJobAction>(A) && Kind == Action::OFK_OpenMP) {
		// OpenMP offloading has a dependency on the host compile action to
		// identify which declarations need to be emitted. This shouldn't be
		// collapsed with any other actions so we can use it in the device.
HostAction->setCannotBeCollapsedWithNextDependentAction();		HostAction->setCannotBeCollapsedWithNextDependentAction();
OffloadAction::HostDependence HDep(		OffloadAction::HostDependence HDep(
HostAction, C.getSingleOffloadToolChain<Action::OFK_Host>(),		HostAction, C.getSingleOffloadToolChain<Action::OFK_Host>(),
/BourdArch=/nullptr, Action::OFK_OpenMP);		/BoundArch=/nullptr, Kind);
OffloadAction::DeviceDependences DDep;		OffloadAction::DeviceDependences DDep;
DDep.add(A, TC, /BoundArch=*/nullptr, Kind);		DDep.add(A, TCAndArch->first, /BoundArch=/nullptr, Kind);
A = C.MakeAction<OffloadAction>(HDep, DDep);		A = C.MakeAction<OffloadAction>(HDep, DDep);
		} else if (isa<AssembleJobAction>(A) && Kind == Action::OFK_Cuda) {
		traUnsubmitted Not Done Reply Inline Actions Could you elaborate why TCAndArch is incremented only here? Don't other cases need to advance it, too? At the very least we need some comments here and for the loop in general, describing what's going on. tra: Could you elaborate why TCAndArch is incremented only here? Don't other cases need to advance…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions It shouldn't be, I forgot to move this out of the conditional once I added more things, I'll explain the usage as well. jhuber6: It shouldn't be, I forgot to move this out of the conditional once I added more things, I'll…
		// The Cuda toolchain uses fatbinary as the linker phase to bundle the
		// PTX and Cubin output.
		ActionList FatbinActions;
		for (Action *A : {A, A->getInputs()[0]}) {
		OffloadAction::DeviceDependences DDep;
		DDep.add(A, TCAndArch->first, TCAndArch->second.data(), Kind);
		FatbinActions.emplace_back(
		C.MakeAction<OffloadAction>(DDep, A->getType()));
}		}
++TC;		A = C.MakeAction<LinkJobAction>(FatbinActions, types::TY_CUDA_FATBIN);
		}
		++TCAndArch;
}		}
}		}

auto TC = ToolChains.begin();		auto TCAndArch = TCAndArchs.begin();
for (Action *A : DeviceActions) {		for (Action *A : DeviceActions) {
DDeps.add(A, TC, /BoundArch=*/nullptr, Kind);		DDeps.add(A, TCAndArch->first, TCAndArch->second.data(), Kind);
		traUnsubmitted Not Done Reply Inline Actions Nit: We do have `llvm::zip` for iterating over multiple containers. However, unpacking loop variable results maybe more trouble than it's worth it in such a small loop, Up to you. tra: Nit: We do have `llvm::zip` for iterating over multiple containers. However, unpacking loop…
TC++;		++TCAndArch;
}		}
}		}

OffloadAction::HostDependence HDep(		OffloadAction::HostDependence HDep(
HostAction, C.getSingleOffloadToolChain<Action::OFK_Host>(),		HostAction, C.getSingleOffloadToolChain<Action::OFK_Host>(),
/BoundArch=/nullptr, DDeps);		/BoundArch=/nullptr, DDeps);
return C.MakeAction<OffloadAction>(HDep, DDeps);		return C.MakeAction<OffloadAction>(HDep, DDeps);
}		}
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	Action *Driver::ConstructPhaseAction(
}		}
case phases::Backend: {		case phases::Backend: {
if (isUsingLTO() && TargetDeviceOffloadKind == Action::OFK_None) {		if (isUsingLTO() && TargetDeviceOffloadKind == Action::OFK_None) {
types::ID Output =		types::ID Output =
Args.hasArg(options::OPT_S) ? types::TY_LTO_IR : types::TY_LTO_BC;		Args.hasArg(options::OPT_S) ? types::TY_LTO_IR : types::TY_LTO_BC;
return C.MakeAction<BackendJobAction>(Input, Output);		return C.MakeAction<BackendJobAction>(Input, Output);
}		}
if (isUsingLTO(/* IsOffload */ true) &&		if (isUsingLTO(/* IsOffload */ true) &&
TargetDeviceOffloadKind == Action::OFK_OpenMP) {		TargetDeviceOffloadKind != Action::OFK_None) {
types::ID Output =		types::ID Output =
Args.hasArg(options::OPT_S) ? types::TY_LTO_IR : types::TY_LTO_BC;		Args.hasArg(options::OPT_S) ? types::TY_LTO_IR : types::TY_LTO_BC;
return C.MakeAction<BackendJobAction>(Input, Output);		return C.MakeAction<BackendJobAction>(Input, Output);
}		}
if (Args.hasArg(options::OPT_emit_llvm) \|\|		if (Args.hasArg(options::OPT_emit_llvm) \|\|
(TargetDeviceOffloadKind == Action::OFK_HIP &&		(TargetDeviceOffloadKind == Action::OFK_HIP &&
Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc,		Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc,
false))) {		false))) {
▲ Show 20 Lines • Show All 1,679 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/Clang.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,230 Lines • ▼ Show 20 Lines	if (Args.hasFlag(options::OPT_fhip_new_launch_api,
options::OPT_fno_hip_new_launch_api, true))		options::OPT_fno_hip_new_launch_api, true))
CmdArgs.push_back("-fhip-new-launch-api");		CmdArgs.push_back("-fhip-new-launch-api");
if (Args.hasFlag(options::OPT_fgpu_allow_device_init,		if (Args.hasFlag(options::OPT_fgpu_allow_device_init,
options::OPT_fno_gpu_allow_device_init, false))		options::OPT_fno_gpu_allow_device_init, false))
CmdArgs.push_back("-fgpu-allow-device-init");		CmdArgs.push_back("-fgpu-allow-device-init");
}		}

if (IsCuda \|\| IsHIP) {		if (IsCuda \|\| IsHIP) {
		if (!Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false) &&
		Args.hasArg(options::OPT_offload_new_driver))
		D.Diag(diag::err_drv_no_rdc_new_driver);
if (Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false))		if (Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false))
		traUnsubmitted Not Done Reply Inline Actions This has to be `Args.hasArg(options::OPT_fno_gpu_rdc) && Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false) == false` E.g. we don't want a warning if we have `-foffload-new-driver -fno-gpu-rdc -fgpu-rdc`. tra: This has to be `Args.hasArg(options::OPT_fno_gpu_rdc) && Args.hasFlag(options::OPT_fgpu_rdc…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions K, will do jhuber6: K, will do
CmdArgs.push_back("-fgpu-rdc");		CmdArgs.push_back("-fgpu-rdc");
		traUnsubmitted Not Done Reply Inline Actions If user specifies both `-fno-gpu-rdc` and `-foffload-new-driver` we would still enable RDC compilation. We may want to at least issue a warning. Considering that we have multiple places where we may check for `-f[no]gpu-rdc` we should make sure we don't get different ideas whether RDC has been enabled. I think it may make sense to provide a common way to figure it out. Either via a helper function that would process CLI arguments or calculate it once and save it somewhere. tra: If user specifies both `-fno-gpu-rdc` and `-foffload-new-driver` we would still enable RDC…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions I haven't quite finalized how to handle this. The new driver should be compatible with a non-RDC build since we simply wouldn't embed the device image or create offloading entries. It's a little bit more difficult here since the new method is opt-in so it requires a flag. We should definitely emit a warning if both are enabled (I'm assuming there's one for passing both `fgpu-rdc` and `fno-gpu-rdc`). I'll add one in. Also we could consider the new driver the RDC in the future which would be the easiest. The problem is if we want to support CUDA's method of RDC considering how other build systems seem to expect it. I could see us embedding the fatbinary in the object file, even if unused, just so that cuobjdump works. However we couldn't support the generation of `__cudaRegisterFatBinary_nv....` functions because then those would cause linker errors. WDYT? jhuber6: I haven't quite finalized how to handle this. The new driver should be compatible with a non…
		traUnsubmitted Not Done Reply Inline Actions I'm assuming there's one for passing both fgpu-rdc and fno-gpu-rdc This is not a valid assumption. The whole idea behind `-fno-something` is that the options can be overridden. E.g. if the build specifies a standard set of compiler options, but we need to override some of them when building a particular file. We can only do so by appending to the standard options. Potentially we may end up having those options overridden again. While it's not very common, it's certainly possible. It's also possible to start with '-fno-gpu-rdc' and then override it with `-fgpu-rdc`. In this case, we care about the final state of RDC specified by -fgpu-rdc options, not by the fact that `-fno-gpu-rdc` is present. `Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false)` does exactly that -- gives you the final state. If it returns false, but we have `-foffload-new-driver`, then we need a warning as these options are contradictory. tra:* > I'm assuming there's one for passing both fgpu-rdc and fno-gpu-rdc This is not a valid…
		traUnsubmitted Not Done Reply Inline Actions The new driver should be compatible with a non-RDC build In that case, we don't need a warning, but we do need a test verifying this behavior. Also we could consider the new driver the RDC in the future which would be the easiest. SGTM. I do not know how it all will work out in the end. Your proposed model makes a lot of sense, and I'm guardedly optimistic about it. Eventually we would deprecate RDC options, but we still need to work sensibly when user specifies a mix of these options. tra: > The new driver should be compatible with a non-RDC build In that case, we don't need a…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions In that case, we don't need a warning, but we do need a test verifying this behavior. It's possible but I don't have the logic here to do it, figured we can cross that bridge later. SGTM. I do not know how it all will work out in the end. Your proposed model makes a lot of sense, and I'm guardedly optimistic about it. So the only downsides I know of, is that we don't currently replicate CUDA's magic to JIT RDC code (We can do this with LTO anyway), and that registering offload entries relies on the linker defining `__start / __stop` variables, which I don't know if linkers on Windows / MacOS provide. I'd be really interested if someone on the LLD team knew the answer to that. jhuber6: > In that case, we don't need a warning, but we do need a test verifying this behavior. > It's…
		traUnsubmitted Not Done Reply Inline Actions relies on the linker defining start / stop variables, which I don't know if linkers on Windows / MacOS provide. I'd be really interested if someone on the LLD team knew the answer to that. @MaskRay, @rnk - would you happen to know the answer? tra: > relies on the linker defining __start / __stop variables, which I don't know if linkers on…
		rnkUnsubmitted Not Done Reply Inline Actions I believe MachO has an equivalent mechanism, but I'm not familiar with it. For PE/COFF, you can search the ASan code to see how the start / stop symbols are defined on Windows using various pragmas and `__declspec(allocate)` to set up sections and sort them accordingly. I would love to have a doc that writes up how to implement this array registration mechanism portably for all major platforms, given that we believe it is possible everywhere. rnk: I believe MachO has an equivalent mechanism, but I'm not familiar with it. For PE/COFF, you can…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions I believe MachO has an equivalent mechanism, but I'm not familiar with it. For PE/COFF, you can search the ASan code to see how the start / stop symbols are defined on Windows using various pragmas and __declspec(allocate) to set up sections and sort them accordingly. I would love to have a doc that writes up how to implement this array registration mechanism portably for all major platforms, given that we believe it is possible everywhere. Thanks for the information, I was having a hard to figuring out if it was possible to implement this on other platforms. Some documentation for handling this on each platform would definitely be useful as I am hoping this can become the standard way to compile / register offloading languages in LLVM. Let me know if I can do anything to help on that front. jhuber6: > I believe MachO has an equivalent mechanism, but I'm not familiar with it. For PE/COFF, you…
if (Args.hasFlag(options::OPT_fgpu_defer_diag,		if (Args.hasFlag(options::OPT_fgpu_defer_diag,
		traUnsubmitted Not Done Reply Inline Actions Combine both ifs, so we don't add `-fgpu-rdc` twice? tra: Combine both ifs, so we don't add `-fgpu-rdc` twice?
options::OPT_fno_gpu_defer_diag, false))		options::OPT_fno_gpu_defer_diag, false))
		traUnsubmitted Not Done Reply Inline Actions The warning does not take any parameters and this one looks wrong anyways. tra: The warning does not take any parameters and this one looks wrong anyways.
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Whoops, deleted that previously but had a little SNAFU with my commits and forgot to do it again. jhuber6: Whoops, deleted that previously but had a little SNAFU with my commits and forgot to do it…
CmdArgs.push_back("-fgpu-defer-diag");		CmdArgs.push_back("-fgpu-defer-diag");
if (Args.hasFlag(options::OPT_fgpu_exclude_wrong_side_overloads,		if (Args.hasFlag(options::OPT_fgpu_exclude_wrong_side_overloads,
options::OPT_fno_gpu_exclude_wrong_side_overloads,		options::OPT_fno_gpu_exclude_wrong_side_overloads,
false)) {		false)) {
CmdArgs.push_back("-fgpu-exclude-wrong-side-overloads");		CmdArgs.push_back("-fgpu-exclude-wrong-side-overloads");
CmdArgs.push_back("-fgpu-defer-diag");		CmdArgs.push_back("-fgpu-defer-diag");
}		}
}		}
▲ Show 20 Lines • Show All 650 Lines • ▼ Show 20 Lines
}		}

// Host-side cuda compilation receives all device-side outputs in a single		// Host-side cuda compilation receives all device-side outputs in a single
// fatbin as Inputs[1]. Include the binary with -fcuda-include-gpubinary.		// fatbin as Inputs[1]. Include the binary with -fcuda-include-gpubinary.
if ((IsCuda \|\| IsHIP) && CudaDeviceInput) {		if ((IsCuda \|\| IsHIP) && CudaDeviceInput) {
CmdArgs.push_back("-fcuda-include-gpubinary");		CmdArgs.push_back("-fcuda-include-gpubinary");
CmdArgs.push_back(CudaDeviceInput->getFilename());		CmdArgs.push_back(CudaDeviceInput->getFilename());
if (Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false))		if (Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, false))
CmdArgs.push_back("-fgpu-rdc");		CmdArgs.push_back("-fgpu-rdc");
		traUnsubmitted Not Done Reply Inline Actions I'm not sure why we're no longer checking for `OPT_foffload_new_driver` here. Don't we want to have the same RDC mode on the host and device sides? I think we do as that affects the way we mangle some symbols and it has to be consistent on both sides. tra: I'm not sure why we're no longer checking for `OPT_foffload_new_driver` here. Don't we want to…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions This is only checked with `CudaDeviceInput` which we don't use with the new driver. jhuber6: This is only checked with `CudaDeviceInput` which we don't use with the new driver.
}		}

		traUnsubmitted Not Done Reply Inline Actions Ditto. tra: Ditto.
if (IsCuda) {		if (IsCuda) {
if (Args.hasFlag(options::OPT_fcuda_short_ptr,		if (Args.hasFlag(options::OPT_fcuda_short_ptr,
options::OPT_fno_cuda_short_ptr, false))		options::OPT_fno_cuda_short_ptr, false))
CmdArgs.push_back("-fcuda-short-ptr");		CmdArgs.push_back("-fcuda-short-ptr");
}		}

if (IsCuda \|\| IsHIP) {		if (IsCuda \|\| IsHIP) {
// Determine the original source input.		// Determine the original source input.
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	CmdArgs.push_back(Args.MakeArgString(
"-fembed-offload-object=" + File + "," +		"-fembed-offload-object=" + File + "," +
Action::GetOffloadKindName(OffloadAction->getOffloadingDeviceKind()) +		Action::GetOffloadKindName(OffloadAction->getOffloadingDeviceKind()) +
"," + TC->getTripleString() + "," + Arch));		"," + TC->getTripleString() + "," + Arch));
}		}

if (Triple.isAMDGPU()) {		if (Triple.isAMDGPU()) {
handleAMDGPUCodeObjectVersionOptions(D, Args, CmdArgs);		handleAMDGPUCodeObjectVersionOptions(D, Args, CmdArgs);

Args.addOptInFlag(CmdArgs, options::OPT_munsafe_fp_atomics,		Args.addOptInFlag(CmdArgs, options::OPT_munsafe_fp_atomics,
		traUnsubmitted Not Done Reply Inline Actions Extracting arch name from the file name looks... questionable. Where does that file name come from? Can't we get this information directly? tra: Extracting arch name from the file name looks... questionable. Where does that file name come…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Yeah, it's not ideal but it was the easiest way to do it. The alternative is to find a way to traverse the job action list and find the nodes with a bound architecture set and hope they're in the same order. I can try to do that. jhuber6: Yeah, it's not ideal but it was the easiest way to do it. The alternative is to find a way to…
options::OPT_mno_unsafe_fp_atomics);		options::OPT_mno_unsafe_fp_atomics);
}		}

// For all the host OpenMP offloading compile jobs we need to pass the targets		// For all the host OpenMP offloading compile jobs we need to pass the targets
// information using -fopenmp-targets= option.		// information using -fopenmp-targets= option.
if (JA.isHostOffloading(Action::OFK_OpenMP)) {		if (JA.isHostOffloading(Action::OFK_OpenMP)) {
SmallString<128> TargetInfo("-fopenmp-targets=");		SmallString<128> TargetInfo("-fopenmp-targets=");

▲ Show 20 Lines • Show All 1,233 Lines • ▼ Show 20 Lines	void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
const ArgList &Args,		const ArgList &Args,
const char *LinkingOutput) const {		const char *LinkingOutput) const {
const Driver &D = getToolChain().getDriver();		const Driver &D = getToolChain().getDriver();
const llvm::Triple TheTriple = getToolChain().getTriple();		const llvm::Triple TheTriple = getToolChain().getTriple();
auto OpenMPTCRange = C.getOffloadToolChains<Action::OFK_OpenMP>();		auto OpenMPTCRange = C.getOffloadToolChains<Action::OFK_OpenMP>();
ArgStringList CmdArgs;		ArgStringList CmdArgs;

// Pass the CUDA path to the linker wrapper tool.		// Pass the CUDA path to the linker wrapper tool.
for (auto &I : llvm::make_range(OpenMPTCRange.first, OpenMPTCRange.second)) {		for (Action::OffloadKind Kind : {Action::OFK_Cuda, Action::OFK_OpenMP}) {
		auto TCRange = C.getOffloadToolChains(Kind);
		for (auto &I : llvm::make_range(TCRange.first, TCRange.second)) {
const ToolChain *TC = I.second;		const ToolChain *TC = I.second;
if (TC->getTriple().isNVPTX()) {		if (TC->getTriple().isNVPTX()) {
CudaInstallationDetector CudaInstallation(D, TheTriple, Args);		CudaInstallationDetector CudaInstallation(D, TheTriple, Args);
if (CudaInstallation.isValid())		if (CudaInstallation.isValid())
CmdArgs.push_back(Args.MakeArgString(		CmdArgs.push_back(Args.MakeArgString(
"--cuda-path=" + CudaInstallation.getInstallPath()));		"--cuda-path=" + CudaInstallation.getInstallPath()));
break;		break;
}		}
}		}
		}

// Get the AMDGPU math libraries.		// Get the AMDGPU math libraries.
// FIXME: This method is bad, remove once AMDGPU has a proper math library		// FIXME: This method is bad, remove once AMDGPU has a proper math library
// (see AMDGCN::OpenMPLinker::constructLLVMLinkCommand).		// (see AMDGCN::OpenMPLinker::constructLLVMLinkCommand).
for (auto &I : llvm::make_range(OpenMPTCRange.first, OpenMPTCRange.second)) {		for (auto &I : llvm::make_range(OpenMPTCRange.first, OpenMPTCRange.second)) {
const ToolChain *TC = I.second;		const ToolChain *TC = I.second;

if (!TC->getTriple().isAMDGPU() \|\| Args.hasArg(options::OPT_nogpulib))		if (!TC->getTriple().isAMDGPU() \|\| Args.hasArg(options::OPT_nogpulib))
▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

clang/test/Driver/cuda-openmp-driver.cu

This file was added.

				// REQUIRES: x86-registered-target
				// REQUIRES: nvptx-registered-target
				MaskRayUnsubmitted Not Done Reply Inline Actions clang-driver is unneeded. MaskRay: clang-driver is unneeded.

				// RUN: %clang -### -target x86_64-linux-gnu -nocudalib -ccc-print-bindings -fgpu-rdc \
				// RUN: --offload-new-driver --offload-arch=sm_35 --offload-arch=sm_70 %s 2>&1 \
				// RUN: \| FileCheck -check-prefix BINDINGS %s

				// BINDINGS: "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[PTX_SM_35:.+]]"
				// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX_SM_35]]"], output: "[[CUBIN_SM_35:.+]]"
				// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["[[CUBIN_SM_35]]", "[[PTX_SM_35]]"], output: "[[FATBIN_SM_35:.+]]"
				// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT]]"], output: "[[PTX_SM_70:.+]]"
				MaskRayUnsubmitted Not Done Reply Inline Actions Better to add -NEXT whenever applicable MaskRay: Better to add -NEXT whenever applicable
				// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX_SM_70:.+]]"], output: "[[CUBIN_SM_70:.+]]"
				// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["[[CUBIN_SM_70]]", "[[PTX_SM_70:.+]]"], output: "[[FATBIN_SM_70:.+]]"
				// BINDINGS-NEXT: "x86_64-unknown-linux-gnu" - "clang", inputs: ["[[INPUT]]", "[[FATBIN_SM_35]]", "[[FATBIN_SM_70]]"], output: "[[HOST_OBJ:.+]]"
				// BINDINGS-NEXT: "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["[[HOST_OBJ]]"], output: "a.out"

				// RUN: %clang -### -nocudalib --offload-new-driver %s 2>&1 \| FileCheck -check-prefix RDC %s
				// RDC: error: Using '--offload-new-driver' requires '-fgpu-rdc'
				traUnsubmitted Not Done Reply Inline Actions You probably want to check for `clang -cc1 ... -fgpu-rdc`, too. tra: You probably want to check for `clang -cc1 ... -fgpu-rdc`, too.

clang/test/Driver/cuda-phases.cu

	Show First 20 Lines • Show All 211 Lines • ▼ Show 20 Lines
	// DASM2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])			// DASM2-DAG: [[P3:[0-9]+]]: backend, {[[P2]]}, assembler, (device-[[T]], [[ARCH]])
	// DASM2-DAG: [[P4:[0-9]+]]: offload, "device-[[T]] ([[TRIPLE:nvptx64-nvidia-cuda]]:[[ARCH]])" {[[P3]]}, assembler			// DASM2-DAG: [[P4:[0-9]+]]: offload, "device-[[T]] ([[TRIPLE:nvptx64-nvidia-cuda]]:[[ARCH]])" {[[P3]]}, assembler
	// DASM2-DAG: [[P5:[0-9]+]]: input, "{{.*}}cuda-phases.cu", [[T]], (device-[[T]], [[ARCH2:sm_35]])			// DASM2-DAG: [[P5:[0-9]+]]: input, "{{.*}}cuda-phases.cu", [[T]], (device-[[T]], [[ARCH2:sm_35]])
	// DASM2-DAG: [[P6:[0-9]+]]: preprocessor, {[[P5]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])			// DASM2-DAG: [[P6:[0-9]+]]: preprocessor, {[[P5]]}, [[T]]-cpp-output, (device-[[T]], [[ARCH2]])
	// DASM2-DAG: [[P7:[0-9]+]]: compiler, {[[P6]]}, ir, (device-[[T]], [[ARCH2]])			// DASM2-DAG: [[P7:[0-9]+]]: compiler, {[[P6]]}, ir, (device-[[T]], [[ARCH2]])
	// DASM2-DAG: [[P8:[0-9]+]]: backend, {[[P7]]}, assembler, (device-[[T]], [[ARCH2]])			// DASM2-DAG: [[P8:[0-9]+]]: backend, {[[P7]]}, assembler, (device-[[T]], [[ARCH2]])
	// DASM2-DAG: [[P9:[0-9]+]]: offload, "device-[[T]] ([[TRIPLE]]:[[ARCH2]])" {[[P8]]}, assembler			// DASM2-DAG: [[P9:[0-9]+]]: offload, "device-[[T]] ([[TRIPLE]]:[[ARCH2]])" {[[P8]]}, assembler
	// DASM2-NOT: host			// DASM2-NOT: host

				//
				// Test the phases generated when using the new offloading driver.
				//
				// RUN: %clang -### -target powerpc64le-ibm-linux-gnu -ccc-print-phases --offload-new-driver \
				// RUN: --offload-arch=sm_52 --offload-arch=sm_70 %s 2>&1 \| FileCheck --check-prefix=NEW_DRIVER %s
				// NEW_DRIVER: 0: input, "[[INPUT:.*]]", cuda, (host-cuda)
				// NEW_DRIVER: 1: preprocessor, {0}, cuda-cpp-output, (host-cuda)
				// NEW_DRIVER: 2: compiler, {1}, ir, (host-cuda)
				// NEW_DRIVER: 3: input, "[[INPUT]]", cuda, (device-cuda, sm_52)
				// NEW_DRIVER: 4: preprocessor, {3}, cuda-cpp-output, (device-cuda, sm_52)
				// NEW_DRIVER: 5: compiler, {4}, ir, (device-cuda, sm_52)
				// NEW_DRIVER: 6: backend, {5}, assembler, (device-cuda, sm_52)
				// NEW_DRIVER: 7: assembler, {6}, object, (device-cuda, sm_52)
				// NEW_DRIVER: 8: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {7}, object
				// NEW_DRIVER: 9: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {6}, assembler
				// NEW_DRIVER: 10: linker, {8, 9}, cuda-fatbin, (device-cuda, sm_52)
				// NEW_DRIVER: 11: input, "[[INPUT]]", cuda, (device-cuda, sm_70)
				// NEW_DRIVER: 12: preprocessor, {11}, cuda-cpp-output, (device-cuda, sm_70)
				// NEW_DRIVER: 13: compiler, {12}, ir, (device-cuda, sm_70)
				// NEW_DRIVER: 14: backend, {13}, assembler, (device-cuda, sm_70)
				// NEW_DRIVER: 15: assembler, {14}, object, (device-cuda, sm_70)
				// NEW_DRIVER: 16: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {15}, object
				// NEW_DRIVER: 17: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {14}, assembler
				// NEW_DRIVER: 18: linker, {16, 17}, cuda-fatbin, (device-cuda, sm_70)
				// NEW_DRIVER: 19: offload, "host-cuda (powerpc64le-ibm-linux-gnu)" {2}, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {10}, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {18}, ir
				// NEW_DRIVER: 20: backend, {19}, assembler, (host-cuda)
				// NEW_DRIVER: 21: assembler, {20}, object, (host-cuda)
				// NEW_DRIVER: 22: clang-linker-wrapper, {21}, image, (host-cuda)