Download Raw Diff

Details

Reviewers

jdoerfert
tra
yaxunl
JonChesterfield

Commits

rGba57828e11c5: [CUDA][OpenMP] Fix the new driver crashing on multiple device-only outputs

Summary

The new driver supports device-only compilation for the offloading
device. The way this is handlded is a little different from the old
offloading driver. The old driver would put all the outputs in the final
action list akin to a linker job. The new driver however generated these
in the middle of the host's job so we instead put them all in a single
offloading action. However, we only handled these kinds of offloading
actions correctly when there was only a single input. When we had
multiple inputs we would instead attempt to get the host job, which
didn't exist, and crash.

This patch simply adds some extra logic to generate the jobs for all
dependencies if there is not host action.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Aug 19 2022, 8:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2022, 8:57 AM

Herald added subscribers: mattd, guansong. · View Herald Transcript

jhuber6 requested review of this revision.Aug 19 2022, 8:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2022, 8:57 AM

Herald added subscribers: cfe-commits, sstefan1, MaskRay. · View Herald Transcript

Forgot to use the new driver in the test.

Harbormaster completed remote builds in B182249: Diff 454036.Aug 19 2022, 10:05 AM

The old driver would put all the outputs in the final action list akin to a linker job.

IIRC that's where HIP and CUDA behaved differently. CUDA compilation does not allow device-only compilation for multiple targets if we have explicitly specified output. It does produce individual per-gpu .o files if compiled without -o.

bin/clang++ --cuda-path=$HOME/local/cuda-11.7 --offload-arch=sm_80 --offload-arch=sm_86 -x cuda axpy.cu  --cuda-device-only -O3  -c -o axpy.o
clang-15: error: cannot specify -o when generating multiple output files

clang/test/Driver/cuda-bindings.cu
160	If we've specified `-o foo.o`, where do those multiple outputs go to? The old driver disallowed using `-o` when compiling for multiple GPUs.

In D132248#3735793, @tra wrote:
The old driver would put all the outputs in the final action list akin to a linker job.

IIRC that's where HIP and CUDA behaved differently. CUDA compilation does not allow device-only compilation for multiple targets if we have explicitly specified output. It does produce individual per-gpu .o files if compiled without -o.
bin/clang++ --cuda-path=$HOME/local/cuda-11.7 --offload-arch=sm_80 --offload-arch=sm_86 -x cuda axpy.cu  --cuda-device-only -O3  -c -o axpy.o
clang-15: error: cannot specify -o when generating multiple output files

Is this an architectural limitation? I'd imagine they'd just behave the same way here in my implementation.

clang/test/Driver/cuda-bindings.cu
160	Good catch, right now it'll just write both of them to the same file.

In D132248#3735900, @jhuber6 wrote:

Is this an architectural limitation? I'd imagine they'd just behave the same way here in my implementation.

The constraint here is that we have to stick with a single output per compiler invocation and that format of that output should be the same. E.g. for C++ we'd expect to see an ELF file when we compile with -c and text assembly, if we compile with -S.

We could pack GPU objects into a fat binary, but for consistency it would have to be done for single-target compilations, too. Packing a single object into a fat binary would make little sense, but producing an object file or a fat binary depending on the nubmer of targets would be inconsitent.
Similarly, compilation with -S also gets tricky -- do you bundle the text output? That would be not particularly useful as, presumably, one would want to examine the assembly. We could concatenate together the ASM files, but that would produce an assembly source we can't really assemble.
On top of that, CUDA compilation has been around for a while and changing the output format would be somewhat disruptive.

In the end, CUDA stuck with insisting on erroring out when the -o has been specified, but where it would need to produce multiple outputs.
HIP grew a --[no-]gpu-bundle-output option to control whether to bundle outputs of device-only compilation.

In D132248#3735943, @tra wrote:

In D132248#3735900, @jhuber6 wrote:

Is this an architectural limitation? I'd imagine they'd just behave the same way here in my implementation.

The constraint here is that we have to stick with a single output per compiler invocation and that format of that output should be the same. E.g. for C++ we'd expect to see an ELF file when we compile with -c and text assembly, if we compile with -S.

We could pack GPU objects into a fat binary, but for consistency it would have to be done for single-target compilations, too. Packing a single object into a fat binary would make little sense, but producing an object file or a fat binary depending on the nubmer of targets would be inconsitent.
Similarly, compilation with -S also gets tricky -- do you bundle the text output? That would be not particularly useful as, presumably, one would want to examine the assembly. We could concatenate together the ASM files, but that would produce an assembly source we can't really assemble.
On top of that, CUDA compilation has been around for a while and changing the output format would be somewhat disruptive.

In the end, CUDA stuck with insisting on erroring out when the -o has been specified, but where it would need to produce multiple outputs.
HIP grew a --[no-]gpu-bundle-output option to control whether to bundle outputs of device-only compilation.

Thanks for the background. I'm assuming HIP did this because they use the old clang-offload-bundler which supported bundling multiple file types, while my new method relies on having some LLVM-IR to embed things in. I wasn't a huge fan of outputting bundles because it meant we couldn't do things like clang -o - | opt or similar. For my implementation I will probably make HIP do what CUDA does as I feel that is more reasonable unless someone has a major objection.

Updating to error with -o and multiple files.

I'm OK with that.

@yaxunl -- what are your thoughts on whether this approach would work for HIP? On one hand HIP already has a lot of features that the new driver is intended to provide, so AMD may have no pressure to change to something else. On the other hand, long term it would make sense to unify the driver pipeline across the different offloading mechanisms we have now.

In D132248#3736295, @tra wrote:

I'm OK with that.

@yaxunl -- what are your thoughts on whether this approach would work for HIP? On one hand HIP already has a lot of features that the new driver is intended to provide, so AMD may have no pressure to change to something else. On the other hand, long term it would make sense to unify the driver pipeline across the different offloading mechanisms we have now.

I am OK as long as it works for HIP and does not break the old driver.

clang/test/Driver/cuda-bindings.cu
154	can you add similar tests for HIP? Thanks.

Adding HIP test

LGTM. Thanks.

This revision is now accepted and ready to land.Aug 23 2022, 7:51 AM

Harbormaster completed remote builds in B182846: Diff 454839.Aug 23 2022, 8:15 AM

Closed by commit rGba57828e11c5: [CUDA][OpenMP] Fix the new driver crashing on multiple device-only outputs (authored by jhuber6). · Explain WhyAug 24 2022, 6:48 AM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rGba57828e11c5: [CUDA][OpenMP] Fix the new driver crashing on multiple device-only outputs.

Diff 455187

clang/lib/Driver/Driver.cpp

Show First 20 Lines • Show All 4,710 Lines • ▼ Show 20 Lines	void Driver::BuildJobs(Compilation &C) const {
// library/a.out. So when a .o, .so, etc are the output, with clang interface		// library/a.out. So when a .o, .so, etc are the output, with clang interface
// stubs there will also be a .ifs and .ifso at the same location.		// stubs there will also be a .ifs and .ifso at the same location.
//		//
// CompileJob of type TY_IFS_CPP: when generating interface stubs is enabled		// CompileJob of type TY_IFS_CPP: when generating interface stubs is enabled
// and -c is passed, we still want to be able to generate a .ifs file while		// and -c is passed, we still want to be able to generate a .ifs file while
// we are also generating .o files. So we allow more than one output file in		// we are also generating .o files. So we allow more than one output file in
// this case as well.		// this case as well.
//		//
		// OffloadClass of type TY_Nothing: device-only output will place many outputs
		// into a single offloading action. We should count all inputs to the action
		// as outputs.
if (FinalOutput) {		if (FinalOutput) {
unsigned NumOutputs = 0;		unsigned NumOutputs = 0;
unsigned NumIfsOutputs = 0;		unsigned NumIfsOutputs = 0;
for (const Action *A : C.getActions())		for (const Action *A : C.getActions()) {
if (A->getType() != types::TY_Nothing &&		if (A->getType() != types::TY_Nothing &&
!(A->getKind() == Action::IfsMergeJobClass \|\|		!(A->getKind() == Action::IfsMergeJobClass \|\|
(A->getType() == clang::driver::types::TY_IFS_CPP &&		(A->getType() == clang::driver::types::TY_IFS_CPP &&
A->getKind() == clang::driver::Action::CompileJobClass &&		A->getKind() == clang::driver::Action::CompileJobClass &&
0 == NumIfsOutputs++) \|\|		0 == NumIfsOutputs++) \|\|
(A->getKind() == Action::BindArchClass && A->getInputs().size() &&		(A->getKind() == Action::BindArchClass && A->getInputs().size() &&
A->getInputs().front()->getKind() == Action::IfsMergeJobClass)))		A->getInputs().front()->getKind() == Action::IfsMergeJobClass)))
++NumOutputs;		++NumOutputs;
		else if (A->getKind() == Action::OffloadClass &&
		A->getType() == types::TY_Nothing)
		NumOutputs += A->size();
		}

if (NumOutputs > 1) {		if (NumOutputs > 1) {
Diag(clang::diag::err_drv_output_argument_with_multiple_files);		Diag(clang::diag::err_drv_output_argument_with_multiple_files);
FinalOutput = nullptr;		FinalOutput = nullptr;
}		}
}		}

const llvm::Triple &RawTriple = C.getDefaultToolChain().getTriple();		const llvm::Triple &RawTriple = C.getDefaultToolChain().getTriple();
▲ Show 20 Lines • Show All 521 Lines • ▼ Show 20 Lines	if (const OffloadAction *OA = dyn_cast<OffloadAction>(A)) {
// \		// \
// Host Action 1 ---> OffloadAction -> Host Action 2		// Host Action 1 ---> OffloadAction -> Host Action 2
//		//
// d) Specify a host dependence to a device action.		// d) Specify a host dependence to a device action.
// Host Action 1 _		// Host Action 1 _
// \		// \
// Device Action 1 ---> OffloadAction -> Device Action 2		// Device Action 1 ---> OffloadAction -> Device Action 2
//		//
// For a) and b), we just return the job generated for the dependence. For		// For a) and b), we just return the job generated for the dependences. For
// c) and d) we override the current action with the host/device dependence		// c) and d) we override the current action with the host/device dependence
// if the current toolchain is host/device and set the offload dependences		// if the current toolchain is host/device and set the offload dependences
// info with the jobs obtained from the device/host dependence(s).		// info with the jobs obtained from the device/host dependence(s).

// If there is a single device option, just generate the job for it.		// If there is a single device option or has no host action, just generate
if (OA->hasSingleDeviceDependence()) {		// the job for it.
		if (OA->hasSingleDeviceDependence() \|\| !OA->hasHostDependence()) {
InputInfoList DevA;		InputInfoList DevA;
OA->doOnEachDeviceDependence([&](Action DepA, const ToolChain DepTC,		OA->doOnEachDeviceDependence([&](Action DepA, const ToolChain DepTC,
const char *DepBoundArch) {		const char *DepBoundArch) {
DevA =		DevA.append(BuildJobsForAction(C, DepA, DepTC, DepBoundArch, AtTopLevel,
BuildJobsForAction(C, DepA, DepTC, DepBoundArch, AtTopLevel,		/MultipleArchs/ !!DepBoundArch,
/MultipleArchs/ !!DepBoundArch, LinkingOutput,		LinkingOutput, CachedResults,
CachedResults, DepA->getOffloadingDeviceKind());		DepA->getOffloadingDeviceKind()));
});		});
return DevA;		return DevA;
}		}

// If 'Action 2' is host, we generate jobs for the device dependences and		// If 'Action 2' is host, we generate jobs for the device dependences and
// override the current action with the host dependence. Otherwise, we		// override the current action with the host dependence. Otherwise, we
// generate the host dependences and override the action with the device		// generate the host dependences and override the action with the device
// dependence. The dependences can't therefore be a top-level action.		// dependence. The dependences can't therefore be a top-level action.
▲ Show 20 Lines • Show All 1,105 Lines • Show Last 20 Lines

clang/test/Driver/cuda-bindings.cu

	Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines
	//			//
	// RUN: %clang -target powerpc64le-ibm-linux-gnu -### \			// RUN: %clang -target powerpc64le-ibm-linux-gnu -### \
	// RUN: --cuda-gpu-arch=sm_52 --cuda-device-only -c -o foo.o %s 2>&1 \			// RUN: --cuda-gpu-arch=sm_52 --cuda-device-only -c -o foo.o %s 2>&1 \
	// RUN: \| FileCheck -check-prefix=D_ONLY %s			// RUN: \| FileCheck -check-prefix=D_ONLY %s
	// RUN: %clang -target powerpc64le-ibm-linux-gnu -### --offload-new-driver \			// RUN: %clang -target powerpc64le-ibm-linux-gnu -### --offload-new-driver \
	// RUN: --cuda-gpu-arch=sm_52 --cuda-device-only -c -o foo.o %s 2>&1 \			// RUN: --cuda-gpu-arch=sm_52 --cuda-device-only -c -o foo.o %s 2>&1 \
	// RUN: \| FileCheck -check-prefix=D_ONLY %s			// RUN: \| FileCheck -check-prefix=D_ONLY %s
	// D_ONLY: "foo.o"			// D_ONLY: "foo.o"

				//
				// Check to make sure we can generate multiple outputs for device-only
				// compilation and fail with '-o'.
				//
				// RUN: %clang -### -target powerpc64le-ibm-linux-gnu --offload-new-driver -ccc-print-bindings \
				yaxunlUnsubmitted Not Done Reply Inline Actions can you add similar tests for HIP? Thanks. yaxunl: can you add similar tests for HIP? Thanks.
				// RUN: --offload-arch=sm_70 --offload-arch=sm_52 --offload-device-only -c %s 2>&1 \
				// RUN: \| FileCheck -check-prefix=MULTI-D-ONLY %s
				// MULTI-D-ONLY: # "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[PTX_70:.+]]"
				// MULTI-D-ONLY-NEXT: # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX_70]]"], output: "[[CUBIN_70:.+]]"
				// MULTI-D-ONLY-NEXT: # "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT]]"], output: "[[PTX_52:.+]]"
				// MULTI-D-ONLY-NEXT: # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX_52]]"], output: "[[CUBIN_52:.+]]"
				traUnsubmitted Not Done Reply Inline Actions If we've specified `-o foo.o`, where do those multiple outputs go to? The old driver disallowed using `-o` when compiling for multiple GPUs. tra: If we've specified `-o foo.o`, where do those multiple outputs go to? The old driver…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Good catch, right now it'll just write both of them to the same file. jhuber6: Good catch, right now it'll just write both of them to the same file.
				//
				// RUN: %clang -### -target powerpc64le-ibm-linux-gnu --offload-new-driver -ccc-print-bindings \
				// RUN: --offload-arch=sm_70 --offload-arch=sm_52 --offload-device-only -c -o %t %s 2>&1 \
				// RUN: \| FileCheck -check-prefix=MULTI-D-ONLY-O %s
				// MULTI-D-ONLY-O: error: cannot specify -o when generating multiple output files

clang/test/Driver/hip-binding.hip

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	// CHECK: # "x86_64-unknown-linux-gnu" - "GNU::Linker", inputs: ["[[HOSTOBJ]]", "[[FATBINOBJ]]"], output: "a.out"			// CHECK: # "x86_64-unknown-linux-gnu" - "GNU::Linker", inputs: ["[[HOSTOBJ]]", "[[FATBINOBJ]]"], output: "a.out"

	// RUN: %clang --hip-link -ccc-print-bindings -target x86_64-linux-gnu \			// RUN: %clang --hip-link -ccc-print-bindings -target x86_64-linux-gnu \
	// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %t.o\			// RUN: --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 %t.o\
	// RUN: 2>&1 \| FileCheck -check-prefix=NORDC %s			// RUN: 2>&1 \| FileCheck -check-prefix=NORDC %s

	// NORDC-NOT: offload bundler			// NORDC-NOT: offload bundler
	// NORDC: # "x86_64-unknown-linux-gnu" - "GNU::Linker", inputs: ["{{.*o}}"], output: "a.out"			// NORDC: # "x86_64-unknown-linux-gnu" - "GNU::Linker", inputs: ["{{.*o}}"], output: "a.out"

				//
				// Check to make sure we can generate multiple outputs for device-only
				// compilation and fail with '-o'.
				//
				// RUN: %clang -### -target x86_64-linux-gnu --offload-new-driver -ccc-print-bindings \
				// RUN: --offload-arch=gfx90a --offload-arch=gfx908 --offload-device-only -c %s 2>&1 \
				// RUN: \| FileCheck -check-prefix=MULTI-D-ONLY %s
				// MULTI-D-ONLY: # "amdgcn-amd-amdhsa" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[GFX908:.+]]"
				// MULTI-D-ONLY-NEXT: # "amdgcn-amd-amdhsa" - "AMDGCN::Linker", inputs: ["[[GFX908]]"], output: "[[GFX908_OUT:.+]]"
				// MULTI-D-ONLY-NEXT: # "amdgcn-amd-amdhsa" - "clang", inputs: ["[[INPUT]]"], output: "[[GFX90a:.+]]"
				// MULTI-D-ONLY-NEXT: # "amdgcn-amd-amdhsa" - "AMDGCN::Linker", inputs: ["[[GFX90a]]"], output: "[[GFX90a_OUT:.+]]"
				//
				// RUN: %clang -### -target x86_64-linux-gnu --offload-new-driver -ccc-print-bindings \
				// RUN: --offload-arch=gfx90a --offload-arch=gfx908 --offload-device-only -c -o %t %s 2>&1 \
				// RUN: \| FileCheck -check-prefix=MULTI-D-ONLY-O %s
				// MULTI-D-ONLY-O: error: cannot specify -o when generating multiple output files

This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][OpenMP] Fix the new driver crashing on multiple device-only outputs
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 455187

clang/lib/Driver/Driver.cpp

clang/test/Driver/cuda-bindings.cu

clang/test/Driver/hip-binding.hip

This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][OpenMP] Fix the new driver crashing on multiple device-only outputsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 455187

clang/lib/Driver/Driver.cpp

clang/test/Driver/cuda-bindings.cu

clang/test/Driver/hip-binding.hip

[CUDA][OpenMP] Fix the new driver crashing on multiple device-only outputs
ClosedPublic