This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Driver/
-
Driver/
-
Driver.cpp
-
ToolChains/
-
Cuda.cpp
-
test/Driver/
-
Driver/
-
cuda-openmp-driver.cu
-
cuda-phases.cu

Differential D128441

[CUDA] Do not embed a fatbinary when using the new driver
ClosedPublic

Authored by jhuber6 on Jun 23 2022, 7:10 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
tra
yaxunl

Commits

rG4d3c010f1d01: [CUDA] Do not embed a fatbinary when using the new driver

Summary

Previously, when using the new driver we created a fatbinary with the
PTX and Cubin output. This was mainly done in an attempt to create some
backwards compatibility with the existing CUDA support that embeds the
fatbinary in each TU. This will most likely be more work than necessary
to actually implement. The linker wrapper cannot do anything with these
embedded PTX files because we do not know how to link them, and if we
did want to include multiple files it should go through the
clang-offload-packager instead. Also this didn't repsect the setting
that disables embedding PTX (although it wasn't used anyway).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Jun 23 2022, 7:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2022, 7:10 AM

Herald added subscribers: mattd, carlosgalvezp. · View Herald Transcript

jhuber6 requested review of this revision.Jun 23 2022, 7:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2022, 7:10 AM

Herald added subscribers: cfe-commits, sstefan1, MaskRay. · View Herald Transcript

Remove comment that is no longer true now that getInputFilename always returns a .cubin variant for object types.

Harbormaster completed remote builds in B171602: Diff 439384.Jun 23 2022, 7:58 AM

The linker wrapper cannot do anything with these embedded PTX files because we do not know how to link them,

Neither, apparently does nvlink. It does have --emip-ptx <file> option, but only if LTO is enabled, which matches the new driver behavior.

This revision is now accepted and ready to land.Jun 23 2022, 10:54 AM

In D128441#3605809, @tra wrote:

The linker wrapper cannot do anything with these embedded PTX files because we do not know how to link them,

Neither, apparently does nvlink. It does have --emip-ptx <file> option, but only if LTO is enabled, which matches the new driver behavior.

Thanks for the review. I'm not sure exactly how CUDA does it, but for their RDC support they do somehow link PTX from multiple TU's at runtime for JIT. I'm guessing they just compile each file upon initialization and link them with nvlink. I think using LTO for JIT support is the saner option in that case.

This revision was landed with ongoing or failed builds.Jun 23 2022, 12:40 PM

Closed by commit rG4d3c010f1d01: [CUDA] Do not embed a fatbinary when using the new driver (authored by jhuber6). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG4d3c010f1d01: [CUDA] Do not embed a fatbinary when using the new driver.

This change breaks clang++ --cuda-device-only compilation. Clang does not create any output in this case. Reverting the change fixes the problem.

Reproducible with:

echo '__global__ void k(){}' | bin/clang++  --offload-arch=sm_70 -x cuda -  --cuda-device-only -v  -c -o foo123.o

Compilation succeeds, but there's no foo123.o to be found.

In D128441#3702800, @tra wrote:
This change breaks clang++ --cuda-device-only compilation. Clang does not create any output in this case. Reverting the change fixes the problem.

Reproducible with:
echo '__global__ void k(){}' | bin/clang++  --offload-arch=sm_70 -x cuda -  --cuda-device-only -v  -c -o foo123.o
Compilation succeeds, but there's no foo123.o to be found.

Is it spitting it out as foo123.cubin instead?

In D128441#3702816, @jhuber6 wrote:

Is it spitting it out as foo123.cubin instead?

That's the output name it passes to ptxas, but it's treated as a temporary file and is removed at the end, so the user gets nothing.

jhuber6 added a child revision: D131278: [CUDA] Fix output name being replaced in device-only mode.Aug 5 2022, 11:58 AM

alexfh mentioned this in rG3b52341116b7: [CUDA] Fix output name being replaced in device-only mode.Aug 18 2022, 3:33 PM

Revision Contents

Path

Size

clang/

lib/

Driver/

Driver.cpp

11 lines

ToolChains/

Cuda.cpp

10 lines

test/

Driver/

cuda-openmp-driver.cu

7 lines

cuda-phases.cu

28 lines

Diff 439504

clang/lib/Driver/Driver.cpp

Show First 20 Lines • Show All 4,453 Lines • ▼ Show 20 Lines	for (phases::ID Phase : PL) {
// collapsed with any other actions so we can use it in the device.		// collapsed with any other actions so we can use it in the device.
HostAction->setCannotBeCollapsedWithNextDependentAction();		HostAction->setCannotBeCollapsedWithNextDependentAction();
OffloadAction::HostDependence HDep(		OffloadAction::HostDependence HDep(
HostAction, C.getSingleOffloadToolChain<Action::OFK_Host>(),		HostAction, C.getSingleOffloadToolChain<Action::OFK_Host>(),
TCAndArch->second.data(), Kind);		TCAndArch->second.data(), Kind);
OffloadAction::DeviceDependences DDep;		OffloadAction::DeviceDependences DDep;
DDep.add(A, TCAndArch->first, TCAndArch->second.data(), Kind);		DDep.add(A, TCAndArch->first, TCAndArch->second.data(), Kind);
A = C.MakeAction<OffloadAction>(HDep, DDep);		A = C.MakeAction<OffloadAction>(HDep, DDep);
} else if (isa<AssembleJobAction>(A) && Kind == Action::OFK_Cuda) {
// The Cuda toolchain uses fatbinary as the linker phase to bundle the
// PTX and Cubin output.
ActionList FatbinActions;
for (Action *A : {A, A->getInputs()[0]}) {
OffloadAction::DeviceDependences DDep;
DDep.add(A, TCAndArch->first, TCAndArch->second.data(), Kind);
FatbinActions.emplace_back(
C.MakeAction<OffloadAction>(DDep, A->getType()));
}
A = C.MakeAction<LinkJobAction>(FatbinActions, types::TY_CUDA_FATBIN);
}		}
++TCAndArch;		++TCAndArch;
}		}
}		}

auto TCAndArch = TCAndArchs.begin();		auto TCAndArch = TCAndArchs.begin();
for (Action *A : DeviceActions) {		for (Action *A : DeviceActions) {
DDeps.add(A, TCAndArch->first, TCAndArch->second.data(), Kind);		DDeps.add(A, TCAndArch->first, TCAndArch->second.data(), Kind);
▲ Show 20 Lines • Show All 1,815 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/Cuda.cpp

Show First 20 Lines • Show All 530 Lines • ▼ Show 20 Lines	for (const auto& II : Inputs) {
if (II.getType() == types::TY_PP_Asm &&		if (II.getType() == types::TY_PP_Asm &&
!shouldIncludePTX(Args, gpu_arch_str))		!shouldIncludePTX(Args, gpu_arch_str))
continue;		continue;
// We need to pass an Arch of the form "sm_XX" for cubin files and		// We need to pass an Arch of the form "sm_XX" for cubin files and
// "compute_XX" for ptx.		// "compute_XX" for ptx.
const char *Arch = (II.getType() == types::TY_PP_Asm)		const char *Arch = (II.getType() == types::TY_PP_Asm)
? CudaArchToVirtualArchString(gpu_arch)		? CudaArchToVirtualArchString(gpu_arch)
: gpu_arch_str;		: gpu_arch_str;
CmdArgs.push_back(Args.MakeArgString(llvm::Twine("--image=profile=") +		CmdArgs.push_back(
Arch + ",file=" + II.getFilename()));		Args.MakeArgString(llvm::Twine("--image=profile=") + Arch +
		",file=" + getToolChain().getInputFilename(II)));
}		}

for (const auto& A : Args.getAllArgValues(options::OPT_Xcuda_fatbinary))		for (const auto& A : Args.getAllArgValues(options::OPT_Xcuda_fatbinary))
CmdArgs.push_back(Args.MakeArgString(A));		CmdArgs.push_back(Args.MakeArgString(A));

const char *Exec = Args.MakeArgString(TC.GetProgramPath("fatbinary"));		const char *Exec = Args.MakeArgString(TC.GetProgramPath("fatbinary"));
C.addCommand(std::make_unique<Command>(		C.addCommand(std::make_unique<Command>(
JA, *this,		JA, *this,
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	CudaToolChain::CudaToolChain(const Driver &D, const llvm::Triple &Triple,
}		}
// Lookup binaries into the driver directory, this is used to		// Lookup binaries into the driver directory, this is used to
// discover the clang-offload-bundler executable.		// discover the clang-offload-bundler executable.
getProgramPaths().push_back(getDriver().Dir);		getProgramPaths().push_back(getDriver().Dir);
}		}

std::string CudaToolChain::getInputFilename(const InputInfo &Input) const {		std::string CudaToolChain::getInputFilename(const InputInfo &Input) const {
// Only object files are changed, for example assembly files keep their .s		// Only object files are changed, for example assembly files keep their .s
// extensions. CUDA also continues to use .o as they don't use nvlink but		// extensions.
// fatbinary.		if (Input.getType() != types::TY_Object)
if (!(OK == Action::OFK_OpenMP && Input.getType() == types::TY_Object))
return ToolChain::getInputFilename(Input);		return ToolChain::getInputFilename(Input);

// Replace extension for object files with cubin because nvlink relies on		// Replace extension for object files with cubin because nvlink relies on
// these particular file names.		// these particular file names.
SmallString<256> Filename(ToolChain::getInputFilename(Input));		SmallString<256> Filename(ToolChain::getInputFilename(Input));
llvm::sys::path::replace_extension(Filename, "cubin");		llvm::sys::path::replace_extension(Filename, "cubin");
return std::string(Filename.str());		return std::string(Filename.str());
}		}
▲ Show 20 Lines • Show All 221 Lines • Show Last 20 Lines

clang/test/Driver/cuda-openmp-driver.cu

	// REQUIRES: x86-registered-target			// REQUIRES: x86-registered-target
	// REQUIRES: nvptx-registered-target			// REQUIRES: nvptx-registered-target

	// RUN: %clang -### -target x86_64-linux-gnu -nocudalib -ccc-print-bindings -fgpu-rdc \			// RUN: %clang -### -target x86_64-linux-gnu -nocudalib -ccc-print-bindings -fgpu-rdc \
	// RUN: --offload-new-driver --offload-arch=sm_35 --offload-arch=sm_70 %s 2>&1 \			// RUN: --offload-new-driver --offload-arch=sm_35 --offload-arch=sm_70 %s 2>&1 \
	// RUN: \| FileCheck -check-prefix BINDINGS %s			// RUN: \| FileCheck -check-prefix BINDINGS %s

	// BINDINGS: "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[PTX_SM_35:.+]]"			// BINDINGS: "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[PTX_SM_35:.+]]"
	// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX_SM_35]]"], output: "[[CUBIN_SM_35:.+]]"			// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX_SM_35]]"], output: "[[CUBIN_SM_35:.+]]"
	// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["[[CUBIN_SM_35]]", "[[PTX_SM_35]]"], output: "[[FATBIN_SM_35:.+]]"
	// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT]]"], output: "[[PTX_SM_70:.+]]"			// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT]]"], output: "[[PTX_SM_70:.+]]"
	// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX_SM_70:.+]]"], output: "[[CUBIN_SM_70:.+]]"			// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX_SM_70:.+]]"], output: "[[CUBIN_SM_70:.+]]"
	// BINDINGS-NEXT: "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["[[CUBIN_SM_70]]", "[[PTX_SM_70:.+]]"], output: "[[FATBIN_SM_70:.+]]"			// BINDINGS-NEXT: "x86_64-unknown-linux-gnu" - "Offload::Packager", inputs: ["[[CUBIN_SM_35]]", "[[CUBIN_SM_70]]"], output: "[[BINARY:.+]]"
	// BINDINGS-NEXT: "x86_64-unknown-linux-gnu" - "Offload::Packager", inputs: ["[[FATBIN_SM_35]]", "[[FATBIN_SM_70]]"], output: "[[BINARY:.+]]"
	// BINDINGS-NEXT: "x86_64-unknown-linux-gnu" - "clang", inputs: ["[[INPUT]]", "[[BINARY]]"], output: "[[HOST_OBJ:.+]]"			// BINDINGS-NEXT: "x86_64-unknown-linux-gnu" - "clang", inputs: ["[[INPUT]]", "[[BINARY]]"], output: "[[HOST_OBJ:.+]]"
	// BINDINGS-NEXT: "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["[[HOST_OBJ]]"], output: "a.out"			// BINDINGS-NEXT: "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["[[HOST_OBJ]]"], output: "a.out"

	// RUN: %clang -### -nocudalib --offload-new-driver %s 2>&1 \| FileCheck -check-prefix RDC %s			// RUN: %clang -### -nocudalib --offload-new-driver %s 2>&1 \| FileCheck -check-prefix RDC %s
	// RDC: error: Using '--offload-new-driver' requires '-fgpu-rdc'			// RDC: error: Using '--offload-new-driver' requires '-fgpu-rdc'

	// RUN: %clang -### -target x86_64-linux-gnu -nocudalib -ccc-print-bindings -fgpu-rdc \			// RUN: %clang -### -target x86_64-linux-gnu -nocudalib -ccc-print-bindings -fgpu-rdc \
	// RUN: --offload-new-driver --offload-arch=sm_35 --offload-arch=sm_70 %s 2>&1 \			// RUN: --offload-new-driver --offload-arch=sm_35 --offload-arch=sm_70 %s 2>&1 \
	// RUN: \| FileCheck -check-prefix BINDINGS-HOST %s			// RUN: \| FileCheck -check-prefix BINDINGS-HOST %s

	// BINDINGS-HOST: # "x86_64-unknown-linux-gnu" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[OUTPUT:.+]]"			// BINDINGS-HOST: # "x86_64-unknown-linux-gnu" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[OUTPUT:.+]]"
	// BINDINGS-HOST: # "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["[[OUTPUT]]"], output: "a.out"			// BINDINGS-HOST: # "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["[[OUTPUT]]"], output: "a.out"

	// RUN: %clang -### -target x86_64-linux-gnu -nocudalib -ccc-print-bindings -fgpu-rdc \			// RUN: %clang -### -target x86_64-linux-gnu -nocudalib -ccc-print-bindings -fgpu-rdc \
	// RUN: --offload-new-driver --offload-arch=sm_35 --offload-arch=sm_70 %s 2>&1 \			// RUN: --offload-new-driver --offload-arch=sm_35 --offload-arch=sm_70 %s 2>&1 \
	// RUN: \| FileCheck -check-prefix BINDINGS-DEVICE %s			// RUN: \| FileCheck -check-prefix BINDINGS-DEVICE %s

	// BINDINGS-DEVICE: # "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[PTX:.+]]"			// BINDINGS-DEVICE: # "nvptx64-nvidia-cuda" - "clang", inputs: ["[[INPUT:.+]]"], output: "[[PTX:.+]]"
	// BINDINGS-DEVICE: # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX]]"], output: "[[CUBIN:.+]]"			// BINDINGS-DEVICE: # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["[[PTX]]"], output: "[[CUBIN:.+]]"
	// BINDINGS-DEVICE: # "nvptx64-nvidia-cuda" - "NVPTX::Linker", inputs: ["[[CUBIN]]", "[[PTX]]"], output: "{{.*}}.fatbin"

	// RUN: %clang -### -target x86_64-linux-gnu -nocudalib --cuda-feature=+ptx61 --offload-arch=sm_70 %s 2>&1 \| FileCheck -check-prefix MANUAL-FEATURE %s			// RUN: %clang -### -target x86_64-linux-gnu -nocudalib --cuda-feature=+ptx61 --offload-arch=sm_70 %s 2>&1 \| FileCheck -check-prefix MANUAL-FEATURE %s
	// MANUAL-FEATURE: -cc1{{.}}-target-feature{{.}}+ptx61			// MANUAL-FEATURE: -cc1{{.}}-target-feature{{.}}+ptx61

	// RUN: %clang -### -target x86_64-linux-gnu -nocudalib -ccc-print-bindings --offload-link %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu -nocudalib -ccc-print-bindings --offload-link %s 2>&1 \
	// RUN: \| FileCheck -check-prefix DEVICE-LINK %s			// RUN: \| FileCheck -check-prefix DEVICE-LINK %s

	// DEVICE-LINK: "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["[[INPUT:.+]]"], output: "a.out"			// DEVICE-LINK: "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["[[INPUT:.+]]"], output: "a.out"

clang/test/Driver/cuda-phases.cu

	Show First 20 Lines • Show All 226 Lines • ▼ Show 20 Lines
	// NEW_DRIVER: 1: preprocessor, {0}, cuda-cpp-output			// NEW_DRIVER: 1: preprocessor, {0}, cuda-cpp-output
	// NEW_DRIVER: 2: compiler, {1}, ir			// NEW_DRIVER: 2: compiler, {1}, ir
	// NEW_DRIVER: 3: input, "[[INPUT]]", cuda, (device-cuda, sm_52)			// NEW_DRIVER: 3: input, "[[INPUT]]", cuda, (device-cuda, sm_52)
	// NEW_DRIVER: 4: preprocessor, {3}, cuda-cpp-output, (device-cuda, sm_52)			// NEW_DRIVER: 4: preprocessor, {3}, cuda-cpp-output, (device-cuda, sm_52)
	// NEW_DRIVER: 5: compiler, {4}, ir, (device-cuda, sm_52)			// NEW_DRIVER: 5: compiler, {4}, ir, (device-cuda, sm_52)
	// NEW_DRIVER: 6: backend, {5}, assembler, (device-cuda, sm_52)			// NEW_DRIVER: 6: backend, {5}, assembler, (device-cuda, sm_52)
	// NEW_DRIVER: 7: assembler, {6}, object, (device-cuda, sm_52)			// NEW_DRIVER: 7: assembler, {6}, object, (device-cuda, sm_52)
	// NEW_DRIVER: 8: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {7}, object			// NEW_DRIVER: 8: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {7}, object
	// NEW_DRIVER: 9: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {6}, assembler			// NEW_DRIVER: 9: input, "[[INPUT]]", cuda, (device-cuda, sm_70)
	// NEW_DRIVER: 10: linker, {8, 9}, cuda-fatbin, (device-cuda, sm_52)			// NEW_DRIVER: 10: preprocessor, {9}, cuda-cpp-output, (device-cuda, sm_70)
	// NEW_DRIVER: 11: offload, "device-cuda (nvptx64-nvidia-cuda:sm_52)" {10}, cuda-fatbin			// NEW_DRIVER: 11: compiler, {10}, ir, (device-cuda, sm_70)
	// NEW_DRIVER: 12: input, "[[INPUT]]", cuda, (device-cuda, sm_70)			// NEW_DRIVER: 12: backend, {11}, assembler, (device-cuda, sm_70)
	// NEW_DRIVER: 13: preprocessor, {12}, cuda-cpp-output, (device-cuda, sm_70)			// NEW_DRIVER: 13: assembler, {12}, object, (device-cuda, sm_70)
	// NEW_DRIVER: 14: compiler, {13}, ir, (device-cuda, sm_70)			// NEW_DRIVER: 14: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {13}, object
	// NEW_DRIVER: 15: backend, {14}, assembler, (device-cuda, sm_70)			// NEW_DRIVER: 15: clang-offload-packager, {8, 14}, image
	// NEW_DRIVER: 16: assembler, {15}, object, (device-cuda, sm_70)			// NEW_DRIVER: 16: offload, " (powerpc64le-ibm-linux-gnu)" {2}, " (powerpc64le-ibm-linux-gnu)" {15}, ir
	// NEW_DRIVER: 17: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {16}, object			// NEW_DRIVER: 17: backend, {16}, assembler, (host-cuda)
	// NEW_DRIVER: 18: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {15}, assembler			// NEW_DRIVER: 18: assembler, {17}, object, (host-cuda)
	// NEW_DRIVER: 19: linker, {17, 18}, cuda-fatbin, (device-cuda, sm_70)			// NEW_DRIVER: 19: clang-linker-wrapper, {18}, image, (host-cuda)
	// NEW_DRIVER: 20: offload, "device-cuda (nvptx64-nvidia-cuda:sm_70)" {19}, cuda-fatbin
	// NEW_DRIVER: 21: clang-offload-packager, {11, 20}, image
	// NEW_DRIVER: 22: offload, " (powerpc64le-ibm-linux-gnu)" {2}, " (powerpc64le-ibm-linux-gnu)" {21}, ir
	// NEW_DRIVER: 23: backend, {22}, assembler, (host-cuda)
	// NEW_DRIVER: 24: assembler, {23}, object, (host-cuda)
	// NEW_DRIVER: 25: clang-linker-wrapper, {24}, image, (host-cuda)