This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Driver/ToolChains/
-
Driver/
-
ToolChains/
1/2
Clang.cpp
-
test/Driver/
-
Driver/
2/5
embed-bitcode-nvptx.cu

Differential D100609

[Offload][OpenMP][CUDA] Allow fembed-bitcode for device offload
Needs RevisionPublic

Authored by jdoerfert on Apr 15 2021, 4:53 PM.

Download Raw Diff

Details

Reviewers

tra
bollu

Summary

This is a fix for the problem reported here:
https://lists.llvm.org/pipermail/llvm-dev/2021-March/149529.html

That is, the target information was missing when we embedded bitcode and
that caused the NVPTX backend to fail.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60 ms	x64 debian > Clang.Driver::embed-bitcode-nvptx.cu
	120 ms	x64 windows > Clang.Driver::embed-bitcode-nvptx.cu

Event Timeline

jdoerfert created this revision.Apr 15 2021, 4:53 PM

Herald added a reviewer: bollu. · View Herald TranscriptApr 15 2021, 4:53 PM

Herald added subscribers: guansong, yaxunl. · View Herald Transcript

jdoerfert requested review of this revision.Apr 15 2021, 4:53 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 15 2021, 4:53 PM

Herald added a subscriber: sstefan1. · View Herald Transcript

I'm not really sure about the test, my local setup didn't have CUDA attached properly but this should work in principle ;)

Harbormaster completed remote builds in B99058: Diff 337947.Apr 15 2021, 6:18 PM

tra added inline comments.Apr 16 2021, 9:58 AM

clang/lib/Driver/ToolChains/Clang.cpp
4442–4446	This duplicates the same code a bit further down in the function. I think you should just set `-target-cpu` for everyone before diving into `if(embedBitcodeInObject)`.
clang/test/Driver/embed-bitcode-nvptx.cu
1	This command line looks extremely odd to me. If you are compiling with `--cuda-device-only`, then clang should've already set the right triple and the features. Could you tell me more about what is the intent of the compilation and why you use this particular set of options? I.e. why not just do `clang -x cuda --offload-arch=sm_70 --cuda-device-only -nocudalib -nocudainc`.

jdoerfert added inline comments.Apr 16 2021, 10:42 AM

clang/lib/Driver/ToolChains/Clang.cpp
4442–4446	Fair. I'll update it.
clang/test/Driver/embed-bitcode-nvptx.cu
1	Could you tell me more about what is the intent of the compilation and why you use this particular set of options? because I never compiled cuda really ;) I'll go with your options.

tra requested changes to this revision.Apr 16 2021, 11:23 AM

tra added inline comments.

clang/test/Driver/embed-bitcode-nvptx.cu
1	Something still does not add up. AFAICT, the real problem is that that we're not adding `-target-cpu`, but rather that `-fembed-bitcode=all` splits `-S` compilation into two phases -- source-to-bitcode (this part gets all the right command line options and compiles fine) and `IR -> PTX` compilation which does end up only with the subset of the options and ends up failing because the intrinsics are not enabled. I think what we want to do in this case is to prevent splitting GPU-side compilation. Adding a '-target-gpu' to the `IR->PTX` subcompilation may make things work in this case, but it does not really fix the root cause. E.g. we should also pass through the features set by the driver and, possibly, other options to keep both source->IR and IR->PTX compilations in sync.

This revision now requires changes to proceed.Apr 16 2021, 11:23 AM

jdoerfert added inline comments.Apr 16 2021, 2:13 PM

clang/test/Driver/embed-bitcode-nvptx.cu
1	I think what we want to do in this case is to prevent splitting GPU-side compilation. I doubt that is as easy as it sounds. Where do we take the IR from then? (I want the GPU IR embedded after all) E.g. we should also pass through the features set by the driver and .. I agree, what if I move the embedding handling to the end, keep the "blacklist" that removes arguments we don't want, and see where that leads us?

tra added inline comments.Apr 16 2021, 2:36 PM

clang/test/Driver/embed-bitcode-nvptx.cu
1	Ah, so you do grab the intermediate IR. I assume that the PTX does get used, too. Another way to deal with this may be to do two independent compilations -- source-to-IR and source-to-PTX. Each would use the standard compilation flags. The downside is that parsing and optimization time will double, so split compilation combined with filtering args is probably more practical.

Revision Contents

Path

Size

clang/

lib/

Driver/

ToolChains/

Clang.cpp

16 lines

test/

Driver/

embed-bitcode-nvptx.cu

8 lines

Diff 337947

clang/lib/Driver/ToolChains/Clang.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,289 Lines • ▼ Show 20 Lines	void Clang::ConstructJob(Compilation &C, const JobAction &JA,

// FIXME: Subclass ToolChain for SPIR and move this to addClangWarningOptions.		// FIXME: Subclass ToolChain for SPIR and move this to addClangWarningOptions.
if (Triple.isSPIR())		if (Triple.isSPIR())
CmdArgs.push_back("-Wspir-compat");		CmdArgs.push_back("-Wspir-compat");

// Select the appropriate action.		// Select the appropriate action.
RewriteKind rewriteKind = RK_None;		RewriteKind rewriteKind = RK_None;

		bool isDeviceOffloadAction = !(JA.isDeviceOffloading(Action::OFK_None) \|\|
		JA.isDeviceOffloading(Action::OFK_Host));

// If CollectArgsForIntegratedAssembler() isn't called below, claim the args		// If CollectArgsForIntegratedAssembler() isn't called below, claim the args
// it claims when not running an assembler. Otherwise, clang would emit		// it claims when not running an assembler. Otherwise, clang would emit
// "argument unused" warnings for assembler flags when e.g. adding "-E" to		// "argument unused" warnings for assembler flags when e.g. adding "-E" to
// flags while debugging something. That'd be somewhat inconvenient, and it's		// flags while debugging something. That'd be somewhat inconvenient, and it's
// also inconsistent with most other flags -- we don't warn on		// also inconsistent with most other flags -- we don't warn on
// -ffunction-sections not being used in -E mode either for example, even		// -ffunction-sections not being used in -E mode either for example, even
// though it's not really used either.		// though it's not really used either.
if (!isa<AssembleJobAction>(JA)) {		if (!isa<AssembleJobAction>(JA)) {
// The args claimed here should match the args used in		// The args claimed here should match the args used in
// CollectArgsForIntegratedAssembler().		// CollectArgsForIntegratedAssembler().
if (TC.useIntegratedAs()) {		if (TC.useIntegratedAs()) {
Args.ClaimAllArgs(options::OPT_mrelax_all);		Args.ClaimAllArgs(options::OPT_mrelax_all);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - JA.isDeviceOffloading(Action::OFK_Host)); + JA.isDeviceOffloading(Action::OFK_Host)); Lint: Pre-merge checks: clang-format: please reformat the code ``` - JA.
Args.ClaimAllArgs(options::OPT_mno_relax_all);		Args.ClaimAllArgs(options::OPT_mno_relax_all);
Args.ClaimAllArgs(options::OPT_mincremental_linker_compatible);		Args.ClaimAllArgs(options::OPT_mincremental_linker_compatible);
Args.ClaimAllArgs(options::OPT_mno_incremental_linker_compatible);		Args.ClaimAllArgs(options::OPT_mno_incremental_linker_compatible);
switch (C.getDefaultToolChain().getArch()) {		switch (C.getDefaultToolChain().getArch()) {
case llvm::Triple::arm:		case llvm::Triple::arm:
case llvm::Triple::armeb:		case llvm::Triple::armeb:
case llvm::Triple::thumb:		case llvm::Triple::thumb:
case llvm::Triple::thumbeb:		case llvm::Triple::thumbeb:
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	if (isa<AnalyzeJobAction>(JA)) {
// Preserve use-list order by default when emitting bitcode, so that		// Preserve use-list order by default when emitting bitcode, so that
// loading the bitcode up in 'opt' or 'llc' and running passes gives the		// loading the bitcode up in 'opt' or 'llc' and running passes gives the
// same result as running passes here. For LTO, we don't need to preserve		// same result as running passes here. For LTO, we don't need to preserve
// the use-list order, since serialization to bitcode is part of the flow.		// the use-list order, since serialization to bitcode is part of the flow.
if (JA.getType() == types::TY_LLVM_BC)		if (JA.getType() == types::TY_LLVM_BC)
CmdArgs.push_back("-emit-llvm-uselists");		CmdArgs.push_back("-emit-llvm-uselists");

// Device-side jobs do not support LTO.		// Device-side jobs do not support LTO.
bool isDeviceOffloadAction = !(JA.isDeviceOffloading(Action::OFK_None) \|\|
JA.isDeviceOffloading(Action::OFK_Host));

if (D.isUsingLTO() && !isDeviceOffloadAction) {		if (D.isUsingLTO() && !isDeviceOffloadAction) {
Args.AddLastArg(CmdArgs, options::OPT_flto, options::OPT_flto_EQ);		Args.AddLastArg(CmdArgs, options::OPT_flto, options::OPT_flto_EQ);
CmdArgs.push_back("-flto-unit");		CmdArgs.push_back("-flto-unit");
}		}
}		}

if (const Arg *A = Args.getLastArg(options::OPT_fthinlto_index_EQ)) {		if (const Arg *A = Args.getLastArg(options::OPT_fthinlto_index_EQ)) {
if (!types::isLLVMIR(Input.getType()))		if (!types::isLLVMIR(Input.getType()))
Show All 16 Lines	void Clang::ConstructJob(Compilation &C, const JobAction &JA,

// Embed-bitcode option.		// Embed-bitcode option.
// Only white-listed flags below are allowed to be embedded.		// Only white-listed flags below are allowed to be embedded.
if (C.getDriver().embedBitcodeInObject() && !C.getDriver().isUsingLTO() &&		if (C.getDriver().embedBitcodeInObject() && !C.getDriver().isUsingLTO() &&
(isa<BackendJobAction>(JA) \|\| isa<AssembleJobAction>(JA))) {		(isa<BackendJobAction>(JA) \|\| isa<AssembleJobAction>(JA))) {
// Add flags implied by -fembed-bitcode.		// Add flags implied by -fembed-bitcode.
Args.AddLastArg(CmdArgs, options::OPT_fembed_bitcode_EQ);		Args.AddLastArg(CmdArgs, options::OPT_fembed_bitcode_EQ);
// Disable all llvm IR level optimizations.		// Disable all llvm IR level optimizations.
		if (!isDeviceOffloadAction) {
CmdArgs.push_back("-disable-llvm-passes");		CmdArgs.push_back("-disable-llvm-passes");
		} else {
		std::string CPU = getCPUName(Args, Triple, /FromAs/ false);
		if (!CPU.empty()) {
		CmdArgs.push_back("-target-cpu");
		CmdArgs.push_back(Args.MakeArgString(CPU));
		}
		traUnsubmitted Not Done Reply Inline Actions This duplicates the same code a bit further down in the function. I think you should just set `-target-cpu` for everyone before diving into `if(embedBitcodeInObject)`. tra: This duplicates the same code a bit further down in the function. I think you should just set `…
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions Fair. I'll update it. jdoerfert: Fair. I'll update it.
		}

// Render target options.		// Render target options.
TC.addClangTargetOptions(Args, CmdArgs, JA.getOffloadingDeviceKind());		TC.addClangTargetOptions(Args, CmdArgs, JA.getOffloadingDeviceKind());

// reject options that shouldn't be supported in bitcode		// reject options that shouldn't be supported in bitcode
// also reject kernel/kext		// also reject kernel/kext
static const constexpr unsigned kBitcodeOptionBlacklist[] = {		static const constexpr unsigned kBitcodeOptionBlacklist[] = {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - } else { + } else { Lint: Pre-merge checks: clang-format: please reformat the code ``` - } else { + } else { ```
options::OPT_mkernel,		options::OPT_mkernel,
options::OPT_fapple_kext,		options::OPT_fapple_kext,
options::OPT_ffunction_sections,		options::OPT_ffunction_sections,
options::OPT_fno_function_sections,		options::OPT_fno_function_sections,
options::OPT_fdata_sections,		options::OPT_fdata_sections,
options::OPT_fno_data_sections,		options::OPT_fno_data_sections,
options::OPT_fbasic_block_sections_EQ,		options::OPT_fbasic_block_sections_EQ,
options::OPT_funique_internal_linkage_names,		options::OPT_funique_internal_linkage_names,
▲ Show 20 Lines • Show All 3,141 Lines • Show Last 20 Lines

clang/test/Driver/embed-bitcode-nvptx.cu

This file was added.

				// RUN: %clang -Xclang -triple -Xclang nvptx64 -S -Xclang -target-feature -Xclang +ptx70 -fembed-bitcode=all --cuda-device-only -nocudalib -nocudainc %s -o - \| FileCheck %s
				traUnsubmitted Not Done Reply Inline Actions This command line looks extremely odd to me. If you are compiling with `--cuda-device-only`, then clang should've already set the right triple and the features. Could you tell me more about what is the intent of the compilation and why you use this particular set of options? I.e. why not just do `clang -x cuda --offload-arch=sm_70 --cuda-device-only -nocudalib -nocudainc`. tra: This command line looks extremely odd to me. If you are compiling with `--cuda-device-only`…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Could you tell me more about what is the intent of the compilation and why you use this particular set of options? because I never compiled cuda really ;) I'll go with your options. jdoerfert: > Could you tell me more about what is the intent of the compilation and why you use this…
				traUnsubmitted Not Done Reply Inline Actions Something still does not add up. AFAICT, the real problem is that that we're not adding `-target-cpu`, but rather that `-fembed-bitcode=all` splits `-S` compilation into two phases -- source-to-bitcode (this part gets all the right command line options and compiles fine) and `IR -> PTX` compilation which does end up only with the subset of the options and ends up failing because the intrinsics are not enabled. I think what we want to do in this case is to prevent splitting GPU-side compilation. Adding a '-target-gpu' to the `IR->PTX` subcompilation may make things work in this case, but it does not really fix the root cause. E.g. we should also pass through the features set by the driver and, possibly, other options to keep both source->IR and IR->PTX compilations in sync. tra: Something still does not add up. AFAICT, the real problem is that that we're not adding `…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions I think what we want to do in this case is to prevent splitting GPU-side compilation. I doubt that is as easy as it sounds. Where do we take the IR from then? (I want the GPU IR embedded after all) E.g. we should also pass through the features set by the driver and .. I agree, what if I move the embedding handling to the end, keep the "blacklist" that removes arguments we don't want, and see where that leads us? jdoerfert: > I think what we want to do in this case is to prevent splitting GPU-side compilation. I…
				traUnsubmitted Not Done Reply Inline Actions Ah, so you do grab the intermediate IR. I assume that the PTX does get used, too. Another way to deal with this may be to do two independent compilations -- source-to-IR and source-to-PTX. Each would use the standard compilation flags. The downside is that parsing and optimization time will double, so split compilation combined with filtering args is probably more practical. tra: Ah, so you do grab the intermediate IR. I assume that the PTX does get used, too. Another way…
				// REQUIRES: nvptx-registered-target
				//
				// CHECK:.global .align 1 .b8 llvm_$_embedded_$_module[

				__device__ void foo(int mask) {
				__nvvm_bar_warp_sync(mask);
				}