This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Improve CUDA compilation pipeline creation.
ClosedPublic

Authored by tra on Jul 16 2015, 3:07 PM.

Download Raw Diff

Details

Reviewers

dblaikie
echristo
bogner

Commits

rGf8144ab44fba: [CUDA] Improve CUDA compilation pipeline creation.
rC246174: [CUDA] Improve CUDA compilation pipeline creation.
rL246174: [CUDA] Improve CUDA compilation pipeline creation.

Summary

Current implementation tries to guess which Action will result in a job which needs to incorporate device-side GPU binaries. The guessing was attempting to work around the fact that multiple actions may be combined into a single compiler invocation. If CudaHostAction ends up being combined (and thus bypassed during action list traversal) no device-side actions it pointed to were processed. The guessing worked for most of the usual cases, but fell apart when external assembler was used.

This change removes the guessing and makes sure we create and pass device-side jobs regardless of how the jobs get combined.

CudaHostAction is always inserted either at Compile phase or the FinalPhase of current compilation, whichever happens first.
If selectToolForJob combines CudaHostAction with other actions, it passes info about CudaHostAction up to the caller
When it sees that CudaHostAction got combined with other actions (and hence will never be passed to BuildJobsForActions), BuildJobsForActions creates device-side jobs the same way they would be created if CudaHostAction was passed to BuildJobsForActions directly.
Added two more test cases to make sure GPU binaries are passed to correct jobs.

Diff Detail

Repository: rL LLVM

Event Timeline

tra updated this revision to Diff 29947.Jul 16 2015, 3:07 PM

tra retitled this revision from to [CUDA] Improve CUDA compilation pipeline creation..

tra updated this object.

tra added reviewers: echristo, bogner, dblaikie.

tra added a subscriber: cfe-commits.

tra added a child revision: D11310: [CUDA] Moved device-side triple calculation to buildCudaActions.Jul 17 2015, 1:45 PM

Manuel - just got this private email from Phab. Seems that should've gone
to all the subscribers & ended up on the mailing list, but didn't?

tra removed a child revision: D11310: [CUDA] Moved device-side triple calculation to buildCudaActions.Jul 20 2015, 2:35 PM

tra mentioned this in D11310: [CUDA] Moved device-side triple calculation to buildCudaActions.Jul 20 2015, 2:39 PM

This seems like a decent incremental improvement. I think we still need to separate out the pipeline a bit further and have the cuda compilation just be separate actions that don't need this "inject" bit, but this can at least prep the code for that cleanup.

-eric

This revision is now accepted and ready to land.Aug 24 2015, 1:25 PM

Closed by commit rL246174: [CUDA] Improve CUDA compilation pipeline creation. (authored by tra). · Explain WhyAug 27 2015, 11:11 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

cfe/

trunk/

lib/

Driver/

Driver.cpp

91 lines

test/

Driver/

cuda-options.cu

73 lines

Diff 33341

cfe/trunk/lib/Driver/Driver.cpp

Show First 20 Lines • Show All 1,227 Lines • ▼ Show 20 Lines	void Driver::BuildInputs(const ToolChain &TC, DerivedArgList &Args,
if (CCCIsCPP() && Inputs.empty()) {		if (CCCIsCPP() && Inputs.empty()) {
// If called as standalone preprocessor, stdin is processed		// If called as standalone preprocessor, stdin is processed
// if no other input is present.		// if no other input is present.
Arg *A = MakeInputArg(Args, Opts, "-");		Arg *A = MakeInputArg(Args, Opts, "-");
Inputs.push_back(std::make_pair(types::TY_C, A));		Inputs.push_back(std::make_pair(types::TY_C, A));
}		}
}		}

// For each unique --cuda-gpu-arch= argument creates a TY_CUDA_DEVICE input		// For each unique --cuda-gpu-arch= argument creates a TY_CUDA_DEVICE
// action and then wraps each in CudaDeviceAction paired with appropriate GPU		// input action and then wraps each in CudaDeviceAction paired with
// arch name. If we're only building device-side code, each action remains		// appropriate GPU arch name. In case of partial (i.e preprocessing
// independent. Otherwise we pass device-side actions as inputs to a new		// only) or device-only compilation, each device action is added to /p
// CudaHostAction which combines both host and device side actions.		// Actions and /p Current is released. Otherwise the function creates
		// and returns a new CudaHostAction which wraps /p Current and device
		// side actions.
static std::unique_ptr<Action>		static std::unique_ptr<Action>
buildCudaActions(const Driver &D, const ToolChain &TC, DerivedArgList &Args,		buildCudaActions(const Driver &D, const ToolChain &TC, DerivedArgList &Args,
const Arg *InputArg, std::unique_ptr<Action> HostAction,		const Arg *InputArg, std::unique_ptr<Action> HostAction,
ActionList &Actions) {		ActionList &Actions) {

// Collect all cuda_gpu_arch parameters, removing duplicates.		// Collect all cuda_gpu_arch parameters, removing duplicates.
SmallVector<const char *, 4> GpuArchList;		SmallVector<const char *, 4> GpuArchList;
llvm::StringSet<> GpuArchNames;		llvm::StringSet<> GpuArchNames;
▲ Show 20 Lines • Show All 167 Lines • ▼ Show 20 Lines	if (InitialPhase > FinalPhase) {
Diag(clang::diag::warn_drv_input_file_unused)		Diag(clang::diag::warn_drv_input_file_unused)
<< InputArg->getAsString(Args) << getPhaseName(InitialPhase)		<< InputArg->getAsString(Args) << getPhaseName(InitialPhase)
<< !!FinalPhaseArg		<< !!FinalPhaseArg
<< (FinalPhaseArg ? FinalPhaseArg->getOption().getName() : "");		<< (FinalPhaseArg ? FinalPhaseArg->getOption().getName() : "");
continue;		continue;
}		}

phases::ID CudaInjectionPhase;		phases::ID CudaInjectionPhase;
if (isSaveTempsEnabled()) {		bool InjectCuda = (InputType == types::TY_CUDA &&
// All phases are done independently, inject GPU blobs during compilation		!Args.hasArg(options::OPT_cuda_host_only));
// phase as that's where we generate glue code to init them.
CudaInjectionPhase = phases::Compile;
} else {
// Assumes that clang does everything up until linking phase, so we inject
// cuda device actions at the last step before linking. Otherwise CUDA
// host action forces preprocessor into a separate invocation.
CudaInjectionPhase = FinalPhase;		CudaInjectionPhase = FinalPhase;
if (FinalPhase == phases::Link)		for (auto &Phase : PL)
for (auto PI = PL.begin(), PE = PL.end(); PI != PE; ++PI) {		if (Phase <= FinalPhase && Phase == phases::Compile) {
auto next = PI + 1;		CudaInjectionPhase = Phase;
if (next != PE && *next == phases::Link)		break;
CudaInjectionPhase = *PI;
}
}		}

// Build the pipeline for this file.		// Build the pipeline for this file.
std::unique_ptr<Action> Current(new InputAction(*InputArg, InputType));		std::unique_ptr<Action> Current(new InputAction(*InputArg, InputType));
for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e = PL.end();		for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e = PL.end();
i != e; ++i) {		i != e; ++i) {
phases::ID Phase = *i;		phases::ID Phase = *i;

// We are done if this step is past what the user requested.		// We are done if this step is past what the user requested.
Show All 11 Lines	for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e = PL.end();
// encode this in the steps because the intermediate type depends on		// encode this in the steps because the intermediate type depends on
// arguments. Just special case here.		// arguments. Just special case here.
if (Phase == phases::Assemble && Current->getType() != types::TY_PP_Asm)		if (Phase == phases::Assemble && Current->getType() != types::TY_PP_Asm)
continue;		continue;

// Otherwise construct the appropriate action.		// Otherwise construct the appropriate action.
Current = ConstructPhaseAction(TC, Args, Phase, std::move(Current));		Current = ConstructPhaseAction(TC, Args, Phase, std::move(Current));

if (InputType == types::TY_CUDA && Phase == CudaInjectionPhase &&		if (InjectCuda && Phase == CudaInjectionPhase) {
!Args.hasArg(options::OPT_cuda_host_only)) {
Current = buildCudaActions(*this, TC, Args, InputArg,		Current = buildCudaActions(*this, TC, Args, InputArg,
std::move(Current), Actions);		std::move(Current), Actions);
if (!Current)		if (!Current)
break;		break;
}		}

if (Current->getType() == types::TY_Nothing)		if (Current->getType() == types::TY_Nothing)
break;		break;
▲ Show 20 Lines • Show All 197 Lines • ▼ Show 20 Lines	if (!A->isClaimed()) {
}		}

Diag(clang::diag::warn_drv_unused_argument)		Diag(clang::diag::warn_drv_unused_argument)
<< A->getAsString(C.getArgs());		<< A->getAsString(C.getArgs());
}		}
}		}
}		}

static const Tool *SelectToolForJob(Compilation &C, bool SaveTemps,		// Returns a Tool for a given JobAction. In case the action and its
		// predecessors can be combined, updates Inputs with the inputs of the
		// first combined action. If one of the collapsed actions is a
		// CudaHostAction, updates CollapsedCHA with the pointer to it so the
		// caller can deal with extra handling such action requires.
		static const Tool *selectToolForJob(Compilation &C, bool SaveTemps,
const ToolChain TC, const JobAction JA,		const ToolChain TC, const JobAction JA,
const ActionList *&Inputs) {		const ActionList *&Inputs,
		const CudaHostAction *&CollapsedCHA) {
const Tool *ToolForJob = nullptr;		const Tool *ToolForJob = nullptr;
		CollapsedCHA = nullptr;

// See if we should look for a compiler with an integrated assembler. We match		// See if we should look for a compiler with an integrated assembler. We match
// bottom up, so what we are actually looking for is an assembler job with a		// bottom up, so what we are actually looking for is an assembler job with a
// compiler input.		// compiler input.

if (TC->useIntegratedAs() && !SaveTemps &&		if (TC->useIntegratedAs() && !SaveTemps &&
!C.getArgs().hasArg(options::OPT_via_file_asm) &&		!C.getArgs().hasArg(options::OPT_via_file_asm) &&
!C.getArgs().hasArg(options::OPT__SLASH_FA) &&		!C.getArgs().hasArg(options::OPT__SLASH_FA) &&
!C.getArgs().hasArg(options::OPT__SLASH_Fa) &&		!C.getArgs().hasArg(options::OPT__SLASH_Fa) &&
isa<AssembleJobAction>(JA) && Inputs->size() == 1 &&		isa<AssembleJobAction>(JA) && Inputs->size() == 1 &&
isa<BackendJobAction>(*Inputs->begin())) {		isa<BackendJobAction>(*Inputs->begin())) {
// A BackendJob is always preceded by a CompileJob, and without		// A BackendJob is always preceded by a CompileJob, and without
// -save-temps they will always get combined together, so instead of		// -save-temps they will always get combined together, so instead of
// checking the backend tool, check if the tool for the CompileJob		// checking the backend tool, check if the tool for the CompileJob
// has an integrated assembler.		// has an integrated assembler.
const ActionList BackendInputs = &(Inputs)[0]->getInputs();		const ActionList BackendInputs = &(Inputs)[0]->getInputs();
JobAction CompileJA = cast<CompileJobAction>(BackendInputs->begin());		// Compile job may be wrapped in CudaHostAction, extract it if
		// that's the case and update CollapsedCHA if we combine phases.
		CudaHostAction CHA = dyn_cast<CudaHostAction>(BackendInputs->begin());
		JobAction *CompileJA =
		cast<CompileJobAction>(CHA ? CHA->begin() : BackendInputs->begin());
		assert(CompileJA && "Backend job is not preceeded by compile job.");
const Tool Compiler = TC->SelectTool(CompileJA);		const Tool Compiler = TC->SelectTool(CompileJA);
if (!Compiler)		if (!Compiler)
return nullptr;		return nullptr;
if (Compiler->hasIntegratedAssembler()) {		if (Compiler->hasIntegratedAssembler()) {
Inputs = &(*BackendInputs)[0]->getInputs();		Inputs = &CompileJA->getInputs();
ToolForJob = Compiler;		ToolForJob = Compiler;
		CollapsedCHA = CHA;
}		}
}		}

// A backend job should always be combined with the preceding compile job		// A backend job should always be combined with the preceding compile job
// unless OPT_save_temps is enabled and the compiler is capable of emitting		// unless OPT_save_temps is enabled and the compiler is capable of emitting
// LLVM IR as an intermediate output.		// LLVM IR as an intermediate output.
if (isa<BackendJobAction>(JA)) {		if (isa<BackendJobAction>(JA)) {
// Check if the compiler supports emitting LLVM IR.		// Check if the compiler supports emitting LLVM IR.
assert(Inputs->size() == 1);		assert(Inputs->size() == 1);
JobAction *CompileJA;		// Compile job may be wrapped in CudaHostAction, extract it if
// Extract real host action, if it's a CudaHostAction.		// that's the case and update CollapsedCHA if we combine phases.
if (CudaHostAction CudaHA = dyn_cast<CudaHostAction>(Inputs->begin()))		CudaHostAction CHA = dyn_cast<CudaHostAction>(Inputs->begin());
CompileJA = cast<CompileJobAction>(*CudaHA->begin());		JobAction *CompileJA =
else		cast<CompileJobAction>(CHA ? CHA->begin() : Inputs->begin());
CompileJA = cast<CompileJobAction>(*Inputs->begin());		assert(CompileJA && "Backend job is not preceeded by compile job.");

const Tool Compiler = TC->SelectTool(CompileJA);		const Tool Compiler = TC->SelectTool(CompileJA);
if (!Compiler)		if (!Compiler)
return nullptr;		return nullptr;
if (!Compiler->canEmitIR() \|\| !SaveTemps) {		if (!Compiler->canEmitIR() \|\| !SaveTemps) {
Inputs = &(*Inputs)[0]->getInputs();		Inputs = &CompileJA->getInputs();
ToolForJob = Compiler;		ToolForJob = Compiler;
		CollapsedCHA = CHA;
}		}
}		}

// Otherwise use the tool for the current job.		// Otherwise use the tool for the current job.
if (!ToolForJob)		if (!ToolForJob)
ToolForJob = TC->SelectTool(*JA);		ToolForJob = TC->SelectTool(*JA);

// See if we should use an integrated preprocessor. We do so when we have		// See if we should use an integrated preprocessor. We do so when we have
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	BuildJobsForAction(
CDA->getGpuArchName(), CDA->isAtTopLevel(),		CDA->getGpuArchName(), CDA->isAtTopLevel(),
/MultipleArchs/ true, LinkingOutput, Result);		/MultipleArchs/ true, LinkingOutput, Result);
return;		return;
}		}

const ActionList *Inputs = &A->getInputs();		const ActionList *Inputs = &A->getInputs();

const JobAction *JA = cast<JobAction>(A);		const JobAction *JA = cast<JobAction>(A);
const Tool *T = SelectToolForJob(C, isSaveTempsEnabled(), TC, JA, Inputs);		const CudaHostAction *CollapsedCHA = nullptr;
		const Tool *T =
		selectToolForJob(C, isSaveTempsEnabled(), TC, JA, Inputs, CollapsedCHA);
if (!T)		if (!T)
return;		return;

		// If we've collapsed action list that contained CudaHostAction we
		// need to build jobs for device-side inputs it may have held.
		if (CollapsedCHA) {
		InputInfo II;
		for (const Action *DA : CollapsedCHA->getDeviceActions()) {
		BuildJobsForAction(C, DA, TC, "", AtTopLevel,
		/MultipleArchs/ false, LinkingOutput, II);
		CudaDeviceInputInfos.push_back(II);
		}
		}

// Only use pipes when there is exactly one input.		// Only use pipes when there is exactly one input.
InputInfoList InputInfos;		InputInfoList InputInfos;
for (const Action Input : Inputs) {		for (const Action Input : Inputs) {
// Treat dsymutil and verify sub-jobs as being at the top-level too, they		// Treat dsymutil and verify sub-jobs as being at the top-level too, they
// shouldn't get temporary output names.		// shouldn't get temporary output names.
// FIXME: Clean this up.		// FIXME: Clean this up.
bool SubJobAtTopLevel = false;		bool SubJobAtTopLevel = false;
if (AtTopLevel && (isa<DsymutilJobAction>(A) \|\| isa<VerifyJobAction>(A)))		if (AtTopLevel && (isa<DsymutilJobAction>(A) \|\| isa<VerifyJobAction>(A)))
▲ Show 20 Lines • Show All 505 Lines • Show Last 20 Lines

cfe/trunk/test/Driver/cuda-options.cu

	// Tests CUDA compilation pipeline construction in Driver.			// Tests CUDA compilation pipeline construction in Driver.
	// REQUIRES: clang-driver			// REQUIRES: clang-driver
	// REQUIRES: x86-registered-target			// REQUIRES: x86-registered-target
	// REQUIRES: nvptx-registered-target			// REQUIRES: nvptx-registered-target

	// Simple compilation case:			// Simple compilation case:
	// RUN: %clang -### -target x86_64-linux-gnu -c %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu -c %s 2>&1 \
	// Compile device-side to PTX assembly and make sure we use it on the host side.			// Compile device-side to PTX assembly and make sure we use it on the host side.
	// RUN: \| FileCheck -check-prefix CUDA-D1 \			// RUN: \| FileCheck -check-prefix CUDA-D1 -check-prefix CUDA-D1NS\
	// Then compile host side and incorporate device code.			// Then compile host side and incorporate device code.
	// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-I1 \			// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-I1 \
	// Make sure we don't link anything.			// Make sure we don't link anything.
	// RUN: -check-prefix CUDA-NL %s			// RUN: -check-prefix CUDA-NL %s

	// Typical compilation + link case:			// Typical compilation + link case:
	// RUN: %clang -### -target x86_64-linux-gnu %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu %s 2>&1 \
	// Compile device-side to PTX assembly and make sure we use it on the host side			// Compile device-side to PTX assembly and make sure we use it on the host side
	// RUN: \| FileCheck -check-prefix CUDA-D1 \			// RUN: \| FileCheck -check-prefix CUDA-D1 -check-prefix CUDA-D1NS\
	// Then compile host side and incorporate device code.			// Then compile host side and incorporate device code.
	// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-I1 \			// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-I1 \
	// Then link things.			// Then link things.
	// RUN: -check-prefix CUDA-L %s			// RUN: -check-prefix CUDA-L %s

	// Verify that -cuda-no-device disables device-side compilation and linking			// Verify that -cuda-no-device disables device-side compilation and linking
	// RUN: %clang -### -target x86_64-linux-gnu --cuda-host-only %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu --cuda-host-only %s 2>&1 \
	// Make sure we didn't run device-side compilation.			// Make sure we didn't run device-side compilation.
	// RUN: \| FileCheck -check-prefix CUDA-ND \			// RUN: \| FileCheck -check-prefix CUDA-ND \
	// Then compile host side and make sure we don't attempt to incorporate GPU code.			// Then compile host side and make sure we don't attempt to incorporate GPU code.
	// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-NI \			// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-NI \
	// Linking is allowed to happen, even if we're missing GPU code.			// Linking is allowed to happen, even if we're missing GPU code.
	// RUN: -check-prefix CUDA-L %s			// RUN: -check-prefix CUDA-L %s

	// Verify that -cuda-no-host disables host-side compilation and linking			// Verify that -cuda-no-host disables host-side compilation and linking
	// RUN: %clang -### -target x86_64-linux-gnu --cuda-device-only %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu --cuda-device-only %s 2>&1 \
	// Compile device-side to PTX assembly			// Compile device-side to PTX assembly
	// RUN: \| FileCheck -check-prefix CUDA-D1 \			// RUN: \| FileCheck -check-prefix CUDA-D1 -check-prefix CUDA-D1NS\
	// Make sure there are no host cmpilation or linking.			// Make sure there are no host cmpilation or linking.
	// RUN: -check-prefix CUDA-NH -check-prefix CUDA-NL %s			// RUN: -check-prefix CUDA-NH -check-prefix CUDA-NL %s

	// Verify that with -S we compile host and device sides to assembly			// Verify that with -S we compile host and device sides to assembly
	// and incorporate device code on the host side.			// and incorporate device code on the host side.
	// RUN: %clang -### -target x86_64-linux-gnu -S -c %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu -S -c %s 2>&1 \
	// Compile device-side to PTX assembly			// Compile device-side to PTX assembly
	// RUN: \| FileCheck -check-prefix CUDA-D1 \			// RUN: \| FileCheck -check-prefix CUDA-D1 -check-prefix CUDA-D1NS\
	// Then compile host side and incorporate GPU code.			// Then compile host side and incorporate GPU code.
	// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-I1 \			// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-I1 \
	// Make sure we don't link anything.			// Make sure we don't link anything.
	// RUN: -check-prefix CUDA-NL %s			// RUN: -check-prefix CUDA-NL %s

	// Verify that --cuda-gpu-arch option passes correct GPU			// Verify that --cuda-gpu-arch option passes correct GPU
	// archtecture info to device compilation.			// archtecture info to device compilation.
	// RUN: %clang -### -target x86_64-linux-gnu --cuda-gpu-arch=sm_35 -c %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu --cuda-gpu-arch=sm_35 -c %s 2>&1 \
	// Compile device-side to PTX assembly.			// Compile device-side to PTX assembly.
	// RUN: \| FileCheck -check-prefix CUDA-D1 -check-prefix CUDA-D1-SM35 \			// RUN: \| FileCheck -check-prefix CUDA-D1 -check-prefix CUDA-D1NS \
				// RUN: -check-prefix CUDA-D1-SM35 \
	// Then compile host side and incorporate GPU code.			// Then compile host side and incorporate GPU code.
	// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-I1 \			// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-I1 \
	// Make sure we don't link anything.			// Make sure we don't link anything.
	// RUN: -check-prefix CUDA-NL %s			// RUN: -check-prefix CUDA-NL %s

	// Verify that there is device-side compilation per --cuda-gpu-arch args			// Verify that there is device-side compilation per --cuda-gpu-arch args
	// and that all results are included on the host side.			// and that all results are included on the host side.
	// RUN: %clang -### -target x86_64-linux-gnu --cuda-gpu-arch=sm_35 --cuda-gpu-arch=sm_30 -c %s 2>&1 \			// RUN: %clang -### -target x86_64-linux-gnu \
				// RUN: --cuda-gpu-arch=sm_35 --cuda-gpu-arch=sm_30 -c %s 2>&1 \
	// Compile both device-sides to PTX assembly			// Compile both device-sides to PTX assembly
	// RUN: \| FileCheck \			// RUN: \| FileCheck \
	// RUN: -check-prefix CUDA-D1 -check-prefix CUDA-D1-SM35 \			// RUN: -check-prefix CUDA-D1 -check-prefix CUDA-D1NS -check-prefix CUDA-D1-SM35 \
	// RUN: -check-prefix CUDA-D2 -check-prefix CUDA-D2-SM30 \			// RUN: -check-prefix CUDA-D2 -check-prefix CUDA-D2-SM30 \
	// Then compile host side and incorporate both device-side outputs			// Then compile host side and incorporate both device-side outputs
	// RUN: -check-prefix CUDA-H -check-prefix CUDA-H-I1 -check-prefix CUDA-H-I2 \			// RUN: -check-prefix CUDA-H -check-prefix CUDA-HNS \
				// RUN: -check-prefix CUDA-H-I1 -check-prefix CUDA-H-I2 \
	// Make sure we don't link anything.			// Make sure we don't link anything.
	// RUN: -check-prefix CUDA-NL %s			// RUN: -check-prefix CUDA-NL %s

				// Verify that device-side results are passed to correct tool when
				// -save-temps is used
				// RUN: %clang -### -target x86_64-linux-gnu -save-temps -c %s 2>&1 \
				// Compile device-side to PTX assembly and make sure we use it on the host side.
				// RUN: \| FileCheck -check-prefix CUDA-D1 -check-prefix CUDA-D1S \
				// Then compile host side and incorporate device code.
				// RUN: -check-prefix CUDA-H -check-prefix CUDA-HS -check-prefix CUDA-HS-I1 \
				// Make sure we don't link anything.
				// RUN: -check-prefix CUDA-NL %s

				// Verify that device-side results are passed to correct tool when
				// -fno-integrated-as is used
				// RUN: %clang -### -target x86_64-linux-gnu -fno-integrated-as -c %s 2>&1 \
				// Compile device-side to PTX assembly and make sure we use it on the host side.
				// RUN: \| FileCheck -check-prefix CUDA-D1 -check-prefix CUDA-D1NS \
				// Then compile host side and incorporate device code.
				// RUN: -check-prefix CUDA-H -check-prefix CUDA-HNS -check-prefix CUDA-HS-I1 \
				// RUN: -check-prefix CUDA-H-AS \
				// Make sure we don't link anything.
				// RUN: -check-prefix CUDA-NL %s

				// Match device-side preprocessor, and compiler phases with -save-temps
				// CUDA-D1S: "-cc1" "-triple" "nvptx{{(64)?}}-nvidia-cuda"
				// CUDA-D1S-SAME: "-fcuda-is-device"
				// CUDA-D1S-SAME: "-x" "cuda"
				// CUDA-D1S: "-cc1" "-triple" "nvptx{{(64)?}}-nvidia-cuda"
				// CUDA-D1S-SAME: "-fcuda-is-device"
				// CUDA-D1S-SAME: "-x" "cuda-cpp-output"

	// --cuda-host-only should never trigger unused arg warning.			// --cuda-host-only should never trigger unused arg warning.
	// RUN: %clang -### -target x86_64-linux-gnu --cuda-host-only -c %s 2>&1 \| \			// RUN: %clang -### -target x86_64-linux-gnu --cuda-host-only -c %s 2>&1 \| \
	// RUN: FileCheck -check-prefix CUDA-NO-UNUSED-CHO %s			// RUN: FileCheck -check-prefix CUDA-NO-UNUSED-CHO %s
	// RUN: %clang -### -target x86_64-linux-gnu --cuda-host-only -x c -c %s 2>&1 \| \			// RUN: %clang -### -target x86_64-linux-gnu --cuda-host-only -x c -c %s 2>&1 \| \
	// RUN: FileCheck -check-prefix CUDA-NO-UNUSED-CHO %s			// RUN: FileCheck -check-prefix CUDA-NO-UNUSED-CHO %s

	// --cuda-device-only should not produce warning compiling CUDA files			// --cuda-device-only should not produce warning compiling CUDA files
	// RUN: %clang -### -target x86_64-linux-gnu --cuda-device-only -c %s 2>&1 \| \			// RUN: %clang -### -target x86_64-linux-gnu --cuda-device-only -c %s 2>&1 \| \
	// RUN: FileCheck -check-prefix CUDA-NO-UNUSED-CDO %s			// RUN: FileCheck -check-prefix CUDA-NO-UNUSED-CDO %s

	// --cuda-device-only should warn during non-CUDA compilation.			// --cuda-device-only should warn during non-CUDA compilation.
	// RUN: %clang -### -target x86_64-linux-gnu --cuda-device-only -x c -c %s 2>&1 \| \			// RUN: %clang -### -target x86_64-linux-gnu --cuda-device-only -x c -c %s 2>&1 \| \
	// RUN: FileCheck -check-prefix CUDA-UNUSED-CDO %s			// RUN: FileCheck -check-prefix CUDA-UNUSED-CDO %s

	// Match device-side compilation			// Match the job that produces PTX assembly
	// CUDA-D1: "-cc1" "-triple" "nvptx{{(64)?}}-nvidia-cuda"			// CUDA-D1: "-cc1" "-triple" "nvptx{{(64)?}}-nvidia-cuda"
	// CUDA-D1-SAME: "-fcuda-is-device"			// CUDA-D1-SAME: "-fcuda-is-device"
	// CUDA-D1-SM35-SAME: "-target-cpu" "sm_35"			// CUDA-D1-SM35-SAME: "-target-cpu" "sm_35"
	// CUDA-D1-SAME: "-o" "[[GPUBINARY1:[^"]*]]"			// CUDA-D1-SAME: "-o" "[[GPUBINARY1:[^"]*]]"
	// CUDA-D1-SAME: "-x" "cuda"			// CUDA-D1NS-SAME: "-x" "cuda"
				// CUDA-D1S-SAME: "-x" "ir"

	// Match anothe device-side compilation			// Match anothe device-side compilation
	// CUDA-D2: "-cc1" "-triple" "nvptx{{(64)?}}-nvidia-cuda"			// CUDA-D2: "-cc1" "-triple" "nvptx{{(64)?}}-nvidia-cuda"
	// CUDA-D2-SAME: "-fcuda-is-device"			// CUDA-D2-SAME: "-fcuda-is-device"
	// CUDA-D2-SM30-SAME: "-target-cpu" "sm_30"			// CUDA-D2-SM30-SAME: "-target-cpu" "sm_30"
	// CUDA-D2-SAME: "-o" "[[GPUBINARY2:[^"]*]]"			// CUDA-D2-SAME: "-o" "[[GPUBINARY2:[^"]*]]"
	// CUDA-D2-SAME: "-x" "cuda"			// CUDA-D2-SAME: "-x" "cuda"

	// Match no device-side compilation			// Match no device-side compilation
	// CUDA-ND-NOT: "-cc1" "-triple" "nvptx{{64?}}-nvidia-cuda"			// CUDA-ND-NOT: "-cc1" "-triple" "nvptx{{(64)?}}-nvidia-cuda"
	// CUDA-ND-SAME-NOT: "-fcuda-is-device"			// CUDA-ND-SAME-NOT: "-fcuda-is-device"

				// Match host-side preprocessor job with -save-temps
				// CUDA-HS: "-cc1" "-triple"
				// CUDA-HS-SAME-NOT: "nvptx{{(64)?}}-nvidia-cuda"
				// CUDA-HS-SAME-NOT: "-fcuda-is-device"
				// CUDA-HS-SAME: "-x" "cuda"

	// Match host-side compilation			// Match host-side compilation
	// CUDA-H: "-cc1" "-triple"			// CUDA-H: "-cc1" "-triple"
	// CUDA-H-SAME-NOT: "nvptx{{64?}}-nvidia-cuda"			// CUDA-H-SAME-NOT: "nvptx{{(64)?}}-nvidia-cuda"
	// CUDA-H-SAME-NOT: "-fcuda-is-device"			// CUDA-H-SAME-NOT: "-fcuda-is-device"
	// CUDA-H-SAME: "-o" "[[HOSTOBJ:[^"]*]]"			// CUDA-H-SAME: "-o" "[[HOSTOUTPUT:[^"]*]]"
	// CUDA-H-SAME: "-x" "cuda"			// CUDA-HNS-SAME: "-x" "cuda"
				// CUDA-HS-SAME: "-x" "cuda-cpp-output"
	// CUDA-H-I1-SAME: "-fcuda-include-gpubinary" "[[GPUBINARY1]]"			// CUDA-H-I1-SAME: "-fcuda-include-gpubinary" "[[GPUBINARY1]]"
	// CUDA-H-I2-SAME: "-fcuda-include-gpubinary" "[[GPUBINARY2]]"			// CUDA-H-I2-SAME: "-fcuda-include-gpubinary" "[[GPUBINARY2]]"

				// Match external assembler that uses compilation output
				// CUDA-H-AS: "-o" "{{.*}}.o" "[[HOSTOUTPUT]]"

	// Match no GPU code inclusion.			// Match no GPU code inclusion.
	// CUDA-H-NI-NOT: "-fcuda-include-gpubinary"			// CUDA-H-NI-NOT: "-fcuda-include-gpubinary"

	// Match no CUDA compilation			// Match no CUDA compilation
	// CUDA-NH-NOT: "-cc1" "-triple"			// CUDA-NH-NOT: "-cc1" "-triple"
	// CUDA-NH-SAME-NOT: "-x" "cuda"			// CUDA-NH-SAME-NOT: "-x" "cuda"

	// Match linker			// Match linker
	// CUDA-L: "{{.*}}{{ld\|link}}{{(.exe)?}}"			// CUDA-L: "{{.*}}{{ld\|link}}{{(.exe)?}}"
	// CUDA-L-SAME: "[[HOSTOBJ]]"			// CUDA-L-SAME: "[[HOSTOUTPUT]]"

	// Match no linker			// Match no linker
	// CUDA-NL-NOT: "{{.*}}{{ld\|link}}{{(.exe)?}}"			// CUDA-NL-NOT: "{{.*}}{{ld\|link}}{{(.exe)?}}"

	// CUDA-NO-UNUSED-CHO-NOT: warning: argument unused during compilation: '--cuda-host-only'			// CUDA-NO-UNUSED-CHO-NOT: warning: argument unused during compilation: '--cuda-host-only'
	// CUDA-UNUSED-CDO: warning: argument unused during compilation: '--cuda-device-only'			// CUDA-UNUSED-CDO: warning: argument unused during compilation: '--cuda-device-only'
	// CUDA-NO-UNUSED-CDO-NOT: warning: argument unused during compilation: '--cuda-device-only'			// CUDA-NO-UNUSED-CDO-NOT: warning: argument unused during compilation: '--cuda-device-only'