This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/
-
clang/
-
Basic/
-
DiagnosticSemaKinds.td
-
Sema/
1/1
Sema.h
-
lib/
-
CodeGen/
7/7
CGCUDANV.cpp
-
Headers/
1
__clang_cuda_runtime_wrapper.h
-
Sema/
-
SemaCUDA.cpp
-
SemaDecl.cpp
-
test/
-
CodeGenCUDA/
-
Inputs/
-
cuda.h
-
device-stub.cu
-
kernel-args-alignment.cu
-
kernel-call.cu
-
Driver/
-
cuda-simple.cu
-
SemaCUDA/
-
Inputs/
-
cuda.h
-
config-type.cu

Differential D57488

[CUDA] add support for the new kernel launch API in CUDA-9.2+.
ClosedPublic

Authored by tra on Jan 30 2019, 4:36 PM.

Download Raw Diff

Details

Reviewers

jlebar

Commits

rGc62214da3de0: [CUDA] add support for the new kernel launch API in CUDA-9.2+.
rC352799: [CUDA] add support for the new kernel launch API in CUDA-9.2+.
rL352799: [CUDA] add support for the new kernel launch API in CUDA-9.2+.

Summary

Instead of calling CUDA runtime to arrange function arguments,
the new API constructs arguments in a local array and the kernels
are launched with __cudaLaunchKernel().

The old API has been deprecated and is expected to go away
in the next CUDA release.

Diff Detail

Build Status

Buildable 27549
Build 27548: arc lint + arc unit

Event Timeline

tra created this revision.Jan 30 2019, 4:36 PM

Herald added subscribers: bixia, sanjoy. · View Herald TranscriptJan 30 2019, 4:36 PM

Harbormaster completed remote builds in B27516: Diff 184405.Jan 30 2019, 4:36 PM

tra added a parent revision: D57487: [CUDA] Propagate detected version of CUDA to cc1.Jan 30 2019, 4:37 PM

tra mentioned this in D57487: [CUDA] Propagate detected version of CUDA to cc1.

LGTM, mostly nits.

clang/include/clang/Sema/Sema.h
10316	Could we be a little less vague, what exactly is the launch-configuration function? (Could be as simple as adding `e.g. cudaFooBar()`.)
clang/lib/CodeGen/CGCUDANV.cpp
201–297	nit `in a local array`
212	Nit, s/`1UL`/`uint64{1}`/ or size_t, whatever this function takes. As-is we're baking in the assumption that unsigned long is the same as the type returned by Args.size(), which isn't necessarily true. As an alternative, you could do `std::max<size_t>(1, Args.size())` or whatever the appropriate type is.
239	Unfixed FIXME?
260	I see lots of references to `__cudaPushCallConfiguration`, but this is the only reference I see to `__cudaPopCallConfiguration`. Is this a typo? Also are we supposed to emit matching push and pop function calls? Kind of weird to do one without the other...
266	Whitespace nit, maybe move this whitespace line before the comment?
clang/lib/Headers/__clang_cuda_runtime_wrapper.h
429	s/undocumented function/this undocumented function/?

This revision is now accepted and ready to land.Jan 30 2019, 5:05 PM

Addressed Justin's comments.

Harbormaster completed remote builds in B27549: Diff 184543.Jan 31 2019, 10:30 AM

tra added inline comments.Jan 31 2019, 10:37 AM

clang/lib/CodeGen/CGCUDANV.cpp
239	Fixed the comment. :-) There's not much we can do if we have no declaration for cudaLaunchKernel, so throwing the error here is the best we can do.
260	the `pop` part is indeed used only here. `Push` is something that takes user-specified parameters, so we get Sema to check them. `Pop` is much simpler and does not have any direct user exposure, so we can just create and use it here. As for matching, it is balanced. `Push` is called at the kernel launch site with the parameters of `<<<>>>` .`Pop` is done in the host-side kernel stub where we retrieve those parameters and pass them to the CUDA runtime. Essentially, push/pop are poor names for these functions are the nesting is never more than one level deep. We could've just stashed the arguments in a fixed buffer somewhere.

Updated ASTMatchers unit test.

Harbormaster completed remote builds in B27564: Diff 184592.Jan 31 2019, 1:24 PM

Closed by commit rC352799: [CUDA] add support for the new kernel launch API in CUDA-9.2+. (authored by tra). · Explain WhyJan 31 2019, 1:34 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

DiagnosticSemaKinds.td

2 lines

Sema/

Sema.h

5 lines

lib/

CodeGen/

CGCUDANV.cpp

110 lines

Headers/

__clang_cuda_runtime_wrapper.h

10 lines

Sema/

SemaCUDA.cpp

19 lines

SemaDecl.cpp

7 lines

test/

CodeGenCUDA/

Inputs/

cuda.h

13 lines

device-stub.cu

65 lines

kernel-args-alignment.cu

16 lines

kernel-call.cu

17 lines

Driver/

cuda-simple.cu

6 lines

SemaCUDA/

Inputs/

cuda.h

12 lines

config-type.cu

8 lines

Diff 184543

clang/include/clang/Basic/DiagnosticSemaKinds.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,137 Lines • ▼ Show 20 Lines	def err_deleted_inherited_ctor_use : Error<
"constructor inherited by %0 from base class %1 is implicitly deleted">;		"constructor inherited by %0 from base class %1 is implicitly deleted">;

def note_called_by : Note<"called by %0">;		def note_called_by : Note<"called by %0">;
def err_kern_type_not_void_return : Error<		def err_kern_type_not_void_return : Error<
"kernel function type %0 must have void return type">;		"kernel function type %0 must have void return type">;
def err_kern_is_nonstatic_method : Error<		def err_kern_is_nonstatic_method : Error<
"kernel function %0 must be a free function or static member function">;		"kernel function %0 must be a free function or static member function">;
def err_config_scalar_return : Error<		def err_config_scalar_return : Error<
"CUDA special function 'cudaConfigureCall' must have scalar return type">;		"CUDA special function '%0' must have scalar return type">;
def err_kern_call_not_global_function : Error<		def err_kern_call_not_global_function : Error<
"kernel call to non-global function %0">;		"kernel call to non-global function %0">;
def err_global_call_not_config : Error<		def err_global_call_not_config : Error<
"call to global function %0 not configured">;		"call to global function %0 not configured">;
def err_ref_bad_target : Error<		def err_ref_bad_target : Error<
"reference to %select{__device__\|__global__\|__host__\|__host__ __device__}0 "		"reference to %select{__device__\|__global__\|__host__\|__host__ __device__}0 "
"function %1 in %select{__device__\|__global__\|__host__\|__host__ __device__}2 function">;		"function %1 in %select{__device__\|__global__\|__host__\|__host__ __device__}2 function">;
def err_ref_bad_target_global_initializer : Error<		def err_ref_bad_target_global_initializer : Error<
▲ Show 20 Lines • Show All 2,360 Lines • Show Last 20 Lines

clang/include/clang/Sema/Sema.h

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,307 Lines • ▼ Show 20 Lines	public:

/// Check whether NewFD is a valid overload for CUDA. Emits		/// Check whether NewFD is a valid overload for CUDA. Emits
/// diagnostics and invalidates NewFD if not.		/// diagnostics and invalidates NewFD if not.
void checkCUDATargetOverload(FunctionDecl *NewFD,		void checkCUDATargetOverload(FunctionDecl *NewFD,
const LookupResult &Previous);		const LookupResult &Previous);
/// Copies target attributes from the template TD to the function FD.		/// Copies target attributes from the template TD to the function FD.
void inheritCUDATargetAttrs(FunctionDecl *FD, const FunctionTemplateDecl &TD);		void inheritCUDATargetAttrs(FunctionDecl *FD, const FunctionTemplateDecl &TD);

		/// Returns the name of the launch configuration function. This is the name
		jlebarUnsubmitted Done Reply Inline Actions Could we be a little less vague, what exactly is the launch-configuration function? (Could be as simple as adding `e.g. cudaFooBar()`.) jlebar: Could we be a little less vague, what exactly is the launch-configuration function? (Could be…
		/// of the function that will be called to configure kernel call, with the
		/// parameters specified via <<<>>>.
		std::string getCudaConfigureFuncName() const;

/// \name Code completion		/// \name Code completion
//@{		//@{
/// Describes the context in which code completion occurs.		/// Describes the context in which code completion occurs.
enum ParserCompletionContext {		enum ParserCompletionContext {
/// Code completion occurs at top-level or namespace context.		/// Code completion occurs at top-level or namespace context.
PCC_Namespace,		PCC_Namespace,
/// Code completion occurs within a class, struct, or union.		/// Code completion occurs within a class, struct, or union.
PCC_Class,		PCC_Class,
▲ Show 20 Lines • Show All 647 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGCUDANV.cpp

Show All 9 Lines
// runtime library.		// runtime library.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "CGCUDARuntime.h"		#include "CGCUDARuntime.h"
#include "CodeGenFunction.h"		#include "CodeGenFunction.h"
#include "CodeGenModule.h"		#include "CodeGenModule.h"
#include "clang/AST/Decl.h"		#include "clang/AST/Decl.h"
		#include "clang/Basic/Cuda.h"
		#include "clang/CodeGen/CodeGenABITypes.h"
#include "clang/CodeGen/ConstantInitBuilder.h"		#include "clang/CodeGen/ConstantInitBuilder.h"
#include "llvm/IR/BasicBlock.h"		#include "llvm/IR/BasicBlock.h"
#include "llvm/IR/CallSite.h"		#include "llvm/IR/CallSite.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/DerivedTypes.h"		#include "llvm/IR/DerivedTypes.h"
#include "llvm/Support/Format.h"		#include "llvm/Support/Format.h"

using namespace clang;		using namespace clang;
▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	llvm::BasicBlock *DummyBlock =
llvm::BasicBlock::Create(Context, "", DummyFunc);		llvm::BasicBlock::Create(Context, "", DummyFunc);
CGBuilderTy FuncBuilder(CGM, Context);		CGBuilderTy FuncBuilder(CGM, Context);
FuncBuilder.SetInsertPoint(DummyBlock);		FuncBuilder.SetInsertPoint(DummyBlock);
FuncBuilder.CreateRetVoid();		FuncBuilder.CreateRetVoid();

return DummyFunc;		return DummyFunc;
}		}

void emitDeviceStubBody(CodeGenFunction &CGF, FunctionArgList &Args);		void emitDeviceStubBodyLegacy(CodeGenFunction &CGF, FunctionArgList &Args);
		void emitDeviceStubBodyNew(CodeGenFunction &CGF, FunctionArgList &Args);

public:		public:
CGNVCUDARuntime(CodeGenModule &CGM);		CGNVCUDARuntime(CodeGenModule &CGM);

void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args) override;		void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args) override;
void registerDeviceVar(llvm::GlobalVariable &Var, unsigned Flags) override {		void registerDeviceVar(llvm::GlobalVariable &Var, unsigned Flags) override {
DeviceVars.push_back(std::make_pair(&Var, Flags));		DeviceVars.push_back(std::make_pair(&Var, Flags));
}		}
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	llvm::FunctionType *CGNVCUDARuntime::getRegisterLinkedBinaryFnTy() const {
llvm::Type *Params[] = {RegisterGlobalsFnTy->getPointerTo(), VoidPtrTy,		llvm::Type *Params[] = {RegisterGlobalsFnTy->getPointerTo(), VoidPtrTy,
VoidPtrTy, CallbackFnTy->getPointerTo()};		VoidPtrTy, CallbackFnTy->getPointerTo()};
return llvm::FunctionType::get(VoidTy, Params, false);		return llvm::FunctionType::get(VoidTy, Params, false);
}		}

void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,		void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
FunctionArgList &Args) {		FunctionArgList &Args) {
EmittedKernels.push_back(CGF.CurFn);		EmittedKernels.push_back(CGF.CurFn);
emitDeviceStubBody(CGF, Args);		if (CudaFeatureEnabled(CGM.getTarget().getSDKVersion(),
		CudaFeature::CUDA_USES_NEW_LAUNCH))
		emitDeviceStubBodyNew(CGF, Args);
		else
		emitDeviceStubBodyLegacy(CGF, Args);
		}

		// CUDA 9.0+ uses new way to launch kernels. Parameters are packed in a local
		// array and kernels are launched using cudaLaunchKernel().
		void CGNVCUDARuntime::emitDeviceStubBodyNew(CodeGenFunction &CGF,
		FunctionArgList &Args) {
		// Build the shadow stack entry at the very start of the function.

		// Calculate amount of space we will need for all arguments. If we have no
		// args, allocate a single pointer so we still have a valid pointer to the
		// argument array that we can pass to runtime, even if it will be unused.
		Address KernelArgs = CGF.CreateTempAlloca(
		VoidPtrTy, CharUnits::fromQuantity(16), "kernel_args",
		llvm::ConstantInt::get(SizeTy, std::max<size_t>(1, Args.size())));
		jlebarUnsubmitted Done Reply Inline Actions Nit, s/`1UL`/`uint64{1}`/ or size_t, whatever this function takes. As-is we're baking in the assumption that unsigned long is the same as the type returned by Args.size(), which isn't necessarily true. As an alternative, you could do `std::max<size_t>(1, Args.size())` or whatever the appropriate type is. jlebar: Nit, s/`1UL`/`uint64{1}`/ or size_t, whatever this function takes. As-is we're baking in the…
		// Store pointers to the arguments in a locally allocated launch_args.
		for (unsigned i = 0; i < Args.size(); ++i) {
		llvm::Value* VarPtr = CGF.GetAddrOfLocalVar(Args[i]).getPointer();
		llvm::Value *VoidVarPtr = CGF.Builder.CreatePointerCast(VarPtr, VoidPtrTy);
		CGF.Builder.CreateDefaultAlignedStore(
		VoidVarPtr, CGF.Builder.CreateConstGEP1_32(KernelArgs.getPointer(), i));
		}

		llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");

		// Lookup cudaLaunchKernel function.
		// cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim, dim3 blockDim,
		// void **args, size_t sharedMem,
		// cudaStream_t stream);
		TranslationUnitDecl *TUDecl = CGM.getContext().getTranslationUnitDecl();
		DeclContext *DC = TranslationUnitDecl::castToDeclContext(TUDecl);
		IdentifierInfo &cudaLaunchKernelII =
		CGM.getContext().Idents.get("cudaLaunchKernel");
		FunctionDecl *cudaLaunchKernelFD = nullptr;
		for (const auto &Result : DC->lookup(&cudaLaunchKernelII)) {
		if (FunctionDecl *FD = dyn_cast<FunctionDecl>(Result))
		cudaLaunchKernelFD = FD;
		}

		if (cudaLaunchKernelFD == nullptr) {
		CGM.Error(CGF.CurFuncDecl->getLocation(),
		"Can't find declaration for cudaLaunchKernel()");
		jlebarUnsubmitted Done Reply Inline Actions Unfixed FIXME? jlebar: Unfixed FIXME?
		traAuthorUnsubmitted Done Reply Inline Actions Fixed the comment. :-) There's not much we can do if we have no declaration for cudaLaunchKernel, so throwing the error here is the best we can do. tra: Fixed the comment. :-) There's not much we can do if we have no declaration for…
		return;
		}
		// Create temporary dim3 grid_dim, block_dim.
		ParmVarDecl *GridDimParam = cudaLaunchKernelFD->getParamDecl(1);
		QualType Dim3Ty = GridDimParam->getType();
		Address GridDim =
		CGF.CreateMemTemp(Dim3Ty, CharUnits::fromQuantity(8), "grid_dim");
		Address BlockDim =
		CGF.CreateMemTemp(Dim3Ty, CharUnits::fromQuantity(8), "block_dim");
		Address ShmemSize =
		CGF.CreateTempAlloca(SizeTy, CGM.getSizeAlign(), "shmem_size");
		Address Stream =
		CGF.CreateTempAlloca(VoidPtrTy, CGM.getPointerAlign(), "stream");
		llvm::Constant *cudaPopConfigFn = CGM.CreateRuntimeFunction(
		llvm::FunctionType::get(IntTy,
		{/gridDim=/GridDim.getType(),
		/blockDim=/BlockDim.getType(),
		/ShmemSize=/ShmemSize.getType(),
		/Stream=/Stream.getType()},
		/isVarArg=/false),
		"__cudaPopCallConfiguration");
		jlebarUnsubmitted Done Reply Inline Actions I see lots of references to `__cudaPushCallConfiguration`, but this is the only reference I see to `__cudaPopCallConfiguration`. Is this a typo? Also are we supposed to emit matching push and pop function calls? Kind of weird to do one without the other... jlebar: I see lots of references to `__cudaPushCallConfiguration`, but this is the only reference I see…
		traAuthorUnsubmitted Done Reply Inline Actions the `pop` part is indeed used only here. `Push` is something that takes user-specified parameters, so we get Sema to check them. `Pop` is much simpler and does not have any direct user exposure, so we can just create and use it here. As for matching, it is balanced. `Push` is called at the kernel launch site with the parameters of `<<<>>>` .`Pop` is done in the host-side kernel stub where we retrieve those parameters and pass them to the CUDA runtime. Essentially, push/pop are poor names for these functions are the nesting is never more than one level deep. We could've just stashed the arguments in a fixed buffer somewhere. tra: the `pop` part is indeed used only here. `Push` is something that takes user-specified…

		CGF.EmitRuntimeCallOrInvoke(cudaPopConfigFn,
		{GridDim.getPointer(), BlockDim.getPointer(),
		ShmemSize.getPointer(), Stream.getPointer()});

		// Emit the call to cudaLaunch
		jlebarUnsubmitted Done Reply Inline Actions Whitespace nit, maybe move this whitespace line before the comment? jlebar: Whitespace nit, maybe move this whitespace line before the comment?
		llvm::Value *Kernel = CGF.Builder.CreatePointerCast(CGF.CurFn, VoidPtrTy);
		CallArgList LaunchKernelArgs;
		LaunchKernelArgs.add(RValue::get(Kernel),
		cudaLaunchKernelFD->getParamDecl(0)->getType());
		LaunchKernelArgs.add(RValue::getAggregate(GridDim), Dim3Ty);
		LaunchKernelArgs.add(RValue::getAggregate(BlockDim), Dim3Ty);
		LaunchKernelArgs.add(RValue::get(KernelArgs.getPointer()),
		cudaLaunchKernelFD->getParamDecl(3)->getType());
		LaunchKernelArgs.add(RValue::get(CGF.Builder.CreateLoad(ShmemSize)),
		cudaLaunchKernelFD->getParamDecl(4)->getType());
		LaunchKernelArgs.add(RValue::get(CGF.Builder.CreateLoad(Stream)),
		cudaLaunchKernelFD->getParamDecl(5)->getType());

		QualType QT = cudaLaunchKernelFD->getType();
		QualType CQT = QT.getCanonicalType();
		llvm::Type *Ty = CGM.getTypes().ConvertFunctionType(CQT, cudaLaunchKernelFD);
		llvm::FunctionType *FTy = dyn_cast<llvm::FunctionType>(Ty);

		const CGFunctionInfo &FI =
		CGM.getTypes().arrangeFunctionDeclaration(cudaLaunchKernelFD);
		llvm::Constant *cudaLaunchKernelFn =
		CGM.CreateRuntimeFunction(FTy, "cudaLaunchKernel");
		CGF.EmitCall(FI, CGCallee::forDirect(cudaLaunchKernelFn), ReturnValueSlot(),
		LaunchKernelArgs);
		CGF.EmitBranch(EndBlock);

		CGF.EmitBlock(EndBlock);
}		}

void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,		void CGNVCUDARuntime::emitDeviceStubBodyLegacy(CodeGenFunction &CGF,
FunctionArgList &Args) {		FunctionArgList &Args) {
		jlebarUnsubmitted Done Reply Inline Actions nit `in a local array` jlebar: nit `in a local array`
// Emit a call to cudaSetupArgument for each arg in Args.		// Emit a call to cudaSetupArgument for each arg in Args.
llvm::Constant *cudaSetupArgFn = getSetupArgumentFn();		llvm::Constant *cudaSetupArgFn = getSetupArgumentFn();
llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");		llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");
CharUnits Offset = CharUnits::Zero();		CharUnits Offset = CharUnits::Zero();
for (const VarDecl *A : Args) {		for (const VarDecl *A : Args) {
CharUnits TyWidth, TyAlign;		CharUnits TyWidth, TyAlign;
std::tie(TyWidth, TyAlign) =		std::tie(TyWidth, TyAlign) =
CGM.getContext().getTypeInfoInChars(A->getType());		CGM.getContext().getTypeInfoInChars(A->getType());
▲ Show 20 Lines • Show All 428 Lines • Show Last 20 Lines

clang/lib/Headers/__clang_cuda_runtime_wrapper.h

	Show First 20 Lines • Show All 420 Lines • ▼ Show 20 Lines
	#define dim3 __cuda_builtin_blockDim_t			#define dim3 __cuda_builtin_blockDim_t
	#define uint3 __cuda_builtin_threadIdx_t			#define uint3 __cuda_builtin_threadIdx_t
	#include "curand_mtgp32_kernel.h"			#include "curand_mtgp32_kernel.h"
	#pragma pop_macro("dim3")			#pragma pop_macro("dim3")
	#pragma pop_macro("uint3")			#pragma pop_macro("uint3")
	#pragma pop_macro("__USE_FAST_MATH__")			#pragma pop_macro("__USE_FAST_MATH__")
	#pragma pop_macro("__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__")			#pragma pop_macro("__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__")

				// CUDA runtime uses this undocumented function to access kernel launch
				jlebarUnsubmitted Not Done Reply Inline Actions s/undocumented function/this undocumented function/? jlebar: s/undocumented function/this undocumented function/?
				// configuration. The declaration is in crt/device_functions.h but that file
				// includes a lot of other stuff we don't want. Instead, we'll provide our own
				// declaration for it here.
				#if CUDA_VERSION >= 9020
				extern "C" unsigned __cudaPushCallConfiguration(dim3 gridDim, dim3 blockDim,
				size_t sharedMem = 0,
				void *stream = 0);
				#endif

	#endif // __CUDA__			#endif // __CUDA__
	#endif // __CLANG_CUDA_RUNTIME_WRAPPER_H__			#endif // __CLANG_CUDA_RUNTIME_WRAPPER_H__

clang/lib/Sema/SemaCUDA.cpp

//===--- SemaCUDA.cpp - Semantic Analysis for CUDA constructs -------------===//		//===--- SemaCUDA.cpp - Semantic Analysis for CUDA constructs -------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
/// \file		/// \file
/// This file implements semantic analysis for CUDA constructs.		/// This file implements semantic analysis for CUDA constructs.
///		///
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "clang/AST/ASTContext.h"		#include "clang/AST/ASTContext.h"
#include "clang/AST/Decl.h"		#include "clang/AST/Decl.h"
#include "clang/AST/ExprCXX.h"		#include "clang/AST/ExprCXX.h"
		#include "clang/Basic/Cuda.h"
#include "clang/Lex/Preprocessor.h"		#include "clang/Lex/Preprocessor.h"
#include "clang/Sema/Lookup.h"		#include "clang/Sema/Lookup.h"
#include "clang/Sema/Sema.h"		#include "clang/Sema/Sema.h"
#include "clang/Sema/SemaDiagnostic.h"		#include "clang/Sema/SemaDiagnostic.h"
#include "clang/Sema/SemaInternal.h"		#include "clang/Sema/SemaInternal.h"
#include "clang/Sema/Template.h"		#include "clang/Sema/Template.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
Show All 12 Lines	bool Sema::PopForceCUDAHostDevice() {
return true;		return true;
}		}

ExprResult Sema::ActOnCUDAExecConfigExpr(Scope *S, SourceLocation LLLLoc,		ExprResult Sema::ActOnCUDAExecConfigExpr(Scope *S, SourceLocation LLLLoc,
MultiExprArg ExecConfig,		MultiExprArg ExecConfig,
SourceLocation GGGLoc) {		SourceLocation GGGLoc) {
FunctionDecl *ConfigDecl = Context.getcudaConfigureCallDecl();		FunctionDecl *ConfigDecl = Context.getcudaConfigureCallDecl();
if (!ConfigDecl)		if (!ConfigDecl)
return ExprError(		return ExprError(Diag(LLLLoc, diag::err_undeclared_var_use)
Diag(LLLLoc, diag::err_undeclared_var_use)		<< getCudaConfigureFuncName());
<< (getLangOpts().HIP ? "hipConfigureCall" : "cudaConfigureCall"));
QualType ConfigQTy = ConfigDecl->getType();		QualType ConfigQTy = ConfigDecl->getType();

DeclRefExpr *ConfigDR = new (Context)		DeclRefExpr *ConfigDR = new (Context)
DeclRefExpr(Context, ConfigDecl, false, ConfigQTy, VK_LValue, LLLLoc);		DeclRefExpr(Context, ConfigDecl, false, ConfigQTy, VK_LValue, LLLLoc);
MarkFunctionReferenced(LLLLoc, ConfigDecl);		MarkFunctionReferenced(LLLLoc, ConfigDecl);

return ActOnCallExpr(S, ConfigDR, LLLLoc, ExecConfig, GGGLoc, nullptr,		return ActOnCallExpr(S, ConfigDR, LLLLoc, ExecConfig, GGGLoc, nullptr,
/IsExecConfig=/true);		/IsExecConfig=/true);
▲ Show 20 Lines • Show All 897 Lines • ▼ Show 20 Lines

void Sema::inheritCUDATargetAttrs(FunctionDecl *FD,		void Sema::inheritCUDATargetAttrs(FunctionDecl *FD,
const FunctionTemplateDecl &TD) {		const FunctionTemplateDecl &TD) {
const FunctionDecl &TemplateFD = *TD.getTemplatedDecl();		const FunctionDecl &TemplateFD = *TD.getTemplatedDecl();
copyAttrIfPresent<CUDAGlobalAttr>(*this, FD, TemplateFD);		copyAttrIfPresent<CUDAGlobalAttr>(*this, FD, TemplateFD);
copyAttrIfPresent<CUDAHostAttr>(*this, FD, TemplateFD);		copyAttrIfPresent<CUDAHostAttr>(*this, FD, TemplateFD);
copyAttrIfPresent<CUDADeviceAttr>(*this, FD, TemplateFD);		copyAttrIfPresent<CUDADeviceAttr>(*this, FD, TemplateFD);
}		}

		std::string Sema::getCudaConfigureFuncName() const {
		if (getLangOpts().HIP)
		return "hipConfigureCall";

		// New CUDA kernel launch sequence.
		if (CudaFeatureEnabled(Context.getTargetInfo().getSDKVersion(),
		CudaFeature::CUDA_USES_NEW_LAUNCH))
		return "__cudaPushCallConfiguration";

		// Legacy CUDA kernel configuration call
		return "cudaConfigureCall";
		}

clang/lib/Sema/SemaDecl.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,140 Lines • ▼ Show 20 Lines	if (D.isRedeclaration() && !Previous.empty()) {
checkDLLAttributeRedeclaration(*this, Prev, NewFD,		checkDLLAttributeRedeclaration(*this, Prev, NewFD,
isMemberSpecialization \|\|		isMemberSpecialization \|\|
isFunctionTemplateSpecialization,		isFunctionTemplateSpecialization,
D.isFunctionDefinition());		D.isFunctionDefinition());
}		}

if (getLangOpts().CUDA) {		if (getLangOpts().CUDA) {
IdentifierInfo *II = NewFD->getIdentifier();		IdentifierInfo *II = NewFD->getIdentifier();
if (II &&		if (II && II->isStr(getCudaConfigureFuncName()) &&
II->isStr(getLangOpts().HIP ? "hipConfigureCall"
: "cudaConfigureCall") &&
!NewFD->isInvalidDecl() &&		!NewFD->isInvalidDecl() &&
NewFD->getDeclContext()->getRedeclContext()->isTranslationUnit()) {		NewFD->getDeclContext()->getRedeclContext()->isTranslationUnit()) {
if (!R->getAs<FunctionType>()->getReturnType()->isScalarType())		if (!R->getAs<FunctionType>()->getReturnType()->isScalarType())
Diag(NewFD->getLocation(), diag::err_config_scalar_return);		Diag(NewFD->getLocation(), diag::err_config_scalar_return)
		<< getCudaConfigureFuncName();
Context.setcudaConfigureCallDecl(NewFD);		Context.setcudaConfigureCallDecl(NewFD);
}		}

// Variadic functions, other than a declaration of printf, are not allowed		// Variadic functions, other than a declaration of printf, are not allowed
// in device-side CUDA code, unless someone passed		// in device-side CUDA code, unless someone passed
// -fcuda-allow-variadic-functions.		// -fcuda-allow-variadic-functions.
if (!getLangOpts().CUDAAllowVariadicFunctions && NewFD->isVariadic() &&		if (!getLangOpts().CUDAAllowVariadicFunctions && NewFD->isVariadic() &&
(NewFD->hasAttr<CUDADeviceAttr>() \|\|		(NewFD->hasAttr<CUDADeviceAttr>() \|\|
▲ Show 20 Lines • Show All 8,210 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/Inputs/cuda.h

	Show All 9 Lines
	#define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__)))			#define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__)))

	struct dim3 {			struct dim3 {
	unsigned x, y, z;			unsigned x, y, z;
	__host__ __device__ dim3(unsigned x, unsigned y = 1, unsigned z = 1) : x(x), y(y), z(z) {}			__host__ __device__ dim3(unsigned x, unsigned y = 1, unsigned z = 1) : x(x), y(y), z(z) {}
	};			};

	typedef struct cudaStream *cudaStream_t;			typedef struct cudaStream *cudaStream_t;
				typedef enum cudaError {} cudaError_t;
	#ifdef __HIP__			#ifdef __HIP__
	int hipConfigureCall(dim3 gridSize, dim3 blockSize, size_t sharedSize = 0,			int hipConfigureCall(dim3 gridSize, dim3 blockSize, size_t sharedSize = 0,
	cudaStream_t stream = 0);			cudaStream_t stream = 0);
	#else			#else
	int cudaConfigureCall(dim3 gridSize, dim3 blockSize, size_t sharedSize = 0,			extern "C" int cudaConfigureCall(dim3 gridSize, dim3 blockSize,
				size_t sharedSize = 0,
				cudaStream_t stream = 0);
				extern "C" int __cudaPushCallConfiguration(dim3 gridSize, dim3 blockSize,
				size_t sharedSize = 0,
	cudaStream_t stream = 0);			cudaStream_t stream = 0);
				extern "C" cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim,
				dim3 blockDim, void **args,
				size_t sharedMem, cudaStream_t stream);
	#endif			#endif

	extern "C" __device__ int printf(const char*, ...);			extern "C" __device__ int printf(const char*, ...);

clang/test/CodeGenCUDA/device-stub.cu

	// RUN: echo "GPU binary would be here" > %t			// RUN: echo "GPU binary would be here" > %t
	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -fcuda-include-gpubinary %t -o - \			// RUN: -target-sdk-version=8.0 -fcuda-include-gpubinary %t -o - \
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=ALL,NORDC,CUDA,CUDANORDC			// RUN: \| FileCheck -allow-deprecated-dag-overlap %s \
				// RUN: --check-prefixes=ALL,NORDC,CUDA,CUDANORDC,CUDA-OLD
				// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
				// RUN: -target-sdk-version=8.0 -fcuda-include-gpubinary %t \
				// RUN: -o - -DNOGLOBALS \
				// RUN: \| FileCheck -allow-deprecated-dag-overlap %s \
				// RUN: -check-prefixes=NOGLOBALS,CUDANOGLOBALS
				// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
				// RUN: -target-sdk-version=8.0 -fgpu-rdc -fcuda-include-gpubinary %t \
				// RUN: -o - \
				// RUN: \| FileCheck -allow-deprecated-dag-overlap %s \
				// RUN: --check-prefixes=ALL,RDC,CUDA,CUDARDC,CUDA-OLD
	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -fcuda-include-gpubinary %t -o - -DNOGLOBALS \			// RUN: -target-sdk-version=8.0 -o - \
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s -check-prefixes=NOGLOBALS,CUDANOGLOBALS			// RUN: \| FileCheck -allow-deprecated-dag-overlap %s -check-prefix=NOGPUBIN

	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -fgpu-rdc -fcuda-include-gpubinary %t -o - \			// RUN: -target-sdk-version=9.2 -fcuda-include-gpubinary %t -o - \
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=ALL,RDC,CUDA,CUDARDC			// RUN: \| FileCheck %s -allow-deprecated-dag-overlap \
	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s -o - \			// RUN: --check-prefixes=ALL,NORDC,CUDA,CUDANORDC,CUDA-NEW
				// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
				// RUN: -target-sdk-version=9.2 -fcuda-include-gpubinary %t -o - -DNOGLOBALS \
				// RUN: \| FileCheck -allow-deprecated-dag-overlap %s \
				// RUN: --check-prefixes=NOGLOBALS,CUDANOGLOBALS
				// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
				// RUN: -target-sdk-version=9.2 -fgpu-rdc -fcuda-include-gpubinary %t -o - \
				// RUN: \| FileCheck %s -allow-deprecated-dag-overlap \
				// RUN: --check-prefixes=ALL,RDC,CUDA,CUDARDC,CUDA_NEW
				// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
				// RUN: -target-sdk-version=9.2 -o - \
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s -check-prefix=NOGPUBIN			// RUN: \| FileCheck -allow-deprecated-dag-overlap %s -check-prefix=NOGPUBIN

	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -fcuda-include-gpubinary %t -o - -x hip\			// RUN: -fcuda-include-gpubinary %t -o - -x hip\
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=ALL,NORDC,HIP,HIPEF			// RUN: \| FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=ALL,NORDC,HIP,HIPEF
	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -fcuda-include-gpubinary %t -o - -DNOGLOBALS -x hip \			// RUN: -fcuda-include-gpubinary %t -o - -DNOGLOBALS -x hip \
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s -check-prefixes=NOGLOBALS,HIPNOGLOBALS			// RUN: \| FileCheck -allow-deprecated-dag-overlap %s -check-prefixes=NOGLOBALS,HIPNOGLOBALS
	▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	// * Alias to global symbol containing the NVModuleID.			// * Alias to global symbol containing the NVModuleID.
	// RDC: @__fatbinwrap[[MODULE_ID]] = alias { i32, i32, i8, i8 }			// RDC: @__fatbinwrap[[MODULE_ID]] = alias { i32, i32, i8, i8 }
	// RDC-SAME: { i32, i32, i8, i8 }* @__[[PREFIX]]_fatbin_wrapper			// RDC-SAME: { i32, i32, i8, i8 }* @__[[PREFIX]]_fatbin_wrapper

	// Test that we build the correct number of calls to cudaSetupArgument followed			// Test that we build the correct number of calls to cudaSetupArgument followed
	// by a call to cudaLaunch.			// by a call to cudaLaunch.

	// ALL: define{{.*}}kernelfunc			// ALL: define{{.*}}kernelfunc
	// ALL: call{{.*}}[[PREFIX]]SetupArgument
	// ALL: call{{.*}}[[PREFIX]]SetupArgument			// New launch sequence stores arguments into local buffer and passes array of
	// ALL: call{{.*}}[[PREFIX]]SetupArgument			// pointers to them directly to cudaLaunchKernel
	// ALL: call{{.*}}[[PREFIX]]Launch			// CUDA-NEW: alloca
				// CUDA-NEW: store
				// CUDA-NEW: store
				// CUDA-NEW: store
				// CUDA-NEW: call{{.*}}__cudaPopCallConfiguration
				// CUDA-NEW: call{{.*}}cudaLaunchKernel

				// Legacy style launch sequence sets up arguments by passing them to
				// [cuda\|hip]SetupArgument.
				// CUDA-OLD: call{{.*}}[[PREFIX]]SetupArgument
				// CUDA-OLD: call{{.*}}[[PREFIX]]SetupArgument
				// CUDA-OLD: call{{.*}}[[PREFIX]]SetupArgument
				// CUDA-OLD: call{{.*}}[[PREFIX]]Launch

				// HIP: call{{.*}}[[PREFIX]]SetupArgument
				// HIP: call{{.*}}[[PREFIX]]SetupArgument
				// HIP: call{{.*}}[[PREFIX]]SetupArgument
				// HIP: call{{.*}}[[PREFIX]]Launch
	__global__ void kernelfunc(int i, int j, int k) {}			__global__ void kernelfunc(int i, int j, int k) {}

	// Test that we've built correct kernel launch sequence.			// Test that we've built correct kernel launch sequence.
	// ALL: define{{.*}}hostfunc			// ALL: define{{.*}}hostfunc
	// ALL: call{{.*}}[[PREFIX]]ConfigureCall			// CUDA-OLD: call{{.*}}[[PREFIX]]ConfigureCall
				// CUDA-NEW: call{{.*}}__cudaPushCallConfiguration
				// HIP: call{{.*}}[[PREFIX]]ConfigureCall
	// ALL: call{{.*}}kernelfunc			// ALL: call{{.*}}kernelfunc
	void hostfunc(void) { kernelfunc<<<1, 1>>>(1, 1, 1); }			void hostfunc(void) { kernelfunc<<<1, 1>>>(1, 1, 1); }
	#endif			#endif

	// Test that we've built a function to register kernels and global vars.			// Test that we've built a function to register kernels and global vars.
	// ALL: define internal void @__[[PREFIX]]_register_globals			// ALL: define internal void @__[[PREFIX]]_register_globals
	// ALL: call{{.}}[[PREFIX]]RegisterFunction(i8* %0, {{.*}}kernelfunc			// ALL: call{{.}}[[PREFIX]]RegisterFunction(i8* %0, {{.*}}kernelfunc
	// ALL-DAG: call{{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}device_var{{.}}i32 0, i32 4, i32 0, i32 0			// ALL-DAG: call{{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}device_var{{.}}i32 0, i32 4, i32 0, i32 0
	▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/kernel-args-alignment.cu

	// RUN: %clang_cc1 --std=c++11 -triple x86_64-unknown-linux-gnu -emit-llvm -o - %s \| \			// New CUDA kernel launch sequence does not require explicit specification of
	// RUN: FileCheck -check-prefix HOST -check-prefix CHECK %s			// size/offset for each argument, so only the old way is tested.
				//
				// RUN: %clang_cc1 --std=c++11 -triple x86_64-unknown-linux-gnu -emit-llvm \
				// RUN: -target-sdk-version=8.0 -o - %s \
				// RUN: \| FileCheck -check-prefixes=HOST-OLD,CHECK %s

	// RUN: %clang_cc1 --std=c++11 -fcuda-is-device -triple nvptx64-nvidia-cuda \			// RUN: %clang_cc1 --std=c++11 -fcuda-is-device -triple nvptx64-nvidia-cuda \
	// RUN: -emit-llvm -o - %s \| FileCheck -check-prefix DEVICE -check-prefix CHECK %s			// RUN: -emit-llvm -o - %s \| FileCheck -check-prefixes=DEVICE,CHECK %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	struct U {			struct U {
	short x;			short x;
	} __attribute__((packed));			} __attribute__((packed));

	struct S {			struct S {
	int *ptr;			int *ptr;
	char a;			char a;
	U u;			U u;
	};			};

	// Clang should generate a packed LLVM struct for S (denoted by the <>s),			// Clang should generate a packed LLVM struct for S (denoted by the <>s),
	// otherwise this test isn't interesting.			// otherwise this test isn't interesting.
	// CHECK: %struct.S = type <{ i32*, i8, %struct.U, [5 x i8] }>			// CHECK: %struct.S = type <{ i32*, i8, %struct.U, [5 x i8] }>

	static_assert(alignof(S) == 8, "Unexpected alignment.");			static_assert(alignof(S) == 8, "Unexpected alignment.");

	// HOST-LABEL: @_Z6kernelc1SPi			// HOST-LABEL: @_Z6kernelc1SPi
	// Marshalled kernel args should be:			// Marshalled kernel args should be:
	// 1. offset 0, width 1			// 1. offset 0, width 1
	// 2. offset 8 (because alignof(S) == 8), width 16			// 2. offset 8 (because alignof(S) == 8), width 16
	// 3. offset 24, width 8			// 3. offset 24, width 8
	// HOST: call i32 @cudaSetupArgument({{[^,]*}}, i64 1, i64 0)			// HOST-OLD: call i32 @cudaSetupArgument({{[^,]*}}, i64 1, i64 0)
	// HOST: call i32 @cudaSetupArgument({{[^,]*}}, i64 16, i64 8)			// HOST-OLD: call i32 @cudaSetupArgument({{[^,]*}}, i64 16, i64 8)
	// HOST: call i32 @cudaSetupArgument({{[^,]*}}, i64 8, i64 24)			// HOST-OLD: call i32 @cudaSetupArgument({{[^,]*}}, i64 8, i64 24)

	// DEVICE-LABEL: @_Z6kernelc1SPi			// DEVICE-LABEL: @_Z6kernelc1SPi
	// DEVICE-SAME: i8{{[^,]}}, %struct.S byval align 8{{[^,]}}, i32			// DEVICE-SAME: i8{{[^,]}}, %struct.S byval align 8{{[^,]}}, i32
	__global__ void kernel(char a, S s, int *b) {}			__global__ void kernel(char a, S s, int *b) {}

clang/test/CodeGenCUDA/kernel-call.cu

	// RUN: %clang_cc1 -emit-llvm %s -o - \| FileCheck %s --check-prefixes=CUDA,CHECK			// RUN: %clang_cc1 -target-sdk-version=8.0 -emit-llvm %s -o - \
	// RUN: %clang_cc1 -x hip -emit-llvm %s -o - \| FileCheck %s --check-prefixes=HIP,CHECK			// RUN: \| FileCheck %s --check-prefixes=CUDA-OLD,CHECK
				// RUN: %clang_cc1 -target-sdk-version=9.2 -emit-llvm %s -o - \
				// RUN: \| FileCheck %s --check-prefixes=CUDA-NEW,CHECK
				// RUN: %clang_cc1 -x hip -emit-llvm %s -o - \
				// RUN: \| FileCheck %s --check-prefixes=HIP,CHECK


	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	// CHECK-LABEL: define{{.*}}g1			// CHECK-LABEL: define{{.*}}g1
	// HIP: call{{.*}}hipSetupArgument			// HIP: call{{.*}}hipSetupArgument
	// HIP: call{{.*}}hipLaunchByPtr			// HIP: call{{.*}}hipLaunchByPtr
	// CUDA: call{{.*}}cudaSetupArgument			// CUDA-OLD: call{{.*}}cudaSetupArgument
	// CUDA: call{{.*}}cudaLaunch			// CUDA-OLD: call{{.*}}cudaLaunch
				// CUDA-NEW: call{{.*}}__cudaPopCallConfiguration
				// CUDA-NEW: call{{.*}}cudaLaunchKernel
	__global__ void g1(int x) {}			__global__ void g1(int x) {}

	// CHECK-LABEL: define{{.*}}main			// CHECK-LABEL: define{{.*}}main
	int main(void) {			int main(void) {
	// HIP: call{{.*}}hipConfigureCall			// HIP: call{{.*}}hipConfigureCall
	// CUDA: call{{.*}}cudaConfigureCall			// CUDA-OLD: call{{.*}}cudaConfigureCall
				// CUDA-NEW: call{{.*}}__cudaPushCallConfiguration
	// CHECK: icmp			// CHECK: icmp
	// CHECK: br			// CHECK: br
	// CHECK: call{{.*}}g1			// CHECK: call{{.*}}g1
	g1<<<1, 1>>>(42);			g1<<<1, 1>>>(42);
	}			}

clang/test/Driver/cuda-simple.cu

	// Verify that we can parse a simple CUDA file with or without -save-temps			// Verify that we can parse a simple CUDA file with or without -save-temps
	// http://llvm.org/PR22936			// http://llvm.org/PR22936
	// RUN: %clang -nocudainc -nocudalib -Werror -fsyntax-only -c %s			// RUN: %clang -nocudainc -nocudalib -Werror -fsyntax-only -c %s
	//			//
	// Verify that we pass -x cuda-cpp-output to compiler after			// Verify that we pass -x cuda-cpp-output to compiler after
	// preprocessing a CUDA file			// preprocessing a CUDA file
	// RUN: %clang -Werror -### -save-temps -c %s 2>&1 \| FileCheck %s			// RUN: %clang -Werror -### -save-temps -c %s 2>&1 \| FileCheck %s
	// CHECK: "-cc1"			// CHECK: "-cc1"
	// CHECK: "-E"			// CHECK: "-E"
	// CHECK: "-x" "cuda"			// CHECK: "-x" "cuda"
	// CHECK-NEXT: "-cc1"			// CHECK-NEXT: "-cc1"
	// CHECK: "-x" "cuda-cpp-output"			// CHECK: "-x" "cuda-cpp-output"
	//			//
	// Verify that compiler accepts CUDA syntax with "-x cuda-cpp-output".			// Verify that compiler accepts CUDA syntax with "-x cuda-cpp-output".
	// RUN: %clang -Werror -fsyntax-only -x cuda-cpp-output -c %s			// RUN: %clang -Werror -fsyntax-only -x cuda-cpp-output -c %s

	int cudaConfigureCall(int, int);			extern "C" int cudaConfigureCall(int, int);
				extern "C" int __cudaPushCallConfiguration(int, int);

	__attribute__((global)) void kernel() {}			__attribute__((global)) void kernel() {}

	void func() {			void func() {
	kernel<<<1,1>>>();			kernel<<<1,1>>>();
	}			}

clang/test/SemaCUDA/Inputs/cuda.h

	Show All 12 Lines
	#define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__)))			#define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__)))

	struct dim3 {			struct dim3 {
	unsigned x, y, z;			unsigned x, y, z;
	__host__ __device__ dim3(unsigned x, unsigned y = 1, unsigned z = 1) : x(x), y(y), z(z) {}			__host__ __device__ dim3(unsigned x, unsigned y = 1, unsigned z = 1) : x(x), y(y), z(z) {}
	};			};

	typedef struct cudaStream *cudaStream_t;			typedef struct cudaStream *cudaStream_t;
				typedef enum cudaError {} cudaError_t;

	int cudaConfigureCall(dim3 gridSize, dim3 blockSize, size_t sharedSize = 0,			extern "C" int cudaConfigureCall(dim3 gridSize, dim3 blockSize,
				size_t sharedSize = 0,
	cudaStream_t stream = 0);			cudaStream_t stream = 0);
				extern "C" int __cudaPushCallConfiguration(dim3 gridSize, dim3 blockSize,
				size_t sharedSize = 0,
				cudaStream_t stream = 0);
				extern "C" cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim,
				dim3 blockDim, void **args,
				size_t sharedMem, cudaStream_t stream);

	// Host- and device-side placement new overloads.			// Host- and device-side placement new overloads.
	void operator new(__SIZE_TYPE__, void p) { return p; }			void operator new(__SIZE_TYPE__, void p) { return p; }
	void operator new[](__SIZE_TYPE__, void p) { return p; }			void operator new[](__SIZE_TYPE__, void p) { return p; }
	__device__ void operator new(__SIZE_TYPE__, void p) { return p; }			__device__ void operator new(__SIZE_TYPE__, void p) { return p; }
	__device__ void operator new[](__SIZE_TYPE__, void p) { return p; }			__device__ void operator new[](__SIZE_TYPE__, void p) { return p; }

	#endif // !__NVCC__			#endif // !__NVCC__

clang/test/SemaCUDA/config-type.cu

	// RUN: %clang_cc1 -fsyntax-only -verify %s			// RUN: %clang_cc1 -fno-cuda-new-launch -fsyntax-only -verify=legacy-launch %s
				// RUN: %clang_cc1 -fcuda-new-launch -fsyntax-only -verify=new-launch %s

	void cudaConfigureCall(unsigned gridSize, unsigned blockSize); // expected-error {{must have scalar return type}}			// legacy-launch-error@+1 {{must have scalar return type}}
				void cudaConfigureCall(unsigned gridSize, unsigned blockSize);
				// new-launch-error@+1 {{must have scalar return type}}
				void __cudaPushCallConfiguration(unsigned gridSize, unsigned blockSize);