This is an archive of the discontinued LLVM Phabricator instance.

End-to-end CUDA compilation.
AbandonedPublic

Authored by tra on Mar 19 2015, 2:28 PM.

Download Raw Diff

Details

Reviewers

eliben
echristo

Summary

The changes implement end-to-end CUDA compilation pipeline (i.e single clang invocation produces usable host object file which incorporates GPU code) in the driver and necessary runtime to initialize GPU code.

Launch device-side compilation(s):
- Added '--cuda-gpu-arch=<sm_XX>' option. ARCH defaults to sm_20.
- For each GPU architecture launch cc1 with -fcuda-is-device -target-cpu <GPU>
- internally each device-side compilation action is wrapped in CudaDeviceAction(GPU) which selects appropriate toolchain based on GPU and then proceeds construction compilation pipeline.
- added --cuda-host-only and --cuda-device-only options to skip host/device compilation parts.

Incorporate GPU code generated by device-side into the host object file:
- added "-fcuda-include-gpubinary <FILE>" option to specify file with GPU code to incorporate.
- internally host-side compilation action is wrapped in CudaHostAction(input.cu, [list of files produced by device-side compilation]). When driver builds jobs for CudaHostAction, host compilation jobs are constructed normally. At the end each device-side output is passed to the host-side compilation by adding "-fcuda-include-gpubinary <device-side-output.s>" options.
- CGCUDARuntime class was extended to provide API for per-module constructor/destructor creation.
- CGNVCUDARuntime
  
  Implemented ModuleCtorFunction and ModuleDtorFunction to generate initialization code required for cudart-style kernel launches to work.
- ModuleCtorFunction():
  - creates .cuda_register_functions(fatbin_handle) function which calls __cudaRegisterFunction(...) for each kernel emitted with EmitDeviceStub().
  - creates and returns .cuda_module_ctor() function: For each -fcuda-include-gpubinary:
    - creates a constant string with contents of the file specified.
    - creates an initialized __fatBinC_Wrapper_t struct which points to the string.
    - generates call to __cudaRegisterFatBinary(&wrapper_struct) and stores returned handle in a variable.
    - generates a call to .cuda_register_functions(handle)
      
      NOTE: Even though we're calling __cudaRegisterFatBinary() which would imply that it expects GPU code to be encapsulated in nvidia's proprietary 'FatBinary' format, we're actually passing GPU code as a NUL-terminated string with PTX assembly in it. Alas fatbin format is not documented. Fortunately the low-level driver API for loading GPU code accepts cubin/fatbin/NUL-terminated string formats and cudart seems to pass the string to the driver, so we can skip fatbin altogether.
- ModuleDtorFunction(): creates and returns .cuda_module_dtor() function which generates a call to __cudaUnregisterFatBinary(saved_handle) for each GPU code blob initialized in ModuleCtorFunction().
- CodeGenModule.cpp During host-side CUDA compilation calls CUDARuntime->ModuleCtorFunction()/ModuleDtorFunction() and adds returned value to a global constructor/destructor list.

Added test case to verify CUDA pipeline construction in the driver.

Diff Detail

Event Timeline

tra updated this revision to Diff 22301.Mar 19 2015, 2:28 PM

tra retitled this revision from to Preliminary driver changes to build and stitch together host and device-side CUDA compilation pipelines..

tra updated this object.

tra edited the test plan for this revision. (Show Details)

tra added reviewers: eliben, echristo.

tra added a subscriber: Unknown Object (MLST).

Herald added a subscriber: klimek. · View Herald TranscriptMar 19 2015, 2:28 PM

Were we going to abandon this one in favor of the write up you're doing or
is this something else?

-eric

yaron.keren added a subscriber: yaron.keren.Mar 23 2015, 11:47 AM

• Tzafrir added a subscriber: • Tzafrir.Mar 23 2015, 11:25 PM

The changes no longer depend on external ptxwrap tool.
CUDA runtime support in CGCUDANV.cpp now generates per-module constructor/destructor to load and initialize GPU code.

tra retitled this revision from Preliminary driver changes to build and stitch together host and device-side CUDA compilation pipelines. to End-to-end CUDA compilation..Apr 2 2015, 2:05 PM

tra updated this object.

Updated description -- added detailed description for the changes.

Thanks for the updated description.

Here's an initial round of comments. I'm leaving the driver parts mostly to the driver experts.

lib/CodeGen/CGCUDANV.cpp
108	Use C++11 {...} initialization?
165	If this is pseudocode example, second level of // comments is superfluous
181	const?
183	Can you document the C signature of the called function somewhere for clarity?
185	leftovers?
207	same here re second-level //
239	Is the 4 in [4] needed?
lib/CodeGen/CGCUDARuntime.h
39	It would really be great not to have data inside this abstract interface; is this necessary? Note that "fatbin handles" sounds very NVIDIA CUDA runtime specific, though this interface is allegedly generic :)
46	Please document these APIs

eliben added inline comments.Apr 3 2015, 2:30 PM

include/clang/Driver/Action.h
140 ↗	(On Diff #23107)	Can you give an example in this comment? like sm_30, etc.
144 ↗	(On Diff #23107)	IIRC _[A-Z] names are discouraged, and against the style anyway
include/clang/Driver/CC1Options.td
612	I'm wondering about the "gpucode" mnemonic :-) It's unusual and kinda ambiguous. What does gpucode mean here? PTX? Maybe PTX can be more explicit then? PTX is probably not too specific since this flag begins with "cuda_" so it's already about the CUDA/PTX flow. [this applies to other uses of "gpucode" too]
include/clang/Driver/Options.td
456 ↗	(On Diff #23107)	Is it possible to make these flags positive, with false-by-default values?
1074 ↗	(On Diff #23107)	What is this for?
include/clang/Frontend/CodeGenOptions.h
167	s/Files/Blobs/ or "strings"? And as above, maybe PTX would be better than "GpuCode"
lib/CodeGen/CGCUDANV.cpp
49	Put doc comments for the new functions/methods you're adding, and preferably for the data fields as well, unless they're completely obvious
94	Do you really need Zeros as a member? You only use it once. Also, if you just declare it you can use the nice C++11 {...} initializer in the place of use, making the code even shorter.

Addressed eliben@'s review comments.

tra added inline comments.Apr 6 2015, 11:04 AM

include/clang/Driver/Action.h
140 ↗	(On Diff #23107)	Done
144 ↗	(On Diff #23107)	Done.
include/clang/Driver/CC1Options.td
612	It's actually an opaque blob. clang does not care what's in the file as it just passes the bits to cudart which passes it to the driver. The driver can digest PTX (which we pass in this case), but it will as happily accept GPU code packed in fatbin or cubin formats. If/when we grow ability to compile device-side to SASS, we would just do "-cuda-include-gpucode gpu-code-packed-in.cubin" and it should work with no other changes on the host side. So, 'gpucode' was the best approximation I could come up with that would keep "GPU code in any shape or form as long as it's PTX/fatbin or cubin". I'd be happy to change it. Suggestions?
include/clang/Driver/Options.td
456 ↗	(On Diff #23107)	Sure. Changed the options to -fcuda-host-only/-fcuda-device-only
1074 ↗	(On Diff #23107)	I've added for (partial) compatibility with nvcc. I've removed it for now as drop-in nvcc compatibility is not the purpose of this patch.
include/clang/Frontend/CodeGenOptions.h
167	It's a vector of strings containing names of files that contain GPU code blobs, whatever their format may be. I'll rename the variable to CudaGpuCodeFileNames and will update the comment to reflect that. How about this? /// A list of file names passed with -cuda-include-gpucode options to forward /// to CUDA runtime back-end for incorporating them into host-side object /// file. std::vector<std::string> CudaGpuCodeFileNames;
lib/CodeGen/CGCUDANV.cpp
49	Moved some fields out of the class and into local variables where they are used. Documented the rest.
94	Done. Also moved number of other things with single use down to where they are used.
108	OK.
165	The idea I wanted to convey is that I'm not really generating the loop, but rather rather generate a call for each kernel, in effect unrolling the loop. I've changed pseudocode to linear sequence of calls which is what those functions really generate.
181	Nope. CreateBitCast wants non-const Function: ../../../tools/clang/lib/CodeGen/CGCUDANV.cpp:198:31: error: cannot initialize a parameter of type 'llvm::Value ' with an lvalue of type 'const llvm::Function *' Builder.CreateBitCast(Kernel, VoidPtrTy), // kernel stub addr
183	I've moved CreateRuntimeFunction(...,"__cudaRegisterFunction") along with its signature in the comments into makeRegisterKernelsFn, so it should be visible close to where it's used.
185	Yes. Removed.
207	Fixed.
239	Not really. Removed.
lib/CodeGen/CGCUDARuntime.h
39	List of generated kernels is something that I expect to be useful for all subclasses of CUDARuntime. That's why I've put EmittedKernels there and a non-virtual methodEmitDeviceStub() to populate it. FatbinHandles, on the other hand, is indeed cudart-specific. I've moved it into CGCUDANV.
46	Done.

A couple of replies to comments; will do another pass on the new revision

include/clang/Driver/CC1Options.td
612	I see - some generic mnemonic is needed, I agree (so PTX is not a good idea). But "--gpu-code" is a nvcc flag that means something completely different :-/ So "gpu code" here may still be confusing. Maybe "gpublob" or "gpuobject" or "gpubinary" or something like that. I can't think of a perfect solution right now. I'll leave it to your discretion.
include/clang/Frontend/CodeGenOptions.h
167	Yeah, if this is for file names, it's a good idea to have "FileNames" in the name
lib/CodeGen/CGCUDARuntime.h
39	I would still remove EmittedKernels for now; we only have a single CUDA runtime at this time in upstream, so this feels redundant, as it makes the runtime interface / implementation barrier less clean than it should be. In the future if/when new runtime implementations are added, we'll figure out what's the best way to factor common code out is. YAGNI, essentially :)

tra added inline comments.Apr 6 2015, 3:25 PM

include/clang/Driver/CC1Options.td
612	gpubinary wins.
lib/CodeGen/CGCUDARuntime.h
39	OK.

Where are the tests for emitting ctors/dtors, registering kernels, etc?

include/clang/Driver/CC1Options.td
612	Should we prefix all cuda-related flags with -f for consistency with the existing ones? Don't know if it makes sense given that the cl_ ones above (for example) have no -f, but at least the CUDA ones should be consistent among themselves
lib/CodeGen/CGCUDANV.cpp
37	s/VMContext/Context/
40	Document FatbinHandles
93	extra line
180	Can you include the parameter names in this declaration? It would be much easier to follow I believe this comes from host_runtime.h?
189	Please document what BlobHandlePtr means here and how it's used
251	Use a named constant for the magic number -- it will then document itself
252	I'd go for a constant here as well These can be class level, probably
259	Comment explaining why
lib/CodeGen/CGCUDARuntime.h
48	I'd move this to the implementation as well, along with EmittedKernels. Just reading the documentation of this method makes little sense given that it lives in an abstract interface. The code will be easier to untangle if the interface stays completely functionality-free. At this time this won't even add code duplication since we just have a single implementation.
lib/Driver/Driver.cpp
1235 ↗	(On Diff #23285)	you can just return new CudaHostAction... here, no?
1522 ↗	(On Diff #23285)	Remove
lib/Driver/Tools.cpp
2591 ↗	(On Diff #23285)	Can you explain a bit more why/what this means in the comment?

I'm still working on the changes to address your comments, so "done" means "done but not submitted yet". I'll finish remaining bits and will update the patch tomorrow (Tue).

include/clang/Driver/CC1Options.td
612	Just had a chat with chandlerc@ and echristo@ on the subject. Consensus appears to be that options related to driver behavior should be --cuda-something[=value] and options passed down to cc1 -fcuda-something[=value]. I'll rename the options I've added accordingly.
lib/CodeGen/CGCUDANV.cpp
37	Done.
40	Done.
93	Removed.
251	It would be an overkill IMO. There's nothing more informative I could add to the comment that it's a magic number.
259	That's what nvcc does. I don't know whether there's a good reason for it. Removing it does not seem to break loading of GPU binary, so I'll remove explicit alignment.
lib/CodeGen/CGCUDARuntime.h
48	Done that already while I was moving EmittedKernels out.
lib/Driver/Driver.cpp
1235 ↗	(On Diff #23285)	Done.
1522 ↗	(On Diff #23285)	That was not intended to be committed. Will fix shortly.
lib/Driver/Tools.cpp
2591 ↗	(On Diff #23285)	General assumption that compilation deals with a single source file. When we're compiling CUDA, driver may generate additional build passes and we may end up with an action that has more than one action input. The check makes sure that all those inputs were results of compilation of the same source file. Hmm. That's another case where I need info about source file type. Let me see if I can add a function to dig that out from the action chain and then this loop will not be necessary as I can explicitly check whether we're compiling a CUDA file.

Round #2 of clean-ups to address eliben@'s commens

Renamed new options to be more consistent.
Added more comments, fixed formatting errors.

tra updated this object.Apr 7 2015, 1:45 PM

Added test case for IR generation for module constructor/destructor.

non-driver parts LGTM

Hi Art,

Starting to look pretty good here. I've got a few inline nits and a couple of small requests, but I think we're almost ready to go here. Sorry for the delays.

-eric

lib/CodeGen/CGCUDANV.cpp
163	"with the CUDA runtime".
165	The function name begins with a .? Ugh.
196–206	clang-format?
lib/Driver/Driver.cpp
183 ↗	(On Diff #23372)	"and partial CUDA compilations only run up"
1194 ↗	(On Diff #23372)	Some comment on the default here.
1672–1674 ↗	(On Diff #23372)	Do you need the declaration up here? Why not just pull the static function up if so?
1732 ↗	(On Diff #23372)	Probably would prefer "DeviceTriple" here.
lib/Driver/Tools.cpp
2583–2588 ↗	(On Diff #23372)	Comment about what's going on here.
2696 ↗	(On Diff #23372)	Might be nice to pull this sort of change out so it isn't affecting the rest of the diff.
5741–5744 ↗	(On Diff #23372)	Please pull this out into a separate patch.
test/CodeGenCUDA/device-stub.cu
40–46	Should some of these be CHECK-NEXT?

Addressed most of echristo@'s comments.
I will split out the parts Eric suggested and runtime glue code generation into separate sets of changes.

tra added inline comments.May 5 2015, 1:24 PM

lib/CodeGen/CGCUDANV.cpp
163	Done.
165	Replaced with __
196–206	Done. I've also replaced last argument with a plain NullPtr.
lib/Driver/Driver.cpp
183 ↗	(On Diff #23372)	Fixed.
1194 ↗	(On Diff #23372)	Done.
1672–1674 ↗	(On Diff #23372)	That would clutter the changes for no good reason. Whenever bunch of code moved from one place to another, it's always a pain figuring out whether things were just copied or copied and changed. Forward declaration is a lesser crime, IMO.
1732 ↗	(On Diff #23372)	Done.
lib/Driver/Tools.cpp
2583–2588 ↗	(On Diff #23372)	Done.
2696 ↗	(On Diff #23372)	Sure. I was also thinking of splitting code generation into a separate commit as well as it's largely independent of the driver changes.
test/CodeGenCUDA/device-stub.cu
40–46	Some. Changed to CHECK-NEXT where it was possible.

Fixed error checking in createInvocationFromCommandLine() so it can deal with multiple jobs created during cuda compilation.
Added a test case to make sure external tools can parse cuda files.

tra mentioned this in D9507: [cuda] Include GPU binary into host object file and generate init/deinit code..May 5 2015, 3:46 PM

tra mentioned this in D9509: [cuda] Driver changes to build and stitch together host and device-side CUDA code..May 5 2015, 3:58 PM

tra updated this revision to Diff 25077.May 6 2015, 11:52 AM

tra updated this object.

This comment was removed by tra.

Ignore diff 25077 which was unintentionally added to this review.
This review has been split into D9509, D9507 and D9506

mkuron added a subscriber: mkuron.Sep 2 2015, 5:06 AM

Revision Contents

Path

Size

include/

clang/

Driver/

CC1Options.td

2 lines

Frontend/

CodeGenOptions.h

5 lines

lib/

CodeGen/

222 lines

18 lines

2 lines

8 lines

Frontend/

CompilerInvocation.cpp

3 lines

test/

CodeGenCUDA/

device-stub.cu

41 lines

Diff 25077

include/clang/Driver/CC1Options.td

Show First 20 Lines • Show All 603 Lines • ▼ Show 20 Lines	def cl_std_EQ : Joined<["-"], "cl-std=">,
HelpText<"OpenCL language standard to compile for">;		HelpText<"OpenCL language standard to compile for">;
def cl_denorms_are_zero : Flag<["-"], "cl-denorms-are-zero">,		def cl_denorms_are_zero : Flag<["-"], "cl-denorms-are-zero">,
HelpText<"OpenCL only. Allow denormals to be flushed to zero">;		HelpText<"OpenCL only. Allow denormals to be flushed to zero">;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// CUDA Options		// CUDA Options
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def fcuda_is_device : Flag<["-"], "fcuda-is-device">,		def fcuda_is_device : Flag<["-"], "fcuda-is-device">,
		elibenUnsubmitted Not Done Reply Inline Actions I'm wondering about the "gpucode" mnemonic :-) It's unusual and kinda ambiguous. What does gpucode mean here? PTX? Maybe PTX can be more explicit then? PTX is probably not too specific since this flag begins with "cuda_" so it's already about the CUDA/PTX flow. [this applies to other uses of "gpucode" too] eliben: I'm wondering about the "gpucode" mnemonic :-) It's unusual and kinda ambiguous. What does…
		traAuthorUnsubmitted Not Done Reply Inline Actions It's actually an opaque blob. clang does not care what's in the file as it just passes the bits to cudart which passes it to the driver. The driver can digest PTX (which we pass in this case), but it will as happily accept GPU code packed in fatbin or cubin formats. If/when we grow ability to compile device-side to SASS, we would just do "-cuda-include-gpucode gpu-code-packed-in.cubin" and it should work with no other changes on the host side. So, 'gpucode' was the best approximation I could come up with that would keep "GPU code in any shape or form as long as it's PTX/fatbin or cubin". I'd be happy to change it. Suggestions? tra: It's actually an opaque blob. clang does not care what's in the file as it just passes the bits…
		elibenUnsubmitted Not Done Reply Inline Actions I see - some generic mnemonic is needed, I agree (so PTX is not a good idea). But "--gpu-code" is a nvcc flag that means something completely different :-/ So "gpu code" here may still be confusing. Maybe "gpublob" or "gpuobject" or "gpubinary" or something like that. I can't think of a perfect solution right now. I'll leave it to your discretion. eliben: I see - some generic mnemonic is needed, I agree (so PTX is not a good idea). But "--gpu-code"…
		elibenUnsubmitted Not Done Reply Inline Actions Should we prefix all cuda-related flags with -f for consistency with the existing ones? Don't know if it makes sense given that the cl_ ones above (for example) have no -f, but at least the CUDA ones should be consistent among themselves eliben: Should we prefix all cuda-related flags with -f for consistency with the existing ones? Don't…
		traAuthorUnsubmitted Not Done Reply Inline Actions gpubinary wins. tra: gpubinary wins.
		traAuthorUnsubmitted Not Done Reply Inline Actions Just had a chat with chandlerc@ and echristo@ on the subject. Consensus appears to be that options related to driver behavior should be --cuda-something[=value] and options passed down to cc1 -fcuda-something[=value]. I'll rename the options I've added accordingly. tra: Just had a chat with chandlerc@ and echristo@ on the subject. Consensus appears to be that…
HelpText<"Generate code for CUDA device">;		HelpText<"Generate code for CUDA device">;
def fcuda_allow_host_calls_from_host_device : Flag<["-"],		def fcuda_allow_host_calls_from_host_device : Flag<["-"],
"fcuda-allow-host-calls-from-host-device">,		"fcuda-allow-host-calls-from-host-device">,
HelpText<"Allow host device functions to call host functions">;		HelpText<"Allow host device functions to call host functions">;
def fcuda_disable_target_call_checks : Flag<["-"],		def fcuda_disable_target_call_checks : Flag<["-"],
"fcuda-disable-target-call-checks">,		"fcuda-disable-target-call-checks">,
HelpText<"Disable all cross-target (host, device, etc.) call checks in CUDA">;		HelpText<"Disable all cross-target (host, device, etc.) call checks in CUDA">;
		def fcuda_include_gpubinary : Separate<["-"], "fcuda-include-gpubinary">,
		HelpText<"Incorporate CUDA device-side binary into host object file.">;

} // let Flags = [CC1Option]		} // let Flags = [CC1Option]


//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// cc1as-only Options		// cc1as-only Options
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

Show All 23 Lines

include/clang/Frontend/CodeGenOptions.h

Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	public:
std::string InstrProfileOutput;		std::string InstrProfileOutput;

/// Name of the profile file to use with -fprofile-sample-use.		/// Name of the profile file to use with -fprofile-sample-use.
std::string SampleProfileFile;		std::string SampleProfileFile;

/// Name of the profile file to use as input for -fprofile-instr-use		/// Name of the profile file to use as input for -fprofile-instr-use
std::string InstrProfileInput;		std::string InstrProfileInput;

		/// A list of file names passed with -fcuda-include-gpubinary options to
		/// forward to CUDA runtime back-end for incorporating them into host-side
		elibenUnsubmitted Not Done Reply Inline Actions s/Files/Blobs/ or "strings"? And as above, maybe PTX would be better than "GpuCode" eliben: s/Files/Blobs/ or "strings"? And as above, maybe PTX would be better than "GpuCode"
		traAuthorUnsubmitted Not Done Reply Inline Actions It's a vector of strings containing names of files that contain GPU code blobs, whatever their format may be. I'll rename the variable to CudaGpuCodeFileNames and will update the comment to reflect that. How about this? /// A list of file names passed with -cuda-include-gpucode options to forward /// to CUDA runtime back-end for incorporating them into host-side object /// file. std::vector<std::string> CudaGpuCodeFileNames; tra: It's a vector of strings containing names of files that contain GPU code blobs, whatever their…
		elibenUnsubmitted Not Done Reply Inline Actions Yeah, if this is for file names, it's a good idea to have "FileNames" in the name eliben: Yeah, if this is for file names, it's a good idea to have "FileNames" in the name
		/// object file.
		std::vector<std::string> CudaGpuBinaryFileNames;

/// Regular expression to select optimizations for which we should enable		/// Regular expression to select optimizations for which we should enable
/// optimization remarks. Transformation passes whose name matches this		/// optimization remarks. Transformation passes whose name matches this
/// expression (and support this feature), will emit a diagnostic		/// expression (and support this feature), will emit a diagnostic
/// whenever they perform a transformation. This is enabled by the		/// whenever they perform a transformation. This is enabled by the
/// -Rpass=regexp flag.		/// -Rpass=regexp flag.
std::shared_ptr<llvm::Regex> OptimizationRemarkPattern;		std::shared_ptr<llvm::Regex> OptimizationRemarkPattern;

/// Regular expression to select optimizations for which we should enable		/// Regular expression to select optimizations for which we should enable
Show All 35 Lines

lib/CodeGen/CGCUDANV.cpp

Show All 14 Lines
#include "CGCUDARuntime.h"		#include "CGCUDARuntime.h"
#include "CodeGenFunction.h"		#include "CodeGenFunction.h"
#include "CodeGenModule.h"		#include "CodeGenModule.h"
#include "clang/AST/Decl.h"		#include "clang/AST/Decl.h"
#include "llvm/IR/BasicBlock.h"		#include "llvm/IR/BasicBlock.h"
#include "llvm/IR/CallSite.h"		#include "llvm/IR/CallSite.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/DerivedTypes.h"		#include "llvm/IR/DerivedTypes.h"
#include <vector>		#include "llvm/IR/Verifier.h"

using namespace clang;		using namespace clang;
using namespace CodeGen;		using namespace CodeGen;

namespace {		namespace {

class CGNVCUDARuntime : public CGCUDARuntime {		class CGNVCUDARuntime : public CGCUDARuntime {

private:		private:
llvm::Type IntTy, SizeTy;		llvm::Type IntTy, SizeTy, *VoidTy;
llvm::PointerType CharPtrTy, VoidPtrTy;		llvm::PointerType CharPtrTy, VoidPtrTy, *VoidPtrPtrTy;

		/// Convenience reference to LLVM Context
		llvm::LLVMContext &Context;
		elibenUnsubmitted Not Done Reply Inline Actions s/VMContext/Context/ eliben: s/VMContext/Context/
		traAuthorUnsubmitted Not Done Reply Inline Actions Done. tra: Done.
		/// Convenience reference to the current module
		llvm::Module &TheModule;
		/// Keeps track of kernel launch stubs emitted in this module
		elibenUnsubmitted Not Done Reply Inline Actions Document FatbinHandles eliben: Document FatbinHandles
		traAuthorUnsubmitted Not Done Reply Inline Actions Done. tra: Done.
		llvm::SmallVector<llvm::Function *, 16> EmittedKernels;
		/// Keeps track of variables containing handles of GPU binaries. Populated by
		/// ModuleCtorFunction() and used to create corresponding cleanup calls in
		/// ModuleDtorFunction()
		llvm::SmallVector<llvm::GlobalVariable *, 16> GpuBinaryHandles;

llvm::Constant *getSetupArgumentFn() const;		llvm::Constant *getSetupArgumentFn() const;
llvm::Constant *getLaunchFn() const;		llvm::Constant *getLaunchFn() const;

		elibenUnsubmitted Not Done Reply Inline Actions Put doc comments for the new functions/methods you're adding, and preferably for the data fields as well, unless they're completely obvious eliben: Put doc comments for the new functions/methods you're adding, and preferably for the data…
		traAuthorUnsubmitted Not Done Reply Inline Actions Moved some fields out of the class and into local variables where they are used. Documented the rest. tra: Moved some fields out of the class and into local variables where they are used. Documented the…
		/// Creates a function to register all kernel stubs generated in this module.
		llvm::Function *makeRegisterKernelsFn();

		/// Helper function that generates a constant string and returns a pointer to
		/// the start of the string. The result of this function can be used anywhere
		/// where the C code specifies const char*.
		llvm::Constant *makeConstantString(const std::string &Str,
		const std::string &Name = "",
		unsigned Alignment = 0) {
		llvm::Constant *Zeros[] = {llvm::ConstantInt::get(SizeTy, 0),
		llvm::ConstantInt::get(SizeTy, 0)};
		auto *ConstStr = CGM.GetAddrOfConstantCString(Str, Name.c_str());
		return llvm::ConstantExpr::getGetElementPtr(ConstStr->getValueType(),
		ConstStr, Zeros);
		}

		void emitDeviceStubBody(CodeGenFunction &CGF, FunctionArgList &Args);

public:		public:
CGNVCUDARuntime(CodeGenModule &CGM);		CGNVCUDARuntime(CodeGenModule &CGM);

void EmitDeviceStubBody(CodeGenFunction &CGF, FunctionArgList &Args) override;		void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args) override;
		/// Creates module constructor function
		llvm::Function *makeModuleCtorFunction() override;
		/// Creates module destructor function
		llvm::Function *makeModuleDtorFunction() override;
};		};

}		}

CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM) : CGCUDARuntime(CGM) {		CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM)
		: CGCUDARuntime(CGM), Context(CGM.getLLVMContext()),
		TheModule(CGM.getModule()) {
CodeGen::CodeGenTypes &Types = CGM.getTypes();		CodeGen::CodeGenTypes &Types = CGM.getTypes();
ASTContext &Ctx = CGM.getContext();		ASTContext &Ctx = CGM.getContext();

IntTy = Types.ConvertType(Ctx.IntTy);		IntTy = Types.ConvertType(Ctx.IntTy);
SizeTy = Types.ConvertType(Ctx.getSizeType());		SizeTy = Types.ConvertType(Ctx.getSizeType());
		VoidTy = llvm::Type::getVoidTy(Context);

CharPtrTy = llvm::PointerType::getUnqual(Types.ConvertType(Ctx.CharTy));		CharPtrTy = llvm::PointerType::getUnqual(Types.ConvertType(Ctx.CharTy));
VoidPtrTy = cast<llvm::PointerType>(Types.ConvertType(Ctx.VoidPtrTy));		VoidPtrTy = cast<llvm::PointerType>(Types.ConvertType(Ctx.VoidPtrTy));
		VoidPtrPtrTy = VoidPtrTy->getPointerTo();
}		}
		elibenUnsubmitted Not Done Reply Inline Actions extra line eliben: extra line
		traAuthorUnsubmitted Not Done Reply Inline Actions Removed. tra: Removed.

		elibenUnsubmitted Not Done Reply Inline Actions Do you really need Zeros as a member? You only use it once. Also, if you just declare it you can use the nice C++11 {...} initializer in the place of use, making the code even shorter. eliben: Do you really need Zeros as a member? You only use it once. Also, if you just declare it you…
		traAuthorUnsubmitted Not Done Reply Inline Actions Done. Also moved number of other things with single use down to where they are used. tra: Done. Also moved number of other things with single use down to where they are used.
llvm::Constant *CGNVCUDARuntime::getSetupArgumentFn() const {		llvm::Constant *CGNVCUDARuntime::getSetupArgumentFn() const {
// cudaError_t cudaSetupArgument(void *, size_t, size_t)		// cudaError_t cudaSetupArgument(void *, size_t, size_t)
std::vector<llvm::Type*> Params;		std::vector<llvm::Type*> Params;
Params.push_back(VoidPtrTy);		Params.push_back(VoidPtrTy);
Params.push_back(SizeTy);		Params.push_back(SizeTy);
Params.push_back(SizeTy);		Params.push_back(SizeTy);
return CGM.CreateRuntimeFunction(llvm::FunctionType::get(IntTy,		return CGM.CreateRuntimeFunction(llvm::FunctionType::get(IntTy,
Params, false),		Params, false),
"cudaSetupArgument");		"cudaSetupArgument");
}		}

llvm::Constant *CGNVCUDARuntime::getLaunchFn() const {		llvm::Constant *CGNVCUDARuntime::getLaunchFn() const {
// cudaError_t cudaLaunch(char *)		// cudaError_t cudaLaunch(char *)
std::vector<llvm::Type*> Params;		return CGM.CreateRuntimeFunction(
		elibenUnsubmitted Not Done Reply Inline Actions Use C++11 {...} initialization? eliben: Use C++11 {...} initialization?
		traAuthorUnsubmitted Not Done Reply Inline Actions OK. tra: OK.
Params.push_back(CharPtrTy);		llvm::FunctionType::get(IntTy, CharPtrTy, false), "cudaLaunch");
return CGM.CreateRuntimeFunction(llvm::FunctionType::get(IntTy,		}
Params, false),
"cudaLaunch");		void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
		FunctionArgList &Args) {
		EmittedKernels.push_back(CGF.CurFn);
		emitDeviceStubBody(CGF, Args);
}		}

void CGNVCUDARuntime::EmitDeviceStubBody(CodeGenFunction &CGF,		void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,
FunctionArgList &Args) {		FunctionArgList &Args) {
// Build the argument value list and the argument stack struct type.		// Build the argument value list and the argument stack struct type.
SmallVector<llvm::Value *, 16> ArgValues;		SmallVector<llvm::Value *, 16> ArgValues;
std::vector<llvm::Type *> ArgTypes;		std::vector<llvm::Type *> ArgTypes;
for (FunctionArgList::const_iterator I = Args.begin(), E = Args.end();		for (FunctionArgList::const_iterator I = Args.begin(), E = Args.end();
I != E; ++I) {		I != E; ++I) {
llvm::Value V = CGF.GetAddrOfLocalVar(I);		llvm::Value V = CGF.GetAddrOfLocalVar(I);
ArgValues.push_back(V);		ArgValues.push_back(V);
assert(isa<llvm::PointerType>(V->getType()) && "Arg type not PointerType");		assert(isa<llvm::PointerType>(V->getType()) && "Arg type not PointerType");
ArgTypes.push_back(cast<llvm::PointerType>(V->getType())->getElementType());		ArgTypes.push_back(cast<llvm::PointerType>(V->getType())->getElementType());
}		}
llvm::StructType *ArgStackTy = llvm::StructType::get(		llvm::StructType *ArgStackTy = llvm::StructType::get(Context, ArgTypes);
CGF.getLLVMContext(), ArgTypes);

llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");		llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");

// Emit the calls to cudaSetupArgument		// Emit the calls to cudaSetupArgument
llvm::Constant *cudaSetupArgFn = getSetupArgumentFn();		llvm::Constant *cudaSetupArgFn = getSetupArgumentFn();
for (unsigned I = 0, E = Args.size(); I != E; ++I) {		for (unsigned I = 0, E = Args.size(); I != E; ++I) {
llvm::Value *Args[3];		llvm::Value *Args[3];
llvm::BasicBlock *NextBlock = CGF.createBasicBlock("setup.next");		llvm::BasicBlock *NextBlock = CGF.createBasicBlock("setup.next");
Show All 15 Lines	void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,
llvm::Constant *cudaLaunchFn = getLaunchFn();		llvm::Constant *cudaLaunchFn = getLaunchFn();
llvm::Value *Arg = CGF.Builder.CreatePointerCast(CGF.CurFn, CharPtrTy);		llvm::Value *Arg = CGF.Builder.CreatePointerCast(CGF.CurFn, CharPtrTy);
CGF.EmitRuntimeCallOrInvoke(cudaLaunchFn, Arg);		CGF.EmitRuntimeCallOrInvoke(cudaLaunchFn, Arg);
CGF.EmitBranch(EndBlock);		CGF.EmitBranch(EndBlock);

CGF.EmitBlock(EndBlock);		CGF.EmitBlock(EndBlock);
}		}

		/// Creates internal function to register all kernel stubs generated in this
		/// module with the CUDA runtime.
		echristoUnsubmitted Not Done Reply Inline Actions "with the CUDA runtime". echristo: "with the CUDA runtime".
		traAuthorUnsubmitted Not Done Reply Inline Actions Done. tra: Done.
		/// \code
		/// void __cuda_register_kernels(void** GpuBinaryHandle) {
		elibenUnsubmitted Not Done Reply Inline Actions If this is pseudocode example, second level of // comments is superfluous eliben: If this is pseudocode example, second level of // comments is superfluous
		traAuthorUnsubmitted Not Done Reply Inline Actions The idea I wanted to convey is that I'm not really generating the loop, but rather rather generate a call for each kernel, in effect unrolling the loop. I've changed pseudocode to linear sequence of calls which is what those functions really generate. tra: The idea I wanted to convey is that I'm not really generating the loop, but rather rather…
		echristoUnsubmitted Not Done Reply Inline Actions The function name begins with a .? Ugh. echristo: The function name begins with a .? Ugh.
		traAuthorUnsubmitted Not Done Reply Inline Actions Replaced with __ tra: Replaced with __
		/// __cudaRegisterFunction(GpuBinaryHandle,Kernel0,...);
		/// ...
		/// __cudaRegisterFunction(GpuBinaryHandle,KernelM,...);
		/// }
		/// \endcode
		llvm::Function *CGNVCUDARuntime::makeRegisterKernelsFn() {
		llvm::Function *RegisterKernelsFunc = llvm::Function::Create(
		llvm::FunctionType::get(VoidTy, VoidPtrPtrTy, false),
		llvm::GlobalValue::InternalLinkage, "__cuda_register_kernels", &TheModule);
		llvm::BasicBlock *EntryBB =
		llvm::BasicBlock::Create(Context, "entry", RegisterKernelsFunc);
		CGBuilderTy Builder(Context);
		Builder.SetInsertPoint(EntryBB);

		// void __cudaRegisterFunction(void *, const char , char , const char ,
		elibenUnsubmitted Not Done Reply Inline Actions Can you include the parameter names in this declaration? It would be much easier to follow I believe this comes from host_runtime.h? eliben: Can you include the parameter names in this declaration? It would be much easier to follow I…
		// int, uint3, uint3, dim3, dim3, int*)
		elibenUnsubmitted Not Done Reply Inline Actions const? eliben: const?
		traAuthorUnsubmitted Not Done Reply Inline Actions Nope. CreateBitCast wants non-const Function: ../../../tools/clang/lib/CodeGen/CGCUDANV.cpp:198:31: error: cannot initialize a parameter of type 'llvm::Value ' with an lvalue of type 'const llvm::Function ' Builder.CreateBitCast(Kernel, VoidPtrTy), // kernel stub addr tra:* Nope. CreateBitCast wants non-const Function*: ../../../tools/clang/lib/CodeGen/CGCUDANV.cpp…
		std::vector<llvm::Type *> RegisterFuncParams = {
		VoidPtrPtrTy, CharPtrTy, CharPtrTy, CharPtrTy, IntTy,
		elibenUnsubmitted Not Done Reply Inline Actions Can you document the C signature of the called function somewhere for clarity? eliben: Can you document the C signature of the called function somewhere for clarity?
		traAuthorUnsubmitted Not Done Reply Inline Actions I've moved CreateRuntimeFunction(...,"__cudaRegisterFunction") along with its signature in the comments into makeRegisterKernelsFn, so it should be visible close to where it's used. tra: I've moved CreateRuntimeFunction(...,"__cudaRegisterFunction") along with its signature in the…
		VoidPtrTy, VoidPtrTy, VoidPtrTy, VoidPtrTy, IntTy->getPointerTo()};
		llvm::Constant *RegisterFunc = CGM.CreateRuntimeFunction(
		elibenUnsubmitted Not Done Reply Inline Actions leftovers? eliben: leftovers?
		traAuthorUnsubmitted Not Done Reply Inline Actions Yes. Removed. tra: Yes. Removed.
		llvm::FunctionType::get(IntTy, RegisterFuncParams, false),
		"__cudaRegisterFunction");

		// Extract GpuBinaryHandle passed as the first argument passed to
		elibenUnsubmitted Not Done Reply Inline Actions Please document what BlobHandlePtr means here and how it's used eliben: Please document what BlobHandlePtr means here and how it's used
		// __cuda_register_kernels() and generate __cudaRegisterFunction() call for
		// each emitted kernel.
		llvm::Argument &GpuBinaryHandlePtr = *RegisterKernelsFunc->arg_begin();
		for (llvm::Function *Kernel : EmittedKernels) {
		llvm::Constant *KernelName = makeConstantString(Kernel->getName());
		llvm::Constant *NullPtr = llvm::ConstantPointerNull::get(VoidPtrTy);
		llvm::Value *args[] = {
		&GpuBinaryHandlePtr, Builder.CreateBitCast(Kernel, VoidPtrTy),
		KernelName, KernelName, llvm::ConstantInt::get(IntTy, -1), NullPtr,
		NullPtr, NullPtr, NullPtr,
		llvm::ConstantPointerNull::get(IntTy->getPointerTo())};
		Builder.CreateCall(RegisterFunc, args);
		}

		Builder.CreateRetVoid();
		llvm::verifyFunction(*RegisterKernelsFunc);
		return RegisterKernelsFunc;
		echristoUnsubmitted Not Done Reply Inline Actions clang-format? echristo: clang-format?
		traAuthorUnsubmitted Not Done Reply Inline Actions Done. I've also replaced last argument with a plain NullPtr. tra: Done. I've also replaced last argument with a plain NullPtr.
		}
		elibenUnsubmitted Not Done Reply Inline Actions same here re second-level // eliben: same here re second-level //
		traAuthorUnsubmitted Not Done Reply Inline Actions Fixed. tra: Fixed.

		/// Creates a global constructor function for the module:
		/// \code
		/// void __cuda_module_ctor(void*) {
		/// Handle0 = __cudaRegisterFatBinary(GpuBinaryBlob0);
		/// __cuda_register_kernels(Handle0);
		/// ...
		/// HandleN = __cudaRegisterFatBinary(GpuBinaryBlobN);
		/// __cuda_register_kernels(HandleN);
		/// }
		/// \endcode
		llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
		// void __cuda_register_kernels(void* handle);
		llvm::Function *RegisterKernelsFunc = makeRegisterKernelsFn();
		// void ** __cudaRegisterFatBinary(void *);
		llvm::Constant *RegisterFatbinFunc = CGM.CreateRuntimeFunction(
		llvm::FunctionType::get(VoidPtrPtrTy, VoidPtrTy, false),
		"__cudaRegisterFatBinary");
		// struct { int magic, int version, void * gpu_binary, void * dont_care };
		llvm::StructType *FatbinWrapperTy =
		llvm::StructType::get(IntTy, IntTy, VoidPtrTy, VoidPtrTy, nullptr);

		llvm::Function *ModuleCtorFunc = llvm::Function::Create(
		llvm::FunctionType::get(VoidTy, VoidPtrTy, false),
		llvm::GlobalValue::InternalLinkage, "__cuda_module_ctor", &TheModule);
		llvm::BasicBlock *CtorEntryBB =
		llvm::BasicBlock::Create(Context, "entry", ModuleCtorFunc);
		CGBuilderTy CtorBuilder(Context);

		CtorBuilder.SetInsertPoint(CtorEntryBB);

		// For each GPU binary, register it with the CUDA runtime and store returned
		elibenUnsubmitted Not Done Reply Inline Actions Is the 4 in [4] needed? eliben: Is the 4 in [4] needed?
		traAuthorUnsubmitted Not Done Reply Inline Actions Not really. Removed. tra: Not really. Removed.
		// handle in a global variable and save the handle in GpuBinaryHandles vector
		// to be cleaned up in destructor on exit. Then associate all known kernels
		// with the GPU binary handle so CUDA runtime can figure out what to call on
		// the GPU side.
		for (const std::string &GpuBinaryFileName :
		CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {
		llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =
		llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);
		if (std::error_code EC = GpuBinaryOrErr.getError()) {
		CGM.getDiags().Report(diag::err_cannot_open_file) << GpuBinaryFileName
		<< EC.message();
		continue;
		elibenUnsubmitted Not Done Reply Inline Actions Use a named constant for the magic number -- it will then document itself eliben: Use a named constant for the magic number -- it will then document itself
		traAuthorUnsubmitted Not Done Reply Inline Actions It would be an overkill IMO. There's nothing more informative I could add to the comment that it's a magic number. tra: It would be an overkill IMO. There's nothing more informative I could add to the comment that…
		}
		elibenUnsubmitted Not Done Reply Inline Actions I'd go for a constant here as well These can be class level, probably eliben: I'd go for a constant here as well These can be class level, probably

		// Create initialized wrapper structure that points to the loaded GPU binary
		llvm::Constant *Values[] = {
		llvm::ConstantInt::get(IntTy, 0x466243b1), // Fatbin wrapper magic.
		llvm::ConstantInt::get(IntTy, 1), // Fatbin version.
		makeConstantString(GpuBinaryOrErr.get()->getBuffer(), "", 16), // Data.
		llvm::ConstantPointerNull::get(VoidPtrTy)}; // Unused in fatbin v1.
		elibenUnsubmitted Not Done Reply Inline Actions Comment explaining why eliben: Comment explaining why
		traAuthorUnsubmitted Not Done Reply Inline Actions That's what nvcc does. I don't know whether there's a good reason for it. Removing it does not seem to break loading of GPU binary, so I'll remove explicit alignment. tra: That's what nvcc does. I don't know whether there's a good reason for it. Removing it does not…
		llvm::GlobalVariable *FatbinWrapper = new llvm::GlobalVariable(
		TheModule, FatbinWrapperTy, true, llvm::GlobalValue::InternalLinkage,
		llvm::ConstantStruct::get(FatbinWrapperTy, Values),
		"__cuda_fatbin_wrapper");

		// GpuBinaryHandle = __cudaRegisterFatBinary(&FatbinWrapper);
		llvm::CallInst *RegisterFatbinCall = CtorBuilder.CreateCall(
		RegisterFatbinFunc,
		CtorBuilder.CreateBitCast(FatbinWrapper, VoidPtrTy));
		llvm::GlobalVariable *GpuBinaryHandle = new llvm::GlobalVariable(
		TheModule, VoidPtrPtrTy, false, llvm::GlobalValue::InternalLinkage,
		llvm::ConstantPointerNull::get(VoidPtrPtrTy), "__cuda_gpubin_handle");
		CtorBuilder.CreateStore(RegisterFatbinCall, GpuBinaryHandle, false);

		// Call __cuda_register_kernels(GpuBinaryHandle);
		CtorBuilder.CreateCall(RegisterKernelsFunc, RegisterFatbinCall);

		// Save GpuBinaryHandle so we can unregister it in destructor.
		GpuBinaryHandles.push_back(GpuBinaryHandle);
		}

		CtorBuilder.CreateRetVoid();
		llvm::verifyFunction(*ModuleCtorFunc);
		return ModuleCtorFunc;
		}

		/// Creates a global destructor function that unregisters all GPU code blobs
		/// registered by constructor.
		/// \code
		/// void __cuda_module_dtor(void*) {
		/// __cudaUnregisterFatBinary(Handle0);
		/// ...
		/// __cudaUnregisterFatBinary(HandleN);
		/// }
		/// \endcode
		llvm::Function *CGNVCUDARuntime::makeModuleDtorFunction() {
		// void __cudaUnregisterFatBinary(void ** handle);
		llvm::Constant *UnregisterFatbinFunc = CGM.CreateRuntimeFunction(
		llvm::FunctionType::get(VoidTy, VoidPtrPtrTy, false),
		"__cudaUnregisterFatBinary");

		llvm::Function *ModuleDtorFunc = llvm::Function::Create(
		llvm::FunctionType::get(VoidTy, VoidPtrTy, false),
		llvm::GlobalValue::InternalLinkage, "__cuda_module_dtor", &TheModule);
		llvm::BasicBlock *DtorEntryBB =
		llvm::BasicBlock::Create(Context, "entry", ModuleDtorFunc);
		CGBuilderTy DtorBuilder(Context);
		DtorBuilder.SetInsertPoint(DtorEntryBB);

		for (llvm::GlobalVariable *GpuBinaryHandle : GpuBinaryHandles) {
		DtorBuilder.CreateCall(UnregisterFatbinFunc,
		DtorBuilder.CreateLoad(GpuBinaryHandle, false));
		}

		DtorBuilder.CreateRetVoid();
		llvm::verifyFunction(*ModuleDtorFunc);
		return ModuleDtorFunc;
		}

CGCUDARuntime *CodeGen::CreateNVCUDARuntime(CodeGenModule &CGM) {		CGCUDARuntime *CodeGen::CreateNVCUDARuntime(CodeGenModule &CGM) {
return new CGNVCUDARuntime(CGM);		return new CGNVCUDARuntime(CGM);
}		}

lib/CodeGen/CGCUDARuntime.h

	Show All 10 Lines
	// subclasses of this implement code generation for specific CUDA			// subclasses of this implement code generation for specific CUDA
	// runtime libraries.			// runtime libraries.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H			#ifndef LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H
	#define LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H			#define LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H

				namespace llvm {
				class Function;
				}

	namespace clang {			namespace clang {

	class CUDAKernelCallExpr;			class CUDAKernelCallExpr;

	namespace CodeGen {			namespace CodeGen {

	class CodeGenFunction;			class CodeGenFunction;
	class CodeGenModule;			class CodeGenModule;
	class FunctionArgList;			class FunctionArgList;
	class ReturnValueSlot;			class ReturnValueSlot;
	class RValue;			class RValue;

	class CGCUDARuntime {			class CGCUDARuntime {
	protected:			protected:
	CodeGenModule &CGM;			CodeGenModule &CGM;

	public:			public:
				elibenUnsubmitted Not Done Reply Inline Actions It would really be great not to have data inside this abstract interface; is this necessary? Note that "fatbin handles" sounds very NVIDIA CUDA runtime specific, though this interface is allegedly generic :) eliben: It would really be great not to have data inside this abstract interface; is this necessary?
				traAuthorUnsubmitted Not Done Reply Inline Actions List of generated kernels is something that I expect to be useful for all subclasses of CUDARuntime. That's why I've put EmittedKernels there and a non-virtual methodEmitDeviceStub() to populate it. FatbinHandles, on the other hand, is indeed cudart-specific. I've moved it into CGCUDANV. tra: List of generated kernels is something that I expect to be useful for all subclasses of…
				elibenUnsubmitted Not Done Reply Inline Actions I would still remove EmittedKernels for now; we only have a single CUDA runtime at this time in upstream, so this feels redundant, as it makes the runtime interface / implementation barrier less clean than it should be. In the future if/when new runtime implementations are added, we'll figure out what's the best way to factor common code out is. YAGNI, essentially :) eliben: I would still remove EmittedKernels for now; we only have a single CUDA runtime at this time in…
				traAuthorUnsubmitted Not Done Reply Inline Actions OK. tra: OK.
	CGCUDARuntime(CodeGenModule &CGM) : CGM(CGM) {}			CGCUDARuntime(CodeGenModule &CGM) : CGM(CGM) {}
	virtual ~CGCUDARuntime();			virtual ~CGCUDARuntime();

	virtual RValue EmitCUDAKernelCallExpr(CodeGenFunction &CGF,			virtual RValue EmitCUDAKernelCallExpr(CodeGenFunction &CGF,
	const CUDAKernelCallExpr *E,			const CUDAKernelCallExpr *E,
	ReturnValueSlot ReturnValue);			ReturnValueSlot ReturnValue);

				elibenUnsubmitted Not Done Reply Inline Actions Please document these APIs eliben: Please document these APIs
				traAuthorUnsubmitted Not Done Reply Inline Actions Done. tra: Done.
	virtual void EmitDeviceStubBody(CodeGenFunction &CGF,			/// Adds CGF.CurFn to EmittedKernels and calls EmitDeviceStubBody() to emit a
	FunctionArgList &Args) = 0;			/// kernel launch stub.
				elibenUnsubmitted Not Done Reply Inline Actions I'd move this to the implementation as well, along with EmittedKernels. Just reading the documentation of this method makes little sense given that it lives in an abstract interface. The code will be easier to untangle if the interface stays completely functionality-free. At this time this won't even add code duplication since we just have a single implementation. eliben: I'd move this to the implementation as well, along with EmittedKernels. Just reading the…
				traAuthorUnsubmitted Not Done Reply Inline Actions Done that already while I was moving EmittedKernels out. tra: Done that already while I was moving EmittedKernels out.
				virtual void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args) = 0;

				/// Constructs and returns a module initialization function or nullptr if it's
				/// not needed. Must be called after all kernels have been emitted.
				virtual llvm::Function *makeModuleCtorFunction() = 0;

				/// Returns a module cleanup function or nullptr if it's not needed.
				/// Must be called after ModuleCtorFunction
				virtual llvm::Function *makeModuleDtorFunction() = 0;
	};			};

	/// Creates an instance of a CUDA runtime class.			/// Creates an instance of a CUDA runtime class.
	CGCUDARuntime *CreateNVCUDARuntime(CodeGenModule &CGM);			CGCUDARuntime *CreateNVCUDARuntime(CodeGenModule &CGM);

	}			}
	}			}

	#endif			#endif

lib/CodeGen/CodeGenFunction.cpp

Show First 20 Lines • Show All 872 Lines • ▼ Show 20 Lines	void CodeGenFunction::GenerateCode(GlobalDecl GD, llvm::Function *Fn,
PGO.assignRegionCounters(GD.getDecl(), CurFn);		PGO.assignRegionCounters(GD.getDecl(), CurFn);
if (isa<CXXDestructorDecl>(FD))		if (isa<CXXDestructorDecl>(FD))
EmitDestructorBody(Args);		EmitDestructorBody(Args);
else if (isa<CXXConstructorDecl>(FD))		else if (isa<CXXConstructorDecl>(FD))
EmitConstructorBody(Args);		EmitConstructorBody(Args);
else if (getLangOpts().CUDA &&		else if (getLangOpts().CUDA &&
!getLangOpts().CUDAIsDevice &&		!getLangOpts().CUDAIsDevice &&
FD->hasAttr<CUDAGlobalAttr>())		FD->hasAttr<CUDAGlobalAttr>())
CGM.getCUDARuntime().EmitDeviceStubBody(*this, Args);		CGM.getCUDARuntime().emitDeviceStub(*this, Args);
else if (isa<CXXConversionDecl>(FD) &&		else if (isa<CXXConversionDecl>(FD) &&
cast<CXXConversionDecl>(FD)->isLambdaToBlockPointerConversion()) {		cast<CXXConversionDecl>(FD)->isLambdaToBlockPointerConversion()) {
// The lambda conversion to block pointer is special; the semantics can't be		// The lambda conversion to block pointer is special; the semantics can't be
// expressed in the AST, so IRGen needs to special-case it.		// expressed in the AST, so IRGen needs to special-case it.
EmitLambdaToBlockPointerBody(Args);		EmitLambdaToBlockPointerBody(Args);
} else if (isa<CXXMethodDecl>(FD) &&		} else if (isa<CXXMethodDecl>(FD) &&
cast<CXXMethodDecl>(FD)->isLambdaStaticInvoker()) {		cast<CXXMethodDecl>(FD)->isLambdaStaticInvoker()) {
// The lambda static invoker function is special, because it forwards or		// The lambda static invoker function is special, because it forwards or
▲ Show 20 Lines • Show All 869 Lines • Show Last 20 Lines

lib/CodeGen/CodeGenModule.cpp

Show First 20 Lines • Show All 344 Lines • ▼ Show 20 Lines	void CodeGenModule::Release() {
applyReplacements();		applyReplacements();
checkAliases();		checkAliases();
EmitCXXGlobalInitFunc();		EmitCXXGlobalInitFunc();
EmitCXXGlobalDtorFunc();		EmitCXXGlobalDtorFunc();
EmitCXXThreadLocalInitFunc();		EmitCXXThreadLocalInitFunc();
if (ObjCRuntime)		if (ObjCRuntime)
if (llvm::Function *ObjCInitFunction = ObjCRuntime->ModuleInitFunction())		if (llvm::Function *ObjCInitFunction = ObjCRuntime->ModuleInitFunction())
AddGlobalCtor(ObjCInitFunction);		AddGlobalCtor(ObjCInitFunction);
		if (Context.getLangOpts().CUDA && !Context.getLangOpts().CUDAIsDevice &&
		CUDARuntime) {
		if (llvm::Function *CudaCtorFunction = CUDARuntime->makeModuleCtorFunction())
		AddGlobalCtor(CudaCtorFunction);
		if (llvm::Function *CudaDtorFunction = CUDARuntime->makeModuleDtorFunction())
		AddGlobalDtor(CudaDtorFunction);
		}
if (PGOReader && PGOStats.hasDiagnostics())		if (PGOReader && PGOStats.hasDiagnostics())
PGOStats.reportDiagnostics(getDiags(), getCodeGenOpts().MainFileName);		PGOStats.reportDiagnostics(getDiags(), getCodeGenOpts().MainFileName);
EmitCtorList(GlobalCtors, "llvm.global_ctors");		EmitCtorList(GlobalCtors, "llvm.global_ctors");
EmitCtorList(GlobalDtors, "llvm.global_dtors");		EmitCtorList(GlobalDtors, "llvm.global_dtors");
EmitGlobalAnnotations();		EmitGlobalAnnotations();
EmitStaticExternCAliases();		EmitStaticExternCAliases();
EmitDeferredUnusedCoverageMappings();		EmitDeferredUnusedCoverageMappings();
if (CoverageMapping)		if (CoverageMapping)
▲ Show 20 Lines • Show All 3,311 Lines • ▼ Show 20 Lines	bool PerformInit =
VD->getAnyInitializer() &&		VD->getAnyInitializer() &&
!VD->getAnyInitializer()->isConstantInitializer(getContext(),		!VD->getAnyInitializer()->isConstantInitializer(getContext(),
/ForRef=/false);		/ForRef=/false);
if (auto InitFunction = getOpenMPRuntime().emitThreadPrivateVarDefinition(		if (auto InitFunction = getOpenMPRuntime().emitThreadPrivateVarDefinition(
VD, GetAddrOfGlobalVar(VD), RefExpr->getLocStart(), PerformInit))		VD, GetAddrOfGlobalVar(VD), RefExpr->getLocStart(), PerformInit))
CXXGlobalInits.push_back(InitFunction);		CXXGlobalInits.push_back(InitFunction);
}		}
}		}

lib/Frontend/CompilerInvocation.cpp

Show First 20 Lines • Show All 639 Lines • ▼ Show 20 Lines	static bool ParseCodeGenArgs(CodeGenOptions &Opts, ArgList &Args, InputKind IK,
Opts.RewriteMapFiles = Args.getAllArgValues(OPT_frewrite_map_file);		Opts.RewriteMapFiles = Args.getAllArgValues(OPT_frewrite_map_file);

// Parse -fsanitize-recover= arguments.		// Parse -fsanitize-recover= arguments.
// FIXME: Report unrecoverable sanitizers incorrectly specified here.		// FIXME: Report unrecoverable sanitizers incorrectly specified here.
parseSanitizerKinds("-fsanitize-recover=",		parseSanitizerKinds("-fsanitize-recover=",
Args.getAllArgValues(OPT_fsanitize_recover_EQ), Diags,		Args.getAllArgValues(OPT_fsanitize_recover_EQ), Diags,
Opts.SanitizeRecover);		Opts.SanitizeRecover);

		Opts.CudaGpuBinaryFileNames =
		Args.getAllArgValues(OPT_fcuda_include_gpubinary);

return Success;		return Success;
}		}

static void ParseDependencyOutputArgs(DependencyOutputOptions &Opts,		static void ParseDependencyOutputArgs(DependencyOutputOptions &Opts,
ArgList &Args) {		ArgList &Args) {
using namespace options;		using namespace options;
Opts.OutputFile = Args.getLastArgValue(OPT_dependency_file);		Opts.OutputFile = Args.getLastArgValue(OPT_dependency_file);
Opts.Targets = Args.getAllArgValues(OPT_MT);		Opts.Targets = Args.getAllArgValues(OPT_MT);
▲ Show 20 Lines • Show All 1,452 Lines • Show Last 20 Lines

test/CodeGenCUDA/device-stub.cu

	// RUN: %clang_cc1 -emit-llvm %s -o - \| FileCheck %s			// RUN: %clang_cc1 -emit-llvm %s -fcuda-include-gpubinary %s -o - \| FileCheck %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

				// Make sure that all parts of GPU code init/cleanup are there:
				// * constant unnamed string with the kernel name
				// CHECK: private unnamed_addr constant{{.}}kernelfunc{{.}}\00", align 1
				// * constant unnamed string with GPU binary
				// CHECK: private unnamed_addr constant{{.*}}\00"
				// * constant struct that wraps GPU binary
				// CHECK: @__cuda_fatbin_wrapper = internal constant { i32, i32, i8, i8 }
				// CHECK: { i32 1180844977, i32 1, {{.}}, i64 0, i64 0), i8 null }
				// * variable to save GPU binary handle after initialization
				// CHECK: @__cuda_gpubin_handle = internal global i8** null
				// * Make sure our constructor/destructor was added to global ctor/dtor list.
				// CHECK: @llvm.global_ctors = appending global {{.*}}@__cuda_module_ctor
				// CHECK: @llvm.global_dtors = appending global {{.*}}@__cuda_module_dtor

	// Test that we build the correct number of calls to cudaSetupArgument followed			// Test that we build the correct number of calls to cudaSetupArgument followed
	// by a call to cudaLaunch.			// by a call to cudaLaunch.

	// CHECK: define{{.*}}kernelfunc			// CHECK: define{{.*}}kernelfunc
	// CHECK: call{{.*}}cudaSetupArgument			// CHECK: call{{.*}}cudaSetupArgument
	// CHECK: call{{.*}}cudaSetupArgument			// CHECK: call{{.*}}cudaSetupArgument
	// CHECK: call{{.*}}cudaSetupArgument			// CHECK: call{{.*}}cudaSetupArgument
	// CHECK: call{{.*}}cudaLaunch			// CHECK: call{{.*}}cudaLaunch
	__global__ void kernelfunc(int i, int j, int k) {}			__global__ void kernelfunc(int i, int j, int k) {}

				// Test that we've built correct kernel launch sequence.
				// CHECK: define{{.*}}hostfunc
				// CHECK: call{{.*}}cudaConfigureCall
				// CHEKC: call{{.*}}kernelfunc
				void hostfunc(void) { kernelfunc<<<1, 1>>>(1, 1, 1); }

				// Test that we've built a function to register kernels
				// CHECK: define internal void @__cuda_register_kernels
				// CHECK: call{{.}}cudaRegisterFunction(i8* %0, {{.*}}kernelfunc

				// Test that we've built contructor..
				// CHECK: define internal void @__cuda_module_ctor
				// .. that calls __cudaRegisterFatBinary(&__cuda_fatbin_wrapper)
				// CHECK: call{{.}}cudaRegisterFatBinary{{.}}__cuda_fatbin_wrapper
				// .. stores return value in __cuda_gpubin_handle
				// CHECK-NEXT: store{{.*}}__cuda_gpubin_handle
				// .. and then calls __cuda_register_kernels
				// CHECK-NEXT: call void @__cuda_register_kernels
				echristoUnsubmitted Not Done Reply Inline Actions Should some of these be CHECK-NEXT? echristo: Should some of these be CHECK-NEXT?
				traAuthorUnsubmitted Not Done Reply Inline Actions Some. Changed to CHECK-NEXT where it was possible. tra: Some. Changed to CHECK-NEXT where it was possible.

				// Test that we've created destructor.
				// CHECK: define internal void @__cuda_module_dtor
				// CHECK: load{{.*}}__cuda_gpubin_handle
				// CHECK-NEXT: call void @__cudaUnregisterFatBinary

This is an archive of the discontinued LLVM Phabricator instance.

End-to-end CUDA compilation.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 25077

include/clang/Driver/CC1Options.td

include/clang/Frontend/CodeGenOptions.h

lib/CodeGen/CGCUDANV.cpp

lib/CodeGen/CGCUDARuntime.h

lib/CodeGen/CodeGenFunction.cpp

lib/CodeGen/CodeGenModule.cpp

lib/Frontend/CompilerInvocation.cpp

test/CodeGenCUDA/device-stub.cu

End-to-end CUDA compilation.
AbandonedPublic