This is an archive of the discontinued LLVM Phabricator instance.

End-to-end CUDA compilation.
AbandonedPublic

Authored by tra on Mar 19 2015, 2:28 PM.

Details

Reviewers
eliben
echristo
Summary

The changes implement end-to-end CUDA compilation pipeline (i.e single clang invocation produces usable host object file which incorporates GPU code) in the driver and necessary runtime to initialize GPU code.

  • Launch device-side compilation(s):
    • Added '--cuda-gpu-arch=<sm_XX>' option. ARCH defaults to sm_20.
    • For each GPU architecture launch cc1 with -fcuda-is-device -target-cpu <GPU>
    • internally each device-side compilation action is wrapped in CudaDeviceAction(GPU) which selects appropriate toolchain based on GPU and then proceeds construction compilation pipeline.
    • added --cuda-host-only and --cuda-device-only options to skip host/device compilation parts.
  • Incorporate GPU code generated by device-side into the host object file:
    • added "-fcuda-include-gpubinary <FILE>" option to specify file with GPU code to incorporate.
    • internally host-side compilation action is wrapped in CudaHostAction(input.cu, [list of files produced by device-side compilation]). When driver builds jobs for CudaHostAction, host compilation jobs are constructed normally. At the end each device-side output is passed to the host-side compilation by adding "-fcuda-include-gpubinary <device-side-output.s>" options.
    • CGCUDARuntime class was extended to provide API for per-module constructor/destructor creation.
    • CGNVCUDARuntime

      Implemented ModuleCtorFunction and ModuleDtorFunction to generate initialization code required for cudart-style kernel launches to work.
    • ModuleCtorFunction():
      • creates .cuda_register_functions(fatbin_handle) function which calls __cudaRegisterFunction(...) for each kernel emitted with EmitDeviceStub().
      • creates and returns .cuda_module_ctor() function: For each -fcuda-include-gpubinary:
        • creates a constant string with contents of the file specified.
        • creates an initialized __fatBinC_Wrapper_t struct which points to the string.
        • generates call to __cudaRegisterFatBinary(&wrapper_struct) and stores returned handle in a variable.
        • generates a call to .cuda_register_functions(handle)

          NOTE: Even though we're calling __cudaRegisterFatBinary() which would imply that it expects GPU code to be encapsulated in nvidia's proprietary 'FatBinary' format, we're actually passing GPU code as a NUL-terminated string with PTX assembly in it. Alas fatbin format is not documented. Fortunately the low-level driver API for loading GPU code accepts cubin/fatbin/NUL-terminated string formats and cudart seems to pass the string to the driver, so we can skip fatbin altogether.
    • ModuleDtorFunction(): creates and returns .cuda_module_dtor() function which generates a call to __cudaUnregisterFatBinary(saved_handle) for each GPU code blob initialized in ModuleCtorFunction().
    • CodeGenModule.cpp During host-side CUDA compilation calls CUDARuntime->ModuleCtorFunction()/ModuleDtorFunction() and adds returned value to a global constructor/destructor list.
  • Added test case to verify CUDA pipeline construction in the driver.

Diff Detail

Event Timeline

tra updated this revision to Diff 22301.Mar 19 2015, 2:28 PM
tra retitled this revision from to Preliminary driver changes to build and stitch together host and device-side CUDA compilation pipelines..
tra updated this object.
tra edited the test plan for this revision. (Show Details)
tra added reviewers: eliben, echristo.
tra added a subscriber: Unknown Object (MLST).
echristo edited edge metadata.Mar 22 2015, 8:44 PM

Were we going to abandon this one in favor of the write up you're doing or
is this something else?

-eric

tra updated this revision to Diff 23107.Apr 1 2015, 5:26 PM
tra updated this object.
tra edited edge metadata.

The changes no longer depend on external ptxwrap tool.
CUDA runtime support in CGCUDANV.cpp now generates per-module constructor/destructor to load and initialize GPU code.

tra retitled this revision from Preliminary driver changes to build and stitch together host and device-side CUDA compilation pipelines. to End-to-end CUDA compilation..Apr 2 2015, 2:05 PM
tra updated this object.
tra added a comment.Apr 2 2015, 2:15 PM

Updated description -- added detailed description for the changes.

eliben edited edge metadata.Apr 3 2015, 2:30 PM

Thanks for the updated description.

Here's an initial round of comments. I'm leaving the driver parts mostly to the driver experts.

lib/CodeGen/CGCUDANV.cpp
109

Use C++11 {...} initialization?

166

If this is pseudocode example, second level of // comments is superfluous

182

const?

184

Can you document the C signature of the called function somewhere for clarity?

186

leftovers?

208

same here re second-level //

240

Is the 4 in [4] needed?

lib/CodeGen/CGCUDARuntime.h
39

It would really be great not to have data inside this abstract interface; is this necessary?

Note that "fatbin handles" sounds very NVIDIA CUDA runtime specific, though this interface is allegedly generic :)

46

Please document these APIs

eliben added inline comments.Apr 3 2015, 2:30 PM
include/clang/Driver/Action.h
140

Can you give an example in this comment? like sm_30, etc.

144

IIRC _[A-Z] names are discouraged, and against the style anyway

include/clang/Driver/CC1Options.td
605

I'm wondering about the "gpucode" mnemonic :-) It's unusual and kinda ambiguous. What does gpucode mean here? PTX? Maybe PTX can be more explicit then?

PTX is probably not too specific since this flag begins with "cuda_" so it's already about the CUDA/PTX flow.

[this applies to other uses of "gpucode" too]

include/clang/Driver/Options.td
462

Is it possible to make these flags positive, with false-by-default values?

1079

What is this for?

include/clang/Frontend/CodeGenOptions.h
164

s/Files/Blobs/ or "strings"? And as above, maybe PTX would be better than "GpuCode"

lib/CodeGen/CGCUDANV.cpp
50

Put doc comments for the new functions/methods you're adding, and preferably for the data fields as well, unless they're completely obvious

95

Do you really need Zeros as a member? You only use it once. Also, if you just declare it you can use the nice C++11 {...} initializer in the place of use, making the code even shorter.

tra updated this revision to Diff 23285.Apr 6 2015, 11:01 AM
tra edited edge metadata.

Addressed eliben@'s review comments.

tra added inline comments.Apr 6 2015, 11:04 AM
include/clang/Driver/Action.h
140

Done

144

Done.

include/clang/Driver/CC1Options.td
605

It's actually an opaque blob. clang does not care what's in the file as it just passes the bits to cudart which passes it to the driver. The driver can digest PTX (which we pass in this case), but it will as happily accept GPU code packed in fatbin or cubin formats. If/when we grow ability to compile device-side to SASS, we would just do "-cuda-include-gpucode gpu-code-packed-in.cubin" and it should work with no other changes on the host side.

So, 'gpucode' was the best approximation I could come up with that would keep "GPU code in any shape or form as long as it's PTX/fatbin or cubin".

I'd be happy to change it. Suggestions?

include/clang/Driver/Options.td
462

Sure. Changed the options to -fcuda-host-only/-fcuda-device-only

1079

I've added for (partial) compatibility with nvcc. I've removed it for now as drop-in nvcc compatibility is not the purpose of this patch.

include/clang/Frontend/CodeGenOptions.h
164

It's a vector of strings containing names of files that contain GPU code blobs, whatever their format may be. I'll rename the variable to CudaGpuCodeFileNames and will update the comment to reflect that.

How about this?

/// A list of file names passed with -cuda-include-gpucode options to forward
/// to CUDA runtime back-end for incorporating them into host-side object
/// file.
std::vector<std::string> CudaGpuCodeFileNames;
lib/CodeGen/CGCUDANV.cpp
50

Moved some fields out of the class and into local variables where they are used.
Documented the rest.

95

Done. Also moved number of other things with single use down to where they are used.

109

OK.

166

The idea I wanted to convey is that I'm not really generating the loop, but rather rather generate a call for each kernel, in effect unrolling the loop.

I've changed pseudocode to linear sequence of calls which is what those functions really generate.

182

Nope. CreateBitCast wants non-const Function*:

../../../tools/clang/lib/CodeGen/CGCUDANV.cpp:198:31: error: cannot initialize a parameter of type 'llvm::Value *' with an lvalue of type 'const llvm::Function *'

Builder.CreateBitCast(Kernel, VoidPtrTy), // kernel stub addr
184

I've moved CreateRuntimeFunction(...,"__cudaRegisterFunction") along with its signature in the comments into makeRegisterKernelsFn, so it should be visible close to where it's used.

186

Yes. Removed.

208

Fixed.

240

Not really. Removed.

lib/CodeGen/CGCUDARuntime.h
39

List of generated kernels is something that I expect to be useful for all subclasses of CUDARuntime.
That's why I've put EmittedKernels there and a non-virtual methodEmitDeviceStub() to populate it.

FatbinHandles, on the other hand, is indeed cudart-specific. I've moved it into CGCUDANV.

46

Done.

eliben added a comment.Apr 6 2015, 3:02 PM

A couple of replies to comments; will do another pass on the new revision

include/clang/Driver/CC1Options.td
605

I see - some generic mnemonic is needed, I agree (so PTX is not a good idea). But "--gpu-code" is a nvcc flag that means something completely different :-/ So "gpu code" here may still be confusing. Maybe "gpublob" or "gpuobject" or "gpubinary" or something like that. I can't think of a perfect solution right now.

I'll leave it to your discretion.

include/clang/Frontend/CodeGenOptions.h
164

Yeah, if this is for file names, it's a good idea to have "FileNames" in the name

lib/CodeGen/CGCUDARuntime.h
39

I would still remove EmittedKernels for now; we only have a single CUDA runtime at this time in upstream, so this feels redundant, as it makes the runtime interface / implementation barrier less clean than it should be. In the future if/when new runtime implementations are added, we'll figure out what's the best way to factor common code out is.

YAGNI, essentially :)

tra added inline comments.Apr 6 2015, 3:25 PM
include/clang/Driver/CC1Options.td
605

gpubinary wins.

lib/CodeGen/CGCUDARuntime.h
39

OK.

eliben added a comment.Apr 6 2015, 3:36 PM

Where are the tests for emitting ctors/dtors, registering kernels, etc?

include/clang/Driver/CC1Options.td
605

Should we prefix all cuda-related flags with -f for consistency with the existing ones? Don't know if it makes sense given that the cl_ ones above (for example) have no -f, but at least the CUDA ones should be consistent among themselves

lib/CodeGen/CGCUDANV.cpp
38

s/VMContext/Context/

41

Document FatbinHandles

94

extra line

181

Can you include the parameter names in this declaration? It would be much easier to follow

I believe this comes from host_runtime.h?

190

Please document what BlobHandlePtr means here and how it's used

252

Use a named constant for the magic number -- it will then document itself

253

I'd go for a constant here as well

These can be class level, probably

260

Comment explaining why

lib/CodeGen/CGCUDARuntime.h
48

I'd move this to the implementation as well, along with EmittedKernels.

Just reading the documentation of this method makes little sense given that it lives in an abstract interface. The code will be easier to untangle if the interface stays completely functionality-free. At this time this won't even add code duplication since we just have a single implementation.

lib/Driver/Driver.cpp
1235

you can just return new CudaHostAction... here, no?

1520

Remove

lib/Driver/Tools.cpp
2590

Can you explain a bit more why/what this means in the comment?

tra added a comment.Apr 6 2015, 5:20 PM

I'm still working on the changes to address your comments, so "done" means "done but not submitted yet". I'll finish remaining bits and will update the patch tomorrow (Tue).

include/clang/Driver/CC1Options.td
605

Just had a chat with chandlerc@ and echristo@ on the subject.
Consensus appears to be that options related to driver behavior should be --cuda-something[=value] and options passed down to cc1 -fcuda-something[=value]. I'll rename the options I've added accordingly.

lib/CodeGen/CGCUDANV.cpp
38

Done.

41

Done.

94

Removed.

252

It would be an overkill IMO. There's nothing more informative I could add to the comment that it's a magic number.

260

That's what nvcc does. I don't know whether there's a good reason for it. Removing it does not seem to break loading of GPU binary, so I'll remove explicit alignment.

lib/CodeGen/CGCUDARuntime.h
48

Done that already while I was moving EmittedKernels out.

lib/Driver/Driver.cpp
1235

Done.

1520

That was not intended to be committed. Will fix shortly.

lib/Driver/Tools.cpp
2590

General assumption that compilation deals with a single source file.
When we're compiling CUDA, driver may generate additional build passes and we may end
up with an action that has more than one action input. The check makes sure that all those inputs were results of compilation of the same source file.

Hmm. That's another case where I need info about source file type. Let me see if I can add a function to dig that out from the action chain and then this loop will not be necessary as I can explicitly check whether we're compiling a CUDA file.

tra updated this revision to Diff 23366.Apr 7 2015, 1:40 PM

Round #2 of clean-ups to address eliben@'s commens

Renamed new options to be more consistent.
Added more comments, fixed formatting errors.

tra updated this object.Apr 7 2015, 1:45 PM
tra updated this revision to Diff 23372.Apr 7 2015, 3:01 PM

Added test case for IR generation for module constructor/destructor.

eliben added a comment.Apr 8 2015, 8:29 AM

non-driver parts LGTM

Hi Art,

Starting to look pretty good here. I've got a few inline nits and a couple of small requests, but I think we're almost ready to go here. Sorry for the delays.

-eric

lib/CodeGen/CGCUDANV.cpp
164

"with the CUDA runtime".

166

The function name begins with a .? Ugh.

197–207

clang-format?

lib/Driver/Driver.cpp
183–186

"and partial CUDA compilations only run up"

1194

Some comment on the default here.

1672–1674

Do you need the declaration up here? Why not just pull the static function up if so?

1732

Probably would prefer "DeviceTriple" here.

lib/Driver/Tools.cpp
2583–2588

Comment about what's going on here.

2696

Might be nice to pull this sort of change out so it isn't affecting the rest of the diff.

5741–5747

Please pull this out into a separate patch.

test/CodeGenCUDA/device-stub.cu
40–46 ↗(On Diff #23372)

Should some of these be CHECK-NEXT?

tra updated this revision to Diff 24968.May 5 2015, 1:22 PM

Addressed most of echristo@'s comments.
I will split out the parts Eric suggested and runtime glue code generation into separate sets of changes.

tra added inline comments.May 5 2015, 1:24 PM
lib/CodeGen/CGCUDANV.cpp
164

Done.

166

Replaced with __

197–207

Done. I've also replaced last argument with a plain NullPtr.

lib/Driver/Driver.cpp
183–186

Fixed.

1194

Done.

1672–1674

That would clutter the changes for no good reason. Whenever bunch of code moved from one place to another, it's always a pain figuring out whether things were just copied or copied *and* changed. Forward declaration is a lesser crime, IMO.

1732

Done.

lib/Driver/Tools.cpp
2583–2588

Done.

2696

Sure. I was also thinking of splitting code generation into a separate commit as well as it's largely independent of the driver changes.

test/CodeGenCUDA/device-stub.cu
40–46 ↗(On Diff #23372)

Some. Changed to CHECK-NEXT where it was possible.

tra updated this revision to Diff 24973.May 5 2015, 1:43 PM

Fixed error checking in createInvocationFromCommandLine() so it can deal with multiple jobs created during cuda compilation.
Added a test case to make sure external tools can parse cuda files.

tra updated this revision to Diff 25077.May 6 2015, 11:52 AM
tra updated this object.
This comment was removed by tra.
tra abandoned this revision.May 6 2015, 12:42 PM

Ignore diff 25077 which was unintentionally added to this review.
This review has been split into D9509, D9507 and D9506

mkuron added a subscriber: mkuron.Sep 2 2015, 5:06 AM