This is an archive of the discontinued LLVM Phabricator instance.

End-to-end CUDA compilation.
AbandonedPublic

Authored by tra on Mar 19 2015, 2:28 PM.

Details

Reviewers
eliben
echristo
Summary

The changes implement end-to-end CUDA compilation pipeline (i.e single clang invocation produces usable host object file which incorporates GPU code) in the driver and necessary runtime to initialize GPU code.

  • Launch device-side compilation(s):
    • Added '--cuda-gpu-arch=<sm_XX>' option. ARCH defaults to sm_20.
    • For each GPU architecture launch cc1 with -fcuda-is-device -target-cpu <GPU>
    • internally each device-side compilation action is wrapped in CudaDeviceAction(GPU) which selects appropriate toolchain based on GPU and then proceeds construction compilation pipeline.
    • added --cuda-host-only and --cuda-device-only options to skip host/device compilation parts.
  • Incorporate GPU code generated by device-side into the host object file:
    • added "-fcuda-include-gpubinary <FILE>" option to specify file with GPU code to incorporate.
    • internally host-side compilation action is wrapped in CudaHostAction(input.cu, [list of files produced by device-side compilation]). When driver builds jobs for CudaHostAction, host compilation jobs are constructed normally. At the end each device-side output is passed to the host-side compilation by adding "-fcuda-include-gpubinary <device-side-output.s>" options.
    • CGCUDARuntime class was extended to provide API for per-module constructor/destructor creation.
    • CGNVCUDARuntime

      Implemented ModuleCtorFunction and ModuleDtorFunction to generate initialization code required for cudart-style kernel launches to work.
    • ModuleCtorFunction():
      • creates .cuda_register_functions(fatbin_handle) function which calls __cudaRegisterFunction(...) for each kernel emitted with EmitDeviceStub().
      • creates and returns .cuda_module_ctor() function: For each -fcuda-include-gpubinary:
        • creates a constant string with contents of the file specified.
        • creates an initialized __fatBinC_Wrapper_t struct which points to the string.
        • generates call to __cudaRegisterFatBinary(&wrapper_struct) and stores returned handle in a variable.
        • generates a call to .cuda_register_functions(handle)

          NOTE: Even though we're calling __cudaRegisterFatBinary() which would imply that it expects GPU code to be encapsulated in nvidia's proprietary 'FatBinary' format, we're actually passing GPU code as a NUL-terminated string with PTX assembly in it. Alas fatbin format is not documented. Fortunately the low-level driver API for loading GPU code accepts cubin/fatbin/NUL-terminated string formats and cudart seems to pass the string to the driver, so we can skip fatbin altogether.
    • ModuleDtorFunction(): creates and returns .cuda_module_dtor() function which generates a call to __cudaUnregisterFatBinary(saved_handle) for each GPU code blob initialized in ModuleCtorFunction().
    • CodeGenModule.cpp During host-side CUDA compilation calls CUDARuntime->ModuleCtorFunction()/ModuleDtorFunction() and adds returned value to a global constructor/destructor list.
  • Added test case to verify CUDA pipeline construction in the driver.

Diff Detail

Event Timeline

tra updated this revision to Diff 22301.Mar 19 2015, 2:28 PM
tra retitled this revision from to Preliminary driver changes to build and stitch together host and device-side CUDA compilation pipelines..
tra updated this object.
tra edited the test plan for this revision. (Show Details)
tra added reviewers: eliben, echristo.
tra added a subscriber: Unknown Object (MLST).
echristo edited edge metadata.Mar 22 2015, 8:44 PM

Were we going to abandon this one in favor of the write up you're doing or
is this something else?

-eric

tra updated this revision to Diff 23107.Apr 1 2015, 5:26 PM
tra updated this object.
tra edited edge metadata.

The changes no longer depend on external ptxwrap tool.
CUDA runtime support in CGCUDANV.cpp now generates per-module constructor/destructor to load and initialize GPU code.

tra retitled this revision from Preliminary driver changes to build and stitch together host and device-side CUDA compilation pipelines. to End-to-end CUDA compilation..Apr 2 2015, 2:05 PM
tra updated this object.
tra added a comment.Apr 2 2015, 2:15 PM

Updated description -- added detailed description for the changes.

eliben edited edge metadata.Apr 3 2015, 2:30 PM

Thanks for the updated description.

Here's an initial round of comments. I'm leaving the driver parts mostly to the driver experts.

lib/CodeGen/CGCUDANV.cpp
108

Use C++11 {...} initialization?

165

If this is pseudocode example, second level of // comments is superfluous

181

const?

183

Can you document the C signature of the called function somewhere for clarity?

185

leftovers?

207

same here re second-level //

239

Is the 4 in [4] needed?

lib/CodeGen/CGCUDARuntime.h
39

It would really be great not to have data inside this abstract interface; is this necessary?

Note that "fatbin handles" sounds very NVIDIA CUDA runtime specific, though this interface is allegedly generic :)

46

Please document these APIs

eliben added inline comments.Apr 3 2015, 2:30 PM
include/clang/Driver/Action.h
140 ↗(On Diff #23107)

Can you give an example in this comment? like sm_30, etc.

144 ↗(On Diff #23107)

IIRC _[A-Z] names are discouraged, and against the style anyway

include/clang/Driver/CC1Options.td
612

I'm wondering about the "gpucode" mnemonic :-) It's unusual and kinda ambiguous. What does gpucode mean here? PTX? Maybe PTX can be more explicit then?

PTX is probably not too specific since this flag begins with "cuda_" so it's already about the CUDA/PTX flow.

[this applies to other uses of "gpucode" too]

include/clang/Driver/Options.td
456 ↗(On Diff #23107)

Is it possible to make these flags positive, with false-by-default values?

1074 ↗(On Diff #23107)

What is this for?

include/clang/Frontend/CodeGenOptions.h
167

s/Files/Blobs/ or "strings"? And as above, maybe PTX would be better than "GpuCode"

lib/CodeGen/CGCUDANV.cpp
49

Put doc comments for the new functions/methods you're adding, and preferably for the data fields as well, unless they're completely obvious

94

Do you really need Zeros as a member? You only use it once. Also, if you just declare it you can use the nice C++11 {...} initializer in the place of use, making the code even shorter.

tra updated this revision to Diff 23285.Apr 6 2015, 11:01 AM
tra edited edge metadata.

Addressed eliben@'s review comments.

tra added inline comments.Apr 6 2015, 11:04 AM
include/clang/Driver/Action.h
140 ↗(On Diff #23107)

Done

144 ↗(On Diff #23107)

Done.

include/clang/Driver/CC1Options.td
612

It's actually an opaque blob. clang does not care what's in the file as it just passes the bits to cudart which passes it to the driver. The driver can digest PTX (which we pass in this case), but it will as happily accept GPU code packed in fatbin or cubin formats. If/when we grow ability to compile device-side to SASS, we would just do "-cuda-include-gpucode gpu-code-packed-in.cubin" and it should work with no other changes on the host side.

So, 'gpucode' was the best approximation I could come up with that would keep "GPU code in any shape or form as long as it's PTX/fatbin or cubin".

I'd be happy to change it. Suggestions?

include/clang/Driver/Options.td
456 ↗(On Diff #23107)

Sure. Changed the options to -fcuda-host-only/-fcuda-device-only

1074 ↗(On Diff #23107)

I've added for (partial) compatibility with nvcc. I've removed it for now as drop-in nvcc compatibility is not the purpose of this patch.

include/clang/Frontend/CodeGenOptions.h
167

It's a vector of strings containing names of files that contain GPU code blobs, whatever their format may be. I'll rename the variable to CudaGpuCodeFileNames and will update the comment to reflect that.

How about this?

/// A list of file names passed with -cuda-include-gpucode options to forward
/// to CUDA runtime back-end for incorporating them into host-side object
/// file.
std::vector<std::string> CudaGpuCodeFileNames;
lib/CodeGen/CGCUDANV.cpp
49

Moved some fields out of the class and into local variables where they are used.
Documented the rest.

94

Done. Also moved number of other things with single use down to where they are used.

108

OK.

165

The idea I wanted to convey is that I'm not really generating the loop, but rather rather generate a call for each kernel, in effect unrolling the loop.

I've changed pseudocode to linear sequence of calls which is what those functions really generate.

181

Nope. CreateBitCast wants non-const Function*:

../../../tools/clang/lib/CodeGen/CGCUDANV.cpp:198:31: error: cannot initialize a parameter of type 'llvm::Value *' with an lvalue of type 'const llvm::Function *'

Builder.CreateBitCast(Kernel, VoidPtrTy), // kernel stub addr
183

I've moved CreateRuntimeFunction(...,"__cudaRegisterFunction") along with its signature in the comments into makeRegisterKernelsFn, so it should be visible close to where it's used.

185

Yes. Removed.

207

Fixed.

239

Not really. Removed.

lib/CodeGen/CGCUDARuntime.h
39

List of generated kernels is something that I expect to be useful for all subclasses of CUDARuntime.
That's why I've put EmittedKernels there and a non-virtual methodEmitDeviceStub() to populate it.

FatbinHandles, on the other hand, is indeed cudart-specific. I've moved it into CGCUDANV.

46

Done.

eliben added a comment.Apr 6 2015, 3:02 PM

A couple of replies to comments; will do another pass on the new revision

include/clang/Driver/CC1Options.td
612

I see - some generic mnemonic is needed, I agree (so PTX is not a good idea). But "--gpu-code" is a nvcc flag that means something completely different :-/ So "gpu code" here may still be confusing. Maybe "gpublob" or "gpuobject" or "gpubinary" or something like that. I can't think of a perfect solution right now.

I'll leave it to your discretion.

include/clang/Frontend/CodeGenOptions.h
167

Yeah, if this is for file names, it's a good idea to have "FileNames" in the name

lib/CodeGen/CGCUDARuntime.h
39

I would still remove EmittedKernels for now; we only have a single CUDA runtime at this time in upstream, so this feels redundant, as it makes the runtime interface / implementation barrier less clean than it should be. In the future if/when new runtime implementations are added, we'll figure out what's the best way to factor common code out is.

YAGNI, essentially :)

tra added inline comments.Apr 6 2015, 3:25 PM
include/clang/Driver/CC1Options.td
612

gpubinary wins.

lib/CodeGen/CGCUDARuntime.h
39

OK.

eliben added a comment.Apr 6 2015, 3:36 PM

Where are the tests for emitting ctors/dtors, registering kernels, etc?

include/clang/Driver/CC1Options.td
612

Should we prefix all cuda-related flags with -f for consistency with the existing ones? Don't know if it makes sense given that the cl_ ones above (for example) have no -f, but at least the CUDA ones should be consistent among themselves

lib/CodeGen/CGCUDANV.cpp
37

s/VMContext/Context/

40

Document FatbinHandles

93

extra line

180

Can you include the parameter names in this declaration? It would be much easier to follow

I believe this comes from host_runtime.h?

189

Please document what BlobHandlePtr means here and how it's used

251

Use a named constant for the magic number -- it will then document itself

252

I'd go for a constant here as well

These can be class level, probably

259

Comment explaining why

lib/CodeGen/CGCUDARuntime.h
48

I'd move this to the implementation as well, along with EmittedKernels.

Just reading the documentation of this method makes little sense given that it lives in an abstract interface. The code will be easier to untangle if the interface stays completely functionality-free. At this time this won't even add code duplication since we just have a single implementation.

lib/Driver/Driver.cpp
1235 ↗(On Diff #23285)

you can just return new CudaHostAction... here, no?

1522 ↗(On Diff #23285)

Remove

lib/Driver/Tools.cpp
2591 ↗(On Diff #23285)

Can you explain a bit more why/what this means in the comment?

tra added a comment.Apr 6 2015, 5:20 PM

I'm still working on the changes to address your comments, so "done" means "done but not submitted yet". I'll finish remaining bits and will update the patch tomorrow (Tue).

include/clang/Driver/CC1Options.td
612

Just had a chat with chandlerc@ and echristo@ on the subject.
Consensus appears to be that options related to driver behavior should be --cuda-something[=value] and options passed down to cc1 -fcuda-something[=value]. I'll rename the options I've added accordingly.

lib/CodeGen/CGCUDANV.cpp
37

Done.

40

Done.

93

Removed.

251

It would be an overkill IMO. There's nothing more informative I could add to the comment that it's a magic number.

259

That's what nvcc does. I don't know whether there's a good reason for it. Removing it does not seem to break loading of GPU binary, so I'll remove explicit alignment.

lib/CodeGen/CGCUDARuntime.h
48

Done that already while I was moving EmittedKernels out.

lib/Driver/Driver.cpp
1235 ↗(On Diff #23285)

Done.

1522 ↗(On Diff #23285)

That was not intended to be committed. Will fix shortly.

lib/Driver/Tools.cpp
2591 ↗(On Diff #23285)

General assumption that compilation deals with a single source file.
When we're compiling CUDA, driver may generate additional build passes and we may end
up with an action that has more than one action input. The check makes sure that all those inputs were results of compilation of the same source file.

Hmm. That's another case where I need info about source file type. Let me see if I can add a function to dig that out from the action chain and then this loop will not be necessary as I can explicitly check whether we're compiling a CUDA file.

tra updated this revision to Diff 23366.Apr 7 2015, 1:40 PM

Round #2 of clean-ups to address eliben@'s commens

Renamed new options to be more consistent.
Added more comments, fixed formatting errors.

tra updated this object.Apr 7 2015, 1:45 PM
tra updated this revision to Diff 23372.Apr 7 2015, 3:01 PM

Added test case for IR generation for module constructor/destructor.

eliben added a comment.Apr 8 2015, 8:29 AM

non-driver parts LGTM

Hi Art,

Starting to look pretty good here. I've got a few inline nits and a couple of small requests, but I think we're almost ready to go here. Sorry for the delays.

-eric

lib/CodeGen/CGCUDANV.cpp
163

"with the CUDA runtime".

165

The function name begins with a .? Ugh.

196–206

clang-format?

lib/Driver/Driver.cpp
183 ↗(On Diff #23372)

"and partial CUDA compilations only run up"

1194 ↗(On Diff #23372)

Some comment on the default here.

1672–1674 ↗(On Diff #23372)

Do you need the declaration up here? Why not just pull the static function up if so?

1732 ↗(On Diff #23372)

Probably would prefer "DeviceTriple" here.

lib/Driver/Tools.cpp
2583–2588 ↗(On Diff #23372)

Comment about what's going on here.

2696 ↗(On Diff #23372)

Might be nice to pull this sort of change out so it isn't affecting the rest of the diff.

5741–5744 ↗(On Diff #23372)

Please pull this out into a separate patch.

test/CodeGenCUDA/device-stub.cu
40–46

Should some of these be CHECK-NEXT?

tra updated this revision to Diff 24968.May 5 2015, 1:22 PM

Addressed most of echristo@'s comments.
I will split out the parts Eric suggested and runtime glue code generation into separate sets of changes.

tra added inline comments.May 5 2015, 1:24 PM
lib/CodeGen/CGCUDANV.cpp
163

Done.

165

Replaced with __

196–206

Done. I've also replaced last argument with a plain NullPtr.

lib/Driver/Driver.cpp
183 ↗(On Diff #23372)

Fixed.

1194 ↗(On Diff #23372)

Done.

1672–1674 ↗(On Diff #23372)

That would clutter the changes for no good reason. Whenever bunch of code moved from one place to another, it's always a pain figuring out whether things were just copied or copied *and* changed. Forward declaration is a lesser crime, IMO.

1732 ↗(On Diff #23372)

Done.

lib/Driver/Tools.cpp
2583–2588 ↗(On Diff #23372)

Done.

2696 ↗(On Diff #23372)

Sure. I was also thinking of splitting code generation into a separate commit as well as it's largely independent of the driver changes.

test/CodeGenCUDA/device-stub.cu
40–46

Some. Changed to CHECK-NEXT where it was possible.

tra updated this revision to Diff 24973.May 5 2015, 1:43 PM

Fixed error checking in createInvocationFromCommandLine() so it can deal with multiple jobs created during cuda compilation.
Added a test case to make sure external tools can parse cuda files.

tra updated this revision to Diff 25077.May 6 2015, 11:52 AM
tra updated this object.
This comment was removed by tra.
tra abandoned this revision.May 6 2015, 12:42 PM

Ignore diff 25077 which was unintentionally added to this review.
This review has been split into D9509, D9507 and D9506

mkuron added a subscriber: mkuron.Sep 2 2015, 5:06 AM