The changes implement end-to-end CUDA compilation pipeline (i.e single clang invocation produces usable host object file which incorporates GPU code) in the driver and necessary runtime to initialize GPU code.
- Launch device-side compilation(s):
- Added '--cuda-gpu-arch=<sm_XX>' option. ARCH defaults to sm_20.
- For each GPU architecture launch cc1 with -fcuda-is-device -target-cpu <GPU>
- internally each device-side compilation action is wrapped in CudaDeviceAction(GPU) which selects appropriate toolchain based on GPU and then proceeds construction compilation pipeline.
- added --cuda-host-only and --cuda-device-only options to skip host/device compilation parts.
- Incorporate GPU code generated by device-side into the host object file:
- added "-fcuda-include-gpubinary <FILE>" option to specify file with GPU code to incorporate.
- internally host-side compilation action is wrapped in CudaHostAction(input.cu, [list of files produced by device-side compilation]). When driver builds jobs for CudaHostAction, host compilation jobs are constructed normally. At the end each device-side output is passed to the host-side compilation by adding "-fcuda-include-gpubinary <device-side-output.s>" options.
- CGCUDARuntime class was extended to provide API for per-module constructor/destructor creation.
- CGNVCUDARuntime
Implemented ModuleCtorFunction and ModuleDtorFunction to generate initialization code required for cudart-style kernel launches to work. - ModuleCtorFunction():
- creates .cuda_register_functions(fatbin_handle) function which calls __cudaRegisterFunction(...) for each kernel emitted with EmitDeviceStub().
- creates and returns .cuda_module_ctor() function: For each -fcuda-include-gpubinary:
- creates a constant string with contents of the file specified.
- creates an initialized __fatBinC_Wrapper_t struct which points to the string.
- generates call to __cudaRegisterFatBinary(&wrapper_struct) and stores returned handle in a variable.
- generates a call to .cuda_register_functions(handle)
NOTE: Even though we're calling __cudaRegisterFatBinary() which would imply that it expects GPU code to be encapsulated in nvidia's proprietary 'FatBinary' format, we're actually passing GPU code as a NUL-terminated string with PTX assembly in it. Alas fatbin format is not documented. Fortunately the low-level driver API for loading GPU code accepts cubin/fatbin/NUL-terminated string formats and cudart seems to pass the string to the driver, so we can skip fatbin altogether.
- ModuleDtorFunction(): creates and returns .cuda_module_dtor() function which generates a call to __cudaUnregisterFatBinary(saved_handle) for each GPU code blob initialized in ModuleCtorFunction().
- CodeGenModule.cpp During host-side CUDA compilation calls CUDARuntime->ModuleCtorFunction()/ModuleDtorFunction() and adds returned value to a global constructor/destructor list.
- Added test case to verify CUDA pipeline construction in the driver.
I'm wondering about the "gpucode" mnemonic :-) It's unusual and kinda ambiguous. What does gpucode mean here? PTX? Maybe PTX can be more explicit then?
PTX is probably not too specific since this flag begins with "cuda_" so it's already about the CUDA/PTX flow.
[this applies to other uses of "gpucode" too]