This is an archive of the discontinued LLVM Phabricator instance.

[doc] Compile CUDA with LLVM
ClosedPublic

Authored by jingyue on Nov 4 2015, 9:45 PM.

Details

Summary

This patch adds documentation on compiling CUDA with LLVM as requested by many
engineers and researchers. It includes not only user guides but also some
internals (mostly optimizations) so that early adopters can start hacking and
contributing.

Quite a few researchers who contacted us haven't used LLVM before, which is
unsurprising as it hasn't been long since LLVM picked up CUDA. So I added a
short summary to help these folks get started with LLVM.

I expect this document to evolve substantially down the road. The user guides
will be much simplified after the Clang integration is done. However, the
internals should continue growing to include for example performance debugging
and key areas to improve.

Diff Detail

Event Timeline

jingyue updated this revision to Diff 39315.Nov 4 2015, 9:45 PM
jingyue retitled this revision from to [doc] Compile CUDA with LLVM.
jingyue updated this object.
jingyue added reviewers: tra, chandlerc, meheff, broune.
jingyue added a subscriber: eliben.
broune accepted this revision.Nov 5 2015, 11:08 AM
broune edited edge metadata.
broune added inline comments.
docs/CompileCudaWithLLVM.rst
14

Could be:

It is aimed at both users who want to compile CUDA with LLVM and developers who want to improve LLVM for GPUs.

70

naively -> natively

82

Could be:

Therefore, for early adopters using CUDA with LLVM now, it is necessary to manually ...

85

"vector" is correct here, though it could suggest a std::vector. "array" wouldn't have that connotation.

Also, maybe "(this operation is sometimes referred to as AXPY)", as just "(AXPY)" would likely seem rather cryptic to someone who doesn't know what AXPY is.

132

to a separate file (supposingly axpy.cu) -> to a separate file axpy.cu.

142

to PTX (supposingly axpy.ptx) -> to a PTX file axpy.ptx

164

capabitliy -> capability

170

host code (supposingly axpy.cc) -> host code in axpy.cc

224

and superscalar -> and is superscalar

225

these differences -> such differences

(the list is not exhaustive)

229

This suggests that these are the only major ones. "The list below shows some of the more important optimizations for GPUs."

230

I had difficulty understanding this sentence. If I understood it correctly, this could be:

"A few of the optimizations have not been upstreamed due to ..."

239

so that emits fast special loads -> so that the backend can emit faster specialized loads

244

more encouraged -> needs to be more aggressive

247

is yet -> has yet

260

knows -> infers

261

is yet -> has yet

This revision is now accepted and ready to land.Nov 5 2015, 11:09 AM
tra added inline comments.Nov 5 2015, 11:39 AM
docs/CompileCudaWithLLVM.rst
79–81

Three more patches (D13144, D13170, D13171) and one can compile many CUDA files with clang as is. Number of details in this guide will not be applicable in few weeks.

82

No need to manually launch the kernel <<<>>> works with PTX. See below for details.

91

Where does this file come from? I think CUDA samples had one. I don't think it's essential for this document and could be removed or replaced.

146–147

Why not let clang compile the file all the way to PTX?

Splitting would make sense if you wanted to link with libdevice code and run NVVMReflect/internalize on bitcode afterwards, but these days clang can do all of that.

If you want to link with libdevice (with NVVMReflect and internalize), just pass appropriate bitcode file:

"-mlink-cuda-bitcode" "/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_35.10.bc"
197–198

It's not quite true. CUDA runtime is will accept raw PTX if you initialize it the way nvcc does.

208

You can pass device-side PTX to the host's cc1 with "-fcuda-include-gpubinary axpy.ptx"
and clang will embed PTX into host object file and will generate code to register kernels so that kernels can be launched with <<<...>>> without any additional steps.

jingyue added inline comments.Nov 5 2015, 4:09 PM
docs/CompileCudaWithLLVM.rst
208

Can you clarify how to do this? I tried using -Xclang to set the -fcuda-include-gpubinary flag, but got the following.

$ clang++ -Xclang -fcuda-include-gpubinary -Xclang axpy.ptx axpy.cc -I$CUDA_ROOT/include -I$CUDA_ROOT/samples/common/inc -L$CUDA_ROOT/lib64 -lcudart_static -lcuda -ldl -lrt -pthread
axpy.cc:39:3: error: use of undeclared identifier 'axpy'
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
  ^
axpy.cc:39:9: error: expected expression
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
        ^
axpy.cc:39:23: error: expected expression
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
                      ^
axpy.cc:39:25: warning: expression result unused [-Wunused-value]
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
                        ^
axpy.cc:39:28: warning: expression result unused [-Wunused-value]
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
                           ^~~~~~~~
2 warnings and 3 errors generated.
tra added inline comments.Nov 5 2015, 4:58 PM
docs/CompileCudaWithLLVM.rst
208

The kernel must be present in axpy.cu during host compilation so compiler can generate host-side stub for kernel launch, so it only works without splitting.

jingyue added inline comments.Nov 5 2015, 10:39 PM
docs/CompileCudaWithLLVM.rst
208

Still have issues with that.

However, I managed to apply your three pending patches, and the patched version works great! So, I think it makes more sense for this document to ask early adopters to apply the patches and try the more functional patched version. Agree?

tra added inline comments.Nov 6 2015, 9:59 AM
docs/CompileCudaWithLLVM.rst
208

Sure. The patches simplify large portion of this section down to

clang++ -o axpy [...] axpy.cu

I'll need to add details on various CUDA-related options I've added to clang.
Do you want to incorporate them into this patch of should I do that after you've committed the docs?

I'll let you do that after this patch. You know much better than me on
those options.

silvas added a subscriber: silvas.Nov 6 2015, 7:18 PM

My biggest concern is to avoid giving users the false impression that what is described here is an officially supported long-term interface from clang. Would it be accurate to say that this document is meant for "LLVM developers" (or otherwise people working inside LLVM)?

Other than clarifying that, the content LGTM.

docs/CompileCudaWithLLVM.rst
12–17

From reading this document, I think it would be worth it to somehow qualify the audience description to say "early adopters who are willing to use unstable internal interfaces" or something like that.

27–30

Do you feel strongly about replicating this here from the GettingStarted page? I would prefer to avoid the duplication. But if you feel strongly that your audience will benefit from this, then we can leave it in.

146

cc1 is officially an internal interface. Please put a big fat warning here that the cc1 interface is unstable and can be broken at any time.

jingyue updated this revision to Diff 39631.Nov 6 2015, 10:53 PM
jingyue edited edge metadata.

simplify the doc

My biggest concern is to avoid giving users the false impression that what is described here is an officially supported long-term interface from clang. Would it be accurate to say that this document is meant for "LLVM developers" (or otherwise people working inside LLVM)?

Other than clarifying that, the content LGTM.

Hi Sean,

Thanks for your suggestions. You may like the new version better. As discussed with Artem, I think it makes more sense for users to apply a temporary patch (I will keep it up-to-date) to avoid so much hacking.

My biggest concern is to avoid giving users the false impression that what is described here is an officially supported long-term interface from clang. Would it be accurate to say that this document is meant for "LLVM developers" (or otherwise people working inside LLVM)?

Other than clarifying that, the content LGTM.

Hi Sean,

Thanks for your suggestions. You may like the new version better. As discussed with Artem, I think it makes more sense for users to apply a temporary patch (I will keep it up-to-date) to avoid so much hacking.

Makes sense. LGTM.

tra added inline comments.Nov 10 2015, 10:45 AM
docs/CompileCudaWithLLVM.rst
84–86

clang will -include cuda_runtime.h (nvcc does, too), so it's not necessary to include it from source.

clang's cuda_runtime.h wrapper will include cuda_builtin_vars.h, so including it explicitly here is not necessary as well.

helper_cuda.h comes from CUDA samples. I would suggest adding a note that we need CUDA samples installed as well because it's possible to have CUDA installed without them.

130

"-I<CUDA install path>/include" -- unnecessary. clang would add it.

You also need to add -std=c++11 in order to use nullptr.

I've also found a weird issue with my patch -- without optimizations, kernel launch fails (silently in your example). For the time being compile with -O2. I'll find and fix the problem ASAP.

tra accepted this revision.Nov 10 2015, 10:46 AM
tra edited edge metadata.
jingyue updated this revision to Diff 39849.Nov 10 2015, 1:18 PM
jingyue edited edge metadata.

Simplify the command lines and header file inclusion

tra added inline comments.Nov 10 2015, 1:32 PM
docs/CompileCudaWithLLVM.rst
130

False alarm about the bug. The failure was due to my local changes. The patch mentioned in the doc appears to work fine.

jingyue updated this revision to Diff 39851.Nov 10 2015, 2:36 PM

Replace the link to the raw diff with more instructions.

That link appears to be temporary.

jingyue closed this revision.Nov 10 2015, 2:38 PM