Changeset View
Standalone View
docs/CompileCudaWithLLVM.rst
- This file was added.
=================================== | |||||
Compiling CUDA C/C++ with LLVM | |||||
=================================== | |||||
.. contents:: | |||||
:local: | |||||
Introduction | |||||
============ | |||||
This document contains the user guides and the internals of compiling CUDA | |||||
C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM | |||||
and developers who want to improve LLVM for GPUs. This document assumes a basic | |||||
familiarity with CUDA. Information about CUDA programming can be found in the | |||||
broune: Could be:
It is aimed at both users who want to compile CUDA with LLVM and developers who want… | |||||
`CUDA programming guide | |||||
<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. | |||||
Not Done ReplyInline ActionsFrom reading this document, I think it would be worth it to somehow qualify the audience description to say "early adopters who are willing to use unstable internal interfaces" or something like that. silvas: From reading this document, I think it would be worth it to somehow qualify the audience… | |||||
How to Download and Build LLVM | |||||
============================== | |||||
The support for CUDA is still in progress and temporarily relies on `this patch | |||||
<http://reviews.llvm.org/D14452>`_. Below is a quick summary of downloading and | |||||
building LLVM with CUDA support. Consult the `Getting Started | |||||
<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting | |||||
up LLVM. | |||||
#. Checkout LLVM | |||||
.. code-block:: c++ | |||||
Not Done ReplyInline ActionsDo you feel strongly about replicating this here from the GettingStarted page? I would prefer to avoid the duplication. But if you feel strongly that your audience will benefit from this, then we can leave it in. silvas: Do you feel strongly about replicating this here from the GettingStarted page? I would prefer… | |||||
$ cd where-you-want-llvm-to-live | |||||
$ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm | |||||
#. Checkout Clang | |||||
.. code-block:: c++ | |||||
$ cd where-you-want-llvm-to-live | |||||
$ cd llvm/tools | |||||
$ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang | |||||
#. Apply the temporary patch for CUDA support. | |||||
If you have installed `Arcanist | |||||
<http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-command-line>`_, | |||||
you can apply this patch using | |||||
.. code-block:: console | |||||
$ arc patch D14452 | |||||
Otherwise, download the raw diff and apply it manually using | |||||
.. code-block:: console | |||||
$ wget http://reviews.llvm.org/file/data/a4prwg2ydrlmviyrcajn/PHID-FILE-ny5ghn5vnpxhajn3hhnp/D14452.diff | |||||
$ patch -p0 < D14452.diff | |||||
#. Configure and build LLVM and Clang | |||||
.. code-block:: console | |||||
$ cd <where you want to build llvm> | |||||
$ mkdir build | |||||
$ cd build | |||||
$ cmake [options] <path to llvm sources> | |||||
$ make | |||||
How to Compile CUDA C/C++ with LLVM | |||||
=================================== | |||||
Not Done ReplyInline Actionsnaively -> natively broune: naively -> natively | |||||
We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA | |||||
CUDA installation Guide | |||||
<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if | |||||
you have not. | |||||
Suppose you want to compile and run the following CUDA program (``axpy.cu``) | |||||
which multiplies a ``float`` array by a ``float`` scalar (AXPY). | |||||
.. code-block:: c++ | |||||
Not Done ReplyInline Actionstra: Three more patches (D13144, D13170, D13171) and one can compile many CUDA files with clang as… | |||||
#include <cuda.h> | |||||
Not Done ReplyInline ActionsCould be: Therefore, for early adopters using CUDA with LLVM now, it is necessary to manually ... broune: Could be:
Therefore, for early adopters using CUDA with LLVM now, it is necessary to manually . | |||||
Not Done ReplyInline ActionsNo need to manually launch the kernel <<<>>> works with PTX. See below for details. tra: No need to manually launch the kernel <<<>>> works with PTX. See below for details. | |||||
#include <cuda_runtime.h> | |||||
#include <helper_cuda.h> // for checkCudaErrors | |||||
#include <cuda_builtin_vars.h> | |||||
Not Done ReplyInline Actions"vector" is correct here, though it could suggest a std::vector. "array" wouldn't have that connotation. Also, maybe "(this operation is sometimes referred to as AXPY)", as just "(AXPY)" would likely seem rather cryptic to someone who doesn't know what AXPY is. broune: "vector" is correct here, though it could suggest a std::vector. "array" wouldn't have that… | |||||
traUnsubmitted Not Done ReplyInline Actionsclang will -include cuda_runtime.h (nvcc does, too), so it's not necessary to include it from source. clang's cuda_runtime.h wrapper will include cuda_builtin_vars.h, so including it explicitly here is not necessary as well. helper_cuda.h comes from CUDA samples. I would suggest adding a note that we need CUDA samples installed as well because it's possible to have CUDA installed without them. tra: clang will -include cuda_runtime.h (nvcc does, too), so it's not necessary to include it from… | |||||
#include <iostream> | |||||
__global__ void axpy(float a, float* x, float* y) { | |||||
y[threadIdx.x] = a * x[threadIdx.x]; | |||||
} | |||||
Not Done ReplyInline ActionsWhere does this file come from? I think CUDA samples had one. I don't think it's essential for this document and could be removed or replaced. tra: Where does this file come from? I think CUDA samples had one. I don't think it's essential for… | |||||
int main(int argc, char* argv[]) { | |||||
const int kDataLen = 4; | |||||
float a = 2.0f; | |||||
float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; | |||||
float host_y[kDataLen]; | |||||
// Copy input data to device. | |||||
float* device_x = nullptr; | |||||
float* device_y = nullptr; | |||||
checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float))); | |||||
checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float))); | |||||
checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), | |||||
cudaMemcpyHostToDevice)); | |||||
// Launch the kernel. | |||||
axpy<<<1, kDataLen>>>(a, device_x, device_y); | |||||
// Copy output data to host. | |||||
checkCudaErrors(cudaDeviceSynchronize()); | |||||
checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), | |||||
cudaMemcpyDeviceToHost)); | |||||
// Print the results. | |||||
for (int i = 0; i < kDataLen; ++i) { | |||||
std::cout << "y[" << i << "] = " << host_y[i] << "\n"; | |||||
} | |||||
cudaDeviceReset(); | |||||
return 0; | |||||
} | |||||
The command line for compilation is similar to what you would use for C++: | |||||
.. code-block:: console | |||||
$ clang++ -o axpy -I<CUDA install path>/include -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread | |||||
traUnsubmitted Not Done ReplyInline Actions"-I<CUDA install path>/include" -- unnecessary. clang would add it. You also need to add -std=c++11 in order to use nullptr. I've also found a weird issue with my patch -- without optimizations, kernel launch fails (silently in your example). For the time being compile with -O2. I'll find and fix the problem ASAP. tra: "-I<CUDA install path>/include" -- unnecessary. clang would add it.
You also need to add… | |||||
traUnsubmitted Not Done ReplyInline ActionsFalse alarm about the bug. The failure was due to my local changes. The patch mentioned in the doc appears to work fine. tra: False alarm about the bug. The failure was due to my local changes. The patch mentioned in the… | |||||
$ ./axpy | |||||
y[0] = 2 | |||||
y[1] = 4 | |||||
Not Done ReplyInline Actionsto a separate file (supposingly axpy.cu) -> to a separate file axpy.cu. broune: to a separate file (supposingly axpy.cu) -> to a separate file axpy.cu. | |||||
y[2] = 6 | |||||
y[3] = 8 | |||||
Optimizations | |||||
============= | |||||
CPU and GPU have different design philosophies and architectures. For example, a | |||||
typical CPU has branch prediction, out-of-order execution, and is superscalar, | |||||
whereas a typical GPU has none of these. Due to such differences, an | |||||
optimization pipeline well-tuned for CPUs may be not suitable for GPUs. | |||||
Not Done ReplyInline Actionsto PTX (supposingly axpy.ptx) -> to a PTX file axpy.ptx broune: to PTX (supposingly axpy.ptx) -> to a PTX file axpy.ptx | |||||
LLVM performs several general and CUDA-specific optimizations for GPUs. The | |||||
list below shows some of the more important optimizations for GPUs. Most of | |||||
them have been upstreamed to ``lib/Transforms/Scalar`` and | |||||
Not Done ReplyInline Actionscc1 is officially an internal interface. Please put a big fat warning here that the cc1 interface is unstable and can be broken at any time. silvas: cc1 is officially an internal interface. Please put a big fat warning here that the cc1… | |||||
``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a | |||||
Not Done ReplyInline ActionsWhy not let clang compile the file all the way to PTX? Splitting would make sense if you wanted to link with libdevice code and run NVVMReflect/internalize on bitcode afterwards, but these days clang can do all of that. If you want to link with libdevice (with NVVMReflect and internalize), just pass appropriate bitcode file: "-mlink-cuda-bitcode" "/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_35.10.bc" tra: Why not let clang compile the file all the way to PTX?
Splitting would make sense if you… | |||||
customizable target-independent optimization pipeline. | |||||
* **Straight-line scalar optimizations**. These optimizations reduce redundancy | |||||
in straight-line code. Details can be found in the `design document for | |||||
straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. | |||||
* **Inferring memory spaces**. `This optimization | |||||
<http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_ | |||||
infers the memory space of an address so that the backend can emit faster | |||||
special loads and stores from it. Details can be found in the `design | |||||
document for memory space inference <https://goo.gl/5wH2Ct>`_. | |||||
* **Aggressive loop unrooling and function inlining**. Loop unrolling and | |||||
function inlining need to be more aggressive for GPUs than for CPUs because | |||||
control flow transfer in GPU is more expensive. They also promote other | |||||
optimizations such as constant propagation and SROA which sometimes speed up | |||||
code by over 10x. An empirical inline threshold for GPUs is 1100. This | |||||
Not Done ReplyInline Actionscapabitliy -> capability broune: capabitliy -> capability | |||||
configuration has yet to be upstreamed with a target-specific optimization | |||||
pipeline. LLVM also provides `loop unrolling pragmas | |||||
<http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ | |||||
and ``__attribute__((always_inline))`` for programmers to force unrolling and | |||||
inling. | |||||
Not Done ReplyInline Actionshost code (supposingly axpy.cc) -> host code in axpy.cc broune: host code (supposingly axpy.cc) -> host code in axpy.cc | |||||
* **Aggressive speculative execution**. `This transformation | |||||
<http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is | |||||
mainly for promoting straight-line scalar optimizations which are most | |||||
effective on code along dominator paths. | |||||
* **Memory-space alias analysis**. `This alias analysis | |||||
<http://llvm.org/docs/NVPTXUsage.html>`_ infers that two pointers in different | |||||
special memory spaces do not alias. It has yet to be integrated to the new | |||||
alias analysis infrastructure; the new infrastructure does not run | |||||
target-specific alias analysis. | |||||
* **Bypassing 64-bit divides**. `An existing optimization | |||||
<http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ | |||||
enabled in the NVPTX backend. 64-bit integer divides are much slower than | |||||
32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit | |||||
divides in our benchmarks have a divisor and dividend which fit in 32-bits at | |||||
runtime. This optimization provides a fast path for this common case. | |||||
Not Done ReplyInline ActionsIt's not quite true. CUDA runtime is will accept raw PTX if you initialize it the way nvcc does. tra: It's not quite true. CUDA runtime is will accept raw PTX if you initialize it the way nvcc does. | |||||
Not Done ReplyInline Actionsand superscalar -> and is superscalar broune: and superscalar -> and is superscalar | |||||
Not Done ReplyInline Actionsthese differences -> such differences (the list is not exhaustive) broune: these differences -> such differences
(the list is not exhaustive) | |||||
Not Done ReplyInline ActionsThis suggests that these are the only major ones. "The list below shows some of the more important optimizations for GPUs." broune: This suggests that these are the only major ones. "The list below shows some of the more… | |||||
Not Done ReplyInline ActionsI had difficulty understanding this sentence. If I understood it correctly, this could be: "A few of the optimizations have not been upstreamed due to ..." broune: I had difficulty understanding this sentence. If I understood it correctly, this could be:
"A… | |||||
Not Done ReplyInline Actionsso that emits fast special loads -> so that the backend can emit faster specialized loads broune: so that emits fast special loads -> so that the backend can emit faster specialized loads | |||||
Not Done ReplyInline Actionsmore encouraged -> needs to be more aggressive broune: more encouraged -> needs to be more aggressive | |||||
Not Done ReplyInline Actionsis yet -> has yet broune: is yet -> has yet | |||||
Not Done ReplyInline Actionsknows -> infers broune: knows -> infers | |||||
Not Done ReplyInline Actionsis yet -> has yet broune: is yet -> has yet | |||||
Not Done ReplyInline ActionsYou can pass device-side PTX to the host's cc1 with "-fcuda-include-gpubinary axpy.ptx" tra: You can pass device-side PTX to the host's cc1 with "-fcuda-include-gpubinary axpy.ptx"
and… | |||||
Not Done ReplyInline ActionsCan you clarify how to do this? I tried using -Xclang to set the -fcuda-include-gpubinary flag, but got the following. $ clang++ -Xclang -fcuda-include-gpubinary -Xclang axpy.ptx axpy.cc -I$CUDA_ROOT/include -I$CUDA_ROOT/samples/common/inc -L$CUDA_ROOT/lib64 -lcudart_static -lcuda -ldl -lrt -pthread axpy.cc:39:3: error: use of undeclared identifier 'axpy' axpy<<<1, kDataLen>>>(a, device_x, device_y); ^ axpy.cc:39:9: error: expected expression axpy<<<1, kDataLen>>>(a, device_x, device_y); ^ axpy.cc:39:23: error: expected expression axpy<<<1, kDataLen>>>(a, device_x, device_y); ^ axpy.cc:39:25: warning: expression result unused [-Wunused-value] axpy<<<1, kDataLen>>>(a, device_x, device_y); ^ axpy.cc:39:28: warning: expression result unused [-Wunused-value] axpy<<<1, kDataLen>>>(a, device_x, device_y); ^~~~~~~~ 2 warnings and 3 errors generated. jingyue: Can you clarify how to do this? I tried using `-Xclang` to set the `-fcuda-include-gpubinary`… | |||||
Not Done ReplyInline ActionsThe kernel must be present in axpy.cu during host compilation so compiler can generate host-side stub for kernel launch, so it only works without splitting. tra: The kernel must be present in axpy.cu during host compilation so compiler can generate host… | |||||
Not Done ReplyInline ActionsStill have issues with that. However, I managed to apply your three pending patches, and the patched version works great! So, I think it makes more sense for this document to ask early adopters to apply the patches and try the more functional patched version. Agree? jingyue: Still have issues with that.
However, I managed to apply your three pending patches, and the… | |||||
Not Done ReplyInline ActionsSure. The patches simplify large portion of this section down to clang++ -o axpy [...] axpy.cu I'll need to add details on various CUDA-related options I've added to clang. tra: Sure. The patches simplify large portion of this section down to
```
clang++ -o axpy [...]… |
Could be:
It is aimed at both users who want to compile CUDA with LLVM and developers who want to improve LLVM for GPUs.