Index: docs/CompileCudaWithLLVM.rst =================================================================== --- /dev/null +++ docs/CompileCudaWithLLVM.rst @@ -0,0 +1,188 @@ +=================================== +Compiling CUDA C/C++ with LLVM +=================================== + +.. contents:: + :local: + +Introduction +============ + +This document contains the user guides and the internals of compiling CUDA +C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM +and developers who want to improve LLVM for GPUs. This document assumes a basic +familiarity with CUDA. Information about CUDA programming can be found in the +`CUDA programming guide +`_. + +How to Download and Build LLVM +============================== + +The support for CUDA is still in progress and temporarily relies on `this patch +`_. Below is a quick summary of downloading and +building LLVM with CUDA support. Consult the `Getting Started +`_ page for more details on setting +up LLVM. + +#. Checkout LLVM + + .. code-block:: console + + $ cd where-you-want-llvm-to-live + $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm + +#. Checkout Clang + + .. code-block:: console + + $ cd where-you-want-llvm-to-live + $ cd llvm/tools + $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang + +#. Apply the temporary patch for CUDA support. + + If you have installed `Arcanist + `_, + you can apply this patch using + + .. code-block:: console + + $ arc patch D14452 + + Otherwise, download the raw diff and apply it manually using + + .. code-block:: console + + $ wget http://reviews.llvm.org/file/data/a4prwg2ydrlmviyrcajn/PHID-FILE-ny5ghn5vnpxhajn3hhnp/D14452.diff + $ patch -p0 < D14452.diff + +#. Configure and build LLVM and Clang + + .. code-block:: console + + $ cd + $ mkdir build + $ cd build + $ cmake [options] + $ make + +How to Compile CUDA C/C++ with LLVM +=================================== + +We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA +CUDA installation Guide +`_ if +you have not. + +Suppose you want to compile and run the following CUDA program (``axpy.cu``) +which multiplies a ``float`` array by a ``float`` scalar (AXPY). + +.. code-block:: c++ + + #include // for checkCudaErrors + + #include + + __global__ void axpy(float a, float* x, float* y) { + y[threadIdx.x] = a * x[threadIdx.x]; + } + + int main(int argc, char* argv[]) { + const int kDataLen = 4; + + float a = 2.0f; + float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; + float host_y[kDataLen]; + + // Copy input data to device. + float* device_x; + float* device_y; + checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float))); + checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float))); + checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), + cudaMemcpyHostToDevice)); + + // Launch the kernel. + axpy<<<1, kDataLen>>>(a, device_x, device_y); + + // Copy output data to host. + checkCudaErrors(cudaDeviceSynchronize()); + checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), + cudaMemcpyDeviceToHost)); + + // Print the results. + for (int i = 0; i < kDataLen; ++i) { + std::cout << "y[" << i << "] = " << host_y[i] << "\n"; + } + + checkCudaErrors(cudaDeviceReset()); + return 0; + } + +The command line for compilation is similar to what you would use for C++. + +.. code-block:: console + + $ clang++ -o axpy -I/samples/common/inc -L/ axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread + $ ./axpy + y[0] = 2 + y[1] = 4 + y[2] = 6 + y[3] = 8 + +Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the +samples installed for this example. ```` is the root +directory where you installed CUDA SDK, typically ``/usr/local/cuda``. + +Optimizations +============= + +CPU and GPU have different design philosophies and architectures. For example, a +typical CPU has branch prediction, out-of-order execution, and is superscalar, +whereas a typical GPU has none of these. Due to such differences, an +optimization pipeline well-tuned for CPUs may be not suitable for GPUs. + +LLVM performs several general and CUDA-specific optimizations for GPUs. The +list below shows some of the more important optimizations for GPUs. Most of +them have been upstreamed to ``lib/Transforms/Scalar`` and +``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a +customizable target-independent optimization pipeline. + +* **Straight-line scalar optimizations**. These optimizations reduce redundancy + in straight-line code. Details can be found in the `design document for + straight-line scalar optimizations `_. + +* **Inferring memory spaces**. `This optimization + `_ + infers the memory space of an address so that the backend can emit faster + special loads and stores from it. Details can be found in the `design + document for memory space inference `_. + +* **Aggressive loop unrooling and function inlining**. Loop unrolling and + function inlining need to be more aggressive for GPUs than for CPUs because + control flow transfer in GPU is more expensive. They also promote other + optimizations such as constant propagation and SROA which sometimes speed up + code by over 10x. An empirical inline threshold for GPUs is 1100. This + configuration has yet to be upstreamed with a target-specific optimization + pipeline. LLVM also provides `loop unrolling pragmas + `_ + and ``__attribute__((always_inline))`` for programmers to force unrolling and + inling. + +* **Aggressive speculative execution**. `This transformation + `_ is + mainly for promoting straight-line scalar optimizations which are most + effective on code along dominator paths. + +* **Memory-space alias analysis**. `This alias analysis + `_ infers that two pointers in different + special memory spaces do not alias. It has yet to be integrated to the new + alias analysis infrastructure; the new infrastructure does not run + target-specific alias analysis. + +* **Bypassing 64-bit divides**. `An existing optimization + `_ + enabled in the NVPTX backend. 64-bit integer divides are much slower than + 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit + divides in our benchmarks have a divisor and dividend which fit in 32-bits at + runtime. This optimization provides a fast path for this common case. Index: docs/index.rst =================================================================== --- docs/index.rst +++ docs/index.rst @@ -86,6 +86,7 @@ GetElementPtr Frontend/PerformanceTips MCJITDesignAndImplementation + CompileCudaWithLLVM :doc:`GettingStarted` Discusses how to get up and running quickly with the LLVM infrastructure. @@ -371,6 +372,9 @@ :doc:`FaultMaps` LLVM support for folding control flow into faulting machine instructions. +:doc:`CompileCudaWithLLVM` + LLVM support for CUDA. + Development Process Documentation =================================