Index: docs/CompileCudaWithLLVM.rst =================================================================== --- /dev/null +++ docs/CompileCudaWithLLVM.rst @@ -0,0 +1,269 @@ +=================================== +Compiling CUDA C/C++ with LLVM +=================================== + +.. contents:: + :local: + +Introduction +============ + +This document contains the user guides and the internals of compiling CUDA C/C++ +with LLVM. It is written for not only users who want to compile CUDA with LLVM +but also developers who want to improve LLVM for GPUs. This +document assumes a basic familiarity with CUDA. Information about CUDA +programming can be found in the `CUDA programming guide +`_. + +How to Download and Build LLVM +============================== + +The trunk of LLVM has the best support for CUDA. Below is a quick summary of +downloading and building LLVM's trunk. Consult the `Getting Started +`_ page for more details on setting up +LLVM. + +#. Checkout LLVM + + * ``cd where-you-want-llvm-to-live`` + * ``svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm`` + +#. Checkout Clang + + * ``cd where-you-want-llvm-to-live`` + * ``cd llvm/tools`` + * ``svn co http://llvm.org/svn/llvm-project/cfe/trunk clang`` + +#. Configure and build LLVM and Clang + + * ``cd where you want to build llvm`` + * ``mkdir build`` + * ``cd build`` + * ``cmake [options] `` + + Some common options: + + * ``-DCMAKE_INSTALL_PREFIX=directory`` --- Specify for *directory* the full + pathname of where you want the LLVM tools and libraries to be installed + (default ``/usr/local``). + + * ``-DCMAKE_BUILD_TYPE=type`` --- Valid options for *type* are Debug, + Release, RelWithDebInfo, and MinSizeRel. Default is Debug. + + * ``-DLLVM_ENABLE_ASSERTIONS=On`` --- Compile with assertion checks enabled + (default is Yes for Debug builds, No for all other build types). + + * Run ``make`` + + * The default target (i.e. ``make``) will build all of LLVM. + + * The ``check-all`` target (i.e. ``make check-all``) will run the + regression tests to ensure everything is in working order. + +How to Compile CUDA C/C++ with LLVM +=================================== + +The standard compilation of CUDA C/C++ programs consists of compiling functions +that run on GPU (so-called *kernels*) into a virtual ISA format called *PTX*. +The CUDA driver compiles PTX at runtime to the low-level machine instruction set +called *SASS* that executes naively on GPU. + +Unlike OpenCL which requires *host code* (code running on +CPU) and *device code* (code running on GPU) to be separated, CUDA mixes host +and device code in the same translation unit (C++ source file) with special +invocation syntax (``<<<...>>>``) that allows host code to invoke device code. +Compiling mixed code differs from traditional C++ compilation as two different +architectures are targeted simultaneously. + +While Clang will ultimately be able to compile mixed-mode CUDA programs, the +current implementation does not support mixed-mode code yet. Therefore, as an +early adopter, one would have to manually extract device code to a separate +file, compile it to PTX, and have the host code load and launch the kernel. + +For example, suppose you want to compile the following mixed-mode CUDA program +that multiplies a ``float`` vector by a ``float`` scalar (AXPY). + +.. code-block:: c++ + + #include + #include + #include // for checkCudaErrors + + #include + + __global__ void axpy(float a, float* x, float* y) { + y[threadIdx.x] = a * x[threadIdx.x]; + } + + int main(int argc, char *argv[]) { + const int kDataLen = 4; + + float a = 2.0f; + float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; + float host_y[kDataLen]; + + // Copy input data to device. + float *device_x, *device_y; + checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float))); + checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float))); + checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), + cudaMemcpyHostToDevice)); + + // Launch the kernel. + axpy<<<1, kDataLen>>>(a, device_x, device_y); + + // Copy output data to host. + checkCudaErrors(cudaDeviceSynchronize()); + checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), + cudaMemcpyDeviceToHost)); + + // Print the results. + for (int i = 0; i < kDataLen; ++i) { + std::cout << "y[" << i << "] = " << host_y[i] << "\n"; + } + + cudaDeviceReset(); + return 0; + } + +To compile it with the current version of LLVM, you need to + +#. Extract the kernel to a separate file (supposingly ``axpy.cu``) + + .. code-block:: c++ + + #define __global__ __attribute__((global)) + + __global__ void axpy(float a, float* x, float* y) { + y[threadIdx.x] = a * x[threadIdx.x]; + } + +#. Compile the device code ``axpy.cu`` to PTX (supposingly ``axpy.ptx``) + + .. code-block:: console + + $ clang -cc1 -triple -target-cpu -include -emit-llvm -fcuda-is-device -O3 axpy.cu -o axpy.ll + $ llc -mcpu= axpy.ll -o axpy.ptx + + The first line ``clang -cc1`` compiles CUDA to LLVM IR and optimizes it with + ``-O3``. The emitted LLVM IR contains a `defined set of conventions + `_ used to represent GPU programming + concepts. The second line ``llc`` translates the optimized LLVM IR to PTX. It + also runs several `NVPTX-specific optimizations`_. + + .. _`NVPTX-specific optimizations`: `Optimizations`_ + + The meaning of the command line arguments: + + * **target triple**: ``nvptx64-nvidia-cuda`` if your system is 64-bit; + ``nvptx-nvidia-cuda`` otherwise. + + * **GPU compute capability**: the compute capability of your GPU. This can be + looked up at the `CUDA GPUs `_ + page. For example, if your GPU is Tesla K40, the compute capabitliy should + be ``sm_35``. + + * **path to cuda_builtin_vars.h**: typically ``/lib/clang//include``. + +#. Modify the host code (supposingly ``axpy.cc``) to load and launch the kernel + in the PTX + + .. code-block:: c++ + + ... + checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), + cudaMemcpyHostToDevice)); + + // Load the kernel. + CUmodule module; + checkCudaErrors(cuModuleLoad(&module, "axpy.ptx")); + CUfunction kernel; + checkCudaErrors(cuModuleGetFunction(&kernel, module, "_Z4axpyfPfS_")); + + // Launch the kernel. + void* args[3] = {(void *)&a, (void *)&device_x, (void *)&device_y}; + checkCudaErrors(cuLaunchKernel(kernel, + /*grid*/ 1, 1, 1, + /*block*/ kDataLen, 1, 1, + /*sharedMemBytes*/ 0, + /*stream*/ 0, args, + /*extra*/ 0)); + + // Copy output data to host. + ... + + Note that CUDA runtime APIs are unable to load kernels from PTX files. + Therefore, we resort to CUDA driver APIs just for this portion of the code. + +#. Use Clang to compile the host code as regular C++. We assume you have + installed the CUDA driver and runtime. Consult the `NVIDIA CUDA installation + Guide + `_ if + you have not. + + .. code-block:: console + + $ clang++ -I/include -I/samples/common/inc -L/ axpy.cc -lcudart_static -lcuda -ldl -lrt -pthread + +#. Run the compiled fat binary + + .. code-block:: console + + $ ./a.out + y[0] = 2 + y[1] = 4 + y[2] = 6 + y[3] = 8 + +Optimizations +============= + +CPU and GPU have different design philosophies and architectures. For example, a +typical CPU has branch prediction, out-of-order execution, and superscalar, +whereas a typical GPU has none of these. Due to these differences, an +optimization pipeline well-tuned for CPUs may be not suitable for GPUs. + +LLVM performs several general and CUDA-specific optimizations for GPUs. Below is +a list of major ones. Most of them have been upstreamed to +``lib/Transforms/Scalar`` and ``lib/Target/NVPTX``. Some are punted due to lack +of a customizable target-independent optimization pipeline. + +* **Straight-line scalar optimizations**. These optimizations reduce redundancy + in straight-line code. Details can be found in the `design document for + straight-line scalar optimizations `_. + +* **Inferring memory spaces**. `This optimization + `_ + infers the memory space of an address so that emits fast special loads and + stores from it. Details can be found in the `design document for memory space + inference `_. + +* **Aggressive loop unrooling and function inlining**. Loop unrolling and + function inlining are more encouraged for GPUs than for CPUs because control + flow transfer in GPU is more expensive. They also promote other optimizations + such as constant propagation and SROA which sometimes speed up code by over + 10x. An empirical inline threshold for GPUs is 1100. This configuration is yet + to be upstreamed with a target-specific optimization pipeline. LLVM also + provides `loop unrolling pragmas + `_ + and ``__attribute__((always_inline))`` for programmers to force unrolling and + inling. + +* **Aggressive speculative execution**. `This transformation + `_ is + mainly for promoting straight-line scalar optimizations which are most + effective on code along dominator paths. + +* **Memory-space alias analysis**. `This alias analysis + `_ knows that two pointers in different + special memory spaces do not alias. It is yet to be integrated to the new + alias analysis infrastructure; the new infrastructure does not run + target-specific alias analysis. + +* **Bypassing 64-bit divides**. `An existing optimization + `_ + enabled in the NVPTX backend. 64-bit integer divides are much slower than + 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit + divides in our benchmarks have a divisor and dividend which fit in 32-bits at + runtime. This optimization provides a fast path for this common case. Index: docs/index.rst =================================================================== --- docs/index.rst +++ docs/index.rst @@ -86,6 +86,7 @@ GetElementPtr Frontend/PerformanceTips MCJITDesignAndImplementation + CompileCudaWithLLVM :doc:`GettingStarted` Discusses how to get up and running quickly with the LLVM infrastructure. @@ -371,6 +372,9 @@ :doc:`FaultMaps` LLVM support for folding control flow into faulting machine instructions. +:doc:`CompileCudaWithLLVM` + LLVM support for CUDA. + Development Process Documentation =================================