Index: docs/CompileCudaWithLLVM.rst
===================================================================
--- /dev/null
+++ docs/CompileCudaWithLLVM.rst
@@ -0,0 +1,269 @@
+===================================
+Compiling CUDA C/C++ with LLVM
+===================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+This document contains the user guides and the internals of compiling CUDA C/C++
+with LLVM. It is written for not only users who want to compile CUDA with LLVM
+but also developers who want to improve LLVM for GPUs. This
+document assumes a basic familiarity with CUDA. Information about CUDA
+programming can be found in the `CUDA programming guide
+<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
+
+How to Download and Build LLVM
+==============================
+
+The trunk of LLVM has the best support for CUDA. Below is a quick summary of
+downloading and building LLVM's trunk. Consult the `Getting Started
+<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting up
+LLVM.
+
+#. Checkout LLVM
+
+   * ``cd where-you-want-llvm-to-live``
+   * ``svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm``
+
+#. Checkout Clang
+
+   * ``cd where-you-want-llvm-to-live``
+   * ``cd llvm/tools``
+   * ``svn co http://llvm.org/svn/llvm-project/cfe/trunk clang``
+
+#. Configure and build LLVM and Clang
+
+   * ``cd where you want to build llvm``
+   * ``mkdir build``
+   * ``cd build``
+   * ``cmake [options] <path to llvm sources>``
+
+     Some common options:
+
+     * ``-DCMAKE_INSTALL_PREFIX=directory`` --- Specify for *directory* the full
+       pathname of where you want the LLVM tools and libraries to be installed
+       (default ``/usr/local``).
+
+     * ``-DCMAKE_BUILD_TYPE=type`` --- Valid options for *type* are Debug,
+       Release, RelWithDebInfo, and MinSizeRel. Default is Debug.
+
+     * ``-DLLVM_ENABLE_ASSERTIONS=On`` --- Compile with assertion checks enabled
+       (default is Yes for Debug builds, No for all other build types).
+
+   * Run ``make``
+
+     * The default target (i.e. ``make``) will build all of LLVM.
+
+     * The ``check-all`` target (i.e. ``make check-all``) will run the
+       regression tests to ensure everything is in working order.
+
+How to Compile CUDA C/C++ with LLVM
+===================================
+
+The standard compilation of CUDA C/C++ programs consists of compiling functions
+that run on GPU (so-called *kernels*) into a virtual ISA format called *PTX*.
+The CUDA driver compiles PTX at runtime to the low-level machine instruction set
+called *SASS* that executes naively on GPU.
+
+Unlike OpenCL which requires *host code* (code running on
+CPU) and *device code* (code running on GPU) to be separated, CUDA mixes host
+and device code in the same translation unit (C++ source file) with special
+invocation syntax (``<<<...>>>``) that allows host code to invoke device code.
+Compiling mixed code differs from traditional C++ compilation as two different
+architectures are targeted simultaneously.
+
+While Clang will ultimately be able to compile mixed-mode CUDA programs, the
+current implementation does not support mixed-mode code yet. Therefore, as an
+early adopter, one would have to manually extract device code to a separate
+file, compile it to PTX, and have the host code load and launch the kernel.
+
+For example, suppose you want to compile the following mixed-mode CUDA program
+that multiplies a ``float`` vector by a ``float`` scalar (AXPY).
+
+.. code-block:: c++
+
+  #include <cuda.h>
+  #include <cuda_runtime.h>
+  #include <helper_cuda.h> // for checkCudaErrors
+
+  #include <iostream>
+
+  __global__ void axpy(float a, float* x, float* y) {
+    y[threadIdx.x] = a * x[threadIdx.x];
+  }
+
+  int main(int argc, char *argv[]) {
+    const int kDataLen = 4;
+
+    float a = 2.0f;
+    float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
+    float host_y[kDataLen];
+
+    // Copy input data to device.
+    float *device_x, *device_y;
+    checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float)));
+    checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float)));
+    checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
+                               cudaMemcpyHostToDevice));
+
+    // Launch the kernel.
+    axpy<<<1, kDataLen>>>(a, device_x, device_y);
+
+    // Copy output data to host.
+    checkCudaErrors(cudaDeviceSynchronize());
+    checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
+                               cudaMemcpyDeviceToHost));
+
+    // Print the results.
+    for (int i = 0; i < kDataLen; ++i) {
+      std::cout << "y[" << i << "] = " << host_y[i] << "\n";
+    }
+
+    cudaDeviceReset();
+    return 0;
+  }
+
+To compile it with the current version of LLVM, you need to
+
+#. Extract the kernel to a separate file (supposingly ``axpy.cu``)
+
+   .. code-block:: c++
+
+     #define __global__ __attribute__((global))
+
+     __global__ void axpy(float a, float* x, float* y) {
+       y[threadIdx.x] = a * x[threadIdx.x];
+     }
+
+#. Compile the device code ``axpy.cu`` to PTX (supposingly ``axpy.ptx``)
+
+   .. code-block:: console
+
+     $ clang -cc1 -triple <target triple> -target-cpu <GPU compute capability> -include <path to cuda_builtin_vars.h> -emit-llvm -fcuda-is-device -O3 axpy.cu -o axpy.ll
+     $ llc -mcpu=<GPU compute capability> axpy.ll -o axpy.ptx
+
+   The first line ``clang -cc1`` compiles CUDA to LLVM IR and optimizes it with
+   ``-O3``. The emitted LLVM IR contains a `defined set of conventions
+   <http://llvm.org/docs/NVPTXUsage.html>`_ used to represent GPU programming
+   concepts. The second line ``llc`` translates the optimized LLVM IR to PTX. It
+   also runs several `NVPTX-specific optimizations`_.
+
+   .. _`NVPTX-specific optimizations`: `Optimizations`_
+
+   The meaning of the command line arguments:
+
+   * **target triple**: ``nvptx64-nvidia-cuda`` if your system is 64-bit;
+     ``nvptx-nvidia-cuda`` otherwise.
+
+   * **GPU compute capability**: the compute capability of your GPU. This can be
+     looked up at the `CUDA GPUs <https://developer.nvidia.com/cuda-gpus>`_
+     page. For example, if your GPU is Tesla K40, the compute capabitliy should
+     be ``sm_35``.
+
+   * **path to cuda_builtin_vars.h**: typically ``<LLVM installation
+     path>/lib/clang/<version>/include``.
+
+#. Modify the host code (supposingly ``axpy.cc``) to load and launch the kernel
+   in the PTX
+
+   .. code-block:: c++
+
+     ...
+     checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
+                                cudaMemcpyHostToDevice));
+
+     // Load the kernel.
+     CUmodule module;
+     checkCudaErrors(cuModuleLoad(&module, "axpy.ptx"));
+     CUfunction kernel;
+     checkCudaErrors(cuModuleGetFunction(&kernel, module, "_Z4axpyfPfS_"));
+
+     // Launch the kernel.
+     void* args[3] = {(void *)&a, (void *)&device_x, (void *)&device_y};
+     checkCudaErrors(cuLaunchKernel(kernel,
+                                    /*grid*/ 1, 1, 1,
+                                    /*block*/ kDataLen, 1, 1,
+                                    /*sharedMemBytes*/ 0,
+                                    /*stream*/ 0, args,
+                                    /*extra*/ 0));
+
+     // Copy output data to host.
+     ...
+
+   Note that CUDA runtime APIs are unable to load kernels from PTX files.
+   Therefore, we resort to CUDA driver APIs just for this portion of the code.
+
+#. Use Clang to compile the host code as regular C++. We assume you have
+   installed the CUDA driver and runtime. Consult the `NVIDIA CUDA installation
+   Guide
+   <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if
+   you have not.
+
+   .. code-block:: console
+
+     $ clang++ -I<CUDA install path>/include -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cc -lcudart_static -lcuda -ldl -lrt -pthread
+
+#. Run the compiled fat binary
+
+   .. code-block:: console
+
+     $ ./a.out
+     y[0] = 2
+     y[1] = 4
+     y[2] = 6
+     y[3] = 8
+
+Optimizations
+=============
+
+CPU and GPU have different design philosophies and architectures. For example, a
+typical CPU has branch prediction, out-of-order execution, and superscalar,
+whereas a typical GPU has none of these. Due to these differences, an
+optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
+
+LLVM performs several general and CUDA-specific optimizations for GPUs. Below is
+a list of major ones. Most of them have been upstreamed to
+``lib/Transforms/Scalar`` and ``lib/Target/NVPTX``. Some are punted due to lack
+of a customizable target-independent optimization pipeline.
+
+* **Straight-line scalar optimizations**. These optimizations reduce redundancy
+  in straight-line code. Details can be found in the `design document for
+  straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
+
+* **Inferring memory spaces**. `This optimization
+  <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_
+  infers the memory space of an address so that emits fast special loads and
+  stores from it. Details can be found in the `design document for memory space
+  inference <https://goo.gl/5wH2Ct>`_.
+
+* **Aggressive loop unrooling and function inlining**. Loop unrolling and
+  function inlining are more encouraged for GPUs than for CPUs because control
+  flow transfer in GPU is more expensive. They also promote other optimizations
+  such as constant propagation and SROA which sometimes speed up code by over
+  10x. An empirical inline threshold for GPUs is 1100. This configuration is yet
+  to be upstreamed with a target-specific optimization pipeline. LLVM also
+  provides `loop unrolling pragmas
+  <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
+  and ``__attribute__((always_inline))`` for programmers to force unrolling and
+  inling.
+
+* **Aggressive speculative execution**. `This transformation
+  <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is
+  mainly for promoting straight-line scalar optimizations which are most
+  effective on code along dominator paths.
+
+* **Memory-space alias analysis**. `This alias analysis
+  <http://llvm.org/docs/NVPTXUsage.html>`_ knows that two pointers in different
+  special memory spaces do not alias. It is yet to be integrated to the new
+  alias analysis infrastructure; the new infrastructure does not run
+  target-specific alias analysis.
+
+* **Bypassing 64-bit divides**. `An existing optimization
+  <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
+  enabled in the NVPTX backend. 64-bit integer divides are much slower than
+  32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
+  divides in our benchmarks have a divisor and dividend which fit in 32-bits at
+  runtime. This optimization provides a fast path for this common case.
Index: docs/index.rst
===================================================================
--- docs/index.rst
+++ docs/index.rst
@@ -86,6 +86,7 @@
    GetElementPtr
    Frontend/PerformanceTips
    MCJITDesignAndImplementation
+   CompileCudaWithLLVM
 
 :doc:`GettingStarted`
    Discusses how to get up and running quickly with the LLVM infrastructure.
@@ -371,6 +372,9 @@
 :doc:`FaultMaps`
   LLVM support for folding control flow into faulting machine instructions.
 
+:doc:`CompileCudaWithLLVM`
+  LLVM support for CUDA.
+
 Development Process Documentation
 =================================