This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
docs/
33
CompileCudaWithLLVM.rst
-
index.rst

Differential D14370

[doc] Compile CUDA with LLVM
ClosedPublic

Authored by jingyue on Nov 4 2015, 9:45 PM.

Download Raw Diff

Details

Reviewers

tra
chandlerc
broune
meheff

Commits

rG4f2a6cb2480a: [doc] Compile CUDA with LLVM
rL252660: [doc] Compile CUDA with LLVM

Summary

This patch adds documentation on compiling CUDA with LLVM as requested by many
engineers and researchers. It includes not only user guides but also some
internals (mostly optimizations) so that early adopters can start hacking and
contributing.

Quite a few researchers who contacted us haven't used LLVM before, which is
unsurprising as it hasn't been long since LLVM picked up CUDA. So I added a
short summary to help these folks get started with LLVM.

I expect this document to evolve substantially down the road. The user guides
will be much simplified after the Clang integration is done. However, the
internals should continue growing to include for example performance debugging
and key areas to improve.

Diff Detail

Event Timeline

jingyue updated this revision to Diff 39315.Nov 4 2015, 9:45 PM

jingyue retitled this revision from to [doc] Compile CUDA with LLVM.

jingyue updated this object.

jingyue added reviewers: tra, chandlerc, meheff, broune.

jingyue added a subscriber: eliben.

jingyue added a subscriber: llvm-commits.Nov 4 2015, 9:46 PM

broune accepted this revision.Nov 5 2015, 11:08 AM

broune edited edge metadata.

broune added inline comments.

docs/CompileCudaWithLLVM.rst
14	Could be: It is aimed at both users who want to compile CUDA with LLVM and developers who want to improve LLVM for GPUs.
70	naively -> natively
82	Could be: Therefore, for early adopters using CUDA with LLVM now, it is necessary to manually ...
85	"vector" is correct here, though it could suggest a std::vector. "array" wouldn't have that connotation. Also, maybe "(this operation is sometimes referred to as AXPY)", as just "(AXPY)" would likely seem rather cryptic to someone who doesn't know what AXPY is.
132	to a separate file (supposingly axpy.cu) -> to a separate file axpy.cu.
142	to PTX (supposingly axpy.ptx) -> to a PTX file axpy.ptx
164	capabitliy -> capability
170	host code (supposingly axpy.cc) -> host code in axpy.cc
224	and superscalar -> and is superscalar
225	these differences -> such differences (the list is not exhaustive)
229	This suggests that these are the only major ones. "The list below shows some of the more important optimizations for GPUs."
230	I had difficulty understanding this sentence. If I understood it correctly, this could be: "A few of the optimizations have not been upstreamed due to ..."
239	so that emits fast special loads -> so that the backend can emit faster specialized loads
244	more encouraged -> needs to be more aggressive
247	is yet -> has yet
260	knows -> infers
261	is yet -> has yet

This revision is now accepted and ready to land.Nov 5 2015, 11:09 AM

tra added inline comments.Nov 5 2015, 11:39 AM

docs/CompileCudaWithLLVM.rst
79–81	Three more patches (D13144, D13170, D13171) and one can compile many CUDA files with clang as is. Number of details in this guide will not be applicable in few weeks.
82	No need to manually launch the kernel <<<>>> works with PTX. See below for details.
91	Where does this file come from? I think CUDA samples had one. I don't think it's essential for this document and could be removed or replaced.
146–147	Why not let clang compile the file all the way to PTX? Splitting would make sense if you wanted to link with libdevice code and run NVVMReflect/internalize on bitcode afterwards, but these days clang can do all of that. If you want to link with libdevice (with NVVMReflect and internalize), just pass appropriate bitcode file: "-mlink-cuda-bitcode" "/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_35.10.bc"
197–198	It's not quite true. CUDA runtime is will accept raw PTX if you initialize it the way nvcc does.
208	You can pass device-side PTX to the host's cc1 with "-fcuda-include-gpubinary axpy.ptx" and clang will embed PTX into host object file and will generate code to register kernels so that kernels can be launched with <<<...>>> without any additional steps.

jingyue added inline comments.Nov 5 2015, 4:09 PM

docs/CompileCudaWithLLVM.rst

208

Can you clarify how to do this? I tried using -Xclang to set the -fcuda-include-gpubinary flag, but got the following.

$ clang++ -Xclang -fcuda-include-gpubinary -Xclang axpy.ptx axpy.cc -I$CUDA_ROOT/include -I$CUDA_ROOT/samples/common/inc -L$CUDA_ROOT/lib64 -lcudart_static -lcuda -ldl -lrt -pthread
axpy.cc:39:3: error: use of undeclared identifier 'axpy'
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
  ^
axpy.cc:39:9: error: expected expression
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
        ^
axpy.cc:39:23: error: expected expression
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
                      ^
axpy.cc:39:25: warning: expression result unused [-Wunused-value]
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
                        ^
axpy.cc:39:28: warning: expression result unused [-Wunused-value]
  axpy<<<1, kDataLen>>>(a, device_x, device_y);
                           ^~~~~~~~
2 warnings and 3 errors generated.

tra added inline comments.Nov 5 2015, 4:58 PM

docs/CompileCudaWithLLVM.rst
208	The kernel must be present in axpy.cu during host compilation so compiler can generate host-side stub for kernel launch, so it only works without splitting.

jingyue added inline comments.Nov 5 2015, 10:39 PM

docs/CompileCudaWithLLVM.rst
208	Still have issues with that. However, I managed to apply your three pending patches, and the patched version works great! So, I think it makes more sense for this document to ask early adopters to apply the patches and try the more functional patched version. Agree?

tra added inline comments.Nov 6 2015, 9:59 AM

docs/CompileCudaWithLLVM.rst
208	Sure. The patches simplify large portion of this section down to clang++ -o axpy [...] axpy.cu I'll need to add details on various CUDA-related options I've added to clang. Do you want to incorporate them into this patch of should I do that after you've committed the docs?

I'll let you do that after this patch. You know much better than me on
those options.

My biggest concern is to avoid giving users the false impression that what is described here is an officially supported long-term interface from clang. Would it be accurate to say that this document is meant for "LLVM developers" (or otherwise people working inside LLVM)?

Other than clarifying that, the content LGTM.

docs/CompileCudaWithLLVM.rst
12–17	From reading this document, I think it would be worth it to somehow qualify the audience description to say "early adopters who are willing to use unstable internal interfaces" or something like that.
27–30	Do you feel strongly about replicating this here from the GettingStarted page? I would prefer to avoid the duplication. But if you feel strongly that your audience will benefit from this, then we can leave it in.
146	cc1 is officially an internal interface. Please put a big fat warning here that the cc1 interface is unstable and can be broken at any time.

simplify the doc

In D14370#284494, @silvas wrote:

My biggest concern is to avoid giving users the false impression that what is described here is an officially supported long-term interface from clang. Would it be accurate to say that this document is meant for "LLVM developers" (or otherwise people working inside LLVM)?

Other than clarifying that, the content LGTM.

Hi Sean,

Thanks for your suggestions. You may like the new version better. As discussed with Artem, I think it makes more sense for users to apply a temporary patch (I will keep it up-to-date) to avoid so much hacking.

In D14370#284612, @jingyue wrote:

In D14370#284494, @silvas wrote:

My biggest concern is to avoid giving users the false impression that what is described here is an officially supported long-term interface from clang. Would it be accurate to say that this document is meant for "LLVM developers" (or otherwise people working inside LLVM)?

Other than clarifying that, the content LGTM.

Hi Sean,

Thanks for your suggestions. You may like the new version better. As discussed with Artem, I think it makes more sense for users to apply a temporary patch (I will keep it up-to-date) to avoid so much hacking.

Makes sense. LGTM.

tra added inline comments.Nov 10 2015, 10:45 AM

docs/CompileCudaWithLLVM.rst
84–86	clang will -include cuda_runtime.h (nvcc does, too), so it's not necessary to include it from source. clang's cuda_runtime.h wrapper will include cuda_builtin_vars.h, so including it explicitly here is not necessary as well. helper_cuda.h comes from CUDA samples. I would suggest adding a note that we need CUDA samples installed as well because it's possible to have CUDA installed without them.
130	"-I<CUDA install path>/include" -- unnecessary. clang would add it. You also need to add -std=c++11 in order to use nullptr. I've also found a weird issue with my patch -- without optimizations, kernel launch fails (silently in your example). For the time being compile with -O2. I'll find and fix the problem ASAP.

tra accepted this revision.Nov 10 2015, 10:46 AM

tra edited edge metadata.

Simplify the command lines and header file inclusion

tra added inline comments.Nov 10 2015, 1:32 PM

docs/CompileCudaWithLLVM.rst
130	False alarm about the bug. The failure was due to my local changes. The patch mentioned in the doc appears to work fine.

Replace the link to the raw diff with more instructions.

That link appears to be temporary.

jingyue closed this revision.Nov 10 2015, 2:38 PM

Revision Contents

Path

Size

docs/

CompileCudaWithLLVM.rst

192 lines

index.rst

4 lines

Diff 39851

docs/CompileCudaWithLLVM.rst

This file was added.

				===================================
				Compiling CUDA C/C++ with LLVM
				===================================

				.. contents::
				:local:

				Introduction
				============

				This document contains the user guides and the internals of compiling CUDA
				C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM
				and developers who want to improve LLVM for GPUs. This document assumes a basic
				familiarity with CUDA. Information about CUDA programming can be found in the
				brouneUnsubmitted Not Done Reply Inline Actions Could be: It is aimed at both users who want to compile CUDA with LLVM and developers who want to improve LLVM for GPUs. broune: Could be: It is aimed at both users who want to compile CUDA with LLVM and developers who want…
				`CUDA programming guide
				<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.

				silvasUnsubmitted Not Done Reply Inline Actions From reading this document, I think it would be worth it to somehow qualify the audience description to say "early adopters who are willing to use unstable internal interfaces" or something like that. silvas: From reading this document, I think it would be worth it to somehow qualify the audience…
				How to Build LLVM with CUDA Support
				===================================

				The support for CUDA is still in progress and temporarily relies on `this patch
				<http://reviews.llvm.org/D14452>`_. Below is a quick summary of downloading and
				building LLVM with CUDA support. Consult the `Getting Started
				<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting
				up LLVM.

				#. Checkout LLVM

				.. code-block:: console

				silvasUnsubmitted Not Done Reply Inline Actions Do you feel strongly about replicating this here from the GettingStarted page? I would prefer to avoid the duplication. But if you feel strongly that your audience will benefit from this, then we can leave it in. silvas: Do you feel strongly about replicating this here from the GettingStarted page? I would prefer…
				$ cd where-you-want-llvm-to-live
				$ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm

				#. Checkout Clang

				.. code-block:: console

				$ cd where-you-want-llvm-to-live
				$ cd llvm/tools
				$ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang

				#. Apply the temporary patch for CUDA support.

				If you have installed `Arcanist
				<http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-command-line>`_,
				you can apply this patch using

				.. code-block:: console

				$ cd where-you-want-llvm-to-live
				$ cd llvm/tools/clang
				$ arc patch D14452

				Otherwise, go to `its review page <http://reviews.llvm.org/D14452>`_,
				download the raw diff, and apply it manually using

				.. code-block:: console

				$ cd where-you-want-llvm-to-live
				$ cd llvm/tools/clang
				$ patch -p0 < D14452.diff

				#. Configure and build LLVM and Clang

				.. code-block:: console

				$ cd where-you-want-llvm-to-live
				$ mkdir build
				$ cd build
				$ cmake [options] ..
				brouneUnsubmitted Not Done Reply Inline Actions naively -> natively broune: naively -> natively
				$ make

				How to Compile CUDA C/C++ with LLVM
				===================================

				We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA
				CUDA installation Guide
				<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if
				you have not.

				Suppose you want to compile and run the following CUDA program (``axpy.cu``)
				traUnsubmitted Not Done Reply Inline Actions Three more patches (D13144, D13170, D13171) and one can compile many CUDA files with clang as is. Number of details in this guide will not be applicable in few weeks. tra: Three more patches (D13144, D13170, D13171) and one can compile many CUDA files with clang as…
				which multiplies a ``float`` array by a ``float`` scalar (AXPY).
				brouneUnsubmitted Not Done Reply Inline Actions Could be: Therefore, for early adopters using CUDA with LLVM now, it is necessary to manually ... broune: Could be: Therefore, for early adopters using CUDA with LLVM now, it is necessary to manually .
				traUnsubmitted Not Done Reply Inline Actions No need to manually launch the kernel <<<>>> works with PTX. See below for details. tra: No need to manually launch the kernel <<<>>> works with PTX. See below for details.

				.. code-block:: c++

				brouneUnsubmitted Not Done Reply Inline Actions "vector" is correct here, though it could suggest a std::vector. "array" wouldn't have that connotation. Also, maybe "(this operation is sometimes referred to as AXPY)", as just "(AXPY)" would likely seem rather cryptic to someone who doesn't know what AXPY is. broune: "vector" is correct here, though it could suggest a std::vector. "array" wouldn't have that…
				#include <helper_cuda.h> // for checkCudaErrors
				traUnsubmitted Not Done Reply Inline Actions clang will -include cuda_runtime.h (nvcc does, too), so it's not necessary to include it from source. clang's cuda_runtime.h wrapper will include cuda_builtin_vars.h, so including it explicitly here is not necessary as well. helper_cuda.h comes from CUDA samples. I would suggest adding a note that we need CUDA samples installed as well because it's possible to have CUDA installed without them. tra: clang will -include cuda_runtime.h (nvcc does, too), so it's not necessary to include it from…

				#include <iostream>

				__global__ void axpy(float a, float* x, float* y) {
				y[threadIdx.x] = a * x[threadIdx.x];
				traUnsubmitted Not Done Reply Inline Actions Where does this file come from? I think CUDA samples had one. I don't think it's essential for this document and could be removed or replaced. tra: Where does this file come from? I think CUDA samples had one. I don't think it's essential for…
				}

				int main(int argc, char* argv[]) {
				const int kDataLen = 4;

				float a = 2.0f;
				float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
				float host_y[kDataLen];

				// Copy input data to device.
				float* device_x;
				float* device_y;
				checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float)));
				checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float)));
				checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
				cudaMemcpyHostToDevice));

				// Launch the kernel.
				axpy<<<1, kDataLen>>>(a, device_x, device_y);

				// Copy output data to host.
				checkCudaErrors(cudaDeviceSynchronize());
				checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
				cudaMemcpyDeviceToHost));

				// Print the results.
				for (int i = 0; i < kDataLen; ++i) {
				std::cout << "y[" << i << "] = " << host_y[i] << "\n";
				}

				checkCudaErrors(cudaDeviceReset());
				return 0;
				}

				The command line for compilation is similar to what you would use for C++.

				.. code-block:: console

				$ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread
				traUnsubmitted Not Done Reply Inline Actions "-I<CUDA install path>/include" -- unnecessary. clang would add it. You also need to add -std=c++11 in order to use nullptr. I've also found a weird issue with my patch -- without optimizations, kernel launch fails (silently in your example). For the time being compile with -O2. I'll find and fix the problem ASAP. tra: "-I<CUDA install path>/include" -- unnecessary. clang would add it. You also need to add…
				traUnsubmitted Not Done Reply Inline Actions False alarm about the bug. The failure was due to my local changes. The patch mentioned in the doc appears to work fine. tra: False alarm about the bug. The failure was due to my local changes. The patch mentioned in the…
				$ ./axpy
				y[0] = 2
				brouneUnsubmitted Not Done Reply Inline Actions to a separate file (supposingly axpy.cu) -> to a separate file axpy.cu. broune: to a separate file (supposingly axpy.cu) -> to a separate file axpy.cu.
				y[1] = 4
				y[2] = 6
				y[3] = 8

				Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the
				samples installed for this example. ``<CUDA install path>`` is the root
				directory where you installed CUDA SDK, typically ``/usr/local/cuda``.

				Optimizations
				=============
				brouneUnsubmitted Not Done Reply Inline Actions to PTX (supposingly axpy.ptx) -> to a PTX file axpy.ptx broune: to PTX (supposingly axpy.ptx) -> to a PTX file axpy.ptx

				CPU and GPU have different design philosophies and architectures. For example, a
				typical CPU has branch prediction, out-of-order execution, and is superscalar,
				whereas a typical GPU has none of these. Due to such differences, an
				silvasUnsubmitted Not Done Reply Inline Actions cc1 is officially an internal interface. Please put a big fat warning here that the cc1 interface is unstable and can be broken at any time. silvas: cc1 is officially an internal interface. Please put a big fat warning here that the cc1…
				optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
				traUnsubmitted Not Done Reply Inline Actions Why not let clang compile the file all the way to PTX? Splitting would make sense if you wanted to link with libdevice code and run NVVMReflect/internalize on bitcode afterwards, but these days clang can do all of that. If you want to link with libdevice (with NVVMReflect and internalize), just pass appropriate bitcode file: "-mlink-cuda-bitcode" "/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_35.10.bc" tra: Why not let clang compile the file all the way to PTX? Splitting would make sense if you…

				LLVM performs several general and CUDA-specific optimizations for GPUs. The
				list below shows some of the more important optimizations for GPUs. Most of
				them have been upstreamed to ``lib/Transforms/Scalar`` and
				``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a
				customizable target-independent optimization pipeline.

				* Straight-line scalar optimizations. These optimizations reduce redundancy
				in straight-line code. Details can be found in the `design document for
				straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.

				* Inferring memory spaces. `This optimization
				<http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_
				infers the memory space of an address so that the backend can emit faster
				special loads and stores from it. Details can be found in the `design
				document for memory space inference <https://goo.gl/5wH2Ct>`_.

				brouneUnsubmitted Not Done Reply Inline Actions capabitliy -> capability broune: capabitliy -> capability
				* Aggressive loop unrooling and function inlining. Loop unrolling and
				function inlining need to be more aggressive for GPUs than for CPUs because
				control flow transfer in GPU is more expensive. They also promote other
				optimizations such as constant propagation and SROA which sometimes speed up
				code by over 10x. An empirical inline threshold for GPUs is 1100. This
				configuration has yet to be upstreamed with a target-specific optimization
				brouneUnsubmitted Not Done Reply Inline Actions host code (supposingly axpy.cc) -> host code in axpy.cc broune: host code (supposingly axpy.cc) -> host code in axpy.cc
				pipeline. LLVM also provides `loop unrolling pragmas
				<http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
				and ``__attribute__((always_inline))`` for programmers to force unrolling and
				inling.

				* Aggressive speculative execution. `This transformation
				<http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is
				mainly for promoting straight-line scalar optimizations which are most
				effective on code along dominator paths.

				* Memory-space alias analysis. `This alias analysis
				<http://llvm.org/docs/NVPTXUsage.html>`_ infers that two pointers in different
				special memory spaces do not alias. It has yet to be integrated to the new
				alias analysis infrastructure; the new infrastructure does not run
				target-specific alias analysis.

				* Bypassing 64-bit divides. `An existing optimization
				<http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
				enabled in the NVPTX backend. 64-bit integer divides are much slower than
				32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
				divides in our benchmarks have a divisor and dividend which fit in 32-bits at
				runtime. This optimization provides a fast path for this common case.
				traUnsubmitted Not Done Reply Inline Actions It's not quite true. CUDA runtime is will accept raw PTX if you initialize it the way nvcc does. tra: It's not quite true. CUDA runtime is will accept raw PTX if you initialize it the way nvcc does.
				brouneUnsubmitted Not Done Reply Inline Actions and superscalar -> and is superscalar broune: and superscalar -> and is superscalar
				brouneUnsubmitted Not Done Reply Inline Actions these differences -> such differences (the list is not exhaustive) broune: these differences -> such differences (the list is not exhaustive)
				brouneUnsubmitted Not Done Reply Inline Actions This suggests that these are the only major ones. "The list below shows some of the more important optimizations for GPUs." broune: This suggests that these are the only major ones. "The list below shows some of the more…
				brouneUnsubmitted Not Done Reply Inline Actions I had difficulty understanding this sentence. If I understood it correctly, this could be: "A few of the optimizations have not been upstreamed due to ..." broune: I had difficulty understanding this sentence. If I understood it correctly, this could be: "A…
				brouneUnsubmitted Not Done Reply Inline Actions so that emits fast special loads -> so that the backend can emit faster specialized loads broune: so that emits fast special loads -> so that the backend can emit faster specialized loads
				brouneUnsubmitted Not Done Reply Inline Actions more encouraged -> needs to be more aggressive broune: more encouraged -> needs to be more aggressive
				brouneUnsubmitted Not Done Reply Inline Actions is yet -> has yet broune: is yet -> has yet
				brouneUnsubmitted Not Done Reply Inline Actions knows -> infers broune: knows -> infers
				brouneUnsubmitted Not Done Reply Inline Actions is yet -> has yet broune: is yet -> has yet
				traUnsubmitted Not Done Reply Inline Actions You can pass device-side PTX to the host's cc1 with "-fcuda-include-gpubinary axpy.ptx" and clang will embed PTX into host object file and will generate code to register kernels so that kernels can be launched with <<<...>>> without any additional steps. tra: You can pass device-side PTX to the host's cc1 with "-fcuda-include-gpubinary axpy.ptx" and…
				jingyueAuthorUnsubmitted Not Done Reply Inline Actions Can you clarify how to do this? I tried using `-Xclang` to set the `-fcuda-include-gpubinary` flag, but got the following. $ clang++ -Xclang -fcuda-include-gpubinary -Xclang axpy.ptx axpy.cc -I$CUDA_ROOT/include -I$CUDA_ROOT/samples/common/inc -L$CUDA_ROOT/lib64 -lcudart_static -lcuda -ldl -lrt -pthread axpy.cc:39:3: error: use of undeclared identifier 'axpy' axpy<<<1, kDataLen>>>(a, device_x, device_y); ^ axpy.cc:39:9: error: expected expression axpy<<<1, kDataLen>>>(a, device_x, device_y); ^ axpy.cc:39:23: error: expected expression axpy<<<1, kDataLen>>>(a, device_x, device_y); ^ axpy.cc:39:25: warning: expression result unused [-Wunused-value] axpy<<<1, kDataLen>>>(a, device_x, device_y); ^ axpy.cc:39:28: warning: expression result unused [-Wunused-value] axpy<<<1, kDataLen>>>(a, device_x, device_y); ^~~~~~~~ 2 warnings and 3 errors generated. jingyue: Can you clarify how to do this? I tried using `-Xclang` to set the `-fcuda-include-gpubinary`…
				traUnsubmitted Not Done Reply Inline Actions The kernel must be present in axpy.cu during host compilation so compiler can generate host-side stub for kernel launch, so it only works without splitting. tra: The kernel must be present in axpy.cu during host compilation so compiler can generate host…
				jingyueAuthorUnsubmitted Not Done Reply Inline Actions Still have issues with that. However, I managed to apply your three pending patches, and the patched version works great! So, I think it makes more sense for this document to ask early adopters to apply the patches and try the more functional patched version. Agree? jingyue: Still have issues with that. However, I managed to apply your three pending patches, and the…
				traUnsubmitted Not Done Reply Inline Actions Sure. The patches simplify large portion of this section down to clang++ -o axpy [...] axpy.cu I'll need to add details on various CUDA-related options I've added to clang. Do you want to incorporate them into this patch of should I do that after you've committed the docs? tra: Sure. The patches simplify large portion of this section down to ``` clang++ -o axpy [...]…

docs/index.rst

Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	.. toctree::
TestingGuide		TestingGuide
tutorial/index		tutorial/index
ReleaseNotes		ReleaseNotes
Passes		Passes
YamlIO		YamlIO
GetElementPtr		GetElementPtr
Frontend/PerformanceTips		Frontend/PerformanceTips
MCJITDesignAndImplementation		MCJITDesignAndImplementation
		CompileCudaWithLLVM

:doc:`GettingStarted`		:doc:`GettingStarted`
Discusses how to get up and running quickly with the LLVM infrastructure.		Discusses how to get up and running quickly with the LLVM infrastructure.
Everything from unpacking and compilation of the distribution to execution		Everything from unpacking and compilation of the distribution to execution
of some tools.		of some tools.

:doc:`CMake`		:doc:`CMake`
An addendum to the main Getting Started guide for those using the `CMake		An addendum to the main Getting Started guide for those using the `CMake
▲ Show 20 Lines • Show All 269 Lines • ▼ Show 20 Lines	:doc:`MergeFunctions`
Describes functions merging optimization.		Describes functions merging optimization.

:doc:`InAlloca`		:doc:`InAlloca`
Description of the ``inalloca`` argument attribute.		Description of the ``inalloca`` argument attribute.

:doc:`FaultMaps`		:doc:`FaultMaps`
LLVM support for folding control flow into faulting machine instructions.		LLVM support for folding control flow into faulting machine instructions.

		:doc:`CompileCudaWithLLVM`
		LLVM support for CUDA.

Development Process Documentation		Development Process Documentation
=================================		=================================

Information about LLVM's development process.		Information about LLVM's development process.

.. toctree::		.. toctree::
:hidden:		:hidden:

▲ Show 20 Lines • Show All 112 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[doc] Compile CUDA with LLVMClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 39851

docs/CompileCudaWithLLVM.rst

docs/index.rst

[doc] Compile CUDA with LLVM
ClosedPublic