This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/docs/
-
docs/
-
CompileCudaWithLLVM.rst

Differential D130193

[LLVM][Docs] Document the new driver for CUDA compilation
Needs ReviewPublic

Authored by jhuber6 on Jul 20 2022, 11:53 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
tra

Summary

This patch adds some small documentation for RDC support in clang using
the current default and the new driver. Previously there was not any
information on wherther or not this was supported in Clang so this
should help that.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Jul 20 2022, 11:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2022, 11:53 AM

Herald added subscribers: mattd, yaxunl. · View Herald Transcript

jhuber6 requested review of this revision.Jul 20 2022, 11:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2022, 11:53 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B176565: Diff 446227.Jul 20 2022, 12:50 PM

I think, as phrased, the section mixes two somewhat independent things -- new driver and RDC compilation as one of the features the new driver makes possible. I'd suggest restructuring the text along the lines of:

Using new driver for CUDA/HIP/... compilation. Describe the general changes, improvements to the way offload binaries are compiled, embedded, including downsides and caveats,

Using new driver for RDC compilation Explain what RDC is (essentially, each TU compiles to .o, same as the host). Describe the default (each TU produces fully linked executable), and the current limitation (clang can compile to .o, but requires final linking and glue generation to be handled by external build tools). Maybe, describe how to do intermediate linking to a full GPU executable when one builds a library, as we've discussed via emails.

Describe other use cases like LTO

In D130193#3666827, @tra wrote:

I think, as phrased, the section mixes two somewhat independent things -- new driver and RDC compilation as one of the features the new driver makes possible. I'd suggest restructuring the text along the lines of:

Using new driver for CUDA/HIP/... compilation. Describe the general changes, improvements to the way offload binaries are compiled, embedded, including downsides and caveats,

Using new driver for RDC compilation Explain what RDC is (essentially, each TU compiles to .o, same as the host). Describe the default (each TU produces fully linked executable), and the current limitation (clang can compile to .o, but requires final linking and glue generation to be handled by external build tools). Maybe, describe how to do intermediate linking to a full GPU executable when one builds a library, as we've discussed via emails.

Describe other use cases like LTO

So I guess I should split this between several subsections, or maybe make a new section for it.

Considering the ton of new functionality that your changes made possible, it would be warranted, IMO.

Speaking of that, it may be worth making the new-driver changes documented in clang, where they would be visible to a more relevant audience. After all, the driver is part of clang's machinery and has little to do with LLVM. The current CompileCudaWithLLVM.rst is a bit of a historic artifact that evolved from the times when we only had NVTPX back-end in LLVM and very little to no support for GPU offloading in clang.

In D130193#3666846, @tra wrote:

Considering the ton of new functionality that your changes made possible, it would be warranted, IMO.

Speaking of that, it may be worth making the new-driver changes documented in clang, where they would be visible to a more relevant audience. After all, the driver is part of clang's machinery and has little to do with LLVM. The current CompileCudaWithLLVM.rst is a bit of a historic artifact that evolved from the times when we only had NVTPX back-end in LLVM and very little to no support for GPU offloading in clang.

I previously made https://clang.llvm.org/docs/OffloadingDesign.html when I first made the version for OpenMP. It's a little outdated and only deals with the internals. It may be beneficial to have some documentation geared more towards users, any suggestions on where to put that?

My guess is that llvm/docs/CompileCudaWithLLVM.rs should be transformed into proper documentation and placed in clang/docs/CudaAndHIPSupport.rst. That may be a bit more work than just adding the bits relevant to your changes, so I'm OK to keep them in LLVM for now.

Revision Contents

Path

Size

llvm/

docs/

CompileCudaWithLLVM.rst

26 lines

Diff 446227

llvm/docs/CompileCudaWithLLVM.rst

	Show First 20 Lines • Show All 128 Lines • ▼ Show 20 Lines
	* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the			* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
	compiler may emit calls to faster, approximate versions of transcendental			compiler may emit calls to faster, approximate versions of transcendental
	functions, instead of using the slower, fully IEEE-compliant versions. For			functions, instead of using the slower, fully IEEE-compliant versions. For
	example, this flag allows clang to emit the ptx ``sin.approx.f32``			example, this flag allows clang to emit the ptx ``sin.approx.f32``
	instruction.			instruction.

	This is implied by ``-ffast-math``.			This is implied by ``-ffast-math``.

				RDC-mode compilation
				--------------------

				Relocatable device code allows for CUDA device symbols to be linked between
				multiple CUDA files. clang by default does not support RDC-mode compilation,
				instead relying on `external tools<
				https://cmake.org/cmake/help/latest/prop_tgt/CUDA_SEPARABLE_COMPILATION.html>`
				to perform the necessary steps to link the device. However, experimental driver
				support exists in clang since llvm 15.0.

				The experimental driver can be invoked using the ``--offload-new-driver`` flag.
				This enables support for RDC-mode compilation via ``-fgpu-rdc``, device LTO via
				``-foffload-lto``, and static libraries containing device code. Compiling CUDA
				code using the experimental driver should cause RDC-mode compilation to behave
				like compilation on the host. An example is shown below.

				.. code-block:: console

				$ clang++ foo.cu bar.cu -fgpu-rdc --offload-arch=<GPU arch> -c
				$ clang++ foo.o bar.o --offload-link -L<CUDA install path>/<lib64 or lib> \
				-lcudart_static -ldl -lrt -pthread

				Note that this new driver is not ABI compatible with binaries created by nvcc
				and is only supported on Unix machines. Additionally, the experimental driver
				does not emit PTX and will not be able to JIT incompatible device images.

	Standard library support			Standard library support
	========================			========================

	In clang and nvcc, most of the C++ standard library is not supported on the			In clang and nvcc, most of the C++ standard library is not supported on the
	device side.			device side.

	``<math.h>`` and ``<cmath>``			``<math.h>`` and ``<cmath>``
	----------------------------			----------------------------
	▲ Show 20 Lines • Show All 417 Lines • Show Last 20 Lines