diff --git a/openmp/docs/SupportAndFAQ.rst b/openmp/docs/SupportAndFAQ.rst --- a/openmp/docs/SupportAndFAQ.rst +++ b/openmp/docs/SupportAndFAQ.rst @@ -52,13 +52,15 @@ Q: How to build an OpenMP GPU offload capable compiler? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To build an *effective* OpenMP offload capable compiler, only one extra CMake -option, `LLVM_ENABLE_RUNTIMES="openmp"`, is needed when building LLVM (Generic +option, ``LLVM_ENABLE_RUNTIMES="openmp"``, is needed when building LLVM (Generic information about building LLVM is available `here -`__.). Make sure all backends that -are targeted by OpenMP to be enabled. By default, Clang will be built with all -backends enabled. When building with `LLVM_ENABLE_RUNTIMES="openmp"` OpenMP -should not be enabled in `LLVM_ENABLE_PROJECTS` because it is enabled by -default. +`__.). Make sure all backends that +are targeted by OpenMP are enabled. That can be done by adjusting the CMake +option ``LLVM_TARGETS_TO_BUILD``. The corresponding targets for offloading to AMD +and Nvidia GPUs are ``"AMDGPU"`` and ``"NVPTX"``, respectively. By default, +Clang will be built with all backends enabled. When building with +``LLVM_ENABLE_RUNTIMES="openmp"`` OpenMP should not be enabled in +``LLVM_ENABLE_PROJECTS`` because it is enabled by default. For Nvidia offload, please see :ref:`build_nvidia_offload_capable_compiler`. For AMDGPU offload, please see :ref:`build_amdgpu_offload_capable_compiler`. @@ -72,14 +74,14 @@ .. _build_nvidia_offload_capable_compiler: -Q: How to build an OpenMP NVidia offload capable compiler? +Q: How to build an OpenMP Nvidia offload capable compiler? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Cuda SDK is required on the machine that will execute the openmp application. If your build machine is not the target machine or automatic detection of the available GPUs failed, you should also set: -- `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=YY` where `YY` is the numeric compute capacity of your GPU, e.g., 75. +- ``LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=YY`` where ``YY`` is the numeric compute capability of your GPU, e.g., 75. .. _build_amdgpu_offload_capable_compiler: @@ -349,7 +351,7 @@ The architecture can either be specified manually using ``--offload-arch=``. If ``--offload-arch=`` is present no ``-fopenmp-targets=`` flag is present then the targets will be inferred from the architectures. Conversely, if -``--fopenmp-targets=`` is present with no ``--offload-arch`` then the target +``--fopenmp-targets=`` is present with no ``--offload-arch`` then the target architecture will be set to a default value, usually the architecture supported by the system LLVM was built on. @@ -451,3 +453,155 @@ For more information on how this is implemented in LLVM/OpenMP's offloading runtime, refer to the `runtime documentation `_. + +Q: What command line options can I use for OpenMP offloading? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +``-fopenmp-targets`` +"""""""""""""""""""" +Specify which OpenMP offloading targets should be supported. For example, you +may specify ``-fopenmp-targets=amdgcn-amd-amdhsa,nvptx64``. This option is +often optional when :ref:`offload_arch` is provided. + +.. _offload_arch: + +``--offload-arch`` +"""""""""""""""""" +Specify the device architecture for OpenMP offloading. For instance +``--offload-arch=sm_80`` to target an Nvidia Tesla A100, +``--offload-arch=gfx90a`` to target an AMD Instinct MI250X, or +``--offload-arch=sm_80,gfx90a`` to target both. + +``--offload-device-only`` +""""""""""""""""""""""""" +Compile only the code that goes on the device. This option is mainly for +debugging purposes. It is primarily used for inspecting the intermediate +representation (IR) output when compiling for the device. It may also be used +if device-only runtimes are created. + +``--offload-host-only`` +"""""""""""""""""""""""""" +Compile only the code that goes on the host. With this option enabled, the +``.llvm.offloading`` section with embedded device code will not be included in +the intermediate representation. + +``--offload-host-device`` +"""""""""""""""""""""""""""""""""""""""""""""""""""" +Compile the target regions for both the host and the device. That is the +default option. + +``-Xopenmp-target `` +""""""""""""""""""""""""" +Pass an argument ```` to the offloading toolchain, for instance +``-Xopenmp-target -march=sm_80``. + +``-Xopenmp-target= `` +"""""""""""""""""""""""""""""""""" +Pass an argument ```` to the offloading toolchain for the target +````. That is especially useful when an argument must differ for each +triple. For instance ``-Xopenmp-target=nvptx64 --offload-arch=sm_80 +-Xopenmp-target=amdgcn --offload-arch=gfx90a`` to specify the device +architecture. Alternatively, :ref:`Xarch_host` and :ref:`Xarch_device` can +pass an argument to the host and device compilation toolchain. + +``-Xoffload-linker `` +"""""""""""""""""""""""""""""""""" +Pass an argument ```` to the offloading linker for the target specified in +````. + +.. _Xarch_device: + +``-Xarch_device `` +""""""""""""""""""""""" +Pass an argument ```` to the device compilation toolchain. + +.. _Xarch_host: + +``-Xarch_host `` +""""""""""""""""""""" +Pass an argument ```` to the host compilation toolchain. + +``-foffload-lto[=]`` +""""""""""""""""""""""""" +Enable device link time optimization (LTO) and select the LTO mode ````. +Select either ``-foffload-lto=thin`` or ``-foffload-lto=full``. Thin LTO takes +less time while still achieving some performance gains. If no argument is set, +this option defaults to ``-foffload-lto=full``. + +``-fopenmp-offload-mandatory`` +"""""""""""""""""""""""""""""" +| This option is set to avoid generating the host fallback code + executed when offloading to the device fails. That is + helpful when the target contains code that cannot be compiled for the host, for + instance, if it contains unguarded device intrinsics. +| This option can also be used to reduce compile time. +| This option should not be used when one wants to verify that the code is being + offloaded to the device. Instead, set the environment variable + ``OMP_TARGET_OFFLOAD='MANDATORY'`` to confirm that the code is being offloaded to + the device. + +``-fopenmp-target-debug[=]`` +""""""""""""""""""""""""""""""""" +Enable debugging in the device runtime library (RTL). Note that it is both +necessary to configure the debugging in the device runtime at compile-time with +``-fopenmp-target-debug=`` and enable debugging at runtime with the +environment variable ``LIBOMPTARGET_DEVICE_RTL_DEBUG=``. Further, it is +currently only supported for Nvidia targets as of July 2023. Alternatively, the +environment variable ``LIBOMPTARGET_DEBUG`` can be set to debug both Nvidia and +AMD GPU targets. For more information, see the +`debugging instructions `_. +The debugging instructions list the supported debugging arguments. + +``-fopenmp-target-jit`` +""""""""""""""""""""""" +| Emit code that is Just-in-Time (JIT) compiled for OpenMP offloading. Embed + LLVM-IR for the device code in the object files rather than binary code for the + respective target. At runtime, the LLVM-IR is optimized again and compiled for + the target device. The optimization level can be set at runtime with + ``LIBOMPTARGET_JIT_OPT_LEVEL``, for instance, + ``LIBOMPTARGET_JIT_OPT_LEVEL=-O3``. See the + `OpenMP JIT details `_ + for instructions on extracting the embedded device code before or after the + JIT and more. +| We want to emphasize that JIT for OpenMP offloading is good for debugging. + +``--offload-new-driver`` +"""""""""""""""""""""""" +In upstream LLVM, OpenMP only uses the new driver. However, enabling this +option for experimental linking with CUDA or HIP files is necessary. + +``--no-offload-new-driver`` +""""""""""""""""""""""""""" +Do not use the new driver for offloading compilation. + +``--offload-link`` +"""""""""""""""""" +Use the new offloading linker `clang-linker-wrapper` to perform the link job. +`clang-linker-wrapper` is the default offloading linker for OpenMP. This option +can be used to use the new offloading linker in toolchains that do not automatically +use it. It is necessary to enable this option when linking with CUDA or HIP files. + +``-nogpulib`` +""""""""""""" +Do not link the device library for CUDA or HIP device compilation. + +``-nogpuinc`` +""""""""""""" +Do not include the default CUDA or HIP headers, and do not add CUDA or HIP +include paths. + +Q: Why is my build taking a long time? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +When installing OpenMP and other LLVM components, the build time on multicore +systems can be significantly reduced with parallel build jobs. As suggested in +*LLVM Techniques, Tips, and Best Practices*, one could consider using ``ninja`` as the +generator. This can be done with the CMake option ``cmake -G Ninja``. Afterward, +use ``ninja install`` and specify the number of parallel jobs with ``-j``. The build +time can also be reduced by setting the build type to ``Release`` with the +``CMAKE_BUILD_TYPE`` option. Recompilation can also be sped up by caching previous +compilations. Consider enabling ``Ccache`` with +``CMAKE_CXX_COMPILER_LAUNCHER=ccache``. + +Q: Did this FAQ not answer your question? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Feel free to post questions or browse old threads at +`LLVM Discourse `__. \ No newline at end of file