Page MenuHomePhabricator

[OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language
ClosedPublic

Authored by tianshilei1992 on Jan 14 2021, 8:15 PM.

Details

Summary

From this patch (plus some landed patches), deviceRTLs is taken as a regular OpenMP program with just declare target regions. In this way, ideally, deviceRTLs can be written in OpenMP directly. No CUDA, no HIP anymore. (Well, AMD is still working on getting it work. For now AMDGCN still uses original way to compile) However, some target specific functions are still required, but they're no longer written in target specific language. For example, CUDA parts have all refined by replacing CUDA intrinsic and builtins with LLVM/Clang/NVVM intrinsics.
Here're a list of changes in this patch.

  1. For NVPTX, DEVICE is defined empty in order to make the common parts still work with AMDGCN. Later once AMDGCN is also available, we will completely remove DEVICE or probably some other macros.
  2. Shared variable is implemented with OpenMP allocator, which is defined in allocator.h. Again, this feature is not available on AMDGCN, so two macros are redefined properly.
  3. CUDA header cuda.h is dropped in the source code. In order to deal with code difference in various CUDA versions, we build one bitcode library for each supported CUDA version. For each CUDA version, the highest PTX version it supports will be used, just as what we currently use for CUDA compilation.
  4. Correspondingly, compiler driver is also updated to support CUDA version encoded in the name of bitcode library. Now the bitcode library for NVPTX is named as libomptarget-nvptx-cuda_[cuda_version]-sm_[sm_number].bc, such as libomptarget-nvptx-cuda_80-sm_20.bc.

With this change, there are also multiple features to be expected in the near future:

  1. CUDA will be completely dropped when compiling OpenMP. By the time, we also build bitcode libraries for all supported SM, multiplied by all supported CUDA version.
  2. Atomic operations used in deviceRTLs can be replaced by omp atomic if OpenMP 5.1 feature is fully supported. For now, the IR generated is totally wrong.
  3. Target specific parts will be wrapped into declare variant with isa selector if it can work properly. No target specific macro is needed anymore.
  4. (Maybe more...)

Diff Detail

Unit TestsFailed

TimeTest
820 msx64 debian > Clang.Driver::openmp-offload-gpu.c
Script: -- : 'RUN: at line 17'; /mnt/disks/ssd0/agent/llvm-project/build/bin/clang -### -no-canonical-prefixes -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_35 -Xopenmp-target -march=sm_60 /mnt/disks/ssd0/agent/llvm-project/clang/test/Driver/openmp-offload-gpu.c 2>&1 | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck -check-prefix=CHK-FOPENMP-TARGET-ARCHS /mnt/disks/ssd0/agent/llvm-project/clang/test/Driver/openmp-offload-gpu.c
11,140 msx64 windows > Clang.Driver::openmp-offload-gpu.c
Script: -- : 'RUN: at line 17'; c:\ws\w16-1\llvm-project\premerge-checks\build\bin\clang.exe -### -no-canonical-prefixes -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_35 -Xopenmp-target -march=sm_60 C:\ws\w16-1\llvm-project\premerge-checks\clang\test\Driver\openmp-offload-gpu.c 2>&1 | c:\ws\w16-1\llvm-project\premerge-checks\build\bin\filecheck.exe -check-prefix=CHK-FOPENMP-TARGET-ARCHS C:\ws\w16-1\llvm-project\premerge-checks\clang\test\Driver\openmp-offload-gpu.c

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
tianshilei1992 edited the summary of this revision. (Show Details)Jan 20 2021, 6:49 PM

Yeah... That's not gone well. Unless @jdoerfert can point us at a better way to spell these in openmp, let's add 'fix target atomics' to the todo pile and use clang intrinsics for the time being.

Clang intrinsics are fine. OpenMP 5.1 atomics should allow us to do this in OpenMP (if we want to).

Rebased and fixed issues about SHARED

tianshilei1992 edited the summary of this revision. (Show Details)Jan 21 2021, 9:33 AM
tianshilei1992 edited the summary of this revision. (Show Details)

Removed unnecessary forward declarations

Get rid of <cuda.h> by building bc libs for all supported CUDA versions

jdoerfert added inline comments.Jan 21 2021, 3:47 PM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
24

we need the 32 bit versions as well, I guess?

tianshilei1992 edited the summary of this revision. (Show Details)Jan 21 2021, 5:37 PM

Added changes in the driver

Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2021, 6:16 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Droped the forward declaration and rewrote CUDA intrinsics with LLVM instrinsics

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
24

We could limit to 64 and see if a feature request for 32 comes in.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
62

Unintended?

tianshilei1992 added inline comments.Jan 22 2021, 6:37 PM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
24

I agree.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
62

Oh, yeah. I was testing whether __CUDA_ARCH__ can be set by CUDA FE automatically but it turns out no.

tianshilei1992 edited the summary of this revision. (Show Details)Jan 23 2021, 2:02 PM

rebase and prep for a new patch for CUDA instrinsics

rebased and dropped cuda header

openmp/libomptarget/deviceRTLs/common/src/libcall.cu
319

I think we could safely delete this function because function call in device code can always be taken as builtin so this function will never be called. WDYT? @jdoerfert @JonChesterfield

openmp/libomptarget/deviceRTLs/common/src/libcall.cu
319

Yes. There is an existing bug that &omp_is_initial_device doesn't work, but because of that nothing can call this function. We can reinstate it when replacing the builtin.

Added the missing critical option -fopenmp-cuda-mode

tianshilei1992 added inline comments.Jan 25 2021, 1:48 PM
clang/lib/Driver/ToolChains/Cuda.cpp
785

This change also requires changes in the driver tests.

Fixed failed driver test

Couple of minor suggestions inline, but overall this looks pretty good. Hopefully device-only compilation works already, the rest could be left for after this patch.

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
16

From @jdoerfert,

-Xclang -fopenmp-is-device  -Xclang -emit-llvm-bc

should do device-only

102

Variant sounds good. Should also be able to use #ifdef __CUDA_ARCH__, as amdgpu shouldn't be setting that

115–121

s/correlation/correspondence, or maybe mapping

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
25

Suggest we drop the DEVICE annotation and change ALIGN to alignas() or similar, but in a later patch. This one is already quite noisy.

tianshilei1992 added inline comments.Jan 25 2021, 3:31 PM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
16

If I drop -fopenmp -Xclang -aux-triple -Xclang ${aux_triple}, a warning will be emitted:

/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:129:10: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  return __atomic_fetch_add(Address, Val, __ATOMIC_SEQ_CST);
         ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:136:10: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  return __atomic_fetch_max(Address, Val, __ATOMIC_SEQ_CST);
         ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:141:3: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  __atomic_exchange(Address, &Val, &R, __ATOMIC_SEQ_CST);
  ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:147:9: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  (void)__atomic_compare_exchange(Address, &Compare, &Val, false,
        ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:155:3: warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  __atomic_exchange(Address, &Val, &R, __ATOMIC_SEQ_CST);
  ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:161:10: warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  return __atomic_fetch_add(Address, Val, __ATOMIC_SEQ_CST);
         ^
6 warnings generated.
tianshilei1992 added inline comments.Jan 25 2021, 3:33 PM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
25

yes, just maintain minimal changes in the code for this patch. We can optimize everything afterwards.

the access size (8 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment

That's not a useful warning. Every size is greater than 0. I guess nvptx hasn't set a value somewhere in clang for max lock free size.

Fortunately we only compile with clang, so let's just pass Wno-atomic-alignment.

Final refinement before moving to review

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
98

This should be firing now that cuda.h is removed

openmp/libomptarget/deviceRTLs/common/omptarget.h
301

This will break on amdgpu, at least until the cmake for amdgpu changes over to openmp

If we spell it like:

#if _OPENMP
extern DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];
#pragma omp allocate(parallelLevel) allocator(omp_pteam_mem_alloc)
#else
extern DEVICE
    uint8_t EXTERN_SHARED(parallelLevel)[MAX_THREADS_PER_TEAM / WARPSIZE];
#endif

then amdgpu will continue working. Iirc this is the only shared array variable, the rest will be fine.

Fixed comments

openmp/libomptarget/deviceRTLs/common/allocator.h
17

If we guard this with #ifdef _OPENMP, and we change amdgcn/src/target_impl.h from

#define SHARED __attribute__((shared))
to
#define SHARED(NAME) __attribute__((shared)) NAME
#define EXTERN_SHARED(NAME) __attribute__((shared)) NAME

then there's a credible chance the downstream amdgpu hip build will continue working

  • Fixed CMake error on CMake 3.16 or lower version as ZIP_LISTS doesn't work;
  • Fixed (hopefully) compilation break on AMDGCN by gaurding allocator.h with macro.
tianshilei1992 retitled this revision from [OpenMP][WIP] Build the deviceRTLs with OpenMP instead of target dependent language - NOT FOR REVIEW to [OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language.Jan 25 2021, 6:55 PM
tianshilei1992 edited the summary of this revision. (Show Details)
tianshilei1992 marked 19 inline comments as done.Jan 25 2021, 6:57 PM
JonChesterfield accepted this revision.Jan 25 2021, 7:05 PM

LGTM! Thank you for iterating on this, and for trying to keep amdgpu working.

With this patch, the devicertl no longer needs the cuda sdk installed to compile. I believe there are still some tests in the cmake that need to go before builds will succeed on a machine without cuda installed.

This revision is now accepted and ready to land.Jan 25 2021, 7:05 PM
This revision was landed with ongoing or failed builds.Jan 26 2021, 9:28 AM
This revision was automatically updated to reflect the committed changes.

For me this patch breaks building llvm. Before this patch, I successfully built llvm using a llvm/10 installation on the system. What is probably special with our llvm installation is that we by default use libc++ rather than libstdc++.

FAILED: projects/openmp/libomptarget/deviceRTLs/nvptx/loop.cu-cuda_110-sm_60.bc 
cd BUILD/projects/openmp/libomptarget/deviceRTLs/nvptx && /home/pj416018/sw/UTIL/ccache/bin/clang -S -x c++ -target nvptx64 -Xclang -emit-llvm-bc -Xclang -aux-triple -Xclang x86_64-unknown-linux-gnu -fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device -D__CUDACC__ -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src -DOMPTARGET_NVPTX_DEBUG=0 -Xclang -target-cpu -Xclang sm_60 -D__CUDA_ARCH__=600 -Xclang -target-feature -Xclang +ptx70 -DCUDA_VERSION=11000 ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu -o loop.cu-cuda_110-sm_60.bc
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/omptarget.h:18:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/debug.h:31:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/device_environment.h:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/stdlib.h:100:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/math.h:312:
${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/limits:406:89: error: host requires 128 bit size 'std::__1::__libcpp_numeric_limits<long double, true>::type' (aka 'long double') type support, but device 'nvptx64' does not support it
    _LIBCPP_INLINE_VISIBILITY static _LIBCPP_CONSTEXPR type lowest() _NOEXCEPT {return -max();}
                                                                                        ^~~~~

My cmake call looks like:

cmake -GNinja -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=$INSTALLDIR \
  -DLLVM_ENABLE_LIBCXX=ON \
  -DCLANG_DEFAULT_CXX_STDLIB=libc++ \
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70 \
  -DLIBOMPTARGET_ENABLE_DEBUG=on \
  -DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=true \
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 \
  -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp" \
  -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
  $LLVM_SOURCE

I also tried to build using my newest installed llvm build (7dd198852b4db52ae22242dfeda4eccda83aa8b2):

In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:14:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
${llvm-INSTALL}/bin/../include/c++/v1/stdlib.h:128:10: error: '__builtin_fabsl' requires 128 bit size 'long double' type support, but device 'nvptx64' does not support it
  return __builtin_fabsl(__lcpp_x);
         ^
jdoerfert added a comment.EditedFeb 3 2021, 8:32 AM

For me this patch breaks building llvm. Before this patch, I successfully built llvm using a llvm/10 installation on the system. What is probably special with our llvm installation is that we by default use libc++ rather than libstdc++.

FAILED: projects/openmp/libomptarget/deviceRTLs/nvptx/loop.cu-cuda_110-sm_60.bc 
cd BUILD/projects/openmp/libomptarget/deviceRTLs/nvptx && /home/pj416018/sw/UTIL/ccache/bin/clang -S -x c++ -target nvptx64 -Xclang -emit-llvm-bc -Xclang -aux-triple -Xclang x86_64-unknown-linux-gnu -fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device -D__CUDACC__ -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src -DOMPTARGET_NVPTX_DEBUG=0 -Xclang -target-cpu -Xclang sm_60 -D__CUDA_ARCH__=600 -Xclang -target-feature -Xclang +ptx70 -DCUDA_VERSION=11000 ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu -o loop.cu-cuda_110-sm_60.bc
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/omptarget.h:18:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/debug.h:31:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/device_environment.h:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/stdlib.h:100:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/math.h:312:
${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/limits:406:89: error: host requires 128 bit size 'std::__1::__libcpp_numeric_limits<long double, true>::type' (aka 'long double') type support, but device 'nvptx64' does not support it
    _LIBCPP_INLINE_VISIBILITY static _LIBCPP_CONSTEXPR type lowest() _NOEXCEPT {return -max();}
                                                                                        ^~~~~

My cmake call looks like:

cmake -GNinja -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=$INSTALLDIR \
  -DLLVM_ENABLE_LIBCXX=ON \
  -DCLANG_DEFAULT_CXX_STDLIB=libc++ \
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70 \
  -DLIBOMPTARGET_ENABLE_DEBUG=on \
  -DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=true \
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 \
  -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp" \
  -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
  $LLVM_SOURCE

I also tried to build using my newest installed llvm build (7dd198852b4db52ae22242dfeda4eccda83aa8b2):

In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:14:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
${llvm-INSTALL}/bin/../include/c++/v1/stdlib.h:128:10: error: '__builtin_fabsl' requires 128 bit size 'long double' type support, but device 'nvptx64' does not support it
  return __builtin_fabsl(__lcpp_x);
         ^

This is tracked in PR48933, could you give D95928 and its two dependences a try?

JonChesterfield added a comment.EditedFeb 3 2021, 8:34 AM

I think there's a bug report about this. Sycl (iirc) introduced a change that caught invalid things with types that were previously ignored. @jdoerfert is on point I think.

-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp"
^ That works, if the compiler in question can build things like an nvptx devicertl, which essentially means if it's a (sometimes very) recent clang. A generally easier path is
-DLLVM_ENABLE_RUNTIMES="openmp"
as that will build clang first, then use that clang to build openmp.

Won't help in this particular instance - if I understand correctly, it's a misfire from using glibc headers on the nvptx subsystem - though that stdlib.h looks like it shipped as part of libc++.

edit: I am too slow...

I think there's a bug report about this. Sycl (iirc) introduced a change that caught invalid things with types that were previously ignored. @jdoerfert is on point I think.

-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp"
^ That works, if the compiler in question can build things like an nvptx devicertl, which essentially means if it's a (sometimes very) recent clang. A generally easier path is
-DLLVM_ENABLE_RUNTIMES="openmp"
as that will build clang first, then use that clang to build openmp.

Won't help in this particular instance - if I understand correctly, it's a misfire from using glibc headers on the nvptx subsystem - though that stdlib.h looks like it shipped as part of libc++.

edit: I am too slow...

I tried @jdoerfert 's patches, but they are not even necessary to address my building issue. Just delaying the build of OpenMP by setting -DLLVM_ENABLE_RUNTIMES="openmp" helped.
From my perspective, we should error out if people try to build OpenMP as a project rather than runtime and print a error message about what to change.

In any case, a stand-alone build of OpenMP still fails with any of the older clang compilers. Should we disable building of libomptarget, if an old clang is used? CMake diagnostic could suggest to use in-tree build for libomptarget.

Sorry, I meant to disable building of the cuda device-runtime, or whatever is breaking.

I think there's a bug report about this. Sycl (iirc) introduced a change that caught invalid things with types that were previously ignored. @jdoerfert is on point I think.

-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp"
^ That works, if the compiler in question can build things like an nvptx devicertl, which essentially means if it's a (sometimes very) recent clang. A generally easier path is
-DLLVM_ENABLE_RUNTIMES="openmp"
as that will build clang first, then use that clang to build openmp.

Won't help in this particular instance - if I understand correctly, it's a misfire from using glibc headers on the nvptx subsystem - though that stdlib.h looks like it shipped as part of libc++.

edit: I am too slow...

I tried @jdoerfert 's patches, but they are not even necessary to address my building issue. Just delaying the build of OpenMP by setting -DLLVM_ENABLE_RUNTIMES="openmp" helped.
From my perspective, we should error out if people try to build OpenMP as a project rather than runtime and print a error message about what to change.

In any case, a stand-alone build of OpenMP still fails with any of the older clang compilers. Should we disable building of libomptarget, if an old clang is used? CMake diagnostic could suggest to use in-tree build for libomptarget.

There're a couple of things:

  1. deviceRTLs is disabled on CUDA-free system by default. LIBOMPTARGET_BUILD_NVPTX_BCLIB is being used now. We have updated many CMake options.
  2. Older version of Clang might not be able to compile current deviceRTLs because we had issues with C++ OpenMP offloading program, and current deviceRTLs is a C++ OpenMP offloading program.
  3. We might want to raise an error when user is using older version to build deviceRTLs.
jdenny added a subscriber: jdenny.Feb 6 2021, 11:26 AM