This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language
ClosedPublic

Authored by tianshilei1992 on Jan 14 2021, 8:15 PM.

Details

Summary

From this patch (plus some landed patches), deviceRTLs is taken as a regular OpenMP program with just declare target regions. In this way, ideally, deviceRTLs can be written in OpenMP directly. No CUDA, no HIP anymore. (Well, AMD is still working on getting it work. For now AMDGCN still uses original way to compile) However, some target specific functions are still required, but they're no longer written in target specific language. For example, CUDA parts have all refined by replacing CUDA intrinsic and builtins with LLVM/Clang/NVVM intrinsics.
Here're a list of changes in this patch.

  1. For NVPTX, DEVICE is defined empty in order to make the common parts still work with AMDGCN. Later once AMDGCN is also available, we will completely remove DEVICE or probably some other macros.
  2. Shared variable is implemented with OpenMP allocator, which is defined in allocator.h. Again, this feature is not available on AMDGCN, so two macros are redefined properly.
  3. CUDA header cuda.h is dropped in the source code. In order to deal with code difference in various CUDA versions, we build one bitcode library for each supported CUDA version. For each CUDA version, the highest PTX version it supports will be used, just as what we currently use for CUDA compilation.
  4. Correspondingly, compiler driver is also updated to support CUDA version encoded in the name of bitcode library. Now the bitcode library for NVPTX is named as libomptarget-nvptx-cuda_[cuda_version]-sm_[sm_number].bc, such as libomptarget-nvptx-cuda_80-sm_20.bc.

With this change, there are also multiple features to be expected in the near future:

  1. CUDA will be completely dropped when compiling OpenMP. By the time, we also build bitcode libraries for all supported SM, multiplied by all supported CUDA version.
  2. Atomic operations used in deviceRTLs can be replaced by omp atomic if OpenMP 5.1 feature is fully supported. For now, the IR generated is totally wrong.
  3. Target specific parts will be wrapped into declare variant with isa selector if it can work properly. No target specific macro is needed anymore.
  4. (Maybe more...)

Diff Detail

Event Timeline

tianshilei1992 created this revision.Jan 14 2021, 8:15 PM
tianshilei1992 requested review of this revision.Jan 14 2021, 8:15 PM
Herald added a project: Restricted Project. · View Herald Transcript
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
74

We can (and should) call clang's ffs/popcount instead of these

Rebased and fixed some issues

Continue to add some forward declarations

tianshilei1992 marked an inline comment as done.Jan 17 2021, 11:52 AM
tianshilei1992 added inline comments.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
74

We could also directly include __clang_openmp_device_functions.h if they're already covered by the header.

tianshilei1992 edited the summary of this revision. (Show Details)Jan 17 2021, 11:56 AM
tianshilei1992 marked an inline comment as done.

This patch can pass compilation now

tianshilei1992 edited the summary of this revision. (Show Details)Jan 19 2021, 1:30 PM
tianshilei1992 edited the summary of this revision. (Show Details)
openmp/libomptarget/deviceRTLs/common/debug.h
132 ↗(On Diff #317675)

Not sure fine grained pragma omp declare target is worth the noise, we can put a declare at the top of the file (before the includes) and an end declare at the end with the same semantics.

openmp/libomptarget/deviceRTLs/common/omptargeti.h
16 ↗(On Diff #317675)

if we put them at the start/end of the source, don't need this noise in any of the headers

openmp/libomptarget/deviceRTLs/common/target_atomic.h
22

Not keen. This is common/, shouldn't be calling functions from cuda.

How about we move target_atomic.h under nvptx, and implement __kmpc_atomic_add etc there?

The amdgpu target_atomic.h would be simpler if it moved under amdgpu, as presently it implements atomicAdd in terms of another function, and could elide that layer entirely if __kmpc_atomic_add called the intrinsic.

We could also implement these in terms of clang intrinsics (i.e., identically as done by amdgpu now), and the result would then work for both platforms.

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
100

This is suspect - why does openmp want to claim to be cuda?

104

i guess this survives until the last use of cuda.h is dropped

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
21 ↗(On Diff #317675)

shouldn't these be in the cuda header above, and also in the clang-injected cuda headers?

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
27

example implementation using omp allocators at D93135

79

let's just use x < y ? x : y, as it'll codegen to the same thing or better anyway

tianshilei1992 marked 4 inline comments as done.Jan 19 2021, 2:14 PM
tianshilei1992 added inline comments.
openmp/libomptarget/deviceRTLs/common/debug.h
132 ↗(On Diff #317675)

clang has a bug before that if there is global variable, it crashed. So at the very beginning, it crashed on extern FILE *stdin. Now it has been fixed so I guess we could do that.

openmp/libomptarget/deviceRTLs/common/target_atomic.h
22

I'm okay with your proposal. I'd do it in another patch after this one can work because I want minimal changes in this patch to make everything work, and then start to optimize them. BTW, the template functions doesn't call CUDA here. They're just declarations, and will be lowered to variant of functions based on their type like fatomicAdd, which are defined in eventually in different implementations.

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
100

To maintain minimal change. There is an include wrapped into a macro in interface.h. For AMD GPU, it includes one header in AMD implementation, and for CUDA device, it includes a header in NVPTX implementation.

104

yep

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
21 ↗(On Diff #317675)

All functions that can be called by CUDA are declared as __device__. In declare target, we cannot call those functions. Instead, we need them to be in a format of OpenMP, so those in cuda.h cannot be used. If not those CUDA version macros, we can drop the header.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
27

yep, will do it later.

tianshilei1992 marked 4 inline comments as done.Jan 19 2021, 2:14 PM
openmp/libomptarget/deviceRTLs/common/target_atomic.h
22

For what it's worth, atomicInc is not a template for amdgcn (we only ever use it on a uint32_t, and a generic implementation in terms of CAS would be very slow). Implementation looks like:

template <typename T> DEVICE T atomicCAS(T *address, T compare, T val) {
  (void)__atomic_compare_exchange(address, &compare, &val, false,
                                  __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
  return compare;
}

INLINE uint32_t atomicInc(uint32_t *address, uint32_t max) {
  return __builtin_amdgcn_atomic_inc32(address, max, __ATOMIC_SEQ_CST, "");
}

so declaring a template with the same name here will break amdgpu. Which is not a disaster, as it doesn't build on trunk anyway, but will cause me some minor headaches when this flows into rocm.

If I find some time I'll write a patch moving the atomic functions around, should be orthogonal to this change.

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
100

Ah, that's probably my fault. May as well leave it for now.

I think we should expose a macro for openmp that indicates whether we're doing offloading to nvptx, or offloading to amdgpu, or just compiling for the host. Or, I think equivalently, replace some #if with variant.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
21 ↗(On Diff #317675)

I think the right answer to the cuda version macros is to compile this file in the deviceRTL twice, once for < 9000 and once for >9000. It seems reasonable to have a different implementation for the cuda API change. Clang knows what version it is compiling applications for so could pick the matching deviceRTL.bc.

That would let us totally decouple from cuda with some slightly ugly stuff like
return __nvvm_shfl_down_i32(Var, Delta, ((WARPSIZE - Width) << 8) | 0x1f);
as typeset in https://reviews.llvm.org/D94731?vs=316809&id=316820#toc

jdoerfert added inline comments.Jan 19 2021, 4:20 PM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
100

Please don't use defines if we have begin/end declare variant for it.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
21 ↗(On Diff #317675)

It's been pointed out to me that we already include ~4k of source at the top of source files that are compiled as openmp, even if they #include no header files. Mostly bits of libm. I'm not pleased to discover that, but it does mean that adding an implementation of __kmpc_impl_activemask etc to a new header won't change the status quo. Let's do that.

Updated positions of #pragma omp declare target to make fewer changes

Let's split this. Declare target in all the source files can go first, doesn't hurt anyone.
Forward declarations should also work fine with existing compilation, so that would be two.
Conditional defines and a SHARED(...) macro would be three.
Four is the cmake magic. Potentially a different folder to build it as C++ + OpenMP alternatively.

Let's split this. Declare target in all the source files can go first, doesn't hurt anyone.

D95048

tianshilei1992 retitled this revision from [OpenMP][WIP] Build the deviceRTLs with OpenMP instead of target dependent language to [OpenMP][WIP] Build the deviceRTLs with OpenMP instead of target dependent language - NOT FOR REVIEW.Jan 20 2021, 9:38 AM
tianshilei1992 edited the summary of this revision. (Show Details)
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
74

D95060 patches ffs/popc/min

Rebased and rewrote atomics with OpenMP.

tianshilei1992 added inline comments.Jan 20 2021, 2:12 PM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
144–151

@jdoerfert @JonChesterfield You might want to review this part. If that works, we could take them back to common parts afterwards.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
154–161

I can't remember what the semantics of atomic_inc are but I do remember them being surprising.

In general I prefer the clang intrinsics, but if this generates the same IR then so be it.

What IR does it emit? Will be easier to tell if we land D95093 first as we can llvm-dis target_impl.bc, instead of scraping together examples of the calls from various places.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
154–161

I can't remember what the semantics of atomic_inc are but I do remember them being surprising.

From docs, writes ((old >= val) ? 0 : (old+1)) to memory and returns old. I would guess that needs to be spelled
*Address = (Old >= Val) ? 0 : (Old+1). We'd also need the *Address read and the *Address store to occur atomically for this to be correct.

Fixed computation in __kmpc_atomic_inc

JonChesterfield added a comment.EditedJan 20 2021, 2:51 PM

Tried ~/llvm-build/llvm/bin/clang++ -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_50 atomic_inc.cpp -c -emit-llvm -S --cuda-device-only -o -

#include <stdint.h>

#pragma omp declare target

template <typename T> T atomicInc(T *, T);

uint32_t __kmpc_atomic_inc_omp(uint32_t *Address, uint32_t Val) {
  uint32_t Old;
#pragma omp atomic capture
  {
    Old = *Address;
    *Address += Old >= Val ? 0 : 1;
  }
  return Old;
}

uint32_t __kmpc_atomic_inc_omp2(uint32_t *Address, uint32_t Val) {
  uint32_t Old;
#pragma omp atomic capture
  {
    Old = *Address;
    *Address = ((Old >= Val) ? 0 : (Old+1));
  }
  return Old;
}


#pragma omp end declare target

Got

target triple = "nvptx64-nvidia-cuda"

; Function Attrs: nofree norecurse nounwind
define hidden i32 @_Z21__kmpc_atomic_inc_ompPjj(i32* nocapture readonly %Address, i32 %Val) local_unnamed_addr #0 {
entry:
  %0 = load atomic i32, i32* %Address monotonic, align 4
  ret i32 %0
}

; Function Attrs: nofree norecurse nounwind
define hidden i32 @_Z22__kmpc_atomic_inc_omp2Pjj(i32* nocapture %Address, i32 %Val) local_unnamed_addr #0 {
entry:
  %0 = atomicrmw xchg i32* %Address, i32 0 monotonic
  ret i32 %0
}

Neither looks right to me. Exchange with zero isn't an increment, and neither is a load.

From the IR generated, seems like OpenMP cannot handle these operations except atomicAdd.

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z17__kmpc_atomic_addIiET_PS0_S0_(i32* %Address, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Val.addr, align 4
  %2 = atomicrmw add i32* %0, i32 %1 monotonic
  store i32 %2, i32* %Old, align 4
  %3 = load i32, i32* %Old, align 4
  ret i32 %3
}

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z17__kmpc_atomic_incIiET_PS0_S0_(i32* %Address, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Old, align 4
  %2 = load i32, i32* %Val.addr, align 4
  %cmp = icmp sge i32 %1, %2
  br i1 %cmp, label %cond.true, label %cond.false

cond.true:                                        ; preds = %entry
  br label %cond.end

cond.false:                                       ; preds = %entry
  %3 = load i32, i32* %Old, align 4
  %add = add nsw i32 %3, 1
  br label %cond.end

cond.end:                                         ; preds = %cond.false, %cond.true
  %cond = phi i32 [ 0, %cond.true ], [ %add, %cond.false ]
  %4 = atomicrmw xchg i32* %0, i32 %cond monotonic
  store i32 %4, i32* %Old, align 4
  %5 = load i32, i32* %Old, align 4
  ret i32 %5
}

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z17__kmpc_atomic_maxIiET_PS0_S0_(i32* %Address, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Old, align 4
  %2 = load i32, i32* %Val.addr, align 4
  %cmp = icmp sgt i32 %1, %2
  br i1 %cmp, label %cond.true, label %cond.false

cond.true:                                        ; preds = %entry
  %3 = load i32, i32* %Old, align 4
  br label %cond.end

cond.false:                                       ; preds = %entry
  %4 = load i32, i32* %Val.addr, align 4
  br label %cond.end

cond.end:                                         ; preds = %cond.false, %cond.true
  %cond = phi i32 [ %3, %cond.true ], [ %4, %cond.false ]
  %5 = atomicrmw xchg i32* %0, i32 %cond monotonic
  store i32 %5, i32* %Old, align 4
  %6 = load i32, i32* %Old, align 4
  ret i32 %6
}

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z22__kmpc_atomic_exchangeIiET_PS0_S0_(i32* %Address, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Val.addr, align 4
  %2 = atomicrmw xchg i32* %0, i32 %1 monotonic
  store i32 %2, i32* %Old, align 4
  %3 = load i32, i32* %Old, align 4
  ret i32 %3
}

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z17__kmpc_atomic_casIiET_PS0_S0_S0_(i32* %Address, i32 %Compare, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Compare.addr = alloca i32, align 4
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Compare, i32* %Compare.addr, align 4
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Old, align 4
  %2 = load i32, i32* %Compare.addr, align 4
  %cmp = icmp eq i32 %1, %2
  br i1 %cmp, label %cond.true, label %cond.false

cond.true:                                        ; preds = %entry
  %3 = load i32, i32* %Val.addr, align 4
  br label %cond.end

cond.false:                                       ; preds = %entry
  %4 = load i32, i32* %Old, align 4
  br label %cond.end

cond.end:                                         ; preds = %cond.false, %cond.true
  %cond = phi i32 [ %3, %cond.true ], [ %4, %cond.false ]
  %5 = atomicrmw xchg i32* %0, i32 %cond monotonic
  store i32 %5, i32* %Old, align 4
  %6 = load i32, i32* %Old, align 4
  ret i32 %6
}

Reading from Address to Old and all following operations are NOT one atomic transaction.

Yeah... That's not gone well. Unless @jdoerfert can point us at a better way to spell these in openmp, let's add 'fix target atomics' to the todo pile and use clang intrinsics for the time being.

Allocate shared variables with OpenMP omp_pteam_mem_alloc allocator

tianshilei1992 edited the summary of this revision. (Show Details)Jan 20 2021, 6:49 PM

Yeah... That's not gone well. Unless @jdoerfert can point us at a better way to spell these in openmp, let's add 'fix target atomics' to the todo pile and use clang intrinsics for the time being.

Clang intrinsics are fine. OpenMP 5.1 atomics should allow us to do this in OpenMP (if we want to).

Rebased and fixed issues about SHARED

tianshilei1992 edited the summary of this revision. (Show Details)Jan 21 2021, 9:33 AM
tianshilei1992 edited the summary of this revision. (Show Details)

Removed unnecessary forward declarations

Get rid of <cuda.h> by building bc libs for all supported CUDA versions

jdoerfert added inline comments.Jan 21 2021, 3:47 PM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
24

we need the 32 bit versions as well, I guess?

tianshilei1992 edited the summary of this revision. (Show Details)Jan 21 2021, 5:37 PM

Added changes in the driver

Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2021, 6:16 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Droped the forward declaration and rewrote CUDA intrinsics with LLVM instrinsics

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
24

We could limit to 64 and see if a feature request for 32 comes in.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
65

Unintended?

tianshilei1992 added inline comments.Jan 22 2021, 6:37 PM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
24

I agree.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
65

Oh, yeah. I was testing whether __CUDA_ARCH__ can be set by CUDA FE automatically but it turns out no.

tianshilei1992 edited the summary of this revision. (Show Details)Jan 23 2021, 2:02 PM

rebase and prep for a new patch for CUDA instrinsics

rebased and dropped cuda header

openmp/libomptarget/deviceRTLs/common/src/libcall.cu
319–322

I think we could safely delete this function because function call in device code can always be taken as builtin so this function will never be called. WDYT? @jdoerfert @JonChesterfield

openmp/libomptarget/deviceRTLs/common/src/libcall.cu
319–322

Yes. There is an existing bug that &omp_is_initial_device doesn't work, but because of that nothing can call this function. We can reinstate it when replacing the builtin.

Added the missing critical option -fopenmp-cuda-mode

tianshilei1992 added inline comments.Jan 25 2021, 1:48 PM
clang/lib/Driver/ToolChains/Cuda.cpp
785 ↗(On Diff #319109)

This change also requires changes in the driver tests.

Fixed failed driver test

Couple of minor suggestions inline, but overall this looks pretty good. Hopefully device-only compilation works already, the rest could be left for after this patch.

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
16

From @jdoerfert,

-Xclang -fopenmp-is-device  -Xclang -emit-llvm-bc

should do device-only

100

Variant sounds good. Should also be able to use #ifdef __CUDA_ARCH__, as amdgpu shouldn't be setting that

116–120

s/correlation/correspondence, or maybe mapping

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
26

Suggest we drop the DEVICE annotation and change ALIGN to alignas() or similar, but in a later patch. This one is already quite noisy.

tianshilei1992 added inline comments.Jan 25 2021, 3:31 PM
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
16

If I drop -fopenmp -Xclang -aux-triple -Xclang ${aux_triple}, a warning will be emitted:

/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:129:10: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  return __atomic_fetch_add(Address, Val, __ATOMIC_SEQ_CST);
         ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:136:10: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  return __atomic_fetch_max(Address, Val, __ATOMIC_SEQ_CST);
         ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:141:3: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  __atomic_exchange(Address, &Val, &R, __ATOMIC_SEQ_CST);
  ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:147:9: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  (void)__atomic_compare_exchange(Address, &Compare, &Val, false,
        ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:155:3: warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  __atomic_exchange(Address, &Val, &R, __ATOMIC_SEQ_CST);
  ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:161:10: warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  return __atomic_fetch_add(Address, Val, __ATOMIC_SEQ_CST);
         ^
6 warnings generated.
tianshilei1992 added inline comments.Jan 25 2021, 3:33 PM
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
26

yes, just maintain minimal changes in the code for this patch. We can optimize everything afterwards.

the access size (8 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment

That's not a useful warning. Every size is greater than 0. I guess nvptx hasn't set a value somewhere in clang for max lock free size.

Fortunately we only compile with clang, so let's just pass Wno-atomic-alignment.

Final refinement before moving to review

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
110

This should be firing now that cuda.h is removed

openmp/libomptarget/deviceRTLs/common/omptarget.h
311

This will break on amdgpu, at least until the cmake for amdgpu changes over to openmp

If we spell it like:

#if _OPENMP
extern DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];
#pragma omp allocate(parallelLevel) allocator(omp_pteam_mem_alloc)
#else
extern DEVICE
    uint8_t EXTERN_SHARED(parallelLevel)[MAX_THREADS_PER_TEAM / WARPSIZE];
#endif

then amdgpu will continue working. Iirc this is the only shared array variable, the rest will be fine.

Fixed comments

openmp/libomptarget/deviceRTLs/common/allocator.h
16 ↗(On Diff #319176)

If we guard this with #ifdef _OPENMP, and we change amdgcn/src/target_impl.h from

#define SHARED __attribute__((shared))
to
#define SHARED(NAME) __attribute__((shared)) NAME
#define EXTERN_SHARED(NAME) __attribute__((shared)) NAME

then there's a credible chance the downstream amdgpu hip build will continue working

  • Fixed CMake error on CMake 3.16 or lower version as ZIP_LISTS doesn't work;
  • Fixed (hopefully) compilation break on AMDGCN by gaurding allocator.h with macro.
tianshilei1992 retitled this revision from [OpenMP][WIP] Build the deviceRTLs with OpenMP instead of target dependent language - NOT FOR REVIEW to [OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language.Jan 25 2021, 6:55 PM
tianshilei1992 edited the summary of this revision. (Show Details)
tianshilei1992 marked 19 inline comments as done.Jan 25 2021, 6:57 PM
JonChesterfield accepted this revision.Jan 25 2021, 7:05 PM

LGTM! Thank you for iterating on this, and for trying to keep amdgpu working.

With this patch, the devicertl no longer needs the cuda sdk installed to compile. I believe there are still some tests in the cmake that need to go before builds will succeed on a machine without cuda installed.

This revision is now accepted and ready to land.Jan 25 2021, 7:05 PM
This revision was landed with ongoing or failed builds.Jan 26 2021, 9:28 AM
This revision was automatically updated to reflect the committed changes.

For me this patch breaks building llvm. Before this patch, I successfully built llvm using a llvm/10 installation on the system. What is probably special with our llvm installation is that we by default use libc++ rather than libstdc++.

FAILED: projects/openmp/libomptarget/deviceRTLs/nvptx/loop.cu-cuda_110-sm_60.bc 
cd BUILD/projects/openmp/libomptarget/deviceRTLs/nvptx && /home/pj416018/sw/UTIL/ccache/bin/clang -S -x c++ -target nvptx64 -Xclang -emit-llvm-bc -Xclang -aux-triple -Xclang x86_64-unknown-linux-gnu -fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device -D__CUDACC__ -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src -DOMPTARGET_NVPTX_DEBUG=0 -Xclang -target-cpu -Xclang sm_60 -D__CUDA_ARCH__=600 -Xclang -target-feature -Xclang +ptx70 -DCUDA_VERSION=11000 ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu -o loop.cu-cuda_110-sm_60.bc
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/omptarget.h:18:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/debug.h:31:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/device_environment.h:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/stdlib.h:100:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/math.h:312:
${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/limits:406:89: error: host requires 128 bit size 'std::__1::__libcpp_numeric_limits<long double, true>::type' (aka 'long double') type support, but device 'nvptx64' does not support it
    _LIBCPP_INLINE_VISIBILITY static _LIBCPP_CONSTEXPR type lowest() _NOEXCEPT {return -max();}
                                                                                        ^~~~~

My cmake call looks like:

cmake -GNinja -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=$INSTALLDIR \
  -DLLVM_ENABLE_LIBCXX=ON \
  -DCLANG_DEFAULT_CXX_STDLIB=libc++ \
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70 \
  -DLIBOMPTARGET_ENABLE_DEBUG=on \
  -DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=true \
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 \
  -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp" \
  -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
  $LLVM_SOURCE

I also tried to build using my newest installed llvm build (7dd198852b4db52ae22242dfeda4eccda83aa8b2):

In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:14:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
${llvm-INSTALL}/bin/../include/c++/v1/stdlib.h:128:10: error: '__builtin_fabsl' requires 128 bit size 'long double' type support, but device 'nvptx64' does not support it
  return __builtin_fabsl(__lcpp_x);
         ^
jdoerfert added a comment.EditedFeb 3 2021, 8:32 AM

For me this patch breaks building llvm. Before this patch, I successfully built llvm using a llvm/10 installation on the system. What is probably special with our llvm installation is that we by default use libc++ rather than libstdc++.

FAILED: projects/openmp/libomptarget/deviceRTLs/nvptx/loop.cu-cuda_110-sm_60.bc 
cd BUILD/projects/openmp/libomptarget/deviceRTLs/nvptx && /home/pj416018/sw/UTIL/ccache/bin/clang -S -x c++ -target nvptx64 -Xclang -emit-llvm-bc -Xclang -aux-triple -Xclang x86_64-unknown-linux-gnu -fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device -D__CUDACC__ -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src -DOMPTARGET_NVPTX_DEBUG=0 -Xclang -target-cpu -Xclang sm_60 -D__CUDA_ARCH__=600 -Xclang -target-feature -Xclang +ptx70 -DCUDA_VERSION=11000 ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu -o loop.cu-cuda_110-sm_60.bc
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/omptarget.h:18:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/debug.h:31:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/device_environment.h:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/stdlib.h:100:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/math.h:312:
${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/limits:406:89: error: host requires 128 bit size 'std::__1::__libcpp_numeric_limits<long double, true>::type' (aka 'long double') type support, but device 'nvptx64' does not support it
    _LIBCPP_INLINE_VISIBILITY static _LIBCPP_CONSTEXPR type lowest() _NOEXCEPT {return -max();}
                                                                                        ^~~~~

My cmake call looks like:

cmake -GNinja -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=$INSTALLDIR \
  -DLLVM_ENABLE_LIBCXX=ON \
  -DCLANG_DEFAULT_CXX_STDLIB=libc++ \
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70 \
  -DLIBOMPTARGET_ENABLE_DEBUG=on \
  -DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=true \
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 \
  -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp" \
  -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
  $LLVM_SOURCE

I also tried to build using my newest installed llvm build (7dd198852b4db52ae22242dfeda4eccda83aa8b2):

In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:14:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
${llvm-INSTALL}/bin/../include/c++/v1/stdlib.h:128:10: error: '__builtin_fabsl' requires 128 bit size 'long double' type support, but device 'nvptx64' does not support it
  return __builtin_fabsl(__lcpp_x);
         ^

This is tracked in PR48933, could you give D95928 and its two dependences a try?

JonChesterfield added a comment.EditedFeb 3 2021, 8:34 AM

I think there's a bug report about this. Sycl (iirc) introduced a change that caught invalid things with types that were previously ignored. @jdoerfert is on point I think.

-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp"
^ That works, if the compiler in question can build things like an nvptx devicertl, which essentially means if it's a (sometimes very) recent clang. A generally easier path is
-DLLVM_ENABLE_RUNTIMES="openmp"
as that will build clang first, then use that clang to build openmp.

Won't help in this particular instance - if I understand correctly, it's a misfire from using glibc headers on the nvptx subsystem - though that stdlib.h looks like it shipped as part of libc++.

edit: I am too slow...

I think there's a bug report about this. Sycl (iirc) introduced a change that caught invalid things with types that were previously ignored. @jdoerfert is on point I think.

-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp"
^ That works, if the compiler in question can build things like an nvptx devicertl, which essentially means if it's a (sometimes very) recent clang. A generally easier path is
-DLLVM_ENABLE_RUNTIMES="openmp"
as that will build clang first, then use that clang to build openmp.

Won't help in this particular instance - if I understand correctly, it's a misfire from using glibc headers on the nvptx subsystem - though that stdlib.h looks like it shipped as part of libc++.

edit: I am too slow...

I tried @jdoerfert 's patches, but they are not even necessary to address my building issue. Just delaying the build of OpenMP by setting -DLLVM_ENABLE_RUNTIMES="openmp" helped.
From my perspective, we should error out if people try to build OpenMP as a project rather than runtime and print a error message about what to change.

In any case, a stand-alone build of OpenMP still fails with any of the older clang compilers. Should we disable building of libomptarget, if an old clang is used? CMake diagnostic could suggest to use in-tree build for libomptarget.

Sorry, I meant to disable building of the cuda device-runtime, or whatever is breaking.

I think there's a bug report about this. Sycl (iirc) introduced a change that caught invalid things with types that were previously ignored. @jdoerfert is on point I think.

-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp"
^ That works, if the compiler in question can build things like an nvptx devicertl, which essentially means if it's a (sometimes very) recent clang. A generally easier path is
-DLLVM_ENABLE_RUNTIMES="openmp"
as that will build clang first, then use that clang to build openmp.

Won't help in this particular instance - if I understand correctly, it's a misfire from using glibc headers on the nvptx subsystem - though that stdlib.h looks like it shipped as part of libc++.

edit: I am too slow...

I tried @jdoerfert 's patches, but they are not even necessary to address my building issue. Just delaying the build of OpenMP by setting -DLLVM_ENABLE_RUNTIMES="openmp" helped.
From my perspective, we should error out if people try to build OpenMP as a project rather than runtime and print a error message about what to change.

In any case, a stand-alone build of OpenMP still fails with any of the older clang compilers. Should we disable building of libomptarget, if an old clang is used? CMake diagnostic could suggest to use in-tree build for libomptarget.

There're a couple of things:

  1. deviceRTLs is disabled on CUDA-free system by default. LIBOMPTARGET_BUILD_NVPTX_BCLIB is being used now. We have updated many CMake options.
  2. Older version of Clang might not be able to compile current deviceRTLs because we had issues with C++ OpenMP offloading program, and current deviceRTLs is a C++ OpenMP offloading program.
  3. We might want to raise an error when user is using older version to build deviceRTLs.
jdenny added a subscriber: jdenny.Feb 6 2021, 11:26 AM