This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/lib/Driver/ToolChains/
-
lib/
-
Driver/
-
ToolChains/
1/1
Cuda.cpp
-
openmp/libomptarget/deviceRTLs/
-
libomptarget/
-
deviceRTLs/
-
common/
1/1
allocator.h
1/1
omptarget.h
-
src/
1/2
libcall.cu
-
omp_data.cu
-
reduction.cu
-
nvptx/
13/13
CMakeLists.txt
-
src/
-
nvptx_interface.h
14/14
target_impl.h
4/4
target_impl.cu

Differential D94745

[OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language
ClosedPublic

Authored by tianshilei1992 on Jan 14 2021, 8:15 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield
ABataev
grokos

Commits

rG7c03f7d7d04c: [OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target…

Summary

From this patch (plus some landed patches), deviceRTLs is taken as a regular OpenMP program with just declare target regions. In this way, ideally, deviceRTLs can be written in OpenMP directly. No CUDA, no HIP anymore. (Well, AMD is still working on getting it work. For now AMDGCN still uses original way to compile) However, some target specific functions are still required, but they're no longer written in target specific language. For example, CUDA parts have all refined by replacing CUDA intrinsic and builtins with LLVM/Clang/NVVM intrinsics.
Here're a list of changes in this patch.

For NVPTX, DEVICE is defined empty in order to make the common parts still work with AMDGCN. Later once AMDGCN is also available, we will completely remove DEVICE or probably some other macros.
Shared variable is implemented with OpenMP allocator, which is defined in allocator.h. Again, this feature is not available on AMDGCN, so two macros are redefined properly.
CUDA header cuda.h is dropped in the source code. In order to deal with code difference in various CUDA versions, we build one bitcode library for each supported CUDA version. For each CUDA version, the highest PTX version it supports will be used, just as what we currently use for CUDA compilation.
Correspondingly, compiler driver is also updated to support CUDA version encoded in the name of bitcode library. Now the bitcode library for NVPTX is named as libomptarget-nvptx-cuda_[cuda_version]-sm_[sm_number].bc, such as libomptarget-nvptx-cuda_80-sm_20.bc.

With this change, there are also multiple features to be expected in the near future:

CUDA will be completely dropped when compiling OpenMP. By the time, we also build bitcode libraries for all supported SM, multiplied by all supported CUDA version.
Atomic operations used in deviceRTLs can be replaced by omp atomic if OpenMP 5.1 feature is fully supported. For now, the IR generated is totally wrong.
Target specific parts will be wrapped into declare variant with isa selector if it can work properly. No target specific macro is needed anymore.
(Maybe more...)

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	820 ms	x64 debian > Clang.Driver::openmp-offload-gpu.c
	11,140 ms	x64 windows > Clang.Driver::openmp-offload-gpu.c

Event Timeline

tianshilei1992 created this revision.Jan 14 2021, 8:15 PM

Herald added subscribers: guansong, yaxunl, mgorny, jholewinski. · View Herald TranscriptJan 14 2021, 8:15 PM

tianshilei1992 requested review of this revision.Jan 14 2021, 8:15 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptJan 14 2021, 8:15 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: openmp-commits, sstefan1. · View Herald Transcript

tianshilei1992 removed a reviewer: jdoerfert.Jan 14 2021, 8:15 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptJan 14 2021, 8:15 PM

Harbormaster completed remote builds in B85289: Diff 316839.Jan 14 2021, 8:47 PM

JonChesterfield added a subscriber: JonChesterfield.Jan 15 2021, 2:19 AM

JonChesterfield added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
71	We can (and should) call clang's ffs/popcount instead of these

Rebased and fixed some issues

Continue to add some forward declarations

Herald added a subscriber: jfb. · View Herald TranscriptJan 16 2021, 6:38 PM

tianshilei1992 mentioned this in D94871: [Clang][OpenMP] Fixed an issue that clang crashed when compiling OpenMP program in device only mode without host IR.Jan 16 2021, 7:00 PM

Harbormaster completed remote builds in B85507: Diff 317201.Jan 16 2021, 7:07 PM

Harbormaster completed remote builds in B85508: Diff 317202.Jan 16 2021, 7:47 PM

tianshilei1992 added a parent revision: D94871: [Clang][OpenMP] Fixed an issue that clang crashed when compiling OpenMP program in device only mode without host IR.Jan 17 2021, 9:39 AM

Continue...

tianshilei1992 marked an inline comment as done.Jan 17 2021, 11:52 AM

tianshilei1992 added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
71	We could also directly include `__clang_openmp_device_functions.h` if they're already covered by the header.

tianshilei1992 edited the summary of this revision. (Show Details)Jan 17 2021, 11:56 AM

tianshilei1992 marked an inline comment as done.

Harbormaster completed remote builds in B85532: Diff 317236.Jan 17 2021, 12:15 PM

tianshilei1992 added a parent revision: D94884: [Clang][OpenMP] Include header for CUDA builtin vars into OpenMP wrapper header.Jan 18 2021, 7:46 PM

tianshilei1992 mentioned this in rG82e537a9d28a: [Clang][OpenMP] Fixed an issue that clang crashed when compiling OpenMP program….Jan 19 2021, 11:18 AM

This patch can pass compilation now

tianshilei1992 edited the summary of this revision. (Show Details)Jan 19 2021, 1:30 PM

tianshilei1992 edited the summary of this revision. (Show Details)

JonChesterfield added inline comments.Jan 19 2021, 2:04 PM

openmp/libomptarget/deviceRTLs/common/debug.h
132 ↗	(On Diff #317675)	Not sure fine grained pragma omp declare target is worth the noise, we can put a declare at the top of the file (before the includes) and an end declare at the end with the same semantics.
openmp/libomptarget/deviceRTLs/common/omptargeti.h
16 ↗	(On Diff #317675)	if we put them at the start/end of the source, don't need this noise in any of the headers
openmp/libomptarget/deviceRTLs/common/target_atomic.h
22 ↗	(On Diff #317675)	Not keen. This is common/, shouldn't be calling functions from cuda. How about we move target_atomic.h under nvptx, and implement __kmpc_atomic_add etc there? The amdgpu target_atomic.h would be simpler if it moved under amdgpu, as presently it implements atomicAdd in terms of another function, and could elide that layer entirely if __kmpc_atomic_add called the intrinsic. We could also implement these in terms of clang intrinsics (i.e., identically as done by amdgpu now), and the result would then work for both platforms.
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
102	This is suspect - why does openmp want to claim to be cuda?
105	i guess this survives until the last use of cuda.h is dropped
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
19	shouldn't these be in the cuda header above, and also in the clang-injected cuda headers?
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
26	example implementation using omp allocators at D93135
76	let's just use x < y ? x : y, as it'll codegen to the same thing or better anyway

tianshilei1992 marked 4 inline comments as done.Jan 19 2021, 2:14 PM

tianshilei1992 added inline comments.

openmp/libomptarget/deviceRTLs/common/debug.h
132 ↗	(On Diff #317675)	`clang` has a bug before that if there is global variable, it crashed. So at the very beginning, it crashed on `extern FILE *stdin`. Now it has been fixed so I guess we could do that.
openmp/libomptarget/deviceRTLs/common/target_atomic.h
22 ↗	(On Diff #317675)	I'm okay with your proposal. I'd do it in another patch after this one can work because I want minimal changes in this patch to make everything work, and then start to optimize them. BTW, the template functions doesn't call CUDA here. They're just declarations, and will be lowered to variant of functions based on their type like `fatomicAdd`, which are defined in eventually in different implementations.
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
102	To maintain minimal change. There is an include wrapped into a macro in `interface.h`. For AMD GPU, it includes one header in AMD implementation, and for CUDA device, it includes a header in NVPTX implementation.
105	yep
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
19	All functions that can be called by CUDA are declared as `__device__`. In `declare target`, we cannot call those functions. Instead, we need them to be in a format of OpenMP, so those in `cuda.h` cannot be used. If not those CUDA version macros, we can drop the header.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
26	yep, will do it later.

tianshilei1992 marked 4 inline comments as done.Jan 19 2021, 2:14 PM

JonChesterfield added inline comments.Jan 19 2021, 2:45 PM

openmp/libomptarget/deviceRTLs/common/target_atomic.h
22 ↗	(On Diff #317675)	For what it's worth, atomicInc is not a template for amdgcn (we only ever use it on a uint32_t, and a generic implementation in terms of CAS would be very slow). Implementation looks like: template <typename T> DEVICE T atomicCAS(T address, T compare, T val) { (void)__atomic_compare_exchange(address, &compare, &val, false, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); return compare; } INLINE uint32_t atomicInc(uint32_t address, uint32_t max) { return __builtin_amdgcn_atomic_inc32(address, max, __ATOMIC_SEQ_CST, ""); } so declaring a template with the same name here will break amdgpu. Which is not a disaster, as it doesn't build on trunk anyway, but will cause me some minor headaches when this flows into rocm. If I find some time I'll write a patch moving the atomic functions around, should be orthogonal to this change.
openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
102	Ah, that's probably my fault. May as well leave it for now. I think we should expose a macro for openmp that indicates whether we're doing offloading to nvptx, or offloading to amdgpu, or just compiling for the host. Or, I think equivalently, replace some `#if` with variant.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
19	I think the right answer to the cuda version macros is to compile this file in the deviceRTL twice, once for < 9000 and once for >9000. It seems reasonable to have a different implementation for the cuda API change. Clang knows what version it is compiling applications for so could pick the matching deviceRTL.bc. That would let us totally decouple from cuda with some slightly ugly stuff like `return __nvvm_shfl_down_i32(Var, Delta, ((WARPSIZE - Width) << 8) \| 0x1f);` as typeset in https://reviews.llvm.org/D94731?vs=316809&id=316820#toc

Harbormaster completed remote builds in B85766: Diff 317675.Jan 19 2021, 2:52 PM

jdoerfert added inline comments.Jan 19 2021, 4:20 PM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
102	Please don't use defines if we have `begin/end declare variant` for it.

JonChesterfield added inline comments.Jan 19 2021, 4:39 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu
19	It's been pointed out to me that we already include ~4k of source at the top of source files that are compiled as openmp, even if they `#include` no header files. Mostly bits of libm. I'm not pleased to discover that, but it does mean that adding an implementation of `__kmpc_impl_activemask` etc to a new header won't change the status quo. Let's do that.

Updated positions of #pragma omp declare target to make fewer changes

tianshilei1992 edited the summary of this revision. (Show Details)Jan 19 2021, 7:39 PM

tianshilei1992 edited parent revisions, added: D95013: [OpenMP][NVPTX] Replaced CUDA builtin vars with LLVM intrinsics; removed: D94884: [Clang][OpenMP] Include header for CUDA builtin vars into OpenMP wrapper header.

Harbormaster completed remote builds in B85820: Diff 317758.Jan 19 2021, 7:57 PM

Let's split this. Declare target in all the source files can go first, doesn't hurt anyone.
Forward declarations should also work fine with existing compilation, so that would be two.
Conditional defines and a SHARED(...) macro would be three.
Four is the cmake magic. Potentially a different folder to build it as C++ + OpenMP alternatively.

JonChesterfield mentioned this in D95048: [libomptarget][devicertl] Wrap source in declare target pragmas.Jan 20 2021, 7:19 AM

In D94745#2508792, @jdoerfert wrote:

Let's split this. Declare target in all the source files can go first, doesn't hurt anyone.

D95048

JonChesterfield mentioned this in rGe069662deb1f: [libomptarget][devicertl] Wrap source in declare target pragmas.Jan 20 2021, 7:51 AM

tianshilei1992 retitled this revision from [OpenMP][WIP] Build the deviceRTLs with OpenMP instead of target dependent language to [OpenMP][WIP] Build the deviceRTLs with OpenMP instead of target dependent language - NOT FOR REVIEW.Jan 20 2021, 9:38 AM

tianshilei1992 edited the summary of this revision. (Show Details)

JonChesterfield added inline comments.Jan 20 2021, 9:47 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
71	D95060 patches ffs/popc/min

Rebased and rewrote atomics with OpenMP.

tianshilei1992 added inline comments.Jan 20 2021, 2:12 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
131–132	@jdoerfert @JonChesterfield You might want to review this part. If that works, we could take them back to common parts afterwards.

JonChesterfield added inline comments.Jan 20 2021, 2:38 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
131–132	I can't remember what the semantics of atomic_inc are but I do remember them being surprising. In general I prefer the clang intrinsics, but if this generates the same IR then so be it. What IR does it emit? Will be easier to tell if we land D95093 first as we can llvm-dis target_impl.bc, instead of scraping together examples of the calls from various places.

JonChesterfield added inline comments.Jan 20 2021, 2:44 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
131–132	I can't remember what the semantics of atomic_inc are but I do remember them being surprising. From docs, writes ((old >= val) ? 0 : (old+1)) to memory and returns old. I would guess that needs to be spelled Address = (Old >= Val) ? 0 : (Old+1). We'd also need the Address read and the *Address store to occur atomically for this to be correct.

Fixed computation in __kmpc_atomic_inc

Tried ~/llvm-build/llvm/bin/clang++ -O2 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_50 atomic_inc.cpp -c -emit-llvm -S --cuda-device-only -o -

#include <stdint.h>

#pragma omp declare target

template <typename T> T atomicInc(T *, T);

uint32_t __kmpc_atomic_inc_omp(uint32_t *Address, uint32_t Val) {
  uint32_t Old;
#pragma omp atomic capture
  {
    Old = *Address;
    *Address += Old >= Val ? 0 : 1;
  }
  return Old;
}

uint32_t __kmpc_atomic_inc_omp2(uint32_t *Address, uint32_t Val) {
  uint32_t Old;
#pragma omp atomic capture
  {
    Old = *Address;
    *Address = ((Old >= Val) ? 0 : (Old+1));
  }
  return Old;
}


#pragma omp end declare target

Got

target triple = "nvptx64-nvidia-cuda"

; Function Attrs: nofree norecurse nounwind
define hidden i32 @_Z21__kmpc_atomic_inc_ompPjj(i32* nocapture readonly %Address, i32 %Val) local_unnamed_addr #0 {
entry:
  %0 = load atomic i32, i32* %Address monotonic, align 4
  ret i32 %0
}

; Function Attrs: nofree norecurse nounwind
define hidden i32 @_Z22__kmpc_atomic_inc_omp2Pjj(i32* nocapture %Address, i32 %Val) local_unnamed_addr #0 {
entry:
  %0 = atomicrmw xchg i32* %Address, i32 0 monotonic
  ret i32 %0
}

Neither looks right to me. Exchange with zero isn't an increment, and neither is a load.

From the IR generated, seems like OpenMP cannot handle these operations except atomicAdd.

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z17__kmpc_atomic_addIiET_PS0_S0_(i32* %Address, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Val.addr, align 4
  %2 = atomicrmw add i32* %0, i32 %1 monotonic
  store i32 %2, i32* %Old, align 4
  %3 = load i32, i32* %Old, align 4
  ret i32 %3
}

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z17__kmpc_atomic_incIiET_PS0_S0_(i32* %Address, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Old, align 4
  %2 = load i32, i32* %Val.addr, align 4
  %cmp = icmp sge i32 %1, %2
  br i1 %cmp, label %cond.true, label %cond.false

cond.true:                                        ; preds = %entry
  br label %cond.end

cond.false:                                       ; preds = %entry
  %3 = load i32, i32* %Old, align 4
  %add = add nsw i32 %3, 1
  br label %cond.end

cond.end:                                         ; preds = %cond.false, %cond.true
  %cond = phi i32 [ 0, %cond.true ], [ %add, %cond.false ]
  %4 = atomicrmw xchg i32* %0, i32 %cond monotonic
  store i32 %4, i32* %Old, align 4
  %5 = load i32, i32* %Old, align 4
  ret i32 %5
}

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z17__kmpc_atomic_maxIiET_PS0_S0_(i32* %Address, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Old, align 4
  %2 = load i32, i32* %Val.addr, align 4
  %cmp = icmp sgt i32 %1, %2
  br i1 %cmp, label %cond.true, label %cond.false

cond.true:                                        ; preds = %entry
  %3 = load i32, i32* %Old, align 4
  br label %cond.end

cond.false:                                       ; preds = %entry
  %4 = load i32, i32* %Val.addr, align 4
  br label %cond.end

cond.end:                                         ; preds = %cond.false, %cond.true
  %cond = phi i32 [ %3, %cond.true ], [ %4, %cond.false ]
  %5 = atomicrmw xchg i32* %0, i32 %cond monotonic
  store i32 %5, i32* %Old, align 4
  %6 = load i32, i32* %Old, align 4
  ret i32 %6
}

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z22__kmpc_atomic_exchangeIiET_PS0_S0_(i32* %Address, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Val.addr, align 4
  %2 = atomicrmw xchg i32* %0, i32 %1 monotonic
  store i32 %2, i32* %Old, align 4
  %3 = load i32, i32* %Old, align 4
  ret i32 %3
}

; Function Attrs: noinline nounwind optnone mustprogress
define linkonce_odr hidden i32 @_Z17__kmpc_atomic_casIiET_PS0_S0_S0_(i32* %Address, i32 %Compare, i32 %Val) #3 comdat {
entry:
  %Address.addr = alloca i32*, align 8
  %Compare.addr = alloca i32, align 4
  %Val.addr = alloca i32, align 4
  %Old = alloca i32, align 4
  store i32* %Address, i32** %Address.addr, align 8
  store i32 %Compare, i32* %Compare.addr, align 4
  store i32 %Val, i32* %Val.addr, align 4
  %0 = load i32*, i32** %Address.addr, align 8
  %1 = load i32, i32* %Old, align 4
  %2 = load i32, i32* %Compare.addr, align 4
  %cmp = icmp eq i32 %1, %2
  br i1 %cmp, label %cond.true, label %cond.false

cond.true:                                        ; preds = %entry
  %3 = load i32, i32* %Val.addr, align 4
  br label %cond.end

cond.false:                                       ; preds = %entry
  %4 = load i32, i32* %Old, align 4
  br label %cond.end

cond.end:                                         ; preds = %cond.false, %cond.true
  %cond = phi i32 [ %3, %cond.true ], [ %4, %cond.false ]
  %5 = atomicrmw xchg i32* %0, i32 %cond monotonic
  store i32 %5, i32* %Old, align 4
  %6 = load i32, i32* %Old, align 4
  ret i32 %6
}

Reading from Address to Old and all following operations are NOT one atomic transaction.

Yeah... That's not gone well. Unless @jdoerfert can point us at a better way to spell these in openmp, let's add 'fix target atomics' to the todo pile and use clang intrinsics for the time being.

Harbormaster completed remote builds in B85970: Diff 318004.Jan 20 2021, 4:39 PM

Harbormaster completed remote builds in B85982: Diff 318023.Jan 20 2021, 6:09 PM

Allocate shared variables with OpenMP omp_pteam_mem_alloc allocator

tianshilei1992 edited the summary of this revision. (Show Details)Jan 20 2021, 6:49 PM

In D94745#2511188, @JonChesterfield wrote:

Yeah... That's not gone well. Unless @jdoerfert can point us at a better way to spell these in openmp, let's add 'fix target atomics' to the todo pile and use clang intrinsics for the time being.

Clang intrinsics are fine. OpenMP 5.1 atomics should allow us to do this in OpenMP (if we want to).

Harbormaster completed remote builds in B86026: Diff 318087.Jan 20 2021, 10:35 PM

Rebased and fixed issues about SHARED

tianshilei1992 edited the summary of this revision. (Show Details)Jan 21 2021, 9:33 AM

tianshilei1992 edited the summary of this revision. (Show Details)

Removed unnecessary forward declarations

Harbormaster completed remote builds in B86119: Diff 318244.Jan 21 2021, 11:09 AM

Harbormaster completed remote builds in B86123: Diff 318249.Jan 21 2021, 11:37 AM

Get rid of <cuda.h> by building bc libs for all supported CUDA versions

jdoerfert added inline comments.Jan 21 2021, 3:47 PM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
24	we need the 32 bit versions as well, I guess?

rebased

tianshilei1992 edited the summary of this revision. (Show Details)Jan 21 2021, 5:37 PM

Added changes in the driver

Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2021, 6:16 PM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B86180: Diff 318334.Jan 21 2021, 6:29 PM

Droped the forward declaration and rewrote CUDA intrinsics with LLVM instrinsics

Harbormaster completed remote builds in B86197: Diff 318365.Jan 21 2021, 8:06 PM

Harbormaster completed remote builds in B86202: Diff 318377.Jan 21 2021, 9:03 PM

Harbormaster completed remote builds in B86217: Diff 318398.Jan 21 2021, 10:38 PM

JonChesterfield added inline comments.Jan 22 2021, 4:16 AM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
24	We could limit to 64 and see if a feature request for 32 comes in.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
62	Unintended?

JonChesterfield mentioned this in D93135: [libomptarget][devicertl] Port amdgcn devicertl to openmp.Jan 22 2021, 7:37 AM

tianshilei1992 added inline comments.Jan 22 2021, 6:37 PM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
24	I agree.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
62	Oh, yeah. I was testing whether `__CUDA_ARCH__` can be set by CUDA FE automatically but it turns out no.

Code cleanup

tianshilei1992 edited the summary of this revision. (Show Details)Jan 23 2021, 2:02 PM

Harbormaster completed remote builds in B86429: Diff 318788.Jan 23 2021, 2:23 PM

rebase and prep for a new patch for CUDA instrinsics

Harbormaster completed remote builds in B86488: Diff 318879.Jan 24 2021, 5:33 PM

rebased and dropped cuda header

tianshilei1992 added inline comments.Jan 25 2021, 12:46 PM

openmp/libomptarget/deviceRTLs/common/src/libcall.cu
319	I think we could safely delete this function because function call in device code can always be taken as builtin so this function will never be called. WDYT? @jdoerfert @JonChesterfield

JonChesterfield added inline comments.Jan 25 2021, 1:33 PM

openmp/libomptarget/deviceRTLs/common/src/libcall.cu
319	Yes. There is an existing bug that &omp_is_initial_device doesn't work, but because of that nothing can call this function. We can reinstate it when replacing the builtin.

Added the missing critical option -fopenmp-cuda-mode

tianshilei1992 added inline comments.Jan 25 2021, 1:48 PM

clang/lib/Driver/ToolChains/Cuda.cpp
785	This change also requires changes in the driver tests.

Harbormaster completed remote builds in B86619: Diff 319096.Jan 25 2021, 1:59 PM

Harbormaster completed remote builds in B86623: Diff 319109.Jan 25 2021, 2:24 PM

Fixed failed driver test

Harbormaster completed remote builds in B86635: Diff 319132.Jan 25 2021, 3:18 PM

Couple of minor suggestions inline, but overall this looks pretty good. Hopefully device-only compilation works already, the rest could be left for after this patch.

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
16	From @jdoerfert, -Xclang -fopenmp-is-device -Xclang -emit-llvm-bc should do device-only
102	Variant sounds good. Should also be able to use `#ifdef __CUDA_ARCH__`, as amdgpu shouldn't be setting that
115–121	s/correlation/correspondence, or maybe mapping
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
25	Suggest we drop the DEVICE annotation and change ALIGN to alignas() or similar, but in a later patch. This one is already quite noisy.

tianshilei1992 added inline comments.Jan 25 2021, 3:31 PM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

If I drop -fopenmp -Xclang -aux-triple -Xclang ${aux_triple}, a warning will be emitted:

/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:129:10: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  return __atomic_fetch_add(Address, Val, __ATOMIC_SEQ_CST);
         ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:136:10: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  return __atomic_fetch_max(Address, Val, __ATOMIC_SEQ_CST);
         ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:141:3: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  __atomic_exchange(Address, &Val, &R, __ATOMIC_SEQ_CST);
  ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:147:9: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  (void)__atomic_compare_exchange(Address, &Compare, &Val, false,
        ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:155:3: warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  __atomic_exchange(Address, &Val, &R, __ATOMIC_SEQ_CST);
  ^
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:161:10: warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (0  bytes) [-Watomic-alignment]
  return __atomic_fetch_add(Address, Val, __ATOMIC_SEQ_CST);
         ^
6 warnings generated.

tianshilei1992 added inline comments.Jan 25 2021, 3:33 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
25	yes, just maintain minimal changes in the code for this patch. We can optimize everything afterwards.

the access size (8 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment

That's not a useful warning. Every size is greater than 0. I guess nvptx hasn't set a value somewhere in clang for max lock free size.

Fortunately we only compile with clang, so let's just pass Wno-atomic-alignment.

Final refinement before moving to review

JonChesterfield added inline comments.Jan 25 2021, 4:23 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
98	This should be firing now that cuda.h is removed

Harbormaster completed remote builds in B86648: Diff 319154.Jan 25 2021, 4:45 PM

JonChesterfield added inline comments.Jan 25 2021, 5:16 PM

openmp/libomptarget/deviceRTLs/common/omptarget.h
301	This will break on amdgpu, at least until the cmake for amdgpu changes over to openmp If we spell it like: #if _OPENMP extern DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE]; #pragma omp allocate(parallelLevel) allocator(omp_pteam_mem_alloc) #else extern DEVICE uint8_t EXTERN_SHARED(parallelLevel)[MAX_THREADS_PER_TEAM / WARPSIZE]; #endif then amdgpu will continue working. Iirc this is the only shared array variable, the rest will be fine.

Fixed comments

JonChesterfield added inline comments.Jan 25 2021, 5:54 PM

openmp/libomptarget/deviceRTLs/common/allocator.h
17	If we guard this with #ifdef _OPENMP, and we change amdgcn/src/target_impl.h from `#define SHARED __attribute__((shared))` to `#define SHARED(NAME) __attribute__((shared)) NAME` `#define EXTERN_SHARED(NAME) __attribute__((shared)) NAME` then there's a credible chance the downstream amdgpu hip build will continue working

Fixed CMake error on CMake 3.16 or lower version as ZIP_LISTS doesn't work;
Fixed (hopefully) compilation break on AMDGCN by gaurding allocator.h with macro.

Herald added a subscriber: jvesely. · View Herald TranscriptJan 25 2021, 6:02 PM

Harbormaster completed remote builds in B86658: Diff 319176.Jan 25 2021, 6:14 PM

tianshilei1992 retitled this revision from [OpenMP][WIP] Build the deviceRTLs with OpenMP instead of target dependent language - NOT FOR REVIEW to [OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language.Jan 25 2021, 6:55 PM

tianshilei1992 edited the summary of this revision. (Show Details)

tianshilei1992 added reviewers: JonChesterfield, ABataev, grokos.

tianshilei1992 marked 19 inline comments as done.Jan 25 2021, 6:57 PM

Harbormaster completed remote builds in B86659: Diff 319180.Jan 25 2021, 7:02 PM

LGTM! Thank you for iterating on this, and for trying to keep amdgpu working.

With this patch, the devicertl no longer needs the cuda sdk installed to compile. I believe there are still some tests in the cmake that need to go before builds will succeed on a machine without cuda installed.

This revision is now accepted and ready to land.Jan 25 2021, 7:05 PM

This revision was landed with ongoing or failed builds.Jan 26 2021, 9:28 AM

Closed by commit rG7c03f7d7d04c: [OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target… (authored by tianshilei1992). · Explain Why

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rG7c03f7d7d04c: [OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target….

tianshilei1992 mentioned this in D95466: [OpenMP][NVPTX] Drop dependence on CUDA to build NVPTX `deviceRTLs`.Jan 26 2021, 11:26 AM

tianshilei1992 mentioned this in rGe7535f8fedb5: [OpenMP][NVPTX] Drop dependence on CUDA to build NVPTX `deviceRTLs`.Jan 26 2021, 5:21 PM

For me this patch breaks building llvm. Before this patch, I successfully built llvm using a llvm/10 installation on the system. What is probably special with our llvm installation is that we by default use libc++ rather than libstdc++.

FAILED: projects/openmp/libomptarget/deviceRTLs/nvptx/loop.cu-cuda_110-sm_60.bc 
cd BUILD/projects/openmp/libomptarget/deviceRTLs/nvptx && /home/pj416018/sw/UTIL/ccache/bin/clang -S -x c++ -target nvptx64 -Xclang -emit-llvm-bc -Xclang -aux-triple -Xclang x86_64-unknown-linux-gnu -fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device -D__CUDACC__ -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src -DOMPTARGET_NVPTX_DEBUG=0 -Xclang -target-cpu -Xclang sm_60 -D__CUDA_ARCH__=600 -Xclang -target-feature -Xclang +ptx70 -DCUDA_VERSION=11000 ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu -o loop.cu-cuda_110-sm_60.bc
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/omptarget.h:18:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/debug.h:31:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/device_environment.h:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/stdlib.h:100:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/math.h:312:
${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/limits:406:89: error: host requires 128 bit size 'std::__1::__libcpp_numeric_limits<long double, true>::type' (aka 'long double') type support, but device 'nvptx64' does not support it
    _LIBCPP_INLINE_VISIBILITY static _LIBCPP_CONSTEXPR type lowest() _NOEXCEPT {return -max();}
                                                                                        ^~~~~

My cmake call looks like:

cmake -GNinja -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=$INSTALLDIR \
  -DLLVM_ENABLE_LIBCXX=ON \
  -DCLANG_DEFAULT_CXX_STDLIB=libc++ \
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70 \
  -DLIBOMPTARGET_ENABLE_DEBUG=on \
  -DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=true \
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 \
  -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp" \
  -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
  $LLVM_SOURCE

I also tried to build using my newest installed llvm build (7dd198852b4db52ae22242dfeda4eccda83aa8b2):

In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:14:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
${llvm-INSTALL}/bin/../include/c++/v1/stdlib.h:128:10: error: '__builtin_fabsl' requires 128 bit size 'long double' type support, but device 'nvptx64' does not support it
  return __builtin_fabsl(__lcpp_x);
         ^

In D94745#2539405, @protze.joachim wrote:

FAILED: projects/openmp/libomptarget/deviceRTLs/nvptx/loop.cu-cuda_110-sm_60.bc 
cd BUILD/projects/openmp/libomptarget/deviceRTLs/nvptx && /home/pj416018/sw/UTIL/ccache/bin/clang -S -x c++ -target nvptx64 -Xclang -emit-llvm-bc -Xclang -aux-triple -Xclang x86_64-unknown-linux-gnu -fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device -D__CUDACC__ -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs -I${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src -DOMPTARGET_NVPTX_DEBUG=0 -Xclang -target-cpu -Xclang sm_60 -D__CUDA_ARCH__=600 -Xclang -target-feature -Xclang +ptx70 -DCUDA_VERSION=11000 ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu -o loop.cu-cuda_110-sm_60.bc
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/src/loop.cu:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/omptarget.h:18:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/debug.h:31:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/common/device_environment.h:16:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/stdlib.h:100:
In file included from ${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/math.h:312:
${llvm-INSTALL}/10.0.0/bin/../include/c++/v1/limits:406:89: error: host requires 128 bit size 'std::__1::__libcpp_numeric_limits<long double, true>::type' (aka 'long double') type support, but device 'nvptx64' does not support it
    _LIBCPP_INLINE_VISIBILITY static _LIBCPP_CONSTEXPR type lowest() _NOEXCEPT {return -max();}
                                                                                        ^~~~~

My cmake call looks like:

cmake -GNinja -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=$INSTALLDIR \
  -DLLVM_ENABLE_LIBCXX=ON \
  -DCLANG_DEFAULT_CXX_STDLIB=libc++ \
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70 \
  -DLIBOMPTARGET_ENABLE_DEBUG=on \
  -DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=true \
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 \
  -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp" \
  -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
  $LLVM_SOURCE

I also tried to build using my newest installed llvm build (7dd198852b4db52ae22242dfeda4eccda83aa8b2):

In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:14:
In file included from ${llvm-SOURCE}/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:18:
${llvm-INSTALL}/bin/../include/c++/v1/stdlib.h:128:10: error: '__builtin_fabsl' requires 128 bit size 'long double' type support, but device 'nvptx64' does not support it
  return __builtin_fabsl(__lcpp_x);
         ^

This is tracked in PR48933, could you give D95928 and its two dependences a try?

I think there's a bug report about this. Sycl (iirc) introduced a change that caught invalid things with types that were previously ignored. @jdoerfert is on point I think.

-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp"
^ That works, if the compiler in question can build things like an nvptx devicertl, which essentially means if it's a (sometimes very) recent clang. A generally easier path is
-DLLVM_ENABLE_RUNTIMES="openmp"
as that will build clang first, then use that clang to build openmp.

Won't help in this particular instance - if I understand correctly, it's a misfire from using glibc headers on the nvptx subsystem - though that stdlib.h looks like it shipped as part of libc++.

edit: I am too slow...

In D94745#2539454, @JonChesterfield wrote:

I think there's a bug report about this. Sycl (iirc) introduced a change that caught invalid things with types that were previously ignored. @jdoerfert is on point I think.

-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp"
^ That works, if the compiler in question can build things like an nvptx devicertl, which essentially means if it's a (sometimes very) recent clang. A generally easier path is
-DLLVM_ENABLE_RUNTIMES="openmp"
as that will build clang first, then use that clang to build openmp.

Won't help in this particular instance - if I understand correctly, it's a misfire from using glibc headers on the nvptx subsystem - though that stdlib.h looks like it shipped as part of libc++.

edit: I am too slow...

I tried @jdoerfert 's patches, but they are not even necessary to address my building issue. Just delaying the build of OpenMP by setting -DLLVM_ENABLE_RUNTIMES="openmp" helped.
From my perspective, we should error out if people try to build OpenMP as a project rather than runtime and print a error message about what to change.

In any case, a stand-alone build of OpenMP still fails with any of the older clang compilers. Should we disable building of libomptarget, if an old clang is used? CMake diagnostic could suggest to use in-tree build for libomptarget.

Sorry, I meant to disable building of the cuda device-runtime, or whatever is breaking.

In D94745#2539661, @protze.joachim wrote:

In D94745#2539454, @JonChesterfield wrote:

I think there's a bug report about this. Sycl (iirc) introduced a change that caught invalid things with types that were previously ignored. @jdoerfert is on point I think.

-DLLVM_ENABLE_PROJECTS="clang;compiler-rt;libcxxabi;libcxx;libunwind;openmp"
^ That works, if the compiler in question can build things like an nvptx devicertl, which essentially means if it's a (sometimes very) recent clang. A generally easier path is
-DLLVM_ENABLE_RUNTIMES="openmp"
as that will build clang first, then use that clang to build openmp.

Won't help in this particular instance - if I understand correctly, it's a misfire from using glibc headers on the nvptx subsystem - though that stdlib.h looks like it shipped as part of libc++.

edit: I am too slow...

I tried @jdoerfert 's patches, but they are not even necessary to address my building issue. Just delaying the build of OpenMP by setting -DLLVM_ENABLE_RUNTIMES="openmp" helped.
From my perspective, we should error out if people try to build OpenMP as a project rather than runtime and print a error message about what to change.

In any case, a stand-alone build of OpenMP still fails with any of the older clang compilers. Should we disable building of libomptarget, if an old clang is used? CMake diagnostic could suggest to use in-tree build for libomptarget.

There're a couple of things:

deviceRTLs is disabled on CUDA-free system by default. LIBOMPTARGET_BUILD_NVPTX_BCLIB is being used now. We have updated many CMake options.
Older version of Clang might not be able to compile current deviceRTLs because we had issues with C++ OpenMP offloading program, and current deviceRTLs is a C++ OpenMP offloading program.
We might want to raise an error when user is using older version to build deviceRTLs.

jdenny added a subscriber: jdenny.Feb 6 2021, 11:26 AM

Revision Contents

Path

Size

clang/

lib/

Driver/

ToolChains/

Cuda.cpp

42 lines

openmp/

libomptarget/

deviceRTLs/

common/

allocator.h

42 lines

omptarget.h

45 lines

src/

libcall.cu

8 lines

omp_data.cu

33 lines

reduction.cu

4 lines

nvptx/

CMakeLists.txt

118 lines

src/

nvptx_interface.h

3 lines

target_impl.h

10 lines

target_impl.cu

2 lines

Diff 319096

clang/lib/Driver/ToolChains/Cuda.cpp

Show First 20 Lines • Show All 706 Lines • ▼ Show 20 Lines	if (LibDeviceFile.empty()) {

getDriver().Diag(diag::err_drv_no_cuda_libdevice) << GpuArch;		getDriver().Diag(diag::err_drv_no_cuda_libdevice) << GpuArch;
return;		return;
}		}

CC1Args.push_back("-mlink-builtin-bitcode");		CC1Args.push_back("-mlink-builtin-bitcode");
CC1Args.push_back(DriverArgs.MakeArgString(LibDeviceFile));		CC1Args.push_back(DriverArgs.MakeArgString(LibDeviceFile));

		std::string CudaVersionStr;

// New CUDA versions often introduce new instructions that are only supported		// New CUDA versions often introduce new instructions that are only supported
// by new PTX version, so we need to raise PTX level to enable them in NVPTX		// by new PTX version, so we need to raise PTX level to enable them in NVPTX
// back-end.		// back-end.
const char *PtxFeature = nullptr;		const char *PtxFeature = nullptr;
switch (CudaInstallation.version()) {		switch (CudaInstallation.version()) {
case CudaVersion::CUDA_110:		#define CASE_CUDA_VERSION(CUDA_VER, PTX_VER) \
PtxFeature = "+ptx70";		case CudaVersion::CUDA_##CUDA_VER: \
break;		CudaVersionStr = #CUDA_VER; \
case CudaVersion::CUDA_102:		PtxFeature = "+ptx" #PTX_VER; \
PtxFeature = "+ptx65";		break;
break;		CASE_CUDA_VERSION(110, 70);
case CudaVersion::CUDA_101:		CASE_CUDA_VERSION(102, 65);
PtxFeature = "+ptx64";		CASE_CUDA_VERSION(101, 64);
break;		CASE_CUDA_VERSION(100, 63);
case CudaVersion::CUDA_100:		CASE_CUDA_VERSION(92, 61);
PtxFeature = "+ptx63";		CASE_CUDA_VERSION(91, 61);
break;		CASE_CUDA_VERSION(90, 60);
case CudaVersion::CUDA_92:		#undef CASE_CUDA_VERSION
PtxFeature = "+ptx61";
break;
case CudaVersion::CUDA_91:
PtxFeature = "+ptx61";
break;
case CudaVersion::CUDA_90:
PtxFeature = "+ptx60";
break;
default:		default:
		// If unknown CUDA version, we take it as CUDA 8.0. Same assumption is also
		// made in libomptarget/deviceRTLs.
		CudaVersionStr = "80";
PtxFeature = "+ptx42";		PtxFeature = "+ptx42";
}		}
CC1Args.append({"-target-feature", PtxFeature});		CC1Args.append({"-target-feature", PtxFeature});
if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,		if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
options::OPT_fno_cuda_short_ptr, false))		options::OPT_fno_cuda_short_ptr, false))
CC1Args.append({"-mllvm", "--nvptx-short-ptr"});		CC1Args.append({"-mllvm", "--nvptx-short-ptr"});

if (CudaInstallation.version() >= CudaVersion::UNKNOWN)		if (CudaInstallation.version() >= CudaVersion::UNKNOWN)
Show All 29 Lines	if (const Arg *A =
CC1Args.push_back(DriverArgs.MakeArgString(LibOmpTargetName));		CC1Args.push_back(DriverArgs.MakeArgString(LibOmpTargetName));
} else {		} else {
getDriver().Diag(diag::err_drv_omp_offload_target_bcruntime_not_found)		getDriver().Diag(diag::err_drv_omp_offload_target_bcruntime_not_found)
<< LibOmpTargetName;		<< LibOmpTargetName;
}		}
} else {		} else {
bool FoundBCLibrary = false;		bool FoundBCLibrary = false;

std::string LibOmpTargetName =		std::string LibOmpTargetName = "libomptarget-nvptx-cuda-" +
"libomptarget-nvptx-" + GpuArch.str() + ".bc";		CudaVersionStr + "-" + GpuArch.str() +
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions This change also requires changes in the driver tests. tianshilei1992: This change also requires changes in the driver tests.
		".bc";

for (StringRef LibraryPath : LibraryPaths) {		for (StringRef LibraryPath : LibraryPaths) {
SmallString<128> LibOmpTargetFile(LibraryPath);		SmallString<128> LibOmpTargetFile(LibraryPath);
llvm::sys::path::append(LibOmpTargetFile, LibOmpTargetName);		llvm::sys::path::append(LibOmpTargetFile, LibOmpTargetName);
if (llvm::sys::fs::exists(LibOmpTargetFile)) {		if (llvm::sys::fs::exists(LibOmpTargetFile)) {
CC1Args.push_back("-mlink-builtin-bitcode");		CC1Args.push_back("-mlink-builtin-bitcode");
CC1Args.push_back(DriverArgs.MakeArgString(LibOmpTargetFile));		CC1Args.push_back(DriverArgs.MakeArgString(LibOmpTargetFile));
FoundBCLibrary = true;		FoundBCLibrary = true;
▲ Show 20 Lines • Show All 161 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/allocator.h

This file was added.

				//===--------- allocator.h - OpenMP target memory allocator ------- C++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Macros for allocating variables in different address spaces.
				//
				//===----------------------------------------------------------------------===//

				#ifndef OMPTARGET_ALLOCATOR_H
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] not useful Lint: Pre-merge checks: clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] [[https…
				#define OMPTARGET_ALLOCATOR_H

				// Follows the pattern in interface.h
				// Clang sema checks this type carefully, needs to closely match that from omp.h
				JonChesterfieldUnsubmitted Done Reply Inline Actions If we guard this with #ifdef _OPENMP, and we change amdgcn/src/target_impl.h from `#define SHARED __attribute__((shared))` to `#define SHARED(NAME) __attribute__((shared)) NAME` `#define EXTERN_SHARED(NAME) __attribute__((shared)) NAME` then there's a credible chance the downstream amdgpu hip build will continue working JonChesterfield: If we guard this with #ifdef _OPENMP, and we change amdgcn/src/target_impl.h from `#define…
				typedef enum omp_allocator_handle_t {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for enum 'omp_allocator_handle_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for enum 'omp_allocator_handle_t' [readability…
				omp_null_allocator = 0,
				omp_default_mem_alloc = 1,
				omp_large_cap_mem_alloc = 2,
				omp_const_mem_alloc = 3,
				omp_high_bw_mem_alloc = 4,
				omp_low_lat_mem_alloc = 5,
				omp_cgroup_mem_alloc = 6,
				omp_pteam_mem_alloc = 7,
				omp_thread_mem_alloc = 8,
				KMP_ALLOCATOR_MAX_HANDLE = ~(0U)
				} omp_allocator_handle_t;

				#define __PRAGMA(STR) _Pragma(#STR)
				#define OMP_PRAGMA(STR) __PRAGMA(omp STR)

				#define SHARED(NAME) \
				NAME [[clang::loader_uninitialized]]; \
				OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))

				#define EXTERN_SHARED(NAME) \
				NAME; \
				OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))

				#endif // OMPTARGET_ALLOCATOR_H

openmp/libomptarget/deviceRTLs/common/omptarget.h

//===---- omptarget.h - OpenMP GPU initialization ---------------- CUDA -*-===//		//===---- omptarget.h - OpenMP GPU initialization ---------------- CUDA -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file contains the declarations of all library macros, types,		// This file contains the declarations of all library macros, types,
// and functions.		// and functions.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef OMPTARGET_H		#ifndef OMPTARGET_H
#define OMPTARGET_H		#define OMPTARGET_H

#include "target_impl.h"		#include "common/allocator.h"
		Lint: Pre-merge checks Inline Actions clang-tidy: error: 'common/allocator.h' file not found [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: 'common/allocator.h' file not found [clang-diagnostic-error] [[https…
#include "common/debug.h" // debug		#include "common/debug.h" // debug
#include "interface.h" // interfaces with omp, compiler, and user
#include "common/state-queue.h"		#include "common/state-queue.h"
#include "common/support.h"		#include "common/support.h"
		#include "interface.h" // interfaces with omp, compiler, and user
		#include "target_impl.h"

#define OMPTARGET_NVPTX_VERSION 1.1		#define OMPTARGET_NVPTX_VERSION 1.1

// used by the library for the interface with the app		// used by the library for the interface with the app
#define DISPATCH_FINISHED 0		#define DISPATCH_FINISHED 0
#define DISPATCH_NOTFINISHED 1		#define DISPATCH_NOTFINISHED 1

// used by dynamic scheduling		// used by dynamic scheduling
Show All 36 Lines	private:
void *buffer[MAX_SHARED_ARGS];		void *buffer[MAX_SHARED_ARGS];
// pointer to arguments buffer.		// pointer to arguments buffer.
// starts off as a pointer to 'buffer' but can be dynamically allocated.		// starts off as a pointer to 'buffer' but can be dynamically allocated.
void **args;		void **args;
// starts off as MAX_SHARED_ARGS but can increase in size.		// starts off as MAX_SHARED_ARGS but can increase in size.
uint32_t nArgs;		uint32_t nArgs;
};		};

extern DEVICE SHARED omptarget_nvptx_SharedArgs		extern DEVICE
omptarget_nvptx_globalArgs;		omptarget_nvptx_SharedArgs EXTERN_SHARED(omptarget_nvptx_globalArgs);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'omptarget_nvptx_SharedArgs' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'omptarget_nvptx_SharedArgs' [readability…

// Worker slot type which is initialized with the default worker slot		// Worker slot type which is initialized with the default worker slot
// size of 4*32 bytes.		// size of 4*32 bytes.
struct __kmpc_data_sharing_slot {		struct __kmpc_data_sharing_slot {
__kmpc_data_sharing_slot *Next;		__kmpc_data_sharing_slot *Next;
__kmpc_data_sharing_slot *Prev;		__kmpc_data_sharing_slot *Prev;
void *PrevSlotStackPtr;		void *PrevSlotStackPtr;
void *DataEnd;		void *DataEnd;
char Data[DS_Worker_Warp_Slot_Size];		char Data[DS_Worker_Warp_Slot_Size];
};		};

// Data structure to keep in shared memory that traces the current slot, stack,		// Data structure to keep in shared memory that traces the current slot, stack,
// and frame pointer as well as the active threads that didn't exit the current		// and frame pointer as well as the active threads that didn't exit the current
// environment.		// environment.
struct DataSharingStateTy {		struct DataSharingStateTy {
__kmpc_data_sharing_slot *SlotPtr[DS_Max_Warp_Number];		__kmpc_data_sharing_slot *SlotPtr[DS_Max_Warp_Number];
void *StackPtr[DS_Max_Warp_Number];		void *StackPtr[DS_Max_Warp_Number];
void * volatile FramePtr[DS_Max_Warp_Number];		void * volatile FramePtr[DS_Max_Warp_Number];
__kmpc_impl_lanemask_t ActiveThreads[DS_Max_Warp_Number];		__kmpc_impl_lanemask_t ActiveThreads[DS_Max_Warp_Number];
};		};

extern DEVICE SHARED DataSharingStateTy DataSharingState;		extern DEVICE DataSharingStateTy EXTERN_SHARED(DataSharingState);

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// task ICV and (implicit & explicit) task state		// task ICV and (implicit & explicit) task state

class omptarget_nvptx_TaskDescr {		class omptarget_nvptx_TaskDescr {
public:		public:
// methods for flags		// methods for flags
INLINE omp_sched_t GetRuntimeSched() const;		INLINE omp_sched_t GetRuntimeSched() const;
▲ Show 20 Lines • Show All 162 Lines • ▼ Show 20 Lines	private:
int64_t nextLowerBound[MAX_THREADS_PER_TEAM];		int64_t nextLowerBound[MAX_THREADS_PER_TEAM];
int64_t stride[MAX_THREADS_PER_TEAM];		int64_t stride[MAX_THREADS_PER_TEAM];
uint64_t cnt;		uint64_t cnt;
};		};

/// Memory manager for statically allocated memory.		/// Memory manager for statically allocated memory.
class omptarget_nvptx_SimpleMemoryManager {		class omptarget_nvptx_SimpleMemoryManager {
private:		private:
ALIGN(128) struct MemDataTy {		struct MemDataTy {
volatile unsigned keys[OMP_STATE_COUNT];		volatile unsigned keys[OMP_STATE_COUNT];
} MemData[MAX_SM];		} MemData[MAX_SM] ALIGN(128);

INLINE static uint32_t hash(unsigned key) {		INLINE static uint32_t hash(unsigned key) {
return key & (OMP_STATE_COUNT - 1);		return key & (OMP_STATE_COUNT - 1);
}		}

public:		public:
INLINE void Release();		INLINE void Release();
INLINE const void Acquire(const void buf, size_t size);		INLINE const void Acquire(const void buf, size_t size);
};		};

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// global data tables		// global data tables
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

extern DEVICE omptarget_nvptx_SimpleMemoryManager		extern DEVICE omptarget_nvptx_SimpleMemoryManager
omptarget_nvptx_simpleMemoryManager;		omptarget_nvptx_simpleMemoryManager;
extern DEVICE SHARED uint32_t usedMemIdx;		extern DEVICE uint32_t EXTERN_SHARED(usedMemIdx);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'uint32_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'uint32_t' [readability-identifier-naming]…
extern DEVICE SHARED uint32_t usedSlotIdx;		extern DEVICE uint32_t EXTERN_SHARED(usedSlotIdx);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'uint32_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'uint32_t' [readability-identifier-naming]…
extern DEVICE SHARED uint8_t		extern DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'uint8_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'uint8_t' [readability-identifier-naming]…
parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];		#pragma omp allocate(parallelLevel) allocator(omp_pteam_mem_alloc)
		JonChesterfieldUnsubmitted Done Reply Inline Actions This will break on amdgpu, at least until the cmake for amdgpu changes over to openmp If we spell it like: #if _OPENMP extern DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE]; #pragma omp allocate(parallelLevel) allocator(omp_pteam_mem_alloc) #else extern DEVICE uint8_t EXTERN_SHARED(parallelLevel)[MAX_THREADS_PER_TEAM / WARPSIZE]; #endif then amdgpu will continue working. Iirc this is the only shared array variable, the rest will be fine. JonChesterfield: This will break on amdgpu, at least until the cmake for amdgpu changes over to openmp If we…
extern DEVICE SHARED uint16_t threadLimit;		extern DEVICE uint16_t EXTERN_SHARED(threadLimit);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'uint16_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'uint16_t' [readability-identifier-naming]…
extern DEVICE SHARED uint16_t threadsInTeam;		extern DEVICE uint16_t EXTERN_SHARED(threadsInTeam);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'uint16_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'uint16_t' [readability-identifier-naming]…
extern DEVICE SHARED uint16_t nThreads;		extern DEVICE uint16_t EXTERN_SHARED(nThreads);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'uint16_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'uint16_t' [readability-identifier-naming]…
extern DEVICE SHARED		extern DEVICE omptarget_nvptx_ThreadPrivateContext *
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'omptarget_nvptx_ThreadPrivateContext' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'omptarget_nvptx_ThreadPrivateContext'…
omptarget_nvptx_ThreadPrivateContext *omptarget_nvptx_threadPrivateContext;		EXTERN_SHARED(omptarget_nvptx_threadPrivateContext);

extern DEVICE SHARED uint32_t execution_param;		extern DEVICE uint32_t EXTERN_SHARED(execution_param);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'uint32_t' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'uint32_t' [readability-identifier-naming]…
extern DEVICE SHARED void *ReductionScratchpadPtr;		extern DEVICE void *EXTERN_SHARED(ReductionScratchpadPtr);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'EXTERN_SHARED' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'EXTERN_SHARED' [readability-identifier…

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// work function (outlined parallel/simd functions) and arguments.		// work function (outlined parallel/simd functions) and arguments.
// needed for L1 parallelism only.		// needed for L1 parallelism only.
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

typedef void *omptarget_nvptx_WorkFn;		typedef void *omptarget_nvptx_WorkFn;
extern volatile DEVICE SHARED omptarget_nvptx_WorkFn		extern volatile DEVICE
omptarget_nvptx_workFn;		omptarget_nvptx_WorkFn EXTERN_SHARED(omptarget_nvptx_workFn);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'omptarget_nvptx_WorkFn' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'omptarget_nvptx_WorkFn' [readability…

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// get private data structures		// get private data structures
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

INLINE omptarget_nvptx_TeamDescr &getMyTeamDescriptor();		INLINE omptarget_nvptx_TeamDescr &getMyTeamDescriptor();
INLINE omptarget_nvptx_WorkDescr &getMyWorkDescriptor();		INLINE omptarget_nvptx_WorkDescr &getMyWorkDescriptor();
INLINE omptarget_nvptx_TaskDescr *		INLINE omptarget_nvptx_TaskDescr *
Show All 10 Lines

openmp/libomptarget/deviceRTLs/common/src/libcall.cu

	Show First 20 Lines • Show All 310 Lines • ▼ Show 20 Lines
	}			}

	EXTERN int omp_get_team_num() {			EXTERN int omp_get_team_num() {
	int rc = GetOmpTeamId();			int rc = GetOmpTeamId();
	PRINT(LD_IO, "call omp_get_team_num() returns %d\n", rc);			PRINT(LD_IO, "call omp_get_team_num() returns %d\n", rc);
	return rc;			return rc;
	}			}

	EXTERN int omp_is_initial_device(void) {			// EXTERN int omp_is_initial_device(void) {
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions I think we could safely delete this function because function call in device code can always be taken as builtin so this function will never be called. WDYT? @jdoerfert @JonChesterfield tianshilei1992: I think we could safely delete this function because function call in device code can always be…
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Yes. There is an existing bug that &omp_is_initial_device doesn't work, but because of that nothing can call this function. We can reinstate it when replacing the builtin. JonChesterfield: Yes. There is an existing bug that &omp_is_initial_device doesn't work, but because of that…
	PRINT0(LD_IO, "call omp_is_initial_device() returns 0\n");			// PRINT0(LD_IO, "call omp_is_initial_device() returns 0\n");
	return 0; // 0 by def on device			// return 0; // 0 by def on device
	}			// }

	// Unspecified on the device.			// Unspecified on the device.
	EXTERN int omp_get_initial_device(void) {			EXTERN int omp_get_initial_device(void) {
	PRINT0(LD_IO, "call omp_get_initial_device() returns 0\n");			PRINT0(LD_IO, "call omp_get_initial_device() returns 0\n");
	return 0;			return 0;
	}			}

	// Unused for now.			// Unused for now.
	Show All 36 Lines

openmp/libomptarget/deviceRTLs/common/src/omp_data.cu

	//===------------ omp_data.cu - OpenMP GPU objects --------------- CUDA -*-===//			//===------------ omp_data.cu - OpenMP GPU objects --------------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the data objects used on the GPU device.			// This file contains the data objects used on the GPU device.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#pragma omp declare target			#pragma omp declare target

	#include "common/omptarget.h"			#include "common/allocator.h"
	#include "common/device_environment.h"			#include "common/device_environment.h"
				#include "common/omptarget.h"

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// global device environment			// global device environment
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	DEVICE omptarget_device_environmentTy omptarget_device_environment;			DEVICE omptarget_device_environmentTy omptarget_device_environment;

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// global data holding OpenMP state information			// global data holding OpenMP state information
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	DEVICE			DEVICE
	omptarget_nvptx_Queue<omptarget_nvptx_ThreadPrivateContext, OMP_STATE_COUNT>			omptarget_nvptx_Queue<omptarget_nvptx_ThreadPrivateContext, OMP_STATE_COUNT>
	omptarget_nvptx_device_State[MAX_SM];			omptarget_nvptx_device_State[MAX_SM];

	DEVICE omptarget_nvptx_SimpleMemoryManager			DEVICE omptarget_nvptx_SimpleMemoryManager omptarget_nvptx_simpleMemoryManager;
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -DEVICE omptarget_nvptx_SimpleMemoryManager omptarget_nvptx_simpleMemoryManager; -DEVICE uint32_t SHARED(usedMemIdx); -DEVICE uint32_t SHARED(usedSlotIdx); + DEVICE omptarget_nvptx_SimpleMemoryManager + omptarget_nvptx_simpleMemoryManager; + DEVICE uint32_t SHARED(usedMemIdx); + DEVICE uint32_t SHARED(usedSlotIdx); Lint: Pre-merge checks: clang-format: please reformat the code ``` -DEVICE omptarget_nvptx_SimpleMemoryManager…
	omptarget_nvptx_simpleMemoryManager;			DEVICE uint32_t SHARED(usedMemIdx);
	DEVICE SHARED uint32_t usedMemIdx;			DEVICE uint32_t SHARED(usedSlotIdx);
	DEVICE SHARED uint32_t usedSlotIdx;
				DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE]; + DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE]; Lint: Pre-merge checks: clang-format: please reformat the code ``` -DEVICE uint8_t parallelLevel[MAX_THREADS_PER_TEAM /…
	DEVICE SHARED uint8_t parallelLevel[MAX_THREADS_PER_TEAM / WARPSIZE];			#pragma omp allocate(parallelLevel) allocator(omp_pteam_mem_alloc)
	DEVICE SHARED uint16_t threadLimit;			DEVICE uint16_t SHARED(threadLimit);
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -DEVICE uint16_t SHARED(threadLimit); -DEVICE uint16_t SHARED(threadsInTeam); -DEVICE uint16_t SHARED(nThreads); -// Pointer to this team's OpenMP state object -DEVICE omptarget_nvptx_ThreadPrivateContext * - SHARED(omptarget_nvptx_threadPrivateContext); + DEVICE uint16_t SHARED(threadLimit); + DEVICE uint16_t SHARED(threadsInTeam); + DEVICE uint16_t SHARED(nThreads); + // Pointer to this team's OpenMP state object 2 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` -DEVICE uint16_t SHARED(threadLimit); -DEVICE…
	DEVICE SHARED uint16_t threadsInTeam;			DEVICE uint16_t SHARED(threadsInTeam);
	DEVICE SHARED uint16_t nThreads;			DEVICE uint16_t SHARED(nThreads);
	// Pointer to this team's OpenMP state object			// Pointer to this team's OpenMP state object
	DEVICE SHARED			DEVICE omptarget_nvptx_ThreadPrivateContext *
	omptarget_nvptx_ThreadPrivateContext *omptarget_nvptx_threadPrivateContext;			SHARED(omptarget_nvptx_threadPrivateContext);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -//////////////////////////////////////////////////////////////////////////////// -// The team master sets the outlined parallel function in this variable to -// communicate with the workers. Since it is in shared memory, there is one -// copy of these variables for each kernel, instance, and team. -//////////////////////////////////////////////////////////////////////////////// -volatile DEVICE omptarget_nvptx_WorkFn SHARED(omptarget_nvptx_workFn); + //////////////////////////////////////////////////////////////////////////////// + // The team master sets the outlined parallel function in this variable to + // communicate with the workers. Since it is in shared memory, there is one + // copy of these variables for each kernel, instance, and team. 2 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` -///////////////////////////////////////////////////…
	// The team master sets the outlined parallel function in this variable to			// The team master sets the outlined parallel function in this variable to
	// communicate with the workers. Since it is in shared memory, there is one			// communicate with the workers. Since it is in shared memory, there is one
	// copy of these variables for each kernel, instance, and team.			// copy of these variables for each kernel, instance, and team.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	volatile DEVICE SHARED omptarget_nvptx_WorkFn omptarget_nvptx_workFn;			volatile DEVICE omptarget_nvptx_WorkFn SHARED(omptarget_nvptx_workFn);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -//////////////////////////////////////////////////////////////////////////////// -// OpenMP kernel execution parameters -//////////////////////////////////////////////////////////////////////////////// -DEVICE uint32_t SHARED(execution_param); + //////////////////////////////////////////////////////////////////////////////// + // OpenMP kernel execution parameters + //////////////////////////////////////////////////////////////////////////////// + DEVICE uint32_t SHARED(execution_param); Lint: Pre-merge checks: clang-format: please reformat the code ``` -///////////////////////////////////////////////////…
	// OpenMP kernel execution parameters			// OpenMP kernel execution parameters
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	DEVICE SHARED uint32_t execution_param;			DEVICE uint32_t SHARED(execution_param);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -//////////////////////////////////////////////////////////////////////////////// -// Data sharing state -//////////////////////////////////////////////////////////////////////////////// -DEVICE DataSharingStateTy SHARED(DataSharingState); + //////////////////////////////////////////////////////////////////////////////// + // Data sharing state + //////////////////////////////////////////////////////////////////////////////// + DEVICE DataSharingStateTy SHARED(DataSharingState); Lint: Pre-merge checks: clang-format: please reformat the code ``` -///////////////////////////////////////////////////…
	// Data sharing state			// Data sharing state
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	DEVICE SHARED DataSharingStateTy DataSharingState;			DEVICE DataSharingStateTy SHARED(DataSharingState);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -//////////////////////////////////////////////////////////////////////////////// -// Scratchpad for teams reduction. -//////////////////////////////////////////////////////////////////////////////// -DEVICE void SHARED(ReductionScratchpadPtr); + //////////////////////////////////////////////////////////////////////////////// + // Scratchpad for teams reduction. + //////////////////////////////////////////////////////////////////////////////// + DEVICE void SHARED(ReductionScratchpadPtr); Lint: Pre-merge checks: clang-format: please reformat the code ``` -///////////////////////////////////////////////////…
	// Scratchpad for teams reduction.			// Scratchpad for teams reduction.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	DEVICE SHARED void *ReductionScratchpadPtr;			DEVICE void *SHARED(ReductionScratchpadPtr);

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -//////////////////////////////////////////////////////////////////////////////// -// Data sharing related variables. -//////////////////////////////////////////////////////////////////////////////// -DEVICE omptarget_nvptx_SharedArgs SHARED(omptarget_nvptx_globalArgs); + //////////////////////////////////////////////////////////////////////////////// + // Data sharing related variables. + //////////////////////////////////////////////////////////////////////////////// + DEVICE omptarget_nvptx_SharedArgs SHARED(omptarget_nvptx_globalArgs); Lint: Pre-merge checks: clang-format: please reformat the code ``` -///////////////////////////////////////////////////…
	// Data sharing related variables.			// Data sharing related variables.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	DEVICE SHARED omptarget_nvptx_SharedArgs omptarget_nvptx_globalArgs;			DEVICE omptarget_nvptx_SharedArgs SHARED(omptarget_nvptx_globalArgs);

	#pragma omp end declare target			#pragma omp end declare target

openmp/libomptarget/deviceRTLs/common/src/reduction.cu

Show First 20 Lines • Show All 202 Lines • ▼ Show 20 Lines	EXTERN int32_t __kmpc_nvptx_teams_reduce_nowait_v2(
// In non-generic mode all workers participate in the teams reduction.		// In non-generic mode all workers participate in the teams reduction.
// In generic mode only the team master participates in the teams		// In generic mode only the team master participates in the teams
// reduction because the workers are waiting for parallel work.		// reduction because the workers are waiting for parallel work.
uint32_t NumThreads =		uint32_t NumThreads =
checkSPMDMode(loc) ? GetNumberOfOmpThreads(/isSPMDExecutionMode=/true)		checkSPMDMode(loc) ? GetNumberOfOmpThreads(/isSPMDExecutionMode=/true)
: /Master thread only/ 1;		: /Master thread only/ 1;
uint32_t TeamId = GetBlockIdInKernel();		uint32_t TeamId = GetBlockIdInKernel();
uint32_t NumTeams = GetNumberOfBlocksInKernel();		uint32_t NumTeams = GetNumberOfBlocksInKernel();
static SHARED unsigned Bound;		static unsigned SHARED(Bound);
static SHARED unsigned ChunkTeamCount;		static unsigned SHARED(ChunkTeamCount);

// Block progress for teams greater than the current upper		// Block progress for teams greater than the current upper
// limit. We always only allow a number of teams less or equal		// limit. We always only allow a number of teams less or equal
// to the number of slots in the buffer.		// to the number of slots in the buffer.
bool IsMaster = isMaster(loc, ThreadId);		bool IsMaster = isMaster(loc, ThreadId);
while (IsMaster) {		while (IsMaster) {
// Atomic read		// Atomic read
Bound = __kmpc_atomic_add((uint32_t *)&IterCnt, 0u);		Bound = __kmpc_atomic_add((uint32_t *)&IterCnt, 0u);
▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

##===----------------------------------------------------------------------===##		##===----------------------------------------------------------------------===##
#		#
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.		# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#		#
##===----------------------------------------------------------------------===##		##===----------------------------------------------------------------------===##
#		#
# Build the NVPTX (CUDA) Device RTL if the CUDA tools are available		# Build the NVPTX (CUDA) Device RTL if the CUDA tools are available
#		#
##===----------------------------------------------------------------------===##		##===----------------------------------------------------------------------===##

		# TODO: This part needs to be refined when libomptarget is going to support
		# Windows!
		# TODO: This part can also be removed if we can change the clang driver to make
		# it support device only compilation.
		JonChesterfieldUnsubmitted Done Reply Inline Actions From @jdoerfert, -Xclang -fopenmp-is-device -Xclang -emit-llvm-bc should do device-only JonChesterfield: From @jdoerfert, ```-Xclang -fopenmp-is-device -Xclang -emit-llvm-bc``` should do device-only
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions If I drop `-fopenmp -Xclang -aux-triple -Xclang ${aux_triple}`, a warning will be emitted: /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:129:10: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment] return __atomic_fetch_add(Address, Val, __ATOMIC_SEQ_CST); ^ /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:136:10: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment] return __atomic_fetch_max(Address, Val, __ATOMIC_SEQ_CST); ^ /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:141:3: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment] __atomic_exchange(Address, &Val, &R, __ATOMIC_SEQ_CST); ^ /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:147:9: warning: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment] (void)__atomic_compare_exchange(Address, &Compare, &Val, false, ^ /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:155:3: warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment] __atomic_exchange(Address, &Val, &R, __ATOMIC_SEQ_CST); ^ /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:161:10: warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment] return __atomic_fetch_add(Address, Val, __ATOMIC_SEQ_CST); ^ 6 warnings generated. tianshilei1992: If I drop `-fopenmp -Xclang -aux-triple -Xclang ${aux_triple}`, a warning will be emitted: ```…
		if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "x86_64")
		set(aux_triple x86_64-unknown-linux-gnu)
		elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "ppc64le")
		set(aux_triple powerpc64le-unknown-linux-gnu)
		elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "aarch64")
		set(aux_triple aarch64-unknown-linux-gnu)
		else()
		libomptarget_say("Not building CUDA offloading device RTL: unknown host arch: ${CMAKE_HOST_SYSTEM_PROCESSOR}")
		jdoerfertUnsubmitted Done Reply Inline Actions we need the 32 bit versions as well, I guess? jdoerfert: we need the 32 bit versions as well, I guess?
		JonChesterfieldUnsubmitted Done Reply Inline Actions We could limit to 64 and see if a feature request for 32 comes in. JonChesterfield: We could limit to 64 and see if a feature request for 32 comes in.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions I agree. tianshilei1992: I agree.
		return()
		endif()

get_filename_component(devicertl_base_directory		get_filename_component(devicertl_base_directory
${CMAKE_CURRENT_SOURCE_DIR}		${CMAKE_CURRENT_SOURCE_DIR}
DIRECTORY)		DIRECTORY)
set(devicertl_common_directory		set(devicertl_common_directory
${devicertl_base_directory}/common)		${devicertl_base_directory}/common)
set(devicertl_nvptx_directory		set(devicertl_nvptx_directory
${devicertl_base_directory}/nvptx)		${devicertl_base_directory}/nvptx)

▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	set(cuda_src_files
${devicertl_common_directory}/src/reduction.cu		${devicertl_common_directory}/src/reduction.cu
${devicertl_common_directory}/src/support.cu		${devicertl_common_directory}/src/support.cu
${devicertl_common_directory}/src/sync.cu		${devicertl_common_directory}/src/sync.cu
${devicertl_common_directory}/src/task.cu		${devicertl_common_directory}/src/task.cu
src/target_impl.cu		src/target_impl.cu
)		)

# Set flags for LLVM Bitcode compilation.		# Set flags for LLVM Bitcode compilation.
set(bc_flags ${LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER_FLAGS}		set(bc_flags -S -x c++
		-target nvptx64
		-Xclang -emit-llvm-bc
		-Xclang -aux-triple -Xclang ${aux_triple}
		-fopenmp -Xclang -fopenmp-is-device
		-D__CUDACC__
		JonChesterfieldUnsubmitted Done Reply Inline Actions This is suspect - why does openmp want to claim to be cuda? JonChesterfield: This is suspect - why does openmp want to claim to be cuda?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions To maintain minimal change. There is an include wrapped into a macro in `interface.h`. For AMD GPU, it includes one header in AMD implementation, and for CUDA device, it includes a header in NVPTX implementation. tianshilei1992: To maintain minimal change. There is an include wrapped into a macro in `interface.h`. For AMD…
		JonChesterfieldUnsubmitted Done Reply Inline Actions Ah, that's probably my fault. May as well leave it for now. I think we should expose a macro for openmp that indicates whether we're doing offloading to nvptx, or offloading to amdgpu, or just compiling for the host. Or, I think equivalently, replace some `#if` with variant. JonChesterfield: Ah, that's probably my fault. May as well leave it for now. I think we should expose a macro…
		jdoerfertUnsubmitted Done Reply Inline Actions Please don't use defines if we have `begin/end declare variant` for it. jdoerfert: Please don't use defines if we have `begin/end declare variant` for it.
		JonChesterfieldUnsubmitted Done Reply Inline Actions Variant sounds good. Should also be able to use `#ifdef __CUDA_ARCH__`, as amdgpu shouldn't be setting that JonChesterfield: Variant sounds good. Should also be able to use `#ifdef __CUDA_ARCH__`, as amdgpu shouldn't be…
-I${devicertl_base_directory}		-I${devicertl_base_directory}
-I${devicertl_nvptx_directory}/src)		-I${devicertl_nvptx_directory}/src)

		JonChesterfieldUnsubmitted Done Reply Inline Actions i guess this survives until the last use of cuda.h is dropped JonChesterfield: i guess this survives until the last use of cuda.h is dropped
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions yep tianshilei1992: yep
if(${LIBOMPTARGET_NVPTX_DEBUG})		if(${LIBOMPTARGET_NVPTX_DEBUG})
set(bc_flags ${bc_flags} -DOMPTARGET_NVPTX_DEBUG=-1)		list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=-1)
else()		else()
set(bc_flags ${bc_flags} -DOMPTARGET_NVPTX_DEBUG=0)		list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=0)
endif()		endif()

# Create target to build all Bitcode libraries.		# Create target to build all Bitcode libraries.
add_custom_target(omptarget-nvptx-bc)		add_custom_target(omptarget-nvptx-bc)

# Generate a Bitcode library for all the compute capabilities the user requested.		# This correlation is from clang/lib/Driver/ToolChains/Cuda.cpp.
		# The last element is the default case.
		set(cuda_version_list 110 102 101 100 92 91 90 80)
		set(ptx_feature_list 70 65 64 63 61 61 60 42)

		# Generate a Bitcode library for all the compute capabilities the user
		# requested and all PTX version we know for now.
		JonChesterfieldUnsubmitted Done Reply Inline Actions s/correlation/correspondence, or maybe mapping JonChesterfield: s/correlation/correspondence, or maybe mapping
foreach(sm ${nvptx_sm_list})		foreach(sm ${nvptx_sm_list})
set(cuda_arch --cuda-gpu-arch=sm_${sm})		set(sm_flags -Xclang -target-cpu -Xclang sm_${sm} "-D__CUDA_ARCH__=${sm}0")

		foreach(cuda_version ptx_num IN ZIP_LISTS cuda_version_list ptx_feature_list)
		set(cuda_flags ${sm_flags})
		list(APPEND cuda_flags -Xclang -target-feature -Xclang +ptx${ptx_num})
		list(APPEND cuda_flags "-DCUDA_VERSION=${cuda_version}00")

# Compile CUDA files to bitcode.
set(bc_files "")		set(bc_files "")
foreach(src ${cuda_src_files})		foreach(src ${cuda_src_files})
get_filename_component(infile ${src} ABSOLUTE)		get_filename_component(infile ${src} ABSOLUTE)
get_filename_component(outfile ${src} NAME)		get_filename_component(outfile ${src} NAME)
		set(outfile "${outfile}-cuda_${cuda_version}-sm_${sm}.bc")

add_custom_command(OUTPUT ${outfile}-sm_${sm}.bc		add_custom_command(OUTPUT ${outfile}
COMMAND ${LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER} ${bc_flags} ${cuda_arch} ${MAX_SM_DEFINITION}		COMMAND ${LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER} ${bc_flags}
-c ${infile} -o ${outfile}-sm_${sm}.bc		${cuda_flags} ${MAX_SM_DEFINITION} ${infile} -o ${outfile}
DEPENDS ${infile}		DEPENDS ${infile}
IMPLICIT_DEPENDS CXX ${infile}		IMPLICIT_DEPENDS CXX ${infile}
COMMENT "Building LLVM bitcode ${outfile}-sm_${sm}.bc"		COMMENT "Building LLVM bitcode ${outfile}"
VERBATIM		VERBATIM
)		)
set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${outfile}-sm_${sm}.bc)		set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${outfile})

list(APPEND bc_files ${outfile}-sm_${sm}.bc)		list(APPEND bc_files ${outfile})
endforeach()		endforeach()

		set(bclib_name "libomptarget-nvptx-cuda_${cuda_version}-sm_${sm}.bc")

# Link to a bitcode library.		# Link to a bitcode library.
add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/libomptarget-nvptx-sm_${sm}.bc		add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name}
COMMAND ${LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER}		COMMAND ${LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER}
-o ${CMAKE_CURRENT_BINARY_DIR}/libomptarget-nvptx-sm_${sm}.bc ${bc_files}		-o ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} ${bc_files}
DEPENDS ${bc_files}		DEPENDS ${bc_files}
COMMENT "Linking LLVM bitcode libomptarget-nvptx-sm_${sm}.bc"		COMMENT "Linking LLVM bitcode ${bclib_name}"
)		)
set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES libomptarget-nvptx-sm_${sm}.bc)		set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${bclib_name})

add_custom_target(omptarget-nvptx-${sm}-bc ALL DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/libomptarget-nvptx-sm_${sm}.bc)		set(bclib_target_name "omptarget-nvptx-cuda_${cuda_version}-sm_${sm}-bc")
add_dependencies(omptarget-nvptx-bc omptarget-nvptx-${sm}-bc)
		add_custom_target(${bclib_target_name} ALL DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name})
		add_dependencies(omptarget-nvptx-bc ${bclib_target_name})

# Copy library to destination.		# Copy library to destination.
add_custom_command(TARGET omptarget-nvptx-${sm}-bc POST_BUILD		add_custom_command(TARGET ${bclib_target_name} POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy ${CMAKE_CURRENT_BINARY_DIR}/libomptarget-nvptx-sm_${sm}.bc		COMMAND ${CMAKE_COMMAND} -E copy ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name}
${LIBOMPTARGET_LIBRARY_DIR})		${LIBOMPTARGET_LIBRARY_DIR})

# Install bitcode library under the lib destination folder.		# Install bitcode library under the lib destination folder.
install(FILES ${CMAKE_CURRENT_BINARY_DIR}/libomptarget-nvptx-sm_${sm}.bc DESTINATION "${OPENMP_INSTALL_LIBDIR}")		install(FILES ${CMAKE_CURRENT_BINARY_DIR}/${bclib_name} DESTINATION "${OPENMP_INSTALL_LIBDIR}")
		endforeach()
endforeach()		endforeach()
endif()		endif()

add_subdirectory(test)		add_subdirectory(test)
else()		else()
libomptarget_say("Not building CUDA offloading device RTL: tools to build bc lib not found in the system.")		libomptarget_say("Not building CUDA offloading device RTL: tools to build bc lib not found in the system.")
endif()		endif()

openmp/libomptarget/deviceRTLs/nvptx/src/nvptx_interface.h

	//===--- nvptx_interface.h - OpenMP interface definitions -------- CUDA -*-===//			//===--- nvptx_interface.h - OpenMP interface definitions -------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef _NVPTX_INTERFACE_H_			#ifndef _NVPTX_INTERFACE_H_
	#define _NVPTX_INTERFACE_H_			#define _NVPTX_INTERFACE_H_

	#include <stdint.h>			#include <stdint.h>

	#define EXTERN extern "C" __device__			#define EXTERN extern "C"

	typedef uint32_t __kmpc_impl_lanemask_t;			typedef uint32_t __kmpc_impl_lanemask_t;
	typedef uint32_t omp_lock_t; /* arbitrary type of the right length */			typedef uint32_t omp_lock_t; /* arbitrary type of the right length */

	#endif			#endif

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

	//===------------ target_impl.h - NVPTX OpenMP GPU options ------- CUDA -*-===//			//===------------ target_impl.h - NVPTX OpenMP GPU options ------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Definitions of target specific functions			// Definitions of target specific functions
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef _TARGET_IMPL_H_			#ifndef _TARGET_IMPL_H_
	#define _TARGET_IMPL_H_			#define _TARGET_IMPL_H_

	#include <assert.h>			#include <assert.h>
	#include <cuda.h>
	#include <inttypes.h>			#include <inttypes.h>
	#include <stdio.h>			#include <stdio.h>
	#include <stdlib.h>			#include <stdlib.h>

	#include "nvptx_interface.h"			#include "nvptx_interface.h"

	#define DEVICE __device__			#define DEVICE
	#define INLINE __forceinline__ DEVICE			#define INLINE inline __attribute__((always_inline))
	#define NOINLINE __noinline__ DEVICE			#define NOINLINE __attribute__((noinline))
	#define SHARED __shared__			#define ALIGN(N) __attribute__((aligned(N)))
				JonChesterfieldUnsubmitted Done Reply Inline Actions Suggest we drop the DEVICE annotation and change ALIGN to alignas() or similar, but in a later patch. This one is already quite noisy. JonChesterfield: Suggest we drop the DEVICE annotation and change ALIGN to alignas() or similar, but in a later…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions yes, just maintain minimal changes in the code for this patch. We can optimize everything afterwards. tianshilei1992: yes, just maintain minimal changes in the code for this patch. We can optimize everything…
	#define ALIGN(N) __align__(N)

				JonChesterfieldUnsubmitted Done Reply Inline Actions example implementation using omp allocators at D93135 JonChesterfield: example implementation using omp allocators at D93135
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions yep, will do it later. tianshilei1992: yep, will do it later.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Kernel options			// Kernel options
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// The following def must match the absolute limit hardwired in the host RTL			// The following def must match the absolute limit hardwired in the host RTL
	// max number of threads per team			// max number of threads per team
	#define MAX_THREADS_PER_TEAM 1024			#define MAX_THREADS_PER_TEAM 1024
	Show All 19 Lines
	// GA100 design has a maxinum of 128 SMs but A100 product only has 108 SMs			// GA100 design has a maxinum of 128 SMs but A100 product only has 108 SMs
	// GA102 design has a maxinum of 84 SMs			// GA102 design has a maxinum of 84 SMs
	#define MAX_SM 108			#define MAX_SM 108
	#elif __CUDA_ARCH__ >= 700			#elif __CUDA_ARCH__ >= 700
	#define MAX_SM 84			#define MAX_SM 84
	#elif __CUDA_ARCH__ >= 600			#elif __CUDA_ARCH__ >= 600
	#define MAX_SM 56			#define MAX_SM 56
	#else			#else
	#define MAX_SM 16			#define MAX_SM 16
				JonChesterfieldUnsubmitted Done Reply Inline Actions Unintended? JonChesterfield: Unintended?
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Oh, yeah. I was testing whether `__CUDA_ARCH__` can be set by CUDA FE automatically but it turns out no. tianshilei1992: Oh, yeah. I was testing whether `__CUDA_ARCH__` can be set by CUDA FE automatically but it…
	#endif			#endif
	#endif			#endif

	#define OMP_ACTIVE_PARALLEL_LEVEL 128			#define OMP_ACTIVE_PARALLEL_LEVEL 128

	// Data sharing related quantities, need to match what is used in the compiler.			// Data sharing related quantities, need to match what is used in the compiler.
	enum DATA_SHARING_SIZES {			enum DATA_SHARING_SIZES {
	// The maximum number of workers in a kernel.			// The maximum number of workers in a kernel.
	DS_Max_Worker_Threads = 992,			DS_Max_Worker_Threads = 992,
				JonChesterfieldUnsubmitted Done Reply Inline Actions We can (and should) call clang's ffs/popcount instead of these JonChesterfield: We can (and should) call clang's ffs/popcount instead of these
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions We could also directly include `__clang_openmp_device_functions.h` if they're already covered by the header. tianshilei1992: We could also directly include `__clang_openmp_device_functions.h` if they're already covered…
				JonChesterfieldUnsubmitted Done Reply Inline Actions D95060 patches ffs/popc/min JonChesterfield: D95060 patches ffs/popc/min
	// The size reserved for data in a shared memory slot.			// The size reserved for data in a shared memory slot.
	DS_Slot_Size = 256,			DS_Slot_Size = 256,
	// The slot size that should be reserved for a working warp.			// The slot size that should be reserved for a working warp.
	DS_Worker_Warp_Slot_Size = WARPSIZE * DS_Slot_Size,			DS_Worker_Warp_Slot_Size = WARPSIZE * DS_Slot_Size,
	// The maximum number of warps in use			// The maximum number of warps in use
				JonChesterfieldUnsubmitted Done Reply Inline Actions let's just use x < y ? x : y, as it'll codegen to the same thing or better anyway JonChesterfield: let's just use x < y ? x : y, as it'll codegen to the same thing or better anyway
	DS_Max_Warp_Number = 32,			DS_Max_Warp_Number = 32,
	// The size of the preallocated shared memory buffer per team			// The size of the preallocated shared memory buffer per team
	DS_Shared_Memory_Size = 128,			DS_Shared_Memory_Size = 128,
	};			};

	enum : __kmpc_impl_lanemask_t {			enum : __kmpc_impl_lanemask_t {
	__kmpc_impl_all_lanes = ~(__kmpc_impl_lanemask_t)0			__kmpc_impl_all_lanes = ~(__kmpc_impl_lanemask_t)0
	};			};

	DEVICE void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi);			DEVICE void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi);
	DEVICE uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi);			DEVICE uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi);
	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt();			DEVICE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt();
	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt();			DEVICE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt();
	DEVICE uint32_t __kmpc_impl_smid();			DEVICE uint32_t __kmpc_impl_smid();
	DEVICE double __kmpc_impl_get_wtick();			DEVICE double __kmpc_impl_get_wtick();
	DEVICE double __kmpc_impl_get_wtime();			DEVICE double __kmpc_impl_get_wtime();

	INLINE uint32_t __kmpc_impl_ffs(uint32_t x) { return __builtin_ffs(x); }			INLINE uint32_t __kmpc_impl_ffs(uint32_t x) { return __builtin_ffs(x); }
	INLINE uint32_t __kmpc_impl_popc(uint32_t x) { return __builtin_popcount(x); }			INLINE uint32_t __kmpc_impl_popc(uint32_t x) { return __builtin_popcount(x); }

	#ifndef CUDA_VERSION			#ifndef CUDA_VERSION
	#error CUDA_VERSION macro is undefined, something wrong with cuda.			#error CUDA_VERSION macro is undefined, something wrong with cuda.
				Lint: Pre-merge checks Inline Actions clang-tidy: error: CUDA_VERSION macro is undefined, something wrong with cuda. [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: CUDA_VERSION macro is undefined, something wrong with cuda. [clang…
				JonChesterfieldUnsubmitted Done Reply Inline Actions This should be firing now that cuda.h is removed JonChesterfield: This should be firing now that cuda.h is removed
	#endif			#endif

	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask();			DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask();

	DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,			DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,
	int32_t SrcLane);			int32_t SrcLane);

	DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,			DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,
	Show All 16 Lines
	// Calls to the NVPTX layer (assuming 1D layout)			// Calls to the NVPTX layer (assuming 1D layout)
	DEVICE int GetThreadIdInBlock();			DEVICE int GetThreadIdInBlock();
	DEVICE int GetBlockIdInKernel();			DEVICE int GetBlockIdInKernel();
	DEVICE int GetNumberOfBlocksInKernel();			DEVICE int GetNumberOfBlocksInKernel();
	DEVICE int GetNumberOfThreadsInBlock();			DEVICE int GetNumberOfThreadsInBlock();
	DEVICE unsigned GetWarpId();			DEVICE unsigned GetWarpId();
	DEVICE unsigned GetLaneId();			DEVICE unsigned GetLaneId();

	// Atomics			// Atomics
	DEVICE uint32_t __kmpc_atomic_add(uint32_t *, uint32_t);			DEVICE uint32_t __kmpc_atomic_add(uint32_t *, uint32_t);
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions @jdoerfert @JonChesterfield You might want to review this part. If that works, we could take them back to common parts afterwards. tianshilei1992: @jdoerfert @JonChesterfield You might want to review this part. If that works, we could take…
				JonChesterfieldUnsubmitted Done Reply Inline Actions I can't remember what the semantics of atomic_inc are but I do remember them being surprising. In general I prefer the clang intrinsics, but if this generates the same IR then so be it. What IR does it emit? Will be easier to tell if we land D95093 first as we can llvm-dis target_impl.bc, instead of scraping together examples of the calls from various places. JonChesterfield: I can't remember what the semantics of atomic_inc are but I do remember them being surprising.
				JonChesterfieldUnsubmitted Done Reply Inline Actions I can't remember what the semantics of atomic_inc are but I do remember them being surprising. From docs, writes ((old >= val) ? 0 : (old+1)) to memory and returns old. I would guess that needs to be spelled Address = (Old >= Val) ? 0 : (Old+1). We'd also need the Address read and the Address store to occur atomically for this to be correct. JonChesterfield:* > I can't remember what the semantics of atomic_inc are but I do remember them being surprising.
	DEVICE uint32_t __kmpc_atomic_inc(uint32_t *, uint32_t);			DEVICE uint32_t __kmpc_atomic_inc(uint32_t *, uint32_t);
	DEVICE uint32_t __kmpc_atomic_max(uint32_t *, uint32_t);			DEVICE uint32_t __kmpc_atomic_max(uint32_t *, uint32_t);
	DEVICE uint32_t __kmpc_atomic_exchange(uint32_t *, uint32_t);			DEVICE uint32_t __kmpc_atomic_exchange(uint32_t *, uint32_t);
	DEVICE uint32_t __kmpc_atomic_cas(uint32_t *, uint32_t, uint32_t);			DEVICE uint32_t __kmpc_atomic_cas(uint32_t *, uint32_t, uint32_t);

	static_assert(sizeof(unsigned long long) == sizeof(uint64_t), "");			static_assert(sizeof(unsigned long long) == sizeof(uint64_t), "");
	DEVICE unsigned long long __kmpc_atomic_exchange(unsigned long long *,			DEVICE unsigned long long __kmpc_atomic_exchange(unsigned long long *,
	unsigned long long);			unsigned long long);
	Show All 15 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

	//===---------- target_impl.cu - NVPTX OpenMP GPU options ------- CUDA -*-===//			//===---------- target_impl.cu - NVPTX OpenMP GPU options ------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Definitions of target specific functions			// Definitions of target specific functions
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#pragma omp declare target			#pragma omp declare target

	#include "target_impl.h"			#include "target_impl.h"
	#include "common/debug.h"			#include "common/debug.h"

	#include <cuda.h>

	DEVICE void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi) {			DEVICE void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi) {
	asm volatile("mov.b64 {%0,%1}, %2;" : "=r"(lo), "=r"(hi) : "l"(val));			asm volatile("mov.b64 {%0,%1}, %2;" : "=r"(lo), "=r"(hi) : "l"(val));
	}			}
				JonChesterfieldUnsubmitted Done Reply Inline Actions shouldn't these be in the cuda header above, and also in the clang-injected cuda headers? JonChesterfield: shouldn't these be in the cuda header above, and also in the clang-injected cuda headers?
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions All functions that can be called by CUDA are declared as `__device__`. In `declare target`, we cannot call those functions. Instead, we need them to be in a format of OpenMP, so those in `cuda.h` cannot be used. If not those CUDA version macros, we can drop the header. tianshilei1992: All functions that can be called by CUDA are declared as `__device__`. In `declare target`, we…
				JonChesterfieldUnsubmitted Done Reply Inline Actions I think the right answer to the cuda version macros is to compile this file in the deviceRTL twice, once for < 9000 and once for >9000. It seems reasonable to have a different implementation for the cuda API change. Clang knows what version it is compiling applications for so could pick the matching deviceRTL.bc. That would let us totally decouple from cuda with some slightly ugly stuff like `return __nvvm_shfl_down_i32(Var, Delta, ((WARPSIZE - Width) << 8) \| 0x1f);` as typeset in https://reviews.llvm.org/D94731?vs=316809&id=316820#toc JonChesterfield: I think the right answer to the cuda version macros is to compile this file in the deviceRTL…
				JonChesterfieldUnsubmitted Done Reply Inline Actions It's been pointed out to me that we already include ~4k of source at the top of source files that are compiled as openmp, even if they `#include` no header files. Mostly bits of libm. I'm not pleased to discover that, but it does mean that adding an implementation of `__kmpc_impl_activemask` etc to a new header won't change the status quo. Let's do that. JonChesterfield: It's been pointed out to me that we already include ~4k of source at the top of source files…

	DEVICE uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi) {			DEVICE uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi) {
	uint64_t val;			uint64_t val;
	asm volatile("mov.b64 %0, {%1,%2};" : "=l"(val) : "r"(lo), "r"(hi));			asm volatile("mov.b64 %0, {%1,%2};" : "=l"(val) : "r"(lo), "r"(hi));
	return val;			return val;
	}			}

	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt() {			DEVICE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt() {
	▲ Show 20 Lines • Show All 175 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent languageClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 319096

clang/lib/Driver/ToolChains/Cuda.cpp

openmp/libomptarget/deviceRTLs/common/allocator.h

openmp/libomptarget/deviceRTLs/common/omptarget.h

openmp/libomptarget/deviceRTLs/common/src/libcall.cu

openmp/libomptarget/deviceRTLs/common/src/omp_data.cu

openmp/libomptarget/deviceRTLs/common/src/reduction.cu

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

openmp/libomptarget/deviceRTLs/nvptx/src/nvptx_interface.h

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

[OpenMP][deviceRTLs] Build the deviceRTLs with OpenMP instead of target dependent language
ClosedPublic