This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/
-
libomptarget/
-
deviceRTLs/
-
amdgcn/src/
-
src/
-
target_impl.hip
-
common/
-
include/
3/9
shuffle.h
-
src/
1
data_sharing.cu
-
loop.cu
-
reduction.cu
-
nvptx/
1/5
CMakeLists.txt
-
src/
-
target_impl.cu
-
target_interface.h

Differential D95752

[OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant
ClosedPublic

Authored by jdoerfert on Jan 30 2021, 5:01 PM.

Download Raw Diff

Details

Reviewers

JonChesterfield
tianshilei1992
bollu

Commits

rG66ba494b4974: [OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant

Summary

The shuffle idiom is differently implemented in our supported targets.
To reduce the "target_impl" file we now move the shuffle idiom in it's
own self-contained header that provides the implementation for AMDGPU
and NVPTX. A fallback can be added later on.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jdoerfert created this revision.Jan 30 2021, 5:01 PM

Herald added a reviewer: bollu. · View Herald TranscriptJan 30 2021, 5:01 PM

Herald added subscribers: guansong, tpr, yaxunl and 2 others. · View Herald Transcript

jdoerfert requested review of this revision.Jan 30 2021, 5:01 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 30 2021, 5:01 PM

Herald added a subscriber: sstefan1. · View Herald Transcript

LGTM. This will be the first example to merge different implementations into one file.

openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu
17	One nit: you might have `shuffle.h` before `target_impl.h` if using `clang-format`.

This revision is now accepted and ready to land.Jan 30 2021, 5:11 PM

Harbormaster completed remote builds in B87288: Diff 320332.Jan 30 2021, 5:28 PM

I can see many warnings emitted, like the following one:

In file included from /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/src/loop.cu:17:
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/include/shuffle.h:96:16: warning: inline function '__kmpc_impl_shfl_d
own_sync' is not defined [-Wundefined-inline]
inline int32_t __kmpc_impl_shfl_down_sync(int64_t Mask, int32_t Var,                                                                                                   ^

Comments inline. Not totally sure this is better, code seems longer and more complicated than it was before. Aim is merging the two GPU implementations?

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
28	Extern C? Also, why forward declare instead of include header?
41	uint64_t for mask. Sign bit is just one of the lanes.
89	Seems bad, both because it's a macro instead of variant, and because I thought we'd already got rid of that macro
118	Not sure about int constants near places where the 32/64 bit distinction is important

JonChesterfield added inline comments.Jan 30 2021, 5:45 PM

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
67	Wonder if this compiles without a forward declare of the intrinsic

In D95752#2532552, @tianshilei1992 wrote:

I can see many warnings emitted, like the following one:

In file included from /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/src/loop.cu:17:
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/include/shuffle.h:96:16: warning: inline function '__kmpc_impl_shfl_d
own_sync' is not defined [-Wundefined-inline]
inline int32_t __kmpc_impl_shfl_down_sync(int64_t Mask, int32_t Var,                                                                                                   ^

I will fix this in clang though before.

jdoerfert added inline comments.Jan 30 2021, 7:21 PM

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
28	because our headers are a mess. These are C++ functions, the interface is a mess too
118	That's why it's int64_t above, -1 sign extended stays -1.

Addressed comments. uint64, and static to avoid warning, once we have a default
impl we can actually remove the first decl variant and the static if we want.

Harbormaster completed remote builds in B87293: Diff 320338.Jan 30 2021, 10:47 PM

I'll update this tomorrow, certain parts are not great.

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
89	we will, ptx selection. I'll actually update this patch.

Replace the CUDA_VERSION macro, provide a mock default impl for shuffle

This revision is now accepted and ready to land.Jan 31 2021, 10:14 AM

jdoerfert added a parent revision: D95765: [OpenMP] Introduce the `disable_selector_propagation` variant selector trait.Jan 31 2021, 10:15 AM

tianshilei1992 added inline comments.Jan 31 2021, 10:48 AM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	There is no `shuffle.cpp`

Harbormaster completed remote builds in B87312: Diff 320362.Jan 31 2021, 10:50 AM

^ shuffle.cpp doesn't exist. Except that, I tested locally and it works properly. LGTM.

Add shuffle.cpp

Harbormaster completed remote builds in B87313: Diff 320368.Jan 31 2021, 12:00 PM

Not too keen on mixing code for different targets in the same file. Would prefer the prototype declared in a header and a nvptx.cpp, amdgcn.cpp, other.cpp implementing that interface, preferably via variant so that the whole thing compiles out for some arch.

If the 'fallback/default' works as I suspect, it makes a really nice way to implement generic versions. Ptx uses asm for an operation that amdgpu uses a shift for, would be nice to use the pure c++ one as the default and substitute in the specialised one for only the targets that use it.

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
106	I like this a lot. Way better that CUDA_VERSION macros.

In D95752#2534578, @JonChesterfield wrote:

Not too keen on mixing code for different targets in the same file. Would prefer the prototype declared in a header and a nvptx.cpp, amdgcn.cpp, other.cpp implementing that interface, preferably via variant so that the whole thing compiles out for some arch.

I'm hoping for a structure where we put a feature into a file, rather than an architecture per file. If we do both, we get 4 files per feature. This is not intrinsically bad but makes updates less convenient. At the end of the day, it doesn't matter much what we do since we perform LTO on this. I was hoping the approach with one header per feature and one cpp file for externally used functions would spur reuse. That is, helpers are defined only once, opportunities to share code are easier to spot during updates, etc.

If the 'fallback/default' works as I suspect, it makes a really nice way to implement generic versions. Ptx uses asm for an operation that amdgpu uses a shift for, would be nice to use the pure c++ one as the default and substitute in the specialised one for only the targets that use it.

So this is what's happening, at least once we have a "fallback" target model for which we can define what a "shuffle" means.

Header per target-specific-function as opposed to a list of things like __kmpc_impl_syncwarp seems reasonable. Not the way I'd have sliced it but clearly equivalent.

Minor request, let it be target/shuffle.h, so that we build up a list of functions (stuff in that directory) that new architectures need to implement in one place, distinct from code that will hopefully work out of the box for a new target.

This starts us down a path towards:

devicertl
  - src
     - parallel.cpp
     - target
        - shuffle.h

as opposed to the current nvptx/amdgcn/common split. That has the attraction of being a more conventional build system (it's weird treating one source file as openmp for one target and hip for another, the semantics don't really match) and, given we're building with clang, even using a single cmake file to build for all targets.

ronlieb added a subscriber: ronlieb.Feb 1 2021, 3:24 PM

ronlieb added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	is there an amdgcn CMakeLists.txt equivalent change ? should there be ?

We could presumably replace the #if CUDA_VERSION >= 9000 in the target_impl.cu file (we should rename these!) with variant, orthogonal to this change. Doing that for the five instances, even just within that file, would let us significantly reduce the number of devicertl libraries compiled.

jdoerfert added inline comments.Feb 1 2021, 3:26 PM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	probably, I'll look and add it. FWIW, if we had tests and CI for this, e.g., AMD CI that builds the runtime for AMDGPU, that would expose such a mistake right away ;)

In D95752#2535155, @JonChesterfield wrote:

We could presumably replace the #if CUDA_VERSION >= 9000 in the target_impl.cu file (we should rename these!) with variant, orthogonal to this change. Doing that for the five instances, even just within that file, would let us significantly reduce the number of devicertl libraries compiled.

yes, it's case by case though. We should check what ptx version, or other criterion, is a good selector and replace them. Here it was rather easy in the end.

ronlieb added inline comments.Feb 1 2021, 5:01 PM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	i agree we really do need an AMD CI, and to get there we also need to be upstreaming our clang support. so in the spirit of making more progress on this, could you do another review of Singh's patch https://reviews.llvm.org/D94961

Address comments

Harbormaster completed remote builds in B87593: Diff 320922.Feb 2 2021, 4:12 PM

This revision was landed with ongoing or failed builds.Mar 11 2021, 9:31 PM

Closed by commit rG66ba494b4974: [OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant (authored by jdoerfert). · Explain Why

This revision was automatically updated to reflect the committed changes.

jdoerfert added a commit: rG66ba494b4974: [OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant.

JonChesterfield added inline comments.Mar 15 2021, 11:48 AM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	Note to self, amdgcn does indeed need shuffle.cpp added to the cmake list (plus include path)

jdoerfert mentioned this in D98677: [OpenMP][FIX] Repair accidental replacement of _shfl_sync with _shfl.Mar 15 2021, 7:53 PM

jdoerfert mentioned this in rG0a954a528b87: [OpenMP][FIX] Repair accidental replacement of _shfl_sync with _shfl.Mar 15 2021, 8:46 PM

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

amdgcn/

src/

target_impl.hip

16 lines

common/

include/

shuffle.h

124 lines

src/

data_sharing.cu

1 line

loop.cu

1 line

reduction.cu

13 lines

nvptx/

CMakeLists.txt

2 lines

src/

target_impl.cu

21 lines

target_interface.h

6 lines

Diff 320362

openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.hip

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	DEVICE double __kmpc_impl_get_wtime() {
return 0;		return 0;
}		}

// Warp vote function		// Warp vote function
DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask() {		DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask() {
return __builtin_amdgcn_read_exec();		return __builtin_amdgcn_read_exec();
}		}

DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t, int32_t var,
int32_t srcLane) {
int width = WARPSIZE;
int self = GetLaneId();
int index = srcLane + (self & ~(width - 1));
return __builtin_amdgcn_ds_bpermute(index << 2, var);
}

DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t, int32_t var,
uint32_t laneDelta, int32_t width) {
int self = GetLaneId();
int index = self + laneDelta;
index = (int)(laneDelta + (self & (width - 1))) >= width ? self : index;
return __builtin_amdgcn_ds_bpermute(index << 2, var);
}

static DEVICE SHARED uint32_t L1_Barrier;		static DEVICE SHARED uint32_t L1_Barrier;

DEVICE void __kmpc_impl_target_init() {		DEVICE void __kmpc_impl_target_init() {
// Don't have global ctors, and shared memory is not zero init		// Don't have global ctors, and shared memory is not zero init
__atomic_store_n(&L1_Barrier, 0u, __ATOMIC_RELEASE);		__atomic_store_n(&L1_Barrier, 0u, __ATOMIC_RELEASE);
}		}

DEVICE void __kmpc_impl_named_sync(uint32_t num_threads) {		DEVICE void __kmpc_impl_named_sync(uint32_t num_threads) {
▲ Show 20 Lines • Show All 147 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/include/shuffle.h

This file was added.

				//===- shuffle.h - OpenMP variants of the shuffle idiom for all targets -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Shuffle function implementations for all supported targets.
				//
				// Note: We unify the mask type to uint64_t instead of __kmpc_impl_lanemask_t.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LIBOMPTARGET_DEVICERTL_SHUFFLE_H
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] not useful Lint: Pre-merge checks: clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] [[https…
				#define LIBOMPTARGET_DEVICERTL_SHUFFLE_H

				#include <assert.h>
				#include <inttypes.h>

				#pragma omp declare target

				/// External shuffle API
				///
				///{

				extern "C" {
				int32_t __kmpc_shuffle_int32(int32_t val, int16_t delta, int16_t size);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_shuffle_int32' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'val' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'delta' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'size' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_shuffle_int32' [readability…
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Extern C? Also, why forward declare instead of include header? JonChesterfield: Extern C? Also, why forward declare instead of include header?
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions because our headers are a mess. These are C++ functions, the interface is a mess too jdoerfert: because our headers are a mess. These are C++ functions, the interface is a mess too
				int64_t __kmpc_shuffle_int64(int64_t val, int16_t delta, int16_t size);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_shuffle_int64' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'val' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'delta' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'size' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_shuffle_int64' [readability…
				}

				///}

				/// Forward declarations
				///
				///{
				unsigned GetLaneId();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'GetLaneId' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'GetLaneId' [readability-identifier…
				unsigned GetWarpSize();
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'GetWarpSize' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'GetWarpSize' [readability-identifier…
				void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_impl_unpack' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'val' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'lo' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'hi' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_impl_unpack' [readability…
				uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_impl_pack' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'lo' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'hi' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_impl_pack' [readability-identifier…
				///}
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions uint64_t for mask. Sign bit is just one of the lanes. JonChesterfield: uint64_t for mask. Sign bit is just one of the lanes.

				/// Fallback implementations of the shuffle sync idiom.
				///
				///{

				inline int32_t __kmpc_impl_shfl_sync(uint64_t Mask, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_sync' [readability…
				int32_t SrcLane) {
				assert(false &&
				"Fallback version of __kmpc_impl_shfl_sync is not available!");
				}

				inline int32_t __kmpc_impl_shfl_down_sync(uint64_t Mask, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_down_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_down_sync' [readability…
				uint32_t Delta, int32_t Width) {
				assert(false &&
				"Fallback version of __kmpc_impl_shfl_down_sync is not available!");
				}

				///}

				/// AMDGCN implementations of the shuffle sync idiom.
				///
				///{
				#pragma omp begin declare variant match(device = {arch(amdgcn)})

				inline int32_t __kmpc_impl_shfl_sync(uint64_t Mask, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: error: redefinition of 'kmpc_impl_shfl_sync' [clang-diagnostic-error] not useful clang-tidy: warning: invalid case style for function 'kmpc_impl_shfl_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: error: redefinition of '__kmpc_impl_shfl_sync' [clang-diagnostic-error] [[https…
				int32_t SrcLane) {
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Wonder if this compiles without a forward declare of the intrinsic JonChesterfield: Wonder if this compiles without a forward declare of the intrinsic
				int Width = GetWarpSize();
				int Self = GetLaneId();
				int Index = SrcLane + (Self & ~(Width - 1));
				return __builtin_amdgcn_ds_bpermute(Index << 2, Var);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_ds_bpermute' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_ds_bpermute' [clang…
				}

				inline int32_t __kmpc_impl_shfl_down_sync(uint64_t Mask, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: error: redefinition of 'kmpc_impl_shfl_down_sync' [clang-diagnostic-error] not useful clang-tidy: warning: invalid case style for function 'kmpc_impl_shfl_down_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: error: redefinition of '__kmpc_impl_shfl_down_sync' [clang-diagnostic-error]…
				uint32_t LaneDelta, int32_t Width) {
				int Self = GetLaneId();
				int Index = Self + LaneDelta;
				Index = (int)(LaneDelta + (Self & (Width - 1))) >= Width ? Self : Index;
				return __builtin_amdgcn_ds_bpermute(Index << 2, Var);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_ds_bpermute' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_amdgcn_ds_bpermute' [clang…
				}

				#pragma omp end declare variant
				///}

				/// NVPTX implementations of the shuffle and shuffle sync idiom.
				///
				///{
				#pragma omp begin declare variant match( \
				device = {arch(nvptx, nvptx64)}, \
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Seems bad, both because it's a macro instead of variant, and because I thought we'd already got rid of that macro JonChesterfield: Seems bad, both because it's a macro instead of variant, and because I thought we'd already got…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions we will, ptx selection. I'll actually update this patch. jdoerfert: we will, ptx selection. I'll actually update this patch.
				implementation = {extension(match_any, disable_selector_propagation)})

				inline int32_t __kmpc_impl_shfl_sync(uint64_t Mask, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_sync' [readability…
				int32_t SrcLane) {
				return __nvvm_shfl_idx_i32(Var, SrcLane, 0x1f);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_shfl_idx_i32' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_shfl_idx_i32' [clang-diagnostic-error]…
				}

				inline int32_t __kmpc_impl_shfl_down_sync(uint64_t Mask, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_down_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_down_sync' [readability…
				uint32_t Delta, int32_t Width) {
				int32_t T = ((GetWarpSize() - Width) << 8) \| 0x1f;
				return __nvvm_shfl_down_i32(Var, Delta, T);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_shfl_down_i32' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_shfl_down_i32' [clang-diagnostic-error]…
				}

				// Acording to clang, ptx60 and higher supports the _sync versions. Thus, we
				// only need to filter ptx42 as it is the last supported ptx version below 60.
				#pragma omp begin declare variant match( \
				device = {isa(ptx42)}, implementation = {extension(match_none)})
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions I like this a lot. Way better that CUDA_VERSION macros. JonChesterfield: I like this a lot. Way better that CUDA_VERSION macros.
				inline int32_t __kmpc_impl_shfl_sync(uint64_t Mask, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_sync' [readability…
				int32_t SrcLane) {
				return __nvvm_shfl_sync_idx_i32(Mask, Var, SrcLane, 0x1f);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_shfl_sync_idx_i32' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_shfl_sync_idx_i32' [clang-diagnostic…
				}

				inline int32_t __kmpc_impl_shfl_down_sync(uint64_t Mask, int32_t Var,
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_down_sync' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_impl_shfl_down_sync' [readability…
				uint32_t Delta, int32_t Width) {
				int32_t T = ((GetWarpSize() - Width) << 8) \| 0x1f;
				return __nvvm_shfl_sync_down_i32(Mask, Var, Delta, T);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__nvvm_shfl_sync_down_i32' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__nvvm_shfl_sync_down_i32' [clang-diagnostic…
				}
				#pragma omp end declare variant

				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Not sure about int constants near places where the 32/64 bit distinction is important JonChesterfield: Not sure about int constants near places where the 32/64 bit distinction is important
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions That's why it's int64_t above, -1 sign extended stays -1. jdoerfert: That's why it's int64_t above, -1 sign extended stays -1.
				#pragma omp end declare variant
				///}

				#pragma omp end declare target

				#endif

openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu

	//===----- data_sharing.cu - OpenMP GPU data sharing ------------- CUDA -*-===//			//===----- data_sharing.cu - OpenMP GPU data sharing ------------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the implementation of data sharing environments			// This file contains the implementation of data sharing environments
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#pragma omp declare target			#pragma omp declare target

	#include "common/omptarget.h"			#include "common/omptarget.h"
				#include "shuffle.h"
	#include "target_impl.h"			#include "target_impl.h"

				tianshilei1992Unsubmitted Not Done Reply Inline Actions One nit: you might have `shuffle.h` before `target_impl.h` if using `clang-format`. tianshilei1992: One nit: you might have `shuffle.h` before `target_impl.h` if using `clang-format`.
	// Return true if this is the master thread.			// Return true if this is the master thread.
	INLINE static bool IsMasterThread(bool isSPMDExecutionMode) {			INLINE static bool IsMasterThread(bool isSPMDExecutionMode) {
	return !isSPMDExecutionMode && GetMasterThreadID() == GetThreadIdInBlock();			return !isSPMDExecutionMode && GetMasterThreadID() == GetThreadIdInBlock();
	}			}

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Runtime functions for trunk data sharing scheme.			// Runtime functions for trunk data sharing scheme.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	▲ Show 20 Lines • Show All 256 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/loop.cu

	//===------------ loop.cu - NVPTX OpenMP loop constructs --------- CUDA -*-===//			//===------------ loop.cu - NVPTX OpenMP loop constructs --------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the implementation of the KMPC interface			// This file contains the implementation of the KMPC interface
	// for the loop construct plus other worksharing constructs that use the same			// for the loop construct plus other worksharing constructs that use the same
	// interface as loops.			// interface as loops.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#pragma omp declare target			#pragma omp declare target

	#include "common/omptarget.h"			#include "common/omptarget.h"
				#include "shuffle.h"
	#include "target_impl.h"			#include "target_impl.h"

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// template class that encapsulate all the helper functions			// template class that encapsulate all the helper functions
	//			//
	// T is loop iteration type (32 \| 64) (unsigned \| signed)			// T is loop iteration type (32 \| 64) (unsigned \| signed)
	// ST is the signed version of T			// ST is the signed version of T
	▲ Show 20 Lines • Show All 734 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/reduction.cu

	//===---- reduction.cu - GPU OpenMP reduction implementation ----- CUDA -*-===//			//===---- reduction.cu - GPU OpenMP reduction implementation ----- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the implementation of reduction with KMPC interface.			// This file contains the implementation of reduction with KMPC interface.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#pragma omp declare target			#pragma omp declare target

	#include "common/omptarget.h"			#include "common/omptarget.h"
				#include "shuffle.h"
	#include "target_impl.h"			#include "target_impl.h"

	EXTERN			EXTERN
	void __kmpc_nvptx_end_reduce(int32_t global_tid) {}			void __kmpc_nvptx_end_reduce(int32_t global_tid) {}

	EXTERN			EXTERN
	void __kmpc_nvptx_end_reduce_nowait(int32_t global_tid) {}			void __kmpc_nvptx_end_reduce_nowait(int32_t global_tid) {}

	EXTERN int32_t __kmpc_shuffle_int32(int32_t val, int16_t delta, int16_t size) {
	return __kmpc_impl_shfl_down_sync(__kmpc_impl_all_lanes, val, delta, size);
	}

	EXTERN int64_t __kmpc_shuffle_int64(int64_t val, int16_t delta, int16_t size) {
	uint32_t lo, hi;
	__kmpc_impl_unpack(val, lo, hi);
	hi = __kmpc_impl_shfl_down_sync(__kmpc_impl_all_lanes, hi, delta, size);
	lo = __kmpc_impl_shfl_down_sync(__kmpc_impl_all_lanes, lo, delta, size);
	return __kmpc_impl_pack(lo, hi);
	}

	INLINE static void gpu_regular_warp_reduce(void *reduce_data,			INLINE static void gpu_regular_warp_reduce(void *reduce_data,
	kmp_ShuffleReductFctPtr shflFct) {			kmp_ShuffleReductFctPtr shflFct) {
	for (uint32_t mask = WARPSIZE / 2; mask > 0; mask /= 2) {			for (uint32_t mask = WARPSIZE / 2; mask > 0; mask /= 2) {
	shflFct(reduce_data, /LaneId - not used= / 0,			shflFct(reduce_data, /LaneId - not used= / 0,
	/Offset = / mask, /AlgoVersion=/0);			/Offset = / mask, /AlgoVersion=/0);
	}			}
	}			}

	▲ Show 20 Lines • Show All 275 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	set(cuda_src_files
${devicertl_common_directory}/src/loop.cu		${devicertl_common_directory}/src/loop.cu
${devicertl_common_directory}/src/omp_data.cu		${devicertl_common_directory}/src/omp_data.cu
${devicertl_common_directory}/src/omptarget.cu		${devicertl_common_directory}/src/omptarget.cu
${devicertl_common_directory}/src/parallel.cu		${devicertl_common_directory}/src/parallel.cu
${devicertl_common_directory}/src/reduction.cu		${devicertl_common_directory}/src/reduction.cu
${devicertl_common_directory}/src/support.cu		${devicertl_common_directory}/src/support.cu
${devicertl_common_directory}/src/sync.cu		${devicertl_common_directory}/src/sync.cu
${devicertl_common_directory}/src/task.cu		${devicertl_common_directory}/src/task.cu
		${devicertl_common_directory}/src/shuffle.cpp
		tianshilei1992Unsubmitted Not Done Reply Inline Actions There is no `shuffle.cpp` tianshilei1992: There is no `shuffle.cpp`
		ronliebUnsubmitted Not Done Reply Inline Actions is there an amdgcn CMakeLists.txt equivalent change ? should there be ? ronlieb: is there an amdgcn CMakeLists.txt equivalent change ? should there be ?
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions probably, I'll look and add it. FWIW, if we had tests and CI for this, e.g., AMD CI that builds the runtime for AMDGPU, that would expose such a mistake right away ;) jdoerfert: probably, I'll look and add it. FWIW, if we had tests and CI for this, e.g., AMD CI that builds…
		ronliebUnsubmitted Not Done Reply Inline Actions i agree we really do need an AMD CI, and to get there we also need to be upstreaming our clang support. so in the spirit of making more progress on this, could you do another review of Singh's patch https://reviews.llvm.org/D94961 ronlieb: i agree we really do need an AMD CI, and to get there we also need to be upstreaming our clang…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Note to self, amdgcn does indeed need shuffle.cpp added to the cmake list (plus include path) JonChesterfield: Note to self, amdgcn does indeed need shuffle.cpp added to the cmake list (plus include path)
src/target_impl.cu		src/target_impl.cu
)		)

# Set flags for LLVM Bitcode compilation.		# Set flags for LLVM Bitcode compilation.
set(bc_flags -S -x c++ -O1 -std=c++14		set(bc_flags -S -x c++ -O1 -std=c++14
-target nvptx64		-target nvptx64
-Xclang -emit-llvm-bc		-Xclang -emit-llvm-bc
-Xclang -aux-triple -Xclang ${aux_triple}		-Xclang -aux-triple -Xclang ${aux_triple}
-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device		-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device
-D__CUDACC__		-D__CUDACC__
-I${devicertl_base_directory}		-I${devicertl_base_directory}
		-I${devicertl_common_directory}/include
-I${devicertl_nvptx_directory}/src)		-I${devicertl_nvptx_directory}/src)

if(${LIBOMPTARGET_NVPTX_DEBUG})		if(${LIBOMPTARGET_NVPTX_DEBUG})
list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=-1)		list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=-1)
else()		else()
list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=0)		list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=0)
endif()		endif()

▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	#if CUDA_VERSION < 9020
return __nvvm_vote_ballot(1);		return __nvvm_vote_ballot(1);
#else		#else
unsigned int Mask;		unsigned int Mask;
asm volatile("activemask.b32 %0;" : "=r"(Mask));		asm volatile("activemask.b32 %0;" : "=r"(Mask));
return Mask;		return Mask;
#endif		#endif
}		}

// In Cuda 9.0, the *_sync() version takes an extra argument 'mask'.
DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,
int32_t SrcLane) {
#if CUDA_VERSION >= 9000
return __nvvm_shfl_sync_idx_i32(Mask, Var, SrcLane, 0x1f);
#else
return __nvvm_shfl_idx_i32(Var, SrcLane, 0x1f);
#endif // CUDA_VERSION
}

DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,
int32_t Var, uint32_t Delta,
int32_t Width) {
int32_t T = ((WARPSIZE - Width) << 8) \| 0x1f;
#if CUDA_VERSION >= 9000
return __nvvm_shfl_sync_down_i32(Mask, Var, Delta, T);
#else
return __nvvm_shfl_down_i32(Var, Delta, T);
#endif // CUDA_VERSION
}

DEVICE void __kmpc_impl_syncthreads() { __syncthreads(); }		DEVICE void __kmpc_impl_syncthreads() { __syncthreads(); }

DEVICE void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask) {		DEVICE void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask) {
#if CUDA_VERSION >= 9000		#if CUDA_VERSION >= 9000
__nvvm_bar_warp_sync(Mask);		__nvvm_bar_warp_sync(Mask);
#else		#else
// In Cuda < 9.0 no need to sync threads in warps.		// In Cuda < 9.0 no need to sync threads in warps.
#endif // CUDA_VERSION		#endif // CUDA_VERSION
▲ Show 20 Lines • Show All 107 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/target_interface.h

	Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	EXTERN void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi);			EXTERN void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi);
	EXTERN uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi);			EXTERN uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi);
	EXTERN __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt();			EXTERN __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt();
	EXTERN __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt();			EXTERN __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt();
	EXTERN uint32_t __kmpc_impl_smid();			EXTERN uint32_t __kmpc_impl_smid();

	EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask();			EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask();

	EXTERN int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,
	int32_t SrcLane);
	EXTERN int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,
	int32_t Var, uint32_t Delta,
	int32_t Width);

	EXTERN void __kmpc_impl_syncthreads();			EXTERN void __kmpc_impl_syncthreads();
	EXTERN void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask);			EXTERN void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask);

	// Kernel initialization			// Kernel initialization
	EXTERN void __kmpc_impl_target_init();			EXTERN void __kmpc_impl_target_init();

	// Memory			// Memory
	EXTERN void *__kmpc_impl_malloc(size_t);			EXTERN void *__kmpc_impl_malloc(size_t);
	EXTERN void __kmpc_impl_free(void *);			EXTERN void __kmpc_impl_free(void *);

	// Barrier until num_threads arrive.			// Barrier until num_threads arrive.
	EXTERN void __kmpc_impl_named_sync(uint32_t num_threads);			EXTERN void __kmpc_impl_named_sync(uint32_t num_threads);

	#endif // _OMPTARGET_TARGET_INTERFACE_H_			#endif // _OMPTARGET_TARGET_INTERFACE_H_

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variantClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 320362

openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.hip

openmp/libomptarget/deviceRTLs/common/include/shuffle.h

openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu

openmp/libomptarget/deviceRTLs/common/src/loop.cu

openmp/libomptarget/deviceRTLs/common/src/reduction.cu

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

openmp/libomptarget/deviceRTLs/target_interface.h

[OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant
ClosedPublic