This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/
-
libomptarget/
-
deviceRTLs/
-
amdgcn/
-
CMakeLists.txt
-
src/
-
target_impl.hip
-
common/
-
include/target/
-
target/
-
shuffle.h
-
src/
1
data_sharing.cu
-
loop.cu
-
reduction.cu
-
shuffle.cpp
-
nvptx/
1/5
CMakeLists.txt
-
src/
-
target_impl.cu
-
target_interface.h

Differential D95752

[OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant
ClosedPublic

Authored by jdoerfert on Jan 30 2021, 5:01 PM.

Download Raw Diff

Details

Reviewers

JonChesterfield
tianshilei1992
bollu

Commits

rG66ba494b4974: [OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant

Summary

The shuffle idiom is differently implemented in our supported targets.
To reduce the "target_impl" file we now move the shuffle idiom in it's
own self-contained header that provides the implementation for AMDGPU
and NVPTX. A fallback can be added later on.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jdoerfert created this revision.Jan 30 2021, 5:01 PM

Herald added a reviewer: bollu. · View Herald TranscriptJan 30 2021, 5:01 PM

Herald added subscribers: guansong, tpr, yaxunl and 2 others. · View Herald Transcript

jdoerfert requested review of this revision.Jan 30 2021, 5:01 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 30 2021, 5:01 PM

Herald added a subscriber: sstefan1. · View Herald Transcript

LGTM. This will be the first example to merge different implementations into one file.

openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu
17	One nit: you might have `shuffle.h` before `target_impl.h` if using `clang-format`.

This revision is now accepted and ready to land.Jan 30 2021, 5:11 PM

Harbormaster completed remote builds in B87288: Diff 320332.Jan 30 2021, 5:28 PM

I can see many warnings emitted, like the following one:

In file included from /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/src/loop.cu:17:
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/include/shuffle.h:96:16: warning: inline function '__kmpc_impl_shfl_d
own_sync' is not defined [-Wundefined-inline]
inline int32_t __kmpc_impl_shfl_down_sync(int64_t Mask, int32_t Var,                                                                                                   ^

Comments inline. Not totally sure this is better, code seems longer and more complicated than it was before. Aim is merging the two GPU implementations?

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
27 ↗	(On Diff #320332)	Extern C? Also, why forward declare instead of include header?
40 ↗	(On Diff #320332)	uint64_t for mask. Sign bit is just one of the lanes.
88 ↗	(On Diff #320332)	Seems bad, both because it's a macro instead of variant, and because I thought we'd already got rid of that macro
117 ↗	(On Diff #320332)	Not sure about int constants near places where the 32/64 bit distinction is important

JonChesterfield added inline comments.Jan 30 2021, 5:45 PM

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
66 ↗	(On Diff #320332)	Wonder if this compiles without a forward declare of the intrinsic

In D95752#2532552, @tianshilei1992 wrote:

I can see many warnings emitted, like the following one:

In file included from /home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/src/loop.cu:17:
/home/shiltian/Documents/vscode/llvm-project/openmp/libomptarget/deviceRTLs/common/include/shuffle.h:96:16: warning: inline function '__kmpc_impl_shfl_d
own_sync' is not defined [-Wundefined-inline]
inline int32_t __kmpc_impl_shfl_down_sync(int64_t Mask, int32_t Var,                                                                                                   ^

I will fix this in clang though before.

jdoerfert added inline comments.Jan 30 2021, 7:21 PM

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
27 ↗	(On Diff #320332)	because our headers are a mess. These are C++ functions, the interface is a mess too
117 ↗	(On Diff #320332)	That's why it's int64_t above, -1 sign extended stays -1.

Addressed comments. uint64, and static to avoid warning, once we have a default
impl we can actually remove the first decl variant and the static if we want.

Harbormaster completed remote builds in B87293: Diff 320338.Jan 30 2021, 10:47 PM

I'll update this tomorrow, certain parts are not great.

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
88 ↗	(On Diff #320332)	we will, ptx selection. I'll actually update this patch.

Replace the CUDA_VERSION macro, provide a mock default impl for shuffle

This revision is now accepted and ready to land.Jan 31 2021, 10:14 AM

jdoerfert added a parent revision: D95765: [OpenMP] Introduce the `disable_selector_propagation` variant selector trait.Jan 31 2021, 10:15 AM

tianshilei1992 added inline comments.Jan 31 2021, 10:48 AM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	There is no `shuffle.cpp`

Harbormaster completed remote builds in B87312: Diff 320362.Jan 31 2021, 10:50 AM

^ shuffle.cpp doesn't exist. Except that, I tested locally and it works properly. LGTM.

Add shuffle.cpp

Harbormaster completed remote builds in B87313: Diff 320368.Jan 31 2021, 12:00 PM

Not too keen on mixing code for different targets in the same file. Would prefer the prototype declared in a header and a nvptx.cpp, amdgcn.cpp, other.cpp implementing that interface, preferably via variant so that the whole thing compiles out for some arch.

If the 'fallback/default' works as I suspect, it makes a really nice way to implement generic versions. Ptx uses asm for an operation that amdgpu uses a shift for, would be nice to use the pure c++ one as the default and substitute in the specialised one for only the targets that use it.

openmp/libomptarget/deviceRTLs/common/include/shuffle.h
105 ↗	(On Diff #320368)	I like this a lot. Way better that CUDA_VERSION macros.

In D95752#2534578, @JonChesterfield wrote:

Not too keen on mixing code for different targets in the same file. Would prefer the prototype declared in a header and a nvptx.cpp, amdgcn.cpp, other.cpp implementing that interface, preferably via variant so that the whole thing compiles out for some arch.

I'm hoping for a structure where we put a feature into a file, rather than an architecture per file. If we do both, we get 4 files per feature. This is not intrinsically bad but makes updates less convenient. At the end of the day, it doesn't matter much what we do since we perform LTO on this. I was hoping the approach with one header per feature and one cpp file for externally used functions would spur reuse. That is, helpers are defined only once, opportunities to share code are easier to spot during updates, etc.

If the 'fallback/default' works as I suspect, it makes a really nice way to implement generic versions. Ptx uses asm for an operation that amdgpu uses a shift for, would be nice to use the pure c++ one as the default and substitute in the specialised one for only the targets that use it.

So this is what's happening, at least once we have a "fallback" target model for which we can define what a "shuffle" means.

Header per target-specific-function as opposed to a list of things like __kmpc_impl_syncwarp seems reasonable. Not the way I'd have sliced it but clearly equivalent.

Minor request, let it be target/shuffle.h, so that we build up a list of functions (stuff in that directory) that new architectures need to implement in one place, distinct from code that will hopefully work out of the box for a new target.

This starts us down a path towards:

devicertl
  - src
     - parallel.cpp
     - target
        - shuffle.h

as opposed to the current nvptx/amdgcn/common split. That has the attraction of being a more conventional build system (it's weird treating one source file as openmp for one target and hip for another, the semantics don't really match) and, given we're building with clang, even using a single cmake file to build for all targets.

ronlieb added a subscriber: ronlieb.Feb 1 2021, 3:24 PM

ronlieb added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	is there an amdgcn CMakeLists.txt equivalent change ? should there be ?

We could presumably replace the #if CUDA_VERSION >= 9000 in the target_impl.cu file (we should rename these!) with variant, orthogonal to this change. Doing that for the five instances, even just within that file, would let us significantly reduce the number of devicertl libraries compiled.

jdoerfert added inline comments.Feb 1 2021, 3:26 PM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	probably, I'll look and add it. FWIW, if we had tests and CI for this, e.g., AMD CI that builds the runtime for AMDGPU, that would expose such a mistake right away ;)

In D95752#2535155, @JonChesterfield wrote:

We could presumably replace the #if CUDA_VERSION >= 9000 in the target_impl.cu file (we should rename these!) with variant, orthogonal to this change. Doing that for the five instances, even just within that file, would let us significantly reduce the number of devicertl libraries compiled.

yes, it's case by case though. We should check what ptx version, or other criterion, is a good selector and replace them. Here it was rather easy in the end.

ronlieb added inline comments.Feb 1 2021, 5:01 PM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	i agree we really do need an AMD CI, and to get there we also need to be upstreaming our clang support. so in the spirit of making more progress on this, could you do another review of Singh's patch https://reviews.llvm.org/D94961

Address comments

Harbormaster completed remote builds in B87593: Diff 320922.Feb 2 2021, 4:12 PM

This revision was landed with ongoing or failed builds.Mar 11 2021, 9:31 PM

Closed by commit rG66ba494b4974: [OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant (authored by jdoerfert). · Explain Why

This revision was automatically updated to reflect the committed changes.

jdoerfert added a commit: rG66ba494b4974: [OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant.

JonChesterfield added inline comments.Mar 15 2021, 11:48 AM

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
131	Note to self, amdgcn does indeed need shuffle.cpp added to the cmake list (plus include path)

jdoerfert mentioned this in D98677: [OpenMP][FIX] Repair accidental replacement of _shfl_sync with _shfl.Mar 15 2021, 7:53 PM

jdoerfert mentioned this in rG0a954a528b87: [OpenMP][FIX] Repair accidental replacement of _shfl_sync with _shfl.Mar 15 2021, 8:46 PM

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

amdgcn/

CMakeLists.txt

2 lines

src/

target_impl.hip

16 lines

common/

include/

target/

shuffle.h

107 lines

src/

1 line

1 line

13 lines

29 lines

nvptx/

CMakeLists.txt

2 lines

src/

target_impl.cu

12 lines

target_interface.h

6 lines

Diff 330143

openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	set(cuda_sources
${devicertl_base_directory}/common/src/data_sharing.cu		${devicertl_base_directory}/common/src/data_sharing.cu
${devicertl_base_directory}/common/src/libcall.cu		${devicertl_base_directory}/common/src/libcall.cu
${devicertl_base_directory}/common/src/loop.cu		${devicertl_base_directory}/common/src/loop.cu
${devicertl_base_directory}/common/src/omp_data.cu		${devicertl_base_directory}/common/src/omp_data.cu
${devicertl_base_directory}/common/src/omptarget.cu		${devicertl_base_directory}/common/src/omptarget.cu
${devicertl_base_directory}/common/src/parallel.cu		${devicertl_base_directory}/common/src/parallel.cu
${devicertl_base_directory}/common/src/reduction.cu		${devicertl_base_directory}/common/src/reduction.cu
${devicertl_base_directory}/common/src/support.cu		${devicertl_base_directory}/common/src/support.cu
		${devicertl_base_directory}/common/src/shuffle.cpp
${devicertl_base_directory}/common/src/sync.cu		${devicertl_base_directory}/common/src/sync.cu
${devicertl_base_directory}/common/src/task.cu)		${devicertl_base_directory}/common/src/task.cu)

set(h_files		set(h_files
${CMAKE_CURRENT_SOURCE_DIR}/src/amdgcn_interface.h		${CMAKE_CURRENT_SOURCE_DIR}/src/amdgcn_interface.h
${CMAKE_CURRENT_SOURCE_DIR}/src/target_impl.h		${CMAKE_CURRENT_SOURCE_DIR}/src/target_impl.h
${devicertl_base_directory}/common/debug.h		${devicertl_base_directory}/common/debug.h
${devicertl_base_directory}/common/device_environment.h		${devicertl_base_directory}/common/device_environment.h
Show All 28 Lines	set(cu_cmd ${AOMP_BINDIR}/clang++
-D__AMDGCN__		-D__AMDGCN__
-Xclang -target-cpu -Xclang ${mcpu}		-Xclang -target-cpu -Xclang ${mcpu}
-fvisibility=default		-fvisibility=default
-Wno-unused-value		-Wno-unused-value
-nogpulib		-nogpulib
-O${optimization_level}		-O${optimization_level}
${CUDA_DEBUG}		${CUDA_DEBUG}
-I${CMAKE_CURRENT_SOURCE_DIR}/src		-I${CMAKE_CURRENT_SOURCE_DIR}/src
		-I${devicertl_base_directory}/common/include
-I${devicertl_base_directory})		-I${devicertl_base_directory})

set(bc1_files)		set(bc1_files)

foreach(file ${ARGN})		foreach(file ${ARGN})
get_filename_component(fname ${file} NAME_WE)		get_filename_component(fname ${file} NAME_WE)
set(bc1_filename ${fname}.${mcpu}.bc)		set(bc1_filename ${fname}.${mcpu}.bc)

Show All 34 Lines

openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.hip

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	EXTERN double __kmpc_impl_get_wtime() {
return 0;		return 0;
}		}

// Warp vote function		// Warp vote function
EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask() {		EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask() {
return __builtin_amdgcn_read_exec();		return __builtin_amdgcn_read_exec();
}		}

EXTERN int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t, int32_t var,
int32_t srcLane) {
int width = WARPSIZE;
int self = GetLaneId();
int index = srcLane + (self & ~(width - 1));
return __builtin_amdgcn_ds_bpermute(index << 2, var);
}

EXTERN int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t, int32_t var,
uint32_t laneDelta, int32_t width) {
int self = GetLaneId();
int index = self + laneDelta;
index = (int)(laneDelta + (self & (width - 1))) >= width ? self : index;
return __builtin_amdgcn_ds_bpermute(index << 2, var);
}

uint32_t __kmpc_L1_Barrier [[clang::loader_uninitialized]];		uint32_t __kmpc_L1_Barrier [[clang::loader_uninitialized]];
#pragma allocate(__kmpc_L1_Barrier) allocator(omp_pteam_mem_alloc)		#pragma allocate(__kmpc_L1_Barrier) allocator(omp_pteam_mem_alloc)

EXTERN void __kmpc_impl_target_init() {		EXTERN void __kmpc_impl_target_init() {
// Don't have global ctors, and shared memory is not zero init		// Don't have global ctors, and shared memory is not zero init
__atomic_store_n(&__kmpc_L1_Barrier, 0u, __ATOMIC_RELEASE);		__atomic_store_n(&__kmpc_L1_Barrier, 0u, __ATOMIC_RELEASE);
}		}

▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/include/target/shuffle.h

This file was added.

				//===- shuffle.h - OpenMP variants of the shuffle idiom for all targets -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Shuffle function implementations for all supported targets.
				//
				// Note: We unify the mask type to uint64_t instead of __kmpc_impl_lanemask_t.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LIBOMPTARGET_DEVICERTL_SHUFFLE_H
				#define LIBOMPTARGET_DEVICERTL_SHUFFLE_H

				#include <assert.h>
				#include <inttypes.h>

				#pragma omp declare target

				/// External shuffle API
				///
				///{

				extern "C" {
				int32_t __kmpc_shuffle_int32(int32_t val, int16_t delta, int16_t size);
				int64_t __kmpc_shuffle_int64(int64_t val, int16_t delta, int16_t size);
				}

				///}

				/// Forward declarations
				///
				///{
				unsigned GetLaneId();
				unsigned GetWarpSize();
				void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi);
				uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi);
				///}

				/// Fallback implementations of the shuffle sync idiom.
				///
				///{

				inline int32_t __kmpc_impl_shfl_sync(uint64_t Mask, int32_t Var,
				int32_t SrcLane) {
				assert(false &&
				"Fallback version of __kmpc_impl_shfl_sync is not available!");
				}

				inline int32_t __kmpc_impl_shfl_down_sync(uint64_t Mask, int32_t Var,
				uint32_t Delta, int32_t Width) {
				assert(false &&
				"Fallback version of __kmpc_impl_shfl_down_sync is not available!");
				}

				///}

				/// AMDGCN implementations of the shuffle sync idiom.
				///
				///{
				#pragma omp begin declare variant match(device = {arch(amdgcn)})

				inline int32_t __kmpc_impl_shfl_sync(uint64_t Mask, int32_t Var,
				int32_t SrcLane) {
				int Width = GetWarpSize();
				int Self = GetLaneId();
				int Index = SrcLane + (Self & ~(Width - 1));
				return __builtin_amdgcn_ds_bpermute(Index << 2, Var);
				}

				inline int32_t __kmpc_impl_shfl_down_sync(uint64_t Mask, int32_t Var,
				uint32_t LaneDelta, int32_t Width) {
				int Self = GetLaneId();
				int Index = Self + LaneDelta;
				Index = (int)(LaneDelta + (Self & (Width - 1))) >= Width ? Self : Index;
				return __builtin_amdgcn_ds_bpermute(Index << 2, Var);
				}

				#pragma omp end declare variant
				///}

				/// NVPTX implementations of the shuffle and shuffle sync idiom.
				///
				///{
				#pragma omp begin declare variant match( \
				device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})

				inline int32_t __kmpc_impl_shfl_sync(uint64_t Mask, int32_t Var,
				int32_t SrcLane) {
				return __nvvm_shfl_idx_i32(Var, SrcLane, 0x1f);
				}

				inline int32_t __kmpc_impl_shfl_down_sync(uint64_t Mask, int32_t Var,
				uint32_t Delta, int32_t Width) {
				int32_t T = ((GetWarpSize() - Width) << 8) \| 0x1f;
				return __nvvm_shfl_down_i32(Var, Delta, T);
				}

				#pragma omp end declare variant
				///}

				#pragma omp end declare target

				#endif

openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu

	//===----- data_sharing.cu - OpenMP GPU data sharing ------------- CUDA -*-===//			//===----- data_sharing.cu - OpenMP GPU data sharing ------------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the implementation of data sharing environments			// This file contains the implementation of data sharing environments
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#pragma omp declare target			#pragma omp declare target

	#include "common/omptarget.h"			#include "common/omptarget.h"
				#include "target/shuffle.h"
	#include "target_impl.h"			#include "target_impl.h"

				tianshilei1992Unsubmitted Not Done Reply Inline Actions One nit: you might have `shuffle.h` before `target_impl.h` if using `clang-format`. tianshilei1992: One nit: you might have `shuffle.h` before `target_impl.h` if using `clang-format`.
	// Return true if this is the master thread.			// Return true if this is the master thread.
	INLINE static bool IsMasterThread(bool isSPMDExecutionMode) {			INLINE static bool IsMasterThread(bool isSPMDExecutionMode) {
	return !isSPMDExecutionMode && GetMasterThreadID() == GetThreadIdInBlock();			return !isSPMDExecutionMode && GetMasterThreadID() == GetThreadIdInBlock();
	}			}

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Runtime functions for trunk data sharing scheme.			// Runtime functions for trunk data sharing scheme.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	▲ Show 20 Lines • Show All 256 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/loop.cu

	//===------------ loop.cu - NVPTX OpenMP loop constructs --------- CUDA -*-===//			//===------------ loop.cu - NVPTX OpenMP loop constructs --------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the implementation of the KMPC interface			// This file contains the implementation of the KMPC interface
	// for the loop construct plus other worksharing constructs that use the same			// for the loop construct plus other worksharing constructs that use the same
	// interface as loops.			// interface as loops.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#pragma omp declare target			#pragma omp declare target

	#include "common/omptarget.h"			#include "common/omptarget.h"
				#include "target/shuffle.h"
	#include "target_impl.h"			#include "target_impl.h"

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// template class that encapsulate all the helper functions			// template class that encapsulate all the helper functions
	//			//
	// T is loop iteration type (32 \| 64) (unsigned \| signed)			// T is loop iteration type (32 \| 64) (unsigned \| signed)
	// ST is the signed version of T			// ST is the signed version of T
	▲ Show 20 Lines • Show All 738 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/reduction.cu

	//===---- reduction.cu - GPU OpenMP reduction implementation ----- CUDA -*-===//			//===---- reduction.cu - GPU OpenMP reduction implementation ----- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the implementation of reduction with KMPC interface.			// This file contains the implementation of reduction with KMPC interface.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#pragma omp declare target			#pragma omp declare target

	#include "common/omptarget.h"			#include "common/omptarget.h"
				#include "target/shuffle.h"
	#include "target_impl.h"			#include "target_impl.h"

	EXTERN			EXTERN
	void __kmpc_nvptx_end_reduce(int32_t global_tid) {}			void __kmpc_nvptx_end_reduce(int32_t global_tid) {}

	EXTERN			EXTERN
	void __kmpc_nvptx_end_reduce_nowait(int32_t global_tid) {}			void __kmpc_nvptx_end_reduce_nowait(int32_t global_tid) {}

	EXTERN int32_t __kmpc_shuffle_int32(int32_t val, int16_t delta, int16_t size) {
	return __kmpc_impl_shfl_down_sync(__kmpc_impl_all_lanes, val, delta, size);
	}

	EXTERN int64_t __kmpc_shuffle_int64(int64_t val, int16_t delta, int16_t size) {
	uint32_t lo, hi;
	__kmpc_impl_unpack(val, lo, hi);
	hi = __kmpc_impl_shfl_down_sync(__kmpc_impl_all_lanes, hi, delta, size);
	lo = __kmpc_impl_shfl_down_sync(__kmpc_impl_all_lanes, lo, delta, size);
	return __kmpc_impl_pack(lo, hi);
	}

	INLINE static void gpu_regular_warp_reduce(void *reduce_data,			INLINE static void gpu_regular_warp_reduce(void *reduce_data,
	kmp_ShuffleReductFctPtr shflFct) {			kmp_ShuffleReductFctPtr shflFct) {
	for (uint32_t mask = WARPSIZE / 2; mask > 0; mask /= 2) {			for (uint32_t mask = WARPSIZE / 2; mask > 0; mask /= 2) {
	shflFct(reduce_data, /LaneId - not used= / 0,			shflFct(reduce_data, /LaneId - not used= / 0,
	/Offset = / mask, /AlgoVersion=/0);			/Offset = / mask, /AlgoVersion=/0);
	}			}
	}			}

	▲ Show 20 Lines • Show All 275 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/shuffle.cpp

This file was added.

				//===--- shuffle.cpp - Implementation of the external shuffle idiom API -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				//===----------------------------------------------------------------------===//

				#include "target/shuffle.h"

				#pragma omp declare target

				static constexpr uint64_t AllLanes = -1;

				int32_t __kmpc_shuffle_int32(int32_t val, int16_t delta, int16_t size) {
				return __kmpc_impl_shfl_down_sync(AllLanes, val, delta, size);
				}

				int64_t __kmpc_shuffle_int64(int64_t val, int16_t delta, int16_t size) {
				uint32_t lo, hi;
				__kmpc_impl_unpack(val, lo, hi);
				hi = __kmpc_impl_shfl_down_sync(AllLanes, hi, delta, size);
				lo = __kmpc_impl_shfl_down_sync(AllLanes, lo, delta, size);
				return __kmpc_impl_pack(lo, hi);
				}

				#pragma omp end declare target

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	set(cuda_src_files
${devicertl_common_directory}/src/loop.cu		${devicertl_common_directory}/src/loop.cu
${devicertl_common_directory}/src/omp_data.cu		${devicertl_common_directory}/src/omp_data.cu
${devicertl_common_directory}/src/omptarget.cu		${devicertl_common_directory}/src/omptarget.cu
${devicertl_common_directory}/src/parallel.cu		${devicertl_common_directory}/src/parallel.cu
${devicertl_common_directory}/src/reduction.cu		${devicertl_common_directory}/src/reduction.cu
${devicertl_common_directory}/src/support.cu		${devicertl_common_directory}/src/support.cu
${devicertl_common_directory}/src/sync.cu		${devicertl_common_directory}/src/sync.cu
${devicertl_common_directory}/src/task.cu		${devicertl_common_directory}/src/task.cu
		${devicertl_common_directory}/src/shuffle.cpp
		tianshilei1992Unsubmitted Not Done Reply Inline Actions There is no `shuffle.cpp` tianshilei1992: There is no `shuffle.cpp`
		ronliebUnsubmitted Not Done Reply Inline Actions is there an amdgcn CMakeLists.txt equivalent change ? should there be ? ronlieb: is there an amdgcn CMakeLists.txt equivalent change ? should there be ?
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions probably, I'll look and add it. FWIW, if we had tests and CI for this, e.g., AMD CI that builds the runtime for AMDGPU, that would expose such a mistake right away ;) jdoerfert: probably, I'll look and add it. FWIW, if we had tests and CI for this, e.g., AMD CI that builds…
		ronliebUnsubmitted Not Done Reply Inline Actions i agree we really do need an AMD CI, and to get there we also need to be upstreaming our clang support. so in the spirit of making more progress on this, could you do another review of Singh's patch https://reviews.llvm.org/D94961 ronlieb: i agree we really do need an AMD CI, and to get there we also need to be upstreaming our clang…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Note to self, amdgcn does indeed need shuffle.cpp added to the cmake list (plus include path) JonChesterfield: Note to self, amdgcn does indeed need shuffle.cpp added to the cmake list (plus include path)
src/target_impl.cu		src/target_impl.cu
)		)

# Set flags for LLVM Bitcode compilation.		# Set flags for LLVM Bitcode compilation.
set(bc_flags -S -x c++ -O1 -std=c++14		set(bc_flags -S -x c++ -O1 -std=c++14
-target nvptx64		-target nvptx64
-Xclang -emit-llvm-bc		-Xclang -emit-llvm-bc
-Xclang -aux-triple -Xclang ${aux_triple}		-Xclang -aux-triple -Xclang ${aux_triple}
-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device		-fopenmp -fopenmp-cuda-mode -Xclang -fopenmp-is-device
-Xclang -target-feature -Xclang +ptx61		-Xclang -target-feature -Xclang +ptx61
-D__CUDACC__		-D__CUDACC__
-I${devicertl_base_directory}		-I${devicertl_base_directory}
		-I${devicertl_common_directory}/include
-I${devicertl_nvptx_directory}/src)		-I${devicertl_nvptx_directory}/src)

if(${LIBOMPTARGET_NVPTX_DEBUG})		if(${LIBOMPTARGET_NVPTX_DEBUG})
list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=-1)		list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=-1)
else()		else()
list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=0)		list(APPEND bc_flags -DOMPTARGET_NVPTX_DEBUG=0)
endif()		endif()

▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

	Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines
	}			}

	DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask() {			DEVICE __kmpc_impl_lanemask_t __kmpc_impl_activemask() {
	unsigned int Mask;			unsigned int Mask;
	asm volatile("activemask.b32 %0;" : "=r"(Mask));			asm volatile("activemask.b32 %0;" : "=r"(Mask));
	return Mask;			return Mask;
	}			}

	DEVICE int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,
	int32_t SrcLane) {
	return __nvvm_shfl_sync_idx_i32(Mask, Var, SrcLane, 0x1f);
	}

	DEVICE int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,
	int32_t Var, uint32_t Delta,
	int32_t Width) {
	int32_t T = ((WARPSIZE - Width) << 8) \| 0x1f;
	return __nvvm_shfl_sync_down_i32(Mask, Var, Delta, T);
	}

	DEVICE void __kmpc_impl_syncthreads() { __syncthreads(); }			DEVICE void __kmpc_impl_syncthreads() { __syncthreads(); }

	DEVICE void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask) {			DEVICE void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask) {
	__nvvm_bar_warp_sync(Mask);			__nvvm_bar_warp_sync(Mask);
	}			}

	// NVPTX specific kernel initialization			// NVPTX specific kernel initialization
	DEVICE void __kmpc_impl_target_init() { /* nvptx needs no extra setup */			DEVICE void __kmpc_impl_target_init() { /* nvptx needs no extra setup */
	▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/target_interface.h

	Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	EXTERN void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi);			EXTERN void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi);
	EXTERN uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi);			EXTERN uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi);
	EXTERN __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt();			EXTERN __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt();
	EXTERN __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt();			EXTERN __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt();
	EXTERN uint32_t __kmpc_impl_smid();			EXTERN uint32_t __kmpc_impl_smid();

	EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask();			EXTERN __kmpc_impl_lanemask_t __kmpc_impl_activemask();

	EXTERN int32_t __kmpc_impl_shfl_sync(__kmpc_impl_lanemask_t Mask, int32_t Var,
	int32_t SrcLane);
	EXTERN int32_t __kmpc_impl_shfl_down_sync(__kmpc_impl_lanemask_t Mask,
	int32_t Var, uint32_t Delta,
	int32_t Width);

	EXTERN void __kmpc_impl_syncthreads();			EXTERN void __kmpc_impl_syncthreads();
	EXTERN void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask);			EXTERN void __kmpc_impl_syncwarp(__kmpc_impl_lanemask_t Mask);

	// Kernel initialization			// Kernel initialization
	EXTERN void __kmpc_impl_target_init();			EXTERN void __kmpc_impl_target_init();

	// Memory			// Memory
	EXTERN void *__kmpc_impl_malloc(size_t);			EXTERN void *__kmpc_impl_malloc(size_t);
	EXTERN void __kmpc_impl_free(void *);			EXTERN void __kmpc_impl_free(void *);

	// Barrier until num_threads arrive.			// Barrier until num_threads arrive.
	EXTERN void __kmpc_impl_named_sync(uint32_t num_threads);			EXTERN void __kmpc_impl_named_sync(uint32_t num_threads);

	#endif // _OMPTARGET_TARGET_INTERFACE_H_			#endif // _OMPTARGET_TARGET_INTERFACE_H_

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variantClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 330143

openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt

openmp/libomptarget/deviceRTLs/amdgcn/src/target_impl.hip

openmp/libomptarget/deviceRTLs/common/include/target/shuffle.h

openmp/libomptarget/deviceRTLs/common/src/data_sharing.cu

openmp/libomptarget/deviceRTLs/common/src/loop.cu

openmp/libomptarget/deviceRTLs/common/src/reduction.cu

openmp/libomptarget/deviceRTLs/common/src/shuffle.cpp

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

openmp/libomptarget/deviceRTLs/target_interface.h

[OpenMP][DeviceRTL] Extract shuffle idiom and port it to declare variant
ClosedPublic