This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/
-
libomptarget/
-
deviceRTLs/
-
amdgcn/
-
CMakeLists.txt
-
common/
-
omptargeti.h
-
src/
-
libcall.cu
-
loop.cu
-
reduction.cu
-
state-queuei.h
-
target_atomic.h
-
nvptx/src/
-
src/
-
target_impl.cu

Differential D71404

[libomptarget][nfc] Introduce atomic wrapper function
ClosedPublic

Authored by JonChesterfield on Dec 12 2019, 2:46 AM.

Download Raw Diff

Details

Reviewers

ABataev
jdoerfert
grokos

Commits

rG2caeaf2f455d: [libomptarget][nfc] Introduce atomic wrapper function

Summary

[libomptarget][nfc] Introduce atomic wrapper function

Wraps atomic functions in a template prefixed __kmpc_atomic that
dispatches to cuda or hip atomic functions. Intended to be easily extended
to dispatch to OpenCL or C++ atomics for a third target.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

JonChesterfield created this revision.Dec 12 2019, 2:46 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 12 2019, 2:46 AM

Herald added subscribers: openmp-commits, jfb, mgorny. · View Herald Transcript

Harbormaster completed remote builds in B42373: Diff 233552.Dec 12 2019, 2:46 AM

consistent underbar

Harbormaster completed remote builds in B42374: Diff 233553.Dec 12 2019, 2:53 AM

Actually - turns out I don't need this for amdgcn. It appears HIP explicitly supports the same functions as cuda with the same names. https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_kernel_language.md

This abstraction is probably still a good idea for opencl based builds, but could perhaps be postponed until then

Related: D71412. Inclined to postpone this renaming / wrapping exercise until a non cuda, non hip target is proposed. Thoughts?

I think it is worth having this layer of abstraction. Aren't there other atmomics except add? I remember writing a patch like this for more than atomic add (which never got in obviously, one should be able to find it though).

In D71404#1782271, @jdoerfert wrote:

I think it is worth having this layer of abstraction. Aren't there other atmomics except add?

Yep. Complete list of functions used by the current deviceRTL is add, inc, max, exch, cas in a few integer variants. The others were left out for the (admittedly minor) reduction in work available from choosing a naming convention before the find&replace.

I think I'd also prefer to have the abstraction layer. It's part of viewing common/ as C++ plus well defined extensions.

In D71404#1782646, @JonChesterfield wrote:

In D71404#1782271, @jdoerfert wrote:

I think it is worth having this layer of abstraction. Aren't there other atmomics except add?

Yep. Complete list of functions used by the current deviceRTL is add, inc, max, exch, cas in a few integer variants. The others were left out for the (admittedly minor) reduction in work available from choosing a naming convention before the find&replace.

I think I'd also prefer to have the abstraction layer. It's part of viewing common/ as C++ plus well defined extensions.

I would propose:

Name them __kmpc_aotmic_XXX as that matches the runtime naming scheme. Changing it locally seems not helpful.
Do the search and replace for this commit already, let's get all atomics moved in a single swoop.
Consider a template solution for both the declaration and implementation. If we get more repetition and types it might be worth it. (At least hat is what I thought in D64217)

__kmpc_atomic_foo works for me, as does running sed once.

Templates I can't see in this instance. Could you sketch the implementation for one of the functions, e.g. the add example from here?

In D71404#1782813, @JonChesterfield wrote:

__kmpc_atomic_foo works for me, as does running sed once.

Templates I can't see in this instance. Could you sketch the implementation for one of the functions, e.g. the add example from here?

See D64217

/// Atomically exchange the pointee of \p Ptr with \p Val and return the
/// original value of the pointee.
template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
  return atomicExch(Ptr, Val);
}

also no need for stdint.h.

JonChesterfield mentioned this in D71446: [libomptarget] Build most of common/src for amdgcn.Dec 12 2019, 5:11 PM

In D71404#1782848, @jdoerfert wrote:

template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
  return atomicExch(Ptr, Val);
}

OK, cool. So we still need the list of underlying functions from cuda.h or equivalent, and if the instantiation type isn't (implicitly convertible to one) on the list we get a slightly less readable compile time error. That seems like a good tradeoff. Agreed

Edit: a drawback is the underlying symbols must be exposed in the header with the template wrapper, meaning no compile time error for calling them directly. With extra
code instead, the declaration can be separate and we get a degree of checking that the wrappers are used consistently. I'm still cautiously in favour of the templates.

There's an outstanding design point here.

Logically, the implementation is per target so should be in arch/src/target_atomic.h, with call sites including target_atomic.

However, the nvptx and amdgcn implementations will be (somewhat spuriously) the same. So either copy & paste time, or each needs to include a header from common containing the implementation which is otherwise not included anywhere.

This awkward redirection is necessary under the current scheme to allow a new arch to provide a header that is picked up by common/src/foo.cpp.

I wonder if that means a different redirection scheme is better, e.g. where headers are used from common unless one with the same name is provided under the arch. That seems error prone however.

In D71404#1782914, @JonChesterfield wrote:

There's an outstanding design point here.

Logically, the implementation is per target so should be in arch/src/target_atomic.h, with call sites including target_atomic.

However, the nvptx and amdgcn implementations will be (somewhat spuriously) the same. So either copy & paste time, or each needs to include a header from common containing the implementation which is otherwise not included anywhere.

This awkward redirection is necessary under the current scheme to allow a new arch to provide a header that is picked up by common/src/foo.cpp.

I wonder if that means a different redirection scheme is better, e.g. where headers are used from common unless one with the same name is provided under the arch. That seems error prone however.

To be honest, I did not follow this so if my response doesn't make sense let me know.

Could we have a generic atomic header with the template definition that is something like this:

template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
  T Tmp;
  #pragma omp atomic
  { Tmp = *Ptr; *Ptr = Val; }
  return Tmp;
}

In the target_impl.h you can provide the specialized implementation.

template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
  return atomicExch(Ptr, Val);
}

Only one template version will be visible at any time. Would that solve the problem?

In D71404#1782942, @jdoerfert wrote:

To be honest, I did not follow this so if my response doesn't make sense let me know.

Apologies. I'll rephrase the problem now that it's a more reasonable time of day.

Could we have a generic atomic header with the template definition that is something like this:

...

In the target_impl.h you can provide the specialized implementation.

Only one template version will be visible at any time. Would that solve the problem?

I don't believe so.

Consider common/src/loop.cu, which calls __kmpc_atomic_add. Today, that atomic add could be implemented under common/atomic.h and included as common/atomic.h or similar and all would work well.

In the future, a third target wants to provide implementations for the atomics using C++ or OpenCL, so writes their own arch/src/atomic.h. This file would then be ignored by the source under common, because it's in the unexpected place.

We can guard against this by writing the target specific implementation under nvptx/src/atomic.h, in which case common/src will include atomic.h and everything will work out as intended for the third target. This is why we have #include "target_impl.h" and #include "common/debug.h", with careful include paths - to disambiguate.

However, the amdgcn implementation would be very similar to the nvptx one. So this is a new use case for the shared source model:

Let nvptx use generic foo.h
Let amdgcn use generic foo.h
Let third party use specialised foo.h

such that common code picks up the generic case or the specialised one if available.

This could be implemented by:

code duplication
making fooi.h and putting it under common, with an amdgcn stub which declares some functions then includes fooi.h
adding complexity to cmake
possibly by playing games with include resolution order
a defaults plus override model, e.g. D68310, where we hope the defaults are broadly applicable. Can fail on a forth target
always call unqualified headers (no common prefix) from everywhere, and include stubs that include shared text from the targets. So we pay with files like debug.h: #include "common/debug.h", but everything composes exactly correctly.

The general problem is how to statically compose source code such that the end result combines common and target specific code without collapsing under the complexity of the scheme. While using somewhat crude tools.

The last bullet point is the totally explicit representation of what one wants to happen, various other schemes move noise from the source into the build sytem.

In D71404#1783287, @JonChesterfield wrote:

In D71404#1782942, @jdoerfert wrote:

To be honest, I did not follow this so if my response doesn't make sense let me know.

Apologies. I'll rephrase the problem now that it's a more reasonable time of day.

Could we have a generic atomic header with the template definition that is something like this:

...

In the target_impl.h you can provide the specialized implementation.

Only one template version will be visible at any time. Would that solve the problem?

I don't believe so.

Consider common/src/loop.cu, which calls __kmpc_atomic_add. Today, that atomic add could be implemented under common/atomic.h and included as common/atomic.h or similar and all would work well.

In the future, a third target wants to provide implementations for the atomics using C++ or OpenCL, so writes their own arch/src/atomic.h. This file would then be ignored by the source under common, because it's in the unexpected place.

We can guard against this by writing the target specific implementation under nvptx/src/atomic.h, in which case common/src will include atomic.h and everything will work out as intended for the third target. This is why we have #include "target_impl.h" and #include "common/debug.h", with careful include paths - to disambiguate.

However, the amdgcn implementation would be very similar to the nvptx one. So this is a new use case for the shared source model:

Let nvptx use generic foo.h

Let amdgcn use generic foo.h

Let third party use specialised foo.h

such that common code picks up the generic case or the specialised one if available.

This could be implemented by:

code duplication

making fooi.h and putting it under common, with an amdgcn stub which declares some functions then includes fooi.h

adding complexity to cmake

possibly by playing games with include resolution order

a defaults plus override model, e.g. D68310, where we hope the defaults are broadly applicable. Can fail on a forth target

always call unqualified headers (no common prefix) from everywhere, and include stubs that include shared text from the targets. So we pay with files like debug.h: #include "common/debug.h", but everything composes exactly correctly.

The general problem is how to statically compose source code such that the end result combines common and target specific code without collapsing under the complexity of the scheme. While using somewhat crude tools.

The last bullet point is the totally explicit representation of what one wants to happen, various other schemes move noise from the source into the build sytem.

We have two targets only right now, let's not complicate things too much. I'm fine with any solution that is reasonable and working now. I have a proposal below but other ways are fine as well.

If both targets agree on the atomics, put the code in common/atomics.h. Once we have third target that doesn't, we put the target code in {nvptx,amdgcn,third}/atomics.h and provide a pseudo/cpu target cpu/atomics.h with the generic openmp based template. In cmake we choose which include path you get, as we do now. (= I would not force no code duplication)

That sounds pragmatic. A single header located under common/ will work for nvptx and amdgcn without #ifdef, so lets go with that. We can rearrange the code if necessary when a third target arises or when amdgcn/nvptx wants to implement the atomics differently.

(amdgcn presently has a choice between clang builtins or one of the hip, opencl, hc libraries for the implementation)

In D71404#1783828, @JonChesterfield wrote:

That sounds pragmatic. A single header located under common/ will work for nvptx and amdgcn without #ifdef, so lets go with that. We can rearrange the code if necessary when a third target arises or when amdgcn/nvptx wants to implement the atomics differently.

(amdgcn presently has a choice between clang builtins or one of the hip, opencl, hc libraries for the implementation)

Let's do that then.

In D71404#1783988, @jdoerfert wrote:

Let's do that then.

Yep. Just a note to say this is not forgotten, merely hasn't hit the top of the priority queue just yet

Change to template implementation

Herald added a project: Restricted Project. · View Herald TranscriptDec 17 2019, 6:27 PM

Herald added subscribers: llvm-commits, dexonsmith, mgrang, jvesely. · View Herald Transcript

Harbormaster completed remote builds in B42705: Diff 234439.Dec 17 2019, 6:27 PM

Change to template implementation

Harbormaster completed remote builds in B42706: Diff 234440.Dec 17 2019, 6:36 PM

LGTM.

Thanks for changing all atomics now.

This revision is now accepted and ready to land.Dec 18 2019, 9:16 AM

JonChesterfield edited the summary of this revision. (Show Details)Dec 18 2019, 12:05 PM

Herald added a subscriber: Anastasia. · View Herald TranscriptDec 18 2019, 12:05 PM

Closed by commit rG2caeaf2f455d: [libomptarget][nfc] Introduce atomic wrapper function (authored by JonChesterfield). · Explain WhyDec 18 2019, 12:14 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

amdgcn/

CMakeLists.txt

1 line

common/

omptargeti.h

6 lines

src/

1 line

9 lines

13 lines

19 lines

38 lines

nvptx/

src/

target_impl.cu

9 lines

Diff 234596

openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	set(h_files
${CMAKE_CURRENT_SOURCE_DIR}/src/amdgcn_interface.h		${CMAKE_CURRENT_SOURCE_DIR}/src/amdgcn_interface.h
${CMAKE_CURRENT_SOURCE_DIR}/src/hip_atomics.h		${CMAKE_CURRENT_SOURCE_DIR}/src/hip_atomics.h
${CMAKE_CURRENT_SOURCE_DIR}/src/target_impl.h		${CMAKE_CURRENT_SOURCE_DIR}/src/target_impl.h
${devicertl_base_directory}/common/debug.h		${devicertl_base_directory}/common/debug.h
${devicertl_base_directory}/common/device_environment.h		${devicertl_base_directory}/common/device_environment.h
${devicertl_base_directory}/common/omptarget.h		${devicertl_base_directory}/common/omptarget.h
${devicertl_base_directory}/common/omptargeti.h		${devicertl_base_directory}/common/omptargeti.h
${devicertl_base_directory}/common/state-queue.h		${devicertl_base_directory}/common/state-queue.h
		${devicertl_base_directory}/common/target_atomic.h
${devicertl_base_directory}/common/state-queuei.h		${devicertl_base_directory}/common/state-queuei.h
${devicertl_base_directory}/common/support.h)		${devicertl_base_directory}/common/support.h)

# for both in-tree and out-of-tree build		# for both in-tree and out-of-tree build
if (NOT CMAKE_ARCHIVE_OUTPUT_DIRECTORY)		if (NOT CMAKE_ARCHIVE_OUTPUT_DIRECTORY)
set(OUTPUTDIR ${CMAKE_CURRENT_BINARY_DIR})		set(OUTPUTDIR ${CMAKE_CURRENT_BINARY_DIR})
else()		else()
set(OUTPUTDIR ${CMAKE_ARCHIVE_OUTPUT_DIRECTORY})		set(OUTPUTDIR ${CMAKE_ARCHIVE_OUTPUT_DIRECTORY})
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/omptargeti.h

	//===---- omptargeti.h - OpenMP GPU initialization --------------- CUDA -*-===//			//===---- omptargeti.h - OpenMP GPU initialization --------------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the declarations of all library macros, types,			// This file contains the declarations of all library macros, types,
	// and functions.			// and functions.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

				#include "common/target_atomic.h"

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Task Descriptor			// Task Descriptor
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	INLINE omp_sched_t omptarget_nvptx_TaskDescr::GetRuntimeSched() const {			INLINE omp_sched_t omptarget_nvptx_TaskDescr::GetRuntimeSched() const {
	// sched starts from 1..4; encode it as 0..3; so add 1 here			// sched starts from 1..4; encode it as 0..3; so add 1 here
	uint8_t rc = (items.flags & TaskDescr_SchedMask) + 1;			uint8_t rc = (items.flags & TaskDescr_SchedMask) + 1;
	return (omp_sched_t)rc;			return (omp_sched_t)rc;
	▲ Show 20 Lines • Show All 180 Lines • ▼ Show 20 Lines
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	INLINE void omptarget_nvptx_SimpleMemoryManager::Release() {			INLINE void omptarget_nvptx_SimpleMemoryManager::Release() {
	ASSERT0(LT_FUSSY, usedSlotIdx < MAX_SM,			ASSERT0(LT_FUSSY, usedSlotIdx < MAX_SM,
	"SlotIdx is too big or uninitialized.");			"SlotIdx is too big or uninitialized.");
	ASSERT0(LT_FUSSY, usedMemIdx < OMP_STATE_COUNT,			ASSERT0(LT_FUSSY, usedMemIdx < OMP_STATE_COUNT,
	"MemIdx is too big or uninitialized.");			"MemIdx is too big or uninitialized.");
	MemDataTy &MD = MemData[usedSlotIdx];			MemDataTy &MD = MemData[usedSlotIdx];
	atomicExch((unsigned *)&MD.keys[usedMemIdx], 0);			__kmpc_atomic_exchange((unsigned *)&MD.keys[usedMemIdx], 0u);
	}			}

	INLINE const void omptarget_nvptx_SimpleMemoryManager::Acquire(const void buf,			INLINE const void omptarget_nvptx_SimpleMemoryManager::Acquire(const void buf,
	size_t size) {			size_t size) {
	ASSERT0(LT_FUSSY, usedSlotIdx < MAX_SM,			ASSERT0(LT_FUSSY, usedSlotIdx < MAX_SM,
	"SlotIdx is too big or uninitialized.");			"SlotIdx is too big or uninitialized.");
	const unsigned sm = usedSlotIdx;			const unsigned sm = usedSlotIdx;
	MemDataTy &MD = MemData[sm];			MemDataTy &MD = MemData[sm];
	unsigned i = hash(GetBlockIdInKernel());			unsigned i = hash(GetBlockIdInKernel());
	while (atomicCAS((unsigned *)&MD.keys[i], 0, 1) != 0) {			while (__kmpc_atomic_cas((unsigned *)&MD.keys[i], 0u, 1u) != 0) {
	i = hash(i + 1);			i = hash(i + 1);
	}			}
	usedSlotIdx = sm;			usedSlotIdx = sm;
	usedMemIdx = i;			usedMemIdx = i;
	return static_cast<const char >(buf) + (sm OMP_STATE_COUNT + i) * size;			return static_cast<const char >(buf) + (sm OMP_STATE_COUNT + i) * size;
	}			}

openmp/libomptarget/deviceRTLs/common/src/libcall.cu

	//===------------ libcall.cu - OpenMP GPU user calls ------------- CUDA -*-===//			//===------------ libcall.cu - OpenMP GPU user calls ------------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements the OpenMP runtime functions that can be			// This file implements the OpenMP runtime functions that can be
	// invoked by the user in an OpenMP region			// invoked by the user in an OpenMP region
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "common/omptarget.h"			#include "common/omptarget.h"
				#include "common/target_atomic.h"
	#include "target_impl.h"			#include "target_impl.h"

	EXTERN double omp_get_wtick(void) {			EXTERN double omp_get_wtick(void) {
	double rc = __target_impl_get_wtick();			double rc = __target_impl_get_wtick();
	PRINT(LD_IO, "omp_get_wtick() returns %g\n", rc);			PRINT(LD_IO, "omp_get_wtick() returns %g\n", rc);
	return rc;			return rc;
	}			}

	▲ Show 20 Lines • Show All 391 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/loop.cu

//===------------ loop.cu - NVPTX OpenMP loop constructs --------- CUDA -*-===//		//===------------ loop.cu - NVPTX OpenMP loop constructs --------- CUDA -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file contains the implementation of the KMPC interface		// This file contains the implementation of the KMPC interface
// for the loop construct plus other worksharing constructs that use the same		// for the loop construct plus other worksharing constructs that use the same
// interface as loops.		// interface as loops.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "common/omptarget.h"		#include "common/omptarget.h"
#include "target_impl.h"		#include "target_impl.h"
		#include "common/target_atomic.h"

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// template class that encapsulate all the helper functions		// template class that encapsulate all the helper functions
//		//
// T is loop iteration type (32 \| 64) (unsigned \| signed)		// T is loop iteration type (32 \| 64) (unsigned \| signed)
// ST is the signed version of T		// ST is the signed version of T
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
▲ Show 20 Lines • Show All 367 Lines • ▼ Show 20 Lines	public:
INLINE static uint64_t NextIter() {		INLINE static uint64_t NextIter() {
__kmpc_impl_lanemask_t active = __kmpc_impl_activemask();		__kmpc_impl_lanemask_t active = __kmpc_impl_activemask();
uint32_t leader = __kmpc_impl_ffs(active) - 1;		uint32_t leader = __kmpc_impl_ffs(active) - 1;
uint32_t change = __kmpc_impl_popc(active);		uint32_t change = __kmpc_impl_popc(active);
__kmpc_impl_lanemask_t lane_mask_lt = __kmpc_impl_lanemask_lt();		__kmpc_impl_lanemask_t lane_mask_lt = __kmpc_impl_lanemask_lt();
unsigned int rank = __kmpc_impl_popc(active & lane_mask_lt);		unsigned int rank = __kmpc_impl_popc(active & lane_mask_lt);
uint64_t warp_res;		uint64_t warp_res;
if (rank == 0) {		if (rank == 0) {
warp_res = atomicAdd(		warp_res = __kmpc_atomic_add(
(unsigned long long *)&omptarget_nvptx_threadPrivateContext->Cnt(),		(unsigned long long *)&omptarget_nvptx_threadPrivateContext->Cnt(),
change);		(unsigned long long)change);
}		}
warp_res = Shuffle(active, warp_res, leader);		warp_res = Shuffle(active, warp_res, leader);
return warp_res + rank;		return warp_res + rank;
}		}

INLINE static int DynamicNextChunk(T &lb, T &ub, T chunkSize,		INLINE static int DynamicNextChunk(T &lb, T &ub, T chunkSize,
T loopLowerBound, T loopUpperBound) {		T loopLowerBound, T loopUpperBound) {
T N = NextIter();		T N = NextIter();
▲ Show 20 Lines • Show All 376 Lines • ▼ Show 20 Lines	if (gtid == 0)
*Buffer = 0; // Reset to minimum loop iteration value.		*Buffer = 0; // Reset to minimum loop iteration value.

// Barrier.		// Barrier.
syncWorkersInGenericMode(NumThreads);		syncWorkersInGenericMode(NumThreads);

// Atomic max of iterations.		// Atomic max of iterations.
uint64_t varArray = (uint64_t )array;		uint64_t varArray = (uint64_t )array;
uint64_t elem = varArray[i];		uint64_t elem = varArray[i];
(void)atomicMax((unsigned long long int *)Buffer,		(void)__kmpc_atomic_max((unsigned long long int *)Buffer,
(unsigned long long int)elem);		(unsigned long long int)elem);

// Barrier.		// Barrier.
syncWorkersInGenericMode(NumThreads);		syncWorkersInGenericMode(NumThreads);

// Read max value and update thread private array.		// Read max value and update thread private array.
varArray[i] = *Buffer;		varArray[i] = *Buffer;

// Barrier.		// Barrier.
syncWorkersInGenericMode(NumThreads);		syncWorkersInGenericMode(NumThreads);
}		}
}		}

openmp/libomptarget/deviceRTLs/common/src/reduction.cu

//===---- reduction.cu - GPU OpenMP reduction implementation ----- CUDA -*-===//		//===---- reduction.cu - GPU OpenMP reduction implementation ----- CUDA -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file contains the implementation of reduction with KMPC interface.		// This file contains the implementation of reduction with KMPC interface.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "common/omptarget.h"		#include "common/omptarget.h"
		#include "common/target_atomic.h"
#include "target_impl.h"		#include "target_impl.h"

EXTERN		EXTERN
void __kmpc_nvptx_end_reduce(int32_t global_tid) {}		void __kmpc_nvptx_end_reduce(int32_t global_tid) {}

EXTERN		EXTERN
void __kmpc_nvptx_end_reduce_nowait(int32_t global_tid) {}		void __kmpc_nvptx_end_reduce_nowait(int32_t global_tid) {}

▲ Show 20 Lines • Show All 215 Lines • ▼ Show 20 Lines	if (ThreadId == 0) {
char *scratchpad = GetTeamsReductionScratchpad();		char *scratchpad = GetTeamsReductionScratchpad();

scratchFct(reduce_data, scratchpad, TeamId, NumTeams);		scratchFct(reduce_data, scratchpad, TeamId, NumTeams);
__kmpc_impl_threadfence();		__kmpc_impl_threadfence();

// atomicInc increments 'timestamp' and has a range [0, NumTeams-1].		// atomicInc increments 'timestamp' and has a range [0, NumTeams-1].
// It resets 'timestamp' back to 0 once the last team increments		// It resets 'timestamp' back to 0 once the last team increments
// this counter.		// this counter.
unsigned val = atomicInc(timestamp, NumTeams - 1);		unsigned val = __kmpc_atomic_inc(timestamp, NumTeams - 1);
IsLastTeam = val == NumTeams - 1;		IsLastTeam = val == NumTeams - 1;
}		}

// We have to wait on L1 barrier because in GENERIC mode the workers		// We have to wait on L1 barrier because in GENERIC mode the workers
// are waiting on barrier 0 for work.		// are waiting on barrier 0 for work.
//		//
// If we guard this barrier as follows it leads to deadlock, probably		// If we guard this barrier as follows it leads to deadlock, probably
// because of a compiler bug: if (!IsGenericMode()) __syncthreads();		// because of a compiler bug: if (!IsGenericMode()) __syncthreads();
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines
}		}

EXTERN int32_t __kmpc_nvptx_teams_reduce_nowait_simple(kmp_Ident *loc,		EXTERN int32_t __kmpc_nvptx_teams_reduce_nowait_simple(kmp_Ident *loc,
int32_t global_tid,		int32_t global_tid,
kmp_CriticalName *crit) {		kmp_CriticalName *crit) {
if (checkSPMDMode(loc) && GetThreadIdInBlock() != 0)		if (checkSPMDMode(loc) && GetThreadIdInBlock() != 0)
return 0;		return 0;
// The master thread of the team actually does the reduction.		// The master thread of the team actually does the reduction.
while (atomicCAS((uint32_t *)crit, 0, 1))		while (__kmpc_atomic_cas((uint32_t *)crit, 0u, 1u))
;		;
return 1;		return 1;
}		}

EXTERN void		EXTERN void
__kmpc_nvptx_teams_end_reduce_nowait_simple(kmp_Ident *loc, int32_t global_tid,		__kmpc_nvptx_teams_end_reduce_nowait_simple(kmp_Ident *loc, int32_t global_tid,
kmp_CriticalName *crit) {		kmp_CriticalName *crit) {
__kmpc_impl_threadfence_system();		__kmpc_impl_threadfence_system();
(void)atomicExch((uint32_t *)crit, 0);		(void)__kmpc_atomic_exchange((uint32_t *)crit, 0u);
}		}

INLINE static bool isMaster(kmp_Ident *loc, uint32_t ThreadId) {		INLINE static bool isMaster(kmp_Ident *loc, uint32_t ThreadId) {
return checkGenericMode(loc) \|\| IsTeamMaster(ThreadId);		return checkGenericMode(loc) \|\| IsTeamMaster(ThreadId);
}		}

INLINE static uint32_t roundToWarpsize(uint32_t s) {		INLINE static uint32_t roundToWarpsize(uint32_t s) {
if (s < WARPSIZE)		if (s < WARPSIZE)
Show All 28 Lines	EXTERN int32_t __kmpc_nvptx_teams_reduce_nowait_v2(
SHARED unsigned ChunkTeamCount;		SHARED unsigned ChunkTeamCount;

// Block progress for teams greater than the current upper		// Block progress for teams greater than the current upper
// limit. We always only allow a number of teams less or equal		// limit. We always only allow a number of teams less or equal
// to the number of slots in the buffer.		// to the number of slots in the buffer.
bool IsMaster = isMaster(loc, ThreadId);		bool IsMaster = isMaster(loc, ThreadId);
while (IsMaster) {		while (IsMaster) {
// Atomic read		// Atomic read
Bound = atomicAdd((uint32_t *)&IterCnt, 0);		Bound = __kmpc_atomic_add((uint32_t *)&IterCnt, 0u);
if (TeamId < Bound + num_of_records)		if (TeamId < Bound + num_of_records)
break;		break;
}		}

if (IsMaster) {		if (IsMaster) {
int ModBockId = TeamId % num_of_records;		int ModBockId = TeamId % num_of_records;
if (TeamId < num_of_records)		if (TeamId < num_of_records)
lgcpyFct(global_buffer, ModBockId, reduce_data);		lgcpyFct(global_buffer, ModBockId, reduce_data);
else		else
lgredFct(global_buffer, ModBockId, reduce_data);		lgredFct(global_buffer, ModBockId, reduce_data);
__kmpc_impl_threadfence_system();		__kmpc_impl_threadfence_system();

// Increment team counter.		// Increment team counter.
// This counter is incremented by all teams in the current		// This counter is incremented by all teams in the current
// BUFFER_SIZE chunk.		// BUFFER_SIZE chunk.
ChunkTeamCount = atomicInc((uint32_t *)&Cnt, num_of_records - 1);		ChunkTeamCount = __kmpc_atomic_inc((uint32_t *)&Cnt, num_of_records - 1u);
}		}
// Synchronize		// Synchronize
if (checkSPMDMode(loc))		if (checkSPMDMode(loc))
__kmpc_barrier(loc, global_tid);		__kmpc_barrier(loc, global_tid);

// reduce_data is global or shared so before being reduced within the		// reduce_data is global or shared so before being reduced within the
// warp we need to bring it in local memory:		// warp we need to bring it in local memory:
// local_reduce_data = reduce_data[i]		// local_reduce_data = reduce_data[i]
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	if (IsMaster) {
IterCnt = 0;		IterCnt = 0;
return 1;		return 1;
}		}
return 0;		return 0;
}		}
if (IsMaster && ChunkTeamCount == num_of_records - 1) {		if (IsMaster && ChunkTeamCount == num_of_records - 1) {
// Allow SIZE number of teams to proceed writing their		// Allow SIZE number of teams to proceed writing their
// intermediate results to the global buffer.		// intermediate results to the global buffer.
atomicAdd((uint32_t *)&IterCnt, num_of_records);		__kmpc_atomic_add((uint32_t *)&IterCnt, uint32_t(num_of_records));
}		}

return 0;		return 0;
}		}

openmp/libomptarget/deviceRTLs/common/state-queuei.h

	//===------- state-queue.cu - NVPTX OpenMP GPU State Queue ------- CUDA -*-===//			//===------- state-queuei.h - OpenMP GPU State Queue ------------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the implementation of a queue to hand out OpenMP state			// This file contains the implementation of a queue to hand out OpenMP state
	// objects to teams of one or more kernels.			// objects to teams of one or more kernels.
	//			//
	// Reference:			// Reference:
	// Thomas R.W. Scogland and Wu-chun Feng. 2015.			// Thomas R.W. Scogland and Wu-chun Feng. 2015.
	// Design and Evaluation of Scalable Concurrent Queues for Many-Core			// Design and Evaluation of Scalable Concurrent Queues for Many-Core
	// Architectures. International Conference on Performance Engineering.			// Architectures. International Conference on Performance Engineering.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "state-queue.h"			#include "state-queue.h"
				#include "common/target_atomic.h"

	template <typename ElementType, uint32_t SIZE>			template <typename ElementType, uint32_t SIZE>
	INLINE uint32_t omptarget_nvptx_Queue<ElementType, SIZE>::ENQUEUE_TICKET() {			INLINE uint32_t omptarget_nvptx_Queue<ElementType, SIZE>::ENQUEUE_TICKET() {
	return atomicAdd((unsigned int *)&tail, 1);			return __kmpc_atomic_add((unsigned int *)&tail, 1u);
	}			}

	template <typename ElementType, uint32_t SIZE>			template <typename ElementType, uint32_t SIZE>
	INLINE uint32_t omptarget_nvptx_Queue<ElementType, SIZE>::DEQUEUE_TICKET() {			INLINE uint32_t omptarget_nvptx_Queue<ElementType, SIZE>::DEQUEUE_TICKET() {
	return atomicAdd((unsigned int *)&head, 1);			return __kmpc_atomic_add((unsigned int *)&head, 1u);
	}			}

	template <typename ElementType, uint32_t SIZE>			template <typename ElementType, uint32_t SIZE>
	INLINE uint32_t			INLINE uint32_t
	omptarget_nvptx_Queue<ElementType, SIZE>::ID(uint32_t ticket) {			omptarget_nvptx_Queue<ElementType, SIZE>::ID(uint32_t ticket) {
	return (ticket / SIZE) * 2;			return (ticket / SIZE) * 2;
	}			}

	template <typename ElementType, uint32_t SIZE>			template <typename ElementType, uint32_t SIZE>
	INLINE bool omptarget_nvptx_Queue<ElementType, SIZE>::IsServing(uint32_t slot,			INLINE bool omptarget_nvptx_Queue<ElementType, SIZE>::IsServing(uint32_t slot,
	uint32_t id) {			uint32_t id) {
	return atomicAdd((unsigned int *)&ids[slot], 0) == id;			return __kmpc_atomic_add((unsigned int *)&ids[slot], 0u) == id;
	}			}

	template <typename ElementType, uint32_t SIZE>			template <typename ElementType, uint32_t SIZE>
	INLINE void			INLINE void
	omptarget_nvptx_Queue<ElementType, SIZE>::PushElement(uint32_t slot,			omptarget_nvptx_Queue<ElementType, SIZE>::PushElement(uint32_t slot,
	ElementType *element) {			ElementType *element) {
	atomicExch((unsigned long long *)&elementQueue[slot],			__kmpc_atomic_exchange((unsigned long long *)&elementQueue[slot],
	(unsigned long long)element);			(unsigned long long)element);
	}			}

	template <typename ElementType, uint32_t SIZE>			template <typename ElementType, uint32_t SIZE>
	INLINE ElementType *			INLINE ElementType *
	omptarget_nvptx_Queue<ElementType, SIZE>::PopElement(uint32_t slot) {			omptarget_nvptx_Queue<ElementType, SIZE>::PopElement(uint32_t slot) {
	return (ElementType )atomicAdd((unsigned long long )&elementQueue[slot],			return (ElementType *)__kmpc_atomic_add(
	(unsigned long long)0);			(unsigned long long *)&elementQueue[slot], (unsigned long long)0);
	}			}

	template <typename ElementType, uint32_t SIZE>			template <typename ElementType, uint32_t SIZE>
	INLINE void omptarget_nvptx_Queue<ElementType, SIZE>::DoneServing(uint32_t slot,			INLINE void omptarget_nvptx_Queue<ElementType, SIZE>::DoneServing(uint32_t slot,
	uint32_t id) {			uint32_t id) {
	atomicExch((unsigned int *)&ids[slot], (id + 1) % MAX_ID);			__kmpc_atomic_exchange((unsigned int *)&ids[slot], (id + 1) % MAX_ID);
	}			}

	template <typename ElementType, uint32_t SIZE>			template <typename ElementType, uint32_t SIZE>
	INLINE void			INLINE void
	omptarget_nvptx_Queue<ElementType, SIZE>::Enqueue(ElementType *element) {			omptarget_nvptx_Queue<ElementType, SIZE>::Enqueue(ElementType *element) {
	uint32_t ticket = ENQUEUE_TICKET();			uint32_t ticket = ENQUEUE_TICKET();
	uint32_t slot = ticket % SIZE;			uint32_t slot = ticket % SIZE;
	uint32_t id = ID(ticket) + 1;			uint32_t id = ID(ticket) + 1;
	Show All 20 Lines

openmp/libomptarget/deviceRTLs/common/target_atomic.h

This file was added.

				//===---- target_atomic.h - OpenMP GPU target atomic functions ---- C++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Declarations of atomic functions provided by each target
				//
				//===----------------------------------------------------------------------===//

				#ifndef OMPTARGET_TARGET_ATOMIC_H
				#define OMPTARGET_TARGET_ATOMIC_H

				#include "target_impl.h"

				template <typename T> INLINE T __kmpc_atomic_add(T *address, T val) {
				return atomicAdd(address, val);
				}

				template <typename T> INLINE T __kmpc_atomic_inc(T *address, T val) {
				return atomicInc(address, val);
				}

				template <typename T> INLINE T __kmpc_atomic_max(T *address, T val) {
				return atomicMax(address, val);
				}

				template <typename T> INLINE T __kmpc_atomic_exchange(T *address, T val) {
				return atomicExch(address, val);
				}

				template <typename T> INLINE T __kmpc_atomic_cas(T *address, T compare, T val) {
				return atomicCAS(address, compare, val);
				}

				#endif

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu

	//===---------- target_impl.cu - NVPTX OpenMP GPU options ------- CUDA -*-===//			//===---------- target_impl.cu - NVPTX OpenMP GPU options ------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Definitions of target specific functions			// Definitions of target specific functions
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "target_impl.h"			#include "target_impl.h"
	#include "common/debug.h"			#include "common/debug.h"
				#include "common/target_atomic.h"

	#define __OMP_SPIN 1000			#define __OMP_SPIN 1000
	#define UNSET 0			#define UNSET 0u
	#define SET 1			#define SET 1u

	EXTERN void __kmpc_impl_init_lock(omp_lock_t *lock) {			EXTERN void __kmpc_impl_init_lock(omp_lock_t *lock) {
	omp_unset_lock(lock);			omp_unset_lock(lock);
	}			}

	EXTERN void __kmpc_impl_destroy_lock(omp_lock_t *lock) {			EXTERN void __kmpc_impl_destroy_lock(omp_lock_t *lock) {
	omp_unset_lock(lock);			omp_unset_lock(lock);
	}			}

	EXTERN void __kmpc_impl_set_lock(omp_lock_t *lock) {			EXTERN void __kmpc_impl_set_lock(omp_lock_t *lock) {
	// int atomicCAS(int* address, int compare, int val);			// int atomicCAS(int* address, int compare, int val);
	// (old == compare ? val : old)			// (old == compare ? val : old)

	// TODO: not sure spinning is a good idea here..			// TODO: not sure spinning is a good idea here..
	while (atomicCAS(lock, UNSET, SET) != UNSET) {			while (__kmpc_atomic_cas(lock, UNSET, SET) != UNSET) {
	clock_t start = clock();			clock_t start = clock();
	clock_t now;			clock_t now;
	for (;;) {			for (;;) {
	now = clock();			now = clock();
	clock_t cycles = now > start ? now - start : now + (0xffffffff - start);			clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
	if (cycles >= __OMP_SPIN * GetBlockIdInKernel()) {			if (cycles >= __OMP_SPIN * GetBlockIdInKernel()) {
	break;			break;
	}			}
	}			}
	} // wait for 0 to be the read value			} // wait for 0 to be the read value
	}			}

	EXTERN void __kmpc_impl_unset_lock(omp_lock_t *lock) {			EXTERN void __kmpc_impl_unset_lock(omp_lock_t *lock) {
	(void)atomicExch(lock, UNSET);			(void)__kmpc_atomic_exchange(lock, UNSET);
	}			}

	EXTERN int __kmpc_impl_test_lock(omp_lock_t *lock) {			EXTERN int __kmpc_impl_test_lock(omp_lock_t *lock) {
	// int atomicCAS(int* address, int compare, int val);			// int atomicCAS(int* address, int compare, int val);
	// (old == compare ? val : old)			// (old == compare ? val : old)
	return atomicAdd(lock, 0);			return atomicAdd(lock, 0);
	}			}