This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/nvptx/
-
libomptarget/
-
deviceRTLs/
-
nvptx/
-
CMakeLists.txt
-
src/
5
sync.cpp
-
sync.cu
2/8
target_impl.h

Differential D64218

[OpenMP][NFCI] Cleanup the target synchronization implementation
Needs ReviewPublic

Authored by jdoerfert on Jul 4 2019, 12:46 PM.

Download Raw Diff

Details

Reviewers

openmp-commits

Summary

Note: WIP patch 2/3 to go with a RFC for the device RTL design (see D64217)

This NFCI patch includes the following cleanup steps:
  - Adjust the code according to the LLVM coding style, especially wrt.
    variable and method names.
  - Document the code with doxygen comments.
  - Change the comments to be less NVPTX specific.
  - Wrap CUDA specific calls into __kmpc_impl_XXX functions and define
    them in an own target_impl.h file.
  - Use a templated barrier implementation to remove code duplication.
  - Use a (macro) generator to reduce code duplication.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 34393
Build 34392: arc lint + arc unit

Event Timeline

jdoerfert created this revision.Jul 4 2019, 12:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 4 2019, 12:46 PM

Herald added subscribers: bollu, mgorny. · View Herald Transcript

jdoerfert retitled this revision from [OpenMP][NFCI] Cleanup the target synchronization implementation Note: WIP patch 2/3 to go with a RFC for the device RTL design (see D64217) to [OpenMP][NFCI] Cleanup the target synchronization implementation.Jul 4 2019, 12:48 PM

jdoerfert edited the summary of this revision. (Show Details)

jdoerfert mentioned this in D64219: [OpenMP][NFCI] Cleanup the target worksharing implementation.Jul 4 2019, 12:49 PM

Harbormaster completed remote builds in B34391: Diff 208072.Jul 4 2019, 12:50 PM

Add missing return type

Harbormaster completed remote builds in B34393: Diff 208076.Jul 4 2019, 12:54 PM

ABataev added inline comments.Jul 4 2019, 3:48 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48	I don't think this is correct. `IsSPMD` flag should be passed as a function parameter. Sometimes, we cannot define the execution mode at the compile time and we could define it only at the execution time (foe example, if the parallel region is called in the orphaned function, marked as noinline or compiled without optimizations, etc.)

jdoerfert marked an inline comment as done.Jul 5 2019, 10:35 AM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48	It is "correct" and it "works" with the rest of the code base but we can change it regardless: It works this way because we have explicit `__kmpc_barrier_XXXXX` functions for the SPMD and non-SPMD case. Through that level of abstraction we know the required barrier implemenetation at compile time. If we want to move avay from the different barrier types that have the mode baked into their name, we would need to make the template parameters arguments for sure. Long story short, I do not have strong feelings about this and it should not matter after inlining and constant propagation.

ping

ABataev added inline comments.Jul 17 2019, 2:50 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48	Actually, we use `__kmpc_barrier` in many cases. Even in SPMD mode. `__kmpc_barrier_xxxx` variants are used in very rare cases. And your change may lead to incorrect results in case of orphaned directives because you hardcoded `IsSpmd` to `false` in `__kmpc_barrier`. The fact that it works for you just means that you have very limited test set. Inlining and constant propagation is not an option here. What if the user compiled the code at `O0`, without optimizations? Jus to debug the code? We should produce different results at `O0` and `O3`? Or explicitly marked the function as `noinline`?

jdoerfert marked an inline comment as done.Aug 5 2019, 3:28 PM

jdoerfert added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48	Actually, we use kmpc_barrier in many cases. Even in SPMD mode. kmpc_barrier_xxxx variants are used in very rare cases. And your change may lead to incorrect results in case of orphaned directives because you hardcoded IsSpmd to false in __kmpc_barrier. The fact that it works for you just means that you have very limited test set. Please take a look at line 52. As before, a call to `__kmpc_barrier` will first check if we are in SPMD mode. (Even if the template argument is `false` that happens, if it is true it is not going to happen though). Thus, it is no different to the behavior we had. Inlining and constant propagation is not an option here. I do not understand what you are taking about. This does not, as nothing ever can, rely on inlining and constant propagation. What if the user compiled the code at O0, without optimizations? [...] They get the same semantics but slower.

ABataev added inline comments.Aug 5 2019, 3:51 PM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48	Please take a look at line 52. As before, a call to __kmpc_barrier will first check if we are in SPMD mode. (Even if the template argument is false that happens, if it is true it is not going to happen though). Thus, it is no different to the behavior we had. Then why do we need this template argument if it does nothing but just confuses people?

ABataev added inline comments.Aug 5 2019, 4:06 PM

openmp/libomptarget/deviceRTLs/nvptx/src/sync.cpp
18	We don't support cancellation in the GPU runtime currently, so I think it is better to set `IsCancellable` to `false` to make it clear that cancellation is not supported yet.
18	I think I got the idea for these params. But it is better to give them some other names, currently they might confuse users.
37	Not sure that this instantiation provides the same afunctionality as the original implementation. Originally, it did not check for ghe number of active threads, just synced all non-SPMD threads unconditionally.
56	Can we invent something else here, not the macros?
88	This is impossible to get rid of it for NVPTX runtime, at least. Better to make it NVPTX specific function.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
55	This is not correct. `kmpc_barrier` can be called even in SPMD mode. Generally speaking, the `spmd` suffix also must be generated dynamically.

JonChesterfield added a subscriber: JonChesterfield.Aug 6 2019, 3:15 AM

JonChesterfield added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48	Inlining and constant propagation is not an option here. What if the user compiled the code at O0, without optimizations? Could you expand on your concern here? The code looks like it does the same thing at O0 and O3 to me (no calls to _builtin_constant_p) so O0 just means slower. That seems OK.

@ABataev I am confused by the comments you make. Could you describe a situation in which we do not execute the same code before and after this patch? Also, the O0/03 inline & constant propagation comment is not clear to me, what is the issue there?

In D64218#1616951, @jdoerfert wrote:

@ABataev I am confused by the comments you make. Could you describe a situation in which we do not execute the same code before and after this patch? Also, the O0/03 inline & constant propagation comment is not clear to me, what is the issue there?

I was confused by the names of the template arguments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
48	That was the wrong assumption on the meaning of the template arguments, I got the idea already but it is better to rename the arguments somehow because currently, they are confusing.

In D64218#1616959, @ABataev wrote:

In D64218#1616951, @jdoerfert wrote:

@ABataev I am confused by the comments you make. Could you describe a situation in which we do not execute the same code before and after this patch? Also, the O0/03 inline & constant propagation comment is not clear to me, what is the issue there?

I was confused by the names of the template arguments.

I'm happy to rename them if we want to go ahead with this __kmpc_impl_barrier (potentially without the rest in here). Name suggestions are always welcome.

In D64218#1616979, @jdoerfert wrote:

In D64218#1616959, @ABataev wrote:

In D64218#1616951, @jdoerfert wrote:

@ABataev I am confused by the comments you make. Could you describe a situation in which we do not execute the same code before and after this patch? Also, the O0/03 inline & constant propagation comment is not clear to me, what is the issue there?

I was confused by the names of the template arguments.

I'm happy to rename them if we want to go ahead with this __kmpc_impl_barrier (potentially without the rest in here). Name suggestions are always welcome.

It would be good to try to gather the common functionality into some kind of class, maybe? Common functionality could be encapsulated into some kind common functionality templated class, while target specific functions could be encapsulated into a target specific class, which could be used as a template argument for common functionality template instantiation.
Something like:

template <typename TargetT>
class CommonFunctionality : public TargetT {
  void barrier(....) {
     if (spmd-mode) {
       TargetT::spmd_barrier();
     } else if (num_threads > 1) {
       TargetT::non_spmd_barrier();
     } else {
       TargetT::flush(); 
     }
  }
};

What do you think about it?
For the template arguments, maybe just do not make it a templated function? Maybe it is better to make set of simple functions and call them in the complex ones?

@JonChesterfield @atmnpatel @tianshilei1992 Unclear if this is still needed and/or applies. Feel free to take a look.

Herald added subscribers: sstefan1, yaxunl. · View Herald TranscriptDec 15 2020, 10:07 AM

This looks like a difficult rebase. Some parts are obsolete (__kmpc_impl_active_thread_mask). Lifting runtime parameters with associated branches to compile time via the template is interesting but I wouldn't have guessed it's where we're losing most performance. Constant propagation probably does the same job with the bitcode RTL.

I'd be inclined to abandon this patch and recreate it if desired

In D64218#2455642, @JonChesterfield wrote:

This looks like a difficult rebase. Some parts are obsolete (__kmpc_impl_active_thread_mask). Lifting runtime parameters with associated branches to compile time via the template is interesting but I wouldn't have guessed it's where we're losing most performance. Constant propagation probably does the same job with the bitcode RTL.

It was not about performance but code duplication. Back then, maybe now too, we have various copies of the barrier implementation which is, sub-optimal.

I'd be inclined to abandon this patch and recreate it if desired

Agreed. But I might not be able to do that any time soon.

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

nvptx/

CMakeLists.txt

2 lines

src/

sync.cpp

92 lines

sync.cu

target_impl.h

51 lines

Diff 208076

openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	set(cuda_src_files
src/cancel.cu		src/cancel.cu
src/critical.cu		src/critical.cu
src/data_sharing.cu		src/data_sharing.cu
src/libcall.cu		src/libcall.cu
src/loop.cu		src/loop.cu
src/omptarget-nvptx.cu		src/omptarget-nvptx.cu
src/parallel.cu		src/parallel.cu
src/reduction.cu		src/reduction.cu
src/sync.cu		src/sync.cpp
src/task.cu		src/task.cu
)		)

set(omp_data_objects src/omp_data.cu)		set(omp_data_objects src/omp_data.cu)

# Get the compute capability the user requested or use SM_35 by default.		# Get the compute capability the user requested or use SM_35 by default.
# SM_35 is what clang uses by default.		# SM_35 is what clang uses by default.
set(default_capabilities 35)		set(default_capabilities 35)
▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/sync.cpp

This file was added.

				//===--- sync.cpp --- OpenMP synchronization operations ---------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Generic implementation for synchronization primitives.
				//
				//===----------------------------------------------------------------------===//

				#include "debug.h"
				#include "target_impl.h"

				/// Perform a barrier operation that might cause a cancellation.
				EXTERN int32_t __kmpc_cancel_barrier(kmp_Ident *Loc, int32_t TID) {
				__kmpc_impl_barrier</* IsCancellable / true, / IsSimple */ false,
				ABataevUnsubmitted Not Done Reply Inline Actions We don't support cancellation in the GPU runtime currently, so I think it is better to set `IsCancellable` to `false` to make it clear that cancellation is not supported yet. ABataev: We don't support cancellation in the GPU runtime currently, so I think it is better to set…
				ABataevUnsubmitted Not Done Reply Inline Actions I think I got the idea for these params. But it is better to give them some other names, currently they might confuse users. ABataev: I think I got the idea for these params. But it is better to give them some other names…
				/* IsSPMD */ false>(Loc, TID);
				return /* should be cancelled */ false;
				}

				/// Perform a barrier operation.
				EXTERN void __kmpc_barrier(kmp_Ident *Loc, int32_t TID) {
				__kmpc_impl_barrier</* IsCancellable / false, / IsSimple */ false,
				/* IsSPMD */ false>(Loc, TID);
				}

				/// Perform a simple barrier operation in SPMD-mode.
				EXTERN void __kmpc_barrier_simple_spmd(kmp_Ident *Loc, int32_t TID) {
				__kmpc_impl_barrier</* IsCancellable / false, / IsSimple */ true,
				/* IsSPMD */ true>(Loc, TID);
				}

				/// Perform a simple barrier operation in non-SPMD-mode.
				EXTERN void __kmpc_barrier_simple_generic(kmp_Ident *Loc, int32_t TID) {
				__kmpc_impl_barrier</* IsCancellable / false, / IsSimple */ true,
				ABataevUnsubmitted Not Done Reply Inline Actions Not sure that this instantiation provides the same afunctionality as the original implementation. Originally, it did not check for ghe number of active threads, just synced all non-SPMD threads unconditionally. ABataev: Not sure that this instantiation provides the same afunctionality as the original…
				/* IsSPMD */ false>(Loc, TID);
				}

				/// Function to be called at the beginning of an "ordered" region.
				EXTERN void __kmpc_ordered(kmp_Ident *, int32_t) {
				PRINT0(LD_IO, "call kmpc_ordered\n");
				}

				/// Function to be called at the end of an "ordered" region.
				EXTERN void __kmpc_end_ordered(kmp_Ident *, int32_t) {
				PRINT0(LD_IO, "call kmpc_end_ordered\n");
				}

				/// Create two functions, one to be called before entering region which returns
				/// a non-zero value if the region should be entered, and one to be called after
				/// the region was executed. The names of the function will be __kmpc_NAME and
				/// __kmcp_end_NAME. The predicate under which the region is entered is provided
				/// as ENTERING_PREDICATE.
				#define REGION_DELIMITERS(NAME, ENTERING_PREDICATE) \
				ABataevUnsubmitted Not Done Reply Inline Actions Can we invent something else here, not the macros? ABataev: Can we invent something else here, not the macros?
				\
				EXTERN int32_t __kmpc_##NAME(kmp_Ident *, int32_t GlobalTID) { \
				PRINT0(LD_IO, "call " #NAME "\n"); \
				return ENTERING_PREDICATE(GlobalTID); \
				} \
				\
				EXTERN void __kmpc_end_##NAME(kmp_Ident *, int32_t GlobalTID) { \
				PRINT0(LD_IO, "call " #NAME "\n"); \
				ASSERT0(LT_FUSSY, ENTERING_PREDICATE(GlobalTID), \
				"Region end function executed by thread which should not have " \
				"entered"); \
				}

				/// Region delimiter functions for "master".
				///{
				REGION_DELIMITERS(master, IsTeamMaster)
				///}

				/// Region delimiter functions for "single" implemented the same as master.
				///{
				REGION_DELIMITERS(single, IsTeamMaster)
				///}

				/// Perform a "flush" operation.
				EXTERN void __kmpc_flush(kmp_Ident *Loc) {
				PRINT0(LD_IO, "call kmpc_flush\n");
				__kmpc_impl_flush(Loc);
				}

				/// Return the bit-mask of active threads in the warp.
				///
				/// FIXME: Warps are a detail we should get rid of here.
				ABataevUnsubmitted Not Done Reply Inline Actions This is impossible to get rid of it for NVPTX runtime, at least. Better to make it NVPTX specific function. ABataev: This is impossible to get rid of it for NVPTX runtime, at least. Better to make it NVPTX…
				EXTERN int32_t __kmpc_warp_active_thread_mask() {
				PRINT0(LD_IO, "call __kmpc_warp_active_thread_mask\n");
				return __kmpc_impl_active_thread_mask();
				}

openmp/libomptarget/deviceRTLs/nvptx/src/sync.cu

This file was deleted.

	//===------------ sync.h - NVPTX OpenMP synchronizations --------- CUDA -*-===//
	//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//
	//===----------------------------------------------------------------------===//
	//
	// Include all synchronization.
	//
	//===----------------------------------------------------------------------===//

	#include "omptarget-nvptx.h"

	////////////////////////////////////////////////////////////////////////////////
	// KMP Ordered calls
	////////////////////////////////////////////////////////////////////////////////

	EXTERN void __kmpc_ordered(kmp_Ident *loc, int32_t tid) {
	PRINT0(LD_IO, "call kmpc_ordered\n");
	}

	EXTERN void __kmpc_end_ordered(kmp_Ident *loc, int32_t tid) {
	PRINT0(LD_IO, "call kmpc_end_ordered\n");
	}

	////////////////////////////////////////////////////////////////////////////////
	// KMP Barriers
	////////////////////////////////////////////////////////////////////////////////

	// a team is a block: we can use CUDA native synchronization mechanism
	// FIXME: what if not all threads (warps) participate to the barrier?
	// We may need to implement it differently

	EXTERN int32_t __kmpc_cancel_barrier(kmp_Ident *loc_ref, int32_t tid) {
	PRINT0(LD_IO, "call kmpc_cancel_barrier\n");
	__kmpc_barrier(loc_ref, tid);
	PRINT0(LD_SYNC, "completed kmpc_cancel_barrier\n");
	return 0;
	}

	EXTERN void __kmpc_barrier(kmp_Ident *loc_ref, int32_t tid) {
	if (checkRuntimeUninitialized(loc_ref)) {
	ASSERT0(LT_FUSSY, checkSPMDMode(loc_ref),
	"Expected SPMD mode with uninitialized runtime.");
	__kmpc_barrier_simple_spmd(loc_ref, tid);
	} else {
	tid = GetLogicalThreadIdInBlock(checkSPMDMode(loc_ref));
	int numberOfActiveOMPThreads =
	GetNumberOfOmpThreads(checkSPMDMode(loc_ref));
	if (numberOfActiveOMPThreads > 1) {
	if (checkSPMDMode(loc_ref)) {
	__kmpc_barrier_simple_spmd(loc_ref, tid);
	} else {
	// The #threads parameter must be rounded up to the WARPSIZE.
	int threads =
	WARPSIZE * ((numberOfActiveOMPThreads + WARPSIZE - 1) / WARPSIZE);

	PRINT(LD_SYNC,
	"call kmpc_barrier with %d omp threads, sync parameter %d\n",
	(int)numberOfActiveOMPThreads, (int)threads);
	// Barrier #1 is for synchronization among active threads.
	named_sync(L1_BARRIER, threads);
	}
	} // numberOfActiveOMPThreads > 1
	PRINT0(LD_SYNC, "completed kmpc_barrier\n");
	}
	}

	// Emit a simple barrier call in SPMD mode. Assumes the caller is in an L0
	// parallel region and that all worker threads participate.
	EXTERN void __kmpc_barrier_simple_spmd(kmp_Ident *loc_ref, int32_t tid) {
	PRINT0(LD_SYNC, "call kmpc_barrier_simple_spmd\n");
	// FIXME: use __syncthreads instead when the function copy is fixed in LLVM.
	__SYNCTHREADS();
	PRINT0(LD_SYNC, "completed kmpc_barrier_simple_spmd\n");
	}

	// Emit a simple barrier call in Generic mode. Assumes the caller is in an L0
	// parallel region and that all worker threads participate.
	EXTERN void __kmpc_barrier_simple_generic(kmp_Ident *loc_ref, int32_t tid) {
	int numberOfActiveOMPThreads = GetNumberOfThreadsInBlock() - WARPSIZE;
	// The #threads parameter must be rounded up to the WARPSIZE.
	int threads =
	WARPSIZE * ((numberOfActiveOMPThreads + WARPSIZE - 1) / WARPSIZE);

	PRINT(LD_SYNC,
	"call kmpc_barrier_simple_generic with %d omp threads, sync parameter "
	"%d\n",
	(int)numberOfActiveOMPThreads, (int)threads);
	// Barrier #1 is for synchronization among active threads.
	named_sync(L1_BARRIER, threads);
	PRINT0(LD_SYNC, "completed kmpc_barrier_simple_generic\n");
	}

	////////////////////////////////////////////////////////////////////////////////
	// KMP MASTER
	////////////////////////////////////////////////////////////////////////////////

	EXTERN int32_t __kmpc_master(kmp_Ident *loc, int32_t global_tid) {
	PRINT0(LD_IO, "call kmpc_master\n");
	return IsTeamMaster(global_tid);
	}

	EXTERN void __kmpc_end_master(kmp_Ident *loc, int32_t global_tid) {
	PRINT0(LD_IO, "call kmpc_end_master\n");
	ASSERT0(LT_FUSSY, IsTeamMaster(global_tid), "expected only master here");
	}

	////////////////////////////////////////////////////////////////////////////////
	// KMP SINGLE
	////////////////////////////////////////////////////////////////////////////////

	EXTERN int32_t __kmpc_single(kmp_Ident *loc, int32_t global_tid) {
	PRINT0(LD_IO, "call kmpc_single\n");
	// decide to implement single with master; master get the single
	return IsTeamMaster(global_tid);
	}

	EXTERN void __kmpc_end_single(kmp_Ident *loc, int32_t global_tid) {
	PRINT0(LD_IO, "call kmpc_end_single\n");
	// decide to implement single with master: master get the single
	ASSERT0(LT_FUSSY, IsTeamMaster(global_tid), "expected only master here");
	// sync barrier is explicitely called... so that is not a problem
	}

	////////////////////////////////////////////////////////////////////////////////
	// Flush
	////////////////////////////////////////////////////////////////////////////////

	EXTERN void __kmpc_flush(kmp_Ident *loc) {
	PRINT0(LD_IO, "call kmpc_flush\n");
	__threadfence();
	}

	////////////////////////////////////////////////////////////////////////////////
	// Vote
	////////////////////////////////////////////////////////////////////////////////

	EXTERN int32_t __kmpc_warp_active_thread_mask() {
	PRINT0(LD_IO, "call __kmpc_warp_active_thread_mask\n");
	return __ACTIVEMASK();
	}

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

	//===--- target_impl.h - OpenMP device RTL target code impl. ---- C++ --===//			//===--- target_impl.h - OpenMP device RTL target code impl. ---- C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Definitions of target specific functions needed in the generic part of the			// Definitions of target specific functions needed in the generic part of the
	// device RTL implementation.			// device RTL implementation.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef TARGET_IMPL_H			#ifndef TARGET_IMPL_H
	#define TARGET_IMPL_H			#define TARGET_IMPL_H

				#include "omptarget-nvptx.h"

	/// Atomically increment the pointee of \p Ptr by \p Val and return the original			/// Atomically increment the pointee of \p Ptr by \p Val and return the original
	/// value of the pointee.			/// value of the pointee.
	template <typename T> T __kmpc_impl_atomic_add(T *Ptr, T Val) {			template <typename T> T __kmpc_impl_atomic_add(T *Ptr, T Val) {
	return atomicAdd(Ptr, Val);			return atomicAdd(Ptr, Val);
	}			}

	/// Atomically exchange the pointee of \p Ptr with \p Val and return the			/// Atomically exchange the pointee of \p Ptr with \p Val and return the
	/// original value of the pointee.			/// original value of the pointee.
	template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {			template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
	return atomicExch(Ptr, Val);			return atomicExch(Ptr, Val);
	}			}

				/// Return the bit-mask representing active threads.
				template <typename T> T __kmpc_impl_active_thread_mask() {
				return __ACTIVEMASK();
				}

				/// Perform an "omp flush" operation.
				void __kmpc_impl_flush(kmp_Ident *) {
				__threadfence();
				}

				/// Perform an "omp barrier" operation for various modes described as
				/// combinations of "(non)-cancellable", "(non-)simple", and "(non-)SPMD".
				///
				/// Note: A team is a block: we can use CUDA native synchronization mechanism.
				///
				/// FIXME: What if not all threads (warps) participate to the barrier? We may
				/// need to implement it differently
				template <bool IsCancellable, bool IsSimple, bool IsSPMD>
				ABataevUnsubmitted Not Done Reply Inline Actions I don't think this is correct. `IsSPMD` flag should be passed as a function parameter. Sometimes, we cannot define the execution mode at the compile time and we could define it only at the execution time (foe example, if the parallel region is called in the orphaned function, marked as noinline or compiled without optimizations, etc.) ABataev: I don't think this is correct. `IsSPMD` flag should be passed as a function parameter.
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions It is "correct" and it "works" with the rest of the code base but we can change it regardless: It works this way because we have explicit `__kmpc_barrier_XXXXX` functions for the SPMD and non-SPMD case. Through that level of abstraction we know the required barrier implemenetation at compile time. If we want to move avay from the different barrier types that have the mode baked into their name, we would need to make the template parameters arguments for sure. Long story short, I do not have strong feelings about this and it should not matter after inlining and constant propagation. jdoerfert: It is "correct" and it "works" with the rest of the code base but we can change it regardless…
				ABataevUnsubmitted Not Done Reply Inline Actions Actually, we use `__kmpc_barrier` in many cases. Even in SPMD mode. `__kmpc_barrier_xxxx` variants are used in very rare cases. And your change may lead to incorrect results in case of orphaned directives because you hardcoded `IsSpmd` to `false` in `__kmpc_barrier`. The fact that it works for you just means that you have very limited test set. Inlining and constant propagation is not an option here. What if the user compiled the code at `O0`, without optimizations? Jus to debug the code? We should produce different results at `O0` and `O3`? Or explicitly marked the function as `noinline`? ABataev: Actually, we use `__kmpc_barrier` in many cases. Even in SPMD mode. `__kmpc_barrier_xxxx`…
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions Actually, we use kmpc_barrier in many cases. Even in SPMD mode. kmpc_barrier_xxxx variants are used in very rare cases. And your change may lead to incorrect results in case of orphaned directives because you hardcoded IsSpmd to false in __kmpc_barrier. The fact that it works for you just means that you have very limited test set. Please take a look at line 52. As before, a call to `__kmpc_barrier` will first check if we are in SPMD mode. (Even if the template argument is `false` that happens, if it is true it is not going to happen though). Thus, it is no different to the behavior we had. Inlining and constant propagation is not an option here. I do not understand what you are taking about. This does not, as nothing ever can, rely on inlining and constant propagation. What if the user compiled the code at O0, without optimizations? [...] They get the same semantics but slower. jdoerfert: > Actually, we use __kmpc_barrier in many cases. Even in SPMD mode. __kmpc_barrier_xxxx…
				ABataevUnsubmitted Not Done Reply Inline Actions Please take a look at line 52. As before, a call to __kmpc_barrier will first check if we are in SPMD mode. (Even if the template argument is false that happens, if it is true it is not going to happen though). Thus, it is no different to the behavior we had. Then why do we need this template argument if it does nothing but just confuses people? ABataev: > Please take a look at line 52. As before, a call to __kmpc_barrier will first check if we are…
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Inlining and constant propagation is not an option here. What if the user compiled the code at O0, without optimizations? Could you expand on your concern here? The code looks like it does the same thing at O0 and O3 to me (no calls to _builtin_constant_p) so O0 just means slower. That seems OK. JonChesterfield: > Inlining and constant propagation is not an option here. What if the user compiled the code…
				ABataevUnsubmitted Not Done Reply Inline Actions That was the wrong assumption on the meaning of the template arguments, I got the idea already but it is better to rename the arguments somehow because currently, they are confusing. ABataev: That was the wrong assumption on the meaning of the template arguments, I got the idea already…
				void __kmpc_impl_barrier(kmp_Ident *Loc, int32_t TID) {
				// Try to justify SPMD mode first as it allows a simple barrier
				// implementation.
				bool InSPMD = IsSPMD \|\| checkRuntimeUninitialized(Loc) \|\| checkSPMDMode(Loc);

				if (InSPMD) {
				PRINT(LD_SYNC, "call kmpc%s_barrier%s_spmd\n",
				ABataevUnsubmitted Not Done Reply Inline Actions This is not correct. `kmpc_barrier` can be called even in SPMD mode. Generally speaking, the `spmd` suffix also must be generated dynamically. ABataev: This is not correct. `kmpc_barrier` can be called even in SPMD mode. Generally speaking, the…
				IsCancellable ? "_cancel" : "", IsSimple ? "_simple" : "");
				// FIXME: use __syncthreads instead when the function copy is fixed in LLVM.
				__SYNCTHREADS();
				} else {
				int NumberOfActiveOMPThreads = GetNumberOfOmpThreads(InSPMD);
				if (NumberOfActiveOMPThreads > 1) {
				// The #threads parameter must be rounded up to the WARPSIZE.
				int NumThreads =
				WARPSIZE * ((NumberOfActiveOMPThreads + WARPSIZE - 1) / WARPSIZE);

				PRINT(LD_SYNC,
				"call kmpc%s_barrier%s with %d omp NumThreads, sync parameter %d\n",
				IsCancellable ? "_cancel" : "", IsSimple ? "_simple" : "",
				NumberOfActiveOMPThreads, NumThreads);

				// Barrier #1 is for synchronization among active NumThreads.
				named_sync(L1_BARRIER, NumThreads);
				}
				}
				PRINT(LD_SYNC, "completed kmpc%s_barrier%s%s\n",
				IsCancellable ? "_cancel" : "", IsSimple ? "_simple" : "",
				InSPMD ? "_spmd" : "");
				}

	#endif // TARGET_IMPL_H			#endif // TARGET_IMPL_H