This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/
-
libomptarget/
-
DeviceRTL/
-
include/
1
Synchronization.h
-
src/
1/1
Kernel.cpp
-
Parallelism.cpp
1/3
Synchronization.cpp
-
test/offloading/
-
offloading/
-
barrier_fence.c

Differential D145290

[OpenMP] Ensure memory fences are created with barriers for AMDGPUs
ClosedPublic

Authored by jdoerfert on Mar 3 2023, 5:53 PM.

Download Raw Diff

Details

Reviewers

jhuber6
arsenm
JonChesterfield
tianshilei1992
t-tye
b-sumner
ye-luo

Commits

rG36d6217c4eb0: [OpenMP] Ensure memory fences are created with barriers for AMDGPUs

Summary

It turns out that the __builtin_amdgcn_s_barrier() alone does not emit
a fence. We somehow got away with this and assumed it would work as it
(hopefully) is correct on the NVIDIA path where we just emit a
__syncthreads. After talking to @arsenm we now (mostly) align with the
OpenCL barrier implementation [1] and emit explicit fences for AMDGPUs.

It seems this was the underlying cause for #59759, but I am not 100%
certain. There is a chance this simply hides the problem.

Fixes: https://github.com/llvm/llvm-project/issues/59759

[1] https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/07b347366eb2c6ebc3414af323c623cbbbafc854/opencl/src/workgroup/wgbarrier.cl#L21

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jdoerfert created this revision.Mar 3 2023, 5:53 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2023, 5:53 PM

Herald added subscribers: kosarev, guansong, bollu and 5 others. · View Herald Transcript

jdoerfert requested review of this revision.Mar 3 2023, 5:53 PM

Herald added subscribers: sstefan1, wdng. · View Herald TranscriptMar 3 2023, 5:53 PM

Harbormaster completed remote builds in B217327: Diff 502323.Mar 3 2023, 5:57 PM

jdoerfert added inline comments.Mar 3 2023, 6:01 PM

openmp/libomptarget/DeviceRTL/src/Kernel.cpp
89	Actually, the ones to form aligned regions should all use relaxed ordering as we really only want the barrier for synchronizing the threads.

I compared it to the HIP implementation linked and from that point of view it looks reasonable to me, but I don't have a good understanding of the internals yet. @JonChesterfield can you comment on the topic?

openmp/libomptarget/DeviceRTL/include/Synchronization.h
107

Herald added a subscriber: sunshaoce. · View Herald TranscriptMar 15 2023, 3:36 AM

Do we actually have seq_cst ordering on GPUs? It means every thread sees the same ordering which I'd guess has to be done by a RMW atomic operation. Maybe a fetch_add. Plus these aren't scoped, so the memory underlying it has to be accessible from all threads, which probably means it goes to host shared memory. Aka fetch_add on CPU memory to get the ordering relation. That seems expensive to the extent that it probably isn't implemented. @arsenm how do we usually deal with that?

@jdoerfert what motives seq_cst here instead of acquire/release? Commented inline that this ordering argument only makes sense when it's consistent across all threads that are participating, which might be worth approximating as constant per call site, i.e. raise it to a template parameter.

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
264–273	I don't understand this. Why is the ordering on a fenceteam related to the ordering on syncthreads in this way? What about acquire_release? In general it seems hazardous that ordering is a runtime variable, if different threads passed in different ordering this would turn into a horrendous mess. Perhaps we should move it to a template parameter, and maybe static_assert in the syncThreads implementation that it meets whatever the constraints on it are.

In D145290#4197573, @JonChesterfield wrote:

That seems expensive to the extent that it probably isn't implemented. @arsenm how do we usually deal with that?

Why wouldn't it be implemented? The exact treatment varies per subtarget but implies setting some cache bits and using some cache flush instructions

In D145290#4197774, @arsenm wrote:

In D145290#4197573, @JonChesterfield wrote:

That seems expensive to the extent that it probably isn't implemented. @arsenm how do we usually deal with that?

Why wouldn't it be implemented? The exact treatment varies per subtarget but implies setting some cache bits and using some cache flush instructions

That sounds sufficient for within a single GPU but not between GPUs. Though I'm not sure just how far the total order of events is supposed to be taken, maybe it's reasonable to observe different orders when variables are only directly visible to some subset of threads.

In D145290#4197794, @JonChesterfield wrote:

In D145290#4197774, @arsenm wrote:

In D145290#4197573, @JonChesterfield wrote:

That seems expensive to the extent that it probably isn't implemented. @arsenm how do we usually deal with that?

Why wouldn't it be implemented? The exact treatment varies per subtarget but implies setting some cache bits and using some cache flush instructions

That sounds sufficient for within a single GPU but not between GPUs. Though I'm not sure just how far the total order of events is supposed to be taken, maybe it's reasonable to observe different orders when variables are only directly visible to some subset of threads.

We are looking at workgroup fences here. We don't even need per GPU fencing. I think this discussion derailed a bit.

In D145290#4197828, @jdoerfert wrote:

We are looking at workgroup fences here. We don't even need per GPU fencing. I think this discussion derailed a bit.

It's the total order requested by choosing sequentially consistent for the memory model. I'm not sure what the semantics of that are on a system with multiple nested address spaces as that's not the C++ model.

What's the semantics you want here? All warps in a workgroup see the same total order of events, while independent workgroups could see different orders? If so then we can ask whether that's what seq_cst is lowered to on a gpu, which it might be.

Or if the global total ordering is not necessary, should we go with acquire/release on the fences instead?

In D145290#4198120, @JonChesterfield wrote:

In D145290#4197828, @jdoerfert wrote:

We are looking at workgroup fences here. We don't even need per GPU fencing. I think this discussion derailed a bit.

It's the total order requested by choosing sequentially consistent for the memory model. I'm not sure what the semantics of that are on a system with multiple nested address spaces as that's not the C++ model.

What's the semantics you want here? All warps in a workgroup see the same total order of events, while independent workgroups could see different orders? If so then we can ask whether that's what seq_cst is lowered to on a gpu, which it might be.

Or if the global total ordering is not necessary, should we go with acquire/release on the fences instead?

Given that we might need to flush global memory, I went with what the OpenCL folks do.

In D145290#4200291, @jdoerfert wrote:

In D145290#4198120, @JonChesterfield wrote:

Or if the global total ordering is not necessary, should we go with acquire/release on the fences instead?

Given that we might need to flush global memory, I went with what the OpenCL folks do.

Fair enough, that's a heuristic I'm happy with.

Raising the argument to a compile time template parameter is better I think - not totally confident as written would constant fold at O0 - but we could leave that for another day / until it comes up in practice.

Add test, correct fence kinds

Harbormaster completed remote builds in B220624: Diff 506836.Mar 20 2023, 8:27 PM

@JonChesterfield is this patch good to go?

LGTM but Matt's the expert here

In D145290#4210633, @JonChesterfield wrote:

LGTM but Matt's the expert here

I know nothing about memory models

In D145290#4210638, @arsenm wrote:

I know nothing about memory models

That's exciting. I've tagged Tony and Brian as my next guess..I'm reasonably clear on memory models in general but haven't reverse engineered what they mean to amdgpu. It's somewhere on my to-do list.

Keep things moving.

This revision is now accepted and ready to land.Mar 23 2023, 8:47 PM

Closed by commit rG36d6217c4eb0: [OpenMP] Ensure memory fences are created with barriers for AMDGPUs (authored by ye-luo). · Explain WhyMar 24 2023, 6:40 PM

This revision was automatically updated to reflect the committed changes.

ye-luo added a commit: rG36d6217c4eb0: [OpenMP] Ensure memory fences are created with barriers for AMDGPUs.

Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2023, 6:40 PM

Herald added a subscriber: openmp-commits. · View Herald Transcript

ye-luo added a reverting change: rGead2d86ee9b1: Revert "[OpenMP] Ensure memory fences are created with barriers for AMDGPUs".Mar 24 2023, 7:10 PM

Got test failure

Failed Tests (1):
  libomptarget :: x86_64-pc-linux-gnu :: offloading/barrier_fence.c

recommit via https://reviews.llvm.org/rG67fed132f39c81e8006c4463ab1f173fea5e4e4b

dhruvachak added a subscriber: dhruvachak.Apr 27 2023, 4:20 PM

dhruvachak added inline comments.

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
266	What if I want a release fence before the barrier and nothing else? As a client, I pass in atomic::release and I get a seq_cst fence before and after the barrier. Seems like an overkill.

jdoerfert added inline comments.May 1 2023, 1:32 PM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
266	If you want functionality beyond what is implemented, you need to implement it.

Revision Contents

Path

Size

openmp/

libomptarget/

DeviceRTL/

include/

Synchronization.h

60 lines

src/

Kernel.cpp

14 lines

Parallelism.cpp

20 lines

Synchronization.cpp

32 lines

test/

offloading/

barrier_fence.c

75 lines

Diff 508262

openmp/libomptarget/DeviceRTL/include/Synchronization.h

Show All 10 Lines

#ifndef OMPTARGET_DEVICERTL_SYNCHRONIZATION_H

#define OMPTARGET_DEVICERTL_SYNCHRONIZATION_H

#include "Types.h"

namespace ompx {

namespace synchronize {

/// Initialize the synchronization machinery. Must be called by all threads.

void init(bool IsSPMD);

/// Synchronize all threads in a warp identified by \p Mask.

void warp(LaneMaskTy Mask);

/// Synchronize all threads in a block.

void threads();

/// Synchronizing threads is allowed even if they all hit different instances of

/// `synchronize::threads()`. However, `synchronize::threadsAligned()` is more

/// restrictive in that it requires all threads to hit the same instance. The

/// noinline is removed by the openmp-opt pass and helps to preserve the

/// information till then.

///{

#pragma omp begin assumes ext_aligned_barrier

/// Synchronize all threads in a block, they are are reaching the same

/// instruction (hence all threads in the block are "aligned").

__attribute__((noinline)) void threadsAligned();

#pragma omp end assumes

///}

} // namespace synchronize

namespace atomic {

enum OrderingTy {

relaxed = __ATOMIC_RELAXED,

aquire = __ATOMIC_ACQUIRE,

release = __ATOMIC_RELEASE,

acq_rel = __ATOMIC_ACQ_REL,

seq_cst = __ATOMIC_SEQ_CST,

▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines

#undef ATOMIC_COMMON_OP

#undef ATOMIC_INT_OP

#undef ATOMIC_FP_OP

///}

} // namespace atomic

namespace synchronize {

/// Initialize the synchronization machinery. Must be called by all threads.

void init(bool IsSPMD);

/// Synchronize all threads in a warp identified by \p Mask.

void warp(LaneMaskTy Mask);

/// Synchronize all threads in a block and perform a fence before and after the

/// barrier according to \p Ordering. Note that the fence might be part of the

/// barrier.

void threads(atomic::OrderingTy Ordering);

/// Synchronizing threads is allowed even if they all hit different instances of

/// `synchronize::threads()`. However, `synchronize::threadsAligned()` is more

/// restrictive in that it requires all threads to hit the same instance. The

/// noinline is removed by the openmp-opt pass and helps to preserve the

/// information till then.

///{

#pragma omp begin assumes ext_aligned_barrier

/// Synchronize all threads in a block, they are reaching the same instruction

jplehrUnsubmitted

Not Done

#pragma omp begin assumes ext_aligned_barrier

- /// Synchronize all threads in a block, they are are reaching the same

+ /// Synchronize all threads in a block, they are reaching the same

/// instruction (hence all threads in the block are "aligned"). Also perform a

jplehr:

/// (hence all threads in the block are "aligned"). Also perform a fence before

/// and after the barrier according to \p Ordering. Note that the

/// fence might be part of the barrier if the target offers this.

__attribute__((noinline)) void threadsAligned(atomic::OrderingTy Ordering);

#pragma omp end assumes

///}

} // namespace synchronize

namespace fence {

/// Memory fence with \p Ordering semantics for the team.

void team(atomic::OrderingTy Ordering);

/// Memory fence with \p Ordering semantics for the contention group.

void kernel(atomic::OrderingTy Ordering);

/// Memory fence with \p Ordering semantics for the system.

void system(atomic::OrderingTy Ordering);

} // namespace fence

} // namespace ompx

#endif

openmp/libomptarget/DeviceRTL/src/Kernel.cpp

Show All 34 Lines	static void genericStateMachine(IdentTy *Ident) {
FunctionTracingRAII();		FunctionTracingRAII();

uint32_t TId = mapping::getThreadIdInBlock();		uint32_t TId = mapping::getThreadIdInBlock();

do {		do {
ParallelRegionFnTy WorkFn = nullptr;		ParallelRegionFnTy WorkFn = nullptr;

// Wait for the signal that we have a new work function.		// Wait for the signal that we have a new work function.
synchronize::threads();		synchronize::threads(atomic::seq_cst);

// Retrieve the work function from the runtime.		// Retrieve the work function from the runtime.
bool IsActive = __kmpc_kernel_parallel(&WorkFn);		bool IsActive = __kmpc_kernel_parallel(&WorkFn);

// If there is nothing more to do, break out of the state machine by		// If there is nothing more to do, break out of the state machine by
// returning to the caller.		// returning to the caller.
if (!WorkFn)		if (!WorkFn)
return;		return;

if (IsActive) {		if (IsActive) {
ASSERT(!mapping::isSPMDMode());		ASSERT(!mapping::isSPMDMode());
((void (*)(uint32_t, uint32_t))WorkFn)(0, TId);		((void (*)(uint32_t, uint32_t))WorkFn)(0, TId);
__kmpc_kernel_end_parallel();		__kmpc_kernel_end_parallel();
}		}

synchronize::threads();		synchronize::threads(atomic::seq_cst);

} while (true);		} while (true);
}		}

extern "C" {		extern "C" {

/// Initialization		/// Initialization
///		///
/// \param Ident Source location identification, can be NULL.		/// \param Ident Source location identification, can be NULL.
///		///
int32_t __kmpc_target_init(IdentTy *Ident, int8_t Mode,		int32_t __kmpc_target_init(IdentTy *Ident, int8_t Mode,
bool UseGenericStateMachine) {		bool UseGenericStateMachine) {
FunctionTracingRAII();		FunctionTracingRAII();
const bool IsSPMD =		const bool IsSPMD =
Mode & llvm::omp::OMPTgtExecModeFlags::OMP_TGT_EXEC_MODE_SPMD;		Mode & llvm::omp::OMPTgtExecModeFlags::OMP_TGT_EXEC_MODE_SPMD;
if (IsSPMD) {		if (IsSPMD) {
inititializeRuntime(/* IsSPMD */ true);		inititializeRuntime(/* IsSPMD */ true);
synchronize::threadsAligned();		synchronize::threadsAligned(atomic::relaxed);
} else {		} else {
inititializeRuntime(/* IsSPMD */ false);		inititializeRuntime(/* IsSPMD */ false);
// No need to wait since only the main threads will execute user		// No need to wait since only the main threads will execute user
// code and workers will run into a barrier right away.		// code and workers will run into a barrier right away.
}		}

if (IsSPMD) {		if (IsSPMD) {
state::assumeInitialState(IsSPMD);		state::assumeInitialState(IsSPMD);

		// Synchronize to ensure the assertions above are in an aligned region.
		// The barrier is eliminated later.
		synchronize::threadsAligned(atomic::relaxed);
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions Actually, the ones to form aligned regions should all use relaxed ordering as we really only want the barrier for synchronizing the threads. jdoerfert: Actually, the ones to form aligned regions should all use relaxed ordering as we really only…
return -1;		return -1;
}		}

if (mapping::isInitialThreadInLevel0(IsSPMD))		if (mapping::isInitialThreadInLevel0(IsSPMD))
return -1;		return -1;

// Enter the generic state machine if enabled and if this thread can possibly		// Enter the generic state machine if enabled and if this thread can possibly
// be an active worker thread.		// be an active worker thread.
Show All 33 Lines
/// and also any memory dynamically allocated by the runtime.		/// and also any memory dynamically allocated by the runtime.
///		///
/// \param Ident Source location identification, can be NULL.		/// \param Ident Source location identification, can be NULL.
///		///
void __kmpc_target_deinit(IdentTy *Ident, int8_t Mode) {		void __kmpc_target_deinit(IdentTy *Ident, int8_t Mode) {
FunctionTracingRAII();		FunctionTracingRAII();
const bool IsSPMD =		const bool IsSPMD =
Mode & llvm::omp::OMPTgtExecModeFlags::OMP_TGT_EXEC_MODE_SPMD;		Mode & llvm::omp::OMPTgtExecModeFlags::OMP_TGT_EXEC_MODE_SPMD;

		synchronize::threadsAligned(atomic::acq_rel);
state::assumeInitialState(IsSPMD);		state::assumeInitialState(IsSPMD);
		synchronize::threadsAligned(atomic::relaxed);

if (IsSPMD)		if (IsSPMD)
return;		return;

// Signal the workers to exit the state machine and exit the kernel.		// Signal the workers to exit the state machine and exit the kernel.
state::ParallelRegionFn = nullptr;		state::ParallelRegionFn = nullptr;
}		}

int8_t __kmpc_is_spmd_exec_mode() {		int8_t __kmpc_is_spmd_exec_mode() {
FunctionTracingRAII();		FunctionTracingRAII();
return mapping::isSPMDMode();		return mapping::isSPMDMode();
}		}
}		}

#pragma omp end declare target		#pragma omp end declare target

openmp/libomptarget/DeviceRTL/src/Parallelism.cpp

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	void __kmpc_parallel_51(IdentTy *ident, int32_t, int32_t if_expr,

// From this point forward we know that there is no thread state used.		// From this point forward we know that there is no thread state used.
ASSERT(state::HasThreadState == false);		ASSERT(state::HasThreadState == false);

uint32_t NumThreads = determineNumberOfThreads(num_threads);		uint32_t NumThreads = determineNumberOfThreads(num_threads);
if (mapping::isSPMDMode()) {		if (mapping::isSPMDMode()) {
// Avoid the race between the read of the `icv::Level` above and the write		// Avoid the race between the read of the `icv::Level` above and the write
// below by synchronizing all threads here.		// below by synchronizing all threads here.
synchronize::threadsAligned();		synchronize::threadsAligned(atomic::seq_cst);
{		{
// Note that the order here is important. `icv::Level` has to be updated		// Note that the order here is important. `icv::Level` has to be updated
// last or the other updates will cause a thread specific state to be		// last or the other updates will cause a thread specific state to be
// created.		// created.
state::ValueRAII ParallelTeamSizeRAII(state::ParallelTeamSize, NumThreads,		state::ValueRAII ParallelTeamSizeRAII(state::ParallelTeamSize, NumThreads,
1u, TId == 0, ident,		1u, TId == 0, ident,
/* ForceTeamState */ true);		/* ForceTeamState */ true);
state::ValueRAII ActiveLevelRAII(icv::ActiveLevel, 1u, 0u, TId == 0,		state::ValueRAII ActiveLevelRAII(icv::ActiveLevel, 1u, 0u, TId == 0,
ident, /* ForceTeamState */ true);		ident, /* ForceTeamState */ true);
state::ValueRAII LevelRAII(icv::Level, 1u, 0u, TId == 0, ident,		state::ValueRAII LevelRAII(icv::Level, 1u, 0u, TId == 0, ident,
/* ForceTeamState */ true);		/* ForceTeamState */ true);

// Synchronize all threads after the main thread (TId == 0) set up the		// Synchronize all threads after the main thread (TId == 0) set up the
// team state properly.		// team state properly.
synchronize::threadsAligned();		synchronize::threadsAligned(atomic::acq_rel);

state::ParallelTeamSize.assert_eq(NumThreads, ident,		state::ParallelTeamSize.assert_eq(NumThreads, ident,
/* ForceTeamState */ true);		/* ForceTeamState */ true);
icv::ActiveLevel.assert_eq(1u, ident, /* ForceTeamState */ true);		icv::ActiveLevel.assert_eq(1u, ident, /* ForceTeamState */ true);
icv::Level.assert_eq(1u, ident, /* ForceTeamState */ true);		icv::Level.assert_eq(1u, ident, /* ForceTeamState */ true);

		// Ensure we synchronize before we run user code to avoid invalidating the
		// assumptions above.
		synchronize::threadsAligned(atomic::relaxed);

if (TId < NumThreads)		if (TId < NumThreads)
invokeMicrotask(TId, 0, fn, args, nargs);		invokeMicrotask(TId, 0, fn, args, nargs);

// Synchronize all threads at the end of a parallel region.		// Synchronize all threads at the end of a parallel region.
synchronize::threadsAligned();		synchronize::threadsAligned(atomic::seq_cst);
}		}

// Synchronize all threads to make sure every thread exits the scope above;		// Synchronize all threads to make sure every thread exits the scope above;
// otherwise the following assertions and the assumption in		// otherwise the following assertions and the assumption in
// __kmpc_target_deinit may not hold.		// __kmpc_target_deinit may not hold.
synchronize::threadsAligned();		synchronize::threadsAligned(atomic::acq_rel);

state::ParallelTeamSize.assert_eq(1u, ident, /* ForceTeamState */ true);		state::ParallelTeamSize.assert_eq(1u, ident, /* ForceTeamState */ true);
icv::ActiveLevel.assert_eq(0u, ident, /* ForceTeamState */ true);		icv::ActiveLevel.assert_eq(0u, ident, /* ForceTeamState */ true);
icv::Level.assert_eq(0u, ident, /* ForceTeamState */ true);		icv::Level.assert_eq(0u, ident, /* ForceTeamState */ true);

		// Ensure we synchronize to create an aligned region around the assumptions.
		synchronize::threadsAligned(atomic::relaxed);

return;		return;
}		}

// We do not create a new data environment because all threads in the team		// We do not create a new data environment because all threads in the team
// that are active are now running this parallel region. They share the		// that are active are now running this parallel region. They share the
// TeamState, which has an increase level-var and potentially active-level		// TeamState, which has an increase level-var and potentially active-level
// set, but they do not have individual ThreadStates yet. If they ever		// set, but they do not have individual ThreadStates yet. If they ever
// modify the ICVs beyond this point a ThreadStates will be allocated.		// modify the ICVs beyond this point a ThreadStates will be allocated.
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	state::ValueRAII ParallelRegionFnRAII(state::ParallelRegionFn, wrapper_fn,
(void *)nullptr, true, ident,		(void *)nullptr, true, ident,
/* ForceTeamState */ true);		/* ForceTeamState */ true);
state::ValueRAII ActiveLevelRAII(icv::ActiveLevel, 1u, 0u, true, ident,		state::ValueRAII ActiveLevelRAII(icv::ActiveLevel, 1u, 0u, true, ident,
/* ForceTeamState */ true);		/* ForceTeamState */ true);
state::ValueRAII LevelRAII(icv::Level, 1u, 0u, true, ident,		state::ValueRAII LevelRAII(icv::Level, 1u, 0u, true, ident,
/* ForceTeamState */ true);		/* ForceTeamState */ true);

// Master signals work to activate workers.		// Master signals work to activate workers.
synchronize::threads();		synchronize::threads(atomic::seq_cst);
// Master waits for workers to signal.		// Master waits for workers to signal.
synchronize::threads();		synchronize::threads(atomic::seq_cst);
}		}

if (nargs)		if (nargs)
__kmpc_end_sharing_variables();		__kmpc_end_sharing_variables();
}		}

__attribute__((noinline)) bool		__attribute__((noinline)) bool
__kmpc_kernel_parallel(ParallelRegionFnTy *WorkFn) {		__kmpc_kernel_parallel(ParallelRegionFnTy *WorkFn) {
▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
// Forward declarations defined to be defined for AMDGCN and NVPTX.		// Forward declarations defined to be defined for AMDGCN and NVPTX.
uint32_t atomicInc(uint32_t *A, uint32_t V, atomic::OrderingTy Ordering);		uint32_t atomicInc(uint32_t *A, uint32_t V, atomic::OrderingTy Ordering);
void namedBarrierInit();		void namedBarrierInit();
void namedBarrier();		void namedBarrier();
void fenceTeam(atomic::OrderingTy Ordering);		void fenceTeam(atomic::OrderingTy Ordering);
void fenceKernel(atomic::OrderingTy Ordering);		void fenceKernel(atomic::OrderingTy Ordering);
void fenceSystem(atomic::OrderingTy Ordering);		void fenceSystem(atomic::OrderingTy Ordering);
void syncWarp(__kmpc_impl_lanemask_t);		void syncWarp(__kmpc_impl_lanemask_t);
void syncThreads();		void syncThreads(atomic::OrderingTy Ordering);
void syncThreadsAligned() { syncThreads(); }		void syncThreadsAligned(atomic::OrderingTy Ordering) { syncThreads(Ordering); }
void unsetLock(omp_lock_t *);		void unsetLock(omp_lock_t *);
int testLock(omp_lock_t *);		int testLock(omp_lock_t *);
void initLock(omp_lock_t *);		void initLock(omp_lock_t *);
void destroyLock(omp_lock_t *);		void destroyLock(omp_lock_t *);
void setLock(omp_lock_t *);		void setLock(omp_lock_t *);
void unsetCriticalLock(omp_lock_t *);		void unsetCriticalLock(omp_lock_t *);
void setCriticalLock(omp_lock_t *);		void setCriticalLock(omp_lock_t *);

▲ Show 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	case atomic::seq_cst:
return __builtin_amdgcn_fence(atomic::seq_cst, "");		return __builtin_amdgcn_fence(atomic::seq_cst, "");
}		}
}		}

void syncWarp(__kmpc_impl_lanemask_t) {		void syncWarp(__kmpc_impl_lanemask_t) {
// AMDGCN doesn't need to sync threads in a warp		// AMDGCN doesn't need to sync threads in a warp
}		}

void syncThreads() { __builtin_amdgcn_s_barrier(); }		void syncThreads(atomic::OrderingTy Ordering) {
void syncThreadsAligned() { syncThreads(); }		if (Ordering != atomic::relaxed)
		fenceTeam(Ordering == atomic::acq_rel ? atomic::release : atomic::seq_cst);
		dhruvachakUnsubmitted Not Done Reply Inline Actions What if I want a release fence before the barrier and nothing else? As a client, I pass in atomic::release and I get a seq_cst fence before and after the barrier. Seems like an overkill. dhruvachak: What if I want a release fence before the barrier and nothing else? As a client, I pass in…
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions If you want functionality beyond what is implemented, you need to implement it. jdoerfert: If you want functionality beyond what is implemented, you need to implement it.

		__builtin_amdgcn_s_barrier();

		if (Ordering != atomic::relaxed)
		fenceTeam(Ordering == atomic::acq_rel ? atomic::aquire : atomic::seq_cst);
		}
		void syncThreadsAligned(atomic::OrderingTy Ordering) { syncThreads(Ordering); }
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions I don't understand this. Why is the ordering on a fenceteam related to the ordering on syncthreads in this way? What about acquire_release? In general it seems hazardous that ordering is a runtime variable, if different threads passed in different ordering this would turn into a horrendous mess. Perhaps we should move it to a template parameter, and maybe static_assert in the syncThreads implementation that it meets whatever the constraints on it are. JonChesterfield: I don't understand this. Why is the ordering on a fenceteam related to the ordering on…

// TODO: Don't have wavefront lane locks. Possibly can't have them.		// TODO: Don't have wavefront lane locks. Possibly can't have them.
void unsetLock(omp_lock_t *) { __builtin_trap(); }		void unsetLock(omp_lock_t *) { __builtin_trap(); }
int testLock(omp_lock_t *) { __builtin_trap(); }		int testLock(omp_lock_t *) { __builtin_trap(); }
void initLock(omp_lock_t *) { __builtin_trap(); }		void initLock(omp_lock_t *) { __builtin_trap(); }
void destroyLock(omp_lock_t *) { __builtin_trap(); }		void destroyLock(omp_lock_t *) { __builtin_trap(); }
void setLock(omp_lock_t *) { __builtin_trap(); }		void setLock(omp_lock_t *) { __builtin_trap(); }

▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
void fenceTeam(atomic::OrderingTy) { __nvvm_membar_cta(); }		void fenceTeam(atomic::OrderingTy) { __nvvm_membar_cta(); }

void fenceKernel(atomic::OrderingTy) { __nvvm_membar_gl(); }		void fenceKernel(atomic::OrderingTy) { __nvvm_membar_gl(); }

void fenceSystem(atomic::OrderingTy) { __nvvm_membar_sys(); }		void fenceSystem(atomic::OrderingTy) { __nvvm_membar_sys(); }

void syncWarp(__kmpc_impl_lanemask_t Mask) { __nvvm_bar_warp_sync(Mask); }		void syncWarp(__kmpc_impl_lanemask_t Mask) { __nvvm_bar_warp_sync(Mask); }

void syncThreads() {		void syncThreads(atomic::OrderingTy Ordering) {
constexpr int BarrierNo = 8;		constexpr int BarrierNo = 8;
asm volatile("barrier.sync %0;" : : "r"(BarrierNo) : "memory");		asm volatile("barrier.sync %0;" : : "r"(BarrierNo) : "memory");
}		}

void syncThreadsAligned() { __syncthreads(); }		void syncThreadsAligned(atomic::OrderingTy Ordering) { __syncthreads(); }

constexpr uint32_t OMP_SPIN = 1000;		constexpr uint32_t OMP_SPIN = 1000;
constexpr uint32_t UNSET = 0;		constexpr uint32_t UNSET = 0;
constexpr uint32_t SET = 1;		constexpr uint32_t SET = 1;

// TODO: This seems to hide a bug in the declare variant handling. If it is		// TODO: This seems to hide a bug in the declare variant handling. If it is
// called before it is defined		// called before it is defined
// here the overload won't happen. Investigate lalter!		// here the overload won't happen. Investigate lalter!
Show All 32 Lines

void synchronize::init(bool IsSPMD) {		void synchronize::init(bool IsSPMD) {
if (!IsSPMD)		if (!IsSPMD)
impl::namedBarrierInit();		impl::namedBarrierInit();
}		}

void synchronize::warp(LaneMaskTy Mask) { impl::syncWarp(Mask); }		void synchronize::warp(LaneMaskTy Mask) { impl::syncWarp(Mask); }

void synchronize::threads() { impl::syncThreads(); }		void synchronize::threads(atomic::OrderingTy Ordering) {
		impl::syncThreads(Ordering);
		}

void synchronize::threadsAligned() { impl::syncThreadsAligned(); }		void synchronize::threadsAligned(atomic::OrderingTy Ordering) {
		impl::syncThreadsAligned(Ordering);
		}

void fence::team(atomic::OrderingTy Ordering) { impl::fenceTeam(Ordering); }		void fence::team(atomic::OrderingTy Ordering) { impl::fenceTeam(Ordering); }

void fence::kernel(atomic::OrderingTy Ordering) { impl::fenceKernel(Ordering); }		void fence::kernel(atomic::OrderingTy Ordering) { impl::fenceKernel(Ordering); }

void fence::system(atomic::OrderingTy Ordering) { impl::fenceSystem(Ordering); }		void fence::system(atomic::OrderingTy Ordering) { impl::fenceSystem(Ordering); }

#define ATOMIC_COMMON_OP(TY) \		#define ATOMIC_COMMON_OP(TY) \
▲ Show 20 Lines • Show All 104 Lines • ▼ Show 20 Lines	if (mapping::isSPMDMode())
return __kmpc_barrier_simple_spmd(Loc, TId);		return __kmpc_barrier_simple_spmd(Loc, TId);

impl::namedBarrier();		impl::namedBarrier();
}		}

__attribute__((noinline)) void __kmpc_barrier_simple_spmd(IdentTy *Loc,		__attribute__((noinline)) void __kmpc_barrier_simple_spmd(IdentTy *Loc,
int32_t TId) {		int32_t TId) {
FunctionTracingRAII();		FunctionTracingRAII();
synchronize::threadsAligned();		synchronize::threadsAligned(atomic::OrderingTy::seq_cst);
}		}

__attribute__((noinline)) void __kmpc_barrier_simple_generic(IdentTy *Loc,		__attribute__((noinline)) void __kmpc_barrier_simple_generic(IdentTy *Loc,
int32_t TId) {		int32_t TId) {
FunctionTracingRAII();		FunctionTracingRAII();
synchronize::threads();		synchronize::threads(atomic::OrderingTy::seq_cst);
}		}

int32_t __kmpc_master(IdentTy *Loc, int32_t TId) {		int32_t __kmpc_master(IdentTy *Loc, int32_t TId) {
FunctionTracingRAII();		FunctionTracingRAII();
return omp_get_thread_num() == 0;		return omp_get_thread_num() == 0;
}		}

void __kmpc_end_master(IdentTy *Loc, int32_t TId) { FunctionTracingRAII(); }		void __kmpc_end_master(IdentTy *Loc, int32_t TId) { FunctionTracingRAII(); }
▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

openmp/libomptarget/test/offloading/barrier_fence.c

This file was added.

				// RUN: %libomptarget-compile-generic -fopenmp-offload-mandatory -O3
				// RUN: %libomptarget-run-generic

				#include <omp.h>
				#include <stdio.h>

				struct IdentTy;
				void __kmpc_barrier_simple_spmd(struct IdentTy *Loc, int32_t TId);
				void __kmpc_barrier_simple_generic(struct IdentTy *Loc, int32_t TId);

				#pragma omp begin declare target device_type(nohost)
				static int A[512] __attribute__((address_space(3), loader_uninitialized));
				static int B[512 * 32] __attribute__((loader_uninitialized));
				#pragma omp end declare target

				int main() {
				printf("Testing simple spmd barrier\n");
				for (int r = 0; r < 50; r++) {
				#pragma omp target teams distribute thread_limit(512) num_teams(440)
				for (int j = 0; j < 512 * 32; ++j) {
				#pragma omp parallel firstprivate(j)
				{
				int TId = omp_get_thread_num();
				int TeamId = omp_get_team_num();
				int NT = omp_get_num_threads();
				// Sequential
				for (int i = 0; i < NT; ++i) {
				// Test shared memory globals
				if (TId == i)
				A[i] = i + j;
				__kmpc_barrier_simple_spmd(0, TId);
				if (A[i] != i + j)
				__builtin_trap();
				__kmpc_barrier_simple_spmd(0, TId);
				// Test generic globals
				if (TId == i)
				B[TeamId] = i;
				__kmpc_barrier_simple_spmd(0, TId);
				if (B[TeamId] != i)
				__builtin_trap();
				__kmpc_barrier_simple_spmd(0, TId);
				}
				}
				}
				}

				printf("Testing simple generic barrier\n");
				for (int r = 0; r < 50; r++) {
				#pragma omp target teams distribute thread_limit(512) num_teams(440)
				for (int j = 0; j < 512 * 32; ++j) {
				#pragma omp parallel firstprivate(j)
				{
				int TId = omp_get_thread_num();
				int TeamId = omp_get_team_num();
				int NT = omp_get_num_threads();
				// Sequential
				for (int i = 0; i < NT; ++i) {
				if (TId == i)
				A[i] = i + j;
				__kmpc_barrier_simple_generic(0, TId);
				if (A[i] != i + j)
				__builtin_trap();
				__kmpc_barrier_simple_generic(0, TId);
				if (TId == i)
				B[TeamId] = i;
				__kmpc_barrier_simple_generic(0, TId);
				if (B[TeamId] != i)
				__builtin_trap();
				__kmpc_barrier_simple_generic(0, TId);
				}
				}
				}
				}
				return 0;
				}

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Ensure memory fences are created with barriers for AMDGPUsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 508262

openmp/libomptarget/DeviceRTL/include/Synchronization.h

openmp/libomptarget/DeviceRTL/src/Kernel.cpp

openmp/libomptarget/DeviceRTL/src/Parallelism.cpp

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

openmp/libomptarget/test/offloading/barrier_fence.c

[OpenMP] Ensure memory fences are created with barriers for AMDGPUs
ClosedPublic