This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/DeviceRTL/
-
libomptarget/
-
DeviceRTL/
-
include/
3/3
Synchronization.h
-
src/
5/5
Reduction.cpp
2/3
Synchronization.cpp

Differential D154172

[OpenMP] Added memory scope to atomic::inc API and used the device scope in reduction.
ClosedPublic

Authored by dhruvachak on Jun 29 2023, 5:42 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
arsenm
carlo.bertolli
tianshilei1992
jhuber6
ronlieb
JonChesterfield

Commits

rG6a1d1f7eefe8: [OpenMP] Added memory scope to atomic::inc API and used the device scope in…

Summary

With https://reviews.llvm.org/D137524, memory scope and ordering
attributes are being used to generate the required instructions for
atomic inc/dec on AMDGPU. This patch adds the memory scope attribute to
the atomic::inc API and uses the device scope in reduction. Without
the device scope in atomic_inc, the default system scope leads to
unnecessary L2 write-backs/invalidates.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dhruvachak created this revision.Jun 29 2023, 5:42 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 5:42 PM

Herald added subscribers: sunshaoce, guansong, tpr, yaxunl. · View Herald Transcript

dhruvachak requested review of this revision.Jun 29 2023, 5:42 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptJun 29 2023, 5:43 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: openmp-commits, jplehr, sstefan1. · View Herald Transcript

dhruvachak added reviewers: arsenm, carlo.bertolli, tianshilei1992, jhuber6, ronlieb.Jun 29 2023, 5:44 PM

Herald added a subscriber: wdng. · View Herald TranscriptJun 29 2023, 5:44 PM

Harbormaster completed remote builds in B242296: Diff 536070.Jun 29 2023, 5:46 PM

The memscope types added are kept consistent with the attributes in OpenMP spec 6.0 TR11. I think the system, device, and team scopes should be sufficient for the OpenMP use cases. Additional scopes are supported by the AMDGPU backend but they haven't been added here.

dhruvachak added a reviewer: JonChesterfield.Jun 29 2023, 5:54 PM

arsenm added inline comments.Jun 29 2023, 6:10 PM

openmp/libomptarget/DeviceRTL/include/Synchronization.h
29	Why does this need a new subset abstraction for scopes? I could understand a complete enum around all the names target scopes
openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226	I doubt this is target specific

dhruvachak added inline comments.Jun 29 2023, 6:32 PM

openmp/libomptarget/DeviceRTL/include/Synchronization.h
29	I tried to keep the memscope arch-independent, hence the additional abstraction. I kept the same memscopes as what the upcoming OpenMP spec is going to have. This way, the call to atomic::inc in DeviceRTL (e.g. in reduction) does not have to be arch-dependent. In other words, I did not want to pass in "agent" directly from the reduction code for amdgpu, as an example.
openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226	Well, I don't know whether it is required for NVPTX for example, so I kept it around. As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu.

jdoerfert added inline comments.Jun 29 2023, 6:42 PM

openmp/libomptarget/DeviceRTL/include/Synchronization.h
33	Can we out a namespace or sth around this, `atomic::device` and `atomic::all` are confusing. What about `atomic::scope::ABC`?
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
207	The above is expanded below.
213–226

jdoerfert added inline comments.Jun 29 2023, 6:44 PM

openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226	As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu. This is unrelated. Could you elaborate on why you think the change is sound (on AMDGPUs)?

dhruvachak added inline comments.Jun 30 2023, 12:17 AM

openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226	Agreed this is unrelated. I can separate this out if you like. With reference to https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-memory-model-gfx90a (gfx90a as an example), the guide points out the required instructions. And that's what we get with this patch for the atomic inc: s_waitcnt vmcnt(0) lgkmcnt(0) global_atomic_inc v0, v1, v0, s[2:3] glc s_waitcnt vmcnt(0) buffer_wbinvl1_vol If we bring back the system fence, here's the resulting sequence (fence followed by the atomic inc): buffer_wbl2 s_waitcnt vmcnt(0) lgkmcnt(0) buffer_invl2 buffer_wbinvl1_vol s_waitcnt vmcnt(0) lgkmcnt(0) global_atomic_inc v0, v1, v0, s[2:3] glc s_waitcnt vmcnt(0) buffer_wbinvl1_vol According to the guide, the L2 instructions are required for coherence between different agents. For the reduction use case, we need coherence within a GPU which is provided by buffer_wbinvl1_vol following the atomic inc. So we can drop the L2 instructions here. The duplicate waitcnt and the buffer_wbinvl1_vol can be dropped too. In other words, as long as the ordering and scope are correct on the atomic operation, we should not need an additional fence. Perhaps it was required before D137524.

Addressed feedback. Removed the fence change, used the suggested macro, and used the enum scope for clarity.

Harbormaster completed remote builds in B242483: Diff 536334.Jun 30 2023, 11:30 AM

dhruvachak marked 5 inline comments as done.Jun 30 2023, 11:31 AM

dhruvachak added inline comments.

openmp/libomptarget/DeviceRTL/src/Reduction.cpp
226	I removed this unrelated change, will post a separate patch for it.
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
213–226	Used the suggested macro. Thanks @jdoerfert

dhruvachak edited the summary of this revision. (Show Details)Jun 30 2023, 11:35 AM

arsenm accepted this revision.Jun 30 2023, 11:58 AM

This revision is now accepted and ready to land.Jun 30 2023, 11:58 AM

Closed by commit rG6a1d1f7eefe8: [OpenMP] Added memory scope to atomic::inc API and used the device scope in… (authored by dhruvachak). · Explain WhyJun 30 2023, 12:05 PM

This revision was automatically updated to reflect the committed changes.

dhruvachak marked an inline comment as done.

dhruvachak added a commit: rG6a1d1f7eefe8: [OpenMP] Added memory scope to atomic::inc API and used the device scope in….

Revision Contents

Path

Size

openmp/

libomptarget/

DeviceRTL/

include/

Synchronization.h

9 lines

src/

Reduction.cpp

5 lines

Synchronization.cpp

96 lines

Diff 536070

openmp/libomptarget/DeviceRTL/include/Synchronization.h

	Show All 20 Lines
	enum OrderingTy {			enum OrderingTy {
	relaxed = __ATOMIC_RELAXED,			relaxed = __ATOMIC_RELAXED,
	aquire = __ATOMIC_ACQUIRE,			aquire = __ATOMIC_ACQUIRE,
	release = __ATOMIC_RELEASE,			release = __ATOMIC_RELEASE,
	acq_rel = __ATOMIC_ACQ_REL,			acq_rel = __ATOMIC_ACQ_REL,
	seq_cst = __ATOMIC_SEQ_CST,			seq_cst = __ATOMIC_SEQ_CST,
	};			};

				enum MemScopeTy {
				arsenmUnsubmitted Done Reply Inline Actions Why does this need a new subset abstraction for scopes? I could understand a complete enum around all the names target scopes arsenm: Why does this need a new subset abstraction for scopes? I could understand a complete enum…
				dhruvachakAuthorUnsubmitted Done Reply Inline Actions I tried to keep the memscope arch-independent, hence the additional abstraction. I kept the same memscopes as what the upcoming OpenMP spec is going to have. This way, the call to atomic::inc in DeviceRTL (e.g. in reduction) does not have to be arch-dependent. In other words, I did not want to pass in "agent" directly from the reduction code for amdgpu, as an example. dhruvachak: I tried to keep the memscope arch-independent, hence the additional abstraction. I kept the…
				all, // All threads on all devices
				device, // All threads on the device
				cgroup // All threads in the contention group, e.g. the team
				};
				jdoerfertUnsubmitted Done Reply Inline Actions Can we out a namespace or sth around this, `atomic::device` and `atomic::all` are confusing. What about `atomic::scope::ABC`? jdoerfert: Can we out a namespace or sth around this, `atomic::device` and `atomic::all` are confusing.

	/// Atomically increment \p *Addr and wrap at \p V with \p Ordering semantics.			/// Atomically increment \p *Addr and wrap at \p V with \p Ordering semantics.
	uint32_t inc(uint32_t *Addr, uint32_t V, OrderingTy Ordering);			uint32_t inc(uint32_t *Addr, uint32_t V, OrderingTy Ordering,
				MemScopeTy MemScope = MemScopeTy::all);

	/// Atomically perform <op> on \p V and \p *Addr with \p Ordering semantics. The			/// Atomically perform <op> on \p V and \p *Addr with \p Ordering semantics. The
	/// result is stored in \p *Addr;			/// result is stored in \p *Addr;
	/// {			/// {

	#define ATOMIC_COMMON_OP(TY) \			#define ATOMIC_COMMON_OP(TY) \
	TY add(TY *Addr, TY V, OrderingTy Ordering); \			TY add(TY *Addr, TY V, OrderingTy Ordering); \
	TY mul(TY *Addr, TY V, OrderingTy Ordering); \			TY mul(TY *Addr, TY V, OrderingTy Ordering); \
	▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/src/Reduction.cpp

Show First 20 Lines • Show All 217 Lines • ▼ Show 20 Lines	int32_t __kmpc_nvptx_teams_reduce_nowait_v2(

if (IsMaster) {		if (IsMaster) {
int ModBockId = TeamId % num_of_records;		int ModBockId = TeamId % num_of_records;
if (TeamId < num_of_records) {		if (TeamId < num_of_records) {
lgcpyFct(GlobalBuffer, ModBockId, reduce_data);		lgcpyFct(GlobalBuffer, ModBockId, reduce_data);
} else		} else
lgredFct(GlobalBuffer, ModBockId, reduce_data);		lgredFct(GlobalBuffer, ModBockId, reduce_data);

		#ifndef __AMDGCN__
		arsenmUnsubmitted Done Reply Inline Actions I doubt this is target specific arsenm: I doubt this is target specific
		dhruvachakAuthorUnsubmitted Done Reply Inline Actions Well, I don't know whether it is required for NVPTX for example, so I kept it around. As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu. dhruvachak: Well, I don't know whether it is required for NVPTX for example, so I kept it around. As an…
		jdoerfertUnsubmitted Done Reply Inline Actions As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu. This is unrelated. Could you elaborate on why you think the change is sound (on AMDGPUs)? jdoerfert: > As an aside, I think the system fence is too conservative, but again I kept it unchanged for…
		dhruvachakAuthorUnsubmitted Done Reply Inline Actions Agreed this is unrelated. I can separate this out if you like. With reference to https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-memory-model-gfx90a (gfx90a as an example), the guide points out the required instructions. And that's what we get with this patch for the atomic inc: s_waitcnt vmcnt(0) lgkmcnt(0) global_atomic_inc v0, v1, v0, s[2:3] glc s_waitcnt vmcnt(0) buffer_wbinvl1_vol If we bring back the system fence, here's the resulting sequence (fence followed by the atomic inc): buffer_wbl2 s_waitcnt vmcnt(0) lgkmcnt(0) buffer_invl2 buffer_wbinvl1_vol s_waitcnt vmcnt(0) lgkmcnt(0) global_atomic_inc v0, v1, v0, s[2:3] glc s_waitcnt vmcnt(0) buffer_wbinvl1_vol According to the guide, the L2 instructions are required for coherence between different agents. For the reduction use case, we need coherence within a GPU which is provided by buffer_wbinvl1_vol following the atomic inc. So we can drop the L2 instructions here. The duplicate waitcnt and the buffer_wbinvl1_vol can be dropped too. In other words, as long as the ordering and scope are correct on the atomic operation, we should not need an additional fence. Perhaps it was required before D137524. dhruvachak: 1. Agreed this is unrelated. I can separate this out if you like. 2. With reference to https…
		dhruvachakAuthorUnsubmitted Done Reply Inline Actions I removed this unrelated change, will post a separate patch for it. dhruvachak: I removed this unrelated change, will post a separate patch for it.
fence::system(atomic::seq_cst);		fence::system(atomic::seq_cst);
		#endif

// Increment team counter.		// Increment team counter.
// This counter is incremented by all teams in the current		// This counter is incremented by all teams in the current
// BUFFER_SIZE chunk.		// BUFFER_SIZE chunk.
ChunkTeamCount = atomic::inc(&Cnt, num_of_records - 1u, atomic::seq_cst);		ChunkTeamCount =
		atomic::inc(&Cnt, num_of_records - 1u, atomic::seq_cst, atomic::device);
}		}
// Synchronize		// Synchronize
if (mapping::isSPMDMode())		if (mapping::isSPMDMode())
__kmpc_barrier(Loc, TId);		__kmpc_barrier(Loc, TId);

// reduce_data is global or shared so before being reduced within the		// reduce_data is global or shared so before being reduced within the
// warp we need to bring it in local memory:		// warp we need to bring it in local memory:
// local_reduce_data = reduce_data[i]		// local_reduce_data = reduce_data[i]
▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

Show All 23 Lines

using namespace ompx;

namespace impl {

/// Atomics

///

///{

/// NOTE: This function needs to be implemented by every target.

uint32_t atomicInc(uint32_t *Address, uint32_t Val,

uint32_t atomicInc(uint32_t *Address, uint32_t Val, atomic::OrderingTy Ordering,

atomic::OrderingTy Ordering);

atomic::MemScopeTy MemScope);

template <typename Ty>

Ty atomicAdd(Ty *Address, Ty Val, atomic::OrderingTy Ordering) {

return __atomic_fetch_add(Address, Val, Ordering);

}

template <typename Ty>

Ty atomicMul(Ty *Address, Ty V, atomic::OrderingTy Ordering) {

▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines

uint32_t atomicExchange(uint32_t *Address, uint32_t Val,

atomic::OrderingTy Ordering) {

uint32_t R;

__atomic_exchange(Address, &Val, &R, Ordering);

return R;

}

///}

// Forward declarations defined to be defined for AMDGCN and NVPTX.

uint32_t atomicInc(uint32_t *A, uint32_t V, atomic::OrderingTy Ordering);

uint32_t atomicInc(uint32_t *A, uint32_t V, atomic::OrderingTy Ordering,

atomic::MemScopeTy MemScope);

void namedBarrierInit();

void namedBarrier();

void fenceTeam(atomic::OrderingTy Ordering);

void fenceKernel(atomic::OrderingTy Ordering);

void fenceSystem(atomic::OrderingTy Ordering);

void syncWarp(__kmpc_impl_lanemask_t);

void syncThreads(atomic::OrderingTy Ordering);

void syncThreadsAligned(atomic::OrderingTy Ordering) { syncThreads(Ordering); }

void unsetLock(omp_lock_t *);

int testLock(omp_lock_t *);

void initLock(omp_lock_t *);

void destroyLock(omp_lock_t *);

void setLock(omp_lock_t *);

void unsetCriticalLock(omp_lock_t *);

void setCriticalLock(omp_lock_t *);

/// AMDGCN Implementation

///

///{

#pragma omp begin declare variant match(device = {arch(amdgcn)})

uint32_t atomicInc(uint32_t *A, uint32_t V, atomic::OrderingTy Ordering) {

uint32_t atomicIncRelaxed(uint32_t *A, uint32_t V,

atomic::MemScopeTy MemScope) {

switch (MemScope) {

default:

__builtin_unreachable();

case atomic::all:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::relaxed, "");

case atomic::device:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::relaxed, "agent");

case atomic::cgroup:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::relaxed, "workgroup");

}

uint32_t atomicIncAquire(uint32_t *A, uint32_t V, atomic::MemScopeTy MemScope) {

switch (MemScope) {

default:

__builtin_unreachable();

case atomic::all:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::aquire, "");

case atomic::device:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::aquire, "agent");

case atomic::cgroup:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::aquire, "workgroup");

}

uint32_t atomicIncRelease(uint32_t *A, uint32_t V,

atomic::MemScopeTy MemScope) {

switch (MemScope) {

default:

__builtin_unreachable();

case atomic::all:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::release, "");

case atomic::device:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::release, "agent");

case atomic::cgroup:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::release, "workgroup");

}

uint32_t atomicIncAcqRel(uint32_t *A, uint32_t V, atomic::MemScopeTy MemScope) {

switch (MemScope) {

default:

__builtin_unreachable();

case atomic::all:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::acq_rel, "");

case atomic::device:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::acq_rel, "agent");

case atomic::cgroup:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::acq_rel, "workgroup");

}

uint32_t atomicIncSeqCst(uint32_t *A, uint32_t V, atomic::MemScopeTy MemScope) {

switch (MemScope) {

default:

__builtin_unreachable();

case atomic::all:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::seq_cst, "");

case atomic::device:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::seq_cst, "agent");

case atomic::cgroup:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::seq_cst, "workgroup");

}

jdoerfertUnsubmitted

Not Done

The above is expanded below.

jdoerfert: The above is expanded below.

uint32_t atomicInc(uint32_t *A, uint32_t V, atomic::OrderingTy Ordering,

atomic::MemScopeTy MemScope) {

// builtin_amdgcn_atomic_inc32 should expand to this switch when

// passed a runtime value, but does not do so yet. Workaround here.

switch (Ordering) {

default:

__builtin_unreachable();

case atomic::relaxed:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::relaxed, "");

return atomicIncRelaxed(A, V, MemScope);

case atomic::aquire:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::aquire, "");

return atomicIncAquire(A, V, MemScope);

case atomic::release:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::release, "");

return atomicIncRelease(A, V, MemScope);

case atomic::acq_rel:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::acq_rel, "");

return atomicIncAcqRel(A, V, MemScope);

case atomic::seq_cst:

return __builtin_amdgcn_atomic_inc32(A, V, atomic::seq_cst, "");

return atomicIncSeqCst(A, V, MemScope);

}

jdoerfertUnsubmitted

Done

// passed a runtime value, but does not do so yet. Workaround here.

- switch (Ordering) {

- default:

- __builtin_unreachable();

- case atomic::relaxed:

- return atomicIncRelaxed(A, V, MemScope);

- case atomic::aquire:

- return atomicIncAquire(A, V, MemScope);

- case atomic::release:

- return atomicIncRelease(A, V, MemScope);

- case atomic::acq_rel:

- return atomicIncAcqRel(A, V, MemScope);

- case atomic::seq_cst:

- return atomicIncSeqCst(A, V, MemScope);

+ #define ScopeSwitch(ORDER) \

+ switch (MemScope) { \

+ case atomic::all: \

+ return __builtin_amdgcn_atomic_inc32(A, V, ORDER, "">); \

+ case atomic::device: \

+ return __builtin_amdgcn_atomic_inc32(A, V, ORDER, "agent">); \

+ case atomic::cgroup: \

+ return __builtin_amdgcn_atomic_inc32(A, V, ORDER, "workgroup">); \

+ }

+ #define Case(ORDER) case ORDER: ScopeSwitch(ORDER)

+ Case(atomic::relaxed);

+ Case(atomic::aquire);

+ Case(atomic::release);

+ Case(atomic::acq_rel);

+ Case(atomic::seq_cst);

+ #undef Case

+ #undef ScopeSwitch

}

uint32_t SHARED(namedBarrierTracker);

jdoerfert:

dhruvachakAuthorUnsubmitted

Done

Used the suggested macro. Thanks @jdoerfert

dhruvachak: Used the suggested macro. Thanks @jdoerfert

}

uint32_t SHARED(namedBarrierTracker);

void namedBarrierInit() {

// Don't have global ctors, and shared memory is not zero init

atomic::store(&namedBarrierTracker, 0u, atomic::release);

}

▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines

/// NVPTX Implementation

///

///{

#pragma omp begin declare variant match( \

device = {arch(nvptx, nvptx64)}, \

implementation = {extension(match_any)})

uint32_t atomicInc(uint32_t *Address, uint32_t Val,

uint32_t atomicInc(uint32_t *Address, uint32_t Val, atomic::OrderingTy Ordering,

atomic::OrderingTy Ordering) {

atomic::MemScopeTy MemScope) {

return __nvvm_atom_inc_gen_ui(Address, Val);

}

void namedBarrierInit() {}

void namedBarrier() {

uint32_t NumThreads = omp_get_num_threads();

ASSERT(NumThreads % 32 == 0);

▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines

ATOMIC_FP_OP(double, int64_t, uint64_t)

#undef ATOMIC_INT_ONLY_OP

#undef ATOMIC_FP_ONLY_OP

#undef ATOMIC_COMMON_OP

#undef ATOMIC_INT_OP

#undef ATOMIC_FP_OP

uint32_t atomic::inc(uint32_t *Addr, uint32_t V, atomic::OrderingTy Ordering) {

uint32_t atomic::inc(uint32_t *Addr, uint32_t V, atomic::OrderingTy Ordering,

return impl::atomicInc(Addr, V, Ordering);

atomic::MemScopeTy MemScope) {

return impl::atomicInc(Addr, V, Ordering, MemScope);

}

void unsetCriticalLock(omp_lock_t *Lock) { impl::unsetLock(Lock); }

void setCriticalLock(omp_lock_t *Lock) { impl::setLock(Lock); }

extern "C" {

void __kmpc_ordered(IdentTy *Loc, int32_t TId) { FunctionTracingRAII(); }

▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Added memory scope to atomic::inc API and used the device scope in reduction.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 536070

openmp/libomptarget/DeviceRTL/include/Synchronization.h

openmp/libomptarget/DeviceRTL/src/Reduction.cpp

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

[OpenMP] Added memory scope to atomic::inc API and used the device scope in reduction.
ClosedPublic