This is an archive of the discontinued LLVM Phabricator instance.

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
282	This doesn't work. You check the lock first, then you set it. It has to be an atomic step. Use an atomicCAS, the result tells you if it worked/if it is now set. This is the setLock code w/o the while. You can even call this in the while below.
290
294	Shouldn't the release and acquire semantics be part of this CAS? If we run relaxed who is to say we see the update of another thread. I would have assumed on failure we want an acquire fence and on success a release fence. Not sure if we need additional ones.

doru1004 added inline comments.Mar 15 2023, 1:17 PM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
294	The idea is that the thread only executes the acquire once when it enters the critical region to prevent it from being executed at every iteration of the loop (which comes with performance penalties if it happens). I would like to understand why you say there's a chance we don't see the update of another thread. The unsetting of the lock is happening here: void unsetLock(omp_lock_t Lock) { (void)atomicExchange((uint32_t )Lock, UNSET, atomic::acq_rel); }

Herald added a subscriber: jplehr. · View Herald TranscriptMar 15 2023, 1:17 PM

doru1004 updated this revision to Diff 505607.Mar 15 2023, 1:20 PM

doru1004 marked 2 inline comments as done.

Harbormaster completed remote builds in B219720: Diff 505607.Mar 15 2023, 1:24 PM

jdoerfert added inline comments.Mar 15 2023, 3:48 PM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
294	The relaxed unset doesn't make other memory effects visible to a thread that takes the lock next. I believe it should come with a proper fence, hence be atomic::release. Similarly, I am not sure if the update of the lock itself is guaranteed to be observed if the update and check are relaxed.

jdoerfert added inline comments.Mar 15 2023, 3:49 PM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
294	UnsetLock actually has a release fence, testLock does not have proper fencing, sorry.

doru1004 updated this revision to Diff 505648.Mar 15 2023, 4:11 PM

doru1004 added inline comments.

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
294	I fixed testLock to use acq_rel.

Harbormaster completed remote builds in B219749: Diff 505648.Mar 15 2023, 4:14 PM

doru1004 added inline comments.Mar 16 2023, 10:08 AM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
294	As far as I can tell the current combination of fences and CAS atomics works, please let me know if you have any comments or if I haven't addressed something.

dhruvachak added inline comments.Mar 16 2023, 10:16 AM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
276	In theory, unset could simply do an atomic write to the Lock with release memorder. Any reason that's not used here? You are not using the prev value anyways.
291	I am getting confused by this pattern. You have: while (!atomicCAS(...) != UNSET) Can't the above be written as while (atomicCAS(...) == UNSET) But what does atomicCAS return? Bool or the prev value? If it is the prev value, shouldn't the condition be? while (atomicCAS(...) == SET) { sleep(); }
292	I think the CAS should take an acq_rel as the success memorder and an acq as the failure memorder. The acq in both the cases will ensure it sees an update from another thread. The release from the other updating thread is not sufficient for this thread to see that update. And when this thread succeeds, the release will ensure other threads see the update of this thread. With the above memorders, I would think we could get rid of the fences.

dhruvachak added inline comments.Mar 19 2023, 4:44 PM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
290	Why do we need the release fence in setLock before the CAS?
292	I think the CAS should take an acq_rel as the success memorder and an acq as the failure memorder. The acq in both the cases will ensure it sees an update from another thread. The release from the other updating thread is not sufficient for this thread to see that update. And when this thread succeeds, the release will ensure other threads see the update of this thread. With the above memorders, I would think we could get rid of the fences. We had an offline discussion. Based on that, I am ok with the current memorders in the patch for the atomicCAS in setLock. C++11 spec has this statement "Implementations should make atomic stores visible to atomic loads within a reasonable amount of time." and since the CAS is done in a loop, the setLock should eventually see the update made by another thread that executed the unsetLock.

doru1004 added inline comments.Mar 20 2023, 7:11 AM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
290	You have to release all of the ordinary stores before you write to the lock.

doru1004 marked an inline comment as done.Mar 20 2023, 7:12 AM

doru1004 added inline comments.

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
292	Sounds good. I'll mark as resolved.

doru1004 marked 2 inline comments as done.Mar 20 2023, 7:13 AM

doru1004 marked an inline comment as done.

doru1004 updated this revision to Diff 506614.Mar 20 2023, 9:03 AM

doru1004 marked an inline comment as done.

Harbormaster completed remote builds in B220464: Diff 506614.Mar 20 2023, 9:06 AM

I'm really sure that locks at thread scope do not work on amdgpu or pre-volta nvptx. One of the threads wins the cas, all the others do not, and it immediately deadlocks.

Critical sections can be done by rewriting the cfg, general purpose locks can't.

What am I missing here?

dhruvachak added inline comments.Mar 20 2023, 7:13 PM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
290	In theory, setLock should have acquire semantics and unsetLock should have release semantics. So the setLock does not really need the release fence. But this patch is consistent between the set and the unset, namely both use acq_rel semantics. So I am ok with the current state of the patch on this aspect.

In D145831#4207338, @JonChesterfield wrote:

I'm really sure that locks at thread scope do not work on amdgpu or pre-volta nvptx. One of the threads wins the cas, all the others do not, and it immediately deadlocks.

Critical sections can be done by rewriting the cfg, general purpose locks can't.

What am I missing here?

What's missing is the outer loop which only allows the lowest thread in the wave to actually be active. All the other inactive wave threads are waiting at a synchronization point. Once the first thread in the wave executes the critical it joins the other threads and loops around again but this time the next thread will be active and so on. Does this help answer your question? Some of that code requires inspection of LLVM IR if you want to see it. This here is just the runtime bit.

In D145831#4207338, @JonChesterfield wrote:

I'm really sure that locks at thread scope do not work on amdgpu or pre-volta nvptx. One of the threads wins the cas, all the others do not, and it immediately deadlocks.

Critical sections can be done by rewriting the cfg, general purpose locks can't.

What am I missing here?

Agreed: this patch does not address "unstructured" locks in general, when there is inter-wave/warp thread divergence in GPUs that cannot guarantee forward progress for diverging lanes within the same wave.

What this is for is to support OpenMP critical sections, whose implementation is based on setLock/unsetLock. That's why the LIT test uses critical. The implementation - it is my understanding - calls very carefully set/unsetLock as @doru1004 indicated (one lane in the wave calss set/unsetLock and the others are waiting at a wave sync point).

doru1004 updated this revision to Diff 507120.Mar 21 2023, 2:11 PM

Harbormaster completed remote builds in B220833: Diff 507120.Mar 21 2023, 2:15 PM

OK then as written this is definitely going to blow up on us. We shouldn't implement the general purpose lock API if it deadlocks unless called in a very specific situation.

Probably best to emit the CAS in line as part of the IR transform, but otherwise we could add more runtime functions specific to critical. Uses of the general purpose omp_lock should be a compile time error on platforms that can't do it (it's unfortunate that lock returns void), but until then builtin_trap at least looks clearer when debugging than deadlock.

This revision now requires changes to proceed.Mar 21 2023, 3:20 PM

In D145831#4211430, @JonChesterfield wrote:

OK then as written this is definitely going to blow up on us. We shouldn't implement the general purpose lock API if it deadlocks unless called in a very specific situation.

Probably best to emit the CAS in line as part of the IR transform, but otherwise we could add more runtime functions specific to critical. Uses of the general purpose omp_lock should be a compile time error on platforms that can't do it (it's unfortunate that lock returns void), but until then builtin_trap at least looks clearer when debugging than deadlock.

How about introducing __kmpc_* functions for all the set/unset/test lock functions and have the compiler call the kmpc versions? The set/unset/test versions can trap for the time being. That way, this patch can go in with interface name changes.

In D145831#4211430, @JonChesterfield wrote:

OK then as written this is definitely going to blow up on us. We shouldn't implement the general purpose lock API if it deadlocks unless called in a very specific situation.

Could you provide a small reproducer for this issue? I'd like to include it in the testing + fix.

NB: I think I understand now that you probably mean the use of omp_set_lock and omp_unset_lock directly (so no need to provide an example I thought you meant you had some critical region situation) in which case yes this patch does not cater to the direct usage of those API functions.

How about introducing __kmpc_* functions for all the set/unset/test lock functions and have the compiler call the kmpc versions? The set/unset/test versions can trap for the time being. That way, this patch can go in with interface name changes.

Agreed. As Carlo stated this patch attempts to fix the critical region not omp_set_lock / omp_unset_lock in general. I will make the necessary changes to reflect that. I thought omp_set_lock and omp_unset_lock are internal to the runtime but since they are not I will return them to their original unimplemented state.

doru1004 updated this revision to Diff 507753.Mar 23 2023, 8:37 AM

Harbormaster completed remote builds in B221321: Diff 507753.Mar 23 2023, 8:40 AM

doru1004 updated this revision to Diff 507755.Mar 23 2023, 8:44 AM

Harbormaster completed remote builds in B221323: Diff 507755.Mar 23 2023, 8:48 AM

In D145831#4211430, @JonChesterfield wrote:

OK then as written this is definitely going to blow up on us. We shouldn't implement the general purpose lock API if it deadlocks unless called in a very specific situation.

Probably best to emit the CAS in line as part of the IR transform, but otherwise we could add more runtime functions specific to critical. Uses of the general purpose omp_lock should be a compile time error on platforms that can't do it (it's unfortunate that lock returns void), but until then builtin_trap at least looks clearer when debugging than deadlock.

@JonChesterfield I believe I have addressed your concerns and the changes I made are now confined to critical regions only.

jdoerfert added inline comments.Mar 23 2023, 9:54 AM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
378	Nit: Move it into the generic part, plz. AMDGPU can overwrite it still.

doru1004 updated this revision to Diff 507908.Mar 23 2023, 3:51 PM

Harbormaster completed remote builds in B221439: Diff 507908.Mar 23 2023, 3:55 PM

I'm not confident that an acq_rel exchange in combination with the fences is correct. If it is correct, I think it must be suboptimal. Why have you gone with that implementation?

edit: I'd remove the requested changes if I could work out how to. Thanks for that change

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
279	I expected unset to be a relaxed store of unset, possibly surrounded by fences. I'm not totally confident an acq_rel ordering here synchronises with the fences in setCriticalLock, though it might do.
290	The fences probably need to be outside the branch on active thread, assuming the intent is for the lock to cover all the threads in the warp. Though I'm not sure it'll make any difference to codegen. I think relaxed ordering on both is right here.
473	These read like a copy/paste error - most functions nearby are calling impl:: with a similar name. Could you add a comment to the effect that nvptx uses the same lock implementation for critical sections as for other locks?
openmp/libomptarget/test/offloading/target_critical_region.cpp
5	why not nvptx? also why not x64, but that seems less immediately worrying

doru1004 added inline comments.Mar 24 2023, 6:35 AM

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
279	The acq_rel semantics apply to the lock variable (not the fences) and the atomic store assumed here synchronizes with the atomic load in setCriticalLock.
290	The fence here is to ensure all other ordinary stores happen before the fence so it is enough that the lane that gets through to execute it. Eventually all lanes will execute it.
473	I can add the comment I am not sure what the copy / paste error is.
openmp/libomptarget/test/offloading/target_critical_region.cpp
5	The nvptx implementation is broken so we avoid enabling it for this test. x64 is disabled since this test is really meant to be for critical regions in target regions.

OK, talked to some more people. Fences are fine inside the branch.

We need acquire/release fencing around the critical block so that code which expects to see writes from other threads going through the same critical block works.

We need mutual exclusion so that we actually have the critical semantics.

As written this patch should do that. Taking the mutex & doing device scope fencing for each lane in the wavefront is a slow thing to do but should work. Better would be to take the mutex once per warp, something like:

if (id == ffs(activemask)) {
  while (atomicCAS(...)) builtin_sleep();
  fence_acquire(agent)
}
for each lane in warp {
  fence_acquire(workgroup);
  critical-region-here
  fence_release(workgroup);
}
drop-mutex(agent)

This revision is now accepted and ready to land.Mar 24 2023, 10:34 AM

LG. All of my concerns have been resolved.

doru1004 marked 5 inline comments as done.Mar 24 2023, 11:18 AM

Doru Bercea <doru.bercea@amd.com> mentioned this in rG737291f1691a: Add support for critical regions in device code..Mar 24 2023, 11:21 AM

Commit: 737291f1691ace49688d6cf0a725ae4579b64dbe

Revision Contents

Path

Size

openmp/

libomptarget/

DeviceRTL/

src/

Synchronization.cpp

33 lines

test/

offloading/

target_critical_region.cpp

36 lines

Diff 507908

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines

void syncWarp(__kmpc_impl_lanemask_t);

void syncThreads();

void syncThreadsAligned() { syncThreads(); }

void unsetLock(omp_lock_t *);

int testLock(omp_lock_t *);

void initLock(omp_lock_t *);

void destroyLock(omp_lock_t *);

void setLock(omp_lock_t *);

void unsetCriticalLock(omp_lock_t *);

void setCriticalLock(omp_lock_t *);

/// AMDGCN Implementation

///

///{

#pragma omp begin declare variant match(device = {arch(amdgcn)})

uint32_t atomicInc(uint32_t *A, uint32_t V, atomic::OrderingTy Ordering) {

// builtin_amdgcn_atomic_inc32 should expand to this switch when

▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines

// TODO: Don't have wavefront lane locks. Possibly can't have them.

void unsetLock(omp_lock_t *) { __builtin_trap(); }

int testLock(omp_lock_t *) { __builtin_trap(); }

void initLock(omp_lock_t *) { __builtin_trap(); }

void destroyLock(omp_lock_t *) { __builtin_trap(); }

void setLock(omp_lock_t *) { __builtin_trap(); }

constexpr uint32_t UNSET = 0;

constexpr uint32_t SET = 1;

dhruvachakUnsubmitted

Not Done

In theory, unset could simply do an atomic write to the Lock with release memorder. Any reason that's not used here? You are not using the prev value anyways.

dhruvachak: In theory, unset could simply do an atomic write to the Lock with release memorder. Any reason…

void unsetCriticalLock(omp_lock_t *Lock) {

(void)atomicExchange((uint32_t *)Lock, UNSET, atomic::acq_rel);

}

JonChesterfieldUnsubmitted

Done

I expected unset to be a relaxed store of unset, possibly surrounded by fences. I'm not totally confident an acq_rel ordering here synchronises with the fences in setCriticalLock, though it might do.

JonChesterfield: I expected unset to be a relaxed store of unset, possibly surrounded by fences. I'm not totally…

doru1004AuthorUnsubmitted

Done

The acq_rel semantics apply to the lock variable (not the fences) and the atomic store assumed here synchronizes with the atomic load in setCriticalLock.

doru1004: The acq_rel semantics apply to the lock variable (not the fences) and the atomic store assumed…

void setCriticalLock(omp_lock_t *Lock) {

uint64_t LowestActiveThread = utils::ffs(mapping::activemask()) - 1;

jdoerfertUnsubmitted

Done

This doesn't work. You check the lock first, then you set it.
It has to be an atomic step. Use an atomicCAS, the result tells you if it worked/if it is now set. This is the setLock code w/o the while. You can even call this in the while below.

jdoerfert: This doesn't work. You check the lock first, then you set it. It has to be an atomic step. Use…

if (mapping::getThreadIdInWarp() == LowestActiveThread) {

fenceKernel(atomic::release);

while (!atomicCAS((uint32_t *)Lock, UNSET, SET, atomic::relaxed,

atomic::relaxed)) {

__builtin_amdgcn_s_sleep(32);

}

fenceKernel(atomic::aquire);

}

jdoerfertUnsubmitted

Done

void setLock(omp_lock_t *Lock) {

- uint64_t lowestActiveThread = utils::ffs(mapping::activemask()) - 1;

+ uint64_t LowestActiveThread = utils::ffs(mapping::activemask()) - 1;

if (mapping::getThreadIdInWarp() == lowestActiveThread) {

jdoerfert:

dhruvachakUnsubmitted

Done

Why do we need the release fence in setLock before the CAS?

dhruvachak: Why do we need the release fence in setLock before the CAS?

doru1004AuthorUnsubmitted

Done

You have to release all of the ordinary stores before you write to the lock.

doru1004: You have to release all of the ordinary stores before you write to the lock.

dhruvachakUnsubmitted

Done

In theory, setLock should have acquire semantics and unsetLock should have release semantics. So the setLock does not really need the release fence. But this patch is consistent between the set and the unset, namely both use acq_rel semantics. So I am ok with the current state of the patch on this aspect.

dhruvachak: In theory, setLock should have acquire semantics and unsetLock should have release semantics.

JonChesterfieldUnsubmitted

Done

The fences probably need to be outside the branch on active thread, assuming the intent is for the lock to cover all the threads in the warp. Though I'm not sure it'll make any difference to codegen.

I think relaxed ordering on both is right here.

JonChesterfield: The fences probably need to be outside the branch on active thread, assuming the intent is for…

doru1004AuthorUnsubmitted

Done

The fence here is to ensure all other ordinary stores happen before the fence so it is enough that the lane that gets through to execute it. Eventually all lanes will execute it.

doru1004: The fence here is to ensure all other ordinary stores happen before the fence so it is enough…

}

dhruvachakUnsubmitted

Done

I am getting confused by this pattern. You have:

while (!atomicCAS(...) != UNSET)

Can't the above be written as

while (atomicCAS(...) == UNSET)

But what does atomicCAS return? Bool or the prev value? If it is the prev value, shouldn't the condition be?

while (atomicCAS(...) == SET) { sleep(); }

dhruvachak: I am getting confused by this pattern. You have: while (!atomicCAS(...) != UNSET) Can't the…

dhruvachakUnsubmitted

Done

I think the CAS should take an acq_rel as the success memorder and an acq as the failure memorder.

The acq in both the cases will ensure it sees an update from another thread. The release from the other updating thread is not sufficient for this thread to see that update. And when this thread succeeds, the release will ensure other threads see the update of this thread.

With the above memorders, I would think we could get rid of the fences.

dhruvachak: I think the CAS should take an acq_rel as the success memorder and an acq as the failure…

dhruvachakUnsubmitted

Done

I think the CAS should take an acq_rel as the success memorder and an acq as the failure memorder.

The acq in both the cases will ensure it sees an update from another thread. The release from the other updating thread is not sufficient for this thread to see that update. And when this thread succeeds, the release will ensure other threads see the update of this thread.

With the above memorders, I would think we could get rid of the fences.

We had an offline discussion. Based on that, I am ok with the current memorders in the patch for the atomicCAS in setLock. C++11 spec has this statement "Implementations should make atomic stores visible to atomic loads within a reasonable amount of time." and since the CAS is done in a loop, the setLock should eventually see the update made by another thread that executed the unsetLock.

dhruvachak: > I think the CAS should take an acq_rel as the success memorder and an acq as the failure…

doru1004AuthorUnsubmitted

Done

Sounds good. I'll mark as resolved.

doru1004: Sounds good. I'll mark as resolved.

#pragma omp end declare variant

///}

jdoerfertUnsubmitted

Not Done

Shouldn't the release and acquire semantics be part of this CAS?
If we run relaxed who is to say we see the update of another thread. I would have assumed on failure we want an acquire fence and on success a release fence. Not sure if we need additional ones.

jdoerfert: Shouldn't the release and acquire semantics be part of this CAS? If we run relaxed who is to…

doru1004AuthorUnsubmitted

Done

The idea is that the thread only executes the acquire once when it enters the critical region to prevent it from being executed at every iteration of the loop (which comes with performance penalties if it happens).

I would like to understand why you say there's a chance we don't see the update of another thread. The unsetting of the lock is happening here:

void unsetLock(omp_lock_t *Lock) {
  (void)atomicExchange((uint32_t *)Lock, UNSET, atomic::acq_rel);
}

doru1004: The idea is that the thread only executes the acquire once when it enters the critical region…

jdoerfertUnsubmitted

Not Done

The relaxed unset doesn't make other memory effects visible to a thread that takes the lock next. I believe it should come with a proper fence, hence be atomic::release.

Similarly, I am not sure if the update of the lock itself is guaranteed to be observed if the update and check are relaxed.

jdoerfert: The relaxed unset doesn't make other memory effects visible to a thread that takes the lock…

jdoerfertUnsubmitted

Not Done

UnsetLock actually has a release fence, testLock does not have proper fencing, sorry.

jdoerfert: UnsetLock actually has a release fence, testLock does not have proper fencing, sorry.

doru1004AuthorUnsubmitted

Done

I fixed testLock to use acq_rel.

doru1004: I fixed testLock to use acq_rel.

doru1004AuthorUnsubmitted

Done

As far as I can tell the current combination of fences and CAS atomics works, please let me know if you have any comments or if I haven't addressed something.

doru1004: As far as I can tell the current combination of fences and CAS atomics works, please let me…

/// NVPTX Implementation

///

///{

#pragma omp begin declare variant match( \

device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})

uint32_t atomicInc(uint32_t *Address, uint32_t Val,

▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines

}

#pragma omp end declare variant

///}

} // namespace impl

void synchronize::init(bool IsSPMD) {

if (!IsSPMD)

jdoerfertUnsubmitted

Done

Nit: Move it into the generic part, plz. AMDGPU can overwrite it still.

jdoerfert: Nit: Move it into the generic part, plz. AMDGPU can overwrite it still.

impl::namedBarrierInit();

}

void synchronize::warp(LaneMaskTy Mask) { impl::syncWarp(Mask); }

void synchronize::threads() { impl::syncThreads(); }

void synchronize::threadsAligned() { impl::syncThreadsAligned(); }

▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines

#undef ATOMIC_FP_ONLY_OP

#undef ATOMIC_COMMON_OP

#undef ATOMIC_INT_OP

#undef ATOMIC_FP_OP

uint32_t atomic::inc(uint32_t *Addr, uint32_t V, atomic::OrderingTy Ordering) {

return impl::atomicInc(Addr, V, Ordering);

}

JonChesterfieldUnsubmitted

Done

These read like a copy/paste error - most functions nearby are calling impl:: with a similar name. Could you add a comment to the effect that nvptx uses the same lock implementation for critical sections as for other locks?

JonChesterfield: These read like a copy/paste error - most functions nearby are calling impl:: with a similar…

doru1004AuthorUnsubmitted

Done

I can add the comment I am not sure what the copy / paste error is.

doru1004: I can add the comment I am not sure what the copy / paste error is.

void unsetCriticalLock(omp_lock_t *Lock) {

impl::unsetLock(Lock);

}

void setCriticalLock(omp_lock_t *Lock) {

impl::setLock(Lock);

}

extern "C" {

void __kmpc_ordered(IdentTy *Loc, int32_t TId) { FunctionTracingRAII(); }

void __kmpc_end_ordered(IdentTy *Loc, int32_t TId) { FunctionTracingRAII(); }

int32_t __kmpc_cancel_barrier(IdentTy *Loc, int32_t TId) {

FunctionTracingRAII();

__kmpc_barrier(Loc, TId);

▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines

void __kmpc_syncwarp(uint64_t Mask) {

FunctionTracingRAII();

synchronize::warp(Mask);

}

void __kmpc_critical(IdentTy *Loc, int32_t TId, CriticalNameTy *Name) {

FunctionTracingRAII();

omp_set_lock(reinterpret_cast<omp_lock_t *>(Name));

impl::setCriticalLock(reinterpret_cast<omp_lock_t *>(Name));

}

void __kmpc_end_critical(IdentTy *Loc, int32_t TId, CriticalNameTy *Name) {

FunctionTracingRAII();

omp_unset_lock(reinterpret_cast<omp_lock_t *>(Name));

impl::unsetCriticalLock(reinterpret_cast<omp_lock_t *>(Name));

}

void omp_init_lock(omp_lock_t *Lock) { impl::initLock(Lock); }

void omp_destroy_lock(omp_lock_t *Lock) { impl::destroyLock(Lock); }

void omp_set_lock(omp_lock_t *Lock) { impl::setLock(Lock); }

void omp_unset_lock(omp_lock_t *Lock) { impl::unsetLock(Lock); }

int omp_test_lock(omp_lock_t *Lock) { return impl::testLock(Lock); }

} // extern "C"

#pragma omp end declare target

openmp/libomptarget/test/offloading/target_critical_region.cpp

This file was added.

				// RUN: %libomptarget-compilexx-run-and-check-generic

				// UNSUPPORTED: nvptx64-nvidia-cuda
				// UNSUPPORTED: nvptx64-nvidia-cuda-LTO
				// UNSUPPORTED: x86_64-pc-linux-gnu
				JonChesterfieldUnsubmitted Done Reply Inline Actions why not nvptx? also why not x64, but that seems less immediately worrying JonChesterfield: why not nvptx? also why not x64, but that seems less immediately worrying
				doru1004AuthorUnsubmitted Done Reply Inline Actions The nvptx implementation is broken so we avoid enabling it for this test. x64 is disabled since this test is really meant to be for critical regions in target regions. doru1004: The nvptx implementation is broken so we avoid enabling it for this test. x64 is disabled since…
				// UNSUPPORTED: x86_64-pc-linux-gnu-LTO

				#include <omp.h>
				#include <stdio.h>

				#define N 1000000

				int A[N];
				int main() {
				for (int i = 0; i < N; i++)
				A[i] = 1;

				int sum[1];
				sum[0] = 0;

				#pragma omp target teams distribute parallel for num_teams(256) \
				schedule(static, 1) map(to \
				: A[:N]) map(tofrom \
				: sum[:1])
				{
				for (int i = 0; i < N; i++) {
				#pragma omp critical
				{ sum[0] += A[i]; }
				}
				}

				// CHECK: SUM = 1000000
				printf("SUM = %d\n", sum[0]);

				return 0;
				}

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget] Add support for critical regions in AMD GPU device offloadingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 507908

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp

openmp/libomptarget/test/offloading/target_critical_region.cpp

[OpenMP][libomptarget] Add support for critical regions in AMD GPU device offloading
ClosedPublic