Page MenuHomePhabricator

[OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement
Needs ReviewPublic

Authored by tianshilei1992 on Jun 16 2021, 1:24 PM.

Details

Summary

The async data movement can cause data race if the target supports it.
Details can be found in [1]. This patch tries to fix this problem by attaching
an event to the entry of data mapping table. Here are the details.

For each issued data movement, a new event is generated and returned to libomptarget
by calling createEvent. The event will be attached to the corresponding mapping table
entry.

For each data mapping lookup, if there is no need for a data movement, the
attached event has to be inserted into the queue to gaurantee that all following
operations in the queue can only be executed if the event is fulfilled.

This design is to avoid synchronization on the host side.

In order to support the whole logic, here is a list of changes we need:

  • Three plugin interfaces are needed to create, destroy, and wait event.
  • New mutable data member Event in HostDataToTargetTy.

Note that we are using CUDA terminolofy here. Similar mechanism is assumped to
be supported by another targets. Even if the target doesn't support it, it can
be easily implemented in the following fall back way:

  • Event can be any kind of flag that has at least two status, 0 and 1.
  • waitEvent can directly busy loop if Event is still 0.

My local test shows that bug49334.cpp can pass.

Reference:
[1] https://bugs.llvm.org/show_bug.cgi?id=49940

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
tianshilei1992 requested review of this revision.Jun 16 2021, 1:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2021, 1:24 PM

This seems very coarse. If we want to schedule one task after another, we should use an API that adds that dependency edge. Using one for 'wait until everything is done, then continue' means as soon as one of those is scheduled, unrelated tasks slow down.

Is this only applicable per-async-object, and thus only affects a single async sequence?

Our async API could do with some cleanup and documentation to define what the constraints on it are.

openmp/libomptarget/include/omptarget.h
139 ↗(On Diff #352537)

Can async info be an opaque object? One void pointer seems sufficient for arbitrary targets, perhaps with 'queue' renamed to 'state' or similar

openmp/libomptarget/include/omptargetplugin.h
148

'barrier'? Not sure what dependency means here

openmp/libomptarget/src/device.h
53

'volatile mutable' reads as 'buggy', why is this volatile or mutable? Also, why is this target thing exposed to the host runtime at all, instead of being purely within an async object?

openmp/libomptarget/src/omptarget.cpp
604

Volatile != atomic. Also, I thought the point of this was to do sync on the target, so why is the host busy waiting on something?

jdoerfert added a comment.EditedJun 16 2021, 5:10 PM

I don't understand why we have an event in the async_info object at all. I would have assumed only to have some plugin specific "lock" in the table. We want the plugin API to have: "void *getLock(asyn_info *)", "waitForLock(void *, async_info*)".
If you have events and streams:
getLock -> create, add to stream and return.
waitForLock -> enqueue wait in stream
if you don't have events:
getLock -> create host lock, return
waitForLock -> wait for host lock.

openmp/libomptarget/src/device.h
53

It has to be here. As discussed in our call today. You need a lock per entry in the mapping table.
That said, I don't know why we want it to be volatile or mutable, the latter if we only get const references somewhere but that would be odd anyway.

tianshilei1992 added inline comments.Jun 16 2021, 5:11 PM
openmp/libomptarget/include/omptarget.h
139 ↗(On Diff #352537)

We need them both.

openmp/libomptarget/include/omptargetplugin.h
148

That's the interesting part. It is not a barrier. This operation is non-blocking, which means it just inserts the event to the queue, and that's it. So there is no wait.

openmp/libomptarget/src/device.h
53

mutable is because std::set::iterator is const. It makes sense because we don't want to modify the key for the set. However, here HstPtrBegin is the key. In order to modify it via the iterator, we need to define it as mutable, or use const_cast when we want to modify it.

openmp/libomptarget/src/omptarget.cpp
604

There can still be gap between creating an entry in map table and issuing the data movement.

I don't understand why we have an event in the async_info object at all. I would have assumed only to have some plugin specific "lock" in the table.

That's the only way to take the event back from plugin to libomptarget.

tianshilei1992 added inline comments.Jun 16 2021, 5:48 PM
openmp/libomptarget/include/omptarget.h
139 ↗(On Diff #352537)

I'll refine this part to remove it from async info.

So is the race this is trying to fix between the host and the target, or between regions running on the target?

Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).

If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

openmp/libomptarget/include/omptargetplugin.h
148

So if it's not a barrier, what is it? What is this intended to do?

protze.joachim added a subscriber: protze.joachim.EditedJun 17 2021, 1:05 AM

I must admit that I have no idea about the internals of libomptarget. But here is my high-level view on this issue:

From my perspective, the compiler should help to distinguish two situations (a weakness of OpenMP is that the DAG only evolves during execution):
(1) target tasks with dependencies, which only synchronize with other target tasks on the same device (outgoing edges are limited to target tasks on the same device)
(2) target tasks with dependencies, which synchronize with host code or target tasks on other devices.
If the compiler cannot prove (1), fall back to (2).

If the compiler can prove, that a target task is of class (1), a fast path is possible by submitting all related device code onto the same device queue ("stream"). Similar, if the runtime can prove that outgoing edges are limited to the same device and no new edges might be added the runtime might identify additional target tasks of class (1). The latter might be difficult, because the information is in libomp not in libomptarget.

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp taskwait

In this case, no host tasks are dependent on target task dependencies -> just submit the operations onto the same device queue.

Only if the host (possibly) has dependencies on the target task, fine-grained synchronization with the host is necessary:

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp task depend(inout:a[0:N])
{}

The first two target tasks, can still rely on device queue synchronization, the last target task must ensure synchronization with the host. I think, this could be mapped to the OpenMP completion event model:

#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) detach(event) nowait
#pragma omp task depend(inout:a[0:N])
{}

Libomptarget "calls omp_fulfill_event(event)" when the exit data is completed, e.g., completion of asynchronous memcopy is signaled with a callback. This should result in a clean interface for synchronization between libomp and libomptarget.

I can see three candidates, where things might break at the moment:

  • actions for target tasks with the same dependency are not submitted to the same device queue
  • device queues don't enforce the necessary ordering for asynchronous execution (copy-to-device, kernel launch, copy-from-device)
  • class (2) target tasks miss to ensure completion before they complete and release outgoing dependencies.

Especially the last point seems like a hot candidate, if the target task just launches the asynchronous execution, but does not ensure completion before the task is marked completed.
In that case, the implicit barrier at the end of BlockMatMul_TargetNowait could pass, although not all device activity is completed.

I must admit that I have no idea about the internals of libomptarget. But here is my high-level view on this issue:

From my perspective, the compiler should help to distinguish two situations (a weakness of OpenMP is that the DAG only evolves during execution):
(1) target tasks with dependencies, which only synchronize with other target tasks on the same device (outgoing edges are limited to target tasks on the same device)
(2) target tasks with dependencies, which synchronize with host code or target tasks on other devices.
If the compiler cannot prove (1), fall back to (2).

If the compiler can prove, that a target task is of class (1), a fast path is possible by submitting all related device code onto the same device queue ("stream"). Similar, if the runtime can prove that outgoing edges are limited to the same device and no new edges might be added the runtime might identify additional target tasks of class (1). The latter might be difficult, because the information is in libomp not in libomptarget.

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp taskwait

In this case, no host tasks are dependent on target task dependencies -> just submit the operations onto the same device queue.

Only if the host (possibly) has dependencies on the target task, fine-grained synchronization with the host is necessary:

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp task depend(inout:a[0:N])
{}

The first two target tasks, can still rely on device queue synchronization, the last target task must ensure synchronization with the host. I think, this could be mapped to the OpenMP completion event model:

#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) detach(event) nowait
#pragma omp task depend(inout:a[0:N])
{}

Libomptarget "calls omp_fulfill_event(event)" when the exit data is completed, e.g., completion of asynchronous memcopy is signaled with a callback. This should result in a clean interface for synchronization between libomp and libomptarget.

I can see three candidates, where things might break at the moment:

  • actions for target tasks with the same dependency are not submitted to the same device queue
  • device queues don't enforce the necessary ordering for asynchronous execution (copy-to-device, kernel launch, copy-from-device)
  • class (2) target tasks miss to ensure completion before they complete and release outgoing dependencies.

Especially the last point seems like a hot candidate, if the target task just launches the asynchronous execution, but does not ensure completion before the task is marked completed.
In that case, the implicit barrier at the end of BlockMatMul_TargetNowait could pass, although not all device activity is completed.

Thanks for the analysis and details. The root cause of this problem (for a target supporting async data movement internally) is in libomptarget. It is unrelated to tasks in libomp. At the end of each target task, there is a mandatory synchronization to guarantee that the task can only be finished if everything is finished. This can assure that if a regular task depends on a target task, existing mechanism will work. You can refer to Section 3.4 [1] for more details. As for the detachable task, [1] also has detailed analysis.

Reference:
[1] https://tianshilei.me/wp-content/uploads/concurrent-lcpc2020.pdf

tianshilei1992 added a comment.EditedJun 17 2021, 8:04 AM

So is the race this is trying to fix between the host and the target, or between regions running on the target?

Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).

If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - No.
  2. Allocate device memory for M and create a mapping entry for M.
  3. Issue data transfer.
  4. Release the lock.
  5. Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - Yes.
  2. Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
  3. Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

So is the race this is trying to fix between the host and the target, or between regions running on the target?

Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).

If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - No.
  2. Allocate device memory for M and create a mapping entry for M.
  3. Issue data transfer.
  4. Release the lock.
  5. Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - Yes.
  2. Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
  3. Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

Thanks for the explanation, I think, I got the issue now.

Following your example, if T1 and T2 are really concurrent, I think it should be an application issue (data race), if they both perform initial mapping on the same memory without synchronization. I would bring this to the OpenMP LC.

For bug49334.cpp this means, that depend(in: BlockA[0], BlockB[0]) is not sufficient and the dependency should rather be depend(inout: BlockA[0], BlockB[0]). The out dependency reflects the writing to the device copy of the variables or, respectively, the possible initial mapping.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

I believe that your root cause description is correct, but I don't understand how the mentioned locks currently present in mapping logic can fix the case when a target device only implements the synchronous API (which is the first problem described in 49940). Let me explain what I mean.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - No.
  2. Allocate device memory for M and create a mapping entry for M.
  3. Issue data transfer.
  4. Release the lock.
  5. Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - Yes.
  2. Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
  3. Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

To the best of my knowledge, the given step-by-step has a small problem: the mapping table lock is not released in step 4, but instead, it is released between steps 2 and 3.

Since the proposed patch stores the event in the AsyncInfoTy, when interacting with a synchronous-only plugin we can still have the same data corruption problem: T1 may be paused right after the mapping table lock is released (between steps 2 and 3) and T2 will go through its target region execution not issuing the appropriate data transfer.

If what I described above is correct, we can fix this synchronous case with two simple changes added to https://reviews.llvm.org/D104382 patch:

  1. Add a mutex to HostDataToTargetTy that will syncrhonize concurrent external (meaning outside the DeviceTy code) accesses;
  2. TargetPointerResultTy locks this mutex on construction and unlock it on destruction (a simple std::unique_lock can suffice).

This way we ensure that multiple threads dealing with the same mapping table entry are synchronized, fixing the first problem in 49940. What do you think?

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

That doesn't sound right. There's no reason to expect the data that was copied across will not be modified by the first region. The openmp semantics look like copy to device at target region start and copy back at the end. Picking up partially written data from another executing kernel breaks that.

Given two target regions that both map the same data, I'm fairly sure both maps have to happen, unless we do significantly more thorough analysis to determine that the output of one kernel is definitely the same as the input to the next, in which case we would be safer fusing them.

ye-luo added a subscriber: ye-luo.Jun 17 2021, 9:49 AM

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

That doesn't sound right. There's no reason to expect the data that was copied across will not be modified by the first region. The openmp semantics look like copy to device at target region start and copy back at the end. Picking up partially written data from another executing kernel breaks that.

Given two target regions that both map the same data, I'm fairly sure both maps have to happen, unless we do significantly more thorough analysis to determine that the output of one kernel is definitely the same as the input to the next, in which case we would be safer fusing them.

I don't agree with you. T2 should follow the reference counting transfer behavior and not issue data transfer. If T1 change things on the device and surprise T2, it is a bug in the user code, user is responsible to express dependency on the target tasks.

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

That doesn't sound right. There's no reason to expect the data that was copied across will not be modified by the first region. The openmp semantics look like copy to device at target region start and copy back at the end. Picking up partially written data from another executing kernel breaks that.

Given two target regions that both map the same data, I'm fairly sure both maps have to happen, unless we do significantly more thorough analysis to determine that the output of one kernel is definitely the same as the input to the next, in which case we would be safer fusing them.

Like Ye said, two target regions both *read* only from same memory region. It is not a race condition from user's perspective.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

but I don't understand how the mentioned locks currently present in mapping logic can fix the case when a target device only implements the synchronous API (which is the first problem described in 49940).

That's a good point. Thanks for pointing it out.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - No.
  2. Allocate device memory for M and create a mapping entry for M.
  3. Issue data transfer.
  4. Release the lock.
  5. Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - Yes.
  2. Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
  3. Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

To the best of my knowledge, the given step-by-step has a small problem: the mapping table lock is not released in step 4, but instead, it is released between steps 2 and 3.

Since the proposed patch stores the event in the AsyncInfoTy, when interacting with a synchronous-only plugin we can still have the same data corruption problem: T1 may be paused right after the mapping table lock is released (between steps 2 and 3) and T2 will go through its target region execution not issuing the appropriate data transfer.

If what I described above is correct, we can fix this synchronous case with two simple changes added to https://reviews.llvm.org/D104382 patch:

  1. Add a mutex to HostDataToTargetTy that will syncrhonize concurrent external (meaning outside the DeviceTy code) accesses;
  2. TargetPointerResultTy locks this mutex on construction and unlock it on destruction (a simple std::unique_lock can suffice).

This way we ensure that multiple threads dealing with the same mapping table entry are synchronized, fixing the first problem in 49940. What do you think?

That sounds right, which means for each entry, there can only be one thread touching it. Or actually we can pass the update decision to getOrAllocTgtPtr such that we don't have to deal with lock(s) across functions.

Or actually we can pass the update decision to getOrAllocTgtPtr such that we don't have to deal with lock(s) across functions.

This could be a nicer solution alongside a proper renaming of the getOrAllocTgtPtr to reflect its multiple responsibilities (if you go forward with it, of course). Just two comments though:

  1. It would be good to check if other "users" of the de DeviceTy wouldn't need such synchronization mechanism. If so, they could use the same changes introduced by https://reviews.llvm.org/D104382 (getTgtPtrBegin used by targetDataEnd is the only example that I can think of);
  2. Maybe letting the lifetime of TargetPointerResultTy instances determine for how long an entry is locked is the best future-proof solution. I am thinking about future changes that could be done to the targetDataBegin function (and maybe other ones) that could use the additional synchronization. But this is just a thought though.

Following your example, if T1 and T2 are really concurrent, I think it should be an application issue (data race), if they both perform initial mapping on the same memory without synchronization. I would bring this to the OpenMP LC.

Not exactly. Two concurrent target regions can read from same memory region.

int array[1024];
#pragma omp target map(to: data) nowait
{ /* read array */ }
#pragma omp target map(to: data) nowait
{ /* read array */ }

I think this is a valid program, no matter from user's perspective or specification's perspective. How to deal with data mapping is implementation details IMO, but implementation should guarantee that array in the two regions should be same. It is not the case for us currently.

For bug49334.cpp this means, that depend(in: BlockA[0], BlockB[0]) is not sufficient and the dependency should rather be depend(inout: BlockA[0], BlockB[0]). The out dependency reflects the writing to the device copy of the variables or, respectively, the possible initial mapping.

Correct, but in the body of the task, BlockA[0] and BlockB[0] are only read. It's valid to mark them as in only.

Correct, but in the body of the task, BlockA[0] and BlockB[0] are only read. It's valid to mark them as in only.

What I mean is the write to the corresponding variable implied by the map.
Do you agree that the following would be invalid and needs inout for target enter data?

#pragma omp target enter data nowait map(to: BlockA[:BS * BS]) depend(in: BlockA[0])
#pragma omp target nowait depend(in: BlockA[0])
{}

Correct, but in the body of the task, BlockA[0] and BlockB[0] are only read. It's valid to mark them as in only.

What I mean is the write to the corresponding variable implied by the map.
Do you agree that the following would be invalid and needs inout for target enter data?

#pragma omp target enter data nowait map(to: BlockA[:BS * BS]) depend(in: BlockA[0])
#pragma omp target nowait depend(in: BlockA[0])
{}

That is actually a very interesting question. Theoretically, the dependency setting here is not wrong, but it indeed doesn't work. But write/read variable cannot be implied by the map. We can definitely write to a variable in target region but don't transfer it back (map(to)), and we can also only read the variable and transfer it back (map(tofrom)). They are both valid.

tianshilei1992 planned changes to this revision.Jun 18 2021, 12:39 PM
tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
24

This should be unrelated...

openmp/libomptarget/src/device.cpp
14

must be clangd automatic generated code

openmp/libomptarget/src/device.h
16

unrelated

53

We have to use shared_ptr here as this data structure has to be copied but std::atomic is un-copyable.

openmp/libomptarget/src/omptarget.cpp
591–601

Here is very critical. Expect opinions.

595

But with D104555, these whole bunch of code will be moved into getTargetPointer.

openmp/libomptarget/test/offloading/bug49334.cpp
73 ↗(On Diff #353077)

With D104552, this is no longer needed.

tianshilei1992 requested review of this revision.Jun 18 2021, 12:41 PM
tianshilei1992 edited the summary of this revision. (Show Details)Jun 18 2021, 12:43 PM

The logic now looks much more sane to me. This is what I was expecting. Plugins can implement "events" the way they want to.

openmp/libomptarget/src/device.h
53

Why is this a shared ptr?

openmp/libomptarget/src/omptarget.cpp
602

Worth a helper to avoid duplication.

615

I don't get why we need this. At the moment at which we are looking at the Entry I would expect it has an event or not, but neither is cause for waiting. When a new entry is created it should only be exposed to others once the event has been setup as well. Plugins that don't need events can just use nullptr without it being a problem. If this is not a problem, I suspect we want a separate flag in the entry for the status. The event should be allows to be null, the flag can be "in_progress" which indicates the rest needs to wait.

tianshilei1992 added inline comments.Jun 18 2021, 1:04 PM
openmp/libomptarget/src/omptarget.cpp
615

This can be removed when this whole bunch of thing is moved to the new function in D104555, otherwise it can still happen that the new entry is created but data movement has not been issued because this function is not guarded by the lock.

tianshilei1992 added inline comments.Jun 18 2021, 4:15 PM
openmp/libomptarget/src/device.h
53

Because unique_ptr is non-copyable.

openmp/libomptarget/src/device.h
53

Could write a copy constructor. Also possible it only needs to be moved, not copied. Don't really want to add an ad hoc garbage collector (aka shared_ptr) to the existing complexity if we can avoid it, lifetimes are opaque enough already

removed unrelated changes

tianshilei1992 retitled this revision from [PoC][WIP][OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement to [OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement.Jun 20 2021, 5:30 PM
tianshilei1992 edited the summary of this revision. (Show Details)
protze.joachim added inline comments.Jun 23 2021, 8:24 AM
openmp/libomptarget/src/device.cpp
170–173

Which patch removes the bool &IsHostPtr, argument?
I cannot see this change in any of the patches in the stack or on main.

This patch fails to apply after applying the other two patches in the stack.

openmp/libomptarget/src/omptarget.cpp
501–506

This probably shouldn't be here?

openmp/libomptarget/src/omptarget.cpp
501–506

yeah, it's a rebase error. I'll reorg the patch stack.

Reading the event-based solution, one corner case came up to my mind (it is odd, but possible):

#pragma omp target map(to: A, B) nowait
{ /* read A and B */ }

#pragma omp target map(to: B, A) nowait
{ /* read A and B */ }

Note that the only difference between them (besides their internal code) is the ordering of the buffers in the mapping list.

Since both regions just read A and B, there is no need for a dependency between them (from the programmers perspective) and, thus, they can be executed by two different threads as follows:

  • T1: alloc A -> issue A submission -> create event for A -> alloc B (not new, MoveData = false) -> wait for B event indefinitly ...
  • T2: alloc B -> issue B submission -> create event for B -> alloc A (not new, MoveData = false) -> wait for A event indefinitly ...

This issue seems similar to the lock ordering problem. It can occur depending on the answers to the questions below:

  1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.
  2. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.
  3. Are the variables in the mapping list ordered somehow? If that is the case, then the (possible) problem described above will never happen as well (like resolving it through lock ordering).

If this is a problem, sorting the mapping list or ensuring the behavior described in the second question would probably solve it. What do you think? Is this line of thought correct?

Reading the event-based solution, one corner case came up to my mind (it is odd, but possible):

#pragma omp target map(to: A, B) nowait
{ /* read A and B */ }

#pragma omp target map(to: B, A) nowait
{ /* read A and B */ }

Note that the only difference between them (besides their internal code) is the ordering of the buffers in the mapping list.

Since both regions just read A and B, there is no need for a dependency between them (from the programmers perspective) and, thus, they can be executed by two different threads as follows:

  • T1: alloc A -> issue A submission -> create event for A -> alloc B (not new, MoveData = false) -> wait for B event indefinitly ...
  • T2: alloc B -> issue B submission -> create event for B -> alloc A (not new, MoveData = false) -> wait for A event indefinitly ...

The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

  1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.

I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

  1. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.

Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:

  • The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.
  • If any operation depends on the event, it will not be executed until the event is fulfilled/notified.
  1. Are the variables in the mapping list ordered somehow? If that is the case, then the (possible) problem described above will never happen as well (like resolving it through lock ordering).

That's a good question. I don't have a clear answer for that. I remember the front end will sort somehow, but don't know in what order.

This comment was removed by gValarini.

The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

Uhm, I think I misunderstood how the events would interact with the queues. But now I have a doubt if __tgt_rtl_wait_event should be allowed to be blocking. Let me explain with my first question at #2850555.

  1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.

I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

Some of the async APIs that I am most used to (like the C++ async library and the non-blocking MPI functions) have the restriction that one should never consider any computation/communication to be done (or even in-flight) until their respective synchronization functions are properly called. This is needed considering that a functional (but not optimized) implementation of those async APIs would be to defer their execution until such synchronization point. If I regard it correctly, this restriction is even present in the CUDA streams specification, but please don't quote me on that since I am not a CUDA expert.

If libomptarget follows such principles, __tgt_rtl_wait_event should not be allowed to be blocking, since that would imply that the "notification of event E#" that you described could be assumed to be completed before __tgt_rtl_synchronize is called, which would be a little bit strange considering the async API. This behavior was what even led me to my second question at #2850555 regarding possibles indirect synchronizations done by __tgt_rtl_wait_event.

  1. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.

Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:

  • The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.
  • If any operation depends on the event, it will not be executed until the event is fulfilled/notified.

Do these async principles make sense for libomptarget? What about restricting __tgt_rtl_wait_event to be only non-blocking and thus always be inserted in the async queue?

By the way, thanks for the clarifications, the event API is much clear to me now.

jdoerfert added inline comments.Wed, Jun 30, 2:48 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1198

@gValarini

Do these async principles make sense for libomptarget? What about restricting __tgt_rtl_wait_event to be only non-blocking and thus always be inserted in the async queue?

If we have a queue, we can and will insert the wait in there (see above). I don't think it is needed as "wait_even", similar to "synchronize" is a proper synchronization point.

Not sure what the problem here is.

The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

Uhm, I think I misunderstood how the events would interact with the queues. But now I have a doubt if __tgt_rtl_wait_event should be allowed to be blocking. Let me explain with my first question at #2850555.

  1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.

I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

Some of the async APIs that I am most used to (like the C++ async library and the non-blocking MPI functions) have the restriction that one should never consider any computation/communication to be done (or even in-flight) until their respective synchronization functions are properly called. This is needed considering that a functional (but not optimized) implementation of those async APIs would be to defer their execution until such synchronization point. If I regard it correctly, this restriction is even present in the CUDA streams specification, but please don't quote me on that since I am not a CUDA expert.

If libomptarget follows such principles, __tgt_rtl_wait_event should not be allowed to be blocking, since that would imply that the "notification of event E#" that you described could be assumed to be completed before __tgt_rtl_synchronize is called, which would be a little bit strange considering the async API. This behavior was what even led me to my second question at #2850555 regarding possibles indirect synchronizations done by __tgt_rtl_wait_event.

  1. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.

Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:

  • The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.
  • If any operation depends on the event, it will not be executed until the event is fulfilled/notified.

Do these async principles make sense for libomptarget? What about restricting __tgt_rtl_wait_event to be only non-blocking and thus always be inserted in the async queue?

Actually, the function naming here is kind of confusing, although I inherited CUDA terminology. One of the most confusing thing is, "wait for an event" doesn't really mean "waiting", at least from host's perspective. Let me explain this here so it will become much clearer.

For CUDA, event wait and event synchronize are different. If you look at CUDA document (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g7840e3984799941a61839de40413d1d9), the function signature of event wait is cudaStreamWaitEvent ( cudaStream_t stream, cudaEvent_t event, unsigned int flags). It basically works "like" that insert event to the stream, and then return. It is non-blocking. GPU scheduler will make sure that all operations enqueued to the stream afterwards will not be executed until the event is fulfilled. Event synchronization is different. As shown here (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#group__CUDART__EVENT_1g949aa42b30ae9e622f6ba0787129ff22), it only accepts the event. So it simply blocks the host execution until the event is fulfilled.

So whether __tgt_rtl_wait_event is blocking or non-blocking is implementation dependent. For example, for a target that doesn't support the whole event mechanism we mentioned before, or it doesn't support async operation at all, it can implement the wait event to be blocking in the following way:

  • An event is an atomic integer, initialized to 0;
  • When the event is fulfilled (or notified), it simply sets it to 1;
  • Wait for an event is simply equivalent to set a loop until it becomes 1.
tianshilei1992 marked 11 inline comments as done.Thu, Jul 22, 7:04 PM