This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement
ClosedPublic

Authored by tianshilei1992 on Jun 16 2021, 1:24 PM.

Details

Summary

The async data movement can cause data race if the target supports it.
Details can be found in [1]. This patch tries to fix this problem by attaching
an event to the entry of data mapping table. Here are the details.

For each issued data movement, a new event is generated and returned to libomptarget
by calling createEvent. The event will be attached to the corresponding mapping table
entry.

For each data mapping lookup, if there is no need for a data movement, the
attached event has to be inserted into the queue to gaurantee that all following
operations in the queue can only be executed if the event is fulfilled.

This design is to avoid synchronization on host side.

Note that we are using CUDA terminolofy here. Similar mechanism is assumped to
be supported by another targets. Even if the target doesn't support it, it can
be easily implemented in the following fall back way:

  • Event can be any kind of flag that has at least two status, 0 and 1.
  • waitEvent can directly busy loop if Event is still 0.

My local test shows that bug49334.cpp can pass.

Reference:
[1] https://bugs.llvm.org/show_bug.cgi?id=49940

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
tianshilei1992 added inline comments.Jun 16 2021, 5:11 PM
openmp/libomptarget/src/device.h
53

mutable is because std::set::iterator is const. It makes sense because we don't want to modify the key for the set. However, here HstPtrBegin is the key. In order to modify it via the iterator, we need to define it as mutable, or use const_cast when we want to modify it.

openmp/libomptarget/src/omptarget.cpp
603

There can still be gap between creating an entry in map table and issuing the data movement.

I don't understand why we have an event in the async_info object at all. I would have assumed only to have some plugin specific "lock" in the table.

That's the only way to take the event back from plugin to libomptarget.

tianshilei1992 added inline comments.Jun 16 2021, 5:48 PM
openmp/libomptarget/include/omptarget.h
139 ↗(On Diff #352537)

I'll refine this part to remove it from async info.

So is the race this is trying to fix between the host and the target, or between regions running on the target?

Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).

If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

openmp/libomptarget/include/omptargetplugin.h
148 ↗(On Diff #352537)

So if it's not a barrier, what is it? What is this intended to do?

protze.joachim added a subscriber: protze.joachim.EditedJun 17 2021, 1:05 AM

I must admit that I have no idea about the internals of libomptarget. But here is my high-level view on this issue:

From my perspective, the compiler should help to distinguish two situations (a weakness of OpenMP is that the DAG only evolves during execution):
(1) target tasks with dependencies, which only synchronize with other target tasks on the same device (outgoing edges are limited to target tasks on the same device)
(2) target tasks with dependencies, which synchronize with host code or target tasks on other devices.
If the compiler cannot prove (1), fall back to (2).

If the compiler can prove, that a target task is of class (1), a fast path is possible by submitting all related device code onto the same device queue ("stream"). Similar, if the runtime can prove that outgoing edges are limited to the same device and no new edges might be added the runtime might identify additional target tasks of class (1). The latter might be difficult, because the information is in libomp not in libomptarget.

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp taskwait

In this case, no host tasks are dependent on target task dependencies -> just submit the operations onto the same device queue.

Only if the host (possibly) has dependencies on the target task, fine-grained synchronization with the host is necessary:

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp task depend(inout:a[0:N])
{}

The first two target tasks, can still rely on device queue synchronization, the last target task must ensure synchronization with the host. I think, this could be mapped to the OpenMP completion event model:

#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) detach(event) nowait
#pragma omp task depend(inout:a[0:N])
{}

Libomptarget "calls omp_fulfill_event(event)" when the exit data is completed, e.g., completion of asynchronous memcopy is signaled with a callback. This should result in a clean interface for synchronization between libomp and libomptarget.

I can see three candidates, where things might break at the moment:

  • actions for target tasks with the same dependency are not submitted to the same device queue
  • device queues don't enforce the necessary ordering for asynchronous execution (copy-to-device, kernel launch, copy-from-device)
  • class (2) target tasks miss to ensure completion before they complete and release outgoing dependencies.

Especially the last point seems like a hot candidate, if the target task just launches the asynchronous execution, but does not ensure completion before the task is marked completed.
In that case, the implicit barrier at the end of BlockMatMul_TargetNowait could pass, although not all device activity is completed.

I must admit that I have no idea about the internals of libomptarget. But here is my high-level view on this issue:

From my perspective, the compiler should help to distinguish two situations (a weakness of OpenMP is that the DAG only evolves during execution):
(1) target tasks with dependencies, which only synchronize with other target tasks on the same device (outgoing edges are limited to target tasks on the same device)
(2) target tasks with dependencies, which synchronize with host code or target tasks on other devices.
If the compiler cannot prove (1), fall back to (2).

If the compiler can prove, that a target task is of class (1), a fast path is possible by submitting all related device code onto the same device queue ("stream"). Similar, if the runtime can prove that outgoing edges are limited to the same device and no new edges might be added the runtime might identify additional target tasks of class (1). The latter might be difficult, because the information is in libomp not in libomptarget.

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp taskwait

In this case, no host tasks are dependent on target task dependencies -> just submit the operations onto the same device queue.

Only if the host (possibly) has dependencies on the target task, fine-grained synchronization with the host is necessary:

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp task depend(inout:a[0:N])
{}

The first two target tasks, can still rely on device queue synchronization, the last target task must ensure synchronization with the host. I think, this could be mapped to the OpenMP completion event model:

#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) detach(event) nowait
#pragma omp task depend(inout:a[0:N])
{}

Libomptarget "calls omp_fulfill_event(event)" when the exit data is completed, e.g., completion of asynchronous memcopy is signaled with a callback. This should result in a clean interface for synchronization between libomp and libomptarget.

I can see three candidates, where things might break at the moment:

  • actions for target tasks with the same dependency are not submitted to the same device queue
  • device queues don't enforce the necessary ordering for asynchronous execution (copy-to-device, kernel launch, copy-from-device)
  • class (2) target tasks miss to ensure completion before they complete and release outgoing dependencies.

Especially the last point seems like a hot candidate, if the target task just launches the asynchronous execution, but does not ensure completion before the task is marked completed.
In that case, the implicit barrier at the end of BlockMatMul_TargetNowait could pass, although not all device activity is completed.

Thanks for the analysis and details. The root cause of this problem (for a target supporting async data movement internally) is in libomptarget. It is unrelated to tasks in libomp. At the end of each target task, there is a mandatory synchronization to guarantee that the task can only be finished if everything is finished. This can assure that if a regular task depends on a target task, existing mechanism will work. You can refer to Section 3.4 [1] for more details. As for the detachable task, [1] also has detailed analysis.

Reference:
[1] https://tianshilei.me/wp-content/uploads/concurrent-lcpc2020.pdf

tianshilei1992 added a comment.EditedJun 17 2021, 8:04 AM

So is the race this is trying to fix between the host and the target, or between regions running on the target?

Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).

If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - No.
  2. Allocate device memory for M and create a mapping entry for M.
  3. Issue data transfer.
  4. Release the lock.
  5. Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - Yes.
  2. Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
  3. Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

So is the race this is trying to fix between the host and the target, or between regions running on the target?

Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).

If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - No.
  2. Allocate device memory for M and create a mapping entry for M.
  3. Issue data transfer.
  4. Release the lock.
  5. Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - Yes.
  2. Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
  3. Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

Thanks for the explanation, I think, I got the issue now.

Following your example, if T1 and T2 are really concurrent, I think it should be an application issue (data race), if they both perform initial mapping on the same memory without synchronization. I would bring this to the OpenMP LC.

For bug49334.cpp this means, that depend(in: BlockA[0], BlockB[0]) is not sufficient and the dependency should rather be depend(inout: BlockA[0], BlockB[0]). The out dependency reflects the writing to the device copy of the variables or, respectively, the possible initial mapping.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

I believe that your root cause description is correct, but I don't understand how the mentioned locks currently present in mapping logic can fix the case when a target device only implements the synchronous API (which is the first problem described in 49940). Let me explain what I mean.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - No.
  2. Allocate device memory for M and create a mapping entry for M.
  3. Issue data transfer.
  4. Release the lock.
  5. Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - Yes.
  2. Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
  3. Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

To the best of my knowledge, the given step-by-step has a small problem: the mapping table lock is not released in step 4, but instead, it is released between steps 2 and 3.

Since the proposed patch stores the event in the AsyncInfoTy, when interacting with a synchronous-only plugin we can still have the same data corruption problem: T1 may be paused right after the mapping table lock is released (between steps 2 and 3) and T2 will go through its target region execution not issuing the appropriate data transfer.

If what I described above is correct, we can fix this synchronous case with two simple changes added to https://reviews.llvm.org/D104382 patch:

  1. Add a mutex to HostDataToTargetTy that will syncrhonize concurrent external (meaning outside the DeviceTy code) accesses;
  2. TargetPointerResultTy locks this mutex on construction and unlock it on destruction (a simple std::unique_lock can suffice).

This way we ensure that multiple threads dealing with the same mapping table entry are synchronized, fixing the first problem in 49940. What do you think?

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

That doesn't sound right. There's no reason to expect the data that was copied across will not be modified by the first region. The openmp semantics look like copy to device at target region start and copy back at the end. Picking up partially written data from another executing kernel breaks that.

Given two target regions that both map the same data, I'm fairly sure both maps have to happen, unless we do significantly more thorough analysis to determine that the output of one kernel is definitely the same as the input to the next, in which case we would be safer fusing them.

ye-luo added a subscriber: ye-luo.Jun 17 2021, 9:49 AM

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

That doesn't sound right. There's no reason to expect the data that was copied across will not be modified by the first region. The openmp semantics look like copy to device at target region start and copy back at the end. Picking up partially written data from another executing kernel breaks that.

Given two target regions that both map the same data, I'm fairly sure both maps have to happen, unless we do significantly more thorough analysis to determine that the output of one kernel is definitely the same as the input to the next, in which case we would be safer fusing them.

I don't agree with you. T2 should follow the reference counting transfer behavior and not issue data transfer. If T1 change things on the device and surprise T2, it is a bug in the user code, user is responsible to express dependency on the target tasks.

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

That doesn't sound right. There's no reason to expect the data that was copied across will not be modified by the first region. The openmp semantics look like copy to device at target region start and copy back at the end. Picking up partially written data from another executing kernel breaks that.

Given two target regions that both map the same data, I'm fairly sure both maps have to happen, unless we do significantly more thorough analysis to determine that the output of one kernel is definitely the same as the input to the next, in which case we would be safer fusing them.

Like Ye said, two target regions both *read* only from same memory region. It is not a race condition from user's perspective.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

but I don't understand how the mentioned locks currently present in mapping logic can fix the case when a target device only implements the synchronous API (which is the first problem described in 49940).

That's a good point. Thanks for pointing it out.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - No.
  2. Allocate device memory for M and create a mapping entry for M.
  3. Issue data transfer.
  4. Release the lock.
  5. Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

  1. Look up into the mapping table to see if M is already mapped. - Yes.
  2. Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
  3. Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

To the best of my knowledge, the given step-by-step has a small problem: the mapping table lock is not released in step 4, but instead, it is released between steps 2 and 3.

Since the proposed patch stores the event in the AsyncInfoTy, when interacting with a synchronous-only plugin we can still have the same data corruption problem: T1 may be paused right after the mapping table lock is released (between steps 2 and 3) and T2 will go through its target region execution not issuing the appropriate data transfer.

If what I described above is correct, we can fix this synchronous case with two simple changes added to https://reviews.llvm.org/D104382 patch:

  1. Add a mutex to HostDataToTargetTy that will syncrhonize concurrent external (meaning outside the DeviceTy code) accesses;
  2. TargetPointerResultTy locks this mutex on construction and unlock it on destruction (a simple std::unique_lock can suffice).

This way we ensure that multiple threads dealing with the same mapping table entry are synchronized, fixing the first problem in 49940. What do you think?

That sounds right, which means for each entry, there can only be one thread touching it. Or actually we can pass the update decision to getOrAllocTgtPtr such that we don't have to deal with lock(s) across functions.

Or actually we can pass the update decision to getOrAllocTgtPtr such that we don't have to deal with lock(s) across functions.

This could be a nicer solution alongside a proper renaming of the getOrAllocTgtPtr to reflect its multiple responsibilities (if you go forward with it, of course). Just two comments though:

  1. It would be good to check if other "users" of the de DeviceTy wouldn't need such synchronization mechanism. If so, they could use the same changes introduced by https://reviews.llvm.org/D104382 (getTgtPtrBegin used by targetDataEnd is the only example that I can think of);
  2. Maybe letting the lifetime of TargetPointerResultTy instances determine for how long an entry is locked is the best future-proof solution. I am thinking about future changes that could be done to the targetDataBegin function (and maybe other ones) that could use the additional synchronization. But this is just a thought though.

Following your example, if T1 and T2 are really concurrent, I think it should be an application issue (data race), if they both perform initial mapping on the same memory without synchronization. I would bring this to the OpenMP LC.

Not exactly. Two concurrent target regions can read from same memory region.

int array[1024];
#pragma omp target map(to: data) nowait
{ /* read array */ }
#pragma omp target map(to: data) nowait
{ /* read array */ }

I think this is a valid program, no matter from user's perspective or specification's perspective. How to deal with data mapping is implementation details IMO, but implementation should guarantee that array in the two regions should be same. It is not the case for us currently.

For bug49334.cpp this means, that depend(in: BlockA[0], BlockB[0]) is not sufficient and the dependency should rather be depend(inout: BlockA[0], BlockB[0]). The out dependency reflects the writing to the device copy of the variables or, respectively, the possible initial mapping.

Correct, but in the body of the task, BlockA[0] and BlockB[0] are only read. It's valid to mark them as in only.

Correct, but in the body of the task, BlockA[0] and BlockB[0] are only read. It's valid to mark them as in only.

What I mean is the write to the corresponding variable implied by the map.
Do you agree that the following would be invalid and needs inout for target enter data?

#pragma omp target enter data nowait map(to: BlockA[:BS * BS]) depend(in: BlockA[0])
#pragma omp target nowait depend(in: BlockA[0])
{}

Correct, but in the body of the task, BlockA[0] and BlockB[0] are only read. It's valid to mark them as in only.

What I mean is the write to the corresponding variable implied by the map.
Do you agree that the following would be invalid and needs inout for target enter data?

#pragma omp target enter data nowait map(to: BlockA[:BS * BS]) depend(in: BlockA[0])
#pragma omp target nowait depend(in: BlockA[0])
{}

That is actually a very interesting question. Theoretically, the dependency setting here is not wrong, but it indeed doesn't work. But write/read variable cannot be implied by the map. We can definitely write to a variable in target region but don't transfer it back (map(to)), and we can also only read the variable and transfer it back (map(tofrom)). They are both valid.

tianshilei1992 planned changes to this revision.Jun 18 2021, 12:39 PM
tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
24 ↗(On Diff #353077)

This should be unrelated...

openmp/libomptarget/src/device.cpp
14

must be clangd automatic generated code

openmp/libomptarget/src/device.h
16

unrelated

53

We have to use shared_ptr here as this data structure has to be copied but std::atomic is un-copyable.

openmp/libomptarget/src/omptarget.cpp
590–614

Here is very critical. Expect opinions.

594

But with D104555, these whole bunch of code will be moved into getTargetPointer.

openmp/libomptarget/test/offloading/bug49334.cpp
73 ↗(On Diff #353077)

With D104552, this is no longer needed.

tianshilei1992 requested review of this revision.Jun 18 2021, 12:41 PM
tianshilei1992 edited the summary of this revision. (Show Details)Jun 18 2021, 12:43 PM

The logic now looks much more sane to me. This is what I was expecting. Plugins can implement "events" the way they want to.

openmp/libomptarget/src/device.h
53

Why is this a shared ptr?

openmp/libomptarget/src/omptarget.cpp
601

Worth a helper to avoid duplication.

614

I don't get why we need this. At the moment at which we are looking at the Entry I would expect it has an event or not, but neither is cause for waiting. When a new entry is created it should only be exposed to others once the event has been setup as well. Plugins that don't need events can just use nullptr without it being a problem. If this is not a problem, I suspect we want a separate flag in the entry for the status. The event should be allows to be null, the flag can be "in_progress" which indicates the rest needs to wait.

tianshilei1992 added inline comments.Jun 18 2021, 1:04 PM
openmp/libomptarget/src/omptarget.cpp
614

This can be removed when this whole bunch of thing is moved to the new function in D104555, otherwise it can still happen that the new entry is created but data movement has not been issued because this function is not guarded by the lock.

tianshilei1992 added inline comments.Jun 18 2021, 4:15 PM
openmp/libomptarget/src/device.h
53

Because unique_ptr is non-copyable.

openmp/libomptarget/src/device.h
53

Could write a copy constructor. Also possible it only needs to be moved, not copied. Don't really want to add an ad hoc garbage collector (aka shared_ptr) to the existing complexity if we can avoid it, lifetimes are opaque enough already

removed unrelated changes

tianshilei1992 retitled this revision from [PoC][WIP][OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement to [OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement.Jun 20 2021, 5:30 PM
tianshilei1992 edited the summary of this revision. (Show Details)
protze.joachim added inline comments.Jun 23 2021, 8:24 AM
openmp/libomptarget/src/device.cpp
160–165

Which patch removes the bool &IsHostPtr, argument?
I cannot see this change in any of the patches in the stack or on main.

This patch fails to apply after applying the other two patches in the stack.

openmp/libomptarget/src/omptarget.cpp
504–509

This probably shouldn't be here?

openmp/libomptarget/src/omptarget.cpp
504–509

yeah, it's a rebase error. I'll reorg the patch stack.

Reading the event-based solution, one corner case came up to my mind (it is odd, but possible):

#pragma omp target map(to: A, B) nowait
{ /* read A and B */ }

#pragma omp target map(to: B, A) nowait
{ /* read A and B */ }

Note that the only difference between them (besides their internal code) is the ordering of the buffers in the mapping list.

Since both regions just read A and B, there is no need for a dependency between them (from the programmers perspective) and, thus, they can be executed by two different threads as follows:

  • T1: alloc A -> issue A submission -> create event for A -> alloc B (not new, MoveData = false) -> wait for B event indefinitly ...
  • T2: alloc B -> issue B submission -> create event for B -> alloc A (not new, MoveData = false) -> wait for A event indefinitly ...

This issue seems similar to the lock ordering problem. It can occur depending on the answers to the questions below:

  1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.
  2. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.
  3. Are the variables in the mapping list ordered somehow? If that is the case, then the (possible) problem described above will never happen as well (like resolving it through lock ordering).

If this is a problem, sorting the mapping list or ensuring the behavior described in the second question would probably solve it. What do you think? Is this line of thought correct?

Reading the event-based solution, one corner case came up to my mind (it is odd, but possible):

#pragma omp target map(to: A, B) nowait
{ /* read A and B */ }

#pragma omp target map(to: B, A) nowait
{ /* read A and B */ }

Note that the only difference between them (besides their internal code) is the ordering of the buffers in the mapping list.

Since both regions just read A and B, there is no need for a dependency between them (from the programmers perspective) and, thus, they can be executed by two different threads as follows:

  • T1: alloc A -> issue A submission -> create event for A -> alloc B (not new, MoveData = false) -> wait for B event indefinitly ...
  • T2: alloc B -> issue B submission -> create event for B -> alloc A (not new, MoveData = false) -> wait for A event indefinitly ...

The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

  1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.

I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

  1. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.

Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:

  • The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.
  • If any operation depends on the event, it will not be executed until the event is fulfilled/notified.
  1. Are the variables in the mapping list ordered somehow? If that is the case, then the (possible) problem described above will never happen as well (like resolving it through lock ordering).

That's a good question. I don't have a clear answer for that. I remember the front end will sort somehow, but don't know in what order.

This comment was removed by gValarini.

The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

Uhm, I think I misunderstood how the events would interact with the queues. But now I have a doubt if __tgt_rtl_wait_event should be allowed to be blocking. Let me explain with my first question at #2850555.

  1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.

I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

Some of the async APIs that I am most used to (like the C++ async library and the non-blocking MPI functions) have the restriction that one should never consider any computation/communication to be done (or even in-flight) until their respective synchronization functions are properly called. This is needed considering that a functional (but not optimized) implementation of those async APIs would be to defer their execution until such synchronization point. If I regard it correctly, this restriction is even present in the CUDA streams specification, but please don't quote me on that since I am not a CUDA expert.

If libomptarget follows such principles, __tgt_rtl_wait_event should not be allowed to be blocking, since that would imply that the "notification of event E#" that you described could be assumed to be completed before __tgt_rtl_synchronize is called, which would be a little bit strange considering the async API. This behavior was what even led me to my second question at #2850555 regarding possibles indirect synchronizations done by __tgt_rtl_wait_event.

  1. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.

Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:

  • The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.
  • If any operation depends on the event, it will not be executed until the event is fulfilled/notified.

Do these async principles make sense for libomptarget? What about restricting __tgt_rtl_wait_event to be only non-blocking and thus always be inserted in the async queue?

By the way, thanks for the clarifications, the event API is much clear to me now.

jdoerfert added inline comments.Jun 30 2021, 2:48 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1145 ↗(On Diff #353252)

@gValarini

Do these async principles make sense for libomptarget? What about restricting __tgt_rtl_wait_event to be only non-blocking and thus always be inserted in the async queue?

If we have a queue, we can and will insert the wait in there (see above). I don't think it is needed as "wait_even", similar to "synchronize" is a proper synchronization point.

Not sure what the problem here is.

The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

Uhm, I think I misunderstood how the events would interact with the queues. But now I have a doubt if __tgt_rtl_wait_event should be allowed to be blocking. Let me explain with my first question at #2850555.

  1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.

I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

Some of the async APIs that I am most used to (like the C++ async library and the non-blocking MPI functions) have the restriction that one should never consider any computation/communication to be done (or even in-flight) until their respective synchronization functions are properly called. This is needed considering that a functional (but not optimized) implementation of those async APIs would be to defer their execution until such synchronization point. If I regard it correctly, this restriction is even present in the CUDA streams specification, but please don't quote me on that since I am not a CUDA expert.

If libomptarget follows such principles, __tgt_rtl_wait_event should not be allowed to be blocking, since that would imply that the "notification of event E#" that you described could be assumed to be completed before __tgt_rtl_synchronize is called, which would be a little bit strange considering the async API. This behavior was what even led me to my second question at #2850555 regarding possibles indirect synchronizations done by __tgt_rtl_wait_event.

  1. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.

Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:

  • The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.
  • If any operation depends on the event, it will not be executed until the event is fulfilled/notified.

Do these async principles make sense for libomptarget? What about restricting __tgt_rtl_wait_event to be only non-blocking and thus always be inserted in the async queue?

Actually, the function naming here is kind of confusing, although I inherited CUDA terminology. One of the most confusing thing is, "wait for an event" doesn't really mean "waiting", at least from host's perspective. Let me explain this here so it will become much clearer.

For CUDA, event wait and event synchronize are different. If you look at CUDA document (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g7840e3984799941a61839de40413d1d9), the function signature of event wait is cudaStreamWaitEvent ( cudaStream_t stream, cudaEvent_t event, unsigned int flags). It basically works "like" that insert event to the stream, and then return. It is non-blocking. GPU scheduler will make sure that all operations enqueued to the stream afterwards will not be executed until the event is fulfilled. Event synchronization is different. As shown here (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#group__CUDART__EVENT_1g949aa42b30ae9e622f6ba0787129ff22), it only accepts the event. So it simply blocks the host execution until the event is fulfilled.

So whether __tgt_rtl_wait_event is blocking or non-blocking is implementation dependent. For example, for a target that doesn't support the whole event mechanism we mentioned before, or it doesn't support async operation at all, it can implement the wait event to be blocking in the following way:

  • An event is an atomic integer, initialized to 0;
  • When the event is fulfilled (or notified), it simply sets it to 1;
  • Wait for an event is simply equivalent to set a loop until it becomes 1.
tianshilei1992 marked 11 inline comments as done.Jul 22 2021, 7:04 PM
tianshilei1992 marked an inline comment as done.

rebase and fix comments

rebase again to remove unrelated changes

grokos accepted this revision.Aug 20 2021, 2:50 AM

A few minor nits. Generally the patch is now in good shape, its logic is much simpler than the first iteration and it's easy to see that we don't risk running into deadlocks or further races. We haven't been very enthusiastic about alternative approaches; meanwhile there's a race waiting to be fixed, so I think we should move on with providing a solution. LGTM on my side, but since a lot of other people have reviewed the patch, let's wait until we hear them, too.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
132 ↗(On Diff #364676)

It's a bit weird that createEvent is defined inside the top-level namespace, whereas waitEvent and destroyEvent are members of DeviceRTLTy.

1379 ↗(On Diff #364676)

DeviceId is not used in this function, so it doesn't need to be an argument.

openmp/libomptarget/src/omptarget.cpp
591–599

This code is a direct duplicate of the logic inside getTargetPointer. Can we factor it out into a new helper function?

This revision is now accepted and ready to land.Aug 20 2021, 2:50 AM
ye-luo requested changes to this revision.Aug 20 2021, 7:55 AM

There is a performance hit from frequent creating/destroying events at very transfer. The create/destroy logic can be separate from recording events.

This revision now requires changes to proceed.Aug 20 2021, 7:55 AM

There is a performance hit from frequent creating/destroying events at very transfer. The create/destroy logic can be separate from recording events.

What an event is and how it is handled is plugin-specific. createEvent and destroyEvent can be implemented e.g. via a pool of events where an event will be picked upon createEvent and returned back to the pool upon destroyEvent (similar to what we do for streams in the CUDA plugin). So these performance considerations can be fixed in a subsequent patch. Right now there is a race condition - a correctness concern - that we should fix. At the very least, since the logic in the base library is correct and we pretty much agree on it, we should let this patch land without the CUDA plugin changes; then we can provide optimized implementations for the __tgt_rtl_*_event functions for the plugins we care about in future patches.

There is a performance hit from frequent creating/destroying events at very transfer. The create/destroy logic can be separate from recording events.

What an event is and how it is handled is plugin-specific. createEvent and destroyEvent can be implemented e.g. via a pool of events where an event will be picked upon createEvent and returned back to the pool upon destroyEvent (similar to what we do for streams in the CUDA plugin). So these performance considerations can be fixed in a subsequent patch. Right now there is a race condition - a correctness concern - that we should fix. At the very least, since the logic in the base library is correct and we pretty much agree on it, we should let this patch land without the CUDA plugin changes; then we can provide optimized implementations for the __tgt_rtl_*_event functions for the plugins we care about in future patches.

Now Event is stored in HostDataToTargetTy. It is not plugin-specific logic.

Consider the issues in the current implementation I just found. I have a strong feeling that the current fix should not get in.
I think we should just create one event per HostDataToTargetTy at the constructor and use the same event for a give mapped region.

In order to avoid blocking the good pieces in this patch, it is better to take out all the plugin API and CUDA plugin change to a separate patch.
I demand split createEvent into createEvent and recordEvent and also make each API doing well defined to minimal unit of work.

Using a pool of events is an optimization we may try at a later stage. Right now, I'm more concerned about the issue I found above instead of performance impact.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1362 ↗(On Diff #364676)

Why waiting for an event needs to pull a stream in.
cuEventSynchronize

1379 ↗(On Diff #364676)

AsyncInfo is not needed. If the stream has been returned to the poo, you just print nullptr all the time.

openmp/libomptarget/src/device.cpp
269

I think the purpose of introducing this data transfer event is to achieve the atomic transfer behavior.
So it is unsafe to issue transfer right away without checking the status of Event.
If a map(always comes from a different thread.

289

The current event never got destroyed unitl the next transfer.

  1. event doesn't got properly destroyed.
  2. almost like one event per map.
tianshilei1992 marked 2 inline comments as done.Aug 22 2021, 7:43 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
269

The event has to be created after the issue. At that moment, the entry might not have any associated event.

289

For your first comment, yes, and now it is fixed by an ugly way. A nicer solution would be related to the API design in D108528where which we can discuss. The question is whether we want to pass async info to the plugin for event creation, synchronization, and destroy.

For the second point, it's a compatibility consideration, that is if there is a platform that doesn't allow event to be reused after it is fulfilled. I don't know.

tianshilei1992 marked 2 inline comments as done.Aug 22 2021, 7:43 PM
tianshilei1992 added inline comments.Aug 22 2021, 7:46 PM
openmp/libomptarget/src/device.cpp
80

The change of the function signature would not be required if we don't need to pass AsyncInfo to the plugin. Potentially we can wrap the native event into a struct and the object will call destroyEvent in its destructor.

jdoerfert added inline comments.Aug 23 2021, 11:42 AM
openmp/libomptarget/src/device.cpp
281

Should we skip the event stuff if we failed?

fix rebase errors

remove unrelated changes

ye-luo added inline comments.Aug 31 2021, 2:09 PM
openmp/libomptarget/src/device.cpp
100

I feel it is better to have a wrapper wraps the raw Event and handles destroyEvent at the destructor.

Once HostDataToTargetMap destructor got called, all the Events are leaked.

289

For the second point, it's a compatibility consideration, that is if there is a platform that doesn't allow event to be reused after it is fulfilled. I don't know.

We don't have such a platform to worry about. If there is one, we just mark the platform as not supporting events that we expect.

310

Missing return error check

318

after this point. If another thread changes Entry->Event and destroyed the old event.
The current copied Event becomes invalid.

tianshilei1992 marked an inline comment as done.Oct 16 2021, 4:41 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
100

HostDataToTargetMap is destroyed along with libomptarget. Because of the way libomptarget and plugins are connected, plugins could already have been destroyed at that moment, and if that is the case, it is undefined behavior to call functions in plugins. D111954 can make sure the events are correctly released.

tianshilei1992 marked an inline comment as done.

rebase and fix comments

tianshilei1992 marked 6 inline comments as done.Dec 27 2021, 8:32 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
281

It's different design philosophy. Either is fine to me, but arguably if a target supports event and related interfaces return error, I don't think we should continue (skip).

openmp/libomptarget/src/omptarget.cpp
601

Yes, this piece of code is kind of redundant. However, there are different locks intervened, if we abstract it to another helper function, it could make the code more difficult to understand because we could find mutex are locked in caller and unlocked in callee. I don't think it's a good practice. It fairly opens a door to misuse of locks.

tianshilei1992 marked 2 inline comments as done.Dec 27 2021, 8:33 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
100

Now we have event pool in plugin so it will not be an issue if we leave one event in libomptarget.

tianshilei1992 edited the summary of this revision. (Show Details)Dec 31 2021, 8:55 AM
JonChesterfield accepted this revision.Jan 4 2022, 10:25 AM

Fairly scary code but I think it's an improvement. Thanks for addressing the previous comments.

I still think the new/old event exchange should be removed.
If T1 record an event for a D2H transfer and then T2 issues a H2D transfer and create an new event and then wait for the old event. the H2D or D2H may happen together when they are on two distinct streams.
This violates atomic data transfer behavior.
the solution is 1 persistent event per map and always check its status before issuing any transfer.

openmp/libomptarget/src/device.cpp
95

Use unique_ptr with Deleter for Event?

I still think the new/old event exchange should be removed.
If T1 record an event for a D2H transfer and then T2 issues a H2D transfer and create an new event and then wait for the old event. the H2D or D2H may happen together when they are on two distinct streams.
This violates atomic data transfer behavior.
the solution is 1 persistent event per map and always check its status before issuing any transfer.

First, there is no event for D2H for now. This patch tries to make sure there is no race among H2D of the same memory from multiple target tasks, especially when they all read from that memory. D2H is out of the scope, and has nothing to do with the scenario mentioned in this patch. Without this patch, H2D and D2H can still happen at the same time on different streams.
Second, I think the case you describe is already data race caused by *user*. For two target tasks, if one task tries to move data in one direction while another task tries to move data in another direction, the two tasks should not run at the same time. That said, they should have dependency because one tries to read and another tries to write. Even with the "atomic" behavior you mentioned, which I didn't find it on the spec (maybe I missed it. could you pin point where it is?) it is still undetermined that which one comes first.
Third, using one event all the time sounds like a feasible solution. If an entry already has an event, we can reuse that one w/o the need to create a new one and destroy the old one, which can potentially save some overhead.

tianshilei1992 marked an inline comment as done.Jan 4 2022, 4:52 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
95

That sounds reasonable but actually not practical. Same as what I said before, we can't have automatic destroy of the event because when libomptarget starts to destroy, we don't know whether the plugin is already dead. Calling to plugin may cause segment fault. Since we already have event pool, all those events will be destroyed properly when a plugin is destroyed. For those resources shared between libomptarget and plugins, libomptarget can only hold it, but not own it. That being said, we can only have *explicit* destruction.

tianshilei1992 marked an inline comment as done.Jan 4 2022, 4:53 PM

reuse an event instead of always creating one

simplify logic

ye-luo added a comment.Jan 4 2022, 5:47 PM

I still think the new/old event exchange should be removed.
If T1 record an event for a D2H transfer and then T2 issues a H2D transfer and create an new event and then wait for the old event. the H2D or D2H may happen together when they are on two distinct streams.
This violates atomic data transfer behavior.
the solution is 1 persistent event per map and always check its status before issuing any transfer.

First, there is no event for D2H for now. This patch tries to make sure there is no race among H2D of the same memory from multiple target tasks, especially when they all read from that memory. D2H is out of the scope, and has nothing to do with the scenario mentioned in this patch. Without this patch, H2D and D2H can still happen at the same time on different streams.
Second, I think the case you describe is already data race caused by *user*. For two target tasks, if one task tries to move data in one direction while another task tries to move data in another direction, the two tasks should not run at the same time. That said, they should have dependency because one tries to read and another tries to write. Even with the "atomic" behavior you mentioned, which I didn't find it on the spec (maybe I missed it. could you pin point where it is?) it is still undetermined that which one comes first.
Third, using one event all the time sounds like a feasible solution. If an entry already has an event, we can reuse that one w/o the need to create a new one and destroy the old one, which can potentially save some overhead.

You are right. No need to worry about mixed H2D and D2H. Since the event is really meant for D2H, it is better to name it EventH2D.

ye-luo accepted this revision.Jan 4 2022, 6:03 PM

Renaming Event to EventH2D and also the corresponding functions are my last request. The rest looks good.

This revision is now accepted and ready to land.Jan 4 2022, 6:03 PM
ye-luo added inline comments.Jan 4 2022, 6:28 PM
openmp/libomptarget/src/device.cpp
320

After the first transfer, every look up calls waitEvent which is still costly.
It is better to return the even back to the pool and avoid waitEvent cost.

add clear comments on how Event should be used in current situation.

ye-luo added inline comments.Jan 4 2022, 7:47 PM
openmp/libomptarget/src/device.cpp
320

I misunderstood waitEvent. It just enqueues the event to the stream. Need to find a better place to reduce the necessary waitEvent.

ye-luo added a comment.Jan 4 2022, 7:47 PM

add clear comments on how Event should be used in current situation.

That is good.

ye-luo added inline comments.Jan 4 2022, 8:12 PM
openmp/libomptarget/src/device.cpp
279

One more question. Should the event only protect IsNew or both IsNew and HasFlagAlways? I feel in the case of HasFlagAlways not IsNew, it is user's responsible to handle the ordering of multiple transfers.

320

I think waitEvent can be bypassed if the present modifier is true.

JonChesterfield added inline comments.Jan 5 2022, 2:02 AM
openmp/libomptarget/src/device.cpp
95

This would be solvable by calling dlclose on the plugins that have been opened by dlopen before libomptarget is destructed. That seems like a good idea independent of this patch - the plugin lifetime can be strictly nested within the lifetime of libomptarget.so

tianshilei1992 marked 4 inline comments as done.Jan 5 2022, 4:21 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
95

IIRC, I did try that, but that didn't solve the problem. We want to have a destructor function, which contains the dlclose, called at the end of the life cycle of libomptarget, even after all globals are destructed. That is not an easy task.

279

It should protect all because the race we are trying to fix in this patch is user transparent. Even two target tasks reading same object/variable/memory can trigger the race.

tianshilei1992 marked 2 inline comments as done.Jan 5 2022, 4:21 PM

minor optimization to bypass event if present modifier appears

tianshilei1992 marked an inline comment as done.Jan 5 2022, 5:04 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
320

waitEvent will not be triggered for entry with present modifier because its attached event will be empty, but we can save a lock.

tianshilei1992 marked an inline comment as done.Jan 5 2022, 5:04 PM
This revision was landed with ongoing or failed builds.Jan 5 2022, 5:20 PM
This revision was automatically updated to reflect the committed changes.
ronlieb added a subscriber: ronlieb.Jan 5 2022, 5:27 PM

Hi Shilei
seems like this broke our amdgpu openmp buildbot
https://lab.llvm.org/buildbot/#/builders/193/builds/4072

Hi Shilei
seems like this broke our amdgpu openmp buildbot
https://lab.llvm.org/buildbot/#/builders/193/builds/4072

Yes. I'll fix it right away.