This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/src/
-
libomptarget/
-
src/
8/8
device.h
25/25
device.cpp
11/11
omptarget.cpp

Differential D104418

[OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement
ClosedPublic

Authored by tianshilei1992 on Jun 16 2021, 1:24 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
grokos
JonChesterfield
ABataev
ye-luo

Commits

rG9584c6fa2fe2: [OpenMP][Offloading] Fixed data race in libomptarget caused by async data…

Summary

The async data movement can cause data race if the target supports it.
Details can be found in [1]. This patch tries to fix this problem by attaching
an event to the entry of data mapping table. Here are the details.

For each issued data movement, a new event is generated and returned to libomptarget
by calling createEvent. The event will be attached to the corresponding mapping table
entry.

For each data mapping lookup, if there is no need for a data movement, the
attached event has to be inserted into the queue to gaurantee that all following
operations in the queue can only be executed if the event is fulfilled.

This design is to avoid synchronization on host side.

Note that we are using CUDA terminolofy here. Similar mechanism is assumped to
be supported by another targets. Even if the target doesn't support it, it can
be easily implemented in the following fall back way:

Event can be any kind of flag that has at least two status, 0 and 1.
waitEvent can directly busy loop if Event is still 0.

My local test shows that bug49334.cpp can pass.

Reference:
[1] https://bugs.llvm.org/show_bug.cgi?id=49940

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

tianshilei1992 added inline comments.Jun 16 2021, 5:11 PM

openmp/libomptarget/src/device.h
53	`mutable` is because `std::set::iterator` is `const`. It makes sense because we don't want to modify the key for the set. However, here `HstPtrBegin` is the key. In order to modify it via the iterator, we need to define it as `mutable`, or use `const_cast` when we want to modify it.
openmp/libomptarget/src/omptarget.cpp
603	There can still be gap between creating an entry in map table and issuing the data movement.

In D104418#2823422, @jdoerfert wrote:

I don't understand why we have an event in the async_info object at all. I would have assumed only to have some plugin specific "lock" in the table.

That's the only way to take the event back from plugin to libomptarget.

tianshilei1992 added inline comments.Jun 16 2021, 5:48 PM

openmp/libomptarget/include/omptarget.h
139 ↗	(On Diff #352537)	I'll refine this part to remove it from async info.

So is the race this is trying to fix between the host and the target, or between regions running on the target?

Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).

If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

openmp/libomptarget/include/omptargetplugin.h
148 ↗	(On Diff #352537)	So if it's not a barrier, what is it? What is this intended to do?

I must admit that I have no idea about the internals of libomptarget. But here is my high-level view on this issue:

From my perspective, the compiler should help to distinguish two situations (a weakness of OpenMP is that the DAG only evolves during execution):
(1) target tasks with dependencies, which only synchronize with other target tasks on the same device (outgoing edges are limited to target tasks on the same device)
(2) target tasks with dependencies, which synchronize with host code or target tasks on other devices.
If the compiler cannot prove (1), fall back to (2).

If the compiler can prove, that a target task is of class (1), a fast path is possible by submitting all related device code onto the same device queue ("stream"). Similar, if the runtime can prove that outgoing edges are limited to the same device and no new edges might be added the runtime might identify additional target tasks of class (1). The latter might be difficult, because the information is in libomp not in libomptarget.

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp taskwait

In this case, no host tasks are dependent on target task dependencies -> just submit the operations onto the same device queue.

Only if the host (possibly) has dependencies on the target task, fine-grained synchronization with the host is necessary:

#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp task depend(inout:a[0:N])
{}

The first two target tasks, can still rely on device queue synchronization, the last target task must ensure synchronization with the host. I think, this could be mapped to the OpenMP completion event model:

#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) detach(event) nowait
#pragma omp task depend(inout:a[0:N])
{}

Libomptarget "calls omp_fulfill_event(event)" when the exit data is completed, e.g., completion of asynchronous memcopy is signaled with a callback. This should result in a clean interface for synchronization between libomp and libomptarget.

I can see three candidates, where things might break at the moment:

actions for target tasks with the same dependency are not submitted to the same device queue
device queues don't enforce the necessary ordering for asynchronous execution (copy-to-device, kernel launch, copy-from-device)
class (2) target tasks miss to ensure completion before they complete and release outgoing dependencies.

Especially the last point seems like a hot candidate, if the target task just launches the asynchronous execution, but does not ensure completion before the task is marked completed.
In that case, the implicit barrier at the end of BlockMatMul_TargetNowait could pass, although not all device activity is completed.

In D104418#2823894, @protze.joachim wrote:
I must admit that I have no idea about the internals of libomptarget. But here is my high-level view on this issue:

From my perspective, the compiler should help to distinguish two situations (a weakness of OpenMP is that the DAG only evolves during execution):
(1) target tasks with dependencies, which only synchronize with other target tasks on the same device (outgoing edges are limited to target tasks on the same device)
(2) target tasks with dependencies, which synchronize with host code or target tasks on other devices.
If the compiler cannot prove (1), fall back to (2).

If the compiler can prove, that a target task is of class (1), a fast path is possible by submitting all related device code onto the same device queue ("stream"). Similar, if the runtime can prove that outgoing edges are limited to the same device and no new edges might be added the runtime might identify additional target tasks of class (1). The latter might be difficult, because the information is in libomp not in libomptarget.
#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp taskwait
In this case, no host tasks are dependent on target task dependencies -> just submit the operations onto the same device queue.

Only if the host (possibly) has dependencies on the target task, fine-grained synchronization with the host is necessary:
#pragma omp target enter data map(to:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp target depend(inout:a[0:N]) nowait
{...}
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) nowait
#pragma omp task depend(inout:a[0:N])
{}
The first two target tasks, can still rely on device queue synchronization, the last target task must ensure synchronization with the host. I think, this could be mapped to the OpenMP completion event model:
#pragma omp target exit data map(from:a[:N]) depend(inout:a[0:N]) detach(event) nowait
#pragma omp task depend(inout:a[0:N])
{}
Libomptarget "calls omp_fulfill_event(event)" when the exit data is completed, e.g., completion of asynchronous memcopy is signaled with a callback. This should result in a clean interface for synchronization between libomp and libomptarget.

I can see three candidates, where things might break at the moment:

actions for target tasks with the same dependency are not submitted to the same device queue

device queues don't enforce the necessary ordering for asynchronous execution (copy-to-device, kernel launch, copy-from-device)

class (2) target tasks miss to ensure completion before they complete and release outgoing dependencies.

Especially the last point seems like a hot candidate, if the target task just launches the asynchronous execution, but does not ensure completion before the task is marked completed.
In that case, the implicit barrier at the end of BlockMatMul_TargetNowait could pass, although not all device activity is completed.

Thanks for the analysis and details. The root cause of this problem (for a target supporting async data movement internally) is in libomptarget. It is unrelated to tasks in libomp. At the end of each target task, there is a mandatory synchronization to guarantee that the task can only be finished if everything is finished. This can assure that if a regular task depends on a target task, existing mechanism will work. You can refer to Section 3.4 [1] for more details. As for the detachable task, [1] also has detailed analysis.

Reference:
[1] https://tianshilei.me/wp-content/uploads/concurrent-lcpc2020.pdf

In D104418#2823848, @JonChesterfield wrote:

So is the race this is trying to fix between the host and the target, or between regions running on the target?

Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).

If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

Look up into the mapping table to see if M is already mapped. - No.
Allocate device memory for M and create a mapping entry for M.
Issue data transfer.
Release the lock.
Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

Look up into the mapping table to see if M is already mapped. - Yes.
Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

In D104418#2824738, @tianshilei1992 wrote:

In D104418#2823848, @JonChesterfield wrote:

So is the race this is trying to fix between the host and the target, or between regions running on the target?

Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).

If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

Look up into the mapping table to see if M is already mapped. - No.

Allocate device memory for M and create a mapping entry for M.

Issue data transfer.

Release the lock.

Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

Look up into the mapping table to see if M is already mapped. - Yes.

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

Thanks for the explanation, I think, I got the issue now.

Following your example, if T1 and T2 are really concurrent, I think it should be an application issue (data race), if they both perform initial mapping on the same memory without synchronization. I would bring this to the OpenMP LC.

For bug49334.cpp this means, that depend(in: BlockA[0], BlockB[0]) is not sufficient and the dependency should rather be depend(inout: BlockA[0], BlockB[0]). The out dependency reflects the writing to the device copy of the variables or, respectively, the possible initial mapping.

In D104418#2824738, @tianshilei1992 wrote:

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

I believe that your root cause description is correct, but I don't understand how the mentioned locks currently present in mapping logic can fix the case when a target device only implements the synchronous API (which is the first problem described in 49940). Let me explain what I mean.

In D104418#2824738, @tianshilei1992 wrote:

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

Look up into the mapping table to see if M is already mapped. - No.

Allocate device memory for M and create a mapping entry for M.

Issue data transfer.

Release the lock.

Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

Look up into the mapping table to see if M is already mapped. - Yes.

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

To the best of my knowledge, the given step-by-step has a small problem: the mapping table lock is not released in step 4, but instead, it is released between steps 2 and 3.

Since the proposed patch stores the event in the AsyncInfoTy, when interacting with a synchronous-only plugin we can still have the same data corruption problem: T1 may be paused right after the mapping table lock is released (between steps 2 and 3) and T2 will go through its target region execution not issuing the appropriate data transfer.

If what I described above is correct, we can fix this synchronous case with two simple changes added to https://reviews.llvm.org/D104382 patch:

Add a mutex to HostDataToTargetTy that will syncrhonize concurrent external (meaning outside the DeviceTy code) accesses;
TargetPointerResultTy locks this mutex on construction and unlock it on destruction (a simple std::unique_lock can suffice).

This way we ensure that multiple threads dealing with the same mapping table entry are synchronized, fixing the first problem in 49940. What do you think?

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

That doesn't sound right. There's no reason to expect the data that was copied across will not be modified by the first region. The openmp semantics look like copy to device at target region start and copy back at the end. Picking up partially written data from another executing kernel breaks that.

Given two target regions that both map the same data, I'm fairly sure both maps have to happen, unless we do significantly more thorough analysis to determine that the output of one kernel is definitely the same as the input to the next, in which case we would be safer fusing them.

In D104418#2824984, @JonChesterfield wrote:

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

That doesn't sound right. There's no reason to expect the data that was copied across will not be modified by the first region. The openmp semantics look like copy to device at target region start and copy back at the end. Picking up partially written data from another executing kernel breaks that.

Given two target regions that both map the same data, I'm fairly sure both maps have to happen, unless we do significantly more thorough analysis to determine that the output of one kernel is definitely the same as the input to the next, in which case we would be safer fusing them.

I don't agree with you. T2 should follow the reference counting transfer behavior and not issue data transfer. If T1 change things on the device and surprise T2, it is a bug in the user code, user is responsible to express dependency on the target tasks.

In D104418#2824984, @JonChesterfield wrote:

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

That doesn't sound right. There's no reason to expect the data that was copied across will not be modified by the first region. The openmp semantics look like copy to device at target region start and copy back at the end. Picking up partially written data from another executing kernel breaks that.

Given two target regions that both map the same data, I'm fairly sure both maps have to happen, unless we do significantly more thorough analysis to determine that the output of one kernel is definitely the same as the input to the next, in which case we would be safer fusing them.

Like Ye said, two target regions both *read* only from same memory region. It is not a race condition from user's perspective.

In D104418#2824961, @gValarini wrote:

In D104418#2824738, @tianshilei1992 wrote:

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

but I don't understand how the mentioned locks currently present in mapping logic can fix the case when a target device only implements the synchronous API (which is the first problem described in 49940).

That's a good point. Thanks for pointing it out.

In D104418#2824738, @tianshilei1992 wrote:

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions T1 and T2 both reading memory region M. Assume T1 is scheduled to execute first. It then maps M to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

Look up into the mapping table to see if M is already mapped. - No.

Allocate device memory for M and create a mapping entry for M.

Issue data transfer.

Release the lock.

Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now T2 starts to execute. It also maps M to the device. However, steps are slightly different.
0. Get the lock of mapping table.

Look up into the mapping table to see if M is already mapped. - Yes.

Since the mapping is already there, T2 will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.

Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for T1 can only guarantee that movement of M is issued, but it cannot guarantee that after Step 3, the data is already on the device. For T2, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that T2 is scheduled to execute by the GPU scheduler ahead of the data movement of M, the memory region on the device contains nothing from host. It is random.

To the best of my knowledge, the given step-by-step has a small problem: the mapping table lock is not released in step 4, but instead, it is released between steps 2 and 3.

Since the proposed patch stores the event in the AsyncInfoTy, when interacting with a synchronous-only plugin we can still have the same data corruption problem: T1 may be paused right after the mapping table lock is released (between steps 2 and 3) and T2 will go through its target region execution not issuing the appropriate data transfer.

If what I described above is correct, we can fix this synchronous case with two simple changes added to https://reviews.llvm.org/D104382 patch:

Add a mutex to HostDataToTargetTy that will syncrhonize concurrent external (meaning outside the DeviceTy code) accesses;

TargetPointerResultTy locks this mutex on construction and unlock it on destruction (a simple std::unique_lock can suffice).

This way we ensure that multiple threads dealing with the same mapping table entry are synchronized, fixing the first problem in 49940. What do you think?

That sounds right, which means for each entry, there can only be one thread touching it. Or actually we can pass the update decision to getOrAllocTgtPtr such that we don't have to deal with lock(s) across functions.

In D104418#2825161, @tianshilei1992 wrote:

Or actually we can pass the update decision to getOrAllocTgtPtr such that we don't have to deal with lock(s) across functions.

This could be a nicer solution alongside a proper renaming of the getOrAllocTgtPtr to reflect its multiple responsibilities (if you go forward with it, of course). Just two comments though:

It would be good to check if other "users" of the de DeviceTy wouldn't need such synchronization mechanism. If so, they could use the same changes introduced by https://reviews.llvm.org/D104382 (getTgtPtrBegin used by targetDataEnd is the only example that I can think of);
Maybe letting the lifetime of TargetPointerResultTy instances determine for how long an entry is locked is the best future-proof solution. I am thinking about future changes that could be done to the targetDataBegin function (and maybe other ones) that could use the additional synchronization. But this is just a thought though.

Following your example, if T1 and T2 are really concurrent, I think it should be an application issue (data race), if they both perform initial mapping on the same memory without synchronization. I would bring this to the OpenMP LC.

Not exactly. Two concurrent target regions can read from same memory region.

int array[1024];
#pragma omp target map(to: data) nowait
{ /* read array */ }
#pragma omp target map(to: data) nowait
{ /* read array */ }

I think this is a valid program, no matter from user's perspective or specification's perspective. How to deal with data mapping is implementation details IMO, but implementation should guarantee that array in the two regions should be same. It is not the case for us currently.

For bug49334.cpp this means, that depend(in: BlockA[0], BlockB[0]) is not sufficient and the dependency should rather be depend(inout: BlockA[0], BlockB[0]). The out dependency reflects the writing to the device copy of the variables or, respectively, the possible initial mapping.

Correct, but in the body of the task, BlockA[0] and BlockB[0] are only read. It's valid to mark them as in only.

In D104418#2825481, @tianshilei1992 wrote:

Correct, but in the body of the task, BlockA[0] and BlockB[0] are only read. It's valid to mark them as in only.

What I mean is the write to the corresponding variable implied by the map.
Do you agree that the following would be invalid and needs inout for target enter data?

#pragma omp target enter data nowait map(to: BlockA[:BS * BS]) depend(in: BlockA[0])
#pragma omp target nowait depend(in: BlockA[0])
{}

In D104418#2825536, @protze.joachim wrote:
In D104418#2825481, @tianshilei1992 wrote:

Correct, but in the body of the task, BlockA[0] and BlockB[0] are only read. It's valid to mark them as in only.

What I mean is the write to the corresponding variable implied by the map.
Do you agree that the following would be invalid and needs inout for target enter data?
#pragma omp target enter data nowait map(to: BlockA[:BS * BS]) depend(in: BlockA[0])
#pragma omp target nowait depend(in: BlockA[0])
{}

That is actually a very interesting question. Theoretically, the dependency setting here is not wrong, but it indeed doesn't work. But write/read variable cannot be implied by the map. We can definitely write to a variable in target region but don't transfer it back (map(to)), and we can also only read the variable and transfer it back (map(tofrom)). They are both valid.

tianshilei1992 mentioned this in D104555: [OpenMP][Offloading] Fix data race in data mapping by using two locks.Jun 18 2021, 11:00 AM

refine

Herald added a subscriber: jfb. · View Herald TranscriptJun 18 2021, 12:36 PM

tianshilei1992 planned changes to this revision.Jun 18 2021, 12:39 PM

tianshilei1992 added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
24 ↗	(On Diff #353077)	This should be unrelated...
openmp/libomptarget/src/device.cpp
14	must be `clangd` automatic generated code
openmp/libomptarget/src/device.h
16	unrelated
53	We have to use `shared_ptr` here as this data structure has to be copied but `std::atomic` is un-copyable.
openmp/libomptarget/src/omptarget.cpp
590–614	Here is very critical. Expect opinions.
594	But with D104555, these whole bunch of code will be moved into `getTargetPointer`.
openmp/libomptarget/test/offloading/bug49334.cpp
73 ↗	(On Diff #353077)	With D104552, this is no longer needed.

tianshilei1992 requested review of this revision.Jun 18 2021, 12:41 PM

tianshilei1992 edited the summary of this revision. (Show Details)Jun 18 2021, 12:43 PM

The logic now looks much more sane to me. This is what I was expecting. Plugins can implement "events" the way they want to.

openmp/libomptarget/src/device.h
53	Why is this a shared ptr?
openmp/libomptarget/src/omptarget.cpp
601	Worth a helper to avoid duplication.
614	I don't get why we need this. At the moment at which we are looking at the Entry I would expect it has an event or not, but neither is cause for waiting. When a new entry is created it should only be exposed to others once the event has been setup as well. Plugins that don't need events can just use nullptr without it being a problem. If this is not a problem, I suspect we want a separate flag in the entry for the status. The event should be allows to be null, the flag can be "in_progress" which indicates the rest needs to wait.

tianshilei1992 added inline comments.Jun 18 2021, 1:04 PM

openmp/libomptarget/src/omptarget.cpp
614	This can be removed when this whole bunch of thing is moved to the new function in D104555, otherwise it can still happen that the new entry is created but data movement has not been issued because this function is not guarded by the lock.

tianshilei1992 added inline comments.Jun 18 2021, 4:15 PM

openmp/libomptarget/src/device.h
53	Because `unique_ptr` is non-copyable.

JonChesterfield added inline comments.Jun 19 2021, 1:17 AM

openmp/libomptarget/src/device.h
53	Could write a copy constructor. Also possible it only needs to be moved, not copied. Don't really want to add an ad hoc garbage collector (aka shared_ptr) to the existing complexity if we can avoid it, lifetimes are opaque enough already

Harbormaster completed remote builds in B109983: Diff 353077.Jun 19 2021, 7:41 AM

rebased on D104555

tianshilei1992 added a parent revision: D104555: [OpenMP][Offloading] Fix data race in data mapping by using two locks.Jun 20 2021, 5:23 PM

Harbormaster completed remote builds in B110108: Diff 353251.Jun 20 2021, 5:23 PM

removed unrelated changes

Harbormaster completed remote builds in B110109: Diff 353252.Jun 20 2021, 5:28 PM

tianshilei1992 retitled this revision from [PoC][WIP][OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement to [OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement.Jun 20 2021, 5:30 PM

tianshilei1992 edited the summary of this revision. (Show Details)

protze.joachim added inline comments.Jun 23 2021, 8:24 AM

openmp/libomptarget/src/device.cpp
160–165	Which patch removes the `bool &IsHostPtr,` argument? I cannot see this change in any of the patches in the stack or on main. This patch fails to apply after applying the other two patches in the stack.
openmp/libomptarget/src/omptarget.cpp
504–509	This probably shouldn't be here?

tianshilei1992 added inline comments.Jun 23 2021, 10:09 AM

openmp/libomptarget/src/omptarget.cpp
504–509	yeah, it's a rebase error. I'll reorg the patch stack.

Reading the event-based solution, one corner case came up to my mind (it is odd, but possible):

#pragma omp target map(to: A, B) nowait
{ /* read A and B */ }

#pragma omp target map(to: B, A) nowait
{ /* read A and B */ }

Note that the only difference between them (besides their internal code) is the ordering of the buffers in the mapping list.

Since both regions just read A and B, there is no need for a dependency between them (from the programmers perspective) and, thus, they can be executed by two different threads as follows:

T1: alloc A -> issue A submission -> create event for A -> alloc B (not new, MoveData = false) -> wait for B event indefinitly ...
T2: alloc B -> issue B submission -> create event for B -> alloc A (not new, MoveData = false) -> wait for A event indefinitly ...

This issue seems similar to the lock ordering problem. It can occur depending on the answers to the questions below:

Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.
Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.
Are the variables in the mapping list ordered somehow? If that is the case, then the (possible) problem described above will never happen as well (like resolving it through lock ordering).

If this is a problem, sorting the mapping list or ensuring the behavior described in the second question would probably solve it. What do you think? Is this line of thought correct?

In D104418#2850555, @gValarini wrote:
Reading the event-based solution, one corner case came up to my mind (it is odd, but possible):
#pragma omp target map(to: A, B) nowait
{ /* read A and B */ }

#pragma omp target map(to: B, A) nowait
{ /* read A and B */ }
Note that the only difference between them (besides their internal code) is the ordering of the buffers in the mapping list.

Since both regions just read A and B, there is no need for a dependency between them (from the programmers perspective) and, thus, they can be executed by two different threads as follows:

T1: alloc A -> issue A submission -> create event for A -> alloc B (not new, MoveData = false) -> wait for B event indefinitly ...

T2: alloc B -> issue B submission -> create event for B -> alloc A (not new, MoveData = false) -> wait for A event indefinitly ...

The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.

I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.

Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:

The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.
If any operation depends on the event, it will not be executed until the event is fulfilled/notified.

Are the variables in the mapping list ordered somehow? If that is the case, then the (possible) problem described above will never happen as well (like resolving it through lock ordering).

That's a good question. I don't have a clear answer for that. I remember the front end will sort somehow, but don't know in what order.

gValarini added a comment.Jun 30 2021, 1:20 PM

This comment was removed by gValarini.

In D104418#2850619, @tianshilei1992 wrote:

The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

Uhm, I think I misunderstood how the events would interact with the queues. But now I have a doubt if __tgt_rtl_wait_event should be allowed to be blocking. Let me explain with my first question at #2850555.

In D104418#2850555, @gValarini wrote:

Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.

I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

Some of the async APIs that I am most used to (like the C++ async library and the non-blocking MPI functions) have the restriction that one should never consider any computation/communication to be done (or even in-flight) until their respective synchronization functions are properly called. This is needed considering that a functional (but not optimized) implementation of those async APIs would be to defer their execution until such synchronization point. If I regard it correctly, this restriction is even present in the CUDA streams specification, but please don't quote me on that since I am not a CUDA expert.

If libomptarget follows such principles, __tgt_rtl_wait_event should not be allowed to be blocking, since that would imply that the "notification of event E#" that you described could be assumed to be completed before __tgt_rtl_synchronize is called, which would be a little bit strange considering the async API. This behavior was what even led me to my second question at #2850555 regarding possibles indirect synchronizations done by __tgt_rtl_wait_event.

In D104418#2850555, @gValarini wrote:

Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.

Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:

The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.

If any operation depends on the event, it will not be executed until the event is fulfilled/notified.

Do these async principles make sense for libomptarget? What about restricting __tgt_rtl_wait_event to be only non-blocking and thus always be inserted in the async queue?

By the way, thanks for the clarifications, the event API is much clear to me now.

jdoerfert added inline comments.Jun 30 2021, 2:48 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1145 ↗	(On Diff #353252)	@gValarini Do these async principles make sense for libomptarget? What about restricting __tgt_rtl_wait_event to be only non-blocking and thus always be inserted in the async queue? If we have a queue, we can and will insert the wait in there (see above). I don't think it is needed as "wait_even", similar to "synchronize" is a proper synchronization point. Not sure what the problem here is.

In D104418#2851339, @gValarini wrote:

In D104418#2850619, @tianshilei1992 wrote:

The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

Uhm, I think I misunderstood how the events would interact with the queues. But now I have a doubt if __tgt_rtl_wait_event should be allowed to be blocking. Let me explain with my first question at #2850555.

In D104418#2850555, @gValarini wrote:

Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (__tgt_rtl_synchronize)? If yes, then the above problem can occur since no synchronization call is done.

I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

Some of the async APIs that I am most used to (like the C++ async library and the non-blocking MPI functions) have the restriction that one should never consider any computation/communication to be done (or even in-flight) until their respective synchronization functions are properly called. This is needed considering that a functional (but not optimized) implementation of those async APIs would be to defer their execution until such synchronization point. If I regard it correctly, this restriction is even present in the CUDA streams specification, but please don't quote me on that since I am not a CUDA expert.

If libomptarget follows such principles, __tgt_rtl_wait_event should not be allowed to be blocking, since that would imply that the "notification of event E#" that you described could be assumed to be completed before __tgt_rtl_synchronize is called, which would be a little bit strange considering the async API. This behavior was what even led me to my second question at #2850555 regarding possibles indirect synchronizations done by __tgt_rtl_wait_event.

In D104418#2850555, @gValarini wrote:

Should a plugin implementation indirectly synchronize the queue where an event was created when a call to __tgt_rtl_wait_event is made? If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.

Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:

The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.

If any operation depends on the event, it will not be executed until the event is fulfilled/notified.

Do these async principles make sense for libomptarget? What about restricting __tgt_rtl_wait_event to be only non-blocking and thus always be inserted in the async queue?

Actually, the function naming here is kind of confusing, although I inherited CUDA terminology. One of the most confusing thing is, "wait for an event" doesn't really mean "waiting", at least from host's perspective. Let me explain this here so it will become much clearer.

For CUDA, event wait and event synchronize are different. If you look at CUDA document (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g7840e3984799941a61839de40413d1d9), the function signature of event wait is cudaStreamWaitEvent ( cudaStream_t stream, cudaEvent_t event, unsigned int flags). It basically works "like" that insert event to the stream, and then return. It is non-blocking. GPU scheduler will make sure that all operations enqueued to the stream afterwards will not be executed until the event is fulfilled. Event synchronization is different. As shown here (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#group__CUDART__EVENT_1g949aa42b30ae9e622f6ba0787129ff22), it only accepts the event. So it simply blocks the host execution until the event is fulfilled.

So whether __tgt_rtl_wait_event is blocking or non-blocking is implementation dependent. For example, for a target that doesn't support the whole event mechanism we mentioned before, or it doesn't support async operation at all, it can implement the wait event to be blocking in the following way:

An event is an atomic integer, initialized to 0;
When the event is fulfilled (or notified), it simply sets it to 1;
Wait for an event is simply equivalent to set a loop until it becomes 1.

tianshilei1992 removed a parent revision: D104382: [OpenMP][Offloading] Refined return value of `DeviceTy::getOrAllocTgtPtr`.Jul 22 2021, 6:35 PM

rebase

tianshilei1992 marked 11 inline comments as done.Jul 22 2021, 7:04 PM

Harbormaster completed remote builds in B115757: Diff 361062.Jul 22 2021, 8:44 PM

tianshilei1992 mentioned this in rG18ce3d3f2c36: [OpenMP][Offloading] Fix data race in data mapping by using two locks.Jul 23 2021, 1:11 PM

rebase

Harbormaster completed remote builds in B115937: Diff 361322.Jul 23 2021, 1:52 PM

ye-luo added inline comments.Jul 23 2021, 1:56 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1189 ↗	(On Diff #361322)	CU_EVENT_WAIT_DEFAULT seems new from CUDA 11.1. cuStreamWaitEvent https://docs.nvidia.com/cuda/archive/11.1.0/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g6a898b652dfc6aa1d5c8d97062618b2f https://docs.nvidia.com/cuda/archive/11.0/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g6a898b652dfc6aa1d5c8d97062618b2f A solution seems like putting 0 for <11.1

rebase and fix comments

Harbormaster completed remote builds in B116188: Diff 361673.Jul 26 2021, 8:51 AM

ping

rebase

rebase again to remove unrelated changes

Harbormaster completed remote builds in B118301: Diff 364676.Aug 5 2021, 8:09 PM

tianshilei1992 mentioned this in D107656: [OpenMP] Use events and taskyield in target nowait task to unblock host threads.Aug 7 2021, 5:05 PM

A few minor nits. Generally the patch is now in good shape, its logic is much simpler than the first iteration and it's easy to see that we don't risk running into deadlocks or further races. We haven't been very enthusiastic about alternative approaches; meanwhile there's a race waiting to be fixed, so I think we should move on with providing a solution. LGTM on my side, but since a lot of other people have reviewed the patch, let's wait until we hear them, too.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
132 ↗	(On Diff #364676)	It's a bit weird that `createEvent` is defined inside the top-level namespace, whereas `waitEvent` and `destroyEvent` are members of `DeviceRTLTy`.
1379 ↗	(On Diff #364676)	`DeviceId` is not used in this function, so it doesn't need to be an argument.
openmp/libomptarget/src/omptarget.cpp
591–599	This code is a direct duplicate of the logic inside `getTargetPointer`. Can we factor it out into a new helper function?

This revision is now accepted and ready to land.Aug 20 2021, 2:50 AM

There is a performance hit from frequent creating/destroying events at very transfer. The create/destroy logic can be separate from recording events.

This revision now requires changes to proceed.Aug 20 2021, 7:55 AM

In D104418#2957432, @ye-luo wrote:

There is a performance hit from frequent creating/destroying events at very transfer. The create/destroy logic can be separate from recording events.

What an event is and how it is handled is plugin-specific. createEvent and destroyEvent can be implemented e.g. via a pool of events where an event will be picked upon createEvent and returned back to the pool upon destroyEvent (similar to what we do for streams in the CUDA plugin). So these performance considerations can be fixed in a subsequent patch. Right now there is a race condition - a correctness concern - that we should fix. At the very least, since the logic in the base library is correct and we pretty much agree on it, we should let this patch land without the CUDA plugin changes; then we can provide optimized implementations for the __tgt_rtl_*_event functions for the plugins we care about in future patches.

In D104418#2957609, @grokos wrote:

In D104418#2957432, @ye-luo wrote:

There is a performance hit from frequent creating/destroying events at very transfer. The create/destroy logic can be separate from recording events.

What an event is and how it is handled is plugin-specific. createEvent and destroyEvent can be implemented e.g. via a pool of events where an event will be picked upon createEvent and returned back to the pool upon destroyEvent (similar to what we do for streams in the CUDA plugin). So these performance considerations can be fixed in a subsequent patch. Right now there is a race condition - a correctness concern - that we should fix. At the very least, since the logic in the base library is correct and we pretty much agree on it, we should let this patch land without the CUDA plugin changes; then we can provide optimized implementations for the __tgt_rtl_*_event functions for the plugins we care about in future patches.

Now Event is stored in HostDataToTargetTy. It is not plugin-specific logic.

Consider the issues in the current implementation I just found. I have a strong feeling that the current fix should not get in.
I think we should just create one event per HostDataToTargetTy at the constructor and use the same event for a give mapped region.

In order to avoid blocking the good pieces in this patch, it is better to take out all the plugin API and CUDA plugin change to a separate patch.
I demand split createEvent into createEvent and recordEvent and also make each API doing well defined to minimal unit of work.

Using a pool of events is an optimization we may try at a later stage. Right now, I'm more concerned about the issue I found above instead of performance impact.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1362 ↗	(On Diff #364676)	Why waiting for an event needs to pull a stream in. cuEventSynchronize
1379 ↗	(On Diff #364676)	AsyncInfo is not needed. If the stream has been returned to the poo, you just print nullptr all the time.
openmp/libomptarget/src/device.cpp
269	I think the purpose of introducing this data transfer event is to achieve the atomic transfer behavior. So it is unsafe to issue transfer right away without checking the status of Event. If a map(always comes from a different thread.
289	The current event never got destroyed unitl the next transfer. event doesn't got properly destroyed. almost like one event per map.

tianshilei1992 mentioned this in D108528: [OpenMP][Offloading] Add support for event related interfaces.Aug 22 2021, 3:42 PM

rebase

tianshilei1992 edited parent revisions, added: D108528: [OpenMP][Offloading] Add support for event related interfaces; removed: D104555: [OpenMP][Offloading] Fix data race in data mapping by using two locks.Aug 22 2021, 7:32 PM

tianshilei1992 marked 2 inline comments as done.Aug 22 2021, 7:43 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
269	The event has to be created after the issue. At that moment, the entry might not have any associated event.
289	For your first comment, yes, and now it is fixed by an ugly way. A nicer solution would be related to the API design in D108528where which we can discuss. The question is whether we want to pass async info to the plugin for event creation, synchronization, and destroy. For the second point, it's a compatibility consideration, that is if there is a platform that doesn't allow event to be reused after it is fulfilled. I don't know.

tianshilei1992 marked 2 inline comments as done.Aug 22 2021, 7:43 PM

tianshilei1992 added inline comments.Aug 22 2021, 7:46 PM

openmp/libomptarget/src/device.cpp
80	The change of the function signature would not be required if we don't need to pass `AsyncInfo` to the plugin. Potentially we can wrap the native event into a struct and the object will call `destroyEvent` in its destructor.

Harbormaster completed remote builds in B120727: Diff 368024.Aug 22 2021, 8:06 PM

jdoerfert added inline comments.Aug 23 2021, 11:42 AM

openmp/libomptarget/src/device.cpp
281	Should we skip the event stuff if we failed?

tianshilei1992 mentioned this in rG29df4ab3f3c9: [OpenMP][Offloading] Add support for event related interfaces.Aug 28 2021, 1:24 PM

rebase

Harbormaster completed remote builds in B121621: Diff 369279.Aug 28 2021, 2:30 PM

tianshilei1992 removed a parent revision: D108528: [OpenMP][Offloading] Add support for event related interfaces.Aug 28 2021, 3:23 PM

rebase

Harbormaster completed remote builds in B121626: Diff 369287.Aug 28 2021, 3:26 PM

fix rebase errors

remove unrelated changes

Harbormaster completed remote builds in B121628: Diff 369290.Aug 28 2021, 4:00 PM

ye-luo added inline comments.Aug 31 2021, 2:09 PM

openmp/libomptarget/src/device.cpp
100	I feel it is better to have a wrapper wraps the raw Event and handles destroyEvent at the destructor. Once HostDataToTargetMap destructor got called, all the Events are leaked.
289	For the second point, it's a compatibility consideration, that is if there is a platform that doesn't allow event to be reused after it is fulfilled. I don't know. We don't have such a platform to worry about. If there is one, we just mark the platform as not supporting events that we expect.
310	Missing return error check
318	after this point. If another thread changes Entry->Event and destroyed the old event. The current copied Event becomes invalid.

ye-luo mentioned this in D109017: [OpenMP][libomptarget] Add event query plugin API.Sep 10 2021, 10:32 AM

tianshilei1992 marked an inline comment as done.Oct 16 2021, 4:41 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
100	`HostDataToTargetMap` is destroyed along with `libomptarget`. Because of the way `libomptarget` and plugins are connected, plugins could already have been destroyed at that moment, and if that is the case, it is undefined behavior to call functions in plugins. D111954 can make sure the events are correctly released.

rebase and fix comments

Harbormaster completed remote builds in B140768: Diff 396368.Dec 27 2021, 8:32 PM

tianshilei1992 marked 6 inline comments as done.Dec 27 2021, 8:32 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
281	It's different design philosophy. Either is fine to me, but arguably if a target supports event and related interfaces return error, I don't think we should continue (skip).
openmp/libomptarget/src/omptarget.cpp
601	Yes, this piece of code is kind of redundant. However, there are different locks intervened, if we abstract it to another helper function, it could make the code more difficult to understand because we could find mutex are locked in caller and unlocked in callee. I don't think it's a good practice. It fairly opens a door to misuse of locks.

tianshilei1992 marked 2 inline comments as done.Dec 27 2021, 8:33 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
100	Now we have event pool in plugin so it will not be an issue if we leave one event in `libomptarget`.

ping

tianshilei1992 edited the summary of this revision. (Show Details)Dec 31 2021, 8:55 AM

Fairly scary code but I think it's an improvement. Thanks for addressing the previous comments.

I still think the new/old event exchange should be removed.
If T1 record an event for a D2H transfer and then T2 issues a H2D transfer and create an new event and then wait for the old event. the H2D or D2H may happen together when they are on two distinct streams.
This violates atomic data transfer behavior.
the solution is 1 persistent event per map and always check its status before issuing any transfer.

openmp/libomptarget/src/device.cpp
95	Use unique_ptr with Deleter for Event?

In D104418#3220189, @ye-luo wrote:

I still think the new/old event exchange should be removed.
If T1 record an event for a D2H transfer and then T2 issues a H2D transfer and create an new event and then wait for the old event. the H2D or D2H may happen together when they are on two distinct streams.
This violates atomic data transfer behavior.
the solution is 1 persistent event per map and always check its status before issuing any transfer.

First, there is no event for D2H for now. This patch tries to make sure there is no race among H2D of the same memory from multiple target tasks, especially when they all read from that memory. D2H is out of the scope, and has nothing to do with the scenario mentioned in this patch. Without this patch, H2D and D2H can still happen at the same time on different streams.
Second, I think the case you describe is already data race caused by *user*. For two target tasks, if one task tries to move data in one direction while another task tries to move data in another direction, the two tasks should not run at the same time. That said, they should have dependency because one tries to read and another tries to write. Even with the "atomic" behavior you mentioned, which I didn't find it on the spec (maybe I missed it. could you pin point where it is?) it is still undetermined that which one comes first.
Third, using one event all the time sounds like a feasible solution. If an entry already has an event, we can reuse that one w/o the need to create a new one and destroy the old one, which can potentially save some overhead.

tianshilei1992 marked an inline comment as done.Jan 4 2022, 4:52 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
95	That sounds reasonable but actually not practical. Same as what I said before, we can't have automatic destroy of the event because when `libomptarget` starts to destroy, we don't know whether the plugin is already dead. Calling to plugin may cause segment fault. Since we already have event pool, all those events will be destroyed properly when a plugin is destroyed. For those resources shared between `libomptarget` and plugins, `libomptarget` can only hold it, but not own it. That being said, we can only have explicit destruction.

tianshilei1992 marked an inline comment as done.Jan 4 2022, 4:53 PM

reuse an event instead of always creating one

simplify logic

Harbormaster completed remote builds in B141586: Diff 397429.Jan 4 2022, 5:11 PM

In D104418#3221061, @tianshilei1992 wrote:

In D104418#3220189, @ye-luo wrote:

I still think the new/old event exchange should be removed.
If T1 record an event for a D2H transfer and then T2 issues a H2D transfer and create an new event and then wait for the old event. the H2D or D2H may happen together when they are on two distinct streams.
This violates atomic data transfer behavior.
the solution is 1 persistent event per map and always check its status before issuing any transfer.

First, there is no event for D2H for now. This patch tries to make sure there is no race among H2D of the same memory from multiple target tasks, especially when they all read from that memory. D2H is out of the scope, and has nothing to do with the scenario mentioned in this patch. Without this patch, H2D and D2H can still happen at the same time on different streams.
Second, I think the case you describe is already data race caused by *user*. For two target tasks, if one task tries to move data in one direction while another task tries to move data in another direction, the two tasks should not run at the same time. That said, they should have dependency because one tries to read and another tries to write. Even with the "atomic" behavior you mentioned, which I didn't find it on the spec (maybe I missed it. could you pin point where it is?) it is still undetermined that which one comes first.
Third, using one event all the time sounds like a feasible solution. If an entry already has an event, we can reuse that one w/o the need to create a new one and destroy the old one, which can potentially save some overhead.

You are right. No need to worry about mixed H2D and D2H. Since the event is really meant for D2H, it is better to name it EventH2D.

Renaming Event to EventH2D and also the corresponding functions are my last request. The rest looks good.

This revision is now accepted and ready to land.Jan 4 2022, 6:03 PM

ye-luo added inline comments.Jan 4 2022, 6:28 PM

openmp/libomptarget/src/device.cpp
320	After the first transfer, every look up calls waitEvent which is still costly. It is better to return the even back to the pool and avoid waitEvent cost.

add clear comments on how Event should be used in current situation.

Harbormaster completed remote builds in B141593: Diff 397438.Jan 4 2022, 6:40 PM

ye-luo added inline comments.Jan 4 2022, 7:47 PM

openmp/libomptarget/src/device.cpp
320	I misunderstood waitEvent. It just enqueues the event to the stream. Need to find a better place to reduce the necessary waitEvent.

In D104418#3221205, @tianshilei1992 wrote:

add clear comments on how Event should be used in current situation.

That is good.

ye-luo added inline comments.Jan 4 2022, 8:12 PM

openmp/libomptarget/src/device.cpp
279	One more question. Should the event only protect IsNew or both IsNew and HasFlagAlways? I feel in the case of HasFlagAlways not IsNew, it is user's responsible to handle the ordering of multiple transfers.
320	I think waitEvent can be bypassed if the present modifier is true.

JonChesterfield added inline comments.Jan 5 2022, 2:02 AM

openmp/libomptarget/src/device.cpp
95	This would be solvable by calling dlclose on the plugins that have been opened by dlopen before libomptarget is destructed. That seems like a good idea independent of this patch - the plugin lifetime can be strictly nested within the lifetime of libomptarget.so

tianshilei1992 marked 4 inline comments as done.Jan 5 2022, 4:21 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
95	IIRC, I did try that, but that didn't solve the problem. We want to have a destructor function, which contains the `dlclose`, called at the end of the life cycle of `libomptarget`, even after all globals are destructed. That is not an easy task.
279	It should protect all because the race we are trying to fix in this patch is user transparent. Even two target tasks reading same object/variable/memory can trigger the race.

tianshilei1992 marked 2 inline comments as done.Jan 5 2022, 4:21 PM

minor optimization to bypass event if present modifier appears

tianshilei1992 marked an inline comment as done.Jan 5 2022, 5:04 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
320	`waitEvent` will not be triggered for entry with present modifier because its attached `event` will be empty, but we can save a lock.

tianshilei1992 marked an inline comment as done.Jan 5 2022, 5:04 PM

Harbormaster completed remote builds in B141802: Diff 397745.Jan 5 2022, 5:11 PM

This revision was landed with ongoing or failed builds.Jan 5 2022, 5:20 PM

Closed by commit rG9584c6fa2fe2: [OpenMP][Offloading] Fixed data race in libomptarget caused by async data… (authored by tianshilei1992). · Explain Why

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rG9584c6fa2fe2: [OpenMP][Offloading] Fixed data race in libomptarget caused by async data….

Hi Shilei
seems like this broke our amdgpu openmp buildbot
https://lab.llvm.org/buildbot/#/builders/193/builds/4072

In D104418#3224032, @ronlieb wrote:

Hi Shilei
seems like this broke our amdgpu openmp buildbot
https://lab.llvm.org/buildbot/#/builders/193/builds/4072

Yes. I'll fix it right away.

In D104418#3224032, @ronlieb wrote:

Hi Shilei
seems like this broke our amdgpu openmp buildbot
https://lab.llvm.org/buildbot/#/builders/193/builds/4072

Fixed it in https://reviews.llvm.org/D116716.

jdoerfert mentioned this in D117627: [OpenMP] Introduce an environment variable to disable atomic map clauses.Jan 18 2022, 5:02 PM

jdoerfert mentioned this in rG1e447d03e2f6: [OpenMP] Introduce an environment variable to disable atomic map clauses.Jan 19 2022, 8:15 PM

Revision Contents

Path

Size

openmp/

libomptarget/

src/

device.h

8 lines

device.cpp

73 lines

omptarget.cpp

32 lines

Diff 396368

openmp/libomptarget/src/device.h

//===----------- device.h - Target independent OpenMP target RTL ----------===//		//===----------- device.h - Target independent OpenMP target RTL ----------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Declarations for managing devices that are handled by RTL plugins.		// Declarations for managing devices that are handled by RTL plugins.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef _OMPTARGET_DEVICE_H		#ifndef _OMPTARGET_DEVICE_H
#define _OMPTARGET_DEVICE_H		#define _OMPTARGET_DEVICE_H

#include <cassert>		#include <cassert>
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions unrelated tianshilei1992: unrelated
#include <cstddef>		#include <cstddef>
#include <list>		#include <list>
#include <map>		#include <map>
#include <memory>		#include <memory>
#include <mutex>		#include <mutex>
#include <set>		#include <set>
#include <vector>		#include <vector>

Show All 20 Lines	struct HostDataToTargetTy {
const uintptr_t HstPtrBase; // host info.		const uintptr_t HstPtrBase; // host info.
const uintptr_t HstPtrBegin;		const uintptr_t HstPtrBegin;
const uintptr_t HstPtrEnd; // non-inclusive.		const uintptr_t HstPtrEnd; // non-inclusive.
const map_var_info_t HstPtrName; // Optional source name of mapped variable.		const map_var_info_t HstPtrName; // Optional source name of mapped variable.

const uintptr_t TgtPtrBegin; // target info.		const uintptr_t TgtPtrBegin; // target info.

private:		private:
static const uint64_t INFRefCount = ~(uint64_t)0;		static const uint64_t INFRefCount = ~(uint64_t)0;
		JonChesterfieldUnsubmitted Done Reply Inline Actions 'volatile mutable' reads as 'buggy', why is this volatile or mutable? Also, why is this target thing exposed to the host runtime at all, instead of being purely within an async object? JonChesterfield: 'volatile mutable' reads as 'buggy', why is this volatile or mutable? Also, why is this target…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions `mutable` is because `std::set::iterator` is `const`. It makes sense because we don't want to modify the key for the set. However, here `HstPtrBegin` is the key. In order to modify it via the iterator, we need to define it as `mutable`, or use `const_cast` when we want to modify it. tianshilei1992: `mutable` is because `std::set::iterator` is `const`. It makes sense because we don't want to…
		jdoerfertUnsubmitted Done Reply Inline Actions It has to be here. As discussed in our call today. You need a lock per entry in the mapping table. That said, I don't know why we want it to be volatile or mutable, the latter if we only get const references somewhere but that would be odd anyway. jdoerfert: It has to be here. As discussed in our call today. You need a lock per entry in the mapping…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions We have to use `shared_ptr` here as this data structure has to be copied but `std::atomic` is un-copyable. tianshilei1992: We have to use `shared_ptr` here as this data structure has to be copied but `std::atomic` is…
		jdoerfertUnsubmitted Done Reply Inline Actions Why is this a shared ptr? jdoerfert: Why is this a shared ptr?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Because `unique_ptr` is non-copyable. tianshilei1992: Because `unique_ptr` is non-copyable.
		JonChesterfieldUnsubmitted Done Reply Inline Actions Could write a copy constructor. Also possible it only needs to be moved, not copied. Don't really want to add an ad hoc garbage collector (aka shared_ptr) to the existing complexity if we can avoid it, lifetimes are opaque enough already JonChesterfield: Could write a copy constructor. Also possible it only needs to be moved, not copied. Don't…
static std::string refCountToStr(uint64_t RefCount) {		static std::string refCountToStr(uint64_t RefCount) {
return RefCount == INFRefCount ? "INF" : std::to_string(RefCount);		return RefCount == INFRefCount ? "INF" : std::to_string(RefCount);
}		}

struct StatesTy {		struct StatesTy {
StatesTy(uint64_t DRC, uint64_t HRC)		StatesTy(uint64_t DRC, uint64_t HRC)
: DynRefCount(DRC), HoldRefCount(HRC) {}		: DynRefCount(DRC), HoldRefCount(HRC) {}
/// The dynamic reference count is the standard reference count as of OpenMP		/// The dynamic reference count is the standard reference count as of OpenMP
Show All 15 Lines	struct StatesTy {
uint64_t HoldRefCount;		uint64_t HoldRefCount;
/// This mutex will be locked when data movement is issued. For targets that		/// This mutex will be locked when data movement is issued. For targets that
/// doesn't support async data movement, this mutex can guarantee that after		/// doesn't support async data movement, this mutex can guarantee that after
/// it is released, memory region on the target is update to date. For		/// it is released, memory region on the target is update to date. For
/// targets that support async data movement, this can guarantee that data		/// targets that support async data movement, this can guarantee that data
/// movement has been issued. This mutex must be locked right before		/// movement has been issued. This mutex must be locked right before
/// releasing the mapping table lock.		/// releasing the mapping table lock.
std::mutex UpdateMtx;		std::mutex UpdateMtx;
		/// Pointer to the event corresponding to the data update of this map.
		void *Event = nullptr;
};		};
// When HostDataToTargetTy is used by std::set, std::set::iterator is const		// When HostDataToTargetTy is used by std::set, std::set::iterator is const
// use unique_ptr to make States mutable.		// use unique_ptr to make States mutable.
const std::unique_ptr<StatesTy> States;		const std::unique_ptr<StatesTy> States;

public:		public:
HostDataToTargetTy(uintptr_t BP, uintptr_t B, uintptr_t E, uintptr_t TB,		HostDataToTargetTy(uintptr_t BP, uintptr_t B, uintptr_t E, uintptr_t TB,
bool UseHoldRefCount, map_var_info_t Name = nullptr,		bool UseHoldRefCount, map_var_info_t Name = nullptr,
Show All 17 Lines	public:
}		}

/// Get the dynamic reference count.		/// Get the dynamic reference count.
uint64_t getDynRefCount() const { return States->DynRefCount; }		uint64_t getDynRefCount() const { return States->DynRefCount; }

/// Get the hold reference count.		/// Get the hold reference count.
uint64_t getHoldRefCount() const { return States->HoldRefCount; }		uint64_t getHoldRefCount() const { return States->HoldRefCount; }

		/// Get the event bound to this data map.
		void *getEvent() const { return States->Event; }

		/// Set the event bound to this data map.
		void setEvent(void *Event) const { States->Event = Event; }

/// Reset the specified reference count unless it's infinity. Reset to 1		/// Reset the specified reference count unless it's infinity. Reset to 1
/// (even if currently 0) so it can be followed by a decrement.		/// (even if currently 0) so it can be followed by a decrement.
void resetRefCount(bool UseHoldRefCount) const {		void resetRefCount(bool UseHoldRefCount) const {
uint64_t &ThisRefCount =		uint64_t &ThisRefCount =
UseHoldRefCount ? States->HoldRefCount : States->DynRefCount;		UseHoldRefCount ? States->HoldRefCount : States->DynRefCount;
if (ThisRefCount != INFRefCount)		if (ThisRefCount != INFRefCount)
ThisRefCount = 1;		ThisRefCount = 1;
}		}
▲ Show 20 Lines • Show All 284 Lines • Show Last 20 Lines

openmp/libomptarget/src/device.cpp

//===--------- device.cpp - Target independent OpenMP target RTL ----------===//		//===--------- device.cpp - Target independent OpenMP target RTL ----------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Functionality for managing devices that are handled by RTL plugins.		// Functionality for managing devices that are handled by RTL plugins.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "device.h"		#include "device.h"
#include "private.h"		#include "private.h"
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions must be `clangd` automatic generated code tianshilei1992: must be `clangd` automatic generated code
#include "rtl.h"		#include "rtl.h"

#include <cassert>		#include <cassert>
#include <climits>		#include <climits>
#include <cstdio>		#include <cstdio>
#include <string>		#include <string>

DeviceTy::DeviceTy(RTLInfoTy *RTL)		DeviceTy::DeviceTy(RTLInfoTy *RTL)
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	DP("Creating new map entry: HstBase=" DPxMOD ", HstBegin=" DPxMOD
newEntry.dynRefCountToStr().c_str(), newEntry.holdRefCountToStr().c_str());		newEntry.dynRefCountToStr().c_str(), newEntry.holdRefCountToStr().c_str());
(void)newEntry;		(void)newEntry;

DataMapMtx.unlock();		DataMapMtx.unlock();

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

int DeviceTy::disassociatePtr(void *HstPtrBegin) {		int DeviceTy::disassociatePtr(void *HstPtrBegin) {
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions The change of the function signature would not be required if we don't need to pass `AsyncInfo` to the plugin. Potentially we can wrap the native event into a struct and the object will call `destroyEvent` in its destructor. tianshilei1992: The change of the function signature would not be required if we don't need to pass `AsyncInfo`…
DataMapMtx.lock();		DataMapMtx.lock();

auto search = HostDataToTargetMap.find(HstPtrBeginTy{(uintptr_t)HstPtrBegin});		auto search = HostDataToTargetMap.find(HstPtrBeginTy{(uintptr_t)HstPtrBegin});
if (search != HostDataToTargetMap.end()) {		if (search != HostDataToTargetMap.end()) {
// Mapping exists		// Mapping exists
if (search->getHoldRefCount()) {		if (search->getHoldRefCount()) {
// This is based on OpenACC 3.1, sec 3.2.33 "acc_unmap_data", L3656-3657:		// This is based on OpenACC 3.1, sec 3.2.33 "acc_unmap_data", L3656-3657:
// "It is an error to call acc_unmap_data if the structured reference		// "It is an error to call acc_unmap_data if the structured reference
// count for the pointer is not zero."		// count for the pointer is not zero."
REPORT("Trying to disassociate a pointer with a non-zero hold reference "		REPORT("Trying to disassociate a pointer with a non-zero hold reference "
"count\n");		"count\n");
} else if (search->isDynRefCountInf()) {		} else if (search->isDynRefCountInf()) {
DP("Association found, removing it\n");		DP("Association found, removing it\n");
		void *Event = search->getEvent();
		if (Event)
		ye-luoUnsubmitted Done Reply Inline Actions Use unique_ptr with Deleter for Event? ye-luo: Use unique_ptr with Deleter for Event?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions That sounds reasonable but actually not practical. Same as what I said before, we can't have automatic destroy of the event because when `libomptarget` starts to destroy, we don't know whether the plugin is already dead. Calling to plugin may cause segment fault. Since we already have event pool, all those events will be destroyed properly when a plugin is destroyed. For those resources shared between `libomptarget` and plugins, `libomptarget` can only hold it, but not own it. That being said, we can only have explicit destruction. tianshilei1992: That sounds reasonable but actually not practical. Same as what I said before, we can't have…
		JonChesterfieldUnsubmitted Done Reply Inline Actions This would be solvable by calling dlclose on the plugins that have been opened by dlopen before libomptarget is destructed. That seems like a good idea independent of this patch - the plugin lifetime can be strictly nested within the lifetime of libomptarget.so JonChesterfield: This would be solvable by calling dlclose on the plugins that have been opened by dlopen before…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions IIRC, I did try that, but that didn't solve the problem. We want to have a destructor function, which contains the `dlclose`, called at the end of the life cycle of `libomptarget`, even after all globals are destructed. That is not an easy task. tianshilei1992: IIRC, I did try that, but that didn't solve the problem. We want to have a destructor function…
		destroyEvent(Event);
HostDataToTargetMap.erase(search);		HostDataToTargetMap.erase(search);
DataMapMtx.unlock();		DataMapMtx.unlock();
return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
} else {		} else {
		ye-luoUnsubmitted Done Reply Inline Actions I feel it is better to have a wrapper wraps the raw Event and handles destroyEvent at the destructor. Once HostDataToTargetMap destructor got called, all the Events are leaked. ye-luo: I feel it is better to have a wrapper wraps the raw Event and handles destroyEvent at the…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions `HostDataToTargetMap` is destroyed along with `libomptarget`. Because of the way `libomptarget` and plugins are connected, plugins could already have been destroyed at that moment, and if that is the case, it is undefined behavior to call functions in plugins. D111954 can make sure the events are correctly released. tianshilei1992: `HostDataToTargetMap` is destroyed along with `libomptarget`. Because of the way `libomptarget`…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Now we have event pool in plugin so it will not be an issue if we leave one event in `libomptarget`. tianshilei1992: Now we have event pool in plugin so it will not be an issue if we leave one event in…
REPORT("Trying to disassociate a pointer which was not mapped via "		REPORT("Trying to disassociate a pointer which was not mapped via "
"omp_target_associate_ptr\n");		"omp_target_associate_ptr\n");
}		}
} else {		} else {
REPORT("Association not found\n");		REPORT("Association not found\n");
}		}

// Mapping not found		// Mapping not found
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	if (lr.Flags.ExtendsAfter) {
DP("WARNING: Pointer is already mapped but section extends beyond mapped "		DP("WARNING: Pointer is already mapped but section extends beyond mapped "
"region\n");		"region\n");
}		}

return lr;		return lr;
}		}

TargetPointerResultTy		TargetPointerResultTy
DeviceTy::getTargetPointer(void HstPtrBegin, void HstPtrBase, int64_t Size,		DeviceTy::getTargetPointer(void HstPtrBegin, void HstPtrBase, int64_t Size,
map_var_info_t HstPtrName, bool HasFlagTo,		map_var_info_t HstPtrName, bool HasFlagTo,
bool HasFlagAlways, bool IsImplicit,		bool HasFlagAlways, bool IsImplicit,
bool UpdateRefCount, bool HasCloseModifier,		bool UpdateRefCount, bool HasCloseModifier,
bool HasPresentModifier, bool HasHoldModifier,		bool HasPresentModifier, bool HasHoldModifier,
AsyncInfoTy &AsyncInfo) {		AsyncInfoTy &AsyncInfo) {
		protze.joachimUnsubmitted Done Reply Inline Actions Which patch removes the `bool &IsHostPtr,` argument? I cannot see this change in any of the patches in the stack or on main. This patch fails to apply after applying the other two patches in the stack. protze.joachim: Which patch removes the `bool &IsHostPtr,` argument? I cannot see this change in any of the…
void *TargetPointer = nullptr;		void *TargetPointer = nullptr;
bool IsHostPtr = false;		bool IsHostPtr = false;
bool IsNew = false;		bool IsNew = false;

DataMapMtx.lock();		DataMapMtx.lock();

LookupResult LR = lookupMapping(HstPtrBegin, Size);		LookupResult LR = lookupMapping(HstPtrBegin, Size);
auto Entry = LR.Entry;		auto Entry = LR.Entry;
▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	if (TargetPointer && !IsHostPtr && HasFlagTo && (IsNew \|\| HasFlagAlways)) {
// thread that could issue data movement will get the right result.		// thread that could issue data movement will get the right result.
Entry->lock();		Entry->lock();
// Release the mapping table lock right after the entry is locked.		// Release the mapping table lock right after the entry is locked.
DataMapMtx.unlock();		DataMapMtx.unlock();

DP("Moving %" PRId64 " bytes (hst:" DPxMOD ") -> (tgt:" DPxMOD ")\n", Size,		DP("Moving %" PRId64 " bytes (hst:" DPxMOD ") -> (tgt:" DPxMOD ")\n", Size,
DPxPTR(HstPtrBegin), DPxPTR(TargetPointer));		DPxPTR(HstPtrBegin), DPxPTR(TargetPointer));

int Ret = submitData(TargetPointer, HstPtrBegin, Size, AsyncInfo);		int Ret = submitData(TargetPointer, HstPtrBegin, Size, AsyncInfo);
		ye-luoUnsubmitted Done Reply Inline Actions I think the purpose of introducing this data transfer event is to achieve the atomic transfer behavior. So it is unsafe to issue transfer right away without checking the status of Event. If a map(always comes from a different thread. ye-luo: I think the purpose of introducing this data transfer event is to achieve the atomic transfer…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions The event has to be created after the issue. At that moment, the entry might not have any associated event. tianshilei1992: The event has to be created after the issue. At that moment, the entry might not have any…

// Unlock the entry immediately after the data movement is issued.
Entry->unlock();

if (Ret != OFFLOAD_SUCCESS) {		if (Ret != OFFLOAD_SUCCESS) {
		Entry->unlock();
REPORT("Copying data to device failed.\n");		REPORT("Copying data to device failed.\n");
// We will also return nullptr if the data movement fails because that		// We will also return nullptr if the data movement fails because that
// pointer points to a corrupted memory region so it doesn't make any		// pointer points to a corrupted memory region so it doesn't make any
// sense to continue to use it.		// sense to continue to use it.
TargetPointer = nullptr;		TargetPointer = nullptr;
}		}

		void *Event = nullptr;
		ye-luoUnsubmitted Done Reply Inline Actions One more question. Should the event only protect IsNew or both IsNew and HasFlagAlways? I feel in the case of HasFlagAlways not IsNew, it is user's responsible to handle the ordering of multiple transfers. ye-luo: One more question. Should the event only protect IsNew or both IsNew and HasFlagAlways? I feel…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions It should protect all because the race we are trying to fix in this patch is user transparent. Even two target tasks reading same object/variable/memory can trigger the race. tianshilei1992: It should protect all because the race we are trying to fix in this patch is user transparent.
		Ret = createEvent(&Event);
		if (Ret != OFFLOAD_SUCCESS) {
		jdoerfertUnsubmitted Done Reply Inline Actions Should we skip the event stuff if we failed? jdoerfert: Should we skip the event stuff if we failed?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions It's different design philosophy. Either is fine to me, but arguably if a target supports event and related interfaces return error, I don't think we should continue (skip). tianshilei1992: It's different design philosophy. Either is fine to me, but arguably if a target supports event…
		Entry->unlock();
		REPORT("Failed to create event\n");
		return {{false /* IsNewEntry /, false / IsHostPointer */},
		{} /* MapTableEntry */,
		nullptr /* TargetPointer */};
		}

		// We cannot assume the event should not be nullptr because we don't
		ye-luoUnsubmitted Done Reply Inline Actions The current event never got destroyed unitl the next transfer. event doesn't got properly destroyed. almost like one event per map. ye-luo: The current event never got destroyed unitl the next transfer. 1. event doesn't got properly…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions For your first comment, yes, and now it is fixed by an ugly way. A nicer solution would be related to the API design in D108528where which we can discuss. The question is whether we want to pass async info to the plugin for event creation, synchronization, and destroy. For the second point, it's a compatibility consideration, that is if there is a platform that doesn't allow event to be reused after it is fulfilled. I don't know. tianshilei1992: For your first comment, yes, and now it is fixed by an ugly way. A nicer solution would be…
		ye-luoUnsubmitted Done Reply Inline Actions For the second point, it's a compatibility consideration, that is if there is a platform that doesn't allow event to be reused after it is fulfilled. I don't know. We don't have such a platform to worry about. If there is one, we just mark the platform as not supporting events that we expect. ye-luo: > For the second point, it's a compatibility consideration, that is if there is a platform that…
		// know if the target support event. But if a target doesn't,
		// recordEvent should always return success.
		Ret = recordEvent(Event, AsyncInfo);
		if (Ret != OFFLOAD_SUCCESS) {
		Entry->unlock();
		REPORT("Failed to set dependence on event " DPxMOD "\n", DPxPTR(Event));
		return {{false /* IsNewEntry /, false / IsHostPointer */},
		{} /* MapTableEntry */,
		nullptr /* TargetPointer */};
		}
		void *OldEvent = Entry->getEvent();
		Entry->setEvent(Event);
		// We're done with the entry. Release the entry.
		Entry->unlock();
		// If there is an event attached, destroy it.
		if (OldEvent) {
		Ret = destroyEvent(OldEvent);
		if (Ret != OFFLOAD_SUCCESS)
		return {{false /* IsNewEntry /, false / IsHostPointer */},
		{} /* MapTableEntry */,
		nullptr /* TargetPointer */};
		ye-luoUnsubmitted Done Reply Inline Actions Missing return error check ye-luo: Missing return error check
		}
} else {		} else {
// Release the mapping table lock directly.		// Release the mapping table lock directly.
DataMapMtx.unlock();		DataMapMtx.unlock();
		// If not a host pointer, we need to wait for the event if it exists.
		if (!IsHostPtr) {
		Entry->lock();
		void *Event = Entry->getEvent();
		ye-luoUnsubmitted Done Reply Inline Actions after this point. If another thread changes Entry->Event and destroyed the old event. The current copied Event becomes invalid. ye-luo: after this point. If another thread changes Entry->Event and destroyed the old event. The…
		if (Event) {
		int Ret = waitEvent(Event, AsyncInfo);
		ye-luoUnsubmitted Done Reply Inline Actions After the first transfer, every look up calls waitEvent which is still costly. It is better to return the even back to the pool and avoid waitEvent cost. ye-luo: After the first transfer, every look up calls waitEvent which is still costly. It is better to…
		ye-luoUnsubmitted Done Reply Inline Actions I misunderstood waitEvent. It just enqueues the event to the stream. Need to find a better place to reduce the necessary waitEvent. ye-luo: I misunderstood waitEvent. It just enqueues the event to the stream. Need to find a better…
		ye-luoUnsubmitted Done Reply Inline Actions I think waitEvent can be bypassed if the present modifier is true. ye-luo: I think waitEvent can be bypassed if the present modifier is true.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions `waitEvent` will not be triggered for entry with present modifier because its attached `event` will be empty, but we can save a lock. tianshilei1992: `waitEvent` will not be triggered for entry with present modifier because its attached `event`…
		Entry->unlock();
		if (Ret != OFFLOAD_SUCCESS) {
		// If it fails to wait for the event, we need to return nullptr in
		// case of any data race.
		REPORT("Failed to wait for event " DPxMOD ".\n", DPxPTR(Event));
		return {{false /* IsNewEntry /, false / IsHostPointer */},
		{} /* MapTableEntry */,
		nullptr /* TargetPointer */};
		}
		} else {
		Entry->unlock();
		}
		}
}		}

return {{IsNew, IsHostPtr}, Entry, TargetPointer};		return {{IsNew, IsHostPtr}, Entry, TargetPointer};
}		}

// Used by targetDataBegin, targetDataEnd, targetDataUpdate and target.		// Used by targetDataBegin, targetDataEnd, targetDataUpdate and target.
// Return the target pointer begin (where the data will be moved).		// Return the target pointer begin (where the data will be moved).
// Decrement the reference counter if called from targetDataEnd.		// Decrement the reference counter if called from targetDataEnd.
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	void DeviceTy::getTgtPtrBegin(void HstPtrBegin, int64_t Size) {
}		}

return NULL;		return NULL;
}		}

int DeviceTy::deallocTgtPtr(void *HstPtrBegin, int64_t Size,		int DeviceTy::deallocTgtPtr(void *HstPtrBegin, int64_t Size,
bool HasHoldModifier) {		bool HasHoldModifier) {
// Check if the pointer is contained in any sub-nodes.		// Check if the pointer is contained in any sub-nodes.
int rc;		int Ret = OFFLOAD_SUCCESS;
DataMapMtx.lock();		DataMapMtx.lock();
LookupResult lr = lookupMapping(HstPtrBegin, Size);		LookupResult lr = lookupMapping(HstPtrBegin, Size);
if (lr.Flags.IsContained \|\| lr.Flags.ExtendsBefore \|\| lr.Flags.ExtendsAfter) {		if (lr.Flags.IsContained \|\| lr.Flags.ExtendsBefore \|\| lr.Flags.ExtendsAfter) {
auto &HT = *lr.Entry;		auto &HT = *lr.Entry;
if (HT.decRefCount(HasHoldModifier) == 0) {		if (HT.decRefCount(HasHoldModifier) == 0) {
DP("Deleting tgt data " DPxMOD " of size %" PRId64 "\n",		DP("Deleting tgt data " DPxMOD " of size %" PRId64 "\n",
DPxPTR(HT.TgtPtrBegin), Size);		DPxPTR(HT.TgtPtrBegin), Size);
deleteData((void *)HT.TgtPtrBegin);		deleteData((void *)HT.TgtPtrBegin);
INFO(OMP_INFOTYPE_MAPPING_CHANGED, DeviceID,		INFO(OMP_INFOTYPE_MAPPING_CHANGED, DeviceID,
"Removing map entry with HstPtrBegin=" DPxMOD ", TgtPtrBegin=" DPxMOD		"Removing map entry with HstPtrBegin=" DPxMOD ", TgtPtrBegin=" DPxMOD
", Size=%" PRId64 ", Name=%s\n",		", Size=%" PRId64 ", Name=%s\n",
DPxPTR(HT.HstPtrBegin), DPxPTR(HT.TgtPtrBegin), Size,		DPxPTR(HT.HstPtrBegin), DPxPTR(HT.TgtPtrBegin), Size,
(HT.HstPtrName) ? getNameFromMapping(HT.HstPtrName).c_str()		(HT.HstPtrName) ? getNameFromMapping(HT.HstPtrName).c_str()
: "unknown");		: "unknown");
		void *Event = lr.Entry->getEvent();
HostDataToTargetMap.erase(lr.Entry);		HostDataToTargetMap.erase(lr.Entry);
		if (Event && destroyEvent(Event) != OFFLOAD_SUCCESS) {
		REPORT("Failed to destroy event " DPxMOD "\n", DPxPTR(Event));
		Ret = OFFLOAD_FAIL;
		}
}		}
rc = OFFLOAD_SUCCESS;
} else {		} else {
REPORT("Section to delete (hst addr " DPxMOD ") does not exist in the"		REPORT("Section to delete (hst addr " DPxMOD ") does not exist in the"
" allocated memory\n",		" allocated memory\n",
DPxPTR(HstPtrBegin));		DPxPTR(HstPtrBegin));
rc = OFFLOAD_FAIL;		Ret = OFFLOAD_FAIL;
}		}

DataMapMtx.unlock();		DataMapMtx.unlock();
return rc;		return Ret;
}		}

/// Init device, should not be called directly.		/// Init device, should not be called directly.
void DeviceTy::init() {		void DeviceTy::init() {
// Make call to init_requires if it exists for this plugin.		// Make call to init_requires if it exists for this plugin.
if (RTL->init_requires)		if (RTL->init_requires)
RTL->init_requires(PM->RTLs.RequiresFlags);		RTL->init_requires(PM->RTLs.RequiresFlags);
int32_t Ret = RTL->init_device(RTLDeviceID);		int32_t Ret = RTL->init_device(RTLDeviceID);
▲ Show 20 Lines • Show All 212 Lines • Show Last 20 Lines

openmp/libomptarget/src/omptarget.cpp

Show First 20 Lines • Show All 495 Lines • ▼ Show 20 Lines	if (arg_types[i] & OMP_TGT_MAPTYPE_PTR_AND_OBJ) {
HasCloseModifier, HasPresentModifier, HasHoldModifier, AsyncInfo);		HasCloseModifier, HasPresentModifier, HasHoldModifier, AsyncInfo);
PointerTgtPtrBegin = Pointer_TPR.TargetPointer;		PointerTgtPtrBegin = Pointer_TPR.TargetPointer;
IsHostPtr = Pointer_TPR.Flags.IsHostPointer;		IsHostPtr = Pointer_TPR.Flags.IsHostPointer;
if (!PointerTgtPtrBegin) {		if (!PointerTgtPtrBegin) {
REPORT("Call to getTargetPointer returned null pointer (%s).\n",		REPORT("Call to getTargetPointer returned null pointer (%s).\n",
HasPresentModifier ? "'present' map type modifier"		HasPresentModifier ? "'present' map type modifier"
: "device failure or illegal mapping");		: "device failure or illegal mapping");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
DP("There are %zu bytes allocated at target address " DPxMOD " - is%s new"		DP("There are %zu bytes allocated at target address " DPxMOD " - is%s new"
"\n",		"\n",
sizeof(void *), DPxPTR(PointerTgtPtrBegin),		sizeof(void *), DPxPTR(PointerTgtPtrBegin),
(Pointer_TPR.Flags.IsNewEntry ? "" : " not"));		(Pointer_TPR.Flags.IsNewEntry ? "" : " not"));
Pointer_HstPtrBegin = HstPtrBase;		Pointer_HstPtrBegin = HstPtrBase;
protze.joachimUnsubmitted Done Reply Inline Actions This probably shouldn't be here? protze.joachim: This probably shouldn't be here?
tianshilei1992AuthorUnsubmitted Done Reply Inline Actions yeah, it's a rebase error. I'll reorg the patch stack. tianshilei1992: yeah, it's a rebase error. I'll reorg the patch stack.
// modify current entry.		// modify current entry.
HstPtrBase = (void *)HstPtrBase;		HstPtrBase = (void *)HstPtrBase;
// No need to update pointee ref count for the first element of the		// No need to update pointee ref count for the first element of the
// subelement that comes from mapper.		// subelement that comes from mapper.
UpdateRef =		UpdateRef =
(!FromMapper \|\| i != 0); // subsequently update ref count of pointee		(!FromMapper \|\| i != 0); // subsequently update ref count of pointee
}		}

▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	if (arg_types[i] & OMP_TGT_MAPTYPE_PTR_AND_OBJ && !IsHostPtr) {
Device.ShadowMtx.unlock();		Device.ShadowMtx.unlock();

DP("Update pointer (" DPxMOD ") -> [" DPxMOD "]\n",		DP("Update pointer (" DPxMOD ") -> [" DPxMOD "]\n",
DPxPTR(PointerTgtPtrBegin), DPxPTR(TgtPtrBegin));		DPxPTR(PointerTgtPtrBegin), DPxPTR(TgtPtrBegin));

void *&TgtPtrBase = AsyncInfo.getVoidPtrLocation();		void *&TgtPtrBase = AsyncInfo.getVoidPtrLocation();
TgtPtrBase = ExpectedTgtPtrBase;		TgtPtrBase = ExpectedTgtPtrBase;

int rt = Device.submitData(PointerTgtPtrBegin, &TgtPtrBase,		int Ret = Device.submitData(PointerTgtPtrBegin, &TgtPtrBase,
sizeof(void *), AsyncInfo);		sizeof(void *), AsyncInfo);
		if (Ret != OFFLOAD_SUCCESS) {
Pointer_TPR.MapTableEntry->unlock();		Pointer_TPR.MapTableEntry->unlock();

if (rt != OFFLOAD_SUCCESS) {
REPORT("Copying data to device failed.\n");		REPORT("Copying data to device failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
		void *Event = nullptr;
		if (Device.createEvent(&Event) != OFFLOAD_SUCCESS) {
		Pointer_TPR.MapTableEntry->unlock();
		REPORT("Failed to create event.\n");
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions But with D104555, these whole bunch of code will be moved into `getTargetPointer`. tianshilei1992: But with D104555, these whole bunch of code will be moved into `getTargetPointer`.
		return OFFLOAD_FAIL;
		}
		// We cannot assume the event should not be nullptr because we don't
		// know if the target support event. But if a target doesn't,
		// recordEvent should always return success.
		grokosUnsubmitted Done Reply Inline Actions This code is a direct duplicate of the logic inside `getTargetPointer`. Can we factor it out into a new helper function? grokos: This code is a direct duplicate of the logic inside `getTargetPointer`. Can we factor it out…
		Ret = Device.recordEvent(Event, AsyncInfo);
		if (Ret != OFFLOAD_SUCCESS) {
		jdoerfertUnsubmitted Done Reply Inline Actions Worth a helper to avoid duplication. jdoerfert: Worth a helper to avoid duplication.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Yes, this piece of code is kind of redundant. However, there are different locks intervened, if we abstract it to another helper function, it could make the code more difficult to understand because we could find mutex are locked in caller and unlocked in callee. I don't think it's a good practice. It fairly opens a door to misuse of locks. tianshilei1992: Yes, this piece of code is kind of redundant. However, there are different locks intervened, if…
		Pointer_TPR.MapTableEntry->unlock();
		REPORT("Failed to set dependence on event " DPxMOD "\n",
		JonChesterfieldUnsubmitted Done Reply Inline Actions Volatile != atomic. Also, I thought the point of this was to do sync on the target, so why is the host busy waiting on something? JonChesterfield: Volatile != atomic. Also, I thought the point of this was to do sync on the target, so why is…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions There can still be gap between creating an entry in map table and issuing the data movement. tianshilei1992: There can still be gap between creating an entry in map table and issuing the data movement.
		DPxPTR(Event));
		return OFFLOAD_FAIL;
		}
		// Exchange the old event with new created event
		void *OldEvent = Pointer_TPR.MapTableEntry->getEvent();
		Pointer_TPR.MapTableEntry->setEvent(Event);
		Pointer_TPR.MapTableEntry->unlock();
		// If the old event is not null, we need to destroy it.
		if (OldEvent)
		Device.destroyEvent(OldEvent);
} else		} else
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Here is very critical. Expect opinions. tianshilei1992: Here is very critical. Expect opinions.
		jdoerfertUnsubmitted Done Reply Inline Actions I don't get why we need this. At the moment at which we are looking at the Entry I would expect it has an event or not, but neither is cause for waiting. When a new entry is created it should only be exposed to others once the event has been setup as well. Plugins that don't need events can just use nullptr without it being a problem. If this is not a problem, I suspect we want a separate flag in the entry for the status. The event should be allows to be null, the flag can be "in_progress" which indicates the rest needs to wait. jdoerfert: I don't get why we need this. At the moment at which we are looking at the Entry I would expect…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions This can be removed when this whole bunch of thing is moved to the new function in D104555, otherwise it can still happen that the new entry is created but data movement has not been issued because this function is not guarded by the lock. tianshilei1992: This can be removed when this whole bunch of thing is moved to the new function in D104555…
Device.ShadowMtx.unlock();		Device.ShadowMtx.unlock();
}		}
}		}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

namespace {		namespace {
▲ Show 20 Lines • Show All 888 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][Offloading] Fixed data race in libomptarget caused by async data movementClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 396368

openmp/libomptarget/src/device.h

openmp/libomptarget/src/device.cpp

openmp/libomptarget/src/omptarget.cpp

[OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement
ClosedPublic