Page MenuHomePhabricator

[OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement
ClosedPublic

Authored by tianshilei1992 on Jun 16 2021, 1:24 PM.

Details

Summary

The async data movement can cause data race if the target supports it.
Details can be found in [1]. This patch tries to fix this problem by attaching
an event to the entry of data mapping table. Here are the details.

For each issued data movement, a new event is generated and returned to libomptarget
by calling createEvent. The event will be attached to the corresponding mapping table
entry.

For each data mapping lookup, if there is no need for a data movement, the
attached event has to be inserted into the queue to gaurantee that all following
operations in the queue can only be executed if the event is fulfilled.

This design is to avoid synchronization on host side.

Note that we are using CUDA terminolofy here. Similar mechanism is assumped to
be supported by another targets. Even if the target doesn't support it, it can
be easily implemented in the following fall back way:

  • Event can be any kind of flag that has at least two status, 0 and 1.
  • waitEvent can directly busy loop if Event is still 0.

My local test shows that bug49334.cpp can pass.

Reference:
[1] https://bugs.llvm.org/show_bug.cgi?id=49940

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision now requires changes to proceed.Aug 20 2021, 7:55 AM

There is a performance hit from frequent creating/destroying events at very transfer. The create/destroy logic can be separate from recording events.

What an event is and how it is handled is plugin-specific. createEvent and destroyEvent can be implemented e.g. via a pool of events where an event will be picked upon createEvent and returned back to the pool upon destroyEvent (similar to what we do for streams in the CUDA plugin). So these performance considerations can be fixed in a subsequent patch. Right now there is a race condition - a correctness concern - that we should fix. At the very least, since the logic in the base library is correct and we pretty much agree on it, we should let this patch land without the CUDA plugin changes; then we can provide optimized implementations for the __tgt_rtl_*_event functions for the plugins we care about in future patches.

There is a performance hit from frequent creating/destroying events at very transfer. The create/destroy logic can be separate from recording events.

What an event is and how it is handled is plugin-specific. createEvent and destroyEvent can be implemented e.g. via a pool of events where an event will be picked upon createEvent and returned back to the pool upon destroyEvent (similar to what we do for streams in the CUDA plugin). So these performance considerations can be fixed in a subsequent patch. Right now there is a race condition - a correctness concern - that we should fix. At the very least, since the logic in the base library is correct and we pretty much agree on it, we should let this patch land without the CUDA plugin changes; then we can provide optimized implementations for the __tgt_rtl_*_event functions for the plugins we care about in future patches.

Now Event is stored in HostDataToTargetTy. It is not plugin-specific logic.

Consider the issues in the current implementation I just found. I have a strong feeling that the current fix should not get in.
I think we should just create one event per HostDataToTargetTy at the constructor and use the same event for a give mapped region.

In order to avoid blocking the good pieces in this patch, it is better to take out all the plugin API and CUDA plugin change to a separate patch.
I demand split createEvent into createEvent and recordEvent and also make each API doing well defined to minimal unit of work.

Using a pool of events is an optimization we may try at a later stage. Right now, I'm more concerned about the issue I found above instead of performance impact.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1362 ↗(On Diff #364676)

Why waiting for an event needs to pull a stream in.
cuEventSynchronize

1379 ↗(On Diff #364676)

AsyncInfo is not needed. If the stream has been returned to the poo, you just print nullptr all the time.

openmp/libomptarget/src/device.cpp
269

I think the purpose of introducing this data transfer event is to achieve the atomic transfer behavior.
So it is unsafe to issue transfer right away without checking the status of Event.
If a map(always comes from a different thread.

289

The current event never got destroyed unitl the next transfer.

  1. event doesn't got properly destroyed.
  2. almost like one event per map.
tianshilei1992 marked 2 inline comments as done.Aug 22 2021, 7:43 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
269

The event has to be created after the issue. At that moment, the entry might not have any associated event.

289

For your first comment, yes, and now it is fixed by an ugly way. A nicer solution would be related to the API design in D108528where which we can discuss. The question is whether we want to pass async info to the plugin for event creation, synchronization, and destroy.

For the second point, it's a compatibility consideration, that is if there is a platform that doesn't allow event to be reused after it is fulfilled. I don't know.

tianshilei1992 marked 2 inline comments as done.Aug 22 2021, 7:43 PM
tianshilei1992 added inline comments.Aug 22 2021, 7:46 PM
openmp/libomptarget/src/device.cpp
80

The change of the function signature would not be required if we don't need to pass AsyncInfo to the plugin. Potentially we can wrap the native event into a struct and the object will call destroyEvent in its destructor.

jdoerfert added inline comments.Aug 23 2021, 11:42 AM
openmp/libomptarget/src/device.cpp
281

Should we skip the event stuff if we failed?

fix rebase errors

remove unrelated changes

ye-luo added inline comments.Aug 31 2021, 2:09 PM
openmp/libomptarget/src/device.cpp
100

I feel it is better to have a wrapper wraps the raw Event and handles destroyEvent at the destructor.

Once HostDataToTargetMap destructor got called, all the Events are leaked.

289

For the second point, it's a compatibility consideration, that is if there is a platform that doesn't allow event to be reused after it is fulfilled. I don't know.

We don't have such a platform to worry about. If there is one, we just mark the platform as not supporting events that we expect.

310

Missing return error check

318

after this point. If another thread changes Entry->Event and destroyed the old event.
The current copied Event becomes invalid.

tianshilei1992 marked an inline comment as done.Oct 16 2021, 4:41 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
100

HostDataToTargetMap is destroyed along with libomptarget. Because of the way libomptarget and plugins are connected, plugins could already have been destroyed at that moment, and if that is the case, it is undefined behavior to call functions in plugins. D111954 can make sure the events are correctly released.

tianshilei1992 marked an inline comment as done.

rebase and fix comments

tianshilei1992 marked 6 inline comments as done.Dec 27 2021, 8:32 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
281

It's different design philosophy. Either is fine to me, but arguably if a target supports event and related interfaces return error, I don't think we should continue (skip).

openmp/libomptarget/src/omptarget.cpp
601

Yes, this piece of code is kind of redundant. However, there are different locks intervened, if we abstract it to another helper function, it could make the code more difficult to understand because we could find mutex are locked in caller and unlocked in callee. I don't think it's a good practice. It fairly opens a door to misuse of locks.

tianshilei1992 marked 2 inline comments as done.Dec 27 2021, 8:33 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
100

Now we have event pool in plugin so it will not be an issue if we leave one event in libomptarget.

tianshilei1992 edited the summary of this revision. (Show Details)Dec 31 2021, 8:55 AM
JonChesterfield accepted this revision.Jan 4 2022, 10:25 AM

Fairly scary code but I think it's an improvement. Thanks for addressing the previous comments.

I still think the new/old event exchange should be removed.
If T1 record an event for a D2H transfer and then T2 issues a H2D transfer and create an new event and then wait for the old event. the H2D or D2H may happen together when they are on two distinct streams.
This violates atomic data transfer behavior.
the solution is 1 persistent event per map and always check its status before issuing any transfer.

openmp/libomptarget/src/device.cpp
95

Use unique_ptr with Deleter for Event?

I still think the new/old event exchange should be removed.
If T1 record an event for a D2H transfer and then T2 issues a H2D transfer and create an new event and then wait for the old event. the H2D or D2H may happen together when they are on two distinct streams.
This violates atomic data transfer behavior.
the solution is 1 persistent event per map and always check its status before issuing any transfer.

First, there is no event for D2H for now. This patch tries to make sure there is no race among H2D of the same memory from multiple target tasks, especially when they all read from that memory. D2H is out of the scope, and has nothing to do with the scenario mentioned in this patch. Without this patch, H2D and D2H can still happen at the same time on different streams.
Second, I think the case you describe is already data race caused by *user*. For two target tasks, if one task tries to move data in one direction while another task tries to move data in another direction, the two tasks should not run at the same time. That said, they should have dependency because one tries to read and another tries to write. Even with the "atomic" behavior you mentioned, which I didn't find it on the spec (maybe I missed it. could you pin point where it is?) it is still undetermined that which one comes first.
Third, using one event all the time sounds like a feasible solution. If an entry already has an event, we can reuse that one w/o the need to create a new one and destroy the old one, which can potentially save some overhead.

tianshilei1992 marked an inline comment as done.Jan 4 2022, 4:52 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
95

That sounds reasonable but actually not practical. Same as what I said before, we can't have automatic destroy of the event because when libomptarget starts to destroy, we don't know whether the plugin is already dead. Calling to plugin may cause segment fault. Since we already have event pool, all those events will be destroyed properly when a plugin is destroyed. For those resources shared between libomptarget and plugins, libomptarget can only hold it, but not own it. That being said, we can only have *explicit* destruction.

tianshilei1992 marked an inline comment as done.Jan 4 2022, 4:53 PM

reuse an event instead of always creating one

simplify logic

ye-luo added a comment.Jan 4 2022, 5:47 PM

I still think the new/old event exchange should be removed.
If T1 record an event for a D2H transfer and then T2 issues a H2D transfer and create an new event and then wait for the old event. the H2D or D2H may happen together when they are on two distinct streams.
This violates atomic data transfer behavior.
the solution is 1 persistent event per map and always check its status before issuing any transfer.

First, there is no event for D2H for now. This patch tries to make sure there is no race among H2D of the same memory from multiple target tasks, especially when they all read from that memory. D2H is out of the scope, and has nothing to do with the scenario mentioned in this patch. Without this patch, H2D and D2H can still happen at the same time on different streams.
Second, I think the case you describe is already data race caused by *user*. For two target tasks, if one task tries to move data in one direction while another task tries to move data in another direction, the two tasks should not run at the same time. That said, they should have dependency because one tries to read and another tries to write. Even with the "atomic" behavior you mentioned, which I didn't find it on the spec (maybe I missed it. could you pin point where it is?) it is still undetermined that which one comes first.
Third, using one event all the time sounds like a feasible solution. If an entry already has an event, we can reuse that one w/o the need to create a new one and destroy the old one, which can potentially save some overhead.

You are right. No need to worry about mixed H2D and D2H. Since the event is really meant for D2H, it is better to name it EventH2D.

ye-luo accepted this revision.Jan 4 2022, 6:03 PM

Renaming Event to EventH2D and also the corresponding functions are my last request. The rest looks good.

This revision is now accepted and ready to land.Jan 4 2022, 6:03 PM
ye-luo added inline comments.Jan 4 2022, 6:28 PM
openmp/libomptarget/src/device.cpp
311

After the first transfer, every look up calls waitEvent which is still costly.
It is better to return the even back to the pool and avoid waitEvent cost.

add clear comments on how Event should be used in current situation.

ye-luo added inline comments.Jan 4 2022, 7:47 PM
openmp/libomptarget/src/device.cpp
311

I misunderstood waitEvent. It just enqueues the event to the stream. Need to find a better place to reduce the necessary waitEvent.

ye-luo added a comment.Jan 4 2022, 7:47 PM

add clear comments on how Event should be used in current situation.

That is good.

ye-luo added inline comments.Jan 4 2022, 8:12 PM
openmp/libomptarget/src/device.cpp
279

One more question. Should the event only protect IsNew or both IsNew and HasFlagAlways? I feel in the case of HasFlagAlways not IsNew, it is user's responsible to handle the ordering of multiple transfers.

311

I think waitEvent can be bypassed if the present modifier is true.

JonChesterfield added inline comments.Jan 5 2022, 2:02 AM
openmp/libomptarget/src/device.cpp
95

This would be solvable by calling dlclose on the plugins that have been opened by dlopen before libomptarget is destructed. That seems like a good idea independent of this patch - the plugin lifetime can be strictly nested within the lifetime of libomptarget.so

tianshilei1992 marked 4 inline comments as done.Jan 5 2022, 4:21 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
95

IIRC, I did try that, but that didn't solve the problem. We want to have a destructor function, which contains the dlclose, called at the end of the life cycle of libomptarget, even after all globals are destructed. That is not an easy task.

279

It should protect all because the race we are trying to fix in this patch is user transparent. Even two target tasks reading same object/variable/memory can trigger the race.

tianshilei1992 marked 2 inline comments as done.Jan 5 2022, 4:21 PM

minor optimization to bypass event if present modifier appears

tianshilei1992 marked an inline comment as done.Jan 5 2022, 5:04 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
311

waitEvent will not be triggered for entry with present modifier because its attached event will be empty, but we can save a lock.

tianshilei1992 marked an inline comment as done.Jan 5 2022, 5:04 PM
This revision was landed with ongoing or failed builds.Jan 5 2022, 5:20 PM
This revision was automatically updated to reflect the committed changes.
ronlieb added a subscriber: ronlieb.Jan 5 2022, 5:27 PM

Hi Shilei
seems like this broke our amdgpu openmp buildbot
https://lab.llvm.org/buildbot/#/builders/193/builds/4072

Hi Shilei
seems like this broke our amdgpu openmp buildbot
https://lab.llvm.org/buildbot/#/builders/193/builds/4072

Yes. I'll fix it right away.