This is an archive of the discontinued LLVM Phabricator instance.

Automatic asynchronous execution of OpenMP Target Regions
Needs ReviewPublic

Authored by randreshg on Aug 20 2022, 9:50 AM.

Details

Summary

This patch allows Automatic asynchronous execution of OpenMP Target Regions.
When the environment variable LIBOMPTARGET_INTRA_THREAD_ASYNC is enabled
(LIBOMPTARGET_INTRA_THREAD_ASYNC=1), the implicit barrier that exists at the end
of every target region is removed and synchronization is only performed on memory transfers
or at the end of the program when the OpenMP runtime calls its destructors

Diff Detail

Event Timeline

randreshg created this revision.Aug 20 2022, 9:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2022, 9:50 AM
randreshg requested review of this revision.Aug 20 2022, 9:50 AM
randreshg edited the summary of this revision. (Show Details)Aug 20 2022, 1:31 PM
ye-luo requested changes to this revision.Aug 20 2022, 2:06 PM
ye-luo added a subscriber: ye-luo.

Could you please explain why removing synchronization at the end of a target region (without nowait) is a valid optimization. Also there should be no valid program relying on synchronization at the end of the program"

This revision now requires changes to proceed.Aug 20 2022, 2:06 PM

Ye,

It is only valid under the assumption of non unified shared memory. State across host and device is only visible during data movements. So it is up until then when changes in the host or device data is reflected. Assuming there are no external runtimes, it is possible to synchronize only on data movements, conserving the data dependencies between host and device.

When I ran target region A and then B and B consumes numbers generated by A on the device and there is no transfer involved. B may start before the A get the numbers ready.

As long as the RTL of the device provides a queue mechanism to execute target tasks sequentially, the dependencies among different tasks will be respected, for NVIDIA GPUs the queue mechanism is the stream. This patch, in contrast with the current OpenMP offloading implementation, launches the execution of tasks into the same stream.

CUDA streams are FIFO queues. But it’s true this will not work if the device queue is not FIFO. In the case of CUDA this works for the case you described without being a read after write dependency.

  1. right now in CUDA plugin, streams are pooled. There is no guarantee that A and B get the same Stream when multiple threads all are doing their own As and Bs.
  2. Need to minimize designing libomptarget based on CUDA behaviors.

I agree on 2. Any recommendations? We can move some of the logic there.

Regarding 1 this patch does not return the stream to the pool right away. It is held by the thread until a synchronization occurs. Synchronizations aré evaluated lazily on data movement. But the stream does not change between consecutive target invokes with no data movement

I agree on 2. Any recommendations? We can move some of the logic there.

Regarding 1 this patch does not return the stream to the pool right away. It is held by the thread until a synchronization occurs. Synchronizations aré evaluated lazily on data movement. But the stream does not change between consecutive target invokes with no data movement

Explicitly tied to thread context is even worse. If I wrap both region with openmp CPU tasks and make task B depends on task A. task A completes but the kernel is still flying. Task B may start at anytime and there is no guarantee of running on the same thread.

That one is a good point. Let us revise that. Since the effect we actually want to have is the creation of the task graph sequentially.

jdoerfert added inline comments.Aug 29 2022, 7:31 AM
openmp/libomptarget/include/device.h
340

Why is this static, that seems wrong.

475

Documentation, please.

openmp/libomptarget/include/omptarget.h
198–203

You cannot just remove this. See https://reviews.llvm.org/D132045, as it introduces a flag you can use to disable synchronization here.

openmp/libomptarget/src/device.cpp
69

Reverse the condition, single line alternative comes first, then the complex consequence (without the else).

81–83

Use early exits and no else after return.

josemonsalve2 added inline comments.Aug 31 2022, 2:25 PM
openmp/libomptarget/src/device.cpp
61–65

It may be better to merge these into:

DP("Asynchronous execution %s\n", AsyncFlag ? "Enabled" : "Disabled");

randreshg marked 4 inline comments as done.Sep 5 2022, 7:17 AM
randreshg added inline comments.Sep 5 2022, 7:31 AM
openmp/libomptarget/include/device.h
340

The idea of AsyncInfoMng is to have a way to control the AsyncInfo object to skip synchronization that is not needed. It's part of the Device class, so we can sync like this:

Device.syncAsyncInfo(AsyncInfo, true);

however, it is not necessary for every device to have a copy of AsyncInfoMng. That's why it's static.

randreshg marked 2 inline comments as done.Sep 13 2022, 6:26 AM
randreshg updated this revision to Diff 459736.EditedSep 13 2022, 6:36 AM

diff updated.

  • The AsyncInfoManager is now part of the omptarget.h and not device.h
  • Global variable AIM was added in interface.cpp
  • Flag was added to the Synchronization in asyncinfo destructor

Documentation added

Some more initial comments

openmp/docs/optimizations/OpenMPOpt.rst
113

Doesn't the length need to match?

116

For nowait regions this is maybe misleading. We should probably say they can be executed synchronously depending on the pragmas and implementation. In that case ...

134

This is misleading. Rather explain what would happen if the map on the target was triggering a transfer.

openmp/libomptarget/include/omptarget.h
232

This default is dangerous. Either swap it or avoid the default. Also, explain what it means to synchronize but not to force synchronize.

236

Use doxygen comments /// not //.

openmp/libomptarget/src/device.cpp
52

Unrelated

openmp/libomptarget/src/interface.cpp
25

We should not have a global like this. It probably belongs in/to the Device object.

254–259

A comment would be nice as this is the one place we do not force synchronization.

randreshg marked 8 inline comments as done.Sep 26 2022, 11:54 AM

This new version has the following changes:

  • Updated documentation
  • The AsyncInfoManager is part of the Device class and now has a map that contains thread ids as its key values.
jdoerfert added inline comments.Oct 18 2022, 10:40 AM
openmp/docs/optimizations/OpenMPOpt.rst
120
135–140
openmp/libomptarget/include/device.h
320

Documentation, please.

353

I would rather expose the AIM than have 3 functions that just forward to it.

openmp/libomptarget/include/omptarget.h
26

We can use LLVM data structures now too, e.g., maps.

openmp/libomptarget/src/interface.cpp
254–259

multi-line conditionals should have braces. The comment is unhelpful. It states what the code already says, not why the code is this way.

openmp/libomptarget/src/omptarget.cpp
58

Lot's of static flags to simply lookup a env var once. Why don't we do it the same way as for other env vars?

60
66
67

You did get an iterator, why do another lookup via [...]? You can do a single [...] lookup and use the reference result to check and update it.

69

And this is the 3rd lookup into the map..

70

Do we ever have to clear the map?

87

I doubt we need both these lines.

randreshg updated this revision to Diff 468932.Oct 19 2022, 8:50 AM
randreshg marked an inline comment as done.

This new patch addresses all previous comments from reviewers

randreshg marked 11 inline comments as done.Oct 19 2022, 8:51 AM

I know this is getting tiresome but we need to make sure people understand what's happening and this plays well with future extensions. More comments.

openmp/libomptarget/include/omptarget.h
195

ShouldSyncWhenDestroyed

232

You never return a nullptr, make it a reference.

240

The description is unhelpful. What AsyncInfo, etc. The function is also unused, do we need it?

openmp/libomptarget/src/interface.cpp
100

Now this pattern is somewhat unfortunate. You get the AsyncInfo from the AIM and then you need to be careful to call the right synchronize. If the AsynFlag was global you could move the new "sync" logic into the regular AsyncInfo sync, right? AIM would just be used to manage the map ID -> AsyncInfo. WDYT about this scheme? You could check the env variable once, like we do it for some others: https://github.com/llvm/llvm-project/blob/23bc343855fdf6fb7668abadf2b064034b207981/openmp/libomptarget/src/rtl.cpp#L43

openmp/libomptarget/src/omptarget.cpp
59–62
randreshg marked 6 inline comments as done.Oct 20 2022, 12:35 PM
randreshg added inline comments.
openmp/libomptarget/src/interface.cpp
100

Agreed!
The pattern you suggest allows for separating the synchronization logic and AsyncInfo management.

randreshg updated this revision to Diff 469594.Oct 21 2022, 6:59 AM
randreshg marked an inline comment as done.

I think this is fine for now. @ye-luo can this go in as opt-in feature that we probably refine as we go?

openmp/libomptarget/src/interface.cpp
100

Better. Not super happy about the explicit free call but that's fine for now.

How many testing has been done? it seems only workable on toy examples.

openmp/docs/optimizations/OpenMPOpt.rst
140

Through this example, I only see TT1 and TT2 racing when the async feature is enabled.

openmp/libomptarget/src/device.cpp
55

Why AIM captures a pointer instead of reference?

openmp/libomptarget/src/omptarget.cpp
30–38

It is mess. synchroning or not depends on a bunch of states. Move the if-statement to the caller side and make the logic directly exposed on the use side.

76

I think this is against coding principles, pass in a reference and delete its memory.

openmp/libomptarget/src/rtl.cpp
42

Is AsyncFlag documented?

ye-luo added inline comments.Oct 26 2022, 1:29 PM
openmp/libomptarget/include/omptarget.h
195

Add const

220

This is insufficient documentation. Please explain what this struct does actually not just what it is used for.

221

Why is this a struct instead of a class? It encourage code like AIM.AsyncInfoM[1]? Use class and appropriate private/public.

222

Why AsyncInfoTy needs to be associated with thread::id. Does this add unnecessary entanglement between target tasks.

ye-luo added inline comments.Oct 26 2022, 1:48 PM
openmp/libomptarget/include/omptarget.h
222

it is necessary to document the design choice that AsyncInfoTy objects and threadIDs have one to one mapping when AsyncFlag is true.

openmp/libomptarget/src/omptarget.cpp
59

New a pointer and then return its reference. "reference" types should not be used to manage the ownership.

openmp/libomptarget/src/rtl.cpp
42

I don't think this is a good design, that tries to modify the state of libomptarget from plugins, especially the corresponding env is called LIBOMPTARGET_ASYNC, which indicates it should be handled in libomptarget instead of in each plugin. Potentially I think we could add a plugin interface function to tell what opt-in feature is enabled. This could be done in a separate patch, but is better to land it before this patch. I don't think it makes sense to open the door (now) and then close it later, because I don't think "later" will come.

BTW, LIBOMPTARGET_ASYNC is a little bit confusing. It's too general.

openmp/libomptarget/include/omptarget.h
21–22

It is recommended to add a blank line between LLVM and STL headers.

222

Better to use LLVM ADT here

openmp/libomptarget/src/rtl.cpp
42

nvm, I didn't look at it right. It's in libomptarget. My bad. But the env name is true.

How many testing has been done? it seems only workable on toy examples.

Wrt toy examples:

openmp/docs/optimizations/OpenMPOpt.rst
140

As discussed yesterday, there is no race.

openmp/libomptarget/include/omptarget.h
222

This does not depend on tasks but threads. If threads are independent wrt offload, this extension allows them to run host and device code asynchronously. If threads are not independent wrt. offload, this extension cannot be used.

openmp/libomptarget/src/omptarget.cpp
30–38

Exposing this to the user side is not only not helpful but will actively harm things. The logic depends on two flags, not "a bunch of states". One is passed by the user, one describes the system setup. This is totally fine.

59

What about this:

Replace "get" with

AsyncInfoMng::register(AsyncInfo &AI);

Remove "free".
Register is implemented as no-op w/o the AsyncInfo flag set.
Otherwise it'll replace AI with a dynamically allocated one.

openmp/libomptarget/src/rtl.cpp
42

What about LIBOMPTARGET_INTRA_THREAD_ASYNC?

openmp/libomptarget/src/rtl.cpp
42

that sounds good!

ye-luo added inline comments.Oct 27 2022, 12:36 PM
openmp/libomptarget/include/omptarget.h
222

This does not depend on tasks but threads. If threads are independent wrt offload, this extension allows them to run host and device code asynchronously. If threads are not independent wrt. offload, this extension cannot be used.

Fair enough. This feature relies on thread id to impose dependency. It can only be used under certain restrictions.

openmp/libomptarget/src/omptarget.cpp
30–38

Two flags = 4 states. Not a low number of variants. Calling synchronize() but it may or may not do the sync. When should the user set ForceSync to true? Better to have some explanation.

59

Not getting what you mean.
I'm still expecting, AsyncInfoTy object being destroyed properly at the end of the target region, when AsyncInfo=false.

jdoerfert added inline comments.Oct 27 2022, 1:28 PM
openmp/libomptarget/src/omptarget.cpp
30–38

4 states, ok. However, they collapse to 2; it's a single conditional after all: synchronize or not.
The rules are fairly simple (one or condition) and documented:

/// Synchronize all pending actions when the LIBOMPTARGET_ASYNC env var
/// is disabled or when synchronization is forced (ForceSync = true) 
// Otherwise, synchronization is skipped
/// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.
59

I'm still expecting, AsyncInfoTy object being destroyed properly at the end of the target region, when AsyncInfo=false.

It is properly destroyed with this patch, and it will be with the proposed scheme. In the proposed scheme we have a local AsyncInfo (as we have upstream now) and we use it if AsyncInfo=false.

randreshg added inline comments.Oct 28 2022, 12:28 PM
openmp/libomptarget/src/omptarget.cpp
59

Im not following this. Could you please provide a pseudocode?

randreshg marked 22 inline comments as done.Nov 21 2022, 7:52 AM
randreshg marked 5 inline comments as done.
randreshg updated this revision to Diff 476906.Nov 21 2022, 8:00 AM
randreshg edited the summary of this revision. (Show Details)

This patch addresses comments from reviewers:

  • Env var name is LIBOMPTARGET_INTRA_THREAD_ASYNC
  • The pattern to get and destroy the AsyncInfoTy object changed.
  • A DenseMap is used instead of the std::map.
randreshg updated this revision to Diff 506605.Mar 20 2023, 8:34 AM
randreshg edited the summary of this revision. (Show Details)

Patch updated to the trunk version.
Changes compared to the last patch:

  • the HasDataTransfer flag was added.
  • the functions AsyncInfoTy *get() was added to both AsyncInfoTy and TaskAsyncInfoWrapperTy

Adding another limitation to this approach. In the following code:

void aaaa(int b) {
    // TARGET A
    #pragma omp target
    {}

    // TARGET B
    #pragma omp target nowait
    {}

    // TARGET C
    #pragma omp target
    {}
}

The task A->B dependency should be respected, since A is originally synchronous, it should have been executed before B. However, B will be spawned in a different thread and A can potentially execute after B.

openmp/libomptarget/include/omptarget.h
208

It may be a good idea to add a comment here explaining why this is necessary. It has to do with garbage collection of the device RTL, as per our conversation today.

Adding another limitation to this approach. In the following code:

void aaaa(int b) {
    // TARGET A
    #pragma omp target
    {}

    // TARGET B
    #pragma omp target nowait
    {}

    // TARGET C
    #pragma omp target
    {}
}

The task A->B dependency should be respected, since A is originally synchronous, it should have been executed before B. However, B will be spawned in a different thread and A can potentially execute after B.

The thread which encounter A will not reach B until A has completed so how can B be spawned when A is active.

Hi Ravi,

The purpose of this is to have regions with nowait to be asynchronous (but ordered within the stream). (See paper). So using the solution presented by @randreshg will have this issue. The problem is that the encountering thread will push A to the queue, but B is going to be lowered as a host task. B can be pushed into another thread, and potentially delayed way before A finishes. A must be synchronized such that the original order is maintained. My example was rather simplistic, but it demonstrates the issue. While A and C will be in order, B will not be ordered w.r.t. A.

Of course this is not default OpenMP behavior where what you said does apply.

The thread which encounter A will not reach B until A has completed so how can B be spawned when A is active.

randreshg updated this revision to Diff 510619.Apr 3 2023, 3:18 PM

Test added