This patch allows Automatic asynchronous execution of OpenMP Target Regions.
When the environment variable LIBOMPTARGET_INTRA_THREAD_ASYNC is enabled
(LIBOMPTARGET_INTRA_THREAD_ASYNC=1), the implicit barrier that exists at the end
of every target region is removed and synchronization is only performed on memory transfers
or at the end of the program when the OpenMP runtime calls its destructors
Details
Diff Detail
Event Timeline
Could you please explain why removing synchronization at the end of a target region (without nowait) is a valid optimization. Also there should be no valid program relying on synchronization at the end of the program"
Ye,
It is only valid under the assumption of non unified shared memory. State across host and device is only visible during data movements. So it is up until then when changes in the host or device data is reflected. Assuming there are no external runtimes, it is possible to synchronize only on data movements, conserving the data dependencies between host and device.
When I ran target region A and then B and B consumes numbers generated by A on the device and there is no transfer involved. B may start before the A get the numbers ready.
As long as the RTL of the device provides a queue mechanism to execute target tasks sequentially, the dependencies among different tasks will be respected, for NVIDIA GPUs the queue mechanism is the stream. This patch, in contrast with the current OpenMP offloading implementation, launches the execution of tasks into the same stream.
CUDA streams are FIFO queues. But it’s true this will not work if the device queue is not FIFO. In the case of CUDA this works for the case you described without being a read after write dependency.
- right now in CUDA plugin, streams are pooled. There is no guarantee that A and B get the same Stream when multiple threads all are doing their own As and Bs.
- Need to minimize designing libomptarget based on CUDA behaviors.
I agree on 2. Any recommendations? We can move some of the logic there.
Regarding 1 this patch does not return the stream to the pool right away. It is held by the thread until a synchronization occurs. Synchronizations aré evaluated lazily on data movement. But the stream does not change between consecutive target invokes with no data movement
Explicitly tied to thread context is even worse. If I wrap both region with openmp CPU tasks and make task B depends on task A. task A completes but the kernel is still flying. Task B may start at anytime and there is no guarantee of running on the same thread.
That one is a good point. Let us revise that. Since the effect we actually want to have is the creation of the task graph sequentially.
openmp/libomptarget/include/device.h | ||
---|---|---|
420 | Why is this static, that seems wrong. | |
587 | Documentation, please. | |
openmp/libomptarget/include/omptarget.h | ||
227–240 | You cannot just remove this. See https://reviews.llvm.org/D132045, as it introduces a flag you can use to disable synchronization here. | |
openmp/libomptarget/src/device.cpp | ||
70 | Reverse the condition, single line alternative comes first, then the complex consequence (without the else). | |
82–84 | Use early exits and no else after return. |
openmp/libomptarget/src/device.cpp | ||
---|---|---|
62–66 | It may be better to merge these into: DP("Asynchronous execution %s\n", AsyncFlag ? "Enabled" : "Disabled"); |
openmp/libomptarget/include/device.h | ||
---|---|---|
420 | The idea of AsyncInfoMng is to have a way to control the AsyncInfo object to skip synchronization that is not needed. It's part of the Device class, so we can sync like this: Device.syncAsyncInfo(AsyncInfo, true); however, it is not necessary for every device to have a copy of AsyncInfoMng. That's why it's static. |
diff updated.
- The AsyncInfoManager is now part of the omptarget.h and not device.h
- Global variable AIM was added in interface.cpp
- Flag was added to the Synchronization in asyncinfo destructor
Some more initial comments
openmp/docs/optimizations/OpenMPOpt.rst | ||
---|---|---|
113 ↗ | (On Diff #461250) | Doesn't the length need to match? |
116 ↗ | (On Diff #461250) | For nowait regions this is maybe misleading. We should probably say they can be executed synchronously depending on the pragmas and implementation. In that case ... |
134 ↗ | (On Diff #461250) | This is misleading. Rather explain what would happen if the map on the target was triggering a transfer. |
openmp/libomptarget/include/omptarget.h | ||
317 | This default is dangerous. Either swap it or avoid the default. Also, explain what it means to synchronize but not to force synchronize. | |
321 | Use doxygen comments /// not //. | |
openmp/libomptarget/src/device.cpp | ||
53 | Unrelated | |
openmp/libomptarget/src/interface.cpp | ||
28 | We should not have a global like this. It probably belongs in/to the Device object. | |
347 | A comment would be nice as this is the one place we do not force synchronization. |
This new version has the following changes:
- Updated documentation
- The AsyncInfoManager is part of the Device class and now has a map that contains thread ids as its key values.
openmp/docs/optimizations/OpenMPOpt.rst | ||
---|---|---|
120 ↗ | (On Diff #462993) | |
135–140 ↗ | (On Diff #462993) | |
openmp/libomptarget/include/device.h | ||
402 | Documentation, please. | |
433 | I would rather expose the AIM than have 3 functions that just forward to it. | |
openmp/libomptarget/include/omptarget.h | ||
27 | We can use LLVM data structures now too, e.g., maps. | |
openmp/libomptarget/src/interface.cpp | ||
347 | multi-line conditionals should have braces. The comment is unhelpful. It states what the code already says, not why the code is this way. | |
openmp/libomptarget/src/omptarget.cpp | ||
73 | Lot's of static flags to simply lookup a env var once. Why don't we do it the same way as for other env vars? | |
75 | ||
81 | ||
82 | You did get an iterator, why do another lookup via [...]? You can do a single [...] lookup and use the reference result to check and update it. | |
84 | And this is the 3rd lookup into the map.. | |
85 | Do we ever have to clear the map? | |
102 | I doubt we need both these lines. |
I know this is getting tiresome but we need to make sure people understand what's happening and this plays well with future extensions. More comments.
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
223 | ShouldSyncWhenDestroyed | |
317 | You never return a nullptr, make it a reference. | |
325 | The description is unhelpful. What AsyncInfo, etc. The function is also unused, do we need it? | |
openmp/libomptarget/src/interface.cpp | ||
111–112 | Now this pattern is somewhat unfortunate. You get the AsyncInfo from the AIM and then you need to be careful to call the right synchronize. If the AsynFlag was global you could move the new "sync" logic into the regular AsyncInfo sync, right? AIM would just be used to manage the map ID -> AsyncInfo. WDYT about this scheme? You could check the env variable once, like we do it for some others: https://github.com/llvm/llvm-project/blob/23bc343855fdf6fb7668abadf2b064034b207981/openmp/libomptarget/src/rtl.cpp#L43 | |
openmp/libomptarget/src/omptarget.cpp | ||
74–77 |
openmp/libomptarget/src/interface.cpp | ||
---|---|---|
111–112 | Agreed! |
How many testing has been done? it seems only workable on toy examples.
openmp/docs/optimizations/OpenMPOpt.rst | ||
---|---|---|
140 ↗ | (On Diff #469594) | Through this example, I only see TT1 and TT2 racing when the async feature is enabled. |
openmp/libomptarget/src/device.cpp | ||
55–56 | Why AIM captures a pointer instead of reference? | |
openmp/libomptarget/src/omptarget.cpp | ||
32–49 | It is mess. synchroning or not depends on a bunch of states. Move the if-statement to the caller side and make the logic directly exposed on the use side. | |
91 | I think this is against coding principles, pass in a reference and delete its memory. | |
openmp/libomptarget/src/rtl.cpp | ||
45 | Is AsyncFlag documented? |
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
223 | Add const | |
305 | This is insufficient documentation. Please explain what this struct does actually not just what it is used for. | |
306 | Why is this a struct instead of a class? It encourage code like AIM.AsyncInfoM[1]? Use class and appropriate private/public. | |
307 | Why AsyncInfoTy needs to be associated with thread::id. Does this add unnecessary entanglement between target tasks. |
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
307 | it is necessary to document the design choice that AsyncInfoTy objects and threadIDs have one to one mapping when AsyncFlag is true. | |
openmp/libomptarget/src/omptarget.cpp | ||
74 | New a pointer and then return its reference. "reference" types should not be used to manage the ownership. |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
45 | I don't think this is a good design, that tries to modify the state of libomptarget from plugins, especially the corresponding env is called LIBOMPTARGET_ASYNC, which indicates it should be handled in libomptarget instead of in each plugin. Potentially I think we could add a plugin interface function to tell what opt-in feature is enabled. This could be done in a separate patch, but is better to land it before this patch. I don't think it makes sense to open the door (now) and then close it later, because I don't think "later" will come. BTW, LIBOMPTARGET_ASYNC is a little bit confusing. It's too general. |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
45 | nvm, I didn't look at it right. It's in libomptarget. My bad. But the env name is true. |
Wrt toy examples:
openmp/docs/optimizations/OpenMPOpt.rst | ||
---|---|---|
140 ↗ | (On Diff #469594) | As discussed yesterday, there is no race. |
openmp/libomptarget/include/omptarget.h | ||
307 | This does not depend on tasks but threads. If threads are independent wrt offload, this extension allows them to run host and device code asynchronously. If threads are not independent wrt. offload, this extension cannot be used. | |
openmp/libomptarget/src/omptarget.cpp | ||
32–49 | Exposing this to the user side is not only not helpful but will actively harm things. The logic depends on two flags, not "a bunch of states". One is passed by the user, one describes the system setup. This is totally fine. | |
74 | What about this: Replace "get" with AsyncInfoMng::register(AsyncInfo &AI); Remove "free". | |
openmp/libomptarget/src/rtl.cpp | ||
45 | What about LIBOMPTARGET_INTRA_THREAD_ASYNC? |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
45 | that sounds good! |
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
307 |
Fair enough. This feature relies on thread id to impose dependency. It can only be used under certain restrictions. | |
openmp/libomptarget/src/omptarget.cpp | ||
32–49 | Two flags = 4 states. Not a low number of variants. Calling synchronize() but it may or may not do the sync. When should the user set ForceSync to true? Better to have some explanation. | |
74 | Not getting what you mean. |
openmp/libomptarget/src/omptarget.cpp | ||
---|---|---|
32–49 | 4 states, ok. However, they collapse to 2; it's a single conditional after all: synchronize or not. /// Synchronize all pending actions when the LIBOMPTARGET_ASYNC env var /// is disabled or when synchronization is forced (ForceSync = true) // Otherwise, synchronization is skipped /// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately. | |
74 |
It is properly destroyed with this patch, and it will be with the proposed scheme. In the proposed scheme we have a local AsyncInfo (as we have upstream now) and we use it if AsyncInfo=false. |
openmp/libomptarget/src/omptarget.cpp | ||
---|---|---|
74 | Im not following this. Could you please provide a pseudocode? |
This patch addresses comments from reviewers:
- Env var name is LIBOMPTARGET_INTRA_THREAD_ASYNC
- The pattern to get and destroy the AsyncInfoTy object changed.
- A DenseMap is used instead of the std::map.
Patch updated to the trunk version.
Changes compared to the last patch:
- the HasDataTransfer flag was added.
- the functions AsyncInfoTy *get() was added to both AsyncInfoTy and TaskAsyncInfoWrapperTy
Adding another limitation to this approach. In the following code:
void aaaa(int b) { // TARGET A #pragma omp target {} // TARGET B #pragma omp target nowait {} // TARGET C #pragma omp target {} }
The task A->B dependency should be respected, since A is originally synchronous, it should have been executed before B. However, B will be spawned in a different thread and A can potentially execute after B.
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
237 | It may be a good idea to add a comment here explaining why this is necessary. It has to do with garbage collection of the device RTL, as per our conversation today. |
The thread which encounter A will not reach B until A has completed so how can B be spawned when A is active.
Hi Ravi,
The purpose of this is to have regions with nowait to be asynchronous (but ordered within the stream). (See paper). So using the solution presented by @randreshg will have this issue. The problem is that the encountering thread will push A to the queue, but B is going to be lowered as a host task. B can be pushed into another thread, and potentially delayed way before A finishes. A must be synchronized such that the original order is maintained. My example was rather simplistic, but it demonstrates the issue. While A and C will be in order, B will not be ordered w.r.t. A.
Of course this is not default OpenMP behavior where what you said does apply.
Documentation, please.