This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Add non-blocking support for target nowait regions
ClosedPublic

Authored by gValarini on Aug 16 2022, 5:47 PM.

Details

Summary

This patch better integrates the target nowait functions with the tasking runtime. It splits the nowait execution into two stages: a dispatch stage, which triggers all the necessary asynchronous device operations and stores a set of post-processing procedures that must be executed after said ops; and a synchronization stage, responsible for synchronizing the previous operations in a non-blocking manner and running the appropriate post-processing functions. Suppose during the synchronization stage the operations are not completed. In that case, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized, allowing other target nowait regions to be concurrently dispatched.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
gValarini edited the summary of this revision. (Show Details)Aug 16 2022, 5:50 PM
ye-luo added a subscriber: ye-luo.Aug 16 2022, 5:51 PM

If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized

Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?

gValarini edited the summary of this revision. (Show Details)Aug 16 2022, 5:51 PM
gValarini updated this revision to Diff 453181.Aug 16 2022, 6:16 PM

Added missing commits

If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized

Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?

This currently is intended only for helper tasks attached to target nowait regions.

If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized

Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?

This currently is intended only for helper tasks attached to target nowait regions.

Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.
Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?

gValarini updated this revision to Diff 453452.Aug 17 2022, 3:43 PM

Update re-enqueue logic to support target nowait regions with hidden helper threads disabled

Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?

Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.

You have a good point. I thought you were talking about using the re-enqueueing scheme with the normal OpenMP tasks regions.

Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?

There was no specific reason to limit that to hidden helper tasks. I have updated the code so it can also work when LIBOMP_USE_HIDDEN_HELPER_TASK=0. Here is the new implemented behavior:

  • HHT enabled: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to other hidden helper threads.
  • HHT disabled / Task with a task team: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to others the other OpenMP threads of the same tasking team. E.g., when a target nowait is created inside a parallel region.
  • HHT disabled /Task without a task team: target nowait regions will not use the new non-blocking synchronization scheme. Synchronization will be done in a blocking manner as before, respecting the case where the target task is serialized. E.g., when a target nowait is created outside any parallel region.

What do you think about this implementation?

Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.

You have a good point. I thought you were talking about using the re-enqueueing scheme with the normal OpenMP tasks regions.

Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?

There was no specific reason to limit that to hidden helper tasks. I have updated the code so it can also work when LIBOMP_USE_HIDDEN_HELPER_TASK=0. Here is the new implemented behavior:

  • HHT enabled: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to other hidden helper threads.
  • HHT disabled / Task with a task team: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to others the other OpenMP threads of the same tasking team. E.g., when a target nowait is created inside a parallel region.
  • HHT disabled /Task without a task team: target nowait regions will not use the new non-blocking synchronization scheme. Synchronization will be done in a blocking manner as before, respecting the case where the target task is serialized. E.g., when a target nowait is created outside any parallel region.

What do you think about this implementation?

This is a lot better. The less specialization of the feature the better.

Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?

I have not thought about that at the moment, but that could be a nice optimization. Since the CUDA plugin currently maintains a resizable pool of streams for each device with an initial size of 32, I thought that for a first implementation this could be enough.

CUDA events have the same API as streams for non-blocking synchronization using cudaEventQuery, so we could store a single event (completionEvent) per AsyncInfo and use that when synchronizing with SyncType::NON_BLOCKING. I have one question though: does querying for CUDA events completion synchronize all the operations prior to the event on the stream? Or another thread on the host must synchronize the stream? If only synchronizing the events is enough, it would make using them quite simpler.

Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?

I have not thought about that at the moment, but that could be a nice optimization. Since the CUDA plugin currently maintains a resizable pool of streams for each device with an initial size of 32, I thought that for a first implementation this could be enough.

CUDA events have the same API as streams for non-blocking synchronization using cudaEventQuery, so we could store a single event (completionEvent) per AsyncInfo and use that when synchronizing with SyncType::NON_BLOCKING. I have one question though: does querying for CUDA events completion synchronize all the operations prior to the event on the stream? Or another thread on the host must synchronize the stream? If only synchronizing the events is enough, it would make using them quite simpler.

My second thought on this is let us do Stream sync for now.
NVIDIA. If we sync with an event and return streams, tasks may got serialized if two happens in the same stream.
AMD, at the hsa level, there is only signals(events).
Level0, it depends on which type of commandlist being used.
So it seems at libomptarget, it should be flexible and let the plugin decide which mechanism to use.

Thanks for the patch. I'll take a close look next week.

Drive by.

openmp/libomptarget/include/omptarget.h
194

Drive by: I don't believe we want such a generic interface. The postprocessing task should just be a fixed function, not multiple unknown at compile time.

openmp/libomptarget/src/omptarget.cpp
863

Just make this a standalone static function.

First of all, sorry for the late reply.

My second thought on this is let us do Stream sync for now.
NVIDIA. If we sync with an event and return streams, tasks may got serialized if two happens in the same stream.
AMD, at the hsa level, there is only signals(events).
Level0, it depends on which type of commandlist being used.
So it seems at libomptarget, it should be flexible and let the plugin decide which mechanism to use.

Ok, we can leave it as is for now. Just a comment on the NVIDIA case for future development: although we may not completely avoid the serialization problem using the pool of streams, I believe we can reduce it if the pool access pattern is changed. Currently, streams are acquired/released from/to the pool in a stack-like manner, which induces the frequent re-use of the streams nearest to the top of the stack/pool (in the code, the streams near the beginning of the Resources vector). By changing this access pattern to be round-robin-like, the stream use frequency may be evened over the pool and serialization becomes less of a problem.

Thanks for the patch. I'll take a close look next week.

Perfect, thanks!

gValarini updated this revision to Diff 455948.Aug 26 2022, 9:44 AM
gValarini marked an inline comment as done.

Update documentations and refactor some code

gValarini added inline comments.Aug 26 2022, 10:06 AM
openmp/libomptarget/include/omptarget.h
194

Just adding some context to why it was done this way:

  • Lambdas can easily store any local data needed by the post-processing procedures. This allows them to be generated locally by the normal code flow (which may improve code maintenance) and then stored in the lambdas implicit structure (solving any lifetime problems).
  • Having a function vector allows us to easily compose post-processing procedures. This is the case for target nowait regions, that should run both the targetDataEnd and processDataAfter post-processing functions.
  • If in the future we need to generate more asynchronous operations inside the post-processing functions, this can be easily done by pushing more lambdas to the vector.

With all that in mind, I know std::functions may lead to an additional heap allocation and one more level of indirection due to type-erasure, prohibiting some code opts. If that is not desirable despite the presented context, I can change how the post-processing procedures are stored/called, but I really would like to keep some of the points described above, especially the lifetime correctness of the post-processing data. Do you have something in mind that could help me with that?

Maybe defining a different struct for each post-processing function and storing it in the AsyncInfoTy as a variant could be enough. What do you think?

jdoerfert added inline comments.Aug 26 2022, 10:55 AM
openmp/libomptarget/include/omptarget.h
194

I would have assumed if we have 2 potential callbacks, let's say static functions F and G we would have two members, both void*.

void *PayloadForF = nullptr;
void *PayloadForG = nullptr;

and if they are not null we call F and G respectively passing the payload.

I'm fine with keeping it for now this way, we can see if there is a need to change it.

openmp/libomptarget/include/omptargetplugin.h
160

Describe the return value, it's not clear what would be returned if it does or doesn't synchronize.

openmp/libomptarget/src/device.cpp
634–635
openmp/libomptarget/src/interface.cpp
143

I assume we can outline some of this as it is probably the same as in the sync version, right? Let's avoid duplication as much as possible. Same below.

282

So, when do we call this but don't want to actually do targetDataUpdate? I am confused. Same above.

405

Clang format.

openmp/libomptarget/src/omptarget.cpp
698

Documentation, plz.

1199

Why are the const problematic?

1511

FWIW, mutable is really not my favorite way of handling things.

openmp/libomptarget/src/private.h
214–217

Here and elsewhere, we should prefer early exit and no else after return.

224–228
openmp/runtime/src/kmp_tasking.cpp
1145

@tianshilei1992 you need to look at these changes.

5187

current task?

tianshilei1992 added inline comments.Aug 30 2022, 6:48 AM
openmp/runtime/src/kmp_tasking.cpp
5157

libomptarget (for now) doesn't require the thread must be an OpenMP thread. Using libomp's gtid generally breaks. Either we add the requirement, which needs to be discussed further, or there it needs an alternative method to implement that. If libomp is executed on another fresh thread, a new root will be created.

tianshilei1992 added inline comments.Aug 30 2022, 6:59 AM
openmp/libomptarget/src/omptarget.cpp
700

nit: a vector of PostProcessingInfo called PostProcessingPtrs is confusing.

openmp/runtime/src/kmp_tasking.cpp
1099
jdoerfert retitled this revision from Add non-blocking support for target nowait regions to [OpenMP] Add non-blocking support for target nowait regions.Sep 7 2022, 10:21 AM

Refactor interface code and address some code review changes

gValarini marked 5 inline comments as done.Sep 14 2022, 1:19 PM
gValarini added inline comments.
openmp/libomptarget/include/omptarget.h
194

Okey, for now, I'll keep it like this then.

openmp/libomptarget/include/omptargetplugin.h
160

Good point. I have updated both the documentation and the function name to better reflect what this new interface should do. Do you think it is more clear now?

openmp/libomptarget/src/interface.cpp
143

Yep, you are correct. I have created two new "launchers" that can unify most of the code paths for the execution and data-related functions for the normal and nowait cases. Since all data-related interface entries practically have the same signature, a single launcher is enough for them all.

A new class called TaskAsyncInfoTy also unifies the code related to managing the task async handle when executing target nowait regions.

What do you think about this new code structure?

282

The goal is to dispatch the device side operations (i.e., call the functions in the omptarget.cpp file) only when a new async handle is created. If we detect that the task already contains an async handle, that means that the device side operations were already dispatched and we should only try to synchronize it in a non-blocking manner.

The new TaskAsyncInfoTy class has a function called shouldDispatch that now encapsulates this detection logic with proper documentation. Do you think it is more clear now? Should we add a comment to each call site as well?

openmp/libomptarget/src/omptarget.cpp
698

Done. Could you check if I missed anything?

700

Ops. Thanks, I have updated the var name.

1199

In summary, we want to be able to move PrivateArgumentManagerTy instances into the post-processing lambdas, so their lifetime is automatically managed by them.

The problem is with how llvm::SmallVector implements its move constructor. Unfortunately, it is implemented as a move-assignment instead of a proper constructor, meaning it cannot be generated for structs with const members. If we replace FirstPrivateArgInfo with a std::vector, the problem does not happen because the STL properly implements a move constructor for vectors.

Since I think we do not want to use std::vector anymore, I just removed the const from the members, since they are not even accessible outside the PrivateArgumentManagerTy class.

What do you think of this approach?

1511

mutable was added because we need to call a non-const member function of PrivateArgumentManager (i.e., free).

I know that makes the lambda a function with an internal state since, multiple calls to it will generate different results, but I don´t know of another approach to it.

Maybe use call_once (IMHO, a little bit overkill) or remove the lambdas altogether and use another approach to store the post-processing functions and their payload. What do you think?

openmp/runtime/src/kmp_tasking.cpp
1145

Any comments on whether we can move the __kmp_unexecuted_hidden_helper_tasks decrement to this place?

5157

Uhm, I did not know about that. Although I think such a requirement makes sense, it may be out of the scope of this patch.

What we could do is check if the current thread is registered inside libomp somehow, falling back to the current execution path that does not depend on the task team information. Do you know we can use __kmpc_global_thread_num's return value to verify that? Maybe assert that returned GTID is valid and within a well-known range (e.g., [0, NUM_REGISTERED_OMP_THREADS]).

Just a note, NUM_REGISTERED_OMP_THREADS is not a valid variable. I just don't know where, or even if, such information is stored. Do you know where can I find this?

gValarini marked an inline comment as done.Sep 14 2022, 1:23 PM
gValarini added inline comments.
openmp/runtime/src/kmp_tasking.cpp
5187

Typo. It should have been current thread!

New set of comments, some minor but not all. Some comments are "out-of-order" as I started commenting top-down while not understanding it all. Read accordingly.

openmp/libomptarget/include/device.h
441

Reading this I don't what this returns. SUCCESS if it completed and FAIL otherwise? Or FAIL only if something failed? Must be called multiple times is also unclear to me, I doubt that we should put that sentence here.

Update: I think I now understand the latter part but if I'm right we should change the interface.
So, queryAsync is supposed to be called before isDone to make sure isDone returns the right value, correct? If so, we should not expose queryAsync to the user as there doesn't seem to be a reason to call it otherwise. Arguably, calling it doesn't provide information, just a state change, thus a secondary query is necessary.

openmp/libomptarget/include/omptarget.h
215

Nit: Rename argument to avoid shadowing.

Make the version that takes the sync type, and probably the sync type, private. IsDone can call the private version, users only the blocking one.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1272

return OFFLOAD_SUCCESS;

will reduce indention and logic later on.

1278
openmp/libomptarget/src/interface.cpp
86

This is different but similar to the condition used in the other new "helper" below. I have the same concerns as there. When would we ever not call the target data function?

95

Is FromMapper ever set to true? Did I miss that?

308

There is more duplication in the callees to be moved here, no?
The two last arguments could be omitted and grabbed from AsyncInfo, also in the above rewrite.
Dispatch is not used?

341

This I don't understand. Why do we have to wait to enqueue the kernel? And even if, how does this not accidentally skip the target region and we will never execute it at all? Long story short, I doubt the conditional here makes sense.

openmp/libomptarget/src/omptarget.cpp
41–50

This would move to isDone below.

openmp/libomptarget/src/private.h
227

Should this be guarded by IsNew?

240

I don't understand what we want/need this dispatch idea for. It seems to skip operations but I don't understand how we would not forget about them and go back.

openmp/runtime/src/kmp_tasking.cpp
5186
gValarini marked 13 inline comments as done.Oct 17 2022, 12:46 PM
gValarini added inline comments.
openmp/libomptarget/include/device.h
441

I should probably update the documentation of this function to reflect the new one added to omptargetplugin.h:__tgt_rtl_query_async. That state the following:

Queries for the completion of asynchronous operations. Instead of blocking the calling thread as __tgt_rtl_synchronize, the progress of the operations stored in AsyncInfo->Queue is queried in a non-blocking manner, partially advancing their execution. If all operations are completed, AsyncInfo->Queue is set to nullptr. If there are still pending operations, AsyncInfo->Queue is kept as a valid queue. In any case of success (i.e., successful query with/without completing all operations), return zero. Otherwise, return an error code.

Thus, queryAsync (which calls __tgt_rtl_query_async), is a non-blocking version of synchronize. That means that we must call it multiple times until all operations are completed and the plugin invalidates the queue inside AsyncInfo. Here, AsyncInfoTy::isDone is just a helper function that indicates if the device side operations are completed or not based on said queue. We need to externalize queryAsync to its user AsyncInfoTy so it can call the non-blocking implementation.

Considering your comment, what do you think of making things more explicit by adding a flag pointer argument to queryAsync (and thus to __tgt_rtl_query_async) that returns true if all operations are completed and false otherwise?

int32_t queryAsync(AsyncInfoTy &AsyncInfo, bool &IsCompleted);
openmp/libomptarget/include/omptarget.h
215

Thanks, I am renaming the type name to SyncTypeTy to reflect the other ones.

Regarding the second comment, I don´t quite understand what you mean with:

IsDone can call the private version, users only the blocking one.

isDone only checks if the operations inside an AsyncInfoTy instance are completed or not, it does not call any plugin function at all. Are you suggesting that we move all the non-blocking synchronization code into isDone? If so, this means we would have some code duplication regarding the post-processing functions due to two separate synchronization paths, but if you think that is better I can do it.

215

Nit: Rename argument to avoid shadowing.

Make the version that takes the sync type, and probably the sync type, private. IsDone can call the private version, users only the blocking one.

openmp/libomptarget/include/omptargetplugin.h
160

@jdoerfert any comments on the new function and its doc?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1272

Perfect, done!

1278

Thanks, done!

openmp/libomptarget/src/interface.cpp
86

That was a code error. The target helper should also use the Dispatch variable as well. Thanks for noticing.

86

With the RFC implemented, we are now re-enqueuing the same task multiple times until all the device side operations are completed. Because of that, we may call the __tgt_target_* functions multiple times as well. Since we want to dispatch the operations only once, we call the target data functions only when the target task is first encountered. The next calls will only synchronize the operations instead of dispatching them again, that's why AsyncInfo.synchronize is always called right below it.

95

Nope, it is not. I am removing it from the arguments and always passing false.

308

Yep, that was an error on my part, Dispatch should be used instead of AsyncInfo.isDone().

Regarding the other arguments, they are obtained from the wrapper TaskAsyncInfoTy, not from AsyncInfoTy. I can change that to unify the wrappers code and AsyncInfoTy, but I'll be probably putting too different responsibilities into the AsyncInfoTy struct. What do you think?

341

You are right, this was a leftover prior to the refactoring of the interface file. Although it worked, it did only because the queue pointer was null and isDone would return true at first. Replaced it with the Dispatch variable.

openmp/libomptarget/src/omptarget.cpp
41–50

Ok, I can do it, no problem. But just to make sure I got it right with respect to the other comments, you are suggesting this so we would can synchronize for the blocking synchronization and isDone for the non-blocking one, correct? If that is so, just remember that for the current code, the PostProcessingFunctions must be called on both cases, so isDone would need to be called when blocking synchronizing as well.

openmp/libomptarget/src/private.h
227

Nope, IsNew indicates that a new task-attached AsyncInfo has just been allocated, but not that we should deallocate it or not. The variable is primarily used to indicate that we must dispatch new operations to the new handle. Maybe I should rename it to just ShouldDispatch. Deallocation is always done when AsyncInfo->isDone() returns true, which is previously checked.

240

Here is the main idea:

  1. The first time a target nowait (target inside an OpenMP task) is executed, a new AsyncInfo handle is created and stored in the OpenMP taskdata structured of said task. Since this is the first time we are executing it (which is detected by the IsNew variable), we dispatch the device side operations and populate the post-processing function array by calling the proper omptarget.cpp functions (e.g., targetDataBegin, targetDataEnd, ...).
  2. Afterward, if the device operations are still pending, the OpenMP task is re-enqueued for execution.
  3. Later, when the task is re-executed, the same outline function called at step 1 will be called again. Here we can recover the AsyncInfo handle from the OpenMP taskdata and just synchronize it. Since this time the handle is not new, we know the operations were already dispatched previously and we should not dispatch them again.

I have a presentation that explains it on slides 19-24, but I believe I am failing to describe that in the code. I'll try to come up with some better documentation for this dispatch/synchronize idea.

openmp/runtime/src/kmp_tasking.cpp
5157

@tianshilei1992 any comments on this?

I'll wait for the updated version to go through again. Below two clarification questions. I think I now am closer to understanding some of the stuff I was confused about. If we end up keeping this scheme we need to adjust some names. I am hoping we can simplify a few things though.

openmp/libomptarget/src/private.h
227

I'm worried here that we might not delete the AsyncInfo or delete it multiple times. Are you saying there is exactly one TaskAsyncInfoTy that will own the AsyncInfor object at any given time? If not, how do we avoid double free?

240

Ok, that makes more sense. Now to help (even) me understand this, why do we need to call the functions from step 1 in step 3? We seem to use the "Dispatch" argument to skip most of what they do (the target data part of a targetDataXXX) anyway, no?

jdoerfert added inline comments.Oct 17 2022, 5:06 PM
openmp/libomptarget/include/device.h
441

My point is: queryAsync is useless if not followed by an isDone, right?
Why do we expose it in the first place? We should merge the two functions, potentially keeping isDone around, but at least avoiding the implicit dependence that you have to call one before the other for any meaningful use case. The updated interface basically does this. It merges the "isDone" query into this function, allowing users to call isDone standalone and this function standalone while getting meaningful results each time.

tianshilei1992 added inline comments.Oct 17 2022, 5:09 PM
openmp/runtime/src/kmp_tasking.cpp
5157

IIRC some return values, except those real thread ids, are for specific purposes. That's one of the reason that I didn't use negative thread id for hidden helper thread. I don't know for sure if there is a value designed to say it is not an OpenMP managed thread. We can probably add one.

gValarini updated this revision to Diff 468566.Oct 18 2022, 8:34 AM
gValarini marked 4 inline comments as done.Oct 18 2022, 8:34 AM

Address more code review changes. This update also fixes the dispatch logic on target regions.

Added some more comments about the new execution flow and the thread id problem.

openmp/libomptarget/include/device.h
441

Uhm, I think I got your point. I'll update AsyncInfoTy::isDone so it can be called standalone without a prior call to AsyncInfoTy::synchronize (which calls the device DeviceTy::queryAsync). This indeed makes the interface better.

I was a little bit confused when you said DeviceTy::queryAsync should not be exposed, but now I got it.

openmp/libomptarget/include/omptarget.h
215

Just a note, now I got the correct idea: we should make isDone a callable as a standalone function!

openmp/libomptarget/src/private.h
227

When executing a target nowait region, the actual owner of the AsyncInfo is the task itself. The structure is allocated when the task first executes and calls any of the libomptarget functions (IsNew is true) and it is deallocated when all the device-side operations are completed (AsyncInfo::isDone returns true).

Here, TaskAsyncInfoTy is just a wrapper around a task-owned AsyncInfoTy (stored inside the OpenMP task data) to mainly automate the allocation and deallocation logic. But following the OpenMP execution flow, since a task is owned and executed by only a single thread at any given time, only one TaskAsyncInfoTy will be managing the task-owned AsyncInfoTy object. This should avoid any double frees, but I understand this could be a weak assumption. If that is enough I could add documentation stating it, but probably having some code checks for that would be best. Maybe assertions at the task deallocation function ensuring no valid AsyncInfoTy address is left?

240

That happens because the dispatch and synchronization logic are placed in the same interface function. The first call to that function done by a task dispatches the operations, while the subsequent calls try to do the non-blocking synchronization.

Maybe a better way of doing it would be to add a new interface function with the sole purpose of executing said synchronization. This way, when a task is re-executed, it calls this new function to only do the synchronization instead of the previous outline function. What do you think? This can better split the dispatch and synchronization code.

openmp/runtime/src/kmp_tasking.cpp
5157

That would be perfect. I know we have KMP_GTID_DNE (value of -2) that represents the return value of a non-existent thread ID. My only problem is: do you know if __kmpc_global_thread_num returns that when called from a non-OpenMP thread? I'll do some local checking on that!

gValarini updated this revision to Diff 469237.Oct 20 2022, 8:16 AM

This update:

  • Internalizes SyncType into SyncInfoTy
  • Split dispatch and synchronization code paths. No more Dispatch checks! New interface function for target nowait synchronization.
  • Make isDone a standalone function. No need to call synchronize before isDone!

@jdoerfert I believe the new revision has a better code structure for the dispatch and synchronize stages. Now we have an exclusive function only for synchronization. No more Dispatch checks!

@tianshilei1992 I am still checking on the validity of the returned GTID from __kmpc_global_thread_num when called from outside an OpenMP thread. The best solution would be to pass the address of the taskdata async handle through the interface for the *_nowait functions directly as a new parameter, but that would require changes to the code generation step. I can do that right now, but I would prefer to change the clang code generation on a different patch, as I am not that familiar with it. What do you think?

@jdoerfert I believe the new revision has a better code structure for the dispatch and synchronize stages. Now we have an exclusive function only for synchronization. No more Dispatch checks!

Great, I think we are almost there. The code looks much cleaner and the approach is much clearer than it was in the beginning (IMHO, I hope people agree).

I left some final clarification questions and some cleanup requests.

One conceptual question which is not blocking the patch though:
Does this approach require hidden helper threads to execute the target task or could we enable it for non-hidden helper threads as well?

openmp/libomptarget/include/omptarget.h
224

Make it clear that this happens only once. Either here or via synchronize. Right now it could be read like every isDone call might invoke the post-processing functions.

376

It's void return but comment talks about the return value.

openmp/libomptarget/src/interface.cpp
64

If you make this a templated function accepting the (sub)type of the AsyncInfo object instead of the object itself, you can move all the remaining duplication at the call sites (namely: checkCtorDtor, get device, create AsyncInfo) into this function. WDYT?

303

Same comment as above wrt. templated version. The duplication we introduce is something I would like to avoid.

450

Do we know the above call is "noreturn"? If not, we should explicitly exit here. On second thought, we should exit either way.

453
openmp/libomptarget/src/omptarget.cpp
79

This is not a good idea, I think. The state of PostProcessingFunctions is undefined afterwards, even if this works in practice. Simply iterate PostProcessingFunctions and then clear it.

openmp/libomptarget/src/private.h
212

Is there a ! missing?

openmp/runtime/src/kmp_tasking.cpp
1802

Much better than the "execute but don't actually execute" version before. Thanks!
Do we need to, or should we, act on the updated async_handle value (I mean if it actually finished, should we do something different than when it hasn't)? Or is that already done?

5181

This can fail, right? If so, we should report it to the user and deal with it properly. Otherwise we should assert it can't fail.

One conceptual question which is not blocking the patch though:
Does this approach require hidden helper threads to execute the target task or could we enable it for non-hidden helper threads as well?

It should work with HHT and normal OpenMP threads (e.g., inside a parallel/single region). The only place that we could have some problems would be in the re-enqueueing of the tasks. But that is already taken care of: if HHTs are disabled, the task will be given to another normal OpenMP thread. If they are enabled, the task will be given to another HHT. I'll do some more local testing, but we should not have any problems in both configs.

tianshilei1992 added inline comments.Oct 21 2022, 1:17 PM
openmp/libomptarget/include/omptarget.h
189

SyncTypeTy looks weird. It's like having an LLVM class called TypeTy. I think SyncTy or SyncType are both fine.

openmp/libomptarget/src/interface.cpp
84

nit: targetDataFunction

openmp/runtime/src/kmp_tasking.cpp
5191

Do we need to check if gtid is valid here?

I have a general question. D81989 is using task yield to potentially improve concurrency instead of blocking the hidden helper task. I know sometimes task yield may have side affects. My question is, compared with using task yield, is creating and enqueuing task (repeatedly) better?

gValarini marked 15 inline comments as done.

This update:

  • Update AsyncInfo documentation
  • Reduce code duplication in the interface
  • Rename SyncTypeTy to SyncTy
  • Fix exporting __tgt_target_nowait_query
  • Refactor task async handle acquisition
  • Optimize fast queue completion`
gValarini marked an inline comment as not done.Oct 25 2022, 10:19 AM

I believe I answered and fixed most of the comments on this revision. Waiting for the next round. 😉

openmp/libomptarget/include/omptarget.h
189

It makes sense, that was a little redundant. It is now renamed to SyncTy. Thanks!

376

Yep, that was a leftover from some previous revisions. Thanks!

openmp/libomptarget/src/interface.cpp
64

Indeed, that is a nice idea. Since TaskAsyncInfoWrapperTy is a wrapper around AsyncInfoTy, I only needed to acquire a reference to it so we would end up always using AsyncInfoTy.

84

Uhm, TargetDataFunction is a function pointer. Shouldn't we also capitalize the first word in this case?

303

Done as well.

450

Uhm, yep, you are right. We should always exit here. I am converting to using FATAL_MESSAGE0, so we directly abort the program.

openmp/libomptarget/src/omptarget.cpp
79

Uhm, really? When moving a SmallVector like this, wouldn't PostProcessingFunctions be emptied and all the values moved to Functions?

openmp/libomptarget/src/private.h
212

I forget to submit some local changes! Done.

openmp/runtime/src/kmp_tasking.cpp
1802

Nice.

And yep, that is already done at the task finalization. Take a look at the changes on kmp_tasking.cpp:1099. When async_handle is not null, the task is re-enqueued, otherwise, the task is finished normally.

5181

Uhm, that makes sense. I'll try to add this functionally and fall back to the old execution flow if it fails.

5191

Uhm, yep we indeed need to check it. I'll add it here and return false if gtid is invalid. This way we can fall back to the old execution flow.

Generally fine from my end. @tianshilei1992 wdyt?

@kevinsala, FYI, there will be a new plugin API we need to port over to the new plugins.

openmp/libomptarget/src/omptarget.cpp
79

The standard says PostPRocessingFunctions is in an unspecified state after the move. If we look at SmallVector we know the state but I don't understand why we need to rely on implementation defined behavior here at all. We don't save any lines and also no work by just iterating PostProcessingFunctions and then clearing it, no?

openmp/libomptarget/src/private.h
227

All asserts should have messages.

TBH I'd like to see/understand if task yield is worse than this method. If not, I'm not sure why we'd like to have so many complicated code.

Another thing is, if we only have one target task, which is really long. Then based on this method, it's gonna check if it is done, enqueue the task, and then the task is scheduled to execute (potentially by another thread, which might hurt locality), check, enqueue the task, again and again. Though existing method blocks the whole thread, at least it will not keep executing. I think we actually need a counter for the bottom half task. If it is executed for many times, it indicates we don't have enough other tasks to execute. In this case, just blocking the thread is fine.

TBH I'd like to see/understand if task yield is worse than this method. If not, I'm not sure why we'd like to have so many complicated code.

Sorry, I missed your previous comment before.

I believe this method can be better than using task yield because it has less probability of "starvation", let me explain. AFAIK, task yield is currently being implemented by calling other tasks in the middle of the execution of the current task, meaning the task execution state (and ordering of resumption) is stored inside the call stack. This can incur a "starvation-like" problem, where an earlier task that has its operations completed cannot be finished because it is at the bottom of the call stack. This can also hold other tasks that are dependent on this "starved" one. Another problem is that, depending on the number of ready target regions, a program can even surpass the stack limit due to many in-flight tasks. If task yield were implemented in a coroutine-like model (maybe some future work), where yielded tasks could be re-enqueued and re-ordered, we would probably use it since that makes the code much simpler.

Another thing is that this can also be an initial point of integration for the device-side resolution of dependencies. (although D81989 also did that as well).

gValarini added a comment.EditedOct 26 2022, 8:34 AM

Another thing is, if we only have one target task, which is really long. Then based on this method, it's gonna check if it is done, enqueue the task, and then the task is scheduled to execute (potentially by another thread, which might hurt locality), check, enqueue the task, again and again. Though existing method blocks the whole thread, at least it will not keep executing. I think we actually need a counter for the bottom half task. If it is executed for many times, it indicates we don't have enough other tasks to execute. In this case, just blocking the thread is fine.

Uhm, that makes perfect sense. I´ll implement the counting mechanism and update the patch.

gValarini updated this revision to Diff 474034.Nov 8 2022, 9:47 AM
gValarini marked 4 inline comments as done.

This update:

  • Unify synchronize and isDone methods. No more code duplication between them.
  • Add query async to plugin-nextgen.
  • Decided sync method based on thread exponential backoff counting.
gValarini marked 3 inline comments as done.Nov 8 2022, 10:05 AM
gValarini added inline comments.
openmp/libomptarget/src/omptarget.cpp
79

I was thinking about a future use case: if a post-processing function generates more asynchronous device-side operations. In this case, it may want to add a new post-processing function into the vector, but we cannot do that while iterating over it. The spec says that, if a push/emplace back resizes the vector, any previous iterator is invalid, which would make the loop invalid. I think that is the case for SmallVectors as well.

Right now, the three post-processing functions do not do that. They all do synchronous device operations. But if in the future, someone adds a post-processing function that does that, it cannot be blindly done without taking care of this situation.

Do you think we could leave this "unimplemented" right now? If so, I can do the iteration-then-clear approach.

Just a "side question": which standard does SmallVector follows? I am asking that because the STL says that a vector is guaranteed to be empty after a move. If that is the case for SmallVectors, thus PostProcessingFunctions would be in a valid state, no?

1199

@jdoerfert any new comment on this?

1511

@jdoerfert any new comment on this?

openmp/runtime/src/kmp_tasking.cpp
1145

@tianshilei1992 is this correct?

I only have one remaining question. @tianshilei1992 might have more though.

openmp/libomptarget/src/interface.cpp
461

Why is this thread_local? Should it be tied to the async info object with non-blocking tasks?

openmp/libomptarget/src/omptarget.cpp
79

If you are worried about inserting, use

for (int i = 0; i < C.size(); ++i)

Anyhow, let's keep it this way for now but add an explicit clear at the end. Just to be sure (and explicit).

tianshilei1992 added inline comments.Nov 23 2022, 8:23 PM
openmp/libomptarget/src/interface.cpp
461

Yeah, I'd expect the counter to be tied to a task.

Reverse ping, let's get this in, we can build dependence via event support on top of it.
There is only the thread_local remark remaining, I think.

gValarini updated this revision to Diff 481658.Dec 9 2022, 7:58 AM

This update:

  • Fix post-processing function run loop. Clear is now explicit.
gValarini marked 3 inline comments as done and an inline comment as not done.Dec 9 2022, 8:41 AM

@jdoerfert there are some other comments pending. How should we proceed?

openmp/libomptarget/src/interface.cpp
461

@jdoerfert @tianshilei1992 my idea here is that we should be able to go back and forth between non-blocking and blocking synchronization in a per-thread manner depending on their own state.

  • If the thread is doing too much non-blocking synchronization (similar to spin-wait), we should block-wait at least once to save resources.
  • If the thread has just completed a target region, let it go through other target regions in a non-blocking manner for a while, possibly completing other tasks and enqueueing some more.

This way, we allow the runtime threads to adapt themselves to the target region payloads.

I see two possible problems in placing the counter in a per-task manner:

  1. We can only go from non-blocking to blocking synchronization. Once the task is completed, the counter is destructed and won´t be used anymore, thus we lose its information for future task execution.
  2. If we have many tasks with higher enough counters at the top of a thread's task queue, that same thread will be forced to only block-synchronize them, even though we could have tasks ready for completion at the bottom of the queue.

What do you think about the above points? Do they make sense?

openmp/libomptarget/src/omptarget.cpp
79

Uhm, yep, making it explicit is better. What do you think?

jdoerfert accepted this revision.Dec 9 2022, 10:16 AM

LG

openmp/libomptarget/src/interface.cpp
461

Hm, it seems reasonable for your scenario. It is unclear what we should optimize for here. I'm OK with keeping it like this as it might be better for a "always blocking tasks" and "consistently mixed task" load. The per-task impl. would only be good if we have "totally independent tasks", I guess.

openmp/libomptarget/src/omptarget.cpp
79

It's fine for now.

1199

It's ok to remove the const.

1511

Add a TODO to look into this in the future.

openmp/runtime/src/kmp_tasking.cpp
1145

I think @tianshilei1992 mentioned to me this should be fine.

This revision is now accepted and ready to land.Dec 9 2022, 10:16 AM
tianshilei1992 accepted this revision.Dec 9 2022, 10:18 AM
gValarini updated this revision to Diff 481683.Dec 9 2022, 10:22 AM

This update:

  • Add a TODO to remove mutables from post-processing
gValarini marked 10 inline comments as done.Dec 9 2022, 10:24 AM
gValarini updated this revision to Diff 481699.Dec 9 2022, 11:06 AM

This update:

  • Fix stream return on next-gen plugin

Getting compiler crash with this change:

# End machine code for function __tgt_target_nowait_query.

*** Bad machine code: FrameSetup is after another FrameSetup ***
- function:    __tgt_target_nowait_query
- basic block: %bb.5 init.check (0x556af1cc4098)
- instruction: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def $rsp, implicit-def $eflags, implicit-def $ssp, implicit $rsp, implicit $ssp, debug-location !6722; llvm-project/openmp/libomptarget/src/interface.cpp:328:42

*** Bad machine code: FrameDestroy is not after a FrameSetup ***
- function:    __tgt_target_nowait_query
- basic block: %bb.5 init.check (0x556af1cc4098)
- instruction: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp, debug-location !6722; llvm-project/openmp/libomptarget/src/interface.cpp:328:42
fatal error: error in backend: Found 2 machine code errors.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: clang++ --target=x86_64-unknown-linux-gnu -DEXPENSIVE_CHECKS -DGTEST_HAS_RTTI=0 -DOMPTARGET_DEBUG -DOMPT_SUPPORT=1 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Illvm/include -Iinclude -Iruntimes/runtimes-bins/openmp/runtime/src -Illvm-project/openmp/libomptarget/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-enum-constexpr-conversion -Wno-extra -Wno-pedantic -std=c++17 -g -fPIC -fno-exceptions -fno-rtti -gsplit-dwarf -MD -MT openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o -MF openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o.d -o openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o -c llvm-project/openmp/libomptarget/src/interface.cpp

Looks like it is caused by the place for static variable declaration.