Page MenuHomePhabricator

[OpenMP] Add non-blocking support for target nowait regions
Needs ReviewPublic

Authored by gValarini on Aug 16 2022, 5:47 PM.

Details

Summary

This patch better integrates the target nowait functions with the tasking runtime. It splits the nowait execution into two stages: a dispatch stage, which triggers all the necessary asynchronous device operations and stores a set of post-processing procedures that must be executed after said ops; and a synchronization stage, responsible for synchronizing the previous operations in a non-blocking manner and running the appropriate post-processing functions. Suppose during the synchronization stage the operations are not completed. In that case, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized, allowing other target nowait regions to be concurrently dispatched.

Diff Detail

Event Timeline

gValarini created this revision.Aug 16 2022, 5:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2022, 5:47 PM
gValarini requested review of this revision.Aug 16 2022, 5:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2022, 5:47 PM
gValarini edited the summary of this revision. (Show Details)Aug 16 2022, 5:50 PM
ye-luo added a subscriber: ye-luo.Aug 16 2022, 5:51 PM

If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized

Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?

gValarini edited the summary of this revision. (Show Details)Aug 16 2022, 5:51 PM
gValarini updated this revision to Diff 453181.Aug 16 2022, 6:16 PM

Added missing commits

If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized

Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?

This currently is intended only for helper tasks attached to target nowait regions.

If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized

Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?

This currently is intended only for helper tasks attached to target nowait regions.

Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.
Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?

gValarini updated this revision to Diff 453452.Aug 17 2022, 3:43 PM

Update re-enqueue logic to support target nowait regions with hidden helper threads disabled

Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?

Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.

You have a good point. I thought you were talking about using the re-enqueueing scheme with the normal OpenMP tasks regions.

Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?

There was no specific reason to limit that to hidden helper tasks. I have updated the code so it can also work when LIBOMP_USE_HIDDEN_HELPER_TASK=0. Here is the new implemented behavior:

  • HHT enabled: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to other hidden helper threads.
  • HHT disabled / Task with a task team: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to others the other OpenMP threads of the same tasking team. E.g., when a target nowait is created inside a parallel region.
  • HHT disabled /Task without a task team: target nowait regions will not use the new non-blocking synchronization scheme. Synchronization will be done in a blocking manner as before, respecting the case where the target task is serialized. E.g., when a target nowait is created outside any parallel region.

What do you think about this implementation?

Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.

You have a good point. I thought you were talking about using the re-enqueueing scheme with the normal OpenMP tasks regions.

Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?

There was no specific reason to limit that to hidden helper tasks. I have updated the code so it can also work when LIBOMP_USE_HIDDEN_HELPER_TASK=0. Here is the new implemented behavior:

  • HHT enabled: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to other hidden helper threads.
  • HHT disabled / Task with a task team: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to others the other OpenMP threads of the same tasking team. E.g., when a target nowait is created inside a parallel region.
  • HHT disabled /Task without a task team: target nowait regions will not use the new non-blocking synchronization scheme. Synchronization will be done in a blocking manner as before, respecting the case where the target task is serialized. E.g., when a target nowait is created outside any parallel region.

What do you think about this implementation?

This is a lot better. The less specialization of the feature the better.

Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?

I have not thought about that at the moment, but that could be a nice optimization. Since the CUDA plugin currently maintains a resizable pool of streams for each device with an initial size of 32, I thought that for a first implementation this could be enough.

CUDA events have the same API as streams for non-blocking synchronization using cudaEventQuery, so we could store a single event (completionEvent) per AsyncInfo and use that when synchronizing with SyncType::NON_BLOCKING. I have one question though: does querying for CUDA events completion synchronize all the operations prior to the event on the stream? Or another thread on the host must synchronize the stream? If only synchronizing the events is enough, it would make using them quite simpler.

Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?

I have not thought about that at the moment, but that could be a nice optimization. Since the CUDA plugin currently maintains a resizable pool of streams for each device with an initial size of 32, I thought that for a first implementation this could be enough.

CUDA events have the same API as streams for non-blocking synchronization using cudaEventQuery, so we could store a single event (completionEvent) per AsyncInfo and use that when synchronizing with SyncType::NON_BLOCKING. I have one question though: does querying for CUDA events completion synchronize all the operations prior to the event on the stream? Or another thread on the host must synchronize the stream? If only synchronizing the events is enough, it would make using them quite simpler.

My second thought on this is let us do Stream sync for now.
NVIDIA. If we sync with an event and return streams, tasks may got serialized if two happens in the same stream.
AMD, at the hsa level, there is only signals(events).
Level0, it depends on which type of commandlist being used.
So it seems at libomptarget, it should be flexible and let the plugin decide which mechanism to use.

Thanks for the patch. I'll take a close look next week.

Drive by.

openmp/libomptarget/include/omptarget.h
194

Drive by: I don't believe we want such a generic interface. The postprocessing task should just be a fixed function, not multiple unknown at compile time.

openmp/libomptarget/src/omptarget.cpp
939

Just make this a standalone static function.

First of all, sorry for the late reply.

My second thought on this is let us do Stream sync for now.
NVIDIA. If we sync with an event and return streams, tasks may got serialized if two happens in the same stream.
AMD, at the hsa level, there is only signals(events).
Level0, it depends on which type of commandlist being used.
So it seems at libomptarget, it should be flexible and let the plugin decide which mechanism to use.

Ok, we can leave it as is for now. Just a comment on the NVIDIA case for future development: although we may not completely avoid the serialization problem using the pool of streams, I believe we can reduce it if the pool access pattern is changed. Currently, streams are acquired/released from/to the pool in a stack-like manner, which induces the frequent re-use of the streams nearest to the top of the stack/pool (in the code, the streams near the beginning of the Resources vector). By changing this access pattern to be round-robin-like, the stream use frequency may be evened over the pool and serialization becomes less of a problem.

Thanks for the patch. I'll take a close look next week.

Perfect, thanks!

gValarini updated this revision to Diff 455948.Aug 26 2022, 9:44 AM
gValarini marked an inline comment as done.

Update documentations and refactor some code

gValarini added inline comments.Aug 26 2022, 10:06 AM
openmp/libomptarget/include/omptarget.h
194

Just adding some context to why it was done this way:

  • Lambdas can easily store any local data needed by the post-processing procedures. This allows them to be generated locally by the normal code flow (which may improve code maintenance) and then stored in the lambdas implicit structure (solving any lifetime problems).
  • Having a function vector allows us to easily compose post-processing procedures. This is the case for target nowait regions, that should run both the targetDataEnd and processDataAfter post-processing functions.
  • If in the future we need to generate more asynchronous operations inside the post-processing functions, this can be easily done by pushing more lambdas to the vector.

With all that in mind, I know std::functions may lead to an additional heap allocation and one more level of indirection due to type-erasure, prohibiting some code opts. If that is not desirable despite the presented context, I can change how the post-processing procedures are stored/called, but I really would like to keep some of the points described above, especially the lifetime correctness of the post-processing data. Do you have something in mind that could help me with that?

Maybe defining a different struct for each post-processing function and storing it in the AsyncInfoTy as a variant could be enough. What do you think?

jdoerfert added inline comments.Aug 26 2022, 10:55 AM
openmp/libomptarget/include/omptarget.h
194

I would have assumed if we have 2 potential callbacks, let's say static functions F and G we would have two members, both void*.

void *PayloadForF = nullptr;
void *PayloadForG = nullptr;

and if they are not null we call F and G respectively passing the payload.

I'm fine with keeping it for now this way, we can see if there is a need to change it.

openmp/libomptarget/include/omptargetplugin.h
160

Describe the return value, it's not clear what would be returned if it does or doesn't synchronize.

openmp/libomptarget/src/device.cpp
634–635
openmp/libomptarget/src/interface.cpp
164

I assume we can outline some of this as it is probably the same as in the sync version, right? Let's avoid duplication as much as possible. Same below.

239

So, when do we call this but don't want to actually do targetDataUpdate? I am confused. Same above.

347

Clang format.

openmp/libomptarget/src/omptarget.cpp
698

Documentation, plz.

1209

Why are the const problematic?

1521

FWIW, mutable is really not my favorite way of handling things.

openmp/libomptarget/src/private.h
214–217

Here and elsewhere, we should prefer early exit and no else after return.

224–228
openmp/runtime/src/kmp_tasking.cpp
1145

@tianshilei1992 you need to look at these changes.

5187

current task?

tianshilei1992 added inline comments.Aug 30 2022, 6:48 AM
openmp/runtime/src/kmp_tasking.cpp
5157

libomptarget (for now) doesn't require the thread must be an OpenMP thread. Using libomp's gtid generally breaks. Either we add the requirement, which needs to be discussed further, or there it needs an alternative method to implement that. If libomp is executed on another fresh thread, a new root will be created.

tianshilei1992 added inline comments.Aug 30 2022, 6:59 AM
openmp/libomptarget/src/omptarget.cpp
700

nit: a vector of PostProcessingInfo called PostProcessingPtrs is confusing.

openmp/runtime/src/kmp_tasking.cpp
1099
jdoerfert retitled this revision from Add non-blocking support for target nowait regions to [OpenMP] Add non-blocking support for target nowait regions.Wed, Sep 7, 10:21 AM

Refactor interface code and address some code review changes

gValarini marked 5 inline comments as done.Wed, Sep 14, 1:19 PM
gValarini added inline comments.
openmp/libomptarget/include/omptarget.h
194

Okey, for now, I'll keep it like this then.

openmp/libomptarget/include/omptargetplugin.h
160

Good point. I have updated both the documentation and the function name to better reflect what this new interface should do. Do you think it is more clear now?

openmp/libomptarget/src/interface.cpp
164

Yep, you are correct. I have created two new "launchers" that can unify most of the code paths for the execution and data-related functions for the normal and nowait cases. Since all data-related interface entries practically have the same signature, a single launcher is enough for them all.

A new class called TaskAsyncInfoTy also unifies the code related to managing the task async handle when executing target nowait regions.

What do you think about this new code structure?

239

The goal is to dispatch the device side operations (i.e., call the functions in the omptarget.cpp file) only when a new async handle is created. If we detect that the task already contains an async handle, that means that the device side operations were already dispatched and we should only try to synchronize it in a non-blocking manner.

The new TaskAsyncInfoTy class has a function called shouldDispatch that now encapsulates this detection logic with proper documentation. Do you think it is more clear now? Should we add a comment to each call site as well?

openmp/libomptarget/src/omptarget.cpp
698

Done. Could you check if I missed anything?

700

Ops. Thanks, I have updated the var name.

1209

In summary, we want to be able to move PrivateArgumentManagerTy instances into the post-processing lambdas, so their lifetime is automatically managed by them.

The problem is with how llvm::SmallVector implements its move constructor. Unfortunately, it is implemented as a move-assignment instead of a proper constructor, meaning it cannot be generated for structs with const members. If we replace FirstPrivateArgInfo with a std::vector, the problem does not happen because the STL properly implements a move constructor for vectors.

Since I think we do not want to use std::vector anymore, I just removed the const from the members, since they are not even accessible outside the PrivateArgumentManagerTy class.

What do you think of this approach?

1521

mutable was added because we need to call a non-const member function of PrivateArgumentManager (i.e., free).

I know that makes the lambda a function with an internal state since, multiple calls to it will generate different results, but I don´t know of another approach to it.

Maybe use call_once (IMHO, a little bit overkill) or remove the lambdas altogether and use another approach to store the post-processing functions and their payload. What do you think?

openmp/runtime/src/kmp_tasking.cpp
1145

Any comments on whether we can move the __kmp_unexecuted_hidden_helper_tasks decrement to this place?

5157

Uhm, I did not know about that. Although I think such a requirement makes sense, it may be out of the scope of this patch.

What we could do is check if the current thread is registered inside libomp somehow, falling back to the current execution path that does not depend on the task team information. Do you know we can use __kmpc_global_thread_num's return value to verify that? Maybe assert that returned GTID is valid and within a well-known range (e.g., [0, NUM_REGISTERED_OMP_THREADS]).

Just a note, NUM_REGISTERED_OMP_THREADS is not a valid variable. I just don't know where, or even if, such information is stored. Do you know where can I find this?

gValarini marked an inline comment as done.Wed, Sep 14, 1:23 PM
gValarini added inline comments.
openmp/runtime/src/kmp_tasking.cpp
5187

Typo. It should have been current thread!