This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Use events and taskyield in target nowait task to unblock host threads
Needs RevisionPublic

Authored by ye-luo on Aug 6 2021, 9:00 AM.

Details

Summary

Currently, in a target task, host thread spins when invoking synchronization after kernel/transfer submission.
This patch adds LIBOMPTARGET_USE_NOWAIT_EVENT environment variable to enable the code path to unblock host thread in an deferred target task by recording an event for synchronization and calling taskyield.

Need LIBOMP_USE_HIDDEN_HELPER_TASK=0 LIBOMPTARGET_USE_NOWAIT_EVENT=1 to make this feature work nicely.
https://github.com/ye-luo/openmp-target/blob/master/hands-on/gemv/7-gemv-omp-target-many-matrices-taskloop/gemv-omp-target-many-matrices-taskloop.cpp
Is the test case I played with.

Diff Detail

Event Timeline

ye-luo created this revision.Aug 6 2021, 9:00 AM
ye-luo requested review of this revision.Aug 6 2021, 9:00 AM
ye-luo added a comment.Aug 6 2021, 9:09 AM

Q: for @AndreyChurbanov
Do you know that is the the constraint exactly?
I need to set KMP_TASK_STEALING_CONSTRAINT to make __kmp_task_is_allowed() return true.

Q: for @AndreyChurbanov
Do you know that is the the constraint exactly?

In short: newly scheduled task should be a descendant of current task if the current task is explicit and tied.

Details from specification:

Task Scheduling Constraints are as follows:

  1. Scheduling of new tied tasks is constrained by the set of task regions that are currently tied to the thread and that are not suspended in a barrier region. If this set is empty, any new tied task may be scheduled. Otherwise, a new tied task may be scheduled only if it is a descendant task of every task in the set.
  2. A dependent task shall not start its execution until its task dependences are fulfilled.
  3. A task shall not be scheduled while any task with which it is mutually exclusive has been scheduled but has not yet completed.
  4. When an explicit task is generated by a construct that contains an if clause for which the expression evaluated to false, and the previous constraints are already met, the task is executed immediately after generation of the task.

A program that relies on any other assumption about task scheduling is non-conforming.

I need to set KMP_TASK_STEALING_CONSTRAINT to make __kmp_task_is_allowed() return true.

Haven't got this. If KMP_TASK_STEALING_CONSTRAINT=0 then __kmp_task_is_allowed() should always return true
(if there is no mutexinoutset dependency on a task).
Otherwise it can return true or false.
But with KMP_TASK_STEALING_CONSTRAINT=0 some tests may hang because of deadlock.

@AndreyChurbanov Thank you for the quick reply. I'm exploring this as a proof-of-concept. Right now without setting KMP_TASK_STEALING_CONSTRAINT to 0, I don't see new tasks being scheduled when task_yield got called. It was because of failing TSC. I didn't understand why it was failing.

  1. Scheduling of new tied tasks is constrained by the set of task regions that are currently tied to the thread and that are not suspended in a barrier region. If this set is empty, any new tied task may be scheduled. Otherwise, a new tied task may be scheduled only if it is a descendant task of every task in the set.
  2. A dependent task shall not start its execution until its task dependences are fulfilled.
  3. A task shall not be scheduled while any task with which it is mutually exclusive has been scheduled but has not yet completed.
  4. When an explicit task is generated by a construct that contains an if clause for which the expression evaluated to false, and the previous constraints are already met, the task is executed immediately after generation of the task.

I don't see problems in 2,3,4 but 1 as you said "In short: newly scheduled task should be a descendant of current task if the current task is explicit and tied."
I think a target task is an explicit task but it s not clear to me that if it is an tied task. probably that is the reason of failing TSC.

If my understanding of the situation is correct, I'm wondering if we can claim target task as untied and then got new tasks scheduled?
Setting KMP_TASK_STEALING_CONSTRAINT is more of just needed for the exploration.

tianshilei1992 added a comment.EditedAug 6 2021, 10:42 AM

I think a target task is an explicit task but it s not clear to me that if it is an tied task. probably that is the reason of failing TSC.

If my understanding of the situation is correct, I'm wondering if we can claim target task as untied and then got new tasks scheduled?

A regular task by default is tied. That's why in __kmpc_omp_target_task_alloc we set it to untied if hidden helper task is enabled. The spec says:

Target task: A mergeable and untied task that is generated by a device construct or a call to a device memory routine and that coordinates activity between the current device and the target device.

So I think we need to set it to untied no matter whether hht is enabled.

ye-luo updated this revision to Diff 364870.Aug 6 2021, 12:41 PM

Thank @tianshilei1992 . Set target task as untied defined by the OpenMP spec. No need of fiddling with KMP_TASK_STEALING_CONSTRAINT

ye-luo edited the summary of this revision. (Show Details)Aug 6 2021, 12:42 PM
RaviNarayanaswamy added inline comments.
openmp/libomptarget/src/interface.cpp
407–408

Is kmpc_omp_taskwait needed.

openmp/libomptarget/src/omptarget.cpp
73

Result is not set on all paths

ye-luo added inline comments.Aug 6 2021, 1:58 PM
openmp/libomptarget/src/interface.cpp
407–408

It is not needed. It has been removed by @tianshilei1992 in main branch. So It will disappear after a rebase.

openmp/libomptarget/src/omptarget.cpp
73

When leaving line 62, the return value is OFFLOAD_SUCCESS as line 28 sets it

grokos added a subscriber: grokos.Aug 6 2021, 2:44 PM
grokos added inline comments.
openmp/libomptarget/include/omptarget.h
349

This function is defined in libomp, so it needs to be declared with the weak attribute in private.h alongside the other API functions from libomp (see private.h, the code block around line 90). Otherwise, we make libomptarget dependent on libomp, whereas we want it to be able to be build independently from any specific host OpenMP runtime.

openmp/libomptarget/src/CMakeLists.txt
38

If you move the declaration of __kmpc_target_task_yield to private.h and mark it as weak, we can skip linking against omp.

openmp/libomptarget/src/device.cpp
552

"fullfiled" --> "fulfilled"
"has not been not fullfiled" --> "has not been fulfilled"

openmp/libomptarget/src/interface.cpp
323

This single-team API function needs the same patching you applied to __tgt_target_teams_nowait_mapper.

openmp/libomptarget/src/omptarget.cpp
24

Should we check for invalid values of this env var?

openmp/libomptarget/src/device.cpp
546

Isn't this initialized to false when the AsyncInfo is created.

openmp/libomptarget/src/omptarget.cpp
73

I missed that.

tianshilei1992 added inline comments.Aug 6 2021, 5:39 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1189

Event destroy worths a separate function. We could add a new return value such as OFFLOAD_NOT_DONE to indicate the event is not fulfilled. It is not a good idea to mix event query and event destroy.

openmp/libomptarget/src/device.cpp
545

early return

openmp/libomptarget/src/omptarget.cpp
45

use early return

51

I'm thinking we can actually do more here. For example, set a count for every task yield. When the count reaches a threshold, fall back to stream synchronize. The threshold can be configured via env and so on.

openmp/runtime/src/kmp_tasking.cpp
1443

This change worths a separate patch.

Why do you want to use taskyield? The semantics of taskyield are weird and not useful in so many cases.
I think, it would make much more sense to adopt the notion of detached tasks instead and call omp_fulfill_event to complete the hidden helper task once the device is done.

Why do you want to use taskyield?

Right now, there is a performance issue of target task blocking host thread while waiting for the device to complete. I want the target task got suspended after kernel launch and the host thread continue to progress other tasks. This patch makes it working well in my use cases.
On NVIDIA with the exisitng implemenatiojn, host threads are spinning at cuStreamSynchronize regardless of using hidden helper tasks or not.
Such synchronization call may be replaced with other smart schemes but it doesn't change the nature that target task is blocking a thread regardless of OpenMP threads or hidden helper threads.

The semantics of taskyield are weird and not useful in so many cases.

Please elaborate why weird. Is there any logic holes in my implementation?
I never claim it is a one method for all cases and it is also added as an option.
If taskyield can be called inside a regular task, is there any reason not allowing it inside the target task?

I think, it would make much more sense to adopt the notion of detached tasks instead and call omp_fulfill_event to complete the hidden helper task once the device is done.

That is an optimization to the hidden helper task. I'm happy to see it being implemented. In my understanding, implementing the whole target task as a detached task doesn't resolve the issue of task blocking thread. You may rely on OS to switching threads to gain something since these are hidden helper threads. You may also suffer from the nature of thread over-subscription when regular OpenMP threads already occupy all the cores. The are many things can be discussed in this topic but I would like to pull helper tasks out of my equation and put it aside.

IMO, to have an efficient implementation of "target nowait", breaking up its operation seems necessary and the breakup needs to happen after enqueuing kernels and transfers before other operations like decrease reference counting, free memory.

I desperately need a working implementation of target nowait for my app. I have one and my work can be unblocked.
The hidden helper tasks is presenting functionality issue to me and I don't have any answer for its performance.
Please keep improving hidden helper tasks can we can compare and have better understanding.
I will be happy with one scheme fits all but I don't think there is one right now and that is why we are exploring several schemes.

protze.joachim added a comment.EditedAug 7 2021, 1:58 PM

Regarding the weird nature of taskyield I refer to https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_1
Not everything in the paper is applicable for your situation. The most dangerous point I see here is, that taskyield if not used with care will effectively build a recursive call stack, so that a task that called taskyield can only proceed, if all recursively called task have finished.

#pragma omp target nowait depend(inout:a)
{}

As I understand the current implementation, this code translates to something like:

#pragma omp (hidden)task depend(inout:a)
{
  a = kernel_launch_async();
  wait_async(a);
}

As I understand your proposal, you want to replace it by something like:

#pragma omp (hidden)task depend(inout:a)
{
  a = kernel_launch_async();
  while (!test_async(a))
  {
   #pragma omp taskyield
  }
}

Think of 3 ready target nowait regions: the target task for the first target region calls taskyield and schedules the second target task. The second task also calls taskyield and schedules the third task. The first task will only continue/complete after the second and third task completed.
Depending on the number of available target tasks, you might even exceed the stack limit.

My proposed code pattern would be like:

#pragma omp (hidden)task depend(inout:a) detach(event)
{
  a = kernel_launch_async();
  a.register_signal(omp_fulfill_event, event); // this registers omp_fulfill_event as a callback to be called, when the asynchronous execution is finished
} //<-- the hidden helper task is done executing. the event handling in omp_fulfill_event will take care of releasing the dependent tasks 

Making the target task a detached task can be done by calling __kmpc_task_allow_completion_event. To signal completion __kmp_fulfill_event would be the internal libomp function.

ye-luo added a comment.Aug 7 2021, 3:49 PM

Regarding the weird nature of taskyield I refer to https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_1
Not everything in the paper is applicable for your situation. The most dangerous point I see here is, that taskyield if not used with care will effectively build a recursive call stack, so that a task that called taskyield can only proceed, if all recursively called task have finished.

I don't have access to the paper but I do understand the case of "a recursive call stack". It can cause performance issues. It also seems like a feature of the taskyield implementation in LLVM libomp. So this is real.
I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.

In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.
I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
So this is quite involving and some dedicated person need to work on this and it needs time.

Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.

ye-luo updated this revision to Diff 364987.Aug 7 2021, 3:55 PM

rebase and address reviews.

ye-luo marked an inline comment as done.Aug 7 2021, 4:19 PM
ye-luo added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1189

I'm trying to avoid adding stuff that is not immediately used.
Similar to Queue, the manipulation of Event is within the plugin and there are no need of APIs to create/destroy events from outside.

recordEvent is responsible to create and record an event.
queryEvent is responsbile to query and destroy an event upon completion.

#define OFFLOAD_SUCCESS (0)
#define OFFLOAD_FAIL (~0)

This is what I found, not even enum. I don't see a clean way to extend OFFLOAD_NOT_DONE
I think fixing the return style is not in the scope of this patch. Some design is needed for return values between plugin and device class and between device to omptarget.

In this case,
OFFLOAD_FAIL is for errors reported by CUDA runtime. Event point to signal it is completed or still on going.

openmp/libomptarget/src/device.cpp
545

Changed to early return.

546

yes. So I removed the call to setEventSupported

552

Thank you for pointing these out. Corrected.

openmp/libomptarget/src/interface.cpp
323

I tend to first get one case the "target teams nowait" case implemented and then extend to all the rest. Not just this case but also all the update. If you think it is better to enable this function as well in this initial patch, let me know and I will add it.

openmp/libomptarget/src/omptarget.cpp
24

I wanted something like libomp. TRUE/1/ON all goes to 1. but I don't know how to handle it in libomptarget.

45

I rewrote the whole function to do mostly early returns.

51

This looks like an optimization which should be explored separately. I think I may use cuEventSynchronize.

73

Call cleaned up and use early return. More readable.

ye-luo added inline comments.Aug 7 2021, 4:27 PM
openmp/libomptarget/include/omptarget.h
349

This exactly what I looked for. All fixed.

tianshilei1992 added inline comments.Aug 7 2021, 5:05 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1141

This function goes too far. It contains:

  1. Create an event;
  2. Return the stream;
  3. Nullify the queue pointer.

Considering recordEvent will be used in many other places, such as D104418, please separate it.

1189

This is what I found, not even enum. I don't see a clean way to extend OFFLOAD_NOT_DONE

It is because these values are used in both libomptarget (C++ API) and plugins (C API).

I think fixing the return style is not in the scope of this patch. Some design is needed for return values between plugin and device class and between device to omptarget.

I don't doubt that but it's not good to "twist" the code to fit existing code if there is apparently a better way to do it. If one part needs to be extended to support new features, just do it in another patch and make this one depend on it.

openmp/libomptarget/src/device.cpp
543

Whether the event is supported is per-device, so no need to put one indicator it in every async info.

openmp/libomptarget/src/omptarget.cpp
51

If cuEventSynchronize is better than stream one (e.g. the synchronization is no longer just spinning but something similar to signal), it's worth to separate the patch with something like:

// launch kernel
// create event
// synchronize

And in CUDA plugin, the synchronize is event synchronize. Then apply this patch on that.

ye-luo added inline comments.Aug 7 2021, 5:42 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1141

it is not clear to me how you would prefer the event manipulation API in the plugins to look like. Could you put up a separate patch by extracting those out of D104418 ? It seems that you need to expose an event in the API. Once you have that up, I can refactor/reorganize my side.

1189

OFFLOAD_NOT_DONE needs to come from the plugin. An enquiry needs to return 3 states. fail, done, not done. I'm wondering how to do it properly? Is there an example to follow?.

openmp/libomptarget/src/device.cpp
543

Indeed I wanted to change that. Is DeviceTy and its constructor the right place to keep and initialize this flag?

openmp/libomptarget/src/omptarget.cpp
51

Let us consolidate the API first. Any optimization further optimization should be deferred.

protze.joachim requested changes to this revision.Aug 8 2021, 12:40 AM

Regarding the weird nature of taskyield I refer to https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_1
Not everything in the paper is applicable for your situation. The most dangerous point I see here is, that taskyield if not used with care will effectively build a recursive call stack, so that a task that called taskyield can only proceed, if all recursively called task have finished.

I don't have access to the paper but I do understand the case of "a recursive call stack". It can cause performance issues. It also seems like a feature of the taskyield implementation in LLVM libomp. So this is real.

Lmgtfy: http://montblanc-project.eu/wp-content/uploads/2018/10/The-impact-of-taskyield-on.pdf

I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.

In such case, you introduce busy waiting by polling on taskyield as long as target is not ready. Since the hidden tasks are pinned to the same cores as application threads, this will impact the performance of host threads. (Reject reason one)

In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.

Please add a mockup of your use pattern as a test case, so that we can review and understand your use pattern.
IMHO, an implementation, where significant drawbacks can be expected should not go into mainline libomptarget just for experimenting with the performance.

I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
So this is quite involving and some dedicated person need to work on this and it needs time.

I'm not sure what you mean with finalization. The only case, where I think a target task might need to get split into pieces is for mapping data from the device (not sure whether the internal signalling model allows to initiate the memory movement just after kernel offloading).
If such splitting would be needed, we could limit the initial detach implementation to only support target regions without mapping at the end of the region. The application can always accomplish this requirement by splitting the mapping into separate directives:

#pragma omp target enter data map(to:A) depend(inout:A) nowait
#pragma omp target depend(inout:A) nowait
#pragma omp target exit data map(from:A) depend(inout:A) nowait

Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.

This seems like a change to address a very limited use case without explaining what the pattern of the use case actually is. We should discuss this in one of the upcoming calls.

This revision now requires changes to proceed.Aug 8 2021, 12:40 AM
ye-luo added a comment.EditedAug 8 2021, 3:07 AM

Regarding the weird nature of taskyield I refer to https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_1
Not everything in the paper is applicable for your situation. The most dangerous point I see here is, that taskyield if not used with care will effectively build a recursive call stack, so that a task that called taskyield can only proceed, if all recursively called task have finished.

I don't have access to the paper but I do understand the case of "a recursive call stack". It can cause performance issues. It also seems like a feature of the taskyield implementation in LLVM libomp. So this is real.

Lmgtfy: http://montblanc-project.eu/wp-content/uploads/2018/10/The-impact-of-taskyield-on.pdf

Thanks. It is consistent with my understanding of the "stack" implantation of taskyield.

I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.

In such case, you introduce busy waiting by polling on taskyield as long as target is not ready. Since the hidden tasks are pinned to the same cores as application threads, this will impact the performance of host threads. (Reject reason one)

  1. using hidden task or regular task is largely orthogonal to what we discussed here. Using hidden tasks is not a "must" for having efficient target nowait.
  2. the current implementation calling cuStreamSynchronize already blocks the application thread. My implementation allows not being blocked.

In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.

Please add a mockup of your use pattern as a test case, so that we can review and understand your use pattern.
IMHO, an implementation, where significant drawbacks can be expected should not go into mainline libomptarget just for experimenting with the performance.

I need it for production. The current "target nowait" has not been workable as expected.

I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
So this is quite involving and some dedicated person need to work on this and it needs time.

I'm not sure what you mean with finalization. The only case, where I think a target task might need to get split into pieces is for mapping data from the device (not sure whether the internal signalling model allows to initiate the memory movement just after kernel offloading).
If such splitting would be needed, we could limit the initial detach implementation to only support target regions without mapping at the end of the region. The application can always accomplish this requirement by splitting the mapping into separate directives:

#pragma omp target enter data map(to:A) depend(inout:A) nowait
#pragma omp target depend(inout:A) nowait
#pragma omp target exit data map(from:A) depend(inout:A) nowait

you need to decrease refcount and free memory if the count is 0 after the completion of all the asynchronous operations. If you can take care of that in the design, it is better to avoid asking extra work from users.
Second. splitting in the way you suggested requires dependency resolution on the host at least right now. The added latency is a huge loss in performance.

Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.

This seems like a change to address a very limited use case without explaining what the pattern of the use case actually is. We should discuss this in one of the upcoming calls.

The test code in the description is a distilled version of the app. I have slides and we can discuss them.

I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.

In such case, you introduce busy waiting by polling on taskyield as long as target is not ready. Since the hidden tasks are pinned to the same cores as application threads, this will impact the performance of host threads. (Reject reason one)

  1. using hidden task or regular task is largely orthogonal to what we discussed here. Using hidden tasks is not a "must" for having efficient target nowait.
  2. the current implementation calling cuStreamSynchronize already blocks the application thread. My implementation allows not being blocked.

Block the thread does not mean "keep the thread busy and eat all the core's cycles"

In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.

Please add a mockup of your use pattern as a test case, so that we can review and understand your use pattern.
IMHO, an implementation, where significant drawbacks can be expected should not go into mainline libomptarget just for experimenting with the performance.

I need it for production. The current "target nowait" has not been workable as expected.

I completely agree with the statement, that "target nowait" is not implemented in libomptarget. I just disagree with the way you suggest to fix the implementation.

I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
So this is quite involving and some dedicated person need to work on this and it needs time.

I'm not sure what you mean with finalization. The only case, where I think a target task might need to get split into pieces is for mapping data from the device (not sure whether the internal signalling model allows to initiate the memory movement just after kernel offloading).
If such splitting would be needed, we could limit the initial detach implementation to only support target regions without mapping at the end of the region. The application can always accomplish this requirement by splitting the mapping into separate directives:

#pragma omp target enter data map(to:A) depend(inout:A) nowait
#pragma omp target depend(inout:A) nowait
#pragma omp target exit data map(from:A) depend(inout:A) nowait

you need to decrease refcount and free memory if the count is 0 after the completion of all the asynchronous operations. If you can take care of that in the design, it is better to avoid asking extra work from users.
Second. splitting in the way you suggested requires dependency resolution on the host at least right now. The added latency is a huge loss in performance.

I didn't suggest, that this should be a permanent solution. It might be an intermediate step until splitting the task into parts is implemented.
I think, @tianshilei1992 already has the code in place to handle these dependencies on the device.
Also, for the code example you posted, there will be no freeing of data at the end of the target region.

By mapping all data to the device you assume all data fits to the device at the same time. If you would remove your enter/exit data on (de)allocation and rely on the mapping for the target region to move the data, you would still not be able to process larger chunks of data. Because of the stacking nature of taskyield no data will be moved from the device before you finished all target regions.

Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.

This seems like a change to address a very limited use case without explaining what the pattern of the use case actually is. We should discuss this in one of the upcoming calls.

The test code in the description is a distilled version of the app. I have slides and we can discuss them.

Thanks for pointing me to the link. The code convinced me even more, that taskyield is not the right solution even for your code example.

I also think. that without ignoring the task scheduling constraint, your code will only be able to schedule one task from the taskloop during your taskyield and a nested taskyield cannot schedule a task:
When you reach the taskyield, you have a tied task from the taskloop scheduled in the barrier of the single region (or in the taskgroup for the task executing the taskloop).
The target task is untied and does not count, so you can schedule another task from the taskloop. Now, when you reach the taskyield, you have a tied task scheduled in the outer taskyield and none of the tasks from the taskloop can be scheduled.
You might mitigate this limitation by adding untied to the taskloop, with the cost of untied tasks and run into the stack problem of taskyield.

ye-luo added a comment.EditedAug 8 2021, 10:36 AM

Thanks for pointing me to the link. The code convinced me even more, that taskyield is not the right solution even for your code example.

I also think. that without ignoring the task scheduling constraint, your code will only be able to schedule one task from the taskloop during your taskyield and a nested taskyield cannot schedule a task:
When you reach the taskyield, you have a tied task from the taskloop scheduled in the barrier of the single region (or in the taskgroup for the task executing the taskloop).
The target task is untied and does not count, so you can schedule another task from the taskloop. Now, when you reach the taskyield, you have a tied task scheduled in the outer taskyield and none of the tasks from the taskloop can be scheduled.

This is not what I observed.

Task scheduling costraints says "the set of task regions that are currently tied to the thread and that are not suspended in a barrier region". The tied task from the taskloop scheduled in the barrier of the single region (or in the taskgroup for the task executing the taskloop) doesn't count in the set.

As the test is evolving, let me switch to a fixed commit.

Let us skip the second iteration of the taskloop. it gets scheduled after the first taskyield from the first target task and runs to its completion as it is CPU only. The third iteration which contains "target nowait" seems to be your concern. This task actually generates the target task and then runs to its completion. Only after that the second target task gets scheduled, "the set of task regions that are currently tied to the thread and that are not suspended in a barrier region" is empty. So the taskyield from the second target task actually continue build up the taskyield "stack".

When I manually stepped in to libomp and I have seen the stack being build up by the taskyield. nvprof also confirms the stack being built up.
Again, building up the stack is not my concern. Usually the loop under taskloop only have < 5 iterations.