This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/
-
libomptarget/
-
include/
1/2
omptarget.h
-
plugins/
-
cuda/src/
-
src/
3/6
rtl.cpp
-
exports
-
src/
-
device.h
4/8
device.cpp
2/4
interface.cpp
6/12
omptarget.cpp
-
private.h
-
rtl.h
-
rtl.cpp
-
runtime/src/
-
src/
-
kmp.h
1/1
kmp_tasking.cpp

Differential D107656

[OpenMP] Use events and taskyield in target nowait task to unblock host threads
Needs RevisionPublic

Authored by ye-luo on Aug 6 2021, 9:00 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
protze.joachim

Summary

Currently, in a target task, host thread spins when invoking synchronization after kernel/transfer submission.
This patch adds LIBOMPTARGET_USE_NOWAIT_EVENT environment variable to enable the code path to unblock host thread in an deferred target task by recording an event for synchronization and calling taskyield.

Need LIBOMP_USE_HIDDEN_HELPER_TASK=0 LIBOMPTARGET_USE_NOWAIT_EVENT=1 to make this feature work nicely.
https://github.com/ye-luo/openmp-target/blob/master/hands-on/gemv/7-gemv-omp-target-many-matrices-taskloop/gemv-omp-target-many-matrices-taskloop.cpp
Is the test case I played with.

Diff Detail

Event Timeline

ye-luo created this revision.Aug 6 2021, 9:00 AM

Herald added subscribers: guansong, yaxunl, mgorny. · View Herald TranscriptAug 6 2021, 9:00 AM

ye-luo requested review of this revision.Aug 6 2021, 9:00 AM

Herald added a reviewer: jdoerfert. · View Herald TranscriptAug 6 2021, 9:00 AM

Herald added a subscriber: sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B118395: Diff 364809.Aug 6 2021, 9:01 AM

Q: for @AndreyChurbanov
Do you know that is the the constraint exactly?
I need to set KMP_TASK_STEALING_CONSTRAINT to make __kmp_task_is_allowed() return true.

In D107656#2931474, @ye-luo wrote:

Q: for @AndreyChurbanov
Do you know that is the the constraint exactly?

In short: newly scheduled task should be a descendant of current task if the current task is explicit and tied.

Details from specification:

Task Scheduling Constraints are as follows:

Scheduling of new tied tasks is constrained by the set of task regions that are currently tied to the thread and that are not suspended in a barrier region. If this set is empty, any new tied task may be scheduled. Otherwise, a new tied task may be scheduled only if it is a descendant task of every task in the set.
A dependent task shall not start its execution until its task dependences are fulfilled.
A task shall not be scheduled while any task with which it is mutually exclusive has been scheduled but has not yet completed.
When an explicit task is generated by a construct that contains an if clause for which the expression evaluated to false, and the previous constraints are already met, the task is executed immediately after generation of the task.

A program that relies on any other assumption about task scheduling is non-conforming.

I need to set KMP_TASK_STEALING_CONSTRAINT to make __kmp_task_is_allowed() return true.

Haven't got this. If KMP_TASK_STEALING_CONSTRAINT=0 then __kmp_task_is_allowed() should always return true
(if there is no mutexinoutset dependency on a task).
Otherwise it can return true or false.
But with KMP_TASK_STEALING_CONSTRAINT=0 some tests may hang because of deadlock.

@AndreyChurbanov Thank you for the quick reply. I'm exploring this as a proof-of-concept. Right now without setting KMP_TASK_STEALING_CONSTRAINT to 0, I don't see new tasks being scheduled when task_yield got called. It was because of failing TSC. I didn't understand why it was failing.

Scheduling of new tied tasks is constrained by the set of task regions that are currently tied to the thread and that are not suspended in a barrier region. If this set is empty, any new tied task may be scheduled. Otherwise, a new tied task may be scheduled only if it is a descendant task of every task in the set.
A dependent task shall not start its execution until its task dependences are fulfilled.
A task shall not be scheduled while any task with which it is mutually exclusive has been scheduled but has not yet completed.
When an explicit task is generated by a construct that contains an if clause for which the expression evaluated to false, and the previous constraints are already met, the task is executed immediately after generation of the task.

I don't see problems in 2,3,4 but 1 as you said "In short: newly scheduled task should be a descendant of current task if the current task is explicit and tied."
I think a target task is an explicit task but it s not clear to me that if it is an tied task. probably that is the reason of failing TSC.

If my understanding of the situation is correct, I'm wondering if we can claim target task as untied and then got new tasks scheduled?
Setting KMP_TASK_STEALING_CONSTRAINT is more of just needed for the exploration.

In D107656#2931676, @ye-luo wrote:

I think a target task is an explicit task but it s not clear to me that if it is an tied task. probably that is the reason of failing TSC.

If my understanding of the situation is correct, I'm wondering if we can claim target task as untied and then got new tasks scheduled?

A regular task by default is tied. That's why in __kmpc_omp_target_task_alloc we set it to untied if hidden helper task is enabled. The spec says:

Target task: A mergeable and untied task that is generated by a device construct or a call to a device memory routine and that coordinates activity between the current device and the target device.

So I think we need to set it to untied no matter whether hht is enabled.

Thank @tianshilei1992 . Set target task as untied defined by the OpenMP spec. No need of fiddling with KMP_TASK_STEALING_CONSTRAINT

Harbormaster completed remote builds in B118437: Diff 364870.Aug 6 2021, 12:42 PM

ye-luo edited the summary of this revision. (Show Details)Aug 6 2021, 12:42 PM

RaviNarayanaswamy added a subscriber: RaviNarayanaswamy.Aug 6 2021, 1:50 PM

RaviNarayanaswamy added inline comments.

openmp/libomptarget/src/interface.cpp
392–393	Is kmpc_omp_taskwait needed.
openmp/libomptarget/src/omptarget.cpp
55–85	Result is not set on all paths

ye-luo added inline comments.Aug 6 2021, 1:58 PM

openmp/libomptarget/src/interface.cpp
392–393	It is not needed. It has been removed by @tianshilei1992 in main branch. So It will disappear after a rebase.
openmp/libomptarget/src/omptarget.cpp
55–85	When leaving line 62, the return value is OFFLOAD_SUCCESS as line 28 sets it

grokos added a subscriber: grokos.Aug 6 2021, 2:44 PM

grokos added inline comments.

openmp/libomptarget/include/omptarget.h
354	This function is defined in libomp, so it needs to be declared with the `weak` attribute in `private.h` alongside the other API functions from libomp (see `private.h`, the code block around line 90). Otherwise, we make libomptarget dependent on libomp, whereas we want it to be able to be build independently from any specific host OpenMP runtime.
openmp/libomptarget/src/CMakeLists.txt
38 ↗	(On Diff #364870)	If you move the declaration of `__kmpc_target_task_yield` to `private.h` and mark it as weak, we can skip linking against `omp`.
openmp/libomptarget/src/device.cpp
561	"fullfiled" --> "fulfilled" "has not been not fullfiled" --> "has not been fulfilled"
openmp/libomptarget/src/interface.cpp
313	This single-team API function needs the same patching you applied to `__tgt_target_teams_nowait_mapper`.
openmp/libomptarget/src/omptarget.cpp
24	Should we check for invalid values of this env var?

RaviNarayanaswamy added inline comments.Aug 6 2021, 2:49 PM

openmp/libomptarget/src/device.cpp
555	Isn't this initialized to false when the AsyncInfo is created.
openmp/libomptarget/src/omptarget.cpp
55–85	I missed that.

tianshilei1992 added inline comments.Aug 6 2021, 5:39 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1191	Event destroy worths a separate function. We could add a new return value such as `OFFLOAD_NOT_DONE` to indicate the event is not fulfilled. It is not a good idea to mix event query and event destroy.
openmp/libomptarget/src/device.cpp
554	early return
openmp/libomptarget/src/omptarget.cpp
43	use early return
49	I'm thinking we can actually do more here. For example, set a count for every task yield. When the count reaches a threshold, fall back to stream synchronize. The threshold can be configured via env and so on.
openmp/runtime/src/kmp_tasking.cpp
1439	This change worths a separate patch.

Why do you want to use taskyield? The semantics of taskyield are weird and not useful in so many cases.
I think, it would make much more sense to adopt the notion of detached tasks instead and call omp_fulfill_event to complete the hidden helper task once the device is done.

In D107656#2932820, @protze.joachim wrote:

Why do you want to use taskyield?

Right now, there is a performance issue of target task blocking host thread while waiting for the device to complete. I want the target task got suspended after kernel launch and the host thread continue to progress other tasks. This patch makes it working well in my use cases.
On NVIDIA with the exisitng implemenatiojn, host threads are spinning at cuStreamSynchronize regardless of using hidden helper tasks or not.
Such synchronization call may be replaced with other smart schemes but it doesn't change the nature that target task is blocking a thread regardless of OpenMP threads or hidden helper threads.

The semantics of taskyield are weird and not useful in so many cases.

Please elaborate why weird. Is there any logic holes in my implementation?
I never claim it is a one method for all cases and it is also added as an option.
If taskyield can be called inside a regular task, is there any reason not allowing it inside the target task?

I think, it would make much more sense to adopt the notion of detached tasks instead and call omp_fulfill_event to complete the hidden helper task once the device is done.

That is an optimization to the hidden helper task. I'm happy to see it being implemented. In my understanding, implementing the whole target task as a detached task doesn't resolve the issue of task blocking thread. You may rely on OS to switching threads to gain something since these are hidden helper threads. You may also suffer from the nature of thread over-subscription when regular OpenMP threads already occupy all the cores. The are many things can be discussed in this topic but I would like to pull helper tasks out of my equation and put it aside.

IMO, to have an efficient implementation of "target nowait", breaking up its operation seems necessary and the breakup needs to happen after enqueuing kernels and transfers before other operations like decrease reference counting, free memory.

I desperately need a working implementation of target nowait for my app. I have one and my work can be unblocked.
The hidden helper tasks is presenting functionality issue to me and I don't have any answer for its performance.
Please keep improving hidden helper tasks can we can compare and have better understanding.
I will be happy with one scheme fits all but I don't think there is one right now and that is why we are exploring several schemes.

Regarding the weird nature of taskyield I refer to https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_1
Not everything in the paper is applicable for your situation. The most dangerous point I see here is, that taskyield if not used with care will effectively build a recursive call stack, so that a task that called taskyield can only proceed, if all recursively called task have finished.

#pragma omp target nowait depend(inout:a)
{}

As I understand the current implementation, this code translates to something like:

#pragma omp (hidden)task depend(inout:a)
{
  a = kernel_launch_async();
  wait_async(a);
}

As I understand your proposal, you want to replace it by something like:

#pragma omp (hidden)task depend(inout:a)
{
  a = kernel_launch_async();
  while (!test_async(a))
  {
   #pragma omp taskyield
  }
}

Think of 3 ready target nowait regions: the target task for the first target region calls taskyield and schedules the second target task. The second task also calls taskyield and schedules the third task. The first task will only continue/complete after the second and third task completed.
Depending on the number of available target tasks, you might even exceed the stack limit.

My proposed code pattern would be like:

#pragma omp (hidden)task depend(inout:a) detach(event)
{
  a = kernel_launch_async();
  a.register_signal(omp_fulfill_event, event); // this registers omp_fulfill_event as a callback to be called, when the asynchronous execution is finished
} //<-- the hidden helper task is done executing. the event handling in omp_fulfill_event will take care of releasing the dependent tasks

Making the target task a detached task can be done by calling __kmpc_task_allow_completion_event. To signal completion __kmp_fulfill_event would be the internal libomp function.

In D107656#2932908, @protze.joachim wrote:

Regarding the weird nature of taskyield I refer to https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_1
Not everything in the paper is applicable for your situation. The most dangerous point I see here is, that taskyield if not used with care will effectively build a recursive call stack, so that a task that called taskyield can only proceed, if all recursively called task have finished.

I don't have access to the paper but I do understand the case of "a recursive call stack". It can cause performance issues. It also seems like a feature of the taskyield implementation in LLVM libomp. So this is real.
I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.

In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.
I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
So this is quite involving and some dedicated person need to work on this and it needs time.

Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.

rebase and address reviews.

ye-luo marked an inline comment as done.Aug 7 2021, 4:19 PM

ye-luo added inline comments.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1191	I'm trying to avoid adding stuff that is not immediately used. Similar to Queue, the manipulation of Event is within the plugin and there are no need of APIs to create/destroy events from outside. recordEvent is responsible to create and record an event. queryEvent is responsbile to query and destroy an event upon completion. #define OFFLOAD_SUCCESS (0) #define OFFLOAD_FAIL (~0) This is what I found, not even enum. I don't see a clean way to extend OFFLOAD_NOT_DONE I think fixing the return style is not in the scope of this patch. Some design is needed for return values between plugin and device class and between device to omptarget. In this case, OFFLOAD_FAIL is for errors reported by CUDA runtime. Event point to signal it is completed or still on going.
openmp/libomptarget/src/device.cpp
554	Changed to early return.
555	yes. So I removed the call to setEventSupported
561	Thank you for pointing these out. Corrected.
openmp/libomptarget/src/interface.cpp
313	I tend to first get one case the "target teams nowait" case implemented and then extend to all the rest. Not just this case but also all the update. If you think it is better to enable this function as well in this initial patch, let me know and I will add it.
openmp/libomptarget/src/omptarget.cpp
24	I wanted something like libomp. TRUE/1/ON all goes to 1. but I don't know how to handle it in libomptarget.
43	I rewrote the whole function to do mostly early returns.
49	This looks like an optimization which should be explored separately. I think I may use cuEventSynchronize.
55–85	Call cleaned up and use early return. More readable.

ye-luo added inline comments.Aug 7 2021, 4:27 PM

openmp/libomptarget/include/omptarget.h
354	This exactly what I looked for. All fixed.

Harbormaster completed remote builds in B118524: Diff 364987.Aug 7 2021, 4:29 PM

tianshilei1992 added inline comments.Aug 7 2021, 5:05 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1143	This function goes too far. It contains: Create an event; Return the stream; Nullify the queue pointer. Considering `recordEvent` will be used in many other places, such as D104418, please separate it.
1191	This is what I found, not even enum. I don't see a clean way to extend OFFLOAD_NOT_DONE It is because these values are used in both `libomptarget` (C++ API) and plugins (C API). I think fixing the return style is not in the scope of this patch. Some design is needed for return values between plugin and device class and between device to omptarget. I don't doubt that but it's not good to "twist" the code to fit existing code if there is apparently a better way to do it. If one part needs to be extended to support new features, just do it in another patch and make this one depend on it.
openmp/libomptarget/src/device.cpp
552	Whether the event is supported is per-device, so no need to put one indicator it in every async info.
openmp/libomptarget/src/omptarget.cpp
49	If `cuEventSynchronize` is better than stream one (e.g. the synchronization is no longer just spinning but something similar to signal), it's worth to separate the patch with something like: // launch kernel // create event // synchronize And in CUDA plugin, the synchronize is event synchronize. Then apply this patch on that.

ye-luo added inline comments.Aug 7 2021, 5:42 PM

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1143	it is not clear to me how you would prefer the event manipulation API in the plugins to look like. Could you put up a separate patch by extracting those out of D104418 ? It seems that you need to expose an event in the API. Once you have that up, I can refactor/reorganize my side.
1191	OFFLOAD_NOT_DONE needs to come from the plugin. An enquiry needs to return 3 states. fail, done, not done. I'm wondering how to do it properly? Is there an example to follow?.
openmp/libomptarget/src/device.cpp
552	Indeed I wanted to change that. Is DeviceTy and its constructor the right place to keep and initialize this flag?
openmp/libomptarget/src/omptarget.cpp
49	Let us consolidate the API first. Any optimization further optimization should be deferred.

In D107656#2932956, @ye-luo wrote:

In D107656#2932908, @protze.joachim wrote:

Regarding the weird nature of taskyield I refer to https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_1
Not everything in the paper is applicable for your situation. The most dangerous point I see here is, that taskyield if not used with care will effectively build a recursive call stack, so that a task that called taskyield can only proceed, if all recursively called task have finished.

I don't have access to the paper but I do understand the case of "a recursive call stack". It can cause performance issues. It also seems like a feature of the taskyield implementation in LLVM libomp. So this is real.

Lmgtfy: http://montblanc-project.eu/wp-content/uploads/2018/10/The-impact-of-taskyield-on.pdf

I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.

In such case, you introduce busy waiting by polling on taskyield as long as target is not ready. Since the hidden tasks are pinned to the same cores as application threads, this will impact the performance of host threads. (Reject reason one)

In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.

Please add a mockup of your use pattern as a test case, so that we can review and understand your use pattern.
IMHO, an implementation, where significant drawbacks can be expected should not go into mainline libomptarget just for experimenting with the performance.

I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
So this is quite involving and some dedicated person need to work on this and it needs time.

I'm not sure what you mean with finalization. The only case, where I think a target task might need to get split into pieces is for mapping data from the device (not sure whether the internal signalling model allows to initiate the memory movement just after kernel offloading).
If such splitting would be needed, we could limit the initial detach implementation to only support target regions without mapping at the end of the region. The application can always accomplish this requirement by splitting the mapping into separate directives:

#pragma omp target enter data map(to:A) depend(inout:A) nowait
#pragma omp target depend(inout:A) nowait
#pragma omp target exit data map(from:A) depend(inout:A) nowait

Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.

This seems like a change to address a very limited use case without explaining what the pattern of the use case actually is. We should discuss this in one of the upcoming calls.

This revision now requires changes to proceed.Aug 8 2021, 12:40 AM

In D107656#2933103, @protze.joachim wrote:

In D107656#2932956, @ye-luo wrote:

In D107656#2932908, @protze.joachim wrote:

Regarding the weird nature of taskyield I refer to https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_1
Not everything in the paper is applicable for your situation. The most dangerous point I see here is, that taskyield if not used with care will effectively build a recursive call stack, so that a task that called taskyield can only proceed, if all recursively called task have finished.

I don't have access to the paper but I do understand the case of "a recursive call stack". It can cause performance issues. It also seems like a feature of the taskyield implementation in LLVM libomp. So this is real.

Lmgtfy: http://montblanc-project.eu/wp-content/uploads/2018/10/The-impact-of-taskyield-on.pdf

Thanks. It is consistent with my understanding of the "stack" implantation of taskyield.

I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.

In such case, you introduce busy waiting by polling on taskyield as long as target is not ready. Since the hidden tasks are pinned to the same cores as application threads, this will impact the performance of host threads. (Reject reason one)

using hidden task or regular task is largely orthogonal to what we discussed here. Using hidden tasks is not a "must" for having efficient target nowait.
the current implementation calling cuStreamSynchronize already blocks the application thread. My implementation allows not being blocked.

In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.

Please add a mockup of your use pattern as a test case, so that we can review and understand your use pattern.
IMHO, an implementation, where significant drawbacks can be expected should not go into mainline libomptarget just for experimenting with the performance.

I need it for production. The current "target nowait" has not been workable as expected.

I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
So this is quite involving and some dedicated person need to work on this and it needs time.

I'm not sure what you mean with finalization. The only case, where I think a target task might need to get split into pieces is for mapping data from the device (not sure whether the internal signalling model allows to initiate the memory movement just after kernel offloading).
If such splitting would be needed, we could limit the initial detach implementation to only support target regions without mapping at the end of the region. The application can always accomplish this requirement by splitting the mapping into separate directives:
#pragma omp target enter data map(to:A) depend(inout:A) nowait
#pragma omp target depend(inout:A) nowait
#pragma omp target exit data map(from:A) depend(inout:A) nowait

you need to decrease refcount and free memory if the count is 0 after the completion of all the asynchronous operations. If you can take care of that in the design, it is better to avoid asking extra work from users.
Second. splitting in the way you suggested requires dependency resolution on the host at least right now. The added latency is a huge loss in performance.

Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.

This seems like a change to address a very limited use case without explaining what the pattern of the use case actually is. We should discuss this in one of the upcoming calls.

The test code in the description is a distilled version of the app. I have slides and we can discuss them.

In D107656#2933162, @ye-luo wrote:

I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.

In such case, you introduce busy waiting by polling on taskyield as long as target is not ready. Since the hidden tasks are pinned to the same cores as application threads, this will impact the performance of host threads. (Reject reason one)

using hidden task or regular task is largely orthogonal to what we discussed here. Using hidden tasks is not a "must" for having efficient target nowait.

the current implementation calling cuStreamSynchronize already blocks the application thread. My implementation allows not being blocked.

Block the thread does not mean "keep the thread busy and eat all the core's cycles"

In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.

Please add a mockup of your use pattern as a test case, so that we can review and understand your use pattern.
IMHO, an implementation, where significant drawbacks can be expected should not go into mainline libomptarget just for experimenting with the performance.

I need it for production. The current "target nowait" has not been workable as expected.

I completely agree with the statement, that "target nowait" is not implemented in libomptarget. I just disagree with the way you suggest to fix the implementation.

I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
So this is quite involving and some dedicated person need to work on this and it needs time.

I'm not sure what you mean with finalization. The only case, where I think a target task might need to get split into pieces is for mapping data from the device (not sure whether the internal signalling model allows to initiate the memory movement just after kernel offloading).
If such splitting would be needed, we could limit the initial detach implementation to only support target regions without mapping at the end of the region. The application can always accomplish this requirement by splitting the mapping into separate directives:
#pragma omp target enter data map(to:A) depend(inout:A) nowait
#pragma omp target depend(inout:A) nowait
#pragma omp target exit data map(from:A) depend(inout:A) nowait
you need to decrease refcount and free memory if the count is 0 after the completion of all the asynchronous operations. If you can take care of that in the design, it is better to avoid asking extra work from users.
Second. splitting in the way you suggested requires dependency resolution on the host at least right now. The added latency is a huge loss in performance.

I didn't suggest, that this should be a permanent solution. It might be an intermediate step until splitting the task into parts is implemented.
I think, @tianshilei1992 already has the code in place to handle these dependencies on the device.
Also, for the code example you posted, there will be no freeing of data at the end of the target region.

By mapping all data to the device you assume all data fits to the device at the same time. If you would remove your enter/exit data on (de)allocation and rely on the mapping for the target region to move the data, you would still not be able to process larger chunks of data. Because of the stacking nature of taskyield no data will be moved from the device before you finished all target regions.

Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.

This seems like a change to address a very limited use case without explaining what the pattern of the use case actually is. We should discuss this in one of the upcoming calls.

The test code in the description is a distilled version of the app. I have slides and we can discuss them.

Thanks for pointing me to the link. The code convinced me even more, that taskyield is not the right solution even for your code example.

I also think. that without ignoring the task scheduling constraint, your code will only be able to schedule one task from the taskloop during your taskyield and a nested taskyield cannot schedule a task:
When you reach the taskyield, you have a tied task from the taskloop scheduled in the barrier of the single region (or in the taskgroup for the task executing the taskloop).
The target task is untied and does not count, so you can schedule another task from the taskloop. Now, when you reach the taskyield, you have a tied task scheduled in the outer taskyield and none of the tasks from the taskloop can be scheduled.
You might mitigate this limitation by adding untied to the taskloop, with the cost of untied tasks and run into the stack problem of taskyield.

Thanks for pointing me to the link. The code convinced me even more, that taskyield is not the right solution even for your code example.

I also think. that without ignoring the task scheduling constraint, your code will only be able to schedule one task from the taskloop during your taskyield and a nested taskyield cannot schedule a task:
When you reach the taskyield, you have a tied task from the taskloop scheduled in the barrier of the single region (or in the taskgroup for the task executing the taskloop).
The target task is untied and does not count, so you can schedule another task from the taskloop. Now, when you reach the taskyield, you have a tied task scheduled in the outer taskyield and none of the tasks from the taskloop can be scheduled.

This is not what I observed.

Task scheduling costraints says "the set of task regions that are currently tied to the thread and that are not suspended in a barrier region". The tied task from the taskloop scheduled in the barrier of the single region (or in the taskgroup for the task executing the taskloop) doesn't count in the set.

As the test is evolving, let me switch to a fixed commit.

Let us skip the second iteration of the taskloop. it gets scheduled after the first taskyield from the first target task and runs to its completion as it is CPU only. The third iteration which contains "target nowait" seems to be your concern. This task actually generates the target task and then runs to its completion. Only after that the second target task gets scheduled, "the set of task regions that are currently tied to the thread and that are not suspended in a barrier region" is empty. So the taskyield from the second target task actually continue build up the taskyield "stack".

When I manually stepped in to libomp and I have seen the stack being build up by the taskyield. nvprof also confirms the stack being built up.
Again, building up the stack is not my concern. Usually the loop under taskloop only have < 5 iterations.

gregrodgers added a subscriber: gregrodgers.Aug 11 2021, 7:39 AM

ye-luo mentioned this in D109017: [OpenMP][libomptarget] Add event query plugin API.Aug 31 2021, 1:30 PM

Revision Contents

Path

Size

openmp/

libomptarget/

include/

omptarget.h

19 lines

plugins/

cuda/

src/

rtl.cpp

78 lines

exports

2 lines

src/

2 lines

21 lines

32 lines

64 lines

1 line

4 lines

3 lines

runtime/

src/

kmp.h

3 lines

kmp_tasking.cpp

5 lines

Diff 364987

openmp/libomptarget/include/omptarget.h

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines

/// This struct contains information exchanged between different asynchronous		/// This struct contains information exchanged between different asynchronous
/// operations for device-dependent optimization and potential synchronization		/// operations for device-dependent optimization and potential synchronization
struct __tgt_async_info {		struct __tgt_async_info {
// A pointer to a queue-like structure where offloading operations are issued.		// A pointer to a queue-like structure where offloading operations are issued.
// We assume to use this structure to do synchronization. In CUDA backend, it		// We assume to use this structure to do synchronization. In CUDA backend, it
// is CUstream.		// is CUstream.
void *Queue = nullptr;		void *Queue = nullptr;
		// A pointer to the event for signalling the completion of asynchronous
		// operations enqueued on the Queue
		void *Event = nullptr;
};		};

struct DeviceTy;		struct DeviceTy;

/// The libomptarget wrapper around a __tgt_async_info object directly		/// The libomptarget wrapper around a __tgt_async_info object directly
/// associated with a libomptarget layer device. RAII semantics to avoid		/// associated with a libomptarget layer device. RAII semantics to avoid
/// mistakes.		/// mistakes.
class AsyncInfoTy {		class AsyncInfoTy {
/// Locations we used in (potentially) asynchronous calls which should live		/// Locations we used in (potentially) asynchronous calls which should live
/// as long as this AsyncInfoTy object.		/// as long as this AsyncInfoTy object.
std::deque<void *> BufferLocations;		std::deque<void *> BufferLocations;

__tgt_async_info AsyncInfo;		__tgt_async_info AsyncInfo;
DeviceTy &Device;		DeviceTy &Device;

		/// AsyncInfoTy is constructed in nowait interface.
		bool FromNoWait;
		/// if > 0, opt in the code path using events in synchronize().
		static int32_t UseNoWaitEvent;
		/// when record/query events are not supported by the plugin,
		/// this flag is set to false to disable using events at synchronize().
		bool EventSupported = false;

public:		public:
AsyncInfoTy(DeviceTy &Device) : Device(Device) {}		AsyncInfoTy(DeviceTy &Device, bool from_nowait = false)
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'from_nowait' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'from_nowait' [readability-identifier…
		: Device(Device), FromNoWait(from_nowait) {}
~AsyncInfoTy() { synchronize(); }		~AsyncInfoTy() { synchronize(); }

/// Implicit conversion to the __tgt_async_info which is used in the		/// Implicit conversion to the __tgt_async_info which is used in the
/// plugin interface.		/// plugin interface.
operator __tgt_async_info *() { return &AsyncInfo; }		operator __tgt_async_info *() { return &AsyncInfo; }

/// Synchronize all pending actions.		/// Synchronize all pending actions.
///		///
/// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.		/// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.
int synchronize();		int synchronize();

/// Return a void* reference with a lifetime that is at least as long as this		/// Return a void* reference with a lifetime that is at least as long as this
/// AsyncInfoTy object. The location can be used as intermediate buffer.		/// AsyncInfoTy object. The location can be used as intermediate buffer.
void *&getVoidPtrLocation();		void *&getVoidPtrLocation();

		/// set EventSupported
		void setEventSupported(bool supported) { EventSupported = supported; }
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'supported' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'supported' [readability-identifier…
		/// get EventSupported
		bool getEventSupported() const { return EventSupported; }
};		};

/// This struct is a record of non-contiguous information		/// This struct is a record of non-contiguous information
struct __tgt_target_non_contig {		struct __tgt_target_non_contig {
uint64_t Offset;		uint64_t Offset;
uint64_t Count;		uint64_t Count;
uint64_t Stride;		uint64_t Stride;
};		};
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	int __tgt_target_teams_nowait_mapper(
int32_t thread_limit, int32_t depNum, void *depList, int32_t noAliasDepNum,		int32_t thread_limit, int32_t depNum, void *depList, int32_t noAliasDepNum,
void *noAliasDepList);		void *noAliasDepList);

void __kmpc_push_target_tripcount(int64_t device_id, uint64_t loop_tripcount);		void __kmpc_push_target_tripcount(int64_t device_id, uint64_t loop_tripcount);

void __kmpc_push_target_tripcount_mapper(ident_t *loc, int64_t device_id,		void __kmpc_push_target_tripcount_mapper(ident_t *loc, int64_t device_id,
uint64_t loop_tripcount);		uint64_t loop_tripcount);

void __tgt_set_info_flag(uint32_t);		void __tgt_set_info_flag(uint32_t);
		grokosUnsubmitted Not Done Reply Inline Actions This function is defined in libomp, so it needs to be declared with the `weak` attribute in `private.h` alongside the other API functions from libomp (see `private.h`, the code block around line 90). Otherwise, we make libomptarget dependent on libomp, whereas we want it to be able to be build independently from any specific host OpenMP runtime. grokos: This function is defined in libomp, so it needs to be declared with the `weak` attribute in…
		ye-luoAuthorUnsubmitted Done Reply Inline Actions This exactly what I looked for. All fixed. ye-luo: This exactly what I looked for. All fixed.

int __tgt_print_device_info(int64_t device_id);		int __tgt_print_device_info(int64_t device_id);
#ifdef __cplusplus		#ifdef __cplusplus
}		}
#endif		#endif

#ifdef __cplusplus		#ifdef __cplusplus
#define EXTERN extern "C"		#define EXTERN extern "C"
#else		#else
#define EXTERN extern		#define EXTERN extern
#endif		#endif

#endif // _OMPTARGET_H_		#endif // _OMPTARGET_H_

openmp/libomptarget/plugins/cuda/src/rtl.cpp

//===----RTLs/cuda/src/rtl.cpp - Target RTLs Implementation ------- C++ -*-===//		//===----RTLs/cuda/src/rtl.cpp - Target RTLs Implementation ------- C++ -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// RTL for CUDA machine		// RTL for CUDA machine
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include <cassert>		#include <cassert>
#include <cstddef>		#include <cstddef>
#include <cuda.h>		#include <cuda.h>
		Lint: Pre-merge checks Inline Actions clang-tidy: error: 'cuda.h' file not found [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: 'cuda.h' file not found [clang-diagnostic-error] [[https://github.
#include <list>		#include <list>
#include <memory>		#include <memory>
#include <mutex>		#include <mutex>
#include <string>		#include <string>
#include <unordered_map>		#include <unordered_map>
#include <vector>		#include <vector>

#include "Debug.h"		#include "Debug.h"
▲ Show 20 Lines • Show All 1,111 Lines • ▼ Show 20 Lines	if (!checkResult(Err, "Error returned from cuLaunchKernel\n"))
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;

DP("Launch of entry point at " DPxMOD " successful!\n",		DP("Launch of entry point at " DPxMOD " successful!\n",
DPxPTR(TgtEntryPtr));		DPxPTR(TgtEntryPtr));

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

		int recordEvent(const int DeviceId, __tgt_async_info *AsyncInfo) const {
		tianshilei1992Unsubmitted Not Done Reply Inline Actions This function goes too far. It contains: Create an event; Return the stream; Nullify the queue pointer. Considering `recordEvent` will be used in many other places, such as D104418, please separate it. tianshilei1992: This function goes too far. It contains: 1. Create an event; 2. Return the stream; 3. Nullify…
		ye-luoAuthorUnsubmitted Done Reply Inline Actions it is not clear to me how you would prefer the event manipulation API in the plugins to look like. Could you put up a separate patch by extracting those out of D104418 ? It seems that you need to expose an event in the API. Once you have that up, I can refactor/reorganize my side. ye-luo: it is not clear to me how you would prefer the event manipulation API in the plugins to look…
		CUstream Stream = reinterpret_cast<CUstream>(AsyncInfo->Queue);
		CUevent Event;
		CUresult Err = cuEventCreate(&Event, CU_EVENT_DEFAULT);

		if (Err != CUDA_SUCCESS) {
		DP("Error when creating an event. stream = " DPxMOD
		", async info ptr = " DPxMOD "\n",
		DPxPTR(Stream), DPxPTR(AsyncInfo));
		CUDA_ERR_STRING(Err);
		return OFFLOAD_FAIL;
		}
		Err = cuEventRecord(Event, Stream);
		if (Err != CUDA_SUCCESS) {
		DP("Error when recording an event. stream = " DPxMOD
		", async info ptr = " DPxMOD "\n",
		DPxPTR(Stream), DPxPTR(AsyncInfo));
		CUDA_ERR_STRING(Err);
		return OFFLOAD_FAIL;
		}
		AsyncInfo->Event = Event;

		// Once the event is recorded, return it to stream pool and reset
		// AsyncInfo->Queue.
		StreamManager->returnStream(DeviceId,
		reinterpret_cast<CUstream>(AsyncInfo->Queue));
		AsyncInfo->Queue = nullptr;

		return OFFLOAD_SUCCESS;
		}

		int queryEvent(const int DeviceId, __tgt_async_info *AsyncInfo) const {
		CUevent Event = reinterpret_cast<CUevent>(AsyncInfo->Event);
		CUresult Err = cuEventQuery(Event);

		if (Err == CUDA_ERROR_NOT_READY) {
		DP("captured work is incomplete. Event = " DPxMOD
		", async info ptr = " DPxMOD "\n",
		DPxPTR(Event), DPxPTR(AsyncInfo));
		// CUDA_ERR_STRING(Err);
		return OFFLOAD_SUCCESS;
		} else if (Err != CUDA_SUCCESS) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: do not use 'else' after 'return' [llvm-else-after-return] not useful Lint: Pre-merge checks: clang-tidy: warning: do not use 'else' after 'return' [llvm-else-after-return] [[https://github.
		DP("Error when querying an event. Event = " DPxMOD
		", async info ptr = " DPxMOD "\n",
		DPxPTR(Event), DPxPTR(AsyncInfo));
		CUDA_ERR_STRING(Err);
		return OFFLOAD_FAIL;
		}
		Err = cuEventDestroy(Event);
		tianshilei1992Unsubmitted Not Done Reply Inline Actions Event destroy worths a separate function. We could add a new return value such as `OFFLOAD_NOT_DONE` to indicate the event is not fulfilled. It is not a good idea to mix event query and event destroy. tianshilei1992: Event destroy worths a separate function. We could add a new return value such as…
		ye-luoAuthorUnsubmitted Done Reply Inline Actions I'm trying to avoid adding stuff that is not immediately used. Similar to Queue, the manipulation of Event is within the plugin and there are no need of APIs to create/destroy events from outside. recordEvent is responsible to create and record an event. queryEvent is responsbile to query and destroy an event upon completion. #define OFFLOAD_SUCCESS (0) #define OFFLOAD_FAIL (~0) This is what I found, not even enum. I don't see a clean way to extend OFFLOAD_NOT_DONE I think fixing the return style is not in the scope of this patch. Some design is needed for return values between plugin and device class and between device to omptarget. In this case, OFFLOAD_FAIL is for errors reported by CUDA runtime. Event point to signal it is completed or still on going. ye-luo: I'm trying to avoid adding stuff that is not immediately used. Similar to Queue, the…
		tianshilei1992Unsubmitted Not Done Reply Inline Actions This is what I found, not even enum. I don't see a clean way to extend OFFLOAD_NOT_DONE It is because these values are used in both `libomptarget` (C++ API) and plugins (C API). I think fixing the return style is not in the scope of this patch. Some design is needed for return values between plugin and device class and between device to omptarget. I don't doubt that but it's not good to "twist" the code to fit existing code if there is apparently a better way to do it. If one part needs to be extended to support new features, just do it in another patch and make this one depend on it. tianshilei1992: > This is what I found, not even enum. I don't see a clean way to extend OFFLOAD_NOT_DONE It is…
		ye-luoAuthorUnsubmitted Done Reply Inline Actions OFFLOAD_NOT_DONE needs to come from the plugin. An enquiry needs to return 3 states. fail, done, not done. I'm wondering how to do it properly? Is there an example to follow?. ye-luo: OFFLOAD_NOT_DONE needs to come from the plugin. An enquiry needs to return 3 states. fail, done…
		if (Err != CUDA_SUCCESS) {
		DP("Error when destroying an event. Event = " DPxMOD
		", async info ptr = " DPxMOD "\n",
		DPxPTR(Event), DPxPTR(AsyncInfo));
		CUDA_ERR_STRING(Err);
		return OFFLOAD_FAIL;
		}
		AsyncInfo->Event = nullptr;
		return OFFLOAD_SUCCESS;
		}

int synchronize(const int DeviceId, __tgt_async_info *AsyncInfo) const {		int synchronize(const int DeviceId, __tgt_async_info *AsyncInfo) const {
CUstream Stream = reinterpret_cast<CUstream>(AsyncInfo->Queue);		CUstream Stream = reinterpret_cast<CUstream>(AsyncInfo->Queue);
CUresult Err = cuStreamSynchronize(Stream);		CUresult Err = cuStreamSynchronize(Stream);

// Once the stream is synchronized, return it to stream pool and reset		// Once the stream is synchronized, return it to stream pool and reset
// AsyncInfo. This is to make sure the synchronization only works for its		// AsyncInfo. This is to make sure the synchronization only works for its
// own tasks.		// own tasks.
StreamManager->returnStream(DeviceId,		StreamManager->returnStream(DeviceId,
▲ Show 20 Lines • Show All 363 Lines • ▼ Show 20 Lines	int32_t __tgt_rtl_run_target_region_async(int32_t device_id,
assert(DeviceRTL.isValidDeviceId(device_id) && "device_id is invalid");		assert(DeviceRTL.isValidDeviceId(device_id) && "device_id is invalid");

return __tgt_rtl_run_target_team_region_async(		return __tgt_rtl_run_target_team_region_async(
device_id, tgt_entry_ptr, tgt_args, tgt_offsets, arg_num,		device_id, tgt_entry_ptr, tgt_args, tgt_offsets, arg_num,
/* team num/ 1, / thread_limit / 1, / loop_tripcount */ 0,		/* team num/ 1, / thread_limit / 1, / loop_tripcount */ 0,
async_info_ptr);		async_info_ptr);
}		}

		int32_t __tgt_rtl_record_event(int32_t device_id,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__tgt_rtl_record_event' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'device_id' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__tgt_rtl_record_event' [readability…
		__tgt_async_info *async_info_ptr) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'async_info_ptr' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'async_info_ptr' [readability-identifier…
		assert(DeviceRTL.isValidDeviceId(device_id) && "device_id is invalid");
		assert(async_info_ptr && "async_info_ptr is nullptr");
		assert(async_info_ptr->Queue && "async_info_ptr->Queue is nullptr");

		return DeviceRTL.recordEvent(device_id, async_info_ptr);
		}

		int32_t __tgt_rtl_query_event(int32_t device_id,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__tgt_rtl_query_event' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'device_id' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__tgt_rtl_query_event' [readability…
		__tgt_async_info *async_info_ptr) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'async_info_ptr' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'async_info_ptr' [readability-identifier…
		assert(DeviceRTL.isValidDeviceId(device_id) && "device_id is invalid");
		assert(async_info_ptr && "async_info_ptr is nullptr");
		assert(async_info_ptr->Event && "async_info_ptr->Event is nullptr");

		return DeviceRTL.queryEvent(device_id, async_info_ptr);
		}

int32_t __tgt_rtl_synchronize(int32_t device_id,		int32_t __tgt_rtl_synchronize(int32_t device_id,
__tgt_async_info *async_info_ptr) {		__tgt_async_info *async_info_ptr) {
assert(DeviceRTL.isValidDeviceId(device_id) && "device_id is invalid");		assert(DeviceRTL.isValidDeviceId(device_id) && "device_id is invalid");
assert(async_info_ptr && "async_info_ptr is nullptr");		assert(async_info_ptr && "async_info_ptr is nullptr");
assert(async_info_ptr->Queue && "async_info_ptr->Queue is nullptr");		assert(async_info_ptr->Queue && "async_info_ptr->Queue is nullptr");

return DeviceRTL.synchronize(device_id, async_info_ptr);		return DeviceRTL.synchronize(device_id, async_info_ptr);
}		}
Show All 14 Lines

openmp/libomptarget/plugins/exports

Show All 12 Lines	global:
__tgt_rtl_data_retrieve_async;		__tgt_rtl_data_retrieve_async;
__tgt_rtl_data_exchange;		__tgt_rtl_data_exchange;
__tgt_rtl_data_exchange_async;		__tgt_rtl_data_exchange_async;
__tgt_rtl_data_delete;		__tgt_rtl_data_delete;
__tgt_rtl_run_target_team_region;		__tgt_rtl_run_target_team_region;
__tgt_rtl_run_target_team_region_async;		__tgt_rtl_run_target_team_region_async;
__tgt_rtl_run_target_region;		__tgt_rtl_run_target_region;
__tgt_rtl_run_target_region_async;		__tgt_rtl_run_target_region_async;
		__tgt_rtl_record_event;
		__tgt_rtl_query_event;
__tgt_rtl_synchronize;		__tgt_rtl_synchronize;
__tgt_rtl_register_lib;		__tgt_rtl_register_lib;
__tgt_rtl_unregister_lib;		__tgt_rtl_unregister_lib;
__tgt_rtl_supports_empty_images;		__tgt_rtl_supports_empty_images;
__tgt_rtl_set_info_flag;		__tgt_rtl_set_info_flag;
__tgt_rtl_print_device_info;		__tgt_rtl_print_device_info;
local:		local:
*;		*;
};		};

openmp/libomptarget/src/device.h

Show First 20 Lines • Show All 267 Lines • ▼ Show 20 Lines	int32_t runRegion(void TgtEntryPtr, void TgtVarsPtr, ptrdiff_t TgtOffsets,
int32_t TgtVarsSize, AsyncInfoTy &AsyncInfo);		int32_t TgtVarsSize, AsyncInfoTy &AsyncInfo);
int32_t runTeamRegion(void TgtEntryPtr, void *TgtVarsPtr,		int32_t runTeamRegion(void TgtEntryPtr, void *TgtVarsPtr,
ptrdiff_t *TgtOffsets, int32_t TgtVarsSize,		ptrdiff_t *TgtOffsets, int32_t TgtVarsSize,
int32_t NumTeams, int32_t ThreadLimit,		int32_t NumTeams, int32_t ThreadLimit,
uint64_t LoopTripCount, AsyncInfoTy &AsyncInfo);		uint64_t LoopTripCount, AsyncInfoTy &AsyncInfo);

/// Synchronize device/queue/event based on \p AsyncInfo and return		/// Synchronize device/queue/event based on \p AsyncInfo and return
/// OFFLOAD_SUCCESS/OFFLOAD_FAIL when succeeds/fails.		/// OFFLOAD_SUCCESS/OFFLOAD_FAIL when succeeds/fails.
		int32_t recordEvent(AsyncInfoTy &AsyncInfo);
		int32_t queryEvent(AsyncInfoTy &AsyncInfo);
int32_t synchronize(AsyncInfoTy &AsyncInfo);		int32_t synchronize(AsyncInfoTy &AsyncInfo);

/// Calls the corresponding print in the \p RTLDEVID		/// Calls the corresponding print in the \p RTLDEVID
/// device RTL to obtain the information of the specific device.		/// device RTL to obtain the information of the specific device.
bool printDeviceInfo(int32_t RTLDevID);		bool printDeviceInfo(int32_t RTLDevID);

private:		private:
// Call to RTL		// Call to RTL
Show All 35 Lines

openmp/libomptarget/src/device.cpp

Show First 20 Lines • Show All 541 Lines • ▼ Show 20 Lines	bool DeviceTy::isDataExchangable(const DeviceTy &DstDevice) {

if (RTL->is_data_exchangable(RTLDeviceID, DstDevice.RTLDeviceID))		if (RTL->is_data_exchangable(RTLDeviceID, DstDevice.RTLDeviceID))
return (RTL->data_exchange != nullptr) \|\|		return (RTL->data_exchange != nullptr) \|\|
(RTL->data_exchange_async != nullptr);		(RTL->data_exchange_async != nullptr);

return false;		return false;
}		}

		int32_t DeviceTy::recordEvent(AsyncInfoTy &AsyncInfo) {
		if (RTL->record_event) {
		AsyncInfo.setEventSupported(true);
		tianshilei1992Unsubmitted Not Done Reply Inline Actions Whether the event is supported is per-device, so no need to put one indicator it in every async info. tianshilei1992: Whether the event is supported is per-device, so no need to put one indicator it in every async…
		ye-luoAuthorUnsubmitted Done Reply Inline Actions Indeed I wanted to change that. Is DeviceTy and its constructor the right place to keep and initialize this flag? ye-luo: Indeed I wanted to change that. Is DeviceTy and its constructor the right place to keep and…
		return RTL->record_event(RTLDeviceID, AsyncInfo);
		}
		tianshilei1992Unsubmitted Not Done Reply Inline Actions early return tianshilei1992: early return
		ye-luoAuthorUnsubmitted Done Reply Inline Actions Changed to early return. ye-luo: Changed to early return.

		RaviNarayanaswamyUnsubmitted Not Done Reply Inline Actions Isn't this initialized to false when the AsyncInfo is created. RaviNarayanaswamy: Isn't this initialized to false when the AsyncInfo is created.
		ye-luoAuthorUnsubmitted Done Reply Inline Actions yes. So I removed the call to setEventSupported ye-luo: yes. So I removed the call to setEventSupported
		// AsyncInfo.EventSupported is false by default.
		return OFFLOAD_SUCCESS;
		}

		// when OFFLOAD_SUCCESS is returned, it means either the event has been
		// fulfilled without error or the event has not been not fulfilled and
		grokosUnsubmitted Not Done Reply Inline Actions "fullfiled" --> "fulfilled" "has not been not fullfiled" --> "has not been fulfilled" grokos: "fullfiled" --> "fulfilled" "has not been not fullfiled" --> "has not been fulfilled"
		ye-luoAuthorUnsubmitted Done Reply Inline Actions Thank you for pointing these out. Corrected. ye-luo: Thank you for pointing these out. Corrected.
		// AsyncInfo.Event is not nullptr.
		int32_t DeviceTy::queryEvent(AsyncInfoTy &AsyncInfo) {
		if (AsyncInfo.getEventSupported()) {
		return RTL->query_event(RTLDeviceID, AsyncInfo);
		}
		// when events are not supported, queryEvent should not be called.
		return OFFLOAD_FAIL;
		}

int32_t DeviceTy::synchronize(AsyncInfoTy &AsyncInfo) {		int32_t DeviceTy::synchronize(AsyncInfoTy &AsyncInfo) {
if (RTL->synchronize)		if (RTL->synchronize)
return RTL->synchronize(RTLDeviceID, AsyncInfo);		return RTL->synchronize(RTLDeviceID, AsyncInfo);
return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

/// Check whether a device has an associated RTL and initialize it if it's not		/// Check whether a device has an associated RTL and initialize it if it's not
/// already initialized.		/// already initialized.
Show All 28 Lines

openmp/libomptarget/src/interface.cpp

Show First 20 Lines • Show All 304 Lines • ▼ Show 20 Lines	int rc = target(loc, Device, host_ptr, arg_num, args_base, args, arg_sizes,
arg_types, arg_names, arg_mappers, 0, 0, false /team/,		arg_types, arg_names, arg_mappers, 0, 0, false /team/,
AsyncInfo);		AsyncInfo);
if (rc == OFFLOAD_SUCCESS)		if (rc == OFFLOAD_SUCCESS)
rc = AsyncInfo.synchronize();		rc = AsyncInfo.synchronize();
handleTargetOutcome(rc == OFFLOAD_SUCCESS, loc);		handleTargetOutcome(rc == OFFLOAD_SUCCESS, loc);
return rc;		return rc;
}		}

EXTERN int __tgt_target_nowait_mapper(		EXTERN int __tgt_target_nowait_mapper(
		grokosUnsubmitted Not Done Reply Inline Actions This single-team API function needs the same patching you applied to `__tgt_target_teams_nowait_mapper`. grokos: This single-team API function needs the same patching you applied to…
		ye-luoAuthorUnsubmitted Done Reply Inline Actions I tend to first get one case the "target teams nowait" case implemented and then extend to all the rest. Not just this case but also all the update. If you think it is better to enable this function as well in this initial patch, let me know and I will add it. ye-luo: I tend to first get one case the "target teams nowait" case implemented and then extend to all…
ident_t loc, int64_t device_id, void host_ptr, int32_t arg_num,		ident_t loc, int64_t device_id, void host_ptr, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
map_var_info_t arg_names, void *arg_mappers, int32_t depNum,		map_var_info_t arg_names, void *arg_mappers, int32_t depNum,
void depList, int32_t noAliasDepNum, void noAliasDepList) {		void depList, int32_t noAliasDepNum, void noAliasDepList) {
TIMESCOPE_WITH_IDENT(loc);		TIMESCOPE_WITH_IDENT(loc);

return __tgt_target_mapper(loc, device_id, host_ptr, arg_num, args_base, args,		return __tgt_target_mapper(loc, device_id, host_ptr, arg_num, args_base, args,
arg_sizes, arg_types, arg_names, arg_mappers);		arg_sizes, arg_types, arg_names, arg_mappers);
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
}		}

EXTERN int __tgt_target_teams_nowait_mapper(		EXTERN int __tgt_target_teams_nowait_mapper(
ident_t loc, int64_t device_id, void host_ptr, int32_t arg_num,		ident_t loc, int64_t device_id, void host_ptr, int32_t arg_num,
void args_base, void args, int64_t arg_sizes, int64_t arg_types,		void args_base, void args, int64_t arg_sizes, int64_t arg_types,
map_var_info_t arg_names, void *arg_mappers, int32_t team_num,		map_var_info_t arg_names, void *arg_mappers, int32_t team_num,
int32_t thread_limit, int32_t depNum, void *depList, int32_t noAliasDepNum,		int32_t thread_limit, int32_t depNum, void *depList, int32_t noAliasDepNum,
void *noAliasDepList) {		void *noAliasDepList) {
TIMESCOPE_WITH_IDENT(loc);		TIMESCOPE_WITH_IDENT(loc);

		RaviNarayanaswamyUnsubmitted Not Done Reply Inline Actions Is kmpc_omp_taskwait needed. RaviNarayanaswamy: Is kmpc_omp_taskwait needed.
		ye-luoAuthorUnsubmitted Done Reply Inline Actions It is not needed. It has been removed by @tianshilei1992 in main branch. So It will disappear after a rebase. ye-luo: It is not needed. It has been removed by @tianshilei1992 in main branch. So It will disappear…
return __tgt_target_teams_mapper(loc, device_id, host_ptr, arg_num, args_base,		DP("Entering target nowait region with entry point " DPxMOD
args, arg_sizes, arg_types, arg_names,		" and device Id %" PRId64 "\n",
arg_mappers, team_num, thread_limit);		DPxPTR(host_ptr), device_id);
		if (checkDeviceAndCtors(device_id, loc) != OFFLOAD_SUCCESS) {
		DP("Not offloading to device %" PRId64 "\n", device_id);
		return OFFLOAD_FAIL;
		}

		if (getInfoLevel() & OMP_INFOTYPE_KERNEL_ARGS)
		printKernelArguments(loc, device_id, arg_num, arg_sizes, arg_types,
		arg_names, "Entering OpenMP kernel");
		#ifdef OMPTARGET_DEBUG
		for (int i = 0; i < arg_num; ++i) {
		DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64
		", Type=0x%" PRIx64 ", Name=%s\n",
		i, DPxPTR(args_base[i]), DPxPTR(args[i]), arg_sizes[i], arg_types[i],
		(arg_names) ? getNameFromMapping(arg_names[i]).c_str() : "unknown");
		}
		#endif

		DeviceTy &Device = PM->Devices[device_id];
		AsyncInfoTy AsyncInfo(Device, true);
		int rc = target(loc, Device, host_ptr, arg_num, args_base, args, arg_sizes,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'rc' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'rc' [readability-identifier-naming]…
		arg_types, arg_names, arg_mappers, team_num, thread_limit,
		true /team/, AsyncInfo);
		if (rc == OFFLOAD_SUCCESS)
		rc = AsyncInfo.synchronize();
		handleTargetOutcome(rc == OFFLOAD_SUCCESS, loc);
		return rc;
}		}

// Get the current number of components for a user-defined mapper.		// Get the current number of components for a user-defined mapper.
EXTERN int64_t __tgt_mapper_num_components(void *rt_mapper_handle) {		EXTERN int64_t __tgt_mapper_num_components(void *rt_mapper_handle) {
TIMESCOPE();		TIMESCOPE();
auto MapperComponentsPtr = (struct MapperComponentsTy )rt_mapper_handle;		auto MapperComponentsPtr = (struct MapperComponentsTy )rt_mapper_handle;
int64_t size = MapperComponentsPtr->Components.size();		int64_t size = MapperComponentsPtr->Components.size();
DP("__tgt_mapper_num_components(Handle=" DPxMOD ") returns %" PRId64 "\n",		DP("__tgt_mapper_num_components(Handle=" DPxMOD ") returns %" PRId64 "\n",
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

openmp/libomptarget/src/omptarget.cpp

	Show All 13 Lines
	#include "omptarget.h"			#include "omptarget.h"
	#include "device.h"			#include "device.h"
	#include "private.h"			#include "private.h"
	#include "rtl.h"			#include "rtl.h"

	#include <cassert>			#include <cassert>
	#include <vector>			#include <vector>

				int32_t AsyncInfoTy::UseNoWaitEvent = []() {
				char *EnvStr = getenv("LIBOMPTARGET_USE_NOWAIT_EVENT");
				return EnvStr ? std::stoi(EnvStr) : 0;
				grokosUnsubmitted Not Done Reply Inline Actions Should we check for invalid values of this env var? grokos: Should we check for invalid values of this env var?
				ye-luoAuthorUnsubmitted Done Reply Inline Actions I wanted something like libomp. TRUE/1/ON all goes to 1. but I don't know how to handle it in libomptarget. ye-luo: I wanted something like libomp. TRUE/1/ON all goes to 1. but I don't know how to handle it in…
				}();

	int AsyncInfoTy::synchronize() {			int AsyncInfoTy::synchronize() {
	int Result = OFFLOAD_SUCCESS;			if (!AsyncInfo.Queue)
	if (AsyncInfo.Queue) {			return OFFLOAD_SUCCESS;
	// If we have a queue we need to synchronize it now.
	Result = Device.synchronize(*this);			// The Queue is active and there are works on going. It need a
				// synchronization.
				// the call is nor from nowait or UseNoWaitEvent is not reqeusted
				if (!FromNoWait \|\| UseNoWaitEvent == 0) {
				DP("Calling Device.synchronize\n");
				int Result = Device.synchronize(*this);
				if (Result != OFFLOAD_SUCCESS)
				DP("Device.synchronize failed!\n");
				// as the last step, the Queue should have been returned
	assert(AsyncInfo.Queue == nullptr &&			assert(AsyncInfo.Queue == nullptr &&
	"The device plugin should have nulled the queue to indicate there "			"The device plugin should have nulled the queue to indicate there "
	"are no outstanding actions!");			"are no outstanding actions!");
				return Result;
				tianshilei1992Unsubmitted Not Done Reply Inline Actions use early return tianshilei1992: use early return
				ye-luoAuthorUnsubmitted Done Reply Inline Actions I rewrote the whole function to do mostly early returns. ye-luo: I rewrote the whole function to do mostly early returns.
				}

				// the call is from nowait and UseNoWaitEvent is reqeusted
				// return cases 1) no event support 2) event create fail 3) event
				// record fail.
				int Ret = Device.recordEvent(*this);
				tianshilei1992Unsubmitted Not Done Reply Inline Actions I'm thinking we can actually do more here. For example, set a count for every task yield. When the count reaches a threshold, fall back to stream synchronize. The threshold can be configured via env and so on. tianshilei1992: I'm thinking we can actually do more here. For example, set a count for every task yield. When…
				ye-luoAuthorUnsubmitted Done Reply Inline Actions This looks like an optimization which should be explored separately. I think I may use cuEventSynchronize. ye-luo: This looks like an optimization which should be explored separately. I think I may use…
				tianshilei1992Unsubmitted Not Done Reply Inline Actions If `cuEventSynchronize` is better than stream one (e.g. the synchronization is no longer just spinning but something similar to signal), it's worth to separate the patch with something like: // launch kernel // create event // synchronize And in CUDA plugin, the synchronize is event synchronize. Then apply this patch on that. tianshilei1992: If `cuEventSynchronize` is better than stream one (e.g. the synchronization is no longer just…
				ye-luoAuthorUnsubmitted Done Reply Inline Actions Let us consolidate the API first. Any optimization further optimization should be deferred. ye-luo: Let us consolidate the API first. Any optimization further optimization should be deferred.
				// handle case 2 and 3.
				if (Ret != OFFLOAD_SUCCESS) {
				DP("Device.recordEvent failed!\n");
				return OFFLOAD_FAIL;
	}			}

				// in case 1) skip task yield
				if (!EventSupported) {
				DP("No event support by the pluggin! Calling synchronize\n");
				int Result = Device.synchronize(*this);
				if (Result != OFFLOAD_SUCCESS)
				DP("Device.synchronize failed!\n");
				// as the last step, the Queue should have been returned
				assert(AsyncInfo.Queue == nullptr &&
				"The device plugin should have nulled the queue to indicate there "
				"are no outstanding actions!");
	return Result;			return Result;
	}			}

				// use events to synchronize
				assert(AsyncInfo.Event && "Event should exist!");
				do {
				__kmpc_target_task_yield();
				Ret = Device.queryEvent(*this);
				} while (Ret == OFFLOAD_SUCCESS && AsyncInfo.Event);

				if (Ret != OFFLOAD_SUCCESS) {
				DP("Device.queryEvent failed!\n");
				return OFFLOAD_FAIL;
				}

				// synchronization relies on events
				// Event should have been destroyed
				assert(AsyncInfo.Event == nullptr && "Event should have been nulled!");
				DP("Event has been fulfilled and destroyed!\n");
				return OFFLOAD_SUCCESS;
				RaviNarayanaswamyUnsubmitted Not Done Reply Inline Actions Result is not set on all paths RaviNarayanaswamy: Result is not set on all paths
				ye-luoAuthorUnsubmitted Done Reply Inline Actions When leaving line 62, the return value is OFFLOAD_SUCCESS as line 28 sets it ye-luo: When leaving line 62, the return value is OFFLOAD_SUCCESS as line 28 sets it
				RaviNarayanaswamyUnsubmitted Not Done Reply Inline Actions I missed that. RaviNarayanaswamy: I missed that.
				ye-luoAuthorUnsubmitted Done Reply Inline Actions Call cleaned up and use early return. More readable. ye-luo: Call cleaned up and use early return. More readable.
				}

	void *&AsyncInfoTy::getVoidPtrLocation() {			void *&AsyncInfoTy::getVoidPtrLocation() {
	BufferLocations.push_back(nullptr);			BufferLocations.push_back(nullptr);
	return BufferLocations.back();			return BufferLocations.back();
	}			}

	/* All begin addresses for partially mapped structs must be 8-aligned in order			/* All begin addresses for partially mapped structs must be 8-aligned in order
	* to ensure proper alignment of members. E.g.			* to ensure proper alignment of members. E.g.
	*			*
	▲ Show 20 Lines • Show All 1,452 Lines • Show Last 20 Lines

openmp/libomptarget/src/private.h

	Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines
	// Implemented in libomp, they are called from within __tgt_* functions.			// Implemented in libomp, they are called from within __tgt_* functions.
	#ifdef __cplusplus			#ifdef __cplusplus
	extern "C" {			extern "C" {
	#endif			#endif
	// functions that extract info from libomp; keep in sync			// functions that extract info from libomp; keep in sync
	int omp_get_default_device(void) __attribute__((weak));			int omp_get_default_device(void) __attribute__((weak));
	int32_t __kmpc_global_thread_num(void *) __attribute__((weak));			int32_t __kmpc_global_thread_num(void *) __attribute__((weak));
	int __kmpc_get_target_offload(void) __attribute__((weak));			int __kmpc_get_target_offload(void) __attribute__((weak));
				void __kmpc_target_task_yield(void) __attribute__((weak));
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function '__kmpc_target_task_yield' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function '__kmpc_target_task_yield' [readability…
	#ifdef __cplusplus			#ifdef __cplusplus
	}			}
	#endif			#endif

	#define TARGET_NAME Libomptarget			#define TARGET_NAME Libomptarget
	#define DEBUG_PREFIX GETNAME(TARGET_NAME)			#define DEBUG_PREFIX GETNAME(TARGET_NAME)

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

openmp/libomptarget/src/rtl.h

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	typedef int32_t(run_region_async_ty)(int32_t, void , void , ptrdiff_t ,
int32_t, __tgt_async_info *);		int32_t, __tgt_async_info *);
typedef int32_t(run_team_region_ty)(int32_t, void , void , ptrdiff_t ,		typedef int32_t(run_team_region_ty)(int32_t, void , void , ptrdiff_t ,
int32_t, int32_t, int32_t, uint64_t);		int32_t, int32_t, int32_t, uint64_t);
typedef int32_t(run_team_region_async_ty)(int32_t, void , void *,		typedef int32_t(run_team_region_async_ty)(int32_t, void , void *,
ptrdiff_t *, int32_t, int32_t,		ptrdiff_t *, int32_t, int32_t,
int32_t, uint64_t,		int32_t, uint64_t,
__tgt_async_info *);		__tgt_async_info *);
typedef int64_t(init_requires_ty)(int64_t);		typedef int64_t(init_requires_ty)(int64_t);
		typedef int64_t(record_event_ty)(int32_t, __tgt_async_info *);
		typedef int64_t(query_event_ty)(int32_t, __tgt_async_info *);
typedef int64_t(synchronize_ty)(int32_t, __tgt_async_info *);		typedef int64_t(synchronize_ty)(int32_t, __tgt_async_info *);
typedef int32_t (register_lib_ty)(__tgt_bin_desc );		typedef int32_t (register_lib_ty)(__tgt_bin_desc );
typedef int32_t(supports_empty_images_ty)();		typedef int32_t(supports_empty_images_ty)();
typedef void(print_device_info_ty)(int32_t);		typedef void(print_device_info_ty)(int32_t);
typedef void(set_info_flag_ty)(uint32_t);		typedef void(set_info_flag_ty)(uint32_t);

int32_t Idx = -1; // RTL index, index is the number of devices		int32_t Idx = -1; // RTL index, index is the number of devices
// of other RTLs that were registered before,		// of other RTLs that were registered before,
Show All 21 Lines	#endif
data_exchange_ty *data_exchange = nullptr;		data_exchange_ty *data_exchange = nullptr;
data_exchange_async_ty *data_exchange_async = nullptr;		data_exchange_async_ty *data_exchange_async = nullptr;
data_delete_ty *data_delete = nullptr;		data_delete_ty *data_delete = nullptr;
run_region_ty *run_region = nullptr;		run_region_ty *run_region = nullptr;
run_region_async_ty *run_region_async = nullptr;		run_region_async_ty *run_region_async = nullptr;
run_team_region_ty *run_team_region = nullptr;		run_team_region_ty *run_team_region = nullptr;
run_team_region_async_ty *run_team_region_async = nullptr;		run_team_region_async_ty *run_team_region_async = nullptr;
init_requires_ty *init_requires = nullptr;		init_requires_ty *init_requires = nullptr;
		record_event_ty *record_event = nullptr;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for member 'record_event' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for member 'record_event' [readability-identifier…
		query_event_ty *query_event = nullptr;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for member 'query_event' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for member 'query_event' [readability-identifier…
synchronize_ty *synchronize = nullptr;		synchronize_ty *synchronize = nullptr;
register_lib_ty register_lib = nullptr;		register_lib_ty register_lib = nullptr;
register_lib_ty unregister_lib = nullptr;		register_lib_ty unregister_lib = nullptr;
supports_empty_images_ty *supports_empty_images = nullptr;		supports_empty_images_ty *supports_empty_images = nullptr;
set_info_flag_ty *set_info_flag = nullptr;		set_info_flag_ty *set_info_flag = nullptr;
print_device_info_ty *print_device_info = nullptr;		print_device_info_ty *print_device_info = nullptr;

// Are there images associated with this RTL.		// Are there images associated with this RTL.
▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

openmp/libomptarget/src/rtl.cpp

Show First 20 Lines • Show All 160 Lines • ▼ Show 20 Lines	#endif
((void *)&R.data_submit_async) =		((void *)&R.data_submit_async) =
dlsym(dynlib_handle, "__tgt_rtl_data_submit_async");		dlsym(dynlib_handle, "__tgt_rtl_data_submit_async");
((void *)&R.data_retrieve_async) =		((void *)&R.data_retrieve_async) =
dlsym(dynlib_handle, "__tgt_rtl_data_retrieve_async");		dlsym(dynlib_handle, "__tgt_rtl_data_retrieve_async");
((void *)&R.run_region_async) =		((void *)&R.run_region_async) =
dlsym(dynlib_handle, "__tgt_rtl_run_target_region_async");		dlsym(dynlib_handle, "__tgt_rtl_run_target_region_async");
((void *)&R.run_team_region_async) =		((void *)&R.run_team_region_async) =
dlsym(dynlib_handle, "__tgt_rtl_run_target_team_region_async");		dlsym(dynlib_handle, "__tgt_rtl_run_target_team_region_async");
		((void *)&R.record_event) =
		dlsym(dynlib_handle, "__tgt_rtl_record_event");
		((void *)&R.query_event) = dlsym(dynlib_handle, "__tgt_rtl_query_event");
((void *)&R.synchronize) = dlsym(dynlib_handle, "__tgt_rtl_synchronize");		((void *)&R.synchronize) = dlsym(dynlib_handle, "__tgt_rtl_synchronize");
((void *)&R.data_exchange) =		((void *)&R.data_exchange) =
dlsym(dynlib_handle, "__tgt_rtl_data_exchange");		dlsym(dynlib_handle, "__tgt_rtl_data_exchange");
((void *)&R.data_exchange_async) =		((void *)&R.data_exchange_async) =
dlsym(dynlib_handle, "__tgt_rtl_data_exchange_async");		dlsym(dynlib_handle, "__tgt_rtl_data_exchange_async");
((void *)&R.is_data_exchangable) =		((void *)&R.is_data_exchangable) =
dlsym(dynlib_handle, "__tgt_rtl_is_data_exchangable");		dlsym(dynlib_handle, "__tgt_rtl_is_data_exchangable");
((void *)&R.register_lib) =		((void *)&R.register_lib) =
▲ Show 20 Lines • Show All 318 Lines • Show Last 20 Lines

openmp/runtime/src/kmp.h

	Show First 20 Lines • Show All 3,800 Lines • ▼ Show 20 Lines
	KMP_EXPORT kmp_task_t __kmpc_omp_task_alloc(ident_t loc_ref, kmp_int32 gtid,			KMP_EXPORT kmp_task_t __kmpc_omp_task_alloc(ident_t loc_ref, kmp_int32 gtid,
	kmp_int32 flags,			kmp_int32 flags,
	size_t sizeof_kmp_task_t,			size_t sizeof_kmp_task_t,
	size_t sizeof_shareds,			size_t sizeof_shareds,
	kmp_routine_entry_t task_entry);			kmp_routine_entry_t task_entry);
	KMP_EXPORT kmp_task_t *__kmpc_omp_target_task_alloc(			KMP_EXPORT kmp_task_t *__kmpc_omp_target_task_alloc(
	ident_t *loc_ref, kmp_int32 gtid, kmp_int32 flags, size_t sizeof_kmp_task_t,			ident_t *loc_ref, kmp_int32 gtid, kmp_int32 flags, size_t sizeof_kmp_task_t,
	size_t sizeof_shareds, kmp_routine_entry_t task_entry, kmp_int64 device_id);			size_t sizeof_shareds, kmp_routine_entry_t task_entry, kmp_int64 device_id);

				KMP_EXPORT void __kmpc_target_task_yield();

	KMP_EXPORT void __kmpc_omp_task_begin_if0(ident_t *loc_ref, kmp_int32 gtid,			KMP_EXPORT void __kmpc_omp_task_begin_if0(ident_t *loc_ref, kmp_int32 gtid,
	kmp_task_t *task);			kmp_task_t *task);
	KMP_EXPORT void __kmpc_omp_task_complete_if0(ident_t *loc_ref, kmp_int32 gtid,			KMP_EXPORT void __kmpc_omp_task_complete_if0(ident_t *loc_ref, kmp_int32 gtid,
	kmp_task_t *task);			kmp_task_t *task);
	KMP_EXPORT kmp_int32 __kmpc_omp_task_parts(ident_t *loc_ref, kmp_int32 gtid,			KMP_EXPORT kmp_int32 __kmpc_omp_task_parts(ident_t *loc_ref, kmp_int32 gtid,
	kmp_task_t *new_task);			kmp_task_t *new_task);
	KMP_EXPORT kmp_int32 __kmpc_omp_taskwait(ident_t *loc_ref, kmp_int32 gtid);			KMP_EXPORT kmp_int32 __kmpc_omp_taskwait(ident_t *loc_ref, kmp_int32 gtid);

	▲ Show 20 Lines • Show All 550 Lines • Show Last 20 Lines

openmp/runtime/src/kmp_tasking.cpp

	Show First 20 Lines • Show All 1,430 Lines • ▼ Show 20 Lines
	kmp_task_t __kmpc_omp_target_task_alloc(ident_t loc_ref, kmp_int32 gtid,			kmp_task_t __kmpc_omp_target_task_alloc(ident_t loc_ref, kmp_int32 gtid,
	kmp_int32 flags,			kmp_int32 flags,
	size_t sizeof_kmp_task_t,			size_t sizeof_kmp_task_t,
	size_t sizeof_shareds,			size_t sizeof_shareds,
	kmp_routine_entry_t task_entry,			kmp_routine_entry_t task_entry,
	kmp_int64 device_id) {			kmp_int64 device_id) {
	auto &input_flags = reinterpret_cast<kmp_tasking_flags_t &>(flags);			auto &input_flags = reinterpret_cast<kmp_tasking_flags_t &>(flags);
	// target task is untied defined in the specification			// target task is untied defined in the specification
	input_flags.tiedness = TASK_UNTIED;			input_flags.tiedness = TASK_UNTIED;
				tianshilei1992Unsubmitted Done Reply Inline Actions This change worths a separate patch. tianshilei1992: This change worths a separate patch.

	if (__kmp_enable_hidden_helper)			if (__kmp_enable_hidden_helper)
	input_flags.hidden_helper = TRUE;			input_flags.hidden_helper = TRUE;

	return __kmpc_omp_task_alloc(loc_ref, gtid, flags, sizeof_kmp_task_t,			return __kmpc_omp_task_alloc(loc_ref, gtid, flags, sizeof_kmp_task_t,
	sizeof_shareds, task_entry);			sizeof_shareds, task_entry);
	}			}

				void __kmpc_target_task_yield() {
				int gtid = __kmp_get_gtid();
				__kmpc_omp_taskyield(nullptr, gtid, 0);
				}

	/*!			/*!
	@ingroup TASKING			@ingroup TASKING
	@param loc_ref location of the original task directive			@param loc_ref location of the original task directive
	@param gtid Global Thread ID of encountering thread			@param gtid Global Thread ID of encountering thread
	@param new_task task thunk allocated by __kmpc_omp_task_alloc() for the ''new			@param new_task task thunk allocated by __kmpc_omp_task_alloc() for the ''new
	task''			task''
	@param naffins Number of affinity items			@param naffins Number of affinity items
	@param affin_list List of affinity items			@param affin_list List of affinity items
	▲ Show 20 Lines • Show All 3,331 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Use events and taskyield in target nowait task to unblock host threadsNeeds RevisionPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 364987

openmp/libomptarget/include/omptarget.h

openmp/libomptarget/plugins/cuda/src/rtl.cpp

openmp/libomptarget/plugins/exports

openmp/libomptarget/src/device.h

openmp/libomptarget/src/device.cpp

openmp/libomptarget/src/interface.cpp

openmp/libomptarget/src/omptarget.cpp

openmp/libomptarget/src/private.h

openmp/libomptarget/src/rtl.h

openmp/libomptarget/src/rtl.cpp

openmp/runtime/src/kmp.h

openmp/runtime/src/kmp_tasking.cpp

[OpenMP] Use events and taskyield in target nowait task to unblock host threads
Needs RevisionPublic