This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/
-
libomptarget/
-
include/
4/4
device.h
13/13
omptarget.h
3/3
omptargetplugin.h
-
rtl.h
-
plugins/
-
cuda/
-
dynamic_cuda/
-
cuda.h
-
cuda.cpp
-
src/
4/4
rtl.cpp
-
exports
-
src/
1/1
device.cpp
23/27
interface.cpp
22/22
omptarget.cpp
13/13
private.h
-
rtl.cpp
-
runtime/src/
-
src/
-
kmp.h
19/19
kmp_tasking.cpp

Differential D132005

[OpenMP] Add non-blocking support for target nowait regions
ClosedPublic

Authored by gValarini on Aug 16 2022, 5:47 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
tianshilei1992

Commits

rG89c82c83949b: [OpenMP] Add non-blocking support for target nowait regions

Summary

This patch better integrates the target nowait functions with the tasking runtime. It splits the nowait execution into two stages: a dispatch stage, which triggers all the necessary asynchronous device operations and stores a set of post-processing procedures that must be executed after said ops; and a synchronization stage, responsible for synchronizing the previous operations in a non-blocking manner and running the appropriate post-processing functions. Suppose during the synchronization stage the operations are not completed. In that case, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized, allowing other target nowait regions to be concurrently dispatched.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Harbormaster completed remote builds in B181663: Diff 453177.Aug 16 2022, 5:50 PM

gValarini edited the summary of this revision. (Show Details)Aug 16 2022, 5:50 PM

If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized

Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?

gValarini edited the summary of this revision. (Show Details)Aug 16 2022, 5:51 PM

Added missing commits

In D132005#3727763, @ye-luo wrote:

If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized

Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?

This currently is intended only for helper tasks attached to target nowait regions.

Harbormaster completed remote builds in B181665: Diff 453181.Aug 16 2022, 6:19 PM

In D132005#3727790, @gValarini wrote:

In D132005#3727763, @ye-luo wrote:

If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized

Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?

This currently is intended only for helper tasks attached to target nowait regions.

Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.
Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?

hyviquel added a subscriber: hyviquel.Aug 17 2022, 5:41 AM

Update re-enqueue logic to support target nowait regions with hidden helper threads disabled

Harbormaster completed remote builds in B181855: Diff 453452.Aug 17 2022, 3:46 PM

Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?

In D132005#3727803, @ye-luo wrote:

Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.

You have a good point. I thought you were talking about using the re-enqueueing scheme with the normal OpenMP tasks regions.

Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?

There was no specific reason to limit that to hidden helper tasks. I have updated the code so it can also work when LIBOMP_USE_HIDDEN_HELPER_TASK=0. Here is the new implemented behavior:

HHT enabled: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to other hidden helper threads.
HHT disabled / Task with a task team: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to others the other OpenMP threads of the same tasking team. E.g., when a target nowait is created inside a parallel region.
HHT disabled /Task without a task team: target nowait regions will not use the new non-blocking synchronization scheme. Synchronization will be done in a blocking manner as before, respecting the case where the target task is serialized. E.g., when a target nowait is created outside any parallel region.

What do you think about this implementation?

In D132005#3730471, @gValarini wrote:

In D132005#3727803, @ye-luo wrote:

Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.

You have a good point. I thought you were talking about using the re-enqueueing scheme with the normal OpenMP tasks regions.

Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?

There was no specific reason to limit that to hidden helper tasks. I have updated the code so it can also work when LIBOMP_USE_HIDDEN_HELPER_TASK=0. Here is the new implemented behavior:

HHT enabled: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to other hidden helper threads.

HHT disabled / Task with a task team: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to others the other OpenMP threads of the same tasking team. E.g., when a target nowait is created inside a parallel region.

HHT disabled /Task without a task team: target nowait regions will not use the new non-blocking synchronization scheme. Synchronization will be done in a blocking manner as before, respecting the case where the target task is serialized. E.g., when a target nowait is created outside any parallel region.

What do you think about this implementation?

This is a lot better. The less specialization of the feature the better.

In D132005#3730450, @ye-luo wrote:

Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?

I have not thought about that at the moment, but that could be a nice optimization. Since the CUDA plugin currently maintains a resizable pool of streams for each device with an initial size of 32, I thought that for a first implementation this could be enough.

CUDA events have the same API as streams for non-blocking synchronization using cudaEventQuery, so we could store a single event (completionEvent) per AsyncInfo and use that when synchronizing with SyncType::NON_BLOCKING. I have one question though: does querying for CUDA events completion synchronize all the operations prior to the event on the stream? Or another thread on the host must synchronize the stream? If only synchronizing the events is enough, it would make using them quite simpler.

In D132005#3730910, @gValarini wrote:

In D132005#3730450, @ye-luo wrote:

Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?

I have not thought about that at the moment, but that could be a nice optimization. Since the CUDA plugin currently maintains a resizable pool of streams for each device with an initial size of 32, I thought that for a first implementation this could be enough.

CUDA events have the same API as streams for non-blocking synchronization using cudaEventQuery, so we could store a single event (completionEvent) per AsyncInfo and use that when synchronizing with SyncType::NON_BLOCKING. I have one question though: does querying for CUDA events completion synchronize all the operations prior to the event on the stream? Or another thread on the host must synchronize the stream? If only synchronizing the events is enough, it would make using them quite simpler.

My second thought on this is let us do Stream sync for now.
NVIDIA. If we sync with an event and return streams, tasks may got serialized if two happens in the same stream.
AMD, at the hsa level, there is only signals(events).
Level0, it depends on which type of commandlist being used.
So it seems at libomptarget, it should be flexible and let the plugin decide which mechanism to use.

Thanks for the patch. I'll take a close look next week.

Drive by.

openmp/libomptarget/include/omptarget.h
194	Drive by: I don't believe we want such a generic interface. The postprocessing task should just be a fixed function, not multiple unknown at compile time.
openmp/libomptarget/src/omptarget.cpp
939	Just make this a standalone static function.

First of all, sorry for the late reply.

In D132005#3730969, @ye-luo wrote:

My second thought on this is let us do Stream sync for now.
NVIDIA. If we sync with an event and return streams, tasks may got serialized if two happens in the same stream.
AMD, at the hsa level, there is only signals(events).
Level0, it depends on which type of commandlist being used.
So it seems at libomptarget, it should be flexible and let the plugin decide which mechanism to use.

Ok, we can leave it as is for now. Just a comment on the NVIDIA case for future development: although we may not completely avoid the serialization problem using the pool of streams, I believe we can reduce it if the pool access pattern is changed. Currently, streams are acquired/released from/to the pool in a stack-like manner, which induces the frequent re-use of the streams nearest to the top of the stack/pool (in the code, the streams near the beginning of the Resources vector). By changing this access pattern to be round-robin-like, the stream use frequency may be evened over the pool and serialization becomes less of a problem.

In D132005#3731000, @tianshilei1992 wrote:

Thanks for the patch. I'll take a close look next week.

Perfect, thanks!

Update documentations and refactor some code

Harbormaster completed remote builds in B183633: Diff 455948.Aug 26 2022, 9:47 AM

gValarini added inline comments.Aug 26 2022, 10:06 AM

openmp/libomptarget/include/omptarget.h
194	Just adding some context to why it was done this way: Lambdas can easily store any local data needed by the post-processing procedures. This allows them to be generated locally by the normal code flow (which may improve code maintenance) and then stored in the lambdas implicit structure (solving any lifetime problems). Having a function vector allows us to easily compose post-processing procedures. This is the case for `target nowait` regions, that should run both the `targetDataEnd` and `processDataAfter` post-processing functions. If in the future we need to generate more asynchronous operations inside the post-processing functions, this can be easily done by pushing more lambdas to the vector. With all that in mind, I know `std::function`s may lead to an additional heap allocation and one more level of indirection due to type-erasure, prohibiting some code opts. If that is not desirable despite the presented context, I can change how the post-processing procedures are stored/called, but I really would like to keep some of the points described above, especially the lifetime correctness of the post-processing data. Do you have something in mind that could help me with that? Maybe defining a different struct for each post-processing function and storing it in the `AsyncInfoTy` as a `variant` could be enough. What do you think?

jdoerfert added inline comments.Aug 26 2022, 10:55 AM

openmp/libomptarget/include/omptarget.h
194	I would have assumed if we have 2 potential callbacks, let's say static functions `F` and `G` we would have two members, both void. void PayloadForF = nullptr; void *PayloadForG = nullptr; and if they are not null we call `F` and `G` respectively passing the payload. I'm fine with keeping it for now this way, we can see if there is a need to change it.
openmp/libomptarget/include/omptargetplugin.h
160	Describe the return value, it's not clear what would be returned if it does or doesn't synchronize.
openmp/libomptarget/src/device.cpp
634–635
openmp/libomptarget/src/interface.cpp
164	I assume we can outline some of this as it is probably the same as in the sync version, right? Let's avoid duplication as much as possible. Same below.
239	So, when do we call this but don't want to actually do `targetDataUpdate`? I am confused. Same above.
347	Clang format.
openmp/libomptarget/src/omptarget.cpp
698	Documentation, plz.
1209	Why are the const problematic?
1521	FWIW, `mutable` is really not my favorite way of handling things.
openmp/libomptarget/src/private.h
214–217	Here and elsewhere, we should prefer early exit and no else after return.
224–228
openmp/runtime/src/kmp_tasking.cpp
1145	@tianshilei1992 you need to look at these changes.
5187	current task?

tianshilei1992 added inline comments.Aug 30 2022, 6:48 AM

openmp/runtime/src/kmp_tasking.cpp
5157	`libomptarget` (for now) doesn't require the thread must be an OpenMP thread. Using `libomp`'s gtid generally breaks. Either we add the requirement, which needs to be discussed further, or there it needs an alternative method to implement that. If `libomp` is executed on another fresh thread, a new root will be created.

tianshilei1992 added inline comments.Aug 30 2022, 6:59 AM

openmp/libomptarget/src/omptarget.cpp
700	nit: a vector of `PostProcessingInfo` called `PostProcessingPtrs` is confusing.
openmp/runtime/src/kmp_tasking.cpp
1099

jdoerfert retitled this revision from Add non-blocking support for target nowait regions to [OpenMP] Add non-blocking support for target nowait regions.Sep 7 2022, 10:21 AM

Herald added subscribers: guansong, yaxunl. · View Herald TranscriptSep 7 2022, 10:21 AM

Refactor interface code and address some code review changes

Harbormaster completed remote builds in B186683: Diff 460172.Sep 14 2022, 11:58 AM

gValarini marked 5 inline comments as done.Sep 14 2022, 1:19 PM

gValarini added inline comments.

openmp/libomptarget/include/omptarget.h
194	Okey, for now, I'll keep it like this then.
openmp/libomptarget/include/omptargetplugin.h
160	Good point. I have updated both the documentation and the function name to better reflect what this new interface should do. Do you think it is more clear now?
openmp/libomptarget/src/interface.cpp
164	Yep, you are correct. I have created two new "launchers" that can unify most of the code paths for the execution and data-related functions for the normal and nowait cases. Since all data-related interface entries practically have the same signature, a single launcher is enough for them all. A new class called `TaskAsyncInfoTy` also unifies the code related to managing the task async handle when executing target nowait regions. What do you think about this new code structure?
239	The goal is to dispatch the device side operations (i.e., call the functions in the `omptarget.cpp` file) only when a new async handle is created. If we detect that the task already contains an async handle, that means that the device side operations were already dispatched and we should only try to synchronize it in a non-blocking manner. The new `TaskAsyncInfoTy` class has a function called `shouldDispatch` that now encapsulates this detection logic with proper documentation. Do you think it is more clear now? Should we add a comment to each call site as well?
openmp/libomptarget/src/omptarget.cpp
698	Done. Could you check if I missed anything?
700	Ops. Thanks, I have updated the var name.
1209	In summary, we want to be able to move `PrivateArgumentManagerTy` instances into the post-processing lambdas, so their lifetime is automatically managed by them. The problem is with how `llvm::SmallVector` implements its move constructor. Unfortunately, it is implemented as a move-assignment instead of a proper constructor, meaning it cannot be generated for structs with const members. If we replace `FirstPrivateArgInfo` with a `std::vector`, the problem does not happen because the STL properly implements a move constructor for vectors. Since I think we do not want to use `std::vector` anymore, I just removed the const from the members, since they are not even accessible outside the `PrivateArgumentManagerTy` class. What do you think of this approach?
1521	`mutable` was added because we need to call a non-const member function of `PrivateArgumentManager` (i.e., `free`). I know that makes the lambda a function with an internal state since, multiple calls to it will generate different results, but I don´t know of another approach to it. Maybe use `call_once` (IMHO, a little bit overkill) or remove the lambdas altogether and use another approach to store the post-processing functions and their payload. What do you think?
openmp/runtime/src/kmp_tasking.cpp
1145	Any comments on whether we can move the `__kmp_unexecuted_hidden_helper_tasks` decrement to this place?
5157	Uhm, I did not know about that. Although I think such a requirement makes sense, it may be out of the scope of this patch. What we could do is check if the current thread is registered inside `libomp` somehow, falling back to the current execution path that does not depend on the task team information. Do you know we can use `__kmpc_global_thread_num`'s return value to verify that? Maybe assert that returned `GTID` is valid and within a well-known range (e.g., [0, NUM_REGISTERED_OMP_THREADS]). Just a note, NUM_REGISTERED_OMP_THREADS is not a valid variable. I just don't know where, or even if, such information is stored. Do you know where can I find this?

gValarini marked an inline comment as done.Sep 14 2022, 1:23 PM

gValarini added inline comments.

openmp/runtime/src/kmp_tasking.cpp
5187	Typo. It should have been `current thread`!

saiislam added a subscriber: saiislam.Oct 12 2022, 7:11 AM

New set of comments, some minor but not all. Some comments are "out-of-order" as I started commenting top-down while not understanding it all. Read accordingly.

openmp/libomptarget/include/device.h
441	Reading this I don't what this returns. SUCCESS if it completed and FAIL otherwise? Or FAIL only if something failed? Must be called multiple times is also unclear to me, I doubt that we should put that sentence here. Update: I think I now understand the latter part but if I'm right we should change the interface. So, queryAsync is supposed to be called before isDone to make sure isDone returns the right value, correct? If so, we should not expose queryAsync to the user as there doesn't seem to be a reason to call it otherwise. Arguably, calling it doesn't provide information, just a state change, thus a secondary query is necessary.
openmp/libomptarget/include/omptarget.h
217	Nit: Rename argument to avoid shadowing. Make the version that takes the sync type, and probably the sync type, private. IsDone can call the private version, users only the blocking one.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1272	return OFFLOAD_SUCCESS; will reduce indention and logic later on.
1278
openmp/libomptarget/src/interface.cpp
86	This is different but similar to the condition used in the other new "helper" below. I have the same concerns as there. When would we ever not call the target data function?
95	Is `FromMapper` ever set to true? Did I miss that?
252	There is more duplication in the callees to be moved here, no? The two last arguments could be omitted and grabbed from AsyncInfo, also in the above rewrite. Dispatch is not used?
279	This I don't understand. Why do we have to wait to enqueue the kernel? And even if, how does this not accidentally skip the target region and we will never execute it at all? Long story short, I doubt the conditional here makes sense.
openmp/libomptarget/src/omptarget.cpp
41–50	This would move to isDone below.
openmp/libomptarget/src/private.h
227	Should this be guarded by IsNew?
240	I don't understand what we want/need this dispatch idea for. It seems to skip operations but I don't understand how we would not forget about them and go back.
openmp/runtime/src/kmp_tasking.cpp
5186

gValarini marked 13 inline comments as done.Oct 17 2022, 12:46 PM

gValarini added inline comments.

openmp/libomptarget/include/device.h
441	I should probably update the documentation of this function to reflect the new one added to `omptargetplugin.h:__tgt_rtl_query_async`. That state the following: Queries for the completion of asynchronous operations. Instead of blocking the calling thread as __tgt_rtl_synchronize, the progress of the operations stored in AsyncInfo->Queue is queried in a non-blocking manner, partially advancing their execution. If all operations are completed, AsyncInfo->Queue is set to nullptr. If there are still pending operations, AsyncInfo->Queue is kept as a valid queue. In any case of success (i.e., successful query with/without completing all operations), return zero. Otherwise, return an error code. Thus, `queryAsync` (which calls `__tgt_rtl_query_async`), is a non-blocking version of `synchronize`. That means that we must call it multiple times until all operations are completed and the plugin invalidates the queue inside `AsyncInfo`. Here, `AsyncInfoTy::isDone` is just a helper function that indicates if the device side operations are completed or not based on said queue. We need to externalize `queryAsync` to its user `AsyncInfoTy` so it can call the non-blocking implementation. Considering your comment, what do you think of making things more explicit by adding a flag pointer argument to `queryAsync` (and thus to __tgt_rtl_query_async) that returns true if all operations are completed and false otherwise? int32_t queryAsync(AsyncInfoTy &AsyncInfo, bool &IsCompleted);
openmp/libomptarget/include/omptarget.h
217	Thanks, I am renaming the type name to `SyncTypeTy` to reflect the other ones. Regarding the second comment, I don´t quite understand what you mean with: IsDone can call the private version, users only the blocking one. `isDone` only checks if the operations inside an `AsyncInfoTy` instance are completed or not, it does not call any plugin function at all. Are you suggesting that we move all the non-blocking synchronization code into `isDone`? If so, this means we would have some code duplication regarding the post-processing functions due to two separate synchronization paths, but if you think that is better I can do it.
217	Nit: Rename argument to avoid shadowing. Make the version that takes the sync type, and probably the sync type, private. IsDone can call the private version, users only the blocking one.
openmp/libomptarget/include/omptargetplugin.h
160	@jdoerfert any comments on the new function and its doc?
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1272	Perfect, done!
1278	Thanks, done!
openmp/libomptarget/src/interface.cpp
86	That was a code error. The target helper should also use the `Dispatch` variable as well. Thanks for noticing.
86	With the RFC implemented, we are now re-enqueuing the same task multiple times until all the device side operations are completed. Because of that, we may call the `__tgt_target_`* functions multiple times as well. Since we want to dispatch the operations only once, we call the target data functions only when the target task is first encountered. The next calls will only synchronize the operations instead of dispatching them again, that's why `AsyncInfo.synchronize` is always called right below it.
95	Nope, it is not. I am removing it from the arguments and always passing `false`.
252	Yep, that was an error on my part, `Dispatch` should be used instead of `AsyncInfo.isDone()`. Regarding the other arguments, they are obtained from the wrapper `TaskAsyncInfoTy`, not from `AsyncInfoTy`. I can change that to unify the wrappers code and `AsyncInfoTy`, but I'll be probably putting too different responsibilities into the `AsyncInfoTy` struct. What do you think?
279	You are right, this was a leftover prior to the refactoring of the interface file. Although it worked, it did only because the queue pointer was null and `isDone` would return `true` at first. Replaced it with the `Dispatch` variable.
openmp/libomptarget/src/omptarget.cpp
41–50	Ok, I can do it, no problem. But just to make sure I got it right with respect to the other comments, you are suggesting this so we would can `synchronize` for the blocking synchronization and `isDone` for the non-blocking one, correct? If that is so, just remember that for the current code, the PostProcessingFunctions must be called on both cases, so isDone would need to be called when blocking synchronizing as well.
openmp/libomptarget/src/private.h
227	Nope, `IsNew` indicates that a new task-attached `AsyncInfo` has just been allocated, but not that we should deallocate it or not. The variable is primarily used to indicate that we must dispatch new operations to the new handle. Maybe I should rename it to just `ShouldDispatch`. Deallocation is always done when `AsyncInfo->isDone()` returns `true`, which is previously checked.
240	Here is the main idea: The first time a target nowait (target inside an OpenMP task) is executed, a new `AsyncInfo` handle is created and stored in the OpenMP taskdata structured of said task. Since this is the first time we are executing it (which is detected by the `IsNew` variable), we dispatch the device side operations and populate the post-processing function array by calling the proper `omptarget.cpp` functions (e.g., `targetDataBegin`, `targetDataEnd`, ...). Afterward, if the device operations are still pending, the OpenMP task is re-enqueued for execution. Later, when the task is re-executed, the same outline function called at step 1 will be called again. Here we can recover the `AsyncInfo` handle from the OpenMP taskdata and just synchronize it. Since this time the handle is not new, we know the operations were already dispatched previously and we should not dispatch them again. I have a presentation that explains it on slides 19-24, but I believe I am failing to describe that in the code. I'll try to come up with some better documentation for this dispatch/synchronize idea.
openmp/runtime/src/kmp_tasking.cpp
5157	@tianshilei1992 any comments on this?

I'll wait for the updated version to go through again. Below two clarification questions. I think I now am closer to understanding some of the stuff I was confused about. If we end up keeping this scheme we need to adjust some names. I am hoping we can simplify a few things though.

openmp/libomptarget/src/private.h
227	I'm worried here that we might not delete the AsyncInfo or delete it multiple times. Are you saying there is exactly one TaskAsyncInfoTy that will own the AsyncInfor object at any given time? If not, how do we avoid double free?
240	Ok, that makes more sense. Now to help (even) me understand this, why do we need to call the functions from step 1 in step 3? We seem to use the "Dispatch" argument to skip most of what they do (the target data part of a targetDataXXX) anyway, no?

jdoerfert added inline comments.Oct 17 2022, 5:06 PM

openmp/libomptarget/include/device.h
441	My point is: queryAsync is useless if not followed by an isDone, right? Why do we expose it in the first place? We should merge the two functions, potentially keeping isDone around, but at least avoiding the implicit dependence that you have to call one before the other for any meaningful use case. The updated interface basically does this. It merges the "isDone" query into this function, allowing users to call isDone standalone and this function standalone while getting meaningful results each time.

tianshilei1992 added inline comments.Oct 17 2022, 5:09 PM

openmp/runtime/src/kmp_tasking.cpp
5157	IIRC some return values, except those real thread ids, are for specific purposes. That's one of the reason that I didn't use negative thread id for hidden helper thread. I don't know for sure if there is a value designed to say it is not an OpenMP managed thread. We can probably add one.

gValarini updated this revision to Diff 468566.Oct 18 2022, 8:34 AM

Address more code review changes. This update also fixes the dispatch logic on target regions.

Harbormaster completed remote builds in B192770: Diff 468566.Oct 18 2022, 8:37 AM

Added some more comments about the new execution flow and the thread id problem.

openmp/libomptarget/include/device.h
441	Uhm, I think I got your point. I'll update `AsyncInfoTy::isDone` so it can be called standalone without a prior call to `AsyncInfoTy::synchronize` (which calls the device `DeviceTy::queryAsync`). This indeed makes the interface better. I was a little bit confused when you said `DeviceTy::queryAsync` should not be exposed, but now I got it.
openmp/libomptarget/include/omptarget.h
217	Just a note, now I got the correct idea: we should make `isDone` a callable as a standalone function!
openmp/libomptarget/src/private.h
227	When executing a target nowait region, the actual owner of the `AsyncInfo` is the task itself. The structure is allocated when the task first executes and calls any of the libomptarget functions (`IsNew` is true) and it is deallocated when all the device-side operations are completed (`AsyncInfo::isDone` returns `true`). Here, `TaskAsyncInfoTy` is just a wrapper around a task-owned `AsyncInfoTy` (stored inside the OpenMP task data) to mainly automate the allocation and deallocation logic. But following the OpenMP execution flow, since a task is owned and executed by only a single thread at any given time, only one `TaskAsyncInfoTy` will be managing the task-owned `AsyncInfoTy` object. This should avoid any double frees, but I understand this could be a weak assumption. If that is enough I could add documentation stating it, but probably having some code checks for that would be best. Maybe assertions at the task deallocation function ensuring no valid `AsyncInfoTy` address is left?
240	That happens because the dispatch and synchronization logic are placed in the same interface function. The first call to that function done by a task dispatches the operations, while the subsequent calls try to do the non-blocking synchronization. Maybe a better way of doing it would be to add a new interface function with the sole purpose of executing said synchronization. This way, when a task is re-executed, it calls this new function to only do the synchronization instead of the previous outline function. What do you think? This can better split the dispatch and synchronization code.
openmp/runtime/src/kmp_tasking.cpp
5157	That would be perfect. I know we have `KMP_GTID_DNE` (value of `-2`) that represents the return value of a non-existent thread ID. My only problem is: do you know if `__kmpc_global_thread_num` returns that when called from a non-OpenMP thread? I'll do some local checking on that!

This update:

Internalizes SyncType into SyncInfoTy
Split dispatch and synchronization code paths. No more Dispatch checks! New interface function for target nowait synchronization.
Make isDone a standalone function. No need to call synchronize before isDone!

Harbormaster completed remote builds in B193244: Diff 469237.Oct 20 2022, 8:19 AM

@jdoerfert I believe the new revision has a better code structure for the dispatch and synchronize stages. Now we have an exclusive function only for synchronization. No more Dispatch checks!

@tianshilei1992 I am still checking on the validity of the returned GTID from __kmpc_global_thread_num when called from outside an OpenMP thread. The best solution would be to pass the address of the taskdata async handle through the interface for the *_nowait functions directly as a new parameter, but that would require changes to the code generation step. I can do that right now, but I would prefer to change the clang code generation on a different patch, as I am not that familiar with it. What do you think?

In D132005#3871439, @gValarini wrote:

@jdoerfert I believe the new revision has a better code structure for the dispatch and synchronize stages. Now we have an exclusive function only for synchronization. No more Dispatch checks!

Great, I think we are almost there. The code looks much cleaner and the approach is much clearer than it was in the beginning (IMHO, I hope people agree).

I left some final clarification questions and some cleanup requests.

One conceptual question which is not blocking the patch though:
Does this approach require hidden helper threads to execute the target task or could we enable it for non-hidden helper threads as well?

openmp/libomptarget/include/omptarget.h
226	Make it clear that this happens only once. Either here or via synchronize. Right now it could be read like every isDone call might invoke the post-processing functions.
386	It's void return but comment talks about the return value.
openmp/libomptarget/src/interface.cpp
64	If you make this a templated function accepting the (sub)type of the AsyncInfo object instead of the object itself, you can move all the remaining duplication at the call sites (namely: checkCtorDtor, get device, create AsyncInfo) into this function. WDYT?
247–252	Same comment as above wrt. templated version. The duplication we introduce is something I would like to avoid.
392	Do we know the above call is "noreturn"? If not, we should explicitly exit here. On second thought, we should exit either way.
395
openmp/libomptarget/src/omptarget.cpp
79	This is not a good idea, I think. The state of PostProcessingFunctions is undefined afterwards, even if this works in practice. Simply iterate PostProcessingFunctions and then clear it.
openmp/libomptarget/src/private.h
212	Is there a `!` missing?
openmp/runtime/src/kmp_tasking.cpp
1802	Much better than the "execute but don't actually execute" version before. Thanks! Do we need to, or should we, act on the updated async_handle value (I mean if it actually finished, should we do something different than when it hasn't)? Or is that already done?
5181	This can fail, right? If so, we should report it to the user and deal with it properly. Otherwise we should assert it can't fail.

In D132005#3873103, @jdoerfert wrote:

One conceptual question which is not blocking the patch though:
Does this approach require hidden helper threads to execute the target task or could we enable it for non-hidden helper threads as well?

It should work with HHT and normal OpenMP threads (e.g., inside a parallel/single region). The only place that we could have some problems would be in the re-enqueueing of the tasks. But that is already taken care of: if HHTs are disabled, the task will be given to another normal OpenMP thread. If they are enabled, the task will be given to another HHT. I'll do some more local testing, but we should not have any problems in both configs.

tianshilei1992 added inline comments.Oct 21 2022, 1:17 PM

openmp/libomptarget/include/omptarget.h
189	`SyncTypeTy` looks weird. It's like having an LLVM class called `TypeTy`. I think `SyncTy` or `SyncType` are both fine.
openmp/libomptarget/src/interface.cpp
84	nit: `targetDataFunction`
openmp/runtime/src/kmp_tasking.cpp
5191	Do we need to check if `gtid` is valid here?

I have a general question. D81989 is using task yield to potentially improve concurrency instead of blocking the hidden helper task. I know sometimes task yield may have side affects. My question is, compared with using task yield, is creating and enqueuing task (repeatedly) better?

This update:

Update AsyncInfo documentation
Reduce code duplication in the interface
Rename SyncTypeTy to SyncTy
Fix exporting __tgt_target_nowait_query
Refactor task async handle acquisition
Optimize fast queue completion`

I believe I answered and fixed most of the comments on this revision. Waiting for the next round. 😉

openmp/libomptarget/include/omptarget.h
189	It makes sense, that was a little redundant. It is now renamed to `SyncTy`. Thanks!
386	Yep, that was a leftover from some previous revisions. Thanks!
openmp/libomptarget/src/interface.cpp
64	Indeed, that is a nice idea. Since `TaskAsyncInfoWrapperTy` is a wrapper around `AsyncInfoTy`, I only needed to acquire a reference to it so we would end up always using `AsyncInfoTy`.
84	Uhm, `TargetDataFunction` is a function pointer. Shouldn't we also capitalize the first word in this case?
247–252	Done as well.
392	Uhm, yep, you are right. We should always exit here. I am converting to using `FATAL_MESSAGE0`, so we directly abort the program.
openmp/libomptarget/src/omptarget.cpp
79	Uhm, really? When moving a `SmallVector` like this, wouldn't `PostProcessingFunctions` be emptied and all the values moved to `Functions`?
openmp/libomptarget/src/private.h
212	I forget to submit some local changes! Done.
openmp/runtime/src/kmp_tasking.cpp
1802	Nice. And yep, that is already done at the task finalization. Take a look at the changes on `kmp_tasking.cpp:1099`. When `async_handle` is not null, the task is re-enqueued, otherwise, the task is finished normally.
5181	Uhm, that makes sense. I'll try to add this functionally and fall back to the old execution flow if it fails.
5191	Uhm, yep we indeed need to check it. I'll add it here and return false if `gtid` is invalid. This way we can fall back to the old execution flow.

Harbormaster completed remote builds in B194220: Diff 470548.Oct 25 2022, 10:20 AM

Generally fine from my end. @tianshilei1992 wdyt?

@kevinsala, FYI, there will be a new plugin API we need to port over to the new plugins.

openmp/libomptarget/src/omptarget.cpp
79	The standard says PostPRocessingFunctions is in an unspecified state after the move. If we look at SmallVector we know the state but I don't understand why we need to rely on implementation defined behavior here at all. We don't save any lines and also no work by just iterating PostProcessingFunctions and then clearing it, no?
openmp/libomptarget/src/private.h
227	All asserts should have messages.

TBH I'd like to see/understand if task yield is worse than this method. If not, I'm not sure why we'd like to have so many complicated code.

Another thing is, if we only have one target task, which is really long. Then based on this method, it's gonna check if it is done, enqueue the task, and then the task is scheduled to execute (potentially by another thread, which might hurt locality), check, enqueue the task, again and again. Though existing method blocks the whole thread, at least it will not keep executing. I think we actually need a counter for the bottom half task. If it is executed for many times, it indicates we don't have enough other tasks to execute. In this case, just blocking the thread is fine.

In D132005#3883907, @tianshilei1992 wrote:

TBH I'd like to see/understand if task yield is worse than this method. If not, I'm not sure why we'd like to have so many complicated code.

Sorry, I missed your previous comment before.

I believe this method can be better than using task yield because it has less probability of "starvation", let me explain. AFAIK, task yield is currently being implemented by calling other tasks in the middle of the execution of the current task, meaning the task execution state (and ordering of resumption) is stored inside the call stack. This can incur a "starvation-like" problem, where an earlier task that has its operations completed cannot be finished because it is at the bottom of the call stack. This can also hold other tasks that are dependent on this "starved" one. Another problem is that, depending on the number of ready target regions, a program can even surpass the stack limit due to many in-flight tasks. If task yield were implemented in a coroutine-like model (maybe some future work), where yielded tasks could be re-enqueued and re-ordered, we would probably use it since that makes the code much simpler.

Another thing is that this can also be an initial point of integration for the device-side resolution of dependencies. (although D81989 also did that as well).

In D132005#3883907, @tianshilei1992 wrote:

Another thing is, if we only have one target task, which is really long. Then based on this method, it's gonna check if it is done, enqueue the task, and then the task is scheduled to execute (potentially by another thread, which might hurt locality), check, enqueue the task, again and again. Though existing method blocks the whole thread, at least it will not keep executing. I think we actually need a counter for the bottom half task. If it is executed for many times, it indicates we don't have enough other tasks to execute. In this case, just blocking the thread is fine.

Uhm, that makes perfect sense. I´ll implement the counting mechanism and update the patch.

This update:

Unify synchronize and isDone methods. No more code duplication between them.
Add query async to plugin-nextgen.
Decided sync method based on thread exponential backoff counting.

Harbormaster completed remote builds in B196737: Diff 474034.Nov 8 2022, 9:50 AM

gValarini marked 3 inline comments as done.Nov 8 2022, 10:05 AM

gValarini added inline comments.

openmp/libomptarget/src/omptarget.cpp
79	I was thinking about a future use case: if a post-processing function generates more asynchronous device-side operations. In this case, it may want to add a new post-processing function into the vector, but we cannot do that while iterating over it. The spec says that, if a push/emplace back resizes the vector, any previous iterator is invalid, which would make the loop invalid. I think that is the case for `SmallVector`s as well. Right now, the three post-processing functions do not do that. They all do synchronous device operations. But if in the future, someone adds a post-processing function that does that, it cannot be blindly done without taking care of this situation. Do you think we could leave this "unimplemented" right now? If so, I can do the iteration-then-clear approach. Just a "side question": which standard does `SmallVector` follows? I am asking that because the STL says that a vector is guaranteed to be empty after a move. If that is the case for `SmallVectors`, thus `PostProcessingFunctions` would be in a valid state, no?
1209	@jdoerfert any new comment on this?
1521	@jdoerfert any new comment on this?
openmp/runtime/src/kmp_tasking.cpp
1145	@tianshilei1992 is this correct?

I only have one remaining question. @tianshilei1992 might have more though.

openmp/libomptarget/src/interface.cpp
403	Why is this thread_local? Should it be tied to the async info object with non-blocking tasks?
openmp/libomptarget/src/omptarget.cpp
79	If you are worried about inserting, use for (int i = 0; i < C.size(); ++i) Anyhow, let's keep it this way for now but add an explicit clear at the end. Just to be sure (and explicit).

tianshilei1992 added inline comments.Nov 23 2022, 8:23 PM

openmp/libomptarget/src/interface.cpp
403	Yeah, I'd expect the counter to be tied to a task.

Reverse ping, let's get this in, we can build dependence via event support on top of it.
There is only the thread_local remark remaining, I think.

This update:

Fix post-processing function run loop. Clear is now explicit.

Harbormaster completed remote builds in B202240: Diff 481658.Dec 9 2022, 8:01 AM

@jdoerfert there are some other comments pending. How should we proceed?

openmp/libomptarget/src/interface.cpp
403	@jdoerfert @tianshilei1992 my idea here is that we should be able to go back and forth between non-blocking and blocking synchronization in a per-thread manner depending on their own state. If the thread is doing too much non-blocking synchronization (similar to spin-wait), we should block-wait at least once to save resources. If the thread has just completed a target region, let it go through other target regions in a non-blocking manner for a while, possibly completing other tasks and enqueueing some more. This way, we allow the runtime threads to adapt themselves to the target region payloads. I see two possible problems in placing the counter in a per-task manner: We can only go from non-blocking to blocking synchronization. Once the task is completed, the counter is destructed and won´t be used anymore, thus we lose its information for future task execution. If we have many tasks with higher enough counters at the top of a thread's task queue, that same thread will be forced to only block-synchronize them, even though we could have tasks ready for completion at the bottom of the queue. What do you think about the above points? Do they make sense?
openmp/libomptarget/src/omptarget.cpp
79	Uhm, yep, making it explicit is better. What do you think?

openmp/libomptarget/src/interface.cpp
403	Hm, it seems reasonable for your scenario. It is unclear what we should optimize for here. I'm OK with keeping it like this as it might be better for a "always blocking tasks" and "consistently mixed task" load. The per-task impl. would only be good if we have "totally independent tasks", I guess.
openmp/libomptarget/src/omptarget.cpp
79	It's fine for now.
1209	It's ok to remove the const.
1521	Add a TODO to look into this in the future.
openmp/runtime/src/kmp_tasking.cpp
1145	I think @tianshilei1992 mentioned to me this should be fine.

This revision is now accepted and ready to land.Dec 9 2022, 10:16 AM

tianshilei1992 accepted this revision.Dec 9 2022, 10:18 AM

This update:

Add a TODO to remove mutables from post-processing

gValarini marked 10 inline comments as done.Dec 9 2022, 10:24 AM

Harbormaster completed remote builds in B202260: Diff 481683.Dec 9 2022, 10:26 AM

This update:

Fix stream return on next-gen plugin

Harbormaster completed remote builds in B202271: Diff 481699.Dec 9 2022, 11:09 AM

Closed by commit rG89c82c83949b: [OpenMP] Add non-blocking support for target nowait regions (authored by gValarini). · Explain WhyDec 14 2022, 9:05 AM

This revision was automatically updated to reflect the committed changes.

gValarini added a commit: rG89c82c83949b: [OpenMP] Add non-blocking support for target nowait regions.

kevinsala mentioned this in D138389: [OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior.Dec 15 2022, 10:40 AM

Getting compiler crash with this change:

# End machine code for function __tgt_target_nowait_query.

*** Bad machine code: FrameSetup is after another FrameSetup ***
- function:    __tgt_target_nowait_query
- basic block: %bb.5 init.check (0x556af1cc4098)
- instruction: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def $rsp, implicit-def $eflags, implicit-def $ssp, implicit $rsp, implicit $ssp, debug-location !6722; llvm-project/openmp/libomptarget/src/interface.cpp:328:42

*** Bad machine code: FrameDestroy is not after a FrameSetup ***
- function:    __tgt_target_nowait_query
- basic block: %bb.5 init.check (0x556af1cc4098)
- instruction: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp, debug-location !6722; llvm-project/openmp/libomptarget/src/interface.cpp:328:42
fatal error: error in backend: Found 2 machine code errors.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: clang++ --target=x86_64-unknown-linux-gnu -DEXPENSIVE_CHECKS -DGTEST_HAS_RTTI=0 -DOMPTARGET_DEBUG -DOMPT_SUPPORT=1 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Illvm/include -Iinclude -Iruntimes/runtimes-bins/openmp/runtime/src -Illvm-project/openmp/libomptarget/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-enum-constexpr-conversion -Wno-extra -Wno-pedantic -std=c++17 -g -fPIC -fno-exceptions -fno-rtti -gsplit-dwarf -MD -MT openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o -MF openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o.d -o openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o -c llvm-project/openmp/libomptarget/src/interface.cpp

Looks like it is caused by the place for static variable declaration.

tianshilei1992 mentioned this in D149557: [OpenMP] Handle function calls from `libomp` to `libomptarget` correctly.Apr 30 2023, 2:07 PM

tianshilei1992 mentioned this in rG479e335fc37c: [OpenMP] Handle function calls from `libomp` to `libomptarget` correctly.May 1 2023, 3:19 PM

tianshilei1992 mentioned this in rGc3efd7ec575c: [OpenMP] Handle function calls from `libomp` to `libomptarget` correctly.May 1 2023, 3:49 PM

Revision Contents

Path

Size

openmp/

libomptarget/

include/

6 lines

36 lines

10 lines

2 lines

plugins/

cuda/

dynamic_cuda/

cuda.h

2 lines

cuda.cpp

1 line

src/

rtl.cpp

34 lines

exports

1 line

src/

7 lines

243 lines

248 lines

53 lines

2 lines

runtime/

src/

kmp.h

10 lines

kmp_tasking.cpp

89 lines

Diff 460172

openmp/libomptarget/include/device.h

Show First 20 Lines • Show All 429 Lines • ▼ Show 20 Lines	int32_t runTeamRegion(void TgtEntryPtr, void *TgtVarsPtr,
ptrdiff_t *TgtOffsets, int32_t TgtVarsSize,		ptrdiff_t *TgtOffsets, int32_t TgtVarsSize,
int32_t NumTeams, int32_t ThreadLimit,		int32_t NumTeams, int32_t ThreadLimit,
uint64_t LoopTripCount, AsyncInfoTy &AsyncInfo);		uint64_t LoopTripCount, AsyncInfoTy &AsyncInfo);

/// Synchronize device/queue/event based on \p AsyncInfo and return		/// Synchronize device/queue/event based on \p AsyncInfo and return
/// OFFLOAD_SUCCESS/OFFLOAD_FAIL when succeeds/fails.		/// OFFLOAD_SUCCESS/OFFLOAD_FAIL when succeeds/fails.
int32_t synchronize(AsyncInfoTy &AsyncInfo);		int32_t synchronize(AsyncInfoTy &AsyncInfo);

		/// Query for device/queue/event based completion on \p AsyncInfo in a
		/// non-blocking manner and return OFFLOAD_SUCCESS/OFFLOAD_FAIL when
		/// succeeds/fails. Must be called multiple times until AsyncInfo is
		/// completed and AsyncInfo.isDone() returns true.
		jdoerfertUnsubmitted Done Reply Inline Actions Reading this I don't what this returns. SUCCESS if it completed and FAIL otherwise? Or FAIL only if something failed? Must be called multiple times is also unclear to me, I doubt that we should put that sentence here. Update: I think I now understand the latter part but if I'm right we should change the interface. So, queryAsync is supposed to be called before isDone to make sure isDone returns the right value, correct? If so, we should not expose queryAsync to the user as there doesn't seem to be a reason to call it otherwise. Arguably, calling it doesn't provide information, just a state change, thus a secondary query is necessary. jdoerfert: Reading this I don't what this returns. SUCCESS if it completed and FAIL otherwise? Or FAIL…
		gValariniAuthorUnsubmitted Done Reply Inline Actions I should probably update the documentation of this function to reflect the new one added to `omptargetplugin.h:__tgt_rtl_query_async`. That state the following: Queries for the completion of asynchronous operations. Instead of blocking the calling thread as __tgt_rtl_synchronize, the progress of the operations stored in AsyncInfo->Queue is queried in a non-blocking manner, partially advancing their execution. If all operations are completed, AsyncInfo->Queue is set to nullptr. If there are still pending operations, AsyncInfo->Queue is kept as a valid queue. In any case of success (i.e., successful query with/without completing all operations), return zero. Otherwise, return an error code. Thus, `queryAsync` (which calls `__tgt_rtl_query_async`), is a non-blocking version of `synchronize`. That means that we must call it multiple times until all operations are completed and the plugin invalidates the queue inside `AsyncInfo`. Here, `AsyncInfoTy::isDone` is just a helper function that indicates if the device side operations are completed or not based on said queue. We need to externalize `queryAsync` to its user `AsyncInfoTy` so it can call the non-blocking implementation. Considering your comment, what do you think of making things more explicit by adding a flag pointer argument to `queryAsync` (and thus to __tgt_rtl_query_async) that returns true if all operations are completed and false otherwise? int32_t queryAsync(AsyncInfoTy &AsyncInfo, bool &IsCompleted); gValarini: I should probably update the documentation of this function to reflect the new one added to…
		jdoerfertUnsubmitted Done Reply Inline Actions My point is: queryAsync is useless if not followed by an isDone, right? Why do we expose it in the first place? We should merge the two functions, potentially keeping isDone around, but at least avoiding the implicit dependence that you have to call one before the other for any meaningful use case. The updated interface basically does this. It merges the "isDone" query into this function, allowing users to call isDone standalone and this function standalone while getting meaningful results each time. jdoerfert: My point is: queryAsync is useless if not followed by an isDone, right? Why do we expose it in…
		gValariniAuthorUnsubmitted Done Reply Inline Actions Uhm, I think I got your point. I'll update `AsyncInfoTy::isDone` so it can be called standalone without a prior call to `AsyncInfoTy::synchronize` (which calls the device `DeviceTy::queryAsync`). This indeed makes the interface better. I was a little bit confused when you said `DeviceTy::queryAsync` should not be exposed, but now I got it. gValarini: Uhm, I think I got your point. I'll update `AsyncInfoTy::isDone` so it can be called standalone…
		int32_t queryAsync(AsyncInfoTy &AsyncInfo);

/// Calls the corresponding print in the \p RTLDEVID		/// Calls the corresponding print in the \p RTLDEVID
/// device RTL to obtain the information of the specific device.		/// device RTL to obtain the information of the specific device.
bool printDeviceInfo(int32_t RTLDevID);		bool printDeviceInfo(int32_t RTLDevID);

/// Event related interfaces.		/// Event related interfaces.
/// {		/// {
/// Create an event.		/// Create an event.
int32_t createEvent(void **Event);		int32_t createEvent(void **Event);
▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

openmp/libomptarget/include/omptarget.h

	Show All 9 Lines
	// target region.			// target region.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef _OMPTARGET_H_			#ifndef _OMPTARGET_H_
	#define _OMPTARGET_H_			#define _OMPTARGET_H_

	#include <deque>			#include <deque>
				#include <functional>
	#include <stddef.h>			#include <stddef.h>
	#include <stdint.h>			#include <stdint.h>
				#include <type_traits>

	#include <SourceInfo.h>			#include <SourceInfo.h>

				#include "llvm/ADT/SmallVector.h"

	#define OFFLOAD_SUCCESS (0)			#define OFFLOAD_SUCCESS (0)
	#define OFFLOAD_FAIL (~0)			#define OFFLOAD_FAIL (~0)

	#define OFFLOAD_DEVICE_DEFAULT -1			#define OFFLOAD_DEVICE_DEFAULT -1

	// Don't format out enums and structs.			// Don't format out enums and structs.
	// clang-format off			// clang-format off

	▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines

	struct DeviceTy;			struct DeviceTy;

	/// The libomptarget wrapper around a __tgt_async_info object directly			/// The libomptarget wrapper around a __tgt_async_info object directly
	/// associated with a libomptarget layer device. RAII semantics to avoid			/// associated with a libomptarget layer device. RAII semantics to avoid
	/// mistakes.			/// mistakes.
	class AsyncInfoTy {			class AsyncInfoTy {
	/// Locations we used in (potentially) asynchronous calls which should live			/// Locations we used in (potentially) asynchronous calls which should live
	/// as long as this AsyncInfoTy object.			/// as long as this AsyncInfoTy object.
				tianshilei1992Unsubmitted Done Reply Inline Actions `SyncTypeTy` looks weird. It's like having an LLVM class called `TypeTy`. I think `SyncTy` or `SyncType` are both fine. tianshilei1992: `SyncTypeTy` looks weird. It's like having an LLVM class called `TypeTy`. I think `SyncTy` or…
				gValariniAuthorUnsubmitted Done Reply Inline Actions It makes sense, that was a little redundant. It is now renamed to `SyncTy`. Thanks! gValarini: It makes sense, that was a little redundant. It is now renamed to `SyncTy`. Thanks!
	std::deque<void *> BufferLocations;			std::deque<void *> BufferLocations;

				/// Post-processing operations executed after a successful synchronization.
				/// \note the post-processing function should return OFFLOAD_SUCCESS or
				/// OFFLOAD_FAIL appropriately.
				jdoerfertUnsubmitted Done Reply Inline Actions Drive by: I don't believe we want such a generic interface. The postprocessing task should just be a fixed function, not multiple unknown at compile time. jdoerfert: Drive by: I don't believe we want such a generic interface. The postprocessing task should just…
				gValariniAuthorUnsubmitted Done Reply Inline Actions Just adding some context to why it was done this way: Lambdas can easily store any local data needed by the post-processing procedures. This allows them to be generated locally by the normal code flow (which may improve code maintenance) and then stored in the lambdas implicit structure (solving any lifetime problems). Having a function vector allows us to easily compose post-processing procedures. This is the case for `target nowait` regions, that should run both the `targetDataEnd` and `processDataAfter` post-processing functions. If in the future we need to generate more asynchronous operations inside the post-processing functions, this can be easily done by pushing more lambdas to the vector. With all that in mind, I know `std::function`s may lead to an additional heap allocation and one more level of indirection due to type-erasure, prohibiting some code opts. If that is not desirable despite the presented context, I can change how the post-processing procedures are stored/called, but I really would like to keep some of the points described above, especially the lifetime correctness of the post-processing data. Do you have something in mind that could help me with that? Maybe defining a different struct for each post-processing function and storing it in the `AsyncInfoTy` as a `variant` could be enough. What do you think? gValarini: Just adding some context to why it was done this way: - Lambdas can easily store any local…
				jdoerfertUnsubmitted Done Reply Inline Actions I would have assumed if we have 2 potential callbacks, let's say static functions `F` and `G` we would have two members, both void. void PayloadForF = nullptr; void PayloadForG = nullptr; and if they are not null we call `F` and `G` respectively passing the payload. I'm fine with keeping it for now this way, we can see if there is a need to change it. jdoerfert:* I would have assumed if we have 2 potential callbacks, let's say static functions `F` and `G`…
				gValariniAuthorUnsubmitted Done Reply Inline Actions Okey, for now, I'll keep it like this then. gValarini: Okey, for now, I'll keep it like this then.
				using PostProcFuncTy = std::function<int()>;
				llvm::SmallVector<PostProcFuncTy> PostProcessingFunctions;

	__tgt_async_info AsyncInfo;			__tgt_async_info AsyncInfo;
	DeviceTy &Device;			DeviceTy &Device;

	public:			public:
				enum class SyncType {
				BLOCKING,
				NON_BLOCKING
				};

	AsyncInfoTy(DeviceTy &Device) : Device(Device) {}			AsyncInfoTy(DeviceTy &Device) : Device(Device) {}
	~AsyncInfoTy() { synchronize(); }			~AsyncInfoTy() { synchronize(); }

	/// Implicit conversion to the __tgt_async_info which is used in the			/// Implicit conversion to the __tgt_async_info which is used in the
	/// plugin interface.			/// plugin interface.
	operator __tgt_async_info *() { return &AsyncInfo; }			operator __tgt_async_info *() { return &AsyncInfo; }

	/// Synchronize all pending actions.			/// Synchronize all pending actions.
	///			///
	/// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.			/// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.
	int synchronize();			int synchronize(SyncType SyncType = SyncType::BLOCKING);
				jdoerfertUnsubmitted Done Reply Inline Actions Nit: Rename argument to avoid shadowing. Make the version that takes the sync type, and probably the sync type, private. IsDone can call the private version, users only the blocking one. jdoerfert: Nit: Rename argument to avoid shadowing. Make the version that takes the sync type, and…
				gValariniAuthorUnsubmitted Done Reply Inline Actions Thanks, I am renaming the type name to `SyncTypeTy` to reflect the other ones. Regarding the second comment, I don´t quite understand what you mean with: IsDone can call the private version, users only the blocking one. `isDone` only checks if the operations inside an `AsyncInfoTy` instance are completed or not, it does not call any plugin function at all. Are you suggesting that we move all the non-blocking synchronization code into `isDone`? If so, this means we would have some code duplication regarding the post-processing functions due to two separate synchronization paths, but if you think that is better I can do it. gValarini: Thanks, I am renaming the type name to `SyncTypeTy` to reflect the other ones. Regarding the…
				gValariniAuthorUnsubmitted Done Reply Inline Actions Nit: Rename argument to avoid shadowing. Make the version that takes the sync type, and probably the sync type, private. IsDone can call the private version, users only the blocking one. gValarini: > Nit: Rename argument to avoid shadowing. > > Make the version that takes the sync type, and…
				gValariniAuthorUnsubmitted Done Reply Inline Actions Just a note, now I got the correct idea: we should make `isDone` a callable as a standalone function! gValarini: Just a note, now I got the correct idea: we should make `isDone` a callable as a standalone…

	/// Return a void* reference with a lifetime that is at least as long as this			/// Return a void* reference with a lifetime that is at least as long as this
	/// AsyncInfoTy object. The location can be used as intermediate buffer.			/// AsyncInfoTy object. The location can be used as intermediate buffer.
	void *&getVoidPtrLocation();			void *&getVoidPtrLocation();

				/// Check if all asynchronous operations are completed.
				///
				/// \returns true if there is no pending asynchronous operations, false
				/// otherwise.
				jdoerfertUnsubmitted Done Reply Inline Actions Make it clear that this happens only once. Either here or via synchronize. Right now it could be read like every isDone call might invoke the post-processing functions. jdoerfert: Make it clear that this happens only once. Either here or via synchronize. Right now it could…
				bool isDone();

				/// Add a new post-processing function to be executed after synchronization.
				///
				/// \param[in] Function is a templated function (e.g., function pointers,
				/// lambdas, std::function) that can be convertible to a PostProcFuncTy (i.e.,
				/// it must have int() as its function signature).
				template<typename FuncTy>
				void addPostProcessingFunction(FuncTy &&Function) {
				static_assert(std::is_convertible_v<FuncTy, PostProcFuncTy>,
				"Invalid post-processing function type. Please check "
				"function signature!");
				PostProcessingFunctions.emplace_back(Function);
				}
	};			};

	/// This struct is a record of non-contiguous information			/// This struct is a record of non-contiguous information
	struct __tgt_target_non_contig {			struct __tgt_target_non_contig {
	uint64_t Offset;			uint64_t Offset;
	uint64_t Count;			uint64_t Count;
	uint64_t Stride;			uint64_t Stride;
	};			};
	▲ Show 20 Lines • Show All 129 Lines • ▼ Show 20 Lines
	void __tgt_set_info_flag(uint32_t);			void __tgt_set_info_flag(uint32_t);

	int __tgt_print_device_info(int64_t DeviceId);			int __tgt_print_device_info(int64_t DeviceId);
	#ifdef __cplusplus			#ifdef __cplusplus
	}			}
	#endif			#endif

	#ifdef __cplusplus			#ifdef __cplusplus
	#define EXTERN extern "C"			#define EXTERN extern "C"
				jdoerfertUnsubmitted Done Reply Inline Actions It's void return but comment talks about the return value. jdoerfert: It's void return but comment talks about the return value.
				gValariniAuthorUnsubmitted Done Reply Inline Actions Yep, that was a leftover from some previous revisions. Thanks! gValarini: Yep, that was a leftover from some previous revisions. Thanks!
	#else			#else
	#define EXTERN extern			#define EXTERN extern
	#endif			#endif

	#endif // _OMPTARGET_H_			#endif // _OMPTARGET_H_

openmp/libomptarget/include/omptargetplugin.h

Show First 20 Lines • Show All 149 Lines • ▼ Show 20 Lines	int32_t __tgt_rtl_run_target_team_region_async(
int32_t ID, void Entry, void Args, ptrdiff_t Offsets, int32_t NumArgs,		int32_t ID, void Entry, void Args, ptrdiff_t Offsets, int32_t NumArgs,
int32_t NumTeams, int32_t ThreadLimit, uint64_t LoopTripcount,		int32_t NumTeams, int32_t ThreadLimit, uint64_t LoopTripcount,
__tgt_async_info *AsyncInfo);		__tgt_async_info *AsyncInfo);

// Device synchronization. In case of success, return zero. Otherwise, return an		// Device synchronization. In case of success, return zero. Otherwise, return an
// error code.		// error code.
int32_t __tgt_rtl_synchronize(int32_t ID, __tgt_async_info *AsyncInfo);		int32_t __tgt_rtl_synchronize(int32_t ID, __tgt_async_info *AsyncInfo);

		// Queries for the completion of asynchronous operations. Instead of blocking
		// the calling thread as __tgt_rtl_synchronize, the progress of the operations
		// stored in AsyncInfo->Queue is queried in a non-blocking manner, partially
		jdoerfertUnsubmitted Done Reply Inline Actions Describe the return value, it's not clear what would be returned if it does or doesn't synchronize. jdoerfert: Describe the return value, it's not clear what would be returned if it does or doesn't…
		gValariniAuthorUnsubmitted Done Reply Inline Actions Good point. I have updated both the documentation and the function name to better reflect what this new interface should do. Do you think it is more clear now? gValarini: Good point. I have updated both the documentation and the function name to better reflect what…
		gValariniAuthorUnsubmitted Done Reply Inline Actions @jdoerfert any comments on the new function and its doc? gValarini: @jdoerfert any comments on the new function and its doc?
		// advancing their execution. If all operations are completed, AsyncInfo->Queue
		// is set to nullptr. If there are still pending operations, AsyncInfo->Queue is
		// kept as a valid queue. In any case of success (i.e., successful query
		// with/without completing all operations), return zero. Otherwise, return an
		// error code.
		int32_t __tgt_rtl_query_async(int32_t ID, __tgt_async_info *AsyncInfo);

// Set plugin's internal information flag externally.		// Set plugin's internal information flag externally.
void __tgt_rtl_set_info_flag(uint32_t);		void __tgt_rtl_set_info_flag(uint32_t);

// Print the device information		// Print the device information
void __tgt_rtl_print_device_info(int32_t ID);		void __tgt_rtl_print_device_info(int32_t ID);

// Event related interfaces. It is expected to use the interfaces in the		// Event related interfaces. It is expected to use the interfaces in the
// following way:		// following way:
Show All 33 Lines

openmp/libomptarget/include/rtl.h

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	struct RTLInfoTy {
typedef int32_t(run_team_region_ty)(int32_t, void , void , ptrdiff_t ,		typedef int32_t(run_team_region_ty)(int32_t, void , void , ptrdiff_t ,
int32_t, int32_t, int32_t, uint64_t);		int32_t, int32_t, int32_t, uint64_t);
typedef int32_t(run_team_region_async_ty)(int32_t, void , void *,		typedef int32_t(run_team_region_async_ty)(int32_t, void , void *,
ptrdiff_t *, int32_t, int32_t,		ptrdiff_t *, int32_t, int32_t,
int32_t, uint64_t,		int32_t, uint64_t,
__tgt_async_info *);		__tgt_async_info *);
typedef int64_t(init_requires_ty)(int64_t);		typedef int64_t(init_requires_ty)(int64_t);
typedef int32_t(synchronize_ty)(int32_t, __tgt_async_info *);		typedef int32_t(synchronize_ty)(int32_t, __tgt_async_info *);
		typedef int32_t(query_async_ty)(int32_t, __tgt_async_info *);
typedef int32_t (register_lib_ty)(__tgt_bin_desc );		typedef int32_t (register_lib_ty)(__tgt_bin_desc );
typedef int32_t(supports_empty_images_ty)();		typedef int32_t(supports_empty_images_ty)();
typedef void(print_device_info_ty)(int32_t);		typedef void(print_device_info_ty)(int32_t);
typedef void(set_info_flag_ty)(uint32_t);		typedef void(set_info_flag_ty)(uint32_t);
typedef int32_t(create_event_ty)(int32_t, void **);		typedef int32_t(create_event_ty)(int32_t, void **);
typedef int32_t(record_event_ty)(int32_t, void , __tgt_async_info );		typedef int32_t(record_event_ty)(int32_t, void , __tgt_async_info );
typedef int32_t(wait_event_ty)(int32_t, void , __tgt_async_info );		typedef int32_t(wait_event_ty)(int32_t, void , __tgt_async_info );
typedef int32_t(sync_event_ty)(int32_t, void *);		typedef int32_t(sync_event_ty)(int32_t, void *);
Show All 34 Lines	#endif
data_exchange_async_ty *data_exchange_async = nullptr;		data_exchange_async_ty *data_exchange_async = nullptr;
data_delete_ty *data_delete = nullptr;		data_delete_ty *data_delete = nullptr;
run_region_ty *run_region = nullptr;		run_region_ty *run_region = nullptr;
run_region_async_ty *run_region_async = nullptr;		run_region_async_ty *run_region_async = nullptr;
run_team_region_ty *run_team_region = nullptr;		run_team_region_ty *run_team_region = nullptr;
run_team_region_async_ty *run_team_region_async = nullptr;		run_team_region_async_ty *run_team_region_async = nullptr;
init_requires_ty *init_requires = nullptr;		init_requires_ty *init_requires = nullptr;
synchronize_ty *synchronize = nullptr;		synchronize_ty *synchronize = nullptr;
		query_async_ty *query_async = nullptr;
register_lib_ty register_lib = nullptr;		register_lib_ty register_lib = nullptr;
register_lib_ty unregister_lib = nullptr;		register_lib_ty unregister_lib = nullptr;
supports_empty_images_ty *supports_empty_images = nullptr;		supports_empty_images_ty *supports_empty_images = nullptr;
set_info_flag_ty *set_info_flag = nullptr;		set_info_flag_ty *set_info_flag = nullptr;
print_device_info_ty *print_device_info = nullptr;		print_device_info_ty *print_device_info = nullptr;
create_event_ty *create_event = nullptr;		create_event_ty *create_event = nullptr;
record_event_ty *record_event = nullptr;		record_event_ty *record_event = nullptr;
wait_event_ty *wait_event = nullptr;		wait_event_ty *wait_event = nullptr;
▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.h

	Show All 23 Lines
	typedef struct CUstream_st *CUstream;			typedef struct CUstream_st *CUstream;
	typedef struct CUevent_st *CUevent;			typedef struct CUevent_st *CUevent;

	typedef enum cudaError_enum {			typedef enum cudaError_enum {
	CUDA_SUCCESS = 0,			CUDA_SUCCESS = 0,
	CUDA_ERROR_INVALID_VALUE = 1,			CUDA_ERROR_INVALID_VALUE = 1,
	CUDA_ERROR_NO_DEVICE = 100,			CUDA_ERROR_NO_DEVICE = 100,
	CUDA_ERROR_INVALID_HANDLE = 400,			CUDA_ERROR_INVALID_HANDLE = 400,
				CUDA_ERROR_NOT_READY = 600,
	} CUresult;			} CUresult;

	typedef enum CUstream_flags_enum {			typedef enum CUstream_flags_enum {
	CU_STREAM_DEFAULT = 0x0,			CU_STREAM_DEFAULT = 0x0,
	CU_STREAM_NON_BLOCKING = 0x1,			CU_STREAM_NON_BLOCKING = 0x1,
	} CUstream_flags;			} CUstream_flags;

	typedef enum CUlimit_enum {			typedef enum CUlimit_enum {
	▲ Show 20 Lines • Show All 196 Lines • ▼ Show 20 Lines

	CUresult cuModuleGetFunction(CUfunction , CUmodule, const char );			CUresult cuModuleGetFunction(CUfunction , CUmodule, const char );
	CUresult cuModuleGetGlobal(CUdeviceptr , size_t , CUmodule, const char *);			CUresult cuModuleGetGlobal(CUdeviceptr , size_t , CUmodule, const char *);

	CUresult cuModuleUnload(CUmodule);			CUresult cuModuleUnload(CUmodule);
	CUresult cuStreamCreate(CUstream *, unsigned);			CUresult cuStreamCreate(CUstream *, unsigned);
	CUresult cuStreamDestroy(CUstream);			CUresult cuStreamDestroy(CUstream);
	CUresult cuStreamSynchronize(CUstream);			CUresult cuStreamSynchronize(CUstream);
				CUresult cuStreamQuery(CUstream);
	CUresult cuCtxSetCurrent(CUcontext);			CUresult cuCtxSetCurrent(CUcontext);
	CUresult cuDevicePrimaryCtxRelease(CUdevice);			CUresult cuDevicePrimaryCtxRelease(CUdevice);
	CUresult cuDevicePrimaryCtxGetState(CUdevice, unsigned , int );			CUresult cuDevicePrimaryCtxGetState(CUdevice, unsigned , int );
	CUresult cuDevicePrimaryCtxSetFlags(CUdevice, unsigned);			CUresult cuDevicePrimaryCtxSetFlags(CUdevice, unsigned);
	CUresult cuDevicePrimaryCtxRetain(CUcontext *, CUdevice);			CUresult cuDevicePrimaryCtxRetain(CUcontext *, CUdevice);
	CUresult cuModuleLoadDataEx(CUmodule , const void , unsigned, void *,			CUresult cuModuleLoadDataEx(CUmodule , const void , unsigned, void *,
	void **);			void **);

	Show All 15 Lines

openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.cpp

	Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	DLWRAP(cuMemFreeHost, 1);			DLWRAP(cuMemFreeHost, 1);
	DLWRAP(cuModuleGetFunction, 3);			DLWRAP(cuModuleGetFunction, 3);
	DLWRAP(cuModuleGetGlobal, 4);			DLWRAP(cuModuleGetGlobal, 4);

	DLWRAP(cuModuleUnload, 1);			DLWRAP(cuModuleUnload, 1);
	DLWRAP(cuStreamCreate, 2);			DLWRAP(cuStreamCreate, 2);
	DLWRAP(cuStreamDestroy, 1);			DLWRAP(cuStreamDestroy, 1);
	DLWRAP(cuStreamSynchronize, 1);			DLWRAP(cuStreamSynchronize, 1);
				DLWRAP(cuStreamQuery, 1);
	DLWRAP(cuCtxSetCurrent, 1);			DLWRAP(cuCtxSetCurrent, 1);
	DLWRAP(cuDevicePrimaryCtxRelease, 1);			DLWRAP(cuDevicePrimaryCtxRelease, 1);
	DLWRAP(cuDevicePrimaryCtxGetState, 3);			DLWRAP(cuDevicePrimaryCtxGetState, 3);
	DLWRAP(cuDevicePrimaryCtxSetFlags, 2);			DLWRAP(cuDevicePrimaryCtxSetFlags, 2);
	DLWRAP(cuDevicePrimaryCtxRetain, 2);			DLWRAP(cuDevicePrimaryCtxRetain, 2);
	DLWRAP(cuModuleLoadDataEx, 5);			DLWRAP(cuModuleLoadDataEx, 5);

	DLWRAP(cuDeviceCanAccessPeer, 3);			DLWRAP(cuDeviceCanAccessPeer, 3);
	▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines

openmp/libomptarget/plugins/cuda/src/rtl.cpp

Show First 20 Lines • Show All 1,257 Lines • ▼ Show 20 Lines

if (Err != CUDA_SUCCESS) {

DP("Error when synchronizing stream. stream = " DPxMOD

", async info ptr = " DPxMOD "\n",

DPxPTR(Stream), DPxPTR(AsyncInfo));

CUDA_ERR_STRING(Err);

}

return (Err == CUDA_SUCCESS) ? OFFLOAD_SUCCESS : OFFLOAD_FAIL;

}

int queryAsync(const int DeviceId, __tgt_async_info *AsyncInfo) const {

CUstream Stream = reinterpret_cast<CUstream>(AsyncInfo->Queue);

CUresult Err = cuStreamQuery(Stream);

if (Err == CUDA_ERROR_NOT_READY) {

// Not ready streams must be considered as successful operations.

Err = CUDA_SUCCESS;

jdoerfertUnsubmitted

Done

return OFFLOAD_SUCCESS;

will reduce indention and logic later on.

jdoerfert: return OFFLOAD_SUCCESS; will reduce indention and logic later on.

gValariniAuthorUnsubmitted

Done

Perfect, done!

gValarini: Perfect, done!

} else {

// Once the stream is synchronized or an error occurs, return it to the

// stream pool and reset AsyncInfo. This is to make sure the

// synchronization only works for its own tasks.

StreamPool[DeviceId]->release(

reinterpret_cast<CUstream>(AsyncInfo->Queue));

jdoerfertUnsubmitted

Done

StreamPool[DeviceId]->release(

- reinterpret_cast<CUstream>(AsyncInfo->Queue));

+ Stream);

AsyncInfo->Queue = nullptr;

jdoerfert:

gValariniAuthorUnsubmitted

Done

Thanks, done!

gValarini: Thanks, done!

AsyncInfo->Queue = nullptr;

}

if (Err != CUDA_SUCCESS) {

DP("Error when querying for stream progress. stream = " DPxMOD

", async info ptr = " DPxMOD "\n",

DPxPTR(Stream), DPxPTR(AsyncInfo));

CUDA_ERR_STRING(Err);

}

return (Err == CUDA_SUCCESS) ? OFFLOAD_SUCCESS : OFFLOAD_FAIL;

}

void printDeviceInfo(int32_t DeviceId) {

char TmpChar[1000];

std::string TmpStr;

size_t TmpSt;

int TmpInt, TmpInt2, TmpInt3;

CUdevice Device;

checkResult(cuDeviceGet(&Device, DeviceId),

▲ Show 20 Lines • Show All 497 Lines • ▼ Show 20 Lines

int32_t __tgt_rtl_synchronize(int32_t DeviceId,

__tgt_async_info *AsyncInfoPtr) {

assert(DeviceRTL.isValidDeviceId(DeviceId) && "device_id is invalid");

assert(AsyncInfoPtr && "async_info_ptr is nullptr");

assert(AsyncInfoPtr->Queue && "async_info_ptr->Queue is nullptr");

// NOTE: We don't need to set context for stream sync.

return DeviceRTL.synchronize(DeviceId, AsyncInfoPtr);

}

int32_t __tgt_rtl_query_async(int32_t DeviceId,

__tgt_async_info *AsyncInfoPtr) {

assert(DeviceRTL.isValidDeviceId(DeviceId) && "device_id is invalid");

assert(AsyncInfoPtr && "async_info_ptr is nullptr");

assert(AsyncInfoPtr->Queue && "async_info_ptr->Queue is nullptr");

// NOTE: We don't need to set context for stream query.

return DeviceRTL.queryAsync(DeviceId, AsyncInfoPtr);

}

void __tgt_rtl_set_info_flag(uint32_t NewInfoLevel) {

std::atomic<uint32_t> &InfoLevel = getInfoLevelInternal();

InfoLevel.store(NewInfoLevel);

}

void __tgt_rtl_print_device_info(int32_t DeviceId) {

assert(DeviceRTL.isValidDeviceId(DeviceId) && "device_id is invalid");

// NOTE: We don't need to set context for print device info.

▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

openmp/libomptarget/plugins/exports

Show All 17 Lines	global:
__tgt_rtl_data_exchange;		__tgt_rtl_data_exchange;
__tgt_rtl_data_exchange_async;		__tgt_rtl_data_exchange_async;
__tgt_rtl_data_delete;		__tgt_rtl_data_delete;
__tgt_rtl_run_target_team_region;		__tgt_rtl_run_target_team_region;
__tgt_rtl_run_target_team_region_async;		__tgt_rtl_run_target_team_region_async;
__tgt_rtl_run_target_region;		__tgt_rtl_run_target_region;
__tgt_rtl_run_target_region_async;		__tgt_rtl_run_target_region_async;
__tgt_rtl_synchronize;		__tgt_rtl_synchronize;
		__tgt_rtl_query_async;
__tgt_rtl_register_lib;		__tgt_rtl_register_lib;
__tgt_rtl_unregister_lib;		__tgt_rtl_unregister_lib;
__tgt_rtl_supports_empty_images;		__tgt_rtl_supports_empty_images;
__tgt_rtl_set_info_flag;		__tgt_rtl_set_info_flag;
__tgt_rtl_print_device_info;		__tgt_rtl_print_device_info;
__tgt_rtl_create_event;		__tgt_rtl_create_event;
__tgt_rtl_record_event;		__tgt_rtl_record_event;
__tgt_rtl_wait_event;		__tgt_rtl_wait_event;
__tgt_rtl_sync_event;		__tgt_rtl_sync_event;
__tgt_rtl_destroy_event;		__tgt_rtl_destroy_event;
__tgt_rtl_init_device_info;		__tgt_rtl_init_device_info;
__tgt_rtl_init_async_info;		__tgt_rtl_init_async_info;
local:		local:
*;		*;
};		};

openmp/libomptarget/src/device.cpp

Show First 20 Lines • Show All 622 Lines • ▼ Show 20 Lines

}

int32_t DeviceTy::synchronize(AsyncInfoTy &AsyncInfo) {

if (RTL->synchronize)

return RTL->synchronize(RTLDeviceID, AsyncInfo);

return OFFLOAD_SUCCESS;

}

int32_t DeviceTy::queryAsync(AsyncInfoTy &AsyncInfo) {

if (RTL->query_async)

return RTL->query_async(RTLDeviceID, AsyncInfo);

return synchronize(AsyncInfo);

jdoerfertUnsubmitted

Done

return RTL->synchronize_async(RTLDeviceID, AsyncInfo);

- else

- return synchronize(AsyncInfo);

+ return synchronize(AsyncInfo);

}

int32_t DeviceTy::createEvent(void **Event) {

jdoerfert:

}

int32_t DeviceTy::createEvent(void **Event) {

if (RTL->create_event)

return RTL->create_event(RTLDeviceID, Event);

return OFFLOAD_SUCCESS;

}

int32_t DeviceTy::recordEvent(void *Event, AsyncInfoTy &AsyncInfo) {

▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

openmp/libomptarget/src/interface.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines for (auto &RTL : PM->RTLs.UsedRTLs) {

if (RTL->unregister_lib) { if (RTL->unregister_lib) {

if ((*RTL->unregister_lib)(Desc) != OFFLOAD_SUCCESS) { if ((*RTL->unregister_lib)(Desc) != OFFLOAD_SUCCESS) {

DP("Could not register library with %s", RTL->RTLName.c_str()); DP("Could not register library with %s", RTL->RTLName.c_str());

} }

static inline void targetDataMapper(

jdoerfertUnsubmitted

Done

If you make this a templated function accepting the (sub)type of the AsyncInfo object instead of the object itself, you can move all the remaining duplication at the call sites (namely: checkCtorDtor, get device, create AsyncInfo) into this function. WDYT?

jdoerfert: If you make this a templated function accepting the (sub)type of the AsyncInfo object instead…

gValariniAuthorUnsubmitted

Done

Indeed, that is a nice idea. Since TaskAsyncInfoWrapperTy is a wrapper around AsyncInfoTy, I only needed to acquire a reference to it so we would end up always using AsyncInfoTy.

gValarini: Indeed, that is a nice idea. Since `TaskAsyncInfoWrapperTy` is a wrapper around `AsyncInfoTy`…

ident_t *Loc, DeviceTy &Device, int64_t DeviceId, int32_t ArgNum,

void **ArgsBase, void **Args, int64_t *ArgSizes, int64_t *ArgTypes,

map_var_info_t *ArgNames, void **ArgMappers, AsyncInfoTy &AsyncInfo,

TargetDataFuncPtrTy TargetDataFunction, const char *RegionTypeMsg,

AsyncInfoTy::SyncType SyncType = AsyncInfoTy::SyncType::BLOCKING,

bool Dispatch = true, bool FromMapper = false) {

TIMESCOPE_WITH_IDENT(Loc);

if (getInfoLevel() & OMP_INFOTYPE_KERNEL_ARGS)

printKernelArguments(Loc, DeviceId, ArgNum, ArgSizes, ArgTypes, ArgNames,

RegionTypeMsg);

#ifdef OMPTARGET_DEBUG

for (int I = 0; I < ArgNum; ++I) {

DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64

", Type=0x%" PRIx64 ", Name=%s\n",

I, DPxPTR(ArgsBase[I]), DPxPTR(Args[I]), ArgSizes[I], ArgTypes[I],

(ArgNames) ? getNameFromMapping(ArgNames[I]).c_str() : "unknown");

}

#endif

tianshilei1992Unsubmitted

Done

nit: targetDataFunction

tianshilei1992: nit: `targetDataFunction`

gValariniAuthorUnsubmitted

Done

Uhm, TargetDataFunction is a function pointer. Shouldn't we also capitalize the first word in this case?

gValarini: Uhm, `TargetDataFunction` is a function pointer. Shouldn't we also capitalize the first word in…

int Rc = OFFLOAD_SUCCESS;

if (Dispatch)

jdoerfertUnsubmitted

Done

This is different but similar to the condition used in the other new "helper" below. I have the same concerns as there. When would we ever not call the target data function?

jdoerfert: This is different but similar to the condition used in the other new "helper" below. I have the…

gValariniAuthorUnsubmitted

Done

That was a code error. The target helper should also use the Dispatch variable as well. Thanks for noticing.

gValarini: That was a code error. The target helper should also use the `Dispatch` variable as well.

gValariniAuthorUnsubmitted

Done

With the RFC implemented, we are now re-enqueuing the same task multiple times until all the device side operations are completed. Because of that, we may call the __tgt_target_* functions multiple times as well. Since we want to dispatch the operations only once, we call the target data functions only when the target task is first encountered. The next calls will only synchronize the operations instead of dispatching them again, that's why AsyncInfo.synchronize is always called right below it.

gValarini: With the [[ https://discourse.llvm.org/t/rfc-re-scheduling-implicit-tasks-for-non-blocking…

Rc = TargetDataFunction(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes,

ArgTypes, ArgNames, ArgMappers, AsyncInfo,

FromMapper);

if (Rc == OFFLOAD_SUCCESS)

Rc = AsyncInfo.synchronize(SyncType);

handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);

}

jdoerfertUnsubmitted

Done

Is FromMapper ever set to true? Did I miss that?

jdoerfert: Is `FromMapper` ever set to true? Did I miss that?

gValariniAuthorUnsubmitted

Done

Nope, it is not. I am removing it from the arguments and always passing false.

gValarini: Nope, it is not. I am removing it from the arguments and always passing `false`.

/// creates host-to-target data mapping, stores it in the /// creates host-to-target data mapping, stores it in the

/// libomptarget.so internal structure (an entry in a stack of data maps) /// libomptarget.so internal structure (an entry in a stack of data maps)

/// and passes the data to the device. /// and passes the data to the device.

EXTERN void __tgt_target_data_begin_mapper(ident_t *Loc, int64_t DeviceId, EXTERN void __tgt_target_data_begin_mapper(ident_t *Loc, int64_t DeviceId,

int32_t ArgNum, void **ArgsBase, int32_t ArgNum, void **ArgsBase,

void **Args, int64_t *ArgSizes, void **Args, int64_t *ArgSizes,

int64_t *ArgTypes, int64_t *ArgTypes,

map_var_info_t *ArgNames, map_var_info_t *ArgNames,

void **ArgMappers) { void **ArgMappers) {

TIMESCOPE_WITH_IDENT(Loc); TIMESCOPE_WITH_IDENT(Loc);

DP("Entering data begin region for device %" PRId64 " with %d mappings\n", DP("Entering data begin region for device %" PRId64 " with %d mappings\n",

DeviceId, ArgNum); DeviceId, ArgNum);

if (checkDeviceAndCtors(DeviceId, Loc)) { if (checkDeviceAndCtors(DeviceId, Loc)) {

DP("Not offloading to device %" PRId64 "\n", DeviceId); DP("Not offloading to device %" PRId64 "\n", DeviceId);

return; return;

} }

DeviceTy &Device = *PM->Devices[DeviceId]; DeviceTy &Device = *PM->Devices[DeviceId];

if (getInfoLevel() & OMP_INFOTYPE_KERNEL_ARGS)

printKernelArguments(Loc, DeviceId, ArgNum, ArgSizes, ArgTypes, ArgNames,

"Entering OpenMP data region");

#ifdef OMPTARGET_DEBUG

for (int I = 0; I < ArgNum; ++I) {

DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64

", Type=0x%" PRIx64 ", Name=%s\n",

I, DPxPTR(ArgsBase[I]), DPxPTR(Args[I]), ArgSizes[I], ArgTypes[I],

(ArgNames) ? getNameFromMapping(ArgNames[I]).c_str() : "unknown");

}

#endif

AsyncInfoTy AsyncInfo(Device); AsyncInfoTy AsyncInfo(Device);

int Rc = targetDataBegin(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes, targetDataMapper(Loc, Device, DeviceId, ArgNum, ArgsBase, Args, ArgSizes,

ArgTypes, ArgNames, ArgMappers, AsyncInfo); ArgTypes, ArgNames, ArgMappers, AsyncInfo, targetDataBegin,

if (Rc == OFFLOAD_SUCCESS) "Entering OpenMP data region");

Rc = AsyncInfo.synchronize();

handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);

} }

EXTERN void __tgt_target_data_begin_nowait_mapper( EXTERN void __tgt_target_data_begin_nowait_mapper(

ident_t *Loc, int64_t DeviceId, int32_t ArgNum, void **ArgsBase, ident_t *Loc, int64_t DeviceId, int32_t ArgNum, void **ArgsBase,

void **Args, int64_t *ArgSizes, int64_t *ArgTypes, map_var_info_t *ArgNames, void **Args, int64_t *ArgSizes, int64_t *ArgTypes, map_var_info_t *ArgNames,

void **ArgMappers, int32_t DepNum, void *DepList, int32_t NoAliasDepNum, void **ArgMappers, int32_t DepNum, void *DepList, int32_t NoAliasDepNum,

void *NoAliasDepList) { void *NoAliasDepList) {

TIMESCOPE_WITH_IDENT(Loc); TIMESCOPE_WITH_IDENT(Loc);

__tgt_target_data_begin_mapper(Loc, DeviceId, ArgNum, ArgsBase, Args, DP("Entering data begin region for device %" PRId64 " with %d mappings\n",

ArgSizes, ArgTypes, ArgNames, ArgMappers); DeviceId, ArgNum);

if (checkDeviceAndCtors(DeviceId, Loc)) {

DP("Not offloading to device %" PRId64 "\n", DeviceId);

return;

}

DeviceTy &Device = *PM->Devices[DeviceId];

TaskAsyncInfoTy AsyncInfo(Device);

targetDataMapper(Loc, Device, DeviceId, ArgNum, ArgsBase, Args, ArgSizes,

ArgTypes, ArgNames, ArgMappers, *AsyncInfo, targetDataBegin,

"Entering OpenMP data region", AsyncInfo.getSyncType(),

AsyncInfo.shouldDispatch());

} }

/// passes data from the target, releases target memory and destroys /// passes data from the target, releases target memory and destroys

/// the host-target mapping (top entry from the stack of data maps) /// the host-target mapping (top entry from the stack of data maps)

/// created by the last __tgt_target_data_begin. /// created by the last __tgt_target_data_begin.

EXTERN void __tgt_target_data_end_mapper(ident_t *Loc, int64_t DeviceId, EXTERN void __tgt_target_data_end_mapper(ident_t *Loc, int64_t DeviceId,

int32_t ArgNum, void **ArgsBase, int32_t ArgNum, void **ArgsBase,

void **Args, int64_t *ArgSizes, void **Args, int64_t *ArgSizes,

int64_t *ArgTypes, int64_t *ArgTypes,

map_var_info_t *ArgNames, map_var_info_t *ArgNames,

void **ArgMappers) { void **ArgMappers) {

TIMESCOPE_WITH_IDENT(Loc); TIMESCOPE_WITH_IDENT(Loc);

DP("Entering data end region with %d mappings\n", ArgNum);

DP("Entering data end region for device %" PRId64 " with %d mappings\n",

DeviceId, ArgNum);

if (checkDeviceAndCtors(DeviceId, Loc)) { if (checkDeviceAndCtors(DeviceId, Loc)) {

DP("Not offloading to device %" PRId64 "\n", DeviceId); DP("Not offloading to device %" PRId64 "\n", DeviceId);

return; return;

jdoerfertUnsubmitted

Done

I assume we can outline some of this as it is probably the same as in the sync version, right? Let's avoid duplication as much as possible. Same below.

jdoerfert: I assume we can outline some of this as it is probably the same as in the sync version, right?

gValariniAuthorUnsubmitted

Done

Yep, you are correct. I have created two new "launchers" that can unify most of the code paths for the execution and data-related functions for the normal and nowait cases. Since all data-related interface entries practically have the same signature, a single launcher is enough for them all.

A new class called TaskAsyncInfoTy also unifies the code related to managing the task async handle when executing target nowait regions.

What do you think about this new code structure?

gValarini: Yep, you are correct. I have created two new "launchers" that can unify most of the code paths…

} }

DeviceTy &Device = *PM->Devices[DeviceId]; DeviceTy &Device = *PM->Devices[DeviceId];

if (getInfoLevel() & OMP_INFOTYPE_KERNEL_ARGS)

printKernelArguments(Loc, DeviceId, ArgNum, ArgSizes, ArgTypes, ArgNames,

"Exiting OpenMP data region");

#ifdef OMPTARGET_DEBUG

for (int I = 0; I < ArgNum; ++I) {

DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64

", Type=0x%" PRIx64 ", Name=%s\n",

I, DPxPTR(ArgsBase[I]), DPxPTR(Args[I]), ArgSizes[I], ArgTypes[I],

(ArgNames) ? getNameFromMapping(ArgNames[I]).c_str() : "unknown");

}

#endif

AsyncInfoTy AsyncInfo(Device); AsyncInfoTy AsyncInfo(Device);

int Rc = targetDataEnd(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes, targetDataMapper(Loc, Device, DeviceId, ArgNum, ArgsBase, Args, ArgSizes,

ArgTypes, ArgNames, ArgMappers, AsyncInfo); ArgTypes, ArgNames, ArgMappers, AsyncInfo, targetDataEnd,

if (Rc == OFFLOAD_SUCCESS) "Exiting OpenMP data region");

Rc = AsyncInfo.synchronize();

handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);

} }

EXTERN void __tgt_target_data_end_nowait_mapper( EXTERN void __tgt_target_data_end_nowait_mapper(

ident_t *Loc, int64_t DeviceId, int32_t ArgNum, void **ArgsBase, ident_t *Loc, int64_t DeviceId, int32_t ArgNum, void **ArgsBase,

void **Args, int64_t *ArgSizes, int64_t *ArgTypes, map_var_info_t *ArgNames, void **Args, int64_t *ArgSizes, int64_t *ArgTypes, map_var_info_t *ArgNames,

void **ArgMappers, int32_t DepNum, void *DepList, int32_t NoAliasDepNum, void **ArgMappers, int32_t DepNum, void *DepList, int32_t NoAliasDepNum,

void *NoAliasDepList) { void *NoAliasDepList) {

TIMESCOPE_WITH_IDENT(Loc); TIMESCOPE_WITH_IDENT(Loc);

__tgt_target_data_end_mapper(Loc, DeviceId, ArgNum, ArgsBase, Args, ArgSizes, DP("Entering data end region for device %" PRId64 " with %d mappings\n",

ArgTypes, ArgNames, ArgMappers); DeviceId, ArgNum);

if (checkDeviceAndCtors(DeviceId, Loc)) {

DP("Not offloading to device %" PRId64 "\n", DeviceId);

return;

}

DeviceTy &Device = *PM->Devices[DeviceId];

TaskAsyncInfoTy AsyncInfo(Device);

targetDataMapper(Loc, Device, DeviceId, ArgNum, ArgsBase, Args, ArgSizes,

ArgTypes, ArgNames, ArgMappers, *AsyncInfo, targetDataEnd,

"Exiting OpenMP data region", AsyncInfo.getSyncType(),

AsyncInfo.shouldDispatch());

} }

EXTERN void __tgt_target_data_update_mapper(ident_t *Loc, int64_t DeviceId, EXTERN void __tgt_target_data_update_mapper(ident_t *Loc, int64_t DeviceId,

int32_t ArgNum, void **ArgsBase, int32_t ArgNum, void **ArgsBase,

void **Args, int64_t *ArgSizes, void **Args, int64_t *ArgSizes,

int64_t *ArgTypes, int64_t *ArgTypes,

map_var_info_t *ArgNames, map_var_info_t *ArgNames,

void **ArgMappers) { void **ArgMappers) {

TIMESCOPE_WITH_IDENT(Loc); TIMESCOPE_WITH_IDENT(Loc);

DP("Entering data update with %d mappings\n", ArgNum);

DP("Entering data update region for device %" PRId64 " with %d mappings\n",

DeviceId, ArgNum);

if (checkDeviceAndCtors(DeviceId, Loc)) { if (checkDeviceAndCtors(DeviceId, Loc)) {

DP("Not offloading to device %" PRId64 "\n", DeviceId); DP("Not offloading to device %" PRId64 "\n", DeviceId);

return; return;

} }

if (getInfoLevel() & OMP_INFOTYPE_KERNEL_ARGS)

printKernelArguments(Loc, DeviceId, ArgNum, ArgSizes, ArgTypes, ArgNames,

"Updating OpenMP data");

DeviceTy &Device = *PM->Devices[DeviceId]; DeviceTy &Device = *PM->Devices[DeviceId];

AsyncInfoTy AsyncInfo(Device); AsyncInfoTy AsyncInfo(Device);

int Rc = targetDataUpdate(Loc, Device, ArgNum, ArgsBase, Args, ArgSizes, targetDataMapper(Loc, Device, DeviceId, ArgNum, ArgsBase, Args, ArgSizes,

ArgTypes, ArgNames, ArgMappers, AsyncInfo); ArgTypes, ArgNames, ArgMappers, AsyncInfo, targetDataUpdate,

if (Rc == OFFLOAD_SUCCESS) "Updating OpenMP data");

Rc = AsyncInfo.synchronize();

handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);

} }

EXTERN void __tgt_target_data_update_nowait_mapper( EXTERN void __tgt_target_data_update_nowait_mapper(

ident_t *Loc, int64_t DeviceId, int32_t ArgNum, void **ArgsBase, ident_t *Loc, int64_t DeviceId, int32_t ArgNum, void **ArgsBase,

void **Args, int64_t *ArgSizes, int64_t *ArgTypes, map_var_info_t *ArgNames, void **Args, int64_t *ArgSizes, int64_t *ArgTypes, map_var_info_t *ArgNames,

void **ArgMappers, int32_t DepNum, void *DepList, int32_t NoAliasDepNum, void **ArgMappers, int32_t DepNum, void *DepList, int32_t NoAliasDepNum,

void *NoAliasDepList) { void *NoAliasDepList) {

TIMESCOPE_WITH_IDENT(Loc); TIMESCOPE_WITH_IDENT(Loc);

__tgt_target_data_update_mapper(Loc, DeviceId, ArgNum, ArgsBase, Args, DP("Entering data update region for device %" PRId64 " with %d mappings\n",

ArgSizes, ArgTypes, ArgNames, ArgMappers); DeviceId, ArgNum);

if (checkDeviceAndCtors(DeviceId, Loc)) {

DP("Not offloading to device %" PRId64 "\n", DeviceId);

return;

} }

/// Implements a kernel entry that executes the target region on the specified DeviceTy &Device = *PM->Devices[DeviceId];

/// device.

jdoerfertUnsubmitted

Done

So, when do we call this but don't want to actually do targetDataUpdate? I am confused. Same above.

jdoerfert: So, when do we call this but don't want to actually do `targetDataUpdate`? I am confused. Same…

gValariniAuthorUnsubmitted

Done

The goal is to dispatch the device side operations (i.e., call the functions in the omptarget.cpp file) only when a new async handle is created. If we detect that the task already contains an async handle, that means that the device side operations were already dispatched and we should only try to synchronize it in a non-blocking manner.

The new TaskAsyncInfoTy class has a function called shouldDispatch that now encapsulates this detection logic with proper documentation. Do you think it is more clear now? Should we add a comment to each call site as well?

gValarini: The goal is to dispatch the device side operations (i.e., call the functions in the `omptarget.

/// TaskAsyncInfoTy AsyncInfo(Device);

/// \param Loc Source location associated with this target region. targetDataMapper(Loc, Device, DeviceId, ArgNum, ArgsBase, Args, ArgSizes,

/// \param DeviceId The device to execute this region, -1 indicated the default. ArgTypes, ArgNames, ArgMappers, *AsyncInfo, targetDataUpdate,

/// \param NumTeams Number of teams to launch the region with, -1 indicates a "Updating OpenMP data", AsyncInfo.getSyncType(),

/// non-teams region and 0 indicates it was unspecified. AsyncInfo.shouldDispatch());

/// \param ThreadLimit Limit to the number of threads to use in the kernel }

/// launch, 0 indicates it was unspecified.

/// \param HostPtr The pointer to the host function registered with the kernel. static inline int

/// \param Args All arguments to this kernel launch (see struct definition). targetKernel(ident_t *Loc, DeviceTy &Device, int64_t DeviceId, int32_t NumTeams,

EXTERN int __tgt_target_kernel(ident_t *Loc, int64_t DeviceId, int32_t NumTeams, int32_t ThreadLimit, void *HostPtr, __tgt_kernel_arguments *Args,

int32_t ThreadLimit, void *HostPtr, AsyncInfoTy &AsyncInfo,

__tgt_kernel_arguments *Args) { AsyncInfoTy::SyncType SyncType = AsyncInfoTy::SyncType::BLOCKING,

bool Dispatch = true) {

jdoerfertUnsubmitted

Done

There is more duplication in the callees to be moved here, no?
The two last arguments could be omitted and grabbed from AsyncInfo, also in the above rewrite.
Dispatch is not used?

jdoerfert: There is more duplication in the callees to be moved here, no? The two last arguments could be…

gValariniAuthorUnsubmitted

Done

Yep, that was an error on my part, Dispatch should be used instead of AsyncInfo.isDone().

Regarding the other arguments, they are obtained from the wrapper TaskAsyncInfoTy, not from AsyncInfoTy. I can change that to unify the wrappers code and AsyncInfoTy, but I'll be probably putting too different responsibilities into the AsyncInfoTy struct. What do you think?

gValarini: Yep, that was an error on my part, `Dispatch` should be used instead of `AsyncInfo.isDone()`.

jdoerfertUnsubmitted

Done

Same comment as above wrt. templated version. The duplication we introduce is something I would like to avoid.

jdoerfert: Same comment as above wrt. templated version. The duplication we introduce is something I would…

gValariniAuthorUnsubmitted

Done

Done as well.

gValarini: Done as well.

TIMESCOPE_WITH_IDENT(Loc); TIMESCOPE_WITH_IDENT(Loc);

DP("Entering target region with entry point " DPxMOD " and device Id %" PRId64

"\n",

DPxPTR(HostPtr), DeviceId);

if (Args->Version != 1) { if (Args->Version != 1) {

DP("Unexpected ABI version: %d\n", Args->Version); DP("Unexpected ABI version: %d\n", Args->Version);

} }

if (checkDeviceAndCtors(DeviceId, Loc)) {

DP("Not offloading to device %" PRId64 "\n", DeviceId);

return OMP_TGT_FAIL;

}

if (getInfoLevel() & OMP_INFOTYPE_KERNEL_ARGS) if (getInfoLevel() & OMP_INFOTYPE_KERNEL_ARGS)

printKernelArguments(Loc, DeviceId, Args->NumArgs, Args->ArgSizes, printKernelArguments(Loc, DeviceId, Args->NumArgs, Args->ArgSizes,

Args->ArgTypes, Args->ArgNames, Args->ArgTypes, Args->ArgNames,

"Entering OpenMP kernel"); "Entering OpenMP kernel");

#ifdef OMPTARGET_DEBUG #ifdef OMPTARGET_DEBUG

for (int I = 0; I < Args->NumArgs; ++I) { for (int I = 0; I < Args->NumArgs; ++I) {

DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64 DP("Entry %2d: Base=" DPxMOD ", Begin=" DPxMOD ", Size=%" PRId64

", Type=0x%" PRIx64 ", Name=%s\n", ", Type=0x%" PRIx64 ", Name=%s\n",

I, DPxPTR(Args->ArgBasePtrs[I]), DPxPTR(Args->ArgPtrs[I]), I, DPxPTR(Args->ArgBasePtrs[I]), DPxPTR(Args->ArgPtrs[I]),

Args->ArgSizes[I], Args->ArgTypes[I], Args->ArgSizes[I], Args->ArgTypes[I],

(Args->ArgNames) ? getNameFromMapping(Args->ArgNames[I]).c_str() (Args->ArgNames) ? getNameFromMapping(Args->ArgNames[I]).c_str()

: "unknown"); : "unknown");

} }

#endif #endif

bool IsTeams = NumTeams != -1; bool IsTeams = NumTeams != -1;

if (!IsTeams) if (!IsTeams)

NumTeams = 0; NumTeams = 0;

DeviceTy &Device = *PM->Devices[DeviceId]; int Rc = OFFLOAD_SUCCESS;

AsyncInfoTy AsyncInfo(Device); if (AsyncInfo.isDone())

jdoerfertUnsubmitted

Done

This I don't understand. Why do we have to wait to enqueue the kernel? And even if, how does this not accidentally skip the target region and we will never execute it at all? Long story short, I doubt the conditional here makes sense.

jdoerfert: This I don't understand. Why do we have to wait to enqueue the kernel? And even if, how does…

gValariniAuthorUnsubmitted

Done

You are right, this was a leftover prior to the refactoring of the interface file. Although it worked, it did only because the queue pointer was null and isDone would return true at first. Replaced it with the Dispatch variable.

gValarini: You are right, this was a leftover prior to the refactoring of the interface file. Although it…

int Rc = target(Loc, Device, HostPtr, Args->NumArgs, Args->ArgBasePtrs, Rc = target(Loc, Device, HostPtr, Args->NumArgs, Args->ArgBasePtrs,

Args->ArgPtrs, Args->ArgSizes, Args->ArgTypes, Args->ArgNames, Args->ArgPtrs, Args->ArgSizes, Args->ArgTypes, Args->ArgNames,

Args->ArgMappers, NumTeams, ThreadLimit, Args->Tripcount, Args->ArgMappers, NumTeams, ThreadLimit, Args->Tripcount,

IsTeams, AsyncInfo); IsTeams, AsyncInfo);

if (Rc == OFFLOAD_SUCCESS) if (Rc == OFFLOAD_SUCCESS)

Rc = AsyncInfo.synchronize(); Rc = AsyncInfo.synchronize(SyncType);

handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc); handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);

assert(Rc == OFFLOAD_SUCCESS && "__tgt_target_kernel unexpected failure!"); assert(Rc == OFFLOAD_SUCCESS && "__tgt_target_kernel unexpected failure!");

return OMP_TGT_SUCCESS; return OMP_TGT_SUCCESS;

} }

/// Implements a kernel entry that executes the target region on the specified

/// device.

///

/// \param Loc Source location associated with this target region.

/// \param DeviceId The device to execute this region, -1 indicated the default.

/// \param NumTeams Number of teams to launch the region with, -1 indicates a

/// non-teams region and 0 indicates it was unspecified.

/// \param ThreadLimit Limit to the number of threads to use in the kernel

/// launch, 0 indicates it was unspecified.

/// \param HostPtr The pointer to the host function registered with the kernel.

/// \param Args All arguments to this kernel launch (see struct definition).

EXTERN int __tgt_target_kernel(ident_t *Loc, int64_t DeviceId, int32_t NumTeams,

int32_t ThreadLimit, void *HostPtr,

__tgt_kernel_arguments *Args) {

TIMESCOPE_WITH_IDENT(Loc);

DP("Entering target region with entry point " DPxMOD " and device Id %" PRId64

"\n",

DPxPTR(HostPtr), DeviceId);

if (checkDeviceAndCtors(DeviceId, Loc)) {

DP("Not offloading to device %" PRId64 "\n", DeviceId);

return OMP_TGT_FAIL;

}

DeviceTy &Device = *PM->Devices[DeviceId];

AsyncInfoTy AsyncInfo(Device);

return targetKernel(Loc, Device, DeviceId, NumTeams, ThreadLimit, HostPtr,

Args, AsyncInfo);

}

EXTERN int __tgt_target_kernel_nowait( EXTERN int __tgt_target_kernel_nowait(

ident_t *Loc, int64_t DeviceId, int32_t NumTeams, int32_t ThreadLimit, ident_t *Loc, int64_t DeviceId, int32_t NumTeams, int32_t ThreadLimit,

void *HostPtr, __tgt_kernel_arguments *Args, int32_t DepNum, void *DepList, void *HostPtr, __tgt_kernel_arguments *Args, int32_t DepNum, void *DepList,

int32_t NoAliasDepNum, void *NoAliasDepList) { int32_t NoAliasDepNum, void *NoAliasDepList) {

TIMESCOPE_WITH_IDENT(Loc); TIMESCOPE_WITH_IDENT(Loc);

return __tgt_target_kernel(Loc, DeviceId, NumTeams, ThreadLimit, HostPtr, DP("Entering target region with entry point " DPxMOD " and device Id %" PRId64

Args); "\n",

DPxPTR(HostPtr), DeviceId);

if (checkDeviceAndCtors(DeviceId, Loc)) {

DP("Not offloading to device %" PRId64 "\n", DeviceId);

return OMP_TGT_FAIL;

}

DeviceTy &Device = *PM->Devices[DeviceId];

TaskAsyncInfoTy AsyncInfo(Device);

return targetKernel(Loc, Device, DeviceId, NumTeams, ThreadLimit, HostPtr,

Args, *AsyncInfo, AsyncInfo.getSyncType(),

AsyncInfo.shouldDispatch());

} }

jdoerfertUnsubmitted

Done

Clang format.

jdoerfert: Clang format.

// Get the current number of components for a user-defined mapper. // Get the current number of components for a user-defined mapper.

EXTERN int64_t __tgt_mapper_num_components(void *RtMapperHandle) { EXTERN int64_t __tgt_mapper_num_components(void *RtMapperHandle) {

TIMESCOPE(); TIMESCOPE();

auto *MapperComponentsPtr = (struct MapperComponentsTy *)RtMapperHandle; auto *MapperComponentsPtr = (struct MapperComponentsTy *)RtMapperHandle;

int64_t Size = MapperComponentsPtr->Components.size(); int64_t Size = MapperComponentsPtr->Components.size();

DP("__tgt_mapper_num_components(Handle=" DPxMOD ") returns %" PRId64 "\n", DP("__tgt_mapper_num_components(Handle=" DPxMOD ") returns %" PRId64 "\n",

DPxPTR(RtMapperHandle), Size); DPxPTR(RtMapperHandle), Size);

Show All 22 Lines for (auto &R : PM->RTLs.AllRTLs) {

if (R.set_info_flag) if (R.set_info_flag)

R.set_info_flag(NewInfoLevel); R.set_info_flag(NewInfoLevel);

} }

EXTERN int __tgt_print_device_info(int64_t DeviceId) { EXTERN int __tgt_print_device_info(int64_t DeviceId) {

return PM->Devices[DeviceId]->printDeviceInfo( return PM->Devices[DeviceId]->printDeviceInfo(

PM->Devices[DeviceId]->RTLDeviceID); PM->Devices[DeviceId]->RTLDeviceID);

} }

jdoerfertUnsubmitted

Done

Do we know the above call is "noreturn"? If not, we should explicitly exit here. On second thought, we should exit either way.

jdoerfert: Do we know the above call is "noreturn"? If not, we should explicitly exit here. On second…

gValariniAuthorUnsubmitted

Done

Uhm, yep, you are right. We should always exit here. I am converting to using FATAL_MESSAGE0, so we directly abort the program.

gValarini: Uhm, yep, you are right. We should always exit here. I am converting to using `FATAL_MESSAGE0`…

jdoerfertUnsubmitted

Done

handleTargetOutcome(false, nullptr);

}

- auto AsyncInfo = (AsyncInfoTy *)*AsyncHandle;

+ auto* AsyncInfo = (AsyncInfoTy *)*AsyncHandle;

// If the are device operations still pending, return immediately without

jdoerfert:

jdoerfertUnsubmitted

Not Done

Why is this thread_local? Should it be tied to the async info object with non-blocking tasks?

jdoerfert: Why is this thread_local? Should it be tied to the async info object with non-blocking tasks?

tianshilei1992Unsubmitted

Not Done

Yeah, I'd expect the counter to be tied to a task.

tianshilei1992: Yeah, I'd expect the counter to be tied to a task.

gValariniAuthorUnsubmitted

Not Done

@jdoerfert @tianshilei1992 my idea here is that we should be able to go back and forth between non-blocking and blocking synchronization in a per-thread manner depending on their own state.

If the thread is doing too much non-blocking synchronization (similar to spin-wait), we should block-wait at least once to save resources.
If the thread has just completed a target region, let it go through other target regions in a non-blocking manner for a while, possibly completing other tasks and enqueueing some more.

This way, we allow the runtime threads to adapt themselves to the target region payloads.

I see two possible problems in placing the counter in a per-task manner:

We can only go from non-blocking to blocking synchronization. Once the task is completed, the counter is destructed and won´t be used anymore, thus we lose its information for future task execution.
If we have many tasks with higher enough counters at the top of a thread's task queue, that same thread will be forced to only block-synchronize them, even though we could have tasks ready for completion at the bottom of the queue.

What do you think about the above points? Do they make sense?

gValarini: @jdoerfert @tianshilei1992 my idea here is that we should be able to go back and forth between…

jdoerfertUnsubmitted

Not Done

Hm, it seems reasonable for your scenario. It is unclear what we should optimize for here. I'm OK with keeping it like this as it might be better for a "always blocking tasks" and "consistently mixed task" load. The per-task impl. would only be good if we have "totally independent tasks", I guess.

jdoerfert: Hm, it seems reasonable for your scenario. It is unclear what we should optimize for here. I'm…

openmp/libomptarget/src/omptarget.cpp

Show All 16 Lines
#include "rtl.h"		#include "rtl.h"

#include <cassert>		#include <cassert>
#include <cstdint>		#include <cstdint>
#include <vector>		#include <vector>

using llvm::SmallVector;		using llvm::SmallVector;

int AsyncInfoTy::synchronize() {		int AsyncInfoTy::synchronize(SyncType SyncType) {
int Result = OFFLOAD_SUCCESS;		int Result = OFFLOAD_SUCCESS;
if (AsyncInfo.Queue) {		if (AsyncInfo.Queue) {
		switch (SyncType) {
		case SyncType::BLOCKING:
// If we have a queue we need to synchronize it now.		// If we have a queue we need to synchronize it now.
Result = Device.synchronize(*this);		Result = Device.synchronize(*this);
assert(AsyncInfo.Queue == nullptr &&		assert(AsyncInfo.Queue == nullptr &&
"The device plugin should have nulled the queue to indicate there "		"The device plugin should have nulled the queue to indicate there "
"are no outstanding actions!");		"are no outstanding actions!");
		break;
		case SyncType::NON_BLOCKING:
		Result = Device.queryAsync(*this);
		break;
		}
		}

		// Run any pending post-processing function registered on this async object.
		if (Result == OFFLOAD_SUCCESS && isDone()) {
		for (auto &PostProcFunc : PostProcessingFunctions) {
		Result = PostProcFunc();
		if (Result != OFFLOAD_SUCCESS)
		break;
		}
		PostProcessingFunctions.clear();
}		}
		jdoerfertUnsubmitted Done Reply Inline Actions This would move to isDone below. jdoerfert: This would move to isDone below.
		gValariniAuthorUnsubmitted Done Reply Inline Actions Ok, I can do it, no problem. But just to make sure I got it right with respect to the other comments, you are suggesting this so we would can `synchronize` for the blocking synchronization and `isDone` for the non-blocking one, correct? If that is so, just remember that for the current code, the PostProcessingFunctions must be called on both cases, so isDone would need to be called when blocking synchronizing as well. gValarini: Ok, I can do it, no problem. But just to make sure I got it right with respect to the other…

return Result;		return Result;
}		}

void *&AsyncInfoTy::getVoidPtrLocation() {		void *&AsyncInfoTy::getVoidPtrLocation() {
BufferLocations.push_back(nullptr);		BufferLocations.push_back(nullptr);
return BufferLocations.back();		return BufferLocations.back();
}		}

		bool AsyncInfoTy::isDone() { return AsyncInfo.Queue == nullptr; }

/* All begin addresses for partially mapped structs must be 8-aligned in order		/* All begin addresses for partially mapped structs must be 8-aligned in order
* to ensure proper alignment of members. E.g.		* to ensure proper alignment of members. E.g.
*		*
* struct S {		* struct S {
* int a; // 4-aligned		* int a; // 4-aligned
* int b; // 4-aligned		* int b; // 4-aligned
* int *p; // 8-aligned		* int *p; // 8-aligned
* } s1;		* } s1;
* ...		* ...
* #pragma omp target map(tofrom: s1.b, s1.p[0:N])		* #pragma omp target map(tofrom: s1.b, s1.p[0:N])
* {		* {
* s1.b = 5;		* s1.b = 5;
* for (int i...) s1.p[i] = ...;		* for (int i...) s1.p[i] = ...;
* }		* }
*		*
* Here we are mapping s1 starting from member b, so BaseAddress=&s1=&s1.a and		* Here we are mapping s1 starting from member b, so BaseAddress=&s1=&s1.a and
* BeginAddress=&s1.b. Let's assume that the struct begins at address 0x100,		* BeginAddress=&s1.b. Let's assume that the struct begins at address 0x100,
* then &s1.a=0x100, &s1.b=0x104, &s1.p=0x108. Each member obeys the alignment		* then &s1.a=0x100, &s1.b=0x104, &s1.p=0x108. Each member obeys the alignment
		jdoerfertUnsubmitted Done Reply Inline Actions This is not a good idea, I think. The state of PostProcessingFunctions is undefined afterwards, even if this works in practice. Simply iterate PostProcessingFunctions and then clear it. jdoerfert: This is not a good idea, I think. The state of PostProcessingFunctions is undefined afterwards…
		gValariniAuthorUnsubmitted Done Reply Inline Actions Uhm, really? When moving a `SmallVector` like this, wouldn't `PostProcessingFunctions` be emptied and all the values moved to `Functions`? gValarini: Uhm, really? When moving a `SmallVector` like this, wouldn't `PostProcessingFunctions` be…
		jdoerfertUnsubmitted Done Reply Inline Actions The standard says PostPRocessingFunctions is in an unspecified state after the move. If we look at SmallVector we know the state but I don't understand why we need to rely on implementation defined behavior here at all. We don't save any lines and also no work by just iterating PostProcessingFunctions and then clearing it, no? jdoerfert: The standard says PostPRocessingFunctions is in an unspecified state after the move. If we look…
		gValariniAuthorUnsubmitted Done Reply Inline Actions I was thinking about a future use case: if a post-processing function generates more asynchronous device-side operations. In this case, it may want to add a new post-processing function into the vector, but we cannot do that while iterating over it. The spec says that, if a push/emplace back resizes the vector, any previous iterator is invalid, which would make the loop invalid. I think that is the case for `SmallVector`s as well. Right now, the three post-processing functions do not do that. They all do synchronous device operations. But if in the future, someone adds a post-processing function that does that, it cannot be blindly done without taking care of this situation. Do you think we could leave this "unimplemented" right now? If so, I can do the iteration-then-clear approach. Just a "side question": which standard does `SmallVector` follows? I am asking that because the STL says that a vector is guaranteed to be empty after a move. If that is the case for `SmallVectors`, thus `PostProcessingFunctions` would be in a valid state, no? gValarini: I was thinking about a future use case: if a post-processing function generates more…
		jdoerfertUnsubmitted Done Reply Inline Actions If you are worried about inserting, use for (int i = 0; i < C.size(); ++i) Anyhow, let's keep it this way for now but add an explicit clear at the end. Just to be sure (and explicit). jdoerfert: If you are worried about inserting, use ```for (int i = 0; i < C.size(); ++i)``` Anyhow, let's…
		gValariniAuthorUnsubmitted Done Reply Inline Actions Uhm, yep, making it explicit is better. What do you think? gValarini: Uhm, yep, making it explicit is better. What do you think?
		jdoerfertUnsubmitted Done Reply Inline Actions It's fine for now. jdoerfert: It's fine for now.
* requirements for its type. Now, when we allocate memory on the device, in		* requirements for its type. Now, when we allocate memory on the device, in
* CUDA's case cuMemAlloc() returns an address which is at least 256-aligned.		* CUDA's case cuMemAlloc() returns an address which is at least 256-aligned.
* This means that the chunk of the struct on the device will start at a		* This means that the chunk of the struct on the device will start at a
* 256-aligned address, let's say 0x200. Then the address of b will be 0x200 and		* 256-aligned address, let's say 0x200. Then the address of b will be 0x200 and
* address of p will be a misaligned 0x204 (on the host there was no need to add		* address of p will be a misaligned 0x204 (on the host there was no need to add
* padding between b and p, so p comes exactly 4 bytes after b). If the device		* padding between b and p, so p comes exactly 4 bytes after b). If the device
* kernel tries to access s1.p, a misaligned address error occurs (as reported		* kernel tries to access s1.p, a misaligned address error occurs (as reported
* by the CUDA plugin). By padding the begin address down to a multiple of 8 and		* by the CUDA plugin). By padding the begin address down to a multiple of 8 and
▲ Show 20 Lines • Show All 602 Lines • ▼ Show 20 Lines	for (ShadowPtrListTy::iterator Itr = Device.ShadowPtrMap.begin();

if (CB(Itr) == OFFLOAD_FAIL)		if (CB(Itr) == OFFLOAD_FAIL)
break;		break;
}		}
}		}

} // namespace		} // namespace

		/// Applies the necessary post-processing procedures to entries listed in \p
		jdoerfertUnsubmitted Done Reply Inline Actions Documentation, plz. jdoerfert: Documentation, plz.
		gValariniAuthorUnsubmitted Done Reply Inline Actions Done. Could you check if I missed anything? gValarini: Done. Could you check if I missed anything?
		/// EntriesInfo after the execution of all device side operations from a target
		/// data end. This includes the update of pointers at the host and removal of
		tianshilei1992Unsubmitted Done Reply Inline Actions nit: a vector of `PostProcessingInfo` called `PostProcessingPtrs` is confusing. tianshilei1992: nit: a vector of `PostProcessingInfo` called `PostProcessingPtrs` is confusing.
		gValariniAuthorUnsubmitted Done Reply Inline Actions Ops. Thanks, I have updated the var name. gValarini: Ops. Thanks, I have updated the var name.
		/// device buffer when needed. It returns OFFLOAD_FAIL or OFFLOAD_SUCCESS
		/// according to the successfulness of the operations.
		static int
		postProcessingTargetDataEnd(DeviceTy *Device,
		SmallVector<PostProcessingInfo> EntriesInfo,
		void * FromMapperBase) {
		int Ret = OFFLOAD_SUCCESS;

		for (PostProcessingInfo &Info : EntriesInfo) {
		// If we marked the entry to be deleted we need to verify no other
		// thread reused it by now. If deletion is still supposed to happen by
		// this thread LR will be set and exclusive access to the HDTT map
		// will avoid another thread reusing the entry now. Note that we do
		// not request (exclusive) access to the HDTT map if Info.DelEntry is
		// not set.
		LookupResult LR;
		DeviceTy::HDTTMapAccessorTy HDTTMap =
		Device->HostDataToTargetMap.getExclusiveAccessor(!Info.DelEntry);

		if (Info.DelEntry) {
		LR = Device->lookupMapping(HDTTMap, Info.HstPtrBegin, Info.DataSize);
		if (LR.Entry->getTotalRefCount() != 0 \|\|
		LR.Entry->getDeleteThreadId() != std::this_thread::get_id()) {
		// The thread is not in charge of deletion anymore. Give up access
		// to the HDTT map and unset the deletion flag.
		HDTTMap.destroy();
		Info.DelEntry = false;
		}
		}

		// If we copied back to the host a struct/array containing pointers,
		// we need to restore the original host pointer values from their
		// shadow copies. If the struct is going to be deallocated, remove any
		// remaining shadow pointer entries for this struct.
		auto CB = [&](ShadowPtrListTy::iterator &Itr) {
		// If we copied the struct to the host, we need to restore the
		// pointer.
		if (Info.ArgType & OMP_TGT_MAPTYPE_FROM) {
		void ShadowHstPtrAddr = (void )Itr->first;
		*ShadowHstPtrAddr = Itr->second.HstPtrVal;
		DP("Restoring original host pointer value " DPxMOD " for host "
		"pointer " DPxMOD "\n",
		DPxPTR(Itr->second.HstPtrVal), DPxPTR(ShadowHstPtrAddr));
		}
		// If the struct is to be deallocated, remove the shadow entry.
		if (Info.DelEntry) {
		DP("Removing shadow pointer " DPxMOD "\n", DPxPTR((void **)Itr->first));
		auto OldItr = Itr;
		Itr++;
		Device->ShadowPtrMap.erase(OldItr);
		} else {
		++Itr;
		}
		return OFFLOAD_SUCCESS;
		};
		applyToShadowMapEntries(*Device, CB, Info.HstPtrBegin, Info.DataSize,
		Info.TPR);

		// If we are deleting the entry the DataMapMtx is locked and we own
		// the entry.
		if (Info.DelEntry) {
		if (!FromMapperBase \|\| FromMapperBase != Info.HstPtrBegin)
		Ret = Device->deallocTgtPtr(HDTTMap, LR, Info.DataSize);

		if (Ret != OFFLOAD_SUCCESS) {
		REPORT("Deallocating data from device failed.\n");
		break;
		}
		}
		}

		return Ret;
		}

/// Internal function to undo the mapping and retrieve the data from the device.		/// Internal function to undo the mapping and retrieve the data from the device.
int targetDataEnd(ident_t *Loc, DeviceTy &Device, int32_t ArgNum,		int targetDataEnd(ident_t *Loc, DeviceTy &Device, int32_t ArgNum,
void ArgBases, void Args, int64_t *ArgSizes,		void ArgBases, void Args, int64_t *ArgSizes,
int64_t ArgTypes, map_var_info_t ArgNames,		int64_t ArgTypes, map_var_info_t ArgNames,
void **ArgMappers, AsyncInfoTy &AsyncInfo, bool FromMapper) {		void **ArgMappers, AsyncInfoTy &AsyncInfo, bool FromMapper) {
int Ret;		int Ret = OFFLOAD_SUCCESS;
SmallVector<PostProcessingInfo> PostProcessingPtrs;		SmallVector<PostProcessingInfo> PostProcessingPtrs;
void *FromMapperBase = nullptr;		void *FromMapperBase = nullptr;
// process each input.		// process each input.
for (int32_t I = ArgNum - 1; I >= 0; --I) {		for (int32_t I = ArgNum - 1; I >= 0; --I) {
// Ignore private variables and arrays - there is no mapping for them.		// Ignore private variables and arrays - there is no mapping for them.
// Also, ignore the use_device_ptr directive, it has no effect here.		// Also, ignore the use_device_ptr directive, it has no effect here.
if ((ArgTypes[I] & OMP_TGT_MAPTYPE_LITERAL) \|\|		if ((ArgTypes[I] & OMP_TGT_MAPTYPE_LITERAL) \|\|
(ArgTypes[I] & OMP_TGT_MAPTYPE_PRIVATE))		(ArgTypes[I] & OMP_TGT_MAPTYPE_PRIVATE))
▲ Show 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	if ((ArgTypes[I] & OMP_TGT_MAPTYPE_FROM) \|\| DelEntry) {
}		}

// Add pointer to the buffer for post-synchronize processing.		// Add pointer to the buffer for post-synchronize processing.
PostProcessingPtrs.emplace_back(HstPtrBegin, DataSize, ArgTypes[I],		PostProcessingPtrs.emplace_back(HstPtrBegin, DataSize, ArgTypes[I],
DelEntry && !IsHostPtr, TPR);		DelEntry && !IsHostPtr, TPR);
}		}
}		}

// TODO: We should not synchronize here but pass the AsyncInfo object to the		// Add post-processing functions
		jdoerfertUnsubmitted Done Reply Inline Actions Just make this a standalone static function. jdoerfert: Just make this a standalone static function.
// allocate/deallocate device APIs.		AsyncInfo.addPostProcessingFunction(
//		[=, Device = &Device,
// We need to synchronize before deallocating data.		PostProcessingPtrs = std::move(PostProcessingPtrs)]() mutable -> int {
Ret = AsyncInfo.synchronize();		return postProcessingTargetDataEnd(Device, PostProcessingPtrs,
if (Ret != OFFLOAD_SUCCESS)		FromMapperBase);
return OFFLOAD_FAIL;		});

// Deallocate target pointer
for (PostProcessingInfo &Info : PostProcessingPtrs) {
// If we marked the entry to be deleted we need to verify no other thread
// reused it by now. If deletion is still supposed to happen by this thread
// LR will be set and exclusive access to the HDTT map will avoid another
// thread reusing the entry now. Note that we do not request (exclusive)
// access to the HDTT map if Info.DelEntry is not set.
LookupResult LR;
DeviceTy::HDTTMapAccessorTy HDTTMap =
Device.HostDataToTargetMap.getExclusiveAccessor(!Info.DelEntry);

if (Info.DelEntry) {
LR = Device.lookupMapping(HDTTMap, Info.HstPtrBegin, Info.DataSize);
if (LR.Entry->getTotalRefCount() != 0 \|\|
LR.Entry->getDeleteThreadId() != std::this_thread::get_id()) {
// The thread is not in charge of deletion anymore. Give up access to
// the HDTT map and unset the deletion flag.
HDTTMap.destroy();
Info.DelEntry = false;
}
}

// If we copied back to the host a struct/array containing pointers, we
// need to restore the original host pointer values from their shadow
// copies. If the struct is going to be deallocated, remove any remaining
// shadow pointer entries for this struct.
auto CB = [&](ShadowPtrListTy::iterator &Itr) {
// If we copied the struct to the host, we need to restore the pointer.
if (Info.ArgType & OMP_TGT_MAPTYPE_FROM) {
void ShadowHstPtrAddr = (void )Itr->first;
*ShadowHstPtrAddr = Itr->second.HstPtrVal;
DP("Restoring original host pointer value " DPxMOD " for host "
"pointer " DPxMOD "\n",
DPxPTR(Itr->second.HstPtrVal), DPxPTR(ShadowHstPtrAddr));
}
// If the struct is to be deallocated, remove the shadow entry.
if (Info.DelEntry) {
DP("Removing shadow pointer " DPxMOD "\n", DPxPTR((void **)Itr->first));
auto OldItr = Itr;
Itr++;
Device.ShadowPtrMap.erase(OldItr);
} else {
++Itr;
}
return OFFLOAD_SUCCESS;
};
applyToShadowMapEntries(Device, CB, Info.HstPtrBegin, Info.DataSize,
Info.TPR);

// If we are deleting the entry the DataMapMtx is locked and we own the
// entry.
if (Info.DelEntry) {
if (!FromMapperBase \|\| FromMapperBase != Info.HstPtrBegin)
Ret = Device.deallocTgtPtr(HDTTMap, LR, Info.DataSize);

if (Ret != OFFLOAD_SUCCESS) {
REPORT("Deallocating data from device failed.\n");
break;
}
}
}

return Ret;		return Ret;
}		}

static int targetDataContiguous(ident_t Loc, DeviceTy &Device, void ArgsBase,		static int targetDataContiguous(ident_t Loc, DeviceTy &Device, void ArgsBase,
void *HstPtrBegin, int64_t ArgSize,		void *HstPtrBegin, int64_t ArgSize,
int64_t ArgType, AsyncInfoTy &AsyncInfo) {		int64_t ArgType, AsyncInfoTy &AsyncInfo) {
TIMESCOPE_WITH_IDENT(Loc);		TIMESCOPE_WITH_IDENT(Loc);
Show All 23 Lines	if (ArgType & OMP_TGT_MAPTYPE_FROM) {
DP("Moving %" PRId64 " bytes (tgt:" DPxMOD ") -> (hst:" DPxMOD ")\n",		DP("Moving %" PRId64 " bytes (tgt:" DPxMOD ") -> (hst:" DPxMOD ")\n",
ArgSize, DPxPTR(TgtPtrBegin), DPxPTR(HstPtrBegin));		ArgSize, DPxPTR(TgtPtrBegin), DPxPTR(HstPtrBegin));
int Ret = Device.retrieveData(HstPtrBegin, TgtPtrBegin, ArgSize, AsyncInfo);		int Ret = Device.retrieveData(HstPtrBegin, TgtPtrBegin, ArgSize, AsyncInfo);
if (Ret != OFFLOAD_SUCCESS) {		if (Ret != OFFLOAD_SUCCESS) {
REPORT("Copying data from device failed.\n");		REPORT("Copying data from device failed.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

auto CB = [&](ShadowPtrListTy::iterator &Itr) {
void ShadowHstPtrAddr = (void )Itr->first;
// Wait for device-to-host memcopies for whole struct to complete,		// Wait for device-to-host memcopies for whole struct to complete,
// before restoring the correct host pointer.		// before restoring the correct host pointer.
if (AsyncInfo.synchronize() != OFFLOAD_SUCCESS)		AsyncInfo.addPostProcessingFunction([=, Device = &Device]() -> int {
return OFFLOAD_FAIL;		auto CB = [&](ShadowPtrListTy::iterator &Itr) {
		void ShadowHstPtrAddr = (void )Itr->first;
*ShadowHstPtrAddr = Itr->second.HstPtrVal;		*ShadowHstPtrAddr = Itr->second.HstPtrVal;
DP("Restoring original host pointer value " DPxMOD		DP("Restoring original host pointer value " DPxMOD
" for host pointer " DPxMOD "\n",		" for host pointer " DPxMOD "\n",
DPxPTR(Itr->second.HstPtrVal), DPxPTR(ShadowHstPtrAddr));		DPxPTR(Itr->second.HstPtrVal), DPxPTR(ShadowHstPtrAddr));
++Itr;		++Itr;
return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
};		};
applyToShadowMapEntries(Device, CB, HstPtrBegin, ArgSize, TPR);		applyToShadowMapEntries(*Device, CB, HstPtrBegin, ArgSize, TPR);

		return OFFLOAD_SUCCESS;
		});
}		}

if (ArgType & OMP_TGT_MAPTYPE_TO) {		if (ArgType & OMP_TGT_MAPTYPE_TO) {
DP("Moving %" PRId64 " bytes (hst:" DPxMOD ") -> (tgt:" DPxMOD ")\n",		DP("Moving %" PRId64 " bytes (hst:" DPxMOD ") -> (tgt:" DPxMOD ")\n",
ArgSize, DPxPTR(HstPtrBegin), DPxPTR(TgtPtrBegin));		ArgSize, DPxPTR(HstPtrBegin), DPxPTR(TgtPtrBegin));
int Ret = Device.submitData(TgtPtrBegin, HstPtrBegin, ArgSize, AsyncInfo);		int Ret = Device.submitData(TgtPtrBegin, HstPtrBegin, ArgSize, AsyncInfo);
if (Ret != OFFLOAD_SUCCESS) {		if (Ret != OFFLOAD_SUCCESS) {
REPORT("Copying data to device failed.\n");		REPORT("Copying data to device failed.\n");
▲ Show 20 Lines • Show All 180 Lines • ▼ Show 20 Lines

/// A class manages private arguments in a target region.		/// A class manages private arguments in a target region.
class PrivateArgumentManagerTy {		class PrivateArgumentManagerTy {
/// A data structure for the information of first-private arguments. We can		/// A data structure for the information of first-private arguments. We can
/// use this information to optimize data transfer by packing all		/// use this information to optimize data transfer by packing all
/// first-private arguments and transfer them all at once.		/// first-private arguments and transfer them all at once.
struct FirstPrivateArgInfoTy {		struct FirstPrivateArgInfoTy {
/// The index of the element in \p TgtArgs corresponding to the argument		/// The index of the element in \p TgtArgs corresponding to the argument
const int Index;		int Index;
/// Host pointer begin		/// Host pointer begin
const char *HstPtrBegin;		char *HstPtrBegin;
/// Host pointer end		/// Host pointer end
const char *HstPtrEnd;		char *HstPtrEnd;
/// Aligned size		/// Aligned size
const int64_t AlignedSize;		int64_t AlignedSize;
/// Host pointer name		/// Host pointer name
const map_var_info_t HstPtrName = nullptr;		map_var_info_t HstPtrName = nullptr;

FirstPrivateArgInfoTy(int Index, const void *HstPtr, int64_t Size,		FirstPrivateArgInfoTy(int Index, void *HstPtr, int64_t Size,
const map_var_info_t HstPtrName = nullptr)		const map_var_info_t HstPtrName = nullptr)
: Index(Index), HstPtrBegin(reinterpret_cast<const char *>(HstPtr)),		: Index(Index), HstPtrBegin(reinterpret_cast<char *>(HstPtr)),
		jdoerfertUnsubmitted Done Reply Inline Actions Why are the const problematic? jdoerfert: Why are the const problematic?
		gValariniAuthorUnsubmitted Done Reply Inline Actions In summary, we want to be able to move `PrivateArgumentManagerTy` instances into the post-processing lambdas, so their lifetime is automatically managed by them. The problem is with how `llvm::SmallVector` implements its move constructor. Unfortunately, it is implemented as a move-assignment instead of a proper constructor, meaning it cannot be generated for structs with const members. If we replace `FirstPrivateArgInfo` with a `std::vector`, the problem does not happen because the STL properly implements a move constructor for vectors. Since I think we do not want to use `std::vector` anymore, I just removed the const from the members, since they are not even accessible outside the `PrivateArgumentManagerTy` class. What do you think of this approach? gValarini: In summary, we want to be able to move `PrivateArgumentManagerTy` instances into the post…
		gValariniAuthorUnsubmitted Done Reply Inline Actions @jdoerfert any new comment on this? gValarini: @jdoerfert any new comment on this?
		jdoerfertUnsubmitted Done Reply Inline Actions It's ok to remove the const. jdoerfert: It's ok to remove the const.
HstPtrEnd(HstPtrBegin + Size), AlignedSize(Size + Size % Alignment),		HstPtrEnd(HstPtrBegin + Size), AlignedSize(Size + Size % Alignment),
HstPtrName(HstPtrName) {}		HstPtrName(HstPtrName) {}
};		};

/// A vector of target pointers for all private arguments		/// A vector of target pointers for all private arguments
SmallVector<void *> TgtPtrs;		SmallVector<void *> TgtPtrs;

/// A vector of information of all first-private arguments to be packed		/// A vector of information of all first-private arguments to be packed
▲ Show 20 Lines • Show All 285 Lines • ▼ Show 20 Lines	static int processDataAfter(ident_t Loc, int64_t DeviceId, void HostPtr,
// Move data from device.		// Move data from device.
int Ret = targetDataEnd(Loc, Device, ArgNum, ArgBases, Args, ArgSizes,		int Ret = targetDataEnd(Loc, Device, ArgNum, ArgBases, Args, ArgSizes,
ArgTypes, ArgNames, ArgMappers, AsyncInfo);		ArgTypes, ArgNames, ArgMappers, AsyncInfo);
if (Ret != OFFLOAD_SUCCESS) {		if (Ret != OFFLOAD_SUCCESS) {
REPORT("Call to targetDataEnd failed, abort target.\n");		REPORT("Call to targetDataEnd failed, abort target.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

// Free target memory for private arguments		// Free target memory for private arguments after synchronization.
Ret = PrivateArgumentManager.free();		AsyncInfo.addPostProcessingFunction(
		[PrivateArgumentManager =
		std::move(PrivateArgumentManager)]() mutable -> int {
		int Ret = PrivateArgumentManager.free();
if (Ret != OFFLOAD_SUCCESS) {		if (Ret != OFFLOAD_SUCCESS) {
REPORT("Failed to deallocate target memory for private args\n");		REPORT("Failed to deallocate target memory for private args\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
		return Ret;
		});
		jdoerfertUnsubmitted Done Reply Inline Actions FWIW, `mutable` is really not my favorite way of handling things. jdoerfert: FWIW, `mutable` is really not my favorite way of handling things.
		gValariniAuthorUnsubmitted Done Reply Inline Actions `mutable` was added because we need to call a non-const member function of `PrivateArgumentManager` (i.e., `free`). I know that makes the lambda a function with an internal state since, multiple calls to it will generate different results, but I don´t know of another approach to it. Maybe use `call_once` (IMHO, a little bit overkill) or remove the lambdas altogether and use another approach to store the post-processing functions and their payload. What do you think? gValarini: `mutable` was added because we need to call a non-const member function of…
		gValariniAuthorUnsubmitted Done Reply Inline Actions @jdoerfert any new comment on this? gValarini: @jdoerfert any new comment on this?
		jdoerfertUnsubmitted Done Reply Inline Actions Add a TODO to look into this in the future. jdoerfert: Add a TODO to look into this in the future.

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}
} // namespace		} // namespace

/// performs the same actions as data_begin in case arg_num is		/// performs the same actions as data_begin in case arg_num is
/// non-zero and initiates run of the offloaded region on the target platform;		/// non-zero and initiates run of the offloaded region on the target platform;
/// if arg_num is non-zero after the region execution is done it also		/// if arg_num is non-zero after the region execution is done it also
Show All 37 Lines	int target(ident_t Loc, DeviceTy &Device, void HostPtr, int32_t ArgNum,
// begin addresses, not bases. That's why we pass args and offsets as two		// begin addresses, not bases. That's why we pass args and offsets as two
// separate entities so that each plugin can do what it needs. This behavior		// separate entities so that each plugin can do what it needs. This behavior
// was introdued via https://reviews.llvm.org/D33028 and commit 1546d319244c.		// was introdued via https://reviews.llvm.org/D33028 and commit 1546d319244c.
SmallVector<void *> TgtArgs;		SmallVector<void *> TgtArgs;
SmallVector<ptrdiff_t> TgtOffsets;		SmallVector<ptrdiff_t> TgtOffsets;

PrivateArgumentManagerTy PrivateArgumentManager(Device, AsyncInfo);		PrivateArgumentManagerTy PrivateArgumentManager(Device, AsyncInfo);

int Ret;		int Ret = OFFLOAD_SUCCESS;
if (ArgNum) {		if (ArgNum) {
// Process data, such as data mapping, before launching the kernel		// Process data, such as data mapping, before launching the kernel
Ret = processDataBefore(Loc, DeviceId, HostPtr, ArgNum, ArgBases, Args,		Ret = processDataBefore(Loc, DeviceId, HostPtr, ArgNum, ArgBases, Args,
ArgSizes, ArgTypes, ArgNames, ArgMappers, TgtArgs,		ArgSizes, ArgTypes, ArgNames, ArgMappers, TgtArgs,
TgtOffsets, PrivateArgumentManager, AsyncInfo);		TgtOffsets, PrivateArgumentManager, AsyncInfo);
if (Ret != OFFLOAD_SUCCESS) {		if (Ret != OFFLOAD_SUCCESS) {
REPORT("Failed to process data before launching the kernel.\n");		REPORT("Failed to process data before launching the kernel.\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
Show All 39 Lines

openmp/libomptarget/src/private.h

Show First 20 Lines • Show All 109 Lines • ▼ Show 20 Lines

// functions that extract info from libomp; keep in sync // functions that extract info from libomp; keep in sync

int omp_get_default_device(void) __attribute__((weak)); int omp_get_default_device(void) __attribute__((weak));

int32_t __kmpc_global_thread_num(void *) __attribute__((weak)); int32_t __kmpc_global_thread_num(void *) __attribute__((weak));

int __kmpc_get_target_offload(void) __attribute__((weak)); int __kmpc_get_target_offload(void) __attribute__((weak));

void __kmpc_omp_wait_deps(ident_t *loc_ref, kmp_int32 gtid, kmp_int32 ndeps, void __kmpc_omp_wait_deps(ident_t *loc_ref, kmp_int32 gtid, kmp_int32 ndeps,

kmp_depend_info_t *dep_list, kmp_int32 ndeps_noalias, kmp_depend_info_t *dep_list, kmp_int32 ndeps_noalias,

kmp_depend_info_t *noalias_dep_list) kmp_depend_info_t *noalias_dep_list)

__attribute__((weak)); __attribute__((weak));

void *__kmpc_omp_get_target_async_handle(kmp_int32 gtid) __attribute__((weak));

void __kmpc_omp_set_target_async_handle(kmp_int32 gtid, void *handle)

__attribute__((weak));

bool __kmpc_omp_has_task_team(kmp_int32 gtid) __attribute__((weak));

#ifdef __cplusplus #ifdef __cplusplus

} }

#endif #endif

#define TARGET_NAME Libomptarget #define TARGET_NAME Libomptarget

#define DEBUG_PREFIX GETNAME(TARGET_NAME) #define DEBUG_PREFIX GETNAME(TARGET_NAME)

//////////////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////////////////

▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines for (int32_t I = 0; I < ArgNum; ++I) {

else else

Type = "use_address"; Type = "use_address";

INFO(OMP_INFOTYPE_ALL, DeviceId, "%s(%s)[%" PRId64 "] %s\n", Type, INFO(OMP_INFOTYPE_ALL, DeviceId, "%s(%s)[%" PRId64 "] %s\n", Type,

getNameFromMapping(VarName).c_str(), ArgSizes[I], Implicit); getNameFromMapping(VarName).c_str(), ArgSizes[I], Implicit);

} }

class TaskAsyncInfoTy {

const int ExecThreadID = -1;

AsyncInfoTy::SyncType SyncType = AsyncInfoTy::SyncType::BLOCKING;

AsyncInfoTy *AsyncInfo = nullptr;

bool IsNew = false;

public:

TaskAsyncInfoTy(DeviceTy &Device)

: ExecThreadID(__kmpc_global_thread_num(NULL)) {

// Acquire the AsyncInfo stored in async handle of the current task being

// executed. If no valid async handle is present, a new AsyncInfo is

// allocated and stored in the current task.

AsyncInfo = (AsyncInfoTy *)__kmpc_omp_get_target_async_handle(ExecThreadID);

if (!AsyncInfo) {

AsyncInfo = new AsyncInfoTy(Device);

__kmpc_omp_set_target_async_handle(ExecThreadID, (void *)AsyncInfo);

IsNew = true;

}

jdoerfertUnsubmitted

Done

Is there a ! missing?

jdoerfert: Is there a `!` missing?

gValariniAuthorUnsubmitted

Done

I forget to submit some local changes! Done.

gValarini: I forget to submit some local changes! Done.

// Define the synchronization type based on the current task context. Only

// tasks with an assigned task team can be re-enqueue and thus can use the

// non-blocking synchronization scheme.

if (__kmpc_omp_has_task_team(ExecThreadID))

jdoerfertUnsubmitted

Done

static inline void completeTaskAsyncInfo(int GTID, AsyncInfoTy *AsyncInfo) {

- if (AsyncInfo->isDone()) {

- delete AsyncInfo;

- __kmpc_omp_set_target_async_handle(GTID, NULL);

- }

+ if (!AsyncInfo->isDone())

+ return;

+ delete AsyncInfo;

+ __kmpc_omp_set_target_async_handle(GTID, NULL);

+ }

/// Define the synchronization type based on the current task context. Only

Here and elsewhere, we should prefer early exit and no else after return.

jdoerfert: Here and elsewhere, we should prefer early exit and no else after return.

SyncType = AsyncInfoTy::SyncType::NON_BLOCKING;

}

/// Delete AsyncInfo if it is completed and remove it from the current task

/// async handle.

~TaskAsyncInfoTy() {

if (!AsyncInfo || !AsyncInfo->isDone())

return;

delete AsyncInfo;

jdoerfertUnsubmitted

Done

Should this be guarded by IsNew?

jdoerfert: Should this be guarded by IsNew?

gValariniAuthorUnsubmitted

Done

Nope, IsNew indicates that a new task-attached AsyncInfo has just been allocated, but not that we should deallocate it or not. The variable is primarily used to indicate that we must dispatch new operations to the new handle. Maybe I should rename it to just ShouldDispatch. Deallocation is always done when AsyncInfo->isDone() returns true, which is previously checked.

gValarini: Nope, `IsNew` indicates that a new task-attached `AsyncInfo` has just been allocated, but not…

jdoerfertUnsubmitted

Done

I'm worried here that we might not delete the AsyncInfo or delete it multiple times. Are you saying there is exactly one TaskAsyncInfoTy that will own the AsyncInfor object at any given time? If not, how do we avoid double free?

jdoerfert: I'm worried here that we might not delete the AsyncInfo or delete it multiple times. Are you…

gValariniAuthorUnsubmitted

Done

When executing a target nowait region, the actual owner of the AsyncInfo is the task itself. The structure is allocated when the task first executes and calls any of the libomptarget functions (IsNew is true) and it is deallocated when all the device-side operations are completed (AsyncInfo::isDone returns true).

Here, TaskAsyncInfoTy is just a wrapper around a task-owned AsyncInfoTy (stored inside the OpenMP task data) to mainly automate the allocation and deallocation logic. But following the OpenMP execution flow, since a task is owned and executed by only a single thread at any given time, only one TaskAsyncInfoTy will be managing the task-owned AsyncInfoTy object. This should avoid any double frees, but I understand this could be a weak assumption. If that is enough I could add documentation stating it, but probably having some code checks for that would be best. Maybe assertions at the task deallocation function ensuring no valid AsyncInfoTy address is left?

gValarini: When executing a target nowait region, the actual owner of the `AsyncInfo` is the task itself.

jdoerfertUnsubmitted

Done

All asserts should have messages.

jdoerfert: All asserts should have messages.

__kmpc_omp_set_target_async_handle(ExecThreadID, NULL);

jdoerfertUnsubmitted

Done

static inline AsyncInfoTy::SyncType getSyncTypeFromTask(int GTID) {

- if (__kmpc_omp_has_task_team(GTID)) {

+ if (__kmpc_omp_has_task_team(GTID))

return AsyncInfoTy::SyncType::NON_BLOCKING;

- } else {

- return AsyncInfoTy::SyncType::BLOCKING;

- }

+ return AsyncInfoTy::SyncType::BLOCKING;

}

#ifdef OMPTARGET_PROFILE_ENABLED

jdoerfert:

}

AsyncInfoTy &operator*() { return *AsyncInfo; }

AsyncInfoTy::SyncType getSyncType() { return SyncType; }

/// Return if the device side operations should be dispatched or not. When the

/// async info attached to the task struct was just created, the target region

/// operations must be dispatched and accumulated on the internal AsyncInfo.

/// Otherwise, the operations were already dispatched before and only

/// synchronization is needed.

bool shouldDispatch() { return !IsNew; }

jdoerfertUnsubmitted

Done

I don't understand what we want/need this dispatch idea for. It seems to skip operations but I don't understand how we would not forget about them and go back.

jdoerfert: I don't understand what we want/need this dispatch idea for. It seems to skip operations but I…

gValariniAuthorUnsubmitted

Done

Here is the main idea:

The first time a target nowait (target inside an OpenMP task) is executed, a new AsyncInfo handle is created and stored in the OpenMP taskdata structured of said task. Since this is the first time we are executing it (which is detected by the IsNew variable), we dispatch the device side operations and populate the post-processing function array by calling the proper omptarget.cpp functions (e.g., targetDataBegin, targetDataEnd, ...).
Afterward, if the device operations are still pending, the OpenMP task is re-enqueued for execution.
Later, when the task is re-executed, the same outline function called at step 1 will be called again. Here we can recover the AsyncInfo handle from the OpenMP taskdata and just synchronize it. Since this time the handle is not new, we know the operations were already dispatched previously and we should not dispatch them again.

I have a presentation that explains it on slides 19-24, but I believe I am failing to describe that in the code. I'll try to come up with some better documentation for this dispatch/synchronize idea.

gValarini: Here is the main idea: # The first time a target nowait (target inside an OpenMP task) is…

jdoerfertUnsubmitted

Done

Ok, that makes more sense. Now to help (even) me understand this, why do we need to call the functions from step 1 in step 3? We seem to use the "Dispatch" argument to skip most of what they do (the target data part of a targetDataXXX) anyway, no?

jdoerfert: Ok, that makes more sense. Now to help (even) me understand this, why do we need to call the…

gValariniAuthorUnsubmitted

Done

That happens because the dispatch and synchronization logic are placed in the same interface function. The first call to that function done by a task dispatches the operations, while the subsequent calls try to do the non-blocking synchronization.

Maybe a better way of doing it would be to add a new interface function with the sole purpose of executing said synchronization. This way, when a task is re-executed, it calls this new function to only do the synchronization instead of the previous outline function. What do you think? This can better split the dispatch and synchronization code.

gValarini: That happens because the dispatch and synchronization logic are placed in the same interface…

};

#ifdef OMPTARGET_PROFILE_ENABLED #ifdef OMPTARGET_PROFILE_ENABLED

#include "llvm/Support/TimeProfiler.h" #include "llvm/Support/TimeProfiler.h"

#define TIMESCOPE() llvm::TimeTraceScope TimeScope(__FUNCTION__) #define TIMESCOPE() llvm::TimeTraceScope TimeScope(__FUNCTION__)

#define TIMESCOPE_WITH_IDENT(IDENT) \ #define TIMESCOPE_WITH_IDENT(IDENT) \

SourceInfo SI(IDENT); \ SourceInfo SI(IDENT); \

llvm::TimeTraceScope TimeScope(__FUNCTION__, SI.getProfileLocation()) llvm::TimeTraceScope TimeScope(__FUNCTION__, SI.getProfileLocation())

#define TIMESCOPE_WITH_NAME_AND_IDENT(NAME, IDENT) \ #define TIMESCOPE_WITH_NAME_AND_IDENT(NAME, IDENT) \

SourceInfo SI(IDENT); \ SourceInfo SI(IDENT); \

llvm::TimeTraceScope TimeScope(NAME, SI.getProfileLocation()) llvm::TimeTraceScope TimeScope(NAME, SI.getProfileLocation())

#else #else

#define TIMESCOPE() #define TIMESCOPE()

#define TIMESCOPE_WITH_IDENT(IDENT) #define TIMESCOPE_WITH_IDENT(IDENT)

#define TIMESCOPE_WITH_NAME_AND_IDENT(NAME, IDENT) #define TIMESCOPE_WITH_NAME_AND_IDENT(NAME, IDENT)

#endif #endif

openmp/libomptarget/src/rtl.cpp

Show First 20 Lines • Show All 206 Lines • ▼ Show 20 Lines	((void *)&R.data_submit_async) =
dlsym(DynlibHandle, "__tgt_rtl_data_submit_async");		dlsym(DynlibHandle, "__tgt_rtl_data_submit_async");
((void *)&R.data_retrieve_async) =		((void *)&R.data_retrieve_async) =
dlsym(DynlibHandle, "__tgt_rtl_data_retrieve_async");		dlsym(DynlibHandle, "__tgt_rtl_data_retrieve_async");
((void *)&R.run_region_async) =		((void *)&R.run_region_async) =
dlsym(DynlibHandle, "__tgt_rtl_run_target_region_async");		dlsym(DynlibHandle, "__tgt_rtl_run_target_region_async");
((void *)&R.run_team_region_async) =		((void *)&R.run_team_region_async) =
dlsym(DynlibHandle, "__tgt_rtl_run_target_team_region_async");		dlsym(DynlibHandle, "__tgt_rtl_run_target_team_region_async");
((void *)&R.synchronize) = dlsym(DynlibHandle, "__tgt_rtl_synchronize");		((void *)&R.synchronize) = dlsym(DynlibHandle, "__tgt_rtl_synchronize");
		((void *)&R.query_async) =
		dlsym(DynlibHandle, "__tgt_rtl_query_async");
((void *)&R.data_exchange) =		((void *)&R.data_exchange) =
dlsym(DynlibHandle, "__tgt_rtl_data_exchange");		dlsym(DynlibHandle, "__tgt_rtl_data_exchange");
((void *)&R.data_exchange_async) =		((void *)&R.data_exchange_async) =
dlsym(DynlibHandle, "__tgt_rtl_data_exchange_async");		dlsym(DynlibHandle, "__tgt_rtl_data_exchange_async");
((void *)&R.is_data_exchangable) =		((void *)&R.is_data_exchangable) =
dlsym(DynlibHandle, "__tgt_rtl_is_data_exchangable");		dlsym(DynlibHandle, "__tgt_rtl_is_data_exchangable");
((void *)&R.register_lib) = dlsym(DynlibHandle, "__tgt_rtl_register_lib");		((void *)&R.register_lib) = dlsym(DynlibHandle, "__tgt_rtl_register_lib");
((void *)&R.unregister_lib) =		((void *)&R.unregister_lib) =
▲ Show 20 Lines • Show All 367 Lines • Show Last 20 Lines

openmp/runtime/src/kmp.h

Show First 20 Lines • Show All 2,469 Lines • ▼ Show 20 Lines	typedef struct kmp_tasking_flags { /* Total struct must be exactly 32 bits */
unsigned executing : 1; /* 1==executing, 0==not executing */		unsigned executing : 1; /* 1==executing, 0==not executing */
unsigned complete : 1; /* 1==complete, 0==not complete */		unsigned complete : 1; /* 1==complete, 0==not complete */
unsigned freed : 1; /* 1==freed, 0==allocated */		unsigned freed : 1; /* 1==freed, 0==allocated */
unsigned native : 1; /* 1==gcc-compiled task, 0==intel */		unsigned native : 1; /* 1==gcc-compiled task, 0==intel */
unsigned reserved31 : 7; /* reserved for library use */		unsigned reserved31 : 7; /* reserved for library use */

} kmp_tasking_flags_t;		} kmp_tasking_flags_t;

		typedef struct kmp_target_data {
		void *async_handle; // libomptarget async handle for task completion query
		} kmp_target_data_t;

struct kmp_taskdata { /* aligned during dynamic allocation */		struct kmp_taskdata { /* aligned during dynamic allocation */
kmp_int32 td_task_id; /* id, assigned by debugger */		kmp_int32 td_task_id; /* id, assigned by debugger */
kmp_tasking_flags_t td_flags; /* task flags */		kmp_tasking_flags_t td_flags; /* task flags */
kmp_team_t td_team; / team for this task */		kmp_team_t td_team; / team for this task */
kmp_info_p td_alloc_thread; / thread that allocated data structures */		kmp_info_p td_alloc_thread; / thread that allocated data structures */
/* Currently not used except for perhaps IDB */		/* Currently not used except for perhaps IDB */
kmp_taskdata_t td_parent; / parent task */		kmp_taskdata_t td_parent; / parent task */
kmp_int32 td_level; /* task nesting level */		kmp_int32 td_level; /* task nesting level */
Show All 26 Lines
#if defined(KMP_GOMP_COMPAT)		#if defined(KMP_GOMP_COMPAT)
// GOMP sends in a copy function for copy constructors		// GOMP sends in a copy function for copy constructors
void (td_copy_func)(void , void *);		void (td_copy_func)(void , void *);
#endif		#endif
kmp_event_t td_allow_completion_event;		kmp_event_t td_allow_completion_event;
#if OMPT_SUPPORT		#if OMPT_SUPPORT
ompt_task_info_t ompt_task_info;		ompt_task_info_t ompt_task_info;
#endif		#endif
		kmp_target_data_t td_target_data;
}; // struct kmp_taskdata		}; // struct kmp_taskdata

// Make sure padding above worked		// Make sure padding above worked
KMP_BUILD_ASSERT(sizeof(kmp_taskdata_t) % sizeof(void *) == 0);		KMP_BUILD_ASSERT(sizeof(kmp_taskdata_t) % sizeof(void *) == 0);

// Data for task team but per thread		// Data for task team but per thread
typedef struct kmp_base_thread_data {		typedef struct kmp_base_thread_data {
kmp_info_p *td_thr; // Pointer back to thread info		kmp_info_p *td_thr; // Pointer back to thread info
▲ Show 20 Lines • Show All 1,478 Lines • ▼ Show 20 Lines
KMP_EXPORT kmp_int32 __kmpc_omp_reg_task_with_affinity(		KMP_EXPORT kmp_int32 __kmpc_omp_reg_task_with_affinity(
ident_t loc_ref, kmp_int32 gtid, kmp_task_t new_task, kmp_int32 naffins,		ident_t loc_ref, kmp_int32 gtid, kmp_task_t new_task, kmp_int32 naffins,
kmp_task_affinity_info_t *affin_list);		kmp_task_affinity_info_t *affin_list);
KMP_EXPORT void __kmp_set_num_teams(int num_teams);		KMP_EXPORT void __kmp_set_num_teams(int num_teams);
KMP_EXPORT int __kmp_get_max_teams(void);		KMP_EXPORT int __kmp_get_max_teams(void);
KMP_EXPORT void __kmp_set_teams_thread_limit(int limit);		KMP_EXPORT void __kmp_set_teams_thread_limit(int limit);
KMP_EXPORT int __kmp_get_teams_thread_limit(void);		KMP_EXPORT int __kmp_get_teams_thread_limit(void);

		/* Interface target task integration */
		KMP_EXPORT void *__kmpc_omp_get_target_async_handle(kmp_int32 gtid);
		KMP_EXPORT void __kmpc_omp_set_target_async_handle(kmp_int32 gtid, void *handle);
		KMP_EXPORT bool __kmpc_omp_has_task_team(kmp_int32 gtid);

/* Lock interface routines (fast versions with gtid passed in) */		/* Lock interface routines (fast versions with gtid passed in) */
KMP_EXPORT void __kmpc_init_lock(ident_t *loc, kmp_int32 gtid,		KMP_EXPORT void __kmpc_init_lock(ident_t *loc, kmp_int32 gtid,
void **user_lock);		void **user_lock);
KMP_EXPORT void __kmpc_init_nest_lock(ident_t *loc, kmp_int32 gtid,		KMP_EXPORT void __kmpc_init_nest_lock(ident_t *loc, kmp_int32 gtid,
void **user_lock);		void **user_lock);
KMP_EXPORT void __kmpc_destroy_lock(ident_t *loc, kmp_int32 gtid,		KMP_EXPORT void __kmpc_destroy_lock(ident_t *loc, kmp_int32 gtid,
void **user_lock);		void **user_lock);
KMP_EXPORT void __kmpc_destroy_nest_lock(ident_t *loc, kmp_int32 gtid,		KMP_EXPORT void __kmpc_destroy_nest_lock(ident_t *loc, kmp_int32 gtid,
▲ Show 20 Lines • Show All 491 Lines • Show Last 20 Lines

openmp/runtime/src/kmp_tasking.cpp

Show First 20 Lines • Show All 1,057 Lines • ▼ Show 20 Lines if (UNLIKELY(taskdata->td_flags.destructors_thunk)) {

KMP_ASSERT(destr_thunk); KMP_ASSERT(destr_thunk);

destr_thunk(gtid, task); destr_thunk(gtid, task);

} }

KMP_DEBUG_ASSERT(taskdata->td_flags.complete == 0); KMP_DEBUG_ASSERT(taskdata->td_flags.complete == 0);

KMP_DEBUG_ASSERT(taskdata->td_flags.started == 1); KMP_DEBUG_ASSERT(taskdata->td_flags.started == 1);

KMP_DEBUG_ASSERT(taskdata->td_flags.freed == 0); KMP_DEBUG_ASSERT(taskdata->td_flags.freed == 0);

bool detach = false; bool completed = true;

if (UNLIKELY(taskdata->td_flags.detachable == TASK_DETACHABLE)) { if (UNLIKELY(taskdata->td_flags.detachable == TASK_DETACHABLE)) {

if (taskdata->td_allow_completion_event.type == if (taskdata->td_allow_completion_event.type ==

KMP_EVENT_ALLOW_COMPLETION) { KMP_EVENT_ALLOW_COMPLETION) {

// event hasn't been fulfilled yet. Try to detach task. // event hasn't been fulfilled yet. Try to detach task.

__kmp_acquire_tas_lock(&taskdata->td_allow_completion_event.lock, gtid); __kmp_acquire_tas_lock(&taskdata->td_allow_completion_event.lock, gtid);

if (taskdata->td_allow_completion_event.type == if (taskdata->td_allow_completion_event.type ==

KMP_EVENT_ALLOW_COMPLETION) { KMP_EVENT_ALLOW_COMPLETION) {

// task finished execution // task finished execution

KMP_DEBUG_ASSERT(taskdata->td_flags.executing == 1); KMP_DEBUG_ASSERT(taskdata->td_flags.executing == 1);

taskdata->td_flags.executing = 0; // suspend the finishing task taskdata->td_flags.executing = 0; // suspend the finishing task

#if OMPT_SUPPORT #if OMPT_SUPPORT

// For a detached task, which is not completed, we switch back // For a detached task, which is not completed, we switch back

// the omp_fulfill_event signals completion // the omp_fulfill_event signals completion

// locking is necessary to avoid a race with ompt_task_late_fulfill // locking is necessary to avoid a race with ompt_task_late_fulfill

if (ompt) if (ompt)

__ompt_task_finish(task, resumed_task, ompt_task_detach); __ompt_task_finish(task, resumed_task, ompt_task_detach);

#endif #endif

// no access to taskdata after this point! // no access to taskdata after this point!

// __kmp_fulfill_event might free taskdata at any time from now // __kmp_fulfill_event might free taskdata at any time from now

taskdata->td_flags.proxy = TASK_PROXY; // proxify! taskdata->td_flags.proxy = TASK_PROXY; // proxify!

detach = true; completed = false;

} }

__kmp_release_tas_lock(&taskdata->td_allow_completion_event.lock, gtid); __kmp_release_tas_lock(&taskdata->td_allow_completion_event.lock, gtid);

} }

if (!detach) { // Tasks with valid target async handles must be re-enqueued.

if (taskdata->td_target_data.async_handle != NULL) {

// Note: no need to translate gtid to its shadow. If the current thread is a

// hidden helper one, then the gtid is already correct. Otherwise, hidden

tianshilei1992Unsubmitted

Done

// Note: no need to translate gtid to its shadow. If the current thread is a

- // hidden helper one, than the gtid is already correct. Otherwise, hidden

+ // hidden helper one, then the gtid is already correct. Otherwise, hidden

// helper threads are disabled, and gtid refers to a OpenMP thread.

tianshilei1992:

// helper threads are disabled, and gtid refers to a OpenMP thread.

__kmpc_give_task(task, __kmp_tid_from_gtid(gtid));

if (KMP_HIDDEN_HELPER_THREAD(gtid))

__kmp_hidden_helper_worker_thread_signal();

completed = false;

}

if (completed) {

taskdata->td_flags.complete = 1; // mark the task as completed taskdata->td_flags.complete = 1; // mark the task as completed

#if OMPT_SUPPORT #if OMPT_SUPPORT

// This is not a detached task, we are done here // This is not a detached task, we are done here

if (ompt) if (ompt)

__ompt_task_finish(task, resumed_task, ompt_task_complete); __ompt_task_finish(task, resumed_task, ompt_task_complete);

#endif #endif

// TODO: What would be the balance between the conditions in the function // TODO: What would be the balance between the conditions in the function

Show All 15 Lines #endif

__kmp_release_deps(gtid, taskdata); __kmp_release_deps(gtid, taskdata);

} }

// td_flags.executing must be marked as 0 after __kmp_release_deps has been // td_flags.executing must be marked as 0 after __kmp_release_deps has been

// called. Othertwise, if a task is executed immediately from the // called. Othertwise, if a task is executed immediately from the

// release_deps code, the flag will be reset to 1 again by this same // release_deps code, the flag will be reset to 1 again by this same

// function // function

KMP_DEBUG_ASSERT(taskdata->td_flags.executing == 1); KMP_DEBUG_ASSERT(taskdata->td_flags.executing == 1);

taskdata->td_flags.executing = 0; // suspend the finishing task taskdata->td_flags.executing = 0; // suspend the finishing task

// Decrement the counter of hidden helper tasks to be executed.

if (taskdata->td_flags.hidden_helper) {

// Hidden helper tasks can only be executed by hidden helper threads.

KMP_ASSERT(KMP_HIDDEN_HELPER_THREAD(gtid));

KMP_ATOMIC_DEC(&__kmp_unexecuted_hidden_helper_tasks);

}

jdoerfertUnsubmitted

Done

@tianshilei1992 you need to look at these changes.

jdoerfert: @tianshilei1992 you need to look at these changes.

gValariniAuthorUnsubmitted

Done

Any comments on whether we can move the __kmp_unexecuted_hidden_helper_tasks decrement to this place?

gValarini: Any comments on whether we can move the `__kmp_unexecuted_hidden_helper_tasks` decrement to…

gValariniAuthorUnsubmitted

Done

@tianshilei1992 is this correct?

gValarini: @tianshilei1992 is this correct?

jdoerfertUnsubmitted

Done

I think @tianshilei1992 mentioned to me this should be fine.

jdoerfert: I think @tianshilei1992 mentioned to me this should be fine.

} }

KA_TRACE( KA_TRACE(

20, ("__kmp_task_finish: T#%d finished task %p, %d incomplete children\n", 20, ("__kmp_task_finish: T#%d finished task %p, %d incomplete children\n",

gtid, taskdata, children)); gtid, taskdata, children));

// Free this task and then ancestor tasks if they have no children. // Free this task and then ancestor tasks if they have no children.

// Restore th_current_task first as suggested by John: // Restore th_current_task first as suggested by John:

// johnmc: if an asynchronous inquiry peers into the runtime system // johnmc: if an asynchronous inquiry peers into the runtime system

// it doesn't see the freed task as the current task. // it doesn't see the freed task as the current task.

thread->th.th_current_task = resumed_task; thread->th.th_current_task = resumed_task;

if (!detach) if (completed)

__kmp_free_task_and_ancestors(gtid, taskdata, thread); __kmp_free_task_and_ancestors(gtid, taskdata, thread);

// TODO: GEH - make sure root team implicit task is initialized properly. // TODO: GEH - make sure root team implicit task is initialized properly.

// KMP_DEBUG_ASSERT( resumed_task->td_flags.executing == 0 ); // KMP_DEBUG_ASSERT( resumed_task->td_flags.executing == 0 );

resumed_task->td_flags.executing = 1; // resume previous task resumed_task->td_flags.executing = 1; // resume previous task

KA_TRACE( KA_TRACE(

10, ("__kmp_task_finish(exit): T#%d finished task %p, resuming task %p\n", 10, ("__kmp_task_finish(exit): T#%d finished task %p, resuming task %p\n",

▲ Show 20 Lines • Show All 379 Lines • ▼ Show 20 Lines #endif

KMP_ATOMIC_ST_RLX(&taskdata->td_incomplete_child_tasks, 0); KMP_ATOMIC_ST_RLX(&taskdata->td_incomplete_child_tasks, 0);

// start at one because counts current task and children // start at one because counts current task and children

KMP_ATOMIC_ST_RLX(&taskdata->td_allocated_child_tasks, 1); KMP_ATOMIC_ST_RLX(&taskdata->td_allocated_child_tasks, 1);

taskdata->td_taskgroup = taskdata->td_taskgroup =

parent_task->td_taskgroup; // task inherits taskgroup from the parent task parent_task->td_taskgroup; // task inherits taskgroup from the parent task

taskdata->td_dephash = NULL; taskdata->td_dephash = NULL;

taskdata->td_depnode = NULL; taskdata->td_depnode = NULL;

taskdata->td_target_data.async_handle = NULL;

if (flags->tiedness == TASK_UNTIED) if (flags->tiedness == TASK_UNTIED)

taskdata->td_last_tied = NULL; // will be set when the task is scheduled taskdata->td_last_tied = NULL; // will be set when the task is scheduled

else else

taskdata->td_last_tied = taskdata; taskdata->td_last_tied = taskdata;

taskdata->td_allow_completion_event.type = KMP_EVENT_UNINITIALIZED; taskdata->td_allow_completion_event.type = KMP_EVENT_UNINITIALIZED;

#if OMPT_SUPPORT #if OMPT_SUPPORT

if (UNLIKELY(ompt_enabled.enabled)) if (UNLIKELY(ompt_enabled.enabled))

__ompt_task_init(taskdata, gtid); __ompt_task_init(taskdata, gtid);

▲ Show 20 Lines • Show All 126 Lines • ▼ Show 20 Lines if (UNLIKELY(ompt_enabled.enabled)) {

thread->th.ompt_thread_info.wait_id = 0; thread->th.ompt_thread_info.wait_id = 0;

thread->th.ompt_thread_info.state = (thread->th.th_team_serialized) thread->th.ompt_thread_info.state = (thread->th.th_team_serialized)

? ompt_state_work_serial ? ompt_state_work_serial

: ompt_state_work_parallel; : ompt_state_work_parallel;

taskdata->ompt_task_info.frame.exit_frame.ptr = OMPT_GET_FRAME_ADDRESS(0); taskdata->ompt_task_info.frame.exit_frame.ptr = OMPT_GET_FRAME_ADDRESS(0);

} }

#endif #endif

// Decreament the counter of hidden helper tasks to be executed

if (taskdata->td_flags.hidden_helper) {

// Hidden helper tasks can only be executed by hidden helper threads

KMP_ASSERT(KMP_HIDDEN_HELPER_THREAD(gtid));

KMP_ATOMIC_DEC(&__kmp_unexecuted_hidden_helper_tasks);

}

// Proxy tasks are not handled by the runtime // Proxy tasks are not handled by the runtime

if (taskdata->td_flags.proxy != TASK_PROXY) { if (taskdata->td_flags.proxy != TASK_PROXY) {

__kmp_task_start(gtid, task, current_task); // OMPT only if not discarded __kmp_task_start(gtid, task, current_task); // OMPT only if not discarded

} }

// TODO: cancel tasks if the parallel region has also been cancelled // TODO: cancel tasks if the parallel region has also been cancelled

// TODO: check if this sequence can be hoisted above __kmp_task_start // TODO: check if this sequence can be hoisted above __kmp_task_start

// if cancellation has been enabled for this run ... // if cancellation has been enabled for this run ...

▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines #if USE_ITT_BUILD && USE_ITT_NOTIFY

} }

KMP_FSYNC_ACQUIRED(taskdata); // acquired self (new task) KMP_FSYNC_ACQUIRED(taskdata); // acquired self (new task)

#endif #endif

if (task->routine != NULL) { if (task->routine != NULL) {

#ifdef KMP_GOMP_COMPAT #ifdef KMP_GOMP_COMPAT

if (taskdata->td_flags.native) { if (taskdata->td_flags.native) {

((void (*)(void *))(*(task->routine)))(task->shareds); ((void (*)(void *))(*(task->routine)))(task->shareds);

} else } else

jdoerfertUnsubmitted

Done

Much better than the "execute but don't actually execute" version before. Thanks!
Do we need to, or should we, act on the updated async_handle value (I mean if it actually finished, should we do something different than when it hasn't)? Or is that already done?

jdoerfert: Much better than the "execute but don't actually execute" version before. Thanks! Do we need to…

gValariniAuthorUnsubmitted

Done

Nice.

And yep, that is already done at the task finalization. Take a look at the changes on kmp_tasking.cpp:1099. When async_handle is not null, the task is re-enqueued, otherwise, the task is finished normally.

gValarini: Nice. And yep, that is already done at the task finalization. Take a look at the changes on…

#endif /* KMP_GOMP_COMPAT */ #endif /* KMP_GOMP_COMPAT */

{ {

(*(task->routine))(gtid, task); (*(task->routine))(gtid, task);

} }

KMP_POP_PARTITIONED_TIMER(); KMP_POP_PARTITIONED_TIMER();

#if USE_ITT_BUILD && USE_ITT_NOTIFY #if USE_ITT_BUILD && USE_ITT_NOTIFY

▲ Show 20 Lines • Show All 3,327 Lines • ▼ Show 20 Lines void __kmpc_taskloop_5(ident_t *loc, int gtid, kmp_task_t *task, int if_val,

int nogroup, int sched, kmp_uint64 grainsize, int nogroup, int sched, kmp_uint64 grainsize,

int modifier, void *task_dup) { int modifier, void *task_dup) {

__kmp_assert_valid_gtid(gtid); __kmp_assert_valid_gtid(gtid);

KA_TRACE(20, ("__kmpc_taskloop_5(enter): T#%d\n", gtid)); KA_TRACE(20, ("__kmpc_taskloop_5(enter): T#%d\n", gtid));

__kmp_taskloop(loc, gtid, task, if_val, lb, ub, st, nogroup, sched, grainsize, __kmp_taskloop(loc, gtid, task, if_val, lb, ub, st, nogroup, sched, grainsize,

modifier, task_dup); modifier, task_dup);

KA_TRACE(20, ("__kmpc_taskloop_5(exit): T#%d\n", gtid)); KA_TRACE(20, ("__kmpc_taskloop_5(exit): T#%d\n", gtid));

} }

/*!

@ingroup TASKING

@param gtid Global Thread ID of current thread

@return Returns a opaque pointer to the thread's current task. If no task is

present, returns NULL.

Acqurires the target async handle from the current task.

void *__kmpc_omp_get_target_async_handle(kmp_int32 gtid) {

void *handle = NULL;

kmp_info_t *thread = __kmp_thread_from_gtid(gtid);

tianshilei1992Unsubmitted

Done

libomptarget (for now) doesn't require the thread must be an OpenMP thread. Using libomp's gtid generally breaks. Either we add the requirement, which needs to be discussed further, or there it needs an alternative method to implement that. If libomp is executed on another fresh thread, a new root will be created.

tianshilei1992: `libomptarget` (for now) doesn't require the thread must be an OpenMP thread. Using `libomp`'s…

gValariniAuthorUnsubmitted

Done

Uhm, I did not know about that. Although I think such a requirement makes sense, it may be out of the scope of this patch.

What we could do is check if the current thread is registered inside libomp somehow, falling back to the current execution path that does not depend on the task team information. Do you know we can use __kmpc_global_thread_num's return value to verify that? Maybe assert that returned GTID is valid and within a well-known range (e.g., [0, NUM_REGISTERED_OMP_THREADS]).

Just a note, NUM_REGISTERED_OMP_THREADS is not a valid variable. I just don't know where, or even if, such information is stored. Do you know where can I find this?

gValarini: Uhm, I did not know about that. Although I think such a requirement makes sense, it may be out…

gValariniAuthorUnsubmitted

Done

@tianshilei1992 any comments on this?

gValarini: @tianshilei1992 any comments on this?

tianshilei1992Unsubmitted

Done

IIRC some return values, except those real thread ids, are for specific purposes. That's one of the reason that I didn't use negative thread id for hidden helper thread. I don't know for sure if there is a value designed to say it is not an OpenMP managed thread. We can probably add one.

tianshilei1992: IIRC some return values, except those real thread ids, are for specific purposes. That's one of…

gValariniAuthorUnsubmitted

Done

That would be perfect. I know we have KMP_GTID_DNE (value of -2) that represents the return value of a non-existent thread ID. My only problem is: do you know if __kmpc_global_thread_num returns that when called from a non-OpenMP thread? I'll do some local checking on that!

gValarini: That would be perfect. I know we have `KMP_GTID_DNE` (value of `-2`) that represents the return…

kmp_taskdata_t *taskdata = thread->th.th_current_task;

if (taskdata) {

handle = taskdata->td_target_data.async_handle;

}

return handle;

}

/*!

@ingroup TASKING

@param gtid Global Thread ID of current thread

@param handle New async handle address

Sets the target async handle for the current task.

void __kmpc_omp_set_target_async_handle(kmp_int32 gtid, void *handle) {

kmp_info_t *thread = __kmp_thread_from_gtid(gtid);

kmp_taskdata_t *taskdata = thread->th.th_current_task;

if (taskdata) {

taskdata->td_target_data.async_handle = handle;

}

jdoerfertUnsubmitted

Done

This can fail, right? If so, we should report it to the user and deal with it properly. Otherwise we should assert it can't fail.

jdoerfert: This can fail, right? If so, we should report it to the user and deal with it properly.

gValariniAuthorUnsubmitted

Done

Uhm, that makes sense. I'll try to add this functionally and fall back to the old execution flow if it fails.

gValarini: Uhm, that makes sense. I'll try to add this functionally and fall back to the old execution…

/*!

@ingroup TASKING

@param gtid Global Thread ID of current thread

@return Returns true if the current thread of the given thread has a task team

jdoerfertUnsubmitted

Done

@param gtid Global Thread ID of current thread

- @return Returns true if the current thread of the given thread has a task team

+ @return Returns true if the current task of the given thread has a task team

allocated to it.

jdoerfert:

allocated to it.

jdoerfertUnsubmitted

Done

current task?

jdoerfert: current task?

gValariniAuthorUnsubmitted

Done

Typo. It should have been current thread!

gValarini: Typo. It should have been `current thread`!

Checks if the current thread has a task team.

KMP_EXPORT bool __kmpc_omp_has_task_team(kmp_int32 gtid) {

tianshilei1992Unsubmitted

Done

Do we need to check if gtid is valid here?

tianshilei1992: Do we need to check if `gtid` is valid here?

gValariniAuthorUnsubmitted

Done

Uhm, yep we indeed need to check it. I'll add it here and return false if gtid is invalid. This way we can fall back to the old execution flow.

gValarini: Uhm, yep we indeed need to check it. I'll add it here and return false if `gtid` is invalid.

kmp_info_t *thread = __kmp_thread_from_gtid(gtid);

kmp_taskdata_t *taskdata = thread->th.th_current_task;

if (taskdata) {

return taskdata->td_task_team != NULL;

}

return FALSE;

}

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Add non-blocking support for target nowait regionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 460172

openmp/libomptarget/include/device.h

openmp/libomptarget/include/omptarget.h

openmp/libomptarget/include/omptargetplugin.h

openmp/libomptarget/include/rtl.h

openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.h

openmp/libomptarget/plugins/cuda/dynamic_cuda/cuda.cpp

openmp/libomptarget/plugins/cuda/src/rtl.cpp

openmp/libomptarget/plugins/exports

openmp/libomptarget/src/device.cpp

openmp/libomptarget/src/interface.cpp

openmp/libomptarget/src/omptarget.cpp

openmp/libomptarget/src/private.h

openmp/libomptarget/src/rtl.cpp

openmp/runtime/src/kmp.h

openmp/runtime/src/kmp_tasking.cpp

[OpenMP] Add non-blocking support for target nowait regions
ClosedPublic