This patch better integrates the target nowait functions with the tasking runtime. It splits the nowait execution into two stages: a dispatch stage, which triggers all the necessary asynchronous device operations and stores a set of post-processing procedures that must be executed after said ops; and a synchronization stage, responsible for synchronizing the previous operations in a non-blocking manner and running the appropriate post-processing functions. Suppose during the synchronization stage the operations are not completed. In that case, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized, allowing other target nowait regions to be concurrently dispatched.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
If during the synchronization stage the operations are not completed, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized
Does this explicitly depends on helper threads and helper tasks or regular OpenMP threads and tasks can be used?
Existing target nowait implementation doesn't distinguish using helper tasks or not. Setting LIBOMP_USE_HIDDEN_HELPER_TASK=0 dispatchs target tasks as regular tasks. As long as there are available OpenMP threads, these tasks can still run concurrently and gain asynchronicity although this scheme has the issue of active waiting and consuming host thread resource. I think this is the same issue you are trying to address in helper tasks.
Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?
Update re-enqueue logic to support target nowait regions with hidden helper threads disabled
Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?
You have a good point. I thought you were talking about using the re-enqueueing scheme with the normal OpenMP tasks regions.
Your scheme of splitting target tasks doesn't seem necessary tied to helper tasks. Do you have a a specific reason restricting this feature to only target tasks?
There was no specific reason to limit that to hidden helper tasks. I have updated the code so it can also work when LIBOMP_USE_HIDDEN_HELPER_TASK=0. Here is the new implemented behavior:
- HHT enabled: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to other hidden helper threads.
- HHT disabled / Task with a task team: target nowait regions will use the new non-blocking synchronization scheme, and tasks will be re-enqueued/pushed to others the other OpenMP threads of the same tasking team. E.g., when a target nowait is created inside a parallel region.
- HHT disabled /Task without a task team: target nowait regions will not use the new non-blocking synchronization scheme. Synchronization will be done in a blocking manner as before, respecting the case where the target task is serialized. E.g., when a target nowait is created outside any parallel region.
What do you think about this implementation?
I have not thought about that at the moment, but that could be a nice optimization. Since the CUDA plugin currently maintains a resizable pool of streams for each device with an initial size of 32, I thought that for a first implementation this could be enough.
CUDA events have the same API as streams for non-blocking synchronization using cudaEventQuery, so we could store a single event (completionEvent) per AsyncInfo and use that when synchronizing with SyncType::NON_BLOCKING. I have one question though: does querying for CUDA events completion synchronize all the operations prior to the event on the stream? Or another thread on the host must synchronize the stream? If only synchronizing the events is enough, it would make using them quite simpler.
My second thought on this is let us do Stream sync for now.
NVIDIA. If we sync with an event and return streams, tasks may got serialized if two happens in the same stream.
AMD, at the hsa level, there is only signals(events).
Level0, it depends on which type of commandlist being used.
So it seems at libomptarget, it should be flexible and let the plugin decide which mechanism to use.
First of all, sorry for the late reply.
Ok, we can leave it as is for now. Just a comment on the NVIDIA case for future development: although we may not completely avoid the serialization problem using the pool of streams, I believe we can reduce it if the pool access pattern is changed. Currently, streams are acquired/released from/to the pool in a stack-like manner, which induces the frequent re-use of the streams nearest to the top of the stack/pool (in the code, the streams near the beginning of the Resources vector). By changing this access pattern to be round-robin-like, the stream use frequency may be evened over the pool and serialization becomes less of a problem.
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
194 | Just adding some context to why it was done this way:
With all that in mind, I know std::functions may lead to an additional heap allocation and one more level of indirection due to type-erasure, prohibiting some code opts. If that is not desirable despite the presented context, I can change how the post-processing procedures are stored/called, but I really would like to keep some of the points described above, especially the lifetime correctness of the post-processing data. Do you have something in mind that could help me with that? Maybe defining a different struct for each post-processing function and storing it in the AsyncInfoTy as a variant could be enough. What do you think? |
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
194 | I would have assumed if we have 2 potential callbacks, let's say static functions F and G we would have two members, both void*. void *PayloadForF = nullptr; void *PayloadForG = nullptr; and if they are not null we call F and G respectively passing the payload. I'm fine with keeping it for now this way, we can see if there is a need to change it. | |
openmp/libomptarget/include/omptargetplugin.h | ||
160 | Describe the return value, it's not clear what would be returned if it does or doesn't synchronize. | |
openmp/libomptarget/src/device.cpp | ||
634–635 | ||
openmp/libomptarget/src/interface.cpp | ||
164 | I assume we can outline some of this as it is probably the same as in the sync version, right? Let's avoid duplication as much as possible. Same below. | |
239 | So, when do we call this but don't want to actually do targetDataUpdate? I am confused. Same above. | |
347 | Clang format. | |
openmp/libomptarget/src/omptarget.cpp | ||
698 | Documentation, plz. | |
1209 | Why are the const problematic? | |
1521 | FWIW, mutable is really not my favorite way of handling things. | |
openmp/libomptarget/src/private.h | ||
214–217 | Here and elsewhere, we should prefer early exit and no else after return. | |
224–228 | ||
openmp/runtime/src/kmp_tasking.cpp | ||
1145 | @tianshilei1992 you need to look at these changes. | |
5187 | current task? |
openmp/runtime/src/kmp_tasking.cpp | ||
---|---|---|
5157 | libomptarget (for now) doesn't require the thread must be an OpenMP thread. Using libomp's gtid generally breaks. Either we add the requirement, which needs to be discussed further, or there it needs an alternative method to implement that. If libomp is executed on another fresh thread, a new root will be created. |
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
194 | Okey, for now, I'll keep it like this then. | |
openmp/libomptarget/include/omptargetplugin.h | ||
160 | Good point. I have updated both the documentation and the function name to better reflect what this new interface should do. Do you think it is more clear now? | |
openmp/libomptarget/src/interface.cpp | ||
164 | Yep, you are correct. I have created two new "launchers" that can unify most of the code paths for the execution and data-related functions for the normal and nowait cases. Since all data-related interface entries practically have the same signature, a single launcher is enough for them all. A new class called TaskAsyncInfoTy also unifies the code related to managing the task async handle when executing target nowait regions. What do you think about this new code structure? | |
239 | The goal is to dispatch the device side operations (i.e., call the functions in the omptarget.cpp file) only when a new async handle is created. If we detect that the task already contains an async handle, that means that the device side operations were already dispatched and we should only try to synchronize it in a non-blocking manner. The new TaskAsyncInfoTy class has a function called shouldDispatch that now encapsulates this detection logic with proper documentation. Do you think it is more clear now? Should we add a comment to each call site as well? | |
openmp/libomptarget/src/omptarget.cpp | ||
698 | Done. Could you check if I missed anything? | |
700 | Ops. Thanks, I have updated the var name. | |
1209 | In summary, we want to be able to move PrivateArgumentManagerTy instances into the post-processing lambdas, so their lifetime is automatically managed by them. The problem is with how llvm::SmallVector implements its move constructor. Unfortunately, it is implemented as a move-assignment instead of a proper constructor, meaning it cannot be generated for structs with const members. If we replace FirstPrivateArgInfo with a std::vector, the problem does not happen because the STL properly implements a move constructor for vectors. Since I think we do not want to use std::vector anymore, I just removed the const from the members, since they are not even accessible outside the PrivateArgumentManagerTy class. What do you think of this approach? | |
1521 | mutable was added because we need to call a non-const member function of PrivateArgumentManager (i.e., free). I know that makes the lambda a function with an internal state since, multiple calls to it will generate different results, but I don´t know of another approach to it. Maybe use call_once (IMHO, a little bit overkill) or remove the lambdas altogether and use another approach to store the post-processing functions and their payload. What do you think? | |
openmp/runtime/src/kmp_tasking.cpp | ||
1145 | Any comments on whether we can move the __kmp_unexecuted_hidden_helper_tasks decrement to this place? | |
5157 | Uhm, I did not know about that. Although I think such a requirement makes sense, it may be out of the scope of this patch. What we could do is check if the current thread is registered inside libomp somehow, falling back to the current execution path that does not depend on the task team information. Do you know we can use __kmpc_global_thread_num's return value to verify that? Maybe assert that returned GTID is valid and within a well-known range (e.g., [0, NUM_REGISTERED_OMP_THREADS]). Just a note, NUM_REGISTERED_OMP_THREADS is not a valid variable. I just don't know where, or even if, such information is stored. Do you know where can I find this? |
openmp/runtime/src/kmp_tasking.cpp | ||
---|---|---|
5187 | Typo. It should have been current thread! |
New set of comments, some minor but not all. Some comments are "out-of-order" as I started commenting top-down while not understanding it all. Read accordingly.
openmp/libomptarget/include/device.h | ||
---|---|---|
441 | Reading this I don't what this returns. SUCCESS if it completed and FAIL otherwise? Or FAIL only if something failed? Must be called multiple times is also unclear to me, I doubt that we should put that sentence here. Update: I think I now understand the latter part but if I'm right we should change the interface. | |
openmp/libomptarget/include/omptarget.h | ||
217 | Nit: Rename argument to avoid shadowing. Make the version that takes the sync type, and probably the sync type, private. IsDone can call the private version, users only the blocking one. | |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
1272 | return OFFLOAD_SUCCESS; will reduce indention and logic later on. | |
1278 | ||
openmp/libomptarget/src/interface.cpp | ||
86 | This is different but similar to the condition used in the other new "helper" below. I have the same concerns as there. When would we ever not call the target data function? | |
95 | Is FromMapper ever set to true? Did I miss that? | |
252 | There is more duplication in the callees to be moved here, no? | |
279 | This I don't understand. Why do we have to wait to enqueue the kernel? And even if, how does this not accidentally skip the target region and we will never execute it at all? Long story short, I doubt the conditional here makes sense. | |
openmp/libomptarget/src/omptarget.cpp | ||
41–50 | This would move to isDone below. | |
openmp/libomptarget/src/private.h | ||
227 | Should this be guarded by IsNew? | |
240 | I don't understand what we want/need this dispatch idea for. It seems to skip operations but I don't understand how we would not forget about them and go back. | |
openmp/runtime/src/kmp_tasking.cpp | ||
5186 |
openmp/libomptarget/include/device.h | ||
---|---|---|
441 | I should probably update the documentation of this function to reflect the new one added to omptargetplugin.h:__tgt_rtl_query_async. That state the following:
Thus, queryAsync (which calls __tgt_rtl_query_async), is a non-blocking version of synchronize. That means that we must call it multiple times until all operations are completed and the plugin invalidates the queue inside AsyncInfo. Here, AsyncInfoTy::isDone is just a helper function that indicates if the device side operations are completed or not based on said queue. We need to externalize queryAsync to its user AsyncInfoTy so it can call the non-blocking implementation. Considering your comment, what do you think of making things more explicit by adding a flag pointer argument to queryAsync (and thus to __tgt_rtl_query_async) that returns true if all operations are completed and false otherwise? int32_t queryAsync(AsyncInfoTy &AsyncInfo, bool &IsCompleted); | |
openmp/libomptarget/include/omptarget.h | ||
217 | Thanks, I am renaming the type name to SyncTypeTy to reflect the other ones. Regarding the second comment, I don´t quite understand what you mean with:
isDone only checks if the operations inside an AsyncInfoTy instance are completed or not, it does not call any plugin function at all. Are you suggesting that we move all the non-blocking synchronization code into isDone? If so, this means we would have some code duplication regarding the post-processing functions due to two separate synchronization paths, but if you think that is better I can do it. | |
217 |
| |
openmp/libomptarget/include/omptargetplugin.h | ||
160 | @jdoerfert any comments on the new function and its doc? | |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
1272 | Perfect, done! | |
1278 | Thanks, done! | |
openmp/libomptarget/src/interface.cpp | ||
86 | That was a code error. The target helper should also use the Dispatch variable as well. Thanks for noticing. | |
86 | With the RFC implemented, we are now re-enqueuing the same task multiple times until all the device side operations are completed. Because of that, we may call the __tgt_target_* functions multiple times as well. Since we want to dispatch the operations only once, we call the target data functions only when the target task is first encountered. The next calls will only synchronize the operations instead of dispatching them again, that's why AsyncInfo.synchronize is always called right below it. | |
95 | Nope, it is not. I am removing it from the arguments and always passing false. | |
252 | Yep, that was an error on my part, Dispatch should be used instead of AsyncInfo.isDone(). Regarding the other arguments, they are obtained from the wrapper TaskAsyncInfoTy, not from AsyncInfoTy. I can change that to unify the wrappers code and AsyncInfoTy, but I'll be probably putting too different responsibilities into the AsyncInfoTy struct. What do you think? | |
279 | You are right, this was a leftover prior to the refactoring of the interface file. Although it worked, it did only because the queue pointer was null and isDone would return true at first. Replaced it with the Dispatch variable. | |
openmp/libomptarget/src/omptarget.cpp | ||
41–50 | Ok, I can do it, no problem. But just to make sure I got it right with respect to the other comments, you are suggesting this so we would can synchronize for the blocking synchronization and isDone for the non-blocking one, correct? If that is so, just remember that for the current code, the PostProcessingFunctions must be called on both cases, so isDone would need to be called when blocking synchronizing as well. | |
openmp/libomptarget/src/private.h | ||
227 | Nope, IsNew indicates that a new task-attached AsyncInfo has just been allocated, but not that we should deallocate it or not. The variable is primarily used to indicate that we must dispatch new operations to the new handle. Maybe I should rename it to just ShouldDispatch. Deallocation is always done when AsyncInfo->isDone() returns true, which is previously checked. | |
240 | Here is the main idea:
I have a presentation that explains it on slides 19-24, but I believe I am failing to describe that in the code. I'll try to come up with some better documentation for this dispatch/synchronize idea. | |
openmp/runtime/src/kmp_tasking.cpp | ||
5157 | @tianshilei1992 any comments on this? |
I'll wait for the updated version to go through again. Below two clarification questions. I think I now am closer to understanding some of the stuff I was confused about. If we end up keeping this scheme we need to adjust some names. I am hoping we can simplify a few things though.
openmp/libomptarget/src/private.h | ||
---|---|---|
227 | I'm worried here that we might not delete the AsyncInfo or delete it multiple times. Are you saying there is exactly one TaskAsyncInfoTy that will own the AsyncInfor object at any given time? If not, how do we avoid double free? | |
240 | Ok, that makes more sense. Now to help (even) me understand this, why do we need to call the functions from step 1 in step 3? We seem to use the "Dispatch" argument to skip most of what they do (the target data part of a targetDataXXX) anyway, no? |
openmp/libomptarget/include/device.h | ||
---|---|---|
441 | My point is: queryAsync is useless if not followed by an isDone, right? |
openmp/runtime/src/kmp_tasking.cpp | ||
---|---|---|
5157 | IIRC some return values, except those real thread ids, are for specific purposes. That's one of the reason that I didn't use negative thread id for hidden helper thread. I don't know for sure if there is a value designed to say it is not an OpenMP managed thread. We can probably add one. |
Address more code review changes. This update also fixes the dispatch logic on target regions.
Added some more comments about the new execution flow and the thread id problem.
openmp/libomptarget/include/device.h | ||
---|---|---|
441 | Uhm, I think I got your point. I'll update AsyncInfoTy::isDone so it can be called standalone without a prior call to AsyncInfoTy::synchronize (which calls the device DeviceTy::queryAsync). This indeed makes the interface better. I was a little bit confused when you said DeviceTy::queryAsync should not be exposed, but now I got it. | |
openmp/libomptarget/include/omptarget.h | ||
217 | Just a note, now I got the correct idea: we should make isDone a callable as a standalone function! | |
openmp/libomptarget/src/private.h | ||
227 | When executing a target nowait region, the actual owner of the AsyncInfo is the task itself. The structure is allocated when the task first executes and calls any of the libomptarget functions (IsNew is true) and it is deallocated when all the device-side operations are completed (AsyncInfo::isDone returns true). Here, TaskAsyncInfoTy is just a wrapper around a task-owned AsyncInfoTy (stored inside the OpenMP task data) to mainly automate the allocation and deallocation logic. But following the OpenMP execution flow, since a task is owned and executed by only a single thread at any given time, only one TaskAsyncInfoTy will be managing the task-owned AsyncInfoTy object. This should avoid any double frees, but I understand this could be a weak assumption. If that is enough I could add documentation stating it, but probably having some code checks for that would be best. Maybe assertions at the task deallocation function ensuring no valid AsyncInfoTy address is left? | |
240 | That happens because the dispatch and synchronization logic are placed in the same interface function. The first call to that function done by a task dispatches the operations, while the subsequent calls try to do the non-blocking synchronization. Maybe a better way of doing it would be to add a new interface function with the sole purpose of executing said synchronization. This way, when a task is re-executed, it calls this new function to only do the synchronization instead of the previous outline function. What do you think? This can better split the dispatch and synchronization code. | |
openmp/runtime/src/kmp_tasking.cpp | ||
5157 | That would be perfect. I know we have KMP_GTID_DNE (value of -2) that represents the return value of a non-existent thread ID. My only problem is: do you know if __kmpc_global_thread_num returns that when called from a non-OpenMP thread? I'll do some local checking on that! |
This update:
- Internalizes SyncType into SyncInfoTy
- Split dispatch and synchronization code paths. No more Dispatch checks! New interface function for target nowait synchronization.
- Make isDone a standalone function. No need to call synchronize before isDone!
@jdoerfert I believe the new revision has a better code structure for the dispatch and synchronize stages. Now we have an exclusive function only for synchronization. No more Dispatch checks!
@tianshilei1992 I am still checking on the validity of the returned GTID from __kmpc_global_thread_num when called from outside an OpenMP thread. The best solution would be to pass the address of the taskdata async handle through the interface for the *_nowait functions directly as a new parameter, but that would require changes to the code generation step. I can do that right now, but I would prefer to change the clang code generation on a different patch, as I am not that familiar with it. What do you think?
Great, I think we are almost there. The code looks much cleaner and the approach is much clearer than it was in the beginning (IMHO, I hope people agree).
I left some final clarification questions and some cleanup requests.
One conceptual question which is not blocking the patch though:
Does this approach require hidden helper threads to execute the target task or could we enable it for non-hidden helper threads as well?
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
226 | Make it clear that this happens only once. Either here or via synchronize. Right now it could be read like every isDone call might invoke the post-processing functions. | |
386 | It's void return but comment talks about the return value. | |
openmp/libomptarget/src/interface.cpp | ||
64 | If you make this a templated function accepting the (sub)type of the AsyncInfo object instead of the object itself, you can move all the remaining duplication at the call sites (namely: checkCtorDtor, get device, create AsyncInfo) into this function. WDYT? | |
247–252 | Same comment as above wrt. templated version. The duplication we introduce is something I would like to avoid. | |
392 | Do we know the above call is "noreturn"? If not, we should explicitly exit here. On second thought, we should exit either way. | |
395 | ||
openmp/libomptarget/src/omptarget.cpp | ||
79 | This is not a good idea, I think. The state of PostProcessingFunctions is undefined afterwards, even if this works in practice. Simply iterate PostProcessingFunctions and then clear it. | |
openmp/libomptarget/src/private.h | ||
212 | Is there a ! missing? | |
openmp/runtime/src/kmp_tasking.cpp | ||
1802 | Much better than the "execute but don't actually execute" version before. Thanks! | |
5181 | This can fail, right? If so, we should report it to the user and deal with it properly. Otherwise we should assert it can't fail. |
It should work with HHT and normal OpenMP threads (e.g., inside a parallel/single region). The only place that we could have some problems would be in the re-enqueueing of the tasks. But that is already taken care of: if HHTs are disabled, the task will be given to another normal OpenMP thread. If they are enabled, the task will be given to another HHT. I'll do some more local testing, but we should not have any problems in both configs.
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
189 | SyncTypeTy looks weird. It's like having an LLVM class called TypeTy. I think SyncTy or SyncType are both fine. | |
openmp/libomptarget/src/interface.cpp | ||
84 | nit: targetDataFunction | |
openmp/runtime/src/kmp_tasking.cpp | ||
5191 | Do we need to check if gtid is valid here? |
I have a general question. D81989 is using task yield to potentially improve concurrency instead of blocking the hidden helper task. I know sometimes task yield may have side affects. My question is, compared with using task yield, is creating and enqueuing task (repeatedly) better?
This update:
- Update AsyncInfo documentation
- Reduce code duplication in the interface
- Rename SyncTypeTy to SyncTy
- Fix exporting __tgt_target_nowait_query
- Refactor task async handle acquisition
- Optimize fast queue completion`
I believe I answered and fixed most of the comments on this revision. Waiting for the next round. 😉
openmp/libomptarget/include/omptarget.h | ||
---|---|---|
189 | It makes sense, that was a little redundant. It is now renamed to SyncTy. Thanks! | |
386 | Yep, that was a leftover from some previous revisions. Thanks! | |
openmp/libomptarget/src/interface.cpp | ||
64 | Indeed, that is a nice idea. Since TaskAsyncInfoWrapperTy is a wrapper around AsyncInfoTy, I only needed to acquire a reference to it so we would end up always using AsyncInfoTy. | |
84 | Uhm, TargetDataFunction is a function pointer. Shouldn't we also capitalize the first word in this case? | |
247–252 | Done as well. | |
392 | Uhm, yep, you are right. We should always exit here. I am converting to using FATAL_MESSAGE0, so we directly abort the program. | |
openmp/libomptarget/src/omptarget.cpp | ||
79 | Uhm, really? When moving a SmallVector like this, wouldn't PostProcessingFunctions be emptied and all the values moved to Functions? | |
openmp/libomptarget/src/private.h | ||
212 | I forget to submit some local changes! Done. | |
openmp/runtime/src/kmp_tasking.cpp | ||
1802 | Nice. And yep, that is already done at the task finalization. Take a look at the changes on kmp_tasking.cpp:1099. When async_handle is not null, the task is re-enqueued, otherwise, the task is finished normally. | |
5181 | Uhm, that makes sense. I'll try to add this functionally and fall back to the old execution flow if it fails. | |
5191 | Uhm, yep we indeed need to check it. I'll add it here and return false if gtid is invalid. This way we can fall back to the old execution flow. |
Generally fine from my end. @tianshilei1992 wdyt?
@kevinsala, FYI, there will be a new plugin API we need to port over to the new plugins.
openmp/libomptarget/src/omptarget.cpp | ||
---|---|---|
79 | The standard says PostPRocessingFunctions is in an unspecified state after the move. If we look at SmallVector we know the state but I don't understand why we need to rely on implementation defined behavior here at all. We don't save any lines and also no work by just iterating PostProcessingFunctions and then clearing it, no? | |
openmp/libomptarget/src/private.h | ||
227 | All asserts should have messages. |
TBH I'd like to see/understand if task yield is worse than this method. If not, I'm not sure why we'd like to have so many complicated code.
Another thing is, if we only have one target task, which is really long. Then based on this method, it's gonna check if it is done, enqueue the task, and then the task is scheduled to execute (potentially by another thread, which might hurt locality), check, enqueue the task, again and again. Though existing method blocks the whole thread, at least it will not keep executing. I think we actually need a counter for the bottom half task. If it is executed for many times, it indicates we don't have enough other tasks to execute. In this case, just blocking the thread is fine.
Sorry, I missed your previous comment before.
I believe this method can be better than using task yield because it has less probability of "starvation", let me explain. AFAIK, task yield is currently being implemented by calling other tasks in the middle of the execution of the current task, meaning the task execution state (and ordering of resumption) is stored inside the call stack. This can incur a "starvation-like" problem, where an earlier task that has its operations completed cannot be finished because it is at the bottom of the call stack. This can also hold other tasks that are dependent on this "starved" one. Another problem is that, depending on the number of ready target regions, a program can even surpass the stack limit due to many in-flight tasks. If task yield were implemented in a coroutine-like model (maybe some future work), where yielded tasks could be re-enqueued and re-ordered, we would probably use it since that makes the code much simpler.
Another thing is that this can also be an initial point of integration for the device-side resolution of dependencies. (although D81989 also did that as well).
Uhm, that makes perfect sense. I´ll implement the counting mechanism and update the patch.
This update:
- Unify synchronize and isDone methods. No more code duplication between them.
- Add query async to plugin-nextgen.
- Decided sync method based on thread exponential backoff counting.
openmp/libomptarget/src/omptarget.cpp | ||
---|---|---|
79 | I was thinking about a future use case: if a post-processing function generates more asynchronous device-side operations. In this case, it may want to add a new post-processing function into the vector, but we cannot do that while iterating over it. The spec says that, if a push/emplace back resizes the vector, any previous iterator is invalid, which would make the loop invalid. I think that is the case for SmallVectors as well. Right now, the three post-processing functions do not do that. They all do synchronous device operations. But if in the future, someone adds a post-processing function that does that, it cannot be blindly done without taking care of this situation. Do you think we could leave this "unimplemented" right now? If so, I can do the iteration-then-clear approach. Just a "side question": which standard does SmallVector follows? I am asking that because the STL says that a vector is guaranteed to be empty after a move. If that is the case for SmallVectors, thus PostProcessingFunctions would be in a valid state, no? | |
1209 | @jdoerfert any new comment on this? | |
1521 | @jdoerfert any new comment on this? | |
openmp/runtime/src/kmp_tasking.cpp | ||
1145 | @tianshilei1992 is this correct? |
I only have one remaining question. @tianshilei1992 might have more though.
openmp/libomptarget/src/interface.cpp | ||
---|---|---|
403 | Why is this thread_local? Should it be tied to the async info object with non-blocking tasks? | |
openmp/libomptarget/src/omptarget.cpp | ||
79 | If you are worried about inserting, use for (int i = 0; i < C.size(); ++i) Anyhow, let's keep it this way for now but add an explicit clear at the end. Just to be sure (and explicit). |
openmp/libomptarget/src/interface.cpp | ||
---|---|---|
403 | Yeah, I'd expect the counter to be tied to a task. |
Reverse ping, let's get this in, we can build dependence via event support on top of it.
There is only the thread_local remark remaining, I think.
@jdoerfert there are some other comments pending. How should we proceed?
openmp/libomptarget/src/interface.cpp | ||
---|---|---|
403 | @jdoerfert @tianshilei1992 my idea here is that we should be able to go back and forth between non-blocking and blocking synchronization in a per-thread manner depending on their own state.
This way, we allow the runtime threads to adapt themselves to the target region payloads. I see two possible problems in placing the counter in a per-task manner:
What do you think about the above points? Do they make sense? | |
openmp/libomptarget/src/omptarget.cpp | ||
79 | Uhm, yep, making it explicit is better. What do you think? |
LG
openmp/libomptarget/src/interface.cpp | ||
---|---|---|
403 | Hm, it seems reasonable for your scenario. It is unclear what we should optimize for here. I'm OK with keeping it like this as it might be better for a "always blocking tasks" and "consistently mixed task" load. The per-task impl. would only be good if we have "totally independent tasks", I guess. | |
openmp/libomptarget/src/omptarget.cpp | ||
79 | It's fine for now. | |
1209 | It's ok to remove the const. | |
1521 | Add a TODO to look into this in the future. | |
openmp/runtime/src/kmp_tasking.cpp | ||
1145 | I think @tianshilei1992 mentioned to me this should be fine. |
Getting compiler crash with this change:
# End machine code for function __tgt_target_nowait_query. *** Bad machine code: FrameSetup is after another FrameSetup *** - function: __tgt_target_nowait_query - basic block: %bb.5 init.check (0x556af1cc4098) - instruction: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def $rsp, implicit-def $eflags, implicit-def $ssp, implicit $rsp, implicit $ssp, debug-location !6722; llvm-project/openmp/libomptarget/src/interface.cpp:328:42 *** Bad machine code: FrameDestroy is not after a FrameSetup *** - function: __tgt_target_nowait_query - basic block: %bb.5 init.check (0x556af1cc4098) - instruction: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp, debug-location !6722; llvm-project/openmp/libomptarget/src/interface.cpp:328:42 fatal error: error in backend: Found 2 machine code errors. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump: 0. Program arguments: clang++ --target=x86_64-unknown-linux-gnu -DEXPENSIVE_CHECKS -DGTEST_HAS_RTTI=0 -DOMPTARGET_DEBUG -DOMPT_SUPPORT=1 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Illvm/include -Iinclude -Iruntimes/runtimes-bins/openmp/runtime/src -Illvm-project/openmp/libomptarget/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -Wall -Wcast-qual -Wformat-pedantic -Wimplicit-fallthrough -Wsign-compare -Wno-enum-constexpr-conversion -Wno-extra -Wno-pedantic -std=c++17 -g -fPIC -fno-exceptions -fno-rtti -gsplit-dwarf -MD -MT openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o -MF openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o.d -o openmp/libomptarget/src/CMakeFiles/omptarget.dir/interface.cpp.o -c llvm-project/openmp/libomptarget/src/interface.cpp
Looks like it is caused by the place for static variable declaration.
Reading this I don't what this returns. SUCCESS if it completed and FAIL otherwise? Or FAIL only if something failed? Must be called multiple times is also unclear to me, I doubt that we should put that sentence here.
Update: I think I now understand the latter part but if I'm right we should change the interface.
So, queryAsync is supposed to be called before isDone to make sure isDone returns the right value, correct? If so, we should not expose queryAsync to the user as there doesn't seem to be a reason to call it otherwise. Arguably, calling it doesn't provide information, just a state change, thus a secondary query is necessary.