This patch adds the support form event related interfaces, which will be used
later to fix data race. See D104418 for more details.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
160 | Please remove AsyncInfo which is not used. | |
169 | Please remove AsyncInfo which is not used. | |
172 | Please remove AsyncInfo which is not used. If left, more confusion got added. | |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
1400 | I feel better to avoid hard-code 0. |
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
160 | I would keep it for compatibility consideration. Note that we are not writing a library only for CUDA. There might be programming models requiring the queue. |
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
160 | If you think compatibility consideration is real now, please provide details of one case of need. It is not about CUDA, but it is better to avoid hypothetical needs. Additional unused arguments adds confusion and cost maintenance. With the minimal arguments, |
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
160 | That's true if the interface is easy to change. This one doesn't seem to be - we carry around legacy entry points in case they're still in use, at least for some period of time. I wonder if we should start putting version numbers in it to establish a path to dropping parts. To a fair approximation, every C callback interface should take an additional void*. Otherwise you end up with qsort vs qsort_r when inevitably the callback has to do something with state. In this case, the interface is roughly: void * async_info_create(); // do things with the async info // awkwardly some of these make a new instance if passed 0 void async_info_destroy(void*); Threading a single void* through the sequence of calls gives the plugin a way of associating them with each other. Given multiple host threads and multiple targets that is otherwise extremely difficult, hence every function getting an integer ID. I would guess that eliding the void* will work fine until we want to have two independent asynchronous regions running on the same target at the same time, at which point we won't know which one to 'record_event' onto. That's not an argument for this interface in particular, I haven't thought about whether an event should have an identity independent of whatever it is coordinating, but each function in the __tgt_rtl interface that can do things asynchronously should get the async_info instance it is acting on. |
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
157 | Is this a means of coordinating across different async objects? Across different targets, or just between a cpu thread and one target? Trying to guess what the sequence of record / wait / sync might map onto. In particular it's difficult to map this onto the HSA operations without knowing the cuda terminology. This is presumably a way of making some kernels wait for others to complete before they start executing but I'd expect that to be a single API call, probably named 'barrier'. |
Does 'waitEvent' mean the GPU waits until all the kernels before it have completed and 'syncEvent' mean the host thread waits for the same?
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
160 | Changing API may never as easy as it might be but this is secondary. |
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
160 | What do wait_event or sync_event do? Sounds like they might involve launching a kernel on the GPU that writes to global memory to indicate it has executed, in which case it probably needs an async info. Destroy event probably has to wait for some resources on the GPU to no longer be in use, which may be at some distant point in the future, in which case the host probably doesn't want to wait for it. |
Maybe this one should be renamed as QueueWaitEvent or WaitEventInQueue to be more explicit. CUDA has this as cuStreamWaitEvent and categorizes this API under Stream Management rather than Event Management.
Wait, attach a wait into the AsyncObj stream to avoid anything added after runs before the event is fulfilled.
Sync, block until the event is fulfilled.
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
160 | Remove the AsyncInfo in the ones that do not attach the event to a stream/AsyncInfo object. |
Would 'fulfilled' mean the kernels that were launched before it have all completed?
That probably involves a barrier packet on amdgpu. Needs to go on the same HSA queue as the associated kernels, which is probably what will be in the async info.
Sync, block until the event is fulfilled.
That is probably doable without poking at the HSA queue for amdgpu but I'm not certain of it.
Can I invoke debug printing as a justification for passing the async object to all of them? I probably want to be able to tell what the lifecycle of a given event is in terms of what functions it was passed to.
If you think of the event as a kernel on the stream/queue, it is fulfilled if it is "executed".
Sync, block until the event is fulfilled.
That is probably doable without poking at the HSA queue for amdgpu but I'm not certain of it.
Can I invoke debug printing as a justification for passing the async object to all of them? I probably want to be able to tell what the lifecycle of a given event is in terms of what functions it was passed to.
I don't mind either way, passing it or not. Event's are opaque too, you can even reference the AsyncInfo/stream/queue if you want. That said, I doubt you need the AsyncInfo to print stuff about the event.
In CUDA we really only need both to make the connection, which happens two times: Put the event on the queue, put a wait for the event in the queue.
If HSA needs it for anything else, we can also change the API before a release w/o much hassle.
It is actually meaningless. When you pass in an event, this event may or may not have been recorded on the AsyncInfo->Queue passed in. It is not one-to-one mapping.
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
160 | Changing plugin interfaces will be very painful. All changes since 2019 were all to add new interfaces just to make sure not to break existing applications. If we decide to do it in one way and in the future we ask for another, existing applications will be broken. Of course we can then add another interface such as __tgt_rtl_create_event_async, but it's gonna be more confusing. Anyway, I'm fine to remove it, but still more inclined to leave some spaces. |
Changing plugin interfaces will be very painful. All changes since 2019 were all to add new interfaces just to make sure not to break existing applications.
The interface you're talking about is between clang and libomptarget (__tgt_target_* functions). Here we are dealing with the interface between the base library and the plugins (__tgt_rtl_* functions), i.e. it is an internal interface within the library, so user applications have nothing to do with it. We can change it without breaking anything (I assume libomptarget and the plugins are compiled and distributed together, I don't expect anyone to mix-and-match components from different commits). The only caveat is that if the __tgt_rtl_* interface changes for one plugin, we have to propagate the change in all other plugins, but that shouldn't be a major concern.
Oh, yeah, you're right. However I can still remember back to 2019 when I added the *_async series, I did want to change existing interfaces, but it was rejected.
I think we need to pursue well defined APIs. I have not seen solid prof that additional arguments are necessary.
Changing API is easy or not is something secondary to me.
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
1400 | CU_EVENT_WAIT_DEFAULT is a enum value. We cannot check if it exists. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
132 | I just realize that this one needs to be This is probably my last concern. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
132 | That's a good one. Done. |
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
132 | Should the return type int as others? |
openmp/libomptarget/src/device.cpp | ||
---|---|---|
560 | If there is no event support. Should createEvent being called? if it should not be called, OFFLOAD_FAIL is better. I don't have deep thought on this. What do you think? |
openmp/libomptarget/src/device.cpp | ||
---|---|---|
560 | Calling events related APIs without actually plugin support would generally be considered undefined behaviors. So it is better to avoid calls. When a device is initialized, we can attempt creating an event. If it remains nullptr, we can flag events not supported in this device. |
openmp/libomptarget/src/device.cpp | ||
---|---|---|
560 | If a plugin doesn't support event mechanism, it simply returns OFFLOAD_SUCCESS for all interfaces. It is a consistent behavior. |
Is this a means of coordinating across different async objects? Across different targets, or just between a cpu thread and one target? Trying to guess what the sequence of record / wait / sync might map onto.
In particular it's difficult to map this onto the HSA operations without knowing the cuda terminology. This is presumably a way of making some kernels wait for others to complete before they start executing but I'd expect that to be a single API call, probably named 'barrier'.