This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][Offloading] Add support for event related interfaces
ClosedPublic

Authored by tianshilei1992 on Aug 22 2021, 3:42 PM.

Details

Summary

This patch adds the support form event related interfaces, which will be used
later to fix data race. See D104418 for more details.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Aug 22 2021, 3:42 PM
tianshilei1992 requested review of this revision.Aug 22 2021, 3:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 22 2021, 3:42 PM

fixed a minor error

jdoerfert accepted this revision.Aug 23 2021, 8:43 AM

LG, this is needed for other fun things anyway.

This revision is now accepted and ready to land.Aug 23 2021, 8:43 AM

Typos. LGTM, waiting for @ye-luo to comment on the patch.

openmp/libomptarget/include/omptargetplugin.h
158

New line between 4) and 5)

openmp/libomptarget/src/device.h
292

dependency

ye-luo added inline comments.Aug 23 2021, 11:48 AM
openmp/libomptarget/include/omptargetplugin.h
160

Please remove AsyncInfo which is not used.

169

Please remove AsyncInfo which is not used.

172

Please remove AsyncInfo which is not used. If left, more confusion got added.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
1398

I feel better to avoid hard-code 0.
better to check if CU_EVENT_WAIT_DEFAULT exists, if not, set CU_EVENT_WAIT_DEFAULT to 0.

tianshilei1992 added inline comments.Aug 23 2021, 1:13 PM
openmp/libomptarget/include/omptargetplugin.h
160

I would keep it for compatibility consideration. Note that we are not writing a library only for CUDA. There might be programming models requiring the queue.

ye-luo added inline comments.Aug 23 2021, 1:30 PM
openmp/libomptarget/include/omptargetplugin.h
160

If you think compatibility consideration is real now, please provide details of one case of need. It is not about CUDA, but it is better to avoid hypothetical needs.

Additional unused arguments adds confusion and cost maintenance.

With the minimal arguments,
If a real needs shows up, it is easy to adapt and expand the list arguments.

openmp/libomptarget/include/omptargetplugin.h
160

That's true if the interface is easy to change. This one doesn't seem to be - we carry around legacy entry points in case they're still in use, at least for some period of time. I wonder if we should start putting version numbers in it to establish a path to dropping parts.

To a fair approximation, every C callback interface should take an additional void*. Otherwise you end up with qsort vs qsort_r when inevitably the callback has to do something with state. In this case, the interface is roughly:

void * async_info_create();
// do things with the async info
// awkwardly some of these make a new instance if passed 0
void async_info_destroy(void*);

Threading a single void* through the sequence of calls gives the plugin a way of associating them with each other. Given multiple host threads and multiple targets that is otherwise extremely difficult, hence every function getting an integer ID.

I would guess that eliding the void* will work fine until we want to have two independent asynchronous regions running on the same target at the same time, at which point we won't know which one to 'record_event' onto.

That's not an argument for this interface in particular, I haven't thought about whether an event should have an identity independent of whatever it is coordinating, but each function in the __tgt_rtl interface that can do things asynchronously should get the async_info instance it is acting on.

openmp/libomptarget/include/omptargetplugin.h
157

Is this a means of coordinating across different async objects? Across different targets, or just between a cpu thread and one target? Trying to guess what the sequence of record / wait / sync might map onto.

In particular it's difficult to map this onto the HSA operations without knowing the cuda terminology. This is presumably a way of making some kernels wait for others to complete before they start executing but I'd expect that to be a single API call, probably named 'barrier'.

Does 'waitEvent' mean the GPU waits until all the kernels before it have completed and 'syncEvent' mean the host thread waits for the same?

ye-luo added inline comments.Aug 23 2021, 3:02 PM
openmp/libomptarget/include/omptargetplugin.h
160

Changing API may never as easy as it might be but this is secondary.
I prefer not adding an argument which is not used. That means it cannot be precisely defined and can be potentially abused.
For this reason, I'm asking if there was a solid need.
tgt_rtl_record_event and tgt_rtl_wait_event should have AsyncInfo. I was only asking for the removal on tgt_rtl_wait_event tgt_rtl_sync_event and __tgt_rtl_destroy_event which are not asynchronous APIs.

openmp/libomptarget/include/omptargetplugin.h
160

What do wait_event or sync_event do? Sounds like they might involve launching a kernel on the GPU that writes to global memory to indicate it has executed, in which case it probably needs an async info.

Destroy event probably has to wait for some resources on the GPU to no longer be in use, which may be at some distant point in the future, in which case the host probably doesn't want to wait for it.

Does 'waitEvent' mean the GPU waits until all the kernels before it have completed and 'syncEvent' mean the host thread waits for the same?

Maybe this one should be renamed as QueueWaitEvent or WaitEventInQueue to be more explicit. CUDA has this as cuStreamWaitEvent and categorizes this API under Stream Management rather than Event Management.

Does 'waitEvent' mean the GPU waits until all the kernels before it have completed and 'syncEvent' mean the host thread waits for the same?

Wait, attach a wait into the AsyncObj stream to avoid anything added after runs before the event is fulfilled.
Sync, block until the event is fulfilled.

openmp/libomptarget/include/omptargetplugin.h
160

Remove the AsyncInfo in the ones that do not attach the event to a stream/AsyncInfo object.

Wait, attach a wait into the AsyncObj stream to avoid anything added after runs before the event is fulfilled.

Would 'fulfilled' mean the kernels that were launched before it have all completed?

That probably involves a barrier packet on amdgpu. Needs to go on the same HSA queue as the associated kernels, which is probably what will be in the async info.

Sync, block until the event is fulfilled.

That is probably doable without poking at the HSA queue for amdgpu but I'm not certain of it.

Can I invoke debug printing as a justification for passing the async object to all of them? I probably want to be able to tell what the lifecycle of a given event is in terms of what functions it was passed to.

Wait, attach a wait into the AsyncObj stream to avoid anything added after runs before the event is fulfilled.

Would 'fulfilled' mean the kernels that were launched before it have all completed?

That probably involves a barrier packet on amdgpu. Needs to go on the same HSA queue as the associated kernels, which is probably what will be in the async info.

If you think of the event as a kernel on the stream/queue, it is fulfilled if it is "executed".

Sync, block until the event is fulfilled.

That is probably doable without poking at the HSA queue for amdgpu but I'm not certain of it.

Can I invoke debug printing as a justification for passing the async object to all of them? I probably want to be able to tell what the lifecycle of a given event is in terms of what functions it was passed to.

I don't mind either way, passing it or not. Event's are opaque too, you can even reference the AsyncInfo/stream/queue if you want. That said, I doubt you need the AsyncInfo to print stuff about the event.
In CUDA we really only need both to make the connection, which happens two times: Put the event on the queue, put a wait for the event in the queue.
If HSA needs it for anything else, we can also change the API before a release w/o much hassle.

Wait, attach a wait into the AsyncObj stream to avoid anything added after runs before the event is fulfilled.

Would 'fulfilled' mean the kernels that were launched before it have all completed?

That probably involves a barrier packet on amdgpu. Needs to go on the same HSA queue as the associated kernels, which is probably what will be in the async info.

Sync, block until the event is fulfilled.

That is probably doable without poking at the HSA queue for amdgpu but I'm not certain of it.

Can I invoke debug printing as a justification for passing the async object to all of them? I probably want to be able to tell what the lifecycle of a given event is in terms of what functions it was passed to.

It is actually meaningless. When you pass in an event, this event may or may not have been recorded on the AsyncInfo->Queue passed in. It is not one-to-one mapping.

tianshilei1992 added inline comments.Aug 27 2021, 9:33 AM
openmp/libomptarget/include/omptargetplugin.h
160

Changing plugin interfaces will be very painful. All changes since 2019 were all to add new interfaces just to make sure not to break existing applications. If we decide to do it in one way and in the future we ask for another, existing applications will be broken. Of course we can then add another interface such as __tgt_rtl_create_event_async, but it's gonna be more confusing. Anyway, I'm fine to remove it, but still more inclined to leave some spaces.

Changing plugin interfaces will be very painful. All changes since 2019 were all to add new interfaces just to make sure not to break existing applications.

The interface you're talking about is between clang and libomptarget (__tgt_target_* functions). Here we are dealing with the interface between the base library and the plugins (__tgt_rtl_* functions), i.e. it is an internal interface within the library, so user applications have nothing to do with it. We can change it without breaking anything (I assume libomptarget and the plugins are compiled and distributed together, I don't expect anyone to mix-and-match components from different commits). The only caveat is that if the __tgt_rtl_* interface changes for one plugin, we have to propagate the change in all other plugins, but that shouldn't be a major concern.

Changing plugin interfaces will be very painful. All changes since 2019 were all to add new interfaces just to make sure not to break existing applications.

The interface you're talking about is between clang and libomptarget (__tgt_target_* functions). Here we are dealing with the interface between the base library and the plugins (__tgt_rtl_* functions), i.e. it is an internal interface within the library, so user applications have nothing to do with it. We can change it without breaking anything (I assume libomptarget and the plugins are compiled and distributed together, I don't expect anyone to mix-and-match components from different commits). The only caveat is that if the __tgt_rtl_* interface changes for one plugin, we have to propagate the change in all other plugins, but that shouldn't be a major concern.

Oh, yeah, you're right. However I can still remember back to 2019 when I added the *_async series, I did want to change existing interfaces, but it was rejected.

I think we need to pursue well defined APIs. I have not seen solid prof that additional arguments are necessary.
Changing API is easy or not is something secondary to me.

rebase and fix comments

tianshilei1992 marked 12 inline comments as done.Aug 27 2021, 3:54 PM
tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1398

CU_EVENT_WAIT_DEFAULT is a enum value. We cannot check if it exists.

tianshilei1992 requested review of this revision.Aug 27 2021, 3:55 PM
tianshilei1992 marked an inline comment as done.
ye-luo added inline comments.Aug 27 2021, 4:08 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
132

I just realize that this one needs to be
int createEvent(void** createEvent)
and the return code needs to indicate the success/failure of cuEventCreate call.

This is probably my last concern.

rebase and fix comments

tianshilei1992 marked an inline comment as done.Aug 27 2021, 4:41 PM
tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
132

That's a good one. Done.

ye-luo added inline comments.Aug 27 2021, 4:51 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
132

Should the return type int as others?

ye-luo added inline comments.Aug 27 2021, 4:56 PM
openmp/libomptarget/src/device.cpp
560

If there is no event support. Should createEvent being called? if it should not be called, OFFLOAD_FAIL is better. I don't have deep thought on this. What do you think?

tianshilei1992 marked 3 inline comments as done.

fix comments

openmp/libomptarget/plugins/cuda/src/rtl.cpp
132

Sure

openmp/libomptarget/src/device.cpp
560

It will be called anyway. We will not check (for now) if a feature is supported before calling it.

ye-luo accepted this revision.Aug 28 2021, 12:37 PM
ye-luo added inline comments.
openmp/libomptarget/src/device.cpp
560

Calling events related APIs without actually plugin support would generally be considered undefined behaviors. So it is better to avoid calls.

When a device is initialized, we can attempt creating an event. If it remains nullptr, we can flag events not supported in this device.

This revision is now accepted and ready to land.Aug 28 2021, 12:37 PM
tianshilei1992 marked an inline comment as done.Aug 28 2021, 1:22 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/device.cpp
560

If a plugin doesn't support event mechanism, it simply returns OFFLOAD_SUCCESS for all interfaces. It is a consistent behavior.

This revision was landed with ongoing or failed builds.Aug 28 2021, 1:24 PM
This revision was automatically updated to reflect the committed changes.
tianshilei1992 marked an inline comment as done.