This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Add placeholder functions for the depend and nowait depend clauses for target data directives.
AbandonedPublic

Authored by gtbercea on Aug 1 2018, 1:39 PM.

Details

Summary

This patch adds placeholder functions to the interface for depend and nowait depend OpenMP target data begin, end and update directives.

Diff Detail

Repository
rOMP OpenMP

Event Timeline

gtbercea created this revision.Aug 1 2018, 1:39 PM

These functions are 1:1 the same as the current _nowait functions. If that's for handling depend clauses without nowait I think the compiler can just add another runtime call into libomp.

grokos added a comment.Aug 2 2018, 9:03 AM

These functions are 1:1 the same as the current _nowait functions. If that's for handling depend clauses without nowait I think the compiler can just add another runtime call into libomp.

Is there any benefit in having the compiler generate the call to libomp vs. embedding this call in extra interface functions?

These functions are 1:1 the same as the current _nowait functions. If that's for handling depend clauses without nowait I think the compiler can just add another runtime call into libomp.

Is there any benefit in having the compiler generate the call to libomp vs. embedding this call in extra interface functions?

Is there any benefit of having more functions? The interface of libomp is already there...

We should not be having calls to _-kmpc_omp_taskwait in libomptarget.
The wait should be done in the user code and then libomptarget routine should be called.

We should not be having calls to _-kmpc_omp_taskwait in libomptarget.
The wait should be done in the user code and then libomptarget routine should be called.

There is no way to implement asynchronous offloading without having knowledge of the dependencies. Intel failed to keep the separation strict for liboffload, so I don't think it helps to complain about the situation.

We should not be having calls to _-kmpc_omp_taskwait in libomptarget.
The wait should be done in the user code and then libomptarget routine should be called.

There is no way to implement asynchronous offloading without having knowledge of the dependencies. Intel failed to keep the separation strict for liboffload, so I don't think it helps to complain about the situation.

Instead, we may pass the pointer to the synchronization function and call it. The compiler may provide the pointer to this sync function. The expected function may be simple (like void sync()), the compiler may generate the required wrapper function around __kmpc_omp_taskwait

We should not be having calls to _-kmpc_omp_taskwait in libomptarget.
The wait should be done in the user code and then libomptarget routine should be called.

There is no way to implement asynchronous offloading without having knowledge of the dependencies. Intel failed to keep the separation strict for liboffload, so I don't think it helps to complain about the situation.

Instead, we may pass the pointer to the synchronization function and call it. The compiler may provide the pointer to this sync function. The expected function may be simple (like void sync()), the compiler may generate the required wrapper function around __kmpc_omp_taskwait

When truly implementing asynchronous offloading libomptarget needs to tightly integrate into libomp's task scheduling. This will require a larger set of functions and I don't think we should change the interface just because we want to improve the implementation.

Anyway it's libOMPtarget, I don't really see the point. The library isn't agnostic of the programming model, it obeys to OMP_ environment variables - which btw. are also queried from libomp!

It may be named libOMPtarget and is in openmp projects but we want to use the offloading library with other runtime libraries.

These calls are here because they require the interface of libomp library include these functions. A patch for Clang is in the works which calls these functions so they need to have some basic, correct implementation that works when used with libomp.
The implementation can/should be improved in the future. In our proprietary OpenMP library implementation we already do something more elaborate which is why we need the placeholders here.

gtbercea added a comment.EditedAug 2 2018, 10:16 AM

! In D50158#1185865, @Hahnfeld wrote:
Is there any benefit of having more functions? The interface of libomp is already there...

Yes, libomptarget can be used with other versions of the libomp library which may have a richer interface than the open source openmp runtime library.

These calls are here because they require the interface of libomp library include these functions. A patch for Clang is in the works which calls these functions so they need to have some basic, correct implementation that works when used with libomp.
The implementation can/should be improved in the future. In our proprietary OpenMP library implementation we already do something more elaborate which is why we need the placeholders here.

Can you at least partly explain why the compiler generated code cannot call __kmpc_omp_taskwait for directives that have no nowait clause? Including a change because "we need it for our proprietary implementation" is a weak argument IMO...

Implementation:

First, there missing sync for the _tgt_target_data_begin_depend, tgt_target_data_begin_depend_nowait, tgt_target_data_update_depend, tgt_target_data_update_depend_nowait,
Look at the end versions of the call for a possible implementation.
Second, it would be better IMO to use
kmpc_omp_wait_deps instead of the broader task wait.

Larger picture:

We negotiated these calls with Alexey: we could create target tasks for these, but Alexey estimated that creating target tasks for entries like begin/end/update, which have no target code, was significantly more complicated. And since it's not a big deal to do it in the runtime, that is why we agreed on having the async versions of these as entry points. If Alexey want to revisit his judgement, I have no issue with that. Its a lot of busy work for something that can easily be replaced in the runtime, but fine by me either ways.

Bigger pictures:

When truly implementing asynchronous offloading libomptarget needs to tightly integrate into libomp's task scheduling. This will require a larger set of functions and I don't think we should change the interface just because we want to improve the implementation.

Anyway it's libOMPtarget, I don't really see the point. The library isn't agnostic of the programming model, it obeys to OMP_ environment variables - which btw. are also queried from libomp!

Agreed, and what is needed here is a way to launch a target task, have libOMP filter dependences that can be satisfied on the device and the one that cannot. The ones that can be satisfied on the device, we need an token that will in effect describe an event/stream combination for NVIDIA, and they will be considered as satisfied from the host's perspective. The ones that are not satisfied on the device, the hose will have to wait for these as usual. Then once all the host dependences are satisfied, the actual task can be launched with the device dependence (list of token to event/stream) being inserted by libomptarget.

The mechanism in libOMP is fairly unobtrusive, in that the dependence graph nodes need to be augmented by one device descriptor (host or device id) and one dependence descriptor (which is device dependent). When scanning a target task, the compiler has to create a task with extra data to maintain up to one pointer per dependence. So the filter add the dependences that can be satisfied by the device in that extra storage associated with the task. When the target task is ready to launch (i.e. all the host dependences are satisfied), libomptarget will get this list of filtered dependences and act accordingly.

Implementation:

First, there missing sync for the _tgt_target_data_begin_depend, tgt_target_data_begin_depend_nowait, tgt_target_data_update_depend, __tgt_target_data_update_depend_nowait,
Look at the end versions of the call for a possible implementation.

I'm not sure what you mean here: libomptarget is completely synchronous for now, so the internal calls will only return after all data movement has happened. Or am I missing something here?

Second, it would be better IMO to use __kmpc_omp_wait_deps instead of the broader task wait.

Agreed. Didn't we have that at some point?

Larger picture:

We negotiated these calls with Alexey: we could create target tasks for these, but Alexey estimated that creating target tasks for entries like begin/end/update, which have no target code, was significantly more complicated. And since it's not a big deal to do it in the runtime, that is why we agreed on having the async versions of these as entry points. If Alexey want to revisit his judgement, I have no issue with that. Its a lot of busy work for something that can easily be replaced in the runtime, but fine by me either ways.

I agree with both of you here: It's easier to do in the runtime + extra optimization opportunities.

Bigger pictures:

When truly implementing asynchronous offloading libomptarget needs to tightly integrate into libomp's task scheduling. This will require a larger set of functions and I don't think we should change the interface just because we want to improve the implementation.

Anyway it's libOMPtarget, I don't really see the point. The library isn't agnostic of the programming model, it obeys to OMP_ environment variables - which btw. are also queried from libomp!

Agreed, and what is needed here is a way to launch a target task, have libOMP filter dependences that can be satisfied on the device and the one that cannot. The ones that can be satisfied on the device, we need an token that will in effect describe an event/stream combination for NVIDIA, and they will be considered as satisfied from the host's perspective. The ones that are not satisfied on the device, the hose will have to wait for these as usual. Then once all the host dependences are satisfied, the actual task can be launched with the device dependence (list of token to event/stream) being inserted by libomptarget.

That's actually the point why you need a dedicated function for depend without nowait. I still don't see why there is a need for _nowait_depend, all of that is already handled by _nowait with an empty dependency list.

gtbercea abandoned this revision.Aug 2 2018, 12:15 PM

No longer needed for trunk.

Hmm, as far as I understood there is a case for needing _depend (without nowait)