Page MenuHomePhabricator

[OpenMP] Introduce low level dependency process to target offloading
Needs ReviewPublic

Authored by tianshilei1992 on Jun 16 2020, 8:26 PM.

Details

Summary

Asynchronous offloading will be wrapped into a target task, and the
corresponding dependencies will go to the task. Only all dependencies are
full-filled, the task will be enqueued and dispatched. However, almost all
device runtime libraries provide ways for dependencies such that we don't need
to go back to host side to resolve the dependencies. For exmaple, we could wait
for a CUDA event before we push some operations into a stream. The wait is not
blocking so that all following enqueues will be proceeded. However, they will
not be executed until the waiting event is full-filled.

This patch lowers the dependency process of target task to the device side. It
supports depending on both host tasks and target tasks. For depending on target
tasks, the process goes to the device side. As for depending on host tasks,
current mechanism is still used with a tiny modification.

The following are design details:

When a target construct is encountered, Clang wraps it into a task, and emit
function call to __kmpc_omp_target_task_alloc. We mark all tasks allocated by
__kmpc_omp_target_task_alloc as _target task_s. The transformation is like:

#pragma omp target depend(D) nowait
{ /* target region */ }
// The above one will be transformed to the following one
#pragma omp task depend(D) shared(...) target_task
#pragma omp target
{ /* target region */ }

where target_task is just a flag that is not really a part of the construct.

After the target task is created, let's call it _A_ and assume it has
dependencies, __kmpc_omp_task_with_deps is called to resolve and process its
dependencies. The only change here is, when A depends on another target task,
let's call it _B_, B is add into A's predecessors which is a linked list
storing all A's *target* predecessors. Here we do NOT increase the counter
npredecessors of A. If B is a host/regular task, existing scheme is used, which
is to add A into B's successors, and increase A's npredecessors. This
approach indicates that a target task's npredecessors only represents the number
of *host/regular* tasks a target task depends. Finally, enqueue a target task no
matter whether its dependencies are full-filled.

Now let's switch to libomptarget and take target nowait as example, and
target data related stuffs are same. It first create a new __tgt_async_info
which contains three fields: DeviceID which is the index to use Devices,
Queue which is a queue-like data structure where to push device operations, and
Event that is a device-dependent event. In the function target_nowait, it
first checks whether asynchronous APIs are supported. If not, wait all its
dependencies to be full-filled by checking the counter and yield the current
task if it still has unfinished dependencies. Once all dependencies are done, it
calls the synchronous version of target by setting the __tgt_async_info to
nullptr to tell the device RTL to use synchronous APIs.

Let's get back to the asynchronous version. It first calls waitForDeps to process dependencies. Here
it checks whether the npredecessors is zero. If not, it means there is still depending host tasks that
have not been finished, then it yields the current task. After that, we know it has no depending
host/regular task, then it starts to check all depending target tasks. __kmpc_get_target_task_waiting_list
is called to fetch __tgt_async_info pointers of all its depending target task. We'll talk about how
__kmpc_get_target_task_waiting_list is implemented later. For each async info, if the depending
task and current task are from the same *type* of target device, which means we can ask the device
API to take care of the dependency, it calls device RTL function wait_event which is mapped to
the plugin interface __tgt_rtl_wait_event to insert the event before doing real offloading works.
I'll cover the map of plugin interfaces and their functionality later. The wait_event is expected to
be asynchronous and its effect is to tell the device RTL that all later enqueued operations can only
be started once the inserted events are full-filled. This mechanism will not work if current task depends
on a target task that is on another type of target device. In this case, we will perform queryAndWait
to check whether the corresponding event is full-filled, aka. the corresponding target task is finished.
If not, yield the current task. It's worth noting that we're not infinitely yielding current task. There is
a counter to tell how much time we have yielded. If it reaches a certain point, it will not yield again.
Instead, it will _synchronize_ the event, which is a blocking wait. This is an optimization to avoid
long-time looping when there is no task in the queue. Two target tasks are of same type if their device
RTLs are same.

Once we finish insert all waiting events, we can start the offloading work of current target task. Again,
the device RTL will make sure that our following offloading operations will not be started until all
waiting events are full-filled. The offloading work of current target task is done by target. It
basically transfers data to device, launches the kernel, and then transfers data back to host. Note that
all these operations are asynchronous. After that, we need to get an event which can only be full-filled
if all operations enqueued before are done. The event is fetched by calling recordEvent which is mapped
to __tgt_rtl_record_event. In fact, this step may not be necessary for some target devices if the event
is generated by each enqueue. In that case, just leave the __tgt_rtl_record_event empty and return
OFFLOAD_SUCCESS. Now we have the event, and we need to attach it to the current task by
calling __kmpc_set_async_info such that all its depended tasks can fetch and use it. There is an
optimization that if there is no dependency in this task, we don't need to do that. However, due to the
issue in current CG that it cannot pass right number into those functions, we cannot depend on it now.
As a consequence, we could only set it whatever. After this point, all its _depended_ tasks can get the
async info and starts their own wait by inserting the event. For current task, it performs queryAndWait
which basically is exactly the one we mentioned for the dependency waiting of two different types of
target devices, and finally finish the current target tasks.

So there are four new plugin interfaces:
__tgt_rtl_release_async_info: To release the asynchronous information, basically returning the Queue
and destroying the event.
__tgt_rtl_wait_event: Non-blocking wait for events. It is like just inserting the event and all following
enqueuing will not be started once the event is full-filled. Since it is non-blocking, we can still enqueue
operations even if the event is not full-filled. They just cannot be started. This can improve the
concurrency.
__tgt_rtl_record_event: Basically to generate an event which can only be full-filled when all previously
enqueued operations are finished. The _record_ here is a CUDA terminology. Feel free to comment
if you have a better name.
__tgt_rtl_check_event: To check whether the event is full-filled. If not, returns OFFLOAD_NOT_DONE;
If yes, returns OFFLOAD_SUCCESS. Return OFFLOAD_FAIL if anything is wrong in RTL.

The last part is about some functions implemented in libomp. We add two member data in the depnode
data structure because it is a per-task data structure and implemented with reference count. One is
a linked list successors, and another one is a void * pointer which is the async info of current
target task.

__kmpc_get_target_task_waiting_list basically goes through all nodes in successors and check
whether the corresponding async information pointer is nullptr. If yes, it means the target task has
not set the async info yet. We yield the current task here. If not, push the pointer to a list which will
be used by current task.

Like before, once the reference count of a depnode is zero, this node will be freed. It calls the function
__kmpc_free_async_info to release corresponding information and free the memory, and deref all
nodes in its successors.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Jun 16 2020, 8:26 PM
Herald added a project: Restricted Project. · View Herald Transcript

Two high level comments below. We need to split this patch.

Can you explain the approach in a bit more detail in the commit message? (Also a typo in there).

openmp/libomptarget/src/omptarget.cpp
596

I guess we can make async info a pointer argument in a separate (NFC) patch to reduce this one, WDYT?

openmp/libomptarget/src/rtl.cpp
416

Style: Everywhere I have seen this we do /* name */ value. I know this was different here but I'd like us to align with LLVM & Clang on this one.
Feel free to commit the comments for all but the new argument as NFC without further review.

openmp/libomptarget/src/rtl.h
111

Typo in comment

Is there some design documentation on this? It's tricky to distinguish intent from quirks of cuda.

Amdgcn is built on the 'heterogenous system architecture' model which has a fair amount of support for managing graphs of tasks but also has challenging forward progress properties. I'm not immediately sure it would share much code with the nvptx implementation.

Is there some design documentation on this? It's tricky to distinguish intent from quirks of cuda.

Amdgcn is built on the 'heterogenous system architecture' model which has a fair amount of support for managing graphs of tasks but also has challenging forward progress properties. I'm not immediately sure it would share much code with the nvptx implementation.

I'll add some documentation.

The high level idea is:

  1. Add events to a queue. This operation is not blocking.
  2. Add following operations into the queue.
  3. Save the event from the second step.

Does AMD GCN support this pattern? The record event thing can be optional because I know some device RT generate the event when pushing an operation into a queue, like OpenCL.

ye-luo added a subscriber: ye-luo.Jun 17 2020, 2:21 PM

Fixed some issues and code style

tianshilei1992 edited the summary of this revision. (Show Details)Jun 18 2020, 9:49 AM
tianshilei1992 edited the summary of this revision. (Show Details)Jun 24 2020, 1:23 PM