Asynchronous offloading will be wrapped into a target task, and the
corresponding dependencies will go to the task. Only all dependencies are
full-filled, the task will be enqueued and dispatched. However, almost all
device runtime libraries provide ways for dependencies such that we don't need
to go back to host side to resolve the dependencies. For exmaple, we could wait
for a CUDA event before we push some operations into a stream. The wait is not
blocking so that all following enqueues will be proceeded. However, they will
not be executed until the waiting event is full-filled.
This patch lowers the dependency process of target task to the device side. It
supports depending on both host tasks and target tasks. For depending on target
tasks, the process goes to the device side. As for depending on host tasks,
current mechanism is still used with a tiny modification.
The following are design details:
When a `target` construct is encountered, Clang wraps it into a task, and emit
function call to `__kmpc_omp_target_task_alloc`. We mark all tasks allocated by
`__kmpc_omp_target_task_alloc` as _target task_s.
After the target task is created, let's call it _A_ and assume it has
dependencies, `__kmpc_omp_task_with_deps` is called to resolve and process its
dependencies. The only change here is, when A depends on another target task,
let's call it _B_, B is add into A's `predecessors` which is a linked list
storing all A's *target* predecessors. Here we do NOT increase the counter
`npredecessors` of A. If B is a host/regular task, existing scheme is used, which
is to add A into B's `successors`, and increase A's `npredecessors`. This
approach indicates that a target task's `npredecessors` only represents the number
of *host/regular* tasks a target task depends. Finally, enqueue a target task no
matter whether its dependencies are full-filled.
Now let's switch to `libomptarget` and take `target nowait` as example, and
`target data` related stuffs are same. It first create a new `__tgt_async_info`
which contains three fields: `DeviceID` which is the index to use `Devices`,
`Queue` which is a queue-like data structure where to push device operations, and
`Event` that is a device-dependent event. In the function `target_nowait`, it
first checks whether asynchronous APIs are supported. If not, wait all its
dependencies to be full-filled by checking the counter and yield the current
task if it still has unfinished dependencies. Once all dependencies are done, it
calls the synchronous version of `target` by setting the `__tgt_async_info` to
`nullptr` to tell the device RTL to use synchronous APIs.
Let's get back to the asynchronous version. It first calls `waitForDeps` to process dependencies. Here
it checks whether the `npredecessors` is zero. If not, it means there is still depending host tasks that
have not been finished, then it yields the current task. After that, we know it has no depending
host/regular task, then it starts to check all depending target tasks. `__kmpc_get_target_task_waiting_list`
is called to fetch `__tgt_async_info` pointers of all its depending target task. We'll talk about how
`__kmpc_get_target_task_waiting_list` is implemented later. For each async info, if the depending
task and current task are from the same *type* of target device, which means we can ask the device
API to take care of the dependency, it calls device RTL function `wait_event` which is mapped to
the plugin interface `__tgt_rtl_wait_event` to insert the event before doing real offloading works.
I'll cover the map of plugin interfaces and their functionality later. The `wait_event` is expected to
be asynchronous and its effect is to tell the device RTL that all later enqueued operations can only
be started once the inserted events are full-filled. This mechanism will not work if current task depends
on a target task that is on another type of target device. In this case, we will perform `queryAndWait`
to check whether the corresponding event is full-filled, aka. the corresponding target task is finished.
If not, yield the current task. It's worth noting that we're not infinitely yielding current task. There is
a counter to tell how much time we have yielded. If it reaches a certain point, it will not yield again.
Instead, it will _synchronize_ the event, which is a blocking wait. This is an optimization to avoid
long-time looping when there is no task in the queue. Two target tasks are of same type if their device
RTLs are same.
Once we finish insert all waiting events, we can start the offloading work of current target task. Again,
the device RTL will make sure that our following offloading operations will not be started until all
waiting events are full-filled. The offloading work of current target task is done by `target`. It
basically transfers data to device, launches the kernel, and then transfers data back to host. Note that
all these operations are asynchronous. After that, we need to get an event which can only be full-filled
if all operations enqueued before are done. The event is fetched by calling `recordEvent` which is mapped
to `__tgt_rtl_record_event`. In fact, this step may not be necessary for some target devices if the event
is generated by each enqueue. In that case, just leave the `__tgt_rtl_record_event` empty and return
`OFFLOAD_SUCCESS`. Now we have the event, and we need to attach it to the current task by
calling `__kmpc_set_async_info` such that all its depended tasks can fetch and use it. There is an
optimization that if there is no dependency in this task, we don't need to do that. However, due to the
issue in current CG that it cannot pass right number into those functions, we cannot depend on it now.
As a consequence, we could only set it whatever. After this point, all its _depended_ tasks can get the
async info and starts their own wait by inserting the event. For current task, it performs `queryAndWait`
which basically is exactly the one we mentioned for the dependency waiting of two different types of
target devices, and finally finish the current target tasks.
So there are four new plugin interfaces:
`__tgt_rtl_release_async_info`: To release the asynchronous information, basically returning the Queue
and destroying the event.
`__tgt_rtl_wait_event`: Non-blocking wait for events. It is like just inserting the event and all following
enqueuing will not be started once the event is full-filled. Since it is non-blocking, we can still enqueue
operations even if the event is not full-filled. They just cannot be started. This can improve the
`__tgt_rtl_record_event`: Basically to generate an event which can only be full-filled when all previously
enqueued operations are finished. The _record_ here is a CUDA terminology. Feel free to comment
if you have a better name.
`__tgt_rtl_check_event`: To check whether the event is full-filled. If not, returns `OFFLOAD_NOT_DONE`;
If yes, returns `OFFLOAD_SUCCESS`. Return `OFFLOAD_FAIL` if anything is wrong in RTL.
The last part is about some functions implemented in `libomp`. We add two member data in the depnode
data structure because it is a per-task data structure and implemented with reference count. One is
a linked list `successors`, and another one is a `void *` pointer which is the async info of current
`__kmpc_get_target_task_waiting_list` basically goes through all nodes in `successors` and check
whether the corresponding async information pointer is `nullptr`. If yes, it means the target task has
not set the async info yet. We yield the current task here. If not, push the pointer to a list which will
be used by current task.
Like before, once the reference count of a depnode is zero, this node will be freed. It calls the function
`__kmpc_free_async_info` to release corresponding information and free the memory, and deref all
nodes in its `successors`.