Page MenuHomePhabricator

[OpenMP] Optimized stream selection by scheduling data mapping for the same target region into a same stream
ClosedPublic

Authored by tianshilei1992 on Mar 28 2020, 7:20 PM.

Details

Summary

This patch introduces two things for offloading:

  1. Asynchronous data transferring: those functions are suffix with _async. They have one more argument compared with their synchronous counterparts: __tgt_async_info*, which is a new struct that only has one field, void *Identifier. This struct is for information exchange between different asynchronous operations. It can be used for stream selection, like in this case, or operation synchronization, which is also used. We may expect more usages in the future.
  2. Optimization of stream selection for data mapping. Previous implementation was using asynchronous device memory transfer but synchronizing after each memory transfer. Actually, if we say kernel A needs four memory copy to device and two memory copy back to host, then we can schedule these seven operations (four H2D, two D2H, and one kernel launch) into a same stream and just need synchronization after memory copy from device to host. In this way, we can save a huge overhead compared with synchronization after each operation.

Diff Detail

Event Timeline

tianshilei1992 created this revision.Mar 28 2020, 7:20 PM
lildmh added a subscriber: lildmh.Mar 30 2020, 5:38 AM
lildmh added inline comments.
openmp/libomptarget/include/omptarget.h
119

Do you need a clang patch to generate code for this structure and new APIs?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
912

It seems no synchronization by default, not aligned with the original behavior

tianshilei1992 added inline comments.Mar 30 2020, 7:35 AM
openmp/libomptarget/include/omptarget.h
119

Currently no. But of course this information can be used by Clang in the future if needed.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
912

The assumption here is: if this function is called at the end of target function, Identifier must be a valid pointer because anyhow during kernel launch it is assigned with a valid CUstream.

jdoerfert added inline comments.Mar 30 2020, 7:41 AM
openmp/libomptarget/include/omptarget.h
119

We'll target the new API also/mainly by rewriting old API calls in OpenMPOpt

openmp/libomptarget/plugins/cuda/src/rtl.cpp
912

It seems wrong to call this with Identifier = nullptr, doesn't it? If so we should make it an assert.

916

Why reinterpret_Cast? Is a CUstream a pointer? If not, why do we move objects around?

tianshilei1992 added inline comments.Mar 30 2020, 8:12 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
912

Yeah, seems right. "If there is nothing to be synchronized, what do users expect to synchronize?" Yes, should do an assert here.

916

Yep.

typedef struct CUstream_st *CUstream;                     /**< CUDA stream */
jdoerfert added inline comments.Mar 30 2020, 8:03 PM
openmp/libomptarget/include/omptarget.h
118

I don't like Identifier as name. Maybe execution queue? Or the name we used for streams before? Also, add a comment what it is and why it is a void *

openmp/libomptarget/plugins/cuda/src/rtl.cpp
320

Nit: I know the style in here is mixed but I would remove the braces around single statements, especially return.

916

then we should be able to just use a static cast, shouldn't we? Maybe also add a static_assert just to be safe. Also other places.

openmp/libomptarget/src/device.h
188

Remove the old versions if we don't expose them to the outside. Let's not accumulate clutter. Similarly for other functions.

openmp/libomptarget/src/omptarget.cpp
646

Move this to the struct definition. It's C++ after all.

tianshilei1992 added inline comments.Mar 30 2020, 8:28 PM
openmp/libomptarget/include/omptarget.h
118

The basic idea is, this structure is device independent. For CUDA device, we use it as CUstream so yes, we can call it stream. However, we don't know how it can be used in other platforms, and that's why I use a "weird" word Identifier and a void * here.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
320

Speaking of this, is it required in LLVM that single-statement should not be wrapped into braces? It is often good practice to have these braces even just with one statement.

916

Correct me if I'm wrong. Compiler will generate code for the cast if using static_cast, but it will not for reinterpret_cast. My thought is, since it is already in CUDA plugin, we can make sure that this void * is actually a CUstream so we can directly reinterpret it to avoid one cast instruction.

openmp/libomptarget/src/device.h
188

Do we need to make AsyncInfoPtr with default value?

openmp/libomptarget/src/omptarget.cpp
646

Yep. Will do that.

jdoerfert added inline comments.Mar 31 2020, 12:26 AM
openmp/libomptarget/include/omptarget.h
118

We need a better word and documentation of the member. Also initialize it here

openmp/libomptarget/plugins/cuda/src/rtl.cpp
320

Coding style says no braces but the runtime is unfortunately not in accordance...

916

It could but not for such a pointer 2 pointer cast on a "normal" architecture. It's just that I don't see reinterpret_cast often, almost 1:4 against static_cast, but I'm fine with leaving it.

openmp/libomptarget/src/device.h
188

Not if we can modify all callers (which we can if the symbols is not exported). If we can't we should add the new API calls as you have.

tianshilei1992 added inline comments.Mar 31 2020, 6:54 AM
openmp/libomptarget/include/omptarget.h
118

Sure. Do you have any suggestion for the name?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
320

Okay. Will change that.

openmp/libomptarget/src/device.h
188

Got you. Will do that.

jdoerfert added inline comments.Mar 31 2020, 1:09 PM
openmp/libomptarget/include/omptarget.h
118

IssueQueue?

tianshilei1992 marked 27 inline comments as done.Apr 2 2020, 4:42 PM

Update code according to review

jdoerfert added inline comments.Apr 3 2020, 11:13 AM
openmp/libomptarget/src/rtl.h
39

Similar to my last comment, why do we need async versions here? Let them be the default (w/o the _async).

Removed all synchronous version and make asynchronous as default

tianshilei1992 marked an inline comment as done.Apr 3 2020, 4:46 PM

Added comments on why we need conditional statements in __tgt_rtl_data_retrieve and __tgt_rtl_data_submit

I think this is good. Do you happen to have some runtime results?

I think this is good. Do you happen to have some runtime results?

Not yet. Just got accel-1.3 and didn't have a chance to execute them yet. Will do maybe tomorrow.

Fixed an issue that passing arguments to dataSubmit in a wrong order

jdoerfert accepted this revision.Apr 5 2020, 10:01 AM

One nit, otherwise LGTM.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
740

Style: Remove the else case, there is a return before. Same below.

This revision is now accepted and ready to land.Apr 5 2020, 10:01 AM
tianshilei1992 added inline comments.Apr 5 2020, 10:15 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
740

Sorry, didn't get your point. The return above only applies when rc != OFFLOAD_SUCCESS.

tianshilei1992 added inline comments.Apr 5 2020, 11:39 AM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
740

Okay, will do that.

Remove else after return