Page MenuHomePhabricator

[OpenMP] Improve D2D memcpy to use more efficient driver API
ClosedPublic

Authored by tianshilei1992 on May 27 2020, 11:09 AM.

Details

Summary

In current implementation, D2D memcpy is first to copy data back to host and then
copy from host to device. This is very efficient if the device supports D2D
memcpy, like CUDA.

In this patch, D2D memcpy will first try to use native supported driver API. If
it fails, fall back to original way. It is worth noting that D2D memcpy in this
scenerio contains two ideas:

  • Same devices: this is the D2D memcpy in the CUDA context.
  • Different devices: this is the PeerToPeer memcpy in the CUDA context.

My implementation merges this two parts. It chooses the best API according to
the source device and destination device.

Diff Detail

Event Timeline

Just copy the execution results from Summit.

==22767== NVPROF is profiling process 22767, command: ./d2d_memcpy
==22767== Profiling application: ./d2d_memcpy
PASS
==22767== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream          Src Dev   Src Ctx          Dst Dev   Dst Ctx  Name
949.72ms  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         1         7                -         -                -         -  [CUDA memcpy DtoH]
949.77ms  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         1         7                -         -                -         -  [CUDA memcpy DtoH]
949.80ms  1.5360us                    -               -         -         -         -        4B  2.4835MB/s    Pageable      Device  Tesla V100-SXM2         1         7                -         -                -         -  [CUDA memcpy HtoD]
949.87ms  457.87ms        (2097152 1 1)       (128 1 1)        44      946B        0B         -           -           -           -  Tesla V100-SXM2         1        19                -         -                -         -  __omp_offloading_32_a7b5d52_main_l34 [128]
1.40840s  22.820ms                    -               -         -         -         -  1.0000GB  43.822GB/s      Device      Device  Tesla V100-SXM2         1        19  Tesla V100-SXM2         1  Tesla V100-SXM2         2  [CUDA memcpy PtoP]
1.46565s  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         2        52                -         -                -         -  [CUDA memcpy DtoH]
1.46568s  1.7920us                    -               -         -         -         -        1B  544.96KB/s      Device    Pageable  Tesla V100-SXM2         2        52                -         -                -         -  [CUDA memcpy DtoH]
1.46572s  1.5360us                    -               -         -         -         -        4B  2.4835MB/s    Pageable      Device  Tesla V100-SXM2         2        52                -         -                -         -  [CUDA memcpy HtoD]
1.48614s  492.70ms        (2097152 1 1)       (128 1 1)        46      946B        0B         -           -           -           -  Tesla V100-SXM2         2        64                -         -                -         -  __omp_offloading_32_a7b5d52_main_l49 [149]
1.97885s  159.89ms                    -               -         -         -         -  1.0000GB  6.2542GB/s      Device    Pageable  Tesla V100-SXM2         2        64                -         -                -         -  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

With PeerToPeer copy, the throughput can reach 43+GB/s.

Some high-level comment. I haven't read the main logic part in depth yet but it looks reasonable to me. Others should chime in too.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
804

Any reason not to make this a function and call it?

openmp/libomptarget/src/api.cpp
178

Please move this into the conditional.

openmp/libomptarget/src/device.cpp
398

We use the name? I somehow feel uneasy about this. Don't we have some form of ID?

openmp/libomptarget/src/device.h
160

Documentation please.

189

Documentation please.

grokos added inline comments.May 27 2020, 3:50 PM
openmp/libomptarget/src/device.cpp
398

I agree. Be careful because RTLInfoTy::RTLName is only available in debug builds, so this piece of code will break if we compile the library in release mode.

You can use a direct pointer comparison RTL == OtherDevice.RTL (devices managed by the same RTL will point to the same RTLInfoTy object).

tianshilei1992 added inline comments.May 27 2020, 3:56 PM
openmp/libomptarget/src/device.cpp
398

Yes, that's why I removed the macro.
The pointer comparison may not work considering that there is in fact one case can violate it: OpenCL ICD, although it is not part of OpenMP. Maybe adding a new plugin interface here is more appropriate.

tianshilei1992 marked 5 inline comments as done.
grokos added a comment.Jun 2 2020, 1:15 PM

I'm happy with the changes, I think the patch looks good now.

@jdoerfert: Can you accept the patch once you are happy with it as well?

jdoerfert accepted this revision.Jun 2 2020, 2:22 PM

LGTM, one nit below.

openmp/libomptarget/src/device.h
189

^

This revision is now accepted and ready to land.Jun 2 2020, 2:22 PM

Updated documentation

tianshilei1992 marked 3 inline comments as done.Jun 2 2020, 4:23 PM
This revision was automatically updated to reflect the committed changes.