Page MenuHomePhabricator

[OpenMP] Introduce asynchronous malloc and free
Needs RevisionPublic

Authored by jdoerfert on Nov 3 2021, 8:52 AM.

Details

Summary

If the plugin supports asynchronous malloc and free we can avoid early
synchronization and hide more runtime work as part of the ongoing kernel
execution.

Diff Detail

Event Timeline

jdoerfert created this revision.Nov 3 2021, 8:52 AM
jdoerfert requested review of this revision.Nov 3 2021, 8:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 3 2021, 8:52 AM
Herald added a subscriber: sstefan1. · View Herald Transcript
NOTE: Missing the cmake magic to determine CUDA_VERSION.
jdoerfert updated this revision to Diff 384481.Nov 3 2021, 9:00 AM

Ensure synchronization for CUDA_VERSION < ...

grokos added a subscriber: grokos.Nov 3 2021, 9:39 AM

Diff shows rtl.h and device.h have been deleted?

jdoerfert updated this revision to Diff 384499.Nov 3 2021, 10:01 AM

Add files back, use the CUDART_VERSION macro to avoid need for cmake

jdoerfert updated this revision to Diff 384531.Nov 3 2021, 11:24 AM

Go back to CUDA_VERSION

grokos added inline comments.Nov 20 2021, 3:22 PM
openmp/libomptarget/plugins/common/MemoryManager/MemoryManager.h
146–152

We aren't we passing an AsyncInfo object as an extra argument to deleteOnDevice and subsequently call DeviceAllocator.free() with that AsyncInfo instead of nullptr? I suspect it's because we need to make room for subsequent allocations before attempting to re-allocate, if that's the case can you add a comment to make it clearer?

296

If there is a reason why deleteOnDevice() cannot be called with an extra AsyncInfo argument, we are we adding it here? It's not used anywhere in the body of free().

jdoerfert updated this revision to Diff 398477.Jan 9 2022, 6:25 PM

Address comments

Took me a while to get back to this. Way behind on certain things. Apologies.

openmp/libomptarget/plugins/common/MemoryManager/MemoryManager.h
146–152

I'll add a comment to explain that this operation should not be delayed.

296

I added it to keep the interface the same but I'll remove it now

tianshilei1992 added inline comments.Jan 9 2022, 6:32 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
1517

why would we need a sync before calling data alloc?

jdoerfert updated this revision to Diff 399822.Jan 13 2022, 4:07 PM

Address comment

jdoerfert updated this revision to Diff 399825.Jan 13 2022, 4:14 PM

put sync for delete back

jdoerfert marked an inline comment as done.Jan 13 2022, 4:21 PM

Ensure atomicity (via events) of async allocations

Fix issues, sync w/o pending work

ye-luo requested changes to this revision.Jan 20 2022, 10:18 AM
ye-luo added inline comments.
openmp/libomptarget/include/device.h
305

What about async malloc?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
321

pass StreamPoolTy& as an argument and there is no need of DeviceId

371

This class is one object per device. I would expect removing DeviceId replace vectors with a single element.

openmp/libomptarget/src/device.cpp
290

The current Event is intended for D2H, you expanded its usage. In case of one record after alloc and another record after the transfer. I think this will go wrong. Once the event is fulfilled after alloc. other thread may think the transfer has been completed.

This revision now requires changes to proceed.Jan 20 2022, 10:18 AM
jdoerfert marked an inline comment as done.Jan 20 2022, 10:33 AM
jdoerfert added inline comments.
openmp/libomptarget/include/device.h
305

I'll rename.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
321

sure.

371

That is somewhat orthogonal, a cleanup for all this is following.

openmp/libomptarget/src/device.cpp
290

Right.

So checking if we also do D2H would partially help but not completely. The stickiness of the event is still a problem. It might also be for the pure D2H case with multiple transfers as only the first is properly guarded.
I think we need to:

  1. only enter it here if we don't do D2H in the same directive (thus below).
  2. remove events after we synchronized the AsyncInfo object it was recorded in.
ye-luo added inline comments.Jan 20 2022, 1:17 PM
openmp/libomptarget/src/device.cpp
290

Sorry I meant H2D not D2H. But seems you got my point. Both 1 and 2 are desired. I thought about 2 when the event was introduced but I didn't immediately figure out what is the best location to do 2.

tianshilei1992 requested changes to this revision.EditedMar 5 2022, 7:34 PM

cuDeviceGetAttribute has to be used to check if the device supports memory pool; otherwise runtime error will be emitted.

Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2022, 7:34 PM