This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget] Implement memory lock/unlock API in NextGen plugins
ClosedPublic

Authored by kevinsala on Jan 8 2023, 8:57 AM.

Details

Summary

This patch implements the memory lock/unlock API, introduced in patch https://reviews.llvm.org/D139208, in the NextGen plugins.

The patch also re-organizes the map of host pinned allocations in PluginInterface.

Diff Detail

Event Timeline

kevinsala created this revision.Jan 8 2023, 8:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 8 2023, 8:57 AM
kevinsala requested review of this revision.Jan 8 2023, 8:57 AM

https://github.com/openucx/ucx/blob/master/src/ucs/datastruct/pgtable.h super optimized page table for a similar use case.

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
299

I would return std::optional<const EntryTy*> to avoid using nullptr to model failure.

332

I would return std::optional<void *> to avoid passing mutable pointers.

https://github.com/openucx/ucx/blob/master/src/ucs/datastruct/pgtable.h super optimized page table for a similar use case.

That's interesting. We have another use inside of libomptarget where we need to determine if a pointer lies inside of an already mapped memory region. We could profile it and see if we could get some better performance with a more optimal implementation.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1823–1824

Nit, why not one line.

jdoerfert added inline comments.Jan 8 2023, 11:51 AM
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
299

Why would we want that? It seems to me null is a fine, didn't find the entry response. We don't need to box things just to have boxed them.

332

This makes more sense. Though, again, why boxing. Returning the devptr content avoids mutable arguments just fine

tschuett added inline comments.Jan 8 2023, 12:12 PM
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
299

Because these are C idoms in C++ code. LLVM is inconsistent whether returning true or false denotes success. The boxing overhead is negligible. The intent of your APIs is much clearer. tryFindIntersecting is a fallible API. òptional` is a great tool to for fallible APIs.

jdoerfert added inline comments.Jan 8 2023, 12:46 PM
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
299
  • Pointers/nullptr is arguably a C++ concept, we don't need to box things because C++ supports objects.
  • I didn't argue about overheads, no need to start that discussion.
  • There is no true or false return here so no confusion, it's EntryTy * that is nullptr or not.
  • The API intent is to find an EntryTy, nullptr is arguably not an EntryTy object, hence none was found. How much clearer could it get if we would check an optional. Further, what would the set but nullptr case even mean?
  • As far as I can see, our coding standards page does not include the word optional. If you want to have this as part of the canonical design, feel free to write an RFC. In the nextgen plugins we only use optional in the JIT, hence this is not an outlier.

https://github.com/openucx/ucx/blob/master/src/ucs/datastruct/pgtable.h super optimized page table for a similar use case.

The license doesn't look like something we could copy. If there is a paper or similar we could reimplement it I assume.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1820

Brief documentation seems to be used elsewhere, add it here and below.

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
332

And remove the const cast at the use site. Cast it here if necessary.

jdoerfert added inline comments.Jan 8 2023, 1:07 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1903

It's not const. Change isHostPinnedBuffer to getHostPinnedBuffer and assign it in the conditional.

openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
499
kevinsala updated this revision to Diff 487238.Jan 8 2023, 2:46 PM

Fixing issues and format. Still missing registering the host memory in CUDA plugin.

kevinsala marked 5 inline comments as done.Jan 8 2023, 2:50 PM
kevinsala added inline comments.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1903

Done, however, I renamed it to getDevicePtrFromPinnedBuffer() since it returns the device pointer.

ye-luo added inline comments.Jan 8 2023, 2:55 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1903

DevicePtr usually refers to the memory on the device. I would call it DeviceAccessiblePtr

jdoerfert accepted this revision.Jan 8 2023, 3:32 PM

LG, with 3 nits:

  • Ye's comment.
  • The one below.
  • @carlo.bertolli needs to modify the plugin interface and if we land this first we need to remember to change it as the other patch makes it in. Otherwise we can wait with this one.
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1898–1901
This revision is now accepted and ready to land.Jan 8 2023, 3:32 PM

I changed the old plugin interface for tgt_rtl_data_lock to return an error code. It now returns the lockedptr as function argument. Let me know if this is not what was called for.
Thanks for this extension!

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
585–586

Just a nit: for amdgpu's we don't need to keep a table of locked pointers. This is already done by ROCr. I would consider making this optional for amdgpu's.

kevinsala marked an inline comment as done.Jan 10 2023, 10:37 AM
kevinsala added inline comments.
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
585–586

You're right. But for the moment, we want to keep this information cached into a map at the plugin level. In this way, we can access the pointer info "faster" and keep track of the pinned memory buffers that are OpenMP-related.

Fixing pending comments, updating code documentation, and including API changes in https://reviews.llvm.org/D139208.

kevinsala marked 3 inline comments as done.Jan 10 2023, 11:10 AM

Still missing the pin/unpin calls in the CUDA plugin; it will be added in the following update of this patch.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1903

Just renamed the function and related code documentation.

What is locked memory, and could we define it in a comment/documentation somewhere if it isn't already?

Is it the same as mmap LOCKED? That plus extra?

Does it imply GPUs can read/write it? If so, can they read/write within a kernel execution, or can we implement locked memory by copy at the start and end of the execution?

What is locked memory, and could we define it in a comment/documentation somewhere if it isn't already?

Is it the same as mmap LOCKED? That plus extra?

Does it imply GPUs can read/write it? If so, can they read/write within a kernel execution, or can we implement locked memory by copy at the start and end of the execution?

Good question!
I always refer back to this article for a definition of locked/pinned:
https://lwn.net/Articles/600502/
but it might be stale, as my knowledge of OS is.
Aside from what I know or not know about OS, this is a kernel concept that we are reflecting in the plugin's.
In general, it is a concept used CUDA and HSA, so perhaps we can refer back to it?

We care about locking host memory because - roughly speaking - hsa_amd_memory_async_copy (asynchronous H2D and D2H memory copy) has a fast path for prelocked/pinned memory.
As locking memory is expensive, putting some "space" between lock operations, memory copies, and unlock operations is an effective optimization strategy. That's why we are giving users an API to prelock pointers when they know they can, and we avoid relocking them in the plugin's.
I hope this helps.

What is locked memory, and could we define it in a comment/documentation somewhere if it isn't already?

Is it the same as mmap LOCKED? That plus extra?

Does it imply GPUs can read/write it? If so, can they read/write within a kernel execution, or can we implement locked memory by copy at the start and end of the execution?

I forgot to answer this: a locked pointer is passed a set of agents that can access it. It is best described as a "agent accessible pointer". In the "first gen" plugin, we use the agent accessible pointer (via call to lock) to perform a memory_async_copy, passing in the GPU agent involved in the copy.

Good question!
I always refer back to this article for a definition of locked/pinned:
https://lwn.net/Articles/600502/
but it might be stale, as my knowledge of OS is.
Aside from what I know or not know about OS, this is a kernel concept that we are reflecting in the plugin's.
In general, it is a concept used CUDA and HSA, so perhaps we can refer back to it?

We care about locking host memory because - roughly speaking - hsa_amd_memory_async_copy (asynchronous H2D and D2H memory copy) has a fast path for prelocked/pinned memory.
As locking memory is expensive, putting some "space" between lock operations, memory copies, and unlock operations is an effective optimization strategy. That's why we are giving users an API to prelock pointers when they know they can, and we avoid relocking them in the plugin's.
I hope this helps.

For HSA we can either have coarse grain semantics on a locked pointer or fine grain depending on which hsa calls we make. That surfaces a user visible distinction - if their kernel writes to this memory, can the host see that write while the kernel executes? I don't know which answer is correct but would like us to choose one and document it, and probably have the same behaviour on nvptx.

Ping. What is left before merging?

JonChesterfield added a comment.EditedJan 21 2023, 8:39 AM

I think it's good to go. If this is something you've tried in qmcpack so we have a real world example of it improving things then even better.

I think it's good to go. If this is something you've tried in qmcpack so we have a real world example of it improving things then even better.

QMCPACK doesn't allocate pinned memroy via OpenMP. So this patch doesn't immediately benefit QMCPACK. This patch lays the foundation to a further optimization and the 16 release branch will be created very soon. So I'm pushing for merging this in time.

If you don't mind keeping an eye on CI and reverting it on breakage feel free to land it, I'm away from my desk

This patch needs a rebase. Applying patch failed.

kevinsala marked an inline comment as done.Jan 21 2023, 1:41 PM

I'll update the patch today. I'm improving it to feature ref counting and certain overlapping of locked areas (partial overlapping with extension of an already locked area is forbidden).

kevinsala updated this revision to Diff 491148.Jan 22 2023, 4:48 AM

Rebasing and adding support for ref counting on locked buffers and allow certain overlapping. Given an already locked buffer A, other buffers that are fully contained inside A can be locked, even if they are smaller than A. Extending an existing locked buffer is not allowed. The original region is unlocked once all its users have released the locked buffer and sub-buffers.

ye-luo accepted this revision.Jan 23 2023, 1:23 PM

Ref-counting lock/unlock makes a lot sense.

If this fixes these failing with the nextgen plugins, please remove the line disabling the nextgen in the test.

Failed Tests (2):
  libomptarget :: amdgcn-amd-amdhsa :: mapping/prelock.cpp
  libomptarget :: amdgcn-amd-amdhsa-LTO :: mapping/prelock.cpp

If this fixes these failing with the nextgen plugins, please remove the line disabling the nextgen in the test.

Failed Tests (2):
  libomptarget :: amdgcn-amd-amdhsa :: mapping/prelock.cpp
  libomptarget :: amdgcn-amd-amdhsa-LTO :: mapping/prelock.cpp

It doesn't fix these tests because they are apparently wrong. We cannot map a device accessible buffer, which is the agent_ptr returned when locking the buffer. (More info: https://reviews.llvm.org/D142399). We can fix it using is_device_ptr instead.