This patch implements the memory lock/unlock API, introduced in patch https://reviews.llvm.org/D139208, in the NextGen plugins.
The patch also re-organizes the map of host pinned allocations in PluginInterface.
Differential D141227
[OpenMP][libomptarget] Implement memory lock/unlock API in NextGen plugins kevinsala on Jan 8 2023, 8:57 AM. Authored by
Details This patch implements the memory lock/unlock API, introduced in patch https://reviews.llvm.org/D139208, in the NextGen plugins. The patch also re-organizes the map of host pinned allocations in PluginInterface.
Diff Detail Event TimelineComment Actions https://github.com/openucx/ucx/blob/master/src/ucs/datastruct/pgtable.h super optimized page table for a similar use case.
Comment Actions That's interesting. We have another use inside of libomptarget where we need to determine if a pointer lies inside of an already mapped memory region. We could profile it and see if we could get some better performance with a more optimal implementation.
Comment Actions The license doesn't look like something we could copy. If there is a paper or similar we could reimplement it I assume.
Comment Actions LG, with 3 nits:
Comment Actions I changed the old plugin interface for tgt_rtl_data_lock to return an error code. It now returns the lockedptr as function argument. Let me know if this is not what was called for.
Comment Actions Fixing pending comments, updating code documentation, and including API changes in https://reviews.llvm.org/D139208. Comment Actions Still missing the pin/unpin calls in the CUDA plugin; it will be added in the following update of this patch.
Comment Actions What is locked memory, and could we define it in a comment/documentation somewhere if it isn't already? Is it the same as mmap LOCKED? That plus extra? Does it imply GPUs can read/write it? If so, can they read/write within a kernel execution, or can we implement locked memory by copy at the start and end of the execution? Comment Actions Good question! We care about locking host memory because - roughly speaking - hsa_amd_memory_async_copy (asynchronous H2D and D2H memory copy) has a fast path for prelocked/pinned memory. Comment Actions I forgot to answer this: a locked pointer is passed a set of agents that can access it. It is best described as a "agent accessible pointer". In the "first gen" plugin, we use the agent accessible pointer (via call to lock) to perform a memory_async_copy, passing in the GPU agent involved in the copy.
Comment Actions For HSA we can either have coarse grain semantics on a locked pointer or fine grain depending on which hsa calls we make. That surfaces a user visible distinction - if their kernel writes to this memory, can the host see that write while the kernel executes? I don't know which answer is correct but would like us to choose one and document it, and probably have the same behaviour on nvptx. Comment Actions I think it's good to go. If this is something you've tried in qmcpack so we have a real world example of it improving things then even better. Comment Actions QMCPACK doesn't allocate pinned memroy via OpenMP. So this patch doesn't immediately benefit QMCPACK. This patch lays the foundation to a further optimization and the 16 release branch will be created very soon. So I'm pushing for merging this in time. Comment Actions If you don't mind keeping an eye on CI and reverting it on breakage feel free to land it, I'm away from my desk Comment Actions I'll update the patch today. I'm improving it to feature ref counting and certain overlapping of locked areas (partial overlapping with extension of an already locked area is forbidden). Comment Actions Rebasing and adding support for ref counting on locked buffers and allow certain overlapping. Given an already locked buffer A, other buffers that are fully contained inside A can be locked, even if they are smaller than A. Extending an existing locked buffer is not allowed. The original region is unlocked once all its users have released the locked buffer and sub-buffers. Comment Actions If this fixes these failing with the nextgen plugins, please remove the line disabling the nextgen in the test. Failed Tests (2): libomptarget :: amdgcn-amd-amdhsa :: mapping/prelock.cpp libomptarget :: amdgcn-amd-amdhsa-LTO :: mapping/prelock.cpp Comment Actions It doesn't fix these tests because they are apparently wrong. We cannot map a device accessible buffer, which is the agent_ptr returned when locking the buffer. (More info: https://reviews.llvm.org/D142399). We can fix it using is_device_ptr instead. |
Brief documentation seems to be used elsewhere, add it here and below.