Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.
Details
Diff Detail
Event Timeline
I think this is correct. Bunch of style requests inline but they could be done post commit if necessary (potentially by me). Getting rid of the hsa_memory_copy call is good for the path to async and probably good for reliability - locking the host pointer instead is an improvement.
openmp/libomptarget/plugins/amdgpu/impl/impl.cpp | ||
---|---|---|
48 | template <CopyDirection Dir>? Means we can static assert that it was one of H2D or D2H and lose the default: clause in the switch, e.g. static_assert((Dir == H2D) || (Dir == D2H),""); err = (Dir == H2D) ? invoke_hsa_copy(signal, dest, agent, lockedPtr, size) : invoke_hsa_copy(signal, lockedPtr, agent, src, size); Or maybe void * dstP = Dir == H2D ? dest : lockedPtr; void * srcP = Dir == H2D ? lockedPtr : src; err = invoke_hsa_copy(signal, destP, agent, srcP, size); since most of the arguments are the same in each case | |
53 | could have assert((src == lockingPtr) | (dst == lockingPtr)) here as the invariant is not obvious from the declaration | |
69 | Control flow is a little obfuscated here. Should go with the switch followed by unconditional unlocking: hsa_status_t unlockErr = hsa_amd_memory_unlock(lockingPtr); if (err != HSA_STATUS_SUCCESS) { return err; } if (unlockErr != HSA_STATUS_SUCCESS) { return unlockErr; } return HSA_STATUS_SUCCESS; | |
92 | Looks a bit like a bug as written because there are a lot of instances of if (err != SUCCESS) { return err; } elsewhere. That's probably why it is currently written return HSA_STATUS_SUCCESS;, I think we should stay with that. | |
111 | Not at all keen on the (pre-existing) duplication here, takes some effort reading both h2d and d2h to spot the differences. I think I'd like to take a pass over this after the patch lands and see if I can make the control flow clearer. | |
openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h | ||
22 | losing the const here is sad but unavoidable - the hsa call we're making doesn't have the pointer const qualified, though I think it could do | |
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
488 | this is probably the only call sites for impl_memcpy_x2y, so if that was rendered as impl_memcpy<enum> we wouldn't lose much |
Don't you need to check if pointers are not already pinned before trying to lock it? HSA_EXT_POINTER_TYPE_HSA or HSA_EXT_POINTER_TYPE_LOCKED
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/inc/hsa_ext_amd.h#L1820
hsa_amd_memory_lock and hsa_amd_memory_unlock are missing in hsa.cpp and hsa_ext_amd.h as well under openmp/libomptarget/plugins/amdgpu/dynamic_hsa
It seems hsa_ext_amd.h should define hsa_amd_memory_lock and hsa_amd_memory_unlock according to https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h
Also DLWRAP(hsa_amd_memory_lock, 5) in hsa.cpp
Yes, should revert and update the dlopen HSA source. Apologies for not checking this builds before committing it.
I don't know whether there's more ritual to to around whether it's already pinned - @carlo.bertolli did you look into testing if the memory is already pinned before doing so? Particularly interested if already pinned is a reason for lock to fail
I have not tried with memory that has already been locked, but I will. In any case, with this patch, if locking fails, then we revert to malloc+lock+unlock+free. This is not ideal, and this case is added for other reasons, but it should be supporting the case.
It is an AMD HSA extension. It builds fine on a system with rocm 4.5. What kind of problem are you seeing?
It will fail to build on a system where cmake fails to find rocr. On such systems there's a dlopen fallback path which needs to be updated for this.
I have not tried with memory that has already been locked, but I will. In any case, with this patch, if locking fails, then we revert to malloc+lock+unlock+free. This is not ideal, and this case is added for other reasons, but it should be supporting the case.
It will be better skipping lock/free if the memory is known to HSA already. I think IBM XL skips its pinned memory optimization when it sees the pointer pinned already for CUDA.
I have code managing lock/unlock via HIP. Even if a lock call from the plugin succeeds, and then a plugin unlock call succeeds, the user unlock call fails.
For this reason, check memory info is required.
fallback to "malloc+lock+unlock+free" is the worst option.
The following test works for me and it does not fall into the malloc+lock+unlock_free path. So user locking/unlocking and runtime locking/unlocking of the same pointer is not an issue for AMD HSA, according to this test.
#include<stdio.h> #include<omp.h> #include<hsa/hsa.h> #include <hsa/hsa_ext_amd.h> #define N 100293 int main() { int n = N; int *a = new int[n]; int *a_locked = nullptr; hsa_status_t herr = hsa_amd_memory_lock(a, n*sizeof(int), nullptr, 0, (void **)&a_locked); if (herr != HSA_STATUS_SUCCESS) { printf("Locking failed\n"); return 1; } #pragma omp target parallel for map(tofrom:a_locked[:n]) for(int i = 0; i < n; i++) a_locked[i] = i; herr = hsa_amd_memory_unlock(a); if (herr != HSA_STATUS_SUCCESS) { printf("Unlocking failed\n"); return 1; } int err = 0; for(int i = 0; i < n; i++) if (a[i] != i) { err++; printf("Err at %d, got %d expected %d\n", i, a[i], i); if (err >10) break; } delete[] a; return err; }
@ye-luo can you please share a minimal test that is failing for you? Thanks!
I failed to verify your first lock behaves as intended.
#include <hsa/hsa.h> #include <hsa/hsa_ext_amd.h> #include <omp.h> #include <stdio.h> #define N 100293 int checkLocked(void *ptr) { hsa_amd_pointer_info_t info; hsa_status_t herr; herr = hsa_amd_pointer_info(ptr, &info, NULL, NULL, NULL); if (herr != HSA_STATUS_SUCCESS) { printf(" hsa_amd_pointer_info failed\n"); return 1; } if (info.type != HSA_EXT_POINTER_TYPE_LOCKED) { printf(" pointer is noooooooooooot locked\n"); return 1; } else printf(" pointer is locked\n"); return 0; } int main() { int n = N; int *a = new int[n]; for (int i = 0; i < n; i++) a[i] = 0; int *a_locked = nullptr; hsa_status_t herr = hsa_amd_memory_lock(a, n * sizeof(int), nullptr, 0, (void **)&a_locked); if (herr != HSA_STATUS_SUCCESS) { printf("Locking failed\n"); return 1; } checkLocked(a); #pragma omp target parallel for map(tofrom : a_locked[:n]) for (int i = 0; i < n; i++) a_locked[i] = i; herr = hsa_amd_memory_unlock(a); if (herr != HSA_STATUS_SUCCESS) { printf("Unlocking failed\n"); return 1; } int err = 0; for (int i = 0; i < n; i++) if (a[i] != i) { err++; printf("Err at %d, got %d expected %d\n", i, a[i], i); if (err > 10) break; } delete[] a; return err; }
I got failure at the first check with "hsa_amd_pointer_info failed". Could you take a look?
Thanks for the test. This works for me and this is what I get:
pointer is locked
I believe that is what you expect?
Tracing shows we are running on the gpu correctly:
export LIBOMPTARGET_KERNEL_TRACE=2
./user_memory_locks
pointer is locked
DEVID: 0 SGN:2 ConstWGSize:256 args: 2 teamsXthrds:( 1X 256) reqd:( 1X 0) lds_usage:11304B sgpr_count:39 vgpr_count:22 sgpr_spill_count:0 vgpr_spill_count:0 tripcount:0 n:__omp_offloading_fd00_5882c9d_main_l43
In this run, I am using the latest of trunk with rocm 4.5 installed on the machine. GPU is a gfx90a.
I know what happened to my machine. Some CMake change caused offload plugins are not compiled. Sign. broken upstream.
My intention is to check pinned status. Before the first lock(not pinned), after the first lock(pinned), after the offload region(pinned), after the unlock(unpinned).
Could you also verify with rocprof hsa trace that the lock and unlock are both called twice?
That makes sense.
I ran it with gdb (running with debug symbols for impl/impl.cpp in the plugin) and all calls to memory_lock/unlock return success.
I am now expanding dynamic_hsa to include the missing calls - following @JonChesterfield suggestions.
Thanks!
[OpenMP] Add missing hsa declarations/definitions when building runtime without rocr (or hsa library) installed on the system
Would this be cmake failed to find libelf and thus didn't build the plugin? I think that's the symptom of our CI at present
I noticed the above test code does
#pragma omp target parallel for map(tofrom : a_locked[:n])
So it is not testing pointer a being locked by user and then again by openmp.
^ @carlo.bertolli please could you add the case that does the (extra) lock explicitly to the libomptarget tests?
template <CopyDirection Dir>? Means we can static assert that it was one of H2D or D2H and lose the default: clause in the switch, e.g.
Or maybe
since most of the arguments are the same in each case