The current only way to obtain pinned memory with libomptarget is to use a custom allocator llvm_omp_target_alloc_host. This reflects well the CUDA implementation of libomptarget, but it does not correctly expose the AMDGPU runtime API, where any system allocated page can be locked/unlocked through a call to hsa_amd_memory_lock/unlock.
This patch enables users to allocate memory through malloc (mmap, sbreak) and then pin the related memory pages with a libomptarget special call. It is a base support in the amdgpu libomptarget plugin to enable users to prelock their host memory pages so that the runtime doesn't need to lock them itself for asynchronous memory transfers.
Details
Diff Detail
Event Timeline
It's interesting that locking locked memory succeeds, but doesn't give you something that has to be unlocked twice. Not totally sure about the convenience vs error detection there. What does the proposed user facing interface look like?
openmp/libomptarget/plugins/amdgpu/impl/impl.cpp | ||
---|---|---|
27 | Maybe return false here, "on failure return is locked" reads the opposite of the semantics | |
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
2716 | could we keep this pattern on the implementation? if something goes wrong, return nullptr, as opposed to passing pointers to pointers that are sometimes assigned |
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
2707 | Could we add both __tgt_rtl_data_lock and __tgt_rtl_data_unlock declaration to the include/omptargetplugin.h header? |
As in the example:
llvm_omp_target_lock_mem(locked, n * sizeof(int), omp_get_default_device()); llvm_omp_target_unlock_mem(locked, omp_get_default_device());
The parent patch of this one also implements OpenMP traits for pinned memory, and calls kmp_target_lock_mem and kmp_target_unlock_mem to implement the trait on malloc'ed memory. These will map to the targetLockExplicit and targetUnlockExplicit calls in this patch, once the two patches are in.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
2716 | Great catch, thanks so much for the input! |
openmp/libomptarget/plugins/amdgpu/impl/impl.cpp | ||
---|---|---|
28 | You can remove else since the if branch has a return. | |
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
1826 | I think the 4th argument is num_agent. Please add it as a comment. In addition, is it always 0? | |
openmp/libomptarget/src/omptarget.cpp | ||
461 | What is this lock protecting? It appears PM->Devices. If that's so, why are accesses such as PM->Devices[DeviceNum] unprotected? Both a few lines down and in targetLockExplicit(). |
[OpenMP][libomptarget][AMDGPU] Apply requested changes and merge against trunk.
openmp/libomptarget/src/omptarget.cpp | ||
---|---|---|
461 | According to deviceIsReady in device.cpp, device size can only change while registering a new runtime lib. If we have enough devices to cover for the device_num passed in by the API caller, then we know that there will always be an RTL object corresponding to that device, so we don't need to lock/unlock again because we know that there is an object that can be dereferenced at device_num position in the array PM->Devices. |
openmp/libomptarget/src/omptarget.cpp | ||
---|---|---|
461 | The problem is it's a vector. adding elements can cause reallocation which will race with PM->Devices[DeviceNum]. We should lock until we have the device, here and elsewhere. This is really only an issue if we add new runtime libs so locking longer will not affect performance. |
[OpenMP][libomptarget][AMDGPU] Add lock/unlock to prevent races on Devices vector in plugin manager.
openmp/libomptarget/src/omptarget.cpp | ||
---|---|---|
461 | Agreed, I've added locks for the Devices vector access. I will also write a new patch that does the same in all uses of that vector to prevent races. |
Lot's of nits. I think they can be addressed pre-merge. LG
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
206 | keep it consistent, int32_t ID, also below. | |
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
1837 | I doubt we need the check here but I don't mind keeping it (in the old plugin). | |
openmp/libomptarget/src/api.cpp | ||
84 | Mark the result as nodiscard to ensure people don't assume ptr is now locked. | |
openmp/libomptarget/src/omptarget.cpp | ||
429 | Nit: Style. | |
451 | Technically you only need to get the Device pointer, then you can drop the lock. Not that it should ever matter much. | |
482 | Same as above. | |
openmp/libomptarget/src/private.h | ||
56 | Nit: Style | |
openmp/libomptarget/test/mapping/prelock.cpp | ||
29 | you need to use the result. |
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
209 | We could return an error (int32_t) like in the rest of the plugin API functions. |
All the lock functions should return pointers via an argument.
All the lock and unlock functions should return error codes.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
2633 | Apart from returning. There is no error handling. the return type should be changed. | |
openmp/libomptarget/test/mapping/prelock.cpp | ||
29 | strange. try to lock a nullptr? | |
39 | strange. try to unlock a nullptr? |
[OpenMP][libomptarget][AMDGPU] Address comments and use prelocked pointers in mep clause implementation.
Thanks for being patience with this patch update. I hit a problem with the is_locked function, when the prelocked pointer is passed to the map clause: the address offset calculation was based on the system-allocator (e.g., malloc) pointer but it would not work if the agentBasePointer (locked) was passed in. Fixed now.
I only added return codes to the plugin API, not to the user-level API. I could not find another example where we do that, and I thought that breaking with the convention was not a good thing. Let me know if you think we should make user-level API's also return an error code and place the locked pointer in a parameter.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
1837 | Checking if something is locked costs less than *locking* it. Because of that logic, we kept is the same way for *unlocking* here, but I have not run any tests to prove this. For *locking*, if I remember correctly, there was a order of magnitude difference between checking if locked and locking, so definitely worth to do. |
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
2628 | Not HostPtr? |
[OpenMP][libomptarget][AMDGPU] Addressed comments.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
2628 | Good catch. I changed it in the _lock function as well. | |
openmp/libomptarget/test/mapping/prelock.cpp | ||
29 | You can do either and it will work (this, incidentally, unveiled a bug in the support that took some time to find). |
I still don't feel safe about the amd plugin implementation
- how is_locked handle error.
- lock_memory doesn't return error code
- lock_memory and unlock_memory behavior is not symmetric.
But that is only inside the plugin.
llvm_omp_target_lock_mem
llvm_omp_target_unlock_mem
are not well specified.
Once renaming happens and tests get fixed. It will be OK to merge.
openmp/libomptarget/include/omptargetplugin.h | ||
---|---|---|
206 | TgtPtr -> HostPtr. | |
210 | TgtPtr -> HostPtr. | |
openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h | ||
17 | The return value can only be meaning full if err_p is a success. | |
openmp/libomptarget/test/mapping/prelock.cpp | ||
22 | better call it host_ptr. Lock or not is a state. |
Agreed, it can be improved. Any suggestions?
- lock_memory doesn't return error code
The __tgt_rtl_target_data_lock and unlock do have an error code now. I am not propagating it beyond libomptarget. Are you suggesting that I should?
Happy to do that, just trying to get what you think is best.
- lock_memory and unlock_memory behavior is not symmetric.
In what sense?
But that is only inside the plugin.
llvm_omp_target_lock_mem
llvm_omp_target_unlock_mem
are not well specified.
Because of the missing error? Happy to add that.
Once renaming happens and tests get fixed. It will be OK to merge.
I already changed the tgt's to hst's parameter names. Perhaps you are looking at an intermediate version.
Thanks!
openmp/libomptarget/test/mapping/prelock.cpp | ||
---|---|---|
29 | I don't believe it is well defined llvm_omp_target_lock_mem when OMP_TARGET_OFFLOAD=disabled. |
Please fix the merge errors in is_locked
openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h | ||
---|---|---|
17 | That's a scary interface choice. It reads as the comment above, but actually it returns false when things go wrong and unconditionally writes success through the out parameter, throwing away the actual return code. And then the call sites pass in a value and ignore the result anyway. Perhaps the result of git merge blowing up on you? |
lock_memory and unlock_memory not symmetric.
I mean
lock_memory(ptr) # lock happens
lock_memory(ptr) # no-op
unlock_memory(ptr) # unlock happens
unlock_memory(ptr) # no-op
I just feel unsafe. Feel more safe if
lock_memory(ptr) # lock happens
lock_memory(ptr) # no-op
unlock_memory(ptr) # no-op
unlock_memory(ptr) # unlock happens
but this requires doing reference counting.
openmp/libomptarget/plugins/amdgpu/impl/impl.cpp | ||
---|---|---|
15 | hsa_status_t is_locked(void *ptr, void **agentBaseAddress) Only read agentBaseAddress value if the return is a success. | |
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
1819 | hsa_status_t lock_memory(void *mem, size_t size, **locked_ptr) |
Ok, I failed to parse it. 1/3 call sites do something with the return code, and it is set by something that reads like a variable declaration. So maybe not a merge misfire.
^ I share that concern about lock/unlock silently succeeding in some interleavings
openmp/libomptarget/plugins/amdgpu/impl/impl.cpp | ||
---|---|---|
20 | Oh, I missed this call as it looks like a variable declaration. So we do return different values through the out parameter, we just don't do anything with them. | |
87 | Here we return the err from is_locked | |
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
1823 | Here we discard it | |
1836 | Here we discard it |
hsa_amd_memory_lock gives the corresponding region coarse semantics. No update visible until after a kernel exits. Is that right for this?
Specifically if someone mmaps some host address space then calls this on it, it'll succeed, but if they want to see the results of atomic operations on that memory while the kernel is running, they won't.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
1826 | I think this means lock for all GPUs. It's a deprecated interface though. There's some calls around memory pool that offer more control but they're harder to use. |
I've just tested this in a pure C++ (no openmp) program using ROCr directly.
Here's the result of the operations above (I am not sure we are doing no-op):
a is pinned after first lock
a is pinned after second lock
a is pinned after first unlock
a is not pinned after second unlock
Code (please fill the void as necessary, or I can provide actual file if you want to test locally):
double *a = (double *)aligned_alloc(4096, n*sizeof(double)); //new double[n]; double *b = (double *)aligned_alloc(4096, n*sizeof(double)); //new double[n]; double *pinned_a = nullptr; err = hsa_amd_memory_lock((void *) a, n*sizeof(double), nullptr, 0, (void **) &pinned_a); CHECK(err); { bool already_pinned = false; hsa_amd_pointer_info_t info; info.size = sizeof(hsa_amd_pointer_info_t); err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr); CHECK(err); already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED); if (already_pinned) printf("a is pinned after first lock\n"); else printf("a is not pinned after first lock\n"); } double *repinned_a = nullptr; err = hsa_amd_memory_lock((void *) a, n*sizeof(double), nullptr, 0, (void **) &repinned_a); CHECK(err); { bool already_pinned = false; hsa_amd_pointer_info_t info; info.size = sizeof(hsa_amd_pointer_info_t); err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr); CHECK(err); already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED); if (already_pinned) printf("a is pinned after second lock\n"); else printf("a is pinned after second lock\n"); } hsa_amd_memory_unlock(pinned_a); { bool already_pinned = false; hsa_amd_pointer_info_t info; info.size = sizeof(hsa_amd_pointer_info_t); err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr); already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED); if (already_pinned) printf("a is pinned after first unlock\n"); else printf("a is not pinned after first unlock\n"); } { bool already_pinned = false; hsa_amd_pointer_info_t info; info.size = sizeof(hsa_amd_pointer_info_t); hsa_amd_memory_unlock(a); err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr); already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED); if (already_pinned) printf("a is pinned after second unlock\n"); else printf("a is not pinned after second unlock\n"); }
This looks like what you want, right?
openmp/libomptarget/plugins/amdgpu/impl/impl.cpp | ||
---|---|---|
20 | Correct, here's the description in the header file: I *believe* the size of the info struct could be smaller than what we use on older runtimes and/or gpus. I will update as necessary. Thanks for the catch. |
I do not *think* hsa_amd_memory_lock changes granularity. There is a call for that, but it is not the locking one.
In general, up to OpenMP 5.2, there is no system-scope atomic available in the language. That makes the fine/coarse grain memory distinction not relevant. We will need to worry about starting with OpenMP 6.0 TR1, I believe.
Specifically if someone mmaps some host address space then calls this on it, it'll succeed, but if they want to see the results of atomic operations on that memory while the kernel is running, they won't.
Correct, and in OpenMP there is no synchronization available during kernel execution, only at kernel boundaries. In any case, not something that "locking" host memory has to do with.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
1826 | You caught me on this one :-) |
There's lock memory to pool which inherits the properties of the pool. That can be used to get fine or coarse synchronization on a mmapped host pointer. Maybe openmp not having atomics means we can ignore it, though if people use opencl style atomic intrinsics in kernels today they work.
Specifically if someone mmaps some host address space then calls this on it, it'll succeed, but if they want to see the results of atomic operations on that memory while the kernel is running, they won't.
Correct, and in OpenMP there is no synchronization available during kernel execution, only at kernel boundaries. In any case, not something that "locking" host memory has to do with.
Huh. I did not realise this. Then given this is an openmp call, and openmp can't write code which can tell the difference, coarse grain seems fine.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
1826 | That would be a regression in rocr - nullptr meaning all seems like a useful capability. I'd have guessed the right thing to do there is talk rocr's maintainers out of breaking backwards compatibility. Otherwise yeah, we can stash all agents in the system in an array and pass it in to various calls. Probably all CPU agents as well as all GPU agents. On the other hand, this is an interface intended to let people write longer but faster code. Perhaps it should be taking a device id and only locking it for that device. That seems likely to be faster than locking it for all of them. |
[OpenMP][libomptarget][AMDGPU] Address comments, including fix in error handling in is_locked function.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp | ||
---|---|---|
1826 | DeviceNum is already part of the API. I changed the implementation to use the related HSA agent. We are good. Thanks for noticing this. | |
openmp/libomptarget/test/mapping/prelock.cpp | ||
29 | Good point. Looking at other API's in api.cpp, none of them seems to be well defined for this case. Happy to take care of this if there is a suggestion here. Both pointers work fine. |
Hi @carlo.bertolli, I am getting the following error in my local build after this commit:
/mybuild/openmp/libomptarget/plugins/amdgpu/impl/impl.cpp:17:3: error: ‘hsa_amd_pointer_info_t’ was not declared in this scope; did you mean ‘hsa_amd_agent_info_t’? 17 | hsa_amd_pointer_info_t info; | ^~~~~~~~~~~~~~~~~~~~~~ | hsa_amd_agent_info_t
Is there missing definition in hsa_ext_amd.h?
Hi Carlo,
Thank you for the prompt response! Direct push works for me. I suppose posting a message in https://reviews.llvm.org/D139208 with the new commit id should be enough to collect any post-commit review comments for the new code.
Thanks,
Slava
I think that's because we (people w/o AMD card) are all building the dynamic AMD plugin, which has all definition of AMD HSA related stuff. That one might be out of date. Please update it accordingly.
Pushed a fix to the non-amdgpu build:
7928d9e12d47fcc226d0c6984e11f5f463670f4a
Tested on machine without ROCm installation or AMDGPU.
[AMD Official Use Only - General]
Hi Slava, Shilei
I've just pushed a fix, tested against a machine without amdgpu's nor rocm. Please let me know if that fixes it for you:
7928d9e12d47fcc226d0c6984e11f5f463670f4a
Thanks for the notification+patience
- Carlo
@carlo.bertolli is there documentation regarding llvm_omp_target_lock_mem llvm_omp_target_unlock_mem? While reading the comments of this patch, I see that the API ensures that locked areas feature reference counting. But I have the following doubts. Should the API support locking memory buffers that were already locked (e.g., complete or partial overlapping)? What's the behavior in such a case?
I didn't add documentation for those two APIs, but indeed I should have. Is API description in code enough or is there a better/different place for it?
According to the ROCr implementation of lock/pin and unlock/unpin:
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h#L1525
double locking just keeps things locked. There is a comment about overlapping pages for two calls to lock that leaves the pages locked.
I don't think we need to report double locks if ref counting works correctly and I do think we should allow double locking.
My two cents. Thanks for the question!
keep it consistent, int32_t ID, also below.
Nit: Style (size)