This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/
-
libomptarget/
-
include/
2/4
omptargetplugin.h
-
rtl.h
-
plugins/amdgpu/
-
amdgpu/
-
impl/
5/6
impl.cpp
2
impl_runtime.h
-
src/
14/16
rtl.cpp
-
src/
1/1
api.cpp
-
exports
7/7
omptarget.cpp
1/1
private.h
-
rtl.cpp
-
test/mapping/
-
mapping/
8/9
prelock.cpp
-
runtime/src/
-
src/
-
kmp_alloc.cpp

Differential D139208

[OpenMP][libomptarget][AMDGPU] lock/unlock (pin/unpin) mechanism in libomptarget amdgpu plugin (API and implementation)
ClosedPublic

Authored by carlo.bertolli on Dec 2 2022, 9:41 AM.

Download Raw Diff

Details

Reviewers

ronl
jdoerfert
tianshilei1992
ye-luo
gregrodgers
dhruvachak
JonChesterfield

Commits

rGb215932e6991: [OpenMP][libomptarget][AMDGPU] lock/unlock (pin/unpin) mechanism in…

Summary

The current only way to obtain pinned memory with libomptarget is to use a custom allocator llvm_omp_target_alloc_host. This reflects well the CUDA implementation of libomptarget, but it does not correctly expose the AMDGPU runtime API, where any system allocated page can be locked/unlocked through a call to hsa_amd_memory_lock/unlock.
This patch enables users to allocate memory through malloc (mmap, sbreak) and then pin the related memory pages with a libomptarget special call. It is a base support in the amdgpu libomptarget plugin to enable users to prelock their host memory pages so that the runtime doesn't need to lock them itself for asynchronous memory transfers.

Diff Detail

Event Timeline

carlo.bertolli created this revision.Dec 2 2022, 9:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 2 2022, 9:41 AM

Herald added subscribers: kosarev, kerbowa, guansong and 5 others. · View Herald Transcript

carlo.bertolli requested review of this revision.Dec 2 2022, 9:41 AM

Herald added subscribers: sstefan1, wdng. · View Herald TranscriptDec 2 2022, 9:41 AM

carlo.bertolli added a parent revision: D138933: [OpenMP] Add API for pinned memory.Dec 2 2022, 9:41 AM

Harbormaster completed remote builds in B200791: Diff 479660.Dec 2 2022, 9:44 AM

It's interesting that locking locked memory succeeds, but doesn't give you something that has to be unlocked twice. Not totally sure about the convenience vs error detection there. What does the proposed user facing interface look like?

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp
27	Maybe return false here, "on failure return is locked" reads the opposite of the semantics
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2716	could we keep this pattern on the implementation? if something goes wrong, return nullptr, as opposed to passing pointers to pointers that are sometimes assigned

kevinsala added a subscriber: kevinsala.Dec 3 2022, 8:14 AM

kevinsala added inline comments.Dec 3 2022, 8:38 AM

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2707	Could we add both `__tgt_rtl_data_lock` and `__tgt_rtl_data_unlock` declaration to the `include/omptargetplugin.h` header?

In D139208#3966980, @JonChesterfield wrote:

It's interesting that locking locked memory succeeds, but doesn't give you something that has to be unlocked twice. Not totally sure about the convenience vs error detection there. What does the proposed user facing interface look like?

As in the example:

llvm_omp_target_lock_mem(locked, n * sizeof(int), omp_get_default_device());
llvm_omp_target_unlock_mem(locked, omp_get_default_device());

The parent patch of this one also implements OpenMP traits for pinned memory, and calls kmp_target_lock_mem and kmp_target_unlock_mem to implement the trait on malloc'ed memory. These will map to the targetLockExplicit and targetUnlockExplicit calls in this patch, once the two patches are in.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2716	Great catch, thanks so much for the input!

Apply comments

Harbormaster completed remote builds in B201118: Diff 480109.Dec 5 2022, 7:48 AM

ping

dhruvachak added inline comments.Dec 12 2022, 10:02 AM

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp
28	You can remove else since the if branch has a return.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1826	I think the 4th argument is num_agent. Please add it as a comment. In addition, is it always 0?
openmp/libomptarget/src/omptarget.cpp
461	What is this lock protecting? It appears PM->Devices. If that's so, why are accesses such as PM->Devices[DeviceNum] unprotected? Both a few lines down and in targetLockExplicit().

carlo.bertolli mentioned this in D140077: [OpenMP] Add missing test for pinned memory API.Dec 14 2022, 7:15 PM

[OpenMP][libomptarget][AMDGPU] Apply requested changes and merge against trunk.

openmp/libomptarget/src/omptarget.cpp
461	According to deviceIsReady in device.cpp, device size can only change while registering a new runtime lib. If we have enough devices to cover for the device_num passed in by the API caller, then we know that there will always be an RTL object corresponding to that device, so we don't need to lock/unlock again because we know that there is an object that can be dereferenced at device_num position in the array PM->Devices. We do the same lock/unlock of the vector size in targetLockExplicit but we call deviceIsReady to accomplish that. We don't do it here, in unlock, because we do not want to re-initialize the device in case we are in process tear down phase. I've switched to using the object constructor/destructor mechanism as used elsewhere in libomptarget.

Harbormaster completed remote builds in B205530: Diff 486079.Jan 3 2023, 2:05 PM

jdoerfert added inline comments.Jan 3 2023, 3:18 PM

openmp/libomptarget/src/omptarget.cpp
461	The problem is it's a vector. adding elements can cause reallocation which will race with `PM->Devices[DeviceNum]`. We should lock until we have the device, here and elsewhere. This is really only an issue if we add new runtime libs so locking longer will not affect performance.

[OpenMP][libomptarget][AMDGPU] Add lock/unlock to prevent races on Devices vector in plugin manager.

carlo.bertolli added inline comments.Jan 4 2023, 2:55 PM

openmp/libomptarget/src/omptarget.cpp
461	Agreed, I've added locks for the Devices vector access. I will also write a new patch that does the same in all uses of that vector to prevent races.

Harbormaster completed remote builds in B205782: Diff 486402.Jan 4 2023, 2:58 PM

Lot's of nits. I think they can be addressed pre-merge. LG

openmp/libomptarget/include/omptargetplugin.h
206	keep it consistent, int32_t ID, also below. Nit: Style (size)
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1837	I doubt we need the check here but I don't mind keeping it (in the old plugin).
openmp/libomptarget/src/api.cpp
84	Mark the result as `nodiscard` to ensure people don't assume ptr is now locked.
openmp/libomptarget/src/omptarget.cpp
429	Nit: Style.
451	Technically you only need to get the Device pointer, then you can drop the lock. Not that it should ever matter much.
482	Same as above.
openmp/libomptarget/src/private.h
56	Nit: Style
openmp/libomptarget/test/mapping/prelock.cpp
29	you need to use the result.

This revision is now accepted and ready to land.Jan 4 2023, 3:48 PM

kevinsala added inline comments.Jan 5 2023, 7:42 AM

openmp/libomptarget/include/omptargetplugin.h
209	We could return an error (`int32_t`) like in the rest of the plugin API functions.

All the lock functions should return pointers via an argument.
All the lock and unlock functions should return error codes.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2633	Apart from returning. There is no error handling. the return type should be changed.
openmp/libomptarget/test/mapping/prelock.cpp
29	strange. try to lock a nullptr?
39	strange. try to unlock a nullptr?

This revision now requires changes to proceed.Jan 5 2023, 9:40 PM

ping. I'm looking forward to the landing of this patch.

kevinsala mentioned this in D141227: [OpenMP][libomptarget] Implement memory lock/unlock API in NextGen plugins.Jan 8 2023, 8:57 AM

[OpenMP][libomptarget][AMDGPU] Address comments and use prelocked pointers in mep clause implementation.

Thanks for being patience with this patch update. I hit a problem with the is_locked function, when the prelocked pointer is passed to the map clause: the address offset calculation was based on the system-allocator (e.g., malloc) pointer but it would not work if the agentBasePointer (locked) was passed in. Fixed now.

In D139208#4030515, @ye-luo wrote:

All the lock functions should return pointers via an argument.
All the lock and unlock functions should return error codes.

I only added return codes to the plugin API, not to the user-level API. I could not find another example where we do that, and I thought that breaking with the convention was not a good thing. Let me know if you think we should make user-level API's also return an error code and place the locked pointer in a parameter.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1837	Checking if something is locked costs less than locking it. Because of that logic, we kept is the same way for unlocking here, but I have not run any tests to prove this. For locking, if I remember correctly, there was a order of magnitude difference between checking if locked and locking, so definitely worth to do. This has impact if used every time we need to transfer data H2D or D2H.

Harbormaster completed remote builds in B206568: Diff 487474.Jan 9 2023, 9:29 AM

ye-luo added inline comments.Jan 10 2023, 11:12 AM

openmp/libomptarget/test/mapping/prelock.cpp
29	Still strange to me. Should the map on "unlocked[:n]"
38	Still strange to me. Should the unlock call on the "unlocked" ptr.

ye-luo added inline comments.Jan 10 2023, 11:16 AM

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2628	Not HostPtr?

[OpenMP][libomptarget][AMDGPU] Addressed comments.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2628	Good catch. I changed it in the _lock function as well.
openmp/libomptarget/test/mapping/prelock.cpp
29	You can do either and it will work (this, incidentally, unveiled a bug in the support that took some time to find). I added tests with unlocked and locked pointers.

Harbormaster completed remote builds in B206872: Diff 487911.Jan 10 2023, 11:40 AM

I still don't feel safe about the amd plugin implementation

how is_locked handle error.
lock_memory doesn't return error code
lock_memory and unlock_memory behavior is not symmetric.

But that is only inside the plugin.

llvm_omp_target_lock_mem
llvm_omp_target_unlock_mem
are not well specified.

Once renaming happens and tests get fixed. It will be OK to merge.

openmp/libomptarget/include/omptargetplugin.h
206	TgtPtr -> HostPtr.
210	TgtPtr -> HostPtr.
openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h
17	The return value can only be meaning full if err_p is a success.
openmp/libomptarget/test/mapping/prelock.cpp
22	better call it host_ptr. Lock or not is a state.

This revision is now accepted and ready to land.Jan 10 2023, 11:54 AM

In D139208#4041043, @ye-luo wrote:

I still don't feel safe about the amd plugin implementation

how is_locked handle error.

Agreed, it can be improved. Any suggestions?

lock_memory doesn't return error code

The __tgt_rtl_target_data_lock and unlock do have an error code now. I am not propagating it beyond libomptarget. Are you suggesting that I should?
Happy to do that, just trying to get what you think is best.

lock_memory and unlock_memory behavior is not symmetric.

In what sense?

But that is only inside the plugin.

llvm_omp_target_lock_mem
llvm_omp_target_unlock_mem
are not well specified.

Because of the missing error? Happy to add that.

Once renaming happens and tests get fixed. It will be OK to merge.

I already changed the tgt's to hst's parameter names. Perhaps you are looking at an intermediate version.

Thanks!

ye-luo added inline comments.Jan 10 2023, 12:04 PM

openmp/libomptarget/test/mapping/prelock.cpp
29	I don't believe it is well defined llvm_omp_target_lock_mem when OMP_TARGET_OFFLOAD=disabled. locked can be nullptr. I would refrain from using "locked" explicitly.

Please fix the merge errors in is_locked

openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h
17	That's a scary interface choice. It reads as the comment above, but actually it returns false when things go wrong and unconditionally writes success through the out parameter, throwing away the actual return code. And then the call sites pass in a value and ignore the result anyway. Perhaps the result of git merge blowing up on you?

This revision now requires changes to proceed.Jan 10 2023, 12:13 PM

lock_memory and unlock_memory not symmetric.

I mean
lock_memory(ptr) # lock happens
lock_memory(ptr) # no-op
unlock_memory(ptr) # unlock happens
unlock_memory(ptr) # no-op

I just feel unsafe. Feel more safe if
lock_memory(ptr) # lock happens
lock_memory(ptr) # no-op
unlock_memory(ptr) # no-op
unlock_memory(ptr) # unlock happens
but this requires doing reference counting.

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp
15	hsa_status_t is_locked(void ptr, void *agentBaseAddress) Only read agentBaseAddress value if the return is a success. Use agentBaseAddress == nullptr to tell locked or not.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1819	hsa_status_t lock_memory(void mem, size_t size, *locked_ptr)

Ok, I failed to parse it. 1/3 call sites do something with the return code, and it is set by something that reads like a variable declaration. So maybe not a merge misfire.

^ I share that concern about lock/unlock silently succeeding in some interleavings

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp
20	Oh, I missed this call as it looks like a variable declaration. So we do return different values through the out parameter, we just don't do anything with them.
87	Here we return the err from is_locked
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1823	Here we discard it
1836	Here we discard it

This revision is now accepted and ready to land.Jan 10 2023, 12:17 PM

JonChesterfield added a subscriber: JonChesterfield.Jan 10 2023, 12:18 PM

hsa_amd_memory_lock gives the corresponding region coarse semantics. No update visible until after a kernel exits. Is that right for this?

Specifically if someone mmaps some host address space then calls this on it, it'll succeed, but if they want to see the results of atomic operations on that memory while the kernel is running, they won't.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1826	I think this means lock for all GPUs. It's a deprecated interface though. There's some calls around memory pool that offer more control but they're harder to use.

In D139208#4041132, @ye-luo wrote:

lock_memory and unlock_memory not symmetric.

I mean
lock_memory(ptr) # lock happens
lock_memory(ptr) # no-op
unlock_memory(ptr) # unlock happens
unlock_memory(ptr) # no-op

I just feel unsafe. Feel more safe if
lock_memory(ptr) # lock happens
lock_memory(ptr) # no-op
unlock_memory(ptr) # no-op
unlock_memory(ptr) # unlock happens
but this requires doing reference counting.

I've just tested this in a pure C++ (no openmp) program using ROCr directly.
Here's the result of the operations above (I am not sure we are doing no-op):
a is pinned after first lock
a is pinned after second lock
a is pinned after first unlock
a is not pinned after second unlock

Code (please fill the void as necessary, or I can provide actual file if you want to test locally):

double *a = (double *)aligned_alloc(4096, n*sizeof(double)); //new double[n];
double *b = (double *)aligned_alloc(4096, n*sizeof(double)); //new double[n];

double *pinned_a = nullptr;
err = hsa_amd_memory_lock((void *) a, n*sizeof(double), nullptr, 0, (void **) &pinned_a);
CHECK(err);

{
  bool already_pinned = false;
  hsa_amd_pointer_info_t info;
  info.size = sizeof(hsa_amd_pointer_info_t);
  err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr);
  CHECK(err);
  already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED);
  if (already_pinned) printf("a is pinned after first lock\n");
  else printf("a is not pinned after first lock\n");
}

double *repinned_a = nullptr;
err = hsa_amd_memory_lock((void *) a, n*sizeof(double), nullptr, 0, (void **) &repinned_a);
CHECK(err);

{
  bool already_pinned = false;
  hsa_amd_pointer_info_t info;
  info.size = sizeof(hsa_amd_pointer_info_t);
  err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr);
  CHECK(err);
  already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED);
if (already_pinned) printf("a is pinned after second lock\n");
else printf("a is pinned after second lock\n");
}


hsa_amd_memory_unlock(pinned_a);
{
  bool already_pinned = false;
  hsa_amd_pointer_info_t info;
  info.size = sizeof(hsa_amd_pointer_info_t);
  err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr);
  already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED);
  if (already_pinned) printf("a is pinned after first unlock\n");
  else printf("a is not pinned after first unlock\n");
}

{
  bool already_pinned = false;
  hsa_amd_pointer_info_t info;
  info.size = sizeof(hsa_amd_pointer_info_t);
  hsa_amd_memory_unlock(a);
  err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr);
  already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED);
  if (already_pinned) printf("a is pinned after second unlock\n");
  else printf("a is not pinned after second unlock\n");
}

This looks like what you want, right?

In D139208#4041427, @carlo.bertolli wrote:
In D139208#4041132, @ye-luo wrote:

lock_memory and unlock_memory not symmetric.

I mean
lock_memory(ptr) # lock happens
lock_memory(ptr) # no-op
unlock_memory(ptr) # unlock happens
unlock_memory(ptr) # no-op

I just feel unsafe. Feel more safe if
lock_memory(ptr) # lock happens
lock_memory(ptr) # no-op
unlock_memory(ptr) # no-op
unlock_memory(ptr) # unlock happens
but this requires doing reference counting.

I've just tested this in a pure C++ (no openmp) program using ROCr directly.
Here's the result of the operations above (I am not sure we are doing no-op):
a is pinned after first lock
a is pinned after second lock
a is pinned after first unlock
a is not pinned after second unlock

Code (please fill the void as necessary, or I can provide actual file if you want to test locally):
double *a = (double *)aligned_alloc(4096, n*sizeof(double)); //new double[n];
double *b = (double *)aligned_alloc(4096, n*sizeof(double)); //new double[n];

double *pinned_a = nullptr;
err = hsa_amd_memory_lock((void *) a, n*sizeof(double), nullptr, 0, (void **) &pinned_a);
CHECK(err);

{
  bool already_pinned = false;
  hsa_amd_pointer_info_t info;
  info.size = sizeof(hsa_amd_pointer_info_t);
  err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr);
  CHECK(err);
  already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED);
  if (already_pinned) printf("a is pinned after first lock\n");
  else printf("a is not pinned after first lock\n");
}

double *repinned_a = nullptr;
err = hsa_amd_memory_lock((void *) a, n*sizeof(double), nullptr, 0, (void **) &repinned_a);
CHECK(err);

{
  bool already_pinned = false;
  hsa_amd_pointer_info_t info;
  info.size = sizeof(hsa_amd_pointer_info_t);
  err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr);
  CHECK(err);
  already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED);
if (already_pinned) printf("a is pinned after second lock\n");
else printf("a is pinned after second lock\n");
}


hsa_amd_memory_unlock(pinned_a);
{
  bool already_pinned = false;
  hsa_amd_pointer_info_t info;
  info.size = sizeof(hsa_amd_pointer_info_t);
  err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr);
  already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED);
  if (already_pinned) printf("a is pinned after first unlock\n");
  else printf("a is not pinned after first unlock\n");
}

{
  bool already_pinned = false;
  hsa_amd_pointer_info_t info;
  info.size = sizeof(hsa_amd_pointer_info_t);
  hsa_amd_memory_unlock(a);
  err = hsa_amd_pointer_info(a, &info, nullptr, nullptr, nullptr);
  already_pinned = (info.type == HSA_EXT_POINTER_TYPE_LOCKED);
  if (already_pinned) printf("a is pinned after second unlock\n");
  else printf("a is not pinned after second unlock\n");
}
This looks like what you want, right?

yes. rocr does reference counting and it is the desired behavior.

carlo.bertolli added inline comments.Jan 10 2023, 2:01 PM

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp
20	Correct, here's the description in the header file: @param[in, out] info Pointer to structure to be filled with allocation info. Data member size must be set to the size of the structure prior to calling hsa_amd_pointer_info. On return size will be set to the size of the pointer info structure supported by the runtime, if smaller. https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/a0d5e18e7752563daf4da970eae5ac8b6056a4c0/src/inc/hsa_ext_amd.h#L1844 I believe the size of the info struct could be smaller than what we use on older runtimes and/or gpus. I will update as necessary. Thanks for the catch.

In D139208#4041170, @JonChesterfield wrote:

hsa_amd_memory_lock gives the corresponding region coarse semantics. No update visible until after a kernel exits. Is that right for this?

I do not *think* hsa_amd_memory_lock changes granularity. There is a call for that, but it is not the locking one.
In general, up to OpenMP 5.2, there is no system-scope atomic available in the language. That makes the fine/coarse grain memory distinction not relevant. We will need to worry about starting with OpenMP 6.0 TR1, I believe.

Specifically if someone mmaps some host address space then calls this on it, it'll succeed, but if they want to see the results of atomic operations on that memory while the kernel is running, they won't.

Correct, and in OpenMP there is no synchronization available during kernel execution, only at kernel boundaries. In any case, not something that "locking" host memory has to do with.

carlo.bertolli added inline comments.Jan 10 2023, 2:27 PM

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1826	You caught me on this one :-) It is not deprecated, but in newer versions of ROCr it will need to be passed in the list of agents that are supposed to be accessing the locked pointer. I will send an update once this patch is in. The main problem is selecting an agent when the user is doing the locking themselves and we don't have a device ID. In that case, we may be forced to pass in all GPU agents in the system, but I am not sure whether this has performance or correctness implications. The alternative is making the llvm_omp_*lock api's dependent on a device num. Suggestions?

In D139208#4041588, @carlo.bertolli wrote:

In D139208#4041170, @JonChesterfield wrote:

hsa_amd_memory_lock gives the corresponding region coarse semantics. No update visible until after a kernel exits. Is that right for this?

I do not *think* hsa_amd_memory_lock changes granularity. There is a call for that, but it is not the locking one.
In general, up to OpenMP 5.2, there is no system-scope atomic available in the language. That makes the fine/coarse grain memory distinction not relevant. We will need to worry about starting with OpenMP 6.0 TR1, I believe.

There's lock memory to pool which inherits the properties of the pool. That can be used to get fine or coarse synchronization on a mmapped host pointer. Maybe openmp not having atomics means we can ignore it, though if people use opencl style atomic intrinsics in kernels today they work.

Specifically if someone mmaps some host address space then calls this on it, it'll succeed, but if they want to see the results of atomic operations on that memory while the kernel is running, they won't.

Correct, and in OpenMP there is no synchronization available during kernel execution, only at kernel boundaries. In any case, not something that "locking" host memory has to do with.

Huh. I did not realise this. Then given this is an openmp call, and openmp can't write code which can tell the difference, coarse grain seems fine.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1826	That would be a regression in rocr - nullptr meaning all seems like a useful capability. I'd have guessed the right thing to do there is talk rocr's maintainers out of breaking backwards compatibility. Otherwise yeah, we can stash all agents in the system in an array and pass it in to various calls. Probably all CPU agents as well as all GPU agents. On the other hand, this is an interface intended to let people write longer but faster code. Perhaps it should be taking a device id and only locking it for that device. That seems likely to be faster than locking it for all of them.

[OpenMP][libomptarget][AMDGPU] Address comments, including fix in error handling in is_locked function.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1826	DeviceNum is already part of the API. I changed the implementation to use the related HSA agent. We are good. Thanks for noticing this.
openmp/libomptarget/test/mapping/prelock.cpp
29	Good point. Looking at other API's in api.cpp, none of them seems to be well defined for this case. Happy to take care of this if there is a suggestion here. Both pointers work fine.

Harbormaster completed remote builds in B207252: Diff 488435.Jan 11 2023, 5:32 PM

ye-luo accepted this revision.Jan 13 2023, 9:28 AM

Closed by commit rGb215932e6991: [OpenMP][libomptarget][AMDGPU] lock/unlock (pin/unpin) mechanism in… (authored by carlo.bertolli). · Explain WhyJan 13 2023, 10:19 AM

This revision was automatically updated to reflect the committed changes.

carlo.bertolli added a commit: rGb215932e6991: [OpenMP][libomptarget][AMDGPU] lock/unlock (pin/unpin) mechanism in….

Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 10:19 AM

Herald added a subscriber: openmp-commits. · View Herald Transcript

Hi @carlo.bertolli, I am getting the following error in my local build after this commit:

/mybuild/openmp/libomptarget/plugins/amdgpu/impl/impl.cpp:17:3: error: ‘hsa_amd_pointer_info_t’ was not declared in this scope; did you mean ‘hsa_amd_agent_info_t’?
   17 |   hsa_amd_pointer_info_t info;
      |   ^~~~~~~~~~~~~~~~~~~~~~
      |   hsa_amd_agent_info_t

Is there missing definition in hsa_ext_amd.h?

Hi Carlo,

Thank you for the prompt response! Direct push works for me. I suppose posting a message in https://reviews.llvm.org/D139208 with the new commit id should be enough to collect any post-commit review comments for the new code.

Thanks,
Slava

I think that's because we (people w/o AMD card) are all building the dynamic AMD plugin, which has all definition of AMD HSA related stuff. That one might be out of date. Please update it accordingly.

carlo.bertolli mentioned this in rG7928d9e12d47: [OpenMP][libomptarget][AMDGPU] Add missing declarations to fix non amdgpu builds.Jan 13 2023, 1:04 PM

Pushed a fix to the non-amdgpu build:
7928d9e12d47fcc226d0c6984e11f5f463670f4a
Tested on machine without ROCm installation or AMDGPU.

[AMD Official Use Only - General]

Hi Slava, Shilei

I've just pushed a fix, tested against a machine without amdgpu's nor rocm. Please let me know if that fixes it for you:
7928d9e12d47fcc226d0c6984e11f5f463670f4a

Thanks for the notification+patience

Carlo

Thank you, Carlo! The fix works for me.

@carlo.bertolli is there documentation regarding llvm_omp_target_lock_mem llvm_omp_target_unlock_mem? While reading the comments of this patch, I see that the API ensures that locked areas feature reference counting. But I have the following doubts. Should the API support locking memory buffers that were already locked (e.g., complete or partial overlapping)? What's the behavior in such a case?

In D139208#4059152, @kevinsala wrote:

@carlo.bertolli is there documentation regarding llvm_omp_target_lock_mem llvm_omp_target_unlock_mem? While reading the comments of this patch, I see that the API ensures that locked areas feature reference counting. But I have the following doubts. Should the API support locking memory buffers that were already locked (e.g., complete or partial overlapping)? What's the behavior in such a case?

I didn't add documentation for those two APIs, but indeed I should have. Is API description in code enough or is there a better/different place for it?
According to the ROCr implementation of lock/pin and unlock/unpin:
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h#L1525
double locking just keeps things locked. There is a comment about overlapping pages for two calls to lock that leaves the pages locked.
I don't think we need to report double locks if ref counting works correctly and I do think we should allow double locking.
My two cents. Thanks for the question!

kevinsala mentioned this in rG2a539ee17d8a: [OpenMP][libomptarget] Implement memory lock/unlock API in NextGen plugins.Jan 24 2023, 3:12 PM

carlo.bertolli mentioned this in D142512: [OpenMP][libomptarget] Fix mapping/prelock.cpp test.Jan 24 2023, 8:25 PM

Revision Contents

Path

Size

openmp/

libomptarget/

include/

omptargetplugin.h

6 lines

rtl.h

4 lines

plugins/

amdgpu/

impl/

impl.cpp

28 lines

impl_runtime.h

4 lines

src/

rtl.cpp

47 lines

src/

8 lines

2 lines

56 lines

3 lines

4 lines

test/

mapping/

prelock.cpp

58 lines

runtime/

src/

kmp_alloc.cpp

3 lines

Diff 486079

openmp/libomptarget/include/omptargetplugin.h

	Show First 20 Lines • Show All 196 Lines • ▼ Show 20 Lines

	int32_t __tgt_rtl_destroy_event(int32_t ID, void *Event);			int32_t __tgt_rtl_destroy_event(int32_t ID, void *Event);
	// }			// }

	int32_t __tgt_rtl_init_async_info(int32_t ID, __tgt_async_info **AsyncInfoPtr);			int32_t __tgt_rtl_init_async_info(int32_t ID, __tgt_async_info **AsyncInfoPtr);
	int32_t __tgt_rtl_init_device_info(int32_t ID, __tgt_device_info *DeviceInfoPtr,			int32_t __tgt_rtl_init_device_info(int32_t ID, __tgt_device_info *DeviceInfoPtr,
	const char **ErrStr);			const char **ErrStr);

				// lock/pin host memory
				void __tgt_rtl_data_lock(int DeviceId, void TgtPtr, int64_t size);
				jdoerfertUnsubmitted Done Reply Inline Actions keep it consistent, int32_t ID, also below. Nit: Style (size) jdoerfert: keep it consistent, int32_t ID, also below. Nit: Style (size)
				ye-luoUnsubmitted Not Done Reply Inline Actions TgtPtr -> HostPtr. ye-luo: TgtPtr -> HostPtr.

				// unlock/unpin host memory
				void __tgt_rtl_data_unlock(int DeviceId, void *TgtPtr);
				kevinsalaUnsubmitted Done Reply Inline Actions We could return an error (`int32_t`) like in the rest of the plugin API functions. kevinsala: We could return an error (`int32_t`) like in the rest of the plugin API functions.

				ye-luoUnsubmitted Not Done Reply Inline Actions TgtPtr -> HostPtr. ye-luo: TgtPtr -> HostPtr.
	#ifdef __cplusplus			#ifdef __cplusplus
	}			}
	#endif			#endif

	#endif // _OMPTARGETPLUGIN_H_			#endif // _OMPTARGETPLUGIN_H_

openmp/libomptarget/include/rtl.h

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	struct RTLInfoTy {
typedef int32_t(record_event_ty)(int32_t, void , __tgt_async_info );		typedef int32_t(record_event_ty)(int32_t, void , __tgt_async_info );
typedef int32_t(wait_event_ty)(int32_t, void , __tgt_async_info );		typedef int32_t(wait_event_ty)(int32_t, void , __tgt_async_info );
typedef int32_t(sync_event_ty)(int32_t, void *);		typedef int32_t(sync_event_ty)(int32_t, void *);
typedef int32_t(destroy_event_ty)(int32_t, void *);		typedef int32_t(destroy_event_ty)(int32_t, void *);
typedef int32_t(release_async_info_ty)(int32_t, __tgt_async_info *);		typedef int32_t(release_async_info_ty)(int32_t, __tgt_async_info *);
typedef int32_t(init_async_info_ty)(int32_t, __tgt_async_info **);		typedef int32_t(init_async_info_ty)(int32_t, __tgt_async_info **);
typedef int64_t(init_device_into_ty)(int64_t, __tgt_device_info *,		typedef int64_t(init_device_into_ty)(int64_t, __tgt_device_info *,
const char **);		const char **);
		typedef void (data_lock_ty)(int32_t, void , int64_t);
		typedef void (data_unlock_ty)(int32_t, void );

int32_t Idx = -1; // RTL index, index is the number of devices		int32_t Idx = -1; // RTL index, index is the number of devices
// of other RTLs that were registered before,		// of other RTLs that were registered before,
// i.e. the OpenMP index of the first device		// i.e. the OpenMP index of the first device
// to be registered with this RTL.		// to be registered with this RTL.
int32_t NumberOfDevices = -1; // Number of devices this RTL deals with.		int32_t NumberOfDevices = -1; // Number of devices this RTL deals with.

std::unique_ptr<llvm::sys::DynamicLibrary> LibraryHandler;		std::unique_ptr<llvm::sys::DynamicLibrary> LibraryHandler;
Show All 35 Lines	#endif
create_event_ty *create_event = nullptr;		create_event_ty *create_event = nullptr;
record_event_ty *record_event = nullptr;		record_event_ty *record_event = nullptr;
wait_event_ty *wait_event = nullptr;		wait_event_ty *wait_event = nullptr;
sync_event_ty *sync_event = nullptr;		sync_event_ty *sync_event = nullptr;
destroy_event_ty *destroy_event = nullptr;		destroy_event_ty *destroy_event = nullptr;
init_async_info_ty *init_async_info = nullptr;		init_async_info_ty *init_async_info = nullptr;
init_device_into_ty *init_device_info = nullptr;		init_device_into_ty *init_device_info = nullptr;
release_async_info_ty *release_async_info = nullptr;		release_async_info_ty *release_async_info = nullptr;
		data_lock_ty *data_lock = nullptr;
		data_unlock_ty *data_unlock = nullptr;

// Are there images associated with this RTL.		// Are there images associated with this RTL.
bool IsUsed = false;		bool IsUsed = false;

llvm::DenseSet<const __tgt_device_image *> UsedImages;		llvm::DenseSet<const __tgt_device_image *> UsedImages;

// Mutex for thread-safety when calling RTL interface functions.		// Mutex for thread-safety when calling RTL interface functions.
// It is easier to enforce thread-safety at the libomptarget level,		// It is easier to enforce thread-safety at the libomptarget level,
▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp

//===--- amdgpu/impl/impl.cpp ------------------------------------- C++ -*-===//		//===--- amdgpu/impl/impl.cpp ------------------------------------- C++ -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
#include "rt.h"		#include "rt.h"
#include <memory>		#include <memory>

/*		/*
* Data		* Data
*/		*/

		bool is_locked(void ptr, hsa_status_t err_p, void **agentBaseAddress) {
		ye-luoUnsubmitted Done Reply Inline Actions hsa_status_t is_locked(void ptr, void agentBaseAddress) Only read agentBaseAddress value if the return is a success. Use agentBaseAddress == nullptr to tell locked or not. ye-luo:* hsa_status_t is_locked(void ptr, void *agentBaseAddress) Only read agentBaseAddress value if…
		bool is_locked = false;
		hsa_status_t err = HSA_STATUS_SUCCESS;
		hsa_amd_pointer_info_t info;
		info.size = sizeof(hsa_amd_pointer_info_t);
		err = hsa_amd_pointer_info(ptr, &info, nullptr, nullptr, nullptr);
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Oh, I missed this call as it looks like a variable declaration. So we do return different values through the out parameter, we just don't do anything with them. JonChesterfield: Oh, I missed this call as it looks like a variable declaration. So we do return different…
		carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions Correct, here's the description in the header file: @param[in, out] info Pointer to structure to be filled with allocation info. Data member size must be set to the size of the structure prior to calling hsa_amd_pointer_info. On return size will be set to the size of the pointer info structure supported by the runtime, if smaller. https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/a0d5e18e7752563daf4da970eae5ac8b6056a4c0/src/inc/hsa_ext_amd.h#L1844 I believe the size of the info struct could be smaller than what we use on older runtimes and/or gpus. I will update as necessary. Thanks for the catch. carlo.bertolli: Correct, here's the description in the header file: @param[in, out] info Pointer to structure…

		if (err_p)
		*err_p = err;

		if (err != HSA_STATUS_SUCCESS) {
		DP("Error when getting pointer info\n");
		return false;
		JonChesterfieldUnsubmitted Done Reply Inline Actions Maybe return false here, "on failure return is locked" reads the opposite of the semantics JonChesterfield: Maybe return false here, "on failure return is locked" reads the opposite of the semantics
		}
		dhruvachakUnsubmitted Done Reply Inline Actions You can remove else since the if branch has a return. dhruvachak: You can remove else since the if branch has a return.

		is_locked = (info.type == HSA_EXT_POINTER_TYPE_LOCKED);
		if (is_locked && agentBaseAddress != nullptr) {
		// When user passes in a basePtr+offset we need to fix the
		// locked pointer to include the offset: ROCr always returns
		// the base locked address, not the shifted one.
		*agentBaseAddress =
		(void *)((uint64_t)info.agentBaseAddress + (uint64_t)ptr -
		(uint64_t)info.hostBaseAddress);
		}

		return is_locked;
		}

// host pointer (either src or dest) must be locked via hsa_amd_memory_lock		// host pointer (either src or dest) must be locked via hsa_amd_memory_lock
static hsa_status_t invoke_hsa_copy(hsa_signal_t signal, void *dest,		static hsa_status_t invoke_hsa_copy(hsa_signal_t signal, void *dest,
hsa_agent_t agent, const void *src,		hsa_agent_t agent, const void *src,
size_t size) {		size_t size) {
const hsa_signal_value_t init = 1;		const hsa_signal_value_t init = 1;
const hsa_signal_value_t success = 0;		const hsa_signal_value_t success = 0;
hsa_signal_store_screlease(signal, init);		hsa_signal_store_screlease(signal, init);

Show All 28 Lines	static hsa_status_t locking_async_memcpy(enum CopyDirection direction,
void *lockingPtr, size_t size) {		void *lockingPtr, size_t size) {
hsa_status_t err;		hsa_status_t err;

void *lockedPtr = nullptr;		void *lockedPtr = nullptr;
err = hsa_amd_memory_lock(lockingPtr, size, nullptr, 0, (void **)&lockedPtr);		err = hsa_amd_memory_lock(lockingPtr, size, nullptr, 0, (void **)&lockedPtr);
if (err != HSA_STATUS_SUCCESS)		if (err != HSA_STATUS_SUCCESS)
return err;		return err;

switch (direction) {		switch (direction) {
		JonChesterfieldUnsubmitted Done Reply Inline Actions Here we return the err from is_locked JonChesterfield: Here we return the err from is_locked
case H2D:		case H2D:
err = invoke_hsa_copy(signal, dest, agent, lockedPtr, size);		err = invoke_hsa_copy(signal, dest, agent, lockedPtr, size);
break;		break;
case D2H:		case D2H:
err = invoke_hsa_copy(signal, lockedPtr, agent, src, size);		err = invoke_hsa_copy(signal, lockedPtr, agent, src, size);
break;		break;
}		}

▲ Show 20 Lines • Show All 73 Lines • Show Last 20 Lines

openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h

	//===--- amdgpu/impl/impl_runtime.h ------------------------------- C++ -*-===//			//===--- amdgpu/impl/impl_runtime.h ------------------------------- C++ -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef INCLUDE_IMPL_RUNTIME_H_			#ifndef INCLUDE_IMPL_RUNTIME_H_
	#define INCLUDE_IMPL_RUNTIME_H_			#define INCLUDE_IMPL_RUNTIME_H_

	#include "hsa_api.h"			#include "hsa_api.h"

	extern "C" {			extern "C" {

				// Check if pointer ptr is already locked and return true
				// if so. Return false otherwise.
				bool is_locked(void ptr, hsa_status_t err_p, void **agentBaseAddress);
				ye-luoUnsubmitted Not Done Reply Inline Actions The return value can only be meaning full if err_p is a success. ye-luo: The return value can only be meaning full if err_p is a success.
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions That's a scary interface choice. It reads as the comment above, but actually it returns false when things go wrong and unconditionally writes success through the out parameter, throwing away the actual return code. And then the call sites pass in a value and ignore the result anyway. Perhaps the result of git merge blowing up on you? JonChesterfield: That's a scary interface choice. It reads as the comment above, but actually it returns false…

	hsa_status_t impl_module_register_from_memory_to_place(			hsa_status_t impl_module_register_from_memory_to_place(
	void *module_bytes, size_t module_size, int DeviceId,			void *module_bytes, size_t module_size, int DeviceId,
	hsa_status_t (on_deserialized_data)(void data, size_t size,			hsa_status_t (on_deserialized_data)(void data, size_t size,
	void *cb_state),			void *cb_state),
	void *cb_state);			void *cb_state);

	hsa_status_t impl_memcpy_h2d(hsa_signal_t signal, void *deviceDest,			hsa_status_t impl_memcpy_h2d(hsa_signal_t signal, void *deviceDest,
	void *hostSrc, size_t size,			void *hostSrc, size_t size,
	Show All 9 Lines

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp

	Show First 20 Lines • Show All 1,810 Lines • ▼ Show 20 Lines
	}			}

	bool imageContainsSymbol(void Data, size_t Size, const char Sym) {			bool imageContainsSymbol(void Data, size_t Size, const char Sym) {
	SymbolInfo SI;			SymbolInfo SI;
	int Rc = getSymbolInfoWithoutLoading((char *)Data, Size, Sym, &SI);			int Rc = getSymbolInfoWithoutLoading((char *)Data, Size, Sym, &SI);
	return (Rc == 0) && (SI.Addr != nullptr);			return (Rc == 0) && (SI.Addr != nullptr);
	}			}

				void lock_memory(void mem, size_t size) {
				ye-luoUnsubmitted Done Reply Inline Actions hsa_status_t lock_memory(void mem, size_t size, locked_ptr) ye-luo:* hsa_status_t lock_memory(void mem, size_t size, *locked_ptr)
				void *lockedPtr = nullptr;
				hsa_status_t err = HSA_STATUS_SUCCESS;

				if (is_locked(mem, &err, nullptr))
				JonChesterfieldUnsubmitted Done Reply Inline Actions Here we discard it JonChesterfield: Here we discard it
				return mem;

				err = hsa_amd_memory_lock(mem, size, nullptr, /num_agent=/0, (void **)&lockedPtr);
				dhruvachakUnsubmitted Done Reply Inline Actions I think the 4th argument is num_agent. Please add it as a comment. In addition, is it always 0? dhruvachak: I think the 4th argument is num_agent. Please add it as a comment. In addition, is it always 0?
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions I think this means lock for all GPUs. It's a deprecated interface though. There's some calls around memory pool that offer more control but they're harder to use. JonChesterfield: I think this means lock for all GPUs. It's a deprecated interface though. There's some calls…
				carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions You caught me on this one :-) It is not deprecated, but in newer versions of ROCr it will need to be passed in the list of agents that are supposed to be accessing the locked pointer. I will send an update once this patch is in. The main problem is selecting an agent when the user is doing the locking themselves and we don't have a device ID. In that case, we may be forced to pass in all GPU agents in the system, but I am not sure whether this has performance or correctness implications. The alternative is making the llvm_omp_lock api's dependent on a device num. Suggestions? carlo.bertolli:* You caught me on this one :-) It is not deprecated, but in newer versions of ROCr it will need…
				JonChesterfieldUnsubmitted Done Reply Inline Actions That would be a regression in rocr - nullptr meaning all seems like a useful capability. I'd have guessed the right thing to do there is talk rocr's maintainers out of breaking backwards compatibility. Otherwise yeah, we can stash all agents in the system in an array and pass it in to various calls. Probably all CPU agents as well as all GPU agents. On the other hand, this is an interface intended to let people write longer but faster code. Perhaps it should be taking a device id and only locking it for that device. That seems likely to be faster than locking it for all of them. JonChesterfield: That would be a regression in rocr - nullptr meaning all seems like a useful capability. I'd…
				carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions DeviceNum is already part of the API. I changed the implementation to use the related HSA agent. We are good. Thanks for noticing this. carlo.bertolli: DeviceNum is already part of the API. I changed the implementation to use the related HSA agent.
				if (err != HSA_STATUS_SUCCESS)
				return nullptr;

				return lockedPtr;
				}

				hsa_status_t unlock_memory(void *mem) {
				hsa_status_t err = HSA_STATUS_SUCCESS;
				if (is_locked(mem, &err, nullptr))
				err = hsa_amd_memory_unlock(mem);
				JonChesterfieldUnsubmitted Done Reply Inline Actions Here we discard it JonChesterfield: Here we discard it
				return err;
				jdoerfertUnsubmitted Not Done Reply Inline Actions I doubt we need the check here but I don't mind keeping it (in the old plugin). jdoerfert: I doubt we need the check here but I don't mind keeping it (in the old plugin).
				carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions Checking if something is locked costs less than locking it. Because of that logic, we kept is the same way for unlocking here, but I have not run any tests to prove this. For locking, if I remember correctly, there was a order of magnitude difference between checking if locked and locking, so definitely worth to do. This has impact if used every time we need to transfer data H2D or D2H. carlo.bertolli: Checking if something is locked costs less than locking it. Because of that logic, we kept is…
				}

	} // namespace			} // namespace

	namespace core {			namespace core {
	hsa_status_t allow_access_to_all_gpu_agents(void *Ptr) {			hsa_status_t allow_access_to_all_gpu_agents(void *Ptr) {
	return hsa_amd_agents_allow_access(DeviceInfo().HSAAgents.size(),			return hsa_amd_agents_allow_access(DeviceInfo().HSAAgents.size(),
	&DeviceInfo().HSAAgents[0], NULL, Ptr);			&DeviceInfo().HSAAgents[0], NULL, Ptr);
	}			}
	} // namespace core			} // namespace core
	▲ Show 20 Lines • Show All 757 Lines • ▼ Show 20 Lines

	void __tgt_rtl_print_device_info(int32_t DeviceId) {			void __tgt_rtl_print_device_info(int32_t DeviceId) {
	// TODO: Assertion to see if DeviceId is correct			// TODO: Assertion to see if DeviceId is correct
	// NOTE: We don't need to set context for print device info.			// NOTE: We don't need to set context for print device info.

	DeviceInfo().printDeviceInfo(DeviceId, DeviceInfo().HSAAgents[DeviceId]);			DeviceInfo().printDeviceInfo(DeviceId, DeviceInfo().HSAAgents[DeviceId]);
	}			}

				void __tgt_rtl_data_lock(int DeviceId, void TgtPtr, int64_t size) {
				void *ptr = nullptr;
				assert(DeviceId < DeviceInfo().NumberOfDevices && "Device ID too large");

				ptr = lock_memory(TgtPtr, size);

				if (ptr != nullptr)
				DP("Tgt lock data %ld bytes, (tgt:%016llx).\n", size,
				(long long unsigned)(Elf64_Addr)ptr);

				return ptr;
				}

				void __tgt_rtl_data_unlock(int DeviceId, void *TgtPtr) {
				assert(DeviceId < DeviceInfo().NumberOfDevices && "Device ID too large");
				hsa_status_t err = HSA_STATUS_SUCCESS;
				ye-luoUnsubmitted Done Reply Inline Actions Not HostPtr? ye-luo: Not HostPtr?
				carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions Good catch. I changed it in the _lock function as well. carlo.bertolli: Good catch. I changed it in the _lock function as well.

				err = unlock_memory(TgtPtr);

				if (err != HSA_STATUS_SUCCESS)
				DP("Error in tgt_rtl_data_unlock\n");
				ye-luoUnsubmitted Done Reply Inline Actions Apart from returning. There is no error handling. the return type should be changed. ye-luo: Apart from returning. There is no error handling. the return type should be changed.

				DP("Tgt unlock data (tgt:%016llx).\n",
				(long long unsigned)(Elf64_Addr)TgtPtr);
				}

	} // extern "C"			} // extern "C"
				JonChesterfieldUnsubmitted Done Reply Inline Actions could we keep this pattern on the implementation? if something goes wrong, return nullptr, as opposed to passing pointers to pointers that are sometimes assigned JonChesterfield: could we keep this pattern on the implementation? if something goes wrong, return nullptr, as…
				carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions Great catch, thanks so much for the input! carlo.bertolli: Great catch, thanks so much for the input!
				kevinsalaUnsubmitted Done Reply Inline Actions Could we add both `__tgt_rtl_data_lock` and `__tgt_rtl_data_unlock` declaration to the `include/omptargetplugin.h` header? kevinsala: Could we add both `__tgt_rtl_data_lock` and `__tgt_rtl_data_unlock` declaration to the…

openmp/libomptarget/src/api.cpp

	Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines

	EXTERN void llvm_omp_target_free_shared(void *Ptre, int DeviceNum) {			EXTERN void llvm_omp_target_free_shared(void *Ptre, int DeviceNum) {
	return targetFreeExplicit(Ptre, DeviceNum, TARGET_ALLOC_SHARED, __func__);			return targetFreeExplicit(Ptre, DeviceNum, TARGET_ALLOC_SHARED, __func__);
	}			}

	EXTERN void *llvm_omp_target_dynamic_shared_alloc() { return nullptr; }			EXTERN void *llvm_omp_target_dynamic_shared_alloc() { return nullptr; }
	EXTERN void *llvm_omp_get_dynamic_shared() { return nullptr; }			EXTERN void *llvm_omp_get_dynamic_shared() { return nullptr; }

				EXTERN void llvm_omp_target_lock_mem(void ptr, size_t size, int device_num) {
				jdoerfertUnsubmitted Done Reply Inline Actions Mark the result as `nodiscard` to ensure people don't assume ptr is now locked. jdoerfert: Mark the result as `nodiscard` to ensure people don't assume ptr is now locked.
				return targetLockExplicit(ptr, size, device_num, __func__);
				}

				EXTERN void llvm_omp_target_unlock_mem(void *ptr, int device_num) {
				targetUnlockExplicit(ptr, device_num, __func__);
				}

	EXTERN int omp_target_is_present(const void *Ptr, int DeviceNum) {			EXTERN int omp_target_is_present(const void *Ptr, int DeviceNum) {
	TIMESCOPE();			TIMESCOPE();
	DP("Call to omp_target_is_present for device %d and address " DPxMOD "\n",			DP("Call to omp_target_is_present for device %d and address " DPxMOD "\n",
	DeviceNum, DPxPTR(Ptr));			DeviceNum, DPxPTR(Ptr));

	if (!Ptr) {			if (!Ptr) {
	DP("Call to omp_target_is_present with NULL ptr, returning false\n");			DP("Call to omp_target_is_present with NULL ptr, returning false\n");
	return false;			return false;
	▲ Show 20 Lines • Show All 229 Lines • Show Last 20 Lines

openmp/libomptarget/src/exports

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	global:
omp_target_disassociate_ptr;		omp_target_disassociate_ptr;
llvm_omp_target_alloc_host;		llvm_omp_target_alloc_host;
llvm_omp_target_alloc_shared;		llvm_omp_target_alloc_shared;
llvm_omp_target_alloc_device;		llvm_omp_target_alloc_device;
llvm_omp_target_free_host;		llvm_omp_target_free_host;
llvm_omp_target_free_shared;		llvm_omp_target_free_shared;
llvm_omp_target_free_device;		llvm_omp_target_free_device;
llvm_omp_target_dynamic_shared_alloc;		llvm_omp_target_dynamic_shared_alloc;
		llvm_omp_target_lock_mem;
		llvm_omp_target_unlock_mem;
__tgt_set_info_flag;		__tgt_set_info_flag;
__tgt_print_device_info;		__tgt_print_device_info;
omp_get_interop_ptr;		omp_get_interop_ptr;
omp_get_interop_str;		omp_get_interop_str;
omp_get_interop_int;		omp_get_interop_int;
omp_get_interop_name;		omp_get_interop_name;
omp_get_interop_type_desc;		omp_get_interop_type_desc;
__tgt_interop_init;		__tgt_interop_init;
__tgt_interop_use;		__tgt_interop_use;
__tgt_interop_destroy;		__tgt_interop_destroy;
local:		local:
*;		*;
};		};

openmp/libomptarget/src/omptarget.cpp

Show First 20 Lines • Show All 419 Lines • ▼ Show 20 Lines	if (!deviceIsReady(DeviceNum)) {
DP("%s returns, nothing to do\n", Name);		DP("%s returns, nothing to do\n", Name);
return;		return;
}		}

PM->Devices[DeviceNum]->deleteData(DevicePtr, Kind);		PM->Devices[DeviceNum]->deleteData(DevicePtr, Kind);
DP("omp_target_free deallocated device ptr\n");		DP("omp_target_free deallocated device ptr\n");
}		}

		void targetLockExplicit(void ptr, size_t size, int DeviceNum,
		const char *name) {
		jdoerfertUnsubmitted Done Reply Inline Actions Nit: Style. jdoerfert: Nit: Style.
		TIMESCOPE();
		DP("Call to %s for device %d locking %zu bytes\n", name, DeviceNum, size);

		if (size <= 0) {
		DP("Call to %s with non-positive length\n", name);
		return NULL;
		}

		void *rc = NULL;

		if (!deviceIsReady(DeviceNum)) {
		DP("%s returns NULL ptr\n", name);
		return NULL;
		}

		DeviceTy &Device = *PM->Devices[DeviceNum];
		if (Device.RTL->data_lock)
		rc = Device.RTL->data_lock(DeviceNum, ptr, size);

		DP("%s returns device ptr " DPxMOD "\n", name, DPxPTR(rc));
		return rc;
		}
		jdoerfertUnsubmitted Done Reply Inline Actions Technically you only need to get the Device pointer, then you can drop the lock. Not that it should ever matter much. jdoerfert: Technically you only need to get the Device pointer, then you can drop the lock. Not that it…

		void targetUnlockExplicit(void ptr, int DeviceNum, const char name) {
		TIMESCOPE();
		DP("Call to %s for device %d unlocking\n", name, DeviceNum);

		// Don't check deviceIsReady as it can initialize the device if needed.
		// Just check if DeviceNum exists as targetUnlockExplicit can be called
		// during process exit/free (and it may have been already destroyed) and
		// targetAllocExplicit will have already checked deviceIsReady anyway.
		size_t DevicesSize;
		dhruvachakUnsubmitted Done Reply Inline Actions What is this lock protecting? It appears PM->Devices. If that's so, why are accesses such as PM->Devices[DeviceNum] unprotected? Both a few lines down and in targetLockExplicit(). dhruvachak: What is this lock protecting? It appears PM->Devices. If that's so, why are accesses such as PM…
		carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions According to deviceIsReady in device.cpp, device size can only change while registering a new runtime lib. If we have enough devices to cover for the device_num passed in by the API caller, then we know that there will always be an RTL object corresponding to that device, so we don't need to lock/unlock again because we know that there is an object that can be dereferenced at device_num position in the array PM->Devices. We do the same lock/unlock of the vector size in targetLockExplicit but we call deviceIsReady to accomplish that. We don't do it here, in unlock, because we do not want to re-initialize the device in case we are in process tear down phase. I've switched to using the object constructor/destructor mechanism as used elsewhere in libomptarget. carlo.bertolli: According to deviceIsReady in device.cpp, device size can only change while registering a new…
		jdoerfertUnsubmitted Done Reply Inline Actions The problem is it's a vector. adding elements can cause reallocation which will race with `PM->Devices[DeviceNum]`. We should lock until we have the device, here and elsewhere. This is really only an issue if we add new runtime libs so locking longer will not affect performance. jdoerfert: The problem is it's a vector. adding elements can cause reallocation which will race with `PM…
		carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions Agreed, I've added locks for the Devices vector access. I will also write a new patch that does the same in all uses of that vector to prevent races. carlo.bertolli: Agreed, I've added locks for the Devices vector access. I will also write a new patch that does…
		{
		std::lock_guard<decltype(PM->RTLsMtx)> LG(PM->RTLsMtx);
		DevicesSize = PM->Devices.size();
		}

		if (DevicesSize <= (size_t)DeviceNum) {
		DP("Device ID %d does not have a matching RTL\n", DeviceNum);
		return;
		}

		if (!PM->Devices[DeviceNum]) {
		DP("%s returns, device %d not available\n", name, DeviceNum);
		return;
		}

		DeviceTy &Device = *PM->Devices[DeviceNum];
		if (Device.RTL->data_unlock)
		Device.RTL->data_unlock(DeviceNum, ptr);

		DP("%s returns\n", name);
		}
		jdoerfertUnsubmitted Done Reply Inline Actions Same as above. jdoerfert: Same as above.

/// Call the user-defined mapper function followed by the appropriate		/// Call the user-defined mapper function followed by the appropriate
// targetData* function (targetData{Begin,End,Update}).		// targetData* function (targetData{Begin,End,Update}).
int targetDataMapper(ident_t Loc, DeviceTy &Device, void ArgBase, void *Arg,		int targetDataMapper(ident_t Loc, DeviceTy &Device, void ArgBase, void *Arg,
int64_t ArgSize, int64_t ArgType, map_var_info_t ArgNames,		int64_t ArgSize, int64_t ArgType, map_var_info_t ArgNames,
void *ArgMapper, AsyncInfoTy &AsyncInfo,		void *ArgMapper, AsyncInfoTy &AsyncInfo,
TargetDataFuncPtrTy TargetDataFunction) {		TargetDataFuncPtrTy TargetDataFunction) {
TIMESCOPE_WITH_IDENT(Loc);		TIMESCOPE_WITH_IDENT(Loc);
DP("Calling the mapper function " DPxMOD "\n", DPxPTR(ArgMapper));		DP("Calling the mapper function " DPxMOD "\n", DPxPTR(ArgMapper));
▲ Show 20 Lines • Show All 1,207 Lines • Show Last 20 Lines

openmp/libomptarget/src/private.h

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	extern int target(ident_t Loc, DeviceTy &Device, void HostPtr, int32_t ArgNum,
AsyncInfoTy &AsyncInfo);		AsyncInfoTy &AsyncInfo);

extern void handleTargetOutcome(bool Success, ident_t *Loc);		extern void handleTargetOutcome(bool Success, ident_t *Loc);
extern bool checkDeviceAndCtors(int64_t &DeviceID, ident_t *Loc);		extern bool checkDeviceAndCtors(int64_t &DeviceID, ident_t *Loc);
extern void *targetAllocExplicit(size_t Size, int DeviceNum, int Kind,		extern void *targetAllocExplicit(size_t Size, int DeviceNum, int Kind,
const char *Name);		const char *Name);
extern void targetFreeExplicit(void *DevicePtr, int DeviceNum, int Kind,		extern void targetFreeExplicit(void *DevicePtr, int DeviceNum, int Kind,
const char *Name);		const char *Name);
		extern void targetLockExplicit(void ptr, size_t size, int device_num,
		const char *name);
		extern void targetUnlockExplicit(void ptr, int device_num, const char name);
		jdoerfertUnsubmitted Done Reply Inline Actions Nit: Style jdoerfert: Nit: Style

// This structure stores information of a mapped memory region.		// This structure stores information of a mapped memory region.
struct MapComponentInfoTy {		struct MapComponentInfoTy {
void *Base;		void *Base;
void *Begin;		void *Begin;
int64_t Size;		int64_t Size;
int64_t Type;		int64_t Type;
void *Name;		void *Name;
▲ Show 20 Lines • Show All 241 Lines • Show Last 20 Lines

openmp/libomptarget/src/rtl.cpp

Show First 20 Lines • Show All 240 Lines • ▼ Show 20 Lines	#endif
((void *)&RTL.destroy_event) =		((void *)&RTL.destroy_event) =
DynLibrary->getAddressOfSymbol("__tgt_rtl_destroy_event");		DynLibrary->getAddressOfSymbol("__tgt_rtl_destroy_event");
((void *)&RTL.release_async_info) =		((void *)&RTL.release_async_info) =
DynLibrary->getAddressOfSymbol("__tgt_rtl_release_async_info");		DynLibrary->getAddressOfSymbol("__tgt_rtl_release_async_info");
((void *)&RTL.init_async_info) =		((void *)&RTL.init_async_info) =
DynLibrary->getAddressOfSymbol("__tgt_rtl_init_async_info");		DynLibrary->getAddressOfSymbol("__tgt_rtl_init_async_info");
((void *)&RTL.init_device_info) =		((void *)&RTL.init_device_info) =
DynLibrary->getAddressOfSymbol("__tgt_rtl_init_device_info");		DynLibrary->getAddressOfSymbol("__tgt_rtl_init_device_info");
		((void *)&RTL.data_lock) =
		DynLibrary->getAddressOfSymbol("__tgt_rtl_data_lock");
		((void *)&RTL.data_unlock) =
		DynLibrary->getAddressOfSymbol("__tgt_rtl_data_unlock");

RTL.LibraryHandler = std::move(DynLibrary);		RTL.LibraryHandler = std::move(DynLibrary);

// Successfully loaded		// Successfully loaded
return true;		return true;
}		}

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
▲ Show 20 Lines • Show All 337 Lines • Show Last 20 Lines

openmp/libomptarget/test/mapping/prelock.cpp

This file was added.

				// RUN: %libomptarget-compilexx-run-and-check-generic

				// UNSUPPORTED: aarch64-unknown-linux-gnu
				// UNSUPPORTED: aarch64-unknown-linux-gnu-LTO
				// UNSUPPORTED: nvptx64-nvidia-cuda
				// UNSUPPORTED: nvptx64-nvidia-cuda-LTO
				// UNSUPPORTED: x86_64-pc-linux-gnu
				// UNSUPPORTED: x86_64-pc-linux-gnu-LTO

				#include <cstdio>

				#include <omp.h>

				extern "C" {
				void llvm_omp_target_lock_mem(void ptr, size_t size, int device_num);
				void llvm_omp_target_unlock_mem(void *ptr, int device_num);
				}

				int main() {
				int n = 100;
				int *unlocked = new int[n];

				ye-luoUnsubmitted Not Done Reply Inline Actions better call it host_ptr. Lock or not is a state. ye-luo: better call it host_ptr. Lock or not is a state.
				for (int i = 0; i < n; i++)
				unlocked[i] = i;

				int *locked = nullptr;

				llvm_omp_target_lock_mem(locked, n * sizeof(int), omp_get_default_device());

				jdoerfertUnsubmitted Done Reply Inline Actions you need to use the result. jdoerfert: you need to use the result.
				ye-luoUnsubmitted Done Reply Inline Actions strange. try to lock a nullptr? ye-luo: strange. try to lock a nullptr?
				ye-luoUnsubmitted Done Reply Inline Actions Still strange to me. Should the map on "unlocked[:n]" ye-luo: Still strange to me. Should the map on "unlocked[:n]"
				carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions You can do either and it will work (this, incidentally, unveiled a bug in the support that took some time to find). I added tests with unlocked and locked pointers. carlo.bertolli: You can do either and it will work (this, incidentally, unveiled a bug in the support that took…
				ye-luoUnsubmitted Done Reply Inline Actions I don't believe it is well defined llvm_omp_target_lock_mem when OMP_TARGET_OFFLOAD=disabled. locked can be nullptr. I would refrain from using "locked" explicitly. ye-luo: I don't believe it is well defined llvm_omp_target_lock_mem when OMP_TARGET_OFFLOAD=disabled.
				carlo.bertolliAuthorUnsubmitted Done Reply Inline Actions Good point. Looking at other API's in api.cpp, none of them seems to be well defined for this case. Happy to take care of this if there is a suggestion here. Both pointers work fine. carlo.bertolli: Good point. Looking at other API's in api.cpp, none of them seems to be well defined for this…
				#pragma omp target teams distribute parallel for map(tofrom : unlocked[ : n])
				for (int i = 0; i < n; i++)
				unlocked[i] += 1;

				#pragma omp target teams distribute parallel for map(tofrom : unlocked[10 : 10])
				for (int i = 10; i < 20; i++)
				unlocked[i] += 1;

				llvm_omp_target_unlock_mem(locked, omp_get_default_device());
				ye-luoUnsubmitted Done Reply Inline Actions Still strange to me. Should the unlock call on the "unlocked" ptr. ye-luo: Still strange to me. Should the unlock call on the "unlocked" ptr.

				ye-luoUnsubmitted Done Reply Inline Actions strange. try to unlock a nullptr? ye-luo: strange. try to unlock a nullptr?
				int err = 0;
				for (int i = 0; i < n; i++) {
				if (i < 10 \|\| i > 19) {
				if (unlocked[i] != i + 1) {
				printf("Err at %d, got %d, expected %d\n", i, unlocked[i], i + 1);
				err++;
				}
				} else if (unlocked[i] != i + 2) {
				printf("Err at %d, got %d, expected %d\n", i, unlocked[i], i + 2);
				err++;
				}
				}

				// CHECK: PASS
				if (err == 0)
				printf("PASS\n");

				return err;
				}

openmp/runtime/src/kmp_alloc.cpp

Show First 20 Lines • Show All 1,365 Lines • ▼ Show 20 Lines	void __kmp_init_target_mem() {
(void *)(&kmp_target_free_shared) =		(void *)(&kmp_target_free_shared) =
KMP_DLSYM("llvm_omp_target_free_shared");		KMP_DLSYM("llvm_omp_target_free_shared");
(void *)(&kmp_target_free_device) =		(void *)(&kmp_target_free_device) =
KMP_DLSYM("llvm_omp_target_free_device");		KMP_DLSYM("llvm_omp_target_free_device");
__kmp_target_mem_available =		__kmp_target_mem_available =
kmp_target_alloc_host && kmp_target_alloc_shared &&		kmp_target_alloc_host && kmp_target_alloc_shared &&
kmp_target_alloc_device && kmp_target_free_host &&		kmp_target_alloc_device && kmp_target_free_host &&
kmp_target_free_shared && kmp_target_free_device;		kmp_target_free_shared && kmp_target_free_device;
		// lock/pin and unlock/unpin target calls
		(void *)(&kmp_target_lock_mem) = KMP_DLSYM("llvm_omp_target_lock_mem");
		(void *)(&kmp_target_unlock_mem) = KMP_DLSYM("llvm_omp_target_unlock_mem");
}		}

omp_allocator_handle_t __kmpc_init_allocator(int gtid, omp_memspace_handle_t ms,		omp_allocator_handle_t __kmpc_init_allocator(int gtid, omp_memspace_handle_t ms,
int ntraits,		int ntraits,
omp_alloctrait_t traits[]) {		omp_alloctrait_t traits[]) {
// OpenMP 5.0 only allows predefined memspaces		// OpenMP 5.0 only allows predefined memspaces
KMP_DEBUG_ASSERT(ms == omp_default_mem_space \|\| ms == omp_low_lat_mem_space \|\|		KMP_DEBUG_ASSERT(ms == omp_default_mem_space \|\| ms == omp_low_lat_mem_space \|\|
ms == omp_large_cap_mem_space \|\| ms == omp_const_mem_space \|\|		ms == omp_large_cap_mem_space \|\| ms == omp_const_mem_space \|\|
▲ Show 20 Lines • Show All 934 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget][AMDGPU] lock/unlock (pin/unpin) mechanism in libomptarget amdgpu plugin (API and implementation)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 486079

openmp/libomptarget/include/omptargetplugin.h

openmp/libomptarget/include/rtl.h

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp

openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp

openmp/libomptarget/src/api.cpp

openmp/libomptarget/src/exports

openmp/libomptarget/src/omptarget.cpp

openmp/libomptarget/src/private.h

openmp/libomptarget/src/rtl.cpp

openmp/libomptarget/test/mapping/prelock.cpp

openmp/runtime/src/kmp_alloc.cpp

[OpenMP][libomptarget][AMDGPU] lock/unlock (pin/unpin) mechanism in libomptarget amdgpu plugin (API and implementation)
ClosedPublic