This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/plugins-nextgen/
-
libomptarget/
-
plugins-nextgen/
-
amdgpu/src/
-
src/
7/7
rtl.cpp
-
common/PluginInterface/
-
PluginInterface/
2/7
PluginInterface.h
2/2
PluginInterface.cpp
-
cuda/src/
-
src/
1
rtl.cpp
-
generic-elf-64bit/src/
-
src/
-
rtl.cpp

Differential D141227

[OpenMP][libomptarget] Implement memory lock/unlock API in NextGen plugins
ClosedPublic

Authored by kevinsala on Jan 8 2023, 8:57 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
jhuber6
JonChesterfield
ye-luo
tianshilei1992
carlo.bertolli

Commits

rG2a539ee17d8a: [OpenMP][libomptarget] Implement memory lock/unlock API in NextGen plugins

Summary

This patch implements the memory lock/unlock API, introduced in patch https://reviews.llvm.org/D139208, in the NextGen plugins.

The patch also re-organizes the map of host pinned allocations in PluginInterface.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kevinsala created this revision.Jan 8 2023, 8:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 8 2023, 8:57 AM

Herald added subscribers: kosarev, kerbowa, guansong and 2 others. · View Herald Transcript

kevinsala requested review of this revision.Jan 8 2023, 8:57 AM

Herald added subscribers: openmp-commits, sstefan1. · View Herald TranscriptJan 8 2023, 8:57 AM

Harbormaster completed remote builds in B206368: Diff 487198.Jan 8 2023, 8:57 AM

https://github.com/openucx/ucx/blob/master/src/ucs/datastruct/pgtable.h super optimized page table for a similar use case.

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
295	I would return `std::optional<const EntryTy*>` to avoid using `nullptr` to model failure.
328	I would return `std::optional<void *>` to avoid passing mutable pointers.

In D141227#4034338, @tschuett wrote:

https://github.com/openucx/ucx/blob/master/src/ucs/datastruct/pgtable.h super optimized page table for a similar use case.

That's interesting. We have another use inside of libomptarget where we need to determine if a pointer lies inside of an already mapped memory region. We could profile it and see if we could get some better performance with a more optimal implementation.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1832–1833	Nit, why not one line.

jdoerfert added inline comments.Jan 8 2023, 11:51 AM

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
295	Why would we want that? It seems to me null is a fine, didn't find the entry response. We don't need to box things just to have boxed them.
328	This makes more sense. Though, again, why boxing. Returning the devptr content avoids mutable arguments just fine

tschuett added inline comments.Jan 8 2023, 12:12 PM

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
295	Because these are C idoms in C++ code. LLVM is inconsistent whether returning true or false denotes success. The boxing overhead is negligible. The intent of your APIs is much clearer. `tryFindIntersecting` is a fallible API. òptional` is a great tool to for fallible APIs.

jdoerfert added inline comments.Jan 8 2023, 12:46 PM

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
295	Pointers/nullptr is arguably a C++ concept, we don't need to box things because C++ supports objects. I didn't argue about overheads, no need to start that discussion. There is no true or false return here so no confusion, it's `EntryTy *` that is nullptr or not. The API intent is to find an EntryTy, nullptr is arguably not an EntryTy object, hence none was found. How much clearer could it get if we would check an optional. Further, what would the set but nullptr case even mean? As far as I can see, our coding standards page does not include the word optional. If you want to have this as part of the canonical design, feel free to write an RFC. In the nextgen plugins we only use optional in the JIT, hence this is not an outlier.

In D141227#4034338, @tschuett wrote:

https://github.com/openucx/ucx/blob/master/src/ucs/datastruct/pgtable.h super optimized page table for a similar use case.

The license doesn't look like something we could copy. If there is a paper or similar we could reimplement it I assume.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1829	Brief documentation seems to be used elsewhere, add it here and below.
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
328	And remove the const cast at the use site. Cast it here if necessary.

jdoerfert added inline comments.Jan 8 2023, 1:07 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1912	It's not const. Change `isHostPinnedBuffer` to `getHostPinnedBuffer` and assign it in the conditional.
openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
498	It looks like we need https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c And https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gc00502b44e5f1bdc0b424487ebb08db0

Fixing issues and format. Still missing registering the host memory in CUDA plugin.

Harbormaster completed remote builds in B206395: Diff 487238.Jan 8 2023, 2:47 PM

kevinsala marked 5 inline comments as done.Jan 8 2023, 2:50 PM

kevinsala added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1912	Done, however, I renamed it to `getDevicePtrFromPinnedBuffer()` since it returns the device pointer.

ye-luo added inline comments.Jan 8 2023, 2:55 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1912	DevicePtr usually refers to the memory on the device. I would call it DeviceAccessiblePtr

LG, with 3 nits:

Ye's comment.
The one below.
@carlo.bertolli needs to modify the plugin interface and if we land this first we need to remember to change it as the other patch makes it in. Otherwise we can wait with this one.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1907–1910

This revision is now accepted and ready to land.Jan 8 2023, 3:32 PM

I changed the old plugin interface for tgt_rtl_data_lock to return an error code. It now returns the lockedptr as function argument. Let me know if this is not what was called for.
Thanks for this extension!

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
584–585	Just a nit: for amdgpu's we don't need to keep a table of locked pointers. This is already done by ROCr. I would consider making this optional for amdgpu's.

kevinsala marked an inline comment as done.Jan 10 2023, 10:37 AM

kevinsala added inline comments.

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
584–585	You're right. But for the moment, we want to keep this information cached into a map at the plugin level. In this way, we can access the pointer info "faster" and keep track of the pinned memory buffers that are OpenMP-related.

Fixing pending comments, updating code documentation, and including API changes in https://reviews.llvm.org/D139208.

Harbormaster completed remote builds in B206860: Diff 487891.Jan 10 2023, 11:09 AM

Still missing the pin/unpin calls in the CUDA plugin; it will be added in the following update of this patch.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1912	Just renamed the function and related code documentation.

What is locked memory, and could we define it in a comment/documentation somewhere if it isn't already?

Is it the same as mmap LOCKED? That plus extra?

Does it imply GPUs can read/write it? If so, can they read/write within a kernel execution, or can we implement locked memory by copy at the start and end of the execution?

In D141227#4041177, @JonChesterfield wrote:

What is locked memory, and could we define it in a comment/documentation somewhere if it isn't already?

Is it the same as mmap LOCKED? That plus extra?

Does it imply GPUs can read/write it? If so, can they read/write within a kernel execution, or can we implement locked memory by copy at the start and end of the execution?

Good question!
I always refer back to this article for a definition of locked/pinned:
https://lwn.net/Articles/600502/
but it might be stale, as my knowledge of OS is.
Aside from what I know or not know about OS, this is a kernel concept that we are reflecting in the plugin's.
In general, it is a concept used CUDA and HSA, so perhaps we can refer back to it?

We care about locking host memory because - roughly speaking - hsa_amd_memory_async_copy (asynchronous H2D and D2H memory copy) has a fast path for prelocked/pinned memory.
As locking memory is expensive, putting some "space" between lock operations, memory copies, and unlock operations is an effective optimization strategy. That's why we are giving users an API to prelock pointers when they know they can, and we avoid relocking them in the plugin's.
I hope this helps.

In D141227#4041529, @carlo.bertolli wrote:

In D141227#4041177, @JonChesterfield wrote:

What is locked memory, and could we define it in a comment/documentation somewhere if it isn't already?

Is it the same as mmap LOCKED? That plus extra?

Does it imply GPUs can read/write it? If so, can they read/write within a kernel execution, or can we implement locked memory by copy at the start and end of the execution?

I forgot to answer this: a locked pointer is passed a set of agents that can access it. It is best described as a "agent accessible pointer". In the "first gen" plugin, we use the agent accessible pointer (via call to lock) to perform a memory_async_copy, passing in the GPU agent involved in the copy.

Good question!
I always refer back to this article for a definition of locked/pinned:
https://lwn.net/Articles/600502/
but it might be stale, as my knowledge of OS is.
Aside from what I know or not know about OS, this is a kernel concept that we are reflecting in the plugin's.
In general, it is a concept used CUDA and HSA, so perhaps we can refer back to it?

We care about locking host memory because - roughly speaking - hsa_amd_memory_async_copy (asynchronous H2D and D2H memory copy) has a fast path for prelocked/pinned memory.
As locking memory is expensive, putting some "space" between lock operations, memory copies, and unlock operations is an effective optimization strategy. That's why we are giving users an API to prelock pointers when they know they can, and we avoid relocking them in the plugin's.
I hope this helps.

For HSA we can either have coarse grain semantics on a locked pointer or fine grain depending on which hsa calls we make. That surfaces a user visible distinction - if their kernel writes to this memory, can the host see that write while the kernel executes? I don't know which answer is correct but would like us to choose one and document it, and probably have the same behaviour on nvptx.

Ping. What is left before merging?

I think it's good to go. If this is something you've tried in qmcpack so we have a real world example of it improving things then even better.

In D141227#4071198, @JonChesterfield wrote:

I think it's good to go. If this is something you've tried in qmcpack so we have a real world example of it improving things then even better.

QMCPACK doesn't allocate pinned memroy via OpenMP. So this patch doesn't immediately benefit QMCPACK. This patch lays the foundation to a further optimization and the 16 release branch will be created very soon. So I'm pushing for merging this in time.

If you don't mind keeping an eye on CI and reverting it on breakage feel free to land it, I'm away from my desk

This patch needs a rebase. Applying patch failed.

I'll update the patch today. I'm improving it to feature ref counting and certain overlapping of locked areas (partial overlapping with extension of an already locked area is forbidden).

Rebasing and adding support for ref counting on locked buffers and allow certain overlapping. Given an already locked buffer A, other buffers that are fully contained inside A can be locked, even if they are smaller than A. Extending an existing locked buffer is not allowed. The original region is unlocked once all its users have released the locked buffer and sub-buffers.

Harbormaster completed remote builds in B209209: Diff 491148.Jan 22 2023, 4:51 AM

Ref-counting lock/unlock makes a lot sense.

kevinsala mentioned this in D142399: [Libomptarget] Use the nextgen plugins by default..Jan 23 2023, 1:47 PM

Closed by commit rG2a539ee17d8a: [OpenMP][libomptarget] Implement memory lock/unlock API in NextGen plugins (authored by kevinsala). · Explain WhyJan 24 2023, 3:12 PM

This revision was automatically updated to reflect the committed changes.

kevinsala added a commit: rG2a539ee17d8a: [OpenMP][libomptarget] Implement memory lock/unlock API in NextGen plugins.

If this fixes these failing with the nextgen plugins, please remove the line disabling the nextgen in the test.

Failed Tests (2):
  libomptarget :: amdgcn-amd-amdhsa :: mapping/prelock.cpp
  libomptarget :: amdgcn-amd-amdhsa-LTO :: mapping/prelock.cpp

In D141227#4078564, @jhuber6 wrote:
If this fixes these failing with the nextgen plugins, please remove the line disabling the nextgen in the test.
Failed Tests (2):
  libomptarget :: amdgcn-amd-amdhsa :: mapping/prelock.cpp
  libomptarget :: amdgcn-amd-amdhsa-LTO :: mapping/prelock.cpp

It doesn't fix these tests because they are apparently wrong. We cannot map a device accessible buffer, which is the agent_ptr returned when locking the buffer. (More info: https://reviews.llvm.org/D142399). We can fix it using is_device_ptr instead.

Revision Contents

Path

Size

openmp/

libomptarget/

plugins-nextgen/

amdgpu/

src/

rtl.cpp

31 lines

common/

PluginInterface/

PluginInterface.h

198 lines

PluginInterface.cpp

142 lines

cuda/

src/

rtl.cpp

7 lines

generic-elf-64bit/

src/

rtl.cpp

9 lines

Diff 491934

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 1,820 Lines • ▼ Show 20 Lines Error queryAsyncImpl(__tgt_async_info &AsyncInfo) override {

// AsyncInfo. This is to make sure the synchronization only works for its // AsyncInfo. This is to make sure the synchronization only works for its

// own tasks. // own tasks.

AMDGPUStreamManager.returnResource(Stream); AMDGPUStreamManager.returnResource(Stream);

AsyncInfo.Queue = nullptr; AsyncInfo.Queue = nullptr;

return Plugin::success(); return Plugin::success();

} }

/// Pin the host buffer and return the device pointer that should be used for

jdoerfertUnsubmitted

Done

Brief documentation seems to be used elsewhere, add it here and below.

jdoerfert: Brief documentation seems to be used elsewhere, add it here and below.

/// device transfers.

Expected<void *> dataLockImpl(void *HstPtr, int64_t Size) override {

void *PinnedPtr = nullptr;

jhuber6Unsubmitted

Done

Nit, why not one line.

jhuber6: Nit, why not one line.

hsa_status_t Status =

hsa_amd_memory_lock(HstPtr, Size, nullptr, 0, &PinnedPtr);

if (auto Err = Plugin::check(Status, "Error in hsa_amd_memory_lock: %s\n"))

return Err;

return PinnedPtr;

}

/// Unpin the host buffer.

Error dataUnlockImpl(void *HstPtr) override {

hsa_status_t Status = hsa_amd_memory_unlock(HstPtr);

return Plugin::check(Status, "Error in hsa_amd_memory_unlock: %s\n");

}

/// Submit data to the device (host to device transfer). /// Submit data to the device (host to device transfer).

Error dataSubmitImpl(void *TgtPtr, const void *HstPtr, int64_t Size, Error dataSubmitImpl(void *TgtPtr, const void *HstPtr, int64_t Size,

AsyncInfoWrapperTy &AsyncInfoWrapper) override { AsyncInfoWrapperTy &AsyncInfoWrapper) override {

// Use one-step asynchronous operation when host memory is already pinned. // Use one-step asynchronous operation when host memory is already pinned.

if (isHostPinnedMemoryBuffer(HstPtr)) { if (void *PinnedPtr =

PinnedAllocs.getDeviceAccessiblePtrFromPinnedBuffer(HstPtr)) {

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper); AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

return Stream.pushPinnedMemoryCopyAsync(TgtPtr, HstPtr, Size); return Stream.pushPinnedMemoryCopyAsync(TgtPtr, PinnedPtr, Size);

} }

void *PinnedHstPtr = nullptr; void *PinnedHstPtr = nullptr;

// For large transfers use synchronous behavior. // For large transfers use synchronous behavior.

if (Size >= OMPX_MaxAsyncCopyBytes) { if (Size >= OMPX_MaxAsyncCopyBytes) {

if (AsyncInfoWrapper.hasQueue()) if (AsyncInfoWrapper.hasQueue())

if (auto Err = synchronize(AsyncInfoWrapper)) if (auto Err = synchronize(AsyncInfoWrapper))

Show All 35 Lines Error dataSubmitImpl(void *TgtPtr, const void *HstPtr, int64_t Size,

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper); AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

return Stream.pushMemoryCopyH2DAsync(TgtPtr, HstPtr, PinnedHstPtr, Size, return Stream.pushMemoryCopyH2DAsync(TgtPtr, HstPtr, PinnedHstPtr, Size,

PinnedMemoryManager); PinnedMemoryManager);

} }

/// Retrieve data from the device (device to host transfer). /// Retrieve data from the device (device to host transfer).

Error dataRetrieveImpl(void *HstPtr, const void *TgtPtr, int64_t Size, Error dataRetrieveImpl(void *HstPtr, const void *TgtPtr, int64_t Size,

AsyncInfoWrapperTy &AsyncInfoWrapper) override { AsyncInfoWrapperTy &AsyncInfoWrapper) override {

// Use one-step asynchronous operation when host memory is already pinned. // Use one-step asynchronous operation when host memory is already pinned.

if (isHostPinnedMemoryBuffer(HstPtr)) { if (void *PinnedPtr =

// Use one-step asynchronous operation when host memory is already pinned. PinnedAllocs.getDeviceAccessiblePtrFromPinnedBuffer(HstPtr)) {

jdoerfertUnsubmitted

Done

AsyncInfoWrapperTy &AsyncInfoWrapper) override {

- void *PinnedPtr = nullptr;

+ void *PinnedPtr = nullptr; // <--- Move back down

// Use one-step asynchronous operation when host memory is already pinned.

- if ((PinnedPtr = PinnedAllocs.getDevicePtrFromPinnedBuffer(HstPtr))) {

+ if (void *PinnedPtr = PinnedAllocs.getDevicePtrFromPinnedBuffer(HstPtr)) {

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

jdoerfert:

AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper); AMDGPUStreamTy &Stream = getStream(AsyncInfoWrapper);

return Stream.pushPinnedMemoryCopyAsync(HstPtr, TgtPtr, Size); return Stream.pushPinnedMemoryCopyAsync(PinnedPtr, TgtPtr, Size);

jdoerfertUnsubmitted

Done

It's not const. Change isHostPinnedBuffer to getHostPinnedBuffer and assign it in the conditional.

jdoerfert: It's not const. Change `isHostPinnedBuffer` to `getHostPinnedBuffer` and assign it in the…

kevinsalaAuthorUnsubmitted

Done

Done, however, I renamed it to getDevicePtrFromPinnedBuffer() since it returns the device pointer.

kevinsala: Done, however, I renamed it to `getDevicePtrFromPinnedBuffer()` since it returns the device…

ye-luoUnsubmitted

Done

DevicePtr usually refers to the memory on the device. I would call it DeviceAccessiblePtr

ye-luo: DevicePtr usually refers to the memory on the device. I would call it DeviceAccessiblePtr

kevinsalaAuthorUnsubmitted

Done

Just renamed the function and related code documentation.

kevinsala: Just renamed the function and related code documentation.

} }

void *PinnedHstPtr = nullptr; void *PinnedHstPtr = nullptr;

// For large transfers use synchronous behavior. // For large transfers use synchronous behavior.

if (Size >= OMPX_MaxAsyncCopyBytes) { if (Size >= OMPX_MaxAsyncCopyBytes) {

if (AsyncInfoWrapper.hasQueue()) if (AsyncInfoWrapper.hasQueue())

if (auto Err = synchronize(AsyncInfoWrapper)) if (auto Err = synchronize(AsyncInfoWrapper))

▲ Show 20 Lines • Show All 713 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h

Show First 20 Lines • Show All 246 Lines • ▼ Show 20 Lines

protected: protected:

/// The preferred number of threads to run the kernel. /// The preferred number of threads to run the kernel.

uint32_t PreferredNumThreads; uint32_t PreferredNumThreads;

/// The maximum number of threads which the kernel could leverage. /// The maximum number of threads which the kernel could leverage.

uint32_t MaxNumThreads; uint32_t MaxNumThreads;

}; };

/// Class representing a map of host pinned allocations. We track these pinned

/// allocations, so memory tranfers invloving these buffers can be optimized.

class PinnedAllocationMapTy {

/// Struct representing a map entry.

struct EntryTy {

/// The host pointer of the pinned allocation.

void *HstPtr;

/// The pointer that devices' driver should use to transfer data from/to the

/// pinned allocation. In most plugins, this pointer will be the same as the

/// host pointer above.

void *DevAccessiblePtr;

/// The size of the pinned allocation.

size_t Size;

/// The number of references to the pinned allocation. The allocation should

/// remain pinned and registered to the map until the number of references

/// becomes zero.

mutable size_t References;

/// Create an entry with the host and device acessible pointers, and the

/// buffer size.

EntryTy(void *HstPtr, void *DevAccessiblePtr, size_t Size)

: HstPtr(HstPtr), DevAccessiblePtr(DevAccessiblePtr), Size(Size),

References(1) {}

/// Utility constructor used for std::set searches.

EntryTy(void *HstPtr)

: HstPtr(HstPtr), DevAccessiblePtr(nullptr), Size(0), References(0) {}

};

/// Comparator of mep entries. Use the host pointer to enforce an order

/// between entries.

struct EntryCmpTy {

bool operator()(const EntryTy &Left, const EntryTy &Right) const {

return Left.HstPtr < Right.HstPtr;

}

};

tschuettUnsubmitted

Not Done

I would return std::optional<const EntryTy*> to avoid using nullptr to model failure.

tschuett: I would return `std::optional<const EntryTy*>` to avoid using `nullptr` to model failure.

jdoerfertUnsubmitted

Not Done

Why would we want that? It seems to me null is a fine, didn't find the entry response. We don't need to box things just to have boxed them.

jdoerfert: Why would we want that? It seems to me null is a fine, didn't find the entry response. We don't…

tschuettUnsubmitted

Not Done

Because these are C idoms in C++ code. LLVM is inconsistent whether returning true or false denotes success. The boxing overhead is negligible. The intent of your APIs is much clearer. tryFindIntersecting is a fallible API. òptional` is a great tool to for fallible APIs.

tschuett: Because these are C idoms in C++ code. LLVM is inconsistent whether returning true or false…

jdoerfertUnsubmitted

Not Done

Pointers/nullptr is arguably a C++ concept, we don't need to box things because C++ supports objects.
I didn't argue about overheads, no need to start that discussion.
There is no true or false return here so no confusion, it's EntryTy * that is nullptr or not.
The API intent is to find an EntryTy, nullptr is arguably not an EntryTy object, hence none was found. How much clearer could it get if we would check an optional. Further, what would the set but nullptr case even mean?
As far as I can see, our coding standards page does not include the word optional. If you want to have this as part of the canonical design, feel free to write an RFC. In the nextgen plugins we only use optional in the JIT, hence this is not an outlier.

jdoerfert: - Pointers/nullptr is arguably a C++ concept, we don't need to box things because C++ supports…

typedef std::set<EntryTy, EntryCmpTy> PinnedAllocSetTy;

/// The map of host pinned allocations.

PinnedAllocSetTy Allocs;

/// The mutex to protect accesses to the map.

mutable std::shared_mutex Mutex;

/// Reference to the corresponding device.

GenericDeviceTy &Device;

/// Find an allocation that intersects with \p Buffer pointer. Assume

/// the map's mutex is acquired.

PinnedAllocSetTy::iterator findIntersecting(const void *Buffer) const {

if (Allocs.empty())

return Allocs.end();

// Search the first allocation with starting address that is not less than

// the buffer address.

auto It = Allocs.lower_bound({const_cast<void *>(Buffer)});

// Direct match of starting addresses.

if (It != Allocs.end() && It->HstPtr == Buffer)

return It;

// Not direct match but may be a previous pinned allocation in the map which

// contains the buffer. Return false if there is no such a previous

// allocation.

if (It == Allocs.begin())

return Allocs.end();

// Move to the previous pinned allocation.

--It;

tschuettUnsubmitted

Not Done

I would return std::optional<void *> to avoid passing mutable pointers.

tschuett: I would return `std::optional<void *>` to avoid passing mutable pointers.

jdoerfertUnsubmitted

Done

This makes more sense. Though, again, why boxing. Returning the devptr content avoids mutable arguments just fine

jdoerfert: This makes more sense. Though, again, why boxing. Returning the devptr content avoids mutable…

jdoerfertUnsubmitted

Done

/// for async memory copies.

- bool isHostPinnedBuffer(void *HstPtr, void **DevPtr = nullptr) const {

+ bool isHostPinnedBuffer(const void *HstPtr, void **DevPtr = nullptr) const {

std::shared_lock<std::shared_mutex> Lock(Mutex);

And remove the const cast at the use site. Cast it here if necessary.

jdoerfert: And remove the const cast at the use site. Cast it here if necessary.

// The buffer is not contained in the pinned allocation.

if (advanceVoidPtr(It->HstPtr, It->Size) > Buffer)

return It;

// None found.

return Allocs.end();

}

public:

/// Create the map of pinned allocations corresponding to a specific device.

PinnedAllocationMapTy(GenericDeviceTy &Device) : Device(Device) {}

/// Register a host buffer that was recently locked. None of the already

/// registered pinned allocations should intersect with this new one. The

/// registration requires the host pointer in \p HstPtr, the pointer that the

/// devices should use when transferring data from/to the allocation in

/// \p DevAccessiblePtr, and the size of the allocation in \p Size. Notice

/// that some plugins may use the same pointer for the \p HstPtr and

/// \p DevAccessiblePtr. The allocation must be unregistered using the

/// unregisterHostBuffer function.

Error registerHostBuffer(void *HstPtr, void *DevAccessiblePtr, size_t Size);

/// Unregister a host pinned allocation passing the host pointer which was

/// previously registered using the registerHostBuffer function. When calling

/// this function, the pinned allocation cannot have any other user.

Error unregisterHostBuffer(void *HstPtr);

/// Lock the host buffer at \p HstPtr or register a new user if it intersects

/// with an already existing one. A partial overlapping with extension is not

/// allowed. The function returns the device accessible pointer of the pinned

/// buffer. The buffer must be unlocked using the unlockHostBuffer function.

Expected<void *> lockHostBuffer(void *HstPtr, size_t Size);

/// Unlock the host buffer at \p HstPtr or unregister a user if other users

/// are still using the pinned allocation. If this was the last user, the

/// pinned allocation is removed from the map and the memory is unlocked.

Error unlockHostBuffer(void *HstPtr);

/// Return the device accessible pointer associated to the host pinned

/// allocation which the \p HstPtr belongs, if any. Return null in case the

/// \p HstPtr does not belong to any host pinned allocation. The device

/// accessible pointer is the one that devices should use for data transfers

/// that involve a host pinned buffer.

void *getDeviceAccessiblePtrFromPinnedBuffer(const void *HstPtr) const {

std::shared_lock<std::shared_mutex> Lock(Mutex);

// Find the intersecting allocation if any.

auto It = findIntersecting(HstPtr);

if (It == Allocs.end())

return nullptr;

const EntryTy &Entry = *It;

return advanceVoidPtr(Entry.DevAccessiblePtr,

getPtrDiff(HstPtr, Entry.HstPtr));

}

/// Check whether a buffer belongs to a registered host pinned allocation.

bool isHostPinnedBuffer(const void *HstPtr) const {

std::shared_lock<std::shared_mutex> Lock(Mutex);

// Return whether there is an intersecting allocation.

return (findIntersecting(const_cast<void *>(HstPtr)) != Allocs.end());

}

};

/// Class implementing common functionalities of offload devices. Each plugin /// Class implementing common functionalities of offload devices. Each plugin

/// should define the specific device class, derive from this generic one, and /// should define the specific device class, derive from this generic one, and

/// implement the necessary virtual function members. /// implement the necessary virtual function members.

struct GenericDeviceTy : public DeviceAllocatorTy { struct GenericDeviceTy : public DeviceAllocatorTy {

/// Construct a device with its device id within the plugin, the number of /// Construct a device with its device id within the plugin, the number of

/// devices in the plugin and the grid values for that kind of device. /// devices in the plugin and the grid values for that kind of device.

GenericDeviceTy(int32_t DeviceId, int32_t NumDevices, GenericDeviceTy(int32_t DeviceId, int32_t NumDevices,

const llvm::omp::GV &GridValues); const llvm::omp::GV &GridValues);

▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines struct GenericDeviceTy : public DeviceAllocatorTy {

virtual Error queryAsyncImpl(__tgt_async_info &AsyncInfo) = 0; virtual Error queryAsyncImpl(__tgt_async_info &AsyncInfo) = 0;

/// Allocate data on the device or involving the device. /// Allocate data on the device or involving the device.

Expected<void *> dataAlloc(int64_t Size, void *HostPtr, TargetAllocTy Kind); Expected<void *> dataAlloc(int64_t Size, void *HostPtr, TargetAllocTy Kind);

/// Deallocate data from the device or involving the device. /// Deallocate data from the device or involving the device.

Error dataDelete(void *TgtPtr, TargetAllocTy Kind); Error dataDelete(void *TgtPtr, TargetAllocTy Kind);

/// Pin host memory to optimize transfers and return the device accessible

/// pointer that devices should use for memory transfers involving the host

/// pinned allocation.

Expected<void *> dataLock(void *HstPtr, int64_t Size) {

return PinnedAllocs.lockHostBuffer(HstPtr, Size);

}

virtual Expected<void *> dataLockImpl(void *HstPtr, int64_t Size) = 0;

/// Unpin a host memory buffer that was previously pinned.

Error dataUnlock(void *HstPtr) {

return PinnedAllocs.unlockHostBuffer(HstPtr);

}

virtual Error dataUnlockImpl(void *HstPtr) = 0;

/// Submit data to the device (host to device transfer). /// Submit data to the device (host to device transfer).

Error dataSubmit(void *TgtPtr, const void *HstPtr, int64_t Size, Error dataSubmit(void *TgtPtr, const void *HstPtr, int64_t Size,

__tgt_async_info *AsyncInfo); __tgt_async_info *AsyncInfo);

virtual Error dataSubmitImpl(void *TgtPtr, const void *HstPtr, int64_t Size, virtual Error dataSubmitImpl(void *TgtPtr, const void *HstPtr, int64_t Size,

AsyncInfoWrapperTy &AsyncInfoWrapper) = 0; AsyncInfoWrapperTy &AsyncInfoWrapper) = 0;

/// Retrieve data from the device (device to host transfer). /// Retrieve data from the device (device to host transfer).

Error dataRetrieve(void *HstPtr, const void *TgtPtr, int64_t Size, Error dataRetrieve(void *HstPtr, const void *TgtPtr, int64_t Size,

▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines private:

virtual Error getDeviceHeapSize(uint64_t &V) = 0; virtual Error getDeviceHeapSize(uint64_t &V) = 0;

virtual Error setDeviceHeapSize(uint64_t V) = 0; virtual Error setDeviceHeapSize(uint64_t V) = 0;

/// Indicate whether the device should setup the device environment. Notice /// Indicate whether the device should setup the device environment. Notice

/// that returning false in this function will change the behavior of the /// that returning false in this function will change the behavior of the

/// setupDeviceEnvironment() function. /// setupDeviceEnvironment() function.

virtual bool shouldSetupDeviceEnvironment() const { return true; } virtual bool shouldSetupDeviceEnvironment() const { return true; }

/// Register a host buffer as host pinned allocation.

Error registerHostPinnedMemoryBuffer(const void *Buffer, size_t Size);

/// Unregister a host pinned allocations.

Error unregisterHostPinnedMemoryBuffer(const void *Buffer);

/// Pointer to the memory manager or nullptr if not available. /// Pointer to the memory manager or nullptr if not available.

MemoryManagerTy *MemoryManager; MemoryManagerTy *MemoryManager;

/// Environment variables defined by the OpenMP standard. /// Environment variables defined by the OpenMP standard.

Int32Envar OMP_TeamLimit; Int32Envar OMP_TeamLimit;

Int32Envar OMP_NumTeams; Int32Envar OMP_NumTeams;

Int32Envar OMP_TeamsThreadLimit; Int32Envar OMP_TeamsThreadLimit;

/// Environment variables defined by the LLVM OpenMP implementation. /// Environment variables defined by the LLVM OpenMP implementation.

Int32Envar OMPX_DebugKind; Int32Envar OMPX_DebugKind;

UInt32Envar OMPX_SharedMemorySize; UInt32Envar OMPX_SharedMemorySize;

UInt64Envar OMPX_TargetStackSize; UInt64Envar OMPX_TargetStackSize;

UInt64Envar OMPX_TargetHeapSize; UInt64Envar OMPX_TargetHeapSize;

/// Map of host pinned allocations. We track these pinned allocations so that

/// memory transfers involving these allocations can be optimized.

std::map<const void *, size_t> HostAllocations;

mutable std::shared_mutex HostAllocationsMutex;

protected: protected:

/// Check whether a buffer has been registered as host pinned memory.

bool isHostPinnedMemoryBuffer(const void *Buffer) const {

std::shared_lock<std::shared_mutex> Lock(HostAllocationsMutex);

if (HostAllocations.empty())

return false;

// Search the first allocation with starting address that is not less than

// the buffer address.

auto It = HostAllocations.lower_bound(Buffer);

// Direct match of starting addresses.

if (It != HostAllocations.end() && It->first == Buffer)

return true;

// Not direct match but may be a previous pinned allocation in the map which

// contains the buffer. Return false if there is no such a previous

// allocation.

if (It == HostAllocations.begin())

return false;

// Move to the previous pinned allocation.

--It;

// Evaluate whether the buffer is contained in the pinned allocation.

return (advanceVoidPtr(It->first, It->second) > (const char *)Buffer);

}

/// Return the execution mode used for kernel \p Name. /// Return the execution mode used for kernel \p Name.

Expected<OMPTgtExecModeFlags> getExecutionModeForKernel(StringRef Name, Expected<OMPTgtExecModeFlags> getExecutionModeForKernel(StringRef Name,

DeviceImageTy &Image); DeviceImageTy &Image);

/// Environment variables defined by the LLVM OpenMP implementation /// Environment variables defined by the LLVM OpenMP implementation

/// regarding the initial number of streams and events. /// regarding the initial number of streams and events.

UInt32Envar OMPX_InitialNumStreams; UInt32Envar OMPX_InitialNumStreams;

UInt32Envar OMPX_InitialNumEvents; UInt32Envar OMPX_InitialNumEvents;

Show All 17 Lines protected:

enum class PeerAccessState : uint8_t { AVAILABLE, UNAVAILABLE, PENDING }; enum class PeerAccessState : uint8_t { AVAILABLE, UNAVAILABLE, PENDING };

/// Array of peer access states with the rest of devices. This means that if /// Array of peer access states with the rest of devices. This means that if

/// the device I has a matrix PeerAccesses with PeerAccesses[J] == AVAILABLE, /// the device I has a matrix PeerAccesses with PeerAccesses[J] == AVAILABLE,

/// the device I can access device J's memory directly. However, notice this /// the device I can access device J's memory directly. However, notice this

/// does not mean that device J can access device I's memory directly. /// does not mean that device J can access device I's memory directly.

llvm::SmallVector<PeerAccessState> PeerAccesses; llvm::SmallVector<PeerAccessState> PeerAccesses;

std::mutex PeerAccessesLock; std::mutex PeerAccessesLock;

/// Map of host pinned allocations used for optimize device transfers.

PinnedAllocationMapTy PinnedAllocs;

}; };

/// Class implementing common functionalities of offload plugins. Each plugin /// Class implementing common functionalities of offload plugins. Each plugin

/// should define the specific plugin class, derive from this generic one, and /// should define the specific plugin class, derive from this generic one, and

/// implement the necessary virtual function members. /// implement the necessary virtual function members.

struct GenericPluginTy { struct GenericPluginTy {

/// Construct a plugin instance. /// Construct a plugin instance.

▲ Show 20 Lines • Show All 370 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp

Show First 20 Lines • Show All 327 Lines • ▼ Show 20 Lines	: MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"),
// Do not initialize the following two envars since they depend on the		// Do not initialize the following two envars since they depend on the
// device initialization. These cannot be consulted until the device is		// device initialization. These cannot be consulted until the device is
// initialized correctly. We intialize them in GenericDeviceTy::init().		// initialized correctly. We intialize them in GenericDeviceTy::init().
OMPX_TargetStackSize(), OMPX_TargetHeapSize(),		OMPX_TargetStackSize(), OMPX_TargetHeapSize(),
// By default, the initial number of streams and events are 32.		// By default, the initial number of streams and events are 32.
OMPX_InitialNumStreams("LIBOMPTARGET_NUM_INITIAL_STREAMS", 32),		OMPX_InitialNumStreams("LIBOMPTARGET_NUM_INITIAL_STREAMS", 32),
OMPX_InitialNumEvents("LIBOMPTARGET_NUM_INITIAL_EVENTS", 32),		OMPX_InitialNumEvents("LIBOMPTARGET_NUM_INITIAL_EVENTS", 32),
DeviceId(DeviceId), GridValues(OMPGridValues),		DeviceId(DeviceId), GridValues(OMPGridValues),
PeerAccesses(NumDevices, PeerAccessState::PENDING), PeerAccessesLock() {		PeerAccesses(NumDevices, PeerAccessState::PENDING), PeerAccessesLock(),
		PinnedAllocs(*this) {
if (OMP_NumTeams > 0)		if (OMP_NumTeams > 0)
GridValues.GV_Max_Teams =		GridValues.GV_Max_Teams =
std::min(GridValues.GV_Max_Teams, uint32_t(OMP_NumTeams));		std::min(GridValues.GV_Max_Teams, uint32_t(OMP_NumTeams));

if (OMP_TeamsThreadLimit > 0)		if (OMP_TeamsThreadLimit > 0)
GridValues.GV_Max_WG_Size =		GridValues.GV_Max_WG_Size =
std::min(GridValues.GV_Max_WG_Size, uint32_t(OMP_TeamsThreadLimit));		std::min(GridValues.GV_Max_WG_Size, uint32_t(OMP_TeamsThreadLimit));
}		}
▲ Show 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	GenericDeviceTy::getExecutionModeForKernel(StringRef Name,

// Check that the retrieved execution mode is valid.		// Check that the retrieved execution mode is valid.
if (!GenericKernelTy::isValidExecutionMode(ExecModeGlobal.getValue()))		if (!GenericKernelTy::isValidExecutionMode(ExecModeGlobal.getValue()))
return Plugin::error("Invalid execution mode %d for '%s'",		return Plugin::error("Invalid execution mode %d for '%s'",
ExecModeGlobal.getValue(), Name.data());		ExecModeGlobal.getValue(), Name.data());

return ExecModeGlobal.getValue();		return ExecModeGlobal.getValue();
}		}

Error GenericDeviceTy::registerHostPinnedMemoryBuffer(const void *Buffer,		Error PinnedAllocationMapTy::registerHostBuffer(void *HstPtr,
		carlo.bertolliUnsubmitted Done Reply Inline Actions Just a nit: for amdgpu's we don't need to keep a table of locked pointers. This is already done by ROCr. I would consider making this optional for amdgpu's. carlo.bertolli: Just a nit: for amdgpu's we don't need to keep a table of locked pointers. This is already done…
		kevinsalaAuthorUnsubmitted Done Reply Inline Actions You're right. But for the moment, we want to keep this information cached into a map at the plugin level. In this way, we can access the pointer info "faster" and keep track of the pinned memory buffers that are OpenMP-related. kevinsala: You're right. But for the moment, we want to keep this information cached into a map at the…
		void *DevAccessiblePtr,
size_t Size) {		size_t Size) {
std::lock_guard<std::shared_mutex> Lock(HostAllocationsMutex);		assert(HstPtr && "Invalid pointer");
		assert(DevAccessiblePtr && "Invalid pointer");

auto Res = HostAllocations.insert({Buffer, Size});		std::lock_guard<std::shared_mutex> Lock(Mutex);

		// No pinned allocation should intersect.
		auto Res = Allocs.insert({HstPtr, DevAccessiblePtr, Size});
if (!Res.second)		if (!Res.second)
return Plugin::error("Registering an already registered pinned buffer");		return Plugin::error("Cannot register locked buffer");

return Plugin::success();		return Plugin::success();
}		}

Error GenericDeviceTy::unregisterHostPinnedMemoryBuffer(const void *Buffer) {		Error PinnedAllocationMapTy::unregisterHostBuffer(void *HstPtr) {
std::lock_guard<std::shared_mutex> Lock(HostAllocationsMutex);		assert(HstPtr && "Invalid pointer");

		std::lock_guard<std::shared_mutex> Lock(Mutex);

		// Find the pinned allocation starting at the host pointer address.
		auto It = Allocs.find({HstPtr});
		if (It == Allocs.end())
		return Plugin::error("Cannot find locked buffer");

		const EntryTy &Entry = *It;

		// There should be no other references to the pinned allocation.
		if (Entry.References > 1)
		return Plugin::error("The locked buffer is still being used");

		// Remove the entry from the map.
		Allocs.erase(It);

		return Plugin::success();
		}

		Expected<void > PinnedAllocationMapTy::lockHostBuffer(void HstPtr,
		size_t Size) {
		assert(HstPtr && "Invalid pointer");

		std::lock_guard<std::shared_mutex> Lock(Mutex);

		auto It = findIntersecting(HstPtr);

		// No intersecting registered allocation found in the map. We must lock and
		// register the memory buffer into the map.
		if (It == Allocs.end()) {
		// First, lock the host buffer and retrieve the device accessible pointer.
		auto PinnedPtrOrErr = Device.dataLockImpl(HstPtr, Size);
		if (!PinnedPtrOrErr)
		return PinnedPtrOrErr.takeError();

		// Then, insert the host buffer entry into the map.
		auto Res = Allocs.insert({HstPtr, *PinnedPtrOrErr, Size});
		if (!Res.second)
		return Plugin::error("Cannot register locked buffer");

		// Return the device accessible pointer.
		return *PinnedPtrOrErr;
		}

		const EntryTy &Entry = *It;

		#ifdef OMPTARGET_DEBUG
		// Do not allow partial overlapping among host pinned buffers.
		if (advanceVoidPtr(HstPtr, Size) > advanceVoidPtr(Entry.HstPtr, Entry.Size))
		return Plugin::error("Partial overlapping not allowed in locked memory");
		#endif

		// Increase the number of references.
		Entry.References++;

		// Return the device accessible pointer after applying the correct offset.
		return advanceVoidPtr(Entry.DevAccessiblePtr,
		getPtrDiff(HstPtr, Entry.HstPtr));
		}

		Error PinnedAllocationMapTy::unlockHostBuffer(void *HstPtr) {
		assert(HstPtr && "Invalid pointer");

size_t Erased = HostAllocations.erase(Buffer);		std::lock_guard<std::shared_mutex> Lock(Mutex);

		auto It = findIntersecting(HstPtr);
		if (It == Allocs.end())
		return Plugin::error("Cannot find locked buffer");

		const EntryTy &Entry = *It;

		// Decrease the number of references. No need to do anything if there are
		// others using the allocation.
		if (--Entry.References > 0)
		return Plugin::success();

		// This was the last user of the allocation. Unlock the original locked memory
		// buffer, which is the host pointer stored in the entry.
		if (auto Err = Device.dataUnlockImpl(Entry.HstPtr))
		return Err;

		// Remove the entry from the map.
		size_t Erased = Allocs.erase(Entry);
if (!Erased)		if (!Erased)
return Plugin::error("Cannot find a registered host pinned buffer");		return Plugin::error("Cannot find locked buffer");

return Plugin::success();		return Plugin::success();
}		}

Error GenericDeviceTy::synchronize(__tgt_async_info *AsyncInfo) {		Error GenericDeviceTy::synchronize(__tgt_async_info *AsyncInfo) {
if (!AsyncInfo \|\| !AsyncInfo->Queue)		if (!AsyncInfo \|\| !AsyncInfo->Queue)
return Plugin::error("Invalid async info queue");		return Plugin::error("Invalid async info queue");

Show All 34 Lines	Expected<void > GenericDeviceTy::dataAlloc(int64_t Size, void HostPtr,
// Report error if the memory manager or the device allocator did not return		// Report error if the memory manager or the device allocator did not return
// any memory buffer.		// any memory buffer.
if (!Alloc)		if (!Alloc)
return Plugin::error("Invalid target data allocation kind or requested "		return Plugin::error("Invalid target data allocation kind or requested "
"allocator not implemented yet");		"allocator not implemented yet");

// Register allocated buffer as pinned memory if the type is host memory.		// Register allocated buffer as pinned memory if the type is host memory.
if (Kind == TARGET_ALLOC_HOST)		if (Kind == TARGET_ALLOC_HOST)
if (auto Err = registerHostPinnedMemoryBuffer(Alloc, Size))		if (auto Err = PinnedAllocs.registerHostBuffer(Alloc, Alloc, Size))
return Err;		return Err;

return Alloc;		return Alloc;
}		}

Error GenericDeviceTy::dataDelete(void *TgtPtr, TargetAllocTy Kind) {		Error GenericDeviceTy::dataDelete(void *TgtPtr, TargetAllocTy Kind) {
// Free is a noop when recording or replaying.		// Free is a noop when recording or replaying.
if (RecordReplay.isRecordingOrReplaying())		if (RecordReplay.isRecordingOrReplaying())
return Plugin::success();		return Plugin::success();

int Res;		int Res;
if (MemoryManager)		if (MemoryManager)
Res = MemoryManager->free(TgtPtr);		Res = MemoryManager->free(TgtPtr);
else		else
Res = free(TgtPtr, Kind);		Res = free(TgtPtr, Kind);

if (Res)		if (Res)
return Plugin::error("Failure to deallocate device pointer %p", TgtPtr);		return Plugin::error("Failure to deallocate device pointer %p", TgtPtr);

// Unregister deallocated pinned memory buffer if the type is host memory.		// Unregister deallocated pinned memory buffer if the type is host memory.
if (Kind == TARGET_ALLOC_HOST)		if (Kind == TARGET_ALLOC_HOST)
if (auto Err = unregisterHostPinnedMemoryBuffer(TgtPtr))		if (auto Err = PinnedAllocs.unregisterHostBuffer(TgtPtr))
return Err;		return Err;

return Plugin::success();		return Plugin::success();
}		}

Error GenericDeviceTy::dataSubmit(void TgtPtr, const void HstPtr,		Error GenericDeviceTy::dataSubmit(void TgtPtr, const void HstPtr,
int64_t Size, __tgt_async_info *AsyncInfo) {		int64_t Size, __tgt_async_info *AsyncInfo) {
auto Err = Plugin::success();		auto Err = Plugin::success();
▲ Show 20 Lines • Show All 311 Lines • ▼ Show 20 Lines	if (Err) {
REPORT("Failure to deallocate device pointer %p: %s\n", TgtPtr,		REPORT("Failure to deallocate device pointer %p: %s\n", TgtPtr,
toString(std::move(Err)).data());		toString(std::move(Err)).data());
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}

return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

		int32_t __tgt_rtl_data_lock(int32_t DeviceId, void *Ptr, int64_t Size,
		void **LockedPtr) {
		auto LockedPtrOrErr = Plugin::get().getDevice(DeviceId).dataLock(Ptr, Size);
		if (!LockedPtrOrErr) {
		auto Err = LockedPtrOrErr.takeError();
		REPORT("Failure to lock memory %p: %s\n", Ptr,
		toString(std::move(Err)).data());
		return OFFLOAD_FAIL;
		}

		if (!(*LockedPtrOrErr)) {
		REPORT("Failure to lock memory %p: obtained a null locked pointer\n", Ptr);
		return OFFLOAD_FAIL;
		}
		LockedPtr = LockedPtrOrErr;

		return OFFLOAD_SUCCESS;
		}

		int32_t __tgt_rtl_data_unlock(int32_t DeviceId, void *Ptr) {
		auto Err = Plugin::get().getDevice(DeviceId).dataUnlock(Ptr);
		if (Err) {
		REPORT("Failure to unlock memory %p: %s\n", Ptr,
		toString(std::move(Err)).data());
		return OFFLOAD_FAIL;
		}

		return OFFLOAD_SUCCESS;
		}

int32_t __tgt_rtl_data_submit(int32_t DeviceId, void TgtPtr, void HstPtr,		int32_t __tgt_rtl_data_submit(int32_t DeviceId, void TgtPtr, void HstPtr,
int64_t Size) {		int64_t Size) {
return __tgt_rtl_data_submit_async(DeviceId, TgtPtr, HstPtr, Size,		return __tgt_rtl_data_submit_async(DeviceId, TgtPtr, HstPtr, Size,
/* AsyncInfoPtr */ nullptr);		/* AsyncInfoPtr */ nullptr);
}		}

int32_t __tgt_rtl_data_submit_async(int32_t DeviceId, void *TgtPtr,		int32_t __tgt_rtl_data_submit_async(int32_t DeviceId, void *TgtPtr,
void *HstPtr, int64_t Size,		void *HstPtr, int64_t Size,
▲ Show 20 Lines • Show All 201 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp

Show First 20 Lines • Show All 487 Lines • ▼ Show 20 Lines	Error queryAsyncImpl(__tgt_async_info &AsyncInfo) override {
// occurs), return it to stream pool and reset AsyncInfo. This is to make		// occurs), return it to stream pool and reset AsyncInfo. This is to make
// sure the synchronization only works for its own tasks.		// sure the synchronization only works for its own tasks.
CUDAStreamManager.returnResource(Stream);		CUDAStreamManager.returnResource(Stream);
AsyncInfo.Queue = nullptr;		AsyncInfo.Queue = nullptr;

return Plugin::check(Res, "Error in cuStreamQuery: %s");		return Plugin::check(Res, "Error in cuStreamQuery: %s");
}		}

		Expected<void > dataLockImpl(void HstPtr, int64_t Size) override {
		// TODO: Register the buffer as CUDA host memory.
		return HstPtr;
		jdoerfertUnsubmitted Not Done Reply Inline Actions It looks like we need https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c And https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gc00502b44e5f1bdc0b424487ebb08db0 jdoerfert: It looks like we need https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.
		}

		Error dataUnlockImpl(void *HstPtr) override { return Plugin::success(); }

/// Submit data to the device (host to device transfer).		/// Submit data to the device (host to device transfer).
Error dataSubmitImpl(void TgtPtr, const void HstPtr, int64_t Size,		Error dataSubmitImpl(void TgtPtr, const void HstPtr, int64_t Size,
AsyncInfoWrapperTy &AsyncInfoWrapper) override {		AsyncInfoWrapperTy &AsyncInfoWrapper) override {
if (auto Err = setContext())		if (auto Err = setContext())
return Err;		return Err;

CUstream Stream = getStream(AsyncInfoWrapper);		CUstream Stream = getStream(AsyncInfoWrapper);
if (!Stream)		if (!Stream)
▲ Show 20 Lines • Show All 547 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp

Show First 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	struct GenELF64DeviceTy : public GenericDeviceTy {
}		}

/// Free the memory. Use std::free in all cases.		/// Free the memory. Use std::free in all cases.
int free(void *TgtPtr, TargetAllocTy Kind) override {		int free(void *TgtPtr, TargetAllocTy Kind) override {
std::free(TgtPtr);		std::free(TgtPtr);
return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

		/// This plugin does nothing to lock buffers. Do not return an error, just
		/// return the same pointer as the device pointer.
		Expected<void > dataLockImpl(void HstPtr, int64_t Size) override {
		return HstPtr;
		}

		/// Nothing to do when unlocking the buffer.
		Error dataUnlockImpl(void *HstPtr) override { return Plugin::success(); }

/// Submit data to the device (host to device transfer).		/// Submit data to the device (host to device transfer).
Error dataSubmitImpl(void TgtPtr, const void HstPtr, int64_t Size,		Error dataSubmitImpl(void TgtPtr, const void HstPtr, int64_t Size,
AsyncInfoWrapperTy &AsyncInfoWrapper) override {		AsyncInfoWrapperTy &AsyncInfoWrapper) override {
std::memcpy(TgtPtr, HstPtr, Size);		std::memcpy(TgtPtr, HstPtr, Size);
return Plugin::success();		return Plugin::success();
}		}

/// Retrieve data from the device (device to host transfer).		/// Retrieve data from the device (device to host transfer).
▲ Show 20 Lines • Show All 170 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget] Implement memory lock/unlock API in NextGen pluginsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 491934

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp

openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp

openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp

[OpenMP][libomptarget] Implement memory lock/unlock API in NextGen plugins
ClosedPublic