This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/
-
libomptarget/
-
src/
-
CMakeLists.txt
11/11
MemoryManager.h
31/36
MemoryManager.cpp
6/6
device.h
14/14
device.cpp
-
test/offloading/
-
offloading/
-
memory_manager.cpp

Differential D81054

[OpenMP] Introduce target memory manager
ClosedPublic

Authored by tianshilei1992 on Jun 2 2020, 10:12 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
ye-luo
JonChesterfield

Commits

rG0289696751e9: [OpenMP] Introduce target memory manager

Summary

Target memory manager is introduced in this patch which aims to manage target
memory such that they will not be freed immediately when they are not used
because the overhead of memory allocation and free is very large. For CUDA
device, cuMemFree even blocks the context switch on device which affects
concurrent kernel execution.

The memory manager can be taken as a memory pool. It divides the pool into
multiple buckets according to the size such that memory allocation/free
distributed to different buckets will not affect each other.

In this version, we use the exact-equality policy to find a free buffer. This
is an open question: will best-fit work better here? IMO, best-fit is not good
for target memory management because computation on GPU usually requires GBs of
data. Best-fit might lead to a serious waste. For example, there is a free
buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
the free buffer will be returned, leading to a 760MB waste.

The allocation will happen when there is no free memory left, and the memory
free on device will take place in the following two cases:

The program ends. Obviously. However, there is a little problem that plugin

library is destroyed before the memory manager is destroyed, leading to a fact
that the call to target plugin will not succeed.

Device is out of memory when we request a new memory. The manager will walk

through all free buffers from the bucket with largest base size, pick up one
buffer, free it, and try to allocate immediately. If it succeeds, it will
return right away rather than freeing all buffers in free list.

Update:
A threshold (8KB by default) is set such that users could control what size of memory
will be managed by the manager. It can also be configured by an environment variable
LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Updated according to comments

tianshilei1992 marked 6 inline comments as done.Aug 5 2020, 12:32 PM

Updated the calculation of NumBuckets

tianshilei1992 marked an inline comment as done.Aug 5 2020, 12:39 PM

Harbormaster completed remote builds in B67160: Diff 283341.Aug 5 2020, 1:28 PM

Use const_iterator

Harbormaster completed remote builds in B67167: Diff 283349.Aug 5 2020, 1:56 PM

Make mutex close to their protected variables

Harbormaster completed remote builds in B67186: Diff 283375.Aug 5 2020, 3:04 PM

Harbormaster completed remote builds in B67197: Diff 283399.Aug 5 2020, 3:57 PM

Removed the plugin interface

tianshilei1992 edited the summary of this revision. (Show Details)Aug 10 2020, 6:37 PM

Some comments and nits you should take under consideration.

I'm not 100% sold on the list design, that we look for the exact size, and that we traverse the list while we look.
However, this is improving a lot over the status quo and we can revisit this with more profiling information later.

LGTM

Nits:
I'd rename memory.h into MemoryManager.h if we don't expect anything else to go in there that is not "a memory manager" at the end of the day. Same with the cpp.
I'm not sure we need the memory namespace, or the impl namespace for that matter.

openmp/libomptarget/src/memory.cpp
24 ↗	(On Diff #284546)	The last sentence is now obsolete. I'd just state that there is an environment variable to set the threshold for which we will manage allocations. Please actually put the name of the variable here ;).
68 ↗	(On Diff #284546)	Can we rename `flp2` into `findPreviousPowerOfTwo` or something similarly descriptive?
157 ↗	(On Diff #284546)
249 ↗	(On Diff #284546)	This pattern occurs at least twice, might be worth to put it in a helper method, e.g., `allocateOrFreeAndAllocate` for the lack of a better name ;)
253 ↗	(On Diff #284546)	This message should be more descriptive I guess. "Return nullptr" is not helpful. Maybe spell out that we failed to allocate the requested memory, the device might be OOM. I guess this is also a good spot for some debugger events eventually...

This revision is now accepted and ready to land.Aug 10 2020, 7:02 PM

jdoerfert added inline comments.Aug 10 2020, 7:02 PM

openmp/libomptarget/plugins/exports
22 ↗	(On Diff #284546)	leftover.

Harbormaster completed remote builds in B67826: Diff 284546.Aug 10 2020, 7:07 PM

I'm still doubtful about this. Bump allocate + no-op free is fast unless the GPU runs out of memory before the arena can be dropped. The list and mutex construction is unusual for an allocator.

Could it be moved under the cuda subdirectory, until another plugin wishes to use it? That means the logic for detecting if it's in use and corresponding API disappear for now.

Update based on comments

Fixed compilation error

In D81054#2209051, @JonChesterfield wrote:

I'm still doubtful about this. Bump allocate + no-op free is fast unless the GPU runs out of memory before the arena can be dropped. The list and mutex construction is unusual for an allocator.

Right. We are working on that and if it turns out to be always superior we can move to that model. So far, this model is superior to what we had, by a lot.

Could it be moved under the cuda subdirectory, until another plugin wishes to use it? That means the logic for detecting if it's in use and corresponding API disappear for now.

This is *not* CUDA specific at all, please do not move generic things into target sub-directories, that is counterproductive. If we have another plugin that want to opt-out/in, we can have hooks for that. As there is non we support right now, hooks are added on-demand later.

In D81054#2209051, @JonChesterfield wrote:

I'm still doubtful about this. Bump allocate + no-op free is fast unless the GPU runs out of memory before the arena can be dropped. The list and mutex construction is unusual for an allocator.

The memory manager is not an allocator. We do need the mutex for the thread safety. I can't figure out a better way not to use the "list", which is a std::multiset here for efficient look up based on the size. Bump allocator is in another patch.

In D81054#2208730, @jdoerfert wrote:

Some comments and nits you should take under consideration.

I'm not 100% sold on the list design, that we look for the exact size, and that we traverse the list while we look.

The "list" is not a real list. It is a std::multiset here. So basically its look up complexity is O(logn) on average. If we don't have such a thing, what would be a better way to organize those free nodes with different sizes?

I'd rename memory.h into MemoryManager.h if we don't expect anything else to go in there that is not "a memory manager" at the end of the day. Same with the cpp.

Done.

I'm not sure we need the memory namespace, or the impl namespace for that matter.

I prefer to leave the namespace. Current implementation of libomptarget has really poor code style. This is a totally new file. I hope to make it right from it.

Deleted unnecessary changes

Harbormaster completed remote builds in B67990: Diff 284868.Aug 11 2020, 3:03 PM

Harbormaster completed remote builds in B67992: Diff 284873.Aug 11 2020, 3:08 PM

Harbormaster completed remote builds in B67995: Diff 284877.Aug 11 2020, 3:27 PM

OK, cool. If we're open to changing the implementation later this is fine by me. An instance per host thread is likely to be better than all the internal locks. Couple of minor comments above.

There are use cases for allocating device memory within the plugin itself. I think including MemoryManager.h from within the plugin would work for that.

openmp/libomptarget/src/MemoryManager.h
28	Can we drop the shared_ptr here? Better to have the MemoryManager move-only and use unique_ptr
38	Deallocate taking a size usually allows a faster implementation, but that can be left until said faster implementation is proposed

In D81054#2213237, @JonChesterfield wrote:

OK, cool. If we're open to changing the implementation later this is fine by me.

Always!

An instance per host thread is likely to be better than all the internal locks.

That is one of the things we can profile and change, no objection if it turns out problematic.

LGTM then. Calling into the plugin to do the bulk alloc/free is nice.

Using std::unique_ptr for the Pimpl

tianshilei1992 marked 2 inline comments as done.Aug 12 2020, 9:19 AM

tianshilei1992 added inline comments.

openmp/libomptarget/src/MemoryManager.h
38	I agree. Currently the plugin interface does not have such argument so we don't need that. In the future we might add that.

tianshilei1992 marked an inline comment as done.Aug 12 2020, 9:19 AM

In D81054#2213237, @JonChesterfield wrote:

There are use cases for allocating device memory within the plugin itself. I think including MemoryManager.h from within the plugin would work for that.

Unluckily, it doesn't work because it has a DeviceTy object…We might have common things such that all plugins can share in the future.

Using std::multiset::find instead of std::find_if for better performance

Updated some comments

Harbormaster completed remote builds in B68124: Diff 285107.Aug 12 2020, 10:01 AM

Please mention LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD, default value and unit in the patch summary.
Is it possible to have a unit test testing the manager class behaviors?
Can we offload to host and run address sanitizer or valgrind?

I'm not sure if I'm asking for too much here.

openmp/libomptarget/src/MemoryManager.cpp
325	SizeThreshold is global while Threshold is local. The default values is also different. I'm lost in the logic here.
openmp/libomptarget/src/MemoryManager.h
27	Why is the pointer needed? What is the design logic behind MemoryManagerTy and MemoryManagerImplTy layers? Can we just have one?
openmp/libomptarget/src/device.h
148	Could you explain why shared_ptr is needed?

tianshilei1992 marked 3 inline comments as done.Aug 12 2020, 10:11 AM

tianshilei1992 added inline comments.

openmp/libomptarget/src/MemoryManager.cpp
325	Yeah, you're lost. By default, `Threshold` is 0, which means we will not overwrite `SizeThreshold`.
openmp/libomptarget/src/MemoryManager.h
27	Pimpl. Like my previous comments mentioned before, this header will be included by others, I don't want unnecessary headers/declarations/definitions to be included to pollute others.
openmp/libomptarget/src/device.h
148	Such that I don't need to include `MemoryManager.h` in the header, and it doesn't hurt anything.

tianshilei1992 marked 3 inline comments as done.Aug 12 2020, 10:11 AM

Block the patch temporarily for my earlier questions.

This revision now requires changes to proceed.Aug 12 2020, 10:11 AM

In D81054#2213550, @ye-luo wrote:

Please mention LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD, default value and unit in the patch summary.

Sure. Will do.

Is it possible to have a unit test testing the manager class behaviors?

I don't think so. We don't have the "unit test" framework you want. If you insist some tests, I could add a simple "feature" test here.

Can we offload to host and run address sanitizer or valgrind?

What do you mean by offload to host? This memory manager will not be used by the device.

tianshilei1992 edited the summary of this revision. (Show Details)Aug 12 2020, 10:16 AM

ye-luo added inline comments.Aug 12 2020, 10:22 AM

openmp/libomptarget/src/MemoryManager.h
27	That is the job of header and cpp files.
openmp/libomptarget/src/device.h
148	This is obviously a wrong way. Move the constructor and destructor to cpp.

It definitely can and should be tested. Instantiate on a device that uses host malloc/free for the functions and stress test it under valgrind.

I've started writing tests out of tree for stuff like this, which is not ideal, but means the code shipped without the tests is likely to be correct

tianshilei1992 added inline comments.Aug 12 2020, 10:26 AM

openmp/libomptarget/src/MemoryManager.h
27	No. You can refer to https://en.cppreference.com/w/cpp/language/pimpl for more details.
openmp/libomptarget/src/device.h
148	Why is it a wrong way? Is there any drawback?

In D81054#2213597, @JonChesterfield wrote:

It definitely can and should be tested. Instantiate on a device that uses host malloc/free for the functions and stress test it under valgrind.

The "unit test" Ye mentions is not the one you said here. I agree to add a test like you said and I will. The "unit test" Ye wants is to test the class MemoryManagerTy directly, which is currently not feasible. We don't have a test framework to support that.

Harbormaster completed remote builds in B68134: Diff 285123.Aug 12 2020, 10:44 AM

Harbormaster completed remote builds in B68136: Diff 285127.Aug 12 2020, 10:59 AM

Improved performance by removing one map table operation

Added a new test

Harbormaster completed remote builds in B68145: Diff 285142.Aug 12 2020, 12:30 PM

Harbormaster completed remote builds in B68152: Diff 285155.Aug 12 2020, 1:12 PM

Replaced std::shared_ptr with std::unique_ptr in the class DeviceTy

Harbormaster completed remote builds in B68171: Diff 285188.Aug 12 2020, 3:36 PM

ye-luo added inline comments.Aug 12 2020, 4:49 PM

openmp/libomptarget/src/MemoryManager.h
27	Pimpl. Like my previous comments mentioned before, this header will be included by others, I don't want unnecessary headers/declarations/definitions to be included to pollute others. Where else do you have in mind this header will be included? So far there is only device.cpp.
openmp/libomptarget/src/device.cpp
31	Why do you think it is OK here leaving the copy constructor always setting MemoryManager nullptr? This cause surprises. The same question applies to assign operator as well.

ye-luo added inline comments.Aug 12 2020, 4:54 PM

openmp/libomptarget/src/MemoryManager.cpp
150	There can be race when you test List.empty().
274	There can be race in PtrToNodeTable when you find()

ye-luo added inline comments.Aug 12 2020, 5:09 PM

openmp/libomptarget/src/MemoryManager.cpp
325	Q1. Why SizeThreshold is not per device? Q2. I was asking for a way to opt-out this optimization. But you ignore LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD=0
328	make_unique is better.
openmp/libomptarget/src/device.cpp
369	I think this is your real default. The default value of SizeThreshold always gets overwritten.

ye-luo added inline comments.Aug 12 2020, 5:31 PM

openmp/libomptarget/src/MemoryManager.cpp
108	Another shared_ptr. See `typedef std::set<HostDataToTargetTy, std::less<>> HostDataToTargetListTy;` as an example. There doesn't seem to need a pointer wrapping NodeTy.

ye-luo added inline comments.Aug 12 2020, 6:23 PM

openmp/libomptarget/src/MemoryManager.cpp
325	Remove Q2. Opt-out has been supported.
openmp/libomptarget/src/MemoryManager.h
31	Second (Third?) place with a default. Remove or error out if size 0?

tianshilei1992 marked 12 inline comments as done.Aug 12 2020, 7:35 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/MemoryManager.cpp
108	We might have same nodes both in the table and the free list. It is on purpose because the map relation never changes, which could save us an operation on the map.
150	It is on purpose, and the "race" is not a problem. Think about it. Even we wrap it into a lock and now it is empty, there is still chance that when we move to the next list, one or more nodes are returned to this list. No difference.
274	There is no race because same `NodePtr` will never go to two threads.
325	Then how could you specify the threshold via environment variable for each device? You don't even know how many devices you're gonna have during the compilation time.
328	Sure.
openmp/libomptarget/src/MemoryManager.h
27	That attributes to the Pimpl idiom. It is not a good practice to have too much implementation stuffs in the header file.
31	Could remove it.
openmp/libomptarget/src/device.cpp
31	`MemoryManager` will be initialized separately later. The only reason we need this is `std::vector<DeviceTy>` requires it. We don't copy or construct those objects afterwards.
369	Yes. The logic is a little weird. I'll refactor this part.

tianshilei1992 marked 9 inline comments as done.Aug 12 2020, 7:35 PM

JonChesterfield added inline comments.Aug 12 2020, 10:14 PM

openmp/libomptarget/src/MemoryManager.cpp
150	That seems to assume list.empty() is an atomic operation. It isn't - calling list.empty() from one thread while another can be inserting into the list is a data race. We could do something involving a relaxed read followed by a lock followed by another read, in the double checked locking fashion. Uncontended locks are cheap though so it's probably not worthwhile.
194	This seems bad. Perhaps we should call a function to do this work shortly before destroying the target plugin?
274	It looks like PtrToNodeTable can be modified by other threads while this is running. Doesn't matter that NodePtr itself is unique - can't call .find() on the structure while another thread is mutating it.
openmp/libomptarget/src/device.cpp
31	std::vector<DeviceTy> should be content with a move constructor. Then the copy constructor can be = delete.

tianshilei1992 marked 4 inline comments as done.Aug 13 2020, 1:32 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/MemoryManager.cpp
150	Double check does not work either. If the `empty` function might crash because of the data race, that is a problem. Otherwise, it is not a problem. Like I said, another thread could still insert node into the list after we check empty using a lock.
194	That is a potential problem, and actually it might not be a problem. Only when we're going to exit the process can this function be invoked. Even the deallocation will not succeed, GPU memory will be free once the process exits anyway.
274	The iterators will not be invalidated on `multiset` in insert operation, but anyway, I'm not sure whether it will crash in some middle status, so I'll wrap them into the guard lock.
openmp/libomptarget/src/device.cpp
31	`std::mutex` cannot be moved. That is the only reason we have the copy constructor.

tianshilei1992 marked 4 inline comments as done.Aug 13 2020, 1:32 PM

Updated based on comments

Harbormaster completed remote builds in B68363: Diff 285542.Aug 13 2020, 7:17 PM

JonChesterfield added inline comments.Aug 13 2020, 8:45 PM

openmp/libomptarget/src/MemoryManager.cpp
150	Double check requires an atomic read, though relaxed is fine. Or probably some use of barriers. Calling empty() while another thread modifies the list is the race. Because empty() is not atomic qualified, the race is UB. Empty probably resolves to loading two pointers and comparing them for equality, so I sympathise with the argument that the race is benign, but it's still prudent to remove the data races we know about. Less UB, and means data race detectors will have a better chance of helping find bugs.
274	A crash would be fine as we'd notice that. It's data corruption due to the race which is the hazard. Thanks for adding it to a locked region.
openmp/libomptarget/src/device.cpp
31	`std::mutex` can't be copied either. If a new default-initialised mutex is OK as the result of the copy, it would be OK as the result of a move too.

Removed the pimpl and namespace.

tianshilei1992 marked 3 inline comments as done.Aug 18 2020, 4:45 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
31	That is beyond the scope of this patch. Since it already has a user-defined copy operator, then let it be.

tianshilei1992 marked an inline comment as done.Aug 18 2020, 4:45 PM

Fixed comment of constructor

Harbormaster completed remote builds in B68820: Diff 286432.Aug 18 2020, 5:27 PM

Harbormaster completed remote builds in B68819: Diff 286430.

In addition,

the DeviceTy copy constructor and assign operator are imperfect before this patch. I don't think we can fix them in this patch. We should just document the imperfection here.
Because the memory limit is per allocation, it seems that the MemoryManager can still hold infinite amount of memory and we don't have way to free them. I'm concerned about having this feature on by default.

openmp/libomptarget/src/MemoryManager.cpp
131	N->Ptr is deleted here. Then the shared_ptr in FreeLists[I] is deleted here but PtrToNodeTable still has the shared_ptr and an address which is no more valid. If I understand correctly, you want FreeLists holds a subset of PtrToNodeTable memory segments. I think what you need is using FreeListTy = std::multiset<std::reference_wrapper<NodeTy>, NodeCmpTy>; std::unordered_map<void *, NodeTy> PtrToNodeTable; In this way, PtrToNodeTable is the unique owner of all the memory segments. FreeList only owns a reference.
325	Then how could you specify the threshold via environment variable for each device? You don't even know how many devices you're gonna have during the compilation time. Although your current implementation via environment variable cannot specify the size for each device, we may use configuration file in the future to control this. It will be helpful If you can facilitate this when cleaning up the logic for the default value.

tianshilei1992 marked an inline comment as done.Aug 18 2020, 7:42 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/MemoryManager.cpp
131	This is a nice catch. Thanks for that. Holding a reference will not solve the problem that the node should also be removed from the map. It is same as holding a `shared_ptr`, but I could in fact use the reference way for the `FreeListTy`. The initial implementation is to remove the node from the map table and then add it to the free lists. Later I want to avoid the unnecessary operation on the map table but forget to update here. I’ll fix it.
325	I would prefer to keep current one. If in the future we have request for the per-device threshold, we could change it by the time. This can keep the implementation consistent.

In D81054#2225277, @ye-luo wrote:

Because the memory limit is per allocation, it seems that the MemoryManager can still hold infinite amount of memory and we don't have way to free them. I'm concerned about having this feature on by default.

First, users can always opt out the feature. What’s more important, if we receive complaints that this feature causes their applications OOM, we could evaluate it and then make corresponding change. What we know for now is many applications benefit from it.

ye-luo added inline comments.Aug 18 2020, 8:01 PM

openmp/libomptarget/src/MemoryManager.cpp
131	The correct code needs to take care of both PtrToNodeTable and FreeLists regardless. Currently in the destructor, you first deal with PtrToNodeTable and then FreeLists with some nullptr check. If you switch to reference in FreeLists, only PtrToNodeTable needs to be taken care. I still hope you find shared_ptr not needed at all.

tianshilei1992 added inline comments.Aug 18 2020, 8:17 PM

openmp/libomptarget/src/MemoryManager.cpp
131	One benefit to use pointer is that we could use `nullptr` to tell a state, which is very important to narrow the critical area as much as possible. Reference does not have that quality so that I need to do more things in the critical area which is counter-efficient. I can take the map table as a container of nodes and use the raw pointer in the free lists.

ye-luo added inline comments.Aug 18 2020, 8:27 PM

openmp/libomptarget/src/MemoryManager.cpp
131	Please don’t use raw pointers. If you look at reference_wrapper it has the same cost as taking the address and store the address. C++ guru invented that for us in a safe way.

ye-luo added inline comments.Aug 18 2020, 8:47 PM

openmp/libomptarget/src/MemoryManager.cpp
220	When arrive here, the code should know if the memory is from free list or newly allocated. It doesn’t even need to do the find. It is wasting time. We may just use std::list if we don’t need to find.

Removed all shared_ptr stuffs and fixed one potential issue

tianshilei1992 marked 4 inline comments as done.Aug 19 2020, 11:37 AM

Harbormaster completed remote builds in B68927: Diff 286621.Aug 19 2020, 12:04 PM

ye-luo added inline comments.Aug 19 2020, 12:34 PM

openmp/libomptarget/src/MemoryManager.cpp
215	Use emplace and its return value iterator to avoid the later lookup(at).
235	I don't what the policy of using auto. auto makes the code cleaner. There are a few similar places with iterators.
openmp/libomptarget/src/device.cpp
419	Prefer else return RTL->data_delete(RTLDeviceID, TgtPtrBegin); the same change to RTL->data_alloc above

Updated based on review comments

tianshilei1992 marked 3 inline comments as done.Aug 19 2020, 1:05 PM

tianshilei1992 added inline comments.

openmp/libomptarget/src/device.cpp
419	It's a code style preference. I would go with "no else after return".

tianshilei1992 marked an inline comment as done.Aug 19 2020, 1:05 PM

LGTM

This revision is now accepted and ready to land.Aug 19 2020, 1:17 PM

Harbormaster completed remote builds in B68937: Diff 286642.Aug 19 2020, 1:41 PM

Fixed the build issue when OMPTARGET_DEBUG is not defined

Harbormaster completed remote builds in B68960: Diff 286686.Aug 19 2020, 5:18 PM

Fixed the clang-tidy warning llvm-header-guard

Harbormaster completed remote builds in B68973: Diff 286700.Aug 19 2020, 7:24 PM

Change the header guard to make clang-tidy happy

Harbormaster completed remote builds in B68975: Diff 286702.Aug 19 2020, 7:58 PM

Closed by commit rG0289696751e9: [OpenMP] Introduce target memory manager (authored by tianshilei1992). · Explain WhyAug 19 2020, 8:12 PM

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rG0289696751e9: [OpenMP] Introduce target memory manager.

Since I spent hours to hunt down several race conditions in libomp in the last months, please fix races immediately, when they are pointed out. There is no such thing as a benign race!

openmp/libomptarget/src/MemoryManager.cpp
150	Double check does not work either. If the `empty` function might crash because of the data race, that is a problem. Otherwise, it is not a problem. Like I said, another thread could still insert node into the list after we check empty using a lock. double check is used to solve race condition not data race. data race is UB and must be avoided. race condition is not UB and might be accepted (benign), but can also break the code - especially reference counting. To avoid the data race, as Jon said, you should use atomics. You might want to add an atomic counter to avoid the use of non-atomic List.empty(). When using double-checking, you need to perform all changes under lock (inserting to the list must be done under the same lock). All related double-checks occur under the same lock. In this case, the issue you tried to make can not occur.

As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.

In D81054#2229637, @JonChesterfield wrote:

As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.

If it is a thread-safety issue, adding mutex in out facing allocate and free should make the code safe while investigating the root cause.

In D81054#2229637, @JonChesterfield wrote:

As a heads up, I'm told this breaks amdgpu tests. @ronlieb is looking at the merge from upstream, don't have any more details at this time. The basic idea of wrapping device alloc seems likely to be sound for all targets so I'd guess we've run into a bug in this patch.

Yeah, issuing a bug would be nice because at least I could get a reproducer. ;-) BTW, all data race mentioned by others were guarded by lock actually.

JonChesterfield added inline comments.Aug 22 2020, 2:54 AM

openmp/libomptarget/src/MemoryManager.cpp
88	This "little issue" of calling into the target plugin after it has been destroyed is a contender for this patch not working on amdgpu. I still think the target plugin, if it wishes to use this allocator, should hold the state itself. That means the allocator can be used internally, e.g. for call frames or the parallel region malloc, as well making destruction order straightforward and correct.

grokos mentioned this in D85274: [OpenMP] Introduced a bump-like allocator into the target memory management.Oct 5 2020, 4:45 AM

The test asserts for x86 offloading:

memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.
memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.

In D81054#2369714, @protze.joachim wrote:

The test asserts for x86 offloading:

memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.
memory_manager.cpp.tmp-x86_64-pc-linux-gnu: llvm-project/openmp/libomptarget/test/offloading/memory_manager.cpp:37: int main(int, char **): Assertion `buffer[j] == i' failed.

Cannot reproduce the failure on my side

I tested this with older clang releases (at least back to clang 9.0) and could reproduce the assertion. The error doesn't seem to be related to this patch, but the test just reveals the issue.

I could reduce the issue to:

#include <omp.h>
#include <cassert>
#include <iostream>
#define N 10

int main(int argc, char *argv[]) {
#pragma omp parallel for num_threads(4)
  for (int i = 0; i < 16; ++i) {
    int buffer[N];
    printf("i=%i, n=%i, buffer=%p\n",i,N,buffer);
#pragma omp critical
#pragma omp target teams distribute parallel for              \
    map(from                                                  \
        : buffer)
    for (int j = 0; j < N; ++j) {
      buffer[j] = i;
    }
    for (int j = 0; j < N; ++j) {
      if(buffer[j] != i){
        printf("buffer[j=%i]=%i != i=%i, buffer=%p\n",j,buffer[j],i,buffer);
        assert(buffer[j] == i);
      }
    }
  }
  std::cout << "PASS\n";
  return 0;
}

So I think, that the map(from) fails when executed from multiple threads. The issue goes away, if the initial test is executed with OMP_NUM_THREADS=1. Adding the critical does not solve the issue. So, I don't think that a race in libomptarget is causing the issue.

tcramer added a subscriber: tcramer.Dec 9 2020, 9:49 AM

protze.joachim added inline comments.Dec 10 2020, 8:44 AM

openmp/libomptarget/src/MemoryManager.cpp
88	@tianshilei1992 Any plan to fix this? This does not only break for AMD, but also for a plugin our group is working on. Without understanding all the details, I think, the destructor of DeviceTy should delete the MemoryManager? Would this solve the issue? I.e. is the DeviceTy destroyed before the target plugin is unloaded?

protze.joachim added inline comments.Dec 10 2020, 8:53 AM

openmp/libomptarget/src/MemoryManager.cpp
88	Nevermind, the unique_ptr should take care of the release. So, why is the device not destroyed before the plugin is unloaded?

I think I volunteered to fix the global constructor/destructor hazard, then forgot about it.

My intent is to add functions to the plugin:

some_enum __tgt_rtl_plugin_init(void);
some_enum __tgt_rtl_plugin_dtor(void);

with the invariant that plugin_init is the first function called on a given plugin, and plugin_dtor is the last function called. Probably also that init, dtor are called at most once, and the dtor is called exactly once if init is called.

The initialization that currently occurs for global variables in the plugin can then optionally be done in the init call. Libomptarget shall destroy the memory manager before calling dtor, so that it can make calls into the plugin during the destruction.

This doesn't address multiple instances of a given plugin, but also doesn't preclude it. Any plugin that doesn't implement these, won't have them called.

edit: However, I don't think libomptarget knows when a given plugin is no longer in use. There's a TODO in rtl.cpp about removing a RTL if it's not used any more, but I can't see how that can be derived reliably from calls into interface.cpp.

edit2: If we move LoadRTLs out of the first call to RegisterLib and into init() or the PluginManager constructor, then we can move some unloading logic out of UnregisterLib and call that from deinit(), at which point we'll have a good place to put the teardown,

I'm surprised to find no dlclose matching the dlopen. Instead of calling some function for init/destroy, can't we just use library constructor/destructors in the plugin? All MemoryManagers for a plugin should then be destroyed before the plugin is explicitly dlclosed.
I'm also surprised that LoadRTLs does not dlclose the library in case of missing symbols.

manorom added a subscriber: manorom.Dec 30 2020, 6:09 PM

manorom added inline comments.

openmp/libomptarget/src/MemoryManager.cpp
88	Nevermind, the unique_ptr should take care of the release. So, why is the device not destroyed before the plugin is unloaded? Hope I'm not too late to the party, but: If I tracked this down correctly, plugins don't really get unloaded explicitly but only when the host program terminates and the program and its libraries get unloaded by OS. Plugins keep ther state in global objects so their destructor is called when the plugin library is unloaded (at least thats when the VE plugin cleans up its resources, including its target memory). The MemoryManager is (ultimately) owned by the PluginManger which gets constructed explicitly by `__attribute__((constructor))` and `__attribute__((destructor))` functions in `rtl.cpp` So what I guess happens is, that the host program terminates, and then all global destructors are exeuted including those in libomptarget and the plugin libraries (before any library actually unloads). And the destructor which is called first happens to be the destrutor for the plugin library and the destructor function which deletes the PluginManger gets called later.

This should be disabled on non-cuda platforms. It is presently a performance improvement on cuda, might improve or regress performance on others, and has a call method on dead object bug that has been open for months.

In particular I don't think it helps performance on amdgpu and it's annoying to set an environment variable to suppress a known bug.

I’m going to put the issue on the top of my list.

In D81054#2484343, @tianshilei1992 wrote:

I’m going to put the issue on the top of my list.

Nice! Thank you.

I was thinking of adding an optional function to the plugin api, bool (*enable_memory_manager)(void) or similar, which defaults to return false; if not implemented. It seems the amd internal branch currently has an #if 0 around the entry point to avoid checking an environment variable, but I'd really like to get rid of that local patch.

The fix is on Phab now. Please refer to D94256 for more details.

Revision Contents

Path

Size

openmp/

libomptarget/

src/

5 lines

95 lines

256 lines

36 lines

58 lines

test/

offloading/

memory_manager.cpp

47 lines

Diff 286705

openmp/libomptarget/src/CMakeLists.txt

	##===----------------------------------------------------------------------===##			##===----------------------------------------------------------------------===##
	#			#
	# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	# See https://llvm.org/LICENSE.txt for license information.			# See https://llvm.org/LICENSE.txt for license information.
	# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	#			#
	##===----------------------------------------------------------------------===##			##===----------------------------------------------------------------------===##
	#			#
	# Build offloading library libomptarget.so.			# Build offloading library libomptarget.so.
	#			#
	##===----------------------------------------------------------------------===##			##===----------------------------------------------------------------------===##

	libomptarget_say("Building offloading runtime library libomptarget.")			libomptarget_say("Building offloading runtime library libomptarget.")

	set(src_files			set(src_files
	api.cpp			api.cpp
	device.cpp			device.cpp
	interface.cpp			interface.cpp
				MemoryManager.cpp
	rtl.cpp			rtl.cpp
	omptarget.cpp			omptarget.cpp
	)			)

	# Build libomptarget library with libdl dependency.			# Build libomptarget library with libdl dependency.
	add_library(omptarget SHARED ${src_files})			add_library(omptarget SHARED ${src_files})
	target_link_libraries(omptarget			target_link_libraries(omptarget
	${CMAKE_DL_LIBS}			${CMAKE_DL_LIBS}
	"-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/exports")			"-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/exports")

	# Install libomptarget under the lib destination folder.			# Install libomptarget under the lib destination folder.
	install(TARGETS omptarget LIBRARY COMPONENT omptarget			install(TARGETS omptarget LIBRARY COMPONENT omptarget
	DESTINATION "${OPENMP_INSTALL_LIBDIR}")			DESTINATION "${OPENMP_INSTALL_LIBDIR}")

openmp/libomptarget/src/MemoryManager.h

This file was added.

				//===----------- MemoryManager.h - Target independent memory manager ------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Declarations for target independent memory manager.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_OPENMP_LIBOMPTARGET_SRC_MEMORYMANAGER_H
				#define LLVM_OPENMP_LIBOMPTARGET_SRC_MEMORYMANAGER_H

				#include <cassert>
				#include <functional>
				#include <list>
				#include <mutex>
				#include <set>
				#include <unordered_map>
				#include <vector>

				// Forward declaration
				struct DeviceTy;

				class MemoryManagerTy {
				ye-luoUnsubmitted Done Reply Inline Actions Why is the pointer needed? What is the design logic behind MemoryManagerTy and MemoryManagerImplTy layers? Can we just have one? ye-luo: Why is the pointer needed? What is the design logic behind MemoryManagerTy and…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Pimpl. Like my previous comments mentioned before, this header will be included by others, I don't want unnecessary headers/declarations/definitions to be included to pollute others. tianshilei1992: Pimpl. Like my previous comments mentioned before, this header will be included by others, I…
				ye-luoUnsubmitted Done Reply Inline Actions That is the job of header and cpp files. ye-luo: That is the job of header and cpp files.
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions No. You can refer to https://en.cppreference.com/w/cpp/language/pimpl for more details. tianshilei1992: No. You can refer to https://en.cppreference.com/w/cpp/language/pimpl for more details.
				ye-luoUnsubmitted Done Reply Inline Actions Pimpl. Like my previous comments mentioned before, this header will be included by others, I don't want unnecessary headers/declarations/definitions to be included to pollute others. Where else do you have in mind this header will be included? So far there is only device.cpp. ye-luo: > Pimpl. Like my previous comments mentioned before, this header will be included by others, I…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions That attributes to the Pimpl idiom. It is not a good practice to have too much implementation stuffs in the header file. tianshilei1992: That attributes to the Pimpl idiom. It is not a good practice to have too much implementation…
				/// A structure stores the meta data of a target pointer
				JonChesterfieldUnsubmitted Done Reply Inline Actions Can we drop the shared_ptr here? Better to have the MemoryManager move-only and use unique_ptr JonChesterfield: Can we drop the shared_ptr here? Better to have the MemoryManager move-only and use unique_ptr
				struct NodeTy {
				/// Memory size
				const size_t Size;
				ye-luoUnsubmitted Done Reply Inline Actions Second (Third?) place with a default. Remove or error out if size 0? ye-luo: Second (Third?) place with a default. Remove or error out if size 0?
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Could remove it. tianshilei1992: Could remove it.
				/// Target pointer
				void *Ptr;

				/// Constructor
				NodeTy(size_t Size, void *Ptr) : Size(Size), Ptr(Ptr) {}
				};

				JonChesterfieldUnsubmitted Done Reply Inline Actions Deallocate taking a size usually allows a faster implementation, but that can be left until said faster implementation is proposed JonChesterfield: Deallocate taking a size usually allows a faster implementation, but that can be left until…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions I agree. Currently the plugin interface does not have such argument so we don't need that. In the future we might add that. tianshilei1992: I agree. Currently the plugin interface does not have such argument so we don't need that. In…
				/// To make \p NodePtrTy ordered when they're put into \p std::multiset.
				struct NodeCmpTy {
				bool operator()(const NodeTy &LHS, const NodeTy &RHS) const {
				return LHS.Size < RHS.Size;
				}
				};

				/// A \p FreeList is a set of Nodes. We're using \p std::multiset here to make
				/// the look up procedure more efficient.
				using FreeListTy = std::multiset<std::reference_wrapper<NodeTy>, NodeCmpTy>;

				/// A list of \p FreeListTy entries, each of which is a \p std::multiset of
				/// Nodes whose size is less or equal to a specific bucket size.
				std::vector<FreeListTy> FreeLists;
				/// A list of mutex for each \p FreeListTy entry
				std::vector<std::mutex> FreeListLocks;
				/// A table to map from a target pointer to its node
				std::unordered_map<void *, NodeTy> PtrToNodeTable;
				/// The mutex for the table \p PtrToNodeTable
				std::mutex MapTableLock;
				/// A reference to its corresponding \p DeviceTy object
				DeviceTy &Device;

				/// Request memory from target device
				void allocateOnDevice(size_t Size, void HstPtr) const;

				/// Deallocate data on device
				int deleteOnDevice(void *Ptr) const;

				/// This function is called when it tries to allocate memory on device but the
				/// device returns out of memory. It will first free all memory in the
				/// FreeList and try to allocate again.
				void freeAndAllocate(size_t Size, void HstPtr);

				/// The goal is to allocate memory on the device. It first tries to allocate
				/// directly on the device. If a \p nullptr is returned, it might be because
				/// the device is OOM. In that case, it will free all unused memory and then
				/// try again.
				void allocateOrFreeAndAllocateOnDevice(size_t Size, void HstPtr);

				public:
				/// Constructor. If \p Threshold is non-zero, then the default threshold will
				/// be overwritten by \p Threshold.
				MemoryManagerTy(DeviceTy &Dev, size_t Threshold = 0);

				/// Destructor
				~MemoryManagerTy();

				/// Allocate memory of size \p Size from target device. \p HstPtr is used to
				/// assist the allocation.
				void allocate(size_t Size, void HstPtr);

				/// Deallocate memory pointed by \p TgtPtr
				int free(void *TgtPtr);
				};

				#endif // LLVM_OPENMP_LIBOMPTARGET_SRC_MEMORYMANAGER_H

openmp/libomptarget/src/MemoryManager.cpp

This file was added.

				//===----------- MemoryManager.cpp - Target independent memory manager ----===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Functionality for managing target memory.
				// It is very expensive to call alloc/free functions of target devices. The
				// MemoryManagerTy in this file is to reduce the number of invocations of those
				// functions by buffering allocated device memory. In this way, when a memory is
				// not used, it will not be freed on the device directly. The buffer is
				// organized in a number of buckets for efficient look up. A memory will go to
				// corresponding bucket based on its size. When a new memory request comes in,
				// it will first check whether there is free memory of same size. If yes,
				// returns it directly. Otherwise, allocate one on device.
				//
				// It also provides a way to opt out the memory manager. Memory
				// allocation/deallocation will only be managed if the requested size is less
				// than SizeThreshold, which can be configured via an environment variable
				// LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD.
				//
				//===----------------------------------------------------------------------===//

				#include "MemoryManager.h"
				#include "device.h"
				#include "private.h"
				#include "rtl.h"

				namespace {
				constexpr const size_t BucketSize[] = {
				0, 1U << 2, 1U << 3, 1U << 4, 1U << 5, 1U << 6, 1U << 7,
				1U << 8, 1U << 9, 1U << 10, 1U << 11, 1U << 12, 1U << 13};

				constexpr const int NumBuckets = sizeof(BucketSize) / sizeof(BucketSize[0]);

				/// The threshold to manage memory using memory manager. If the request size is
				/// larger than \p SizeThreshold, the allocation will not be managed by the
				/// memory manager. This variable can be configured via an env \p
				/// LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD. By default, the value is 8KB.
				size_t SizeThreshold = 1U << 13;

				/// Find the previous number that is power of 2 given a number that is not power
				/// of 2.
				size_t floorToPowerOfTwo(size_t Num) {
				Num \|= Num >> 1;
				Num \|= Num >> 2;
				Num \|= Num >> 4;
				Num \|= Num >> 8;
				Num \|= Num >> 16;
				Num \|= Num >> 32;
				Num += 1;
				return Num >> 1;
				}

				/// Find a suitable bucket
				int findBucket(size_t Size) {
				const size_t F = floorToPowerOfTwo(Size);

				DP("findBucket: Size %zu is floored to %zu.\n", Size, F);

				int L = 0, H = NumBuckets - 1;
				while (H - L > 1) {
				int M = (L + H) >> 1;
				if (BucketSize[M] == F)
				return M;
				if (BucketSize[M] > F)
				H = M - 1;
				else
				L = M;
				}

				assert(L >= 0 && L < NumBuckets && "L is out of range");

				DP("findBucket: Size %zu goes to bucket %d\n", Size, L);

				return L;
				}
				} // namespace

				MemoryManagerTy::MemoryManagerTy(DeviceTy &Dev, size_t Threshold)
				: FreeLists(NumBuckets), FreeListLocks(NumBuckets), Device(Dev) {
				if (Threshold)
				SizeThreshold = Threshold;
				}

				MemoryManagerTy::~MemoryManagerTy() {
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions This "little issue" of calling into the target plugin after it has been destroyed is a contender for this patch not working on amdgpu. I still think the target plugin, if it wishes to use this allocator, should hold the state itself. That means the allocator can be used internally, e.g. for call frames or the parallel region malloc, as well making destruction order straightforward and correct. JonChesterfield: This "little issue" of calling into the target plugin after it has been destroyed is a…
				protze.joachimUnsubmitted Not Done Reply Inline Actions @tianshilei1992 Any plan to fix this? This does not only break for AMD, but also for a plugin our group is working on. Without understanding all the details, I think, the destructor of DeviceTy should delete the MemoryManager? Would this solve the issue? I.e. is the DeviceTy destroyed before the target plugin is unloaded? protze.joachim: @tianshilei1992 Any plan to fix this? This does not only break for AMD, but also for a plugin…
				protze.joachimUnsubmitted Not Done Reply Inline Actions Nevermind, the unique_ptr should take care of the release. So, why is the device not destroyed before the plugin is unloaded? protze.joachim: Nevermind, the unique_ptr should take care of the release. So, why is the device not destroyed…
				manoromUnsubmitted Not Done Reply Inline Actions Nevermind, the unique_ptr should take care of the release. So, why is the device not destroyed before the plugin is unloaded? Hope I'm not too late to the party, but: If I tracked this down correctly, plugins don't really get unloaded explicitly but only when the host program terminates and the program and its libraries get unloaded by OS. Plugins keep ther state in global objects so their destructor is called when the plugin library is unloaded (at least thats when the VE plugin cleans up its resources, including its target memory). The MemoryManager is (ultimately) owned by the PluginManger which gets constructed explicitly by `__attribute__((constructor))` and `__attribute__((destructor))` functions in `rtl.cpp` So what I guess happens is, that the host program terminates, and then all global destructors are exeuted including those in libomptarget and the plugin libraries (before any library actually unloads). And the destructor which is called first happens to be the destrutor for the plugin library and the destructor function which deletes the PluginManger gets called later. manorom: > Nevermind, the unique_ptr should take care of the release. So, why is the device not…
				// TODO: There is a little issue that target plugin is destroyed before this
				// object, therefore the memory free will not succeed.
				// Deallocate all memory in map
				for (auto Itr = PtrToNodeTable.begin(); Itr != PtrToNodeTable.end(); ++Itr) {
				assert(Itr->second.Ptr && "nullptr in map table");
				deleteOnDevice(Itr->second.Ptr);
				}
				}

				void MemoryManagerTy::allocateOnDevice(size_t Size, void HstPtr) const {
				return Device.RTL->data_alloc(Device.RTLDeviceID, Size, HstPtr);
				}

				int MemoryManagerTy::deleteOnDevice(void *Ptr) const {
				return Device.RTL->data_delete(Device.RTLDeviceID, Ptr);
				}

				void MemoryManagerTy::freeAndAllocate(size_t Size, void HstPtr) {
				std::vector<void *> RemoveList;

				ye-luoUnsubmitted Done Reply Inline Actions Another shared_ptr. See `typedef std::set<HostDataToTargetTy, std::less<>> HostDataToTargetListTy;` as an example. There doesn't seem to need a pointer wrapping NodeTy. ye-luo: Another shared_ptr. See `typedef std::set<HostDataToTargetTy, std::less<>>…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions We might have same nodes both in the table and the free list. It is on purpose because the map relation never changes, which could save us an operation on the map. tianshilei1992: We might have same nodes both in the table and the free list. It is on purpose because the map…
				// Deallocate all memory in FreeList
				for (int I = 0; I < NumBuckets; ++I) {
				FreeListTy &List = FreeLists[I];
				std::lock_guard<std::mutex> Lock(FreeListLocks[I]);
				if (List.empty())
				continue;
				for (const NodeTy &N : List) {
				deleteOnDevice(N.Ptr);
				RemoveList.push_back(N.Ptr);
				}
				FreeLists[I].clear();
				}

				// Remove all nodes in the map table which have been released
				if (!RemoveList.empty()) {
				std::lock_guard<std::mutex> LG(MapTableLock);
				for (void *P : RemoveList)
				PtrToNodeTable.erase(P);
				}

				// Try allocate memory again
				return allocateOnDevice(Size, HstPtr);
				}
				ye-luoUnsubmitted Done Reply Inline Actions N->Ptr is deleted here. Then the shared_ptr in FreeLists[I] is deleted here but PtrToNodeTable still has the shared_ptr and an address which is no more valid. If I understand correctly, you want FreeLists holds a subset of PtrToNodeTable memory segments. I think what you need is using FreeListTy = std::multiset<std::reference_wrapper<NodeTy>, NodeCmpTy>; std::unordered_map<void , NodeTy> PtrToNodeTable; In this way, PtrToNodeTable is the unique owner of all the memory segments. FreeList only owns a reference. ye-luo:* N->Ptr is deleted here. Then the shared_ptr in FreeLists[I] is deleted here but PtrToNodeTable…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions This is a nice catch. Thanks for that. Holding a reference will not solve the problem that the node should also be removed from the map. It is same as holding a `shared_ptr`, but I could in fact use the reference way for the `FreeListTy`. The initial implementation is to remove the node from the map table and then add it to the free lists. Later I want to avoid the unnecessary operation on the map table but forget to update here. I’ll fix it. tianshilei1992: This is a nice catch. Thanks for that. Holding a reference will not solve the problem that the…
				ye-luoUnsubmitted Done Reply Inline Actions The correct code needs to take care of both PtrToNodeTable and FreeLists regardless. Currently in the destructor, you first deal with PtrToNodeTable and then FreeLists with some nullptr check. If you switch to reference in FreeLists, only PtrToNodeTable needs to be taken care. I still hope you find shared_ptr not needed at all. ye-luo: The correct code needs to take care of both PtrToNodeTable and FreeLists regardless. Currently…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions One benefit to use pointer is that we could use `nullptr` to tell a state, which is very important to narrow the critical area as much as possible. Reference does not have that quality so that I need to do more things in the critical area which is counter-efficient. I can take the map table as a container of nodes and use the raw pointer in the free lists. tianshilei1992: One benefit to use pointer is that we could use `nullptr` to tell a state, which is very…
				ye-luoUnsubmitted Done Reply Inline Actions Please don’t use raw pointers. If you look at reference_wrapper it has the same cost as taking the address and store the address. C++ guru invented that for us in a safe way. ye-luo: Please don’t use raw pointers. If you look at reference_wrapper it has the same cost as taking…

				void *MemoryManagerTy::allocateOrFreeAndAllocateOnDevice(size_t Size,
				void *HstPtr) {
				void *TgtPtr = allocateOnDevice(Size, HstPtr);
				// We cannot get memory from the device. It might be due to OOM. Let's
				// free all memory in FreeLists and try again.
				if (TgtPtr == nullptr) {
				DP("Failed to get memory on device. Free all memory in FreeLists and "
				"try again.\n");
				TgtPtr = freeAndAllocate(Size, HstPtr);
				}

				#ifdef OMPTARGET_DEBUG
				if (TgtPtr == nullptr)
				DP("Still cannot get memory on device probably because the device is "
				"OOM.\n");
				#endif

				return TgtPtr;
				ye-luoUnsubmitted Done Reply Inline Actions There can be race when you test List.empty(). ye-luo: There can be race when you test List.empty().
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions It is on purpose, and the "race" is not a problem. Think about it. Even we wrap it into a lock and now it is empty, there is still chance that when we move to the next list, one or more nodes are returned to this list. No difference. tianshilei1992: It is on purpose, and the "race" is not a problem. Think about it. Even we wrap it into a lock…
				JonChesterfieldUnsubmitted Done Reply Inline Actions That seems to assume list.empty() is an atomic operation. It isn't - calling list.empty() from one thread while another can be inserting into the list is a data race. We could do something involving a relaxed read followed by a lock followed by another read, in the double checked locking fashion. Uncontended locks are cheap though so it's probably not worthwhile. JonChesterfield: That seems to assume list.empty() is an atomic operation. It isn't - calling list.empty() from…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Double check does not work either. If the `empty` function might crash because of the data race, that is a problem. Otherwise, it is not a problem. Like I said, another thread could still insert node into the list after we check empty using a lock. tianshilei1992: Double check does not work either. If the `empty` function might crash because of the data race…
				JonChesterfieldUnsubmitted Done Reply Inline Actions Double check requires an atomic read, though relaxed is fine. Or probably some use of barriers. Calling empty() while another thread modifies the list is the race. Because empty() is not atomic qualified, the race is UB. Empty probably resolves to loading two pointers and comparing them for equality, so I sympathise with the argument that the race is benign, but it's still prudent to remove the data races we know about. Less UB, and means data race detectors will have a better chance of helping find bugs. JonChesterfield: Double check requires an atomic read, though relaxed is fine. Or probably some use of barriers.
				protze.joachimUnsubmitted Not Done Reply Inline Actions Double check does not work either. If the `empty` function might crash because of the data race, that is a problem. Otherwise, it is not a problem. Like I said, another thread could still insert node into the list after we check empty using a lock. double check is used to solve race condition not data race. data race is UB and must be avoided. race condition is not UB and might be accepted (benign), but can also break the code - especially reference counting. To avoid the data race, as Jon said, you should use atomics. You might want to add an atomic counter to avoid the use of non-atomic List.empty(). When using double-checking, you need to perform all changes under lock (inserting to the list must be done under the same lock). All related double-checks occur under the same lock. In this case, the issue you tried to make can not occur. protze.joachim: > Double check does not work either. If the `empty` function might crash because of the data…
				}

				void MemoryManagerTy::allocate(size_t Size, void HstPtr) {
				// If the size is zero, we will not bother the target device. Just return
				// nullptr directly.
				if (Size == 0)
				return nullptr;

				DP("MemoryManagerTy::allocate: size %zu with host pointer " DPxMOD ".\n",
				Size, DPxPTR(HstPtr));

				// If the size is greater than the threshold, allocate it directly from
				// device.
				if (Size > SizeThreshold) {
				DP("%zu is greater than the threshold %zu. Allocate it directly from "
				"device\n",
				Size, SizeThreshold);
				void *TgtPtr = allocateOrFreeAndAllocateOnDevice(Size, HstPtr);

				DP("Got target pointer " DPxMOD ". Return directly.\n", DPxPTR(TgtPtr));

				return TgtPtr;
				}

				NodeTy *NodePtr = nullptr;

				// Try to get a node from FreeList
				{
				const int B = findBucket(Size);
				FreeListTy &List = FreeLists[B];

				NodeTy TempNode(Size, nullptr);
				std::lock_guard<std::mutex> LG(FreeListLocks[B]);
				FreeListTy::const_iterator Itr = List.find(TempNode);

				if (Itr != List.end()) {
				NodePtr = &Itr->get();
				List.erase(Itr);
				}
				}

				#ifdef OMPTARGET_DEBUG
				if (NodePtr != nullptr)
				DP("Find one node " DPxMOD " in the bucket.\n", DPxPTR(NodePtr));
				JonChesterfieldUnsubmitted Done Reply Inline Actions This seems bad. Perhaps we should call a function to do this work shortly before destroying the target plugin? JonChesterfield: This seems bad. Perhaps we should call a function to do this work shortly before destroying the…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions That is a potential problem, and actually it might not be a problem. Only when we're going to exit the process can this function be invoked. Even the deallocation will not succeed, GPU memory will be free once the process exits anyway. tianshilei1992: That is a potential problem, and actually it might not be a problem. Only when we're going to…
				#endif

				// We cannot find a valid node in FreeLists. Let's allocate on device and
				// create a node for it.
				if (NodePtr == nullptr) {
				DP("Cannot find a node in the FreeLists. Allocate on device.\n");
				// Allocate one on device
				void *TgtPtr = allocateOrFreeAndAllocateOnDevice(Size, HstPtr);

				if (TgtPtr == nullptr)
				return nullptr;

				// Create a new node and add it into the map table
				{
				std::lock_guard<std::mutex> Guard(MapTableLock);
				auto Itr = PtrToNodeTable.emplace(TgtPtr, NodeTy(Size, TgtPtr));
				NodePtr = &Itr.first->second;
				}

				DP("Node address " DPxMOD ", target pointer " DPxMOD ", size %zu\n",
				DPxPTR(NodePtr), DPxPTR(TgtPtr), Size);
				ye-luoUnsubmitted Done Reply Inline Actions Use emplace and its return value iterator to avoid the later lookup(at). ye-luo: Use emplace and its return value iterator to avoid the later lookup(at).
				}

				assert(NodePtr && "NodePtr should not be nullptr at this point");

				return NodePtr->Ptr;
				ye-luoUnsubmitted Done Reply Inline Actions When arrive here, the code should know if the memory is from free list or newly allocated. It doesn’t even need to do the find. It is wasting time. We may just use std::list if we don’t need to find. ye-luo: When arrive here, the code should know if the memory is from free list or newly allocated. It…
				}

				int MemoryManagerTy::free(void *TgtPtr) {
				DP("MemoryManagerTy::free: target memory " DPxMOD ".\n", DPxPTR(TgtPtr));

				NodeTy *P = nullptr;

				// Look it up into the table
				{
				std::lock_guard<std::mutex> G(MapTableLock);
				auto Itr = PtrToNodeTable.find(TgtPtr);

				// We don't remove the node from the map table because the map does not
				// change.
				if (Itr != PtrToNodeTable.end())
				ye-luoUnsubmitted Done Reply Inline Actions I don't what the policy of using auto. auto makes the code cleaner. There are a few similar places with iterators. ye-luo: I don't what the policy of using auto. auto makes the code cleaner. There are a few similar…
				P = &Itr->second;
				}

				// The memory is not managed by the manager
				if (P == nullptr) {
				DP("Cannot find its node. Delete it on device directly.\n");
				return deleteOnDevice(TgtPtr);
				}

				// Insert the node to the free list
				const int B = findBucket(P->Size);

				DP("Found its node " DPxMOD ". Insert it to bucket %d.\n", DPxPTR(P), B);

				{
				std::lock_guard<std::mutex> G(FreeListLocks[B]);
				FreeLists[B].insert(*P);
				}

				return OFFLOAD_SUCCESS;
				}
				ye-luoUnsubmitted Done Reply Inline Actions SizeThreshold is global while Threshold is local. The default values is also different. I'm lost in the logic here. ye-luo: SizeThreshold is global while Threshold is local. The default values is also different. I'm…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Yeah, you're lost. By default, `Threshold` is 0, which means we will not overwrite `SizeThreshold`. tianshilei1992: Yeah, you're lost. By default, `Threshold` is 0, which means we will not overwrite…
				ye-luoUnsubmitted Done Reply Inline Actions Q1. Why SizeThreshold is not per device? Q2. I was asking for a way to opt-out this optimization. But you ignore LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD=0 ye-luo: Q1. Why SizeThreshold is not per device? Q2. I was asking for a way to opt-out this…
				ye-luoUnsubmitted Done Reply Inline Actions Remove Q2. Opt-out has been supported. ye-luo: Remove Q2. Opt-out has been supported.
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Then how could you specify the threshold via environment variable for each device? You don't even know how many devices you're gonna have during the compilation time. tianshilei1992: Then how could you specify the threshold via environment variable for each device? You don't…
				ye-luoUnsubmitted Done Reply Inline Actions Then how could you specify the threshold via environment variable for each device? You don't even know how many devices you're gonna have during the compilation time. Although your current implementation via environment variable cannot specify the size for each device, we may use configuration file in the future to control this. It will be helpful If you can facilitate this when cleaning up the logic for the default value. ye-luo: > Then how could you specify the threshold via environment variable for each device? You don't…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions I would prefer to keep current one. If in the future we have request for the per-device threshold, we could change it by the time. This can keep the implementation consistent. tianshilei1992: I would prefer to keep current one. If in the future we have request for the per-device…
				ye-luoUnsubmitted Done Reply Inline Actions There can be race in PtrToNodeTable when you find() ye-luo: There can be race in PtrToNodeTable when you find()
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions There is no race because same `NodePtr` will never go to two threads. tianshilei1992: There is no race because same `NodePtr` will never go to two threads.
				JonChesterfieldUnsubmitted Done Reply Inline Actions It looks like PtrToNodeTable can be modified by other threads while this is running. Doesn't matter that NodePtr itself is unique - can't call .find() on the structure while another thread is mutating it. JonChesterfield: It looks like PtrToNodeTable can be modified by other threads while this is running. Doesn't…
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions The iterators will not be invalidated on `multiset` in insert operation, but anyway, I'm not sure whether it will crash in some middle status, so I'll wrap them into the guard lock. tianshilei1992: The iterators will not be invalidated on `multiset` in insert operation, but anyway, I'm not…
				JonChesterfieldUnsubmitted Done Reply Inline Actions A crash would be fine as we'd notice that. It's data corruption due to the race which is the hazard. Thanks for adding it to a locked region. JonChesterfield: A crash would be fine as we'd notice that. It's data corruption due to the race which is the…
				ye-luoUnsubmitted Done Reply Inline Actions make_unique is better. ye-luo: make_unique is better.
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Sure. tianshilei1992: Sure.

openmp/libomptarget/src/device.h

Show All 11 Lines

#ifndef _OMPTARGET_DEVICE_H		#ifndef _OMPTARGET_DEVICE_H
#define _OMPTARGET_DEVICE_H		#define _OMPTARGET_DEVICE_H

#include <cassert>		#include <cassert>
#include <cstddef>		#include <cstddef>
#include <list>		#include <list>
#include <map>		#include <map>
		#include <memory>
#include <mutex>		#include <mutex>
#include <set>		#include <set>
#include <vector>		#include <vector>

// Forward declarations.		// Forward declarations.
struct RTLInfoTy;		struct RTLInfoTy;
struct __tgt_bin_desc;		struct __tgt_bin_desc;
struct __tgt_target_table;		struct __tgt_target_table;
struct __tgt_async_info;		struct __tgt_async_info;
		class MemoryManagerTy;

/// Map between host data and target data.		/// Map between host data and target data.
		jdoerfertUnsubmitted Done Reply Inline Actions Can we call these things `MemoryManagerInterface` and `MemoryManagerImpl` instead? jdoerfert: Can we call these things `MemoryManagerInterface` and `MemoryManagerImpl` instead?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions I renamed the implementation class to `MemoryManagerImplTy`. tianshilei1992: I renamed the implementation class to `MemoryManagerImplTy`.
struct HostDataToTargetTy {		struct HostDataToTargetTy {
uintptr_t HstPtrBase; // host info.		uintptr_t HstPtrBase; // host info.
uintptr_t HstPtrBegin;		uintptr_t HstPtrBegin;
uintptr_t HstPtrEnd; // non-inclusive.		uintptr_t HstPtrEnd; // non-inclusive.

uintptr_t TgtPtrBegin; // target info.		uintptr_t TgtPtrBegin; // target info.

private:		private:
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	struct DeviceTy {
ShadowPtrListTy ShadowPtrMap;		ShadowPtrListTy ShadowPtrMap;

std::mutex DataMapMtx, PendingGlobalsMtx, ShadowMtx;		std::mutex DataMapMtx, PendingGlobalsMtx, ShadowMtx;

// NOTE: Once libomp gains full target-task support, this state should be		// NOTE: Once libomp gains full target-task support, this state should be
// moved into the target task in libomp.		// moved into the target task in libomp.
std::map<int32_t, uint64_t> LoopTripCnt;		std::map<int32_t, uint64_t> LoopTripCnt;

DeviceTy(RTLInfoTy *RTL)		/// Memory manager
: DeviceID(-1), RTL(RTL), RTLDeviceID(-1), IsInit(false), InitFlag(),		std::unique_ptr<MemoryManagerTy> MemoryManager;
		ye-luoUnsubmitted Done Reply Inline Actions Could you explain why shared_ptr is needed? ye-luo: Could you explain why shared_ptr is needed?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Such that I don't need to include `MemoryManager.h` in the header, and it doesn't hurt anything. tianshilei1992: Such that I don't need to include `MemoryManager.h` in the header, and it doesn't hurt anything.
		ye-luoUnsubmitted Done Reply Inline Actions This is obviously a wrong way. Move the constructor and destructor to cpp. ye-luo: This is obviously a wrong way. Move the constructor and destructor to cpp.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Why is it a wrong way? Is there any drawback? tianshilei1992: Why is it a wrong way? Is there any drawback?
HasPendingGlobals(false), HostDataToTargetMap(), PendingCtorsDtors(),
ShadowPtrMap(), DataMapMtx(), PendingGlobalsMtx(), ShadowMtx() {}		DeviceTy(RTLInfoTy *RTL);

// The existence of mutexes makes DeviceTy non-copyable. We need to		// The existence of mutexes makes DeviceTy non-copyable. We need to
// provide a copy constructor and an assignment operator explicitly.		// provide a copy constructor and an assignment operator explicitly.
DeviceTy(const DeviceTy &d)		DeviceTy(const DeviceTy &D);
: DeviceID(d.DeviceID), RTL(d.RTL), RTLDeviceID(d.RTLDeviceID),
IsInit(d.IsInit), InitFlag(), HasPendingGlobals(d.HasPendingGlobals),
HostDataToTargetMap(d.HostDataToTargetMap),
PendingCtorsDtors(d.PendingCtorsDtors), ShadowPtrMap(d.ShadowPtrMap),
DataMapMtx(), PendingGlobalsMtx(), ShadowMtx(),
LoopTripCnt(d.LoopTripCnt) {}

DeviceTy& operator=(const DeviceTy &d) {
DeviceID = d.DeviceID;
RTL = d.RTL;
RTLDeviceID = d.RTLDeviceID;
IsInit = d.IsInit;
HasPendingGlobals = d.HasPendingGlobals;
HostDataToTargetMap = d.HostDataToTargetMap;
PendingCtorsDtors = d.PendingCtorsDtors;
ShadowPtrMap = d.ShadowPtrMap;
LoopTripCnt = d.LoopTripCnt;

return *this;		DeviceTy &operator=(const DeviceTy &D);
}
		~DeviceTy();

// Return true if data can be copied to DstDevice directly		// Return true if data can be copied to DstDevice directly
bool isDataExchangable(const DeviceTy& DstDevice);		bool isDataExchangable(const DeviceTy& DstDevice);

uint64_t getMapEntryRefCnt(void *HstPtrBegin);		uint64_t getMapEntryRefCnt(void *HstPtrBegin);
LookupResult lookupMapping(void *HstPtrBegin, int64_t Size);		LookupResult lookupMapping(void *HstPtrBegin, int64_t Size);
void getOrAllocTgtPtr(void HstPtrBegin, void *HstPtrBase, int64_t Size,		void getOrAllocTgtPtr(void HstPtrBegin, void *HstPtrBase, int64_t Size,
bool &IsNew, bool &IsHostPtr, bool IsImplicit,		bool &IsNew, bool &IsHostPtr, bool IsImplicit,
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

openmp/libomptarget/src/device.cpp

//===--------- device.cpp - Target independent OpenMP target RTL ----------===//		//===--------- device.cpp - Target independent OpenMP target RTL ----------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Functionality for managing devices that are handled by RTL plugins.		// Functionality for managing devices that are handled by RTL plugins.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "device.h"		#include "device.h"
		#include "MemoryManager.h"
#include "private.h"		#include "private.h"
#include "rtl.h"		#include "rtl.h"

#include <cassert>		#include <cassert>
#include <climits>		#include <climits>
#include <string>		#include <string>

/// Map between Device ID (i.e. openmp device id) and its DeviceTy.		/// Map between Device ID (i.e. openmp device id) and its DeviceTy.
DevicesTy Devices;		DevicesTy Devices;

		DeviceTy::DeviceTy(const DeviceTy &D)
		: DeviceID(D.DeviceID), RTL(D.RTL), RTLDeviceID(D.RTLDeviceID),
		IsInit(D.IsInit), InitFlag(), HasPendingGlobals(D.HasPendingGlobals),
		HostDataToTargetMap(D.HostDataToTargetMap),
		PendingCtorsDtors(D.PendingCtorsDtors), ShadowPtrMap(D.ShadowPtrMap),
		DataMapMtx(), PendingGlobalsMtx(), ShadowMtx(),
		LoopTripCnt(D.LoopTripCnt), MemoryManager(nullptr) {}
		ye-luoUnsubmitted Done Reply Inline Actions Why do you think it is OK here leaving the copy constructor always setting MemoryManager nullptr? This cause surprises. The same question applies to assign operator as well. ye-luo: Why do you think it is OK here leaving the copy constructor always setting MemoryManager…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions `MemoryManager` will be initialized separately later. The only reason we need this is `std::vector<DeviceTy>` requires it. We don't copy or construct those objects afterwards. tianshilei1992: `MemoryManager` will be initialized separately later. The only reason we need this is `std…
		JonChesterfieldUnsubmitted Done Reply Inline Actions std::vector<DeviceTy> should be content with a move constructor. Then the copy constructor can be = delete. JonChesterfield: std::vector<DeviceTy> should be content with a move constructor. Then the copy constructor can…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions `std::mutex` cannot be moved. That is the only reason we have the copy constructor. tianshilei1992: `std::mutex` cannot be moved. That is the only reason we have the copy constructor.
		JonChesterfieldUnsubmitted Done Reply Inline Actions `std::mutex` can't be copied either. If a new default-initialised mutex is OK as the result of the copy, it would be OK as the result of a move too. JonChesterfield: `std::mutex` can't be copied either. If a new default-initialised mutex is OK as the result of…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions That is beyond the scope of this patch. Since it already has a user-defined copy operator, then let it be. tianshilei1992: That is beyond the scope of this patch. Since it already has a user-defined copy operator, then…

		DeviceTy &DeviceTy::operator=(const DeviceTy &D) {
		DeviceID = D.DeviceID;
		RTL = D.RTL;
		RTLDeviceID = D.RTLDeviceID;
		IsInit = D.IsInit;
		HasPendingGlobals = D.HasPendingGlobals;
		HostDataToTargetMap = D.HostDataToTargetMap;
		PendingCtorsDtors = D.PendingCtorsDtors;
		ShadowPtrMap = D.ShadowPtrMap;
		LoopTripCnt = D.LoopTripCnt;

		return *this;
		}

		DeviceTy::DeviceTy(RTLInfoTy *RTL)
		: DeviceID(-1), RTL(RTL), RTLDeviceID(-1), IsInit(false), InitFlag(),
		HasPendingGlobals(false), HostDataToTargetMap(), PendingCtorsDtors(),
		ShadowPtrMap(), DataMapMtx(), PendingGlobalsMtx(), ShadowMtx(),
		MemoryManager(nullptr) {}

		DeviceTy::~DeviceTy() = default;

int DeviceTy::associatePtr(void HstPtrBegin, void TgtPtrBegin, int64_t Size) {		int DeviceTy::associatePtr(void HstPtrBegin, void TgtPtrBegin, int64_t Size) {
DataMapMtx.lock();		DataMapMtx.lock();

// Check if entry exists		// Check if entry exists
auto search = HostDataToTargetMap.find(HstPtrBeginTy{(uintptr_t)HstPtrBegin});		auto search = HostDataToTargetMap.find(HstPtrBeginTy{(uintptr_t)HstPtrBegin});
if (search != HostDataToTargetMap.end()) {		if (search != HostDataToTargetMap.end()) {
// Mapping already exists		// Mapping already exists
bool isValid = search->HstPtrEnd == (uintptr_t)HstPtrBegin + Size &&		bool isValid = search->HstPtrEnd == (uintptr_t)HstPtrBegin + Size &&
▲ Show 20 Lines • Show All 177 Lines • ▼ Show 20 Lines	if (lr.Flags.IsContained \|\|
// In addition to the mapping rules above, the close map modifier forces the		// In addition to the mapping rules above, the close map modifier forces the
// mapping of the variable to the device.		// mapping of the variable to the device.
if (Size) {		if (Size) {
DP("Return HstPtrBegin " DPxMOD " Size=%" PRId64 " RefCount=%s\n",		DP("Return HstPtrBegin " DPxMOD " Size=%" PRId64 " RefCount=%s\n",
DPxPTR((uintptr_t)HstPtrBegin), Size,		DPxPTR((uintptr_t)HstPtrBegin), Size,
(UpdateRefCount ? " updated" : ""));		(UpdateRefCount ? " updated" : ""));
IsHostPtr = true;		IsHostPtr = true;
rc = HstPtrBegin;		rc = HstPtrBegin;
}		}
		jdoerfertUnsubmitted Done Reply Inline Actions Nit: make `tp` a `void ` and cast the one use of it as `uintptr_t` instead. jdoerfert:* Nit: make `tp` a `void *` and cast the one use of it as `uintptr_t` instead.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Unrelated to this patch so mark it as Done. tianshilei1992: Unrelated to this patch so mark it as Done.
} else if (HasPresentModifier) {		} else if (HasPresentModifier) {
DP("Mapping required by 'present' map type modifier does not exist for "		DP("Mapping required by 'present' map type modifier does not exist for "
"HstPtrBegin=" DPxMOD ", Size=%" PRId64 "\n",		"HstPtrBegin=" DPxMOD ", Size=%" PRId64 "\n",
DPxPTR(HstPtrBegin), Size);		DPxPTR(HstPtrBegin), Size);
MESSAGE("device mapping required by 'present' map type modifier does not "		MESSAGE("device mapping required by 'present' map type modifier does not "
"exist for host address " DPxMOD " (%" PRId64 " bytes)",		"exist for host address " DPxMOD " (%" PRId64 " bytes)",
DPxPTR(HstPtrBegin), Size);		DPxPTR(HstPtrBegin), Size);
} else if (Size) {		} else if (Size) {
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	int DeviceTy::deallocTgtPtr(void *HstPtrBegin, int64_t Size, bool ForceDelete,
LookupResult lr = lookupMapping(HstPtrBegin, Size);		LookupResult lr = lookupMapping(HstPtrBegin, Size);
if (lr.Flags.IsContained \|\| lr.Flags.ExtendsBefore \|\| lr.Flags.ExtendsAfter) {		if (lr.Flags.IsContained \|\| lr.Flags.ExtendsBefore \|\| lr.Flags.ExtendsAfter) {
auto &HT = *lr.Entry;		auto &HT = *lr.Entry;
if (ForceDelete)		if (ForceDelete)
HT.resetRefCount();		HT.resetRefCount();
if (HT.decRefCount() == 0) {		if (HT.decRefCount() == 0) {
DP("Deleting tgt data " DPxMOD " of size %" PRId64 "\n",		DP("Deleting tgt data " DPxMOD " of size %" PRId64 "\n",
DPxPTR(HT.TgtPtrBegin), Size);		DPxPTR(HT.TgtPtrBegin), Size);
deleteData((void *)HT.TgtPtrBegin);		deleteData((void *)HT.TgtPtrBegin);
		jdoerfertUnsubmitted Done Reply Inline Actions Nit: Remove the cast. jdoerfert: Nit: Remove the cast.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Unrelated to this patch so mark it as Done. tianshilei1992: Unrelated to this patch so mark it as Done.
DP("Removing%s mapping with HstPtrBegin=" DPxMOD ", TgtPtrBegin=" DPxMOD		DP("Removing%s mapping with HstPtrBegin=" DPxMOD ", TgtPtrBegin=" DPxMOD
", Size=%" PRId64 "\n", (ForceDelete ? " (forced)" : ""),		", Size=%" PRId64 "\n", (ForceDelete ? " (forced)" : ""),
DPxPTR(HT.HstPtrBegin), DPxPTR(HT.TgtPtrBegin), Size);		DPxPTR(HT.HstPtrBegin), DPxPTR(HT.TgtPtrBegin), Size);
HostDataToTargetMap.erase(lr.Entry);		HostDataToTargetMap.erase(lr.Entry);
}		}
rc = OFFLOAD_SUCCESS;		rc = OFFLOAD_SUCCESS;
} else {		} else {
DP("Section to delete (hst addr " DPxMOD ") does not exist in the allocated"		DP("Section to delete (hst addr " DPxMOD ") does not exist in the allocated"
" memory\n", DPxPTR(HstPtrBegin));		" memory\n", DPxPTR(HstPtrBegin));
rc = OFFLOAD_FAIL;		rc = OFFLOAD_FAIL;
}		}

DataMapMtx.unlock();		DataMapMtx.unlock();
return rc;		return rc;
}		}

/// Init device, should not be called directly.		/// Init device, should not be called directly.
void DeviceTy::init() {		void DeviceTy::init() {
// Make call to init_requires if it exists for this plugin.		// Make call to init_requires if it exists for this plugin.
if (RTL->init_requires)		if (RTL->init_requires)
RTL->init_requires(RTLs->RequiresFlags);		RTL->init_requires(RTLs->RequiresFlags);
int32_t rc = RTL->init_device(RTLDeviceID);		int32_t Ret = RTL->init_device(RTLDeviceID);
if (rc == OFFLOAD_SUCCESS) {		if (Ret != OFFLOAD_SUCCESS)
		return;

		// The memory manager will only be disabled when users provide a threshold via
		ye-luoUnsubmitted Done Reply Inline Actions I think this is your real default. The default value of SizeThreshold always gets overwritten. ye-luo: I think this is your real default. The default value of SizeThreshold always gets overwritten.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Yes. The logic is a little weird. I'll refactor this part. tianshilei1992: Yes. The logic is a little weird. I'll refactor this part.
		// the environment variable \p LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD and set
		// it to 0.
		if (const char *Env = std::getenv("LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD")) {
		size_t Threshold = std::stoul(Env);
		if (Threshold)
		MemoryManager = std::make_unique<MemoryManagerTy>(*this, Threshold);
		} else
		MemoryManager = std::make_unique<MemoryManagerTy>(*this);

IsInit = true;		IsInit = true;
}		}
}

/// Thread-safe method to initialize the device only once.		/// Thread-safe method to initialize the device only once.
int32_t DeviceTy::initOnce() {		int32_t DeviceTy::initOnce() {
std::call_once(InitFlag, &DeviceTy::init, this);		std::call_once(InitFlag, &DeviceTy::init, this);

// At this point, if IsInit is true, then either this thread or some other		// At this point, if IsInit is true, then either this thread or some other
// thread in the past successfully initialized the device, so we can return		// thread in the past successfully initialized the device, so we can return
// OFFLOAD_SUCCESS. If this thread executed init() via call_once() and it		// OFFLOAD_SUCCESS. If this thread executed init() via call_once() and it
Show All 10 Lines
__tgt_target_table DeviceTy::load_binary(void Img) {		__tgt_target_table DeviceTy::load_binary(void Img) {
RTL->Mtx.lock();		RTL->Mtx.lock();
__tgt_target_table *rc = RTL->load_binary(RTLDeviceID, Img);		__tgt_target_table *rc = RTL->load_binary(RTLDeviceID, Img);
RTL->Mtx.unlock();		RTL->Mtx.unlock();
return rc;		return rc;
}		}

void DeviceTy::allocData(int64_t Size, void HstPtr) {		void DeviceTy::allocData(int64_t Size, void HstPtr) {
		// If memory manager is enabled, we will allocate data via memory manager.
		if (MemoryManager)
		return MemoryManager->allocate(Size, HstPtr);

return RTL->data_alloc(RTLDeviceID, Size, HstPtr);		return RTL->data_alloc(RTLDeviceID, Size, HstPtr);
}		}

int32_t DeviceTy::deleteData(void *TgtPtrBegin) {		int32_t DeviceTy::deleteData(void *TgtPtrBegin) {
		// If memory manager is enabled, we will deallocate data via memory manager.
		if (MemoryManager)
		return MemoryManager->free(TgtPtrBegin);

return RTL->data_delete(RTLDeviceID, TgtPtrBegin);		return RTL->data_delete(RTLDeviceID, TgtPtrBegin);
		ye-luoUnsubmitted Done Reply Inline Actions Prefer else return RTL->data_delete(RTLDeviceID, TgtPtrBegin); the same change to RTL->data_alloc above ye-luo: Prefer ``` else return RTL->data_delete(RTLDeviceID, TgtPtrBegin); ``` the same change to RTL…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions It's a code style preference. I would go with "no else after return". tianshilei1992: It's a code style preference. I would go with "no else after return".
}		}

// Submit data to device		// Submit data to device
int32_t DeviceTy::submitData(void TgtPtrBegin, void HstPtrBegin, int64_t Size,		int32_t DeviceTy::submitData(void TgtPtrBegin, void HstPtrBegin, int64_t Size,
__tgt_async_info *AsyncInfoPtr) {		__tgt_async_info *AsyncInfoPtr) {
if (!AsyncInfoPtr \|\| !RTL->data_submit_async \|\| !RTL->synchronize)		if (!AsyncInfoPtr \|\| !RTL->data_submit_async \|\| !RTL->synchronize)
return RTL->data_submit(RTLDeviceID, TgtPtrBegin, HstPtrBegin, Size);		return RTL->data_submit(RTLDeviceID, TgtPtrBegin, HstPtrBegin, Size);
else		else
▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

openmp/libomptarget/test/offloading/memory_manager.cpp

This file was added.

				// RUN: %libomptarget-compilexx-run-and-check-aarch64-unknown-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-powerpc64-ibm-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-powerpc64le-ibm-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-x86_64-pc-linux-gnu
				// RUN: %libomptarget-compilexx-run-and-check-nvptx64-nvidia-cuda

				#include <omp.h>

				#include <cassert>
				#include <iostream>

				int main(int argc, char *argv[]) {
				#pragma omp parallel for
				for (int i = 0; i < 16; ++i) {
				for (int n = 1; n < (1 << 13); n <<= 1) {
				void p = omp_target_alloc(n sizeof(int), 0);
				omp_target_free(p, 0);
				}
				}

				#pragma omp parallel for
				for (int i = 0; i < 16; ++i) {
				for (int n = 1; n < (1 << 13); n <<= 1) {
				int p = (int )omp_target_alloc(n * sizeof(int), 0);
				#pragma omp target teams distribute parallel for is_device_ptr(p)
				for (int j = 0; j < n; ++j) {
				p[j] = i;
				}
				int buffer[n];
				#pragma omp target teams distribute parallel for is_device_ptr(p) \
				map(from \
				: buffer)
				for (int j = 0; j < n; ++j) {
				buffer[j] = p[j];
				}
				for (int j = 0; j < n; ++j) {
				assert(buffer[j] == i);
				}
				omp_target_free(p, 0);
				}
				}

				std::cout << "PASS\n";
				return 0;
				}

				// CHECK: PASS

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Introduce target memory managerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 286705

openmp/libomptarget/src/CMakeLists.txt

openmp/libomptarget/src/MemoryManager.h

openmp/libomptarget/src/MemoryManager.cpp

openmp/libomptarget/src/device.h

openmp/libomptarget/src/device.cpp

openmp/libomptarget/test/offloading/memory_manager.cpp

[OpenMP] Introduce target memory manager
ClosedPublic